Türkçe dokümanların sınıflandırılması

Please use this identifier to cite or link to this item: http://hdl.handle.net/11607/1018

Title:	Türkçe dokümanların sınıflandırılması
Other Titles:	Classification of Turkish documents
Authors:	Aşlıyan, Rıfat Yılmaz, Rumeysa Adnan Menderes Üniversitesi, Fen Bilimleri Enstitüsü, Matematik Anabilim Dalı
Keywords:	Doküman Sınıflandırma K-En Yakın Komşu Modeli Çok Katmanlı Algılayıcı Modeli Destek Vektör Makinesi n-gram Document Classification K-Nearest Neighbor Multi-Layer Perceptron Support Vector Machine
Issue Date:	1-Jan-2013
Publisher:	Adnan Menderes Üniversitesi, Fen Bilimleri Enstitüsü
Abstract:	İnternetin hızla gelişmesi elektronik ortamdaki bilgileri ve işlemleri arttırmıştır. Fakat, bu ortamlarda depolanan ve işlenen bilgilerin boyutunun artması aranan bilgiye erişmekte problemler çıkarmıştır. Bu doğrultuda, kullanıcıların istedikleri bilgiye daha doğru ve hızlı bir şekilde ulaşma ihtiyacı doğmuştur ve elektronik ortamdaki dokümanların sınıflandırılmasında yeni metotların geliştirilmesi çalışmaları devam etmektedir. Bu çalışmada, Türkçe metin içerikli web sitelerinden elde edilen dokümanların sınıflandırılması amaçlanmaktadır. Dokümanlar, gövde tabanlı, sözcük tabanlı, hece tabanlı ve karakter tabanlı olmak üzere dört farklı kategoride ele alınmıştır. Gövde, sözcük, hece ve karakterler için n-gram analizleri yapılmıştır. Sistem K-En Yakın Komşu Modeli (K-NN), Çok Katmanlı Algılayıcı Modeli (MLP) ve Destek Vektör Makinesi (SVM) olmak üzere üç farklı yöntem ile eğitilmiş ve test edilmiştir. Çalışmanın gerçekleştirilmesinde eğitim ve test olmak üzere iki derlem oluşturulmuştur. Her biri internet ortamından temin edilen 75'er doküman içeren eğitim, ekonomi, kültür-sanat, otomobil, sağlık ve spor sınıfları ele alınmıştır. Bu dokümanlardan 25'er tane alınarak toplamda 150 doküman sistemin eğitilmesinde, 50'şer tane alınarak toplamda 300 doküman da sistemin test edilmesinde kullanılmıştır. Çalışmada sisteme verilen dokümanlar öncelikle önişlemden geçirilmiştir. Önişlemden geçirilen dokümanların frekansları ve olasılıkları hesaplandıktan sonra her bir sınıf için öznitelik vektör veritabanı oluşturulmuştur. Öznitelik vektör veritabanı oluşturulurken sözcüklerin dokümanlarda karşılaştırılmasında eşik değeri olarak 0,25, 0,50, 0,75 ve 0,90 değerleri kullanılmış. Eğitim setindeki dokümanlar sisteme verilmiş ve her bir sınıf için oluşturulan öznitelik vektör veritabanındaki sözcükler ile karşılaştırılarak dokümanın hangi sınıfa dahil olduğu belirlenmiştir. Daha sonra test setindeki dokümanlar sisteme verilmiş ve sistemin başarısı, kesinlik skoru, hassasiyet skoru, F-ölçüsü ve doğruluk değerlerine göre tespit edilmiştir. Sonuç olarak en yüksek doğruluk başarı oranı SVM metodu kullanılarak sözcük 1-gramlarda %99,9 olarak bulunmuştur. F-ölçüsü değeri de %99,7 olmuştur. Advancing the technologies of Internet has caused a great deal of digital information and operations. But, because of great amount of digital information, there are some difficulties to reach the information which is stored in databases and processed by systems. In this way, the users in information technology require to reach the information faster and more robust. For that reason, in the classification of the documents in digital technology, the studies about development of new approaches are ongoing. In this study, we aimed to the classification of the documents which are obtained from Turkish web sites. The documents are categorized to some different classes according to word, stem, syllable and character based approaches. At the same time, n-grams of above units are also used for the classification. The developed systems classify the web based documents with the methods: K-Nearest Neighbor (K-NN), Multi-Layer Perceptron (MLP) and Support Vector Machine (SVM). The systems which use MLP and SVM have been trained and tested. For training and testing operations, two corpora are generated from Turkish web pages. The documents are classified into 6 different classes as "education", "economy", "art and culture", "automobile", "health" and "sport". Each class includes 25 documents and 50 documents in training and testing sets respectively. Thus, the corpora have totally 450 documents for training and testing operations. In preprocessing stage of the system, all unnecessary characters such as punctuation marks are removed from the documents. All capital letters are converted to lower case and only one space character is allowed between two consecutive words. After the frequencies of word, stem, syllable and character n-grams in all documents have been computed in feature extraction stage, the documents have been represented as column vectors which contain frequencies. The tokens as word, stem, syllable and character n-grams are determined with threshold values as 0.25, 0.50, 0.75 and 0.90 in feature selection. In training stage, every document as a feature vector is given to MLP and SVM methods and using these methods a model is constructed for each class. Finally, the documents in test set are categorized to the classes using the models. The designed systems are evaluated according to the Precision, Recall, Accuracy and F-measure. The most successful method is SVM with word 1-gram and Accuracy and F-measure score of the systems are 99.9 % and 99.7 % respectively.
URI:	http://adumilas.adu.edu.tr/web/catalog/info.php?idx=53748640&idt=1 http://hdl.handle.net/11607/1018
Appears in Collections:	Yüksek Lisans

Files in This Item:

File	Description	Size	Format
ABSTRACT.pdf		38.86 kB	Adobe PDF	View/Open
ÖZET.pdf		125.18 kB	Adobe PDF	View/Open
Rümeysa YILMAZ.pdf	Yüksek Lisans Tezi	1.19 MB	Adobe PDF	View/Open

Show full item record

DSpace JSPUI

DSpace preserves and enables easy and open access to all types of digital content including text, images, moving images, mpegs and data sets