Türkçe dokümanların sınıflandırılması

DSpace Ana Sayfası
→
Tez
→
Yüksek Lisans
→
Öğe Göster

dc.contributor.advisor	Aşlıyan, Rıfat
dc.contributor.author	Yılmaz, Rumeysa
dc.date.accessioned	2016-01-13T11:14:18Z
dc.date.available	2016-01-13T11:14:18Z
dc.date.issued	2013-01-01
dc.date.submitted	2013-01-01
dc.identifier.uri	http://adumilas.adu.edu.tr/web/catalog/info.php?idx=53748640&idt=1
dc.identifier.uri	http://hdl.handle.net/11607/1018
dc.description.abstract	İnternetin hızla gelişmesi elektronik ortamdaki bilgileri ve işlemleri arttırmıştır. Fakat, bu ortamlarda depolanan ve işlenen bilgilerin boyutunun artması aranan bilgiye erişmekte problemler çıkarmıştır. Bu doğrultuda, kullanıcıların istedikleri bilgiye daha doğru ve hızlı bir şekilde ulaşma ihtiyacı doğmuştur ve elektronik ortamdaki dokümanların sınıflandırılmasında yeni metotların geliştirilmesi çalışmaları devam etmektedir. Bu çalışmada, Türkçe metin içerikli web sitelerinden elde edilen dokümanların sınıflandırılması amaçlanmaktadır. Dokümanlar, gövde tabanlı, sözcük tabanlı, hece tabanlı ve karakter tabanlı olmak üzere dört farklı kategoride ele alınmıştır. Gövde, sözcük, hece ve karakterler için n-gram analizleri yapılmıştır. Sistem K-En Yakın Komşu Modeli (K-NN), Çok Katmanlı Algılayıcı Modeli (MLP) ve Destek Vektör Makinesi (SVM) olmak üzere üç farklı yöntem ile eğitilmiş ve test edilmiştir. Çalışmanın gerçekleştirilmesinde eğitim ve test olmak üzere iki derlem oluşturulmuştur. Her biri internet ortamından temin edilen 75'er doküman içeren eğitim, ekonomi, kültür-sanat, otomobil, sağlık ve spor sınıfları ele alınmıştır. Bu dokümanlardan 25'er tane alınarak toplamda 150 doküman sistemin eğitilmesinde, 50'şer tane alınarak toplamda 300 doküman da sistemin test edilmesinde kullanılmıştır. Çalışmada sisteme verilen dokümanlar öncelikle önişlemden geçirilmiştir. Önişlemden geçirilen dokümanların frekansları ve olasılıkları hesaplandıktan sonra her bir sınıf için öznitelik vektör veritabanı oluşturulmuştur. Öznitelik vektör veritabanı oluşturulurken sözcüklerin dokümanlarda karşılaştırılmasında eşik değeri olarak 0,25, 0,50, 0,75 ve 0,90 değerleri kullanılmış. Eğitim setindeki dokümanlar sisteme verilmiş ve her bir sınıf için oluşturulan öznitelik vektör veritabanındaki sözcükler ile karşılaştırılarak dokümanın hangi sınıfa dahil olduğu belirlenmiştir. Daha sonra test setindeki dokümanlar sisteme verilmiş ve sistemin başarısı, kesinlik skoru, hassasiyet skoru, F-ölçüsü ve doğruluk değerlerine göre tespit edilmiştir. Sonuç olarak en yüksek doğruluk başarı oranı SVM metodu kullanılarak sözcük 1-gramlarda %99,9 olarak bulunmuştur. F-ölçüsü değeri de %99,7 olmuştur.	tr_TR
dc.description.abstract	Advancing the technologies of Internet has caused a great deal of digital information and operations. But, because of great amount of digital information, there are some difficulties to reach the information which is stored in databases and processed by systems. In this way, the users in information technology require to reach the information faster and more robust. For that reason, in the classification of the documents in digital technology, the studies about development of new approaches are ongoing. In this study, we aimed to the classification of the documents which are obtained from Turkish web sites. The documents are categorized to some different classes according to word, stem, syllable and character based approaches. At the same time, n-grams of above units are also used for the classification. The developed systems classify the web based documents with the methods: K-Nearest Neighbor (K-NN), Multi-Layer Perceptron (MLP) and Support Vector Machine (SVM). The systems which use MLP and SVM have been trained and tested. For training and testing operations, two corpora are generated from Turkish web pages. The documents are classified into 6 different classes as "education", "economy", "art and culture", "automobile", "health" and "sport". Each class includes 25 documents and 50 documents in training and testing sets respectively. Thus, the corpora have totally 450 documents for training and testing operations. In preprocessing stage of the system, all unnecessary characters such as punctuation marks are removed from the documents. All capital letters are converted to lower case and only one space character is allowed between two consecutive words. After the frequencies of word, stem, syllable and character n-grams in all documents have been computed in feature extraction stage, the documents have been represented as column vectors which contain frequencies. The tokens as word, stem, syllable and character n-grams are determined with threshold values as 0.25, 0.50, 0.75 and 0.90 in feature selection. In training stage, every document as a feature vector is given to MLP and SVM methods and using these methods a model is constructed for each class. Finally, the documents in test set are categorized to the classes using the models. The designed systems are evaluated according to the Precision, Recall, Accuracy and F-measure. The most successful method is SVM with word 1-gram and Accuracy and F-measure score of the systems are 99.9 % and 99.7 % respectively.	tr_TR
dc.language.iso	tur	tr_TR
dc.publisher	Adnan Menderes Üniversitesi, Fen Bilimleri Enstitüsü	tr_TR
dc.rights	info:eu-repo/semantics/openAccess	tr_TR
dc.subject	Doküman Sınıflandırma	tr_TR
dc.subject	K-En Yakın Komşu Modeli	tr_TR
dc.subject	Çok Katmanlı Algılayıcı Modeli	tr_TR
dc.subject	Destek Vektör Makinesi	tr_TR
dc.subject	n-gram	tr_TR
dc.subject	Document Classification	tr_TR
dc.subject	K-Nearest Neighbor	tr_TR
dc.subject	Multi-Layer Perceptron	tr_TR
dc.subject	Support Vector Machine	tr_TR
dc.title	Türkçe dokümanların sınıflandırılması	tr_TR
dc.title.alternative	Classification of Turkish documents	tr_TR
dc.type	masterThesis	tr_TR
dc.contributor.department	Adnan Menderes Üniversitesi, Fen Bilimleri Enstitüsü, Matematik Anabilim Dalı	tr_TR