impex: comp 491: preliminary report

hani proce raporu diyorduk ya, işte ta kendisi. ancaaaaak, önce neyin üzerine rapor yazıyoruz bilelim, di'mi?

i kept on telling about some project report, and there it is. but first of all, we have to know what this report is written about, innit?

12.11.2005

Topic: Ekşi Sözlük Graph Visualization Tool

Motivation: Ekşi Sözlük (http://sozluk.sourtimes.org/) is a popular Turkish web site, up and running since February 15th, 1999. Having about 10,000 active contributors (susers – Sözlük users in Ekşi Sözlük jargon), this web site is basically a hypertext dictionary comprising of the entries of its collaborators. In Ekşi Sözlük, one can find explanations and definitions of almost any concept one can think of. In Ekşi Sözlük’s jargon, a concept for which information can be found is called a “title” (literal translation of “başlık” from Turkish). Each individual definition, explanation, or information of any kind is called an “entry”. There may be any number of entries posted under a title. What makes Sözlük different from any other plain text based dictionary is that it contains hyper-textual references to other titles. The data to be used in this project is obtained by crawling through the entries of Ekşi Sözlük.

Scope: A detailed inspection of Ekşi Sözlük data in the form of a digraph as a way of representation, with some simple algorithms employed for coming up with the digraph. Extensions, such as marking the titles one specific suser has written, finding cycles of association or creating timelines (or a histogram) of activity for a specific title can also be implemented.

Method: As an initial step, a crawler for extracting Ekşi Sözlük data, named ek$iDump was written in C#, which is a simple, single-threaded application which accesses Sözlük entries one by one by their numerical ID and dumps the necessary details to a non-relational Microsoft Access database. Currently all entries until the ID #3300000 have been crawled. Due to the high number of deleted entries by moderation, the choice of the suser or voiding of the suser account, the total number of entries stored locally stand close to 2,000,000. As of December 12th, 2005, there are more than 5,000,000 entries posted under about 1,100,000 titles and the ID of the most recent entry is #8685844. This may give a measure of the density of Sözlük data (detailed statistics can be found at http://sozluk.sourtimes.org/stats.asp). Due to time limitations, a cutoff point will be selected (ID #4000000 or #5000000 is considered). The latter and final step is to design and implement the graphing tool which will work on the extracted data. This tool will make use of some simple algorithms or checks. Some are:

Checking the number of entries under a destination title before assigning a connection between two nodes depicting titles. This will be necessary, as links sometimes are used for other purposes by susers, such as emphasizing a part of the entry. Also, some links point to non-existent titles which should be eliminated.
Possibly, a node distribution algorithm, so that no node of the graph overlaps with another to allow clarity of presentation.

Expected Results: A report of the senior design project with extended demonstrations of the final product, the graphing tool which is expected to generate a “forest” of Ekşi Sözlük data. As mentioned above, the data extraction tool (crawler) is complete with a collection of classes to be able to acquire and arrange Ekşi Sözlük data, namely the Ek$iAPI; although can still be improved speedwise. The graphing tool is currently in the drawing-board phase.

impex

Thursday, June 22, 2006

comp 491: preliminary report

No comments:

Post a Comment