Friday, June 23, 2006

comp 491: report | ek$iEdgeDump

ek$iEdgeDump

The Problem: Extracting links (connections) from the existing “heap” of entries.

Design: For ek$iVista to be able to function, to be able to produce a digraph, it needs a list of directed edges, and ek$iEdgeDump was produced for this purpose. For the sake of simplicity, like ek$iDump, it is also designed as a console application. During the development, two versions of ek$iEdgeDump were produced. The first version gets the full list of titles in the database (a table named Titles exists in the database) and scans them one by one. In this scan, the entries under the title being scanned are inspected and any link that points to a title (those that point to single entries are omitted) is parsed out of the entry text. Then, another database query checks whether the title pointed by the link exists in the title list. If it exists, the pair consisting of the IDs of the source title and the destination title (the records in the Titles table have a title ID and title name) are written to a table named EdgeData, only to be used by ek$iVista in drawing the digraph. This approach proved to be too slow, because for every link found in an entry, a verification query has to be made. The scan rate of this version of ek$iEdgeDump was less than 1,000 titles/day. Given the fact that the database contained more than 700,000 titles, the job would be completed in nearly two years. Clearly, another approach had to be adopted .

In the second version of ek$iEdgeDump, the focus is back on the entries instead of the titles. As one can recall from the description of the Entry class in ek$iAPI, one of the details of acquired from Ekşi Sözlük when an entry is extracted is the title the entry is placed under. Thus, we can produce a different table that looks like the EdgeData table described above that keeps information of the source and the destination vertices of the directed edge. The table, in the new approach, is produced by scanning the entries in the database (they reside in a table named Entries), parsing out the links that point to titles and writing the pair consisting from the name of the source title and the destination title to a table named EksiEdgeData without checking whether the destination title exists in the Titles table. This verification effort was the factor that slowed the first version down, and it can be handled without querying the database by ek$iVista (the details of how this is done are given in the section discussing ek$iVista). As the title data in the Entries table is stored in string format (not as integers; foreign keys related to title ID column in Titles table), the size of the EksiEdgeData table is significantly larger than that of EdgeData. ek$iVista uses the data from EksiEdgeData table, generated by the last version of ek$iEdgeDump.




Fig. 5: ek$iEdgeDump versions 1 and 2 in action

No comments:

Post a Comment