Conference Name Disambiguation

<title></title><div class="page" title="Page 1"><div class="section" style="background-color: rgb(100.000000%, 100.000000%, 100.000000%);"><div class="layoutArea"><div class="column"><p><span><b>The Problem:</b></span></p></div></div><div class="layoutArea"><div class="column"><p><span>We need to match the names from the canonical set of venues from DBLP (around 10K) to the set of venues listed here: </span><span style="color: rgb(6.670000%, 33.330000%, 80.000000%);"></span><span>.</span><span style="color: rgb(6.670000%, 33.330000%, 80.000000%);">​ </span><span>The output will be a tableof entries of the form:</span></p><p><span>canonical venue,nickname1,nickname2,...</span></p><p><span>For example:</span></p><p><span>“EMNLP”, “empirical methods in NLP”, “EMNLP&#173;CoNLL”,...<br/></span></p><p><span>Note that many of the entries in the attached input do not have a match in the Wikipedia list.</span></p><p><span><b>Challenges:</b></span></p><p><span>There are two aspects of this problem:</span></p><p><span>1. Matching and determining the right clusters of venues.</span></p><p><span>2. Determining the canonical venue name for a given cluster.</span></p><p><span><b>Evaluation:</b></span></p><p><span>For Matching, we will have a Precision / Recall metric based on a subset of venue clusters that we will label and audit.</span></p><p><span><b>Matching Precision</b>: </span><span>For each cluster of nick names produced by the algorithm, Precision = (Number of correct pairs / Number of total pairs in the cluster). </span><span>Overall precision = Average precision across all audited clusters.<b> </b></span></p><p><span><b>Matching Recall</b>: We will manually label a set of 10 clusters as ground truth. For each cluster, Recall = 1 if there is at least 1 cluster outputted by algorithm with all venues, 0 otherwise. Overall Recall = Average recall across 10 clusters.</span></p><p><span>For determining the right canonical name we will measure Canonical name accuracy.</span></p><p><span><b>Canonical venue name accuracy</b> = (# of correct canonical names) / (# of clusters with precision = 1).<br/></span></p><p><span><b>Algorithm Ideas:</b></span></p><p><span>A successful solution will almost certainly involve extensive interaction with the data to develop appropriate heuristics. Due to the many acronyms in this data, you may want to consider using algorithms to convert full descriptions to acronyms (or rather a set of possible acronyms in a heuristic fashion, given the sometimes creative nature of academic acronyms) and algorithms to calculate how likely a given acronym is to match a given name. For instance, from the above example, EMNLP should have a relatively strong match with “empirical methods in NLP”, a non&#173;-zero but smaller match with “EMNLP&#173;CoNLL” due to the extra letters in the latter, and avery low or zero match with “AAAI”. A modified Levenshtein distance or other string matchingtechniques might be useful as well.</span></p><p>You can find the DBLP data at <a href=""></a> and the Wikipedia data at <a href=""></a>.</p><p><b>Sample algorithm:</b></p><p>A sample algorithm that does naive matching between strings can be seen at: <a href=""></a>. The algorithm is not optimized, and takes about 5 minutes to match the input strings. The results of this basic algorithm can be seen at: <a href=""><font color="#9963ff"><span style="background-color: rgb(255, 255, 255);"></span></font><font color="#000000"><span style="background-color: rgb(255, 255, 255);"></span></font></a><a href=""></a><a title="data://Nilojyoti/dblp/naive_results.txt" class="filename ng-binding" href=""></a>. Adding handling of acronyms as described above will certainly improve the accuracy.</p></div></div></div></div>
  • {{comment.username}}
submission(s) pending review
Bounty expires in
Bounty expired
(no tags)