Sunday, June 21, 2009

Plagiarism Detection Competition

The SEPLN´09 Workshop PAN, "Uncovering Plagiarism, Authorship and Social Software Misuse" ran an international "Plagiarism" Detection Competition this year and have recently published their results. I've put the word plagiarism in quotes, as my definition of plagiarism encompasses much more than just character sequence matching. Copies and near copies can perhaps be detected by a programming system, but the determination of plagiarism is something that only a teacher can determine, as there may be legitimate reasons for copies (they are part of properly quoted material) and a structural plagiarism can exist where no exact copy can be found.

They have developed a massive corpus of English-language artefacts including various sizes of documents, various amounts and types of copying and have also included automatic translation from Spanish and German into English. They give the following statistics about their corpus:
  • Corpus size: 20 611 suspicious documents, 20 612 source documents.
  • Document lengths: small (up to paper size), medium, large (up to book size).
  • Plagiarism contamination per document: 0%-100% (higher fractions with lower probabilities).
  • Plagiarized passage length: short (few sentences), medium, long (many pages).
  • Plagiarism types: monolingual (obfuscation degrees none, low, and high), and multilingual (automatic translation).
They have a development corpus that annotates the copied portions, so that researchers can train their systems. The competition corpus is, of course, without such annotations.

They calculate precision, recall, and granularity for each of the contestants on a character sequence level. Precision is the name given for how many of the detections were correct. Recall is the amount of plagiarism that was there was actually identified. Granularity demonstrates how often a particular copy is flagged - this should be close to one, that is, that any given copy is found only once.

They split the competition into external copy identification (but for a given, finite corpus, not against the open Internet) in which a matching with a given set of papers is to be found, and an intrinsic plagiarism identification, in which a stylistic analysis without use of any external documents is to identify the plagiarisms.

The results are, as I expected, wildly different between external and intrinsic. I find the recall values important - how many of the possible copies were found, although the precision is also important, so that not too many false positives are registered.

The recall for the 10 systems doing the external identification ranges wildly between 1 % and 69 % of possible copies found. This corresponds with my results from 2008 with a small corpus of hand-made plagiarisms and hand-detection, in which we found a recall of between 20% and 75% (the ones finding nothing were disqualified in our test). The median recall of the competition is 49%, the average 45%, which validates my informal assertion that flipping a coin to decide if a paper is plagiarized is about as effective as running software over a digital version of the paper (of course, flipping a coin gives no indication as to what part is indeed plagiarized). The precision ran between a median of 63% and an average of 60%.

The intrinsic identification was quite different. Although the recall was good (median 51% and average 56% with one of the four systems reaching 94%), the precision gave a median of 15 % with an average of 16%. The best system only had 23 % correct answers - that means that over 3 in 4 identified plagiarisms using stilistic analysis was, in fact, incorrectly flagged as plagiarism. This has interesting ramifications for stylistic analysis.

The overall score (I am not sure exactly what this is) has a median of 32 % and an average of 29% over all of the systems for recall, and a precision of only 39 % (average 28%) on precision.

I can identify only two of the authors as having written software that I have tested. The group from Zhytomyr State University, Ukraine, are the authors of Plagiarism Detector, this system was removed from our ranking for installing a trojan on systems using it, although their results gave them second place in my test (overall fourth place in this test). I also tested WCopyFind, but this is a system that is for detecting collusion. It's recall was overall about 32 %, but with less than 1 % precision it generates a *lot* of false positives!

I applaud the competition organizers
for this very valuable competition, and I especially applaud them for making their results and the corpus available online. I'll download the corpus when I get my new laptop, I currently only seem to have 7 GB free :)

4 comments:

  1. maybe you know where can I get the data(set of documents) they used in competition? :)

    ReplyDelete
  2. I didn't know where to post this. The prof was attempting to teach students about plagiarism. Found examples in weekly paper, and then...

    UToledo prof finds unattributed work in Toledo Free Press informs publisher & president - job threatened http://bit.ly/RQTh0

    ReplyDelete
  3. Hello,

    I’m one of the participants of the PAN09 competition (Plagiarism Detector Project) and would like to disagree with prof. Debora Weber-Wulff in relation to this post.

    Again, I’m really sorry to see the scepticism you show in relation to the PAN09 therretical basis.

    As a matter of fact I taught students that there always exists two ways towards your attitude to a problem. One - you can critisise and be negative saying why its bad and it can’t be done that way, or you can be constructive and positive – pointing to the possible problems and the ways how thwe can be solved.

    I wish prof. Debora Weber-Wulff took an active participation in the preparation of PAN09 competition, gave the definition of “Structural Plagiarism”, pointed to the possible issues of “artificail plagiarism problem” and helped to overcome the problem. Goole groups discussions had been open long before the Competiition even started - http://groups.google.com/group/pan09-competition

    There have been a lot of discussion about the problem of “artificail plagiarism”: http://groups.google.com/group/pan09-competition/browse_thread/thread/e3645557c5eec011

    And much more serious problem was discovered later.

    One more thing to say Is that in relation to intrinsic plagiarism detection the term “stylistic analysis” is not correct – it does not reflect the whole wide range of algorithms that are used in it, or even part of it!

    The following algorythms were use in the Competition apart form “stylistic analysis”:

    1. Statistical term frequency analysis.
    2. TF-IDF analysis.
    3. General language melodics analysis.
    4. Keword density analysis.
    5. Meta information analysis on the text and\or sentence level.
    And I belive many others!

    I strongly believe that due to such scientific occasions you will discover that software will outperform a human in relation to plagiarism detection, and I have a lot of arguments to put forward.

    I bet anyone will be surprised how software can solve Plagiarism related tasks, in case he or shee takes a closer look!

    I fear that you not quite understand the scope and the main ide of PAN09 competition. It shall not be treated as a “universal panacea for Plagiarism” in any way. You can safely remove inverted comas from your statement about Plagiarism.

    - It is a really fantastic opportunity to unbiasly measure the effectivenes of Plagiarism Detection algorithms and software!
    - It is a first time a mathematical model was developed to assess the results of the Plagiarism Detection.
    - It is a unique complex framework being developed that allows improvemets of the developed plagiarism detection systems
    - It is a statisticaly valid (extremely valid!) research
    - It is an extremely hot problem solution for the existing problem of automatic content originality check. It can’t be only related to academic plagiarism (!!!) this is a very important thing! This reasearch is valuable in the field of high load search systems such as Google or Yahoo.

    I more than highly appreciate the work of Martin Potthast and the organizers of the Competition, and look forward to meeting them in September!

    p.s. I seriusly doubt that downloading the Plagiarism corpus will do any benefit. Without the software that processes the corpus it is absolutely useless J. For any kind of academic interest.
    --
    Best Regards J.A. Palkovskii [using my collegue's account :-)]

    ReplyDelete

Please note that I moderate comments. Any comments that I consider unscientific will not be published.