Posts by richard
Last ←Newer Page 1 2 3 4 5 Older→ First
-
Jeremy -- respect! May go to the library for that one.
-
@David
On the number of holes, Jolisa is not digging them at random but is following her highly trained nose... I think you are right that your procedure would eventually give you a reliable estimate of the amount of plagiarized material -- however, it is likely to undercount in two ways. Firstly, not all potentially plagiarized material is known to google, and second because some of the material is disguised by small changes in wording and tense and thus will not return a clear hit.
But we are already at 0.5% (since we are effectively sampling without replacement, in that there is no point in digging two holes in the same place) so it can only get worse.
The real issue is how much do you need to take to be a thief.
On the number of novels, try this. I would guess that you can choose the first word of a sentence at random from the entire corpus.
Then (I again guess) there are half as many ways to choose the second word, and half as many again to choose the third word, etc, once the first words are fixed. That means given an M word corpus, an N word sentence can be formed in
M^N /(2^0 2^1 2^2 ... 2^(N-1) )
ways. Looking at the denominator, and summing the exponents, you have 2^(N(N-1)/2 ) -- so for large enough N the number of legal sentences you can form will start to go down (and I guess I am ignoring sentences with conjunctions here -- these should just be "atomic" sentences).
For M=50,000 and N=10, I think that amounts to 2.7 x 10^33 legal 10 word sentences. If you want to compose a 100,000 word novel from 10 word sentences, you could combine these sentences (with or without replacement, it would make little difference) in
( 2.6 x 10^33)^10000
ways. This is roughly 3 x 10^334000 possible novels.
This is 10^135000 times smaller than David's estimate, but still insanely huge. On the other hand, my estimate would go up if I allowed a mix of sentence sizes. And David and I only computed the number of novels with exactly 100,000 words.
On the third hand, two novels with one different word between them count twice in both of our measures -- so that would go in the other direction from extending the computation to include novels of arbitrary length.
Ok, back to real work.
(Mistakes with powers of ten and typing to be expected. But it seems like a reasonable approach).
-
+ permute? Is that a word?
It certainly is where I work. It is the result of verbing permutation, as Calvin would say.
-
Even if you don't want to do the math (and I must admit, I did it just to warm my fingers up this morning), try taking random snippets of text and dropping them into google with quotes round them, and see how many "random" matches you get -- and once you get beyond five or six words, with at least one "rare" word, my guess is that most such strings do not produce any accidental matches.
e.g. Then you can foucault with the rest of us
But at this point I am uncomfortably reminded of news footage showing police digging up a serial killer's back yard. If every second or third hole they dig turns up something nasty, at what point can they be sure that they have discovered everything that is there to be found...
[Not that I am implying that plagiarism is akin to serial killing -- but anyone who cares about it (Ihimaera's employers for instance?) might be well advised to make a systematic search, rather than simply assuming that they already had a complete accounting of the "borrowings"]
-
And what about the plagiarised artist's opinions of the plagiaree? Do they matter? Cos you did say yourself in your blog piece that it pissed you off when someone else nicked your upside-down NZ idea.
Homage to Ian Brackenberry Channell, presumably?
-
You have to admit though Russell, this "punishment" does have a wiff of the briar patch about it.
-
And then there's Kaavya Viswanathan, who borrowed phrases and words from several different sources. She fared less well.
Indeed. There is something almost biblical about the term "pulped", is there not?
-
Auckland University's investigation was indeed impressively quick and efficient.
Especially since it cannot be known whether the list of unattributed passages is complete -- given that we are told that the copying was unintentional, no-one can know for sure whether the list they looked at is exhaustive.
-
What is about "White Pride" advocates that always makes you think that if these sad clowns really are the ubermensch, then the world is very likely headed to hell in a handbasket.
-
The new iMac does look nice.
I am also amused by the new Mini server -- I actually have a Mac mini that runs OS X server that serves my research group's website and an internal wiki for file and code sharing. Lovely little thing and it perches cheerfully on the corner of my desk.