geek-y-ness
Jun. 17th, 2009 03:16 pmOkay, now I talk corpus. At work, we comply a text corpus, that is, a compilation of texts of the variety of English spoken in Ghana. It is amazing! It is cool! It is funny! ... It is sometimes verrrrry difficult.
Allow me to introduce our corpus work in a few sentences:
We scan, typewrite texts (like school books, novels, newspaper reports etc.) and tag them. Headlines get <h></h> around them, paragraphs <p></p> and so on and so forth. That's easy. It gets difficult when we have to decide what to include in the corpus and what not. For example, we have a quote mark-up (<quote></quote>, who would have thought?) for words and sentences which were not originally written by the author. We would not want to include a universal translation of an uttering by Socrates in our corpus of Ghanaian English, would we? We keep such sentences in the text but we include some more Ghanaian English words. This has all to do with word counts really, because we have different categories of text genres and each category includes such and such many texts with two thousand words (that's the important part for the <quote> stuff) to finally get one million words altogther. Speaking of this, another example for the importance of the <quote> tag would be our massive bulk of self-help books which consist to 50% of bible quotes. Taking only these books and not kicking out the word count of the bible quotes, we would have a half-bible, half-Ghanaian-English corpus. That's why we mark, for example, bible quotes as so-called extra-corpus material and do not include the word counts of them in the corpus itself.
With this little introduction to corpus work, on to my little confusing text passage here.
Right now, I'm finding my way through a handwritten exam of modern poetry (which is nothing compared to the pain in the assish biochemistry exams, really...) and am standing dumbfounded before the saying "Dulce et decorum est". It is not Ghanian English, that's for sure. But there are so many possibilities still:
1. It is a sentence in a foreign language which deserves the <foreign> tag.
2. It is a syntactic complete sentence which also deserves the <X> tag but integrated into the Ghanian student's ongoing sentence which normally only gets the <quote> tag.
3. The complete problem is that originally the passage reads "the saying 'Dulce et decorum est'" which actually deserves an additional <mention> tag, too.
This is so not funny. U., N. and I will break our brains taking about this together later. We can apply multiple tags to such sentences, but at a certain point it gets ridiculous to tag, tag, and tag them further. I'm curious which tag we will leave out.
And you know what? I am so thankful that Latin is not an indigenous language of West Africa. Because we have an <indig> tag as well...