DeepDive Application for authorship Attribution
May 21
1. Run toy_6 (Tim, Zifei)
- paper-level single regressor?
- what rules / features help the most
2. space of problems (Denny)
3. Create 3' (Denny)
- run on that (Zifei)
Hard voting?
Tune weights?
Tune features
X - Balance negative examples X - adopt multinomial variables X - evaluation based on Tim's holdout & reuse script
X - Word shape (frequency of words with different combination of upper & lower case letters) ? - word length distribution (characters per word) X - Function word frequencies (word n-gram might capture a lot of content) X - number of spelling errors ? - Parse tree features (e.g. frequency of pair (A,B) where A is parent of B in parse tree) ! - synonyms
! - punctuation
Content words are words that have meaning. They are words we would look up in a dictionary, such as "lamp," "computer," "drove." New content words are constantly added to the English language; old content words constantly leave the language as they become obsolete. Therefore, we refer to content words as an "open" class.
Also see Wikipedia.
NN* (nouns) JJ* (adjs) RB* (adverbs) VB* (verbs)
Function words are words that exist to explain or create grammatical or structural relationships into which the content words may fit. Words like "of," "the," "to," they have little meaning on their own. They are much fewer in number and generally do not change as English adds and omits content words. Therefore, we refer to function words as a "closed" class.
Pronouns, prepositions, conjunctions, determiners, qualifiers/intensifiers, and interrogatives are some function parts of speech.
Features
- Length
- Number of characters, words, sentences, paragraphs, sections
- How many words per sentence etc
- Vocabulary richness (Yules K^3?? used in Arvind paper)
- Number of words used only once in the text [Could be extended to a distribution] ! - Word shape (frequency of words with different combination of upper & lower case letters) ! - word length distribution (characters per word)
- Character n-gram frequencies (all chars are interesting, letter, digit, punction, special characters), at least for n=1 ! - Function word frequencies (word n-gram might capture a lot of content) ! - number of spelling errors
- POS tag ngrams ! - Parse tree features (e.g. frequency of pair (A,B) where A is parent of B in parse tree)
- Features from dependency parse ! - synonyms
- Vocabulary richness (Yules K^3?? used in Arvind paper)