For my current dev project I need a word frequency list derived from anime. Since the project builds upon the Anki decks found on https://japanesedecks.blogspot.com, I computed the list from the content of these decks. For computing the individual words in their dictionary form I used mecab + ipadic and a c++ port of Jisho.org’s parser.
The list is found in this spreadsheet.
Here’s a list of bias and error sources:
- The data source is limited, only ~2 million sentences were used. Since I’m mostly interested in the top 2k words, that’s sufficient for now.
- Parsing errors often yield wrong words. They usually occur, when text expresses slang & sloppy pronunciation instead of proper high Japanese. An example is “く” on #54, which mostly results from “ったく”.
- The parser operates on dictionary entries (ipadic and jmdict). Since the dictionaries contain not only words but also many expressions, the list contains expressions as standalone entries, for example ような instead of よう な (with space).
- Names are not parsed correctly, because they are not part of the dictionaries. They often result in individual, “dangling” kanji with strange frequencies.
- Ambiguous spelling can result in separate entries (i.e. kana & kanji spelling counted separately).
- Verb stems used as noun are often counted separately.
- The passive form of a verb sometimes has its own dictionary entry.
- Compound nouns are counted separately (not a real bias, but still something to be aware of): nakigao = naku + kao; natsuyasumi = natsu + yasumi, ekiben = eki + bento.