Import ftfy and use its
uncurl_quotes method to turn curly quotes into
straight ones, providing consistency with multiple forms of apostrophes.
Set minimum version requierements on
so that tokenization will give consistent results.
Workaround an inconsistency in the
msgpack API around
When tokenizing Japanese or Korean, MeCab's dictionaries no longer have to
be installed separately as system packages. They can now be found via the
When the tokenizer had to infer word boundaries in languages without spaces, inputs that were too long (such as the letter 'l' repeated 800 times) were causing overflow errors. We changed the sequence of operations so that it no longer overflows, and such inputs simply get a frequency of 0.
The Exquisite Corpus data has been updated to include Google Books Ngrams 2019, Reddit data through 2019, Wikipedia data from 2020, and Twitter-sampled data from 2020, and somewhat more reliable language detection.
Updated dependencies to require recent versions of
to get tokenization that's consistent with the word lists.
requires a version after 2020.04.04.
Relaxing the dependency on regex had an unintended consequence in 2.3.1: it could no longer get the frequency of French phrases such as "l'écran" because their tokenization behavior changed.
2.3.2 fixes this with a more complex tokenization rule that should handle apostrophes the same across these various versions of regex.
Python 3.5 is the oldest maintained version of Python, and we have stopped claiming support for earlier versions.
Updated to langcodes 2.0.
match_cutoff parameter, which was intended for situations
where we need to approximately match a language code, but was not usefully
configurable in those situations.
Relaxed the version requirement on the 'regex' dependency, allowing compatibility with spaCy.
The range of regex versions that wordfreq now allows is from 2017.07.11 to 2018.02.21. No changes to word boundary matching were made between these versions.
msgpack.load with a deprecated parameter.
Updated the data from Exquisite Corpus to filter the ParaCrawl web crawl better. ParaCrawl provides two metrics (Zipporah and Bicleaner) for the goodness of its data, and we now filter it to only use texts that get positive scores on both metrics.
The input data includes the change to tokenization described above, giving us word frequencies for words such as "l@s".
The output of
word_frequency is rounded to three significant digits. This
provides friendlier output, and better reflects the precision of the
underlying data anyway.
The MeCab interface can now look for Korean and Japanese dictionaries
/usr/lib/x86_64-linux-gnu/mecab, which is where Ubuntu 18.04 puts them
when they are installed from source.
Fixed edge cases that inserted spurious token boundaries when Japanese text is
simple_tokenize, because of a few characters that don't match any
of our "spaceless scripts".
It is not a typical situation for Japanese text to be passed through
simple_tokenize, because Japanese text should instead use the
Japanese-specific tokenization in
However, some downstream uses of wordfreq have justifiable reasons to pass all
simple_tokenize, even terms that may be in Japanese, and in
those cases we want to detect only the most obvious token boundaries.
In this situation, we no longer try to detect script changes, such as between kanji and katakana, as token boundaries. This particularly allows us to keep together Japanese words where ヶ appears between kanji, as well as words that use the iteration mark 々.
This change does not affect any word frequencies. (The Japanese word list uses
wordfreq.mecab for tokenization, not
The big change in this version is that text preprocessing, tokenization, and postprocessing to look up words in a list are separate steps.
If all you need is preprocessing to make text more consistent, use
wordfreq.preprocess.preprocess_text(text, lang). If you need preprocessing
and tokenization, use
wordfreq.tokenize(text, lang) as before. If you need
all three steps, use the new function
As a breaking change, this means that the
tokenize function no longer has
combine_numbers option, because that's a postprocessing step. For
the same behavior, use
lossy_tokenize, which always combines numbers.
tokenize will no longer replace Chinese characters with their
Simplified Chinese version, while
There's a new default wordlist for each language, called "best". This chooses the "large" wordlist for that language, or if that list doesn't exist, it falls back on "small".
The wordlist formerly named "combined" (this name made sense long ago) is now named "small". "combined" remains as a deprecated alias.
The "twitter" wordlist has been removed. If you need to compare word frequencies from individual sources, you can work with the separate files in exquisite-corpus.
Tokenizing Chinese will preserve the original characters, no matter whether they are Simplified or Traditional, instead of replacing them all with Simplified characters.
Different languages require different processing steps, and the decisions
about what these steps are now appear in the
replacing a bunch of scattered and inconsistent
Tokenizing CJK languages while preserving punctuation now has a less confusing implementation.
The preprocessing step can transliterate Azerbaijani, although we don't yet have wordlists in this language. This is similar to how the tokenizer supports many more languages than the ones with wordlists, making future wordlists possible.
Speaking of that, the tokenizer will log a warning (once) if you ask to tokenize text written in a script we can't tokenize (such as Thai).
New source data from exquisite-corpus includes OPUS OpenSubtitles 2018.
Nitty gritty dependency changes:
Updated the regex dependency to 2018.02.21. (We would love suggestions on
how to coexist with other libraries that use other versions of
>= requirement that could introduce unexpected data-altering
We now depend on
msgpack, the new name for
Depend on langcodes 1.4, with a new language-matching system that does not depend on SQLite.
This prevents silly conflicts where langcodes' SQLite connection was preventing langcodes from being used in threads.