粵典數據
counts the number of characters in the corpus and return a JSON dictionary
including non-CJK characters
用嚟統計corpus裡面嘅字(唔係詞)嘅使用頻率。
Data License: public domain. Credits to words.hk appreciated.
授權:公有領域。
Query database for all word representations, and count the number of occurrences, without trying to do segmentation on the article content
用嚟統計資料庫現有嘅詞嘅使用頻率。
請留意,呢個清單只包括粵文庫入面見過嘅詞,唔包括《粵典》有收錄但粵文庫冇出現過嘅詞。
Data License: public domain. Credits to words.hk appreciated.
授權:公有領域。
Query database for all word jyutpings, and just dump it out by character if it seems valid.
可以用作{字=>粵拼}嘅數據
The list does not contain character variants that the system recognizes
呢個表冇收錄系統認得嘅異體字
Data License: public domain. Credits to words.hk appreciated.
授權:公有領域。
Contains the wordlist and pronunciations of all entries recorded in the dictionary. Result will be a dictionary (pun not intended) of written characters as keys, and their respective list of possible pronunciations values. Words that contain variants (異體字) will have a "*"
粵典所收錄嘅詞表同佢哋嘅拼音。異體字會打粒「*」
Data License: public domain. Credits to words.hk appreciated.
授權:公有領域。
This data set exposes the index used to search English words in the dictionary. The index is built by mapping the words seen in each entry's English explanation to its Cantonese word.
This may be useful for English->Cantonese translation purposes
The "English" terms are normalized (to US spelling variant using https://github.com/en-wl/wordlist/blob/master/varcon/README ). If they are prefixed with "!" it means that they are stemmed with PorterStemmer (see http://www.tartarus.org/~martin/PorterStemmer for implementations).
The score number in each entry is the estimate of how important the English term is for the definition of the Cantonese word using some form of tf–idf. The formula is #magic and probably will change anyways. The range of score values is 0-100, but we have limited the dataset to >40 to reduce noise.
Data License: public domain. Credits to words.hk appreciated.