粵典數據

** 下載粵典詞典資料 **

[ 詳情... ] [ JSON | CSV ]

語料庫單字使用頻率 character usage frequency

簡介：

counts the number of characters in the corpus and return a JSON dictionary
including non-CJK characters

用嚟統計corpus裡面嘅字(唔係詞)嘅使用頻率。

Data License: public domain. Credits to words.hk appreciated.
授權：公有領域。

[ 詳情... ] [ JSON | CSV ]

語料庫x詞表使用頻率 word frequency list

簡介：

Query database for all word representations, and count the number of occurrences, without trying to do segmentation on the article content

用嚟統計資料庫現有嘅詞嘅使用頻率。
請留意，呢個清單只包括粵文庫入面見過嘅詞，唔包括《粵典》有收錄但粵文庫冇出現過嘅詞。

Data License: public domain. Credits to words.hk appreciated.
授權：公有領域。

[ 詳情... ] [ JSON | CSV ]

粵典字表 words.hk character list

簡介：

Query database for all word jyutpings, and just dump it out by character if it seems valid.

可以用作｛字=>粵拼｝嘅數據

The list does not contain character variants that the system recognizes
呢個表冇收錄系統認得嘅異體字

Data License: public domain. Credits to words.hk appreciated.
授權：公有領域。

[ 詳情... ] [ JSON | CSV ]

粵典詞表 words.hk word list

簡介：

Contains the wordlist and pronunciations of all entries recorded in the dictionary. Result will be a dictionary (pun not intended) of written characters as keys, and their respective list of possible pronunciations values. Words that contain variants (異體字) will have a "*"

粵典所收錄嘅詞表同佢哋嘅拼音。異體字會打粒「*」

Data License: public domain. Credits to words.hk appreciated.
授權：公有領域。

[ 詳情... ] [ JSON | CSV ]

英粵對照表 English index

簡介：

This data set exposes the index used to search English words in the dictionary. The index is built by mapping the words seen in each entry's English explanation to its Cantonese word.

This may be useful for English->Cantonese translation purposes

The "English" terms are normalized (to US spelling variant using https://github.com/en-wl/wordlist/blob/master/varcon/README ). If they are prefixed with "!" it means that they are stemmed with PorterStemmer (see http://www.tartarus.org/~martin/PorterStemmer for implementations).

The score number in each entry is the estimate of how important the English term is for the definition of the Cantonese word using some form of tf–idf. The formula is #magic and probably will change anyways. The range of score values is 0-100, but we have limited the dataset to >40 to reduce noise.

Data License: public domain. Credits to words.hk appreciated.