universal character encoding detector

Find a file

PyYoshi 148efe50fb updated		2012-06-21 00:11:47 +09:00
testdata	rename testdata files.	2012-06-20 21:45:29 +09:00
.gitignore	update	2012-06-20 21:47:25 +09:00
cchardet.pyx	first commit	2012-06-20 10:41:36 +09:00
readme.md	updated	2012-06-21 00:11:47 +09:00
setup.py	remove platform, sys and os module	2012-06-20 22:07:05 +09:00
tests.py	rename testdata files.	2012-06-20 21:45:29 +09:00

readme.md

cChardet

This library is high speed universal character encoding detector. - binding to charsetdetect.

This library is faster than chardet.

Support codecs

Big5
EUC-JP
EUC-KR
GB18030
gb18030
HZ-GB-2312
IBM855
IBM866
ISO-2022-CN
ISO-2022-JP
ISO-2022-KR
ISO-8859-2
ISO-8859-5
ISO-8859-7
ISO-8859-8
KOI8-R
Shift_JIS
TIS-620
UTF-8
UTF-16BE
UTF-16LE
UTF-32BE
UTF-32LE
windows-1250
windows-1251
windows-1252
windows-1253
windows-1255
x-euc-tw
X-ISO-10646-UCS-4-2143
X-ISO-10646-UCS-4-3412
x-mac-cyrillic

Requires

Cython: http://www.cython.org/
uchardet-enhanced: https://bitbucket.org/medoc/uchardet-enhanced/overview

Install

Build uchardet-enhanced

$cd /tmp
$hg clone https://bitbucket.org/medoc/uchardet-enhanced
$cd uchardet-enhanced/libcharsetdetect
$./configure
$make
$sudo make install
$ls -la /usr/local/lib
$ls -la /usr/local/include

Build cChardet

$cd /tmp
$git clone git://github.com/PyYoshi/cChardet.git
$cd cChardet
$sudo pip install or easy_install -U cython. (If your os is Ubuntu, I recommend that you do "sudo apt-get install python-dev cython")
$python setup.py build
$sudo python setup.py install

Example

# coding: utf8
import cchardet
msg = file(r"testdata/wikipediaJa_One_Thousand_and_One_Nights_SJIS.txt").read()
result = cchardet.detect(msg)
print(result)

Test

sudo easy_install or pip install -U chardet nose
$nosetests --nocapture tests.py

Benchmark

see tests.TestCchardetSpeed

Sample(shift_jis):

testdata/wikipediaJa_One_Thousand_and_One_Nights_SJIS.txt

PC Spec.:

CPU: Intel Core i7 860 2.8GHz
RAM: DDR3-1333 16GB
Platform: Windows 7 HP x64, Python 2.7.3 32-bit

Result:

chardet: 4.009999990463257s, shift_jis
cchardet: 0.0009999275207519531s, shift_jis

Contact

My blog

Sorry for my poor English :)