universal character encoding detector
Find a file
2012-06-20 22:18:38 +09:00
testdata rename testdata files. 2012-06-20 21:45:29 +09:00
.gitignore update 2012-06-20 21:47:25 +09:00
cchardet.pyx first commit 2012-06-20 10:41:36 +09:00
readme.md wrote Install, etc... 2012-06-20 22:18:38 +09:00
setup.py remove platform, sys and os module 2012-06-20 22:07:05 +09:00
tests.py rename testdata files. 2012-06-20 21:45:29 +09:00

cChardet

This library is high speed universal character encoding detector. - binding to libcharsetdetect

Requires

Cython: http://www.cython.org/

uchardet-enhanced: https://bitbucket.org/medoc/uchardet-enhanced/overview

Install

Build uchardet-enhanced

$cd /tmp

$hg clone https://bitbucket.org/medoc/uchardet-enhanced

$cd uchardet-enhanced/libcharsetdetect

$./configure

$make

$sudo make install

$ls -la /usr/local/lib

$ls -la /usr/local/include

Build cChardet

$cd /tmp

$git clone git://github.com/PyYoshi/cChardet.git

$cd cChardet

$sudo pip install or easy_install -U cython. (If your os is Ubuntu, I recommend that you do "sudo apt-get install python-dev cython")

$python setup.py build

$sudo python setup.py install

Benchmark

see tests.TestCchardetSpeed

Sample(shift_jis):

testdata/wikipediaJa_One_Thousand_and_One_Nights_SJIS.txt

PC Spec.:

CPU: Intel Core i7 860 2.8GHz

RAM: DDR3-1333 16GB

Platform: Windows 7 HP x64, Python 2.7.3 32-bit

Result:

chardet: 4.009999990463257s, shift_jis

cchardet: 0.0009999275207519531s, shift_jis

Contact

My blog

Sorry for my poor English :)