universal character encoding detector

Find a file

PyYoshi ffb55ca55b update		2012-07-07 12:22:34 +09:00
src	update	2012-07-07 12:19:24 +09:00
test	add "cchardet.detect_with_confidence" method.	2012-07-05 12:05:11 +09:00
.gitignore	update	2012-07-07 11:38:39 +09:00
ez_setup.py	add ezsetup	2012-06-23 12:27:38 +09:00
MANIFEST.in	update build method	2012-06-26 16:00:39 +09:00
pandoc_markdown2rst.bat	add md2rst_converter	2012-06-26 11:02:21 +09:00
readme.md	update	2012-07-07 12:22:34 +09:00
readme.rst	update	2012-06-26 18:13:53 +09:00
setup.py	add "cchardet.detect_with_confidence" method.	2012-07-05 12:05:11 +09:00

readme.md

cChardet

This library is high speed universal character encoding detector. - binding to charsetdetect.

This library is faster than chardet.

Support codecs

Big5
EUC-JP
EUC-KR
GB18030
gb18030
HZ-GB-2312
IBM855
IBM866
ISO-2022-CN
ISO-2022-JP
ISO-2022-KR
ISO-8859-2
ISO-8859-5
ISO-8859-7
ISO-8859-8
KOI8-R
Shift_JIS
TIS-620
UTF-8
UTF-16BE
UTF-16LE
UTF-32BE
UTF-32LE
WINDOWS-1250
WINDOWS-1251
WINDOWS-1252
WINDOWS-1253
WINDOWS-1255
EUC-TW
X-ISO-10646-UCS-4-2143
X-ISO-10646-UCS-4-3412
x-mac-cyrillic

Requires

Cython: http://www.cython.org/

e.g.) Ubuntu 12.04

$sudo apt-get install build-essential python-dev cython

Installation

$cd /tmp

$git clone git://github.com/PyYoshi/cChardet.git

$cd cChardet

$python setup.py build

$sudo python setup.py install

$sudo easy_install cchardet

Example

# coding: utf8
import cchardet
msg = file(r"test/testdata/wikipediaJa_One_Thousand_and_One_Nights_SJIS.txt").read()
result = cchardet.detect(msg)
print(result)
result2 = cchardet.detect_with_confidence(msg)
print(result2)

Test

$sudo easy_install or pip install -U chardet nose

$cd test

$nosetests --nocapture tests.py

Benchmark

code: tests.TestCchardetSpeed

sample: test/testdata/wikipediaJa_One_Thousand_and_One_Nights_SJIS.txt

Performance:

CPU: Intel Core i7 860 2.8GHz

RAM: DDR3-1333 16GB

Platform: Windows 7 HP x64, Python 2.7.3 32-bit

Result:

	Request (call/s)	Result of encoding
chardet	0.25	shift_jis
cchardet	500.03	shift_jis

# License * This library files("cchardet.pyx","setup.py","tests.py") are "The MIT License".

Other Libraries License: Please, look at the ext directory.

Thanks

Contact

My blog

Sorry for my poor English :)