universal character encoding detector
src | ||
test | ||
.gitignore | ||
ez_setup.py | ||
MANIFEST.in | ||
pandoc_markdown2rst.bat | ||
readme.md | ||
readme.rst | ||
setup.py |
cChardet
This library is high speed universal character encoding detector. - binding to charsetdetect.
This library is faster than chardet.
Support codecs
- Big5
- EUC-JP
- EUC-KR
- GB18030
- gb18030
- HZ-GB-2312
- IBM855
- IBM866
- ISO-2022-CN
- ISO-2022-JP
- ISO-2022-KR
- ISO-8859-2
- ISO-8859-5
- ISO-8859-7
- ISO-8859-8
- KOI8-R
- Shift_JIS
- TIS-620
- UTF-8
- UTF-16BE
- UTF-16LE
- UTF-32BE
- UTF-32LE
- WINDOWS-1250
- WINDOWS-1251
- WINDOWS-1252
- WINDOWS-1253
- WINDOWS-1255
- EUC-TW
- X-ISO-10646-UCS-4-2143
- X-ISO-10646-UCS-4-3412
- x-mac-cyrillic
Requires
- Cython: http://www.cython.org/
e.g.) Ubuntu 12.04
$sudo apt-get install build-essential python-dev cython
Installation
$cd /tmp
$git clone git://github.com/PyYoshi/cChardet.git
$cd cChardet
$python setup.py build
$sudo python setup.py install
or
$sudo easy_install cchardet
Example
# coding: utf8
import cchardet
msg = file(r"test/testdata/wikipediaJa_One_Thousand_and_One_Nights_SJIS.txt").read()
result = cchardet.detect(msg)
print(result)
result2 = cchardet.detect_with_confidence(msg)
print(result2)
Test
$sudo easy_install or pip install -U chardet nose
$cd test
$nosetests --nocapture tests.py
Benchmark
code: tests.TestCchardetSpeed
sample: test/testdata/wikipediaJa_One_Thousand_and_One_Nights_SJIS.txt
Performance:
CPU: Intel Core i7 860 2.8GHz
RAM: DDR3-1333 16GB
Platform: Windows 7 HP x64, Python 2.7.3 32-bit
Result:
Request (call/s) | Result of encoding | |
---|---|---|
chardet | 0.25 | shift_jis |
cchardet | 500.03 | shift_jis |
- Other Libraries License: Please, look at the ext directory.
Thanks
Contact
Sorry for my poor English :)