universal character encoding detector
dockerfiles | ||
src | ||
.gitignore | ||
MANIFEST.in | ||
README.markdown | ||
setup.py | ||
win_build.bat | ||
win_upload.bat |
cChardet
cChardet is high speed universal character encoding detector. - binding to charsetdetect.
Support codecs
- Big5
- EUC-JP
- EUC-KR
- GB18030
- HZ-GB-2312
- IBM855
- IBM866
- ISO-2022-CN
- ISO-2022-JP
- ISO-2022-KR
- ISO-8859-2
- ISO-8859-5
- ISO-8859-7
- ISO-8859-8
- KOI8-R
- Shift_JIS
- TIS-620
- UTF-8
- UTF-16BE
- UTF-16LE
- UTF-32BE
- UTF-32LE
- WINDOWS-1250
- WINDOWS-1251
- WINDOWS-1252
- WINDOWS-1253
- WINDOWS-1255
- EUC-TW
- X-ISO-10646-UCS-4-2143
- X-ISO-10646-UCS-4-3412
- x-mac-cyrillic
Requires
- Cython: http://www.cython.org/
e.g.) Ubuntu 12.04
$ sudo apt-get install build-essential python-dev cython
Installation
$ cd /tmp
$ git clone git://github.com/PyYoshi/cChardet.git
$ cd cChardet
$ python setup.py build
$ python setup.py install
or
$ pip install -U cchardet
Example
# -*- coding: utf-8 -*-
import cchardet as chardet
with open(r"tests/testdata/wikipediaJa_One_Thousand_and_One_Nights_SJIS.txt", "rb") as f:
msg = f.read()
result = chardet.detect(msg)
print(result)
Benchmark
code: tests.TestCchardetSpeed
sample: tests/testdata/wikipediaJa_One_Thousand_and_One_Nights_SJIS.txt
Performance:
CPU: Intel Core i7 860 2.8GHz
RAM: DDR3-1333 16GB
Platform: Kubuntu 12.04 amd64, Python 2.7.3 64-bit
Result:
Request (call/s) | |
---|---|
chardet | 0.32 |
cchardet | 975.46 |
License
-
The MIT License: src/cchardet
-
Other Libraries License: Please, look at the src/ext directory.