universal character encoding detector

Find a file

PyYoshi 75774f628a update benchmark		2016-10-17 13:04:33 +09:00
dockerfiles	update 3.4-slim	2015-09-07 17:57:39 +09:00
src	update benchmark	2016-10-17 13:04:33 +09:00
.gitignore	use tox	2016-10-17 12:19:24 +09:00
.travis.yml	add travis config	2016-10-17 12:41:26 +09:00
Makefile	use tox	2016-10-17 12:19:24 +09:00
MANIFEST.in	move tests	2016-10-17 11:43:08 +09:00
README.markdown	move tests	2016-10-17 11:43:08 +09:00
setup.cfg	use tox	2016-10-17 12:19:24 +09:00
setup.py	version 1.0.0	2015-09-07 17:38:37 +09:00
tox.ini	use tox	2016-10-17 12:19:24 +09:00
win_build.bat	add windows build/upload scripts	2015-09-07 22:10:41 +09:00
win_upload.bat	add windows build/upload scripts	2015-09-07 22:10:41 +09:00

README.markdown

cChardet

cChardet is high speed universal character encoding detector. - binding to charsetdetect.

Support codecs

Big5
EUC-JP
EUC-KR
GB18030
HZ-GB-2312
IBM855
IBM866
ISO-2022-CN
ISO-2022-JP
ISO-2022-KR
ISO-8859-2
ISO-8859-5
ISO-8859-7
ISO-8859-8
KOI8-R
Shift_JIS
TIS-620
UTF-8
UTF-16BE
UTF-16LE
UTF-32BE
UTF-32LE
WINDOWS-1250
WINDOWS-1251
WINDOWS-1252
WINDOWS-1253
WINDOWS-1255
EUC-TW
X-ISO-10646-UCS-4-2143
X-ISO-10646-UCS-4-3412
x-mac-cyrillic

Requires

Cython: http://www.cython.org/

e.g.) Ubuntu 12.04

$ sudo apt-get install build-essential python-dev cython

Installation

$ cd /tmp
$ git clone git://github.com/PyYoshi/cChardet.git
$ cd cChardet
$ python setup.py build
$ python setup.py install

$ pip install -U cchardet

Example

# -*- coding: utf-8 -*-
import cchardet as chardet
with open(r"tests/testdata/wikipediaJa_One_Thousand_and_One_Nights_SJIS.txt", "rb") as f:
    msg = f.read()
result = chardet.detect(msg)
print(result)

Benchmark

code: tests.TestCchardetSpeed

sample: tests/testdata/wikipediaJa_One_Thousand_and_One_Nights_SJIS.txt

Performance:

CPU: Intel Core i7 860 2.8GHz

RAM: DDR3-1333 16GB

Platform: Kubuntu 12.04 amd64, Python 2.7.3 64-bit

Result:

	Request (call/s)
chardet	0.32
cchardet	975.46