universal character encoding detector
Find a file
2012-06-26 10:02:49 +09:00
ext refresh 2012-06-26 10:02:49 +09:00
src/cchardet refresh 2012-06-26 10:02:49 +09:00
testdata rename testdata files. 2012-06-20 21:45:29 +09:00
.gitignore refresh 2012-06-26 10:02:49 +09:00
ez_setup.py add ezsetup 2012-06-23 12:27:38 +09:00
readme.md update 2012-06-23 12:30:47 +09:00
setup.py refresh 2012-06-26 09:55:03 +09:00
tests.py change dev branch and include libcharsetdetect 2012-06-23 12:27:19 +09:00

Dev branch is too buggy!!! I recommend master branch.

cChardet

This library is high speed universal character encoding detector. - binding to charsetdetect.

This library is faster than chardet.

Support codecs

  • Big5
  • EUC-JP
  • EUC-KR
  • GB18030
  • gb18030
  • HZ-GB-2312
  • IBM855
  • IBM866
  • ISO-2022-CN
  • ISO-2022-JP
  • ISO-2022-KR
  • ISO-8859-2
  • ISO-8859-5
  • ISO-8859-7
  • ISO-8859-8
  • KOI8-R
  • Shift_JIS
  • TIS-620
  • UTF-8
  • UTF-16BE
  • UTF-16LE
  • UTF-32BE
  • UTF-32LE
  • windows-1250
  • windows-1251
  • windows-1252
  • windows-1253
  • windows-1255
  • x-euc-tw
  • X-ISO-10646-UCS-4-2143
  • X-ISO-10646-UCS-4-3412
  • x-mac-cyrillic

Requires

Install

Build uchardet-enhanced

  1. $cd /tmp

  2. $hg clone https://bitbucket.org/medoc/uchardet-enhanced

  3. $cd uchardet-enhanced/libcharsetdetect

  4. $./configure

  5. $make

  6. $sudo make install

  7. $ls -la /usr/local/lib

  8. $ls -la /usr/local/include

Example

# coding: utf8
import cchardet
msg = file(r"testdata/wikipediaJa_One_Thousand_and_One_Nights_SJIS.txt").read()
result = cchardet.detect(msg)
print(result)

Test

  • $sudo easy_install or pip install -U chardet nose

  • $nosetests --nocapture tests.py

Benchmark

see tests.TestCchardetSpeed

Sample(shift_jis):

PC Spec.:

  • CPU: Intel Core i7 860 2.8GHz

  • RAM: DDR3-1333 16GB

  • Platform: Windows 7 HP x64, Python 2.7.3 32-bit

Result:

  • chardet: 4.009999990463257s, shift_jis

  • cchardet: 0.0009999275207519531s, shift_jis

License

  • This library files("cchardet.pyx","setup.py","tests.py") are "The MIT License".

  • Other Library License: Please, look at the "ext" directory.

Contact

My blog

Sorry for my poor English :)