cChardet/readme.md

106 lines
2.4 KiB
Markdown
Raw Normal View History

2012-06-23 03:30:47 +00:00
# Dev branch is too buggy!!! I recommend master branch.
2012-06-20 01:41:36 +00:00
# cChardet
2012-06-20 15:07:12 +00:00
This library is high speed universal character encoding detector. - binding to [charsetdetect](https://bitbucket.org/medoc/uchardet-enhanced/overview).
This library is faster than [chardet](http://pypi.python.org/pypi/chardet).
# Support codecs
* Big5
* EUC-JP
* EUC-KR
* GB18030
* gb18030
* HZ-GB-2312
* IBM855
* IBM866
* ISO-2022-CN
* ISO-2022-JP
* ISO-2022-KR
* ISO-8859-2
* ISO-8859-5
* ISO-8859-7
* ISO-8859-8
* KOI8-R
* Shift_JIS
* TIS-620
* UTF-8
* UTF-16BE
* UTF-16LE
* UTF-32BE
* UTF-32LE
* windows-1250
* windows-1251
* windows-1252
* windows-1253
* windows-1255
* x-euc-tw
* X-ISO-10646-UCS-4-2143
* X-ISO-10646-UCS-4-3412
* x-mac-cyrillic
2012-06-20 01:41:36 +00:00
# Requires
2012-06-20 15:07:12 +00:00
* Cython: [http://www.cython.org/](http://www.cython.org/)
2012-06-20 01:41:36 +00:00
2012-06-20 15:07:12 +00:00
* uchardet-enhanced: [https://bitbucket.org/medoc/uchardet-enhanced/overview](https://bitbucket.org/medoc/uchardet-enhanced/overview)
2012-06-20 01:41:36 +00:00
2012-06-20 13:18:38 +00:00
# Install
### Build uchardet-enhanced
2012-06-20 15:07:12 +00:00
1. $cd /tmp
2012-06-20 13:18:38 +00:00
2012-06-20 15:07:12 +00:00
2. $hg clone https://bitbucket.org/medoc/uchardet-enhanced
2012-06-20 13:18:38 +00:00
2012-06-20 15:07:12 +00:00
3. $cd uchardet-enhanced/libcharsetdetect
2012-06-20 13:18:38 +00:00
2012-06-20 15:07:12 +00:00
4. $./configure
2012-06-20 13:18:38 +00:00
2012-06-20 15:07:12 +00:00
5. $make
2012-06-20 13:18:38 +00:00
2012-06-20 15:07:12 +00:00
6. $sudo make install
2012-06-20 13:18:38 +00:00
2012-06-20 15:07:12 +00:00
7. $ls -la /usr/local/lib
2012-06-20 13:18:38 +00:00
2012-06-20 15:07:12 +00:00
8. $ls -la /usr/local/include
2012-06-20 13:18:38 +00:00
2012-06-20 15:07:12 +00:00
# Example
```python
# coding: utf8
import cchardet
2012-06-20 15:11:47 +00:00
msg = file(r"testdata/wikipediaJa_One_Thousand_and_One_Nights_SJIS.txt").read()
result = cchardet.detect(msg)
2012-06-20 15:07:12 +00:00
print(result)
```
# Test
2012-06-20 15:15:49 +00:00
* $sudo easy_install or pip install -U chardet nose
2012-06-20 15:07:12 +00:00
* $nosetests --nocapture tests.py
2012-06-20 01:41:36 +00:00
2012-06-20 02:29:50 +00:00
# Benchmark
2012-06-21 01:18:46 +00:00
see [tests.TestCchardetSpeed](https://github.com/PyYoshi/cChardet/blob/master/tests.py#L416)
2012-06-20 02:31:41 +00:00
2012-06-20 02:40:03 +00:00
### Sample(shift_jis):
2012-06-20 15:07:12 +00:00
* [testdata/wikipediaJa_One_Thousand_and_One_Nights_SJIS.txt](https://github.com/PyYoshi/cChardet/blob/master/testdata/wikipediaJa_One_Thousand_and_One_Nights_SJIS.txt)
2012-06-20 02:31:41 +00:00
2012-06-20 02:40:03 +00:00
### PC Spec.:
2012-06-20 15:07:12 +00:00
* CPU: Intel Core i7 860 2.8GHz
2012-06-20 02:31:41 +00:00
2012-06-20 15:07:12 +00:00
* RAM: DDR3-1333 16GB
2012-06-20 02:40:03 +00:00
2012-06-20 15:07:12 +00:00
* Platform: Windows 7 HP x64, Python 2.7.3 32-bit
2012-06-20 13:18:38 +00:00
2012-06-20 02:40:03 +00:00
### Result:
2012-06-20 15:07:12 +00:00
* chardet: 4.009999990463257s, shift_jis
2012-06-20 02:31:41 +00:00
2012-06-20 15:07:12 +00:00
* cchardet: 0.0009999275207519531s, shift_jis
2012-06-20 02:29:50 +00:00
# License
* This library files("cchardet.pyx","setup.py","tests.py") are "The MIT License".
* Other Library License: Please, look at the "ext" directory.
2012-06-20 01:41:36 +00:00
# Contact
[My blog](http://blog.remu.biz)
Sorry for my poor English :)