2016-10-17 18:39:55 +08:00
|
|
|
cChardet
|
|
|
|
========
|
|
|
|
|
2017-03-27 23:54:50 +08:00
|
|
|
cChardet is high speed universal character encoding detector. - binding to `uchardet`_.
|
2016-10-17 18:39:55 +08:00
|
|
|
|
|
|
|
.. image:: https://badge.fury.io/py/cchardet.svg
|
|
|
|
:target: https://badge.fury.io/py/cchardet
|
|
|
|
:alt: PyPI version
|
2017-04-06 10:39:59 +08:00
|
|
|
.. image:: https://travis-ci.org/PyYoshi/cChardet.svg?branch=master
|
2016-10-17 18:39:55 +08:00
|
|
|
:target: https://travis-ci.org/PyYoshi/cChardet
|
|
|
|
:alt: Travis Ci build status
|
2017-04-06 10:39:59 +08:00
|
|
|
.. image:: https://ci.appveyor.com/api/projects/status/lwkc4rgf3gncb1ne/branch/master?svg=true
|
|
|
|
:target: https://ci.appveyor.com/project/PyYoshi/cchardet/branch/master
|
2016-10-17 18:39:55 +08:00
|
|
|
:alt: AppVeyor build status
|
|
|
|
|
2017-03-27 23:54:50 +08:00
|
|
|
Supported Languages/Encodings
|
|
|
|
-----------------------------
|
2016-10-17 18:39:55 +08:00
|
|
|
|
2017-03-27 23:54:50 +08:00
|
|
|
- International (Unicode)
|
2016-10-17 18:39:55 +08:00
|
|
|
|
2017-03-27 23:54:50 +08:00
|
|
|
- UTF-8
|
|
|
|
- UTF-16BE / UTF-16LE
|
|
|
|
- UTF-32BE / UTF-32LE / X-ISO-10646-UCS-4-34121 /
|
|
|
|
X-ISO-10646-UCS-4-21431
|
2016-10-17 18:39:55 +08:00
|
|
|
|
2017-03-27 23:54:50 +08:00
|
|
|
- Arabic
|
|
|
|
|
|
|
|
- ISO-8859-6
|
|
|
|
- WINDOWS-1256
|
|
|
|
|
|
|
|
- Bulgarian
|
|
|
|
|
|
|
|
- ISO-8859-5
|
|
|
|
- WINDOWS-1251
|
|
|
|
|
|
|
|
- Chinese
|
|
|
|
|
|
|
|
- ISO-2022-CN
|
|
|
|
- BIG5
|
|
|
|
- EUC-TW
|
|
|
|
- GB18030
|
|
|
|
- HZ-GB-2312
|
|
|
|
|
|
|
|
- Croatian:
|
|
|
|
|
|
|
|
- ISO-8859-2
|
|
|
|
- ISO-8859-13
|
|
|
|
- ISO-8859-16
|
|
|
|
- Windows-1250
|
|
|
|
- IBM852
|
|
|
|
- MAC-CENTRALEUROPE
|
|
|
|
|
|
|
|
- Czech
|
|
|
|
|
|
|
|
- Windows-1250
|
|
|
|
- ISO-8859-2
|
|
|
|
- IBM852
|
|
|
|
- MAC-CENTRALEUROPE
|
|
|
|
|
|
|
|
- Danish
|
|
|
|
|
|
|
|
- ISO-8859-1
|
|
|
|
- ISO-8859-15
|
|
|
|
- WINDOWS-1252
|
|
|
|
|
|
|
|
- English
|
|
|
|
|
|
|
|
- ASCII
|
|
|
|
|
|
|
|
- Esperanto
|
|
|
|
|
|
|
|
- ISO-8859-3
|
|
|
|
|
|
|
|
- Estonian
|
|
|
|
|
|
|
|
- ISO-8859-4
|
|
|
|
- ISO-8859-13
|
|
|
|
- ISO-8859-13
|
|
|
|
- Windows-1252
|
|
|
|
- Windows-1257
|
|
|
|
|
|
|
|
- Finnish
|
|
|
|
|
|
|
|
- ISO-8859-1
|
|
|
|
- ISO-8859-4
|
|
|
|
- ISO-8859-9
|
|
|
|
- ISO-8859-13
|
|
|
|
- ISO-8859-15
|
|
|
|
- WINDOWS-1252
|
|
|
|
|
|
|
|
- French
|
|
|
|
|
|
|
|
- ISO-8859-1
|
|
|
|
- ISO-8859-15
|
|
|
|
- WINDOWS-1252
|
2016-10-17 18:39:55 +08:00
|
|
|
|
2017-03-27 23:54:50 +08:00
|
|
|
- German
|
2016-10-17 18:39:55 +08:00
|
|
|
|
2017-03-27 23:54:50 +08:00
|
|
|
- ISO-8859-1
|
|
|
|
- WINDOWS-1252
|
2016-10-17 18:39:55 +08:00
|
|
|
|
2017-03-27 23:54:50 +08:00
|
|
|
- Greek
|
2016-10-17 18:39:55 +08:00
|
|
|
|
2017-03-27 23:54:50 +08:00
|
|
|
- ISO-8859-7
|
|
|
|
- WINDOWS-1253
|
2016-10-17 18:39:55 +08:00
|
|
|
|
2017-03-27 23:54:50 +08:00
|
|
|
- Hebrew
|
2016-10-17 18:39:55 +08:00
|
|
|
|
2017-03-27 23:54:50 +08:00
|
|
|
- ISO-8859-8
|
|
|
|
- WINDOWS-1255
|
2016-10-17 18:39:55 +08:00
|
|
|
|
2017-03-27 23:54:50 +08:00
|
|
|
- Hungarian:
|
2016-10-17 18:39:55 +08:00
|
|
|
|
2017-03-27 23:54:50 +08:00
|
|
|
- ISO-8859-2
|
|
|
|
- WINDOWS-1250
|
2016-10-17 18:39:55 +08:00
|
|
|
|
2017-03-27 23:54:50 +08:00
|
|
|
- Irish Gaelic
|
2016-10-17 18:39:55 +08:00
|
|
|
|
2017-03-27 23:54:50 +08:00
|
|
|
- ISO-8859-1
|
|
|
|
- ISO-8859-9
|
|
|
|
- ISO-8859-15
|
|
|
|
- WINDOWS-1252
|
2016-10-17 18:39:55 +08:00
|
|
|
|
2017-03-27 23:54:50 +08:00
|
|
|
- Italian
|
2016-10-17 18:39:55 +08:00
|
|
|
|
2017-03-27 23:54:50 +08:00
|
|
|
- ISO-8859-1
|
|
|
|
- ISO-8859-3
|
|
|
|
- ISO-8859-9
|
|
|
|
- ISO-8859-15
|
|
|
|
- WINDOWS-1252
|
2016-10-17 18:39:55 +08:00
|
|
|
|
2017-03-27 23:54:50 +08:00
|
|
|
- Japanese
|
2016-10-17 18:39:55 +08:00
|
|
|
|
2017-03-27 23:54:50 +08:00
|
|
|
- ISO-2022-JP
|
|
|
|
- SHIFT\_JIS
|
|
|
|
- EUC-JP
|
|
|
|
|
|
|
|
- Korean
|
|
|
|
|
|
|
|
- ISO-2022-KR
|
|
|
|
- EUC-KR / UHC
|
|
|
|
|
|
|
|
- Lithuanian
|
|
|
|
|
|
|
|
- ISO-8859-4
|
|
|
|
- ISO-8859-10
|
|
|
|
- ISO-8859-13
|
|
|
|
|
|
|
|
- Latvian
|
|
|
|
|
|
|
|
- ISO-8859-4
|
|
|
|
- ISO-8859-10
|
|
|
|
- ISO-8859-13
|
|
|
|
|
|
|
|
- Maltese
|
|
|
|
|
|
|
|
- ISO-8859-3
|
|
|
|
|
|
|
|
- Polish:
|
|
|
|
|
|
|
|
- ISO-8859-2
|
|
|
|
- ISO-8859-13
|
|
|
|
- ISO-8859-16
|
|
|
|
- Windows-1250
|
|
|
|
- IBM852
|
|
|
|
- MAC-CENTRALEUROPE
|
|
|
|
|
|
|
|
- Portuguese
|
|
|
|
|
|
|
|
- ISO-8859-1
|
|
|
|
- ISO-8859-9
|
|
|
|
- ISO-8859-15
|
|
|
|
- WINDOWS-1252
|
|
|
|
|
|
|
|
- Romanian:
|
|
|
|
|
|
|
|
- ISO-8859-2
|
|
|
|
- ISO-8859-16
|
|
|
|
- Windows-1250
|
|
|
|
- IBM852
|
|
|
|
|
|
|
|
- Russian
|
|
|
|
|
|
|
|
- ISO-8859-5
|
|
|
|
- KOI8-R
|
|
|
|
- WINDOWS-1251
|
|
|
|
- MAC-CYRILLIC
|
|
|
|
- IBM866
|
|
|
|
- IBM855
|
|
|
|
|
|
|
|
- Slovak
|
|
|
|
|
|
|
|
- Windows-1250
|
|
|
|
- ISO-8859-2
|
|
|
|
- IBM852
|
|
|
|
- MAC-CENTRALEUROPE
|
|
|
|
|
|
|
|
- Slovene
|
|
|
|
|
|
|
|
- ISO-8859-2
|
|
|
|
- ISO-8859-16
|
|
|
|
- Windows-1250
|
|
|
|
- IBM852
|
|
|
|
- M
|
|
|
|
|
|
|
|
Example
|
2016-10-17 18:39:55 +08:00
|
|
|
-------
|
|
|
|
|
2017-03-27 23:54:50 +08:00
|
|
|
.. code-block:: python
|
2016-10-17 18:39:55 +08:00
|
|
|
|
2017-03-27 23:54:50 +08:00
|
|
|
# -*- coding: utf-8 -*-
|
|
|
|
import cchardet as chardet
|
2017-03-28 09:29:19 +08:00
|
|
|
with open(r"src/tests/samples/wikipediaJa_One_Thousand_and_One_Nights_SJIS.txt", "rb") as f:
|
2017-03-27 23:54:50 +08:00
|
|
|
msg = f.read()
|
|
|
|
result = chardet.detect(msg)
|
|
|
|
print(result)
|
|
|
|
|
2017-03-28 09:29:19 +08:00
|
|
|
Benchmark
|
|
|
|
---------
|
|
|
|
|
|
|
|
.. code-block:: bash
|
|
|
|
|
|
|
|
$ cd src/
|
|
|
|
$ pip install chardet
|
|
|
|
$ python tests/bench.py
|
|
|
|
|
|
|
|
|
|
|
|
Results
|
|
|
|
~~~~~~~
|
|
|
|
|
|
|
|
CPU: Intel(R) Core(TM) i5-4690 CPU @ 3.50GHz
|
|
|
|
|
|
|
|
RAM: DDR3 1600Mhz 16GB
|
|
|
|
|
|
|
|
Platform: Ubuntu 16.04 amd64
|
|
|
|
|
2017-04-25 10:48:11 +08:00
|
|
|
Python 2.7.13
|
2017-03-28 09:29:19 +08:00
|
|
|
^^^^^^^^^^^^^
|
|
|
|
|
2017-04-25 10:48:11 +08:00
|
|
|
+-----------------+------------------+
|
|
|
|
| | Request (call/s) |
|
|
|
|
+=================+==================+
|
|
|
|
| chardet v3.0.2 | 0.36 |
|
|
|
|
+-----------------+------------------+
|
|
|
|
| cchardet v2.0.1 | 1396.42 |
|
|
|
|
+-----------------+------------------+
|
2017-03-28 09:29:19 +08:00
|
|
|
|
2017-04-25 10:48:11 +08:00
|
|
|
Python 3.6.1
|
2017-03-28 09:29:19 +08:00
|
|
|
^^^^^^^^^^^^
|
|
|
|
|
2017-04-25 10:48:11 +08:00
|
|
|
+-----------------+------------------+
|
|
|
|
| | Request (call/s) |
|
|
|
|
+=================+==================+
|
|
|
|
| chardet v3.0.2 | 0.35 |
|
|
|
|
+-----------------+------------------+
|
|
|
|
| cchardet v2.0.1 | 1467.77 |
|
|
|
|
+-----------------+------------------+
|
2017-03-28 09:29:19 +08:00
|
|
|
|
|
|
|
|
2017-03-27 23:54:50 +08:00
|
|
|
LICENSE
|
|
|
|
-------
|
2016-10-17 18:39:55 +08:00
|
|
|
|
2017-03-27 23:54:50 +08:00
|
|
|
See **COPYING** file.
|
2016-10-17 18:39:55 +08:00
|
|
|
|
|
|
|
Contact
|
|
|
|
-------
|
|
|
|
|
2017-03-27 23:54:50 +08:00
|
|
|
- `Issues`_
|
2016-10-17 18:39:55 +08:00
|
|
|
|
|
|
|
|
2017-03-27 23:54:50 +08:00
|
|
|
.. _uchardet: https://github.com/PyYoshi/uchardet
|
2016-10-17 18:39:55 +08:00
|
|
|
.. _Issues: https://github.com/PyYoshi/cChardet/issues?page=1&state=open
|