码迷,mamicode.com
首页 > 其他好文 > 详细

用juniversalchardet解决爬虫乱码问题

时间:2017-05-22 11:11:01      阅读:216      评论:0      收藏:0      [点我收藏+]

标签:out   win   结束   arch   最简   utf-16   other   rac   notify   

 

 

        爬虫往往会遇到乱码问题。最简单的方法是根据http的响应信息来获取编码信息。但如果对方网站的响应信息不包含编码信息或编码信息错误,那么爬虫取下来的信息就很可能是乱码。

       好的解决办法是直接根据页面内容来自动判断页面的编码。如Mozilla公司的firefox使用的universalchardet编码自动检测工具。

       juniversalchardet是universalchardet的Java版本。源码开源于 https://github.com/thkoch2001/juniversalchardet

       自动编码主要是根据统计学的方法来判断。具体原理,可以看http://www-archive.mozilla.org/projects/intl/UniversalCharsetDetection.html

       现在以Java爬虫常用的httpclient来讲解如何使用。看以下关键代码:

 
UniversalDetector encDetector = new UniversalDetector(null);  
    while ((l = myStream.read(tmp)) != -1) {  
        buffer.append(tmp, 0, l);  
        if (!encDetector.isDone()) {  
            encDetector.handleData(tmp, 0, l);  
        }  
    }  
encDetector.dataEnd();  
String encoding = encDetector.getDetectedCharset();  
if (encoding != null) {  
    return new String(buffer.toByteArray(), encoding);  
}  
encDetector.reset();  

  

  1. myStream.read(tmp)) 读取httpclient得到的流。我们要做的就是在读流的同时,运用juniversalchardet来检测编码,如果有符合特征的编码的出现,则最后可通过detector.getDetectedCharset()  
  2. 可以得到编码,否则返回null。至此,检测工作结束,通过String的构造方法来进行按一定编码构建字符串。  



http://mvnrepository.com/artifact/com.googlecode.juniversalchardet/juniversalchardet/1.0.3

<!-- https://mvnrepository.com/artifact/com.googlecode.juniversalchardet/juniversalchardet -->
<dependency>
    <groupId>com.googlecode.juniversalchardet</groupId>
    <artifactId>juniversalchardet</artifactId>
    <version>1.0.3</version>
</dependency>

  

 

https://code.google.com/archive/p/juniversalchardet/

 

Java port of universalchardet

1. What is it?

juniversalchardet is a Java port of ‘universalchardet‘, that is the encoding detector library of Mozilla.

The original code of universalchardet is available athttp://lxr.mozilla.org/seamonkey/source/extensions/universalchardet/

Techniques used by universalchardet are described athttp://www.mozilla.org/projects/intl/UniversalCharsetDetection.html

2. Encodings that can be detected

  • Chinese

    • ISO-2022-CN
    • BIG5
    • EUC-TW
    • GB18030
    • HZ-GB-23121
  • Cyrillic

    • ISO-8859-5
    • KOI8-R
    • WINDOWS-1251
    • MACCYRILLIC
    • IBM866
    • IBM855
  • Greek

    • ISO-8859-7
    • WINDOWS-1253
  • Hebrew

    • ISO-8859-8
    • WINDOWS-1255
  • Japanese

    • ISO-2022-JP
    • SHIFT_JIS
    • EUC-JP
  • Korean

    • ISO-2022-KR
    • EUC-KR
  • Unicode

    • UTF-8
    • UTF-16BE / UTF-16LE
    • UTF-32BE / UTF-32LE / X-ISO-10646-UCS-4-34121 / X-ISO-10646-UCS-4-21431
  • Others

    • WINDOWS-1252

1 Currently not supported by Java

3. How to use it

  1. Construct an instance of org.mozilla.universalchardet.UniversalDetector.
  2. Feed some data (typically several thousands bytes) to the detector by calling UniversalDetector.handleData().
  3. Notify the detector of the end of data by calling UniversalDetector.dataEnd().
  4. Get the detected encoding name by calling UniversalDetector.getDetectedCharset().
  5. Don‘t forget to call UniversalDetector.reset() before you reuse the detector instance.

Sample Code

Download ``` import org.mozilla.universalchardet.UniversalDetector;

public class TestDetector { public static void main(String[] args) throws java.io.IOException { byte[] buf = new byte[4096]; String fileName = args[0]; java.io.FileInputStream fis = new java.io.FileInputStream(fileName);

// (1)
UniversalDetector detector = new UniversalDetector(null);

// (2)
int nread;
while ((nread = fis.read(buf)) > 0 && !detector.isDone()) {
  detector.handleData(buf, 0, nread);
}
// (3)
detector.dataEnd();

// (4)
String encoding = detector.getDetectedCharset();
if (encoding != null) {
  System.out.println("Detected encoding = " + encoding);
} else {
  System.out.println("No encoding detected.");
}

// (5)
detector.reset();

} } ```

4. Related Works

jchardet

  • http://jchardet.sourceforge.net/ jchardet is another Java port of the Mozilla‘s encoding dectection library. The main difference between jchardet and juniversalchardet is modules they are based on. jchardet is based on the ‘chardet‘ module that has long existed. juniversalchardet is based on the ‘universalchardet‘ module that is new and generally provides better accuracy on detection results.

5. License

The library is subject to the Mozilla Public License Version 1.1. Alternatively, the library may be used under the terms of either the GNU General Public License Version 2 or later, or the GNU Lesser General Public License 2.1 or later.

 

用juniversalchardet解决爬虫乱码问题

标签:out   win   结束   arch   最简   utf-16   other   rac   notify   

原文地址:http://www.cnblogs.com/lhp2012/p/6888318.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!