布隆过滤器

时间：2015-03-14 15:30:58 阅读：149 评论：0 收藏：0 [点我收藏+]

标签：

学习网络爬虫讲到布隆过滤器，把算法记录下来。

布隆过滤器是哈希算法的一种改进，以书本过滤email的需求为例子，使用MD5码(128bit，16字节)，1亿的数据需要128亿比特(1.6GB的内存)。我们有1亿的数据，如果完全不相同并且是连续的，那么1亿bit的标记位就够用了，现在为了增加容错，使用16亿bit，每个数据按照算法映射到8个不同的标记位，如果这八个不同的标记位都是使用的，那么这个数据之前肯定被标记了。这个方法肯定存在误报率，但是基于这样的想法，8不行可以分16、32只要不是超过或者等于128对空间的需求肯定小于纯哈希算法。Java实现如下：

import java.util.BitSet;

public class BloomFilter {

    private static final int DEFAULT_SIZE = 2 << 24;//布隆过滤器的比特长度
    private static final int[] seeds = { 3, 5, 7, 11, 13, 31, 37, 61};
    private static BitSet bits = new BitSet(DEFAULT_SIZE);
    private static SimpleHash[] func = new SimpleHash[seeds.length];

    public static void addValue(String value)
    {
        for(SimpleHash f : func)
            bits.set(f.hash(value),true);
    }
    
    public static void add(String value)
    {
        if(value != null) addValue(value);
    }
    
    public static boolean contains(String value)
    {
        if(value == null) return false;
        boolean ret = true;
        for(SimpleHash f : func)
            ret = ret && bits.get(f.hash(value));
        return ret;
    }
    
    public static void main(String[] args) {
        String value = "xkeyideal@gmail.com";
        for (int i = 0; i < seeds.length; i++) {
            func[i] = new SimpleHash(DEFAULT_SIZE, seeds[i]);
        }
        add(value);
        System.out.println(contains(value));
    }
}

class SimpleHash {

    private int cap;
    private int seed;

    public  SimpleHash(int cap, int seed) {
        this.cap = cap;
        this.seed = seed;
    }

    public int hash(String value) {
        int result = 0;
        int len = value.length();
        for (int i = 0; i < len; i++) {
            result = seed * result + value.charAt(i);
        }
        return (cap - 1) & result;
    }
}

布隆过滤器

标签：

原文地址：http://my.oschina.net/u/659405/blog/386999

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行