基于DFA算法的敏感词过滤

时间：2019-09-20 22:52:03 阅读：87 评论：0 收藏：0 [点我收藏+]

标签：position ++ trim eric end gets message bool 字符串

DFA算法的全称是Deterministic Finite Automaton，即确定有穷永动机算法。

DFA算法中对汉字的存储，字典树中的节点存储的字符Character类型，不是ASCII码。

建立默认敏感词替换词

private sttaic final String REPLACE=" whatever";

定义字典树

字典树中的节点拥有以下属性：

1.kv键值对类型的子节点，key中保存着敏感词，v是triedNode类型的保存节点

2.代表分支结尾的end属性

3.给字典树的分支添加end的方法

4 获取节点中的key的方法

5判断是否到达end节点的方法

6 添加子节点的方法

public class TrieNode{

标定敏感词的结尾 true为关键词终结，false为继续

private boolean end= false;

key 是下一个字符，value是对应的节点，subNode中保存了子节点的值和子节点的位置

private Map<Character,TriedNode> subNodes=new HashMap<>();

/**

向指定位置添加节点树

void addSubNode(Character Key,TriedNode node){subNodes,put(key,node);}

//获取下一个节点

TriedNode getsubNode(Character key){return subNodes.get(key);}

boolean isKeywordEnd（）{return end;}

void setKeyEnd(booleean end){this.end=end;}

public int getSubNodeCount(){return subNodes.size();}

}

创建空节点作为字典树的根

private TriedNode root=new TriedNode();

//判断输入的内容是否是无意义的符号

boolean isSymbol(char c){

int ic=(int)c;

//0x2e80-0x9ff东亚文字范围

return ！CharUtil.iisAsciiAlphanumeric(c)&&（ic<0x2e80||ic>0x9ff）;

}

/**

过滤敏感词

首先判断输入内容是否为空用StringUtils.isBlank()来判断，为空就return 输入的字符串

定义结果 StringBuffer result 将判断的结果都存入StringBuffer

int begin=0;//回滚数

int position=0;//当前的比较位置

public String filter(String text){

TriedNode rootT=root;

StringBuffer sb=new StringBuffer();

int begin=0;

int position=0;

while(begin<tex.length()){

char c=-tex.charAt(position);

//空格直接跳过

if(isSymbol(c)){

if(rootT==root){

sb.append(c);

beigin++;

}

position++;

continue;

}

rootT=rootT.getSubNode(c);

if(rootT==null){

sb.append(c);

posiotn=begin++;

}else if (tempNode.isKeywordEnd()) {

        // 发现敏感词， 从begin到position的位置用replacement替换掉
       sbt.append(replacement);
        position = position + 1;
        begin = position;
        tempNode = rootNode;
    } else {
        ++position;
    }
}

result.append(text.substring(begin));

return sbt.toString();

}

private void addWord(String lineTxt) {
    TrieNode tempNode = rootNode;
    // 循环每个字节
    for (int i = 0; i < lineTxt.length(); ++i) {
        Character c = lineTxt.charAt(i);
        // 过滤空格
        if (isSymbol(c)) {
            continue;
        }
        TrieNode node = tempNode.getSubNode(c);

        if (node == null) { // 没初始化
            node = new TrieNode();
            tempNode.addSubNode(c, node);
        }

        tempNode = node;

        if (i == lineTxt.length() - 1) {
            // 关键词结束， 设置结束标志
            tempNode.setKeywordEnd(true);
        }
    }
}

}

public void afterPropertiesSet() throws Exception {
        rootNode = new TrieNode();

        try {
            InputStream is = Thread.currentThread().getContextClassLoader()
                    .getResourceAsStream("SensitiveWords.txt");
            InputStreamReader read = new InputStreamReader(is);
            BufferedReader bufferedReader = new BufferedReader(read);
            String lineTxt;
            while ((lineTxt = bufferedReader.readLine()) != null) {
                lineTxt = lineTxt.trim();
                addWord(lineTxt);
            }
            read.close();
        } catch (Exception e) {
            logger.error("读取敏感词文件失败" + e.getMessage());
        }
    }

}

基于DFA算法的敏感词过滤

标签：position ++ trim eric end gets message bool 字符串

原文地址：https://www.cnblogs.com/bowenqianngzhibushiwo/p/11560161.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行