Java实现一个敏感词过滤有哪些方法以及怎么优化详解

《Java实现一个敏感词过滤有哪些方法以及怎么优化详解》我们在开发系统或者应用的过程中,经常需要对用户提交的评论或者文章进行审核,对其中的敏感词进行校验或者过滤,：本文主要介绍Java实现一个敏感... ...

敏感词过滤是非常常见的一种手段，避免出现一些违规词汇。

Java实现敏感词过滤的完整方案与优化策略

敏感词过滤是内容安全的重要组成部分，以下是Java中实现敏感词过滤的多种方法及其优化方案。

一、基础实现方法

1. 简单字符串匹配（适合小规模场景）

public class SimpleFilter {
private static final Set<String> sensitiveWords = new HashSet<>(Arrays.asList("敏感词1", "敏感词2"));
public static String filter(String text) {
for (String word : sensitiveWords) {
if (text.contains(word)) {
text = text.replace(word, "***");
}
}
return text;
}
}

缺点：时间复杂度O(n*m)，性能差，无法处理变形词，拼音等扩展功能。

2. 正则表达式匹配

public class RegexFilter {
private static final String pattern = "敏感词1|敏感词2|敏感词3";
public static String filter(String text) {
return text.replaceAll(pattern, "***");
}
}

缺点：正则构建时间长，敏感词多时性能下降明显。敏感词有些场景还是可以考虑的，可以做一个分片处理。

二、高效实现方案

1. Trie树（前缀树）实现

class TrieNode {
private Map<Character, TrieNode> children = new HashMap<>();
private boolean isEnd;
// 添加子节点方法
// 查找子节点方法
// getter/setter
}
public class TrieFilter {
private TrieNode root = new TrieNode();
// 构建Trie树
public void addword(String word) {
TrieNode node = root;
for (char c : word.toCharArray()) {
node = node.getChildren().computeIfAbsent(c, k -> new TrieNode());
}
node.setEnd(true);
}
// 过滤方法
public String filter(String text) {
StringBuilder result = new StringBuilder();
TrieNode temp;
for (int i = 0; i < text.length(); i++) {
temp = root;
int j = i;
while (j < text.length() && temp.getChildren().containsKey(text.charAt(j))) {
temp = temp.getChildren().get(text.charAt(j));
j++;
if (temp.isEnd()) {
// 发现敏感词，替换为*
result.append("*".repeat(j - i));
i = j - 1;
break;
}
}
if (i >= text.length()) break;
if (!temp.isEnd()) {
result.append(text.charAt(i));
}
}
return result.toString();
}
}

其实也就是一种树形有向图（无环）结构。是DFA的一种特例（树形结构，无失败转移）。

优点：时间复杂度O(n)，适合大规模敏感词库

前缀树的优点是，插入和查询效率高，特别是在敏感词有共同前缀的情况下（如ab、abc、abcd）。而且他的空间效率较高，因为是共享公共前缀的。

但是他也有缺点，一方面是构建树的初期成本较高。另外对于没有共同前缀的敏感词，效率提升不明显。

所以，前缀树适合做高效的字典查找、根据前缀自动补全、利用前缀匹配进行快速路由等场景。

2. DFA（确定性有限自动机）算法

DFA是Deterministic Finite Automaton的缩写，翻译过来叫确定有限自动机，DFA算法是一种高效的文本匹配算法，特别适合于敏感词过滤。

DFA由一组状态组成，以及在这些状态之间的转换，这些转换由输入字符串驱动。每个状态都知道下一个字符的到来应该转移到哪个状态。如果输入字符串结束时，DFA处于接受状态，则输入字符串被认为是匹配的。

其实就是一种一般有向图（可能含环，如自环）结构，满足一条路径则算匹配成功，就算一个敏感词了。

有三个参数组成

节点（States）：表示自动机的状态，包括：
- 初始状态（起点）
- 中间状态
- 终止状态（敏感词匹配成功的状态）
边（Transitions）：表示状态之间的转移条件，每个边对应一个输入字符（如字母、汉字）。
终止状态：某些节点被标记为终止状态，代表从初始状态到该状态的路径对应一个完整的敏感词。

编程

具体过程就像下面这样

输入字符c，检查当前状态是否有c对应的边。
如果有，转移到下一个状态；如果没有，匹配失败。
如果最终停在终止状态，则输入文本包含敏感词。
否则，不包含。

public class DFAFilter {
private Map<String, Object> sensitiveWordMap = new HashMap<>();
// 构建敏感词库
public void init(Set<String> words) {
for (String word : words) {
Map<String, Object> nowMap = sensitiveWordMap;
for (int i = 0; i < word.length(); i++) {
String key = String.valueOf(word.charAt(i));
Object tempMap = nowMap.get(key);
if (tempMap == null) {
Map<String, Object> newMap = new HashMap<>();
newMap.put("isEnd", "0");
nowMap.put(key, newMap);
nowMap = newMap;
} else {
nowMap = (Map<String, Object>) tempMap;
}
if (i == word.length() - 1) {
nowMap.put("isEnd", "1");
}
}
}
}
// 过滤方法
public String filter(String text) {
StringBuilder result = new StringBuilder();
for (int i = 0; i < text.length(); i++) {
int length = checkWord(text, i);
if (length > 0) {
result.append("*".repeat(length));
i += length - 1;
} else {
result.append(text.charAt(i));
}
}
return result.toString();
}
private int checkWord(String text, int beginIndex) {
boolean flag = false;
int matchLength = 0;
Map<String, Object> tempMap = sensitiveWordMap;
for (int i = beginIndex; i < text.length(); i++) {
String word = String.valueOf(text.charAt(i));
tempMap = (Map<String, Object>) tempMap.get(word);
if (tempMap == null) break;
matchLength++;
if ("1".equals(tempMap.get("isEnd"))) {
flag = true;
break;
}
}
return flag ? matchLength : 0;
}
}

内存优化：
- 双数组Trie：压缩状态存储，减少内存占用。
- 共享前缀：DFA合并相同前缀的状态（如"敏感词"和"敏感内容"共享"敏感"路径）。
匹配加速：
- AC自动机：在DFA基础上添加失败指针，支持多模式匹配（类似KMP算法）。
- 批处理：对长文本分块并行检测。
工程实践：
- 热更新：动态加载敏感词库，无需重启服务。
- 多级过滤：先布隆过滤器快速排除无敏感词文本，再走DFA精确匹配。

给大家推荐一个基于 DFA 算法实现的高性能 java 敏感词过滤工具框架——sensitive-word

三、高级优化方案

1. 多模式匹配算法优化

AC自动机（Aho-Corasick算法）

public class ACFilter {
private ACTrie trie;
public void init(Set<String> words) {
trie = new ACTrie();
for (String word : words) {
trie.insert(word);
}
trie.buildFailureLinks();
}
public String filter(String text) {
Set<ACTrie.Match> matches = trie.parseText(text);
char[] chars = text.toCharArray();
for (ACTrie.Match match : matches) {
Arrays.fill(chars, match.getStart(), match.getEnd() + 1, '*');
}
return new String(chars);
}
}

优点：一次扫描匹配所有模式串，时间复杂度O(n)

2. 基于布隆过滤器的预处理

public class BloomFilterPreprocessor {
privjsate BloomFilter<String> bloomFilter;
private Set<String> exactMatchSet;
public void init(Set<String> words) {
bloomFilter = BloomFilter.create(Funnels.stringFunnel(), words.size(), 0.01);
exactMatchSet = new HashSet<>(words);
words.forEach(bloomFilter::put);
}
public boolean mightcontain(String text) {
return bloomFilter.mightContain(text);
}
public boolean exactMatch(String text) {
return exactMatchSet.contains(text);
}
}

用途：先快速判断是否可能包含敏感词，再进行精确匹配

四、工程化实践方案

1. 敏感词库动态加载

public class DynamicWordFilter {
private volatile Map<String, Object> wordMap;
private ScheduledExecutorService executor;
public void init() {
loadWords();
executor = Executors.newSingleThreadScheduledExecutor();
executor.scheduleAtFixedRate(this::loadWords, 1, 1, TimeUnit.HOURS);
}
private void loadWords() {
Map<String, Object> newMap = new HashMap<>();
// 从数据库或文件加载敏感词
Set<String> words = loadFromDB();
// 构建DFA结构
this.wordMap = buildDFA(words);
}
}

2. 分布式敏感词过滤

public class DistributedFilter {
private RedisTemplate<String, String> redisTemplate;
public boolean isSensitive(String text) {
// 使用Redis的Set结构存储敏感词
return redisTemplate.opsForSet().isMember("sensitive:words", text);
}
public String filter(String text) {
// 调用分布式过滤服务
return restTemplate.postForObject("http://filter-service/filter", text, String.class);
}
}

3：也可以考虑使用ElasticSearch做搜索引擎

为什么可以使用ES可以看看

为什么用ElasticSearch？和传统数据库mysql与什么区别？

总的来说就是ES有强大的文本分析和查询能力来实现。以下是详细实现过程和方案：

具体实现方案

方案1：索引时敏感词标记（推荐）

步骤：

自定义分析器：

PUT /sensitive_content_index
{
"settings": {
"analysis": {
"analyzer": {
"sensitive_filter_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"sensitive_word_filter"
]
}
},
"filter": {
"sensitive_word_filter": {
"type": "stop",
"stopwords": ["敏感词1", "敏感词2", "违法词"]
}
}
}
},
"mappings": {
"properties": {
"content": {
"type": "text",
"analyzer": "sensitive_filter_analyzer",
"fields": {
"original": {
"type": "keyword" // 保留原始内容
}
}
}
}
}
}

检测敏感词：

GET /sensitive_content_index/_analyze
{
"analyzer": "sensitive_filter_analyzer",
"text": "这是一段包含敏感词1的文本"
}

输出：敏感词会被过滤掉，只返回普通词项

写入时自动标记：

POST /sensitive_content_index/_doc
{
"content": "这是需要检测的文本",
"has_sensitive": false // 由pipeline更新
}

使用Ingest Pipeline自动检测：

PUT _ingest/pipeline/sensitive_check_pipeline
{
"processors": [
{
"script": {
"source": """
def sensitiveWords = ['敏感词1', '违禁词'];
for (word in sensitiveWords) {
if (ctx.content.contains(word)) {
ctx.has_sensitive = true;
ctx.sensitive_word = word;
break;
}
}
"""
}
}
]
}

方案2：查询时敏感词过滤

使用Term查询检测：

GET /content_index/_search
{
"query": {
"bool": {
"must_not": [
{ "terms": { "content": ["敏感词1", "违禁词"] }}
]
}
}
}

高亮显示敏感词：

GET /content_index/_search
{
"query": {
"match": { "content": "正常文本" }
},
"highlight": {
"fields": {
"content": {
"highlight_query": {
"terms": { "content": ["敏感词1", "违禁词"] }
}
}
}
}
}

方案3：结合机器学习（ES 7.15+）

训练敏感词分类模型：

PUT _ml/trained_models/sensitive_words_classifier
{
"input": {"field_names": ["text"]},
"inference_config": {
"text_classification"SaQBYI: {
"vocabulary": ["敏感词1", "变体词", "拼音词"]
}
}
}

部署推理处理器：

PUT _ingest/pipeline/ml_sensitive_detection
{
"processors": [
{
"inference": {
"model_id": "sensitive_words_classifier",
"field_map": { "content": "text" }
}
}
]
}

性能优化技巧

敏感词库存储优化：
- 使用ES的Synonyms Token Filter管理同义词/变体词
- 将敏感词库存储在单独索引中，定期更新

缓存加速：

PUT /sensitive_words_cache
{
"mappings": {
"properties": {
"word": { "type": "keyword" }
}
}
}

分布式检测：
- 对大型文档分片处理
- 使用_search_shardsAPI并行检测

五、性能优化技巧

内存优化：
- 使用基本类型替代包装类
- 压缩Trie树结构（Ternary Search Tree）
- 对象复用减少GC压力
算法优化：
- 对短文本使用快速失败策略
- 实现多级过滤（先粗筛后精筛），
- 并行化处理（Fork/Join框架）
预处理优化：
- 文本归一化（全角转半角，繁体转简体）
- 拼音转换处理（如"taobao"->"淘宝"）
- 近音词/形近词处理
缓存优化：
- 缓存常见文本的过滤结果
- 使用Caffeine实现本地缓存
- 布隆过滤器预判

六、完整生产级实现示例

public class ProductionWordFilter implements InitializingBean {
private final TrieNode root = new TrieNode();
private final List<String> wordSources;
private final ScheduledExecutorService executor;
public ProductionWordFilter(List<String> wordSources) {
thjavascriptis.wordSources = wordSources;
this.executor = Executors.newSingleThreadScheduledExecutor();
}
@Override
public void afterPropertiesSet() {
reload();
executor.scheduleWithFixedDelay(this::reload, 1, 1, TimeUnit.HOURS);
}
public synchronized void reload() {
TrieNode newRoot = new TrieNode();
wordSources.stream()
.flatMap(source -> loadWords(source).stream())
.forEach(word -> addWord(newRoot, word));
this.root = newRoot;
}
public FilterResult filter(String text) {
StringBuilder result = new StringBuilder();
Set<String> foundWords = new HashSet<>();
int replacedCount = 0;
for (int i = 0; i < text.length(); ) {
MatchResult match = findNextMatch(text, i);
if (match != null) {
foundWords.add(match.getWord());
result.append("*".repeat(match.getLength()));
replacedCount++;
i = match.getEndIndex();
} else {
result.append(text.charAt(i));
i++;
}
}
return new FilterResult(result.toString(), foundWords, replacedCount);
}
// 其他辅助方法...
}