feat: 提升语义兴趣评分与拼写错误生成

- 为中文拼写生成器实现了背景预热功能，以提升首次使用时的性能。 - 更新了MessageStorageBatcher以支持可配置的提交批次大小和间隔，优化数据库写入性能。 - 增强版数据集生成器，对样本规模设置硬性限制并提升采样效率。 - 将AutoTrainer中的最大样本数增加至1000，以优化训练数据利用率。 - 对亲和兴趣计算器进行了重构，以避免并发初始化并优化模型加载逻辑。 - 引入批量处理机制用于语义兴趣评分，以应对高频聊天场景。 - 更新了配置模板以反映新的评分参数，并移除了已弃用的兴趣阈值。
2025-12-12 14:11:36 +08:00
parent 9d01b81cef
commit e6a4f855a2
17 changed files with 433 additions and 554 deletions
--- a/src/chat/semantic_interest/features_tfidf.py
+++ b/src/chat/semantic_interest/features_tfidf.py
@@ -26,7 +26,7 @@ class TfidfFeatureExtractor:
    def __init__(
        self,
        analyzer: str = "char",  # type: ignore
-        ngram_range: tuple[int, int] = (2, 3),  # 优化：缩小 n-gram 范围
+        ngram_range: tuple[int, int] = (2, 4),  # 优化：缩小 n-gram 范围
        max_features: int = 10000,  # 优化：减少特征数量，矩阵大小和 dot product 减半
        min_df: int = 3,  # 优化：过滤低频 n-gram
        max_df: float = 0.95,