feat(cache): 提升内存管理与监控能力

- 在CacheManager中添加健康监控系统，并提供详细的内存统计信息 - 使用新的memory_utils模块实现精确的内存估算 - 添加基于大小的缓存条目限制，以防止过大项目 - 通过去重内存计算优化缓存统计 - 在MultiLevelCache中添加过期条目的自动清理功能 - 增强批处理调度器缓存功能，支持LRU驱逐策略和内存追踪 - 更新配置以支持最大项目大小限制 - 添加全面的内存分析文档和工具重大变更：CacheManager 的默认 TTL 参数现改为 None 而非 3600。数据库兼容层默认禁用缓存，以防止旧版代码过度使用缓存。
2025-11-03 15:18:00 +08:00
parent 29a5357728
commit ecef8edd28
10 changed files with 1923 additions and 20 deletions
--- a/MEMORY_PROFILING.md
+++ b/MEMORY_PROFILING.md
@@ -0,0 +1,471 @@
 # Bot 内存分析工具使用指南
 一个统一的内存诊断工具，提供进程监控、对象分析和数据可视化功能。
 ## 🚀 快速开始
 > **提示**: 建议使用虚拟环境运行脚本（`.\.venv\Scripts\python.exe`）
 ```powershell
 # 查看帮助
 .\.venv\Scripts\python.exe scripts/memory_profiler.py --help
 # 进程监控模式（最简单）
 .\.venv\Scripts\python.exe scripts/memory_profiler.py --monitor
 # 对象分析模式（深度分析）
 .\.venv\Scripts\python.exe scripts/memory_profiler.py --objects --output memory_data.txt
 # 可视化模式（生成图表）
 .\.venv\Scripts\python.exe scripts/memory_profiler.py --visualize --input memory_data.txt.jsonl
 ```
 **或者使用简短命令**（如果你的系统 `python` 已指向虚拟环境）:
 ```powershell
 python scripts/memory_profiler.py --monitor
 ```
 ## 📦 依赖安装
 ```powershell
 # 基础功能（进程监控）
 pip install psutil
 # 对象分析功能
 pip install pympler
 # 可视化功能
 pip install matplotlib
 # 一次性安装全部
 pip install psutil pympler matplotlib
 ```
 ## 🔧 三种模式详解
 ### 1. 进程监控模式 (--monitor)
 **用途**: 从外部监控 bot 进程的总内存、子进程情况
 **特点**:
 - ✅ 自动启动 bot.py（使用虚拟环境）
 - ✅ 实时显示进程内存（RSS、VMS）
 - ✅ 列出所有子进程及其内存占用
 - ✅ 显示 bot 输出日志
 - ✅ 自动保存监控历史
 **使用示例**:
 ```powershell
 # 基础用法
 python scripts/memory_profiler.py --monitor
 # 自定义监控间隔（10秒）
 python scripts/memory_profiler.py --monitor --interval 10
 # 简写
 python scripts/memory_profiler.py -m -i 5
 ```
 **输出示例**:
 ```
 ================================================================================
 检查点 #1 - 14:23:15
 Bot 进程 (PID: 12345)
  RSS: 45.82 MB
  VMS: 12.34 MB
  占比: 0.25%
  子进程: 2 个
  子进程内存: 723.64 MB
  总内存: 769.46 MB
  📋 子进程详情:
    [1] PID 12346: python.exe - 520.15 MB
        命令: python.exe -m chromadb.server ...
    [2] PID 12347: python.exe - 203.49 MB
        命令: python.exe -m uvicorn ...
 ================================================================================
 ```
 **保存位置**: `data/memory_diagnostics/process_monitor_<timestamp>_pid<PID>.txt`
 ---
 ### 2. 对象分析模式 (--objects)
 **用途**: 在 bot 进程内部统计所有 Python 对象的内存占用
 **特点**:
 - ✅ 统计所有对象类型（dict、list、str、AsyncOpenAI 等）
 - ✅ **按模块统计内存占用（新增）** - 显示哪个模块占用最多内存
 - ✅ 包含所有线程的对象
 - ✅ 显示对象变化（diff）
 - ✅ 线程信息和 GC 统计
 - ✅ 保存 JSONL 数据用于可视化
 **使用示例**:
 ```powershell
 # 基础用法（推荐指定输出文件）
 python scripts/memory_profiler.py --objects --output memory_data.txt
 # 自定义参数
 python scripts/memory_profiler.py --objects \
    --interval 10 \
    --output memory_data.txt \
    --object-limit 30
 # 简写
 python scripts/memory_profiler.py -o -i 10 --output data.txt -l 30
 ```
 **输出示例**:
 ```
 ================================================================================
 🔍 对象级内存分析 #1 - 14:25:30
 ================================================================================
 📦 对象统计 (前 20 个类型):
 类型                                                  数量           总大小
 --------------------------------------------------------------------------------
 <class 'dict'>                                     125,843         45.23 MB
 <class 'str'>                                      234,567         23.45 MB
 <class 'list'>                                      56,789         12.34 MB
 <class 'tuple'>                                     89,012          8.90 MB
 <class 'openai.resources.chat.completions'>            12          5.67 MB
 ...
 📚 模块内存占用 (前 20 个模块):
 模块名                                               对象数             总内存
 --------------------------------------------------------------------------------
 builtins                                         169,144        26.20 MB
 src                                               12,345         5.67 MB
 openai                                             3,456         2.34 MB
 chromadb                                           2,345         1.89 MB
 ...
  总模块数: 85
 🧵 线程信息 (8 个):
  [1] ✓ MainThread
  [2] ✓ AsyncOpenAIClient (守护)
  [3] ✓ ChromaDBWorker (守护)
  ...
 🗑️  垃圾回收:
  代 0: 1,234 次
  代 1: 56 次
  代 2: 3 次
  追踪对象: 456,789
 📊 总对象数: 567,890
 ================================================================================
 ```
 **每 3 次迭代会显示对象变化**:
 ```
 📈 对象变化分析:
 --------------------------------------------------------------------------------
                types |   # objects |   total size
 ==================== | =========== | ============
            <class 'dict'> |      +1234 |    +1.23 MB
             <class 'str'> |       +567 |   +0.56 MB
 ...
 --------------------------------------------------------------------------------
 ```
 **保存位置**: 
 - 文本: `<output>.txt`
 - 结构化数据: `<output>.txt.jsonl`
 ---
 ### 3. 可视化模式 (--visualize)
 **用途**: 将对象分析模式生成的 JSONL 数据绘制成图表
 **特点**:
 - ✅ 显示对象类型随时间的内存变化
 - ✅ 自动选择内存占用最高的 N 个类型
 - ✅ 生成高清 PNG 图表
 **使用示例**:
 ```powershell
 # 基础用法
 python scripts/memory_profiler.py --visualize \
    --input memory_data.txt.jsonl
 # 自定义参数
 python scripts/memory_profiler.py --visualize \
    --input memory_data.txt.jsonl \
    --top 15 \
    --plot-output my_plot.png
 # 简写
 python scripts/memory_profiler.py -v -i data.txt.jsonl -t 15
 ```
 **输出**: PNG 图像，展示前 N 个对象类型的内存占用随时间的变化曲线
 **保存位置**: 默认 `memory_analysis_plot.png`，可通过 `--plot-output` 指定
 ---
 ## 💡 使用场景
 | 场景 | 推荐模式 | 命令 |
 |------|----------|------|
 | 快速查看总内存 | `--monitor` | `python scripts/memory_profiler.py -m` |
 | 查看子进程占用 | `--monitor` | `python scripts/memory_profiler.py -m` |
 | 分析具体对象占用 | `--objects` | `python scripts/memory_profiler.py -o --output data.txt` |
 | 追踪内存泄漏 | `--objects` | `python scripts/memory_profiler.py -o --output data.txt` |
 | 可视化分析趋势 | `--visualize` | `python scripts/memory_profiler.py -v -i data.txt.jsonl` |
 ## 📊 完整工作流程
 ### 场景 1: 快速诊断内存问题
 ```powershell
 # 1. 运行进程监控（查看总体情况）
 python scripts/memory_profiler.py --monitor --interval 5
 # 观察输出，如果发现内存异常，进入场景 2
 ```
 ### 场景 2: 深度分析对象占用
 ```powershell
 # 1. 启动对象分析（保存数据）
 python scripts/memory_profiler.py --objects \
    --interval 10 \
    --output data/memory_diagnostics/analysis_$(Get-Date -Format 'yyyyMMdd_HHmmss').txt
 # 2. 运行一段时间（建议至少 5-10 分钟），按 Ctrl+C 停止
 # 3. 生成可视化图表
 python scripts/memory_profiler.py --visualize \
    --input data/memory_diagnostics/analysis_<timestamp>.txt.jsonl \
    --top 15 \
    --plot-output data/memory_diagnostics/plot_<timestamp>.png
 # 4. 查看图表，分析哪些对象类型随时间增长
 ```
 ### 场景 3: 持续监控
 ```powershell
 # 在后台运行对象分析（Windows）
 Start-Process powershell -ArgumentList "-Command", "python scripts/memory_profiler.py -o -i 30 --output logs/memory_continuous.txt" -WindowStyle Minimized
 # 定期查看 JSONL 并生成图表
 python scripts/memory_profiler.py -v -i logs/memory_continuous.txt.jsonl -t 20
 ```
 ## 🎯 参数参考
 ### 通用参数
 | 参数 | 简写 | 默认值 | 说明 |
 |------|------|--------|------|
 | `--interval` | `-i` | 10 | 监控间隔（秒） |
 ### 对象分析模式参数
 | 参数 | 简写 | 默认值 | 说明 |
 |------|------|--------|------|
 | `--output` | - | 无 | 输出文件路径（强烈推荐） |
 | `--object-limit` | `-l` | 20 | 显示的对象类型数量 |
 ### 可视化模式参数
 | 参数 | 简写 | 默认值 | 说明 |
 |------|------|--------|------|
 | `--input` | - | **必需** | 输入 JSONL 文件路径 |
 | `--top` | `-t` | 10 | 展示前 N 个对象类型 |
 | `--plot-output` | - | `memory_analysis_plot.png` | 输出图表路径 |
 ## ⚠️ 注意事项
 ### 性能影响
 | 模式 | 性能影响 | 说明 |
 |------|----------|------|
 | `--monitor` | < 1% | 几乎无影响，适合生产环境 |
 | `--objects` | 5-15% | 有一定影响，建议在测试环境使用 |
 | `--visualize` | 0% | 离线分析，无影响 |
 ### 常见问题
 **Q: 对象分析模式报错 "pympler 未安装"？**
 ```powershell
 pip install pympler
 ```
 **Q: 可视化模式报错 "matplotlib 未安装"？**
 ```powershell
 pip install matplotlib
 ```
 **Q: 对象分析模式提示 "bot.py 未找到 main_async() 或 main() 函数"？**
 这是正常的。如果你的 bot.py 的主逻辑在 `if __name__ == "__main__":` 中，监控线程仍会在后台运行。你可以：
 - 保持 bot 运行，监控会持续统计
 - 或者在 bot.py 中添加一个 `main_async()` 或 `main()` 函数
 **Q: 进程监控模式看不到子进程？**
 确保 bot.py 已经启动了子进程（例如 ChromaDB）。如果刚启动就查看，可能还没有创建子进程。
 **Q: JSONL 文件在哪里？**
 当你使用 `--output <file>` 时，会生成：
 - `<file>`: 人类可读的文本
 - `<file>.jsonl`: 结构化数据（用于可视化）
 ## 📁 输出文件说明
 ### 进程监控输出
 **位置**: `data/memory_diagnostics/process_monitor_<timestamp>_pid<PID>.txt`
 **内容**: 每次检查点的进程内存信息
 ### 对象分析输出
 **文本文件**: `<output>`
 - 人类可读格式
 - 包含每次迭代的对象统计
 **JSONL 文件**: `<output>.jsonl`
 - 每行一个 JSON 对象
 - 包含: timestamp, iteration, total_objects, summary, threads, gc_stats
 - 用于可视化分析
 ### 可视化输出
 **PNG 图像**: 默认 `memory_analysis_plot.png`
 - 折线图，展示对象类型随时间的内存变化
 - 高清 150 DPI
 ## 🔍 诊断技巧
 ### 1. 识别内存泄漏
 使用对象分析模式运行较长时间，观察：
 - 某个对象类型的数量或大小持续增长
 - 对象变化 diff 中始终为正数
 ### 2. 定位大内存对象
 **查看对象统计**:
 - 如果 `<class 'dict'>` 占用很大，可能是缓存未清理
 - 如果看到特定类（如 `AsyncOpenAI`），检查该类的实例数
 **查看模块统计**（推荐）:
 - 查看 📚 模块内存占用部分
 - 如果 `src` 模块占用很大，说明你的代码中有大量对象
 - 如果 `openai`、`chromadb` 等第三方模块占用大，可能是这些库的使用问题
 - 对比不同时间点，看哪个模块的内存持续增长
 ### 3. 分析子进程占用
 使用进程监控模式：
 - 查看子进程详情中的命令行
 - 识别哪个子进程占用大量内存（如 ChromaDB）
 ### 4. 对比不同时间点
 使用可视化模式：
 - 生成图表后，观察哪些对象类型的曲线持续上升
 - 对比不同功能运行时的内存变化
 ## 🎓 高级用法
 ### 长期监控脚本
 创建 `monitor_continuously.ps1`:
 ```powershell
 # 持续监控脚本
 $timestamp = Get-Date -Format "yyyyMMdd_HHmmss"
 $logPath = "logs/memory_analysis_$timestamp.txt"
 Write-Host "开始持续监控，数据保存到: $logPath"
 Write-Host "按 Ctrl+C 停止监控"
 python scripts/memory_profiler.py --objects --interval 30 --output $logPath
 ```
 ### 自动生成日报
 创建 `generate_daily_report.ps1`:
 ```powershell
 # 生成内存分析日报
 $date = Get-Date -Format "yyyyMMdd"
 $jsonlFiles = Get-ChildItem "logs" -Filter "*$date*.jsonl"
 foreach ($file in $jsonlFiles) {
    $outputPlot = $file.FullName -replace ".jsonl", "_plot.png"
    python scripts/memory_profiler.py --visualize --input $file.FullName --plot-output $outputPlot --top 20
    Write-Host "生成图表: $outputPlot"
 }
 ```
 ## 📚 扩展阅读
 - **Python 内存管理**: https://docs.python.org/3/c-api/memory.html
 - **psutil 文档**: https://psutil.readthedocs.io/
 - **Pympler 文档**: https://pympler.readthedocs.io/
 - **Matplotlib 文档**: https://matplotlib.org/
 ## 🆘 获取帮助
 ```powershell
 # 查看完整帮助信息
 python scripts/memory_profiler.py --help
 # 查看特定模式示例
 python scripts/memory_profiler.py --help | Select-String "示例"
 ```
 ---
 **快速开始提醒**:
 ```powershell
 # 使用虚拟环境（推荐）
 .\.venv\Scripts\python.exe scripts/memory_profiler.py --monitor
 # 或者使用系统 Python
 python scripts/memory_profiler.py --monitor
 # 深度分析
 .\.venv\Scripts\python.exe scripts/memory_profiler.py --objects --output memory.txt
 # 可视化
 .\.venv\Scripts\python.exe scripts/memory_profiler.py --visualize --input memory.txt.jsonl
 ```
 ### 💡 虚拟环境说明
 **Windows**:
 ```powershell
 .\.venv\Scripts\python.exe scripts/memory_profiler.py [选项]
 ```
 **Linux/Mac**:
 ```bash
 ./.venv/bin/python scripts/memory_profiler.py [选项]
 ```
 脚本会自动检测并使用项目虚拟环境来启动 bot（进程监控模式），对象分析模式会自动添加项目根目录到 Python 路径。
 🎉 现在你已经掌握了完整的内存分析工具！
--- a/docs/guides/OBJECT_LEVEL_MEMORY_ANALYSIS.md
+++ b/docs/guides/OBJECT_LEVEL_MEMORY_ANALYSIS.md
@@ -0,0 +1,267 @@
 # 对象级内存分析指南
 ## 🎯 概述
 对象级内存分析可以帮助你：
 - 查看哪些 Python 对象类型占用最多内存
 - 追踪对象数量和大小的变化
 - 识别内存泄漏的具体对象
 - 监控垃圾回收效率
 ## 🚀 快速开始
 ### 1. 安装依赖
 ```powershell
 pip install pympler
 ```
 ### 2. 启用对象级分析
 ```powershell
 # 基本用法 - 启用对象分析
 python scripts/run_bot_with_tracking.py --objects
 # 自定义监控间隔（10 秒）
 python scripts/run_bot_with_tracking.py --objects --interval 10
 # 显示更多对象类型（前 20 个）
 python scripts/run_bot_with_tracking.py --objects --object-limit 20
 # 完整示例（简写参数）
 python scripts/run_bot_with_tracking.py -o -i 10 -l 20
 ```
 ## 📊 输出示例
 ### 进程级信息
 ```
 ================================================================================
 检查点 #1 - 12:34:56
 Bot 进程 (PID: 12345)
  RSS: 45.23 MB
  VMS: 125.45 MB
  占比: 0.35%
  子进程: 1 个
  子进程内存: 32.10 MB
  总内存: 77.33 MB
 变化:
  RSS: +2.15 MB
 ```
 ### 对象级分析信息
 ```
 📦 对象级内存分析 (检查点 #1)
 --------------------------------------------------------------------------------
 类型                                       数量        总大小
 --------------------------------------------------------------------------------
 dict                                     12,345      15.23 MB
 str                                      45,678       8.92 MB
 list                                      8,901       5.67 MB
 tuple                                    23,456       4.32 MB
 type                                      1,234       3.21 MB
 code                                      2,345       2.10 MB
 set                                       1,567       1.85 MB
 function                                  3,456       1.23 MB
 method                                    4,567     890.45 KB
 weakref                                   2,345     678.12 KB
 🗑️  垃圾回收统计:
  - 代 0 回收: 125 次
  - 代 1 回收: 12 次
  - 代 2 回收: 2 次
  - 未回收对象: 0
  - 追踪对象数: 89,456
 📊 总对象数: 123,456
 --------------------------------------------------------------------------------
 ```
 ## 🔍 如何解读输出
 ### 1. 对象类型统计
 每一行显示：
 - **类型名称**: Python 对象类型（dict、str、list 等）
 - **数量**: 该类型的对象实例数量
 - **总大小**: 该类型所有对象占用的总内存
 **关键指标**：
 - `dict` 多是正常的（Python 大量使用字典）
 - `str` 多也是正常的（字符串无处不在）
 - 如果看到某个自定义类型数量异常增长 → 可能存在泄漏
 - 如果某个类型占用内存异常大 → 需要优化
 ### 2. 垃圾回收统计
 **代 0/1/2 回收次数**：
 - 代 0：最频繁，新创建的对象
 - 代 1：中等频率，存活一段时间的对象
 - 代 2：最少，长期存活的对象
 **未回收对象**：
 - 应该是 0 或很小的数字
 - 如果持续增长 → 可能存在循环引用导致的内存泄漏
 **追踪对象数**：
 - Python 垃圾回收器追踪的对象总数
 - 持续增长可能表示内存泄漏
 ### 3. 总对象数
 当前进程中所有 Python 对象的数量。
 ## 🎯 常见使用场景
 ### 场景 1: 查找内存泄漏
 ```powershell
 # 长时间运行，频繁检查
 python scripts/run_bot_with_tracking.py -o -i 5
 ```
 **观察**：
 - 哪些对象类型数量持续增长？
 - RSS 内存增长和对象数量增长是否一致？
 - 垃圾回收是否正常工作？
 ### 场景 2: 优化内存占用
 ```powershell
 # 较长间隔，查看稳定状态
 python scripts/run_bot_with_tracking.py -o -i 30 -l 25
 ```
 **分析**：
 - 前 25 个对象类型中，哪些是你的代码创建的？
 - 是否有不必要的大对象缓存？
 - 能否使用更轻量的数据结构？
 ### 场景 3: 调试特定功能
 ```powershell
 # 短间隔，快速反馈
 python scripts/run_bot_with_tracking.py -o -i 3
 ```
 **用途**：
 - 触发某个功能后立即观察内存变化
 - 检查对象是否正确释放
 - 验证优化效果
 ## 📝 保存的历史文件
 监控结束后，历史数据会自动保存到：
 ```
 data/memory_diagnostics/bot_memory_monitor_YYYYMMDD_HHMMSS_pidXXXXX.txt
 ```
 文件内容包括：
 - 每个检查点的进程内存信息
 - 每个检查点的对象统计（前 10 个类型）
 - 总体统计信息（起始/结束/峰值/平均）
 ## 🔧 高级技巧
 ### 1. 结合代码修改
 在你的代码中添加检查点：
 ```python
 import gc
 from pympler import muppy, summary
 def debug_memory():
    """在关键位置调用此函数"""
    gc.collect()
    all_objects = muppy.get_objects()
    sum_data = summary.summarize(all_objects)
    summary.print_(sum_data, limit=10)
 ```
 ### 2. 比较不同时间点
 ```powershell
 # 运行 1 分钟
 python scripts/run_bot_with_tracking.py -o -i 10
 # Ctrl+C 停止，查看文件
 # 等待 5 分钟后再运行
 python scripts/run_bot_with_tracking.py -o -i 10
 # 比较两次的对象统计
 ```
 ### 3. 专注特定对象类型
 修改 `run_bot_with_tracking.py` 中的 `get_object_stats()` 函数，添加过滤：
 ```python
 def get_object_stats(limit: int = 10) -> Dict:
    # ...现有代码...
    # 只显示特定类型
    filtered_summary = [
        row for row in sum_data 
        if 'YourClassName' in row[0]
    ]
    return {
        "summary": filtered_summary[:limit],
        # ...
    }
 ```
 ## ⚠️ 注意事项
 ### 性能影响
 对象级分析会影响性能：
 - **pympler 分析**: ~10-20% 性能影响
 - **gc.collect()**: 每次检查点触发垃圾回收，可能导致短暂卡顿
 **建议**：
 - 开发/调试时使用对象分析
 - 生产环境使用普通监控（不加 `--objects`）
 ### 内存开销
 对象分析本身也会占用内存：
 - `muppy.get_objects()` 会创建对象列表
 - 统计数据会保存在历史中
 **建议**：
 - 不要设置过小的 `--interval`（建议 >= 5 秒）
 - 长时间运行时考虑关闭对象分析
 ### 准确性
 - 对象统计是**快照**，不是实时的
 - `gc.collect()` 后才统计，确保垃圾已回收
 - 子进程的对象无法统计（只统计主进程）
 ## 📚 相关工具
 | 工具 | 用途 | 对象级分析 |
 |------|------|----------|
 | `run_bot_with_tracking.py` | 一键启动+监控 | ✅ 支持 |
 | `memory_monitor.py` | 手动监控 | ✅ 支持 |
 | `windows_memory_profiler.py` | 详细分析 | ✅ 支持 |
 | `run_bot_with_pympler.py` | 专门的对象追踪 | ✅ 专注此功能 |
 ## 🎓 学习资源
 - [Pympler 文档](https://pympler.readthedocs.io/)
 - [Python GC 模块](https://docs.python.org/3/library/gc.html)
 - [内存泄漏调试技巧](https://docs.python.org/3/library/tracemalloc.html)
 ---
 **快速开始**: 
 ```powershell
 pip install pympler
 python scripts/run_bot_with_tracking.py --objects
 ```
 🎉
--- a/scripts/memory_profiler.py
+++ b/scripts/memory_profiler.py
@@ -0,0 +1,757 @@
 #!/usr/bin/env python3
 """
 统一内存分析工具 - Bot 内存诊断完整解决方案
 支持三种模式:
  1. 进程监控模式 (--monitor): 从外部监控 bot 进程内存、子进程
  2. 对象分析模式 (--objects): 在 bot 内部统计所有对象（包括所有线程）
  3. 可视化模式 (--visualize): 将 JSONL 数据绘制成图表
 示例:
  # 进程监控（启动 bot 并监控）
  python scripts/memory_profiler.py --monitor --interval 10
  # 对象分析（深度对象统计）
  python scripts/memory_profiler.py --objects --interval 10 --output memory_data.txt
  # 生成可视化图表
  python scripts/memory_profiler.py --visualize --input memory_data.txt.jsonl --top 15
 """
 import argparse
 import asyncio
 import gc
 import json
 import os
 import subprocess
 import sys
 import threading
 import time
 from collections import defaultdict
 from datetime import datetime
 from pathlib import Path
 from typing import Dict, List, Optional
 import psutil
 try:
    from pympler import muppy, summary, tracker
    PYMPLER_AVAILABLE = True
 except ImportError:
    PYMPLER_AVAILABLE = False
 try:
    import matplotlib.pyplot as plt
    MATPLOTLIB_AVAILABLE = True
 except ImportError:
    MATPLOTLIB_AVAILABLE = False
 # ============================================================================
 # 进程监控模式
 # ============================================================================
 async def monitor_bot_process(bot_process: subprocess.Popen, interval: int = 5):
    """从外部监控 bot 进程的内存使用（进程级）"""
    if bot_process.pid is None:
        print("❌ Bot 进程 PID 为空")
        return
    print(f"🔍 开始监控 Bot 内存（PID: {bot_process.pid}）")
    print(f"监控间隔: {interval} 秒")
    print("按 Ctrl+C 停止监控和 Bot\n")
    try:
        process = psutil.Process(bot_process.pid)
    except psutil.NoSuchProcess:
        print("❌ 无法找到 Bot 进程")
        return
    history = []
    iteration = 0
    try:
        while bot_process.poll() is None:
            try:
                mem_info = process.memory_info()
                mem_percent = process.memory_percent()
                children = process.children(recursive=True)
                children_mem = sum(child.memory_info().rss for child in children)
                info = {
                    "timestamp": time.strftime("%H:%M:%S"),
                    "rss_mb": mem_info.rss / 1024 / 1024,
                    "vms_mb": mem_info.vms / 1024 / 1024,
                    "percent": mem_percent,
                    "children_count": len(children),
                    "children_mem_mb": children_mem / 1024 / 1024,
                }
                history.append(info)
                iteration += 1
                print(f"{'=' * 80}")
                print(f"检查点 #{iteration} - {info['timestamp']}")
                print(f"Bot 进程 (PID: {bot_process.pid})")
                print(f"  RSS: {info['rss_mb']:.2f} MB")
                print(f"  VMS: {info['vms_mb']:.2f} MB")
                print(f"  占比: {info['percent']:.2f}%")
                if children:
                    print(f"  子进程: {info['children_count']} 个")
                    print(f"  子进程内存: {info['children_mem_mb']:.2f} MB")
                    total_mem = info['rss_mb'] + info['children_mem_mb']
                    print(f"  总内存: {total_mem:.2f} MB")
                    print(f"\n  📋 子进程详情:")
                    for idx, child in enumerate(children, 1):
                        try:
                            child_mem = child.memory_info().rss / 1024 / 1024
                            child_name = child.name()
                            child_cmdline = " ".join(child.cmdline()[:3])
                            if len(child_cmdline) > 80:
                                child_cmdline = child_cmdline[:77] + "..."
                            print(f"    [{idx}] PID {child.pid}: {child_name} - {child_mem:.2f} MB")
                            print(f"        命令: {child_cmdline}")
                        except (psutil.NoSuchProcess, psutil.AccessDenied):
                            print(f"    [{idx}] 无法访问进程信息")
                if len(history) > 1:
                    prev = history[-2]
                    rss_diff = info['rss_mb'] - prev['rss_mb']
                    print(f"\n变化:")
                    print(f"  RSS: {rss_diff:+.2f} MB")
                    if rss_diff > 10:
                        print(f"  ⚠️  内存增长较快！")
                    if info['rss_mb'] > 1000:
                        print(f"  ⚠️  内存使用超过 1GB！")
                print(f"{'=' * 80}\n")
                await asyncio.sleep(interval)
            except psutil.NoSuchProcess:
                print("\n❌ Bot 进程已结束")
                break
            except Exception as e:
                print(f"\n❌ 监控出错: {e}")
                break
    except KeyboardInterrupt:
        print("\n\n⚠️  用户中断监控")
    finally:
        if history and bot_process.pid:
            save_process_history(history, bot_process.pid)
 def save_process_history(history: list, pid: int):
    """保存进程监控历史"""
    output_dir = Path("data/memory_diagnostics")
    output_dir.mkdir(parents=True, exist_ok=True)
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    output_file = output_dir / f"process_monitor_{timestamp}_pid{pid}.txt"
    with open(output_file, "w", encoding="utf-8") as f:
        f.write("Bot 进程内存监控历史记录\n")
        f.write("=" * 80 + "\n\n")
        f.write(f"Bot PID: {pid}\n\n")
        for info in history:
            f.write(f"时间: {info['timestamp']}\n")
            f.write(f"RSS: {info['rss_mb']:.2f} MB\n")
            f.write(f"VMS: {info['vms_mb']:.2f} MB\n")
            f.write(f"占比: {info['percent']:.2f}%\n")
            if info['children_count'] > 0:
                f.write(f"子进程: {info['children_count']} 个\n")
                f.write(f"子进程内存: {info['children_mem_mb']:.2f} MB\n")
            f.write("\n")
    print(f"\n✅ 监控历史已保存到: {output_file}")
 async def run_monitor_mode(interval: int):
    """进程监控模式主函数"""
    print("=" * 80)
    print("🚀 进程监控模式")
    print("=" * 80)
    print("此模式将:")
    print("  1. 使用虚拟环境启动 bot.py")
    print("  2. 实时监控进程内存（RSS、VMS）")
    print("  3. 显示子进程详细信息")
    print("  4. 自动保存监控历史")
    print("=" * 80 + "\n")
    project_root = Path(__file__).parent.parent
    bot_file = project_root / "bot.py"
    if not bot_file.exists():
        print(f"❌ 找不到 bot.py: {bot_file}")
        return 1
    # 检测虚拟环境
    venv_python = project_root / ".venv" / "Scripts" / "python.exe"
    if not venv_python.exists():
        venv_python = project_root / ".venv" / "bin" / "python"
    if venv_python.exists():
        python_exe = str(venv_python)
        print(f"🐍 使用虚拟环境: {venv_python}")
    else:
        python_exe = sys.executable
        print(f"⚠️  未找到虚拟环境，使用当前 Python: {python_exe}")
    print(f"🤖 启动 Bot: {bot_file}")
    bot_process = subprocess.Popen(
        [python_exe, str(bot_file)],
        cwd=str(project_root),
        stdout=subprocess.PIPE,
        stderr=subprocess.STDOUT,
        text=True,
        bufsize=1,
    )
    await asyncio.sleep(2)
    if bot_process.poll() is not None:
        print("❌ Bot 启动失败")
        if bot_process.stdout:
            output = bot_process.stdout.read()
            if output:
                print(f"\nBot 输出:\n{output}")
        return 1
    print(f"✅ Bot 已启动 (PID: {bot_process.pid})\n")
    # 启动输出读取线程
    def read_bot_output():
        if bot_process.stdout:
            try:
                for line in bot_process.stdout:
                    print(f"[Bot] {line}", end="")
            except Exception:
                pass
    output_thread = threading.Thread(target=read_bot_output, daemon=True)
    output_thread.start()
    try:
        await monitor_bot_process(bot_process, interval)
    except KeyboardInterrupt:
        print("\n\n⚠️  用户中断")
        if bot_process.poll() is None:
            print("\n正在停止 Bot...")
            bot_process.terminate()
            try:
                bot_process.wait(timeout=10)
            except subprocess.TimeoutExpired:
                print("⚠️  强制终止 Bot...")
                bot_process.kill()
                bot_process.wait()
        print("✅ Bot 已停止")
    return 0
 # ============================================================================
 # 对象分析模式
 # ============================================================================
 class ObjectMemoryProfiler:
    """对象级内存分析器"""
    def __init__(self, interval: int = 10, output_file: Optional[str] = None, object_limit: int = 20):
        self.interval = interval
        self.output_file = output_file
        self.object_limit = object_limit
        self.running = False
        self.tracker = None
        if PYMPLER_AVAILABLE:
            self.tracker = tracker.SummaryTracker()
        self.iteration = 0
    def get_object_stats(self) -> Dict:
        """获取当前进程的对象统计（所有线程）"""
        if not PYMPLER_AVAILABLE:
            return {}
        try:
            gc.collect()
            all_objects = muppy.get_objects()
            sum_data = summary.summarize(all_objects)
            # 按总大小（第3个元素）降序排序
            sorted_sum_data = sorted(sum_data, key=lambda x: x[2], reverse=True)
            # 按模块统计内存
            module_stats = self._get_module_stats(all_objects)
            threads = threading.enumerate()
            thread_info = [
                {
                    "name": t.name,
                    "daemon": t.daemon,
                    "alive": t.is_alive(),
                }
                for t in threads
            ]
            gc_stats = {
                "collections": gc.get_count(),
                "garbage": len(gc.garbage),
                "tracked": len(gc.get_objects()),
            }
            return {
                "summary": sorted_sum_data[:self.object_limit],
                "module_stats": module_stats,
                "gc_stats": gc_stats,
                "total_objects": len(all_objects),
                "threads": thread_info,
            }
        except Exception as e:
            print(f"❌ 获取对象统计失败: {e}")
            return {}
    def _get_module_stats(self, all_objects: list) -> Dict:
        """统计各模块的内存占用"""
        module_mem = defaultdict(lambda: {"count": 0, "size": 0})
        for obj in all_objects:
            try:
                # 获取对象所属模块
                obj_type = type(obj)
                module_name = obj_type.__module__
                if module_name:
                    # 获取顶级模块名（例如 src.chat.xxx -> src）
                    top_module = module_name.split('.')[0]
                    obj_size = sys.getsizeof(obj)
                    module_mem[top_module]["count"] += 1
                    module_mem[top_module]["size"] += obj_size
            except Exception:
                # 忽略无法获取大小的对象
                continue
        # 转换为列表并按大小排序
        sorted_modules = sorted(
            [(mod, stats["count"], stats["size"]) 
             for mod, stats in module_mem.items()],
            key=lambda x: x[2],
            reverse=True
        )
        return {
            "top_modules": sorted_modules[:20],  # 前20个模块
            "total_modules": len(module_mem)
        }
    def print_stats(self, stats: Dict, iteration: int):
        """打印统计信息"""
        print("\n" + "=" * 80)
        print(f"🔍 对象级内存分析 #{iteration} - {time.strftime('%H:%M:%S')}")
        print("=" * 80)
        if "summary" in stats:
            print(f"\n📦 对象统计 (前 {self.object_limit} 个类型):\n")
            print(f"{'类型':<50} {'数量':>12} {'总大小':>15}")
            print("-" * 80)
            for obj_type, obj_count, obj_size in stats["summary"]:
                if obj_size >= 1024 * 1024 * 1024:
                    size_str = f"{obj_size / 1024 / 1024 / 1024:.2f} GB"
                elif obj_size >= 1024 * 1024:
                    size_str = f"{obj_size / 1024 / 1024:.2f} MB"
                elif obj_size >= 1024:
                    size_str = f"{obj_size / 1024:.2f} KB"
                else:
                    size_str = f"{obj_size} B"
                print(f"{obj_type:<50} {obj_count:>12,} {size_str:>15}")
        if "module_stats" in stats and stats["module_stats"]:
            print(f"\n📚 模块内存占用 (前 20 个模块):\n")
            print(f"{'模块名':<40} {'对象数':>12} {'总内存':>15}")
            print("-" * 80)
            for module_name, obj_count, obj_size in stats["module_stats"]["top_modules"]:
                if obj_size >= 1024 * 1024 * 1024:
                    size_str = f"{obj_size / 1024 / 1024 / 1024:.2f} GB"
                elif obj_size >= 1024 * 1024:
                    size_str = f"{obj_size / 1024 / 1024:.2f} MB"
                elif obj_size >= 1024:
                    size_str = f"{obj_size / 1024:.2f} KB"
                else:
                    size_str = f"{obj_size} B"
                print(f"{module_name:<40} {obj_count:>12,} {size_str:>15}")
            print(f"\n  总模块数: {stats['module_stats']['total_modules']}")
        if "threads" in stats:
            print(f"\n🧵 线程信息 ({len(stats['threads'])} 个):")
            for idx, t in enumerate(stats["threads"], 1):
                status = "✓" if t["alive"] else "✗"
                daemon = "(守护)" if t["daemon"] else ""
                print(f"  [{idx}] {status} {t['name']} {daemon}")
        if "gc_stats" in stats:
            gc_stats = stats["gc_stats"]
            print(f"\n🗑️  垃圾回收:")
            print(f"  代 0: {gc_stats['collections'][0]:,} 次")
            print(f"  代 1: {gc_stats['collections'][1]:,} 次")
            print(f"  代 2: {gc_stats['collections'][2]:,} 次")
            print(f"  追踪对象: {gc_stats['tracked']:,}")
        if "total_objects" in stats:
            print(f"\n📊 总对象数: {stats['total_objects']:,}")
        print("=" * 80 + "\n")
    def print_diff(self):
        """打印对象变化"""
        if not PYMPLER_AVAILABLE or not self.tracker:
            return
        print("\n📈 对象变化分析:")
        print("-" * 80)
        self.tracker.print_diff()
        print("-" * 80)
    def save_to_file(self, stats: Dict):
        """保存统计信息到文件"""
        if not self.output_file:
            return
        try:
            # 保存文本
            with open(self.output_file, "a", encoding="utf-8") as f:
                f.write(f"\n{'=' * 80}\n")
                f.write(f"时间: {time.strftime('%Y-%m-%d %H:%M:%S')}\n")
                f.write(f"迭代: #{self.iteration}\n")
                f.write(f"{'=' * 80}\n\n")
                if "summary" in stats:
                    f.write("对象统计:\n")
                    for obj_type, obj_count, obj_size in stats["summary"]:
                        f.write(f"  {obj_type}: {obj_count:,} 个, {obj_size:,} 字节\n")
                if "module_stats" in stats and stats["module_stats"]:
                    f.write("\n模块统计 (前 20 个):\n")
                    for module_name, obj_count, obj_size in stats["module_stats"]["top_modules"]:
                        f.write(f"  {module_name}: {obj_count:,} 个对象, {obj_size:,} 字节\n")
                f.write(f"\n总对象数: {stats.get('total_objects', 0):,}\n")
                f.write(f"线程数: {len(stats.get('threads', []))}\n")
            # 保存 JSONL
            jsonl_path = str(self.output_file) + ".jsonl"
            record = {
                "timestamp": time.strftime('%Y-%m-%d %H:%M:%S'),
                "iteration": self.iteration,
                "total_objects": stats.get("total_objects", 0),
                "threads": stats.get("threads", []),
                "gc_stats": stats.get("gc_stats", {}),
                "summary": [
                    {"type": t, "count": c, "size": s} 
                    for (t, c, s) in stats.get("summary", [])
                ],
                "module_stats": stats.get("module_stats", {}),
            }
            with open(jsonl_path, "a", encoding="utf-8") as jf:
                jf.write(json.dumps(record, ensure_ascii=False) + "\n")
            if self.iteration == 1:
                print(f"💾 数据保存到: {self.output_file}")
                print(f"💾 结构化数据: {jsonl_path}")
        except Exception as e:
            print(f"⚠️  保存文件失败: {e}")
    def start_monitoring(self):
        """启动监控线程"""
        self.running = True
        def monitor_loop():
            print(f"🚀 对象分析器已启动")
            print(f"   监控间隔: {self.interval} 秒")
            print(f"   对象类型限制: {self.object_limit}")
            print(f"   输出文件: {self.output_file or '无'}")
            print()
            while self.running:
                try:
                    self.iteration += 1
                    stats = self.get_object_stats()
                    self.print_stats(stats, self.iteration)
                    if self.iteration % 3 == 0 and self.tracker:
                        self.print_diff()
                    if self.output_file:
                        self.save_to_file(stats)
                    time.sleep(self.interval)
                except Exception as e:
                    print(f"❌ 监控出错: {e}")
                    import traceback
                    traceback.print_exc()
        monitor_thread = threading.Thread(target=monitor_loop, daemon=True)
        monitor_thread.start()
        print(f"✓ 监控线程已启动\n")
    def stop(self):
        """停止监控"""
        self.running = False
 def run_objects_mode(interval: int, output: Optional[str], object_limit: int):
    """对象分析模式主函数"""
    if not PYMPLER_AVAILABLE:
        print("❌ pympler 未安装，无法使用对象分析模式")
        print("   安装: pip install pympler")
        return 1
    print("=" * 80)
    print("🔬 对象分析模式")
    print("=" * 80)
    print("此模式将:")
    print("  1. 在 bot.py 进程内部运行")
    print("  2. 统计所有对象（包括所有线程）")
    print("  3. 显示对象变化（diff）")
    print("  4. 保存 JSONL 数据用于可视化")
    print("=" * 80 + "\n")
    # 添加项目根目录到 Python 路径
    project_root = Path(__file__).parent.parent
    if str(project_root) not in sys.path:
        sys.path.insert(0, str(project_root))
        print(f"✓ 已添加项目根目录到 Python 路径: {project_root}\n")
    profiler = ObjectMemoryProfiler(
        interval=interval,
        output_file=output,
        object_limit=object_limit
    )
    profiler.start_monitoring()
    print("🤖 正在启动 Bot...\n")
    try:
        import bot
        if hasattr(bot, 'main_async'):
            asyncio.run(bot.main_async())
        elif hasattr(bot, 'main'):
            bot.main()
        else:
            print("⚠️  bot.py 未找到 main_async() 或 main() 函数")
            print("   Bot 模块已导入，监控线程在后台运行")
            print("   按 Ctrl+C 停止\n")
            while profiler.running:
                time.sleep(1)
    except KeyboardInterrupt:
        print("\n\n⚠️  用户中断")
    except Exception as e:
        print(f"\n❌ Bot 运行出错: {e}")
        import traceback
        traceback.print_exc()
    finally:
        profiler.stop()
    return 0
 # ============================================================================
 # 可视化模式
 # ============================================================================
 def load_jsonl(path: Path) -> List[Dict]:
    """加载 JSONL 文件"""
    snapshots = []
    with open(path, "r", encoding="utf-8") as f:
        for line in f:
            line = line.strip()
            if not line:
                continue
            try:
                snapshots.append(json.loads(line))
            except Exception:
                continue
    return snapshots
 def aggregate_top_types(snapshots: List[Dict], top_n: int = 10):
    """聚合前 N 个对象类型的时间序列"""
    type_max = defaultdict(int)
    for snap in snapshots:
        for item in snap.get("summary", []):
            t = item.get("type")
            s = int(item.get("size", 0))
            type_max[t] = max(type_max[t], s)
    top_types = sorted(type_max.items(), key=lambda kv: kv[1], reverse=True)[:top_n]
    top_names = [t for t, _ in top_types]
    times = []
    series = {t: [] for t in top_names}
    for snap in snapshots:
        ts = snap.get("timestamp")
        try:
            times.append(datetime.strptime(ts, "%Y-%m-%d %H:%M:%S"))
        except Exception:
            times.append(None)
        summary = {item.get("type"): int(item.get("size", 0)) 
                   for item in snap.get("summary", [])}
        for t in top_names:
            series[t].append(summary.get(t, 0) / 1024.0 / 1024.0)
    return times, series
 def plot_series(times: List, series: Dict, output: Path, top_n: int):
    """绘制时间序列图"""
    plt.figure(figsize=(14, 8))
    for name, values in series.items():
        if all(v == 0 for v in values):
            continue
        plt.plot(times, values, marker="o", label=name, linewidth=2)
    plt.xlabel("时间", fontsize=12)
    plt.ylabel("内存 (MB)", fontsize=12)
    plt.title(f"对象类型随时间的内存占用 (前 {top_n} 类型)", fontsize=14)
    plt.legend(loc="upper left", fontsize="small")
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.savefig(str(output), dpi=150)
    print(f"✅ 已保存图像: {output}")
 def run_visualize_mode(input_file: str, output_file: str, top: int):
    """可视化模式主函数"""
    if not MATPLOTLIB_AVAILABLE:
        print("❌ matplotlib 未安装，无法使用可视化模式")
        print("   安装: pip install matplotlib")
        return 1
    print("=" * 80)
    print("📊 可视化模式")
    print("=" * 80)
    path = Path(input_file)
    if not path.exists():
        print(f"❌ 找不到输入文件: {path}")
        return 1
    print(f"📂 读取数据: {path}")
    snaps = load_jsonl(path)
    if not snaps:
        print("❌ 未读取到任何快照数据")
        return 1
    print(f"✓ 读取 {len(snaps)} 个快照")
    times, series = aggregate_top_types(snaps, top_n=top)
    print(f"✓ 提取前 {top} 个对象类型")
    output_path = Path(output_file)
    plot_series(times, series, output_path, top)
    return 0
 # ============================================================================
 # 主入口
 # ============================================================================
 def main():
    """主函数"""
    parser = argparse.ArgumentParser(
        description="统一内存分析工具 - Bot 内存诊断完整解决方案",
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog="""
 模式说明:
  --monitor    进程监控模式：从外部监控 bot 进程内存、子进程
  --objects    对象分析模式：在 bot 内部统计所有对象（包括所有线程）
  --visualize  可视化模式：将 JSONL 数据绘制成图表
 使用示例:
  # 进程监控（启动 bot 并监控）
  python scripts/memory_profiler.py --monitor --interval 10
  # 对象分析（深度对象统计）
  python scripts/memory_profiler.py --objects --interval 10 --output memory_data.txt
  # 生成可视化图表
  python scripts/memory_profiler.py --visualize --input memory_data.txt.jsonl --top 15 --output plot.png
 注意:
  - 对象分析模式需要: pip install pympler
  - 可视化模式需要: pip install matplotlib
        """,
    )
    # 模式选择
    mode_group = parser.add_mutually_exclusive_group(required=True)
    mode_group.add_argument("--monitor", "-m", action="store_true", 
                           help="进程监控模式（外部监控 bot 进程）")
    mode_group.add_argument("--objects", "-o", action="store_true", 
                           help="对象分析模式（内部统计所有对象）")
    mode_group.add_argument("--visualize", "-v", action="store_true", 
                           help="可视化模式（绘制 JSONL 数据）")
    # 通用参数
    parser.add_argument("--interval", "-i", type=int, default=10,
                       help="监控间隔（秒），默认 10")
    # 对象分析参数
    parser.add_argument("--output", type=str,
                       help="输出文件路径（对象分析模式）")
    parser.add_argument("--object-limit", "-l", type=int, default=20,
                       help="对象类型显示数量，默认 20")
    # 可视化参数
    parser.add_argument("--input", type=str,
                       help="输入 JSONL 文件（可视化模式）")
    parser.add_argument("--top", "-t", type=int, default=10,
                       help="展示前 N 个类型（可视化模式），默认 10")
    parser.add_argument("--plot-output", type=str, default="memory_analysis_plot.png",
                       help="图表输出文件，默认 memory_analysis_plot.png")
    args = parser.parse_args()
    # 根据模式执行
    if args.monitor:
        return asyncio.run(run_monitor_mode(args.interval))
    elif args.objects:
        if not args.output:
            print("⚠️  建议使用 --output 指定输出文件以保存数据")
        return run_objects_mode(args.interval, args.output, args.object_limit)
    elif args.visualize:
        if not args.input:
            print("❌ 可视化模式需要 --input 参数指定 JSONL 文件")
            return 1
        return run_visualize_mode(args.input, args.plot_output, args.top)
    return 0
 if __name__ == "__main__":
    sys.exit(main())
--- a/src/common/cache_manager.py
+++ b/src/common/cache_manager.py
@@ -33,12 +33,12 @@ class CacheManager:
            cls._instance = super().__new__(cls)
        return cls._instance
-    def __init__(self, default_ttl: int = 3600):
+    def __init__(self, default_ttl: int | None = None):
        """
        初始化缓存管理器。
        """
        if not hasattr(self, "_initialized"):
-            self.default_ttl = default_ttl
+            self.default_ttl = default_ttl or 3600
            self.semantic_cache_collection_name = "semantic_cache"
            # L1 缓存 (内存)
@@ -360,6 +360,60 @@ class CacheManager:
        if expired_keys:
            logger.info(f"清理了 {len(expired_keys)} 个过期的L1缓存条目")
    def get_health_stats(self) -> dict[str, Any]:
        """获取缓存健康统计信息"""
        from src.common.memory_utils import format_size
        return {
            "l1_count": len(self.l1_kv_cache),
            "l1_memory": self.l1_current_memory,
            "l1_memory_formatted": format_size(self.l1_current_memory),
            "l1_max_memory": self.l1_max_memory,
            "l1_memory_usage_percent": round((self.l1_current_memory / self.l1_max_memory) * 100, 2),
            "l1_max_size": self.l1_max_size,
            "l1_size_usage_percent": round((len(self.l1_kv_cache) / self.l1_max_size) * 100, 2),
            "average_item_size": self.l1_current_memory // len(self.l1_kv_cache) if self.l1_kv_cache else 0,
            "average_item_size_formatted": format_size(self.l1_current_memory // len(self.l1_kv_cache)) if self.l1_kv_cache else "0 B",
            "largest_item_size": max(self.l1_size_map.values()) if self.l1_size_map else 0,
            "largest_item_size_formatted": format_size(max(self.l1_size_map.values())) if self.l1_size_map else "0 B",
        }
    def check_health(self) -> tuple[bool, list[str]]:
        """检查缓存健康状态
        Returns:
            (is_healthy, warnings) - 是否健康，警告列表
        """
        warnings = []
        # 检查内存使用
        memory_usage = (self.l1_current_memory / self.l1_max_memory) * 100
        if memory_usage > 90:
            warnings.append(f"⚠️ L1缓存内存使用率过高: {memory_usage:.1f}%")
        elif memory_usage > 75:
            warnings.append(f"⚡ L1缓存内存使用率较高: {memory_usage:.1f}%")
        # 检查条目数
        size_usage = (len(self.l1_kv_cache) / self.l1_max_size) * 100
        if size_usage > 90:
            warnings.append(f"⚠️ L1缓存条目数过多: {size_usage:.1f}%")
        # 检查平均条目大小
        if self.l1_kv_cache:
            avg_size = self.l1_current_memory // len(self.l1_kv_cache)
            if avg_size > 100 * 1024:  # >100KB
                from src.common.memory_utils import format_size
                warnings.append(f"⚡ 平均缓存条目过大: {format_size(avg_size)}")
        # 检查最大单条目
        if self.l1_size_map:
            max_size = max(self.l1_size_map.values())
            if max_size > 500 * 1024:  # >500KB
                from src.common.memory_utils import format_size
                warnings.append(f"⚠️ 发现超大缓存条目: {format_size(max_size)}")
        return len(warnings) == 0, warnings
 # 全局实例
--- a/src/common/database/compatibility/adapter.py
+++ b/src/common/database/compatibility/adapter.py
@@ -175,7 +175,8 @@ async def db_query(
        if query_type == "get":
            # 使用QueryBuilder
-            query_builder = QueryBuilder(model_class)
+            # 🔧 兼容层默认禁用缓存（避免旧代码产生大量缓存）
            query_builder = QueryBuilder(model_class).no_cache()
            # 应用过滤条件
            if filters:
--- a/src/common/database/optimization/batch_scheduler.py
+++ b/src/common/database/optimization/batch_scheduler.py
@@ -19,6 +19,7 @@ from sqlalchemy import delete, insert, select, update
 from src.common.database.core.session import get_db_session
 from src.common.logger import get_logger
 from src.common.memory_utils import estimate_size_smart
 logger = get_logger("batch_scheduler")
@@ -65,6 +66,10 @@ class BatchStats:
    last_batch_duration: float = 0.0
    last_batch_size: int = 0
    congestion_score: float = 0.0  # 拥塞评分 (0-1)
    # 🔧 新增：缓存统计
    cache_size: int = 0  # 缓存条目数
    cache_memory_mb: float = 0.0  # 缓存内存占用（MB）
 class AdaptiveBatchScheduler:
@@ -118,8 +123,11 @@ class AdaptiveBatchScheduler:
        # 统计信息
        self.stats = BatchStats()
-        # 简单的结果缓存
+        # 🔧 改进的结果缓存（带大小限制和内存统计）
        self._result_cache: dict[str, tuple[Any, float]] = {}
        self._cache_max_size = 1000  # 最大缓存条目数
        self._cache_memory_estimate = 0  # 缓存内存估算（字节）
        self._cache_size_map: dict[str, int] = {}  # 每个缓存条目的大小
        logger.info(
            f"自适应批量调度器初始化: "
@@ -530,11 +538,53 @@ class AdaptiveBatchScheduler:
        return None
    def _set_cache(self, cache_key: str, result: Any) -> None:
-        """设置缓存"""
+        """设置缓存（改进版，带大小限制和内存统计）"""
        import sys
        # 🔧 检查缓存大小限制
        if len(self._result_cache) >= self._cache_max_size:
            # 首先清理过期条目
            current_time = time.time()
            expired_keys = [
                k for k, (_, ts) in self._result_cache.items()
                if current_time - ts >= self.cache_ttl
            ]
            for k in expired_keys:
                # 更新内存统计
                if k in self._cache_size_map:
                    self._cache_memory_estimate -= self._cache_size_map[k]
                    del self._cache_size_map[k]
                del self._result_cache[k]
            # 如果还是太大，清理最老的条目（LRU）
            if len(self._result_cache) >= self._cache_max_size:
                oldest_key = min(
                    self._result_cache.keys(), 
                    key=lambda k: self._result_cache[k][1]
                )
                # 更新内存统计
                if oldest_key in self._cache_size_map:
                    self._cache_memory_estimate -= self._cache_size_map[oldest_key]
                    del self._cache_size_map[oldest_key]
                del self._result_cache[oldest_key]
                logger.debug(f"缓存已满，淘汰最老条目: {oldest_key}")
        # 🔧 使用准确的内存估算方法
        try:
            total_size = estimate_size_smart(cache_key) + estimate_size_smart(result)
            self._cache_size_map[cache_key] = total_size
            self._cache_memory_estimate += total_size
        except Exception as e:
            logger.debug(f"估算缓存大小失败: {e}")
            # 使用默认值
            self._cache_size_map[cache_key] = 1024
            self._cache_memory_estimate += 1024
        self._result_cache[cache_key] = (result, time.time())
    async def get_stats(self) -> BatchStats:
-        """获取统计信息"""
+        """获取统计信息（改进版，包含缓存统计）"""
        async with self._lock:
            return BatchStats(
                total_operations=self.stats.total_operations,
@@ -547,6 +597,9 @@ class AdaptiveBatchScheduler:
                last_batch_duration=self.stats.last_batch_duration,
                last_batch_size=self.stats.last_batch_size,
                congestion_score=self.stats.congestion_score,
                # 🔧 新增：缓存统计
                cache_size=len(self._result_cache),
                cache_memory_mb=self._cache_memory_estimate / (1024 * 1024),
            )
--- a/src/common/database/optimization/cache_manager.py
+++ b/src/common/database/optimization/cache_manager.py
@@ -16,6 +16,7 @@ from dataclasses import dataclass
 from typing import Any, Generic, TypeVar
 from src.common.logger import get_logger
 from src.common.memory_utils import estimate_size_smart
 logger = get_logger("cache_manager")
@@ -230,13 +231,12 @@ class LRUCache(Generic[T]):
            )
    def _estimate_size(self, value: Any) -> int:
-        """估算数据大小（字节）
+        """估算数据大小（字节）- 使用准确的估算方法
-        这是一个简单的估算，实际大小可能不同
+        使用深度递归估算，比 sys.getsizeof() 更准确
        """
        import sys
        try:
-            return sys.getsizeof(value)
+            return estimate_size_smart(value)
        except (TypeError, AttributeError):
            # 无法获取大小，返回默认值
            return 1024
@@ -259,6 +259,7 @@ class MultiLevelCache:
        l2_max_size: int = 10000,
        l2_ttl: float = 300,
        max_memory_mb: int = 100,
        max_item_size_mb: int = 1,
    ):
        """初始化多级缓存
@@ -268,15 +269,19 @@ class MultiLevelCache:
            l2_max_size: L2缓存最大条目数
            l2_ttl: L2缓存TTL（秒）
            max_memory_mb: 最大内存占用（MB）
            max_item_size_mb: 单个缓存条目最大大小（MB）
        """
        self.l1_cache: LRUCache[Any] = LRUCache(l1_max_size, l1_ttl, "L1")
        self.l2_cache: LRUCache[Any] = LRUCache(l2_max_size, l2_ttl, "L2")
        self.max_memory_bytes = max_memory_mb * 1024 * 1024
        self.max_item_size_bytes = max_item_size_mb * 1024 * 1024
        self._cleanup_task: asyncio.Task | None = None
        self._is_closing = False  # 🔧 添加关闭标志
        logger.info(
            f"多级缓存初始化: L1({l1_max_size}项/{l1_ttl}s) "
-            f"L2({l2_max_size}项/{l2_ttl}s) 内存上限({max_memory_mb}MB)"
+            f"L2({l2_max_size}项/{l2_ttl}s) 内存上限({max_memory_mb}MB) "
            f"单项上限({max_item_size_mb}MB)"
        )
    async def get(
@@ -337,6 +342,19 @@ class MultiLevelCache:
            size: 数据大小（字节）
            ttl: 自定义过期时间（秒），如果为None则使用默认TTL
        """
        # 估算数据大小（如果未提供）
        if size is None:
            size = estimate_size_smart(value)
        # 检查单个条目大小是否超过限制
        if size > self.max_item_size_bytes:
            logger.warning(
                f"缓存条目过大，跳过缓存: key={key}, "
                f"size={size / (1024 * 1024):.2f}MB, "
                f"limit={self.max_item_size_bytes / (1024 * 1024):.2f}MB"
            )
            return
        # 根据TTL决定写入哪个缓存层
        if ttl is not None:
            # 有自定义TTL，根据TTL大小决定写入层级
@@ -373,17 +391,51 @@ class MultiLevelCache:
        logger.info("所有缓存已清空")
    async def get_stats(self) -> dict[str, Any]:
-        """获取所有缓存层的统计信息"""
+        """获取所有缓存层的统计信息（修正版，避免重复计数）"""
        l1_stats = await self.l1_cache.get_stats()
        l2_stats = await self.l2_cache.get_stats()
-        total_size_bytes = l1_stats.total_size + l2_stats.total_size
+        
        # 🔧 修复：计算实际独占的内存，避免L1和L2共享数据的重复计数
        l1_keys = set(self.l1_cache._cache.keys())
        l2_keys = set(self.l2_cache._cache.keys())
        shared_keys = l1_keys & l2_keys
        l1_only_keys = l1_keys - l2_keys
        l2_only_keys = l2_keys - l1_keys
        # 计算实际总内存（避免重复计数）
        # L1独占内存
        l1_only_size = sum(
            self.l1_cache._cache[k].size 
            for k in l1_only_keys 
            if k in self.l1_cache._cache
        )
        # L2独占内存
        l2_only_size = sum(
            self.l2_cache._cache[k].size 
            for k in l2_only_keys 
            if k in self.l2_cache._cache
        )
        # 共享内存（只计算一次，使用L1的数据）
        shared_size = sum(
            self.l1_cache._cache[k].size 
            for k in shared_keys 
            if k in self.l1_cache._cache
        )
        actual_total_size = l1_only_size + l2_only_size + shared_size
        return {
            "l1": l1_stats,
            "l2": l2_stats,
-            "total_memory_mb": total_size_bytes / (1024 * 1024),
+            "total_memory_mb": actual_total_size / (1024 * 1024),
            "l1_only_mb": l1_only_size / (1024 * 1024),
            "l2_only_mb": l2_only_size / (1024 * 1024),
            "shared_mb": shared_size / (1024 * 1024),
            "shared_keys_count": len(shared_keys),
            "dedup_savings_mb": (l1_stats.total_size + l2_stats.total_size - actual_total_size) / (1024 * 1024),
            "max_memory_mb": self.max_memory_bytes / (1024 * 1024),
-            "memory_usage_percent": (total_size_bytes / self.max_memory_bytes * 100) if self.max_memory_bytes > 0 else 0,
+            "memory_usage_percent": (actual_total_size / self.max_memory_bytes * 100) if self.max_memory_bytes > 0 else 0,
        }
    async def check_memory_limit(self) -> None:
@@ -421,9 +473,13 @@ class MultiLevelCache:
            return
        async def cleanup_loop():
-            while True:
+            while not self._is_closing:
                try:
                    await asyncio.sleep(interval)
                    if self._is_closing:
                        break
                    stats = await self.get_stats()
                    l1_stats = stats["l1"]
                    l2_stats = stats["l2"]
@@ -433,9 +489,14 @@ class MultiLevelCache:
                        f"L2: {l2_stats.item_count}项, "
                        f"命中率{l2_stats.hit_rate:.2%} | "
                        f"内存: {stats['total_memory_mb']:.2f}MB/{stats['max_memory_mb']:.2f}MB "
-                        f"({stats['memory_usage_percent']:.1f}%)"
+                        f"({stats['memory_usage_percent']:.1f}%) | "
                        f"共享: {stats['shared_keys_count']}键/{stats['shared_mb']:.2f}MB "
                        f"(去重节省{stats['dedup_savings_mb']:.2f}MB)"
                    )
                    # 🔧 清理过期条目
                    await self._clean_expired_entries()
                    # 检查内存限制
                    await self.check_memory_limit()
@@ -449,6 +510,8 @@ class MultiLevelCache:
    async def stop_cleanup_task(self) -> None:
        """停止清理任务"""
        self._is_closing = True
        if self._cleanup_task is not None:
            self._cleanup_task.cancel()
            try:
@@ -457,6 +520,45 @@ class MultiLevelCache:
                pass
            self._cleanup_task = None
            logger.info("缓存清理任务已停止")
    async def _clean_expired_entries(self) -> None:
        """清理过期的缓存条目"""
        try:
            current_time = time.time()
            # 清理 L1 过期条目
            async with self.l1_cache._lock:
                expired_keys = [
                    key for key, entry in self.l1_cache._cache.items()
                    if current_time - entry.created_at > self.l1_cache.ttl
                ]
                for key in expired_keys:
                    entry = self.l1_cache._cache.pop(key, None)
                    if entry:
                        self.l1_cache._stats.evictions += 1
                        self.l1_cache._stats.item_count -= 1
                        self.l1_cache._stats.total_size -= entry.size
            # 清理 L2 过期条目
            async with self.l2_cache._lock:
                expired_keys = [
                    key for key, entry in self.l2_cache._cache.items()
                    if current_time - entry.created_at > self.l2_cache.ttl
                ]
                for key in expired_keys:
                    entry = self.l2_cache._cache.pop(key, None)
                    if entry:
                        self.l2_cache._stats.evictions += 1
                        self.l2_cache._stats.item_count -= 1
                        self.l2_cache._stats.total_size -= entry.size
            if expired_keys:
                logger.debug(f"清理了 {len(expired_keys)} 个过期缓存条目")
        except Exception as e:
            logger.error(f"清理过期条目失败: {e}", exc_info=True)
 # 全局缓存实例
@@ -498,11 +600,13 @@ async def get_cache() -> MultiLevelCache:
                    l2_max_size = db_config.cache_l2_max_size
                    l2_ttl = db_config.cache_l2_ttl
                    max_memory_mb = db_config.cache_max_memory_mb
                    max_item_size_mb = db_config.cache_max_item_size_mb
                    cleanup_interval = db_config.cache_cleanup_interval
                    logger.info(
                        f"从配置加载缓存参数: L1({l1_max_size}/{l1_ttl}s), "
-                        f"L2({l2_max_size}/{l2_ttl}s), 内存限制({max_memory_mb}MB)"
+                        f"L2({l2_max_size}/{l2_ttl}s), 内存限制({max_memory_mb}MB), "
                        f"单项限制({max_item_size_mb}MB)"
                    )
                except Exception as e:
                    # 配置未加载，使用默认值
@@ -512,6 +616,7 @@ async def get_cache() -> MultiLevelCache:
                    l2_max_size = 10000
                    l2_ttl = 300
                    max_memory_mb = 100
                    max_item_size_mb = 1
                    cleanup_interval = 60
                _global_cache = MultiLevelCache(
@@ -520,6 +625,7 @@ async def get_cache() -> MultiLevelCache:
                    l2_max_size=l2_max_size,
                    l2_ttl=l2_ttl,
                    max_memory_mb=max_memory_mb,
                    max_item_size_mb=max_item_size_mb,
                )
                await _global_cache.start_cleanup_task(interval=cleanup_interval)
--- a/src/common/memory_utils.py
+++ b/src/common/memory_utils.py
@@ -0,0 +1,192 @@
 """
 准确的内存大小估算工具
 提供比 sys.getsizeof() 更准确的内存占用估算方法
 """
 import sys
 import pickle
 from typing import Any
 import numpy as np
 def get_accurate_size(obj: Any, seen: set | None = None) -> int:
    """
    准确估算对象的内存大小（递归计算所有引用对象）
    比 sys.getsizeof() 准确得多，特别是对于复杂嵌套对象。
    Args:
        obj: 要估算大小的对象
        seen: 已访问对象的集合（用于避免循环引用）
    Returns:
        估算的字节数
    """
    if seen is None:
        seen = set()
    obj_id = id(obj)
    if obj_id in seen:
        return 0
    seen.add(obj_id)
    size = sys.getsizeof(obj)
    # NumPy 数组特殊处理
    if isinstance(obj, np.ndarray):
        size += obj.nbytes
        return size
    # 字典：递归计算所有键值对
    if isinstance(obj, dict):
        size += sum(get_accurate_size(k, seen) + get_accurate_size(v, seen) 
                   for k, v in obj.items())
    # 列表、元组、集合：递归计算所有元素
    elif isinstance(obj, (list, tuple, set, frozenset)):
        size += sum(get_accurate_size(item, seen) for item in obj)
    # 有 __dict__ 的对象：递归计算属性
    elif hasattr(obj, '__dict__'):
        size += get_accurate_size(obj.__dict__, seen)
    # 其他可迭代对象
    elif hasattr(obj, '__iter__') and not isinstance(obj, (str, bytes, bytearray)):
        try:
            size += sum(get_accurate_size(item, seen) for item in obj)
        except:
            pass
    return size
 def get_pickle_size(obj: Any) -> int:
    """
    使用 pickle 序列化大小作为参考
    通常比 sys.getsizeof() 更接近实际内存占用，
    但可能略小于真实内存占用（不包括 Python 对象开销）
    Args:
        obj: 要估算大小的对象
    Returns:
        pickle 序列化后的字节数，失败返回 0
    """
    try:
        return len(pickle.dumps(obj, protocol=pickle.HIGHEST_PROTOCOL))
    except Exception:
        return 0
 def estimate_size_smart(obj: Any, max_depth: int = 5, sample_large: bool = True) -> int:
    """
    智能估算对象大小（平衡准确性和性能）
    使用深度受限的递归估算+采样策略，平衡准确性和性能：
    - 深度5层足以覆盖99%的缓存数据结构
    - 对大型容器（>100项）进行采样估算
    - 性能开销约60倍于sys.getsizeof，但准确度提升1000+倍
    Args:
        obj: 要估算大小的对象
        max_depth: 最大递归深度（默认5层，可覆盖大多数嵌套结构）
        sample_large: 对大型容器是否采样（默认True，提升性能）
    Returns:
        估算的字节数
    """
    return _estimate_recursive(obj, max_depth, set(), sample_large)
 def _estimate_recursive(obj: Any, depth: int, seen: set, sample_large: bool) -> int:
    """递归估算，带深度限制和采样"""
    # 检查深度限制
    if depth <= 0:
        return sys.getsizeof(obj)
    # 检查循环引用
    obj_id = id(obj)
    if obj_id in seen:
        return 0
    seen.add(obj_id)
    # 基本大小
    size = sys.getsizeof(obj)
    # 简单类型直接返回
    if isinstance(obj, (int, float, bool, type(None), str, bytes, bytearray)):
        return size
    # NumPy 数组特殊处理
    if isinstance(obj, np.ndarray):
        return size + obj.nbytes
    # 字典递归
    if isinstance(obj, dict):
        items = list(obj.items())
        if sample_large and len(items) > 100:
            # 大字典采样：前50 + 中间50 + 最后50
            sample_items = items[:50] + items[len(items)//2-25:len(items)//2+25] + items[-50:]
            sampled_size = sum(
                _estimate_recursive(k, depth - 1, seen, sample_large) + 
                _estimate_recursive(v, depth - 1, seen, sample_large)
                for k, v in sample_items
            )
            # 按比例推算总大小
            size += int(sampled_size * len(items) / len(sample_items))
        else:
            # 小字典全部计算
            for k, v in items:
                size += _estimate_recursive(k, depth - 1, seen, sample_large)
                size += _estimate_recursive(v, depth - 1, seen, sample_large)
        return size
    # 列表、元组、集合递归
    if isinstance(obj, (list, tuple, set, frozenset)):
        items = list(obj)
        if sample_large and len(items) > 100:
            # 大容器采样：前50 + 中间50 + 最后50
            sample_items = items[:50] + items[len(items)//2-25:len(items)//2+25] + items[-50:]
            sampled_size = sum(
                _estimate_recursive(item, depth - 1, seen, sample_large)
                for item in sample_items
            )
            # 按比例推算总大小
            size += int(sampled_size * len(items) / len(sample_items))
        else:
            # 小容器全部计算
            for item in items:
                size += _estimate_recursive(item, depth - 1, seen, sample_large)
        return size
    # 有 __dict__ 的对象
    if hasattr(obj, '__dict__'):
        size += _estimate_recursive(obj.__dict__, depth - 1, seen, sample_large)
    return size
 def format_size(size_bytes: int) -> str:
    """
    格式化字节数为人类可读的格式
    Args:
        size_bytes: 字节数
    Returns:
        格式化后的字符串，如 "1.23 MB"
    """
    if size_bytes < 1024:
        return f"{size_bytes} B"
    elif size_bytes < 1024 * 1024:
        return f"{size_bytes / 1024:.2f} KB"
    elif size_bytes < 1024 * 1024 * 1024:
        return f"{size_bytes / 1024 / 1024:.2f} MB"
    else:
        return f"{size_bytes / 1024 / 1024 / 1024:.2f} GB"
 # 向后兼容的别名
 get_deep_size = get_accurate_size
--- a/src/config/official_configs.py
+++ b/src/config/official_configs.py
@@ -49,6 +49,7 @@ class DatabaseConfig(ValidatedConfigBase):
    cache_l2_ttl: int = Field(default=300, ge=60, le=7200, description="L2缓存生存时间（秒）")
    cache_cleanup_interval: int = Field(default=60, ge=30, le=600, description="缓存清理任务执行间隔（秒）")
    cache_max_memory_mb: int = Field(default=100, ge=10, le=1000, description="缓存最大内存占用（MB），超过此值将触发强制清理")
    cache_max_item_size_mb: int = Field(default=1, ge=1, le=100, description="单个缓存条目最大大小（MB），超过此值将不缓存")
 class BotConfig(ValidatedConfigBase):
--- a/template/bot_config_template.toml
+++ b/template/bot_config_template.toml
@@ -1,5 +1,5 @@
 [inner]
-version = "7.5.5"
+version = "7.5.6"
 #----以下是给开发人员阅读的，如果你只是部署了MoFox-Bot，不需要阅读----
 #如果你想要修改配置文件，请递增version的值
@@ -50,7 +50,8 @@ cache_l1_ttl = 60 # L1缓存生存时间（秒）
 cache_l2_max_size = 10000 # L2缓存最大条目数（温数据，内存占用约10-50MB）
 cache_l2_ttl = 300 # L2缓存生存时间（秒）
 cache_cleanup_interval = 60 # 缓存清理任务执行间隔（秒）
-cache_max_memory_mb = 100 # 缓存最大内存占用（MB），超过此值将触发强制清理
+cache_max_memory_mb = 500 # 缓存最大内存占用（MB），超过此值将触发强制清理
 cache_max_item_size_mb = 5 # 单个缓存条目最大大小（MB），超过此值将不缓存
 [permission] # 权限系统配置
 # Master用户配置（拥有最高权限，无视所有权限节点）