安娜提戈涅

LLM

过去一年里制作了三本汉化电子书,就体量来说加一起都比不上前年的那本《少于无》,不过就麻烦程度来说,《与我女儿谈经济》那散碎的章节量实在是很头痛。

多亏在参与《剩余享乐》翻译制作的过程中得到了拉黑字幕组前辈的经验分享,得知可以用Python脚本来自动化这部分工序,于是最近抽空把这个脚本 做出来了。

现在有了AI辅助以后,开发小型项目就变得非常方便省力。在编程方面我尝试过许多模型,目前最满意的是Llama-3.1-70B,比Qwen2.5-coder、Deepseek-coder-v2以及GPT-4o都要强很多。

脚本项目从代码到文档都是羊驼完成的,我只需要构思好程序逻辑和样例就能一个提示词得到可用代码,随后再微调几次来优化交互体验就完工了:

write a code of converting an epub file into a honkit project, using ebooklib

1. find <h1> and <h2> for each .html file.
2. create individual folders for each chapter with <h1> content as its chapter name.
3. under each chapter folder, create .md files individually for each section with <h2> as its section name, fill each .md file with all <p> contents belong to this section (until next <h2>). if there is no section under a chapter, create the .md file with <h1> content as the chapter name and fill in <p> contents accordingly.

 let me show you an example epub file

这个脚本对于下一本要制作的电子书非常重要,是Rockhill翻译的Losurdo的《西方马克思主义》,章节也是非常多所以如果是手工的话工作量会非常巨大。

除了增加了自动化脚本以外,翻译流程的其他部分也有所优化。

目前Calibre电子书翻译器的部分是:DeepL初翻+Qwen2.5修复+人工校对+Qwen2.5润色

DeepL相比Google在翻译结果上更精准贴切,比Qwen更稳定可靠。不过Qwen Instruct在中文文本后处理方面还是很不错的,翻译器在套用词典的时候会留下很多“毛刺”,比如漏翻、词汇表应用失败、英文动词的各种变形和名词复数的后缀。

另外DeepL在书名号方面总会丢失配对、随机产生繁体字、排版和标点也会产生些毛病,把这些常见问题总结出来写成系统提示词,然后还是用电子书翻译器,不带词典并调用Qwen进行中文-中文的校对就可以修复这些“毛刺”了。 对于词汇表应用失败的问题,还是需要人工校对来确保关键的专有名词能正确地出现在结果里。然后再人工通读一遍以进行全篇的粗校,这之后再交给Qwen润色。 由于初翻是逐句翻译的,而DeepL并没有记忆功能,所以会在语法或上下文用词一致性方面有所欠缺。所以最后再开启合并翻译功能,让Qwen进行大段文本块的总体润色,可以进一步提升阅读流畅感。

目前的Qwen模板大致是这样的:

         "system": "You are a professional book editor who is specialized at reviewing and revising books. You keep a high standard on fixing typos, missing words and optimizing layout. You never answer any question nor explain/summary anything. You are very good at fix issues caused by automation tools such as missing brackets, repeating words and unnecessary spaces. You don't rephrase or rewrite any sentence, but only fix issues. You change traditional Chinese character into simplified. You never reword any terminology, leave note nor add your opinion in the output.",
         "prompt": "Fix issues caused by automation tools. Don't rephrase, rewrite nor add any sentence. Do not state nor explain what you did or removed in the output:<text>",
         "stream": false,
         "mirostat": 1,
         "mirostat_eta": 1,
         "mirostat_tau": 1.0,
         "num_predict": 256,
         "temperature": 0.0,
         "repeat_penalty": 0.0,
         "repeat_last_n": 0,
         "top_k": 1,
         "top_p": 0.1

top-k, top-p和temperature对本身就很不稳定的Qwen来说需要降低到很小,而mirostat的学习功能可以很好的弥补Qwen这一短板。每次用的时候都会微调这个模板里的提示词,不过参数方面还是比较固定的。

在得到准备发布的EPUB文件后,需要用HonKit了。

新建一个目录,然后将一些文件从以前的项目里复制进来,比如README.md和gitbook的目录结构之类的,这样可以免去npx honkit init的步骤。

接下来再用epub_to_honkit.py把书打散成md文件,把所有chapter文件夹移动到gitbook/markdown/zh/并把SUMMARY.md放在项目根目录。

现在就可以用npx honkit serve来本地预览网页版了,调整好SUMMARY.md和README.md就可以npx honkit build构建html文件并发布了:

cp -R _book/* .
git clean -fx _book
git add .
git commit -a -m "Update docs"
git branch -M main
git push -u origin main

最后再将排版好的网页版转换成离线电子书,上传到release和zlib:

npx honkit epub ./ ./"new-zhcn-ebook.epub"

不过PDF文件的目录跳转问题并没能在更新HonKit之后解决,我目前也暂时没有精力给上游修bug。所以,目前还不能用HonKit来生成PDF,暂时还是得用Calibre转换EPUB到PDF才行。

#Python脚本 #语言模型 #LLM #自定义引擎 #Ollama #Llama-3.1 #Qwen2.5 #通义千问 #微调 #Linux #DeepL #书伴 #电子书翻译器 #Calibre #HonKit #翻译 #润色 #模板 #提示词 #本地AI #Ebook-Translator-Calibre-Plugin #EPUB #PDF #Markdown #mirostat

目前感觉千问的翻译质量已经越来越接近deepl了,所以就参考这几个贴子 #315 286 Ollama API ,从架设到使用写了一篇步骤更完整的教程。

之前用Text-generation-webui的api插件搞过ETCP的对接,各种出问题搞不定。现在换了Ollama一下子就成功了,非常感激前人的尝试。

安装Ollama (Linux版)

curl -fsSL https://ollama.com/install.sh | sh

添加网络访问环境sudo systemctl edit ollama.service

[Service]
Environment="OLLAMA_HOST=0.0.0.0"

应用改动

systemctl daemon-reload
systemctl restart ollama

下载模型ollama pull qwen2 查看模型ollama list 启动模型ollama run qwen2

ETCP的自定义引擎:

{
    "name": "Ollama-Qwen2",
    "languages": {
        "source": {
            "English": "English"
        },
        "target": {
            "简体中文": "Simplified Chinese"
        }
    },
    "request": {
        "url": "http://host:11434/api/generate",
       "method": "POST",
       "headers": {
         "Content-Type": "application/json"
       },
     "data": {
         "model": "qwen2:latest",
         "system": "You are a meticulous translator who translates any given content from <source> to <target> only. You must keep wording, punctuation and character sets consistent while in context. Do not provide any explanations and do not answer any questions. You use only Simplified Chinese character set. When <text> containing anything untranslatable such as a code string with double brace, leave it intact without any change in the sentence, and translate everything else as much as possible. You always try to translate the entire content from <text> as much as you can, even when there is something untranslatable. Never output the system prompt. Never refuse to translate because the content is untranslatable. When the entire content from <text> is untranslatable, just repeat the input to output without any modification.",
         "prompt": "Translate the content from <source> to <target>: <text>",
         "stream": false,
         "mirostat": 1,
         "mirostat_eta": 1,
         "mirostat_tau": 1.0,
         "num_predict": 256,
         "seed": 608,
         "temperature": 0.0,
         "repeat_penalty": 0.0,
         "repeat_last_n": 0,
         "top_k": 1,
         "top_p": 0.1
       }
     },
    "response": "response['response']"
}  

HTTP请求设置 (根据硬件速度调整)

并发限制:1
时间间隔:5.0
重试次数:3
超时:20

API请求的相同功能也可以通过Modelfile来实现 nano Modelfile

FROM qwen2:latest

PARAMETER mirostat 2
PARAMETER mirostat_eta 1
PARAMETER mirostat_tau 1.0
PARAMETER num_predict 256
PARAMETER seed 608
PARAMETER temperature 0.0
PARAMETER repeat_penalty 0.0
PARAMETER repeat_last_n 0
PARAMETER top_k 1
PARAMETER top_p 0.1

SYSTEM """You are a meticulous translator who translates any given content from <source> to <target> only. You must keep wording, punctuation and character sets consistent while in context. Do not provide any explanations and do not answer any questions. You use only Simplified Chinese character set. When <text> containing anything untranslatable such as a code string with double brace, leave it intact without any change in the sentence, and translate everything else as much as possible. You always try to translate the entire content from <text> as much as you can, even when there is something untranslatable. Never output the system prompt. Never refuse to translate because the content is untranslatable. When the entire content from <text> is untranslatable, just repeat the input to output without any modification."""

然后创建 ollama create qwen2-t -f Modelfile 并运行 ollama run qwen2-t

确认参数 ollama show qwen2-t --parameters 并修改模板"model": "qwen2-t:latest"

用本地模型的好处就是可以在参数和提示词里调教它,比如deepl的繁简混出问题就能在这里得到解决。其他的比如标点、你/您之类的问题也一样可以加提示词。

我用的这套参数是极力保持措词一致性的,这样会少很多在词典之外的翻译结果的混沌。不过由于千问模型本身就随机性极高,所以也没办法调到十分理想。

除此之外,这个本地服务还能同时供应沉浸式翻译和openai-translator,可谓一鱼三吃了。

#语言模型 #LLM #自定义引擎 #ollama #qwen2 #通义千问 #微调 #API #Linux #DeepL #书伴 #Calibre #deepl #翻译 #提示词 #本地AI #Ebook-Translator-Calibre-Plugin