Markdown头部文本分割
Motivation
许多聊天或问答应用程序在嵌入和向量存储之前涉及对输入文档进行分块。
Pinecone的这些笔记提供了一些有用的提示:
当整个段落或文档被嵌入时,嵌入过程会考虑整体上下文以及文本中句子和短语之间的关系。这可能导致更全面的向量表示,捕捉文本的广泛意义和主题。
正如提到的,分块通常旨在将具有共同上下文的文本保持在一起。考虑到这一点,我们可能特别希望尊重文档本身的结构。例如,Markdown文件是通过标题组织的。在特定标题组内创建块是一个直观的想法。为了解决这个挑战,我们可以使用MarkdownHeaderTextSplitter
。这将通过一组指定的标题将Markdown文件拆分。
例如,如果要拆分以下Markdown:
md = '# Foo\n\n ## Bar\n\nHi this is Jim \n\nHi this is Joe\n\n ## Baz\n\n Hi this is Molly'
我们可以指定拆分的标题:
[("#", "Header 1"),("##", "Header 2")]
内容将按常见标题分组或拆分:
{'content': 'Hi this is Jim \nHi this is Joe', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Bar'}}
{'content': 'Hi this is Molly', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Baz'}}
接下来让我们看一些示例。
from langchain.text_splitter import MarkdownHeaderTextSplitter
markdown_document = """
# Foo
## Bar
Hi this is Jim
Hi this is Joe
### Boo
Hi this is Lance
## Baz
Hi this is Molly
"""
headers_to_split_on = [
("#", "Header 1"),
("##", "Header 2"),
("###", "Header 3"),
]
markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
md_header_splits = markdown_splitter.split_text(markdown_document)
md_header_splits
# 输出:
# [
# Document(page_content='Hi this is Jim \nHi this is Joe', metadata={'Header 1': 'Foo', 'Header 2': 'Bar'}),
# Document(page_content='Hi this is Lance', metadata={'Header 1': 'Foo', 'Header 2': 'Bar', 'Header 3': 'Boo'}),
# Document(page_content='Hi this is Molly', metadata={'Header 1': 'Foo', 'Header 2': 'Baz'})
# ]
在每个Markdown组内,我们可以应用任何我们想要的文本拆分器。
markdown_document = """
# Intro
## History
Markdown[9] is a lightweight markup language for creating formatted text using a plain-text editor. John Gruber created Markdown in 2004 as a markup language that is appealing to human readers in its source code form.[9]
Markdown is widely used in blogging, instant messaging, online forums, collaborative software, documentation pages, and readme files.
## Rise and divergence
Markdown popularity grew rapidly, many Markdown implementations appeared, driven mostly by the need for
additional features such as tables, footnotes, definition lists,[note 1] and Markdown inside HTML blocks.
### Standardization
From 2012, a group of people, including Jeff Atwood and John MacFarlane, launched what Atwood characterised as a standardisation effort.
## Implementations
Implementations of Markdown are available for over a dozen programming languages.
"""
headers_to_split_on = [
("#", "Header 1"),
("##", "Header 2"),
]
# MD拆分
markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
md_header_splits = markdown_splitter.split_text(markdown_document)
# 字符级拆分
from langchain.text_splitter import RecursiveCharacterTextSplitter
chunk_size = 250
chunk_overlap = 30
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size, chunk_overlap=chunk_overlap
)
# 拆分
splits = text_splitter.split_documents(md_header_splits)
splits
# 输出:
# [
# Document(page_content='Markdown[9] is a lightweight markup language for creating formatted text using a plain-text editor. John Gruber created Markdown in 2004 as a markup language that is appealing to human readers in its source code form.[9]', metadata={'Header 1': 'Intro', 'Header 2': 'History'}),
# Document(page_content='Markdown is widely used in blogging, instant messaging, online forums, collaborative software, documentation pages, and readme files.', metadata={'Header 1': 'Intro', 'Header 2': 'History'}),
# Document(page_content='Markdown popularity grew rapidly, many Markdown implementations appeared, driven mostly by the need for \nadditional features such as tables, footnotes, definition lists,[note 1] and Markdown inside HTML blocks. ### Standardization', metadata={'Header 1': 'Intro', 'Header 2': 'Rise and divergence'}),
# Document(page_content='### Standardization \nFrom 2012, a group of people, including Jeff Atwood and John MacFarlane, launched what Atwood characterised as a standardisation effort.', metadata={'Header 1': 'Intro', 'Header 2': 'Rise and divergence'}),
# Document(page_content='Implementations of Markdown are available for over a dozen programming languages.', metadata={'Header 1': 'Intro', 'Header 2': 'Implementations'})
# ]