Knowledge ¶
Assets¶
Knowledge assets are instances of KnowledgeAsset that hold a list of Python objects (most often dicts) and expose various methods to manipulate them. For usage examples, see the API documentation of the particular method.
VBT assets¶
There are two knowledge assets in VBT: 1) website pages, and 2) Discord messages. The former asset consists of pages and headings that you can find on the (mainly private) website. Each data item represents a page or a heading of a page. Pages usually just point to one or more other pages and/or headings, while headings themselves hold text content - it all reflects the structure of Markdown files. The latter asset consists of the message history of the "vectorbt.pro" Discord server. Here, each data item represents a Discord message that may reference other Discord message(s) through replies.
The assets are attached to each release as pages.json.zip and messages.json.zip respectively, which is a ZIP-compressed JSON file. This file is managed by the class PagesAsset and MessagesAsset respectively. It can be either loaded automatically or manually. When loading automatically, GitHub token must be provided.
Hint
The first pull will download the assets, while subsequent pulls will use the cached versions. Once VBT is upgraded, new assets will be downloaded automatically.
env["GITHUB_TOKEN"] = "<YOUR_GITHUB_TOKEN>" # (1)!
pages_asset = vbt.PagesAsset.pull()
messages_asset = vbt.MessagesAsset.pull()
# ______________________________________________________________
vbt.settings.set("knowledge.assets.vbt.token", "YOUR_GITHUB_TOKEN") # (2)!
pages_asset = vbt.PagesAsset.pull()
messages_asset = vbt.MessagesAsset.pull()
# ______________________________________________________________
pages_asset = vbt.PagesAsset(/MessagesAsset).pull(release_name="v2024.8.20") # (3)!
pages_asset = vbt.PagesAsset(/MessagesAsset).pull(cache_dir="my_cache_dir") # (4)!
pages_asset = vbt.PagesAsset(/MessagesAsset).pull(clear_cache=True) # (5)!
pages_asset = vbt.PagesAsset(/MessagesAsset).pull(cache=False) # (6)!
# ______________________________________________________________
pages_asset = vbt.PagesAsset.from_json_file("pages.json.zip") # (7)!
messages_asset = vbt.MessagesAsset.from_json_file("messages.json.zip")
- Set the token as an environment variable
- Set the token as a global setting
- Use the asset of a different release. By default, uses the asset of the installed release (see
vbt.version). - Specify a different cache directory. By default, stores data under ./knowledge/vbt/$release_name/pages/assets/ for pages and ./knowledge/vbt/$release_name/messages/assets/ for messages.
- Clear the cache directory and pull the asset once again. By default, won't pull if the asset already exists.
- Create a temporary directory
- Load the asset file that has already been downloaded
Generic assets¶
Knowledge assets are not limited to VBT assets - you can construct an asset out of any list!
asset = vbt.KnowledgeAsset(my_list) # (1)!
asset = vbt.KnowledgeAsset.from_json_file("my_list.json") # (2)!
asset = vbt.KnowledgeAsset.from_json_bytes(vbt.load_bytes("my_list.json")) # (3)!
- Create an instance by wrapping a list directly
- Create an instance from a JSON file
- Create an instance from JSON bytes
Describing¶
Knowledge assets behave like regular lists, thus, to describe an asset, you should describe it as a list. This gives us many analysis options like assessing the length, printing out a random data item, but also more sophisticated options like printing out the field schema - most data items of an asset are dicts, so you can describe them by their fields.
print(len(asset)) # (1)!
asset.sample().print() # (2)!
asset.print_sample()
asset.print_schema() # (3)!
vbt.pprint(messages_asset.describe()) # (4)!
pages_asset.print_site_schema() # (5)!
- Get the number of data items
- Pick a random data item and print it
- Visualize the asset's field schema as a tree. Shows each field, including the frequency and type. Works on all assets where data items are dicts.
- Describe each field. Works on all assets where data items are dicts.
- Visualize URL schema as a tree. Works on pages and headings only.
Manipulating¶
A knowledge asset is just a sophisticated list: it looks like a VBT object but behaves like a list. For manipulation, it offers a collection of methods that end with item or items to get, set, or remove data items, either by returning a new asset instance (default) or modifying the asset instance in place.
d = asset.get_items(0) # (1)!
d = asset[0]
data = asset[0:100] # (2)!
data = asset[mask] # (3)!
data = asset[indices] # (4)!
# ______________________________________________________________
new_asset = asset.set_items(0, new_d) # (5)!
asset.set_items(0, new_d, inplace=True) # (6)!
asset[0] = new_d # (7)!
asset[0:100] = new_data
asset[mask] = new_data
asset[indices] = new_data
# ______________________________________________________________
new_asset = asset.delete_items(0) # (8)!
asset.delete_items(0, inplace=True)
asset.remove(0)
del asset[0]
del asset[0:100]
del asset[mask]
del asset[indices]
# ______________________________________________________________
new_asset = asset.append_item(new_d) # (9)!
asset.append_item(new_d, inplace=True)
asset.append(new_d)
# ______________________________________________________________
new_asset = asset.extend_items([new_d1, new_d2]) # (10)!
asset.extend_items([new_d1, new_d2], inplace=True)
asset.extend([new_d1, new_d2])
asset += [new_d1, new_d2]
# ______________________________________________________________
print(d in asset) # (11)!
print(asset.index(d)) # (12)!
print(asset.count(d)) # (13)!
# ______________________________________________________________
for d in asset: # (14)!
...
- Get the first data item
- Get the first 100 data items
- Get the data items that correspond to True in the mask
- Get the data items that correspond to the positions in the index array
- Set the first data item by returning a new asset instance
- Set the first data item by modifying the asset instance
- Built-in methods all modify the existing asset instance
- Remove the first data item
- Append a new data item
- Extend with new data items
- Check if the asset instance contains a data item
- Get the position of a data item in the asset instance
- Get the number of data items in the asset instance that match a data item
- Iterate over data items
Querying¶
There is a zoo of methods to query an asset: get / select, query / filter, and find. The first pair is used to get and process one to multiple fields from each data item. The get method returns the raw output while the select method returns a new asset instance. The second pair is used to run queries against the asset using various engines such as JMESPath. And again, the query method returns the raw output while the filter method returns a new asset instance. Finally, the find method is specialized at finding information across one to multiple fields. By default, it returns a new asset instance.
messages = messages_asset.get() # (1)!
total_reactions = sum(messages_asset.get("reactions")) # (2)!
first_attachments = messages_asset.get("attachments[0]['content']", skip_missing=True) # (3)!
first_attachments = messages_asset.get("attachments.0.content", skip_missing=True) # (4)!
stripped_contents = pages_asset.get("content", source="x.strip() if x else ''") # (5)!
stripped_contents = pages_asset.get("content", source=lambda x: x.strip() if x else '') # (6)!
stripped_contents = pages_asset.get(source="content.strip() if content else ''") # (7)!
# (8)!
all_contents = pages_asset.select("content").remove_empty().get() # (9)!
all_attachments = messages_asset.select("attachments").merge().get() # (10)!
combined_content = messages_asset.select(source=vbt.Sub('[$author] $content')).join() # (11)!
# ______________________________________________________________
user_questions = messages_asset.query("content if '@polakowo' in mentions else vbt.NoResult") # (12)!
is_user_question = messages_asset.query("'@polakowo' in mentions", return_type="bool") # (13)!
all_attachments = messages_asset.query("[].attachments | []", query_engine="jmespath") # (14)!
all_classes = pages_asset.query("name[obj_type == 'class'].sort_values()", query_engine="pandas") # (15)!
# (16)!
support messages = messages_asset.filter("channel == 'support'") # (17)!
# ______________________________________________________________
new_messages_asset = messages_asset.find("@polakowo") # (18)!
new_messages_asset = messages_asset.find("@polakowo", path="author") # (19)!
new_messages_asset = messages_asset.find(vbt.Not("@polakowo"), path="author") # (20)!
new_messages_asset = messages_asset.find( # (21)!
["@polakowo", "from_signals"],
path=["author", "content"],
find_all=True
)
found_fields = messages_asset.find( # (22)!
["vbt.Portfolio", "vbt.PF"],
return_type="field"
).get()
found_code_matches = messages_asset.find( # (23)!
r"(?<!`)`([^`]*)`(?!`)",
mode="regex",
return_type="match",
).sort().get()
- Get all data items. The same as
messages_asset.data. - Get the value under a simple field from each data item. Here, get and sum the number of reactions.
- Get the value under a nested field from each data item. Also, if the existence of the field is not guaranteed, skip data items where the field is missing.
- The path to the value can also be expressed via the dot notation
- Post-process the value using a source expression. Here, strip the select value ("content").
- The source can also be a function
- If no value was selected, the source expression/function is applied to the entire data item, where the data item is denoted by "x" and its fields are denoted by their names
- If the result needs to be processed further, use
selectinstead ofget. It accepts the same arguments. - Select contents while removing None and empty strings
- Select attachments, merge them into a list, and extract
- Format the author and content into a string, and join the strings
- Return the content if @polakowo is in the mentions, else ignore. The expression here acts similarly to the source expression in
get. - Return True if @polakowo is in the mentions, else False. Without
return_type="bool", it would have acted like a filter and returned the data items that satisfy the condition. - Use JMESPath to extract all attachments
- You can even use Pandas as a query engine where each field will become a Series. Here, get the heading name of every data item that has "class" as object type, and sort.
- If the result needs to be processed further, use
filterinstead ofquery. It accepts the same arguments. - Filter messages that belong to the support channel
- Find messages that mention @polakowo in any field
- Find messages that have @polakowo as author
- Find messages that don't have @polakowo as author
- Find messages that have @polakowo as author and mention
from_signalsin the content. Iffind_allwas False, the conjunction would be "or". - Find all fields that mention either
vbt.Portfolioorvbt.PF - Find all RegEx matches for code snippets with a single backtick
Tip
To make chained calls more readable, use one of the following two styles:
Code¶
There is a specialized method for finding code, either in single backticks or blocks.
found_code_blocks = messages_asset.find_code().get() # (1)!
found_code_blocks = messages_asset.find_code(language="python").get() # (2)!
found_code_blocks = messages_asset.find_code(language=True).get() # (3)!
found_code_blocks = messages_asset.find_code("from_signals").get() # (4)!
found_code_blocks = messages_asset.find_code("from_signals", in_blocks=False).get() # (5)!
found_code_blocks = messages_asset.find_code("from_signals", path="attachments").get() # (6)!
- Find any code blocks across all fields
- Find any Python code blocks across all fields
- Find code blocks annotated with any language across all fields
- Find code blocks that mention
from_signalsacross all fields - Find code that mention
from_signalsacross all fields - Find code blocks that mention
from_signalsacross attachments
Links¶
Custom knowledge assets like pages and messages also have specialized methods for finding data items by their link. The default behavior is to match the target against the end of each link, such that searching for both "https://vectorbt.pro/become-a-member/" and "become-a-member/" will reliably return "https://vectorbt.pro/become-a-member/". Also, it automatically adds a variant with the slash or without if either "exact" or "end" mode is used, such that searching for "become-a-member" (without slash) will still return "https://vectorbt.pro/become-a-member/". This will also disregard another matched link "https://vectorbt.pro/become-a-member/#become-a-member" as it belongs to the same page.
new_messages_asset = messages_asset.find_link( # (1)!
"https://discord.com/channels/918629562441695344/919715148896301067/923327319882485851"
)
new_messages_asset = messages_asset.find_link("919715148896301067/923327319882485851") # (2)!
new_pages_asset = pages_asset.find_page( # (3)!
"https://vectorbt.pro/pvt_xxxxxxxx/getting-started/installation/"
)
new_pages_asset = pages_asset.find_page("https://vectorbt.pro/pvt_7a467f6b/getting-started/installation/") # (4)!
new_pages_asset = pages_asset.find_page("installation/")
new_pages_asset = pages_asset.find_page("installation") # (5)!
new_pages_asset = pages_asset.find_page("installation", aggregate=True) # (6)!
- Find the message based on a Discord URL
- Find the message based on a suffix. Here,
channel_id/message_id. - Find the page based on a website URL
- Find the page based on a suffix
- Slash will be added automatically
- Find page but also aggregate it
Objects¶
You can also find headings that correspond to VBT objects.
new_pages_asset = pages_asset.find_obj(vbt.Portfolio) # (1)!
new_pages_asset = pages_asset.find_obj(vbt.Portfolio, aggregate=True) # (2)!
new_pages_asset = pages_asset.find_obj(vbt.PF.from_signals, aggregate=True)
new_pages_asset = pages_asset.find_obj(vbt.pf_nb, aggregate=True)
new_pages_asset = pages_asset.find_obj("SignalContext", aggregate=True)
- Get the top-level heading corresponding to the class
Portfolio - Get the top-level heading corresponding to the class
Portfoliowith all sub-headings
Nodes¶
You can also traverse pages and messages similarly to nodes in a graph.
new_vbt_asset = vbt_asset.select_previous(link) # (1)!
new_vbt_asset = vbt_asset.select_next(link)
# ______________________________________________________________
new_pages_asset = pages_asset.select_parent(link) # (2)!
new_pages_asset = pages_asset.select_children(link)
new_pages_asset = pages_asset.select_siblings(link)
new_pages_asset = pages_asset.select_descendants(link)
new_pages_asset = pages_asset.select_branch(link)
new_pages_asset = pages_asset.select_ancestors(link)
new_pages_asset = pages_asset.select_parent_page(link)
new_pages_asset = pages_asset.select_descendant_headings(link)
# ______________________________________________________________
new_messages_asset = messages_asset.select_reference(link)
new_messages_asset = messages_asset.select_replies(link)
new_messages_asset = messages_asset.select_block(link) # (3)!
new_messages_asset = messages_asset.select_thread(link)
new_messages_asset = messages_asset.select_channel(link)
- This and below methods work on both pages and messages
- This and below methods do not include link by default (use
incl_link=Trueto enable) - This and below methods include link by default (use
incl_link=Falseto disable)
Note
Each operation requires at least one full data pass; use sparingly.
Applying¶
"Find" and many other methods rely upon KnowledgeAsset.apply, which executes a function on each data item. They are so-called asset functions, which consist of two parts: argument preparation and function calling. The main benefit is that arguments are prepared only once and then passed to each function call. The execution is done via the mighty execute function, which is capable of parallelization.
links = messages_asset.apply("get", "link") # (1)!
from vectorbtpro.utils.knowledge.base_asset_funcs import GetAssetFunc # (2)!
args, kwargs = GetAssetFunc.prepare("link")
links = [GetAssetFunc.call(d, *args, **kwargs) for d in messages_asset]
# ______________________________________________________________
links_asset = messages_asset.apply(lambda d: d["link"]) # (3)!
links = messages_asset.apply(lambda d: d["link"], wrap=False) # (4)!
json_asset = messages_asset.apply(vbt.dump, dump_engine="json") # (5)!
# ______________________________________________________________
new_asset = asset.apply( # (6)!
...,
execute_kwargs=dict(
n_chunks="auto",
distribute="chunks",
engine="processpool"
)
)
- Apply a built-in function. This operation is equivalent to
get("link"). Whether to return a new asset instance or the raw output depends on the function. - The same as above but manually
- Apply a custom function. By default, returns a new asset instance.
- Apply a custom function and return the raw output
- All functions that require a single argument can be used. Here, we serialize each message with JSON.
- Any operation can be distributed by specifying execution-related keyword arguments
Pipelines¶
Most examples show how to execute a chain of standalone operations, but each operation passes through data at least once. To pass through data exactly once regardless of the number of operations, use asset pipelines. There are two kinds of asset pipelines: basic and complex. Basic ones take a list of tasks (i.e., functions and their arguments) and compose them into a single operation that takes a single data item. This composed operation is then applied to all data items. Complex ones take a Python expression in a functional programming style where one function receives a data item and returns a result that becomes argument of another function.
tasks = [("find", ("@polakowo",), dict(return_type="match")), len, "get"] # (1)!
tasks = [vbt.Task("find", "@polakowo", return_type="match"), vbt.Task(len), vbt.Task("get")] # (2)!
mention_count = messages_asset.apply(tasks) # (3)!
asset_pipeline = vbt.BasicAssetPipeline(tasks) # (4)!
mention_count = [asset_pipeline(d) for d in messages_asset]
# ______________________________________________________________
expression = "get(len(find(d, '@polakowo', return_type='match')))"
mention_count = messages_asset.apply(expression) # (5)!
asset_pipeline = vbt.ComplexAssetPipeline(expression) # (6)!
mention_count = [asset_pipeline(d) for d in messages_asset]
- Tasks can be provided as strings, tuples, or functions
- They can also be provided as instances of Task
- Get the number of @polakowo mentions in each message by using a list of tasks
- The same as above but manually
- Get the number of @polakowo mentions in each message by using an expression
- The same as above but manually
Info
In both pipelines, arguments are prepared only once during initialization.
Reducing¶
Reducing means merging all data items into one. This requires a function that takes two data items. At first, these two data items are the initializer (such as empty dict) and the first data item. If the initializer is unknown, the first two data items are used. The result of this first iteration is then passed as the first data item to the next iteration. The execution is done by KnowledgeAsset.reduce and cannot be parallelized since each iteration depends on the previous one.
all_attachments = messages_asset.select("attachments").reduce("merge_lists") # (1)!
from vectorbtpro.utils.knowledge.base_asset_funcs import MergeListsAssetFunc # (2)!
args, kwargs = MergeListsAssetFunc.prepare()
d1 = []
for d2 in messages_asset.select("attachments"):
d1 = MergeListsAssetFunc.call(d1, d2, *args, **kwargs)
all_attachments = d1
# ______________________________________________________________
total_reactions = messages_asset.select("reactions").reduce(lambda d1, d2: d1 + d2) # (3)!
- Apply a built-in function. This operation is equivalent to
select("attachments").merge_lists(). Whether to return a new asset instance or the raw output depends on the function. - The same as above but manually
- Apply a custom function. By default, returns the raw output.
+
In addition, you can split a knowledge asset into groups and reduce the groups. The iteration over groups is done by the execute function, which is capable of parallelization.
reactions_by_channel = messages_asset.groupby_reduce( # (1)!
lambda d1, d2: d1 + d2["reactions"],
by="channel",
initializer=0,
return_group_keys=True
)
# ______________________________________________________________
result = asset.groupby_reduce( # (2)!
...,
execute_kwargs=dict(
n_chunks="auto",
distribute="chunks",
engine="processpool"
)
)
- Get the total number of reactions per channel
- Any group-by operation can be distributed by specifying execution-related keyword arguments
Aggregating¶
Since headings are represented as individual data items, they can be aggregated back into their parent page. This is useful in order to format or display the page. Note that only headings can be aggregated - pages cannot be aggregated into other pages.
new_pages_asset = pages_asset.aggregate() # (1)!
new_pages_asset = pages_asset.aggregate(append_obj_type=False, append_github_link=False) # (2)!
- Aggregate headings into the content of the parent page
- The same as above but don't append object type and GitHub source to the API headings
+
Messages, on the other hand, can be aggregated across multiple levels: "message", "block", "thread", and "channel". Aggregation here simply means taking messages that belong to the specified level, and dumping and putting them into the content of a single, bigger message.
- The level "message" means that attachments are included in the content of the message.
- The level "block" puts together messages of the same author that reference the same block or don't reference anything at all. The link of the block is the link of the first message in the block.
- The level "thread" puts together messages that belong to the same channel and are connected through a chain of replies. The link of the thread is the link of the first message in the thread.
- The level "channel" puts together messages that belong to the same channel.
new_messages_asset = messages_asset.aggregate() # (1)!
new_messages_asset = messages_asset.aggregate(by="message") # (2)!
new_messages_asset = messages_asset.aggregate(by="block") # (3)!
new_messages_asset = messages_asset.aggregate(by="thread") # (4)!
new_messages_asset = messages_asset.aggregate(by="channel") # (5)!
new_messages_asset = messages_asset.aggregate(
...,
minimize_metadata=True # (6)!
)
new_messages_asset = messages_asset.aggregate(
...,
dump_metadata_kwargs=dict(dump_engine="nestedtext") # (7)!
)
- Aggregate into the content of the parent if a single parent exists, else will raise an error
- Aggregate attachments into the content of the parent message
- Aggregate messages into the content of the parent block
- Aggregate messages into the content of the parent thread
- Aggregate messages into the content of the parent channel
- Remove irrelevant keys from metadata before dumping
- When putting messages into the parent content, dump their metadata using the selected engine (here NestedText)
Formatting¶
Most Python objects can be dumped (i.e., serialized) into strings.
new_asset = asset.dump() # (1)!
new_asset = asset.dump(dump_engine="nestedtext", indent=4) # (2)!
# ______________________________________________________________
print(asset.dump().join()) # (3)!
print(asset.dump().join(separator="\n\n---------------------\n\n")) # (4)!
print(asset.dump_all()) # (5)!
- Dump each data item. By default, dumps into YAML using Ruamel (if installed) or PyYAML.
- Dump each data item using a custom engine. Here, using NestedText with indentation of 4 spaces.
- Join all dumped data items. Chooses the separator automatically.
- Join all dumped data items with a custom separator
- Dump the entire list as a single object
+
Custom knowledge assets like pages and messages can be converted and optionally saved in Markdown or HTML format. Only the field "content" will be converted while other fields will build the metadata block displayed at the beginning of each file.
Note
Without aggregation, each page heading will become a separate file.
new_pages_asset = pages_asset.to_markdown() # (1)!
new_pages_asset = pages_asset.to_markdown(root_metadata_key="pages") # (2)!
new_pages_asset = pages_asset.to_markdown(clear_metadata=False) # (3)!
new_pages_asset = pages_asset.to_markdown(remove_code_title=False, even_indentation=False) # (4)!
dir_path = pages_asset.save_to_markdown() # (5)!
dir_path = pages_asset.save_to_markdown(cache_dir="markdown") # (6)!
dir_path = pages_asset.save_to_markdown(clear_cache=True) # (7)!
dir_path = pages_asset.save_to_markdown(cache=False) # (8)!
# (9)!
# ______________________________________________________________
new_pages_asset = pages_asset.to_html() # (10)!
new_pages_asset = pages_asset.to_html(to_markdown_kwargs=dict(root_metadata_key="pages")) # (11)!
new_pages_asset = pages_asset.to_html(make_links=False) # (12)!
new_pages_asset = pages_asset.to_html(extensions=[], use_pygments=False) # (13)!
extensions = vbt.settings.get("knowledge.formatting.markdown_kwargs.extensions")
new_pages_asset = pages_asset.to_html(extensions=extensions + ["pymdownx.smartsymbols"]) # (14)!
extensions = vbt.settings.get("knowledge.formatting.markdown_kwargs.extensions")
extensions.append("pymdownx.smartsymbols") # (15)!
extension_configs = vbt.settings.get("knowledge.formatting.markdown_kwargs.extension_configs")
extension_configs["pymdownx.superfences"]["preserve_tabs"] = False # (16)!
new_pages_asset = pages_asset.to_html(format_html_kwargs=dict(pygments_kwargs=dict(style="dracula"))) # (17)!
vbt.settings.set("knowledge.formatting.pygments_kwargs.style", "dracula") # (18)!
style_extras = vbt.settings.get("knowledge.formatting.style_extras")
style_extras.append("""
.admonition.success {
background-color: #00c8531a;
border-left-color: #00c853;
}
""") # (19)!
head_extras = vbt.settings.get("knowledge.formatting.head_extras")
head_extras.append('<link ...>') # (20)!
body_extras = vbt.settings.get("knowledge.formatting.body_extras")
body_extras.append('<script>...</script>') # (21)!
vbt.settings.get("knowledge.formatting").reset() # (22)!
dir_path = pages_asset.save_to_html() # (23)!
# (24)!
- Convert each data item to a Markdown-formatted string. Returns a new asset instance with a list of strings.
- Prepend a root key called "pages" (or use "messages" for messages) to the metadata block
- Keep empty fields in the metadata block. By default, they will be removed.
- Keep the content as is, without removing code titles and fixing uneven indentation
- Save each data item to a Markdown-formatted file. Returns the path to the parent directory. By default, saves to ./knowledge/vbt/$release_name/pages/markdown/ for pages and ./knowledge/vbt/$release_name/messages/markdown/ for messages.
- Specify a different cache directory
- Clear the cache directory before saving. By default, a data item is skipped if its file already exists.
- Create a temporary directory
- Method
save_to_markdownaccepts the same arguments asto_markdown - Convert each data item to a HTML-formatted string. Returns a new asset instance with a list of strings.
- The same options as for the method
to_markdowncan be provided viato_markdown_kwargs - If any URL is detected in text, don't make it a link. By default, makes all URLs clickable.
- Disable any Markdown extensions and don't use Pygments for highlighting
- Add a new extension to the preset list of Markdown extensions (locally)
- Add a new extension to the preset list of Markdown extensions (globally)
- Change a config of a Markdown extension
- Change the default Pygments highlighting style (locally)
- Change the default Pygments highlighting style (globally)
- Add extra CSS at the end of the
<style>element - Add extra HTML such as links at the end of the
<head>element - Add extra HTML such as JavaScript at the end of the
<body>element - Reset formatting
- Save each data item to a HTML-formatted file. Returns the path to the parent directory. By default, saves to ./knowledge/vbt/$release_name/pages/html/ for pages and ./knowledge/vbt/$release_name/pages/html/ for messages.
- Method
save_to_htmlaccepts the same arguments asto_htmlandsave_to_markdown
Browsing¶
Pages and messages can be displayed and browsed through via static HTML files. When a single item should be displayed, VBT creates a temporary HTML file and opens it in the default browser. All links in this file remain external. When multiple items should be displayed, VBT creates a single HTML file where items are displayed as iframes that can be iterated over using pagination.
file_path = pages_asset.display() # (1)!
file_path = pages_asset.display(link="documentation/fundamentals") # (2)!
file_path = pages_asset.display(link="documentation/fundamentals", aggregate=True) # (3)!
# ______________________________________________________________
file_path = messages_asset.display() # (4)!
file_path = messages_asset.display(link="919715148896301067/923327319882485851") # (5)!
file_path = messages_asset.filter("channel == 'announcements'").display() # (6)!
- Display one or more pages
- Choose a page and display
- If the asset isn't aggregated, choose a page, find its headings, merge them into one page, and display
- Display one or more messages
- Choose a message and display
- Select messages that belong to the selected channel, aggregate them, and display
+
When one or more pages (and/or headings) should be browsed like a website, VBT can convert all data items to HTML and replace all external links to internal ones such that you can jump from one page to another locally. But which page is displayed first? Pages and headings build a directed graph. If there's one page from which all other pages are accessible, it's displayed first. If there are multiple pages, VBT creates an index page with metadata blocks from which you can access other pages (unless you specify entry_link).
dir_path = pages_asset.browse() # (1)!
dir_path = pages_asset.browse(aggregate=True) # (2)!
dir_path = pages_asset.browse(entry_link="documentation/fundamentals", aggregate=True) # (3)!
dir_path = pages_asset.browse(entry_link="documentation", descendants_only=True, aggregate=True) # (4)!
dir_path = pages_asset.browse(cache_dir="website") # (5)!
dir_path = pages_asset.browse(clear_cache=True) # (6)!
dir_path = pages_asset.browse(cache=False) # (7)!
# ______________________________________________________________
dir_path = messages_asset.browse() # (8)!
dir_path = messages_asset.browse(entry_link="919715148896301067/923327319882485851") # (9)!
# (10)!
- Generate an HTML file for every page and heading (unless already aggregated)
- Aggregate headings into pages and generate an HTML file for every page
- Generate an HTML file for every page and start browsing from the selected page
- Generate an HTML file for every page that is descendant of the selected page
- Specify a different cache directory. By default, stores data under ./knowledge/vbt/$release_name/pages/html/ for pages and ./knowledge/vbt/$release_name/messages/html/ for messages.
- Clear the cache directory before saving. By default, a page or heading is skipped if its file already exists.
- Create a temporary directory
- Generate an HTML file for every message (note that there are a lot of messages!)
- In addition to the above, choose a message to open in the default browser
- Messages also take the same caching-related arguments as pages
Combining¶
Assets can be easily combined. When the target class is not specified, their common superclass is used. For example, combining PagesAsset and MessagesAsset will yield an instance of VBTAsset, which is based on overlapping features of both assets, such as "link" and "content" fields.
vbt_asset = pages_asset + messages_asset # (1)!
vbt_asset = pages_asset.combine(messages_asset) # (2)!
vbt_asset = vbt.VBTAsset.combine(pages_asset, messages_asset) # (3)!
- Concatenate both lists and wrap by the common superclass
- Concatenate both lists and wrap by the first class
- Concatenate both lists and wrap by the selected class
+
If both assets have the same number of data items, you can also merge them on the data item level. This works even for complex containers like nested dictionaries and lists by flattening their nested structures into flat dicts, merging them, and then unflattening them back into the original container type.
new_asset = asset1.merge(asset2) # (1)!
new_asset = vbt.KnowledgeAsset.merge(asset1, asset2) # (2)!
- Merge both lists and wrap by the first class
- Merge both lists and wrap by the selected class
+
You can also merge data items of a single asset into a single data item.
new_asset = asset.merge() # (1)!
new_asset = asset.merge_dicts() # (2)!
new_asset = asset.merge_lists() # (3)!
- Calls either
merge_dictsormerge_listsdepending on the data type - Concatenate all lists into one
- (Deep-)merge all dicts into one
Searching¶
For objects¶
There are 4 methods to search for an arbitrary VBT object in pages and messages. The first method searches for the API documentation of the object, the second method searches for object mentions in the non-API (human-readable) documentation, the third method searches for object mentions in Discord messages, and the last method searches for object mentions in the code of both pages and messages.
api_asset = vbt.find_api(vbt.PFO) # (1)!
api_asset = vbt.find_api(vbt.PFO, incl_bases=False, incl_ancestors=False) # (2)!
api_asset = vbt.find_api(vbt.PFO, use_parent=True) # (3)!
api_asset = vbt.find_api(vbt.PFO, use_refs=True) # (4)!
api_asset = vbt.find_api(vbt.PFO.row_stack) # (5)!
api_asset = vbt.find_api(vbt.PFO.from_uniform, incl_refs=False) # (6)!
api_asset = vbt.find_api([vbt.PFO.from_allocate_func, vbt.PFO.from_optimize_func]) # (7)!
# ______________________________________________________________
api_asset = vbt.PFO.find_api() # (8)!
api_asset = vbt.PFO.find_api(attr="from_optimize_func")
- Find the (aggregated) API page for PortfolioOptimizer. Includes the (aggregated) base classes, such as Configured, as well as the (non-aggregated) parent modules, such as portfolio.pfopt.base.
- Don't include base classes and parent modules, only this class
- Uses the (aggregated) parent of this class instead. Here, portfolio.pfopt.base.
- Include the (non-aggregated) pages of the objects that this object references
- Find the API page for PortfolioOptimizer.row_stack. Includes the (aggregated) base methods, such as Wrapping.row_stack, the (non-aggregated) parent objects, such as PortfolioOptimizer, and the (non-aggregated) references.
- Don't include references
- Search for multiple objects
- Make a call directly from the object's interface
docs_asset = vbt.find_docs(vbt.PFO) # (1)!
docs_asset = vbt.find_docs(vbt.PFO, incl_shortcuts=False, incl_instances=False) # (2)!
docs_asset = vbt.find_docs(vbt.PFO, incl_custom=["pf_opt"]) # (3)!
docs_asset = vbt.find_docs(vbt.PFO, incl_custom=[r"pf_opt\s*=\s*.+"], is_custom_regex=True) # (4)!
docs_asset = vbt.find_docs(vbt.PFO, as_code=True) # (5)!
docs_asset = vbt.find_docs([vbt.PFO.from_allocate_func, vbt.PFO.from_optimize_func]) # (6)!
docs_asset = vbt.find_docs(vbt.PFO, up_aggregate_th=0) # (7)!
docs_asset = vbt.find_docs(vbt.PFO, up_aggregate_pages=True) # (8)!
docs_asset = vbt.find_docs(vbt.PFO, incl_pages=["documentation", "tutorials"]) # (9)!
docs_asset = vbt.find_docs(vbt.PFO, incl_pages=[r"(features|cookbook)"], page_find_mode="regex") # (10)!
docs_asset = vbt.find_docs(vbt.PFO, excl_pages=["release-notes"]) # (11)!
# ______________________________________________________________
docs_asset = vbt.PFO.find_docs() # (12)!
docs_asset = vbt.PFO.find_docs(attr="from_optimize_func")
- Find the mentions of PortfolioOptimizer across all non-API pages. Searches for full reference names, shortcuts (such as
vbt.PFO), imports (such asfrom ... import PFO), typical instance names (sch aspfo =), and access/call notations (such asPFO.). - Include only full reference names and imports
- Include custom literal mentions
- Include custom regex mentions
- Find the mentions only as code
- Search for multiple objects
- By default, if the method matches 2/3 of all the headings that share the same parent, it takes the (aggregated) parent instead. Here, take the entire page if any mention is found.
- Similarly, if the method matches 2/3 of all the pages that share the same parent page, it includes the parent page as well
- Include only the links from the documentation and tutorials (the targets are substrings)
- Include only the links from the features and cookbook (the target is a regex)
- Exclude the links from the release notes (the target is a substring)
- Make a call directly from the object's interface
messages_asset = vbt.find_messages(vbt.PFO) # (1)!
# ______________________________________________________________
messages_asset = vbt.PFO.find_messages() # (2)!
messages_asset = vbt.PFO.find_messages(attr="from_optimize_func")
- The same as above but finds mentions in messages. Accepts the same arguments related to targets (first block in
find_docsrecipes) and no arguments related to pages (second block infind_docsrecipes). - Make a call directly from the object's interface
examples_asset = vbt.find_examples(vbt.PFO) # (1)!
# ______________________________________________________________
examples_asset = vbt.PFO.find_examples() # (2)!
examples_asset = vbt.PFO.find_examples(attr="from_optimize_func")
- The same as above but finds code mentions in both pages and messages
- Make a call directly from the object's interface
+
The first three methods are guaranteed to be non-overlapping, while the last method can return examples that can be returned by the first three methods as well. Thus, there is another method that calls the first three methods by default and combines them into a single asset. This way, we can gather all relevant knowledge about a VBT object.
combined_asset = vbt.find_assets(vbt.Trades) # (1)!
combined_asset = vbt.find_assets(vbt.Trades, asset_names=["api", "docs"]) # (2)!
combined_asset = vbt.find_assets(vbt.Trades, asset_names=["messages", ...]) # (3)!
combined_asset = vbt.find_assets(vbt.Trades, asset_names="all") # (4)!
combined_asset = vbt.find_assets( # (5)!
vbt.Trades,
api_kwargs=dict(incl_ancestors=False),
docs_kwargs=dict(as_code=True),
messages_kwargs=dict(as_code=True),
)
combined_asset = vbt.find_assets(vbt.Trades, minimize=False) # (6)!
asset_list = vbt.find_assets(vbt.Trades, combine=False) # (7)!
combined_asset = vbt.find_assets([vbt.EntryTrades, vbt.ExitTrades]) # (8)!
# ______________________________________________________________
combined_asset = vbt.find_assets("SQL", resolve=False) # (9)!
combined_asset = vbt.find_assets(["SQL", "database"], resolve=False) # (10)!
# ______________________________________________________________
messages_asset = vbt.Trades.find_assets() # (11)!
messages_asset = vbt.Trades.find_assets(attr="plot")
messages_asset = pf.trades.find_assets(attr="expectancy")
- Combine assets for Trades
- Use only the API asset and documentation asset
- Put the messages asset first and other assets in their usual order second by using
...(Ellipsis) - Use all assets, including the examples asset
- Provide asset-specific keyword arguments
- Disable minimization. Keeps all fields but takes more context space.
- Disable combining. Returns a dictionary of assets by their name.
- Search for multiple objects
- Search for arbitrary keywords (not actual VBT objects)
- Search for multiple keywords
- Make a call directly from the object's interface
vbt.Trades.find_assets().select("link").print() # (1)!
file_path = vbt.Trades.find_assets( # (2)!
asset_names="docs",
docs_kwargs=dict(excl_pages="release-notes")
).display()
dir_path = vbt.Trades.find_assets( # (3)!
asset_names="docs",
docs_kwargs=dict(excl_pages="release-notes")
).browse(cache=False)
- Print all found links
- Browse all found documentation (but not release notes) as a single HTML file
- Browse all found documentation (but not release notes) as multiple HTML files
Globally¶
Not only we can search for knowledge related to an individual VBT object, but we can also search for any VBT items that match a query in natural language. This works by embedding the query and the data items, computing their pairwise similarity scores, and sorting the data items by their mean score in descending order. Since the result contains all the data items from the original set just in a different order, it's advised to select top-k results before displaying.
All the methods discussed in objects work on queries too!
api_asset = vbt.find_api("How to rebalance weekly?", top_k=20)
docs_asset = vbt.find_docs("How to hedge a position?", top_k=20)
messages_asset = vbt.find_messages("How to trade live?", top_k=20)
combined_asset = vbt.find_assets("How to create a custom data class?", top_k=20)
+
There also exists a specialized search function that calls find_assets, caches the documents (such that the next search call becomes a magnitude faster), and displays the top results as a static HTML page.
Info
The first time you run this command, it may take up to 15 minutes to prepare and embed documents. However, most of the preparation steps are cached and stored, so future searches will be significantly faster without needing to repeat the process.
file_path = vbt.search("How to turn df into data?") # (1)!
found_asset = vbt.search("How to turn df into data?", display=False) # (2)!
file_path = vbt.search("How to turn df into data?", display_kwargs=dict(open_browser=False)) # (3)!
file_path = vbt.search("How to fix 'Symbols have mismatching columns'?", asset_names="messages") # (4)!
file_path = vbt.search("How to use templates in signal_func_nb?", asset_names="examples", display=100) # (5)!
- Search API pages, documentation pages, and Discord messages for a query, and display the most 20 relevant of them
- Return the results instead of displaying them
- Don't automatically open the HTML file in the browser
- Search Discord messages only
- Search for and display the most 100 relevant code examples
Chatting¶
Knowledge assets can be used as a context in chatting with LLMs. The method responsible for chatting is Contextable.chat, which dumps the asset instance, packs it together with your question and chat history into messages, sends them to the LLM service, and displays and persists the response. The response can be displayed in a variety of formats, including raw text, Markdown, and HTML. All three formats support streaming. This method also supports multiple LLM APIs, including OpenAI, LiteLLM, and LLamaIndex.
env["OPENAI_API_KEY"] = "<OPENAI_API_KEY>" # (1)!
# ______________________________________________________________
patterns_tutorial = pages_asset.find_page( # (2)!
"https://vectorbt.pro/pvt_xxxxxxxx/tutorials/patterns-and-projections/patterns/",
aggregate=True
)
patterns_tutorial.chat("How to detect a pattern?")
data_documentation = pages_asset.select_branch("documentation/data").aggregate() # (3)!
data_documentation.chat("How to convert DataFrame into vbt.Data?")
pfo_api = pages_asset.find_obj(vbt.PFO, aggregate=True) # (4)!
pfo_api.chat("How to rebalance weekly?")
combined_asset = pages_asset + messages_asset
signal_func_nb_code = combined_asset.find_code("signal_func_nb") # (5)!
signal_func_nb_code.chat("How to pass an array to signal_func_nb?")
polakowo_messages = messages_asset.filter("author == '@polakowo'").minimize().shuffle()
polakowo_messages.chat("Describe the author of these messages", max_tokens=10_000) # (6)!
mention_fields = combined_asset.find(
"parameterize",
mode="substring",
return_type="field",
merge_fields=False
)
mention_counts = combined_asset.find(
"parameterize",
mode="substring",
return_type="match",
merge_matches=False
).apply(len)
sorted_fields = mention_fields.sort(keys=mention_counts, reverse=True).merge()
sorted_fields.chat("How to parameterize a function?") # (7)!
vbt.settings.set("knowledge.chat.max_tokens", None) # (8)!
# ______________________________________________________________
chat_history = []
signal_func_nb_code.chat("How to check if we're in a long position?", chat_history) # (9)!
signal_func_nb_code.chat("How about short one?", chat_history) # (10)!
chat_history.clear() # (11)!
signal_func_nb_code.chat("How to access close price?", chat_history)
# ______________________________________________________________
asset.chat(..., completions="openai", model="o1-mini", system_as_user=True) # (12)!
# (13)!
# vbt.settings.set("knowledge.chat.completions_configs.openai.model", "o1-mini")
# (14)!
# vbt.OpenAICompletions.set_settings({"model": "o1-mini"})
env["OPENAI_API_KEY"] = "<YOUR_OPENROUTER_API_KEY>"
asset.chat(..., completions="openai", base_url="https://openrouter.ai/api/v1", model="openai/gpt-4o")
# vbt.settings.set("knowledge.chat.completions_configs.openai.base_url", "https://openrouter.ai/api/v1")
# vbt.settings.set("knowledge.chat.completions_configs.openai.model", "openai/gpt-4o")
# vbt.OpenAICompletions.set_settings({
# "base_url": "https://openrouter.ai/api/v1",
# "model": "openai/gpt-4o"
# })
env["DEEPSEEK_API_KEY"] = "<YOUR_DEEPSEEK_API_KEY>"
asset.chat(..., completions="litellm", model="deepseek/deepseek-coder")
# vbt.settings.set("knowledge.chat.completions_configs.litellm.model", "deepseek/deepseek-coder")
# vbt.LiteLLMCompletions.set_settings({"model": "deepseek/deepseek-coder"})
asset.chat(..., completions="llama_index", llm="perplexity", model="claude-3-5-sonnet-20240620") # (15)!
# vbt.settings.set("knowledge.chat.completions_configs.llama_index.llm", "anthropic")
# anthropic_config = {"model": "claude-3-5-sonnet-20240620"}
# vbt.settings.set("knowledge.chat.completions_configs.llama_index.anthropic", anthropic_config)
# vbt.LlamaIndexCompletions.set_settings({"llm": "anthropic", "anthropic": anthropic_config})
vbt.settings.set("knowledge.chat.completions", "litellm") # (16)!
# ______________________________________________________________
asset.chat(..., stream=False) # (17)!
asset.chat(..., formatter="plain") # (18)!
asset.chat(..., formatter="ipython_markdown") # (19)!
asset.chat(..., formatter="ipython_html") # (20)!
file_path = asset.chat(..., formatter="html") # (21)!
file_path = asset.chat(..., formatter="html", formatter_kwargs=dict(cache_dir="chat")) # (22)!
file_path = asset.chat(..., formatter="html", formatter_kwargs=dict(clear_cache=True)) # (23)!
file_path = asset.chat(..., formatter="html", formatter_kwargs=dict(cache=False)) # (24)!
file_path = asset.chat( # (25)!
...,
formatter="html",
formatter_kwargs=dict(
to_markdown_kwargs=dict(...),
to_html_kwargs=dict(...),
format_html_kwargs=dict(...)
)
)
asset.chat(..., formatter_kwargs=dict(update_interval=1.0)) # (26)!
asset.chat(..., formatter_kwargs=dict(output_to="response.txt")) # (27)!
asset.chat( # (28)!
...,
system_prompt="You are a helpful assistant",
context_prompt="Here's what you need to know: $context"
)
- Setting the API key as an environment variable makes it available for all packages. Another way is by passing
api_keydirectly or saving it to the settings, similarly tomodelbelow. - Paste a link from the website (here, the tutorial on patterns), and use it as a context. If you don't know the private hash, you can paste the suffix - see querying.
- Select and aggregate all documentation pages related to data and use them as a context
- Select the API documentation page related the portfolio optimizer and use it as a context
- Find all mentions of
signal_func_nbacross the code of all pages and messages and use them as a context - Filter all messages by @polakowo, keep only relevant fields to fit more data. By default, trims the context to 120,000 tokens (should depend on the model, GPT-4o has allowance of 128,000). Content is shuffled to avoid putting more emphasis on old/new content when trimming the context.
- Get all fields with at least one "parameterize" mention, sort them by the number of mentions in descending order, and merge into one list. When embeddings are unavailable, this is a common workflow when there's too much data.
- Allow for unlimited context
- Append the question and answer to the chat history
- Re-use the chat history to post the question to the current chat
- Clear the chat history to start a new chat
- Specify the model and other client and chat-related arguments for OpenAI API
- Save the model to the settings to use it by default next time
- The same as above
- You can define configs by LLM when using LLamaIndex
- Set the default class for completions
- Disable streaming. By default, streaming is enabled.
- Print the response
- Display the response in Markdown format (requires iPython environment)
- Display the response in HTML format (requires iPython environment)
- Store the response in a static HTML file and display
- Specify a different cache directory. By default, stores data under ./knowledge/vbt/$release_name/chat.
- Clear the cache directory before saving
- Create a temporary file
- When working with HTML, you can provide many arguments that are accepted by VBTAsset.to_html
- When update interval is used, the streaming data is buffered and released periodically. Note that when displaying HTML, the minimum update interval is 1 second.
- In addition to displaying, the raw response can be forwarded to a file
- Customize the system and context prompts
About objects¶
We can chat about a VBT object using chat_about. Under the hood, it calls the method above, but on code examples only. When passing arguments, they are automatically distributed between find_assets and KnowledgeAsset.chat (see chatting for recipes)
vbt.chat_about(vbt.Portfolio, "How to get trading expectancy?") # (1)!
vbt.chat_about( # (2)!
vbt.Portfolio,
"How to get returns accessor with log returns?",
asset_names="api",
api_kwargs=dict(incl_bases=False, incl_ancestors=False)
)
vbt.chat_about( # (3)!
vbt.Portfolio,
"How to backtest a basic strategy?",
model="o1-mini",
system_as_user=True,
max_tokens=100_000,
shuffle=True
)
# ______________________________________________________________
vbt.Portfolio.chat("How to create portfolio from order records?") # (4)!
vbt.Portfolio.chat("How to get grouped stats?", attr="stats")
- Find knowledge about Portfolio and ask a question by using this knowledge as a context
- Pass knowledge-related arguments. Here, find API knowledge that only contains the Portfolio class.
- Pass chat-related arguments. Here, use the o1-mini model on a shuffled context with a maximum number of tokens of 100k. Also, use the user role instead of the system role for the initial instruction.
- Make a call directly from the object's interface
You can also ask a question about objects that technically do not exist in VBT, or keywords in general, such as "quantstats", which will search for mentions of "quantstats" in pages and messages.
vbt.chat_about(
"sql",
"How to import data from a SQL database?",
resolve=False, # (1)!
find_kwargs=dict(
ignore_case=True,
allow_prefix=True, # (2)!
allow_suffix=True # (3)!
)
)
- Use this to avoid searching for a VBT object with the same name
- Allows prefixes for "sql", such as "from_sql"
- Allows suffixes for "sql", such as "SQLData"
Globally¶
Similarly to the global search function, there is also a global function for chatting - chat. It manipulates documents in the same way, but instead of displaying, it sends them to an LLM for completion.
Info
The first time you run this command, it may take up to 15 minutes to prepare and embed documents. However, most of the preparation steps are cached and stored, so future searches will be significantly faster without needing to repeat the process.
vbt.chat("How to turn df into data?") # (1)!
file_path = vbt.chat("How to turn df into data?", formatter="html") # (2)!
vbt.chat("How to fix 'Symbols have mismatching columns'?", asset_names="messages") # (3)!
vbt.chat(
"How to use templates in signal_func_nb?",
asset_names="examples",
top_k=None,
cutoff=None,
return_chunks=False
) # (4)!
chat_history = []
vbt.chat("How to turn df into data?", chat_history) # (5)!
vbt.chat("What if I have symbols as columns?", chat_history) # (6)!
vbt.chat("How to replace index of data?", chat_history, incl_past_queries=False) # (7)!
_, chat = vbt.chat("How to turn df into data?", return_chat=True) # (8)!
chat.complete("What if I have symbols as columns?")
- Search API pages, documentation pages, and Discord messages for a query, and chat about them. By default, uses a maximum of 100 document chunks.
- Accepts the same arguments as
asset.chat - Chat about Discord messages only
- Use the entire (ranked) asset as a context
- Re-use the chat history to post the question to the current chat
- Take into account the chat history. Considers all user messages when ranking the context.
- Take into account the chat history. Considers only the query when ranking the context.
- The same as above, but the context is fixed for all subsequent queries
RAG¶
VBT deploys a collection of components for vanilla RAG. Most of them are orchestrated and deployed automatically whenever you globally search for knowledge on VBT or chat about VBT.
Tokenizer¶
The Tokenizer class and its subclasses offer an interface for converting text into tokens.
tokenizer = vbt.TikTokenizer() # (1)!
tokenizer = vbt.TikTokenizer(encoding="o200k_base")
tokenizer = vbt.TikTokenizer(model="gpt-4o")
vbt.TikTokenizer.set_settings(encoding="o200k_base") # (2)!
token_count = tokenizer.count_tokens(text) # (3)!
tokens = tokenizer.encode(text)
text = tokenizer.decode(tokens)
# ______________________________________________________________
tokens = vbt.tokenize(text) # (4)!
text = vbt.detokenize(tokens)
tokens = vbt.tokenize(text, tokenizer="tiktoken", model="gpt-4o") # (5)!
- Use
tiktokenpackage as a tokenizer - Set default settings
- Use the tokenizer to count tokens in a text, encode text into tokens, or decode tokens back into text
- There are also shortcut methods that construct a tokenizer for you
- Tokenizer type as well as parameters can be passed as keyword arguments to both methods
Embeddings¶
The Embeddings class and its subclasses offer an interface for generating vector representations of text.
embeddings = vbt.OpenAIEmbeddings() # (1)!
embeddings = vbt.OpenAIEmbeddings(batch_size=256) # (2)!
embeddings = vbt.OpenAIEmbeddings(model="text-embedding-3-large") # (3)!
embeddings = vbt.LiteLLMEmbeddings(model="openai/text-embedding-3-large") # (4)!
embeddings = vbt.LlamaIndexEmbeddings(embedding="openai", model="text-embedding-3-large") # (5)!
embeddings = vbt.LlamaIndexEmbeddings(embedding="huggingface", model_name="BAAI/bge-small-en-v1.5")
vbt.OpenAIEmbeddings.set_settings(model="text-embedding-3-large") # (6)!
emb = embeddings.get_embedding(text) # (7)!
embs = embeddings.get_embeddings(texts)
# ______________________________________________________________
emb = vbt.embed(text) # (8)!
embs = vbt.embed(texts)
emb = vbt.embed(text, embeddings="openai", model="text-embedding-3-large") # (9)!
- Use
openaipackage as an embeddings provider - Process embeddings in batches of 256
- Specify the model
- Use
litellmpackage as an embeddings provider - Use
llamaindexpackage as an embeddings provider - Set default settings
- Use the embeddings provider to get embeddings for one or more texts
- There is also a shortcut method that constructs an embeddings provider for you. It accepts one or more texts.
- Embeddings provider type as well as parameters can be passed as keyword arguments to the method
Completions¶
The Completions class and its subclasses offer an interface for generating text completions based on user queries. For arguments such as formatter, see chatting.
completions = vbt.OpenAICompletions() # (1)!
completions = vbt.OpenAICompletions(stream=False)
completions = vbt.OpenAICompletions(max_tokens=100_000, tokenizer="tiktoken")
completions = vbt.OpenAICompletions(model="o1-mini", system_as_user=True)
completions = vbt.OpenAICompletions(formatter="html", formatter_kwargs=dict(cache=False))
completions = vbt.LiteLLMCompletions(model="openai/o1-mini", system_as_user=True) # (2)!
completions = vbt.LlamaIndexCompletions(llm="openai", model="o1-mini", system_as_user=True) # (3)!
vbt.OpenAICompletions.set_settings(model="o1-mini", system_as_user=True) # (4)!
completions.get_completion(text) # (5)!
# ______________________________________________________________
vbt.complete(text) # (6)!
vbt.complete(text, completions="openai", model="o1-mini", system_as_user=True) # (7)!
- Use
openaipackage as an completions provider - Use
litellmpackage as an completions provider - Use
llamaindexpackage as an completions provider - Set default settings
- Use the completions provider to get completion for a texts
- There is also a shortcut method that constructs a completions provider for you
- Completions provider type as well as parameters can be passed as keyword arguments to the method
Text splitter¶
The TextSplitter class and its subclasses offer an interface for splitting text.
text_splitter = vbt.TokenSplitter() # (1)!
text_splitter = vbt.TokenSplitter(chunk_size=1000, chunk_overlap=200)
text_splitter = vbt.SegmentSplitter() # (2)!
text_splitter = vbt.SegmentSplitter(separators=r"\s+") # (3)!
text_splitter = vbt.SegmentSplitter(separators=[r"(?<=[.!?])\s+", r"\s+", None]) # (4)!
text_splitter = vbt.SegmentSplitter(tokenizer="tiktoken", tokenizer_kwargs=dict(model="gpt-4o"))
text_splitter = vbt.LlamaIndexSplitter(node_parser="SentenceSplitter") # (5)!
vbt.TokenSplitter.set_settings(chunk_size=1000, chunk_overlap=200) # (6)!
text_chunks = text_splitter.split_text(text) # (7)!
# ______________________________________________________________
text_chunks = vbt.split_text(text) # (8)!
text_chunks = vbt.split_text(text, text_splitter="llamaindex", node_parser="SentenceSplitter") # (9)!
- Create a splitter that splits into tokens
- Create a splitter that splits into segments. By default, splits into paragraphs and sentences, words (if the sentence is too long), and then tokens (if the word is too long).
- Split text strictly into words
- Split text into sentences, words (if the sentence is too long), and then tokens (if the word is too long).
- Use
SentenceSplitterfromllamaindexpackage as a text splitter - Set default settings
- Use the text splitter to split a text, which returns a generator
- There is also a shortcut method that constructs a text splitter for you
- Text splitter type as well as parameters can be passed as keyword arguments to the method
Object store¶
The ObjectStore class and its subclasses offer an interface for efficiently storing and retrieving arbitrary Python objects, such as text documents and embeddings. Such objects must subclass StoreObject.
obj_store = vbt.DictStore() # (1)!
obj_store = vbt.MemoryStore(store_id="abc") # (2)!
obj_store = vbt.MemoryStore(purge_on_open=True) # (3)!
obj_store = vbt.FileStore(dir_path="./file_store") # (4)!
obj_store = vbt.FileStore(consolidate=True, use_patching=False) # (5)!
obj_store = vbt.LMDBStore(dir_path="./lmdb_store") # (6)!
obj_store = vbt.CachedStore(obj_store=vbt.FileStore()) # (7)!
obj_store = vbt.CachedStore(obj_store=vbt.FileStore(), mirror=True) # (8)!
vbt.FileStore.set_settings(consolidate=True, use_patching=False) # (9)!
obj = vbt.TextDocument(id_, text) # (10)!
obj = vbt.TextDocument.from_data(text) # (11)!
obj = vbt.TextDocument.from_data( # (12)!
{"timestamp": timestamp, "content": text},
text_path="content",
excl_embed_metadata=["timestamp"],
dump_kwargs=dict(dump_engine="nestedtext")
)
obj1 = vbt.StoreEmbedding(id1, child_ids=[id2, id3]) # (13)!
obj2 = vbt.StoreEmbedding(id2, parent_id=id1, embedding=embedding2)
obj3 = vbt.StoreEmbedding(id3, parent_id=id1, embedding=embedding3)
with obj_store: # (14)!
obj = obj_store[obj.id_]
obj_store[obj.id_] = obj
del obj_store[obj.id_]
print(len(obj_store))
for id_, obj in obj_store.items():
...
- Create an object store based on a simple dictionary, where data persists only for the lifetime of the instance
- Create an object store based on a global dictionary (
memory_store), where data persists for the lifetime of the Python process - Clear the store upon opening, removing any data saved by previous instances
- Create an object store based on a file (without patching) or folder (with patching). Patching means that additional changes will be added as separate files.
- Consolidate the folder (if any) into a file and disable patching
- Create an object store based on LMDB
- Create an object store that caches operations of another object store internally
- Same as above but mirrors operations in the global dictionary (
memory_store) to persist the cache for the lifetime of the Python process - Set default settings
- Create a text document object
- Generate id automatically from text
- Create a text document object with metadata
- Create three embedding objects with a relationship
- An object store is very easy to use: it behaves just like a regular dictionary
Document ranker¶
The DocumentRanker class offers an interface for embedding, scoring, and ranking documents.
doc_ranker = vbt.DocumentRanker() # (1)!
doc_ranker = vbt.DocumentRanker(dataset_id="abc") # (2)!
doc_ranker = vbt.DocumentRanker( # (3)!
embeddings="litellm",
embeddings_kwargs=dict(model="openai/text-embedding-3-large")
)
doc_ranker = vbt.DocumentRanker( # (4)!
doc_store="file",
doc_store_kwargs=dict(dir_path="./doc_file_store"),
emb_store="file",
emb_store_kwargs=dict(dir_path="./emb_file_store"),
)
doc_ranker = vbt.DocumentRanker(score_func="dot", score_agg_func="max") # (5)!
vbt.DocumentRanker.set_settings(doc_store="memory", emb_store="memory") # (6)!
documents = [vbt.TextDocument("text1"), vbt.TextDocument("text2")] # (7)!
doc_ranker.embed_documents(documents) # (8)!
emb_documents = doc_ranker.embed_documents(documents, return_documents=True)
embs = doc_ranker.embed_documents(documents, return_embeddings=True)
doc_ranker.embed_documents(documents, refresh=True) # (9)!
doc_scores = doc_ranker.score_documents("How to use VBT?", documents) # (10)!
chunk_scores = doc_ranker.score_documents("How to use VBT?", documents, return_chunks=True)
scored_documents = doc_ranker.score_documents("How to use VBT?", documents, return_documents=True)
documents = doc_ranker.rank_documents("How to use VBT?", documents) # (11)!
scored_documents = doc_ranker.rank_documents("How to use VBT?", documents, return_scores=True)
documents = doc_ranker.rank_documents("How to use VBT?", documents, top_k=50) # (12)!
documents = doc_ranker.rank_documents("How to use VBT?", documents, top_k=0.1) # (13)!
documents = doc_ranker.rank_documents("How to use VBT?", documents, top_k="elbow") # (14)!
documents = doc_ranker.rank_documents("How to use VBT?", documents, cutoff=0.5, min_top_k=20) # (15)!
# ______________________________________________________________
vbt.embed_documents(documents) # (16)!
vbt.embed_documents(documents, embeddings="openai", model="text-embedding-3-large")
documents = vbt.rank_documents("How to use VBT?", documents)
- Create a document ranker
- Set store id for both the document store and embedding store
- Specify the embeddings provider type as well as parameters
- Specify the object store types for documents and embeddings
- Specify the score function and score aggregation function (both can be callables)
- Set default settings
- A document ranker accepts iterable of store objects, such as text documents
- Use the document ranker to embed documents (embeddings are stored in an embedding store). You can also specify whether to return the embedded documents, or the embeddings themselves.
- If a document or embedding already exists in the store, override it
- Give each document a similarity score relative to the query, and return the scores. You can also specify whether to return the scores for the chunks (they are aggregated by default), or the documents together with their scores.
- Rank documents based on similarity scores relative to the query. By default, simply reorders the documents, but you can also specify whether to return the documents together with their scores.
- Pick top 50 documents
- Pick top 10% documents
- Pick top documents based on an algorithm such as Elbow method
- Pick top 20 documents or more with a similarity score above 0.5
- There are also shortcut methods for embedding and ranking that construct a document ranker for you
Pipeline¶
The components mentioned above can enhance RAG pipelines, extending their utility beyond the VBT scope.
data = [
"The Eiffel Tower is not located in London.",
"The Great Wall of China is not visible from Jupiter.",
"HTML is not a programming language."
]
query = "Where the Eiffel Tower is not located?"
documents = map(vbt.TextDocument.from_data, data)
retrieved_documents = vbt.rank_documents(query, documents, top_k=1)
context = "\n\n".join(map(str, retrieved_documents))
vbt.complete(query, context=context)