Knowledge ¶

Assets¶

Knowledge assets are instances of KnowledgeAsset that hold a list of Python objects (most often dicts) and expose various methods to manipulate them. For usage examples, see the API documentation of the particular method.

VBT assets¶

There are two knowledge assets in VBT: 1) website pages, and 2) Discord messages. The former asset consists of pages and headings that you can find on the (mainly private) website. Each data item represents a page or a heading of a page. Pages usually just point to one or more other pages and/or headings, while headings themselves hold text content - it all reflects the structure of Markdown files. The latter asset consists of the message history of the "vectorbt.pro" Discord server. Here, each data item represents a Discord message that may reference other Discord message(s) through replies.

The assets are attached to each release as pages.json.zip and messages.json.zip respectively, which is a ZIP-compressed JSON file. This file is managed by the class PagesAsset and MessagesAsset respectively. It can be either loaded automatically or manually. When loading automatically, GitHub token must be provided.

Hint

The first pull will download the assets, while subsequent pulls will use the cached versions. Once VBT is upgraded, new assets will be downloaded automatically.

How to load an asset

env["GITHUB_TOKEN"] = "<YOUR_GITHUB_TOKEN>"  # (1)!
pages_asset = vbt.PagesAsset.pull()
messages_asset = vbt.MessagesAsset.pull()

# ______________________________________________________________

vbt.settings.set("knowledge.assets.vbt.token", "YOUR_GITHUB_TOKEN")  # (2)!
pages_asset = vbt.PagesAsset.pull()
messages_asset = vbt.MessagesAsset.pull()

# ______________________________________________________________

pages_asset = vbt.PagesAsset(/MessagesAsset).pull(release_name="v2024.8.20") # (3)!
pages_asset = vbt.PagesAsset(/MessagesAsset).pull(cache_dir="my_cache_dir") # (4)!
pages_asset = vbt.PagesAsset(/MessagesAsset).pull(clear_cache=True) # (5)!
pages_asset = vbt.PagesAsset(/MessagesAsset).pull(cache=False)  # (6)!

# ______________________________________________________________

pages_asset = vbt.PagesAsset.from_json_file("pages.json.zip") # (7)!
messages_asset = vbt.MessagesAsset.from_json_file("messages.json.zip")

Set the token as an environment variable
Set the token as a global setting
Use the asset of a different release. By default, uses the asset of the installed release (see vbt.version).
Specify a different cache directory. By default, stores data under ./knowledge/vbt/$release_name/pages/assets/ for pages and ./knowledge/vbt/$release_name/messages/assets/ for messages.
Clear the cache directory and pull the asset once again. By default, won't pull if the asset already exists.
Create a temporary directory
Load the asset file that has already been downloaded

Generic assets¶

Knowledge assets are not limited to VBT assets - you can construct an asset out of any list!

How to load an asset

asset = vbt.KnowledgeAsset(my_list)  # (1)!
asset = vbt.KnowledgeAsset.from_json_file("my_list.json")  # (2)!
asset = vbt.KnowledgeAsset.from_json_bytes(vbt.load_bytes("my_list.json"))  # (3)!

Create an instance by wrapping a list directly
Create an instance from a JSON file
Create an instance from JSON bytes

Describing¶

Knowledge assets behave like regular lists, thus, to describe an asset, you should describe it as a list. This gives us many analysis options like assessing the length, printing out a random data item, but also more sophisticated options like printing out the field schema - most data items of an asset are dicts, so you can describe them by their fields.

How to describe an asset

print(len(asset))  # (1)!

asset.sample().print()  # (2)!
asset.print_sample()

asset.print_schema()  # (3)!

vbt.pprint(messages_asset.describe())  # (4)!

pages_asset.print_site_schema()  # (5)!

Get the number of data items
Pick a random data item and print it
Visualize the asset's field schema as a tree. Shows each field, including the frequency and type. Works on all assets where data items are dicts.
Describe each field. Works on all assets where data items are dicts.
Visualize URL schema as a tree. Works on pages and headings only.

Manipulating¶

A knowledge asset is just a sophisticated list: it looks like a VBT object but behaves like a list. For manipulation, it offers a collection of methods that end with item or items to get, set, or remove data items, either by returning a new asset instance (default) or modifying the asset instance in place.

How to manipulate an asset

d = asset.get_items(0)  # (1)!
d = asset[0]
data = asset[0:100]  # (2)!
data = asset[mask]  # (3)!
data = asset[indices]  # (4)!

# ______________________________________________________________

new_asset = asset.set_items(0, new_d)  # (5)!
asset.set_items(0, new_d, inplace=True)  # (6)!
asset[0] = new_d  # (7)!
asset[0:100] = new_data
asset[mask] = new_data
asset[indices] = new_data

# ______________________________________________________________

new_asset = asset.delete_items(0)  # (8)!
asset.delete_items(0, inplace=True)
asset.remove(0)
del asset[0]
del asset[0:100]
del asset[mask]
del asset[indices]

# ______________________________________________________________

new_asset = asset.append_item(new_d)  # (9)!
asset.append_item(new_d, inplace=True)
asset.append(new_d)

# ______________________________________________________________

new_asset = asset.extend_items([new_d1, new_d2])  # (10)!
asset.extend_items([new_d1, new_d2], inplace=True)
asset.extend([new_d1, new_d2])
asset += [new_d1, new_d2]

# ______________________________________________________________

print(d in asset)  # (11)!
print(asset.index(d))  # (12)!
print(asset.count(d))  # (13)!

# ______________________________________________________________

for d in asset:  # (14)!
    ...

Get the first data item
Get the first 100 data items
Get the data items that correspond to True in the mask
Get the data items that correspond to the positions in the index array
Set the first data item by returning a new asset instance
Set the first data item by modifying the asset instance
Built-in methods all modify the existing asset instance
Remove the first data item
Append a new data item
Extend with new data items
Check if the asset instance contains a data item
Get the position of a data item in the asset instance
Get the number of data items in the asset instance that match a data item
Iterate over data items

Querying¶

There is a zoo of methods to query an asset: get / select, query / filter, and find. The first pair is used to get and process one to multiple fields from each data item. The get method returns the raw output while the select method returns a new asset instance. The second pair is used to run queries against the asset using various engines such as JMESPath. And again, the query method returns the raw output while the filter method returns a new asset instance. Finally, the find method is specialized at finding information across one to multiple fields. By default, it returns a new asset instance.

How to query an asset

messages = messages_asset.get()  # (1)!
total_reactions = sum(messages_asset.get("reactions"))  # (2)!
first_attachments = messages_asset.get("attachments[0]['content']", skip_missing=True)  # (3)!
first_attachments = messages_asset.get("attachments.0.content", skip_missing=True)  # (4)!
stripped_contents = pages_asset.get("content", source="x.strip() if x else ''")  # (5)!
stripped_contents = pages_asset.get("content", source=lambda x: x.strip() if x else '')  # (6)!
stripped_contents = pages_asset.get(source="content.strip() if content else ''")  # (7)!

# (8)!

all_contents = pages_asset.select("content").remove_empty().get()  # (9)!
all_attachments = messages_asset.select("attachments").merge().get()  # (10)!
combined_content = messages_asset.select(source=vbt.Sub('[$author] $content')).join()  # (11)!

# ______________________________________________________________

user_questions = messages_asset.query("content if '@polakowo' in mentions else vbt.NoResult")  # (12)!
is_user_question = messages_asset.query("'@polakowo' in mentions", return_type="bool")  # (13)!
all_attachments = messages_asset.query("[].attachments | []", query_engine="jmespath")  # (14)!
all_classes = pages_asset.query("name[obj_type == 'class'].sort_values()", query_engine="pandas")  # (15)!

# (16)!

support messages = messages_asset.filter("channel == 'support'")  # (17)!

# ______________________________________________________________

new_messages_asset = messages_asset.find("@polakowo")  # (18)!
new_messages_asset = messages_asset.find("@polakowo", path="author")  # (19)!
new_messages_asset = messages_asset.find(vbt.Not("@polakowo"), path="author")  # (20)!
new_messages_asset = messages_asset.find(  # (21)!
    ["@polakowo", "from_signals"], 
    path=["author", "content"], 
    find_all=True
)

found_fields = messages_asset.find(  # (22)!
    ["vbt.Portfolio", "vbt.PF"], 
    return_type="field"
).get()
found_code_matches = messages_asset.find(  # (23)!
    r"(?<!`)`([^`]*)`(?!`)", 
    mode="regex", 
    return_type="match",
).sort().get()

Get all data items. The same as messages_asset.data.
Get the value under a simple field from each data item. Here, get and sum the number of reactions.
Get the value under a nested field from each data item. Also, if the existence of the field is not guaranteed, skip data items where the field is missing.
The path to the value can also be expressed via the dot notation
Post-process the value using a source expression. Here, strip the select value ("content").
The source can also be a function
If no value was selected, the source expression/function is applied to the entire data item, where the data item is denoted by "x" and its fields are denoted by their names
If the result needs to be processed further, use select instead of get. It accepts the same arguments.
Select contents while removing None and empty strings
Select attachments, merge them into a list, and extract
Format the author and content into a string, and join the strings
Return the content if @polakowo is in the mentions, else ignore. The expression here acts similarly to the source expression in get.
Return True if @polakowo is in the mentions, else False. Without return_type="bool", it would have acted like a filter and returned the data items that satisfy the condition.
Use JMESPath to extract all attachments
You can even use Pandas as a query engine where each field will become a Series. Here, get the heading name of every data item that has "class" as object type, and sort.
If the result needs to be processed further, use filter instead of query. It accepts the same arguments.
Filter messages that belong to the support channel
Find messages that mention @polakowo in any field
Find messages that have @polakowo as author
Find messages that don't have @polakowo as author
Find messages that have @polakowo as author and mention from_signals in the content. If find_all was False, the conjunction would be "or".
Find all fields that mention either vbt.Portfolio or vbt.PF
Find all RegEx matches for code snippets with a single backtick

Tip

To make chained calls more readable, use one of the following two styles:

How to find admonition types

admonition_types = (
    pages_asset.find(
        r"!!!\s+(\w+)", 
        mode="regex", 
        return_type="match"
    )
    .sort()
    .get()
)
admonition_types = pages_asset.chain([
    ("find", (r"!!!\s+(\w+)",), dict(mode="regex", return_type="match")),
    "sort",
    "get"
])

Code¶

There is a specialized method for finding code, either in single backticks or blocks.

How to find code

found_code_blocks = messages_asset.find_code().get()  # (1)!
found_code_blocks = messages_asset.find_code(language="python").get()  # (2)!
found_code_blocks = messages_asset.find_code(language=True).get()  # (3)!
found_code_blocks = messages_asset.find_code("from_signals").get()  # (4)!
found_code_blocks = messages_asset.find_code("from_signals", in_blocks=False).get()  # (5)!
found_code_blocks = messages_asset.find_code("from_signals", path="attachments").get()  # (6)!

Find any code blocks across all fields
Find any Python code blocks across all fields
Find code blocks annotated with any language across all fields
Find code blocks that mention from_signals across all fields
Find code that mention from_signals across all fields
Find code blocks that mention from_signals across attachments

Links¶

Custom knowledge assets like pages and messages also have specialized methods for finding data items by their link. The default behavior is to match the target against the end of each link, such that searching for both "https://vectorbt.pro/become-a-member/" and "become-a-member/" will reliably return "https://vectorbt.pro/become-a-member/". Also, it automatically adds a variant with the slash or without if either "exact" or "end" mode is used, such that searching for "become-a-member" (without slash) will still return "https://vectorbt.pro/become-a-member/". This will also disregard another matched link "https://vectorbt.pro/become-a-member/#become-a-member" as it belongs to the same page.

How to find links

new_messages_asset = messages_asset.find_link(  # (1)!
    "https://discord.com/channels/918629562441695344/919715148896301067/923327319882485851"
)
new_messages_asset = messages_asset.find_link("919715148896301067/923327319882485851")  # (2)!

new_pages_asset = pages_asset.find_page(  # (3)!
    "https://vectorbt.pro/pvt_xxxxxxxx/getting-started/installation/"
)
new_pages_asset = pages_asset.find_page("https://vectorbt.pro/pvt_7a467f6b/getting-started/installation/")  # (4)!
new_pages_asset = pages_asset.find_page("installation/")
new_pages_asset = pages_asset.find_page("installation")  # (5)!
new_pages_asset = pages_asset.find_page("installation", aggregate=True)  # (6)!

Find the message based on a Discord URL
Find the message based on a suffix. Here, channel_id/message_id.
Find the page based on a website URL
Find the page based on a suffix
Slash will be added automatically
Find page but also aggregate it

Objects¶

You can also find headings that correspond to VBT objects.

How to find an object

new_pages_asset = pages_asset.find_obj(vbt.Portfolio)  # (1)!
new_pages_asset = pages_asset.find_obj(vbt.Portfolio, aggregate=True)  # (2)!
new_pages_asset = pages_asset.find_obj(vbt.PF.from_signals, aggregate=True)
new_pages_asset = pages_asset.find_obj(vbt.pf_nb, aggregate=True)
new_pages_asset = pages_asset.find_obj("SignalContext", aggregate=True)

Get the top-level heading corresponding to the class Portfolio
Get the top-level heading corresponding to the class Portfolio with all sub-headings

Nodes¶

You can also traverse pages and messages similarly to nodes in a graph.

How to traverse an asset

new_vbt_asset = vbt_asset.select_previous(link)  # (1)!
new_vbt_asset = vbt_asset.select_next(link)

# ______________________________________________________________

new_pages_asset = pages_asset.select_parent(link)  # (2)!
new_pages_asset = pages_asset.select_children(link)
new_pages_asset = pages_asset.select_siblings(link)
new_pages_asset = pages_asset.select_descendants(link)
new_pages_asset = pages_asset.select_branch(link)
new_pages_asset = pages_asset.select_ancestors(link)
new_pages_asset = pages_asset.select_parent_page(link)
new_pages_asset = pages_asset.select_descendant_headings(link)

# ______________________________________________________________

new_messages_asset = messages_asset.select_reference(link)
new_messages_asset = messages_asset.select_replies(link)
new_messages_asset = messages_asset.select_block(link)  # (3)!
new_messages_asset = messages_asset.select_thread(link)
new_messages_asset = messages_asset.select_channel(link)

This and below methods work on both pages and messages
This and below methods do not include link by default (use incl_link=True to enable)
This and below methods include link by default (use incl_link=False to disable)

Note

Each operation requires at least one full data pass; use sparingly.

Applying¶

"Find" and many other methods rely upon KnowledgeAsset.apply, which executes a function on each data item. They are so-called asset functions, which consist of two parts: argument preparation and function calling. The main benefit is that arguments are prepared only once and then passed to each function call. The execution is done via the mighty execute function, which is capable of parallelization.

How to apply a function to an asset

links = messages_asset.apply("get", "link")  # (1)!

from vectorbtpro.utils.knowledge.base_asset_funcs import GetAssetFunc  # (2)!
args, kwargs = GetAssetFunc.prepare("link")
links = [GetAssetFunc.call(d, *args, **kwargs) for d in messages_asset]

# ______________________________________________________________

links_asset = messages_asset.apply(lambda d: d["link"])  # (3)!
links = messages_asset.apply(lambda d: d["link"], wrap=False)  # (4)!
json_asset = messages_asset.apply(vbt.dump, dump_engine="json")  # (5)!

# ______________________________________________________________

new_asset = asset.apply(  # (6)!
    ...,
    execute_kwargs=dict(
        n_chunks="auto", 
        distribute="chunks", 
        engine="processpool"
    )
)

Apply a built-in function. This operation is equivalent to get("link"). Whether to return a new asset instance or the raw output depends on the function.
The same as above but manually
Apply a custom function. By default, returns a new asset instance.
Apply a custom function and return the raw output
All functions that require a single argument can be used. Here, we serialize each message with JSON.
Any operation can be distributed by specifying execution-related keyword arguments

Pipelines¶

Most examples show how to execute a chain of standalone operations, but each operation passes through data at least once. To pass through data exactly once regardless of the number of operations, use asset pipelines. There are two kinds of asset pipelines: basic and complex. Basic ones take a list of tasks (i.e., functions and their arguments) and compose them into a single operation that takes a single data item. This composed operation is then applied to all data items. Complex ones take a Python expression in a functional programming style where one function receives a data item and returns a result that becomes argument of another function.

How to apply a pipeline to an asset

tasks = [("find", ("@polakowo",), dict(return_type="match")), len, "get"]  # (1)!
tasks = [vbt.Task("find", "@polakowo", return_type="match"), vbt.Task(len), vbt.Task("get")]  # (2)!
mention_count = messages_asset.apply(tasks)  # (3)!

asset_pipeline = vbt.BasicAssetPipeline(tasks) # (4)!
mention_count = [asset_pipeline(d) for d in messages_asset]

# ______________________________________________________________

expression = "get(len(find(d, '@polakowo', return_type='match')))"
mention_count = messages_asset.apply(expression)  # (5)!

asset_pipeline = vbt.ComplexAssetPipeline(expression)  # (6)!
mention_count = [asset_pipeline(d) for d in messages_asset]

Tasks can be provided as strings, tuples, or functions
They can also be provided as instances of Task
Get the number of @polakowo mentions in each message by using a list of tasks
The same as above but manually
Get the number of @polakowo mentions in each message by using an expression
The same as above but manually

Info

In both pipelines, arguments are prepared only once during initialization.

Reducing¶

Reducing means merging all data items into one. This requires a function that takes two data items. At first, these two data items are the initializer (such as empty dict) and the first data item. If the initializer is unknown, the first two data items are used. The result of this first iteration is then passed as the first data item to the next iteration. The execution is done by KnowledgeAsset.reduce and cannot be parallelized since each iteration depends on the previous one.

How to reduce an asset

all_attachments = messages_asset.select("attachments").reduce("merge_lists")  # (1)!

from vectorbtpro.utils.knowledge.base_asset_funcs import MergeListsAssetFunc  # (2)!
args, kwargs = MergeListsAssetFunc.prepare()
d1 = []
for d2 in messages_asset.select("attachments"):
    d1 = MergeListsAssetFunc.call(d1, d2, *args, **kwargs)
all_attachments = d1

# ______________________________________________________________

total_reactions = messages_asset.select("reactions").reduce(lambda d1, d2: d1 + d2)  # (3)!

Apply a built-in function. This operation is equivalent to select("attachments").merge_lists(). Whether to return a new asset instance or the raw output depends on the function.
The same as above but manually
Apply a custom function. By default, returns the raw output.

+

In addition, you can split a knowledge asset into groups and reduce the groups. The iteration over groups is done by the execute function, which is capable of parallelization.

How to reduce groups of an asset

reactions_by_channel = messages_asset.groupby_reduce(  # (1)!
    lambda d1, d2: d1 + d2["reactions"], 
    by="channel", 
    initializer=0,
    return_group_keys=True
)

# ______________________________________________________________

result = asset.groupby_reduce(  # (2)!
    ...,
    execute_kwargs=dict(
        n_chunks="auto", 
        distribute="chunks", 
        engine="processpool"
    )
)

Get the total number of reactions per channel
Any group-by operation can be distributed by specifying execution-related keyword arguments

Aggregating¶

Since headings are represented as individual data items, they can be aggregated back into their parent page. This is useful in order to format or display the page. Note that only headings can be aggregated - pages cannot be aggregated into other pages.

How to aggregate pages

new_pages_asset = pages_asset.aggregate()  # (1)!
new_pages_asset = pages_asset.aggregate(append_obj_type=False, append_github_link=False)  # (2)!

Aggregate headings into the content of the parent page
The same as above but don't append object type and GitHub source to the API headings

+

Messages, on the other hand, can be aggregated across multiple levels: "message", "block", "thread", and "channel". Aggregation here simply means taking messages that belong to the specified level, and dumping and putting them into the content of a single, bigger message.

The level "message" means that attachments are included in the content of the message.
The level "block" puts together messages of the same author that reference the same block or don't reference anything at all. The link of the block is the link of the first message in the block.
The level "thread" puts together messages that belong to the same channel and are connected through a chain of replies. The link of the thread is the link of the first message in the thread.
The level "channel" puts together messages that belong to the same channel.

How to aggregate messages

new_messages_asset = messages_asset.aggregate()  # (1)!
new_messages_asset = messages_asset.aggregate(by="message")  # (2)!
new_messages_asset = messages_asset.aggregate(by="block")  # (3)!
new_messages_asset = messages_asset.aggregate(by="thread")  # (4)!
new_messages_asset = messages_asset.aggregate(by="channel")  # (5)!
new_messages_asset = messages_asset.aggregate(
    ..., 
    minimize_metadata=True  # (6)!
)
new_messages_asset = messages_asset.aggregate(
    ...,
    dump_metadata_kwargs=dict(dump_engine="nestedtext")  # (7)!
)

Aggregate into the content of the parent if a single parent exists, else will raise an error
Aggregate attachments into the content of the parent message
Aggregate messages into the content of the parent block
Aggregate messages into the content of the parent thread
Aggregate messages into the content of the parent channel
Remove irrelevant keys from metadata before dumping
When putting messages into the parent content, dump their metadata using the selected engine (here NestedText)

Formatting¶

Most Python objects can be dumped (i.e., serialized) into strings.

How to dump an asset

new_asset = asset.dump()  # (1)!
new_asset = asset.dump(dump_engine="nestedtext", indent=4)  # (2)!

# ______________________________________________________________

print(asset.dump().join())  # (3)!
print(asset.dump().join(separator="\n\n---------------------\n\n"))  # (4)!
print(asset.dump_all())  # (5)!

Dump each data item. By default, dumps into YAML using Ruamel (if installed) or PyYAML.
Dump each data item using a custom engine. Here, using NestedText with indentation of 4 spaces.
Join all dumped data items. Chooses the separator automatically.
Join all dumped data items with a custom separator
Dump the entire list as a single object

+

Custom knowledge assets like pages and messages can be converted and optionally saved in Markdown or HTML format. Only the field "content" will be converted while other fields will build the metadata block displayed at the beginning of each file.

Note

Without aggregation, each page heading will become a separate file.

How to format an asset

new_pages_asset = pages_asset.to_markdown()  # (1)!
new_pages_asset = pages_asset.to_markdown(root_metadata_key="pages")  # (2)!
new_pages_asset = pages_asset.to_markdown(clear_metadata=False)  # (3)!
new_pages_asset = pages_asset.to_markdown(remove_code_title=False, even_indentation=False)  # (4)!

dir_path = pages_asset.save_to_markdown()  # (5)!
dir_path = pages_asset.save_to_markdown(cache_dir="markdown")  # (6)!
dir_path = pages_asset.save_to_markdown(clear_cache=True)  # (7)!
dir_path = pages_asset.save_to_markdown(cache=False)  # (8)!

# (9)!

# ______________________________________________________________

new_pages_asset = pages_asset.to_html()  # (10)!
new_pages_asset = pages_asset.to_html(to_markdown_kwargs=dict(root_metadata_key="pages"))  # (11)!
new_pages_asset = pages_asset.to_html(make_links=False)  # (12)!
new_pages_asset = pages_asset.to_html(extensions=[], use_pygments=False)  # (13)!

extensions = vbt.settings.get("knowledge.formatting.markdown_kwargs.extensions")
new_pages_asset = pages_asset.to_html(extensions=extensions + ["pymdownx.smartsymbols"])  # (14)!

extensions = vbt.settings.get("knowledge.formatting.markdown_kwargs.extensions")
extensions.append("pymdownx.smartsymbols")  # (15)!

extension_configs = vbt.settings.get("knowledge.formatting.markdown_kwargs.extension_configs")
extension_configs["pymdownx.superfences"]["preserve_tabs"] = False  # (16)!

new_pages_asset = pages_asset.to_html(format_html_kwargs=dict(pygments_kwargs=dict(style="dracula")))  # (17)!
vbt.settings.set("knowledge.formatting.pygments_kwargs.style", "dracula")  # (18)!

style_extras = vbt.settings.get("knowledge.formatting.style_extras")
style_extras.append("""
.admonition.success {
    background-color: #00c8531a;
    border-left-color: #00c853;
}
""")  # (19)!

head_extras = vbt.settings.get("knowledge.formatting.head_extras")
head_extras.append('<link ...>')  # (20)!

body_extras = vbt.settings.get("knowledge.formatting.body_extras")
body_extras.append('<script>...</script>')  # (21)!

vbt.settings.get("knowledge.formatting").reset()  # (22)!

dir_path = pages_asset.save_to_html()  # (23)!

# (24)!

Convert each data item to a Markdown-formatted string. Returns a new asset instance with a list of strings.
Prepend a root key called "pages" (or use "messages" for messages) to the metadata block
Keep empty fields in the metadata block. By default, they will be removed.
Keep the content as is, without removing code titles and fixing uneven indentation
Save each data item to a Markdown-formatted file. Returns the path to the parent directory. By default, saves to ./knowledge/vbt/$release_name/pages/markdown/ for pages and ./knowledge/vbt/$release_name/messages/markdown/ for messages.
Specify a different cache directory
Clear the cache directory before saving. By default, a data item is skipped if its file already exists.
Create a temporary directory
Method save_to_markdown accepts the same arguments as to_markdown
Convert each data item to a HTML-formatted string. Returns a new asset instance with a list of strings.
The same options as for the method to_markdown can be provided via to_markdown_kwargs
If any URL is detected in text, don't make it a link. By default, makes all URLs clickable.
Disable any Markdown extensions and don't use Pygments for highlighting
Add a new extension to the preset list of Markdown extensions (locally)
Add a new extension to the preset list of Markdown extensions (globally)
Change a config of a Markdown extension
Change the default Pygments highlighting style (locally)
Change the default Pygments highlighting style (globally)
Add extra CSS at the end of the <style> element
Add extra HTML such as links at the end of the <head> element
Add extra HTML such as JavaScript at the end of the <body> element
Reset formatting
Save each data item to a HTML-formatted file. Returns the path to the parent directory. By default, saves to ./knowledge/vbt/$release_name/pages/html/ for pages and ./knowledge/vbt/$release_name/pages/html/ for messages.
Method save_to_html accepts the same arguments as to_html and save_to_markdown

Browsing¶

Pages and messages can be displayed and browsed through via static HTML files. When a single item should be displayed, VBT creates a temporary HTML file and opens it in the default browser. All links in this file remain external. When multiple items should be displayed, VBT creates a single HTML file where items are displayed as iframes that can be iterated over using pagination.

How to display an asset

file_path = pages_asset.display()  # (1)!
file_path = pages_asset.display(link="documentation/fundamentals")  # (2)!
file_path = pages_asset.display(link="documentation/fundamentals", aggregate=True)  # (3)!

# ______________________________________________________________

file_path = messages_asset.display()  # (4)!
file_path = messages_asset.display(link="919715148896301067/923327319882485851")  # (5)!
file_path = messages_asset.filter("channel == 'announcements'").display()  # (6)!

Display one or more pages
Choose a page and display
If the asset isn't aggregated, choose a page, find its headings, merge them into one page, and display
Display one or more messages
Choose a message and display
Select messages that belong to the selected channel, aggregate them, and display

+

When one or more pages (and/or headings) should be browsed like a website, VBT can convert all data items to HTML and replace all external links to internal ones such that you can jump from one page to another locally. But which page is displayed first? Pages and headings build a directed graph. If there's one page from which all other pages are accessible, it's displayed first. If there are multiple pages, VBT creates an index page with metadata blocks from which you can access other pages (unless you specify entry_link).

How to browse an asset

dir_path = pages_asset.browse()  # (1)!
dir_path = pages_asset.browse(aggregate=True)  # (2)!
dir_path = pages_asset.browse(entry_link="documentation/fundamentals", aggregate=True)  # (3)!
dir_path = pages_asset.browse(entry_link="documentation", descendants_only=True, aggregate=True)  # (4)!
dir_path = pages_asset.browse(cache_dir="website")  # (5)!
dir_path = pages_asset.browse(clear_cache=True)  # (6)!
dir_path = pages_asset.browse(cache=False)  # (7)!

# ______________________________________________________________

dir_path = messages_asset.browse()  # (8)!
dir_path = messages_asset.browse(entry_link="919715148896301067/923327319882485851")  # (9)!

# (10)!

Generate an HTML file for every page and heading (unless already aggregated)
Aggregate headings into pages and generate an HTML file for every page
Generate an HTML file for every page and start browsing from the selected page
Generate an HTML file for every page that is descendant of the selected page
Specify a different cache directory. By default, stores data under ./knowledge/vbt/$release_name/pages/html/ for pages and ./knowledge/vbt/$release_name/messages/html/ for messages.
Clear the cache directory before saving. By default, a page or heading is skipped if its file already exists.
Create a temporary directory
Generate an HTML file for every message (note that there are a lot of messages!)
In addition to the above, choose a message to open in the default browser
Messages also take the same caching-related arguments as pages

Combining¶

Assets can be easily combined. When the target class is not specified, their common superclass is used. For example, combining PagesAsset and MessagesAsset will yield an instance of VBTAsset, which is based on overlapping features of both assets, such as "link" and "content" fields.

How to combine multiple assets

vbt_asset = pages_asset + messages_asset  # (1)!
vbt_asset = pages_asset.combine(messages_asset)  # (2)!
vbt_asset = vbt.VBTAsset.combine(pages_asset, messages_asset)  # (3)!

Concatenate both lists and wrap by the common superclass
Concatenate both lists and wrap by the first class
Concatenate both lists and wrap by the selected class

+

If both assets have the same number of data items, you can also merge them on the data item level. This works even for complex containers like nested dictionaries and lists by flattening their nested structures into flat dicts, merging them, and then unflattening them back into the original container type.

How to merge multiple assets

new_asset = asset1.merge(asset2)  # (1)!
new_asset = vbt.KnowledgeAsset.merge(asset1, asset2)  # (2)!

Merge both lists and wrap by the first class
Merge both lists and wrap by the selected class

+

You can also merge data items of a single asset into a single data item.

How to merge one asset

new_asset = asset.merge()  # (1)!
new_asset = asset.merge_dicts()  # (2)!
new_asset = asset.merge_lists()  # (3)!

Calls either merge_dicts or merge_lists depending on the data type
Concatenate all lists into one
(Deep-)merge all dicts into one

Searching¶

For objects¶

There are 4 methods to search for an arbitrary VBT object in pages and messages. The first method searches for the API documentation of the object, the second method searches for object mentions in the non-API (human-readable) documentation, the third method searches for object mentions in Discord messages, and the last method searches for object mentions in the code of both pages and messages.

How to find API-related knowledge about an object

api_asset = vbt.find_api(vbt.PFO)  # (1)!
api_asset = vbt.find_api(vbt.PFO, incl_bases=False, incl_ancestors=False)  # (2)!
api_asset = vbt.find_api(vbt.PFO, use_parent=True)  # (3)!
api_asset = vbt.find_api(vbt.PFO, use_refs=True)  # (4)!
api_asset = vbt.find_api(vbt.PFO.row_stack)  # (5)!
api_asset = vbt.find_api(vbt.PFO.from_uniform, incl_refs=False)  # (6)!
api_asset = vbt.find_api([vbt.PFO.from_allocate_func, vbt.PFO.from_optimize_func])  # (7)!

# ______________________________________________________________

api_asset = vbt.PFO.find_api()  # (8)!
api_asset = vbt.PFO.find_api(attr="from_optimize_func")

Find the (aggregated) API page for PortfolioOptimizer. Includes the (aggregated) base classes, such as Configured, as well as the (non-aggregated) parent modules, such as portfolio.pfopt.base.
Don't include base classes and parent modules, only this class
Uses the (aggregated) parent of this class instead. Here, portfolio.pfopt.base.
Include the (non-aggregated) pages of the objects that this object references
Find the API page for PortfolioOptimizer.row_stack. Includes the (aggregated) base methods, such as Wrapping.row_stack, the (non-aggregated) parent objects, such as PortfolioOptimizer, and the (non-aggregated) references.
Don't include references
Search for multiple objects
Make a call directly from the object's interface

How to find documentation-related knowledge about an object

docs_asset = vbt.find_docs(vbt.PFO)  # (1)!
docs_asset = vbt.find_docs(vbt.PFO, incl_shortcuts=False, incl_instances=False)  # (2)!
docs_asset = vbt.find_docs(vbt.PFO, incl_custom=["pf_opt"])  # (3)!
docs_asset = vbt.find_docs(vbt.PFO, incl_custom=[r"pf_opt\s*=\s*.+"], is_custom_regex=True)  # (4)!
docs_asset = vbt.find_docs(vbt.PFO, as_code=True)  # (5)!
docs_asset = vbt.find_docs([vbt.PFO.from_allocate_func, vbt.PFO.from_optimize_func])  # (6)!

docs_asset = vbt.find_docs(vbt.PFO, up_aggregate_th=0)  # (7)!
docs_asset = vbt.find_docs(vbt.PFO, up_aggregate_pages=True)  # (8)!
docs_asset = vbt.find_docs(vbt.PFO, incl_pages=["documentation", "tutorials"])  # (9)!
docs_asset = vbt.find_docs(vbt.PFO, incl_pages=[r"(features|cookbook)"], page_find_mode="regex")  # (10)!
docs_asset = vbt.find_docs(vbt.PFO, excl_pages=["release-notes"])  # (11)!

# ______________________________________________________________

docs_asset = vbt.PFO.find_docs()  # (12)!
docs_asset = vbt.PFO.find_docs(attr="from_optimize_func")

Find the mentions of PortfolioOptimizer across all non-API pages. Searches for full reference names, shortcuts (such as vbt.PFO), imports (such as from ... import PFO), typical instance names (sch as pfo =), and access/call notations (such as PFO.).
Include only full reference names and imports
Include custom literal mentions
Include custom regex mentions
Find the mentions only as code
Search for multiple objects
By default, if the method matches 2/3 of all the headings that share the same parent, it takes the (aggregated) parent instead. Here, take the entire page if any mention is found.
Similarly, if the method matches 2/3 of all the pages that share the same parent page, it includes the parent page as well
Include only the links from the documentation and tutorials (the targets are substrings)
Include only the links from the features and cookbook (the target is a regex)
Exclude the links from the release notes (the target is a substring)
Make a call directly from the object's interface

How to find Discord-related knowledge about an object

messages_asset = vbt.find_messages(vbt.PFO)  # (1)!

# ______________________________________________________________

messages_asset = vbt.PFO.find_messages()  # (2)!
messages_asset = vbt.PFO.find_messages(attr="from_optimize_func")

The same as above but finds mentions in messages. Accepts the same arguments related to targets (first block in find_docs recipes) and no arguments related to pages (second block in find_docs recipes).
Make a call directly from the object's interface

How to find code examples of an object

examples_asset = vbt.find_examples(vbt.PFO)  # (1)!

# ______________________________________________________________

examples_asset = vbt.PFO.find_examples()  # (2)!
examples_asset = vbt.PFO.find_examples(attr="from_optimize_func")

The same as above but finds code mentions in both pages and messages
Make a call directly from the object's interface

+

The first three methods are guaranteed to be non-overlapping, while the last method can return examples that can be returned by the first three methods as well. Thus, there is another method that calls the first three methods by default and combines them into a single asset. This way, we can gather all relevant knowledge about a VBT object.

How to combine knowledge about an object

combined_asset = vbt.find_assets(vbt.Trades)  # (1)!
combined_asset = vbt.find_assets(vbt.Trades, asset_names=["api", "docs"])  # (2)!
combined_asset = vbt.find_assets(vbt.Trades, asset_names=["messages", ...])  # (3)!
combined_asset = vbt.find_assets(vbt.Trades, asset_names="all")  # (4)!
combined_asset = vbt.find_assets(  # (5)!
    vbt.Trades, 
    api_kwargs=dict(incl_ancestors=False),
    docs_kwargs=dict(as_code=True),
    messages_kwargs=dict(as_code=True),
)
combined_asset = vbt.find_assets(vbt.Trades, minimize=False)  # (6)!
asset_list = vbt.find_assets(vbt.Trades, combine=False)  # (7)!
combined_asset = vbt.find_assets([vbt.EntryTrades, vbt.ExitTrades])  # (8)!

# ______________________________________________________________

combined_asset = vbt.find_assets("SQL", resolve=False)  # (9)!
combined_asset = vbt.find_assets(["SQL", "database"], resolve=False)  # (10)!

# ______________________________________________________________

messages_asset = vbt.Trades.find_assets()  # (11)!
messages_asset = vbt.Trades.find_assets(attr="plot")
messages_asset = pf.trades.find_assets(attr="expectancy")

Combine assets for Trades
Use only the API asset and documentation asset
Put the messages asset first and other assets in their usual order second by using ... (Ellipsis)
Use all assets, including the examples asset
Provide asset-specific keyword arguments
Disable minimization. Keeps all fields but takes more context space.
Disable combining. Returns a dictionary of assets by their name.
Search for multiple objects
Search for arbitrary keywords (not actual VBT objects)
Search for multiple keywords
Make a call directly from the object's interface

How to browse combined knowledge about an object

vbt.Trades.find_assets().select("link").print()  # (1)!

file_path = vbt.Trades.find_assets( # (2)!
    asset_names="docs", 
    docs_kwargs=dict(excl_pages="release-notes")
).display()

dir_path = vbt.Trades.find_assets( # (3)!
    asset_names="docs", 
    docs_kwargs=dict(excl_pages="release-notes")
).browse(cache=False)

Print all found links
Browse all found documentation (but not release notes) as a single HTML file
Browse all found documentation (but not release notes) as multiple HTML files

Globally¶

Not only we can search for knowledge related to an individual VBT object, but we can also search for any VBT items that match a query in natural language. This works by embedding the query and the data items, computing their pairwise similarity scores, and sorting the data items by their mean score in descending order. Since the result contains all the data items from the original set just in a different order, it's advised to select top-k results before displaying.

All the methods discussed in objects work on queries too!

How to search for knowledge using natural language

api_asset = vbt.find_api("How to rebalance weekly?", top_k=20)
docs_asset = vbt.find_docs("How to hedge a position?", top_k=20)
messages_asset = vbt.find_messages("How to trade live?", top_k=20)
combined_asset = vbt.find_assets("How to create a custom data class?", top_k=20)

+

There also exists a specialized search function that calls find_assets, caches the documents (such that the next search call becomes a magnitude faster), and displays the top results as a static HTML page.

Info

The first time you run this command, it may take up to 15 minutes to prepare and embed documents. However, most of the preparation steps are cached and stored, so future searches will be significantly faster without needing to repeat the process.

How to search for knowledge on VBT using natural language and display top results

file_path = vbt.search("How to turn df into data?")  # (1)!
found_asset = vbt.search("How to turn df into data?", display=False)  # (2)!
file_path = vbt.search("How to turn df into data?", display_kwargs=dict(open_browser=False))  # (3)!
file_path = vbt.search("How to fix 'Symbols have mismatching columns'?", asset_names="messages")  # (4)!
file_path = vbt.search("How to use templates in signal_func_nb?", asset_names="examples", display=100)  # (5)!

Search API pages, documentation pages, and Discord messages for a query, and display the most 20 relevant of them
Return the results instead of displaying them
Don't automatically open the HTML file in the browser
Search Discord messages only
Search for and display the most 100 relevant code examples

Chatting¶

Knowledge assets can be used as a context in chatting with LLMs. The method responsible for chatting is Contextable.chat, which dumps the asset instance, packs it together with your question and chat history into messages, sends them to the LLM service, and displays and persists the response. The response can be displayed in a variety of formats, including raw text, Markdown, and HTML. All three formats support streaming. This method also supports multiple LLM APIs, including OpenAI, LiteLLM, and LLamaIndex.

How to chat about an asset

env["OPENAI_API_KEY"] = "<OPENAI_API_KEY>"  # (1)!

# ______________________________________________________________

patterns_tutorial = pages_asset.find_page( # (2)!
    "https://vectorbt.pro/pvt_xxxxxxxx/tutorials/patterns-and-projections/patterns/", 
    aggregate=True
)
patterns_tutorial.chat("How to detect a pattern?")

data_documentation = pages_asset.select_branch("documentation/data").aggregate()  # (3)!
data_documentation.chat("How to convert DataFrame into vbt.Data?")

pfo_api = pages_asset.find_obj(vbt.PFO, aggregate=True)  # (4)!
pfo_api.chat("How to rebalance weekly?")

combined_asset = pages_asset + messages_asset
signal_func_nb_code = combined_asset.find_code("signal_func_nb")  # (5)!
signal_func_nb_code.chat("How to pass an array to signal_func_nb?")

polakowo_messages = messages_asset.filter("author == '@polakowo'").minimize().shuffle()
polakowo_messages.chat("Describe the author of these messages", max_tokens=10_000)  # (6)!

mention_fields = combined_asset.find(
    "parameterize", 
    mode="substring", 
    return_type="field", 
    merge_fields=False
)
mention_counts = combined_asset.find(
    "parameterize", 
    mode="substring", 
    return_type="match", 
    merge_matches=False
).apply(len)
sorted_fields = mention_fields.sort(keys=mention_counts, reverse=True).merge()
sorted_fields.chat("How to parameterize a function?")  # (7)!

vbt.settings.set("knowledge.chat.max_tokens", None)  # (8)!

# ______________________________________________________________

chat_history = []
signal_func_nb_code.chat("How to check if we're in a long position?", chat_history)  # (9)!
signal_func_nb_code.chat("How about short one?", chat_history)  # (10)!
chat_history.clear()  # (11)!
signal_func_nb_code.chat("How to access close price?", chat_history)

# ______________________________________________________________

asset.chat(..., completions="openai", model="o1-mini", system_as_user=True)  # (12)!
# (13)!
# vbt.settings.set("knowledge.chat.completions_configs.openai.model", "o1-mini")
# (14)!
# vbt.OpenAICompletions.set_settings({"model": "o1-mini"})

env["OPENAI_API_KEY"] = "<YOUR_OPENROUTER_API_KEY>"
asset.chat(..., completions="openai", base_url="https://openrouter.ai/api/v1", model="openai/gpt-4o") 
# vbt.settings.set("knowledge.chat.completions_configs.openai.base_url", "https://openrouter.ai/api/v1")
# vbt.settings.set("knowledge.chat.completions_configs.openai.model", "openai/gpt-4o")
# vbt.OpenAICompletions.set_settings({
#     "base_url": "https://openrouter.ai/api/v1", 
#     "model": "openai/gpt-4o"
# })

env["DEEPSEEK_API_KEY"] = "<YOUR_DEEPSEEK_API_KEY>"
asset.chat(..., completions="litellm", model="deepseek/deepseek-coder")
# vbt.settings.set("knowledge.chat.completions_configs.litellm.model", "deepseek/deepseek-coder")
# vbt.LiteLLMCompletions.set_settings({"model": "deepseek/deepseek-coder"})

asset.chat(..., completions="llama_index", llm="perplexity", model="claude-3-5-sonnet-20240620")  # (15)!
# vbt.settings.set("knowledge.chat.completions_configs.llama_index.llm", "anthropic")
# anthropic_config = {"model": "claude-3-5-sonnet-20240620"}
# vbt.settings.set("knowledge.chat.completions_configs.llama_index.anthropic", anthropic_config)
# vbt.LlamaIndexCompletions.set_settings({"llm": "anthropic", "anthropic": anthropic_config})

vbt.settings.set("knowledge.chat.completions", "litellm")  # (16)!

# ______________________________________________________________

asset.chat(..., stream=False)  # (17)!

asset.chat(..., formatter="plain")  # (18)!
asset.chat(..., formatter="ipython_markdown")  # (19)!
asset.chat(..., formatter="ipython_html")  # (20)!

file_path = asset.chat(..., formatter="html")  # (21)!
file_path = asset.chat(..., formatter="html", formatter_kwargs=dict(cache_dir="chat"))  # (22)!
file_path = asset.chat(..., formatter="html", formatter_kwargs=dict(clear_cache=True))  # (23)!
file_path = asset.chat(..., formatter="html", formatter_kwargs=dict(cache=False))  # (24)!
file_path = asset.chat(  # (25)!
    ..., 
    formatter="html", 
    formatter_kwargs=dict(
        to_markdown_kwargs=dict(...),
        to_html_kwargs=dict(...),
        format_html_kwargs=dict(...)
    )
)

asset.chat(..., formatter_kwargs=dict(update_interval=1.0))  # (26)!

asset.chat(..., formatter_kwargs=dict(output_to="response.txt"))  # (27)!

asset.chat(  # (28)!
    ..., 
    system_prompt="You are a helpful assistant",
    context_prompt="Here's what you need to know: $context"
)

Setting the API key as an environment variable makes it available for all packages. Another way is by passing api_key directly or saving it to the settings, similarly to model below.
Paste a link from the website (here, the tutorial on patterns), and use it as a context. If you don't know the private hash, you can paste the suffix - see querying.
Select and aggregate all documentation pages related to data and use them as a context
Select the API documentation page related the portfolio optimizer and use it as a context
Find all mentions of signal_func_nb across the code of all pages and messages and use them as a context
Filter all messages by @polakowo, keep only relevant fields to fit more data. By default, trims the context to 120,000 tokens (should depend on the model, GPT-4o has allowance of 128,000). Content is shuffled to avoid putting more emphasis on old/new content when trimming the context.
Get all fields with at least one "parameterize" mention, sort them by the number of mentions in descending order, and merge into one list. When embeddings are unavailable, this is a common workflow when there's too much data.
Allow for unlimited context
Append the question and answer to the chat history
Re-use the chat history to post the question to the current chat
Clear the chat history to start a new chat
Specify the model and other client and chat-related arguments for OpenAI API
Save the model to the settings to use it by default next time
The same as above
You can define configs by LLM when using LLamaIndex
Set the default class for completions
Disable streaming. By default, streaming is enabled.
Print the response
Display the response in Markdown format (requires iPython environment)
Display the response in HTML format (requires iPython environment)
Store the response in a static HTML file and display
Specify a different cache directory. By default, stores data under ./knowledge/vbt/$release_name/chat.
Clear the cache directory before saving
Create a temporary file
When working with HTML, you can provide many arguments that are accepted by VBTAsset.to_html
When update interval is used, the streaming data is buffered and released periodically. Note that when displaying HTML, the minimum update interval is 1 second.
In addition to displaying, the raw response can be forwarded to a file
Customize the system and context prompts

About objects¶

We can chat about a VBT object using chat_about. Under the hood, it calls the method above, but on code examples only. When passing arguments, they are automatically distributed between find_assets and KnowledgeAsset.chat (see chatting for recipes)

How to chat about an object

vbt.chat_about(vbt.Portfolio, "How to get trading expectancy?")  # (1)!
vbt.chat_about(  # (2)!
    vbt.Portfolio, 
    "How to get returns accessor with log returns?", 
    asset_names="api",
    api_kwargs=dict(incl_bases=False, incl_ancestors=False)
)
vbt.chat_about(  # (3)!
    vbt.Portfolio, 
    "How to backtest a basic strategy?", 
    model="o1-mini",
    system_as_user=True,
    max_tokens=100_000,
    shuffle=True
)

# ______________________________________________________________

vbt.Portfolio.chat("How to create portfolio from order records?")  # (4)!
vbt.Portfolio.chat("How to get grouped stats?", attr="stats")

Find knowledge about Portfolio and ask a question by using this knowledge as a context
Pass knowledge-related arguments. Here, find API knowledge that only contains the Portfolio class.
Pass chat-related arguments. Here, use the o1-mini model on a shuffled context with a maximum number of tokens of 100k. Also, use the user role instead of the system role for the initial instruction.
Make a call directly from the object's interface

You can also ask a question about objects that technically do not exist in VBT, or keywords in general, such as "quantstats", which will search for mentions of "quantstats" in pages and messages.

How to chat about keywords

vbt.chat_about(
    "sql", 
    "How to import data from a SQL database?", 
    resolve=False,  # (1)!
    find_kwargs=dict(
        ignore_case=True,
        allow_prefix=True,  # (2)!
        allow_suffix=True  # (3)!
    )
)

Use this to avoid searching for a VBT object with the same name
Allows prefixes for "sql", such as "from_sql"
Allows suffixes for "sql", such as "SQLData"

Globally¶

Similarly to the global search function, there is also a global function for chatting - chat. It manipulates documents in the same way, but instead of displaying, it sends them to an LLM for completion.

Info

The first time you run this command, it may take up to 15 minutes to prepare and embed documents. However, most of the preparation steps are cached and stored, so future searches will be significantly faster without needing to repeat the process.

How to chat about VBT

vbt.chat("How to turn df into data?")  # (1)!
file_path = vbt.chat("How to turn df into data?", formatter="html")  # (2)!
vbt.chat("How to fix 'Symbols have mismatching columns'?", asset_names="messages")  # (3)!
vbt.chat(
    "How to use templates in signal_func_nb?", 
    asset_names="examples", 
    top_k=None, 
    cutoff=None, 
    return_chunks=False
)  # (4)!

chat_history = []
vbt.chat("How to turn df into data?", chat_history)  # (5)!
vbt.chat("What if I have symbols as columns?", chat_history)  # (6)!
vbt.chat("How to replace index of data?", chat_history, incl_past_queries=False)  # (7)!

_, chat = vbt.chat("How to turn df into data?", return_chat=True)  # (8)!
chat.complete("What if I have symbols as columns?")

Search API pages, documentation pages, and Discord messages for a query, and chat about them. By default, uses a maximum of 100 document chunks.
Accepts the same arguments as asset.chat
Chat about Discord messages only
Use the entire (ranked) asset as a context
Re-use the chat history to post the question to the current chat
Take into account the chat history. Considers all user messages when ranking the context.
Take into account the chat history. Considers only the query when ranking the context.
The same as above, but the context is fixed for all subsequent queries

RAG¶

VBT deploys a collection of components for vanilla RAG. Most of them are orchestrated and deployed automatically whenever you globally search for knowledge on VBT or chat about VBT.

Tokenizer¶

The Tokenizer class and its subclasses offer an interface for converting text into tokens.

How to tokenize text

tokenizer = vbt.TikTokenizer()  # (1)!
tokenizer = vbt.TikTokenizer(encoding="o200k_base")
tokenizer = vbt.TikTokenizer(model="gpt-4o")

vbt.TikTokenizer.set_settings(encoding="o200k_base")  # (2)!

token_count = tokenizer.count_tokens(text)  # (3)!
tokens = tokenizer.encode(text)
text = tokenizer.decode(tokens)

# ______________________________________________________________

tokens = vbt.tokenize(text)  # (4)!
text = vbt.detokenize(tokens)

tokens = vbt.tokenize(text, tokenizer="tiktoken", model="gpt-4o")  # (5)!

Use tiktoken package as a tokenizer
Set default settings
Use the tokenizer to count tokens in a text, encode text into tokens, or decode tokens back into text
There are also shortcut methods that construct a tokenizer for you
Tokenizer type as well as parameters can be passed as keyword arguments to both methods

Embeddings¶

The Embeddings class and its subclasses offer an interface for generating vector representations of text.

How to embed text

embeddings = vbt.OpenAIEmbeddings()  # (1)!
embeddings = vbt.OpenAIEmbeddings(batch_size=256)  # (2)!
embeddings = vbt.OpenAIEmbeddings(model="text-embedding-3-large")  # (3)!
embeddings = vbt.LiteLLMEmbeddings(model="openai/text-embedding-3-large")  # (4)!
embeddings = vbt.LlamaIndexEmbeddings(embedding="openai", model="text-embedding-3-large")  # (5)!
embeddings = vbt.LlamaIndexEmbeddings(embedding="huggingface", model_name="BAAI/bge-small-en-v1.5")

vbt.OpenAIEmbeddings.set_settings(model="text-embedding-3-large")  # (6)!

emb = embeddings.get_embedding(text)  # (7)!
embs = embeddings.get_embeddings(texts)

# ______________________________________________________________

emb = vbt.embed(text)  # (8)!
embs = vbt.embed(texts)

emb = vbt.embed(text, embeddings="openai", model="text-embedding-3-large")  # (9)!

Use openai package as an embeddings provider
Process embeddings in batches of 256
Specify the model
Use litellm package as an embeddings provider
Use llamaindex package as an embeddings provider
Set default settings
Use the embeddings provider to get embeddings for one or more texts
There is also a shortcut method that constructs an embeddings provider for you. It accepts one or more texts.
Embeddings provider type as well as parameters can be passed as keyword arguments to the method

Completions¶

The Completions class and its subclasses offer an interface for generating text completions based on user queries. For arguments such as formatter, see chatting.

How to (auto-)complete a query

completions = vbt.OpenAICompletions()  # (1)!
completions = vbt.OpenAICompletions(stream=False)
completions = vbt.OpenAICompletions(max_tokens=100_000, tokenizer="tiktoken")
completions = vbt.OpenAICompletions(model="o1-mini", system_as_user=True)
completions = vbt.OpenAICompletions(formatter="html", formatter_kwargs=dict(cache=False))
completions = vbt.LiteLLMCompletions(model="openai/o1-mini", system_as_user=True)  # (2)!
completions = vbt.LlamaIndexCompletions(llm="openai", model="o1-mini", system_as_user=True)  # (3)!

vbt.OpenAICompletions.set_settings(model="o1-mini", system_as_user=True)  # (4)!

completions.get_completion(text)  # (5)!

# ______________________________________________________________

vbt.complete(text)  # (6)!

vbt.complete(text, completions="openai", model="o1-mini", system_as_user=True)  # (7)!

Use openai package as an completions provider
Use litellm package as an completions provider
Use llamaindex package as an completions provider
Set default settings
Use the completions provider to get completion for a texts
There is also a shortcut method that constructs a completions provider for you
Completions provider type as well as parameters can be passed as keyword arguments to the method

Text splitter¶

The TextSplitter class and its subclasses offer an interface for splitting text.

How to split text

text_splitter = vbt.TokenSplitter()  # (1)!
text_splitter = vbt.TokenSplitter(chunk_size=1000, chunk_overlap=200)
text_splitter = vbt.SegmentSplitter()  # (2)!
text_splitter = vbt.SegmentSplitter(separators=r"\s+")  # (3)!
text_splitter = vbt.SegmentSplitter(separators=[r"(?<=[.!?])\s+", r"\s+", None])  # (4)!
text_splitter = vbt.SegmentSplitter(tokenizer="tiktoken", tokenizer_kwargs=dict(model="gpt-4o"))
text_splitter = vbt.LlamaIndexSplitter(node_parser="SentenceSplitter")  # (5)!

vbt.TokenSplitter.set_settings(chunk_size=1000, chunk_overlap=200)  # (6)!

text_chunks = text_splitter.split_text(text)  # (7)!

# ______________________________________________________________

text_chunks = vbt.split_text(text)  # (8)!

text_chunks = vbt.split_text(text, text_splitter="llamaindex", node_parser="SentenceSplitter")  # (9)!

Create a splitter that splits into tokens
Create a splitter that splits into segments. By default, splits into paragraphs and sentences, words (if the sentence is too long), and then tokens (if the word is too long).
Split text strictly into words
Split text into sentences, words (if the sentence is too long), and then tokens (if the word is too long).
Use SentenceSplitter from llamaindex package as a text splitter
Set default settings
Use the text splitter to split a text, which returns a generator
There is also a shortcut method that constructs a text splitter for you
Text splitter type as well as parameters can be passed as keyword arguments to the method

Object store¶

The ObjectStore class and its subclasses offer an interface for efficiently storing and retrieving arbitrary Python objects, such as text documents and embeddings. Such objects must subclass StoreObject.

How to store objects

obj_store = vbt.DictStore()  # (1)!
obj_store = vbt.MemoryStore(store_id="abc")  # (2)!
obj_store = vbt.MemoryStore(purge_on_open=True)  # (3)!
obj_store = vbt.FileStore(dir_path="./file_store")  # (4)!
obj_store = vbt.FileStore(consolidate=True, use_patching=False)  # (5)!
obj_store = vbt.LMDBStore(dir_path="./lmdb_store")  # (6)!
obj_store = vbt.CachedStore(obj_store=vbt.FileStore())  # (7)!
obj_store = vbt.CachedStore(obj_store=vbt.FileStore(), mirror=True)  # (8)!

vbt.FileStore.set_settings(consolidate=True, use_patching=False)  # (9)!

obj = vbt.TextDocument(id_, text)  # (10)!
obj = vbt.TextDocument.from_data(text)  # (11)!
obj = vbt.TextDocument.from_data(  # (12)!
    {"timestamp": timestamp, "content": text}, 
    text_path="content",
    excl_embed_metadata=["timestamp"],
    dump_kwargs=dict(dump_engine="nestedtext")
)
obj1 = vbt.StoreEmbedding(id1, child_ids=[id2, id3])  # (13)!
obj2 = vbt.StoreEmbedding(id2, parent_id=id1, embedding=embedding2)
obj3 = vbt.StoreEmbedding(id3, parent_id=id1, embedding=embedding3)

with obj_store:  # (14)!
    obj = obj_store[obj.id_]
    obj_store[obj.id_] = obj
    del obj_store[obj.id_]
    print(len(obj_store))
    for id_, obj in obj_store.items():
        ...

Create an object store based on a simple dictionary, where data persists only for the lifetime of the instance
Create an object store based on a global dictionary (memory_store), where data persists for the lifetime of the Python process
Clear the store upon opening, removing any data saved by previous instances
Create an object store based on a file (without patching) or folder (with patching). Patching means that additional changes will be added as separate files.
Consolidate the folder (if any) into a file and disable patching
Create an object store based on LMDB
Create an object store that caches operations of another object store internally
Same as above but mirrors operations in the global dictionary (memory_store) to persist the cache for the lifetime of the Python process
Set default settings
Create a text document object
Generate id automatically from text
Create a text document object with metadata
Create three embedding objects with a relationship
An object store is very easy to use: it behaves just like a regular dictionary

Document ranker¶

The DocumentRanker class offers an interface for embedding, scoring, and ranking documents.

How to rank documents

doc_ranker = vbt.DocumentRanker()  # (1)!
doc_ranker = vbt.DocumentRanker(dataset_id="abc")  # (2)!
doc_ranker = vbt.DocumentRanker(  # (3)!
    embeddings="litellm", 
    embeddings_kwargs=dict(model="openai/text-embedding-3-large")
)
doc_ranker = vbt.DocumentRanker(  # (4)!
    doc_store="file",
    doc_store_kwargs=dict(dir_path="./doc_file_store"),
    emb_store="file",
    emb_store_kwargs=dict(dir_path="./emb_file_store"),
)
doc_ranker = vbt.DocumentRanker(score_func="dot", score_agg_func="max")  # (5)!

vbt.DocumentRanker.set_settings(doc_store="memory", emb_store="memory")  # (6)!

documents = [vbt.TextDocument("text1"), vbt.TextDocument("text2")]  # (7)!

doc_ranker.embed_documents(documents)  # (8)!
emb_documents = doc_ranker.embed_documents(documents, return_documents=True)
embs = doc_ranker.embed_documents(documents, return_embeddings=True)
doc_ranker.embed_documents(documents, refresh=True)  # (9)!

doc_scores = doc_ranker.score_documents("How to use VBT?", documents)  # (10)!
chunk_scores = doc_ranker.score_documents("How to use VBT?", documents, return_chunks=True)
scored_documents = doc_ranker.score_documents("How to use VBT?", documents, return_documents=True)

documents = doc_ranker.rank_documents("How to use VBT?", documents)  # (11)!
scored_documents = doc_ranker.rank_documents("How to use VBT?", documents, return_scores=True)
documents = doc_ranker.rank_documents("How to use VBT?", documents, top_k=50)  # (12)!
documents = doc_ranker.rank_documents("How to use VBT?", documents, top_k=0.1)  # (13)!
documents = doc_ranker.rank_documents("How to use VBT?", documents, top_k="elbow")  # (14)!
documents = doc_ranker.rank_documents("How to use VBT?", documents, cutoff=0.5, min_top_k=20)  # (15)!

# ______________________________________________________________

vbt.embed_documents(documents)  # (16)!
vbt.embed_documents(documents, embeddings="openai", model="text-embedding-3-large")
documents = vbt.rank_documents("How to use VBT?", documents)

Create a document ranker
Set store id for both the document store and embedding store
Specify the embeddings provider type as well as parameters
Specify the object store types for documents and embeddings
Specify the score function and score aggregation function (both can be callables)
Set default settings
A document ranker accepts iterable of store objects, such as text documents
Use the document ranker to embed documents (embeddings are stored in an embedding store). You can also specify whether to return the embedded documents, or the embeddings themselves.
If a document or embedding already exists in the store, override it
Give each document a similarity score relative to the query, and return the scores. You can also specify whether to return the scores for the chunks (they are aggregated by default), or the documents together with their scores.
Rank documents based on similarity scores relative to the query. By default, simply reorders the documents, but you can also specify whether to return the documents together with their scores.
Pick top 50 documents
Pick top 10% documents
Pick top documents based on an algorithm such as Elbow method
Pick top 20 documents or more with a similarity score above 0.5
There are also shortcut methods for embedding and ranking that construct a document ranker for you

Pipeline¶

The components mentioned above can enhance RAG pipelines, extending their utility beyond the VBT scope.

How to create a basic RAG pipeline

data = [
    "The Eiffel Tower is not located in London.",
    "The Great Wall of China is not visible from Jupiter.",
    "HTML is not a programming language."
]
query = "Where the Eiffel Tower is not located?"

documents = map(vbt.TextDocument.from_data, data)
retrieved_documents = vbt.rank_documents(query, documents, top_k=1)
context = "\n\n".join(map(str, retrieved_documents))
vbt.complete(query, context=context)