vllm.benchmarks.datasets.create_txt_slices_dataset ¶
Convert a plain-text file (local path or URL) into a JSONL dataset compatible with CustomDataset (--dataset-name custom), by randomly slicing the tokenized text into prompts.
Each line of the output JSONL contains a prompt (decoded from a random slice of the tokenized source text) and an output_tokens count.
Usage¶
::
python -m vllm.benchmarks.datasets.create_txt_slices_dataset \
--input sonnet.txt \
--output sonnet_dataset.jsonl \
--tokenizer gpt2 \
--num-prompts 1000 \
--input-len 1024 \
--output-len 128
The resulting JSONL file can then be used with the serving benchmark::
python -m vllm.benchmarks.serve \
--dataset-name custom \
--dataset-path sonnet_dataset.jsonl \
...
create_txt_slices_jsonl ¶
create_txt_slices_jsonl(
*,
input_path: str,
output_path: str,
tokenizer_name: str,
num_prompts: int,
input_len: int,
output_len: int,
range_ratio: RangeRatio = 0.0,
seed: int = 0,
trust_remote_code: bool = False,
) -> None
Read input_path, slice it into prompts, and write JSONL to output_path.
Source code in vllm/benchmarks/datasets/create_txt_slices_dataset.py
load_text ¶
Load text from a local file or URL.