Today we’re releasing a synthetic audio dataset based on TinyStories which we hope will help with multimodal scaling research. Instructions for dataset usage can be found on our GitHub repo or on the Huggingface page. We release under CDLA-Sharing-1.0, as per the original dataset.
The data consists of 30 thousand hours of story narrations from the original GPT-4 generated instruct dataset, synthesized with XTTS-v2 over three days on one of our H100 nodes. The audio was generated in sentence chunks and concatenated into files of approximately 30 minutes. Due to the causal conditioning of the model, no two words are identically pronounced, so it should be non-trivial for a model to extract semantics from the data. In total, the dataset is about 15TB in size, with a validation subset of about 1%. We also include pre-tokenized data using Meta’s Hubert and Encodec models, as used in architectures like AudioLM.
In TinyStories it was shown that very small language models can generate coherent narratives with very distilled training data. Given identical content and some additional information conveyed through synthetic intonation, audio models should be able to reach a similar level of performance with the same magnitude of compute. We’re very excited to see work pushing this direction of research.
To get started, follow the instructions on the GitHub repo.
from datasets import load_dataset
val_split = load_dataset('sfcompute/TinyNarrations', split='validation', streaming=True)
sample_wav = torch.from_numpy(next(iter(val_split))['audio']['array']).unsqueeze(0)
# open the README.md for more info