Speech to Text Datasets

Speech to text datasets

Base class


source

STTDataset

 STTDataset (tokenizer:lhotse.dataset.collation.TokenCollater,
             num_mel_bins:int=80)

An abstract class representing a :class:Dataset.

All datasets that represent a map from keys to data samples should subclass it. All subclasses should overwrite :meth:__getitem__, supporting fetching a data sample for a given key. Subclasses could also optionally overwrite :meth:__len__, which is expected to return the size of the dataset by many :class:~torch.utils.data.Sampler implementations and the default options of :class:~torch.utils.data.DataLoader. Subclasses could also optionally implement :meth:__getitems__, for speedup batched samples loading. This method accepts list of indices of samples of batch and returns list of samples.

.. note:: :class:~torch.utils.data.DataLoader by default constructs a index sampler that yields integral indices. To make it work with a map-style dataset with non-integral indices/keys, a custom sampler must be provided.

Type Default Details
tokenizer TokenCollater text tokenizer
num_mel_bins int 80 number of mel spectrogram bins

LibriSpeech DataModule


source

LibriSpeechDataModule

 LibriSpeechDataModule (target_dir='/data/en',
                        dataset_parts='mini_librispeech',
                        output_dir='../recipes/stt/librispeech/data',
                        num_jobs=1)

A DataModule standardizes the training, val, test splits, data preparation and transforms. The main advantage is consistent data splits, data preparation and transforms across models.

Example::

import lightning.pytorch as L
import torch.utils.data as data
from pytorch_lightning.demos.boring_classes import RandomDataset

class MyDataModule(L.LightningDataModule):
    def prepare_data(self):
        # download, IO, etc. Useful with shared filesystems
        # only called on 1 GPU/TPU in distributed
        ...

    def setup(self, stage):
        # make assignments here (val/train/test split)
        # called on every process in DDP
        dataset = RandomDataset(1, 100)
        self.train, self.val, self.test = data.random_split(
            dataset, [80, 10, 10], generator=torch.Generator().manual_seed(42)
        )

    def train_dataloader(self):
        return data.DataLoader(self.train)

    def val_dataloader(self):
        return data.DataLoader(self.val)

    def test_dataloader(self):
        return data.DataLoader(self.test)

    def teardown(self):
        # clean up state after the trainer stops, delete files...
        # called on every process in DDP
        ...
Type Default Details
target_dir str /data/en where data will be saved / retrieved
dataset_parts str mini_librispeech either full librispeech or mini subset
output_dir str ../recipes/stt/librispeech/data where to save manifest
num_jobs int 1 num_jobs depending on number of cpus available

Usage

dm = LibriSpeechDataModule(
    target_dir="../data/en", 
    dataset_parts="mini_librispeech",
    output_dir="../data/en/LibriSpeech/dev-clean-2",
    num_jobs=1
)
# skip this at export time to not waste time
# download
# dm.prepare_data()
dm.setup(stage='test')
Dataset parts: 100%|██████████| 1/1 [00:00<00:00, 111.33it/s]
recs = RecordingSet.from_file("../data/en/LibriSpeech/dev-clean-2/librispeech_recordings_dev-clean-2.jsonl.gz")
sup = SupervisionSet("../data/en/LibriSpeech/dev-clean-2/librispeech_supervisions_dev-clean-2.jsonl.gz")
print(len(recs),len(sup))
25 80
test_dl = dm.test_dataloader()
b = next(iter(test_dl))
print(b["feats_pad"].shape, b["tokens_pad"].shape, b["ilens"].shape)
plt.imshow(b["feats_pad"][0].transpose(0,1), origin='lower')

# dm.tokenizer.idx2token(b["tokens_pad"][0])
# dm.tokenizer.inverse(b["tokens_pad"][0], b["ilens"][0])
torch.Size([1, 1113, 80]) torch.Size([1, 163]) torch.Size([1])
<matplotlib.image.AxesImage>

print(dm.cuts_test)
cut = dm.cuts_test[0]
# pprint(cut.to_dict())
cut.plot_audio()
CutSet(len=25) [underlying data type: <class 'dict'>]
<Axes: >