= LibriSpeechDataModule(
dm ="../data/en",
target_dir="mini_librispeech",
dataset_parts="../data/en/LibriSpeech/dev-clean-2",
output_dir=1
num_jobs )
Speech to Text Datasets
Base class
STTDataset
STTDataset (tokenizer:lhotse.dataset.collation.TokenCollater, num_mel_bins:int=80)
An abstract class representing a :class:Dataset
.
All datasets that represent a map from keys to data samples should subclass it. All subclasses should overwrite :meth:__getitem__
, supporting fetching a data sample for a given key. Subclasses could also optionally overwrite :meth:__len__
, which is expected to return the size of the dataset by many :class:~torch.utils.data.Sampler
implementations and the default options of :class:~torch.utils.data.DataLoader
. Subclasses could also optionally implement :meth:__getitems__
, for speedup batched samples loading. This method accepts list of indices of samples of batch and returns list of samples.
.. note:: :class:~torch.utils.data.DataLoader
by default constructs a index sampler that yields integral indices. To make it work with a map-style dataset with non-integral indices/keys, a custom sampler must be provided.
Type | Default | Details | |
---|---|---|---|
tokenizer | TokenCollater | text tokenizer | |
num_mel_bins | int | 80 | number of mel spectrogram bins |
LibriSpeech DataModule
LibriSpeechDataModule
LibriSpeechDataModule (target_dir='/data/en', dataset_parts='mini_librispeech', output_dir='../recipes/stt/librispeech/data', num_jobs=1)
A DataModule standardizes the training, val, test splits, data preparation and transforms. The main advantage is consistent data splits, data preparation and transforms across models.
Example::
import lightning.pytorch as L
import torch.utils.data as data
from pytorch_lightning.demos.boring_classes import RandomDataset
class MyDataModule(L.LightningDataModule):
def prepare_data(self):
# download, IO, etc. Useful with shared filesystems
# only called on 1 GPU/TPU in distributed
...
def setup(self, stage):
# make assignments here (val/train/test split)
# called on every process in DDP
dataset = RandomDataset(1, 100)
self.train, self.val, self.test = data.random_split(
dataset, [80, 10, 10], generator=torch.Generator().manual_seed(42)
)
def train_dataloader(self):
return data.DataLoader(self.train)
def val_dataloader(self):
return data.DataLoader(self.val)
def test_dataloader(self):
return data.DataLoader(self.test)
def teardown(self):
# clean up state after the trainer stops, delete files...
# called on every process in DDP
...
Type | Default | Details | |
---|---|---|---|
target_dir | str | /data/en | where data will be saved / retrieved |
dataset_parts | str | mini_librispeech | either full librispeech or mini subset |
output_dir | str | ../recipes/stt/librispeech/data | where to save manifest |
num_jobs | int | 1 | num_jobs depending on number of cpus available |
Usage
# skip this at export time to not waste time
# download
# dm.prepare_data()
='test') dm.setup(stage
Dataset parts: 100%|██████████| 1/1 [00:00<00:00, 111.33it/s]
= RecordingSet.from_file("../data/en/LibriSpeech/dev-clean-2/librispeech_recordings_dev-clean-2.jsonl.gz")
recs = SupervisionSet("../data/en/LibriSpeech/dev-clean-2/librispeech_supervisions_dev-clean-2.jsonl.gz")
sup print(len(recs),len(sup))
25 80
= dm.test_dataloader()
test_dl = next(iter(test_dl))
b print(b["feats_pad"].shape, b["tokens_pad"].shape, b["ilens"].shape)
"feats_pad"][0].transpose(0,1), origin='lower')
plt.imshow(b[
# dm.tokenizer.idx2token(b["tokens_pad"][0])
# dm.tokenizer.inverse(b["tokens_pad"][0], b["ilens"][0])
torch.Size([1, 1113, 80]) torch.Size([1, 163]) torch.Size([1])
<matplotlib.image.AxesImage>
print(dm.cuts_test)
= dm.cuts_test[0]
cut # pprint(cut.to_dict())
cut.plot_audio()
CutSet(len=25) [underlying data type: <class 'dict'>]
<Axes: >