# tok = TokenCollater()
# ds = LhotseTTSDataset(tok)
Audio TTS Datasets
Text-To-Speech
Lhotse-based Base Class
https://github.com/Lightning-AI/lightning/issues/10358 https://colab.research.google.com/drive/1HKSYPsWx_HoCdrnLpaPdYj5zwlPsM3NH
LhotseTTSDataset
LhotseTTSDataset (tokenizer=<class 'lhotse.dataset.collation.TokenCollater'>, extractor=<l hotse.dataset.input_strategies.OnTheFlyFeatures object at 0x7f912ab4a6d0>)
An abstract class representing a :class:Dataset
.
All datasets that represent a map from keys to data samples should subclass it. All subclasses should overwrite :meth:__getitem__
, supporting fetching a data sample for a given key. Subclasses could also optionally overwrite :meth:__len__
, which is expected to return the size of the dataset by many :class:~torch.utils.data.Sampler
implementations and the default options of :class:~torch.utils.data.DataLoader
. Subclasses could also optionally implement :meth:__getitems__
, for speedup batched samples loading. This method accepts list of indices of samples of batch and returns list of samples.
.. note:: :class:~torch.utils.data.DataLoader
by default constructs a index sampler that yields integral indices. To make it work with a map-style dataset with non-integral indices/keys, a custom sampler must be provided.
Type | Default | Details | |
---|---|---|---|
tokenizer | type | TokenCollater | text tokenizer |
extractor | OnTheFlyFeatures | <lhotse.dataset.input_strategies.OnTheFlyFeatures object at 0x7f912ab4a6d0> | feature extractor |
Default base class
TTSDataset
TTSDataset (tokenizer, num_mel_bins:int=80, sampling_rate:int=16000)
An abstract class representing a :class:Dataset
.
All datasets that represent a map from keys to data samples should subclass it. All subclasses should overwrite :meth:__getitem__
, supporting fetching a data sample for a given key. Subclasses could also optionally overwrite :meth:__len__
, which is expected to return the size of the dataset by many :class:~torch.utils.data.Sampler
implementations and the default options of :class:~torch.utils.data.DataLoader
. Subclasses could also optionally implement :meth:__getitems__
, for speedup batched samples loading. This method accepts list of indices of samples of batch and returns list of samples.
.. note:: :class:~torch.utils.data.DataLoader
by default constructs a index sampler that yields integral indices. To make it work with a map-style dataset with non-integral indices/keys, a custom sampler must be provided.
Type | Default | Details | |
---|---|---|---|
tokenizer | text tokenizer | ||
num_mel_bins | int | 80 | number of mel spectrogram bins |
sampling_rate | int | 16000 | sampling rate |
LibriTTS DataModule
#(Waveform, Sample_rate, Original_text, Normalized_text, Speaker_ID, Chapter_ID, Utterance_ID)
= LIBRITTS("../data/en", 'test-clean')
ds print(ds[0])
(tensor([[0.0007, 0.0008, 0.0012, ..., 0.0039, 0.0042, 0.0042]]), 24000, 'He hoped there would be stew for dinner, turnips and carrots and bruised potatoes and fat mutton pieces to be ladled out in thick peppered flour-fattened sauce. Stuff it into you, his belly counselled him.', 'He hoped there would be stew for dinner, turnips and carrots and bruised potatoes and fat mutton pieces to be ladled out in thick peppered flour fattened sauce. Stuff it into you, his belly counselled him.', 1089, 134686, '1089_134686_000001_000001')
0][0], ds[0][1]) plot_waveform(ds[
LibriTTSDataModule
LibriTTSDataModule (target_dir='/data/en/libriTTS', dataset_parts=['dev- clean', 'test-clean'], output_dir='/home/syl20/slg/ni mrod/recipes/libritts/data', num_jobs=0)
A DataModule standardizes the training, val, test splits, data preparation and transforms. The main advantage is consistent data splits, data preparation and transforms across models.
Example::
import lightning.pytorch as L
import torch.utils.data as data
from pytorch_lightning.demos.boring_classes import RandomDataset
class MyDataModule(L.LightningDataModule):
def prepare_data(self):
# download, IO, etc. Useful with shared filesystems
# only called on 1 GPU/TPU in distributed
...
def setup(self, stage):
# make assignments here (val/train/test split)
# called on every process in DDP
dataset = RandomDataset(1, 100)
self.train, self.val, self.test = data.random_split(
dataset, [80, 10, 10], generator=torch.Generator().manual_seed(42)
)
def train_dataloader(self):
return data.DataLoader(self.train)
def val_dataloader(self):
return data.DataLoader(self.val)
def test_dataloader(self):
return data.DataLoader(self.test)
def teardown(self):
# clean up state after the trainer stops, delete files...
# called on every process in DDP
...
Type | Default | Details | |
---|---|---|---|
target_dir | str | /data/en/libriTTS | where data will be saved / retrieved |
dataset_parts | list | [‘dev-clean’, ‘test-clean’] | either full libritts or subset |
output_dir | str | /home/syl20/slg/nimrod/recipes/libritts/data | where to save manifest |
num_jobs | int | 0 | num_jobs depending on number of cpus available |
Usage
# num_jobs=0 turns parallel computing off within jupyter notebook. Else it fails.
= LibriTTSDataModule(
dm ="../data/en",
target_dir="test-clean",
dataset_parts="../data/en/LibriTTS/test-clean",
output_dir=1
num_jobs )
NameError: name 'LibriTTSDataModule' is not defined
# skip download and use local data folder
# dm.prepare_data()
='test') dm.setup(stage
Preparing LibriTTS parts: 100%|██████████| 7/7 [00:00<00:00, 8169.21it/s]
= dm.test_dataloader()
test_dl = next(iter(test_dl))
batch print(batch.keys())
dict_keys(['feats_pad', 'feats_lens', 'tokens_pad', 'tokens_lens'])
print(batch['feats_pad'].shape)
'feats_pad'][3].transpose(0,1))
plt.imshow(batch[print(batch['feats_lens'])
print(batch['tokens_pad'][3], batch['tokens_lens'][3])
torch.Size([4, 1515, 80])
tensor([1515, 1478, 1464, 1299])
tensor([ 2, 30, 43, 40, 4, 54, 40, 49, 55, 40, 49, 38, 40, 4, 50, 41, 4, 54,
36, 44, 49, 55, 4, 45, 36, 48, 40, 54, 4, 58, 43, 44, 38, 43, 4, 54,
36, 60, 54, 4, 55, 43, 36, 55, 4, 43, 40, 4, 58, 43, 50, 4, 50, 41,
41, 40, 49, 39, 54, 4, 36, 42, 36, 44, 49, 54, 55, 4, 50, 49, 40, 4,
38, 50, 48, 48, 36, 49, 39, 48, 40, 49, 55, 4, 37, 40, 38, 50, 48, 40,
54, 4, 42, 56, 44, 47, 55, 60, 4, 50, 41, 4, 36, 47, 47, 7, 4, 43,
36, 39, 4, 54, 40, 40, 48, 40, 39, 4, 55, 50, 4, 43, 44, 48, 4, 41,
44, 53, 54, 55, 4, 36, 4, 54, 58, 50, 47, 47, 40, 49, 4, 51, 43, 53,
36, 54, 40, 4, 56, 49, 55, 44, 47, 4, 43, 40, 4, 43, 36, 39, 4, 37,
40, 42, 56, 49, 4, 55, 50, 4, 42, 53, 50, 51, 40, 4, 44, 49, 4, 55,
43, 40, 4, 39, 36, 53, 46, 49, 40, 54, 54, 4, 50, 41, 4, 43, 44, 54,
4, 50, 58, 49, 4, 54, 55, 36, 55, 40, 9, 3, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0]) tensor(210, dtype=torch.int32)
= dm.tokenizer.inverse(batch['tokens_pad'], batch['tokens_lens'])
original_sentences print(original_sentences)
["A certain pride, a certain awe, withheld him from offering to God even one prayer at night, though he knew it was in God's power to take away his life while he slept and hurl his soul hellward ere he could beg for mercy.", 'It was strange too that he found an arid pleasure in following up to the end the rigid lines of the doctrines of the church and penetrating into obscure silences only to hear and feel the more deeply his own condemnation.', 'He had sinned mortally not once but many times and he knew that, while he stood in danger of eternal damnation for the first sin alone, by every succeeding sin he multiplied his guilt and his punishment.', 'The sentence of saint james which says that he who offends against one commandment becomes guilty of all, had seemed to him first a swollen phrase until he had begun to grope in the darkness of his own state.']