Audio TTS Datasets

TTS datasets

Text-To-Speech

Lhotse-based Base Class

https://github.com/Lightning-AI/lightning/issues/10358 https://colab.research.google.com/drive/1HKSYPsWx_HoCdrnLpaPdYj5zwlPsM3NH


source

LhotseTTSDataset

 LhotseTTSDataset (tokenizer=<class
                   'lhotse.dataset.collation.TokenCollater'>, extractor=<l
                   hotse.dataset.input_strategies.OnTheFlyFeatures object
                   at 0x7f912ab4a6d0>)

An abstract class representing a :class:Dataset.

All datasets that represent a map from keys to data samples should subclass it. All subclasses should overwrite :meth:__getitem__, supporting fetching a data sample for a given key. Subclasses could also optionally overwrite :meth:__len__, which is expected to return the size of the dataset by many :class:~torch.utils.data.Sampler implementations and the default options of :class:~torch.utils.data.DataLoader. Subclasses could also optionally implement :meth:__getitems__, for speedup batched samples loading. This method accepts list of indices of samples of batch and returns list of samples.

.. note:: :class:~torch.utils.data.DataLoader by default constructs a index sampler that yields integral indices. To make it work with a map-style dataset with non-integral indices/keys, a custom sampler must be provided.

Type Default Details
tokenizer type TokenCollater text tokenizer
extractor OnTheFlyFeatures <lhotse.dataset.input_strategies.OnTheFlyFeatures object at 0x7f912ab4a6d0> feature extractor
# tok = TokenCollater()
# ds = LhotseTTSDataset(tok)

Default base class


source

TTSDataset

 TTSDataset (tokenizer, num_mel_bins:int=80, sampling_rate:int=16000)

An abstract class representing a :class:Dataset.

All datasets that represent a map from keys to data samples should subclass it. All subclasses should overwrite :meth:__getitem__, supporting fetching a data sample for a given key. Subclasses could also optionally overwrite :meth:__len__, which is expected to return the size of the dataset by many :class:~torch.utils.data.Sampler implementations and the default options of :class:~torch.utils.data.DataLoader. Subclasses could also optionally implement :meth:__getitems__, for speedup batched samples loading. This method accepts list of indices of samples of batch and returns list of samples.

.. note:: :class:~torch.utils.data.DataLoader by default constructs a index sampler that yields integral indices. To make it work with a map-style dataset with non-integral indices/keys, a custom sampler must be provided.

Type Default Details
tokenizer text tokenizer
num_mel_bins int 80 number of mel spectrogram bins
sampling_rate int 16000 sampling rate

LibriTTS DataModule

#(Waveform, Sample_rate, Original_text, Normalized_text, Speaker_ID, Chapter_ID, Utterance_ID)
ds = LIBRITTS("../data/en", 'test-clean')
print(ds[0])
(tensor([[0.0007, 0.0008, 0.0012,  ..., 0.0039, 0.0042, 0.0042]]), 24000, 'He hoped there would be stew for dinner, turnips and carrots and bruised potatoes and fat mutton pieces to be ladled out in thick peppered flour-fattened sauce. Stuff it into you, his belly counselled him.', 'He hoped there would be stew for dinner, turnips and carrots and bruised potatoes and fat mutton pieces to be ladled out in thick peppered flour fattened sauce. Stuff it into you, his belly counselled him.', 1089, 134686, '1089_134686_000001_000001')
plot_waveform(ds[0][0], ds[0][1])


source

LibriTTSDataModule

 LibriTTSDataModule (target_dir='/data/en/libriTTS', dataset_parts=['dev-
                     clean', 'test-clean'], output_dir='/home/syl20/slg/ni
                     mrod/recipes/libritts/data', num_jobs=0)

A DataModule standardizes the training, val, test splits, data preparation and transforms. The main advantage is consistent data splits, data preparation and transforms across models.

Example::

import lightning.pytorch as L
import torch.utils.data as data
from pytorch_lightning.demos.boring_classes import RandomDataset

class MyDataModule(L.LightningDataModule):
    def prepare_data(self):
        # download, IO, etc. Useful with shared filesystems
        # only called on 1 GPU/TPU in distributed
        ...

    def setup(self, stage):
        # make assignments here (val/train/test split)
        # called on every process in DDP
        dataset = RandomDataset(1, 100)
        self.train, self.val, self.test = data.random_split(
            dataset, [80, 10, 10], generator=torch.Generator().manual_seed(42)
        )

    def train_dataloader(self):
        return data.DataLoader(self.train)

    def val_dataloader(self):
        return data.DataLoader(self.val)

    def test_dataloader(self):
        return data.DataLoader(self.test)

    def teardown(self):
        # clean up state after the trainer stops, delete files...
        # called on every process in DDP
        ...
Type Default Details
target_dir str /data/en/libriTTS where data will be saved / retrieved
dataset_parts list [‘dev-clean’, ‘test-clean’] either full libritts or subset
output_dir str /home/syl20/slg/nimrod/recipes/libritts/data where to save manifest
num_jobs int 0 num_jobs depending on number of cpus available

Usage

# num_jobs=0 turns parallel computing off within jupyter notebook. Else it fails.
dm = LibriTTSDataModule(
    target_dir="../data/en", 
    dataset_parts="test-clean",
    output_dir="../data/en/LibriTTS/test-clean",
    num_jobs=1
)
NameError: name 'LibriTTSDataModule' is not defined
# skip download and use local data folder
# dm.prepare_data()
dm.setup(stage='test')
Preparing LibriTTS parts: 100%|██████████| 7/7 [00:00<00:00, 8169.21it/s]
test_dl = dm.test_dataloader()
batch = next(iter(test_dl))
print(batch.keys())
dict_keys(['feats_pad', 'feats_lens', 'tokens_pad', 'tokens_lens'])
print(batch['feats_pad'].shape)
plt.imshow(batch['feats_pad'][3].transpose(0,1))
print(batch['feats_lens'])
print(batch['tokens_pad'][3], batch['tokens_lens'][3])
torch.Size([4, 1515, 80])
tensor([1515, 1478, 1464, 1299])
tensor([ 2, 30, 43, 40,  4, 54, 40, 49, 55, 40, 49, 38, 40,  4, 50, 41,  4, 54,
        36, 44, 49, 55,  4, 45, 36, 48, 40, 54,  4, 58, 43, 44, 38, 43,  4, 54,
        36, 60, 54,  4, 55, 43, 36, 55,  4, 43, 40,  4, 58, 43, 50,  4, 50, 41,
        41, 40, 49, 39, 54,  4, 36, 42, 36, 44, 49, 54, 55,  4, 50, 49, 40,  4,
        38, 50, 48, 48, 36, 49, 39, 48, 40, 49, 55,  4, 37, 40, 38, 50, 48, 40,
        54,  4, 42, 56, 44, 47, 55, 60,  4, 50, 41,  4, 36, 47, 47,  7,  4, 43,
        36, 39,  4, 54, 40, 40, 48, 40, 39,  4, 55, 50,  4, 43, 44, 48,  4, 41,
        44, 53, 54, 55,  4, 36,  4, 54, 58, 50, 47, 47, 40, 49,  4, 51, 43, 53,
        36, 54, 40,  4, 56, 49, 55, 44, 47,  4, 43, 40,  4, 43, 36, 39,  4, 37,
        40, 42, 56, 49,  4, 55, 50,  4, 42, 53, 50, 51, 40,  4, 44, 49,  4, 55,
        43, 40,  4, 39, 36, 53, 46, 49, 40, 54, 54,  4, 50, 41,  4, 43, 44, 54,
         4, 50, 58, 49,  4, 54, 55, 36, 55, 40,  9,  3,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0]) tensor(210, dtype=torch.int32)

original_sentences = dm.tokenizer.inverse(batch['tokens_pad'], batch['tokens_lens'])
print(original_sentences)
["A certain pride, a certain awe, withheld him from offering to God even one prayer at night, though he knew it was in God's power to take away his life while he slept and hurl his soul hellward ere he could beg for mercy.", 'It was strange too that he found an arid pleasure in following up to the end the rigid lines of the doctrines of the church and penetrating into obscure silences only to hear and feel the more deeply his own condemnation.', 'He had sinned mortally not once but many times and he knew that, while he stood in danger of eternal damnation for the first sin alone, by every succeeding sin he multiplied his guilt and his punishment.', 'The sentence of saint james which says that he who offends against one commandment becomes guilty of all, had seemed to him first a swollen phrase until he had begun to grope in the darkness of his own state.']