Audio Embedders

EncoDec


source

EncoDec

 EncoDec (device:str='cpu')

Initialize self. See help(type(self)) for accurate signature.

Usage

wav, sr = torchaudio.load("../data/audio/obama.wav")
# wav, sr = torch.rand((1, 24000)), 24000
# wav, sr = np.random.random((1, 24000)), 24000

encodec = EncoDec(device='cpu')
codes = encodec(wav,sr)
print(f"wav: {wav.shape}, code: {codes.shape} ")
plt.rcParams["figure.figsize"] = (5,5)
plt.xlabel('frames')
plt.ylabel('quantization')
plt.imshow(codes.squeeze().cpu().numpy())
decoded = encodec.decode(codes)
plot_waveform(decoded.detach().cpu().squeeze(0), encodec.sample_rate)
/Users/slegroux/miniforge3/envs/nimrod/lib/python3.11/site-packages/torch/nn/utils/weight_norm.py:30: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
  warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
wav: torch.Size([1, 102400]), code: torch.Size([1, 8, 480]) 

plt.plot(codes[0][0])
print(codes[0][0].shape)
torch.Size([480])

Lhotse-style Encodec feature extractor


source

EncoDecExtractor

 EncoDecExtractor (config=EncoDecConfig(frame_shift=0.013333333333333334,
                   n_q=8))

The base class for all feature extractors in Lhotse. It is initialized with a config object, specific to a particular feature extraction method. The config is expected to be a dataclass so that it can be easily serialized.

All derived feature extractors must implement at least the following:

  • a name class attribute (how are these features called, e.g. ‘mfcc’)
  • a config_type class attribute that points to the configuration dataclass type
  • the extract method,
  • the frame_shift property.

Feature extractors that support feature-domain mixing should additionally specify two static methods:

  • compute_energy, and
  • mix.

By itself, the FeatureExtractor offers the following high-level methods that are not intended for overriding:

  • extract_from_samples_and_store
  • extract_from_recording_and_store

These methods run a larger feature extraction pipeline that involves data augmentation and disk storage.


source

EncoDecConfig

 EncoDecConfig (frame_shift:float=0.013333333333333334, n_q:int=8)
encodec_extractor = EncoDecExtractor()
# cuts = CutSet.from_file("../recipes/tts/ljspeech/data/first_3.jsonl.gz")
cuts = CutSet.from_file("../data/en/LJSpeech-1.1/first_3.encodec.jsonl.gz")
print(cuts[0])
print(cuts[1])
MonoCut(id='LJ001-0001-0', start=0, duration=9.65501133786848, channel=0, supervisions=[SupervisionSegment(id='LJ001-0001', recording_id='LJ001-0001', start=0.0, duration=9.65501133786848, channel=0, text='Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition', language='English', speaker=None, gender='female', custom=None, alignment=None)], features=Features(type='encodec', num_frames=724, num_features=8, frame_shift=0.013333333333333334, sampling_rate=22050, start=0, duration=9.65501134, storage_type='lilcom_chunky', storage_path='../data/en/LJSpeech-1.1/encodec.lca', storage_key='0,8029,3610', recording_id='None', channels=0), recording=Recording(id='LJ001-0001', sources=[AudioSource(type='file', channels=[0], source='/data/en/LJSpeech/LJSpeech-1.1/wavs/LJ001-0001.wav')], sampling_rate=22050, num_samples=212893, duration=9.65501133786848, channel_ids=[0], transforms=None), custom=None)
MonoCut(id='LJ001-0002-1', start=0, duration=1.899546485260771, channel=0, supervisions=[SupervisionSegment(id='LJ001-0002', recording_id='LJ001-0002', start=0.0, duration=1.899546485260771, channel=0, text='in being comparatively modern.', language='English', speaker=None, gender='female', custom=None, alignment=None)], features=Features(type='encodec', num_frames=142, num_features=8, frame_shift=0.013333333333333334, sampling_rate=22050, start=0, duration=1.89954649, storage_type='lilcom_chunky', storage_path='../data/en/LJSpeech-1.1/encodec.lca', storage_key='11639,2294', recording_id='None', channels=0), recording=Recording(id='LJ001-0002', sources=[AudioSource(type='file', channels=[0], source='/data/en/LJSpeech/LJSpeech-1.1/wavs/LJ001-0002.wav')], sampling_rate=22050, num_samples=41885, duration=1.899546485260771, channel_ids=[0], transforms=None), custom=None)
/Users/slegroux/miniforge3/envs/nimrod/lib/python3.11/site-packages/torch/nn/utils/weight_norm.py:30: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
  warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
# torch.set_num_threads(1)
# torch.set_num_interop_threads(1)
# feats = cuts.compute_and_store_features(extractor=Fbank(), storage_path="../recipes/tts/ljspeech/data/feats")
# storage_path = "../.data/en/LJSpeech-1.1"
# # storage_path = "../recipes/tts/ljspeech/data/feats"
# # TODO: make it work with num_jobs>1
# cuts = cuts.compute_and_store_features(
#     extractor=encodec_extractor,
#     storage_path=storage_path,
#     num_jobs=1,
# )
# cuts.to_file("../recipes/tts/ljspeech/data/cuts_encodec.jsonl.gz")
# print(cuts[0])
# cuts[0].plot_features()
# print(cuts)
files = "../data/en/LJSpeech-1.1/cuts_encodec.jsonl.gz"
# files = "../recipes/tts/ljspeech/data/cuts_encodec.jsonl.gz"
cuts = CutSet.from_file(files)
print(cuts)
None

AudioLM

# TO DO