R
RunPod5w ago
Yasmin

Very Slow Mapping

Hello! I am trying to run dataset.map() and it takes only a few minutes when I run it on Colab. However, when I run it on any machine on RunPod, it reports that it has several hours to finish. I reported this to the Support, but no solution yet. I wonder if anyone faced a similar issue, and how to solve it. The code below is for pre-processing an audio dataset for Whisper fine-tuning. Thanks!
def prepare_dataset(batch):
audio = batch["audio"]

batch["input_features"] = feature_extractor(audio["array"],
sampling_rate=audio["sampling_rate"]).input_features[0]

batch["labels"] = tokenizer(batch["translation"]).input_ids

return batch

dataset = dataset.map(prepare_dataset,
remove_columns=dataset.column_names["train"],
num_proc=None)
def prepare_dataset(batch):
audio = batch["audio"]

batch["input_features"] = feature_extractor(audio["array"],
sampling_rate=audio["sampling_rate"]).input_features[0]

batch["labels"] = tokenizer(batch["translation"]).input_ids

return batch

dataset = dataset.map(prepare_dataset,
remove_columns=dataset.column_names["train"],
num_proc=None)
5 Replies
nerdylive
nerdylive5w ago
The same code and dataset size? And where do you store the dataset in What library do you use in that code?
Yasmin
Yasmin5w ago
Transformers
nerdylive
nerdylive5w ago
yeah
Yasmin
Yasmin5w ago
Same everthing. RunPod "network volume"
nerdylive
nerdylive5w ago
Ooh hmm network volume is abit slower than usual But I haven't done fine tuning for that yet Maybe select nvme when creating pods and filter 9 vcpus