HF asks for WOLRD_SIZE and other variables | Learn AI Together | Page 1

pseudo hearth Sep 9, 2023, 9:25 AM

#

I'm trying to fine-tune an LLM using HuggingFace Transformers, but it keeps asking for WORLD_SIZE, MASTER_ADDR and MASTER_PORT. The process is running on a pc with a single gpu, so I guess WORLD_SIZE=1 and RANK=0, but what about the other variables?

`from transformers import TrainingArguments

training_args = TrainingArguments(output_dir="test_trainer")`

error: Error initializing torch.distributed using env:// rendezvous: environment variable MASTER_ADDR expected, but not set

prime hearth Sep 10, 2023, 8:38 AM

#

pseudo hearth I'm trying to fine-tune an LLM using HuggingFace Transformers, but it keeps aski...

While not familiar, you may find solutions in this issue: torch.distributed.launch don't set the right MASTER_ADDR

GitHub

torch.distributed.launch don't set the right MASTER_ADDR · Issue #7...

🐛 Describe the bug it run well in torch 1.8.1,but get hang in torch 1.10.2 and 1.11.0，I guess the reason is the MASTER_ADDR value。 TRAINING_SCRIPT.py as following env_dist = os.environ print('e...

pseudo hearth Sep 10, 2023, 9:16 AM

#

Thanks for your reply. I did solve the problem by using os.environ['MASTER_ADDR'] = 'localhost' and os.environ['MASTER_PORT'] = '29500' and lastly torch.distributed.init_process_group(backend='nccl', world_size=1, rank=0).

#

I find it weird that this is needed in a single machine single gpu environment, especially that the HF documentation makes no mention of this and assumes that it will just work

prime hearth Sep 10, 2023, 9:27 AM

#

pseudo hearth I find it weird that this is needed in a single machine single gpu environment, ...

Well it is more about dealing with PyTorch rather than HF, but I find it weird that you triggered distributed APIs while working with single machine single GPU. How did you trigger them?

pseudo hearth Sep 10, 2023, 9:28 AM

#

I did nothing more than using the code you see above

#HF asks for WOLRD_SIZE and other variables