accelerate多卡并行启动就报错

(kdc) d25-lfz@MILab:/media/ubuntu/data/lfz/kdc/kuavo_data_challenge$ accelerate launch --config_file configs/accelerate/accelerate_config.yaml kuavo_train/train_policy_with_accelerate.py --config-name=act_config task=task1 method=act                                                                                                             
/media/ubuntu/data/lfz/miniconda3/envs/kdc/lib/python3.10/site-packages/pydantic/_internal/_generate_schema.py:2249: UnsupportedFieldAttributeWarning: The 'repr' attribute with value False was provided to the `Field()` function, which has no effect in the context it was used. 'repr' is field-specific metadata, and can only be attached to a model field using `Annotated` metadata or by assignment. This may have happened because an `Annotated` type alias using the `type` statement was used, or if the `Field()` function was attached to a single member of a union type.                                                                                                                  
  warnings.warn(                                                                                                                                                           
/media/ubuntu/data/lfz/miniconda3/envs/kdc/lib/python3.10/site-packages/pydantic/_internal/_generate_schema.py:2249: UnsupportedFieldAttributeWarning: The 'frozen' attribute with value True was provided to the `Field()` function, which has no effect in the context it was used. 'frozen' is field-specific metadata, and can only be attached to a model field using `Annotated` metadata or by assignment. This may have happened because an `Annotated` type alias using the `type` statement was used, or if the `Field()` function was attached to a single member of a union type.                                                                                                               
  warnings.warn(
W0304 17:25:11.423000 1022108 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 1022279 closing signal SIGTERM
W0304 17:25:11.424000 1022108 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 1022280 closing signal SIGTERM
E0304 17:25:11.839000 1022108 site-packages/torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: -11) local_rank: 0 (pid: 1022277) of binary: /media/ubuntu/data/lfz/miniconda3/envs/kdc/bin/python3.10
Traceback (most recent call last):
  File "/media/ubuntu/data/lfz/miniconda3/envs/kdc/bin/accelerate", line 6, in <module>
    sys.exit(main())
  File "/media/ubuntu/data/lfz/miniconda3/envs/kdc/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 50, in main
    args.func(args)
  File "/media/ubuntu/data/lfz/miniconda3/envs/kdc/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1226, in launch_command
    multi_gpu_launcher(args)
  File "/media/ubuntu/data/lfz/miniconda3/envs/kdc/lib/python3.10/site-packages/accelerate/commands/launch.py", line 853, in multi_gpu_launcher
    distrib_run.run(args)
  File "/media/ubuntu/data/lfz/miniconda3/envs/kdc/lib/python3.10/site-packages/torch/distributed/run.py", line 883, in run
    elastic_launch(
  File "/media/ubuntu/data/lfz/miniconda3/envs/kdc/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 139, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/media/ubuntu/data/lfz/miniconda3/envs/kdc/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 270, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=========================================================
kuavo_train/train_policy_with_accelerate.py FAILED
---------------------------------------------------------
Failures:
[1]:
  time      : 2026-03-04_17:25:11
  host      : MILab
  rank      : 1 (local_rank: 1)
  exitcode  : -11 (pid: 1022278)
  error_file: <N/A>
  traceback : Signal 11 (SIGSEGV) received by PID 1022278
---------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2026-03-04_17:25:11
  host      : MILab
  rank      : 0 (local_rank: 0)
  exitcode  : -11 (pid: 1022277)
  error_file: <N/A>
  traceback : Signal 11 (SIGSEGV) received by PID 1022277
=========================================================
Article Image

如图为拓扑,8*4090没有NVLink

accelerate_config.yaml

compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU     # Multi-GPU mode
fp16: true                     # Set according to your amp configuration
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
num_machines: 1
num_processes: 4                # 2 threads in this example (I.e. 2 GPU's)
gpu_ids: "4,5,6,7"                  # Which GPU ID's to be used for training

解决方案:

export NCCL_DEBUG=INFO
export NCCL_ASYNC_ERROR_HANDLING=1
export NCCL_SHM_DISABLE=1
export NCCL_P2P_DISABLE=1
export NCCL_IB_DISABLE=1