Shengheng Yan

The workflow of Python

Virtual Environment

This section provides most common-used commands.

Create a virtual environment with conda

conda create -n env_name python=3.8

Activate the environment

conda activate env_name

Import Environment Configuration:

import using conda:

conda env create -f environment.yml

import using pip:

pip install -r requirement.txt

List exsisting the environment

conda env list

Delete the environment

conda deactivate
conda remove -n env_name --all

Export Environment Configuration:

export using conda:

conda env export | grep -v "^prefix: " > environment.yml

export using pip:

pip freeze > requirements.txt

Pytorch

check if the pytorch use cuda

import torch
torch.cuda.is_available() # shall output: True

check the version of pytorch

import torch
print(torch.__version__)

running llama2 with pytorch issues

torch.distributed.elastic.multiprocessing.errors.ChildFailedError

this error also mentioned here

after running the script from llama2:

torchrun --nproc_per_node 1 example_chat_completion.py \
    --ckpt_dir llama-2-7b-chat/ \
    --tokenizer_path tokenizer.model \
    --max_seq_len 512 --max_batch_size 4

the error :

> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
[2024-03-26 16:50:50,221] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9) local_rank: 0 (pid: 11127) of binary: /home/yan/anaconda3/envs/llama2/bin/python
Traceback (most recent call last):
  File "/home/yan/anaconda3/envs/llama2/bin/torchrun", line 8, in <module>
    ...
  File "/home/yan/anaconda3/envs/llama2/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
======================================================
example_chat_completion.py FAILED
------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-03-26_16:50:50
  host      : yan-ml
  rank      : 0 (local_rank: 0)
  exitcode  : -9 (pid: 11127)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 11127
======================================================

Dignose the issue

~~for me it probabily run out of memoery. ~~ [20240326] I have the same error on HCP,