Updated Megatron version by jlamypoirier · Pull Request #85 · bigcode-project/Megatron-LM

jlamypoirier · 2023-12-21T00:50:24Z

No description provided.

Signed-off-by: Selvaraj Anandaraj <selvaraja@login-eos01.eos.clusters.nvidia.com>

Signed-off-by: jiemingz <jiemingz@nvidia.com>

Signed-off-by: Selvaraj Anandaraj <selvaraja@login-eos01.eos.clusters.nvidia.com>

add is_first_microbatch for TE See merge request ADLR/megatron-lm!1033

Need a switch at NeMo level to enable Atomic GEMM See merge request ADLR/megatron-lm!1017

Add distributed checkpoint support to non-TE based models See merge request ADLR/megatron-lm!1005

Signed-off-by: Selvaraj Anandaraj <selvaraja@login-eos01.eos.clusters.nvidia.com>

Support for activation offloading to CPU in M-LM See merge request ADLR/megatron-lm!1016

add rope and swiglu fusion See merge request ADLR/megatron-lm!946

Add jit_fuser to switch between torch.jit.script and torch.compile See merge request ADLR/megatron-lm!1036

Run black on megatron/optimizer See merge request ADLR/megatron-lm!1050

…ommunication Compute norm once per batch (instead of once per microbatch) and once per bucket (instead of once per param)

Fix NaN checking in grads: should be performed before data-parallel all-reduce See merge request ADLR/megatron-lm!989

…ry ffn_hidden_size

Move to Draco OCI See merge request ADLR/megatron-lm!1137

Print number of transformer and embedding parameters separately See merge request ADLR/megatron-lm!1159

Mcore LLaVA model See merge request ADLR/megatron-lm!1151

[OMNIML-614] AMMO ptq + TensorRT-LLM export examples for megatron-lm See merge request ADLR/megatron-lm!1013

Make throughput and memory footprint formulae compatible with arbitrary ffn_hidden_size See merge request ADLR/megatron-lm!1169

Experimental Yaml configs See merge request ADLR/megatron-lm!1134

This reverts commit fe1f23c.

PytLab and others added 30 commits January 18, 2024 23:08

Add ImportError catch for one_logger

bf9c0a1

Add message on how to install one_logger

85c4034

Better code formatting

54de98d

Fixed merge conflicts

909bda3

Signed-off-by: Selvaraj Anandaraj <selvaraja@login-eos01.eos.clusters.nvidia.com>

add is_first_microbatch for TE

3c44fb9

Signed-off-by: jiemingz <jiemingz@nvidia.com>

add arg name

27879a7

Signed-off-by: jiemingz <jiemingz@nvidia.com>

add docstring and move set_is_first_microbatch

7dc2ee8

Signed-off-by: jiemingz <jiemingz@nvidia.com>

Fixed formatting

3e19c76

Signed-off-by: Selvaraj Anandaraj <selvaraja@login-eos01.eos.clusters.nvidia.com>

Merge branch 'jiemingz/is_first_microbatch' into 'main'

bed60a8

add is_first_microbatch for TE See merge request ADLR/megatron-lm!1033

fix a bug in branch and format

cf1a1c6

Merge branch 'main' into fuse_rope_swiglu_main

036605d

fix tests

568da5a

Merge branch megatron-lm:main into atomic_gemm_switch

140642c

enable swiglu and rope fusion by default and disable them in tests

de9428a

Merge branch 'atomic_gemm_switch' into 'main'

599f558

Need a switch at NeMo level to enable Atomic GEMM See merge request ADLR/megatron-lm!1017

Merge branch 'mblaz/dist-ckpt-layernorms' into 'main'

ca8a00a

Add distributed checkpoint support to non-TE based models See merge request ADLR/megatron-lm!1005

Docstring removed for context config

79269fa

Signed-off-by: Selvaraj Anandaraj <selvaraja@login-eos01.eos.clusters.nvidia.com>

Decoupled cpu offloading and SplitAlongDim imports

4b05862

Signed-off-by: Selvaraj Anandaraj <selvaraja@login-eos01.eos.clusters.nvidia.com>

Merge branch 'cpu_offload' into 'main'

a5165ac

Support for activation offloading to CPU in M-LM See merge request ADLR/megatron-lm!1016

Merge branch 'fuse_rope_swiglu_main' into 'main'

640af6b

add rope and swiglu fusion See merge request ADLR/megatron-lm!946

Add jit_fuser to switch between torch.jit.script and torch.compile

473225f

Merge branch 'jaeminc/mcore-jit' into 'main'

de4028a

Add jit_fuser to switch between torch.jit.script and torch.compile See merge request ADLR/megatron-lm!1036

misc

716204e

Merge branch 'black_on_optimizer' into 'main'

8c2cd99

Run black on megatron/optimizer See merge request ADLR/megatron-lm!1050

Router and communication refactoring.

c795038

Add Z-loss and aux loss. Code cleanup.

2016969

Code clean.

9b5cd88

Add top-k router and documentation.

dc436f2

Add UT. Fix top-k >1 when EP is off.

a98c5ba

Noramlize the token scores.

0f80408

deepakn94 and others added 30 commits February 27, 2024 20:31

Fix NaN checking in grads: should be performed before data-parallel c…

d668077

…ommunication Compute norm once per batch (instead of once per microbatch) and once per bucket (instead of once per param)

Merge branch 'check_nan_in_grad' into 'main'

53a350e

Fix NaN checking in grads: should be performed before data-parallel all-reduce See merge request ADLR/megatron-lm!989

Make throughput and memory footprint formulae compatible with arbitra…

9677b3b

…ry ffn_hidden_size

Move to Draco OCI

3dafc0e

Merge branch 'maanug/jet-oci' into 'main'

17c487a

Move to Draco OCI See merge request ADLR/megatron-lm!1137

Merge branch 'theoretical_memory_fix' into 'main'

3b0fcd1

Print number of transformer and embedding parameters separately See merge request ADLR/megatron-lm!1159

Mcore LLaVA model

7bc3c74

Merge branch 'trintamaki/llava-model-mr' into 'main'

d1acce3

Mcore LLaVA model See merge request ADLR/megatron-lm!1151

[OMNIML-614] AMMO ptq + TensorRT-LLM export examples for megatron-lm

80e180d

Merge branch 'chenhany/ammo_ptq_example' into 'main'

36e9b6b

[OMNIML-614] AMMO ptq + TensorRT-LLM export examples for megatron-lm See merge request ADLR/megatron-lm!1013

Merge branch 'variable_ffn_size' into 'main'

0c1e53d

Make throughput and memory footprint formulae compatible with arbitrary ffn_hidden_size See merge request ADLR/megatron-lm!1169

Experimental Yaml configs

47cb630

Merge branch 'yaml' into 'main'

8957468

Experimental Yaml configs See merge request ADLR/megatron-lm!1134

MOE support

63d9d3e

stuff

40a134a

Merge branch 'main' into compare_tensors_updated

1a96a99

Support megatron core models

fdd668c

Fix arg

4238a80

fixes

fe38434

fix

3c6652e

fix

f6b9b4b

update

cb6baf1

misc

fe1f23c

Revert "misc"

2e23b9b

This reverts commit fe1f23c.

version

511e8f5

fix

75b0d97

misc

f02b413

stuff

89f391e

fix

30e7aec

misc

dee2745

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updated Megatron version#85

Updated Megatron version#85
jlamypoirier wants to merge 437 commits intobigcode-project:nvidia_mainfrom
ServiceNow:compare_tensors_updated

jlamypoirier commented Dec 21, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

jlamypoirier commented Dec 21, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants