Use first bad_words as extra parameters, and implement min-p #1536

pathorn · 2024-05-02T23:26:23Z

An approach for implementing #1154

The user-facing classes in BatchManager and tensorrt_llm::executor::SamplingConfig are not open source (the constructor implementation exists in .a files and will segfault if the class is modified), so we hacked it by using the integers in the first bad_words for extra parameters.

In this case, we're using the first integer reinterpreted as a float, to represent min_p (the default value is 0.0 which matches the 0 padding in bad_words).

I implemented the min-p by piggybacking on the existing logprobs calculation in cuda, so it should have no additional performance overhead beyond the logprobs calculation.

pathorn · 2024-05-02T23:36:49Z

Here is some example BLS code for adding the min_p value into the bad_words list in the way this PR expects:

            numpy_tensor = preproc_output_tensor.as_numpy()
            if trtllm_tensor_name == "bad_words_list":
                bad_words_data, bad_words_offsets = numpy_tensor[0]
                opt = np.get_printoptions()
                np.set_printoptions(threshold=np.inf)
                pprint(numpy_tensor)
                min_p = 0.0
                if "min_p" in bls_input_tensors_map:
                    minptensor = bls_input_tensors_map["min_p"].as_numpy()
                    pprint(minptensor)
                    min_p = minptensor[0,0]
                min_p_int ,= struct.unpack('i', struct.pack('f', min_p))
                extra_data = np.array([min_p_int], dtype=np.int32)
                if bad_words_offsets[0] == -1:
                    # Special case: if no bad_words are passed, numpy_tensor will be [[[0], [-1]]]
                    # In this case, we don't want to prepend [0] because that would add a bad word offset where there otherwise was none.
                    bad_words_data = extra_data
                    bad_words_offsets = np.array([-1], dtype=np.int32)
                else:
                    # Prepend min_p words.
                    bad_words_data = np.concatenate((extra_data, bad_words_data), axis=0)
                    # The offsets array is padded with -1, so we first add one to make the padding all zeros, then trim_zeros and subtract one.
                    bad_words_offsets = np.trim_zeros(bad_words_offsets + 1) - 1
                    # Then, we prepend an extra 0 element to account for an extra bad_word being added.
                    bad_words_offsets = np.concatenate((np.array([0], dtype=np.int32), bad_words_offsets), axis=0)
                    # Then, we offset the indices by the length of the newly added data.
                    bad_words_offsets = bad_words_offsets + len(extra_data)
                    # Finally, we pad this array to make it the same length as bad_words_data:
                    bad_words_offsets = np.concatenate((bad_words_offsets, np.array([-1] * (len(bad_words_data) - len(bad_words_offsets)), dtype=np.int32)), axis=0)
                numpy_tensor = np.array([[bad_words_data, bad_words_offsets]], dtype=np.int32)
                print("Final:")
                pprint(numpy_tensor)
                np.set_printoptions(**opt)

            trtllm_input_tensors.append(
                pb_utils.Tensor(trtllm_tensor_name,
                                numpy_tensor))

juney-nvidia · 2024-05-14T12:48:16Z

@pathorn

Hi Pathorn

Thanks for your interest to submit the MR into TRT-LLM.

The current process of merging community MR into TRT-LLM is:

After the contributor finishing the implementation with passing the local test. TRT-LLM engineers can help review the MR with providing the feedbacks and then several iterations of code refinements/discussions are necessary :)
After the MR is ready to get landed, one TRT-LLM engineer will cherry-pick the MR into our internal git repo.
Then later when the new TRT-LLM version gets pushed to the github, we will acknowledge the contributor's name in the announcement notes.

Pls let me know whether the above process makes sense to you.
Thanks

June

…t needing barrier

pathorn mentioned this pull request May 2, 2024

Feature Request: Add Min-P sampling layer #1154

Closed

pathorn force-pushed the minp_via_badwords_apr30 branch from 7e1acc9 to 3731f5b Compare May 3, 2024 22:40

pathorn force-pushed the minp_via_badwords_apr30 branch from 3731f5b to 3d7d658 Compare May 29, 2024 10:31

pathorn force-pushed the minp_via_badwords_apr30 branch from 3d7d658 to 0481a36 Compare June 5, 2024 17:01

pathorn force-pushed the minp_via_badwords_apr30 branch from 0481a36 to f4c8e1c Compare June 25, 2024 14:04

pathorn force-pushed the minp_via_badwords_apr30 branch from f4c8e1c to a248ba1 Compare September 25, 2024 00:13

pathorn force-pushed the minp_via_badwords_apr30 branch 2 times, most recently from d1f2263 to 55db1b6 Compare October 9, 2024 07:05

pathorn added 2 commits October 16, 2024 15:58

Use first bad_words as extra parameters, and implement min-p

9fdd375

Make minPs into array of pointers so we can reference badWords withou…

6b54714

…t needing barrier

pathorn force-pushed the minp_via_badwords_apr30 branch from 55db1b6 to 6b54714 Compare October 16, 2024 22:59

schetlur-nv mentioned this pull request Jan 30, 2025

TensorRT-LLM v0.17 Release #2725

Merged

DanBlanaru mentioned this pull request Feb 6, 2025

Update TensorRT-LLM #2755

Merged

pathorn closed this Feb 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use first bad_words as extra parameters, and implement min-p #1536

Use first bad_words as extra parameters, and implement min-p #1536

Uh oh!

pathorn commented May 2, 2024

Uh oh!

pathorn commented May 2, 2024

Uh oh!

juney-nvidia commented May 14, 2024

Uh oh!

Uh oh!

Use first bad_words as extra parameters, and implement min-p #1536

Use first bad_words as extra parameters, and implement min-p #1536

Uh oh!

Conversation

pathorn commented May 2, 2024

Uh oh!

pathorn commented May 2, 2024

Uh oh!

juney-nvidia commented May 14, 2024

Uh oh!

Uh oh!