vllm.v1.spec_decode.utils ¶
eagle_prepare_inputs_padded_kernel ¶
eagle_prepare_inputs_padded_kernel(
cu_num_draft_tokens_ptr,
valid_sampled_tokens_count_ptr,
query_start_loc_gpu_ptr,
token_indices_to_sample_ptr,
num_reqs,
)
Fused kernel for Eagle prepare_input_padded. This kernel computes the token index to sample for each request, taking into account the number of draft tokens and the number of valid sampled tokens (which is one more than the number of accepted tokens).
Source code in vllm/v1/spec_decode/utils.py
eagle_prepare_next_token_padded_kernel ¶
eagle_prepare_next_token_padded_kernel(
sampled_token_ids_ptr,
discard_request_mask_ptr,
backup_next_token_ids_ptr,
next_token_ids_ptr,
valid_sampled_tokens_count_ptr,
vocab_size,
num_sampled_tokens_per_req,
num_reqs,
stride_sampled_token_ids,
BLOCK_SIZE_TOKENS: constexpr,
)
Fused kernel for Eagle prepare_next_token_ids_padded. This kernel computes the number of valid (1 + accepted) tokens for each request, and the corresponding "next" token id to sample from during speculative decoding. This is the "last accepted token" from the sampled tokens, or the backup token if no tokens were accepted or if the request is marked as discarded.
Source code in vllm/v1/spec_decode/utils.py
is_spec_decode_unsupported ¶
is_spec_decode_unsupported(
sampling_params: SamplingParams,
) -> bool
True if request is incompatible with speculative decoding