GPT-OSS-20B Issues With VLLM And Open WebUI
Hey guys! I'm running into some serious issues trying to get GPT-OSS-20B to play nice with vLLM as the backend for Open WebUI, and I'm hoping to get some help from the community. I've been trying to get this setup running, but I've hit a couple of snags that are really impacting the quality and functionality. I'm documenting these problems here to get some insights and hopefully find a solution. Let's dive into the details of what's going on and how we can potentially fix it. First, I'm going to outline the specific problems I'm seeing, including the degraded output quality and the tool-calling issues. I'll include the commands and the configurations, which can help in reproducing the problems, and some helpful screenshots.
I really like the potential of this setup, and I'm sure it's something that other people would like to use as well, so let's work together to make this happen.
Output Quality Degradation
The first problem I'm encountering is a significant drop in the quality of the output generated by GPT-OSS-20B when served through vLLM and accessed via Open WebUI. The model seems to perform well initially, but the quality degrades noticeably after the first paragraph. I'm using the configuration I mentioned above to run the model.
I've included an example to illustrate this. The initial response is okay, but subsequent text is garbled or incoherent. This is a real problem, as it makes the model pretty much unusable for any extended conversation or complex task. The output I'm getting is simply not up to par, and I'm not sure what's causing it.
I've tried a few things to fix this, like adjusting the parameters in the vLLM command, such as the max_model_len
parameter, but I haven't seen any improvement. The core issue seems to be a degradation in the output quality, and I'm not sure how to resolve it. I'm going to outline the exact command I'm using to serve the model.
To make the issue clear, I've included a screenshot of the output. I'm using the Docker command shown in the original report. I'm not seeing this problem when I use a simple curl command to query the vLLM server directly. When I do that, the output is much more coherent and of higher quality.
docker run --gpus all \
--name llama-server-gptoss-20b-vllm \
--init --rm \
-p ${PORT}:9028 \
--ipc=host \
--ulimit nofile=65536:65536 \
-v /mnt/user/appdata/vllm/hf_cache:/root/.cache/huggingface \
-v /mnt/user/appdata/vllm/my_tool_template.jinja:/templates/my_tool_template.jinja \
-e CUDA_DEVICE_ORDER=PCI_BUS_ID \
vllm/vllm-openai:v0.11.1rc1 \
--model openai/gpt-oss-20b --port 9028 \
--tensor-parallel-size "2" \
--gpu-memory-utilization "0.6" \
--max-model-len "60000" \
--enable-auto-tool-choice \
--tool-call-parser openai \
--served-model-name "GPT-OSS-20B-vllm" \
--disable-custom-all-reduce \
--reasoning-parser openai_gptoss
I am wondering if anyone has encountered a similar issue when using this specific model or with vLLM in general, and if there are any suggestions. The core issue here is the degraded output, which makes the whole setup pretty useless. Any suggestions would be much appreciated!
Tool/Function Calling Issues
The second problem I've noticed involves tool or function calling. When I use GPT-OSS-20B through Open WebUI with vLLM, the tool calls aren't working as expected. Specifically, the model seems to struggle with properly formatting and executing these calls. This is a critical functionality for a lot of applications. I've tried to make it call tools and extract information, but it seems to fail to operate correctly or is sometimes a malformed JSON.
I'm using the --enable-auto-tool-choice
and --tool-call-parser openai
flags, but these don't seem to be helping. The tool call attempts are often placed within the reasoning block of the output, rather than being correctly formatted for execution. This makes it impossible for the tools to be called properly, which breaks a lot of the value of this setup.
Here are some of the screenshots of what I'm seeing. The tool calls show up as part of the output, instead of being properly formatted JSON. This, again, renders the system unusable, because it defeats the whole point of using tools.
I'm also including the Open WebUI setup for this. The way the prompt and responses are handled seems to be an issue here as well. The way the tool calls are being parsed is incorrect, and the model seems to struggle with this.
I'm hoping to get some input from others who have used tool calling with GPT-OSS-20B or any similar model with vLLM. Has anyone found a workaround or a specific configuration that makes this work? This is a significant issue, and any help would be appreciated. The main goal here is to make the model capable of properly calling tools, which is failing right now.
Expected Behavior
What I'm expecting here is for the GPT-OSS-20B model to perform consistently with other backends, such as llama.cpp
. This means the output should be high quality and coherent. The model should also be able to correctly call tools and generate valid JSON. The goal is to get a smooth, reliable integration. I am expecting the output quality to be stable throughout the conversation, with tool calls formatted and executed correctly.
If you have any pointers or suggestions on what might be causing these issues, please let me know. Thanks!