WEBVTT 00:00.000 --> 00:11.800 Max is a good friend of mine for IBM. He's going to be talking about LM tools in use in 00:11.800 --> 00:15.000 VOLM. Take it away, Max. 00:15.000 --> 00:23.120 Thanks. So hi, everyone. I'm Max from I work at, yeah, okay. I'm Max. I work at IBM research. 00:23.120 --> 00:29.200 So I have a solution problem here. So some things I cut out below, but in the interest of 00:29.200 --> 00:33.840 time, let's move on. So my goal with this presentation is that by the end, you will know 00:33.840 --> 00:43.960 how tool calling with LM's works at the lowest level. So what is VLLM? Are there any 00:43.960 --> 00:49.120 VLLM users out there? Oh, okay, there's some, but yeah, it's nice to know that I'm not 00:49.120 --> 00:55.520 preaching to the choir. But anyway, the project's goal is to be the fastest and easiest to 00:55.520 --> 01:03.720 use open source, LLM inference and serving engine, right? So it started in, yeah, okay, you would 01:03.720 --> 01:10.360 see 2023 below there, but it started as a PhD project at the University of Berkeley. It has 01:10.360 --> 01:18.200 grown a lot since then and has become a Linux foundation project. There are contributions 01:18.200 --> 01:26.760 by many companies, including IBM, but also Intel, Mr. All the model providers and hardware 01:26.760 --> 01:34.440 providers contribute. And some of these companies also use it in production, like RIDU. 01:34.440 --> 01:41.080 Okay, so there are two ways to use VLLM. One is in Python with the offline-badged inference 01:41.080 --> 01:48.120 API. So this is basically like using the transformers library and only a bit faster, right? 01:48.120 --> 01:57.120 And then there is online serving using with the OpenAI API. So you can use VLLM to host 01:57.120 --> 02:06.000 models for your LLM based apps. All right, so just to contextualize this a little bit. 02:06.000 --> 02:12.840 So when you think about the program, it has a central logic, right, a control flow that, 02:13.400 --> 02:19.840 well, where you program what you want to do. And it uses sub-retains in your own program, 02:19.840 --> 02:26.120 it uses libraries, it uses operating system to interact with the world. So you could see these 02:26.120 --> 02:33.720 things as tools for your business logic, right? But the problem with programs is that this 02:33.720 --> 02:38.920 business logic is fixed. So if you want new behavior, you have to program it and update all 02:38.920 --> 02:44.520 your deployments, right? So with LLMs, you can do new things. You can put an LLM, they're in 02:44.520 --> 02:51.160 the middle, it can come up with new plans depending on the user input. And it can use all these 02:51.160 --> 02:58.200 things as tools, right? So that's, that's the motivation. So this allows you to do some things, 02:58.200 --> 03:05.000 like the classical example of an AI assistant, where you give the model a bunch of APIs 03:05.000 --> 03:09.960 that it can call here. In this example, our restaurant reservation API. And so you know, 03:09.960 --> 03:17.320 you can handle user input in natural language, so that it will be the model can, yeah, satisfy 03:17.320 --> 03:24.920 the users requests that come in natural language with these API calls. Okay, so there are 03:24.920 --> 03:31.560 basically three types of tool calling out there. One is JSON based, where the model input, 03:31.640 --> 03:36.920 I mean the function descriptions are in JSON and the model output is also JSON. Then there's 03:36.920 --> 03:43.720 code based tool calling, where the model generates code. For example, in LLM3 models, they generate 03:43.720 --> 03:49.960 Python code, and that has to be executed. And there's built into, so for example, again, in the 03:49.960 --> 03:56.760 LLM3 models, during instruction tuning, the model learns to use brave search and wealth from 03:56.840 --> 04:04.520 often. Okay, so in this talk, we're going to focus on JSON tool calling. So in this modality, 04:04.520 --> 04:09.960 the model, what you give the model is a description of all the functions that it can call 04:09.960 --> 04:19.240 in this JSON format that is very similar to open API, if not identical. And what you do is you 04:19.240 --> 04:26.360 describe your function, you give a function name, you tell the model what it does in the description, 04:26.920 --> 04:34.120 then you list all the parameters, the types, and names, and also descriptions. And so when you 04:34.120 --> 04:40.520 get a user input in natural language, the model should generate an output, where it identifies 04:41.720 --> 04:48.200 the arguments for these functions from the user input and generates and select the correct function, 04:48.200 --> 04:56.440 and returns your JSON like this one. Okay, so this is modeled in the OpenAI API as a 04:56.760 --> 05:03.880 chat. So here you have a sequence of different messages with different roles. So in the beginning, 05:03.880 --> 05:11.320 you usually, as the application developer, you start with a system prompt, where you tell the model 05:11.320 --> 05:17.240 what is, is it's role as, for example, you would say something you're helpful assistant, in this case, 05:17.240 --> 05:22.600 you're going to know a little bit about weather prediction, then you have a user request, 05:22.680 --> 05:30.920 right, with a user role, coming in with a user request, right, and then there can be some 05:30.920 --> 05:36.600 back and forth between the model and the user, and at some point the model will generate a tool 05:36.600 --> 05:44.360 call in JSON format as we've seen, then you can put this output back into the model, and here, 05:44.360 --> 05:51.240 well, it's cut below, but the model, when it sees this output from tool, it can generate an explanation 05:51.320 --> 05:57.880 in natural language, so that you can send it back to the user. All right, so we know that 05:57.880 --> 06:05.000 the model only handles text, so to translate from this list of messages to actual text, that the model 06:05.000 --> 06:12.280 can process, we have a, we use chat templates, that chat templates usually bundled with the tokenizer, 06:12.280 --> 06:19.160 in VLLM, we curate some chat templates specifically for tool use. So this example is for IBM 06:19.240 --> 06:27.560 granite 3.1, so as you can see in the beginning, the available tools are inserted with a special 06:27.560 --> 06:32.920 role called available tools, but depending on the model, it can also be in the system role, 06:32.920 --> 06:39.160 or in the first user message, then there is a fall loop iterating over all of the messages, 06:39.160 --> 06:46.040 and formatting accordingly, if they're depending on the role, so in most roles, the message is just 06:46.120 --> 06:53.320 copied plain text into the prompt, but in some cases, like with tool calls, there has to be some 06:53.320 --> 07:02.440 JSON manipulation, and we can't see it, but if you want the model to complete text as the 07:02.440 --> 07:09.880 system, then you also start inserting start of role, assistant, and of role, so that the model 07:09.880 --> 07:17.480 knows it has to go from there. All right, so the text looks like this, at the beginning, as I 07:17.480 --> 07:27.720 mentioned, you have the functions, then the system prompt, then the user input, and then just the 07:27.720 --> 07:35.960 beginning of the assistant role, so that the model will complete, so it will complete with a tool call, 07:36.040 --> 07:43.560 there's a special format, that's a model specific, so for each different model in VLLM, we have 07:43.560 --> 07:52.520 a different parser, to get the model output, and return JSON, then we can put the response back, 07:53.320 --> 07:57.800 and by the end, the model can generate a well, we can see it here, but it can generate a 07:57.800 --> 08:04.440 natural language description for the user. All right, so from an application developer perspective, 08:04.840 --> 08:14.520 wait, why is it not correctly formed? Okay, but anyway, yeah, as a client developer, in Python, 08:14.520 --> 08:24.120 for example, you can use the open AI Python client, our API in VLLM is compatible, so you define 08:24.120 --> 08:33.880 your tools as JSON digs, and it's a message as sorry, as a list, and then you call, when you call 08:33.960 --> 08:39.880 the API, you get back, this chat completion object, where the finished reason is two calls, 08:39.880 --> 08:45.640 and you get a nice array of the two calls that were passed from the model return, so 08:47.320 --> 08:52.600 notice that there's a special parameter here, tool choice auto, which I will explain now. 08:52.680 --> 09:04.760 Okay, so the tool choice auto parameter, it let's you generate the two calls in different ways, 09:04.760 --> 09:11.560 so when you, there's one, the first option is to actually pass it a JSON with the function that you 09:11.560 --> 09:18.600 want to call, so in VLLM, this forces the model to use only one specific function, 09:19.480 --> 09:26.200 using structured output, right, so with structured outputs, we restrict which tokens the model can 09:26.200 --> 09:31.880 generate, so it has to generate, for example, JSON schema that follows the output for that specific 09:31.880 --> 09:38.280 function, so if you use input has nothing to do with that function, the model can be forced to make 09:38.280 --> 09:43.960 upper arguments, then this tool choice required, which is similar, but now the model can choose 09:44.040 --> 09:51.320 between different functions, then this tool choice none, where you don't want the model to 09:51.320 --> 09:55.480 generate a two call, so you could use this, for example, if you want to do more prompt engineering 09:55.480 --> 10:00.280 after you insert a third to tools, for example, a chain of thought or something like this, 10:00.840 --> 10:06.200 and finally, this tool choice auto where the model free to either generate text to send back 10:06.200 --> 10:15.640 to the user or to generate a function call, okay, so as we have seen, the model only handles text, 10:15.640 --> 10:22.200 VLLM translates this from into JSON, but VLLM does not call, actually call the tools, 10:22.760 --> 10:28.680 so as an application developer, this is your responsibility, right, so what you need to do is to 10:28.680 --> 10:36.520 basically implement this executor box to orchestrate this interaction, so you take the user prompt, 10:36.520 --> 10:43.080 send it to the model, the model returns the tool call, then sends it to you execute the tool, 10:43.080 --> 10:47.480 get the response, show it to the model, so that model can generate output for the user, 10:48.200 --> 10:57.640 and then finally, this sequence ends with text sent to the user, all right, so when you're using 10:57.720 --> 11:04.920 tool calling, it's nice to know how your model was trained, right, so I like this paper from IBM 11:04.920 --> 11:11.560 research a lot because it details the tasks which the model was trained, and this also another data set 11:11.560 --> 11:21.000 on the data set that was used, so highly recommended, I have to rush through these other 11:21.000 --> 11:28.120 slides now because I'm getting out of time, but yeah, wrapping up, so based on natural language 11:28.120 --> 11:35.960 LLM can figure out what functions call, the chat API is used for that, and so since the model 11:35.960 --> 11:43.000 and it inference server only handle text in JSON, the orchestration is your responsibility as a developer, 11:43.000 --> 11:48.200 so if that's a lot of work, you might also want to check out agent frameworks such as the 11:49.160 --> 11:54.040 agent framework, and with that, I'll leave you with some pointers, and thank you. 11:55.880 --> 11:56.520 Give me a hug. 12:01.000 --> 12:05.240 Thank you so much, Max, and I'm going to kick him out over there, so you have questions for him, 12:05.240 --> 12:08.440 you should find him over there, because Peter, you're up next, buddy. 12:11.400 --> 12:14.840 All right, we'll give it about two minutes, or however quickly he gets us about 12:18.200 --> 12:27.720 every minute.