Llama Token Counter - a Hugging Face Spa...

Hi, using llama2 from a cloudflare worker using the ai.run binding, and finding that the responses I get back get cut off after < 300 tokens. Is there a way to set the token limit for a response to something higher than whatever it's set to? A silly example, to illustrate, where I ask for a recipe for potatoes au gratin with bubble gum syrup, gets cut off midway through the instructions...
const response = await ai.run('@cf/meta/llama-2-7b-chat-int8', {
prompt: `Create a recipe for potatoes au gratin that includes the special ingredient Bubble Gum Syrup\n\n`,
}
return new Response(JSON.stringify(response));
const response = await ai.run('@cf/meta/llama-2-7b-chat-int8', {
prompt: `Create a recipe for potatoes au gratin that includes the special ingredient Bubble Gum Syrup\n\n`,
}
return new Response(JSON.stringify(response));
{"response":"Potatoes au gratin is a classic French dish that consists of thinly sliced potatoes layered in a baking dish and topped with cheese, cream, and breadcrumbs. To give this dish a unique twist, you can add a special ingredient like Bubble Gum Syrup. This sweet and tangy syrup is made with real bubble gum and adds a fun and playful touch to the dish. Here's a recipe for potatoes au gratin with Bubble Gum Syrup:\nIngredients:\n* 3-4 large potatoes, peeled and thinly sliced\n* 1/4 cup Bubble Gum Syrup\n* 1/4 cup grated cheese (such as cheddar or Parmesan)\n* 1/4 cup heavy cream\n* 1/4 cup breadcrumbs\n* Salt and pepper to taste\nInstructions:\n1. Preheat the oven to 375°F (190°C).\n2. In a large baking dish, arrange a layer of over"}
{"response":"Potatoes au gratin is a classic French dish that consists of thinly sliced potatoes layered in a baking dish and topped with cheese, cream, and breadcrumbs. To give this dish a unique twist, you can add a special ingredient like Bubble Gum Syrup. This sweet and tangy syrup is made with real bubble gum and adds a fun and playful touch to the dish. Here's a recipe for potatoes au gratin with Bubble Gum Syrup:\nIngredients:\n* 3-4 large potatoes, peeled and thinly sliced\n* 1/4 cup Bubble Gum Syrup\n* 1/4 cup grated cheese (such as cheddar or Parmesan)\n* 1/4 cup heavy cream\n* 1/4 cup breadcrumbs\n* Salt and pepper to taste\nInstructions:\n1. Preheat the oven to 375°F (190°C).\n2. In a large baking dish, arrange a layer of over"}
If I take that response into a llama token counter: https://huggingface.co/spaces/Xanthius/llama-token-counter It's only 259 tokens but cut off, and my understanding is that llama2 is supposed to have to context window of 4096 tokens, so there should be no reason it couldn't have finished the instructions.
When using the OpenAI API's we can pass an argument max_tokens to the chat completions API. I don't see an equivalent in workers ai. Is this something that you might add?
9 Replies
Ian Taylor
Ian TaylorOP15mo ago
Realized this should probably have been posted in workers-ai-beta, sorry
Ian Taylor
Ian TaylorOP15mo ago
I see... 256 output tokens is not going to be enough for many use-cases.
lnicola
lnicola15mo ago
I'm sure they'll increase these in time, it's very early
Ian Taylor
Ian TaylorOP15mo ago
Yeah, I read the top part of the limits doc before I started playing around with this, but I didn't get down to the bottom of it. My bad. Thanks for your help!
lnicola
lnicola15mo ago
By the way, since you've used OpenAI before, how do you feel about the pricing of Cloudflare AI? I'm new to all this, and it seems pretty good to me, but maybe there's a catch.
Ian Taylor
Ian TaylorOP15mo ago
I haven't really been able to wrap my head fully around it yet 😀 I think a lot depends on how many llm tokens = 1 neuron and how much someone really needs the "Fast Twitch Neurons" vs the regular ones, because the fast ones are 12x the price, (assuming that you're able to control if you use fast twitch or not???)
lnicola
lnicola15mo ago
You'll be able to control it, but did I misread the blog post? I thought they're only 25% more. The token count is probably the current maximum. Regular Twitch Neurons (RTN) - running wherever there's capacity at $0.01 / 1k neurons Fast Twitch Neurons (FTN) - running at nearest user location at $0.125 / 1k neurons Right, missed that 😅
Victor
Victor15mo ago
@ian.b.taylor https://socket.dev/npm/package/llama-tokenizer-js I'm about to personally try it out, but haven't yet
Socket
llama-tokenizer-js - npm Package Security Analysis - Socket
JS tokenizer for LLaMA-based LLMs. Version: 1.1.3 was published by belladoreai. Start using Socket to analyze llama-tokenizer-js and its 0 dependencies to secure your app from supply chain attacks.
Want results from more Discord servers?
Add your server