So, I got vLLM running locally now to test it out and see if that's an option. The results are really cool, but not the right approach for me. 😄
============================= test session starts =============================
collecting ... collected 1 item
test_model_wrapper.py::test_complete
======================== 1 passed in 100.25s (0:01:40) ========================
PASSED [100%]
Time to set up: 0.1114599 seconds
Runs: [31.6107654, 31.6132964, 34.2333528, 34.2130582, 34.2042625, 34.2340227, 34.2474756, 34.2105221, 34.2394911, 34.2595923, 34.2379803, 34.2343132, 34.2297294, 34.2636501, 34.223144, 34.2007176, 34.2261261, 34.1895347, 34.2410051, 34.2694363, 34.2138979, 34.219749, 34.2380913, 34.2763124, 34.2519246, 34.24931, 34.2471614, 34.2769107, 42.0093234, 42.0015118, 42.0404314, 42.0552968, 42.0382447, 42.0084274, 42.0031958, 42.054302, 42.002341, 42.0075319, 42.0172066, 47.9992168, 47.953934, 47.9645648, 48.0265022, 48.0249268, 47.951652, 48.0254267, 47.9690977, 49.3527739, 49.3451308, 50.7696567, 50.767339, 50.7534204, 50.775435, 50.759712, 50.7630917, 50.7589791, 50.7768211, 53.4804957, 53.4305316, 56.1890333, 56.1694433, 56.1409132, 56.1523421, 56.1788844, 56.1737424, 56.1204063, 56.1047533, 56.1247003, 56.1293441, 56.1116698, 56.1414604, 57.3100886, 57.2900195, 57.3443251, 58.5790726, 59.776073, 59.6932115, 59.7322922, 59.7565171, 59.7780871, 59.7480857, 60.7950622, 63.1008595, 62.9985082, 63.0248589, 63.0549046, 64.3303572, 64.3385164, 64.3284183, 68.3025694, 68.2875625, 69.6675595, 69.6902716, 70.5738979, 70.6244101, 70.5820796, 71.4233821, 78.2565978, 81.2923646, 99.9321581]
Average run time: 49.985901578 seconds
============================= test session starts =============================
collecting ... collected 1 item
test_model_wrapper.py::test_complete
======================== 1 passed in 100.25s (0:01:40) ========================
PASSED [100%]
Time to set up: 0.1114599 seconds
Runs: [31.6107654, 31.6132964, 34.2333528, 34.2130582, 34.2042625, 34.2340227, 34.2474756, 34.2105221, 34.2394911, 34.2595923, 34.2379803, 34.2343132, 34.2297294, 34.2636501, 34.223144, 34.2007176, 34.2261261, 34.1895347, 34.2410051, 34.2694363, 34.2138979, 34.219749, 34.2380913, 34.2763124, 34.2519246, 34.24931, 34.2471614, 34.2769107, 42.0093234, 42.0015118, 42.0404314, 42.0552968, 42.0382447, 42.0084274, 42.0031958, 42.054302, 42.002341, 42.0075319, 42.0172066, 47.9992168, 47.953934, 47.9645648, 48.0265022, 48.0249268, 47.951652, 48.0254267, 47.9690977, 49.3527739, 49.3451308, 50.7696567, 50.767339, 50.7534204, 50.775435, 50.759712, 50.7630917, 50.7589791, 50.7768211, 53.4804957, 53.4305316, 56.1890333, 56.1694433, 56.1409132, 56.1523421, 56.1788844, 56.1737424, 56.1204063, 56.1047533, 56.1247003, 56.1293441, 56.1116698, 56.1414604, 57.3100886, 57.2900195, 57.3443251, 58.5790726, 59.776073, 59.6932115, 59.7322922, 59.7565171, 59.7780871, 59.7480857, 60.7950622, 63.1008595, 62.9985082, 63.0248589, 63.0549046, 64.3303572, 64.3385164, 64.3284183, 68.3025694, 68.2875625, 69.6675595, 69.6902716, 70.5738979, 70.6244101, 70.5820796, 71.4233821, 78.2565978, 81.2923646, 99.9321581]
Average run time: 49.985901578 seconds
100 completions in about 100s is amazing for the total result. But the individual completion is too slow. I wonder if tabbyAPI (with sequential requests) and multiple parallel workers might actually be better. I expect no more than 2-3 requests at the same time for now. And no more than 5 for a while (assumptions can be wrong of course).