When you make a docker image and push it to dockerhub all of the instructions it took appear there so it's very transparent, also super easy for any person to build it themselves unlike executables, just download the Dockerfile and run a single command
Besides the obvious of telling your users to build the exe, have you considered alternative distribution methods like docker?
agreed, it seems quite capable, i haven't tested all the way down to q2 to verify but i'm not surprised
There's apparently a pip command to display the leaderboard, if this ends up being of interest to people I could make a post and just update it every so often with the latest leaderboard
Yeah it's a step in the right direction at least, though now that you mention it doesn't lmsys or someone do the same with human eval and side by side comparisons?
It's such a tricky line to walk between deterministic questions (repeatable but cheatable) and user questions (real world but potentially unfair)
I have the snap installed, for what it's worth it's pretty painless AS LONG AS YOU DON'T WANT TO DO ANYTHING SILLY
I've found it nearly impossible to alter the base behaviour and have it not entirely break, so if nextcloud out of the box does exactly what you want, go ahead and install it via snap...
I predict that on docker you're going to have a bad time if you can't give it host network mode and try to just forward ports
That said, docker >>>> VM in my books
lmao a reasonable request, I'm pretty disappointed they don't have it hosted anywhere..
here's a link to their latest image of the leaderboard for what it's worth:
https://cdn.discordapp.com/attachments/1134163974296961195/1138833170838589471/image1.png
I've managed to get it running in koboldcpp, had to add --forceversion 405 because it wasn't being detected properly, even with q5.1 I was getting an impressive 15 T/s and the code actually seemed decent, this might be a really good candidate for fine-tuning on large datasets and passing massive contexts of basically entire small repos or at least several full files
Odd they chose neox as their model, I think only ctranslate2 can offload those? I had trouble getting the GPTQ running in autogptq.. maybe the huggingface TGI would work better
I've been impressed with Vicuna 1.5, seems quite competent and enjoyable. Unfortunately I'm only able to do 13B at any reasonable speed so that's where I tend to stay, though funny enough I haven't tried any 70Bs since llama.cpp added support, I'll have to start some downloads...
The same thing is happening here that happened to smartphones, we started out saying they were the be-all end-all, but largely because they were all so goddam different that it was impossible to compare them 1:1 in any meaningful way without some kind of automation like benchmarks
Then some people started cheating them, and we noticed that really the benchmarks, while nice for generating big pretty numbers, don't actually have much correlation to real world performance, and more often than not would miss-represent what the product was capable of
Eventually we get to a point where we can harmonize between benchmarks providing useful metrics and frames of reference for showing that there's something wrong, and having real reviews that dive into how the actual model works in the real world
I would love to see more of this and maybe making it its own post for more traction and discussion, do you have a link to those pictures elsewhere? can't seem to get a large version loaded on desktop haha.
You could definitely do clever things to obfuscate what you're doing, but it's much easier to replicate building the image as there are no external dependencies, if you have docker installed then you can build any docker image