Kerfuffle

joined 2 years ago
[–] Kerfuffle 1 points 1 year ago

Because I don’t live in fantasy land where prepared food costs are exactly the same as raw food costs?

Obviously it doesn't. Either your time is so valuable that it's clearly better to pay someone else to prepare stuff (which appears to be your position) or it's not. The equation doesn't change when we're talking about 10 meals or 1 meal. You don't seem to realize the inconsistency in your position.

You don't save "hundreds of dollars" by preparing one meal yourself, you might save a couple dollars at the expense of your time. Roasting some coffee is a roughly equivalent amount of effort to preparing one meal yourself and you probably save about the same amount of money. So if your time is so valuable that roasting coffee would be a ridiculous waste of your super valuable times, then if you were consistent this would also apply to meal prep.

Yes, I agree with you that your entire argument doesn’t make sense.

The "I know you are, but what am I?" turnaround seems a bit immature, don't you think?

[–] Kerfuffle 1 points 1 year ago (1 children)

Ah, I see. Wouldn't it be pretty easy to determine if MPS is actually the issue by trying to run the model with the non-MPS PyTorch version? Since it's a 7B model, CPU inference should be reasonably fast. If you still get the memory leak, then you'll know it's not MPS at fault.

[–] Kerfuffle 1 points 1 year ago (3 children)

You can find the remote code in the huggingface repo.

Ahh, interesting.

I mean, it's published by a fairly reputable organization so the chances of a problem are fairly low but I'm not sure there's any guarantee that the compiled Python in the pickle matches the source files there. I wrote my own pickle interpreter a while back and it's an insane file format. I think it would be nearly impossible to verify something like that. Loading a pickle file with the safety stuff disabled is basically the same as running a .pyc file: it can do anything a Python script can.

So I think my caution still applies.

It could also be PyTorch or one of the huggingface libraries, since mps support is still very beta.

From their description here: https://github.com/QwenLM/Qwen-7B/blob/main/tech_memo.md#model

It doesn't seem like anything super crazy is going on. I doubt the issue would be in Transformers or PyTorch.

I'm not completely sure what you mean by "MPS".

[–] Kerfuffle 2 points 1 year ago

I have also tried to generate code using deterministic sampling (always pick the token with the highest probability). I didn’t notice any appreciable improvement.

Well, you said you sometimes did that so it's not entirely clear what conclusions you came to are based on deterministic sampling and which aren't. Anyway, like I said, it's not just temperature that may be causing issues.

I want to be clear I'm not criticizing you personally or anything like that. I'm not trying to catch you out and you don't have to justify anything about your decisions or approach to me. The only thing I'm trying to do here is provide information that might help you and potentially other people get better results or understand why the results with a certain approach may be better or worse.

[–] Kerfuffle 3 points 1 year ago* (last edited 1 year ago) (5 children)

Another one that made a good impression on me is Qwen-7B-Chat

Bit off-topic but if I'm looking at this correctly, it uses a custom architecture which requires turning on trust_remote_code and the code that would be embedded into the models and trusted is not included in the repo. In fact, there's no real code in the repo: it's the just a bit of boilerplate to run inference and tests. If so, that's kind of spooky and I suggest being careful not to run inference on those models outside of a locked down environment like a container.

[–] Kerfuffle 2 points 1 year ago (2 children)

For sampling I normally use the llama-cpp-python defaults

Most default settings have the temperature around 0.8-0.9 which is likely way too high for code generation. Default settings also frequently include stuff like a repetition penalty. Imagine the LLM is trying to generate Python, it has to produce a bunch of spaces before every line but something like a repetition penalty can severely reduce the probability of the tokens it basically must select for the result to be valid. With code, there's often very little leeway for choosing what to write.

So you said:

I’m aware of how sampling and prompt format affect models.

But judging the model by what it outputs with the default settings (I checked and it looks like for llama-cpp-python it has both a pretty high temperature setting and a repetition penalty enabled) kind of contradicts that.

By the way, you might also want to look into the grammar sampling stuff that recently got added to llama.cpp. This can force the model to generate tokens that conform to some grammar which is pretty useful for code and some other stuff where the output has to conform to something. You should still carefully look at the other settings to ensure they conform to the type of result you want to generate though, the defaults are not suitable for every use case.

[–] Kerfuffle 1 points 1 year ago (4 children)

But I can’t accept such an immediately-noticeable decline in real-world performance (model literally craps itself) compared to previous models while simultaneously bragging about how outstanding the benchmark performance is.

Your criticisms are at least partially true and benchmarks like "x% of ChatGPT" should be looked at with extreme skepticism. In my experience as well, parameter size is extremely important. Actually, even with the benchmarks it's very important: if you look at the ones that collect results you'll see, for example, there are no 33B models that have a MMLU score in the 70s.

However, I wonder if all of the criticism is entirely fair. Just for example, I believe MMLU is 5-shot, ARC is 10-shot. That means there are a bunch of examples of that type of question and the correct answer before the one the LLM has to answer. If you're just asking it a question, that's 1-shot: it has to get it right the first time, without any examples of correct question/answer pairs. Seeing a high MMLU score doesn't necessarily directly translate to 1-shot performance, so your expectations might not be in line with reality.

Also, different models have different prompt formats. For these fine-tuned models, it won't necessarily just say "ERROR" if you use the wrong prompt form but the results can be a lot worse. Are you making sure you're using exactly the prompt that was used when benchmarking?

Finally, sampling settings can also make a really big difference too. A relatively high temperature setting when generating creative output can be good but not when generating source code. Stuff like repetition, frequency/presence penalties can be good in some situations but maybe not when generating source code. Having the wrong sampler settings can force a random token to be picked, even if it's not valid for the language, or ban/reduce the probability of tokens that would be necessary to produce valid output.

You may or may not already know, but LLMs don't produce any specific answer after evaluation. You get back an array of probabilities, one for every token ID the model understands (~32,000 for LLaMA models). So sampling can be extremely important.

[–] Kerfuffle 10 points 1 year ago (1 children)

And others don’t feel guilty for eating meat.

Carrots are incapable of feeling anything: they can't be affected in a morally relevant way. Animals have emotions, preferences, can experience suffering and can be deprived of positive/pleasurable experiences in their lives.

Than you for recognizing that people have different feelings.

Obviously this isn't a sufficient justification for harming others. "I don't care about people with dark skin, please recognize that different people have different feelings." The fact that I don't care about the individuals I'm victimizing doesn't mean victimizing them is okay.

[–] Kerfuffle 1 points 1 year ago (2 children)

You’re comparing a need (food) to not even a want,

This makes no sense. We're talking about preparing it yourself vs buying it. In either case, you get the item so there's no "this need doesn't get satisfied" possibility.

You don't need to roast your own coffee, just as you don't need to preparing your own meals: instead of spending the time personally preparing to those things, you could buy them. So if your position is "my time is so valuable that I'd rather pay someone else to do the work", then why is that only applied to roasting coffee and not preparing meals?

[–] Kerfuffle 3 points 1 year ago

I’m not a utilitarian, I’m a virtue ethicist.

We'll probably never agree then.

I think participating in rape isn’t virtuous.

Drinking 1/10th of a teaspoon of milk is necessarily never virtuous and always wrong then. Correct?

[–] Kerfuffle 5 points 1 year ago (2 children)

It’s a fucking bottle of milk, it’s not insulin. You’re doing the kid a favor by refusing.

Kind of surprised by some of the responses here.

The kid doesn't need the milk, but there's basically no upside to refusing:

  1. Any adult that hears your response (or learns about it later) is going to come away with a very negative perception of vegans.
  2. This may apply to the kid (maybe later on), or at the least they'll just be confused and think you're mean.
  3. The kid is already holding the bottle of milk, whatever harm the demand caused is already done. If the kid throws the milk away and buys some other beverage, that's probably going to be more harm overall: basically anything we eat causes some amount of animal suffering, human suffering, environmental damage, etc.

Would you cook a hamburger for the kid if he looked up at you with big eyes and said pweeeze?

There's a massive difference between opening a container and cooking a meal.

I think it's better to look at this from the perspective of what actually makes it less likely for animals to be harmed than to stand on the principle not matter what the effect is. Hmm, sounds like the kind of thing a Utilitarian would say.

[–] Kerfuffle 2 points 1 year ago

It is only a matter of time before we’re running 40B+ parameters at home (casually).

I guess that's kind of my problem. :) With 64GB RAM you can run 40, 65, 70B parameter quantized models pretty casually. It's not super fast, but I don't really have a specific "use case" so something like 600ms/token is acceptable. That being the case, how do I get excited about a 7B or 13B? It would have to be doing something really special that even bigger models can't.

I assume they'll be working on a Vicuna-70B 1.5 based on LLaMA to so I'll definitely try that one out when it's released assuming it performs well.

view more: ‹ prev next ›