this post was submitted on 29 Oct 2023
93 points (93.5% liked)

Programming

16760 readers
207 users here now

Welcome to the main community in programming.dev! Feel free to post anything relating to programming here!

Cross posting is strongly encouraged in the instance. If you feel your post or another person's post makes sense in another community cross post into it.

Hope you enjoy the instance!

Rules

Rules

  • Follow the programming.dev instance rules
  • Keep content related to programming in some way
  • If you're posting long videos try to add in some form of tldr for those who don't want to watch videos

Wormhole

Follow the wormhole through a path of communities [email protected]



founded 1 year ago
MODERATORS
 

Got myself a few months ago into the optimization rabbit hole as I had a slow quant finance library to take care of, and for now my most successful optimizations are using local memory allocators (see my C++ post, I also played with mimalloc which helped but custom local memory allocators are even better) and rethinking class layouts in a more "data-oriented" way (mostly going from array-of-structs to struct-of-arrays layouts whenever it's more advantageous to do so, see for example this talk).

What are some of your preferred optimizations that yielded sizeable gains in speed and/or memory usage? I realize that many optimizations aren't necessarily specific to any given language so I'm asking in [email protected].

top 38 comments
sorted by: hot top controversial new old
[–] [email protected] 43 points 9 months ago (1 children)

I don’t know if this counts, but making sure not a single query in my app results in a full table scan in MySQL made a huge difference.

[–] [email protected] 44 points 9 months ago* (last edited 9 months ago) (2 children)

Lemmy is probably a good live example of how sometimes going for a "faster language" like Rust isn't going to magically make a bad SQL database design better or slow queries faster: https://github.com/LemmyNet/lemmy/issues/2877

So yeah, SQL optimization stories are definitely welcome too!

[–] [email protected] 5 points 9 months ago

That was a good read. Thanks for the link.

[–] [email protected] 5 points 9 months ago

I read through most of this and it was very informative. Thank you very much.

[–] [email protected] 33 points 9 months ago (1 children)

More than ten years ago, I was a team-lead in a game development company and during another crunch I was very overloaded with tasks, including a code review. And of course, I missed some things. We optimized the game, which in general did not do 30 fps on the target machine with minimum requirements, and the target was at least 40 fps. During the month, we jointly optimized everything so that we were able to achieve almost stable 40 frames, but this was not enough and periodically there were drawdowns up to 15-20 fps. Everyone is in panic, there are no ideas, we’re optimized everything.

No, I understand that there is no optimization limit apriori, but I mean an adequate opts without rewriting everything to assembler specifically for all supported architectures.

So, one night I looked into the event loop initialization and found that the initialization happens twice. Twice. Two parallel event loops. That is, two parallel cycles of logic and state updates. That night, by deleting one line, I optimized the performance of the game by more than 100%. 🤦🏻‍♂️

I investigated and found out that this line of secondary initialization was left by junior "for debugging" and forgotten in the crunch. And I missed it on the review.

I’ve keep this story a secret all these years. And now I'm not revealing names. Otherwise, it can have a dramatic impact on the careers of many.

[–] [email protected] 8 points 9 months ago

Human code review is very error prone

[–] [email protected] 24 points 9 months ago

Staying on the SQL theme... The company I work for has a fairly old (~20 years) system. There's a feature for users and site admins to export massive amounts of data, with the option to export data from when the system was first released. Purely CSV or XML data formats. On large datasets, the time for export would vary from 10-20+ hours, and would frequently timeout, forcing you to split exports into multiple timeframes and manually merging them into a single file. The solution? Indexes! Indexes were non-existent. After adding them, export times have dropped to ~10-15 minutes, which is a rather insane performance increase, especially since a single export is accepted per account at a time.

[–] [email protected] 23 points 9 months ago

One of my most productive days was throwing away 1000 lines of code.

-- Ken Thompson

[–] [email protected] 23 points 9 months ago* (last edited 9 months ago)

ok maybe not my biggest or cleanest optimization, but an interesting one

I made the WCS Predictor for StarCraft 2 eSports, which would simulate the whole year of tournaments to figure out the chances for each player qualifying for the world championship (the top 16 players by WCS Points), and it would also show the events that would help/hurt the players. It was a big monte carlo simulation that would run millions of iterations. (My memory is fuzzy here because this was 2013 to 2014)

Originally I did a data-based approach where each Tournament object had a vector of Round objects (like round of 32, or semifinals, etc) which had a vector of Match objects (like Soulkey vs Innovation, or a 4 player group). The Round class had a property for best_of (best of 3, best of 5, etc). This was really inflexible cause I couldn't properly simulate complex tournament formats that didn't fit my data, and it required a lot of if statements that could cause branch prediction misses.

A tournament definition looked something like this:

older code example

		Round ro32("ro32");
		ro32.best_of=3;
		ro32.points_for_3rd=150+25;
		ro32.points_for_4th=100+25;
		Match ro32GroupA;

		ro32.matches.push_back(ro32GroupA);
		ro32.matches.push_back(ro32GroupB);
		ro32.matches.push_back(ro32GroupC);
		ro32.matches.push_back(ro32GroupD);
		ro32.matches.push_back(ro32GroupE);
		ro32.matches.push_back(ro32GroupF);
		ro32.matches.push_back(ro32GroupG);
		ro32.matches.push_back(ro32GroupH);

		GSL.rounds.push_back( ro32 );

When I rewrote it for the following year, I ditched the data-driven approach to try to get more efficiency and flexibility, but I definitely didn't want virtual functions either, so I decided to use templates instead and a custom class for each tournament. Now creating a tournament looked like this:

newer code example


template
class Round : public RoundBase
{
public:
	MatchType matches[num_matches];
// ... more stuff ...
};

class CodeSBase : public TournamentBase
{
public:
	Round<32, SwissGroup, 8, RandomAdvancement<16, 16>, true > ro32;
	Round<16, SwissGroup, 4, A1vsB2<8, 8>, true > ro16;
	Round<8, SingleMatch, 4, StraightAdvancement<4, 4>, true > quarterfinals;
	Round<4, SingleMatch, 2, StraightAdvancement<2, 2>, true > semifinals;
	Round<2, SingleMatch, 1, StraightAdvancement<1, 1>, true > finals;

	void init(vector &prev_matches, vector &upcoming_matches)
	{
		quarterfinals.best_of = 5;
		semifinals.best_of = 7;
		finals.best_of = 7;
		ro32.points_for_placing[3] = 100;
		ro32.points_for_placing[2] = 150;
		ro16.points_for_placing[3] = 300;
		ro16.points_for_placing[2] = 400;
		quarterfinals.points_for_placing[1] = 600;
		semifinals.points_for_placing[1] = 900;
		finals.points_for_placing[1] = 1250;
		finals.points_for_placing[0] = 2000;
		finals.match_placing_to_tournament_placing[0] = 1;
	}


	void predict(Simulation &sim, array &top8, array &bottom24, array &finalists, Rand64 &rng)
	{
		RandomAdvancement<16, 16> Ro32toRo16;
		A1vsB2<8, 8> Ro16toRo8;
		StraightAdvancement<4, 4> Ro8toRo4;
		StraightAdvancement<2, 2> Ro4toRo2;
		StraightAdvancement<1, 1> finalsadv;

		ro32.predict(sim, t_id, Ro32toRo16, rng);
		ro16.AcceptAdvancements(Ro32toRo16.advancing_players);
		ro16.predict(sim, t_id, Ro16toRo8, rng);
		quarterfinals.AcceptAdvancements(Ro16toRo8.advancing_players);
		quarterfinals.predict(sim, t_id, Ro8toRo4, rng);
		semifinals.AcceptAdvancements(Ro8toRo4.advancing_players);
		semifinals.predict(sim, t_id, Ro4toRo2, rng);
		finals.AcceptAdvancements(Ro4toRo2.advancing_players);
		finals.predict(sim, t_id, finalsadv, rng);

		top8 = Ro16toRo8.advancing_players;
		for (uint i = 0; i<8; i++) {
			bottom24[i] = Ro32toRo16.falling_players[i * 2];
			bottom24[i + 8] = Ro32toRo16.falling_players[i * 2 + 1];
			bottom24[i + 16] = Ro16toRo8.falling_players[i];
		}
		finalists[0] = finalsadv.advancing_players[0];
		finalists[1] = finalsadv.falling_players[0];
	}

All data had very good locality using arrays instead of vectors, everything could be inlined and branch predicted and prefecthed, complex tournament formats could be built even properly handling high placements from one tournament granting you seeding into a different tournament. I think I gained like 3x to 5x speedboost from this alone, I also made it multithreaded and work in batches which improved performance further, which allowed me to get updated results more quickly and with a higher number of samples. I also made it so it could output the results and then keep processing more batches of samples to refine the numbers from there. I wouldn't normally suggest these kinds of optimizations but it's a very unusual program to have such a wide hotloop

The website is gone, but here's a working archive of it (which I didn't know existed until writing this post, I thought there were only broken archives)

archive links, me reminiscing on old times, get ready for stats and graphs overload

home page: https://web.archive.org/web/20150822091605/http://sc2.4ever.tv:80/

a tournament page: https://web.archive.org/web/20160323082041/http://sc2.4ever.tv/?page=tournament&tid=27

a player's page: https://web.archive.org/web/20161129233856/http://sc2.4ever.tv/?pid=73

a general checkup page: https://web.archive.org/web/20161130055346/http://sc2.4ever.tv/?page=checkup

a page for players who must win tournaments to qualify: https://web.archive.org/web/20161129235740/http://sc2.4ever.tv/?page=must_wins

page showing the simulations history: https://web.archive.org/web/20161130055230/http://sc2.4ever.tv/?page=simulations

FAQ page: https://web.archive.org/web/20160323073344/http://sc2.4ever.tv/?page=faq

the fantasy league WCS Wars: https://web.archive.org/web/20161130055233/http://sc2.4ever.tv/?page=gamehome

my WCS Wars user page https://web.archive.org/web/20160704040152/http://sc2.4ever.tv/?page=user&uid=1

[–] [email protected] 17 points 9 months ago

A database optimization I made was changing a tables id generation from a manual generation scheme (some other table had an entry with the next usable id, it was updated with every entry written) to a uuid generation scheme. The table stores data from a daily import, on a fresh import all previous data is deleted. On some systems, there are more than 10 000 000 entries to be imported on a daily basis, which took 8 hours. Now, with batched inserts and the mentioned improvement in the db scheme, it's at about 20 minutes.

TLDR: Reducing the amount of queries sent is good, because although network is usually fast (ms), db requests are still slow compared to the speed of an application (clock cycles).

And yeah, there is an option to only import changes daily, but sadly that isn't supported in every environment.

[–] [email protected] 16 points 9 months ago (2 children)

I was working on a pretty well known game, porting it to consoles.

On PS4 we started getting OOM crashes after you've played a few levels, because PS4 doesn't have that much memory. I was mostly new on the project and didn't know it very well, so I started profiling.

It turned out that all the levels are saved in a pretty descriptive JSON files. And all of them are in Unity's Scriptable Objects, so even if you are not playing that level, they all get loaded into memory, since once something references a SO, it gets loaded immediately. It was 1.7Gb of JSON strings loaded into memory once the game started, that stays there for the whole gameplay.

I wrote a build script that compresses the JSON strings using gzip, and then uncompresses it when loading the actual level.

It reduced the memory of all the levels to 46Mb down from 1.7Gb, while also reduced the game load by around 5 seconds.

[–] [email protected] 6 points 9 months ago

See now we want to know what game it was...

[–] [email protected] 2 points 9 months ago

I don't want to know how many rushed games so stuff like this

[–] [email protected] 15 points 9 months ago (1 children)

The optimization I'm the most proud about was when I worked on a legacy project whose end-to-end builds took around 1 hour. After spending some time working on its architecture, project layout and build system, I managed to get the full end-to-end builds to take 10 minutes, and incremental builds to be almost instant.

What makes me the most proud about this project is that the technical debt plaguing the legacy project was directly and indirectly the reason why half a dozen of my team members burned out and quit the company. After that point my remaining team members started to be far less stressed and team velocity skyrocketed, just for the fact that the thought of iterating over a bugfix and posting a pull request didn't cost at least one hour, and sometimes two or three.

[–] [email protected] 7 points 9 months ago* (last edited 9 months ago) (1 children)

Surely you got a bonus and a raise out of it right? Right??

Who am I kidding only managers get such things

[–] [email protected] 2 points 9 months ago

Surely you got a bonus and a raise out of it right? Right??

The only reward I got from it was recognition from my team members, which was already more than what I was expecting to get.

My manager was praised for the higher team velocity and improvements in the team's burndown chart. The hallmark of having done good work is seeing others trying to take credit for it.

[–] [email protected] 10 points 9 months ago

I patched someone else's program which was known for being slower when it's used for a longer time. It was iterating over items in its window just to reach the last element, the more items the slower, it became snappy when i taught it to keep a pointer to the last element.

Not mind blowing, I know, but it was a popular program and this made life better for many people.

[–] [email protected] 10 points 9 months ago* (last edited 9 months ago) (1 children)

I've got so many more stories about bad optimizations. I guess I'll pick one of those.

There was an infamous (and critical) internal application somewhere I used to work. It took in a ton of data, putting it in the database, and then running a ton of updates to populate various fields and states. It was something like,

  • Put all data in x table with batch y.
  • Update rows in batch y with condition a, set as type a. (just using letters as placeholders for real states)
  • Update rows in batch y that haven't been updated and have condition b, set as type b.
  • Update rows in batch y that haven't been updated and have condition c, set as type c.
  • Update rows in batch y that have condition b and c and condition d, set as type d.
  • (Repeat many, many times)

It was an unreadable mess. Trying to debug it was awful. Business rules encoded as a chain of sql updates are incredibly hard to reason about. Like, how did this row end up with that data??

Me and a coworker eventually inherited the mess. Once we deciphered exactly what the rules were and realized they weren't actually that complicated, we changed the architecture to:

  • Pull data row by row (instead of immediately into a database)
  • Hydrate the data into a model
  • Set up and work with the model based on the business rules we painstakingly reverse engineered (i.e. this row is type b because conditions x,y,z)
  • Insert models to database in batches

I don't remember the exact performance impact, but it wasn't markedly faster or slower than the previous "fast" SQL-based approach. We found and fixed numerous bugs, and when new issues came up, issues could be fixed in hours rather than days/weeks.

A few words of caution: Don't assume that building things with a certain tech or architecture will absolutely be "too slow". Always favor building things in a way that can be understood. Jumping to the wrong tool "because it's fast" is a terrible idea.

Edit: fixed formatting on Sync

[–] [email protected] 12 points 9 months ago* (last edited 9 months ago) (1 children)

If you think it’s slow, first prove it with a benchmark

So many crimes against maintainability are committed in the name of performance. Optimisation tears down abstractions, exposes internals, and couples tightly. If you’re choosing to shoulder that cost, ensure it is done for good reason.

[–] [email protected] 5 points 9 months ago

Yep, absolutely.

In another project, I had some throwaway code, where I used a naive approach that was easy to understand/validate. I assumed I would need to replace it once we made sure it was right because it would be too slow.

Turns out it wasn't a bottleneck at all. It was my first time using Java streams with relatively large volumes of data (~10k items) and it turned out they were damn fast in this case. I probably could have optimized it to be faster, but for their simplicity and speed, I ended up using them everywhere in that project.

[–] [email protected] 9 points 9 months ago

As a more fun or absurd anecdote, I reviewed a change in a SQL server stored procedure that merely copied the parameter value into a variable, solving a significant performance issue.

The query optimizer on first invocation determines the optimal query execution and resolution approach. For a stored procedure with parameters, parameter values can change what the best approach is.

In this case, the parameter could hold a record ID or null for all records. Because it included sub queries, whether it's one or the other makes a huge difference on the query execution approach.

Using an intermediate variable prevents it of taking and remembering the optimization path for the other case.

[–] [email protected] 8 points 9 months ago

Don't have an objective definition of "Best" but here are few over the top of my head.

  1. Profiled backend's startup and reduced startup time by ~20-30 times. Issue was we were traversing node_modules folder to scan for API documentation.
  2. Profiled python using cProfiler & snakeviz and optimized dozens of regex calls by simple pre-filtering. Improved performance by ~20 times.
  3. Profiled python using pprofile & qcalgrind (IIRC) to pinpoint a single heavy regex compilation and optimized it by pre-filtering. Improved performance 8-10 times.
  4. Use nodejs streams to do load data from s3, parse as CSV, transform into different structure and save into s3 in a continuous stream. Didn't measure the performance gains but should be several times atleast. 5... many more, some coming from embedded world and even more drastic & unexpected
[–] [email protected] 7 points 9 months ago

I just recently updated a database call to return the last 7 days of data instead of all data from all time. It went from taking 30+seconds to around 2-4 seconds. Still a long way to go to make it "good" but at least it's not timing out now.

[–] [email protected] 7 points 9 months ago* (last edited 9 months ago)

This isn't any great C++ low level optimization but I have an Angular Project which contains a list that shows the live logs of various backend services and other things comming in through websockets (I keep up to 500K lines in memory). Essentially a extremly long scrollable list. First I obviously used lazy list which only renders the elements in view that are visibile instead of one that renders everything all the time. Secondly I removed as many html elements from each list element as possible, mostly divs that were just used to apply a css class.

When it comes to JS code I got a huge performance improvement by replacing a bunch of pure array.map, array.filter, array.toSorted, ... method calls with a big ol for loop. These pure functions caused an insane memory churn because they create a shallow copy each time and doing that for an array with 500K entries caused the garbage collector to work overtime. Also the whole filtering and searching in the first draft was done every time new elements were added to the list via websocket for the whole entire list. I changed that so it only does it for new elements.

Overall this allowed me to increase the number of logs kept in memory from about 20K (because then performance would start to dip) to now 500K.

[–] [email protected] 5 points 9 months ago

Memorization of pure functions can make a WORLD of difference. It's just not as easy as it should be.

[–] [email protected] 5 points 9 months ago* (last edited 9 months ago)

I had a pretty standard linear-list scan initially. Each time the program started, I'd check the list for some values. The list of course grew each time the program started. I maximized the list size to like 2MB or something (I forget), but it was in the millions and therefore MBs range. I figured it was too small for me to care about optimization.

I was somewhat correct, even when I simulated a full-sized list, the program booted faster than I could react, so I didn't care.


Later, I wrote some test code that exhaustively tested startup conditions. Instead of just running the startup once, I was running it millions of times. Suddenly I cared about startup speed, so I replaced it with a Hash Table so that my test-code would finish within 10 minutes (instead of taking a projected 3 days to exhaustively test all startup conditions).


Honestly, I'm more impressed at the opposite. This is perhaps one of the few times I've actually taken the linear-list and optimized it into a hash table. Almost all other linear-lists I've used in the last 10 years of my professional coding life remain just that: a linear scan, with no one caring about performance. I've got linear-lists doing some crazy things, even with MBs of data, that no one has ever came back to me and said it needs optimization.

Do not underestimate the power of std::vector. Its probably faster than you expect, even with O(n^2) algorithms all over the place. std::map and std::unordered_map certainly have their uses, but there's a lot of situations where the std::vector is far, far, far easier to think about, so its my preferred solution rather than preoptimizing to std::map ahead of time.

[–] [email protected] 5 points 9 months ago

another funny optimization I made was we were setting up a big Perl project for Docker. The automated tests were very slow because each test was doing includes which was searching many include paths and if you're using Docker on Windows or Mac then I/O stuff can be slow. So we made the docker init script create a single master lib folder and have it make symlinks to the actual files, perl then only had the default include path and this master folder, gave us like 2x or 3x speed boost on running the test suite.

[–] [email protected] 5 points 9 months ago* (last edited 9 months ago)

Noticed that Hibernate session (DB ORM session) was leaking to Jackson (JSON marshalling), potentially causing infinite n+1 problem. Changing a few lines of code to lazy loading and fixing the session leak reduced our daily data transfer from DB from 5.6Gb to 170Mb.

Not sure if this was the biggest optimisation, but definitely the dumbest issue.

[–] [email protected] 5 points 9 months ago

I remember getting 9x improvement on a set of Shell scripts at work, piping the results instead of writing and reading from files at each step.

[–] [email protected] 5 points 9 months ago

Was working on a game where we drew a fake audio graph to a black texture. The code originally cleared this texture every frame by drawing each pixel black before drawing the audio wave again. I cached the pixels we drew the audio wave to and only changed those back to black saving a huge amount of time.

[–] [email protected] 4 points 9 months ago

Unrolled a loop for a microcontroller and improved performance by like 300x. You can achieve amazing optimizations doing basics when you're at low enough levels. The compiler was not smart enough on its own. Then again this was like 20 years ago. But anyway, will likely never hit that level of optimization ever again lol.

[–] [email protected] 4 points 9 months ago

Not sure if it counts, but a web server I took over had a checkout process that took about 10 seconds. I added a queue and offloaded everything not a required verification step and took the request down to under a second.

[–] [email protected] 4 points 9 months ago

Meme answer: Changed a manual wait from 20 seconds to 18 seconds. Boom, 10% performance improvement. I can only do it so often though.

[–] [email protected] 3 points 9 months ago

We had a service that compiles a dataset once per quarter. The total size is ~30gb. We were starting a container, storing it on an EFS volume, and mounting like any other disk.

Every time a pod started it would need to read this data into memory so we would get quick initial start-up time but the time to be ready for traffic still took a while.

Since we didn't need to update it very often, we decided to just package the compiled dataset into the container and skip the EFS volume. We updated the image pull policy to ifNotPresent so it cut egress traffic pricing from EFS to zero. Now there is a cost to pull the image from ECR but that's only if the pod is being scheduled onto a node it hasn't been run on before. There was no noticable change in behavior or performance and we saved a bunch on cost.

Sometimes the big, dumb option is the right choice.

[–] [email protected] 3 points 9 months ago

I recently spent some time optimizing a small Julia program I wrote that generates a lookup table of brainfuck constants. Because it only needs to run once, I originally didn't care about performance when I originally wrote it (and the optimization was mostly for fun).

I achieved an ~100x improvement by adding types, using static arrays and memoization. In the end, the performance was mostly limited by primitive math operations, I tried using multiple threads, but any synchronization destroyed the performance.

However, the most impressive thing was the ability of Julia to scale from dynamically typed scripting language to almost a compiled language with minimal changes to the code.

[–] [email protected] 3 points 9 months ago

Implemented a dual ring buffer read/write multiprocessing approach that removed locking, doubled memory requirements but massively sped up the process.

[–] [email protected] 2 points 9 months ago

I was playing bloons td back when it was flash, in firefox. It was sometimes too slow. So i fired up perf and found out what horrors flash player was doing with memcpy. One byte memcpy, completely unaligned memcpy.

So i wrote an ssse3 memcpy that could do one byte unaligned with xmm registers. It was 30% faster then whatever glibc was doing and made the game playable. Was planing to submit it to glibc, but they came up with something different that was just as fast.

[–] [email protected] 1 points 9 months ago

This is still on my todo list, but getting rid of std::regex in my C++ code. It's abhorrently slow.