Home Artists Posts Import Register

Content

A long standing issue with Wabbajack is that it always wants to recreate BSA archives when updating a list, even if the contents of those archives hasn't changed since the last modlist release, this has been fixed, keep reading for the nitty gritty details.

The reason for this long standing issue is that BSAs and the compression they use are non deterministic. The table of contents in a BSA file must be in a very specific order (ordered by Bethesda's wonky hash value), but the order of the files themselves is unspecified. So BSArch may write all the files based on size, Creation Kit may do it based on name, and Wabbajack may do it on something completely different. All are correct, but since the data in the archive is in a different order, the hashes won't match.

Likewise, BSA archives use what's known as zip or LZ4 compression. Like BSAs themselves these compression formats are not lossy, and when extracted will always give back exactly the same data they are handed, but on disk the format is non-deterministic. These compression formats are a bit like driving directions. One person may say "turn left, then right, then stop at the third light" and someone else may say "go to the intersection of fifth and third, then turn on Worcestershire street, and finally stop at the 3rd house". Both may be correct, but the instructions are completely different.

All this means is that there are countless ways to write a BSA file that all result in the same correct file for the game, but the contents at the binary level look nothing alike. This is a problem that has stumped me for some time.

But in reality the solution was fairly simple. Wabbajack keeps a cache of the hashes of all files. It's essentially a mapping of "this file name has this hash as of this date". If you update the file, the modified date will change and WJ knows to then go and recalculate the hash. But sometime ago we added an optimization that allows WJ to set it's own hash value for a file whenever it wants. No reason to re-hash that 200MB ESP file once you install it, you got it from an archive with a matching hash, so the ESP has to also have the correct hash. So when installing files in WJ we simply pre-populate this hash with hash values from the modlist author's system.

But for the longest time we didn't do this caching for BSAs...and that was the problem. Now, as of 2.5.3.15, we effectively lie to the hash. We pre-populate the hash for the BSA file with a the value from the author's machine. Then when reinstalling it's easy enough to see that the BSA's cached hash, still matches the hash from the modlist, and we know we don't have to create the BSA. Based on that we can then remove a lot of archives and extracted files that we no longer need.

How much does this help? Well in a test modlist I used, there were 400k files. It took a little bit to optimize the list, and from there WJ realized it only needed to install 1000 files instead of 100,000 files. In addition no BSAs were deleted or recreated, and the file installation happened in about 10 seconds. Those remaining 1000 files are all .meta files and .ini files that are even harder to figure out how to optimize away, but they take just a second or two to write, so I doubt we'll ever go back and work on this.

One note about this: the faster install method only works if the BSAs have a pre-cached value, so sadly you'll need to update once more with the slow method, but that should be the last time, future updates to that same list shouldn't recreate the BSAs.

It feels good to finally get this thorn in the side removed.

Thank you all for your support as always.

Comments

John Barry

Fascinating as always, you keep writing them, I'll keep reading them. Will now go and update :)