Timothy Baldridge

Improving file extraction in Wabbajack (Patreon)

Published:

2020-09-09 13:09:52

Edited:

2020-09-09 13:09:54

Imported:

2021-09

Content

Over the past week I've spent quite a few hours working through our rather messy file extraction code. There's quite a few problems with file extraction in general, and finding the correct combinations of features and limitations has been quite hard. As an example, here's a list of some of the challenges with extracting mod files:

* Extracting a file requires both reading and writing data, this can be problematic for HDDs that prefer to read data in fairly large chunks in order to reduce drive seek times

* Sometimes we need files inside of files. A WJ list may call for a file to be copied to disk that comes from a .BSA inside a .7z. So there could be a 2GB mod of which we only need about 100MB worth of data

* Several archive types (like 7z) are what's known as "SOLID" archives. That is to say that even the table of contents in the file is compressed. This means we can't just jump into a file and find the one bit of data we need, we have to read the entire archive to that point to get the file.

* Due to how slow SOLID extraction is, we need some sort of batching extraction. We only want to read each archive once, so we have to batch up all the files we need from that archive and extract them all at once

* Files inside archives inside archives are a pain, it would be nice to not have to write a archive to disk just to read it and then delete it right afterwards.

* The 7z format is highly undocumented, and there's really only one utility that does it well, 7z.exe, and controlling that from C# is problematic, it either has to be done via a badly documented COM library or via command line switches that offer little control over how the data is processed.

If all this sounds complex, it is, but let explain how extraction used to work:

1) We'd write a file to disk that contains the names of the files we want to extract
2) We'd call 7z.exe and hand it the archive and the file names, telling it to extract to a temp folder
3) We'd copy the files from the temp folders to their install locations.
4) We'd duplicate any files that needed to be duplicated
5) If the file we need is in a nested archive, goto #1
5) We'd patch any files on disk that require it.

If you count it up, we're reading and writing the contents of the data 3-4 times during this whole process. So the solution (as it is with any programming problem) to add another layer of abstraction. In the new code, we have a FileExtractor class that takes two callbacks and an archive name. Using this interface the extractor can ask WJ if it should extract a file, and if it is extracted, WJ has full control on where to put it. We're interfacing with 7z.dll (via the COM interface), and using several tweaks to keep this all working well:

1) We tell the Extractor to extract a file
2) The Extractor asks if we want a given file (if not, the file is skipped over and not extracted)
3) The Extractor decompresses the file:
3.1) If the file is > 500MB it is written to a temp file to reduce memory usage
3.2) if the file is < 500MB it is held in memory and handed to WJ

5) If the file is a archive in an archive, the file is kept in memory (or in a temp file) and we goto #1
6) WJ writes the data to the install locations, patching it in the process (if required)

7) The temp file is deleted and we move on

All in all this new approach greatly reduces HDD thrashing, allows fine-grained control over memory usage (the 500MB limit can be changed quite easily). And allows for a streaming approach where we have very few archives open, and keep very little data in temp storage, because we reclaim storage the instant we are done extracting from a temporary file.

It'll most likely be another week before the release of 2.3.0.0 as there's several other bugs I want to work through, and tests I want to run, but I'm quite happy with these improvements, and look forward to seeing them in use.