#ASH0 decompression questions

1 messages · Page 1 of 1 (latest)

manic stratus
#

Hey there! I was brought here by this repo https://github.com/PretendoNetwork/ASH0 which said i should ping @cunning yoke for questions.

I'm currently trying to untangle the archive.org archive of Mario Maker servers, both to verify that its data is good and to hopefully put it in a bit more usable format for those who want to recover their courses.

The first issue I'm running into is that while the data seems to have the ASH0 heading, it's actually four ASH0 files concatenated together. While I could search for the ASH0 magic to separate the chunks, it turns out that among the millions of files in the archive there are many where the string "ASH0" randomly happens to appear in the binary data.

So question 1 is: Is there a way for me to determine the length of the ASH0 blob from the binary?

The other issue that I'm having is that although the data does in fact begin with the ASH0 byte sequence, the library here tells me Ash is not compressed. I'm running this on a little-endian Debian server.
nevermind this was PEBKAC, i was passing it an uninitialized buffer Bonkcliff. question 1 still stands though

#

(cc @floral kiln who i think has looked at this before maybe?)

manic stratus
#

i'm checking now but it may be the case that ASH0\x00 only occurs four times per file - i'm pretty sure the next byte is always null in practice

manic stratus
#

using the null byte works enough in practice that i'm going ahead and assuming it for the purpose of decompressing the data, and i'll deal with any exceptions if they come up

#

days-long decompression script has commenced

naive sparrow
#

Nybbit has abandoned that account for many years now, so you won't get any answers out of him

With that said, the goal of that tool was to eventually make it into a compressor as well, but that idea was abandoned a long time ago

As for your question, no there is not. You just have to find the offsets, as you discovered. Yes, SMM courses are stored as 4 ASH compressed files stitched together so the ASH0 magic will only appear 4 times

That being said, that dump is fairly old. I am working with someone right now who has spent several years using our archive tool (https://github.com/PretendoNetwork/smm1-course-archive) to download all the courses, which we at Pretendo will be using and making available for historic purposes

I would suggest just waiting until that process is finished, since it will have much more recent and accurate information

manic stratus
#

okay, sounds good. i have observed that ASH0 appears more than 4 times in a fair amount of the levels

#

as of right now i have developed enough tools to convert the warc archives into a sqlite database and a mountable squashfs of de-ash0'd files.

#

all the ones i've tested load properly when patched into a cemu savefile.

#

the only things that are out of date as far as i can tell are the course metadata (anything maker-related is excluded from the archive, and would be nice to have), and two levels that were 403 forbidden on the original scan

#

for posterity, these two ids:

63933982
69352261

manic stratus
#

so far though, ASH0\x00 (ASH0 plus a null byte) appears to always appear exactly 4 times

manic stratus
#

and just to be clear - it's the metadata that's out of date, right? the archive.org dump was taken shortly after the readonly cutoff i think

naive sparrow
#

IIRC this dump was gotten through just interacting with the game. The person I'm working with is using my archive tool, which connects directly to the games servers and blows through every single possible course ID, including things like special event courses, and includes all metadata including the maker

You can see the the metadata my tool saves here https://github.com/PretendoNetwork/smm1-course-archive/#meta-data

manic stratus
#

ok! that’s pretty much the same schema i have in this dump

#

except the event courses bit

naive sparrow
#

Oh I wonder if they used an older version of my script then

manic stratus
#

maybe! what's the resulting format of your script? this dump is in archive.org warc files, which is non-optimal

manic stratus
#

and later when i get home i can give you a couple ids of courses where the text ASH0 appears more than four times in the blob

manic stratus
#
12994274
17569250
24853613
26091169
29077098
3101633
35161306
44452439
49901121
60099849
7560914
7850168

here's a few course ids

#

that might give you some trouble

manic stratus
#

another issue i'm running into is that the golang library segfaults about 1/10 of the time i run it

manic stratus
#

@naive sparrow okay i've figured out a format that can have fast random access while still being very compressed. essentially i'm decompressing each WARC chunk file into its own squashfs archive, just reuising the sharding already present. happy to compare notes afterwards, as this archive may contain courses that were deleted after the snapshot was taken.

#

(definitely the metadata will be more up to date on the new one though)

#

each squashfs archive uses zstd compression at max compression settings, and it doesn't seem to slow down decompression significantly, especially for the use-case of grabbing one level's files, which is nearly instantaneous

#

i'm also still missing these two level ids:

03CF-8E1E
0422-3B45
#

(still haven't looked into how to compute the checksum to get the full level id but that's the final two segments of each)

naive sparrow
# manic stratus (still haven't looked into how to compute the checksum to get the full level id ...

I detail how level codes work and how to calculate the checksum here https://github.com/jonbarrow/smm1-course-archive/issues/2#issuecomment-1636900475

GitHub

Again, I'm posting this here since the official Pretendo repo doesn't have an issues page. Over the past two years, I've steadily amassed roughly 4 million courses. As the time when the...

manic stratus
#

but i do have a process and a data format that gets good compression and fast random access. i'll be publishing that... somewhere once the churning is done, and i can also do the same process for the new data set, to update my metadata and fill in any courses i might be missing.

manic stratus
#

(i should also note that the archive.org dataset has about 10 million courses)

$ sqlite3 courses.db 'select count(*) from courses;'
10610412
manic stratus
#

@naive sparrow course id 7500440 has "ASH0\x00" five times in its compressed blob

manic stratus
#

dm'd

#

the game itself must have some way of doing this splitting consistently, or we'd see errors on courses like this

#

on account of the WARC archive format i do have the full headers returned for that query as well, but i don't see anything relevant in it

manic stratus
#
; sqlite3 courses.db -json 'select * from warc_refs where id = 7500440'
[{"id":7500440,"chunk":573,"offset":442912909,"size":44580}]
#

the offset and size give the position and length to read from the un-gzipped warc file from archive.org

naive sparrow
manic stratus
#

okay! i think i can deal with that

#

if one of the other ones was a coincidence it would be quite a bit harder

naive sparrow
# manic stratus the game itself must have some way of doing this splitting consistently, or we'd...

I'm unsure how the game itself does it, but it's likely that either the ASH file itself contains some reference to it's own size (it contains the decompressed size, but I'm unsure about the compressed size), or the decompression library keeps track of how many bytes were decompressed, and the game just goes one by one and doesn't chunk the file at all

My work around was just to always limit it to 4 offsets and chunk based on that, since they will always only have the 4

const fs = require('fs');

const file = fs.readFileSync('./problem.ash0');

const offsets = [];
let offset = file.indexOf('ASH0');

while (offset !== -1) {
    offsets.push(offset);

    // * Check if more than 4 "ASH" files
    // * were found. Sometimes compressed
    // * data can have the ASH0 magic num
    if (offsets.length === 4) {
        offset = -1;
    } else {
        offset = file.indexOf('ASH0', offset+4);
    }
}

for (let i = 0; i < offsets.length; i++) {
    const start = offsets[i];
    const end = offsets[i+1];

    const chunk = file.subarray(start, end);

    fs.writeFileSync(`./chunk-${i+1}.ash0`, chunk);
}
#

Like you said this won't work if the extra ASH header is anywhere else in the data but it's better than nothing

manic stratus
#

yep

naive sparrow
#

Oh wait no I remember now

manic stratus
#

i suspect only this course is a problem

naive sparrow
#

I know exactly how the game knows the offsets lol

manic stratus
#

i'm 950/999 chunks into verifying that every other course has "ASH0\x00" exactly four times

#

oh!

naive sparrow
#

So it's not actually possible for you to know with the data you have, unfortunately. And looking back at my own tool, it also does not directly provide the data needed (though there's enough data in my tool to reconstruct it for Pretendo's use)

I'll try not to get too technical but it does actually require some context as to how Nintendo Network games function overall

manic stratus
#

hm

#

is it in the headers?

#

like the http headers?

naive sparrow
#

No

#

The way Nintendo Network games work is by using a piece of middleware called NEX. NEX provides a set of services (also called protocols) that the client and server both implement and use to communicate. The client registers other protocol clients for each protocol it wants to use. The client and server then use these protocols to communicate and do whatever operations they need to do using a UDP based transport protocol called PRUDP with tightly packed RMC payloads

These protocols are highly specialized and tend to only serve one purpose. The notifications protocol is for sending and reacting to notifications. The matchmaking protocols for setting up multiplayer sessions. Etc

One of these protocols is called DataStore. DataStore is a very basic "object" storage protocol. Objects, in the official context, refer to Amazon S3 objects. Though they can really be anything. The entire point of the protocol is to provision URLs for the client to be able to upload, modify, delete, etc, "objects". This is how things like Ghosts in MK8 work, MK8 just uploads an object to S3, the object being all the data needed for the Ghost

This is also how SMM works. Every course is just an object in DataStore, which is just uploaded to and download from S3

When an object is created, metadata about the object is also created. Stuff like it's flags, an access password, etc. Objects also optionally have what's called a "meta binary". These contain game-specific data about a unique object. For instance Mario vs DK Tipping Stars uses the "meta binary" to store the Miiverse post ID associated with the level

SMM stores some metadata about the course inside it's course objects "meta binary", stuff like the course theme, etc. It also contains the CRCs and sizes of the compressed chunks, which it very likely uses to chunk the downloaded object data (using the sizes) and verify the integrity (using the CRCs)

Since you don't have the "meta binary", you can't know the sizes or the CRCs

manic stratus
#

ahhh