https://github.com/lifthrasiir/j40
Older discussions are around #off-topic .
#J40 discussion
1 messages ยท Page 1 of 1 (latest)
so, after wasting one full day with build systems, my next topic is probably a proper render system
essentially I would need something similar to libjxl's render_pipeline ๐ฎ
Do you want to be able to do streaming decode with progressive rendering and all that? Because if not, things are quite a bit simpler...
I already have lots of different "planes" (typed bitmaps, really) around, so I figured out that I already need some sort of pipeline anyway
for now, J40 decodes global modular channels (if any) and produces per-group XYB samples that are immediately converted to sRGB before integrated into the aforementioned global modular channels
there are multiple issues with this architecture, to name a few J40 in VarDCT always creates u16 planes for color channels, while ideally they should be f32
and sRGB conversion should happen a lot later
honestly speaking having a pipeline is that I can easily verify and adjust how planes are combined and converted
that's a main goal of pipeline, progressive rendering is a nice bonus but ultimately a side effect ๐
I think you're probably right
@minor lark is the render pipeline architect in libjxl so he might have some tips...
I had the following in my multi-month-old notes:
render ops:
J40_OP_LOAD planeref (-- plane)
J40_OP_COPY x y sx sy (dst src -- dst)
J40_OP_IDCT coeffref (-- plane)
J40_OP_UNSQUEEZEH (avg residu -- merged)
J40_OP_UNSQUEEZEV (avg residu -- merged)
J40_OP_UNRCT
as you can see this is a render graph encoded in forth-like IL
I'm yet to fully figure out which ops would be needed and which ops can be axed
today in J40: fuzzing! which means that I have really... a lot of things to fix.
with fuzzing, I really feel needs for testing architecture. I actually had an ad hoc testing infra using labrat https://github.com/squarewave/labrat but I quickly learned its limitations.
one unorthodox possibility is to use Rust as a testing infrastructure (!)
oh thank you for nothing that.
today in j40: now pretty much every fuzzer outcome is panicking at // TODO midbits can overflow! comment in j40__hybrid_int, which means I need to do something with this...
while not guaranteed, you can always file an issue.
to me the biggest blocker is that we don't have any freely available copy of ISO/IEC 18181-2
other aspects of container format can be inferred from libjxl code, but jbrd is complex
my fuzzing process now regularly stalls at j40__icc, which is given a very large enc_size and zero-bit symbols :p
okay, I polished and pushed most commits in my working copy
you can see how I usually work, I start by tackling a particular task, solve any side tasks as needed, split commits for ease of reading and/or reviewing and push a bulk of commits.
(this is a bad strategy when that task turned out to be much larger than imagined though, I would have to detect this to avoid stalling)
yes, lots of edge cases!
it shouldn't affect valid images, except for the very last commit which finally limits the input to the Main level 5 (API pending).
today in j40: dealing with fallouts in MSVC.
https://github.com/lifthrasiir/j40/commit/0ce79d31c833780238e76633c65a8bf9543a865f took a while but here it is.
you can run make and get dj40.exe if you have Visual Studio (!)
possibly, I think it should be done via CI and this commit is a part of preparation
okay, I've pushed all local commits (which took a while as I had tons of them in the queue)
and you can now fuzz this damn thing! well, not too long, my current corpus crashes after 500K iterations :p
(after fixing a lot of low-hanging fruits)
Of course robustness is always good to have, but maybe it's not super critical for the main use cases of j40 โ which I imagine would be things like games that don't want to introduce a dependency on libjxl to decode game assets.
Yeah, but it is far less than ideal that just a few minutes of fuzzing finds yet another bug.
And fuzzing does help for finding leaks, which would be equally important even for trusted inputs as well.
true
the latest bug is confirmed to be a swapped grows vs. gcolumns (i.e. groupwise height and width). it turns out that my quick test corpus doesn't have any image that has more than 1 groups and grows and gcolumns differ... facepalm
yet another bug:
- for (i = 0; i < 4; ++i) J40__TRY(j40__modular_channel(st, &m, i, sidx2));
+ for (i = 0; i < m.num_channels; ++i) J40__TRY(j40__modular_channel(st, &m, i, sidx2));
basically modular subimages can have transformations and can have a different number of channels to decode
yet, yet another bug or something else: palette with nb_colours == 0, is it even possible????
ah it is possible, because every palette index will be synthetic, oh okay.
so it's just J40 not handling this edge case.
yes, the default palette is quite useful: everything with index >= nb_colours maps to two color cubes, and everything with index < 0 maps to default delta palette entries
so you can actually encode a pretty nice image without specifying any custom palette colors
now I have to deal with a possibility that those zero-width images can be further transformed ๐
oh right we still nominally insert a 0x0 palette channel in that case, just to keep the invariant that every palette transform adds one metachannel
empty channels do not end up getting encoded but you do need to take them into account for the channel indices
for example there can be 3 palette metachannels and they can undergo RCT... which should be valid but may need a special handling
squeeze can also introduce empty channels btw
yeah, a large enough number of transforms
and I think it will be hairier than other transforms
yeah rct on channel-palettes could actually be useful, current encoder doesn't do that but it would help a bit for the case where you e.g. have a 10-bit image encoded nominally as a 16-bit one where only 1024 sample values actually get used; in that case the encoder will currently produce 3 channel palettes that are identical but still encoded 3 times; doing an RCT on that would reduce it to 1 channel with some entropy and 2 channels that are just zeroes.
so we kept that possibility open in the spec even if the current encoder isn't doing it yet
but all the transforms (rct,palette and squeeze) have to operate either on metachannels or on real channels but not a mix of them
I came to realize that the fuzzing corpus from J40 can be used against other implementations including libjxl
I guess libjxl already has a lot of them though
because that introduces too much weirdness: it's then no longer clear what channel is meta and what channel is nonmeta
yes, you could probably get a fuzzing corpus from libjxl to try it on j40 too
in libjxl we also have a funky way to make fuzzing more effective, which is to have a variant of the encoder/decoder where entropy coding is skipped and things are just raw bits that the fuzzer can flip directly โ or at least that's how I remember it, @minor lark or @golden basin probably know more about this
yeah vshift and hshift are important to keep, otherwise we will have an ambiguity (was afk)
nah, just using huffman with 8-bit symbols everywhere
(just for the corpus)
also removing the check for the final state
ah right, huffman with 8-bit symbols so it's actually still a valid bitstream, just poorly compressed and symbols nicely byte-aligned, right?
Heh, I thought the thread is dead. Will reply after I get back home.
Spoiler: not that good I think ๐
I believe GCC is a bit more aggressive on autovectorization, which J40 heavily relies on
today in j40: after a few days of fuzzing I've reached the point where I need a radically different corpus or source code modification to continue fuzzing, which seems like the perfect moment to stop active fuzzing ๐
so far fuzzing covered almost all modular code but not much vardct code, as expected
I tried to manually put known vardct images to the corpus but that didn't help much, possibly because those inputs are too large
So what's next? Implementing the missing coding tools?
for now, finishing up restoration filters is the highest priority and that probably involves rendering
oh wow, it's based on ccgo, which is kinda unexpected
today in j40: I'm populating the issue tracker, and here is an initial issue about the Rust version: https://github.com/lifthrasiir/j40/issues/10
@boreal cairn will want to be pinged
thanks! I gave my 2 cents to the issue.
I don't know enough about J40 or JXL in general to help with the real details, but happy to work out some of the architectural kinks with you if you wanna chat about it. Obviously as soon as their is code I'm happy to contribute!
I didn't get to work on my own implementation in a long time now, I will 100% make time for this sooner or later, so either way I'm gonna either contribute to J40 or roll my own thing, whatever happens first ๐
back from a bit of vacation, playing too much Horizon: Zero Dawn (what year is this), thinking about rendering again
I would say that libjxl will be lightweight in this case, because you absolutely need NEON-specific optimizations that are in libjxl
The ARM build of libjxl-dec is about 200kb iirc
Maybe has debug symbols still in there or something?
Freely suggested improvements to j40 makefile to increase portability and readability and brevity:
CFLAGS = -O3 $(CFLAGS_WRN)
CFLAGS_DBG = -DJ40_DEBUG -g -Og $(CFLAGS_WRN)
CFLAGS_WRN = -W -Wall -Wconversion -Wc++-compat
LDFLAGS = -lm
CLANG = clang
dj40: dj40.c j40.h extra/stb_image_write.h
$(CC) $(CFLAGS) $(LDFLAGS) -o $@ dj40.c
dj40-cxx: dj40.c j40.h extra/stb_image_write.h
$(CC) -xc++ $(CFLAGS) $(LDFLAGS) -o $@ dj40.c
dj40-o0g: dj40.c j40.h extra/stb_image_write.h
$(CC) $(CFLAGS_DBG) -fsanitize=address,undefined $(LDFLAGS) -o $@ dj40.c
j40-fuzz: extra/j40-fuzz.c j40.h
$(CLANG) $(CFLAGS_DBG) -fsanitize=fuzzer,address,undefined $(LDFLAGS) -o $@ extra/j40-fuzz.c
It should be pretty much equivalent...
Since it's an all-in-one-file lib the makefile is probably not that important anyways.
@ruby fable any plans to resume work on j40 at some point?
I hope so, for the last 6 months I didn't really have much energy to spare (not just about J40 but more generally) though
@unkempt knot by the way, I think jxl-oxide already surpassed what I wanted to achieve by J40 and wonder if I should keep working on J40 as I originally intended
specifically there were two goals I had in mind: producing a complete reimplementation from the spec is one, more or less achieved by now (not by J40, of course)
the second goal was to provide a minimal ground for working with JPEG XL
the minimal ground here means, for example, a test suite completely independent from libjxl
I expected J40 will eventually need them anyway and it might be easier to produce such one from J40 and not from libjxl
that's another goal, and I'm still unsure the optimal way to achieve that
it was another reason I didn't have much progress recently, if I had some concrete plans maybe I could have tried to make one (I indeed had other side projects that were going very slowly but still steady in the same period), but I didn't have any actionable plan
so if libjxl had some (ideally concrete) ideas that might be hard to do themselves, maybe I can look at them instead
rethinking about J40 rn, mostly about how to restructure it to avoid known pains
(yes, I've got a new job since then and a heavy milestone has been passed, so I now have some peace in my mind.)
(still recovering from the mental health issues though, mine is not as critical as others' but nevertheless medications really helped, anyway)
one of the main PITA in J40 was the cleanup path, which is... uh... always painful in C to be frank
it greatly relates to the testing strategy as well, because there should be a guarantee that the cleanup fully restores the known-good state
I knew this from the beginning, but after months of hiatus with fresh eyes I feel it more acutely
I'm starting to think about a lightweight preprocessor (still written in C) that help quite a bit, but not sure
my initial idea was to make the source code itself written in a non-portable C (i.e. allowing some GNU extensions) but it can be converted to a portable C with that preprocessor
but __attribute__((cleanup)) was something inferior compared to what I actually wanted, so... I'm not sure how I develop this idea further
I always saw other projects implementing cleanup by prologue and epilogue macros.
I don't think there's any other options to do it portably.
yes, the current J40 also does this, but it is increasingly harder to deal with new features compared to other languages like C++. (but I don't like to write C++.)
let me give a concrete example. I eventually want to support a progressive decoding, i.e. the decoder can signal increasingly precise renders at any time.
if the language supports a coroutine this is really an easy task, because you can park the decoder after the last available input and resume it whenever you want.
but it is really annoying and error-prone to write an equivalent state machine in C.
as a practical and very relevant example here is a decoder loop of Brotli: https://github.com/google/brotli/blob/ed738e842d2fbdf2d6459e39267a633c4a9b2f5d/c/dec/decode.c#L2264
Brotli decoder is composed of tons of state machines with delicate invariants, and it is possible as you can see, but it is inhumane to be honest
and JPEG XL is many times more complex than Brotli (in fact, JPEG XL contains a big portion of Brotli)
so both J40 and libjxl are designed to roll back to the known-good state when the input is not enough.
for example, if the signature has been read but the image header cannot be fully read, you roll back to the end of the signature and next time you will try to re-read the image header.
this is suboptimal (if you supply a single byte every time, it will decode the same thing over and over) but practically easier to manage...
...if it is easy to roll back. which is mostly the case in C++, while it is not even trivial in C.
so in C you can't have an easy coroutine nor an easy state rollback. what to do then? this is my current question.
many C projects can cope with macros because they generally have (or constrain themselves to have) a small number of exit conditions per each function.
for example:
int foo(state_t *st) {
// precondition: st->a thru st->c are not initialized
st->a = malloc(sizeof(st->a));
if (!st->a) goto error;
st->b = malloc(sizeof(st->b));
if (!st->b) goto free_a;
st->c = malloc(sizeof(st->c));
if (!st->c) goto free_a_and_b;
// postcondition when return value is 1: st->a thru st->c are all valid
return 1;
free_a_and_b:
free(st->b);
free_a:
free(st->a);
error:
// postcondition when return value is 0: st->a thru st->c are not initialized
// and no memory has been leaked
return 0;
}
this is a contrived example, people know this is really wordy so they actually use shortcuts, but you'd get my point
people generally write the following instead (and J40 does this as well):
int foo(state_t *st) {
// ensure that `error` is always free to jump
st->a = NULL;
st->b = NULL;
st->c = NULL;
st->a = malloc(sizeof(st->a));
if (!st->a) goto error;
st->b = malloc(sizeof(st->b));
if (!st->b) goto error;
st->c = malloc(sizeof(st->c));
if (!st->c) goto error;
return 1;
error:
free(st->a); // safe to call for NULL
free(st->b);
free(st->c);
return 0;
}
How about using a callback cleanup mechanism?
Basically you introduce the concept of a callback structure, and a stack of those. Add callbacks to the structure and push it to the stack. Popping them from the stack would call those callback functions.
How does it sound?
first, C doesn't have a portable nested function (sadly). and second, I don't know how many of them are required :S
for now I'm currently considering about an arena-based approach, trading the strict memory usage with a convenience.
but the arena will not fully solve my problem because I have so many states...
to be accurate, if there is a way to make a non-portable but working equivalent of Go defer, I'll seriously consider that
I'm okay with the non-portability if it i) can be mechanically translated later and ii) works as is with GCC and clang
that would comfortably cover pretty much every use case
There isn't to my knowledge.
Plans to have it in C23 got cancelled.
It's effectively pushed back to C30.
yeah I know (cf. https://thephd.dev/lambdas-nested-functions-block-expressions-oh-my), but I can't use it in the portable, translated version anyway, so the base version doesn't have to be that portable
Your only non-portable option in that case would be GCC's cleanup attribute, which you did try.
But yes, make sure that the source file is machine-preprocessable to portable C code if you decide to go with that option.
@ruby fable i know quite some time has passed since this discussion stopped, but I normally use this macro to implement a Go-like defer in C:
#define PPCAT2(n,x) n ## x
#define PPCAT(n,x) PPCAT2(n,x)
#define DEFER2(stmt, counter) \
void PPCAT(__cleanup, counter) (int* u) { stmt; } \
int PPCAT(__var, counter) __attribute__((unused, cleanup(PPCAT(__cleanup, counter ))));
#define DEFER(stmt) DEFER2(stmt, __COUNTER__)
which is based on __attribute__((cleanup)) which you mention above, so I'm not sure if this is something you already attempted or not
Sample usage:
void *buf = malloc(128);
DEFER(free(buf));
this works on GCC only though as clang doesn't implment nested functions in C
I have also one question; is there a way to use cjxl to force to encode files that j40 will know how to decode? Or put in other terms, is there a way to disable features that j40 doesn't support, so that the current version of j40 can successfully be used without fear of generating images that can't be later decoded?
I didn't use that mainly because it is not really portable, and I wanted to use it unconditionally ๐
for j40-compatible images, any low enough level (I think it was up to -e 7?) works I think. see the README.
Thanks, should I expect some runtime error for images using unsupported features, or just a corrupted picture?
modulo any bugs, it will probably reject them.
I haven't touched J40 for a long time now though, so there will be several bugs around...
I'm slowly investigating several paths to revive the project, but all paths depend on how to reliably write a C code without making it too large, and that's a big problem
I should make more accurate statement on this:
I will revive J40 if I have a way to write a C code roughly equivalent to the current J40 code without exactly using barebones C
if Zig could have been translated to C I would have picked it, but AFAIK Zig-to-C is not supported
in my humble opinion, you're already using several clever macro tricks in the current codebase to push the C boundaries
i don't think there's much more than can be done in the realm of C
i wish clang didn't decide to skip nested functions in C, that would have helped me greatly, but alas they went for that decision years ago and haven't looked back
that would solve defer, but then if you want to have more solid coroutines or stuff like that, I guess that's not something that really matches C
yeah, that is my current dilemma ๐ฆ
i guess translation from a higher level language is one option as you said
something like https://github.com/google/wuffs/ but for a broader library would be beneficial
(Wuffs is truly great and I would like it to be more widespread, but it doesn't fit my use case)
yes i like wuffs too but i think it fits more into the realm of safety, not sure if that's also your focus
i haven't really evaluated that as a higher level langauge by itself
I'd like to ensure that my library is reasonably safe, but not necessarily in the formally verifiable way (which is... hard I know)
it doesn't look so but there are several guiding principles throughout the J40's code to avoid usual problems, like consistent coding styles
yeah, though some fuzzying will bring you somewhere into a safe area
J40 has been surely fuzzed, but it will need an additional restructuring to make it fuzz-friendly
I think the current fuzzing attempt covers roughly a half of the entire code
I will test decompression speed on my target platform for this project, which is a Nintendo 64; that'll give me a first number to see whether a full porting is viable or not
oh, that would be adventerous to be sure ๐
does libjxl compile in that platform after all?
i haven't tried; I have been working on that platform for quite some time and I ported several modern formats like h264 and opus to it. I'm looking for a solution for lossy encoding, so i figured it out i'd start from jpegxl and move back in case of trouble ๐
notice that porting doesn't mean only recompiling (that's the easy part), because optimizing for n64 means offloading part of the calculations to a DSP with SIMD instructions that must be programmed in assembly
so the work ends up being similar to a task lilke "adding arm+neon acceleration to a C codebase" or something like that
is it just out of curiosity or do you have some concrete goal like a homebrew game?
i maintain an open source library that's used for homebrew games (https://github.com/DragonMinded/libdragon), I'd like to offer a lossy image compression solution to my users
does JPEG work? if it isn't, there is not much chance for JPEG XL either (because it includes a baseline JPEG as a part of backward compatibility)
yes, jpeg was also used by commercial games back at a time
oh, that's good to hear
we are pushing the boundaries more than they were ever able to do, this is why i was targeting something more modern
I think a VarDCT subset of JPEG XL might be actually viable enough
that is, a single-frame lossy subset
(JPEG XL is designed for many more use cases, so you don't need the full library)
yeah probably. BTW I have ported mpeg1 already so for jpeg actually i should have most of the blocks
the DSP with SIMD is fixed point though, i think VarDCT is defined with floating points?
H264 is luckily integer only, that helped quite a bit ๐
and Opus reference implementation supports both floating and fixed, that also helped
yeah, J40 also assumes a working floating point impl
there are floating points on the CPU; it's more about the parts that I want to offload to the DSP ... those would have to be converted to fixed point
You can't have fully VarDCT-only since even VarDCT uses Modular for the LF image etc, but yes, we're thinking about a "lightweight" profile that would also have a hardware implementation. It would restrict the use of Modular to put constraints on the kind of MA trees and predictors that can be used, probably not have extra channels at all, definitely no splines and patches, etc. How it will look will depend on what the hardware folks are willing to implement, it's still too early to tell.
not to necessarily mean a proper subset of lossy JPEG XL, the LF image can be possibly encoded in other means for example.
There's a subset of Modular that is just no-context with uniform West prediction, that's basically what JPEG does. Probably we can define a somewhat larger subset of Modular that still gives some of the gains while still being simple enough for a hardware implementation, where you obviously don't want to deal with arbitrary MA trees and funky predictors like the self-correcting Weighed predictor.
yeah, that might be a possible alternative
the point is that, I think N64 will need some specialized format and/or subset for a desired performance
I think itโs fine, we usually control both sides of the pipeline (encoding and decoding)
For h264 I compress videos only in baseline profile and I disable the in loop filter for instance as that creates performance issues
For opus I select Celt only (disable SILK) and fix a few internal parameters leaving a bit less flexibility to the encoder
So yes Iโm ready to disable a few things at encoding time, thatโs not an issue for my use case
Im just wondering if I should base the work on j40 or not given that its development is paused; I dont necessarily need it to be maintained and improved if whatโs there is sufficient, I just fear of bugs
@broken flint I think I can answer that, I had trouble until I settled for -e 4
(decodable with j40.h)
Thanks!
@ruby fable i have a question on j40; can you please explain at the high level why j40__advance is designed as a coroutine? In what case it is necessary for it to yield leaving the work uncompleted?
the whole setup was designed for incremental parsing, as any individual step can stop at the end of currently available inputs in addition to genuine errors.
but is there an API for that? It seems like you can fetch from either a file or a memory callback but in both cases they seem to assume full consuming of the input
nope, it was never implemented to this day
I should mention that this approach is also similar to what libjxl does incremental parsing
(which inspired my design)
I already knew that a resumable coroutine would be necessary for incremental parsing in general, but C is too primitive to support that in a pleasant way, so I used that coroutine macro hack to retain a reasonable chunk of resumable routines
that said, in retrospect I think it wasn't enough because each individual routine does have to roll back perfectly on error, which was quite hard to do in general (especially in C!)
i think it's complicated in general
a stackfull coroutine would have a much easier life of course
if we are not using C... ๐
yes but on the other hand, C is the standard for embedding and it is like that specifically because it is a simple language ๐
so of course one has to take compromises here; in my case for instance i don't need incremental parsing so i will just remove that
@ruby fable Hello there how's things going with the library?
see above for the current situation.
I'm currently trying to revive the project with some AI sprinkles, let me see whether it would work or not...