#representation of nonstandard character encodings in pattern language?
1 messages ยท Page 1 of 1 (latest)
Custom encodings are supported by the hex editor to a certain extent. The pattern editor does support utf8 but nothing else. The pattern language has no encoding capabilities that im aware of except wide char data type. Are you trying to parse some character encoding using pattern language? All encodings are more or less unique, do you mean that the encoding is rare or unusual?
the encoding is already known, just unique, i'm just looking to represent it in a struct
rn i'm just representing it as a char[n], but this of course produces utf8
char[n] reads n bytes, there is no utf8 here. in utf8 a char can be 1 2 or 3 bytes long. Im not sure if it is possible to represent encodings using patterns. To deal with encodings you need to read one byte at a time and interpret it according to the encoding so you can tell what the next byte is going to be.
those are non printable ascii chars because pattern language doesn't know utf8
I also would like this feature, especially coupled with the ability to search for such strings.
What feature do you have in mind? The part displaying the chars is not the pattern language. Searching for strings can be done in a few places. Is there an app that has the ability to display custom encoded text and do searches for them? How would you even type strings encoded in custom tables?
I don't have the idea cleanly in my head for where it would fit into ImHex, but...
Well, I use a Japanese keyboard and Japanese/English files, especially from older machines. So as well as the somewhat standard ones like the various revisions of (Shift-)JIS I'm often looking at custom encodings. Japanese has a LOT of custom encodings, often for individual projects. Please note that these encodings don't cover the entirely of the file in most cases, just stored strings, and often multiple different encodings within each file.
I'd like to be able to define a mapping (1 or 2 bytes per character) that maps modern Unicode to whatever custom encoding I encountered.
That way, I could define a mapping/encoding, perhaps within the pattern editor. Then in the pattern editor I could make a new variable with my custom string type.
Additionally the search function could use this mapping, such that I type in Unicode as normal, and it maps to search for the bytestring that matches.
It would likely be defined like an enum.
' ' = 0x00, 'A' = 0x01, 'B', 'C', 'D', ...and so on.
Encodings need to be able to distinguish text from non-text. They have to be applied to the entire file and they need to detect what is a char and how many bytes would be needed to encode such char. Imhex has a limited suppoorted feature to display custom encodings using unicode. You can create a table file that is imported and shown in the custom encoding column next to the ascii column
Yeah I anticipated you saying some of that, hence the clarifications above. When I said "1 or 2 bytes", I didn't mean like unicode where it's sometimes one and sometimes 2+, I don't need that at all. I just mean sometimes it's 8-bit encoding and sometimes it's 16-bit encoding.
I expect that you're thinking about encoding at the file level, hence some of what I was writing above, but I don't mean this at all. I just mean that the file itself is straight up binary. Just like I can search it for "abc" I want to also be able to search for "abc" when "abc" doesn't mean 0x61 62 63 but instead, through a representation mapping, is actually 0x18 19 1A or whatever.
Sure, all files are just binary, text needs to be encoded for that exact reason.
what you describe is not very useful because encodings usually don't change what a,b andc are but definenew representations of chars that are not in ascii
@obtuse stag said above about doing char[n], I get it. The representation of that is unhelpful, so I'd kind of have to be writing like a format function to get it to be a useful representation. Which is probably a good way to approach it, not sure, I've only used that for certain things.
This is 100% not true for old games in non-roman languages, especially Japanese.
thats why i said ussually
That's an extremely Anglocentric view.
no, it is a true statement of fact. ussually != 100%
whatever, not getting into the weeds with that one.
Besides, there is nothing Anglo about ascii. it is for the most part latin based.
This still is exactly what I want to achieve, because as well as the roman representation being mutable or non-existent, I need to map in kana all the time. It's currently hard to search for a string of 'ใขใซใตใฟใ', even when it's a standard encoding.
I guess I kind of want several related features here, but I'm also not sure how they would best be achieved.
Firstly, I'd like to give an attribute to a defined string to tell it a mapping/encoding/enum to use to represent the string in the Pattern Data table.
Even if that has to crush down to ascii, it would still be useful as long as I could use more than one ascii letter to represent a single byte.
Something like
char crazystring[16] [[mapping("JISX0208")]]; where I've defined the JISX0208 function or enum or mapping or whatever in a separate place.
Probably totally possible that I could do that with a format function?
lol got it in one
tbh i'd like a formatting function as well, since it would be helpful for a bunch of other stuff not just alternate string encodings (alternate bases, numbers with offsets, images if you wanted to be really fancy), but first the pattern lang will need to be more stable, unfortunately its gotten to the point where my relatively simple script crashes more than it doesn't :(
Some of that exists already. Format functions already exist as [[name]] and [[format]]. Even images with [[hex::visualize]]
Not sure about using format attribute for this use case though. like, I guess I maybe could, but that still doesn't help me identify these strings in the first place, because I don't have an easy way to search for them, and also Values don't seem to be searchable in Pattern Data either, only Names, so the best I can do is export the data and then search that?
Please report crashes you encounter so they can be fixed.
working on producing a minimum reproduction yea, though it only happens sometimes so idk what's triggering it
if your script is relatively simple and it crashes always given the same inputs that should be good enough for a report or a message in the support server.
i would classify it as fairly simple but it has several very large enums, so idk