representation of nonstandard character encodings in pattern language? | ImHex 🔎 | Page 1

obtuse stag Sep 12, 2024, 12:43 PM

#

is there a best practice here? this isn't like ascii vs utf8 either, it is to my knowledge completely unique unfortunately

near niche Sep 12, 2024, 1:17 PM

#

Custom encodings are supported by the hex editor to a certain extent. The pattern editor does support utf8 but nothing else. The pattern language has no encoding capabilities that im aware of except wide char data type. Are you trying to parse some character encoding using pattern language? All encodings are more or less unique, do you mean that the encoding is rare or unusual?

obtuse stag Sep 12, 2024, 1:18 PM

#

the encoding is already known, just unique, i'm just looking to represent it in a struct
rn i'm just representing it as a char[n], but this of course produces utf8

near niche Sep 12, 2024, 1:23 PM

#

char[n] reads n bytes, there is no utf8 here. in utf8 a char can be 1 2 or 3 bytes long. Im not sure if it is possible to represent encodings using patterns. To deal with encodings you need to read one byte at a time and interpret it according to the encoding so you can tell what the next byte is going to be.

obtuse stag Sep 12, 2024, 1:25 PM

#

its represented as a string in the data tables

#

is what i meant

near niche Sep 12, 2024, 1:27 PM

#

those are non printable ascii chars because pattern language doesn't know utf8

sinful sandal Sep 12, 2024, 4:29 PM

#

I also would like this feature, especially coupled with the ability to search for such strings.

near niche Sep 12, 2024, 4:38 PM

#

What feature do you have in mind? The part displaying the chars is not the pattern language. Searching for strings can be done in a few places. Is there an app that has the ability to display custom encoded text and do searches for them? How would you even type strings encoded in custom tables?

sinful sandal Sep 12, 2024, 4:48 PM

#

I don't have the idea cleanly in my head for where it would fit into ImHex, but...
Well, I use a Japanese keyboard and Japanese/English files, especially from older machines. So as well as the somewhat standard ones like the various revisions of (Shift-)JIS I'm often looking at custom encodings. Japanese has a LOT of custom encodings, often for individual projects. Please note that these encodings don't cover the entirely of the file in most cases, just stored strings, and often multiple different encodings within each file.

I'd like to be able to define a mapping (1 or 2 bytes per character) that maps modern Unicode to whatever custom encoding I encountered.
That way, I could define a mapping/encoding, perhaps within the pattern editor. Then in the pattern editor I could make a new variable with my custom string type.

#

Additionally the search function could use this mapping, such that I type in Unicode as normal, and it maps to search for the bytestring that matches.

#

It would likely be defined like an enum.

#

' ' = 0x00, 'A' = 0x01, 'B', 'C', 'D', ...and so on.

near niche Sep 12, 2024, 4:54 PM

#

Encodings need to be able to distinguish text from non-text. They have to be applied to the entire file and they need to detect what is a char and how many bytes would be needed to encode such char. Imhex has a limited suppoorted feature to display custom encodings using unicode. You can create a table file that is imported and shown in the custom encoding column next to the ascii column

sinful sandal Sep 12, 2024, 4:57 PM

#

Yeah I anticipated you saying some of that, hence the clarifications above. When I said "1 or 2 bytes", I didn't mean like unicode where it's sometimes one and sometimes 2+, I don't need that at all. I just mean sometimes it's 8-bit encoding and sometimes it's 16-bit encoding.

#

I expect that you're thinking about encoding at the file level, hence some of what I was writing above, but I don't mean this at all. I just mean that the file itself is straight up binary. Just like I can search it for "abc" I want to also be able to search for "abc" when "abc" doesn't mean 0x61 62 63 but instead, through a representation mapping, is actually 0x18 19 1A or whatever.

near niche Sep 12, 2024, 5:01 PM

#

Sure, all files are just binary, text needs to be encoded for that exact reason.

#

what you describe is not very useful because encodings usually don't change what a,b andc are but definenew representations of chars that are not in ascii

sinful sandal Sep 12, 2024, 5:06 PM

#

@obtuse stag said above about doing char[n], I get it. The representation of that is unhelpful, so I'd kind of have to be writing like a format function to get it to be a useful representation. Which is probably a good way to approach it, not sure, I've only used that for certain things.

sinful sandal Sep 12, 2024, 5:06 PM

#

near niche what you describe is not very useful because encodings usually don't change what...

This is 100% not true for old games in non-roman languages, especially Japanese.

near niche Sep 12, 2024, 5:07 PM

#

thats why i said ussually

sinful sandal Sep 12, 2024, 5:07 PM

#

That's an extremely Anglocentric view.

near niche Sep 12, 2024, 5:07 PM

#

no, it is a true statement of fact. ussually != 100%

sinful sandal Sep 12, 2024, 5:08 PM

#

whatever, not getting into the weeds with that one.

near niche Sep 12, 2024, 5:09 PM

#

Besides, there is nothing Anglo about ascii. it is for the most part latin based.

sinful sandal Sep 12, 2024, 5:11 PM

#

near niche what you describe is not very useful because encodings usually don't change what...

This still is exactly what I want to achieve, because as well as the roman representation being mutable or non-existent, I need to map in kana all the time. It's currently hard to search for a string of 'アカサタナ', even when it's a standard encoding.

#

I guess I kind of want several related features here, but I'm also not sure how they would best be achieved.

sinful sandal Sep 12, 2024, 5:25 PM

#

obtuse stag its represented as a string in the data tables

Firstly, I'd like to give an attribute to a defined string to tell it a mapping/encoding/enum to use to represent the string in the Pattern Data table.

#

Even if that has to crush down to ascii, it would still be useful as long as I could use more than one ascii letter to represent a single byte.

sinful sandal Sep 12, 2024, 5:44 PM

#

Something like
char crazystring[16] [[mapping("JISX0208")]]; where I've defined the JISX0208 function or enum or mapping or whatever in a separate place.

#

Probably totally possible that I could do that with a format function?

obtuse stag Sep 12, 2024, 5:54 PM

#

sinful sandal This is 100% not true for old games in non-roman languages, especially Japanese.

lol got it in one

#

tbh i'd like a formatting function as well, since it would be helpful for a bunch of other stuff not just alternate string encodings (alternate bases, numbers with offsets, images if you wanted to be really fancy), but first the pattern lang will need to be more stable, unfortunately its gotten to the point where my relatively simple script crashes more than it doesn't :(

sinful sandal Sep 12, 2024, 6:00 PM

#

Some of that exists already. Format functions already exist as [[name]] and [[format]]. Even images with [[hex::visualize]]

#

Not sure about using format attribute for this use case though. like, I guess I maybe could, but that still doesn't help me identify these strings in the first place, because I don't have an easy way to search for them, and also Values don't seem to be searchable in Pattern Data either, only Names, so the best I can do is export the data and then search that?

near niche Sep 12, 2024, 6:05 PM

#

obtuse stag tbh i'd like a formatting function as well, since it would be helpful for a bunc...

Please report crashes you encounter so they can be fixed.

obtuse stag Sep 12, 2024, 6:09 PM

#

working on producing a minimum reproduction yea, though it only happens sometimes so idk what's triggering it

near niche Sep 12, 2024, 6:13 PM

#

if your script is relatively simple and it crashes always given the same inputs that should be good enough for a report or a message in the support server.

obtuse stag Sep 12, 2024, 6:22 PM

#

i would classify it as fairly simple but it has several very large enums, so idk

#representation of nonstandard character encodings in pattern language?