Serde deserializer for custom format | Rust Programming Language Community | Page 1

opal obsidian Jan 5, 2023, 11:13 PM

#

I have a string with a format that I can't change.
The format is: <dict>;<dict>;<dict>
With each <dict> being comma seperated key and values: key1,12345,key2,09244,key3,0,key4,datadata, keyname,value,keyname,value
The dict has a non-constant number of fields, and the string can have an unlimited amount of these dicts.

I have already written a parser by looping through each dict with the split() function, but it is very ugly and I would like to improve it by using Serde and implementing the Deserialize trait myself.

Another problem is that an dict can have many keys which I would like to deserialize into structs. But each dict can represent a diffrent struct with their own property.
For example: dict has id,10,time,10440, I would like to deserialize it to Time as it has a time key.
For id,2948,action,10, I would like to deserialize it to Action as it has a action key.

But for some of the structs, they have common keys, so I cannot easily distinguish between structs.
For id,11,time,5320,duration,1055, I would like to deserialize it to Length even when there's a time because it has duration.
Etc...

I have considered a large struct with many Option<>s, but first of all, a dict can have 100+ different fields (not all of them at once), and a string can have as much as 1mil dicts or even larger and there's usually multiple strings like this (about 20).
From what I've read, Option<>s still take up the same amount of memory even if it is None, so memory would be a problem with the million of large structs with only a small part of their allocated memory used.

Should I even have a custom Deserializer for the string? As the string is only a array.
Or should I make a Deserializer for the dicts instead?
And if I should make a Deserializer, how should I make it deserialize into enum depending on fields?

zenith mica Jan 5, 2023, 11:19 PM

#

It sounds like they should just be &str or &[u8] until you need to turn them into a specific type. Is this format recursive at all (even just with an escaped dict for a value)? What if a value has a , or ; in it? Are the keys (supposed to be) unique?

opal obsidian Jan 5, 2023, 11:30 PM

#

zenith mica It sounds like they should just be `&str` or `&[u8]` until you need to turn them...

This format does not seem recursive and the user modifiable values are encoded with base64-url from what I can see, so no , or ; will occur unless it's an actual separation.

#

Also the keys are not all unique, as many of these are shared between the dicts. With additional ones providing extra information for certain specific dicts that occurs pretty frequently.

zenith mica Jan 6, 2023, 12:10 AM

#

So when you make a serialization format for serde (aka implementing Serializer and Deserializer), you need to handle at least turning any Serialize type into your format, and any Deserialize type from your format. Idk if it's possible to handle the "extra information" part with your format, so you may need to do that after deserialization/before serialization.

Fortunately, you can use the base64 values as your recursive part (since serde will give your format recursive types). You might want to make your format generic over what subformat gets base64 encoded, but that can also be handled by the Serialize/Deserialize type. It's kind of weird because your format has two levels (the , separated parts and the ; separated parts) and rust types only have one level, so you might want to make your deserializer work only with the , separated parts, and handle the ; parts outside of serde (i.e. return an iterator of T where T is an enum of the expected types).

For things like the time key, you'd make your own type (something like FormatNameTime) and use the usual serde macros to decide how it's represented. Then any structs that are expecting such a value would contain that type as a member. This is completely separate from making your format, although of course, the format needs to be able to represent them (e.g. not turning everything into base64).

serde_json has the Value type which can serialize/deserialize any JSON. That sounds like what you're doing with the Options, but this is also completely separate from writing a serde data format. You'll probably reuse a bunch of parser code, but as far as serde is concerned, it's just another type that implements Serialize and Deserialize.

#

Oh and here's the part about writing your own format: https://serde.rs/data-format.html

Writing a data format · Serde

opal obsidian Jan 6, 2023, 12:24 AM

#

zenith mica So when you make a serialization format for serde (aka implementing `Serializer`...

So I should handle the first level separately and then pass each of the second level (, separated parts) to the custom Deserializer and leave them in Value?
But then I would need to make a match statement depending on the contents of the Value instead of an enum of the types of the dict, making it so I'll still have to do some get("<key name>").unwrap_or("<default value>").parse().unwrap_or_default()

Also, the format is not entirely base64 encoded:
Example of the string: kA13,0,kA15,0,kA16,0,kA14,,kA6,13,kA7,0,kA17,0,kA18,0,kS39,1,kA2,0,kA3,0,kA8,0,kA4,0,kA9,0,kA10,0,kA11,0;1,211,2,75,3,75,21,19;1,211,2,75,3,135,21,13;1,211,2,75,3,195,21,7;1,211,2,75,3,255,21,1;1,211,2,75,3,375,21,7;1,211,2,75,3,397.5,21,13;1,211,2,75,3,420,21,19;1,211,2,75,3,315,21,25;1,211,2,135,3,75,21,20;1,211,2,135,3,135,21,14;1,211,2,135,3,195,21,8;1,211,2,135,3,255,21,2;1,211,2,195,3,75,21,21;1,211,2,195,3,135,21,15;1,211,2,195,3,195,21,9;1,211,2,195,3,255,21,3;1,211,2,135,3,375,21,8;1,211,2,135,3,397.5,21,14;1,211,2,135,3,420,21,20;1,211,2,195,3,375,21,9;1,211,2,195,3,397.5,21,15;1,211,2,195,3,420,21,21;1,211,2,135,3,315,21,26;1,211,2,195,3,315,21,27;1,211,2,255,3,75,21,22;1,211,2,255,3,135,21,16;1,211,2,255,3,195,21,10;1,211,2,255,3,255,21,4;1,211,2,255,3,375,21,10;1,211,2,255,3,397.5,21,16;1,211,2,255,3,420,21,22;1,211,2,255,3,315,21,28;1,899,2,405,3,225,36,1,7,0,8,0,9,0,10,0.5,35,1,23,1000;1,899,2,585,3,225,36,1,7,255,8,0,9,0,10,0.5,35,1,23,1000;1,899,2,765,3,225,25,2,36,1,7,0,8,255,9,0,10,0.5,35,1,23,1000;1,899,2,945,3,225,25,2,36,1,7,0,8,0,9,255,10,0.5,35,1,23,1000;1,899,2,1125,3,225,25,2,36,1,7,37,8,37,9,37,10,0.5,35,1,23,1000;

As you can see, the key names are just unreadable messes, so I intend to use #[serde(rename = "")] to fix that.

#

Also, each of the dicts can either be this small 1,211,2,75,3,75,21,19
Or very big: kS38,1_40_2_125_3_255_11_255_12_255_13_255_4_-1_6_1000_7_1_15_1_18_0_8_1|1_0_2_102_3_255_11_255_12_255_13_255_4_-1_6_1001_7_1_15_1_18_0_8_1|1_0_2_102_3_255_11_255_12_255_13_255_4_-1_6_1009_7_1_15_1_18_0_8_1|1_255_2_255_3_255_11_255_12_255_13_255_4_-1_6_1002_5_1_7_1_15_1_18_0_8_1|1_135_2_135_3_135_11_255_12_255_13_255_4_-1_6_1005_5_1_7_1_15_1_18_0_8_1|1_255_2_255_3_255_11_255_12_255_13_255_4_-1_6_1006_5_1_7_1_15_1_18_0_8_1|1_255_2_255_3_255_11_255_12_255_13_255_4_-1_6_1_5_1_7_1_15_1_18_0_8_1|1_255_2_255_3_255_11_255_12_255_13_255_4_-1_6_2_5_1_7_0.75_15_1_18_0_8_1|1_255_2_255_3_255_11_255_12_255_13_255_4_-1_6_3_5_1_7_0.5_15_1_18_0_8_1|1_255_2_255_3_255_11_255_12_255_13_255_4_-1_6_4_5_1_7_0.25_15_1_18_0_8_1|1_255_2_0_3_0_11_255_12_255_13_255_4_-1_6_7_5_1_7_1_15_1_18_0_8_1|1_255_2_0_3_0_11_255_12_255_13_255_4_-1_6_8_5_1_7_0.75_15_1_18_0_8_1|1_255_2_0_3_0_11_255_12_255_13_255_4_-1_6_9_5_1_7_0.5_15_1_18_0_8_1|1_255_2_0_3_0_11_255_12_255_13_255_4_-1_6_10_5_1_7_0.25_15_1_18_0_8_1|1_0_2_0_3_255_11_255_12_255_13_255_4_-1_6_13_5_1_7_1_15_1_18_0_8_1|1_0_2_0_3_255_11_255_12_255_13_255_4_-1_6_14_5_1_7_0.75_15_1_18_0_8_1|1_0_2_0_3_255_11_255_12_255_13_255_4_-1_6_15_5_1_7_0.5_15_1_18_0_8_1|1_0_2_0_3_255_11_255_12_255_13_255_4_-1_6_16_5_1_7_0.25_15_1_18_0_8_1|1_0_2_255_3_0_11_255_12_255_13_255_4_-1_6_19_5_1_7_1_15_1_18_0_8_1|1_0_2_255_3_0_11_255_12_255_13_255_4_-1_6_20_5_1_7_0.75_15_1_18_0_8_1|,kA13,0,kA15,0,kA16,0,kA14,,kA6,13,kA7,0,kA17,0,kA18,0,kS39,1,kA2,0,kA3,0,kA8,0,kA4,0,kA9,0,kA10,0,kA11,0

#

But a majority of these dicts are pretty small, so calling the deserializer on each of these dicts separately seems kinda wasteful.

zenith mica Jan 6, 2023, 12:30 AM

#

Oh dear, another separator. Maybe you should make 3 formats, one for each level? And then nest them in this specific order and use that as the main format.

opal obsidian Jan 6, 2023, 12:37 AM

#

zenith mica Oh dear, another separator. Maybe you should make 3 formats, one for each level?...

Yeah, it's pretty unconventional, but it's from another program that I can't control and if it was me, I would have just used a datatype with proper standards.
Also bad news, there are certainly more than three separators in the whole format
Let me provide you a string out of 20 on the "thing" I have to deal with. Maybe it will help explain the format more.

📎 string.txt

#

The reason I wanted to rewrite it in serde in the first place is because I had multiple loops each dealing with their own separator.

zenith mica Jan 6, 2023, 1:18 AM

#

Well, serde isn't going to magically handle different separators, but in this case it doesn't seem too bad. You have to keep track of the current and parent separators, and just ensure anything else is a valid integer or base64.

It seems like the big advantage to using serde here will be in reducing the effort of writing the rust types, which means you'd need to weigh that against the effort of writing the serde format. It's not going to affect the difficulty of parsing at all.

opal obsidian Jan 6, 2023, 1:19 AM

#

zenith mica Well, serde isn't going to magically handle different separators, but in this ca...

Alright, seems like a good deal, thanks for all the help!

undone eagle Jan 6, 2023, 1:32 AM

#

You might try using nom, it’s good at parsing simple non-recursive formats like this

#

serde is better suited to very generic formats like JSON

opal obsidian Jan 6, 2023, 1:34 AM

#

undone eagle serde is better suited to very generic formats like JSON

Hm, but for the key names, how would I map them to fields?

#

Since they are just 1,2,3,4 and etc.

undone eagle Jan 6, 2023, 1:35 AM

#

opal obsidian Hm, but for the key names, how would I map them to fields?

with nom or serde?

opal obsidian Jan 6, 2023, 1:35 AM

#

undone eagle with nom or serde?

With nom

undone eagle Jan 6, 2023, 1:36 AM

#

opal obsidian With nom

hang on, I don’t feel like typing that on mobile

zenith mica Jan 6, 2023, 1:38 AM

#

nom and serde aren't exclusive, they do different things. You will need to parse it somehow no matter what. You can do that with nom whether you're making a serde format or not.

undone eagle Jan 6, 2023, 1:39 AM

#

many0(separated_pair(none_of(",;"), char(","), none_of(",;")))

undone eagle Jan 6, 2023, 1:39 AM

#

zenith mica nom and serde aren't exclusive, they do different things. You will need to parse...

yeah, I know, nom just seemed like a good fit for this to me

opal obsidian Jan 6, 2023, 1:52 AM

#

My idea originally was to pass a new separator for each different format I want to parse.
Like first level:

let mything: Level1 = Deserializer::from_str("<my str>", b'<seperator for level 1>');

Into a vec:

struct Level1 {
    level2s: Vec<Level2>
}

Then for level2 I would have deserialize like:

impl<'de> Deserialize<'de> for Level2 {
    fn deserialize<D>(deserializer: D) -> Result<Level2, D::Error>
    where
        D: Deserializer<'de>,
    {
        let s: &'de [u8] = Deserialize::deserialize(deserializer)?;
        Ok(Deserializer::from_slice(s, b'<seperator for level2>')?)
    }
}

And so fourth...

zenith mica Jan 6, 2023, 1:54 AM

#

Put your first Deserializer:: etc in between two `

undone eagle Jan 6, 2023, 1:54 AM

#

fn kv_pair(i: &str) -> IResult<&'_ str, (String, String), VerboseError<&'_ str>> {
    separated_pair(none_of(",;"), char(','), none_of(",;"))(i)
}

fn dict(i: &str) -> IResult<&'_ str, HashMap<&'_ str, &'_ str>, VerboseError<&'_ str>> {
    map(
        terminated(pair(kv_pair, many0(preceded(char(','), kv_pair))), opt(char(','))),
        |(first, rest)| {
            let mut dict = HashMap::from([first]);
            dict.extend(rest.into_iter());
            dict
        },
    )
}

fn dict_many(i: &str) -> IResult<&'_ str, Vec<HashMap<&'_ str, &'_ str>>, VerboseError<&'_ str>> {
    map(
        terminated(pair(dict, many0(preceded(char(';'), dict))), opt(char(';'))),
        |(first, rest)| {
            let mut dicts = vec![first];
            dicts.append(&mut rest);
            dicts
        },
    )
}

#

this is what it might look like with nom

#

idk

#

do you want a struct or a Vec<HashMap<_, _>> as output?

zenith mica Jan 6, 2023, 1:55 AM

#

Can't be HashMap since there's duplicate keys

undone eagle Jan 6, 2023, 1:55 AM

#

ah

opal obsidian Jan 6, 2023, 1:55 AM

#

The main thing is to deserialize into a struct

opal obsidian Jan 6, 2023, 1:55 AM

#

zenith mica Can't be HashMap since there's duplicate keys

No?

#

At least I haven't seen them

zenith mica Jan 6, 2023, 1:57 AM

#

opal obsidian Also the keys are not all unique, as many of these are shared between the dicts....

What did you mean here?

opal obsidian Jan 6, 2023, 1:57 AM

#

zenith mica What did you mean here?

Not unique between dicts

#

One dict can have a key x

#

The other can also have a key x

#

Maybe I haven't explained it very clearly, sorry

zenith mica Jan 6, 2023, 1:58 AM

#

Oh okay, that seems pretty normal then.

undone eagle Jan 6, 2023, 1:58 AM

#

A combination of the two might be good if you want structs, nom is an easy way to write a parser and serde is good at producing structured output

opal obsidian Jan 6, 2023, 2:00 AM

#

opal obsidian The main thing is to deserialize into a struct

As before, I had a struct with a other field of HashMap<String, String>, for fields not in the struct
So each time I wanted to access something that wasn't in the struct, I had to do level2.other.get("51").unwrap().parse().unwrap() multiple times

#

Also this was my previous solution:

for object_string in bytes.split(|byte| *byte == b';') {
        let mut object = Level2::default();
        let mut iterator = string.split(|byte| *byte == b',');
        while let Some(property_id) = iterator.next() {
            let property_value = match iterator.next() {
                Some(value) => value,
                None => break,
            };
            match property_id {
                b"1" => object.id = String::from_utf8_lossy(property_value).parse().unwrap(),
                b"2" => {
                    object.x = String::from_utf8_lossy(property_value)
                        .parse::<f32>()
                        .unwrap()
                }
                b"3" => {
                    object.y = String::from_utf8_lossy(property_value)
                        .parse::<f32>()
                        .unwrap()
                }
                b"4" => object.flip_x = u8_to_bool(property_value),
                b"5" => object.flip_y = u8_to_bool(property_value),
                b"6" => object.rot = String::from_utf8_lossy(property_value).parse().unwrap(),
                b"21" => {
                    object.main_color = String::from_utf8_lossy(property_value).parse().unwrap()
                }
                b"22" => {
                    object.second_color = String::from_utf8_lossy(property_value).parse().unwrap()
                }
                _ => {
                    object.other.insert(
                        String::from_utf8_lossy(property_id).to_string(),
                        String::from_utf8_lossy(property_value).to_string(),
                    );
                }
            }
        }
        objects.push(object);
    }
}

With this, I had to make a separate match condition for each variable I wanted to parse.

undone eagle Jan 6, 2023, 2:07 AM

#

serde seems like a good idea then

#Serde deserializer for custom format