#Serde deserializer for custom format

45 messages · Page 1 of 1 (latest)

opal obsidian
#

I have a string with a format that I can't change.
The format is: <dict>;<dict>;<dict>
With each <dict> being comma seperated key and values: key1,12345,key2,09244,key3,0,key4,datadata, keyname,value,keyname,value
The dict has a non-constant number of fields, and the string can have an unlimited amount of these dicts.

I have already written a parser by looping through each dict with the split() function, but it is very ugly and I would like to improve it by using Serde and implementing the Deserialize trait myself.

Another problem is that an dict can have many keys which I would like to deserialize into structs. But each dict can represent a diffrent struct with their own property.
For example: dict has id,10,time,10440, I would like to deserialize it to Time as it has a time key.
For id,2948,action,10, I would like to deserialize it to Action as it has a action key.

But for some of the structs, they have common keys, so I cannot easily distinguish between structs.
For id,11,time,5320,duration,1055, I would like to deserialize it to Length even when there's a time because it has duration.
Etc...

I have considered a large struct with many Option<>s, but first of all, a dict can have 100+ different fields (not all of them at once), and a string can have as much as 1mil dicts or even larger and there's usually multiple strings like this (about 20).
From what I've read, Option<>s still take up the same amount of memory even if it is None, so memory would be a problem with the million of large structs with only a small part of their allocated memory used.

Should I even have a custom Deserializer for the string? As the string is only a array.
Or should I make a Deserializer for the dicts instead?
And if I should make a Deserializer, how should I make it deserialize into enum depending on fields?

zenith mica
#

It sounds like they should just be &str or &[u8] until you need to turn them into a specific type. Is this format recursive at all (even just with an escaped dict for a value)? What if a value has a , or ; in it? Are the keys (supposed to be) unique?

opal obsidian
#

Also the keys are not all unique, as many of these are shared between the dicts. With additional ones providing extra information for certain specific dicts that occurs pretty frequently.

zenith mica
#

So when you make a serialization format for serde (aka implementing Serializer and Deserializer), you need to handle at least turning any Serialize type into your format, and any Deserialize type from your format. Idk if it's possible to handle the "extra information" part with your format, so you may need to do that after deserialization/before serialization.

Fortunately, you can use the base64 values as your recursive part (since serde will give your format recursive types). You might want to make your format generic over what subformat gets base64 encoded, but that can also be handled by the Serialize/Deserialize type. It's kind of weird because your format has two levels (the , separated parts and the ; separated parts) and rust types only have one level, so you might want to make your deserializer work only with the , separated parts, and handle the ; parts outside of serde (i.e. return an iterator of T where T is an enum of the expected types).

For things like the time key, you'd make your own type (something like FormatNameTime) and use the usual serde macros to decide how it's represented. Then any structs that are expecting such a value would contain that type as a member. This is completely separate from making your format, although of course, the format needs to be able to represent them (e.g. not turning everything into base64).

serde_json has the Value type which can serialize/deserialize any JSON. That sounds like what you're doing with the Options, but this is also completely separate from writing a serde data format. You'll probably reuse a bunch of parser code, but as far as serde is concerned, it's just another type that implements Serialize and Deserialize.

opal obsidian
# zenith mica So when you make a serialization format for serde (aka implementing `Serializer`...

So I should handle the first level separately and then pass each of the second level (, separated parts) to the custom Deserializer and leave them in Value?
But then I would need to make a match statement depending on the contents of the Value instead of an enum of the types of the dict, making it so I'll still have to do some get("<key name>").unwrap_or("<default value>").parse().unwrap_or_default()

Also, the format is not entirely base64 encoded:
Example of the string: kA13,0,kA15,0,kA16,0,kA14,,kA6,13,kA7,0,kA17,0,kA18,0,kS39,1,kA2,0,kA3,0,kA8,0,kA4,0,kA9,0,kA10,0,kA11,0;1,211,2,75,3,75,21,19;1,211,2,75,3,135,21,13;1,211,2,75,3,195,21,7;1,211,2,75,3,255,21,1;1,211,2,75,3,375,21,7;1,211,2,75,3,397.5,21,13;1,211,2,75,3,420,21,19;1,211,2,75,3,315,21,25;1,211,2,135,3,75,21,20;1,211,2,135,3,135,21,14;1,211,2,135,3,195,21,8;1,211,2,135,3,255,21,2;1,211,2,195,3,75,21,21;1,211,2,195,3,135,21,15;1,211,2,195,3,195,21,9;1,211,2,195,3,255,21,3;1,211,2,135,3,375,21,8;1,211,2,135,3,397.5,21,14;1,211,2,135,3,420,21,20;1,211,2,195,3,375,21,9;1,211,2,195,3,397.5,21,15;1,211,2,195,3,420,21,21;1,211,2,135,3,315,21,26;1,211,2,195,3,315,21,27;1,211,2,255,3,75,21,22;1,211,2,255,3,135,21,16;1,211,2,255,3,195,21,10;1,211,2,255,3,255,21,4;1,211,2,255,3,375,21,10;1,211,2,255,3,397.5,21,16;1,211,2,255,3,420,21,22;1,211,2,255,3,315,21,28;1,899,2,405,3,225,36,1,7,0,8,0,9,0,10,0.5,35,1,23,1000;1,899,2,585,3,225,36,1,7,255,8,0,9,0,10,0.5,35,1,23,1000;1,899,2,765,3,225,25,2,36,1,7,0,8,255,9,0,10,0.5,35,1,23,1000;1,899,2,945,3,225,25,2,36,1,7,0,8,0,9,255,10,0.5,35,1,23,1000;1,899,2,1125,3,225,25,2,36,1,7,37,8,37,9,37,10,0.5,35,1,23,1000;

As you can see, the key names are just unreadable messes, so I intend to use #[serde(rename = "")] to fix that.

#

Also, each of the dicts can either be this small 1,211,2,75,3,75,21,19
Or very big: kS38,1_40_2_125_3_255_11_255_12_255_13_255_4_-1_6_1000_7_1_15_1_18_0_8_1|1_0_2_102_3_255_11_255_12_255_13_255_4_-1_6_1001_7_1_15_1_18_0_8_1|1_0_2_102_3_255_11_255_12_255_13_255_4_-1_6_1009_7_1_15_1_18_0_8_1|1_255_2_255_3_255_11_255_12_255_13_255_4_-1_6_1002_5_1_7_1_15_1_18_0_8_1|1_135_2_135_3_135_11_255_12_255_13_255_4_-1_6_1005_5_1_7_1_15_1_18_0_8_1|1_255_2_255_3_255_11_255_12_255_13_255_4_-1_6_1006_5_1_7_1_15_1_18_0_8_1|1_255_2_255_3_255_11_255_12_255_13_255_4_-1_6_1_5_1_7_1_15_1_18_0_8_1|1_255_2_255_3_255_11_255_12_255_13_255_4_-1_6_2_5_1_7_0.75_15_1_18_0_8_1|1_255_2_255_3_255_11_255_12_255_13_255_4_-1_6_3_5_1_7_0.5_15_1_18_0_8_1|1_255_2_255_3_255_11_255_12_255_13_255_4_-1_6_4_5_1_7_0.25_15_1_18_0_8_1|1_255_2_0_3_0_11_255_12_255_13_255_4_-1_6_7_5_1_7_1_15_1_18_0_8_1|1_255_2_0_3_0_11_255_12_255_13_255_4_-1_6_8_5_1_7_0.75_15_1_18_0_8_1|1_255_2_0_3_0_11_255_12_255_13_255_4_-1_6_9_5_1_7_0.5_15_1_18_0_8_1|1_255_2_0_3_0_11_255_12_255_13_255_4_-1_6_10_5_1_7_0.25_15_1_18_0_8_1|1_0_2_0_3_255_11_255_12_255_13_255_4_-1_6_13_5_1_7_1_15_1_18_0_8_1|1_0_2_0_3_255_11_255_12_255_13_255_4_-1_6_14_5_1_7_0.75_15_1_18_0_8_1|1_0_2_0_3_255_11_255_12_255_13_255_4_-1_6_15_5_1_7_0.5_15_1_18_0_8_1|1_0_2_0_3_255_11_255_12_255_13_255_4_-1_6_16_5_1_7_0.25_15_1_18_0_8_1|1_0_2_255_3_0_11_255_12_255_13_255_4_-1_6_19_5_1_7_1_15_1_18_0_8_1|1_0_2_255_3_0_11_255_12_255_13_255_4_-1_6_20_5_1_7_0.75_15_1_18_0_8_1|,kA13,0,kA15,0,kA16,0,kA14,,kA6,13,kA7,0,kA17,0,kA18,0,kS39,1,kA2,0,kA3,0,kA8,0,kA4,0,kA9,0,kA10,0,kA11,0

#

But a majority of these dicts are pretty small, so calling the deserializer on each of these dicts separately seems kinda wasteful.

zenith mica
#

Oh dear, another separator. Maybe you should make 3 formats, one for each level? And then nest them in this specific order and use that as the main format.

opal obsidian
#

The reason I wanted to rewrite it in serde in the first place is because I had multiple loops each dealing with their own separator.

zenith mica
#

Well, serde isn't going to magically handle different separators, but in this case it doesn't seem too bad. You have to keep track of the current and parent separators, and just ensure anything else is a valid integer or base64.

It seems like the big advantage to using serde here will be in reducing the effort of writing the rust types, which means you'd need to weigh that against the effort of writing the serde format. It's not going to affect the difficulty of parsing at all.

opal obsidian
undone eagle
#

You might try using nom, it’s good at parsing simple non-recursive formats like this

#

serde is better suited to very generic formats like JSON

opal obsidian
#

Since they are just 1,2,3,4 and etc.

undone eagle
opal obsidian
undone eagle
zenith mica
#

nom and serde aren't exclusive, they do different things. You will need to parse it somehow no matter what. You can do that with nom whether you're making a serde format or not.

undone eagle
#
many0(separated_pair(none_of(",;"), char(","), none_of(",;")))
undone eagle
opal obsidian
#

My idea originally was to pass a new separator for each different format I want to parse.
Like first level:

let mything: Level1 = Deserializer::from_str("<my str>", b'<seperator for level 1>');

Into a vec:

struct Level1 {
    level2s: Vec<Level2>
}

Then for level2 I would have deserialize like:

impl<'de> Deserialize<'de> for Level2 {
    fn deserialize<D>(deserializer: D) -> Result<Level2, D::Error>
    where
        D: Deserializer<'de>,
    {
        let s: &'de [u8] = Deserialize::deserialize(deserializer)?;
        Ok(Deserializer::from_slice(s, b'<seperator for level2>')?)
    }
}

And so fourth...

zenith mica
#

Put your first Deserializer:: etc in between two `

undone eagle
#
fn kv_pair(i: &str) -> IResult<&'_ str, (String, String), VerboseError<&'_ str>> {
    separated_pair(none_of(",;"), char(','), none_of(",;"))(i)
}

fn dict(i: &str) -> IResult<&'_ str, HashMap<&'_ str, &'_ str>, VerboseError<&'_ str>> {
    map(
        terminated(pair(kv_pair, many0(preceded(char(','), kv_pair))), opt(char(','))),
        |(first, rest)| {
            let mut dict = HashMap::from([first]);
            dict.extend(rest.into_iter());
            dict
        },
    )
}

fn dict_many(i: &str) -> IResult<&'_ str, Vec<HashMap<&'_ str, &'_ str>>, VerboseError<&'_ str>> {
    map(
        terminated(pair(dict, many0(preceded(char(';'), dict))), opt(char(';'))),
        |(first, rest)| {
            let mut dicts = vec![first];
            dicts.append(&mut rest);
            dicts
        },
    )
}
#

this is what it might look like with nom

#

idk

#

do you want a struct or a Vec<HashMap<_, _>> as output?

zenith mica
#

Can't be HashMap since there's duplicate keys

undone eagle
#

ah

opal obsidian
#

The main thing is to deserialize into a struct

opal obsidian
#

At least I haven't seen them

opal obsidian
#

One dict can have a key x

#

The other can also have a key x

#

Maybe I haven't explained it very clearly, sorry

zenith mica
#

Oh okay, that seems pretty normal then.

undone eagle
#

A combination of the two might be good if you want structs, nom is an easy way to write a parser and serde is good at producing structured output

opal obsidian
#

Also this was my previous solution:

for object_string in bytes.split(|byte| *byte == b';') {
        let mut object = Level2::default();
        let mut iterator = string.split(|byte| *byte == b',');
        while let Some(property_id) = iterator.next() {
            let property_value = match iterator.next() {
                Some(value) => value,
                None => break,
            };
            match property_id {
                b"1" => object.id = String::from_utf8_lossy(property_value).parse().unwrap(),
                b"2" => {
                    object.x = String::from_utf8_lossy(property_value)
                        .parse::<f32>()
                        .unwrap()
                }
                b"3" => {
                    object.y = String::from_utf8_lossy(property_value)
                        .parse::<f32>()
                        .unwrap()
                }
                b"4" => object.flip_x = u8_to_bool(property_value),
                b"5" => object.flip_y = u8_to_bool(property_value),
                b"6" => object.rot = String::from_utf8_lossy(property_value).parse().unwrap(),
                b"21" => {
                    object.main_color = String::from_utf8_lossy(property_value).parse().unwrap()
                }
                b"22" => {
                    object.second_color = String::from_utf8_lossy(property_value).parse().unwrap()
                }
                _ => {
                    object.other.insert(
                        String::from_utf8_lossy(property_id).to_string(),
                        String::from_utf8_lossy(property_value).to_string(),
                    );
                }
            }
        }
        objects.push(object);
    }
}

With this, I had to make a separate match condition for each variable I wanted to parse.

undone eagle
#

serde seems like a good idea then