unexpected behavior (bug?) when using serde `untagged` with an enum to deserialize csv data | Rust Programming Language Community | Page 1

vale kayak Mar 18, 2024, 7:51 PM

#

mod example {

    use super::*;

    #[derive(Clone,Debug,Serialize, Deserialize)]
    pub struct MyStruct {
        a: usize,
        b: String,
        c: i32,
    }

    #[derive(Clone,Debug,Serialize, Deserialize)]
    #[serde(untagged)]
    pub enum MyUntaggedEnum {
        V1 {
            a: usize,
            b: String,
            c: i32,
        }
    }

    #[cfg(test)]
    mod tests {

        use super::*;

        pub const CSV_DATA: &'static str = "\
            a,b,c
            1,cat,-4
            0,dog,4
            1,mouse,19";

        pub fn csv_reader() -> csv::Reader<&'static [u8]> {

            ReaderBuilder::new()
                .trim(Trim::All)
                .has_headers(true)
                .flexible(true)
                .terminator(csv::Terminator::Any(b'\n'))
                .from_reader(CSV_DATA.as_bytes())
        }

        pub fn deserialize_csv<RecordType>(
            rdr: &mut csv::Reader<&'static [u8]>

        ) where RecordType: Debug + serde::de::DeserializeOwned {

            let mut rows = vec![];

            for result in rdr.deserialize::<RecordType>() {
                match result {
                    Ok(row) => {
                        println!("{:?}", row);
                        rows.push(row);
                    },
                    Err(e) => eprintln!("Error: {}", e),
                }
            }

            assert!(!rows.is_empty());
        }

        // this test passes
        #[test]
        fn test_my_struct() {

            let mut rdr = csv_reader();

            deserialize_csv::<MyStruct>(&mut rdr);
        }

        // this test fails
        #[test]
        fn test_my_untagged_enum() {

            let mut rdr = csv_reader();

            deserialize_csv::<MyUntaggedEnum>(&mut rdr);
        }
    }
}

stone silo Mar 18, 2024, 8:17 PM

#

It seems like there's only support for field-level enums, not row-level enums: https://docs.rs/csv/latest/csv/struct.Reader.html#rules

vale kayak Mar 18, 2024, 8:28 PM

#

i'm not sure i understand what you mean

stone silo Mar 18, 2024, 8:32 PM

#

You can't deserialize an entire row of CSV to an enum

vale kayak Mar 18, 2024, 8:33 PM

#

i just saw in the docs you sent:

Finally, simple enums in Rust can be deserialized as well. Namely, enums must either be variants with no arguments or variants with a single argument. Variants with no arguments are deserialized based on which variant name the field matches. Variants with a single argument are deserialized based on which variant can store the data. The latter is only supported when using “untagged” enum deserialization.

#

i understand what they are saying but not why it is this way

#

it seems to me that the expected behavior should be to use the header fields to determine which variant of the enum we want

#

additionally, we also fail when we modify MyUntaggedEnum to be this:

#[derive(Clone,Debug,Serialize, Deserialize)]
    #[serde(untagged)]
    pub enum MyUntaggedEnum {
        V1(MyStruct),
    }

#

"Variants with a single argument are deserialized based on which variant can store the data. "

#

we know MyStruct can store the data already and we use untagged enum deserialization, so i think this should work but it doesn't

#

thanks for helping me btw

stone silo Mar 18, 2024, 8:51 PM

#

Those rules only seem to work for individual CSV fields. I don't think it's technically impossible, just unimplemented.

I found a way that works (I think):

#[derive(Clone, Debug, Serialize, Deserialize)]
#[serde(untagged)]
pub enum MyUntaggedEnumInner {
    V1(MyStruct),
}

#[derive(Clone, Debug, Serialize, Deserialize)]
pub struct MyUntaggedEnum {
    #[serde(flatten)]
    inner: MyUntaggedEnumInner
}

But if you want to have different representations based on the headers, it would be better to handle the enum outside of serde. You'd change your enum to this:

pub enum MyUntaggedEnum {
    V1(Vec<MyStruct>)
}
```Then inspect the headers before iterating over rows. If the headers match the fields of `MyStruct`, you would make `Vec<MyStruct>` and put it in the enum. This is kind of inconvenient because you would need to list the fields manually, but it easily validates that all the rows are the same.

The serde enum is only necessary if you have different kinds of rows with the same headers in a single CSV.

#

The main downside for the serde enum is that it checks V1 first on every single row, so if you know from just the headers that they're all V2, it'll fail once every row, parsing, allocating, and dropping all the elements that were successful for V1.

vale kayak Mar 18, 2024, 9:30 PM

#

thanks!! your solution about using an inner enum with flatten seems to work for my use case -- thank you!!

#

the rest of the information is quite interesting and good to know for future. i appreciate it

#

i am currently doing something like this:

#[derive(Clone,Debug,Serialize, Deserialize)]
pub struct Transaction {

    #[serde(flatten)]
    inner: TransactionInner,
}


#[derive(Clone,Debug,Serialize, Deserialize)]
#[serde(untagged)]
enum TransactionInner {

    /// this V1 is as of roughly  2023
    V1 {
        #[serde(rename = "Date")]
        #[serde(with = "naive_date_format")]
        date: NaiveDate,

        #[serde(rename = "Transaction Type")]
        transaction_type: TransactionType,

        #[serde(rename = "Check/Serial #")]
        check_or_serial_number: Option<usize>,

        #[serde(rename = "Description")]
        description: Option<String>,

        #[serde(rename = "Amount")]
        #[serde(deserialize_with = "parse_amount")]
        amount: Decimal,
    },

    // V2 is as of  2024
    V2 {
        #[serde(rename = "Posted Date")]
        #[serde(with = "naive_date_format")]
        posted_date: NaiveDate,

        #[serde(rename = "Transaction Date")]
        #[serde(with = "naive_date_format")]
        transaction_date: NaiveDate,

        #[serde(rename = "Transaction Type")]
        transaction_type: TransactionType,

        #[serde(rename = "Check/Serial #")]
        check_or_serial_number: Option<usize>,

        #[serde(rename = "Description")]
        description: Option<String>,

        #[serde(rename = "Amount")]
        #[serde(deserialize_with = "parse_amount")]
        amount: Decimal,
    },
}

#

i have one additional question for this use case:

in the V2 variant there is an optional column in the csv files. If it was not optional, we would specify it like this:

        #[serde(rename = "Daily Posted Balance")]
        #[serde(deserialize_with = "parse_amount")]
        posted_balance: Decimal,

#

do you think there is a good way to include this in the V2 variant without creating a third variant? we could do V2WithPostedBalance, but maybe there is something cleaner

stone silo Mar 18, 2024, 9:57 PM

#

Should be able to just do Option<Decimal>

#

Actually, not sure how this works with deserialize_with, but you could move that into its own struct anyway.

stone silo Mar 18, 2024, 10:30 PM

#

Okay it looks like you can do this

#[serde(
    rename = "Daily Posted Balance",
    deserialize_with = "parse_amount_opt",
    default
)]
posted_balance: Option<Decimal>,

fn parse_amount_opt<'de, D>(de: D) -> Result<Option<Decimal>, D::Error>
where
    D: Deserializer<'de>,
{
    parse_amount(de).map(Some)
}

vale kayak Mar 19, 2024, 4:32 AM

#

wow, thank you so much -- this was brilliant

#

i would have probably never figured that out on my own and ended up with something not quite optimal

#

the default token seems to be the key such that we get None when we don't have the column

#unexpected behavior (bug?) when using serde `untagged` with an enum to deserialize csv data