#How do I overwrite a dtype that Polars has inferred when reading a .CSV file?

9 messages · Page 1 of 1 (latest)

gentle thunder
#

I am attempting to read data from a .CSV file into a Polars dataframe for analysis. I followed this prior Stack Overflow answer to get help on the layout of the CsvReader's chained methods.

fn read_csv_to_dataframe(path: &str) -> PolarsResult<DataFrame> {
    CsvReader::from_path(path)?.has_header(true).finish()
}

When I run the code, I'm met with the following error from the compiler:

```Could not parse "TA1305000009" as dtype i64 at column 'end_station_id' (column number 8).
The current offset in the file is 108268613 bytes.

You might want to try:

  • increasing infer_schema_length (e.g. infer_schema_length=10000),
  • specifying correct dtype with the dtypes argument
  • setting ignore_errors to True,
  • adding "TA1305000009" to the null_values list.
What I'd like to do:

1) Parse the CSV file
2) Input an override for the column end_station_id's i64 type to cast it to a string type.
3) Return the resulting dataframe.

I can get the dataframe to render when I use `.with_ignore_errors(ignore: true)`, but I specifically need the `end_station_id` and `start_station_id` columns.

From reading the docs for CsvReader, I've run into a few possibilities:

1- `with_dtypes()` - Overwrite the schema with the dtypes in this given Schema. The given schema may be a subset of the total schema. 

This looks to be the optimal solution since I don't want to overwrite the whole schema, I'd just need to override a specific field.

2- `with_schema()` - Set the CSV file’s schema. This only accepts datatypes that are implemented in the csv parser and expects a complete Schema. It is recommended to use with_dtypes instead. 

Ideally I would avoid this second implementation because I'd like to avoid having to write out the whole schema from scratch.

I don't know how to implement either one of these solutions since it appears to ask for a Schema definition and I can't quite follow how to provide one.

Help for this newbie Rustacean would be appreciated!
small finch
#

Maybe you could try infer_schemahttps://docs.rs/polars/latest/polars/prelude/struct.CsvReader.html#method.infer_schema

gentle thunder
#

That's actually not a bad suggestion at all. As long as I give the CSV a quick scan to make sure I'm including a range which would have all of the required values for it to properly infer the type, that could solve it.

small finch
#

Lemme know if that works 🙂

gentle thunder
#

Worked like a charm actually.

#
use polars::prelude::*;

fn read_csv_from_file(filepath: &str) -> PolarsResult<DataFrame> {

    let num_records: Option<usize> = Some(10000);
    CsvReader::from_path(filepath)?.has_header(true).infer_schema(num_records).finish()
}

fn main() {
    let path: String = String::from("/example_path/example.csv");
    let df: PolarsResult<DataFrame> = read_csv_from_file(&path);
    println!("{:?}", df);
}
#

Adding the infer_schema allowed Polars to read more lines in the CSV file. It then correctly guessed that I'd need strings there because sometimes the station ids use the TA prepended before the numbers.

#

Thank you!

gentle thunder
#

This may prove useful to someone else, so I'll leave it here. I actually figured out how to overwrite the schema directly, so I solved my own initial problem.

use polars::prelude::*;
use std::sync::Arc;
use std::{ 
    error::Error, 
    env,
};

pub fn configure_the_environment() {
    env::set_var("POLARS_FMT_TABLE_ROUNDED_CORNERS", "1"); // apply rounded corners to UTF8-styled tables.
    env::set_var("POLARS_FMT_MAX_COLS", "15"); // maximum number of columns shown when formatting DataFrames.
    env::set_var("POLARS_FMT_MAX_ROWS", "10");   // maximum number of rows shown when formatting DataFrames.
    env::set_var("POLARS_FMT_STR_LEN", "50");    // maximum number of characters printed per string value.
}

fn read_csv_from_file(filepath: &str) -> PolarsResult<DataFrame> {
        let my_schema = Schema::from_iter(
            vec![
                Field::new("end_station_id", DataType::Utf8),
            ]
            .into_iter(),
    );

    println!("{:?}", my_schema);

    CsvReader::from_path(filepath)?
        .has_header(true)
        .with_dtypes(Some(Arc::new(my_schema))) // The .with_dtypes() argument is cool because you get to be pretty surgical with your changes here. You can overwrite only a subset of dtypes.
        .finish()
}

fn main() -> Result<(), Box<dyn Error>> {
    configure_the_environment(); // This sets a few of my preferred configs for Polars, can be changed on a user basis.

    let path: String = String::from("/example_path/example.csv");
    let mut df: DataFrame = read_csv_from_file(&path)?;

    println!("{:?}", df);
    Ok(())
}