#[newbie] list comprehension

55 messages · Page 1 of 1 (latest)

cobalt crescent
#

I don't understand why this does not work:

dfs_map = Enum.reduce(fileList, %{}, fn %{ "split" => split, "url" => url }, acc ->
  case Explorer.DataFrame.from_parquet(url) do
    {:ok, df} -> Map.update(acc, split, [df], fn dfs -> [df | dfs] end)
    {:error, _} ->
      IO.puts("Error processing file: #{url}")
      acc
  end
end)````

while this work:

```elixir
urls_map = Enum.reduce(fileList, %{}, fn %{ "split" => split, "url" => url }, acc ->
  Map.update(acc, split, [url], fn urls -> [url | urls] end)
end)

dfs_map = Map.new(urls_map, fn {key, urls} ->
  dfs = Enum.map(urls, fn url ->
    case DataFrame.from_parquet(url) do
      {:ok, df} -> df
      {:error, reason} ->
        IO.puts("Error loading DataFrame from URL #{url}: #{reason}")
        nil  # or handle the error as needed
    end
  end)
  {key, dfs}
end)

rigid birch
#

When you say it does not work, what does it do? Is there an error, or simply the wrong return value? If it's wrong, what is it and what are you expecting?

hot ledge
#

One difference I see is that the one that doesn't work won't contain urls that didn't parse, but without more information about your goals, I can't help further

#

If you care about speed, you should build a list and use map.new. Calling map.update a lot will definitely be slower

hot ledge
#

I think you can improve the urls_map function to:

urls_map = Enum.group_by(file_list, & &1["split"], & &1["url"])
cobalt crescent
#

Oh... I restarted the notebook and now it is working 🤔 This is weird. I should have posted the error I was having .. it was somehing about right hand side matching.

hot ledge
#

I looked at it for a while, and couldn't figure out why it wouldn't work

cobalt crescent
#

so sorry.. .but that was strange.

Is there something like a progress bar for Elixir enums? like tqdm in python?

hot ledge
#

how many elements are you processing? This definitely could be made more efficient.

#

using Map.update will definitely be a lot slower than building a list, and then a map

cobalt crescent
#
 
Mix.install([
  {:req, "~> 0.4.8"},
  {:explorer, "~> 0.7.0"},
  {:kino, "~> 0.10.0"},
  {:kino_explorer, "~> 0.1.10"},
  {:progress_bar, "~> 3.0"}
])

alias Explorer.DataFrame
req = Req.new(base_url: "https://datasets-server.huggingface.co")

dataset = 
  Req.get!(req, url: "/parquet?dataset=blog_authorship_corpus").body["parquet_files"]
  |> Enum.reduce(%{}, fn %{ "split" => split, "url" => url }, acc ->
        case Explorer.DataFrame.from_parquet(url) do
          {:ok, df} -> Map.update(acc, split, df, fn dfs -> Explorer.DataFrame.concat_rows(dfs, df) end)
          {:error, _} ->
            IO.puts("Error processing file: #{url}")
            acc
        end
      end)
#

But I don't know how to add a progress bar:

format = [
  bar: "=~",
  blank: ". ",
  left: [IO.ANSI.green, "[", IO.ANSI.reset],
  right: [IO.ANSI.red, "]", IO.ANSI.reset],
]

Enum.each 1..100, fn (i) ->
  ProgressBar.render(i, 100, format)
  :timer.sleep 30
end
hot ledge
#

how many entries are you processing? Maybe if this goes faster, you won't have to use a progress bar

cobalt crescent
#

In this case, only 2 .. yes I don't need.

#

But.. I will eventually need. I will need to learn it. But I can ask when I have several.

hot ledge
#

there are only two parquet files?

cobalt crescent
#

And... I was wondering if I could make this function generic to all HF datasets.

#

A parquet file is a compressed form of a dataset. It is the arrow format... think like. zip for basic data.

#

Hugginface has hundreds of datasets in Parquet format.

In Python, instead of using Polars (a DataFrame), I would use DuckDB

hot ledge
#

so when i'm looking at this, i see a lot of room to make this faster

cobalt crescent
#

I use DuckDB to take the informaiton I need from the dataset...

hot ledge
#

you're doing a lot of work sequentially that doesn't need to be done sequentially

cobalt crescent
#

In the case of DuckDB, it does not need to reed ... it is lazy

hot ledge
#

if i read the above correctly, you're downloading things from a set of urls

#

then, for each thing you download, you're applying a transformation to it

cobalt crescent
#

yes, hugginface has an api with information about their datasets

hot ledge
#

i believe this part can be parallelized
Explorer.DataFrame.from_parquet(url)

#

something tells me that downloads something, and then transforms it

cobalt crescent
#

Does Explorer.DataFrame.from_parquet allow multiple urls?

#

In Duck db I can pass a list of urls

hot ledge
#

i have no idea

#

it also looks like there's an elixir duckdb library

cobalt crescent
#

maybe I could use it. I looked for it, but didn't find examples in Livebook.

hot ledge
#

you don't have an elixir project?

cobalt crescent
#

Well, I am just experimenting in Livebook first. In Python, I am used to developing using Jupyter and nbdev (which takes a notebook and transforms into documentation and lib)

hot ledge
#

elixir isn't python though

#

livebook is good for quick things, but this seems a lot more complicated

cobalt crescent
#

For duckdb to get parquet files, you need extensions. When I looked for the elixir duckdb I didn't find how to add extensions.

hot ledge
cobalt crescent
#

hmmm, I was looking other duckdb package. There are 2!

hot ledge
#

it happens

cobalt crescent
#

I will try this one.

hot ledge
#

i still think there's a lot of optimizations to be had

cobalt crescent
#

Can I download all the files concurrently? this would be faster.

#

(I think duckdbex does not like Livebook (160 seconds to setup and hasn't finished)

#

238 seconds to evaluate!

hot ledge
#

you can definitely download the files concurrently

#
list_of_urls
|> Task.async_stream(&Explorer.DataFrame.from_parquet/1)
|> Enum.map(...)
#

the Enum.map function will receive the results of the calls to DataFrame.to_parquet

cobalt crescent
#

I am scratching my head here:
train_ds =
urls_map["train"]
|> Task.async_stream(&DataFrame.from_parquet/1)
|> Enum.reduce(DataFrame.new(), fn
{:ok, df}, acc -> DataFrame.concat_rows(acc, df)
{:error, _}, acc -> acc
end)

#

'I am overcomplicating, am I not?

#

you thought Enum.map...

hot ledge
#

this seems fine

#

you want to concatenate them into the dataframe