#data-science-and-ml

1 messages · Page 402 of 1

distant horizon
#

Do it

misty flint
shrewd saddle
#

Is there any reason that tanh could be causing some overfitting compared to ReLU in a CNN? I am trying out a CNN model and the training accuracy for using tanh is always more than ReLU, even though the test accuracy is more or less same

iron rampart
#

Does someone know what kind of data this is from keras? keras.datasets.fashion_mnist

tidal bough
celest vine
#

Hi

#

Is anyone here?

vagrant trench
#

yes

celest vine
#

I needed help with something

#

I have a dataset that contains followers data of a Twitter account.

#

The data contains username and profile pic URL.

#

But I want the actual profile pic in JPEG format

#

How to get that

#

I have around 300k users data.

#

That means 300k profile pic URLs

serene scaffold
#

@celest vine if the Twitter API gives you the URL for the profile picture, you can probably use requests to download the jpeg. but make sure you're not attempting to download the profile picture for the same person more than once, so you don't add needless load to their servers

celest vine
serene scaffold
#

it's going to depend on the speed of your network and of their network, and everything in between.

#

and possibly also rate limits.

tidal bough
#

You can experiment by downloading the few hundred first ones (using something like aiohttp, ideally) and seeing how long that takes. Extrapolate on the full 300k and decide if it's worth it.

#

ratelimits might be the biggest problem though

#

I wonder if it's possible to, without downloading a file, request its hash or something like that, to avoid redownloading equal files.

serene scaffold
#

I had a similar thought. it's weird that the twitter API exposes the pfp URL but doesn't have an official way to download it. the idea of downloading them all "manually" seems a bit questionable from a TOS standpoint

celest vine
#

I already have the pfp URLs though.

misty flint
#

api rate limiting is def a big hurdle

serene scaffold
#

@celest vine what are you going to do with these images once you have them?

#

twitter PFPs could be of almost anything, so I'm not sure what one would do with a big dump of them

brazen totem
#

is it good to use SMOTE on a pretty even dataset

vagrant trench
#

guys please who work with kmeans ?

iron rampart
misty flint
#

hmm

#

need to try some model stuff on aws

#

guess i should just expect unexpected cloud costs

glossy mist
#

Hey guys, sorry I am new to python, recently I want to read this table in that is generated from Adobe Premier Pro. I use pandas to read it but it gives a very weird output. Do you know why it is this case?

serene scaffold
#

!code

arctic wedgeBOT
#

Here's how to format Python code on Discord:

```py
print('Hello world!')
```

These are backticks, not quotes. Check this out if you can't find the backtick key.

glossy mist
glossy mist
sullen hazel
#

Guys, any way out there to use decision tree with n-bit values?

burnt iron
#

Hello

#

I am facing a problem

#

actually I'm struggling to find a scalable solution to a problem

#

So here it is:

#

I want to merge two csv-files with soccer data. They hold different data of the same and different games (partial overlap). Normally I would do a merge with df.merge, but the problem is, that the nomenclature differs for some teams in the two Datasets. E.g. "Atletic Bilbao" is called "Club Atletic" in the second set.

#

So the main question is: How could I automatically analyse the differences in the two datasets naming?

#

so dataset 1 looks something like this:

#
Atletic Bilbao   Leicester   2022-05-20 22:00:00 0.2812 
#

and dataset two has the same structure but atletic bilbao is called club atletic:

#
hometeam            awayteam    date  
Club Atletic   Leicester   2022-05-20 22:00:00 0.2812 
#

there are a few more columns that I want to do analysis on but basically this is what the merge is going to operate on

frigid elk
#

is it not an option to fix the data in one of them? then concat the dataframes together?

burnt iron
#

I mean yes that could work

#

but what if I had more than 2 dataframes

frigid elk
#

you could create a table of master_id mappings to hometeam names, .. then join your dataframes together using master_id.

burnt iron
#

hmm yeah I get what you mean but with .concat

#

I don't have the freedom to choose what algorithm I use for comparing the strings

#

Or dont I?

frigid elk
#

no algorithm needed, just replace all team names (hometeam/awayteam) with the master_id (master_team_name).

#

you could join all dataframes, then group by date and pull any rows that have more than one record to validate your results. ... that would give you a list of team names you'll need to fix

burnt iron
#

ok thanks I'll look into that

odd meteor
burnt iron
#

Yeah that sounds good too but again, what if I have more than 2 datasets?

#

I mean the problem's complexity grows exponentially then because for each dataframe I have to have mappings to the other data frames

odd meteor
burnt iron
#

Is that what you mean?

odd meteor
#

First thing first. How many data frames do you have? 3?

burnt iron
#

Well

#

Sometimes

#

3

#

Sometimes more

#

But ideally I would like to make is so that the number of dfs doesn't matter

serene scaffold
#

without having read the context, you usually want the number of dataframes in your code to be constant. if there's a step where the number of dataframes is variable, but they all have the same schema, you should be using a multiindex.

odd meteor
# burnt iron Is that what you mean?

Let's presume you have 3 data frames.

  1. df1 ==> the team names in team column are correctly spelt no errors
  2. df2 and df3 ==> not so good

All you need to do is this

combined_df = pd.concat([df2, df3], axis=0)
combined_df['team_names'].unique()

Use the result to create a csv file that has two columns i.e faulty team names & correct team name (you'd have to do this part manually on your excel or somewhere)

team_names_fixed = { }
with open('team_names.csv', encoding='utf-8') as f:
    lines = f.readlines()
    for line in lines[0:]:
        bad_name, good_name = line.split(',')
        team_name_fixed[bad_name] = good_name

combined_df['team_name'].map(team_names_fixed)

This should be able to fix the problem. If there's a better approach, feel free to try it as well.

burnt iron
#

Which I have done before but now scalability becomes a problem because I would have to hardcode this for every df

#

Or I could use this multiindexing thing which I haven't heard about before so Ill definitely look into that as well

ashen umbra
#

hi I am not sure if this is the right channel to ask a ques abt git hub

#

but I have couple of pickle ML models that I want to load on my colab

#

I am getting an error saying no such file or directory

#

does anyone know how to fix that or even load an ML model pickle on colab directly from github?

urban prism
#

I don't think this is the right place but what's the github of the models?

ashen umbra
#

I can dm u the link if u would like

serene scaffold
#

@ashen umbra if the models are pickle files in a git repository on github, then you need to !git clone them into your colab environment.

ashen umbra
#

because I already cloned it in my colab env

#

but not sure how to load it

serene scaffold
#

but if the file is already there, and it's what it's supposed to be, then you need to know what library can open and use that pickle.

misty flint
#

something akin to torch.load() or similar?

urban prism
ashen umbra
ashen umbra
cunning parrot
#

uhm... anyone got an idea how i can stop my grafana from connecting two points if there is no value inbetween, im getting CO2 values but the device crashed until like 20mins ago when i started it, so there was no data, but grafana still connected it, how do i stop that?

it doesnt revieve null values, it recieves nothing

(not sure if this question is for this channel, if its not just say it to me)

celest vine
#

Hi

#

How can I know if two images are similar or not?

weary cloud
#

Hi guys, I am building an artificial intelligence. Someone knows if exist a speech standard I can embed on my project, otherwise I should write every sentences

rigid summit
#

Hello all
I'm having issues CONCATENATING the output from my LSTM layer and one hot encoding values, that will finally be passed to a dense layer.

Can anyone help ??

bold timber
#

Hi, I have a question about data text: What is get_features_names()? Why does it make more words than stopwords?

tawny vine
#

sure

young granite
#

someone with a bit of plotly 3dsurface knowledge in here to help me in #help-bread

glacial sparrow
#

anyone know any website where a shapes file for Taipei is available for download?

loud cove
#

Wouldn't this be 1.8/3 = 0.6?

hidden frigate
#

I'm struggling a bit with handling wheelEvents in pyqtgraph. Does anyone with pyqtgraph/qt experience have a moment to provide a bit of guidance?

steep oyster
#

Can someone clearly explain to me (a 7th-grade boy) how this works exactly? I never learned about circles, and I'm unsure how can sine need a list instead of opposite/hypotenuse

#

Ping me, thanks.

#

Also i'm sorry if asking both in a help channel AND here is against the rules, I'll delete this one if it's against.

rain sand
#

is there anyone who can help me out in machine leaning project

steep oyster
#

@rain sand don't ask to ask

rain sand
#

??

green wasp
#

So I have a quickie question. I managed to get live flight data from flightradar24 and it only returns 1500 elements each call. I was wondering if it would be smarter to keep it open as a stream, since it seems to support it, or make a 50ms long request every 5 seconds

#

My objective is to gather data and make some dashboards using django and some other libraries(not sure which but I’ll find some) and study stuff like. How many flights from airport, how many are intercontinental and how many are extra continental, how many are interstate and how many extrastate and so on

#

I was debating using elasricsearch and kibana for storage and visualization but I decided to opt for mongodb and a custom django site because it’s a teaching experience

#

So my main question ias, stream the data or make periodic requests a few ms long? Because 1500 json entries arw not enough and I’m not sure they’re different than entries each time

rain sand
#

why i cannot ask

#

here

#

???

#

is there any reason behind it

#

???

wicked pike
#

Hi guys i have a question on pandas

sand cliff
#

Hey guys, attempting to extract a google sheet content into a dataframe, feeding it into an URL:

url = f"https://docs.google.com/spreadsheets/d/{sheet_id}/gviz/tq?tqx=out:csv&sheet=Today's Records"
df = pd.read_csv(url)

As you can see I have a control character in the Today's Records (') and it's throwing an error as URL cannot contain control characters, unfortunately, the sheet is named thus and I cannot change it. Anyone know if it can be encoded or a workaround?

shadow halo
#

Hello people I'm working to get better working with different datasets and I got one with 60 samples of olive oils coming from 4 regions each coming with 570 features. I wanna make an MLP that can classify them so the maximum I could pull is an accuracy of 0.87 with 2 hidden layers of 100 neurons. I wanna get better then that and changing the topology doesn't get me any further. So I'm kinda of lost of how should I process my data knowing that I passed it on a Standard Scaler before setting the training. So I'm looking for how to reduce that number of features to maybe get better results? Knowing that the values of set features are spectrographic values.

wicked pike
#

how do i filter latest timestamp hourly in this df?

serene scaffold
wicked pike
#

say if i want to keep the df?

#

does drop duplicate works?

serene scaffold
still wind
#

Hey guys I am running tensorflow with gpt2, I am trying to run the sequence generator file and I followed all the other steps. This mf been running for a day, how long does it take to finish and what do I do after that lol

odd meteor
serene scaffold
#

the question on everyone's mind, apparently

odd meteor
raw mortar
misty flint
#

i think the default for the "live dashboards" is 30s

steady basalt
#

Yooo got a m1 mac arriving tomorrow

misty flint
#

for streaming, i believe you can use kafka or something similar

steady basalt
#

Finally goin to use a non potato

misty flint
#

nice

steady basalt
#

Does the gpu count as a gpu?

#

How does it fare with TF

misty flint
#

ive heard their gpu is actually one of the better ones for ML (compared to older stuff)

steady basalt
#

I might run some benchmarks if anyone’s interested

#

See how it stacks up v my old 2015 i5

misty flint
#

but this is info from podcasts so dont know how accurate it is

steady basalt
#

I believe I need to use Metal

misty flint
#

please ping me if you do so

steady basalt
#

It won’t be anything thorough, just a simple dataset and maybe a simple Cnn

#

And just time them

misty flint
#

results are results

#

still would be interesting to see

steady basalt
#

I’m estimating it’s going to triple training speed

misty flint
#

lol dont get your hopes up before it arrives

steady basalt
#

But probably not beat co lab by much

#

Co lab gpus aren’t amazing tho

misty flint
#

yeah

#

most likely

steady basalt
#

I’ll try all 3

#

It may depend on other stuff tho

#

Such as how I run it on arm

cunning parrot
steady basalt
#

How do people access more powerful gpu on co lab

#

Surprising

#

However

#

Let's see how Apple's new M1 Pro and M1 Max deal with various machine learning workloads.

Blog post with results - https://www.mrdbourke.com/m1-pro-m1-max-machine-learning-speed-test-comparison
Code on GitHub - https://github.com/mrdbourke/m1-machine-learning-test
Setup your M1 Mac for machine learning video - https://youtu.be/_1CaUOHhI6U

Link...

▶ Play video
#

Tbh, google co lab is OP for small projects

#

I would just stick with a cheap laptop and use that if my current one didn’t get super bad

#

And I like shiny new things

#

Plus u don’t need internet

quartz raptor
#

i have a 1billion line sorted csv file and i would like to find a certain entry with binary search, can pandas do this?

#

i can implement the binary search myself, by jumping to the approx middle of the file using file.seek() and then find the next line but is there nothing better?

steady basalt
#

Can’t u just do it the old fashioned way using min max list and divide by two ?

ripe garnet
steady basalt
#

What?

quartz raptor
steady basalt
#

@quartz raptor do u know binary search alg

quartz raptor
#

yes

steady basalt
#

Like the one to solve that leetcode question

#

Is it possible to turn a Normal list into nodes?

quartz raptor
#

the file is about 100gb

steady basalt
#

In fact u don’t even need nodes surely

quartz raptor
#

i cant load that into memory

steady basalt
#

Where’s this data from

#

U gona need some serious power

quartz raptor
#

binance all trades since 2017

steady basalt
#

What are u looking for

quartz raptor
#

no i can do it in few ms if i do it myself

#

but i want it even faster

#

im just doing some datascience on it

steady basalt
#

Well what are u looking for in the search

quartz raptor
#

yes so the data is 1billion rows of trades, sorted by timestamp

#

and i would like to get a dataframe of all trades between two timestamps

steady basalt
#

What cpu u using

#

And ram

quartz raptor
#

its not a problem

#

i have to just not open the file all at once

steady basalt
#

Just do normal pandas selection then

#

Try 20gb at a time

#

That’s manageable

quartz raptor
#

no that opens the whole file

steady basalt
#

No load it in

#

In chunks

#

Load the first 20 gn

quartz raptor
#

yes thats still slow

steady basalt
#

Gb

quartz raptor
#

it will still have to load everything eventually

steady basalt
#

When u specify read csv

#

There’s a way I think to just take the first x rows

quartz raptor
#

yes i know

#

but

#

that still reads all lines one by one just doesnt save them in memory

#

that still takes like a minute

steady basalt
#

I’m sorry I don’t know a faster method

quartz raptor
#

i think i will do binary search with file.seek() and then pass the filestream to pandas

#

was just wondering if there exists a library that does that in c

steady basalt
#

The biggest dataset I’ve ever worked on is 20gb

#

I’ve never needed to deal with this much

#

I’m sure there’s a solution

#

If u only need to find this once just do it the normal way

quartz raptor
#

no i need to find it every batch

#

so i would like it to be less than 100ms

steady basalt
#

The search or loading?

quartz raptor
#

give me like 10mins ill show the code

#

well both, but once the search is done i will only load like 10k lines

steady basalt
#

Load the csv in first entirely and then do one single search

quartz raptor
#

which is fast enough

steady basalt
#

One batch is 10k?

quartz raptor
#

yes

#

well idk yes

steady basalt
#

I don’t understand, if it’s time ordered you’d know roughly how much lines u need

#

So load until the time stamps out of desired range

#

if it’s only a small period of time you can cut down excess and ur left to work on your ?2000 rows of data

tacit basin
#

Let dask to take care of batching and stuff 😜

steady basalt
#

I think keeping things under X milliseconds and using binary searches is getting into SWE territory

#

Is the issue loading the data in or optimising search once it’s loaded

quartz raptor
#

from what i understand dask is to have big data in memory

#

i dont need alot in memory actually

steady basalt
#

You’d have found the data u want to work on by now if u just loaded ur csv in and ran a search

quartz raptor
#

no you dont get it

#

i will use all the data

#

just not all at once

#

so i would have to do that over and over and over again

steady basalt
#

Do it at once

#

Ur specs are elite

#

It will take 10 min

#

And maybe doing it over and over would be faster anyway

#

Try

#

Where did u get this data from is it public

#

They actually posted their entire transaction times and not allow you to download instead yearly ones

frigid elk
#

can you convert the csv to parquet and load using predicate pushdown?

#

you could take it a step further and partition the parquet by date ranges

wicked pike
#

how would i filter to get the latest timestamp in this diagram?

frigid elk
#

you could take it a step further and utilize spark to do the heavy lifting

steady basalt
wicked pike
#

i would say filter to get a output like this

steady basalt
#

In general or just for this specific case

#

You want 4 and 9 only

wicked pike
#

for this specific case

steady basalt
#

Then just Index it

#

Much easier than having to filter for biggest time

#

It seems u already done so

wicked pike
#

okay my bad, i forget to mention the rows is 74520

#

so i guess in general

#

i just print the head for that case

steady basalt
#

You want to return the two highest values ?

#

In entire data

#

Sort in order by time stamp and take the top two

wicked pike
#

the highest time stamp for each hour

steady basalt
#

I’d create new columns for each hour

wicked pike
#

and the measurement timestamp is already sorted by ascending

steady basalt
#

Now have 24 hours

#

Columns

#

Easy to find top from each

#

Conditional index hour 01:00 time stamps to 24:00

#

Or I guess u can just do a search that way

#

Using order and take the top row

#

No need to make columns

wicked pike
steady basalt
#

U mean without making new columns ?

wicked pike
#

yes without making new columns

steady basalt
#

Locate indexes where hour = 15:00 order by time stamp iloc 0 th row?

#

I’d do that 24 times cause I forgot how to return a data frame for each one in one command

#

U might be better off using sql?

wicked pike
#

im trying to learn and practice my pandas foundations as i suck at it lool

#

but i appreciate the help

steady basalt
#

Google best for syntax

#

I think u can double sort

pale vortex
#

How do I replace all columns in a dataframe with a single list? I have a dataframe of shape (5,8) and a list of shape (5,) which I called a. Doing df[:] = a raises a could not broadcast shape (5,) to (5,8) error.

tidal bough
#

Hmm, maybe do df[:] = a.reshape(-1,1)

#

worst case scenario, you'd need to manually broadcast a to the right shape

serene scaffold
pale vortex
serene scaffold
#

also, why do you have an empty dataframe in the first place?

pale vortex
#

I have a list of numbers five numbers [1,2,3,4,5]. I wanted to get a DF that has each column as that list of numbers, so in this case eight columns which look like:

1
2
3
4
5
#

my strategy was to create an empty df of size (5,8), and fill them all in one go

serene scaffold
#
In [1]: arr = np.arange(1, 6)

In [2]: arr
Out[2]: array([1, 2, 3, 4, 5])

In [6]: arr.reshape(-1, 1).repeat(8, axis=1)
Out[6]:
array([[1, 1, 1, 1, 1, 1, 1, 1],
       [2, 2, 2, 2, 2, 2, 2, 2],
       [3, 3, 3, 3, 3, 3, 3, 3],
       [4, 4, 4, 4, 4, 4, 4, 4],
       [5, 5, 5, 5, 5, 5, 5, 5]])
In [9]: pd.DataFrame(arr.reshape(-1, 1).repeat(8, axis=1))
Out[9]:
   0  1  2  3  4  5  6  7
0  1  1  1  1  1  1  1  1
1  2  2  2  2  2  2  2  2
2  3  3  3  3  3  3  3  3
3  4  4  4  4  4  4  4  4
4  5  5  5  5  5  5  5  5
#

or, as a stand-alone expression, pd.DataFrame(np.arange(1, 6).reshape(-1, 1).repeat(8, axis=1))

#

with numpy/pandas, you don't want to allocate empty space for data and fill it later. the library supports doing these sorts of things in one go.

quartz raptor
# steady basalt Do it at once

so i did this: ```python
from timeit import timeit
from pathlib import Path
import os

from typing import IO
import pandas as pd

FILE_NAME = Path(file).parent / '../data/trades.csv'
MAX_LINE_LENGTH = 256

def get_timestamp_from_line(line: str, col: int = 4) -> int:
r = line.split(',')
r = r[col]
return int(r)

def get_line(file: IO, cursor: int) -> str:
'''return the line in a file containing the cursor'''
file.seek(cursor)

char = ''
while char != '\n':
    cursor -= 1
    file.seek(cursor)
    char = file.read(1)
return file.readline()

def binary_search_interval(from_timestamp: int, to_timestamp: int) -> pd.DataFrame:

with open(FILE_NAME, 'r') as file:
    file.seek(0, os.SEEK_END)
    cursor = jump = file.tell() // 2  # get file size in bytes

    # get cursor to before the line containing from_timestamp
    while jump > 1:
        # get timestamp at current step
        line = get_line(file, cursor)
        timestamp = get_timestamp_from_line(line)

        jump //= 2
        if timestamp < from_timestamp:
            cursor += jump
        elif timestamp > from_timestamp:
            cursor -= jump
        else:
            break
    start = file.tell()
    num = 0
    while True:
        line = file.readline()
        if get_timestamp_from_line(line) > to_timestamp:
            break
        num += 1
    file.seek(start)
    return pd.read_csv(file, nrows=num - 1)

def test_pd():
df = pd.read_csv(FILE_NAME, nrows=1000, skiprows=10000000)
print(df)
...

def test_bs():
df = binary_search_interval(1620255222803, 1620255249925)
print(df)

def main() -> int:
t1 = timeit(test_pd, number=1)
t2 = timeit(test_bs, number=1)
print(t1)
print(t2)
return 1

if name == 'main':
raise SystemExit(main())

serene scaffold
#

I'll break down what this does.

pd.DataFrame(  # convert it to a DF at the end
    np.arange(1, 6) \  # make an array from [1, 6)
    .reshape(-1, 1) \  # make it a column
    .reshape(8, axis=1)  # repeat this 8 times to the left
)
pale vortex
#

I seem this makes sense. If I had a pandas series instead of an array, would this reshape and repeat logic still work?

quartz raptor
pale vortex
steady basalt
#

It’s cool and stuff but unless I have a global template I’m cool waiting a few seconds before I look at the data

quartz raptor
steady basalt
#

I mean with this data

quartz raptor
#

well i guess im just trying stuff out

steady basalt
#

U said u only wanted to work on a subset of the data

#

Ml project?

quartz raptor
#

well ok, so i am trying to tokenize this 100gb data into smaller chunks using an auto-encoder

steady basalt
#

Unless ur constantly working with 100gb files it’s prob faster to wait for a csv to load than write 200 lines making a binary search

quartz raptor
#

and the auto encoder will have some attention span which has to get loaded at the same time

#

so to train i will have to load random chunks of the data for each batch

#

so i have to be able to read the whole file very fast

steady basalt
#

But you only want 2017 days don’t u

quartz raptor
#

no why?

steady basalt
#

I thought u said

quartz raptor
#

i have all the data from 2017 to 2022

#

no

steady basalt
#

What’s the project aim

#

What are u predicting

#

Or classifying

quartz raptor
#

well some im trying to see how much noise vs information is in the btc (or any crypto)'s price.

steady basalt
#

How will u do that

#

(I didn’t graduate math)

quartz raptor
#

and if it turns out that the market is inefficient i.e. the price can be predicted and isnt just random then i can maybe make a trading bot

#

well im also just a first year, so i didnt learn any of this (yet)

steady basalt
#

First year math ?

quartz raptor
#

yes

steady basalt
#

This is hella hard project bro

#

Is it ur second degree or?

quartz raptor
#

yeah i dont really need to hit the goal, its just about learning

#

no my first

steady basalt
#

Did u already do smaller ML projects

quartz raptor
#

actually i did some work on differential equation solvers using ml

steady basalt
#

I’m curious, where did you learn to code like this as a first year math student

quartz raptor
#

and i managed to outperfom state of the art in some metrics

#

oh im programming for like 8 years since im 14 or so

steady basalt
#

How is ml solving your equations?

#

Bruh ur a first year student outperforming soa ML?

quartz raptor
#

i wrote a paper, but its in german

steady basalt
#

We’re talking phd here not bs right

quartz raptor
#

no its not soa in ML it was outperforming soa in regular algorithms

#

but also only in very specific metrics

#

so its somewhat cheating, but still intresting

steady basalt
#

I thought I was pretty nerdy

#

At 18

#

There’s a new generation 😂

quartz raptor
#

youre 18?

#

or when you were 18

steady basalt
#

No I’m 23

quartz raptor
#

oh ok, well even then at 23 you have alot of time

steady basalt
#

I can tell u no one I’ve never met has been writing papers at 18…

#

How old are you?

quartz raptor
#

21 now

steady basalt
#

In Germany u start uni at 20?

quartz raptor
#

switzerland

steady basalt
#

Here we are first year at 18

quartz raptor
#

yes because i had to go to the army for a year

steady basalt
#

Do u have a rifle?

quartz raptor
#
  • we have an extra year of highschool compared to everywhere else
steady basalt
#

How did you write machine learning papers at such an age

#

How do you find supervisor lol

green wasp
green wasp
#

I never hooked kafka up to an api and I’m not sure it can be done but maybe I’ve never used it to its full capabilities

misty flint
#

you seem to know more than me about kafka. let me know what you find out though. im curious

green wasp
#

I don’t like that it involves jdbc though

#

This looks promising

#

So in theory I wouldn’t need a python fetcher at all, just kafka pushing to mongo and a django site for the dashboards

wise falcon
#

Hi! I am using PySpark to read, cleanse and write csv files (the files are more than 10gb), after I read the csv file and work with it I try to save it with dataframe.write but it saves it into multiple parts, how could I save the csv file into only one big csv?

ocean swallow
#

bro, google just sent me an alert of aftershock and I was like what? weirdo.

#

20 seconds later earthquake happened

#

future is now

serene scaffold
plush jungle
#

but I can't figure out how to run this command in terminal

#
# Generate curated MetFaces images without truncation (Fig.10 left)
python generate.py --outdir=out --trunc=1 --seeds=85,265,297,849 \
    --network=https://nvlabs-fi-cdn.nvidia.com/stylegan2-ada/pretrained/metfaces.pkl```
#

if I copy paste it into terminal, it only runs the first line until it gets to the \ and newline

#

if I get rid of the newline and run it all as one line, it gives this

#
D:\Python\stylegan2-ada\stylegan2-ada-main>py generate.py --outdir=out --trunc=1 --seeds=85,265,297,849 \ --network=https://nvlabs-fi-cdn.nvidia.com/stylegan2-ada/pretrained/metfaces.pkl
2022-05-13 20:29:26.269912: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cudart64_110.dll'; dlerror: cudart64_110.dll not found
2022-05-13 20:29:26.270276: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
WARNING:tensorflow:From C:\Users\mac\AppData\Local\Programs\Python\Python39\lib\site-packages\tensorflow\python\compat\v2_compat.py:107: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
usage: generate.py [-h] {generate-images,truncation-traversal,generate-latent-walk,generate-neighbors,lerp-video} ...
generate.py: error: argument command: invalid choice: '\\' (choose from 'generate-images', 'truncation-traversal', 'generate-latent-walk', 'generate-neighbors', 'lerp-video')```
misty flint
#

like

#

do you have a google pixel?

#

im curious if that would also happen to me if i was near an earthquake

misty flint
#

the dependencies seem pretty specific

plush jungle
#

I'm not sure I have all the dependencies, but I don't think the github project works anyway, because when I just put in seeds and network, it gave me this error

#

and I tried the solutions in that but they didn't change it

#

my worry is that the code is just out of date and no longer maintained or compatible with tensorflow 2.x

misty flint
plush jungle
#

right, but when I tried to uninstall tensorflow 2 and get a tensorflow version with pip, it couldn't find any matching packages

#

so I don't think this code actually works anymore

misty flint
#

welp. thats a bummer

plush jungle
#

yeah. maybe I should try stylegan3

misty flint
#

give it a shot

#

our team used pix2pix and cyclegan recently

#

so their stuff def works

plush jungle
misty flint
#

image translation being like

#

stuff like turning daytime images into nighttime ones

#

and vice versa, etc.

#

but i believe cyclegan gives quite good results for plain image generation as well

#

pix2pix can be used for a lot of different image generation tasks as well

#

heres a horse2zebra model using cycleGAN i believe

plush jungle
misty flint
#

right? its pretty dope

#

you can also train the model to do a variety of things

#

my imagination is limited but the potential is def there

plush jungle
#

@misty flint any idea how to set up Cuda with gpu?

#

I'm on windows 10 and I definitely have an nvidia gpu, but when I try to run stylegan3 I get

AssertionError: Torch not compiled with CUDA enabled```
#

and google is not being very clear about how to do it

misty flint
misty flint
#

hmm hmm

#

can you do transformations inside big query

#

they have their Big Query ML

#

but like

#

i think thats autoML or something

#

tbh im not sure since even after reading it im still confused

#

interesting

#

there is google data studio

#

but it just looks like a worse tableau

#

oh i forgot to mention

#

earlier i tried AWS Sagemaker for the first time today

#

pretty interesting

#

its like a jupyter notebook but on aws

#

you might ask what could you need this for? you can train a model on aws and create an API endpoint with Sagemaker

#

and then create some web app to to call the API for model inference

#

TIL many things, like how annoying working with the cloud is kekHands

misty flint
#

my boss was like

#

oh let me get you aws access for the company

#

since these things can end up costing you a wild amount of money sometimes

#

im like

#

uhh ok

#

ill try not to go into the tens of thousands i guess

#

i will quadruple-check to make sure i dont leave instances running

misty flint
#

i found your alternate self

#

this is a good summary:

For any ML model, the time spent in a Jupyter notebook is inversely proportional to its reproducibility. The reasons behind this rule are poor modularity and reusability of the code in notebooks, and poor integration with Git. The worst part of it is the habit of using notebooks which incentivizes the practices that go against reproducibility. This seems to be a vicious circle. We use notebooks because they are a great way of prototyping models or exploring data. However, the more you use notebooks, the more problems you’ll face at the deployment stage.

plush jungle
#

I'm trying to run stylegan3 code, but when I do I get

    raise RuntimeError("Ninja is required to load C++ extensions")
RuntimeError: Ninja is required to load C++ extensions```
#

I'm on windows 10 and I've already installed ninja version (1.10.2.3)

flat fiber
#

Hi. I need to classify 25 unlabelled images, depending on their similarity.
I have never done Computer Vision. Can someone tell me a general process to tackle this, or a tutorial that does something similar?

quasi ether
#

what does embedding_dim arg do in keras.layers.Embedding?

wooden sail
#

it's the dimension of the vector space you want to use to encode a set of integers

#

you can think of it as a function f: R -> R^n, where n is the embedding_dim

#

for example, if you chose n=3, you're asking keras to represent your list of integers by using vectors in R^3, i.e. vectors of the form [x,y,z], where x,y,z are no longer integer valued

#

the idea is that by doing this, you assign a geometric representation to the integers of the classes or words you were encoding, and this makes it possible to look for structure like clustering of similar classes. whether the clustering is evident depends on the number of dimensions. too few makes it so that no structure is present. too many becomes wasteful because structure usually has a low rank

steady basalt
#

Who was it that wanted tf gpu benchmarked

#

@misty flint ?

#

M1 pro using tf metal for the gpu is so fucking fast

steady basalt
#

And co lab is faster

#

Well

#

😦

#

if anyone cares

#

Time (s) to convolve 32x7x7x3 filter over random 100x100x100x3 images (batch x height x width x channel). Sum of ten runs.
CPU (s):
0.632617707999998
GPU (s):
0.12009491599928879
GPU speedup over CPU: 5x

#

google co lab Time (s) to convolve 32x7x7x3 filter over random 100x100x100x3 images (batch x height x width x channel). Sum of ten runs.
CPU (s):
4.349970917000064
GPU (s):
0.03639778600017962
GPU speedup over CPU: 119x

#

so... co lab is much much faster than my m1 gpu, seems wrong considering people have reported the m1 pro being faster

#

anyone any idea why

#

im using miniforge so i believe its native

ocean swallow
ocean swallow
#

Said an earthquake just happened. beware of aftershocks.

#

kinda like this but this is an actual prediction.

#

I think they are using accelerometer data of people around and sending them faster than the shockwave. wow

urban prism
#

The future is now.

iron basalt
# steady basalt anyone any idea why

This is probably just TF being bad. The CPU/GPU difference for convolutions should probably be much larger (on pretty much any device that has GPGPU). Some have TF slower on GPU than on CPU for the m1 (clearly something is broken).

steady basalt
#

how did those people get the correct performance

steady basalt
#

yes

#

maybe you need to install numpy specially?

iron basalt
#

Numpy does not use the GPU.

steady basalt
#

have to use veclib?

iron basalt
#

Also a K80 is a pretty beefy device relative to the M1, so while it could be faster than the K80, it would probably require a decent amount of effort that actually makes use of the M1 in the way that it wants to be used.

#

I don't think most are used to working with that kind of device.

steady basalt
#

initial benchmarks when it released quicky found its better

iron basalt
#

Who tested it?

steady basalt
#

like tonnes of poeple google it

#

so not sure why i found it slower than co lab

iron basalt
#

Can you link me to one that does the same benchmark you showed above of the 32x7x7x3 on 100x100x100x3 images?

steady basalt
#

closed it now but its

#

official google or tf benchmarking code

#

u shud be able to find it by pasting

wooden vault
#

is it possible to make a python program that plays an online driving game. (asked out of curiosity)

iron basalt
#

Well, IDK, but the largest M1 max's GPU is about 10.4 TFLOPS while the K80 is about 8.73 TFLOPS. The largest M1 pro has half as many GPU execution units as the M1 max. So just from that it seems a bit suspicious. But, if the model is not too large I could totally see the M1 being faster simply due to its GPU being integrated (memory/batch transfer rate / speed) (for training, for non-training it's probably even faster).

#

*Single precision.

#

**Nvidia lies all the time though so the 8.73 TFLOPS could be complete BS.

#

**But depends how much Apple is lying too.

steady basalt
iron basalt
#

Last I checked in on TF the issue was not using the m1's zero copy memory transfer to the GPU so it would only be faster if you did large enough batch sizes.

#

Back then the CPU was often reported as faster than GPU.

#

So it probably is better now, but still.

steady basalt
#

So if I used a massive data set I’d find the m1 is faster thank80?

#

I’ll give it a go later

iron basalt
#

The m1 or m1 pro or m1 max (and there are multiple of each)?

steady basalt
#

The iris data is like 1400

#

M1pro

iron basalt
#

Maybe, gotta just try it. But if the TFLOPS are true (although they are theoretical) then no.

#

In practice most programs won't even make use of like 20% of the theoretical.

steady basalt
#

There’s no way google are cheating right, they wouldn’t just rig notebook to run better on iris dataset

iron basalt
#

(takes too much effort, people don't want to spend so much time on a single device when new ones come out all the time and the libraries are targeting many different ones)

steady basalt
#

And the m1 demolishes anything similar to it cpu wise, I just need to get RTX benchmark next to see

iron basalt
#

Google, Apple, Nvidia, and AMD heavily cheat on performance benchmarks and straight up lie (sometimes they pay fines for it like Nvidia, but that is a small cost for them).

#

(Nvidia lies probably the most IMO)

#

(And they also lock off parts of the hardware unless you pay for a special license or hack it)

#

For CPUs, yeah, the M1 should win, at least in performance per watt, and for the GPU too.

#

For CPUs the m1 max is very fast, but still slower than the fastest AMD CPUs.

#

Although they use way less power, which is what really matters if you want to have a bunch of them training stuff.

#

M1 should beat out current Intel CPUs (in pretty much every way) (until Intel finishes their new factory and maybe their new stuff is better).

steady basalt
#

It’s twice as fast as Ryzen 9

iron basalt
#

Which Ryzen 9?

steady basalt
#

HX

#

I posted the benchmark

obsidian bough
#

Hey

#

Anyone know what happened to pywhat?

#

It's not working in my code

#

I tried to repair it

#

But it didn't work

iron basalt
#

If by similar you mean for power consumption yeah.

#

The power consumption on AMD CPUs is pretty bad, but ofc if you are not worried about that, then AMD CPUs are def. faster than any M1.

obsidian bough
arctic wedgeBOT
#

PyWhatKit is a Simple and Powerful WhatsApp Automation Library with many useful Features

obsidian bough
versed gulch
#

Hi is there any way I can save my numpy array of shape (30, 400, 400) = (number of slices/images, height, width) as an image file type?

wooden sail
#

PIL, scipy, and cv2 all seem to have ways of doing this

#

if it's just 2D images, you could also save each slice using matplotlib in a loop, using plt.savefig(...) while iterating over the slices

versed gulch
#

i want to use the library SITK which only reads images that are of 3D file types thats why I need to save the image as a 3D format

wooden sail
#

in that case you'd wanna look up how to convert exactly into the format you need, since having more than 4 layers in an image is not a standard image format that goes around

versed gulch
wooden sail
#

you want the image to be 30 x 400 x 400?

versed gulch
#

yh

wooden sail
#

that's not a standard image format

#

anyway sitk can stack them afterwards, so the format doesn't matter

versed gulch
#

even 400x400x30?

wooden sail
#

you have to look up a format that supports more than 4 layers. none of the normal ones do

misty flint
#

interesting

wooden sail
#

sitk also says it only takes 2,3 and 4D images. what are you trying to do?

versed gulch
# wooden sail sitk also says it only takes 2,3 and 4D images. what are you trying to do?

im trying to run this function

def oof3response(image=None, radii=[], resp_type=3):
    print('   ¬ Compute OOF filter response ...')

    # response_type = 3 :: sqrt(max(0, l1) .*max(0, l2));
    # OOF tensor eigenvalues :: l1 >> l2 >> l3
    # normalisation_type: blob-like (0), curvilinear (1), planar (2)
    # sigma: sigma >= min(radii), otherwise normalisation_type = 0
    opts = {'ntype': 1, 'sigma': min(image.GetSpacing()), 'use_absolute': True,
            'radii': radii, 'resp_type': resp_type}

    if min(radii)<opts['sigma'] and opts['ntype']>0:
        print('Normalisation type is set to zero since sigma<min(radii)')
        opts['ntype'] = 0

    # image
    data = sitk.GetArrayFromImage(image)
    size = [image.GetSize()[i] for i in [2,1,0]]
    spacing = [image.GetSpacing()[i] for i in [2,1,0]]

    # output
    output_data = np.zeros_like(data).astype('float64')

    # Fast Fourier Transform
    fft = np.fft.fftn(data)

    # Radius from Fourier coordinates
    x, y, z = ifft_shifted_coord_matrix(size, spacing)
    x /= size[0] * spacing[0]
    y /= size[1] * spacing[1]
    z /= size[2] * spacing[2]
    radius = np.sqrt(x**2 + y**2 + z**2) + 1e-12

 ...... (could not bee shown in discord as hit the character limit)
wooden sail
#

the code seems to imply the image has only 3 axes

versed gulch
wooden sail
#

and what did you create the image with?

versed gulch
wooden sail
#

all right. well, sitk seems to be able to make images out of numpy arrays

#

but it anyway seems like you want the image as a numpy array

#

how do you currently have the image stored?

versed gulch
#

yh but I wont be able to do getszie and get spacing

versed gulch
wooden sail
#

the easiest way, to me, looks like just reading the slices, putting them into a numpy array, and giving that numpy array to sitk

versed gulch
#

yh but I wont be able to do this "GetSize and GetSpacing"

#

thats why Im having a problem

wooden sail
#

hmm?

#

your images don't have that to begin with if they are just a bunch of 2d images in a generic format

#

if you have the measurement parameters as separate metadata, you can put it in yourself

versed gulch
#

hmm okay so Get spacing and Getsize would be meaningless if it wasn't orignally saved as a 3D image file?

wooden sail
#

more like, if that metadata was not included into the image format

versed gulch
wooden sail
#

mhm

young granite
#

hi guys i used plotly express to plot 3d scatter plot and now wanted to use plotly.go for a 3d surface plot of the same data.
Sadly when i define x, y and z the same way as i did in the scatter plot i run into a java error. So i looked up the plotly site and there the data is provided as arrays not as df columns. Would someone help me to make it work?

young granite
# serene scaffold try showing the code
df_dict = {}
if group_name not in df_dict:
    df_dict[group_name] = [df_new]
    df_dict[group_name].append(df_new)```
i stored dfs into a dict with the key beeing the group_name

data = {}
for group_name in df_dict:
df_new = pd.concat(df_dict[group_name])

df_new['Area'] = df_new['Area'].fillna(0)

df = df_new.rename(columns={"ID#": "Compound Name:", "Temp": "Temp. [°C]"})

if group_name not in data:
    data[group_name] = df```

i concatenated the dfs of a group to one big from which i plottet

for group_name in data:
    Headline = group_name
    df = data[group_name]
    x = df["Temp. [°C]"]
    z = df["Area"]
    y = df.index```
#
                    x=x,
                    range_y=[1, 42],
                    y=y,
                    z=z,
                    color=name,
                    hover_name=df["display_name"],
                    #log_z=True
                       )```
and then used plotly express to plot a scatter plot.
Now i wanted to do the same thing with the go.Surface
strange stag
#

was hoping someone could tell me whats wrong with this gym environment

import random

from gym import Env
from gym.spaces import Discrete


class ShowerEnv(Env):
    def __init__(self):
        self.action_space = Discrete(3)
        self.observation_space = Discrete(100)
        self.state = 38 + random.randint(-3, 3)
        self.shower_length = 60

    def step(self, action):
        self.state += action - 1
        self.shower_length -= 1

        # Calculating the reward
        if 37 <= self.state <= 39:
            reward = 1
        else:
            reward = -1

        # Checking if shower is done
        if self.shower_length <= 0:
            done = True
        else:
            done = False

        # Setting the placeholder for info
        info = {}

        # Returning the step information
        return self.state, reward, done, info

    def reset(self):
        self.state = 38 + random.randint(-3, 3)
        self.shower_length = 60
        return self.state
#

cause when im using ray rllib, im getting a shape error

ValueError: Cannot feed value of shape (11, 256) for Tensor default_policy/Placeholder_default_policy/default_policy/fc_1/kernel/Adam:0, which has shape (100, 256)
steady basalt
misty flint
#

squiggle has a point with those caveats too

steady basalt
#

any idea for a fix?

#

also, does anyone know how to make another env in conda default?

#

i dont wana use base anymore i wana use my 'ml' env

#

permanently

#

else im gona have to like do some effort to clean up and install tf on base

misty flint
steady basalt
#

He has some ideas why but no fix

brazen spire
#

what are some cheap options to get GPUs to run machine learning models?

#

My current GPU (RTX 2080 TI) doesn't have enough VRAM for my models

#

(11 GB)

iron basalt
#

Apple just gives you their ML tools that they want you to use.

spare briar
spare briar
spare briar
ripe flare
#

Any scipy expert here?

serene scaffold
strange stag
#

@spare briar mm, could you help me a bit more? how do i increase the sample size?

spare briar
#

what is your dataloader

strange stag
#

60 steps before a reset
not using a dataloader

#

apologies this isnt my code and my ML is really really really rusty

spare briar
strange stag
#

the dataset size is indefinte tho

#

dictated by the # of epochs

spare briar
#

for some reason a batch has only 11 samples but your optimizer is complaining that it expects 100

strange stag
#

what is telling the optimizer to expect 100?

spare briar
#

my guess is you are iterating over a dataset where dataset % 100 = 11

strange stag
#

think it would be better to keep the batch size the same tho

spare briar
strange stag
#

ah

#

so somewhere in rays default config

spare briar
#

and the optimizer has initialized with a kernel that expects this

strange stag
#

thanks @spare briar

#

Got another ray rllib question

really got no idea whats happening

Worker crashed during call to step_attempt(). To try to continue training without the failed worker, set ignore_worker_failures=True.

code + env + code traceback ~> https://bpa.st/YOEA

misty flint
#

redshift is basically amazon's version of postgres right?

#

i bring this up because apparently gcp's big query is used more for actual data warehousing

#

and has cooler features for ML apparently

orchid carbon
#

UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use zero_division parameter to control this behavior

What is this telling ?

#

Is it bad bad?

potent parrot
worldly dawn
misty flint
#

ah i see, i see

#

what am i thinking of? amazon RDS?

#

tbh im still trying to learn the services atm

worldly dawn
pure badge
#

Hello everyone. I am going to do thesis on machine learning, AI. For that, I should learn Python. I do have basic programming knowledge in both C and Java. Can anyone suggest me the best free tutorial or road map which will be a huge help for my upcoming thesis? Thanks in advance!!

serene scaffold
lapis sequoia
#

I fed the whole X matrix and y labels to the cross_val_score function. Does it do the train test split automatically for each fold and calculate the accuracy? If yes then what's the point of using the k-fold generator which returns the indexes for each fold's train and test, if I just want the accuracy.

pure badge
serene scaffold
pure badge
#

I haven't selected any specific topic yet. I do have interest in Machine Learning and AI

serene scaffold
spare briar
#

this is just telling you it chose to set the scores to 0

serene scaffold
lapis sequoia
#

How can I explore a Boolean column. Other than value counts

misty flint
#

this is for a data journalist position...interesting

serene scaffold
lapis sequoia
#

Just wanna do it solo rn. @serene scaffold

serene scaffold
lapis sequoia
#

There is when you are in school 🤪

#

Well It's just analysis, like finding how data is distributed, outliers, etc.

#

For categorical ones I just wrote the percentage of each unique value.

thin pelican
#

Can the samples you use for accuracy be from the dataset used to train the model?

wooden sail
#

for validation, you mean?

#

if so, it's a bad idea to do that. some recent papers have shown that under relatively mild conditions, the error w.r.t. the training data set will decay to 0, and this tells you nothing about the predictive power when the model is used on new data

winged vessel
thin pelican
#

Alright thanks amigos

brazen spire
#

Rendering

#

Computer vision

#

that's why I need more than 11 gb of Vram because I'm limited on the size of the batch at 7

#

can't go beyond or else I run out of Vram after 1 epoch and it crashs

hasty kiln
#

Which is better for ML applications Django or Flask for machine learning deployments, regardless of the learning curve 😶?

wooden sail
#

hmm are either of those for ML?

strange stag
hasty kiln
# strange stag lol wut, those arent for ML.... they are for webapps n the like

Flask is best for beginners while Django is for more advanced machine learning deployments. Flask is a microframework making it more reliant on extensions for functionality. Django is a full-stack web framework. It comes with more ready to access features.

#

I should have made the question clear pithink

orchid carbon
steady basalt
#

@misty flint i must say the laptops great

marble stag
#

Hello i am recently started learning about ml and encountered SVR and i understood the theory behind it but am having a hard time understanding the maths behind it can anyone teach me?

serene scaffold
steady basalt
#

start with SVM?

marble stag
#

i know that it creates a tube where error is accepted and all the values outside the tube are support vectors so i wanna know what happens mathamatically . i am not good at maths so i am having trouble understanding.

marble stag
heady rivet
#

Hi im about to go to the collage and they offered me to choose a scholarship for have

#

information systems degree or busines analytics

#

Is there are anyone in the field can help me which one is related more to data science

#

I like to make webscraping scripts and manipulate and play with the data

#

So i want to ask which The speciality is the closest to my interests because I have not been familiar with the field of databases or data analysis
And sorry about the disturbing but it's a pivotal point in my life, and I want to ask, so as not to regret it

rose agate
heady rivet
#

@rose agatei have the curriculum for both of them can you see them and inform me??

rose agate
#

I don't know if I'd be able to say what's best, but maybe send them in this chat and someone else can try?

opaque estuary
opaque estuary
#

I still suggest make your own decision.

mighty spoke
#

Hi I'm trying to bin some values but the data frame has Nan values and the max count in bins is 2, i'm not sure why though any help appreciated```import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.fft import rfft, rfftfreq
from scipy import fftpack
import scipy.signal as sg
from PyAstronomy.pyasl import foldAt
#method 2 for binning

t=np.arange(0,1.024, 4e-3)
fr=60.546875# fundamental frequency
Tp=1/fr#time period

phases = foldAt(t, Tp, T0=0)

plt.figure()
#data is the intensity

xp, yd = zip(*sorted(zip(phases, data)))#ensures x and y values correspond to each others in pairs when sorted

plt.plot(xp, yd)#plot the phase xp and Flux

df4 = pd.DataFrame({'X' : xp, 'Y' : yd}) #we build a dataframe from the data
M=20#no of bins

bins=np.arange(0, max(phases), step=Tp/M)

categorical_object = pd.cut(xp, bins)
count=pd.value_counts(categorical_object)#count in bins
grp = df4.groupby(by = categorical_object) #we group the data by the cut
ret = grp.aggregate(np.mean)#calculates the mean in each bin
plt.figure()
plt.plot(ret.X, ret.Y)```

hidden frigate
# potent parrot just saw this, I'm a pyqtgraph maintainer, might be able to help, what's up

awesome, thanks! I kiiiind of solved my problem for now. I was trying to get a, ImageView to scroll through slices of a (2D) matrix with mouse wheel scrolls, but it wasn't calling the wheelEvent function I had defined. I ended up discovering that ImageView either doesn't inherit the wheelEvent function or isn't passed wheelEvent calls, but if the view attribute does, so I can just override the wheelEvent definition for the view attribute. It's still a bit hacky, but it works. I appreciate the offer for help! I'll hit you up if I have more questions; I'm trying to exercise pyqtgraph for a work project, so I'm sure I'll find more bits and bobs.

potent parrot
orchid carbon
#

ValueError: `logits` and `labels` must have the same shape, received ((32, 2) vs (32, 1)). What's this about?

#

What''s a logit

misty flint
serene scaffold
orchid carbon
#

Well it's working now, it was the loss function I had used

#

it's pretty big lemme try a site paster

#

but I was using loss="binary_crossentropy"

strange stag
#

@hasty kiln go with django

versed gulch
#

Does anyone know how to add 2 gray scale images as numpy arrays (that are between 0-1) i.e. just overlaying the images over each other?

wooden sail
#

if they're already numpy arrays, just do +. they need to have the same or broadcastable sizes though

versed gulch
hidden frigate
# orchid carbon ```ValueError: `logits` and `labels` must have the same shape, received ((32, 2)...

logit is a statistics term; practically, in this case it probably means the unnormalized probabilities of classes in a classifier. The function you're calling is trying to compare an array of probability predictions to the correct labels and it's an element-wise comparison, so the arrays need to be the same shape. It look like the logits you're passing aren't a single number per prediction, but two numbers (the second dimension of the array shape is 2) while the labels only have 1. Check the output of whatever neural network layer or classifier you're coding; if the labels are shape (31,1) the logits have to be (32,1) also

wooden sail
hasty kiln
oblique drum
#

How much stuff/programming do i need to program a website/app that detects if something is an apple or not

misty flint
#

any of the major clouds have APIs that can do object recognition like that

#

if you dont want to build your own

#

otherwise you could probably build your own model with ImageNet

rose quarry
#

How can I replace all values of nan with 0?

#

atm Im using this CNN_values2 = [[n if n!=np.nan else 0 for n in k[:4096]] for k in data_array2] but it still gives me nan values

vital ruin
#

I am working with an xml file that I am parsing to a dict with xmltodict. Is there a way to use a list to call keys to sheet the info I am wanting out of this dict. I.E. data[list] instead of data[key1][0][key2]

#

Get*

wooden sail
gloomy anvil
#

I have the stupidest issue:

boxplot_df = pd.DataFrame()
for currency in currencies:
    sub_df = evaluation_df.query(f'{currency}'.format(currency=currency))
    boxplot_df [f'{currency}_allin'.format(currency=currency)] = sub_df['allin_trades'].copy()

I have an empty boxplot_df and want to add the column 'allin_trades' from the evaluation_df, if the string in the column 'currency' matches the currency i am currently looping through

#

If I run this code, I have the expected result for bitcoin, which is my first currency in the list, but for all following currencies I receive a 'nan' even though I can clearly see the populated column in the queried sub_df
What is the issue here? There must be some stupid but that I am missing

serene scaffold
#

@gloomy anvil you're using f strings and .format at the same time, which is wrong.

You can't selectively add columns. Every row always has an element for every column, and vice versa.

Is there any data transformation going on here? Or are you just copying certain columns from one df to another?

#

Because you don't want to allocate empty DataFrames and then copy stuff into it. If you want a new df that contains copies of the columns in a given df, you just do df[['a', 'b', 'c']] and what you get are copies.

gloomy anvil
gloomy anvil
serene scaffold
#

Can you show evaluation_df?

#

Also I'm on my phone, so my explanations will be high level

gloomy anvil
#

evaluation_df

#

it also has a column at the end called 'allin_trades'

#

the query part works fine as well. I can create the sub_df based on the currency and I can see the populated 'allin_trades' column

#

but somehow it does not copy to the boxplot_df... is this some stupid typo i might have? I am loosing my mind about this simple stupid issue

serene scaffold
#

@gloomy anvil look into how to pivot a DataFrames. Because that's what you're actually trying to do.

gloomy anvil
serene scaffold
#

Also, what is each row supposed to represent?

gloomy anvil
#

each row should represent the return of different classification models for each currency. I want to create a boxplot for each column/currency

serene scaffold
#

@gloomy anvil there are columns like TN, FP-- are these supposed to be the rows in the desired df?

gloomy anvil
# serene scaffold <@803185107547586600> there are columns like TN, FP-- are these supposed to be t...

No, I have a evaluation_df with all currencies, models and evaluation data like True positives (TP, FP); and so on, as well as their return ('allin_trades'), respectively. I query the evaluation_df by currency and create a sub_df with all the data from the evaluation_df for the currency, that I queried. From the sub_df I want to take only the column 'allin_trades' and copy it to the boxplot_df into a new column with the currencies name

gloomy anvil
#

figured it out

#

i don't know why but this here works: all fields populated

#

I just added reset_index(drop=True).

#

saw it at stackoverflow, copy, pasted, works. Don't know what it does or why it works though

gloomy anvil
royal hound
#

hello fellas im tryna generate a image using perlin noise

#

where the white spots would be grey colored and the black spots would be transparent
(0,0,0,0)

#

I have tried countless efforts

#

but everything is slow

#

atleast 1 minute

plush glacier
mellow vapor
#

how often will I have to write sql queries to retrieve data from databases, I mean we do have pandas and I can work on it to perform the necessary changes
I am familiar with the sql queries but do I need to have extremely solid grip on it or is it fine if I work on the data using libraries like pandas?

royal hound
plush glacier
#

you can try making some perlin noise in numpy might be a bit faster

#

also how does the code look there might be other ways to improve the speed a bit

#

and is it possible to find out what parts take long (also i wont be available to help a lot because will go to bed now it is like 1am for me)

royal hound
marsh yacht
#

hello guys

#

im looking for a data scientist in my team

#

im currently working on a game project and its related to AI and machine learning

#

im also a data scientist

#

but i need your helop

#

if you're interested just pm me

serene scaffold
#

@marsh yacht please post the GitHub for the project in the chat. If this is a closed source project, kindly remove your messages

brave sand
lapis sequoia
serene scaffold
brave sand
serene scaffold
#

but if it's an internship, they don't necessarily expect you to know.

brave sand
#

could I learn the basics of PyTorch in 24 hours?

serene scaffold
#

probably not.

brave sand
#

could I DM you a picture of the email he sent? I’m not sure what the interview is asking for

serene scaffold
#

so you haven't been offered the internship? you just have an interview for it?

brave sand
#

yeah, it’s weird like that

#

he wants to interview me

#

and I’m kinda screwed

serene scaffold
#

being interviewed for positions is normal. but why don't you copy and paste the email into this chat, with sensitive parts censored out?

brave sand
#

alright

#

The Intro slides to MARL can be found here. You may be interested in this blog post which gives an exciting overview into the future of Artificial Intelligence with Multi-agent Reinforcement Learning. A summary of MARL is attached as a Word document along with relevant papers in MARL.

This summer, 4 students (1 UMD CS Undergrad, 1 Virginia Tech CS Undergrad, 2 Blair HS Junior students) will be working with me on MARL.

Working on the project will involve coding in python, pytorch and a background mathematical knowledge specifically in calculus, probability and optimization. All 4 students currently in my team are prepared in these areas so I will directly start introducing Reinforcement Learning from May 23 before the actual project in Multi-Agent Reinforcement Learning begins from May 30.

This position is unpaid. No Professor will be involved in this Summer research project. So if that's something that you are not interested in, please let me know immediately.

If you are interested, please meet me tomorrow at 7pm EST for a 20 minute time slot on Zoom. I will evaluate your Python programming skills, background Mathematical knowledge, and willingness to learn new things quickly. In case you qualify for the position, I will let you know immediately. Then, please let me know about your interest in the position by May 17.

serene scaffold
brave sand
#

oh I’m in high school, all internships are non paid practically. Getting an internship is lucky enough…

#

internships for me right now are for experience and resume

serene scaffold
#

I suppose. but once you're in college/university, don't waste your time on unpaid internships.

if you don't already have "a background mathematical knowledge specifically in calculus, probability and optimization", there is no way you can learn enough in 24 hours to become a more competitive applicant for this position than you are currently.

brave sand
#

yeah, I’m terrible at calc, and lin alg. should I just give up?

serene scaffold
#

you can still continue with the process for the interview experience. and who knows, maybe something interesting will happen.

brave sand
#

yeah, hopefully. do you think the python testing part will be difficult?

#

I’m terrible at programming under a time limit or stress

serene scaffold
#

if you already know Python basics, it's unlikely that they'd ask you anything particularly esoteric. but I don't think you can really learn how to use pytorch if you don't understand deep learning in general

brave sand
#

yeah, gotcha. is it inappropriate to ask programming questions in a channel during the interview?

serene scaffold
#

I suppose that counts as cheating, but if you're in a Zoom interview, you can't just pause everything while you type out your question in Discord

#

and if you don't know the answers to their questions, just tell them what you do know. if you're squirming, they'll probably move on to a different question.

brave sand
#

yeah, I understand

#

jeez this is tough

orchid carbon
#

I added a second Dense layer with 64 neurons and softmax activation, apparently it makes my net give the same predict results to all predicition

#

It's quite a simple problem, the simpler the less neurons required I suppose?

serene scaffold
# brave sand how do zoom interviews work?

you have your webcam on and you talk to them like you're sitting at a table. it's basically the same as in-person interviews, except you secretly don't wear pants, and everyone secretly knows it.

brave sand
#

jkjk

misty flint
#

they will know if you cheat lol

#

but anyway if youre in highschool and its unpaid, cant hurt to try for the interview

#

high schoolers have plenty of time anyways lol

#

might as well spend your summer doing something productive/educational

serene scaffold
# misty flint high schoolers have plenty of time anyways lol

in general, or just over the summer? because I think the amount of time HS students spend at school, plus commuting to school and back, plus homework, plus the kajillion extracurriculars that kids are pressured to participate in, is very unreasonable.

royal crest
#

what would be a good way to visualise co-occurrence matrix? i'm thinking something like a graph network visualisation where the co-occurrence value determines the thickness of the line between two nodes but i don't think there are good ways to integrate with a pandas dataframe

#

one i'm looking at is holoviews, but it seems way too complicated for my purpose

#

I'm thinking along the lines of R's igraph

wooden sail
#

a common approach is to just plot the matrix itself as an image

pliant pewter
#

Is there a canonical graduate level textbook for data science and AI/ML?

lapis sequoia
#

what would be a good place to draw some 2d graphs? like I want to have some vectors over a graph, and then show possible vectors having norm of one.

royal crest
zenith panther
#

hello can anyone figure out where is the issue in this code ?

smoky shadow
#

Can someone just tell me for this code:

#

def speak(audio):
path = r"D:\Python\Jarvis\Voice_Files\Voice"
directories = os.listdir(path)

    # This would print all the files and directories
    for file in directories:
            filename = audio.replace(" ", "_")
            file_path = (rf"{path}\\{filename}"+".mp3")
            playsound(file_path)
#

why does the mp3 file keeps on repeating?

misty flint
brave sand
#

that’s true

misty flint
#

if youre serious about ML Engineering, this just came out from one of my favorite ML Eng

brave sand
#

so what do you think the professor is goin you ask me math and programming wise?

misty flint
#

bit dense if you arent graduate level so..you have been warned

brave sand
spare briar
#

Both of those books are 2nd year undergrad level (not to knock them, Bishop’s book might be my favorite textbook). For grad level content, if you let me know a more specific subfield you are interested in i might have recs

misty flint
pliant pewter
#

They sound like a good starting point, I kind of need something general before getting into specifics

#

Are there any good books specifically about Python tools for ML (and related)? I've used Pandas a tiny bit

#

But would like to learn stuff like TensorFlow

spare briar
misty flint
#

but like anokhi said

#

def check out bishop

#

for sure

arctic wedgeBOT
#

Hey @mighty spoke!

You either uploaded a .txt file or entered a message that was too long. Please use our paste bin instead.

green wasp
#

Helo GVbearwiggle I’ve checked the pins. Is it right to say that machine learning is a step up from data analytics? I try and find books about data analysis and I get ML books think

versed gulch
#

Hi guys i have 2 arrays (one of them is just filled with 0's and 1's lets say this is arr2) after adding them i.e. arr1 +arr2, I only want to divide by 2 on those elements in arr1 that have been changed, can anyone help me with this?

wooden sail
#

if using numpy array, you can do something like

...
my_sum = arr1 + arr2
indices = my_sum == arr1
my_sum[indices] = my_sum[indices]/2
...
misty flint
#

you are better off looking at statistics books for data analysis, since that would lay a good foundation i believe

green wasp
#

Mhh think ok so I guess like. You need both?

#

Like I should have asked my question better. Is data analysis as a role more machine learning based, data analysis based or both?

wooden sail
#

it's kinda difficult to make a clear-cut distinction, since these are buzz words that grew out of something else

#

to the extent that what "data analysis" and "machine learning" means depends on who you ask. if you have a job in mind, you can see what they are asking for. if not, it's kinda hard to say

#

e.g. you could do data analysis to prep data for machine learning. or, you could do machine learning as part of your data analysis to extract some useful info

#

you could put both under estimation theory, which falls under statistics... along with an overlap with other topics here and there

green wasp
#

Mmmmm I see. I thought the field would be more like. Idk. Intertwined?

#

Like ok practical example. My per project

#

I’m fetching live flight data and I want to make some visualization dashboards like total distance flown in a day, total time, fuel estimation, how many aircrafts of a given type

#

Which airports are more popular, which less so etc erc

#

What does that fall under

wooden sail
#

depends what your boss calls it 😛 possibly data analysis?

#

again, the words are made up and misused, but it's likely you'll find it under "data analysis"

green wasp
#

I don’t have a boss I’m doing it on my spare time

#

I’m a sysadmin PensiveCowboy

#

Like would that project even have a ML aspect? I can’t think of any think idk

wooden sail
#

depends on how you wanna find the quantities you mentioned. distance flown, time, fuel estimation (!! this one, especially, since estimation is one of ML's applications), popularity (!! here too if you wanna use a fancy definition of popularity other than just counting all flights. if you have info from several disjoint time periods, for example, you might wanna try and predict the behavior in the times you didn't observe and then do some weighted sum or something, or directly estimate a popularity metric that describes the observed data)

green wasp
#

Don’t have information from other time periods :/ but I could like

#

Ingest daily and predict the next day?

#

Or after a while

#

Like maybe live flight data isn’t the best to work with. Idk. Lots of updates for little information

#

Like it updates every 5 seconds, and all that changes usually is altitude and position, which I guess is what matters tbh? But also I’m not sure what to do with it

#

What could I do actually? I have the idwa for the project but I’m now realizing i have a bunch of useless data that I’m collecting

wooden sail
#

idk, maybe you don't need ML. you could still find interesting statistics and cool ways of visualizing and interpreting them

#

or maybe you could see of some of the values are good predictors for the others if you find that interesting. that could let you do supervised training

green wasp
#

Mmmmm yeah but my main gripe rn is that I have a lot of data that means nothing

#

The live api is updated every 5/10 seconds and it’s great for studying and watching the planes on the map but from a data analysis/ml standpoint having that much data just adds noise I guess?

#

I’m not really sure. I’ve never done this and I’m making it up as I go, and I just now realized how much data I get and how little it changes in a short period of time

wooden sail
#

that's a cool ML task on its own, you know? data interpolation. you use the super frequently gathered data to train so that, later, your network can take much fewer samples and fill in what is missing

#

then you can replace API calls with inference

green wasp
#

👀

wooden sail
#

or alternatively, instead of filling in missing samples using the API infrequently, you could still use it frequently and exploit those slow changes to predict the future samples for some time window

#

those are both cool applications

#

that the data changes slowly means it is structured, and that makes it a good candidate for inter and extrapolation

green wasp
#

So like I could take the position of aircraft X

#

Train the model to predict where that will go based on waypoints in the area, the flight path and the speed at which is traveling?

#

And see if I can predict the correct flight path of the aircraft

wooden sail
#

sure

#

you know how they do those forecasts for hurricanes and tornadoes?

#

where they show current radar measurements and the current position of a storm

#

and a cone of probability where they expect it to be in the next hours/days

green wasp
#

Mmmm

#

Something similar?

#

Or what else? Hit me with ideas, I’ve never done this before so I don’t know what questions to ask the data think

wooden sail
#

you could do that probably only with the location and speed. and with the other info you have, possibly predict which airport it's headed to, with a probability updated on the fly

green wasp
#

Well not super useful since the data contains the destination airport

#

But that’s just an easy test tbh!

wooden sail
#

sure, but if you're just doing this to play around, it's a nice experiment

#

it could fail, sure. then you test something else

green wasp
#

Yeah!

wooden sail
#

maybe you'll get some other idea as you're setting it all up. gotta get your hands dirty at some point

green wasp
#

Possibly

#

An idea I had is try to correlate airport destinations with holidays/festivities and weather

#

Like pick airport x. What’s the probability that that’s the diversion and the original airport is airport Y which has bad weather

#

Conversely, if that airport has bad weather what could be the diversion

#

Not sure how to go about it though

misty flint
#

i think the more you work with different datasets, the more you develop your data intuition aka what questions you can ask the data vs. getting stuck

green wasp
#

Yeah laughing

#

I guess it’s also not very smart to plunge right into a big project like this but I’m confident that with the help of the discord server, some research and books I can at get something working

#

Oh yeah forgor to ask! What books could help me with this project?

wooden forge
#

Hey there, I'm wondering if there is a way to get the color of a histogram plot with Matplotlib that I already ploted ?

long locust
mint palm
#

why is it not advised to have val set when dataset is too small??

misty mulch
#

im trying to get into reinforcement learning, where should I begin?

serene scaffold
misty mulch
#

nothing rlly, was starting to get into it yesterday and learning Linear regression

main gorge
#

Does anyone know how we can fetch schema from parquet files using python. And then create schema for hive external table?

#

Please help I am totally out of context and really need help in this.

#

Please help 🙏

#

I am searching this over Google from past one week

frigid elk
#

a quick google shows

from pyarrow.parquet import ParquetFile
ParquetFile(source).metadata
main gorge
#

Source file is on S3

#

Can I read parquet using this library?

#

And I need to connect to Hive using Pyspark and need to query

#

Does anyone know how to query hive on my local machine?

frigid elk
#
import pyarrow.parquet as pq
import s3fs
s3 = s3fs.S3FileSystem()

pandas_dataframe = pq.ParquetDataset('s3://your-bucket/', filesystem=s3).read_pandas().to_pandas()

per https://stackoverflow.com/a/48809552/1538838

main gorge
#

How can I create hive schema according from dataframe i receive

#

I am able to fetch dataframe but i am unable to automate how can I create hive structure accordingly through this dataframe

#

I really need help creating hive structure accordingly the parquet dataframe which i receive in any datatype. Be it Pyspark dataframe or pandas dataframe

digital lynx
#

anyone know how to remove the dots

#
df = pd.read_csv("datasets/covid19cases_test.csv")
sonoma_df = df[df["area"] == "Sonoma"]

print(sonoma_df)

dates = pd.to_datetime(sonoma_df['date'])
cases = sonoma_df['cases']

#plt.style.use('seaborn')
plt.plot_date(dates, cases, linestyle='solid', linewidth=0.5)
plt.gcf().autofmt_xdate()

plt.title('Sonoma County Cases vs Time')
plt.xlabel('Date')
plt.ylabel('Cases')

plt.tight_layout()

plt.savefig('figures/sonoma_cases.png')
#

nvm i got it

main gorge
frigid elk
#

anybody out here using palantir foundry cloud? curious what your workflow looks like for development, .. prior to racking up the bills in production

echo vigil
#

How do you guys handle code review / PRs for jupyter notebooks? Since notebooks are painful to read in plain text.

pliant pewter
#

VS Code does syntax highlighting in notebooks, this is my go-to way to work with notebooks now

echo vigil
#

Ah ty, I've never used VS Code, can it show diffs?

pliant pewter
#

probably, I've only explored a fraction of its features.

#

Diffs in Jupyter notebooks, though? I'm not sure

frigid elk
pliant pewter
#

Note however: this is after installing loads of plugins. I don't even remember what I've installed anymore, but fortunately there's a feature to transport the list from one installation to another.

echo vigil
#

thanks both!

tidal bough
#

VSCode does show diffs of jupyter notebooks, but badly: it doesn't exclude cells that have no diff.

#

So you have to scroll to find the cells that actually have changes.

#

This is only with the Python extension and the stuff that comes with it like the Jupyter extension.

desert oar
#

oh nvm i was doing https

delicate apex
#

I've had a very persistant bug with Pandas for a while now, and I'm curious if any of you have a possible solution. Note that both of the problems here require pandas 1.4.0rc0 or greater, as pandas 1.3.5 works completely fine. I was hoping that the update to 1.4.2 back in April might quash these bugs, but it didn't.
Have confirmed existance of bugs on a separate computer.

import pandas as pd

df = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
dfsty = df.style
dfsty.applymap(lambda _: 'color: red')

with open('kaboom.html', 'w') as f:
    # version 1 - fails
    dfsty.to_html(f)  # regression: breaks with ValueError

    # version 2 - incorrect
    # f.write('bye bye style tag\n')  # mangles css style below
    # dfsty.to_html('kaboom.html')  # bad css due to above

    # version 3 - works, but deprecated
    # f.write(dfsty.render())

produces (redacted for brevity, can post full traceback and details of css bug if requested)

  File "C:\Users\Thurisatic\miniconda3\lib\site-packages\pandas\io\formats\format.py", line 1220, in get_buffer
    raise ValueError("buf is not a file name and encoding is specified.")
ValueError: buf is not a file name and encoding is specified.
iron peak
#

Anyone online? I would like to ask for a favor.

desert oar
serene scaffold
iron peak
barren wedge
#

What is embeddings in pytorch?

mellow vortex
#

how can i graph kinematic equations in matplotlib?

#

or how can i structure a kinematic equation to be a graphable function?

tacit basin
brisk nest
#

Hi guy, I've been doing a comparison of the four models LiR RR LASSO and EN. Am I wrong to assume that if one metric is the best resulting one then the other metrices should also be the best performing one? Because in the image below EN is the best performing for MAE but for MSE and R2 it is LASSO. Am I doing something wrong here?

wooden sail
#

that is indeed wrong to assume

brisk nest
#

Interesting, can you please elaborate?

wooden sail
#

each model will have its own properties, that's about it

#

that's the whole reason there are different models and different metrics. you have to try and find the best one for your use case. it goes deeper, too: you can read about the no free lunch theorems of estimation. all estimators (what you use to fit the chosen model) have the same performance on average. this means if they are good in one scenario, they are necessarily bad in others.