#data-science-and-ml
1 messages · Page 402 of 1

Is there any reason that tanh could be causing some overfitting compared to ReLU in a CNN? I am trying out a CNN model and the training accuracy for using tanh is always more than ReLU, even though the test accuracy is more or less same
Does someone know what kind of data this is from keras? keras.datasets.fashion_mnist
https://www.tensorflow.org/api_docs/python/tf/keras/datasets/fashion_mnist/load_data
This is a dataset of 60,000 28x28 grayscale images of 10 fashion categories, along with a test set of 10,000 images. This dataset can be used as a drop-in replacement for MNIST.
Loads the Fashion-MNIST dataset.
or similar docs on keras's site: https://keras.io/api/datasets/fashion_mnist/
yes
I needed help with something
I have a dataset that contains followers data of a Twitter account.
The data contains username and profile pic URL.
But I want the actual profile pic in JPEG format
How to get that
I have around 300k users data.
That means 300k profile pic URLs
@celest vine if the Twitter API gives you the URL for the profile picture, you can probably use requests to download the jpeg. but make sure you're not attempting to download the profile picture for the same person more than once, so you don't add needless load to their servers
this page talks about how to pick a version of a profile picture of a given size. I would download the smallest one that is suitable for your purposes: https://developer.twitter.com/en/docs/twitter-api/v1/accounts-and-users/user-profile-images-and-banners
How much time will it take to download 300k profile pics using requests?
no idea
it's going to depend on the speed of your network and of their network, and everything in between.
and possibly also rate limits.
You can experiment by downloading the few hundred first ones (using something like aiohttp, ideally) and seeing how long that takes. Extrapolate on the full 300k and decide if it's worth it.
ratelimits might be the biggest problem though
I wonder if it's possible to, without downloading a file, request its hash or something like that, to avoid redownloading equal files.
I had a similar thought. it's weird that the twitter API exposes the pfp URL but doesn't have an official way to download it. the idea of downloading them all "manually" seems a bit questionable from a TOS standpoint
I already have the pfp URLs though.
@celest vine what are you going to do with these images once you have them?
twitter PFPs could be of almost anything, so I'm not sure what one would do with a big dump of them
is it good to use SMOTE on a pretty even dataset
guys please who work with kmeans ?
I'm new to AI and all. And i've trained an Neural Network on it. how could i expand this into real-world images?
hmm
need to try some model stuff on aws
guess i should just expect unexpected cloud costs

Hey guys, sorry I am new to python, recently I want to read this table in that is generated from Adobe Premier Pro. I use pandas to read it but it gives a very weird output. Do you know why it is this case?
please show the code you used to read it as text
!code
Here's how to format Python code on Discord:
```py
print('Hello world!')
```
These are backticks, not quotes. Check this out if you can't find the backtick key.
no worries! I just changed the encoding with the pandas function and now it works
Thanks though
Guys, any way out there to use decision tree with n-bit values?
Hello
I am facing a problem
actually I'm struggling to find a scalable solution to a problem
So here it is:
I want to merge two csv-files with soccer data. They hold different data of the same and different games (partial overlap). Normally I would do a merge with df.merge, but the problem is, that the nomenclature differs for some teams in the two Datasets. E.g. "Atletic Bilbao" is called "Club Atletic" in the second set.
So the main question is: How could I automatically analyse the differences in the two datasets naming?
so dataset 1 looks something like this:
Atletic Bilbao Leicester 2022-05-20 22:00:00 0.2812
and dataset two has the same structure but atletic bilbao is called club atletic:
hometeam awayteam date
Club Atletic Leicester 2022-05-20 22:00:00 0.2812
there are a few more columns that I want to do analysis on but basically this is what the merge is going to operate on
is it not an option to fix the data in one of them? then concat the dataframes together?
you could create a table of master_id mappings to hometeam names, .. then join your dataframes together using master_id.
hmm yeah I get what you mean but with .concat
I don't have the freedom to choose what algorithm I use for comparing the strings
Or dont I?
no algorithm needed, just replace all team names (hometeam/awayteam) with the master_id (master_team_name).
you could join all dataframes, then group by date and pull any rows that have more than one record to validate your results. ... that would give you a list of team names you'll need to fix
ok thanks I'll look into that
-
Get the unique team names from the df where the team column has issues.
-
create a dictionary using team names from #1 as key and their respective correct team names as the value/item.
-
Call the map function on df in #1 and pass your dictionary to it to fix the problem.
-
Now you can merge your df1 and df2 on team names
Yeah that sounds good too but again, what if I have more than 2 datasets?
I mean the problem's complexity grows exponentially then because for each dataframe I have to have mappings to the other data frames
That's why you need to create the dictionary; it's reusable. . If you have more dfs, so long as the faulty team names from the new df is present as a key in already existing dictionary, you can reuse it. But if otherwise, you can simply update your dictionary and you're good to go.
Yeah. So that means that I would have a dictionary of team1 of df1 to a list of the corresponding names in df2, df3 etc?
Is that what you mean?
First thing first. How many data frames do you have? 3?
Well
Sometimes
3
Sometimes more
But ideally I would like to make is so that the number of dfs doesn't matter
without having read the context, you usually want the number of dataframes in your code to be constant. if there's a step where the number of dataframes is variable, but they all have the same schema, you should be using a multiindex.
Let's presume you have 3 data frames.
- df1 ==> the team names in team column are correctly spelt no errors
- df2 and df3 ==> not so good
All you need to do is this
combined_df = pd.concat([df2, df3], axis=0)
combined_df['team_names'].unique()
Use the result to create a csv file that has two columns i.e faulty team names & correct team name (you'd have to do this part manually on your excel or somewhere)
team_names_fixed = { }
with open('team_names.csv', encoding='utf-8') as f:
lines = f.readlines()
for line in lines[0:]:
bad_name, good_name = line.split(',')
team_name_fixed[bad_name] = good_name
combined_df['team_name'].map(team_names_fixed)
This should be able to fix the problem. If there's a better approach, feel free to try it as well.
I think I understand what your solution is but there are thousands of records in these dfs and I can't really hardcode each difference by hand. What I can do though is use some string matching algorithm to determine which names are most likely the same for all dfs.
Which I have done before but now scalability becomes a problem because I would have to hardcode this for every df
Or I could use this multiindexing thing which I haven't heard about before so Ill definitely look into that as well
hi I am not sure if this is the right channel to ask a ques abt git hub
but I have couple of pickle ML models that I want to load on my colab
I am getting an error saying no such file or directory
does anyone know how to fix that or even load an ML model pickle on colab directly from github?
I don't think this is the right place but what's the github of the models?
u mean the github link to those pickle files?
I can dm u the link if u would like
@ashen umbra if the models are pickle files in a git repository on github, then you need to !git clone them into your colab environment.
it gives me this error:
fatal: destination path 'final_random_forest_model.sav' already exists and is not an empty directory.
because I already cloned it in my colab env
but not sure how to load it
thanks for giving part of the error message as text. I also need to see the code that caused the error.
but if the file is already there, and it's what it's supposed to be, then you need to know what library can open and use that pickle.
Does it exist as a file in your local?
actually I went ahead with loading the model from Google drive and it worked! thanks so much everyone
yess. I added in my local env
uhm... anyone got an idea how i can stop my grafana from connecting two points if there is no value inbetween, im getting CO2 values but the device crashed until like 20mins ago when i started it, so there was no data, but grafana still connected it, how do i stop that?
it doesnt revieve null values, it recieves nothing
(not sure if this question is for this channel, if its not just say it to me)
Hi guys, I am building an artificial intelligence. Someone knows if exist a speech standard I can embed on my project, otherwise I should write every sentences
Hello all
I'm having issues CONCATENATING the output from my LSTM layer and one hot encoding values, that will finally be passed to a dense layer.
Can anyone help ??
Hi, I have a question about data text: What is get_features_names()? Why does it make more words than stopwords?
sure
someone with a bit of plotly 3dsurface knowledge in here to help me in #help-bread
anyone know any website where a shapes file for Taipei is available for download?
previously I used this https://download.geofabrik.de/asia/taiwan.html but it does not print the whole thing as the image shows
Wouldn't this be 1.8/3 = 0.6?
I'm struggling a bit with handling wheelEvents in pyqtgraph. Does anyone with pyqtgraph/qt experience have a moment to provide a bit of guidance?
Can someone clearly explain to me (a 7th-grade boy) how this works exactly? I never learned about circles, and I'm unsure how can sine need a list instead of opposite/hypotenuse
Ping me, thanks.
Also i'm sorry if asking both in a help channel AND here is against the rules, I'll delete this one if it's against.
is there anyone who can help me out in machine leaning project
@rain sand don't ask to ask
So I have a quickie question. I managed to get live flight data from flightradar24 and it only returns 1500 elements each call. I was wondering if it would be smarter to keep it open as a stream, since it seems to support it, or make a 50ms long request every 5 seconds
My objective is to gather data and make some dashboards using django and some other libraries(not sure which but I’ll find some) and study stuff like. How many flights from airport, how many are intercontinental and how many are extra continental, how many are interstate and how many extrastate and so on
I was debating using elasricsearch and kibana for storage and visualization but I decided to opt for mongodb and a custom django site because it’s a teaching experience
So my main question ias, stream the data or make periodic requests a few ms long? Because 1500 json entries arw not enough and I’m not sure they’re different than entries each time
Hi guys i have a question on pandas
Hey guys, attempting to extract a google sheet content into a dataframe, feeding it into an URL:
url = f"https://docs.google.com/spreadsheets/d/{sheet_id}/gviz/tq?tqx=out:csv&sheet=Today's Records"
df = pd.read_csv(url)
As you can see I have a control character in the Today's Records (') and it's throwing an error as URL cannot contain control characters, unfortunately, the sheet is named thus and I cannot change it. Anyone know if it can be encoded or a workaround?
Hello people I'm working to get better working with different datasets and I got one with 60 samples of olive oils coming from 4 regions each coming with 570 features. I wanna make an MLP that can classify them so the maximum I could pull is an accuracy of 0.87 with 2 hidden layers of 100 neurons. I wanna get better then that and changing the topology doesn't get me any further. So I'm kinda of lost of how should I process my data knowing that I passed it on a Standard Scaler before setting the training. So I'm looking for how to reduce that number of features to maybe get better results? Knowing that the values of set features are spectrographic values.
how do i filter latest timestamp hourly in this df?
you could groupby hour and take the max
pandas operations pretty much always return copies. it's evident in the code when something you're doing overwrites your existing data
Hey guys I am running tensorflow with gpt2, I am trying to run the sequence generator file and I followed all the other steps. This mf been running for a day, how long does it take to finish and what do I do after that lol
are you using a GPU?
it's been running for + 24 hours now? Like, you've slept, rave, ate, woke up and it's still running? 😃 How large is the data? Are you using GPU?
the question on everyone's mind, apparently
Lol, hopefully it doesn't extend to 48hrs
A bit late, but under display section for the panel, you can toggle points and disable lines, there some other settings in there too
you could go either way. if you use mongodb, mongodb atlas allows the dashboards to update according to the time interval you set
i think the default for the "live dashboards" is 30s

Yooo got a m1 mac arriving tomorrow
for streaming, i believe you can use kafka or something similar
Finally goin to use a non potato
ive heard their gpu is actually one of the better ones for ML (compared to older stuff)
I might run some benchmarks if anyone’s interested
See how it stacks up v my old 2015 i5
but this is info from podcasts so dont know how accurate it is
I believe I need to use Metal
i would be interested
please ping me if you do so
It won’t be anything thorough, just a simple dataset and maybe a simple Cnn
And just time them
I’m estimating it’s going to triple training speed
lol dont get your hopes up before it arrives
fixed it by using $__timeGroup(timestamp, '5m', 0) macro
How do people access more powerful gpu on co lab
Surprising
However
Let's see how Apple's new M1 Pro and M1 Max deal with various machine learning workloads.
Blog post with results - https://www.mrdbourke.com/m1-pro-m1-max-machine-learning-speed-test-comparison
Code on GitHub - https://github.com/mrdbourke/m1-machine-learning-test
Setup your M1 Mac for machine learning video - https://youtu.be/_1CaUOHhI6U
Link...
Tbh, google co lab is OP for small projects
I would just stick with a cheap laptop and use that if my current one didn’t get super bad
And I like shiny new things
Plus u don’t need internet
i have a 1billion line sorted csv file and i would like to find a certain entry with binary search, can pandas do this?
i can implement the binary search myself, by jumping to the approx middle of the file using file.seek() and then find the next line but is there nothing better?
What is the data
Can’t u just do it the old fashioned way using min max list and divide by two ?
wowwww
What?
what do you mean by min max list?
@quartz raptor do u know binary search alg
yes
Like the one to solve that leetcode question
Is it possible to turn a Normal list into nodes?
the file is about 100gb
In fact u don’t even need nodes surely
i cant load that into memory
binance all trades since 2017
What are u looking for
no i can do it in few ms if i do it myself
but i want it even faster
im just doing some datascience on it
Well what are u looking for in the search
yes so the data is 1billion rows of trades, sorted by timestamp
and i would like to get a dataframe of all trades between two timestamps
no that opens the whole file
yes thats still slow
Gb
it will still have to load everything eventually
yes i know
but
that still reads all lines one by one just doesnt save them in memory
that still takes like a minute
I’m sorry I don’t know a faster method
i think i will do binary search with file.seek() and then pass the filestream to pandas
was just wondering if there exists a library that does that in c
The biggest dataset I’ve ever worked on is 20gb
I’ve never needed to deal with this much
I’m sure there’s a solution
If u only need to find this once just do it the normal way
The search or loading?
give me like 10mins ill show the code
well both, but once the search is done i will only load like 10k lines
Load the csv in first entirely and then do one single search
which is fast enough
One batch is 10k?
I don’t understand, if it’s time ordered you’d know roughly how much lines u need
So load until the time stamps out of desired range
if it’s only a small period of time you can cut down excess and ur left to work on your ?2000 rows of data
Let dask to take care of batching and stuff 😜
I think keeping things under X milliseconds and using binary searches is getting into SWE territory
Is the issue loading the data in or optimising search once it’s loaded
from what i understand dask is to have big data in memory
i dont need alot in memory actually
You’d have found the data u want to work on by now if u just loaded ur csv in and ran a search
no you dont get it
i will use all the data
just not all at once
so i would have to do that over and over and over again
Do it at once
Ur specs are elite
It will take 10 min
And maybe doing it over and over would be faster anyway
Try
Where did u get this data from is it public
They actually posted their entire transaction times and not allow you to download instead yearly ones
can you convert the csv to parquet and load using predicate pushdown?
you could take it a step further and partition the parquet by date ranges
how would i filter to get the latest timestamp in this diagram?
you could take it a step further and utilize spark to do the heavy lifting
Filter or order?
i would say filter to get a output like this
for this specific case
Then just Index it
Much easier than having to filter for biggest time
It seems u already done so
okay my bad, i forget to mention the rows is 74520
so i guess in general
i just print the head for that case
You want to return the two highest values ?
In entire data
Sort in order by time stamp and take the top two
the highest time stamp for each hour
I’d create new columns for each hour
and the measurement timestamp is already sorted by ascending
Now have 24 hours
Columns
Easy to find top from each
Conditional index hour 01:00 time stamps to 24:00
Or I guess u can just do a search that way
Using order and take the top row
No need to make columns
what do you mean by this?
U mean without making new columns ?
yes without making new columns
Locate indexes where hour = 15:00 order by time stamp iloc 0 th row?
I’d do that 24 times cause I forgot how to return a data frame for each one in one command
U might be better off using sql?
im trying to learn and practice my pandas foundations as i suck at it lool
but i appreciate the help
How do I replace all columns in a dataframe with a single list? I have a dataframe of shape (5,8) and a list of shape (5,) which I called a. Doing df[:] = a raises a could not broadcast shape (5,) to (5,8) error.
Hmm, maybe do df[:] = a.reshape(-1,1)
worst case scenario, you'd need to manually broadcast a to the right shape
I'm not sure I follow. are you trying rename the columns or what?
I have an empty DF of size (5,8), I just want to replace all columns with a list of size 5
I don't understand. columns are just a vertical slice of the data. if the shape of the dataframe is (5, 8), then you have five rows and 8 columns.
also, why do you have an empty dataframe in the first place?
I have a list of numbers five numbers [1,2,3,4,5]. I wanted to get a DF that has each column as that list of numbers, so in this case eight columns which look like:
1
2
3
4
5
my strategy was to create an empty df of size (5,8), and fill them all in one go
In [1]: arr = np.arange(1, 6)
In [2]: arr
Out[2]: array([1, 2, 3, 4, 5])
In [6]: arr.reshape(-1, 1).repeat(8, axis=1)
Out[6]:
array([[1, 1, 1, 1, 1, 1, 1, 1],
[2, 2, 2, 2, 2, 2, 2, 2],
[3, 3, 3, 3, 3, 3, 3, 3],
[4, 4, 4, 4, 4, 4, 4, 4],
[5, 5, 5, 5, 5, 5, 5, 5]])
In [9]: pd.DataFrame(arr.reshape(-1, 1).repeat(8, axis=1))
Out[9]:
0 1 2 3 4 5 6 7
0 1 1 1 1 1 1 1 1
1 2 2 2 2 2 2 2 2
2 3 3 3 3 3 3 3 3
3 4 4 4 4 4 4 4 4
4 5 5 5 5 5 5 5 5
or, as a stand-alone expression, pd.DataFrame(np.arange(1, 6).reshape(-1, 1).repeat(8, axis=1))
with numpy/pandas, you don't want to allocate empty space for data and fill it later. the library supports doing these sorts of things in one go.
so i did this: ```python
from timeit import timeit
from pathlib import Path
import os
from typing import IO
import pandas as pd
FILE_NAME = Path(file).parent / '../data/trades.csv'
MAX_LINE_LENGTH = 256
def get_timestamp_from_line(line: str, col: int = 4) -> int:
r = line.split(',')
r = r[col]
return int(r)
def get_line(file: IO, cursor: int) -> str:
'''return the line in a file containing the cursor'''
file.seek(cursor)
char = ''
while char != '\n':
cursor -= 1
file.seek(cursor)
char = file.read(1)
return file.readline()
def binary_search_interval(from_timestamp: int, to_timestamp: int) -> pd.DataFrame:
with open(FILE_NAME, 'r') as file:
file.seek(0, os.SEEK_END)
cursor = jump = file.tell() // 2 # get file size in bytes
# get cursor to before the line containing from_timestamp
while jump > 1:
# get timestamp at current step
line = get_line(file, cursor)
timestamp = get_timestamp_from_line(line)
jump //= 2
if timestamp < from_timestamp:
cursor += jump
elif timestamp > from_timestamp:
cursor -= jump
else:
break
start = file.tell()
num = 0
while True:
line = file.readline()
if get_timestamp_from_line(line) > to_timestamp:
break
num += 1
file.seek(start)
return pd.read_csv(file, nrows=num - 1)
def test_pd():
df = pd.read_csv(FILE_NAME, nrows=1000, skiprows=10000000)
print(df)
...
def test_bs():
df = binary_search_interval(1620255222803, 1620255249925)
print(df)
def main() -> int:
t1 = timeit(test_pd, number=1)
t2 = timeit(test_bs, number=1)
print(t1)
print(t2)
return 1
if name == 'main':
raise SystemExit(main())
I'll break down what this does.
pd.DataFrame( # convert it to a DF at the end
np.arange(1, 6) \ # make an array from [1, 6)
.reshape(-1, 1) \ # make it a column
.reshape(8, axis=1) # repeat this 8 times to the left
)
I seem this makes sense. If I had a pandas series instead of an array, would this reshape and repeat logic still work?
try it 😄
the binary search runs in 7ms while pd.read_csv takes 2seconds, keep in mind that this is only a 4gb subset of the actual data so pd.read_csv should take about 40seconds on the full data while mine will stay at pretty much the same speed
sure
Are you developing software ?
It’s cool and stuff but unless I have a global template I’m cool waiting a few seconds before I look at the data
you mean as a job? no. I study math
I mean with this data
well i guess im just trying stuff out
well ok, so i am trying to tokenize this 100gb data into smaller chunks using an auto-encoder
Unless ur constantly working with 100gb files it’s prob faster to wait for a csv to load than write 200 lines making a binary search
and the auto encoder will have some attention span which has to get loaded at the same time
so to train i will have to load random chunks of the data for each batch
so i have to be able to read the whole file very fast
But you only want 2017 days don’t u
no why?
I thought u said
well some im trying to see how much noise vs information is in the btc (or any crypto)'s price.
and if it turns out that the market is inefficient i.e. the price can be predicted and isnt just random then i can maybe make a trading bot
well im also just a first year, so i didnt learn any of this (yet)
First year math ?
yes
Did u already do smaller ML projects
actually i did some work on differential equation solvers using ml
I’m curious, where did you learn to code like this as a first year math student
and i managed to outperfom state of the art in some metrics
oh im programming for like 8 years since im 14 or so
How is ml solving your equations?
Bruh ur a first year student outperforming soa ML?
i wrote a paper, but its in german
We’re talking phd here not bs right
no its not soa in ML it was outperforming soa in regular algorithms
but also only in very specific metrics
so its somewhat cheating, but still intresting
No I’m 23
oh ok, well even then at 23 you have alot of time
I can tell u no one I’ve never met has been writing papers at 18…
How old are you?
21 now
In Germany u start uni at 20?
switzerland
Here we are first year at 18
yes because i had to go to the army for a year
Do u have a rifle?
- we have an extra year of highschool compared to everywhere else
How did you write machine learning papers at such an age
How do you find supervisor lol
Interesting. Mongodb has live dashboards? I didn’t know, how does that work? I wanted to make an interactive dashboard maker and visualizer in django, looks like there are some libraries that make it relatively easy, but I never knew mongo had something for it.
checkout mongodb atlas
Kafka still needs something to produce events. It needs a producer, but that could be it yeah, I’ll look at it. I’ve used kafka at work but we always had either a beat producing to it or a lightweight application doing it
I never hooked kafka up to an api and I’m not sure it can be done but maybe I’ve never used it to its full capabilities
you seem to know more than me about kafka. let me know what you find out though. im curious

From a quick search you could, in theory, write a connector that fetches from a rest api or use this https://www.progress.com/tutorials/jdbc/import-data-from-any-rest-api-to-kafka-incrementally-using-jdbc
I don’t like that it involves jdbc though
This looks promising
So in theory I wouldn’t need a python fetcher at all, just kafka pushing to mongo and a django site for the dashboards
Hi! I am using PySpark to read, cleanse and write csv files (the files are more than 10gb), after I read the csv file and work with it I try to save it with dataframe.write but it saves it into multiple parts, how could I save the csv file into only one big csv?
bro, google just sent me an alert of aftershock and I was like what? weirdo.
20 seconds later earthquake happened
future is now
nice. how far were you from the epicenter?
i'm trying to run the stylegan2 code here https://github.com/johndpope/stylegan2-ada
but I can't figure out how to run this command in terminal
# Generate curated MetFaces images without truncation (Fig.10 left)
python generate.py --outdir=out --trunc=1 --seeds=85,265,297,849 \
--network=https://nvlabs-fi-cdn.nvidia.com/stylegan2-ada/pretrained/metfaces.pkl```
if I copy paste it into terminal, it only runs the first line until it gets to the \ and newline
if I get rid of the newline and run it all as one line, it gives this
D:\Python\stylegan2-ada\stylegan2-ada-main>py generate.py --outdir=out --trunc=1 --seeds=85,265,297,849 \ --network=https://nvlabs-fi-cdn.nvidia.com/stylegan2-ada/pretrained/metfaces.pkl
2022-05-13 20:29:26.269912: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cudart64_110.dll'; dlerror: cudart64_110.dll not found
2022-05-13 20:29:26.270276: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
WARNING:tensorflow:From C:\Users\mac\AppData\Local\Programs\Python\Python39\lib\site-packages\tensorflow\python\compat\v2_compat.py:107: disable_resource_variables (from tensorflow.python.ops.variable_scope) is deprecated and will be removed in a future version.
Instructions for updating:
non-resource variables are not supported in the long term
usage: generate.py [-h] {generate-images,truncation-traversal,generate-latent-walk,generate-neighbors,lerp-video} ...
generate.py: error: argument command: invalid choice: '\\' (choose from 'generate-images', 'truncation-traversal', 'generate-latent-walk', 'generate-neighbors', 'lerp-video')```
wait which google service
like
do you have a google pixel?

im curious if that would also happen to me if i was near an earthquake

not sure. did you double check the requirements?
the dependencies seem pretty specific
I'm not sure I have all the dependencies, but I don't think the github project works anyway, because when I just put in seeds and network, it gave me this error
and I tried the solutions in that but they didn't change it
my worry is that the code is just out of date and no longer maintained or compatible with tensorflow 2.x
bro...it says tensorflow 2.x is not supported here...

right, but when I tried to uninstall tensorflow 2 and get a tensorflow version with pip, it couldn't find any matching packages
so I don't think this code actually works anymore
welp. thats a bummer
yeah. maybe I should try stylegan3
give it a shot
our team used pix2pix and cyclegan recently
so their stuff def works

thanks for the suggestions! are those specialized for a certain type of image generation?
they are pretty much the seminal models in the image translation problem space
image translation being like
stuff like turning daytime images into nighttime ones
and vice versa, etc.
but i believe cyclegan gives quite good results for plain image generation as well
pix2pix can be used for a lot of different image generation tasks as well
@plush jungle check this out https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix
heres a horse2zebra model using cycleGAN i believe


that's awesome
right? its pretty dope
you can also train the model to do a variety of things
my imagination is limited but the potential is def there

@misty flint any idea how to set up Cuda with gpu?
I'm on windows 10 and I definitely have an nvidia gpu, but when I try to run stylegan3 I get
AssertionError: Torch not compiled with CUDA enabled```
and google is not being very clear about how to do it
not sure. have you checked this out: https://www.tensorflow.org/install/gpu
this might also be helpful https://docs.nvidia.com/cuda/cuda-installation-guide-microsoft-windows/
The installation instructions for the CUDA Toolkit on MS-Windows systems.
theres some documentation on pytorch too https://pytorch.org/docs/master/notes/cuda.html#cuda-semantics
hmm hmm
can you do transformations inside big query

they have their Big Query ML
but like
i think thats autoML or something
tbh im not sure since even after reading it im still confused

interesting
there is google data studio
but it just looks like a worse tableau
oh i forgot to mention
earlier i tried AWS Sagemaker for the first time today
pretty interesting
its like a jupyter notebook but on aws
you might ask what could you need this for? you can train a model on aws and create an API endpoint with Sagemaker
and then create some web app to to call the API for model inference
TIL many things, like how annoying working with the cloud is 
Yeah bills can be annoying
bro luckily i stopped myself before trying anything on my personal aws account
my boss was like
oh let me get you aws access for the company
since these things can end up costing you a wild amount of money sometimes
im like
uhh ok

ill try not to go into the tens of thousands i guess

i will quadruple-check to make sure i dont leave instances running

@serene scaffold https://mlopsfluff.dstack.ai/p/notebooks-and-mlops-choose-one
In the previous issue, I wrote about what MLOps suffers from. Now that I come to think of it, I have realized that it is worth writing about one more thing that stands in our way towards MLOps. You know this thing very well. It’s Jupyter notebooks. In fairness to Jupyter notebooks, they have become the standard way of prototyping ML models all o...
i found your alternate self

this is a good summary:
For any ML model, the time spent in a Jupyter notebook is inversely proportional to its reproducibility. The reasons behind this rule are poor modularity and reusability of the code in notebooks, and poor integration with Git. The worst part of it is the habit of using notebooks which incentivizes the practices that go against reproducibility. This seems to be a vicious circle. We use notebooks because they are a great way of prototyping models or exploring data. However, the more you use notebooks, the more problems you’ll face at the deployment stage.
I'm trying to run stylegan3 code, but when I do I get
raise RuntimeError("Ninja is required to load C++ extensions")
RuntimeError: Ninja is required to load C++ extensions```
I'm on windows 10 and I've already installed ninja version (1.10.2.3)
Hope this helps!
Hi. I need to classify 25 unlabelled images, depending on their similarity.
I have never done Computer Vision. Can someone tell me a general process to tackle this, or a tutorial that does something similar?
what does embedding_dim arg do in keras.layers.Embedding?
it's the dimension of the vector space you want to use to encode a set of integers
you can think of it as a function f: R -> R^n, where n is the embedding_dim
for example, if you chose n=3, you're asking keras to represent your list of integers by using vectors in R^3, i.e. vectors of the form [x,y,z], where x,y,z are no longer integer valued
the idea is that by doing this, you assign a geometric representation to the integers of the classes or words you were encoding, and this makes it possible to look for structure like clustering of similar classes. whether the clustering is evident depends on the number of dimensions. too few makes it so that no structure is present. too many becomes wasteful because structure usually has a low rank
Who was it that wanted tf gpu benchmarked
@misty flint ?
M1 pro using tf metal for the gpu is so fucking fast
And co lab is faster
Well
😦
if anyone cares
Time (s) to convolve 32x7x7x3 filter over random 100x100x100x3 images (batch x height x width x channel). Sum of ten runs.
CPU (s):
0.632617707999998
GPU (s):
0.12009491599928879
GPU speedup over CPU: 5x
google co lab Time (s) to convolve 32x7x7x3 filter over random 100x100x100x3 images (batch x height x width x channel). Sum of ten runs.
CPU (s):
4.349970917000064
GPU (s):
0.03639778600017962
GPU speedup over CPU: 119x
so... co lab is much much faster than my m1 gpu, seems wrong considering people have reported the m1 pro being faster
anyone any idea why
im using miniforge so i believe its native
approx. 45km
it is straight up google service. a page opened similar to that 2-factor number bubbles. scary as hell.
Said an earthquake just happened. beware of aftershocks.
kinda like this but this is an actual prediction.
I think they are using accelerometer data of people around and sending them faster than the shockwave. wow
The future is now.
This is probably just TF being bad. The CPU/GPU difference for convolutions should probably be much larger (on pretty much any device that has GPGPU). Some have TF slower on GPU than on CPU for the m1 (clearly something is broken).
I have read in many places that the m1 gpu beats COLAB though, and from the chart you can see i plotted training a basic neural net the k80 wins
how did those people get the correct performance
Did they do the same task?
Numpy does not use the GPU.
have to use veclib?
Also a K80 is a pretty beefy device relative to the M1, so while it could be faster than the K80, it would probably require a decent amount of effort that actually makes use of the M1 in the way that it wants to be used.
I don't think most are used to working with that kind of device.
initial benchmarks when it released quicky found its better
Who tested it?
Can you link me to one that does the same benchmark you showed above of the 32x7x7x3 on 100x100x100x3 images?
closed it now but its
official google or tf benchmarking code
u shud be able to find it by pasting
is it possible to make a python program that plays an online driving game. (asked out of curiosity)
Well, IDK, but the largest M1 max's GPU is about 10.4 TFLOPS while the K80 is about 8.73 TFLOPS. The largest M1 pro has half as many GPU execution units as the M1 max. So just from that it seems a bit suspicious. But, if the model is not too large I could totally see the M1 being faster simply due to its GPU being integrated (memory/batch transfer rate / speed) (for training, for non-training it's probably even faster).
*Single precision.
**Nvidia lies all the time though so the 8.73 TFLOPS could be complete BS.
**But depends how much Apple is lying too.
^
So any reason why training a model on iris data takes 6ms per step on co lab and 14 on m1 pro
Last I checked in on TF the issue was not using the m1's zero copy memory transfer to the GPU so it would only be faster if you did large enough batch sizes.
Back then the CPU was often reported as faster than GPU.
So it probably is better now, but still.
So if I used a massive data set I’d find the m1 is faster thank80?
I’ll give it a go later
The m1 or m1 pro or m1 max (and there are multiple of each)?
Maybe, gotta just try it. But if the TFLOPS are true (although they are theoretical) then no.
In practice most programs won't even make use of like 20% of the theoretical.
There’s no way google are cheating right, they wouldn’t just rig notebook to run better on iris dataset
(takes too much effort, people don't want to spend so much time on a single device when new ones come out all the time and the libraries are targeting many different ones)
And the m1 demolishes anything similar to it cpu wise, I just need to get RTX benchmark next to see
Google, Apple, Nvidia, and AMD heavily cheat on performance benchmarks and straight up lie (sometimes they pay fines for it like Nvidia, but that is a small cost for them).
(Nvidia lies probably the most IMO)
(And they also lock off parts of the hardware unless you pay for a special license or hack it)
For CPUs, yeah, the M1 should win, at least in performance per watt, and for the GPU too.
For CPUs the m1 max is very fast, but still slower than the fastest AMD CPUs.
Although they use way less power, which is what really matters if you want to have a bunch of them training stuff.
M1 should beat out current Intel CPUs (in pretty much every way) (until Intel finishes their new factory and maybe their new stuff is better).
It’s twice as fast as Ryzen 9
Which Ryzen 9?
Hey
Anyone know what happened to pywhat?
It's not working in my code
I tried to repair it
But it didn't work
The mobile CPU? Yeah that's a weaker CPU.
If by similar you mean for power consumption yeah.
The power consumption on AMD CPUs is pretty bad, but ofc if you are not worried about that, then AMD CPUs are def. faster than any M1.
!pypi pywhatkit
???
Hi is there any way I can save my numpy array of shape (30, 400, 400) = (number of slices/images, height, width) as an image file type?
PIL, scipy, and cv2 all seem to have ways of doing this
if it's just 2D images, you could also save each slice using matplotlib in a loop, using plt.savefig(...) while iterating over the slices
i want to use the library SITK which only reads images that are of 3D file types thats why I need to save the image as a 3D format
in that case you'd wanna look up how to convert exactly into the format you need, since having more than 4 layers in an image is not a standard image format that goes around
PIL doesnt seem to be able to do this
you want the image to be 30 x 400 x 400?
yh
that's not a standard image format
anyway sitk can stack them afterwards, so the format doesn't matter
even 400x400x30?
you have to look up a format that supports more than 4 layers. none of the normal ones do
sitk also says it only takes 2,3 and 4D images. what are you trying to do?
im trying to run this function
def oof3response(image=None, radii=[], resp_type=3):
print(' ¬ Compute OOF filter response ...')
# response_type = 3 :: sqrt(max(0, l1) .*max(0, l2));
# OOF tensor eigenvalues :: l1 >> l2 >> l3
# normalisation_type: blob-like (0), curvilinear (1), planar (2)
# sigma: sigma >= min(radii), otherwise normalisation_type = 0
opts = {'ntype': 1, 'sigma': min(image.GetSpacing()), 'use_absolute': True,
'radii': radii, 'resp_type': resp_type}
if min(radii)<opts['sigma'] and opts['ntype']>0:
print('Normalisation type is set to zero since sigma<min(radii)')
opts['ntype'] = 0
# image
data = sitk.GetArrayFromImage(image)
size = [image.GetSize()[i] for i in [2,1,0]]
spacing = [image.GetSpacing()[i] for i in [2,1,0]]
# output
output_data = np.zeros_like(data).astype('float64')
# Fast Fourier Transform
fft = np.fft.fftn(data)
# Radius from Fourier coordinates
x, y, z = ifft_shifted_coord_matrix(size, spacing)
x /= size[0] * spacing[0]
y /= size[1] * spacing[1]
z /= size[2] * spacing[2]
radius = np.sqrt(x**2 + y**2 + z**2) + 1e-12
...... (could not bee shown in discord as hit the character limit)
the code seems to imply the image has only 3 axes
which is what Im trying to do with mine
and what did you create the image with?
originally these images are microscopy images
all right. well, sitk seems to be able to make images out of numpy arrays
but it anyway seems like you want the image as a numpy array
how do you currently have the image stored?
yh but I wont be able to do getszie and get spacing
they're of PNG file format (2D slices of a 3D object)
the easiest way, to me, looks like just reading the slices, putting them into a numpy array, and giving that numpy array to sitk
yh but I wont be able to do this "GetSize and GetSpacing"
thats why Im having a problem
hmm?
your images don't have that to begin with if they are just a bunch of 2d images in a generic format
if you have the measurement parameters as separate metadata, you can put it in yourself
hmm okay so Get spacing and Getsize would be meaningless if it wasn't orignally saved as a 3D image file?
more like, if that metadata was not included into the image format
metadata meaning the size of each pixel for e.g 3 micro metres and the spacing between them?
mhm
hi guys i used plotly express to plot 3d scatter plot and now wanted to use plotly.go for a 3d surface plot of the same data.
Sadly when i define x, y and z the same way as i did in the scatter plot i run into a java error. So i looked up the plotly site and there the data is provided as arrays not as df columns. Would someone help me to make it work?
try showing the code
df_dict = {}
if group_name not in df_dict:
df_dict[group_name] = [df_new]
df_dict[group_name].append(df_new)```
i stored dfs into a dict with the key beeing the group_name
data = {}
for group_name in df_dict:
df_new = pd.concat(df_dict[group_name])
df_new['Area'] = df_new['Area'].fillna(0)
df = df_new.rename(columns={"ID#": "Compound Name:", "Temp": "Temp. [°C]"})
if group_name not in data:
data[group_name] = df```
i concatenated the dfs of a group to one big from which i plottet
for group_name in data:
Headline = group_name
df = data[group_name]
x = df["Temp. [°C]"]
z = df["Area"]
y = df.index```
x=x,
range_y=[1, 42],
y=y,
z=z,
color=name,
hover_name=df["display_name"],
#log_z=True
)```
and then used plotly express to plot a scatter plot.
Now i wanted to do the same thing with the go.Surface
was hoping someone could tell me whats wrong with this gym environment
import random
from gym import Env
from gym.spaces import Discrete
class ShowerEnv(Env):
def __init__(self):
self.action_space = Discrete(3)
self.observation_space = Discrete(100)
self.state = 38 + random.randint(-3, 3)
self.shower_length = 60
def step(self, action):
self.state += action - 1
self.shower_length -= 1
# Calculating the reward
if 37 <= self.state <= 39:
reward = 1
else:
reward = -1
# Checking if shower is done
if self.shower_length <= 0:
done = True
else:
done = False
# Setting the placeholder for info
info = {}
# Returning the step information
return self.state, reward, done, info
def reset(self):
self.state = 38 + random.randint(-3, 3)
self.shower_length = 60
return self.state
cause when im using ray rllib, im getting a shape error
ValueError: Cannot feed value of shape (11, 256) for Tensor default_policy/Placeholder_default_policy/default_policy/fc_1/kernel/Adam:0, which has shape (100, 256)
here is the ray code: https://bpa.st/6JDQ
yeah its not expected
squiggle has a point with those caveats too
any idea for a fix?
also, does anyone know how to make another env in conda default?
i dont wana use base anymore i wana use my 'ml' env
permanently
else im gona have to like do some effort to clean up and install tf on base
thats squiggle's expertise. def knows gpu stuffs
He has some ideas why but no fix
what are some cheap options to get GPUs to run machine learning models?
My current GPU (RTX 2080 TI) doesn't have enough VRAM for my models
(11 GB)
The M1 is too closed off and I don't test on it like most.
Apple just gives you their ML tools that they want you to use.
why not aws? if you need cheap try vast.ai
your optimizer expects batch size 100 but this batch has only 11 samples
activate env in .bashrc or .zshrc
Any scipy expert here?
You should always ask your actual question, not ask if people know about the topic of the question.
@spare briar mm, could you help me a bit more? how do i increase the sample size?
is your dataset size divisible by 100?
what is your dataloader
60 steps before a reset
not using a dataloader
apologies this isnt my code and my ML is really really really rusty
^
for some reason a batch has only 11 samples but your optimizer is complaining that it expects 100
what is telling the optimizer to expect 100?
my guess is you are iterating over a dataset where dataset % 100 = 11
think it would be better to keep the batch size the same tho
the batch size is defined as 100
and the optimizer has initialized with a kernel that expects this
thanks @spare briar
Got another ray rllib question
really got no idea whats happening
Worker crashed during call to step_attempt(). To try to continue training without the failed worker, set ignore_worker_failures=True.
code + env + code traceback ~> https://bpa.st/YOEA
redshift is basically amazon's version of postgres right?
i bring this up because apparently gcp's big query is used more for actual data warehousing
and has cooler features for ML apparently

UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples. Use zero_division parameter to control this behavior
What is this telling ?
Is it bad bad?
just saw this, I'm a pyqtgraph maintainer, might be able to help, what's up
no. redshift is more like presto or trino
ah i see, i see
what am i thinking of? amazon RDS?
tbh im still trying to learn the services atm

postgres is more like mysql or RDS/aurora
Hello everyone. I am going to do thesis on machine learning, AI. For that, I should learn Python. I do have basic programming knowledge in both C and Java. Can anyone suggest me the best free tutorial or road map which will be a huge help for my upcoming thesis? Thanks in advance!!
how long is your thesis going to take?
I fed the whole X matrix and y labels to the cross_val_score function. Does it do the train test split automatically for each fold and calculate the accuracy? If yes then what's the point of using the k-fold generator which returns the indexes for each fold's train and test, if I just want the accuracy.
1 year. 3 semester each 4 month.
what is it going to be about specifically?
I haven't selected any specific topic yet. I do have interest in Machine Learning and AI
well, we have this: https://www.pythondiscord.com/resources/?topics=data-science
We're a large, friendly community focused around the Python programming language. Our community is open to those who wish to learn the language, as well as those looking to help others.
thank you so much!!
how do you measure precision in a class with no samples
this is just telling you it chose to set the scores to 0
if a tree falls in a forest and Helen Keller is--nvm
How can I explore a Boolean column. Other than value counts
Do those bools correlate with another feature in an interesting way?
Just wanna do it solo rn. @serene scaffold
Not sure what you mean. I don't think there's any inherent virtue in looking at columns of data in isolation.
There is when you are in school 🤪
Well It's just analysis, like finding how data is distributed, outliers, etc.
For categorical ones I just wrote the percentage of each unique value.
Can the samples you use for accuracy be from the dataset used to train the model?
for validation, you mean?
if so, it's a bad idea to do that. some recent papers have shown that under relatively mild conditions, the error w.r.t. the training data set will decay to 0, and this tells you nothing about the predictive power when the model is used on new data
I wouldn't recommend it as it doesn't show if the model if overfitting and you can't see how well it works on data it's never seen before. Instead use validation_split=0.1 as a parameter in the model.fit function
Alright thanks amigos
what models
Rendering
Computer vision
that's why I need more than 11 gb of Vram because I'm limited on the size of the batch at 7
can't go beyond or else I run out of Vram after 1 epoch and it crashs
bump
Which is better for ML applications Django or Flask for machine learning deployments, regardless of the learning curve 😶?
hmm are either of those for ML?
lol wut, those arent for ML.... they are for webapps n the like
I should have made the question clear 
By samples you mean labeled data?
@misty flint i must say the laptops great
Hello i am recently started learning about ml and encountered SVR and i understood the theory behind it but am having a hard time understanding the maths behind it can anyone teach me?
how much do you understand about it currently?
start with SVM?
i know that it creates a tube where error is accepted and all the values outside the tube are support vectors so i wanna know what happens mathamatically . i am not good at maths so i am having trouble understanding.
i started with linear regression then polynomial regression then came svr
Hi im about to go to the collage and they offered me to choose a scholarship for have
information systems degree or busines analytics
Is there are anyone in the field can help me which one is related more to data science
I like to make webscraping scripts and manipulate and play with the data
So i want to ask which The speciality is the closest to my interests because I have not been familiar with the field of databases or data analysis
And sorry about the disturbing but it's a pivotal point in my life, and I want to ask, so as not to regret it
Is there a list of classes/units that each degree will take? Or maybe a summary of the degree contents? At my university if you just search the degree you'll find a summary of the degree and the core units that you must take, which should be helpful
@rose agatei have the curriculum for both of them can you see them and inform me??
I don't know if I'd be able to say what's best, but maybe send them in this chat and someone else can try?
University courses usually sucks when it come to these. But if you go in depth business analytics will be more useful to you compared to IS.
thanks for the advise
I still suggest make your own decision.
Hi I'm trying to bin some values but the data frame has Nan values and the max count in bins is 2, i'm not sure why though any help appreciated```import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.fft import rfft, rfftfreq
from scipy import fftpack
import scipy.signal as sg
from PyAstronomy.pyasl import foldAt
#method 2 for binning
t=np.arange(0,1.024, 4e-3)
fr=60.546875# fundamental frequency
Tp=1/fr#time period
phases = foldAt(t, Tp, T0=0)
plt.figure()
#data is the intensity
xp, yd = zip(*sorted(zip(phases, data)))#ensures x and y values correspond to each others in pairs when sorted
plt.plot(xp, yd)#plot the phase xp and Flux
df4 = pd.DataFrame({'X' : xp, 'Y' : yd}) #we build a dataframe from the data
M=20#no of bins
bins=np.arange(0, max(phases), step=Tp/M)
categorical_object = pd.cut(xp, bins)
count=pd.value_counts(categorical_object)#count in bins
grp = df4.groupby(by = categorical_object) #we group the data by the cut
ret = grp.aggregate(np.mean)#calculates the mean in each bin
plt.figure()
plt.plot(ret.X, ret.Y)```
awesome, thanks! I kiiiind of solved my problem for now. I was trying to get a, ImageView to scroll through slices of a (2D) matrix with mouse wheel scrolls, but it wasn't calling the wheelEvent function I had defined. I ended up discovering that ImageView either doesn't inherit the wheelEvent function or isn't passed wheelEvent calls, but if the view attribute does, so I can just override the wheelEvent definition for the view attribute. It's still a bit hacky, but it works. I appreciate the offer for help! I'll hit you up if I have more questions; I'm trying to exercise pyqtgraph for a work project, so I'm sure I'll find more bits and bobs.
that's a clever use of wheelEvent; but yeah not really hacky to monkey-patch in this case, you sort of have to do that, hacky or not 😆
ValueError: `logits` and `labels` must have the same shape, received ((32, 2) vs (32, 1)). What's this about?
What''s a logit
damn stop making me jealous

can you show the code related to the error?
Well it's working now, it was the loss function I had used
it's pretty big lemme try a site paster
but I was using loss="binary_crossentropy"
@hasty kiln go with django
Does anyone know how to add 2 gray scale images as numpy arrays (that are between 0-1) i.e. just overlaying the images over each other?
if they're already numpy arrays, just do +. they need to have the same or broadcastable sizes though
hmm but i get unwanted grey pixels that shouldnt be there
logit is a statistics term; practically, in this case it probably means the unnormalized probabilities of classes in a classifier. The function you're calling is trying to compare an array of probability predictions to the correct labels and it's an element-wise comparison, so the arrays need to be the same shape. It look like the logits you're passing aren't a single number per prediction, but two numbers (the second dimension of the array shape is 2) while the labels only have 1. Check the output of whatever neural network layer or classifier you're coding; if the labels are shape (31,1) the logits have to be (32,1) also
wdym they shouldn't be there?
I finally understand what I mean 😂😂😂, I will probably go with Django
How much stuff/programming do i need to program a website/app that detects if something is an apple or not
any of the major clouds have APIs that can do object recognition like that
if you dont want to build your own
otherwise you could probably build your own model with ImageNet
How can I replace all values of nan with 0?
atm Im using this CNN_values2 = [[n if n!=np.nan else 0 for n in k[:4096]] for k in data_array2] but it still gives me nan values
I am working with an xml file that I am parsing to a dict with xmltodict. Is there a way to use a list to call keys to sheet the info I am wanting out of this dict. I.E. data[list] instead of data[key1][0][key2]
Get*
one of the properties of np.nan is that it is not equal to anything, including nan. you should instead use numpy's isnan function, like so:
In [1]: import numpy as np
In [2]: x = np.nan
In [3]: x
Out[3]: nan
In [4]: x == np.nan
Out[4]: False
In [5]: np.isnan(x)
Out[5]: True
Thank you very much!
I have the stupidest issue:
boxplot_df = pd.DataFrame()
for currency in currencies:
sub_df = evaluation_df.query(f'{currency}'.format(currency=currency))
boxplot_df [f'{currency}_allin'.format(currency=currency)] = sub_df['allin_trades'].copy()
I have an empty boxplot_df and want to add the column 'allin_trades' from the evaluation_df, if the string in the column 'currency' matches the currency i am currently looping through
If I run this code, I have the expected result for bitcoin, which is my first currency in the list, but for all following currencies I receive a 'nan' even though I can clearly see the populated column in the queried sub_df
What is the issue here? There must be some stupid but that I am missing
@gloomy anvil you're using f strings and .format at the same time, which is wrong.
You can't selectively add columns. Every row always has an element for every column, and vice versa.
Is there any data transformation going on here? Or are you just copying certain columns from one df to another?
Because you don't want to allocate empty DataFrames and then copy stuff into it. If you want a new df that contains copies of the columns in a given df, you just do df[['a', 'b', 'c']] and what you get are copies.
thanks! I fixed the format as you can see in the screenshot at the bottom
Can you show evaluation_df?
Also I'm on my phone, so my explanations will be high level
evaluation_df
it also has a column at the end called 'allin_trades'
the query part works fine as well. I can create the sub_df based on the currency and I can see the populated 'allin_trades' column
but somehow it does not copy to the boxplot_df... is this some stupid typo i might have? I am loosing my mind about this simple stupid issue
@gloomy anvil look into how to pivot a DataFrames. Because that's what you're actually trying to do.
will do! but is there any reason why the last line boxplot_df[f'{currency}_allin'] = sub_df['allin_trades'].copy() does not work?
That's not how pandas works. You don't allocate empty space and put stuff into it later
Also, what is each row supposed to represent?
each row should represent the return of different classification models for each currency. I want to create a boxplot for each column/currency
@gloomy anvil there are columns like TN, FP-- are these supposed to be the rows in the desired df?
No, I have a evaluation_df with all currencies, models and evaluation data like True positives (TP, FP); and so on, as well as their return ('allin_trades'), respectively. I query the evaluation_df by currency and create a sub_df with all the data from the evaluation_df for the currency, that I queried. From the sub_df I want to take only the column 'allin_trades' and copy it to the boxplot_df into a new column with the currencies name
and by the way: thank you for your time and effort to help me here. I feel so lost and I am very grateful for that
figured it out
i don't know why but this here works: all fields populated
I just added reset_index(drop=True).
saw it at stackoverflow, copy, pasted, works. Don't know what it does or why it works though
thanks again for your help and your time. very appreciated
hello fellas im tryna generate a image using perlin noise
where the white spots would be grey colored and the black spots would be transparent
(0,0,0,0)
I have tried countless efforts
but everything is slow
atleast 1 minute
are you using library's like numpy and what resolution images are you generating
how often will I have to write sql queries to retrieve data from databases, I mean we do have pandas and I can work on it to perform the necessary changes
I am familiar with the sql queries but do I need to have extremely solid grip on it or is it fine if I work on the data using libraries like pandas?
no im using some noise module from pypi and the resolution is 2048x2048
you can try making some perlin noise in numpy might be a bit faster
also how does the code look there might be other ways to improve the speed a bit
and is it possible to find out what parts take long (also i wont be available to help a lot because will go to bed now it is like 1am for me)
im trying this currently
we on the same timezone 
im currently getting this
hello guys
im looking for a data scientist in my team
im currently working on a game project and its related to AI and machine learning
im also a data scientist
but i need your helop
if you're interested just pm me
@marsh yacht please post the GitHub for the project in the chat. If this is a closed source project, kindly remove your messages
do u have any knowledge in deep learning?
mad sus buddy
why do you ask?
I got an internship for deep learning and I’m kinda screwed lol. just asking if it’s possible to learn the basics in 24 hours
absolutely not.
but if it's an internship, they don't necessarily expect you to know.
could I learn the basics of PyTorch in 24 hours?
probably not.
could I DM you a picture of the email he sent? I’m not sure what the interview is asking for
so you haven't been offered the internship? you just have an interview for it?
being interviewed for positions is normal. but why don't you copy and paste the email into this chat, with sensitive parts censored out?
alright
The Intro slides to MARL can be found here. You may be interested in this blog post which gives an exciting overview into the future of Artificial Intelligence with Multi-agent Reinforcement Learning. A summary of MARL is attached as a Word document along with relevant papers in MARL.
This summer, 4 students (1 UMD CS Undergrad, 1 Virginia Tech CS Undergrad, 2 Blair HS Junior students) will be working with me on MARL.
Working on the project will involve coding in python, pytorch and a background mathematical knowledge specifically in calculus, probability and optimization. All 4 students currently in my team are prepared in these areas so I will directly start introducing Reinforcement Learning from May 23 before the actual project in Multi-Agent Reinforcement Learning begins from May 30.
This position is unpaid. No Professor will be involved in this Summer research project. So if that's something that you are not interested in, please let me know immediately.
If you are interested, please meet me tomorrow at 7pm EST for a 20 minute time slot on Zoom. I will evaluate your Python programming skills, background Mathematical knowledge, and willingness to learn new things quickly. In case you qualify for the position, I will let you know immediately. Then, please let me know about your interest in the position by May 17.
if the position isn't paid, I wouldn't even bother
oh I’m in high school, all internships are non paid practically. Getting an internship is lucky enough…
internships for me right now are for experience and resume
I suppose. but once you're in college/university, don't waste your time on unpaid internships.
if you don't already have "a background mathematical knowledge specifically in calculus, probability and optimization", there is no way you can learn enough in 24 hours to become a more competitive applicant for this position than you are currently.
yeah, I’m terrible at calc, and lin alg. should I just give up?
you can still continue with the process for the interview experience. and who knows, maybe something interesting will happen.
yeah, hopefully. do you think the python testing part will be difficult?
I’m terrible at programming under a time limit or stress
if you already know Python basics, it's unlikely that they'd ask you anything particularly esoteric. but I don't think you can really learn how to use pytorch if you don't understand deep learning in general
yeah, gotcha. is it inappropriate to ask programming questions in a channel during the interview?
I suppose that counts as cheating, but if you're in a Zoom interview, you can't just pause everything while you type out your question in Discord
and if you don't know the answers to their questions, just tell them what you do know. if you're squirming, they'll probably move on to a different question.
how do zoom interviews work?
I added a second Dense layer with 64 neurons and softmax activation, apparently it makes my net give the same predict results to all predicition
It's quite a simple problem, the simpler the less neurons required I suppose?
you have your webcam on and you talk to them like you're sitting at a table. it's basically the same as in-person interviews, except you secretly don't wear pants, and everyone secretly knows it.
holy do they actually know I don’t wear pants??
jkjk
they will know if you cheat lol
but anyway if youre in highschool and its unpaid, cant hurt to try for the interview
high schoolers have plenty of time anyways lol
might as well spend your summer doing something productive/educational
in general, or just over the summer? because I think the amount of time HS students spend at school, plus commuting to school and back, plus homework, plus the kajillion extracurriculars that kids are pressured to participate in, is very unreasonable.
what would be a good way to visualise co-occurrence matrix? i'm thinking something like a graph network visualisation where the co-occurrence value determines the thickness of the line between two nodes but i don't think there are good ways to integrate with a pandas dataframe
one i'm looking at is holoviews, but it seems way too complicated for my purpose
I'm thinking along the lines of R's igraph
a common approach is to just plot the matrix itself as an image
I know it's an option, but in the field something like this is more common
Is there a canonical graduate level textbook for data science and AI/ML?
what would be a good place to draw some 2d graphs? like I want to have some vectors over a graph, and then show possible vectors having norm of one.
I managed to achieve it by using python-igraph! The documentation is quite poor, but it seems to integrate very well with pandas.
hello can anyone figure out where is the issue in this code ?
"ConnectionError: HTTPSConnectionPool(host='www.mubawab.tnhttps', port=443): Max retries exceeded with url: //www.mubawab.tn/en/a/7429900/apartment-for-sale-in-les-jardins-de-carthage-2-rooms-reinforced-door-and-double-glazing- (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x000002357B8BB6D0>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed')) " this is the error that i get
Can someone just tell me for this code:
def speak(audio):
path = r"D:\Python\Jarvis\Voice_Files\Voice"
directories = os.listdir(path)
# This would print all the files and directories
for file in directories:
filename = audio.replace(" ", "_")
file_path = (rf"{path}\\{filename}"+".mp3")
playsound(file_path)
why does the mp3 file keeps on repeating?
the ones that dont do extracurriculars have all the time in the world 
that’s true
if youre serious about ML Engineering, this just came out from one of my favorite ML Eng
so what do you think the professor is goin you ask me math and programming wise?
some would say artificial intelligence by russell and norvig. pattern recognition by bishop is another common one. they dont really cover data science per se. more statistics. the former has more newer ML methods including CV, NLP, etc., while the latter has more classical ML
bit dense if you arent graduate level so..you have been warned

how complex could the programming prompt be if he’s not expecting much?
Both of those books are 2nd year undergrad level (not to knock them, Bishop’s book might be my favorite textbook). For grad level content, if you let me know a more specific subfield you are interested in i might have recs
i guess thats fair. but they are canonical imo

They sound like a good starting point, I kind of need something general before getting into specifics
Are there any good books specifically about Python tools for ML (and related)? I've used Pandas a tiny bit
But would like to learn stuff like TensorFlow
yes definitely must read bishop
youre looking for more o'reilly books then; those tend to be more practical than theoretical
but like anokhi said
def check out bishop
for sure
Hey @mighty spoke!
You either uploaded a .txt file or entered a message that was too long. Please use our paste bin instead.
Helo
I’ve checked the pins. Is it right to say that machine learning is a step up from data analytics? I try and find books about data analysis and I get ML books 
Hi guys i have 2 arrays (one of them is just filled with 0's and 1's lets say this is arr2) after adding them i.e. arr1 +arr2, I only want to divide by 2 on those elements in arr1 that have been changed, can anyone help me with this?
if using numpy array, you can do something like
...
my_sum = arr1 + arr2
indices = my_sum == arr1
my_sum[indices] = my_sum[indices]/2
...
there is a decent amount of overlap. but data analysis tends to be more descriptive analysis (looking at the past) while classical ML tends to be predictive or even prescriptive analysis (looking at the future)
you are better off looking at statistics books for data analysis, since that would lay a good foundation i believe
Mhh
ok so I guess like. You need both?
Like I should have asked my question better. Is data analysis as a role more machine learning based, data analysis based or both?
it's kinda difficult to make a clear-cut distinction, since these are buzz words that grew out of something else
to the extent that what "data analysis" and "machine learning" means depends on who you ask. if you have a job in mind, you can see what they are asking for. if not, it's kinda hard to say
e.g. you could do data analysis to prep data for machine learning. or, you could do machine learning as part of your data analysis to extract some useful info
you could put both under estimation theory, which falls under statistics... along with an overlap with other topics here and there
Mmmmm I see. I thought the field would be more like. Idk. Intertwined?
Like ok practical example. My per project
I’m fetching live flight data and I want to make some visualization dashboards like total distance flown in a day, total time, fuel estimation, how many aircrafts of a given type
Which airports are more popular, which less so etc erc
What does that fall under
depends what your boss calls it 😛 possibly data analysis?
again, the words are made up and misused, but it's likely you'll find it under "data analysis"
I don’t have a boss I’m doing it on my spare time
I’m a sysadmin 
Like would that project even have a ML aspect? I can’t think of any
idk
depends on how you wanna find the quantities you mentioned. distance flown, time, fuel estimation (!! this one, especially, since estimation is one of ML's applications), popularity (!! here too if you wanna use a fancy definition of popularity other than just counting all flights. if you have info from several disjoint time periods, for example, you might wanna try and predict the behavior in the times you didn't observe and then do some weighted sum or something, or directly estimate a popularity metric that describes the observed data)
Don’t have information from other time periods :/ but I could like
Ingest daily and predict the next day?
Or after a while
Like maybe live flight data isn’t the best to work with. Idk. Lots of updates for little information
Like it updates every 5 seconds, and all that changes usually is altitude and position, which I guess is what matters tbh? But also I’m not sure what to do with it
What could I do actually? I have the idwa for the project but I’m now realizing i have a bunch of useless data that I’m collecting
idk, maybe you don't need ML. you could still find interesting statistics and cool ways of visualizing and interpreting them
or maybe you could see of some of the values are good predictors for the others if you find that interesting. that could let you do supervised training
Mmmmm yeah but my main gripe rn is that I have a lot of data that means nothing
The live api is updated every 5/10 seconds and it’s great for studying and watching the planes on the map but from a data analysis/ml standpoint having that much data just adds noise I guess?
I’m not really sure. I’ve never done this and I’m making it up as I go, and I just now realized how much data I get and how little it changes in a short period of time
that's a cool ML task on its own, you know? data interpolation. you use the super frequently gathered data to train so that, later, your network can take much fewer samples and fill in what is missing
then you can replace API calls with inference
👀
or alternatively, instead of filling in missing samples using the API infrequently, you could still use it frequently and exploit those slow changes to predict the future samples for some time window
those are both cool applications
that the data changes slowly means it is structured, and that makes it a good candidate for inter and extrapolation

So like I could take the position of aircraft X
Train the model to predict where that will go based on waypoints in the area, the flight path and the speed at which is traveling?
And see if I can predict the correct flight path of the aircraft
sure
you know how they do those forecasts for hurricanes and tornadoes?
where they show current radar measurements and the current position of a storm
and a cone of probability where they expect it to be in the next hours/days
Mmmm
Something similar?
Or what else? Hit me with ideas, I’ve never done this before so I don’t know what questions to ask the data 
you could do that probably only with the location and speed. and with the other info you have, possibly predict which airport it's headed to, with a probability updated on the fly
Yah!
Well not super useful since the data contains the destination airport
But that’s just an easy test tbh!
sure, but if you're just doing this to play around, it's a nice experiment
it could fail, sure. then you test something else
maybe you'll get some other idea as you're setting it all up. gotta get your hands dirty at some point
Possibly
An idea I had is try to correlate airport destinations with holidays/festivities and weather
Like pick airport x. What’s the probability that that’s the diversion and the original airport is airport Y which has bad weather
Conversely, if that airport has bad weather what could be the diversion
Not sure how to go about it though
i think the more you work with different datasets, the more you develop your data intuition aka what questions you can ask the data vs. getting stuck

Yeah 
I guess it’s also not very smart to plunge right into a big project like this but I’m confident that with the help of the discord server, some research and books I can at get something working
Oh yeah forgor to ask! What books could help me with this project?
Hey there, I'm wondering if there is a way to get the color of a histogram plot with Matplotlib that I already ploted ?
I think the default is viridis
why is it not advised to have val set when dataset is too small??
im trying to get into reinforcement learning, where should I begin?
what do you currently know about reinforcement learning, and ML in general?
nothing rlly, was starting to get into it yesterday and learning Linear regression
Does anyone know how we can fetch schema from parquet files using python. And then create schema for hive external table?
Please help I am totally out of context and really need help in this.
Please help 🙏
I am searching this over Google from past one week
a quick google shows
from pyarrow.parquet import ParquetFile
ParquetFile(source).metadata
Source file is on S3
Can I read parquet using this library?
And I need to connect to Hive using Pyspark and need to query
Does anyone know how to query hive on my local machine?
import pyarrow.parquet as pq
import s3fs
s3 = s3fs.S3FileSystem()
pandas_dataframe = pq.ParquetDataset('s3://your-bucket/', filesystem=s3).read_pandas().to_pandas()
How can I create hive schema according from dataframe i receive
I am able to fetch dataframe but i am unable to automate how can I create hive structure accordingly through this dataframe
I really need help creating hive structure accordingly the parquet dataframe which i receive in any datatype. Be it Pyspark dataframe or pandas dataframe
anyone know how to remove the dots
df = pd.read_csv("datasets/covid19cases_test.csv")
sonoma_df = df[df["area"] == "Sonoma"]
print(sonoma_df)
dates = pd.to_datetime(sonoma_df['date'])
cases = sonoma_df['cases']
#plt.style.use('seaborn')
plt.plot_date(dates, cases, linestyle='solid', linewidth=0.5)
plt.gcf().autofmt_xdate()
plt.title('Sonoma County Cases vs Time')
plt.xlabel('Date')
plt.ylabel('Cases')
plt.tight_layout()
plt.savefig('figures/sonoma_cases.png')
nvm i got it
I am still waiting for your input
this may be of value.. https://stackoverflow.com/a/34344654/1538838
anybody out here using palantir foundry cloud? curious what your workflow looks like for development, .. prior to racking up the bills in production
How do you guys handle code review / PRs for jupyter notebooks? Since notebooks are painful to read in plain text.
VS Code does syntax highlighting in notebooks, this is my go-to way to work with notebooks now
Ah ty, I've never used VS Code, can it show diffs?
probably, I've only explored a fraction of its features.
Diffs in Jupyter notebooks, though? I'm not sure
yes, there is tight integration with git as well, ... you can show diffs between different files (in explorer) or diffs of different commits
Note however: this is after installing loads of plugins. I don't even remember what I've installed anymore, but fortunately there's a feature to transport the list from one installation to another.
thanks both!
VSCode does show diffs of jupyter notebooks, but badly: it doesn't exclude cells that have no diff.
So you have to scroll to find the cells that actually have changes.
E.g:
This is only with the Python extension and the stuff that comes with it like the Jupyter extension.
read the bible http://incompleteideas.net/book/the-book-2nd.html
there's no site here
oh nvm i was doing https
I've had a very persistant bug with Pandas for a while now, and I'm curious if any of you have a possible solution. Note that both of the problems here require pandas 1.4.0rc0 or greater, as pandas 1.3.5 works completely fine. I was hoping that the update to 1.4.2 back in April might quash these bugs, but it didn't.
Have confirmed existance of bugs on a separate computer.
import pandas as pd
df = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
dfsty = df.style
dfsty.applymap(lambda _: 'color: red')
with open('kaboom.html', 'w') as f:
# version 1 - fails
dfsty.to_html(f) # regression: breaks with ValueError
# version 2 - incorrect
# f.write('bye bye style tag\n') # mangles css style below
# dfsty.to_html('kaboom.html') # bad css due to above
# version 3 - works, but deprecated
# f.write(dfsty.render())
produces (redacted for brevity, can post full traceback and details of css bug if requested)
File "C:\Users\Thurisatic\miniconda3\lib\site-packages\pandas\io\formats\format.py", line 1220, in get_buffer
raise ValueError("buf is not a file name and encoding is specified.")
ValueError: buf is not a file name and encoding is specified.
Anyone online? I would like to ask for a favor.
this is possibly a bug in pandas' handling of optional arguments. i would file a bug report. my guess is that the default encoding= argument changed, but they forgot to check for the case when the output isn't a filename. try encoding=None explicitly in to_html, as a test / workaround
you have to say what it is that you need help with.
Creating mock data for a quiz.
What is embeddings in pytorch?
how can i graph kinematic equations in matplotlib?
or how can i structure a kinematic equation to be a graphable function?
This course is live now https://github.com/huggingface/deep-rl-class
Hi guy, I've been doing a comparison of the four models LiR RR LASSO and EN. Am I wrong to assume that if one metric is the best resulting one then the other metrices should also be the best performing one? Because in the image below EN is the best performing for MAE but for MSE and R2 it is LASSO. Am I doing something wrong here?
that is indeed wrong to assume
Interesting, can you please elaborate?
each model will have its own properties, that's about it
that's the whole reason there are different models and different metrics. you have to try and find the best one for your use case. it goes deeper, too: you can read about the no free lunch theorems of estimation. all estimators (what you use to fit the chosen model) have the same performance on average. this means if they are good in one scenario, they are necessarily bad in others.

