#data-science-and-ml
1 messages · Page 23 of 1
ah nice haha
im just messing around i learnt this huckel theory approximation quite some time ago but never really understood how they computed the energies for benzene or other larger molecules
3x3 or 4x4 matrices were quite easy to examine but the larger ones like benzene made me curious as to how they're computed
Does some computation are faster in pyarrow vs numpy?:
I see
That shouldn't be very difficult imo
ig for directly putting the values and calculating for matrices with larger dimensions you could use numpy.linalg.det(array)
Hmm, maybe just use an arbitrary real number for x?
how do you mean?
what is the determinant finding here?
the determinant gives an equation in x which is set equal to 0
in this case it's a quartic equation
the roots of this equation are the coefficients i'm looking for
Ah I see, thank you!
Hello, my name is Agustin. I am from Argentina. I am currently working in data analytics. I am trying to solve a problem with pandas. specifically with the metod .astype() which will be obsoleted in a near future. I dont know how to replace this. Python itself is suggesting this function for replacement: Use obj.tz_localize(None) or obj.tz_convert('UTC').tz_localize(None) instead
Can someone help me with this? I dont understand how to replace this function with those options
that is the error that I get when executing the astype() method. Its more a warning rather than an error, but in the near future it will become an error
Hello, I have a question about numpy, specifically about numpy.linalg.eig. In the documentation, it says that it returns “normalized” eigenvectors. However, I don’t want them normalized for a project I’m working on. I’ve looked at stackoverflow, but there’s no suggestion that doesn’t involve going to another package sympy. Is there any way to use just numpy and calculate the eigenvectors of a matrix without the normalizing?
Hello, general question about machine learning from a beginner here: what does it mean to train a model? I often see that being said in context of machine learning, but based on my experience with kNN and GAs, I dont really get what it means. How would I "train" my GA or kNN algorithm? Or is training needed for other algorithms and methods besides those simple ones?
Also, how exactly can I imagine machine "learning"? How is my machine "learning" something by using kNN or GAs? I only see it as following a strict pattern/algorithm and coming to a solution that way?
I hope someone can answer my questions, thanks in advance!
so from my point of view (also a beginner), training consists in adjusting the parameters of your network. This is also the reason why you usually divide your data into training and validation data.
If you look at this model you will notice a certain peculiarity for areas of your model that are not covered by data.
if you would then analyze known data with it and check the predictions in a truth matrix you could make statements about the quality of your model
Okay, so training is not to be understood as the model training itself, but just us humans adjusting parameters so its giving better results?
depends on the training method
For example genetic algorithms or the k-nearest neighbors algorithm?
u got 3 main methods: supervised, unsupervised and reinforced learning
those are algorithm types
the method is how ur modell handles ur data
Can you give an example for that?
different algorithms result in different predictions for same dataset
Logistic Regression, Decision Tree, Random Forest, Support Vector Machine, K-Nearest Neighbour and Naive Bayes are the main ones
and its to say that for all the above the input data differs so u have to normalise the inputs but from the original dataset
so the dataset always is the same
if that makes any sense 🗿
maybe @serene scaffold can crosscheck my explanation
Thanks for your explanations! Can you also answer my 2nd question there, about how I can imagine this learning process?
as stated above it depends on the learning model chosen
^dis
basically its linear regression and u got functions in ur NN with y=f(x)
NN stands for...?
and u got labelled data with which ur NN can check autonomously if it was right or wrong
neural network
Oh, im not using NNs
u always do
The only thing I did so far is really just implementing the k-nearest neighbor algorithm to predict data, and an evolutionary algorithm
Is that not machine learning, if im just using those algorithms without a NN?
ok so u just use statistics?
Hmm yeah I guess
I basically got these algorithms from a machine learning tutorial series, so I thought this would already be some sort of learning, but ig its just statistics then? Where/when does the actual learning part come in? With NNs?
is this a good explanation of gradient decent ?
https://github.com/sivansh11/machine-learning-explained/blob/main/gradient_decent.ipynb
if u use tensorflow or pytorch 🗿
and for my understanding of whats happening inside the hidden layers of a NN -> Black Box cause the Network finds correlations and causalitys from n-data
Ok so, how is the NN interacting with, for example, the kNN algorithm then on a higher level? Who is influencing/adjusting what?
u got input(blue) hidden(yellow,red) and output(green)
all got a function y=f(x) and weights(black strings) now it gives different approaches but easiest is forward so blue->yellow->red->green
"what wires together fires together"
depending on the value for a given neuron it fires or it wont
and thats the learning part
fyi knn and nn are 2 fundamentally different algorithms
dont get confused
I got a general idea of NNs, but how do those inputs and outputs now interact with an algorithm like kNN? Where would I put that in a NN?
knn stands for k nearest neighbours and nn stands for neural networks
NN and kNN are unrelated.
Well u can use them tgt tho, cant u?
they do different things internally
And my question is, how are you using them together?
I guess?
classification
first then learning
statistics always first step
u dont really use them together
they are 2 different things to do something similar
like you can ride a car, or a bike
both do the same thing
but both are fundamentally different
Alright, is it the same for GAs and NNs? Or is that possible to be used tgt?
what is tgt 😅
@strong sedge i could give statistic functions as neuron functions tho cant i?
yes, you can use a GA and NN together
can you elaborate ?
And why not kNNs? Because GAs are just making more sense to be combined with NNs?
for my understanding and like a tried to explain to bruce neurons got functions and weights applied to em (y=f(x))
Maybe because the kNN wouldnt make sense cuz its a supervised algorithm and there wouldnt be much left to learn if ur just comparing the data etc.?
so it would be possible to say make a predicition with knn and use the value of it
ummm, I would honestly suggest that you understand what neural networks are, k nearest neighbour is and how genetic algorithms work at a deeper level
no no no,
knn != nn
they are different
ofc
knn doesnt have a neuron in it
i do know that
but neurons got functions
so i assumed i could apply a function to the neuron like knn
different approaches yes
and nn is not always >> statistics but i thought i could "combine" if wanted
elaborate pls
nn doesnt works on statistics, rather calculus
knn also doesnt works on statistics, it works on the distance between points
there is a separate algorithm called naive bayes that works on statistical idea called bayes theorem
a neuron in a neural network takes in inputs, does some processing on it and gives some outputs
in function form
y = f(x)
yes
knn works on distance between k points, the resultant value of a new point is the average of the k nearest points
2 very different ideas
i tired to keep it simple and not explain the underlying idea but i thought i could combine em thanks for correcting me
no worries
but in a nn i work somewhat with statistics cause the system searches for correlations and causalitys doesnt it?
do give feed back on this, tell me what changes I should make
idrk
best answer BLACK BOX 🗿
@serene scaffold u got any clue on that?
give me a few mins
"note: the multiplier used should be small, there is no fixed value, can **you **what ever works for you"
looks fine for me and is sufficient but i wont quiet remember what my prof said a few years back
ill fix it
thanks
gladly
if I have an array of zeroes np.zeros(100,336) how do I update the 318th to 325th entries to 1?
by "entries" do you mean columns?
arr = np.zeros((100, 336))
arr[:, 318:326] = 1
thanks
can i ask a more complicated question?
i actually have an array of zeros np.zeros(100,48*7)
each row is a factory shift in a data frame
and each column is a half hour segement of the week
well fuck
i'm trying to iterate over a dataframe
which will count how many factory shifts are active at each half hour of the week
so something like
for r in df.iterrows:
when you're doing numpy or pandas, just banish "iterate" from your mind.
hold that thought
do print(df.head().to_dict('list')) and put the result in the chat
and then explain what you're trying to do, without any code.
I won't look at any screenshots.
ok give me one sec
thanks
crap I dont have access to the file on my home computer
guess we can't do it?
do you have the name of each column and its dtype memorized?
yes
that's what I need to know
what time unit is start_shift_time? seconds?
how should we proceed? should I tell you what I tried to do?
so you need to know how many half-hour blocks in each day (00:00, 00:30, 01:00, etc.) are covered by a factory position?
i need to know how many factory workers/positions are needed at each half hour of the week
so you need to know how many factory positions are active during each half-hour block?
okay, we can work with that.
thanks
so we loop through the df right and need to figure out the start half hour and end half hour right?
no looping.
the first step is to represent everything as actual timestamps
In [3]: pd.Series([430, 1200, 0000])
Out[3]:
0 430
1 1200
2 0
dtype: int64
In [4]: s = _
In [6]: s.astype(str).str.zfill(4)
Out[6]:
0 0430
1 1200
2 0000
dtype: object
In [7]: pd.to_datetime(s.astype(str).str.zfill(4), format='%H%M')
Out[7]:
0 1900-01-01 04:30:00
1 1900-01-01 12:00:00
2 1900-01-01 00:00:00
dtype: datetime64[ns]
ok got it
You can also add the days.
In [9]: pd.Series([1, 2, 3]).astype('timedelta64[D]')
Out[9]:
0 1 days
1 2 days
2 3 days
dtype: timedelta64[ns]
In [10]: pd.to_datetime(s.astype(str).str.zfill(4), format='%H%M') + _
Out[10]:
0 1900-01-02 04:30:00
1 1900-01-03 12:00:00
2 1900-01-04 00:00:00
dtype: datetime64[ns]
yeah
will the start and end always be on an hour or on the half hour? like it will never be at 617 or 1535?
no
no to which part
itll always be on the hour or half hour
okay. I wonder if there's a way to "expand" each row into one row for each half-hour block
you said there's 7 days, right?
yes
so you can do this
In [23]: pd.date_range(start='1900-01-01', freq='30min', periods=24 * 2 * 7)
Out[23]:
DatetimeIndex(['1900-01-01 00:00:00', '1900-01-01 00:30:00',
'1900-01-01 01:00:00', '1900-01-01 01:30:00',
'1900-01-01 02:00:00', '1900-01-01 02:30:00',
'1900-01-01 03:00:00', '1900-01-01 03:30:00',
'1900-01-01 04:00:00', '1900-01-01 04:30:00',
...
'1900-01-07 19:00:00', '1900-01-07 19:30:00',
'1900-01-07 20:00:00', '1900-01-07 20:30:00',
'1900-01-07 21:00:00', '1900-01-07 21:30:00',
'1900-01-07 22:00:00', '1900-01-07 22:30:00',
'1900-01-07 23:00:00', '1900-01-07 23:30:00'],
dtype='datetime64[ns]', length=336, freq='30T')
is this on each row?
no, this is separate. it's every possible half-hour block
oh i see
anyway, you can do something like this
In [36]: blocks = pd.date_range(start='1900-01-01', freq='30min', periods=24 * 2 * 7).to_numpy()[None, :]
In [37]: blocks.shape
Out[37]: (1, 336)
In [38]: shift_starts
Out[38]:
array([['1900-01-01T04:30:00.000000000'],
['1900-01-01T12:00:00.000000000'],
['1900-01-01T00:00:00.000000000']], dtype='datetime64[ns]')
In [39]: (shift_starts < blocks) & (blocks < shift_ends)
Out[39]:
array([[False, False, False, ..., False, False, False],
[False, False, False, ..., False, False, False],
[False, False, False, ..., False, False, False]])
where you use broadcasting to get a 2d array of bools. each column is a block and each row is a shift. and it's True if that shift overlaps with that block
well, that's it
how do you get to a count of shift for each half hour?
sum of each row
'''
y = wx + b
dy = dwx + wdx + db
dy / dw = dw * x / dw + w * dx / dw + db / dw
what I think should be correct
dy / dw = x
dy / db = 1
dy / dx = w
'''
this is technically wrong
it should be
'''
dw = dy * x
db = dy
dx = dy * w
'''```
what am I doing wrong here ?
I have a h5 model, but how do I run it with opencv
this is the code im using, but it doesnt work : ```from cvzone.ClassificationModule import Classifier
import cv2
cap = cv2.VideoCapture(0)
myClassifier = Classifier('eyedisease.h5','labels.txt')
while True:
_, img = cap.read()
predictions, index = myClassifier.getPrediction(img)
print(predictions)
cv2.imshow("Image", img)
cv2.waitKey(1)
gives this error
my reinforced learning network has 5 inputs and 3 outputs. No matter how many middle layers there are or how many nodes it has, my output is always only 1 option. I have tried different training algorithms and different activation functions but nothing works. Do I not have enough input nodes or something? I am not sure what to do. I would appreciate the help
how can I compare how many "lowest" values a column has compared to 3 other columns?
why does jax.grad fail on the following method
def U(x):
return np.sum(np.linalg.norm(x[:, None, :] - x[None, :, :], axis=-1))```
i truncated to np.linalg.norm after trying with np.sqrt(np.sum(np.square(... for a while
jax.grad is possible up until the np.sqrt part
also np is jax.numpy, not the standard numpy
Hi, has anyone worked with receipt data extraction before? Like extract the invoice number, receipt date and amount etc..
Is there any model that are ready to train for this?
can anyone tell me good course about data science ?
Guys, any tips on how to deal with vanishing gradients in a discriminator from a GAN?
(My discriminator has only 3 layers and its optimizer is an adam with lr=1)
I can think about residual blocks and batchnormalization, but I suppose residual blocks aren't really a good option for a GAN, right?
Hello. I have plotted a histogram for temperature to see which has temperature occurs the most on any given day. And I have a question about the mode. The mode is at the red line. But I can see that on the right there's a temperature value that occurs more often. So why is the mode shown on the left instead of the right? If needed the data comes from here https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/Data/ml-basics/daily-bike-share.csv
ACCURACY = THE POINT AND RANGE OF A MEASURED AMOUNT OF CAPABILITY A POSSIBILITY CAN HAPPEN AND DETERMINE COME INTO EFFECT
RADIUS = SET RANGE OF A CENTERED POINT TO THE END DESTINATION
DIAMETER = SET RANGE POINT FROM START TO MIDDLE TO THE END WHILE PASSING THE RADIUS
CONVERT = CHANGE FORM AND OR CHARACTER AND OR FUNCTION
PATTERN = REPEATING METHOD
WRITE = ENSCRIBE FROM LOOKING AT WORDS
READ = DESCRIBE FROM LOOKING AT A PATH OF WORDS
SPAN = MEASURED LIMITED RANGE
VIBRATION = PARTS THAT MOVE BACK AND FORTH AT A GIVEN SPEED
TRANSFORM = MAKE A CHANGE IN FORM
SYNCHRONIZE = LINK AND SEND THE SAME RESULT TO ALL SOURCES
SCAN = ANALYZE A SPECIFIC WORD OR FIELD AND OR GIVE DATA ON THE ASKED INFORMATION TO SEARCH FOR
ANALYZE = READ AND LOOK OVER
CALCULATE = GIVE A DESIGNATED OF A CALCULATES DESCRIPTION FOR A NUMBER AND GIVE ANSWER FOR ALL OF VALUE
LIMIT = SET DEFINED AMOUNT FOR KNOWLEDGE WITH A GIVEN POWER LEVEL
RECALL = GAIN THE ABILITY TO VIEW PAST MEMORY INSIDE BRAIN
REACH = GRAB TO PULL INWARDS
PREDICT = GIVE PERFECT VALUE
REPEAT = CYCLE SAME EFFECT AGAIN INTO SAME FREQUENCY
RECOGNIZE = RECALL FROM AN EARLIER POINT WITHIN TIME
ENCODE = COMPRESS CODE
DECODE = DECOMPRESS CODE
RECODE = COMPRESS CODE ONCE MORE
LOOP = BIND IN A CYCLE
MEASURE = TAKE IN THE AMOUNT AND DISTANCE OF
ANSWER = SOLUTION TO A PROBLEM
SOLUTION = FINAL OUTCOME TO AN FORMULA
PROBLEM = UNFINISHED SOLUTION
SEARCH = FIND AND LOCATE SOMETHING
ASK = STATE A QUESTION
TIME = MEASUREMENT IN WHICH CURRENT REALITIES MUST PASS
SPACE = CONTAINER IN WHICH TIME MUST PASS THROUGH
UPLOAD = TRANSFER3 INTO DESCRIBED LOCATION
DOWNLOAD = TRANSFER3 TO CURRENT DEVICE
SIDELOAD = TRANSFER3 TO ALL DEVICES WITH STATUS OF STATED SET LOCATION
CLONE2 = MAKE AN IDENTICAL COPY OF
SYNCHRONIZE = LINK AND SEND THE SAME RESULT TO ALL SOURCES
ENCODE = COMPRESS CODE
DECODE = DECOMPRESS CODE
RECODE = COMPRESS CODE ONCE MORE
SETTING = A MEASUREMENT COMMAND THAT CAN BE ADJUSTED AND BY AN OPERATOR
ADJUST = EDIT AND MODIFY
EDIT = CHANGE AND OR MODIFY TO ADJUST TO A SPECIFIED PURPOSE
WORK = PRODUCING EFFORT TO FINISH A TASK
WORKLOAD = THE AMOUNT OF WORK
COMMAND = ORDER TO BE GIVEN
LINK = BRING TOGETHER AND ATTACH TO
BIND = EDIT AND MODIFY
LEVEL = NUMBER AMOUNT OF OR SIZE
UNIT = STORAGE CONTAINER
DIMENSION = NUMBER OF GIVEN AXIS POINTS
NUMBER = ARITHMETICAL VALUE THAT IS EXPRESSED BY A WORD AND OR SYMBOLE AND OR FIGURE REPRESENTING A PARTICULAR QUANTITY AND USED IN COUNTING AND MAKING CALCULATIONS AND OR FOR SHOWING ORDER IN A SERIES OR FOR IDENTIFICATION
FREQUENCY = REPEATED PATTERN AND OR SETTING
POWER = AMOUNT
STRENGTH = LEVEL INTENSITY
CALIBRATE = SCALE WITH A STANDARD SET OF READINGS THAT CORRELATES THE READINGS WITH THOSE OF A STANDARD IN ORDER TO CHECK THE INSTRUMENT AND ITS ACCURACY
PUBLIC = ACCESS TO ALL OF CREATORS INTERIOR DOMINION
PRIVATE = HIDDEN TO EVERYONE BUT CURRENT2 USER2
PERSONAL = EXCLUSIVE TO THE CREATOR
ESCAPE = RETURN TO SOURCE PLACE2
RETURN = GO BACK
CONSTANT = ALWAYS IN EFFECT
CYCLE = PROCESS OF REPEATING AN EVENT CONTINUOUSLY IN THE SAME ORDER
MEASUREMENT = AN ACT TO CALCULATE AND GIVE A SPECIFIC LENGTH ON SOMETHING
CALCULATOR = A DEVICE USED TO CALCULATE INFORMATION AND ANALYZE SET TASKS AS A ROOT VALUE OF LOGIC
WAVELENGTH = A SET OF WAVE PATTERNS GIVEN FREQUENCY FORMAT IN A LENGTH OF A WAVE VALUE DETERMINED BY A PREVIOUS VALUE EFFECT
LENGTH = HOW LONG A MEASURED DIMENSIONAL OBJECT IS EXTENDED
LATTICE = INTERLACED STRUCTURE AND OR PATTERN
LOCATION = SPECIFIED AREA
LINE2 = CHOSEN DIRECTION THAT IS SET IN A SINGLE PATH
WAVE2 = CONTINUAL FLUCTUATION OF FREQUENCY AND OR PATTERN
WIDTH = MEASUREMENT OF SOMETHING FROM SIDE TO SIDE
HEIGHT = THE LENGTH OF RAISING OR LOWERING IN A VERTICAL PATH
HERTZ = DEFINED SOUND WAVE FREQUENCY
MEASURE = TAKE IN THE AMOUNT AND DISTANCE OF
Those are the mathematical variables my language has
just some and not a complete list yet
looking for input and feedback and what others think of it. How others see it could be used if I made it as a smaller library/module for it to connect to the full language with.
What others see within its potential as well
Can anyone help me turn excel data into something that can be worked with in python? I'm new to data science and have already tried all the built in functions from pandas but it can never recognize my file for some reason, not sure if i should be saving it in a particular place first? Would appreciate if someone could hop on a call or something and help me work through this!
Bit confused about how to combine two different models into one. I.e. if I fit a linear regression model and also fit a XGBoost model to a dataset. I know sometimes you can get better scores utilizing both models but am unsure how to go about this process. Can anyone point me in the right direction? Thanks! (@ me on reply if you can please)
Dm me what your file formatting looks like I can help point you in the right direction
holy sht
@topaz night You Like it
If you're using keras, you can pass the output of a model as input of another model.
Example: create a convolution model to extract features from an image and pass those features inti XGBoost. Or you can extract features with PCA and pass them into a Decision Tree.
If you're using tensorflow or Pytorch for neural networks, things can get more interesting, as you can create a Neural Network with XGBoost inside of it.
I was using sklearn for both
Need help with pythpn and pandas code
Traceback (most recent call last):
File "/local/scratch/v_rahul_pratap_singh/UnsupervisedVAD/video_feature_extractor/extract.py", line 50, in <module>
model = get_model(args)
File "/local/scratch/v_rahul_pratap_singh/UnsupervisedVAD/video_feature_extractor/model.py", line 32, in get_model
model = model.cuda()
File "/shared/home/v_rahul_pratap_singh/miniconda3/envs/envRahul/lib/python3.10/site-packages/torch/nn/modules/module.py", line 689, in cuda
return self._apply(lambda t: t.cuda(device))
File "/shared/home/v_rahul_pratap_singh/miniconda3/envs/envRahul/lib/python3.10/site-packages/torch/nn/modules/module.py", line 579, in _apply
module._apply(fn)
File "/shared/home/v_rahul_pratap_singh/miniconda3/envs/envRahul/lib/python3.10/site-packages/torch/nn/modules/module.py", line 602, in _apply
param_applied = fn(param)
File "/shared/home/v_rahul_pratap_singh/miniconda3/envs/envRahul/lib/python3.10/site-packages/torch/nn/modules/module.py", line 689, in <lambda>
return self._apply(lambda t: t.cuda(device))
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.```
this is error or what, my code seems to continuing running and making relavant files but this pops
Anyone interested to have a look in my project and help me
ask your problem, some one will help
lr = 1 is too much ?
export it as csv in excel
in python, import pandas as pd and then use pd.read_csv()
its called stacking, take a read at this https://www.javatpoint.com/stacking-in-machine-learning#:~:text=Stacking is one of the,new model with improved performance.
Stacking in Machine Learning with Tutorial, Machine Learning Introduction, What is Machine Learning, Data Machine Learning, Machine Learning vs Artificial Intelligence etc.
Awesome thanks! I'll watch some videos on it
@strong sedge can we connect need to share my screen and make you understand my problem
Hi guys ive got my entry, take profit, and stop loss stored in my dataframe, but cant figure out how to track the profit and loss of the strategy. Any advice? thanks. This is for a trading strategy
yeah thats really cool bro like damn yk
what if myself is the problem :v
return int(v.strip(',')) ```
does anyone know why a .strip won't work in this instance. i am trying to pull from a collum in a data frame where it is all strings since the number values have commas (EX: 36,098) but i want to convert all of those values to ints without commas
```---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
/tmp/ipykernel_191/1557144099.py in <module>
2 def convert_votes_to_int(v):
3 return int(v.strip(','))
----> 4 video_games = video_games.con
5 video_games
AttributeError: 'DataFrame' object has no attribute 'con'```
this is the error message i am getting
anyone know a cute way of getting the bottom 3 strings from a specific column? so lets say i did .tail(3) I want 3 values from that, it would be the same column for each row
lets call that rowCbottom3 = []
and then to match for row B
sorry
not row, COL
ah, worked it out
How can I see if my prediction model is the best model?
what does the model predict?
one model for solar energy production and another one for electricity prices
"best" by what standard?
accuracy I guess
i am not being glib. that's a legitimate question and an important one that you must answer in any modeling project!
in general, it's hard to know if your model is "best" but you can compare various models to see which one is better
So the only way is by testing multiple models and comparing them?
in general yes. in statistics specifically, certain models have certain desirable mathematically-proven characteristics in some situations.
but that also doesn't make them "best" for any particular application
bias-variance tradeoff is also important to consider
would you prefer a model with really small average error, but huge variation in the predictions? or would you prefer a model with modest average error, but less variation in the predictions?
wdym by huge variation in predictions?
if you don't know what "bias-variance" tradeoff is, go look it up right now
ok thnx
The best would be ofc a small bias and small variation, but I think for our project a small bias would be more important
ok, but then you have to be comfortable with the chance that your particular sample gives substantially "incorrect" results!
but wouldnt a bigger bias also give incorrect results?
this often comes up with observational data, such as that collected from the environment. it's often helpful to think of "the environment" as a big random sampling engine: physical phenomena are the outcomes of random data generating processes. you get exactly one opportunity to observe that data generating process, because time only runs forward!
so it's tempting to look at a time series at the millisecond scale of something like solar energy, and conclude that you have a big data set, and therefore that you don't care about variance and must minimize bias. but there is a legitimate interpretation in which you have a data set of exactly 1 data point.
so its actually better to find a balance between the bias and variation error?
yes. it's not something you can always tune precisely, but it's something important to consider when asking what the "best" model is
And how can I calculate these, because right now I'm only calculating the mse of the last training data and the mse of the prediction
Also another question, since I am using 2 models for the energy prices and solar energy production, would stacking be a good method to make a more accurate prediction?
i need help choosing between tf and pytorch.
i've read that pytorch pretty much beats tf when it comes to use in research, and is starting to get more and more popular in the industry
i'm a bit concerned about deployment though (i'm only concerned about deploying to web apps)
read that it's a bit harder to deploy with pytorch. is this still true? or has it become easier to deploy pytorch now?
my interests include mostly NLP(mostly japanese) and music (music theory, metadata, genres)
this command Dataset.from_dict(dutch_dict) gives me this error pyarrow.lib.ArrowInvalid: Column 1 named validation expected length 43410 but got length 5426
I just want to convert a dictionary to a Dataset object which I've imported from datasets but I don't know why I'm getting this error.
Do both
Tensorflow still has the most weight behind it, but it's a bit of a relic
Pytorch is the up and comer and will likely overtake eventually
aight thanks
last question
how hard is it to deploy to the web with pytorch vs tensorflow as of now? (the articles i've been reading are from 1-2 years ago and i'd guess pytorch has improved since then)
If you're using cloud platforms they're both as easy as each other, not sure about 3rd party software though
Like I use GCP for model deployment and both are integrated in the same way
I got this error and can't fix it pls help : (
Graph execution error:
Real programmers delete and start from scratch whenever they reload their IDE
but i am beginner
Then begin
Kinda... It's quite rare to see algorithms with lr = 1
Oh, you can ues sklearn, too. My head was still in the neural networks
Batchnorm tends to be my go to for GAN vanishing gradients, are they not working for your case?
In sklearn you can make a model's output be another model's input
No, they weren't. I had to lower the Linear layers in the discriminator
Use less neurons
Ah
However, it seems that it's stabilized for now... I'm adding random noise to the discriminator's inputs, using label-smoothing, weights initialization...
Aaaand updating after each batch, instead of each epoch
First time I am seeing something like this
What are you doing?
A Text GAN
Ohhh cool
hello, I am trying to change a single weight and bias in my model but I am not sure how to go about. Is there some sort of indexing through the model? model[column][row] <-- like this?
I'm using pytorch
Yes, you can access a model parameters by using a loop with model.parameters()
for param in model.parameters():
print(param)
or
for name, param in model.named_parameters():
print(name, param)
thank you
is there a way to index model.parameters() without the loop?
I got this error
RuntimeError: CUDA out of memory. Tried to allocate 2.43 GiB (GPU 0; 8.00 GiB total capacity; 5.70 GiB already allocated; 0 bytes free; 6.52 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
which I find strange, because my memory is supposed to be 16 gb
so why does it say 8.00 GiB total capacity?
probably GPU memory vs RAM memory?
CUDA uses the Video memory, not the pc RAM memory
Try checking your dxdiag
so my gpu only has 8gb of memory?
is that a lot? stylegan3 recommends 16gb
Because it probably relies on multiple GPUs from cloud servers
Those big models usually does that
well yeah, some huge neural network requires completely absurd amounts of memory that no personal computer should have, and are expected to run (exclusively) on cloud providers
task manager says I've got 2 gpus, a 3080 and "AMD radeon Graphics"
but when I try to train neural nets it only lets me use the 3080
is the amd one not a real gpu or something?
CUDA = nvidia = probably doesn't supports amd I guess
It doesn't 
so why would the manufacturer of the computer put two incompatible gpus?
they're not incompatible per se
it's just not compatible with cuda
oh
Unfortunately, it doesn't seem to be that easy to use AMD GPUs for neural networks... I remember I tried to search about it, and...nothing.
Maybe most algorithms tend to rely on NVidia GPUs because of that...except for Google's, since they like to use their TPUs
if you're just messing with GANs for fun or even some project you haven't got much progress into yet, you might as well move on to Stable Diffusion tbh - almost completely (if not completely) different network architecture, but generalises a lot better as far as I know
really? I've heard a lot of hype about stable diffusion recently, but I thought it's still worse at high quality images and face datasets
But diffusion models tend to be heavier than GANs
do them? I don't know how much you need to train it, but I know that you can run inference with 8GB for sd
weren't you doing some Avatar the last airbender stuff?
yep!
that doesn't really fits into neither "high quality" nor "face datasets" I think?
my main goal was to generate new character designs, kind of like a "novel pokemon gan" I saw someone do
but I trained a vanilla gan from scratch and the results were both very blurry and extremely overfit
and then I discovered thiswaifudoesnotexist, which retrained stylegan2 on a small dataset of anime girls
Oh, inference is usually ok, the training is the problem.
The diffusion model creates too many samples for a single training loop.
and it was super crisp
so now I'm trying to retrain stylegan3 on my dataset
to achieve the same result
OpenAI even developed a diffusion model that is better than DeepMind's BigGAN, which is the state of the art GAN, but the computation power that thing demands...
Each checkpoint file has, like, 1 Gb.
link?
I found two github repos about Pokemon GAN, both of which are very much not high quality at all
yeah they weren't great
but the waifu one was excellent
What was the size of images you tried to generate? 64x64?
Blurry images tend to be normal, so GAN models usually rely on SuperResolution nets...
Maybe some of them don't, but others do.
I think BigGAN uses something to avoid this, but it was so complicated that I can't remember... but there's NVidia's Progressive Grow, which uses a GAN that grows after each training session and generates quite interesting images with quite a resolution.
i am using ssh.
if i clean GPU cache because i am getting CUDA out of memory, will it affect others using that GPU?
Working with dates/times in a pandas dataframe.
One of the columns in a df of data from our SQL server is ip_date (initial production date). Pandas says it's type object. I need to work on this as a date, so I run .to_datetime on it, and now its type datetime64[ns]. However, when I try to get the data type off of an individual value in that column of data, its type is <class 'pandas._libs.tslibs.timestamps.Timestamp'>.
meters_sql = #result of the sql query
print(meters_sql.dtypes) # Says column `ip_date` is `object`
meters_sql['ip_date'] = pd.to_datetime(meters_sql['ip_date'])
print(meters_sql.dtypes) # Says column `ip_date` is `datetime64[ns]
print(type(meters_sql['ip_date'][1])) # Says it's type ...timestamps.Timestamp
How do I force this to be a datetime? Or what module would I use to work with timestamp?
timestamps are pandas's version of datetimes
!e import pandas; print(pandas.Timestamp.mro())
@agile cobalt :white_check_mark: Your 3.11 eval job has completed with return code 0.
[<class 'pandas._libs.tslibs.timestamps.Timestamp'>, <class 'pandas._libs.tslibs.timestamps._Timestamp'>, <class 'pandas._libs.tslibs.base.ABCTimestamp'>, <class 'datetime.datetime'>, <class 'datetime.date'>, <class 'object'>]
pandas.Timestamp is to datetime.datetime what numpy.float64 is to float
OK... So the reason I'm asking this is I tried to run (forgive me for using terms badly) a list comprehension on the df to replace all the day values in the dates with the number 1. So 5/14/2022 wwould become 5/1/2022. I'm very much a beginner with list comprehensions so I tried this and got an error:
meters_sql['ip_date'] = [meters_sql['ip_date'].replace(day=1) for x in meters_sql['ip_date']]
error message: Series.replace() got an unexpected keyword argument 'day'
So my first thought was that the type of data in that series is not datetime so that's how I got to where I am now.
OK, so I think I figured out the first part of the list comprehension error. I have this now:
meters_sql['ip_date'] = [meters_sql['ip_date'][x].replace(day=1) for x in meters_sql['ip_date']]
The error message is: Exception has occurred: KeyError Timestamp('2018-05-19 00:00:00')
I'm at a loss as to what to do here.
don't ever iterate over pandas dataframes - specially, do not use list comprehensions for that kind of stuff.
So use if/else instead?
loop up pandas vectorized operations
hokey pokey, thx 🙂
Sounds complicated so if I start dropping those words around the developers maybe they'll think I'm smart lol
Oh man, that looks a lot simpler and easier to understand. At least the first couple examples I see.
explicit loops are as bad as (or even worse than) pure python code without pandas
apply()/map() with user defined functions is bad and shouldn't be used either, but still beats explicit loops
you should always use specific built-in methods that operate over the entire series
Understood. This discord is my main point of education for such things (specific built-in methods). Thank you!
So to vectorize the replacement of the day in the field ip_date, it would be something like this I suppose?
df['ip_date'] = df['ip_date'].replace(day=1)
pretty much
I recommend taking a look at the pandas documentation at https://pandas.pydata.org/docs/user_guide if you haven't yet
Hey there! I'm building a Django app and I use pandas a lot to process data. I have come across one big problem: at some point in my app, data analysis takes like forever. I have the following code:
i = df['agencia'] == 'DHL'
for row in tqdm(df[i].index):
for col in df.columns:
for supplement_col in supplements_columns_names:
for supplement_col_total in supplements_columns_names_total:
for supplement_price_col in list_df_supplements_prices_columns:
if df.loc[row, col] == supplement_price_col:
df.at[row, supplement_price_col] = df_supplements_prices.at[0, supplement_price_col]
theoretical_price = df.at[row, supplement_price_col]
invoiced_price = df.at[row, supplement_col_total]
if theoretical_price != invoiced_price:
errors_data.append(
{'Package number': df.at[row, 'agencia'], 'Supplement error': supplement_price_col,
'Invoiced price': invoiced_price, 'Theoretical price': theoretical_price,
'Difference': invoiced_price - theoretical_price})
# Generation DF errors
df_errors = pd.DataFrame(errors_data)
I know that pandas does not recommend to loop through a DF. But in my case, I have to get to a precise cell to append data, i.e. getting the row and column for this part :
df.at[row, supplement_price_col] = df_supplements_prices.at[0, supplement_price_col]
For 2000 rows, the analysis takes like 4 min (!), which is way too long. So here's my question, I know it is possible to do better, but could you please guide me? I did look up and saw df.apply(lambda row) for example but since I'm using a lot of conditions and loops, it is unclear to me whether I should use this function or not...
does anyone know how to keep only a certain amount of rows in a data frame
like for example i just want to keep the first 10 rows of a df with like over 200 rows
what function would i use
@inland eagle maybe df.head(10) ?
do you want to actually delete the rows, or just display the first 10 rows?
"to a DataFrame that contains the ten most common genres of video games, in descending order"
so i think just display the first 10 rows
or like 0-9
for some reason it won't let me use that
AttributeError Traceback (most recent call last)
/tmp/ipykernel_644/3463392015.py in <module>
2 sheesh = yolo.get('title').sort_values(ascending=False)
3 wut = yolo.get(['title']).assign(count = sheesh).drop(columns='title')
----> 4 most_common_genres = wut.head(10)
5 most_common_genres
AttributeError: 'DataFrame' object has no attribute 'head'```
this is the error message i am getting
ignore the variable and df names lol
Skip the part where you assign most_common_genres and just go straight to wut.head(10)
print(wut.head(10))
I'm trying to retrain stylegan3 starting from one of their pretrained models, and I'm getting this
File "C:\python\Generative-Adversarial-Networks\stylegan3-main\stylegan3-main\torch_utils\misc.py", line 162, in copy_params_and_buffers
tensor.copy_(src_tensors[name].detach()).requires_grad_(tensor.requires_grad)
RuntimeError: The size of tensor a (12) must match the size of tensor b (24) at non-singleton dimension 0```
this stackoverflow post about the same issue in stylegan2 says it's because my dataset doesn't have the same number of classes as the pretrained model
but how could I fix this?
what type of ai?
there's image recognition, image generation, NLP
image recognition sounds cool
if that's the case, then you should look into the basics of neural nets
train a pytorch neural net on the MNIST dataset
okay
I highly recommend 3blue1brown's youtube video on neural networks
What are the neurons, why are there layers, and what is the math underlying it?
Help fund future projects: https://www.patreon.com/3blue1brown
Written/interactive form of this series: https://www.3blue1brown.com/topics/neural-networks
Additional funding for this project provided by Amplify Partners
Typo correction: At 14 minutes 45 seconds, th...
the mnist dataset refers to the dataset of handwritten 0-9 digits compiled by NIST
training a neural net to look at images of handwritten digits and predict what number it is is a great starting project
okay
im still getting error messages
pytorch actually has the mnist dataset as one of the builtin ones
https://pytorch.org/vision/stable/generated/torchvision.datasets.MNIST.html#torchvision.datasets.MNIST
That's bizarre. Never seen that issue before. The error even says it's a dataframe.
Try print(type(most_common_genres)) and see what you get
this is the output:
babypandas.bpd.DataFrame
my class is using baby pandas which is basically an smaller version of pandas for beginners
ok
idk what that is tho 😄
it is basically pandas
like majority of the functions are the same
maybe the head command isn't available in babypandas
Take a look in there and see if there's a 'head' command anywhere
You could always just make a loop:
for i in range(10):
print(df[i])
That might work for ya.
IndexError: BabyPandas only accepts Boolean objects when indexing against the data frame; please use .get to get columns, and .loc or .iloc for more complex cases.
hmmm, just a sec...
Give this a try:
print(df.iloc[0:9])
idk how limited baby pandas is tho
...why would you use that over normal pandas?
it is because this version of pandas was made for this specific course, so like ya
i have no choice but to use it over regular pandas
OMG thank you that finally worked
I'm glad! Now... can you help me with datetime stuff?
i can try, i am only a beginner. what are you trying to do?
Trying to replace the day in a datetime if the value of day is < 15.
I have this, but I get an error on df.loc[df['fom'].day
df.loc[df['fom'].day < 15, 'fom'] = df['fom'].apply(lambda dt: dt.replace(day=1))
I got the basic syntax from here: https://datagy.io/pandas-conditional-column/
Hello, does anyone here have a good amount of experience with PyTorch?
I was just wondering if someone can help me understand how to prepare data for nn.LSTM or nn.LSTMCell, the long short term memory, a recurrent neural network
anyone knows why my graph looks like this?
instead of this reference image.. i cant figure it out.. been at it for a while
fig2,ax2 = plt.subplots(figsize=(5,4))
fig2.patch.set_facecolor("None")
ax2.set_xlabel("Year")
ax2.set_ylabel("Double faults per match")
x,y = df2["year"],df2["player1 double faults"]/df2["player1 total points total"]
ax2.scatter(x,y,alpha=0.5)
ax2.plot(x,y,"-",color="orange")
mpl.style.use("default")
Do anyone have experience in federated learning?
No, what's that? 
how do I get all the values with pandas groupby? I want to sort it by 2 columns but I want the rest of the columns to come as a result too, but all I'm getting is a generic GroupBy object in return. Do I necessarily have to use a function with groupby for it to return something?
yes. think of the GroupBy as a bag of dataframes, where each dataframe is one of the groups you made. but you can't see them again until you do something that reduces them back to one dataframe.
but if you want "all the values", you might rethink why you're using groupby. you usually end up with less data after grouping and doing something with the groups, not the same amount.
oooh, I see. It makes total sense now, thanks! Basically, I wanted to turn this dataset:
into this one, where it separates by months
I thought using groupby would do, but doesn't look like it's the proper solution
I think you're looking for pivot_table
oh, let me check that
if you get stuck, do print(df.sample(10).to_dict('list')) for me and put it in the chat as text (no screenshots), and ping me.
also if the .sample(10) part doesn't give you rows with at least two months represented, just do it again until it does.
alrighty, thanks a lot, will try it out with the documentation and will reach you if I get stuck
you can iterate over the groupby object, which yields group_label, group_data pairs
as stelercus said, usually you don't need to do this
but sometimes it comes in handy. i do it now and then
note that pivot_table might give funny results with a datetime column
obviously you can manually construct a "year-month" column first and use that for pivoting
personally i can never remember the arguments for pivot_table so i would probably do .resample followed by .unstack
(what is RDD?)
Hey guys, when I load an .wav file using librosa.load, what is the unit of measurement for the y axis?
I know that it loads audio data in a time-series, so the x is seconds, but what about the y? Amplitude in decibels?
I posted a question in #🤡help-banana would be very grateful if anyone could answer 🙏🏻
i did some changes to my code, ive lost my orange line. but the data looks better. tho the x axis doesnt give anything proper. can anyone provide pointers?
fig2,ax2 = plt.subplots(figsize=(6,4))
fig2.patch.set_facecolor("None")
dbl_ratio = pd.DataFrame(df2["player1 double faults"]/df2["player1 total points total"]) # good
y_avr = dbl_ratio
x_grpby = df2.groupby("year")
x,y = df2["start date"].values,dbl_ratio
ax2.set_xlabel("Year")
ax2.set_ylabel("Double faults per match")
ax2.scatter(x,y,alpha=0.2) # good
ax2.plot(x_grpby,y_avr,"-",color="orange")
mpl.style.use("default")
it would need to look like this
i was thinking, using the mean for the data points to draw the orange line.. but nothing of what i use/do works
Splitting ur stuff up to protect privacy of the data
Think of it this way, if the plot of prediction and y_true is same, then the prediction is 100% accurate
If it's not the same, you can see when for what values the model predicts wrong answers
Basically
U plot 2 graphs on the same figure
1 is test vs preds
2 is test vs y_true
Ahh I see thank you
I am trying to create a neat implementation in python and in the papers it says that neural networks that haven't improved in x generations will be removed, what is the definition for having not improved and how would I check for it?
the fitness of a particular species (not genome) doesn't go up after x generations will be removed
is the fitness of the species the average fitness of the genomes in that species?
The species themselves aren't preserved each generation as in even if the same genomes are in that species they might not be in the same one?
yeah
I dont remember how, but I think there is a way to keep track of which genome is from which species
watch this guys video
https://www.youtube.com/watch?v=3nbvrrdymF0&ab_channel=NeatAI
Does speciation make a difference when finding solutions using neural nets ?
Watch the video to find out..
Music :
https://www.bensound.com/
he has a bunch of videos on explaining parts of neat
Why does summing array of numbers, using pyarrow is faster than NumPy?
https://stackoverflow.com/questions/74123523/why-does-summing-array-of-number-using-pyarrow-is-faster-than-numpy
I have watched his video but he does no mention how he checks if the species has not improved
I honestly can't remember how
Did u try reading the original research paper ?
I have tried reading the original paper but I couldn't find anything where he explicitly mentions it
He also has a website with alot of q and a
Check that out
tysm for ur help
how do I read a .ps file?
Do you by any chance know java?
Because I have the java version of neat and that references the drop off age but I would need help understanding how that works
No i don't
I would suggest just read it like it's sudo code
The language hardly ever matters for understanding how stuff works
ok thank ty for ur help
Whats the best way to display confusion matrices?
sklearn, but are there better alternatives
empty_df = pd.DataFrame()
for name in sd_eth_list:
profile_url = df.loc[df['name'].str.contains(name, case=False)]
empty_df.append(profile_url)
The dataframe still remains empty after running the code.
What am I doing wrong?
not supposed to append dataframes. Use .concat() instead. Also is df a temp variable or how is it defined?
y = np.linalg.solve(random_img, heart_img)
why arent these the same matrix
also dont iterate through dataframes at all just apply. one sec
contains_name = df.query('name in sth_ed_list')
filtering a dataframe is not done through iteration
You have to query or aggregate the data to reduce the size
or dimensionality, for aggregating
df is a dataframe of usernames
does my code work
Did not work
native-country salary
United-States >50K 7171
? >50K 146
Philippines >50K 61
Germany >50K 44
i got a lil dataframe here
e= df[['native-country', 'salary']]
highest_earning_country= e[e['salary']== '>50K'].value_counts()
i filtered it to show the country and the salary over 50K
now i want to print out the country with the leading salaray which would be the US
urls = []
for name in sd_eth_list:
if profile_url == (df['name'].str.contains(name, case=False)).any():
url = df[df['name'].str.contains(name, case=False)]['profileUrl']
urls.append(url)
Tried this as well but did not work as well
send the df
print(highest_earning_country[0])
profileUrl screenName name bio followersCount friendsCount
0 https://twitter.com/TheSnoopAvatars TheSnoopAvatars The Doggies Enter tha Metaverse with @SnoopDogg x @TheSand... 88130 16
1 https://twitter.com/JulienROMAN13 JulienROMAN13 Julien ROMAN 💵 Investisseur / Youtube 🎬\n\n💸 Finance - Inve... 88768 162
2 https://twitter.com/landz_nft landz_nft Landz.io - Minting NOW The first disruptive Real Estate NFT collectio... 53608 266
3 https://twitter.com/borgetsebastien borgetsebastien Sebastien 🏞 Co-Founder & COO of @TheSandboxGame, the open ... 93652 1138
4 https://twitter.com/cryptoamazo cryptoamazo Crypto Amazo Crypto Promoter | Giveaway | DM to sponsor a #... 15743 56
share it the way you constructed it lol or as a csv
actually nvm ill figure it out with another df
i get the number 7171 but not the name, i want the OIutput to be
United-States```
Basically I have a list of names that I want to find in the dataframe's name column.
I can do it with .isin but that looks for exact match.
I want results the way .str.contains() gives
try as_index=False for .value_counts()
nvm thats not a thing
highest_earning_country= e[e['salary']== '>50K'].value_counts().reset_index()['native-country']
Try this
alr
@rich olive where you at help me
one sec
works 🫂
filtered = df.query(lambda row: name_ele in row.name for name_ele in sd_eth_list)
maybe
wait you want contains
this
filtered = df.query(lambda row: row.name.contains(name_ele) for name_ele in sd_eth_list)
case=False
ValueError Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_8584\2498179663.py in <module>
----> 1 filtered = df.query(lambda row: row.name.contains(name_ele, case=False) for name_ele in sd_eth_list)
c:\users\user\appdata\local\programs\python\python37\lib\site-packages\pandas\core\frame.py in query(self, expr, inplace, **kwargs)
4055 if not isinstance(expr, str):
4056 msg = f"expr must be a string to be evaluated, {type(expr)} given"
-> 4057 raise ValueError(msg)
4058 kwargs["level"] = kwargs.pop("level", 0) + 1
4059 kwargs["target"] = None
ValueError: expr must be a string to be evaluated, <class 'generator'> given
yeah that makes sense. sorry Im pretty new too. can fix tho one sec
filtered = df.apply(lambda row: row.name.contains(name_ele) for name_ele in sd_eth_list)
I have a multiindexed dataframe like:
Duration Duration/MW Cost Cost (m) Times Offshore Exceeded Times Vessel Full
Vessel Start Date
JUV 2022-01-01 34.688 4.818 3.983e+06 3.983 1.5 0.0
2022-01-02 33.296 4.624 3.839e+06 3.839 1.4 0.0
2022-01-03 34.354 4.771 3.948e+06 3.948 1.6 0.1
2022-01-04 30.342 4.214 3.534e+06 3.534 1.5 0.1
2022-01-05 35.092 4.874 4.025e+06 4.025 1.6 0.1
2022-01-06 31.342 4.353 3.637e+06 3.637 1.4 0.2
2022-01-07 30.100 4.181 3.509e+06 3.509 1.3 0.2
WTIV 2022-01-01 34.688 4.818 3.983e+06 3.983 1.5 0.0
2022-01-02 33.296 4.624 3.839e+06 3.839 1.4 0.0
2022-01-03 34.354 4.771 3.948e+06 3.948 1.6 0.1
2022-01-04 30.342 4.214 3.534e+06 3.534 1.5 0.1
2022-01-05 35.092 4.874 4.025e+06 4.025 1.6 0.1
2022-01-06 31.342 4.353 3.637e+06 3.637 1.4 0.2
2022-01-07 30.100 4.181 3.509e+06 3.509 1.3 0.2
I need to create a boxplot of Duration for each Vessel/ Start Date combination. Ive been struggling to make it work could someone help me? It would be much appreciated
whoops wrong df, I created this for the time-series analysis of duration and costs but they are the mean of 10 runs for each pair Vessel, Start Date
I have a long ~7300x7 dataframe, where each entry Vessel/ Start Date is separate 20 times, with the same index
AttributeError: 'str' object has no attribute 'contains'
that looks like
...
723 WTIV 2022-12-28 10.750 ... 2.248 0 0
724 JUV 2022-12-29 43.333 ... 4.876 2 0
725 WTIV 2022-12-29 6.833 ... 1.647 0 0
726 JUV 2022-12-30 43.667 ... 4.910 2 0
727 WTIV 2022-12-30 12.083 ... 2.452 0 0
728 JUV 2022-12-31 47.917 ... 5.349 2 0
729 WTIV 2022-12-31 8.000 ... 1.826 0 0
0 JUV 2022-01-01 35.375 ... 4.054 1 0
1 WTIV 2022-01-01 6.500 ... 1.596 0 0
2 JUV 2022-01-02 33.083 ... 3.817 1 0
3 WTIV 2022-01-02 10.250 ... 2.171 0 0
4 JUV 2022-01-03 30.875 ... 3.589 1 0
5 WTIV 2022-01-03 9.250 ... 2.018 0 0
6 JUV 2022-01-04 10.917 ... 1.528 0 1
...
contains isnt a python method lol use in
.
with df.apply() tho
you have 7300 x values?
No I basically have 20 times the identical 729x7 dataframe
Just appended to each other:
simulation(strategy) generates one of those 729x7 dfs
full_results = []
for i in range(0, 10):
print("Run", i+1, "of 10")
full_results.append(simulation(strategy))
I managed to get a boxplot for each vessel, but using the entire year as data for each boxplot. Instead I need to have a boxplot for each day of the year per vessel
I don't quite now how to implement the Date still
you can datetime or just manually parse the date
whats the difference doing it year vs day
my research investigates the impact of weather seasonality on offshore wind farm decommissioning project performance
okay what are you trying to boxplot lol
the box and whisker for the duration per day
Because the time-series graph I create takes the average of 20 runs, so very high values and very low values are not considered
so for each day of the year, you want the spread of duration across all vessels
for each day of the year, I want the spread of duration per vessel
Because I want to investigate if one of the vessels is more subject to weather uncertainties/ impacts
you cant boxplot that, its 3 dimensional
I found something like this, that's how I imagined it but I can't get it to work: https://stackoverflow.com/questions/46603823/boxplot-with-multiindex
The graph at the bottom of the thread
just with the year representing my vessels, and the a/b the start date on the x axis
so you want each vessel as a sub-hierarchy to each day in a boxplot
me too so we'll see if I can even be any help
wait a sec... Isn't my structure pretty much identical to the df in the thread?
Vessel Start Date Duration
717 WTIV 2022-12-25 12.000 ... 2.439 0 0
718 JUV 2022-12-26 47.333 ... 5.289 1 0
719 WTIV 2022-12-26 10.000 ... 2.133 0 0
720 JUV 2022-12-27 45.917 ... 5.143 2 0
721 WTIV 2022-12-27 10.500 ... 2.210 0 0
Year in his example would be my Vessel, Text would be my Start Date and data would be duration?
sure. I imagine most dfs would apply. Im reading through it now but pivoting is hard lol
Pivoting is a bitch yeah
I ran this code
filtered = df.apply(lambda row: row.name.contains(name_ele, case=False) for name_ele in sd_eth_list)
I mean I did it here, but couldn't get anywhere with it for the boxplot, works perfectly for the time-series
Fuck this shit man! I think I should open a small grocery shop
how come
is contains a method? i cant find documentation lol
You gave that code
filtered = df.apply(lambda row: name_ele in row for name_ele in sd_eth_list)
yeah because I assumed you were using .contains() correctly lmao
Means you are saying I am dumb, right? Yes I am but don't say that directly
Do you guys reckon it would help if I created an individual plot for each month?
i mean, that looks about right
it looks pretty much like a greyed out version of my time-series
i just wouldnt use a boxplot
hm one sec
Something like this would be nice as well, but probably same story as the boxplot
Maybe heatmap the duration on a 2x2 vessel x day and examine the spread seperately
bless you
lmao np
haha oh yeah theyre not words just had a micro-seizure
fair enough loool
I see what you mean by heatmap
but how can I imagine the structure there?
Would you mean duration on y axis, date on x axis
and then heatmap per vessel
one axis of heatmap is day the other is vessl so each square is a vessl on a day, trends along each axis, colour is avg duration
but then that would not model spread (?)
not spread per vessel per day. Thats what I meant by examine it seperately like per vessl or per day
aaah
but youll be able to see spread of avg duration across year and vessel
Yeah I think this one is so messy because it has one pair duration spread/vessel for every single day of the year
and then use other examinations to highlight areas of interest on the heatmap
Maybe if I were to combine the months instead of doing a boxplot for every single day that could work
its still 20 vessels so whatever you think 240 boxes looks like
its 2 vessel
oh lol
...you could do a 3d heatmap from an offest angle with cell height and whiskers showing spread
one sec
uuh 3d heatmap looking nice tho
actually the whiskers would be nonsensical
heatmap doesnt make sense with two of one variable
can you just dual candlestick chat it
ayo
calm down with words
im out here googling their meaning nonstop haha
Wouldn't dual candlestick essentially be dual boxplot tho
like a stock chart but with the grouped bars like in your SO example
lmao yeah
im dumb
yeah nah I think a boxplot would be the best approach here
because it is intended to visualize variability isnt it
yeah i guess you have to reduce the timeframe if you wanna see spread
yeah I might just do one separately for each month
Then I could derive If you use vessel x in month y, the project performance is significantly uncertain some stuff like that
what's that?
Oh god have you heard of pandasgui??
.getdbt
It's tools writing select statement inside datawarehouse
guys im getting this error could someone help
if query[0] == activationWord:
TypeError: 'builtin_function_or_method' object is not subscriptable
can you do print(query), so we know what it is?
sure
<built-in method split of str object at 0x000001DE82E003F0>
this is what got printed
so, somewhere along the way, you tried to use the split method without calling it. can you show where query = is defined?
hello i got a score of 1.0 accuracy for my kNN for k=1 to k=15, what am i doing wrong?
Hey @pure moat!
It looks like you tried to attach a Python file - please use a code-pasting service such as https://paste.pythondiscord.com
1.0 is 100%, not 1%.
look at query = parseCommand().lower().split. you forgot the () at the end of split
so you think your model can only have 100% accuracy if you've done something incorrectly? we can't guess what we did wrong if we don't know what you did.
no problem. thanks for helping me to help you 👍🏻
it's normal for my kNN model to get 100%?
it's possible, if you have enough training data and there aren't very many/any outliers.
ahh i've been told to be worried if my model has 100% accuracy haha. my dimensions are (247165, 19) for my training set. i scaled all numerical datasets before inserting it into a kNN
usually, if you have 100%, it means that your model is very dependent on the training data, and wouldn't perform well in real situations. but I don't know what your model is intended to do.
Hi I’m new to data science so what all modules should I learn in python for data science 🙂
i know pandas and numpy
I'm not sure what data science does actually, but I'm doing big data analysis & AI master right now.
the first block, I'm learning Machine learning and AI
basically, in ML, we make use of matplot to plot graph and sklearn to do the heavy-lifting. try to explain how dataframe says
thanks
"learning modules" isn't a viable strategy for actually learning data science. you have to understand the theory, and libraries in the data science ecosystem are not designed to gradually teach that to you as you use them. Try one of the books on our website
!resources data science
The Resources page on our website contains a list of hand-selected learning resources that we regularly recommend to both beginners and experts.
sure I’ll check em out
I am trying to manually randomly change the weights and biases in my neural network but the only way I found to access them is to loop through them so this is what I did but it is not changing the weights and biases at all. This is very cumbersome, is there a easier way to index through it so that it actually changes the weights and biases?
using pytorch
Basically I have around 1,250,000 photos uploaded on boto3 and am trying to make an X file with all the rgb values, but colab takes way too long to download the files and turn them into np arrays
anyone have a better idea?
You’re changing j (the variable) but not the value in the parameter
You should enumerate in the loops to get the indexes along with the values, then change the value at that index to the new value
It would be something like param[idx1][idx2] = j
Please don't ask people to read screenshots of text
!code
Here's how to format Python code on Discord:
```py
print('Hello world!')
```
These are backticks, not quotes. Check this out if you can't find the backtick key.
I'm trying to retrain stylegan3, but I keep running into the following error
this command
python train.py --outdir=~/training-runs --cfg=stylegan3-t --data=datasets/ffhq_control.zip --gpus=1 --batch=4 --gamma=8.2 --mirror=1 --workers=1 --snap=50 --tick=4 --cbase=16384 --resume=C:\python\Generative-Adversarial-Networks\stylegan3-main\stylegan3-main\pretrained_models\stylegan3-t-ffhqu-1024x1024.pkl```
produces this error
File "C:\python\Generative-Adversarial-Networks\stylegan3-main\stylegan3-main\training\training_loop.py", line 162, in training_loop
misc.copy_params_and_buffers(resume_data[name], module, require_all=False)
File "C:\python\Generative-Adversarial-Networks\stylegan3-main\stylegan3-main\torch_utils\misc.py", line 163, in copy_params_and_buffers
tensor.copy_(src_tensors[name].detach()).requires_grad_(tensor.requires_grad)
RuntimeError: The size of tensor a (16) must match the size of tensor b (32) at non-singleton dimension 1```
I'm resuming from a model trained on 1024x1024 images from the ffhq dataset
I see thanks
and my dataset, "ffhq_control.zip" is 12 images from that same dataset
Are your images the same size as the ones it was trained on before
yes, they're literally taken from the same dataset because I was getting this same issue on my own images
so I figured if I can't even retrain it on the same dataset it's not an issue with my images
it also says this
Output directory: ~/training-runs\00006-stylegan3-t-ffhq_control-gpus1-batch32-gamma8.2
Number of GPUs: 1
Batch size: 32 images
Training duration: 25000 kimg
Dataset path: datasets/ffhq_control.zip
Dataset size: 12 images
Dataset resolution: 1024
Dataset labels: False
Dataset x-flips: True
Creating output directory...
Launching processes...
Loading training set...
Num images: 24
Image shape: [3, 1024, 1024]
Label shape: [0]```
do they provide instructions for running it?
yeah
in the section of the readme under Preparing Datasets and Training
my working theory is that it's something to do with my class labels (or lack thereof)
so you ran python dataset_tool.py first?
yeah, like this
python dataset_tool.py --source=C:\python\Generative-Adversarial-Networks\stylegan3-main\stylegan3-main\datasets\ffhq_control --dest=datasets/ffhq_control.zip ```
what happens if you follow one of their examples exactly as written? e.g. the metfaces example
"Fine-tune StyleGAN3-R for MetFaces-U using 1 GPU, starting from the pre-trained FFHQ-U pickle."
don't you need to download the entire metfaces dataset for that?
that's a terabyte at least I think
70,000 images or so
which is why I instead tried it with the 12 images I downloaded from ffhq
you think I'm missing a config file that comes with those datasets?
oh i didn't realize it was a huge dataset
maybe there's a sample you can download
oh i see, the ffhq set is more manageable
seems like that should work too though
from the readme:
Datasets are stored as uncompressed ZIP archives containing uncompressed PNG files and a metadata file dataset.json for labels. Custom datasets can be created from a folder containing images; see python dataset_tool.py --help for more information. Alternatively, the folder can also be used directly as a dataset, without running it through dataset_tool.py first, but doing so may lead to suboptimal performance.```
makes sense
debugging other people's code is always difficult
hard to know where the breakdown is... if it were me, i'd probably file a bug report
this person seemed to have the same issue
-
this is a good question for a help channel, see #❓|how-to-get-help . the channel #data-science-and-ml is for something specific and not really related to boto3
-
don't "ask to ask". the #❓|how-to-get-help information (as well as the popup when you ask a new question) provides detailed instructions for asking answerable questions. read them.
bru i asked this in help
and they told me to come here
then state your question in detail and maybe someone can help. but boto3 is the client library for AWS. this channel is about data science.
please do read the guide on asking good questions
yeah... i wonder if there's some other magic required number here. the cbase argument is "capacity multiplier" and i have no idea what that means
apparently your cbase etc. options need to match the pre-trained model
i have 1.2 mil photos on s3 and am trying to turn them into a large csv file. Anyone have any idea on how to pickle these, because right now I am downloading each one and it's going to take around 50 days
csv? how do you expect to turn a photo into csv data?
pixel data
what are you doing with them? why do you need csv data?
into a huge np array
...do you see how withholding information in your question wastes both yours and everyone else's time?
now it is (kind of) a data science question
you're trying to make a single numpy array of 1.2 million images?
why do you need it all in a huge numpy array?
that seems ill-advised and like an "XY" problem
i don't im trying to find some better way to do it
Is there even enough ram on any computer to do that?
that's why im asking
that's the problem
idk how to manage this much data
what are you actually trying to do? don't force people to interrogate you for information.
I'm with salt rock on this one, tell us exactly what the end goal is
i'm making an animal recognition software and I have a harddrive with a lot of trail cam footage that I'm trying to make into something I can train a model with
I'm new to this
and am trying to learn
I've just never made something with this much data
normally you don't try to load all this data at once, and normally you don't need to load it into one big numpy array. ML frameworks like pytorch have some kind of "data loader" mechanism, and usually that includes ready-to-use functionality for working with images.
it's sometimes enticing to try to DIY things, but with a relatively large amount of data, and the relatively sophisticated models required to do ML on it, then you should probably just use a framework and spare yourself the difficulty
i'll look into this
thank you so much ❤️
is the footage labelled? cause if it's not labelled it probably won't be useful as a training dataset for animal classification
it is!
wow, nice
remember: if you stated your actual question first, then you'd have gotten this answer a lot faster. see: https://xyproblem.info/
Asking about your attempted solution rather than your actual problem
so uh, im trying to do this (first image) but im getting this (second image)
here's the code
fig2,ax2 = plt.subplots(figsize=(6,4))
fig2.set_facecolor("None")
dbl_ratio = pd.DataFrame(df2["player1 double faults"]/df2["player1 total points total"]) # good
dbl_ratio_avr = dbl_ratio
year_grpby = df2.groupby("year").max()
x,y = df2["start date"],dbl_ratio
ax2.set_xlabel("Year")
ax2.set_ylabel("Double faults per match")
ax2.scatter(x,y,alpha=0.3) # good
ax2.plot(year_grpby,dbl_ratio,"-",color="orange")
mpl.style.use("default")
what is the data type of year? and what is the data type of start date?
ok, but what data types? are they strings? datetime? something else?
those look like strings
i highly recommend instead converting start date to a proper datetime column
!d pandas.to_datetime
pandas.to_datetime(arg, errors='raise', dayfirst=False, yearfirst=False, utc=None, format=None, exact=True, unit=None, infer_datetime_format=False, origin='unix', cache=True)```
Convert argument to datetime.
This function converts a scalar, array-like, [`Series`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html#pandas.Series "pandas.Series") or [`DataFrame`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html#pandas.DataFrame "pandas.DataFrame")/dict-like to a pandas datetime object.
year is pandas.core.series.Series and same for start date
every pandas series has its own "dtype" which describes the data stored in it. that is what i'm asking about
df['start date'].dtype
if you convert from strings to a proper datetime type, then you don't need the year column at all. you can do something like df.resample('AS', on='start date').max()
'O' usually means "strings"
also... what is year_groupby supposed to be the max of? right now your code takes the max of all columns
furthermore that line looks like a mean, not a max (and a smoothed one at that)
df2['start date'] = pd.to_datetime(df2['start date'])
x = df2["start date"]
y = dbl_ratio
y_year_mean = df2.resample('AS', on='start date').mean()
this might get you started, but i think you are missing some other things here
please do read the docs and not just copy my code though
@desert oar not sure if you'll remember me from yesterday, but only got time now to test again. Whenever you're free let me know and I'll send you the sample and the pivot table I created.
ive been on this for 2 days in a row, ive tried pretty much everything of what they provide in the lecture notebook. scatter works, but i cant get the year on x to show properly and also my mean for the line isnt working
ill be reading now. ill try to come up with something. thanks a lot
lmfao, getting somewhere i guess
did they not show you how to work with datetime data in pandas?
its a bootcamp, so im doing my best to follow
that looks like more like a daily maximum. maybe 'AS' was wrong, but that's what i thought i saw in the docs for "start of year"
well you did max() in your code, not mean()!
thats the last line of code i wrote, i tried anything that i could remember. sum() mean() max() etc
well why would you do max when you meant mean?
in my head its clear what i want to do, but putting it into code, doesnt look like its working much
well answer the practical question here. you want the yearly mean, right? so why would you use anything other than mean?
show me the code you used for the messed up chart you just posted above
it might seem simple to you, but to me, it wasnt working. ive tried 100s of iteration to the code. for some reasons, its just not clicking for this exercise. its really the first one where im having this much trouble. im behind, this is the first workshop, theres a second one. and i only have a few days left to upload it to github. i have a full time job, i wish i could take the time to dig into every single documentation, but right now its not possible
fig2,ax2 = plt.subplots(figsize=(6,4))
fig2.set_facecolor("None")
df2['start date'] = pd.to_datetime(df2['start date'])
dbl_ratio = pd.DataFrame(df2["player1 double faults"]/df2["player1 total points total"]) # good
dbl_ratio_avr = dbl_ratio
x = df2["start date"]
y = dbl_ratio
ax2.set_xlabel("Year")
ax2.set_ylabel("Double faults per match")
ax2.scatter(x,y,alpha=0.3) # good
ax2.plot(x,dbl_ratio,"-",color="orange")
mpl.style.use("default")
in the future, i suggest asking sooner! if you don't understand an error message, trying other random stuff usually isn't a good approach
you forgot to actually take any kind of average:
dbl_ratio_avr = dbl_ratio
use the resample code i showed you
i totally understand the stress of being short on time and not understanding what's going on
yeah, i really thought i could handle it myself
also ax2.plot(x,dbl_ratio,"-",color="orange") you didn't even plot dbl_ratio_avr
i think you understand more than you realize, you are just making silly mistakes at this point. maybe fatigue?
yeah, full time welder. it's exhausting. usually my code is cleaner during the weekend
been at it for 12 years. thats why im trying the bootcamp and maybe get a job. change of career. im all in
you only need to do this once, when you load the dataset:
df2['start date'] = pd.to_datetime(df2['start date'])
and this should produce something like the plot you're looking for:
dbl_ratio = pd.DataFrame(df2["player1 double faults"] / df2["player1 total points total"])
dbl_ratio_avg = dbl_ratio.resample('AS', on='start date').mean()
x = df2["start date"]
y = dbl_ratio
fig2, ax2 = plt.subplots(figsize=(6,4))
fig2.set_facecolor("None")
ax2.scatter(x, y, alpha=0.3)
ax2.plot(x, dbl_ratio_avg, "-", color="C1")
ax2.set_xlabel("Year")
ax2.set_ylabel("Double faults per match")
plt.show()
also, i usually come back from work, take a shower and code.. then i realise, like now, that i did not eat yet
go eat and don't look at a computer screen. then go look at the code i just posted above and see if it makes more sense
yes sir! 🫡
can someone explain me what's pytorch and why do I need it to train YoloV5?
pytorch is a library that helps you write sophisticated machine learning models. if you need it to train yolov5, that's because the code for yolov5 was written using pytorch.
oh ok
also, I have no idea what CUDA i'm supposed to choose
or how to find out
did you install a cuda toolkit yet?
no lol
do you have a python environment set up?
I was following tutorials and they just brought me here
I created a venv
I managed to get the stylegan code to not throw any errors by just trying random pretrained models until one worked. But now the code freezes at
Setting up PyTorch plugin "filtered_lrelu_plugin"...```
tbh it might be a little easier to do this with conda, but installing and setting up conda is a bit of a pain
is there any way to know why it's freezing here?
so should I do it?
no, don't bother.
activate the venv, then run:
python -m pip install --extra-index-url https://pypi.ngc.nvidia.com nvidia-cuda-runtime-cu11
this will install cuda toolkit, as per https://docs.nvidia.com/cuda/cuda-installation-guide-microsoft-windows/index.html#pip-wheels-installation-windows
you should be able to check the exact cuda version that was installed using the nvcc command.
then you should be able to run (with the venv still active):
python -m pip install torch torchvision
and you should have torch available in the venv
The installation instructions for the CUDA Toolkit on MS-Windows systems.
i assume you installed python 3.10 from python.org?
it was I while ago, I guess I did
maybe the nvcc comand is not working?
@desert oar hmm i get KeyError: 'The grouper name start date is not found'
meh, skip it. idk 😆
in the documentation for resample, i cant find "AS". where did you get those keyword?
ok, so now I don't need to install it through the website?
oh my mistake. you should probably put dbl_ratio back into the dataframe so you can do this more easily:
df2['dbl_ratio'] = dbl_ratio
dbl_ratio_avg = df2.resample('AS', on='start date')['dbl_ratio'].mean()
i don't think you need to
@fringe anvil i think it would be easier to do this using a datetime index, but that's a whole big pandas topic that i think we can hold off on (but you should learn it at some point)
oh also, one more thing
you don't need pd.DataFrame here:
dbl_ratio = df2["player1 double faults"] / df2["player1 total points total"]
the full code:
df2 = ...
df2["start date"] = pd.to_datetime(df2["start date"])
df2["dbl_ratio"] = df2["player1 double faults"] / df2["player1 total points total"]
dbl_ratio_year_avg = df2.resample("AS", on="start date")["dbl_ratio"].mean()
x = df2["start date"]
y = df2["dbl_ratio"]
fig2, ax2 = plt.subplots(figsize=(6,4))
fig2.set_facecolor("None")
ax2.scatter(x, y, alpha=0.3)
ax2.plot(x, dbl_ratio_year_avg, "-", color="C1")
ax2.set_xlabel("Year")
ax2.set_ylabel("Double faults per match")
plt.show()
hmm, title dont show on the y and x now, and the style isnt white anymore.. idk if its my computer being janky lol.. i restarted the kernel reran everything. now i get ValueError: x and y must have same first dimension, but have shapes (1179,) and (15,)
ok set_xlabel needs to be called before .scatter and .plot
ok I found out that the reason it's freezing at filter_lrelu_plugin is cause I have two versions of it in my pytorch files
how do I know which one to delete?
3.9 is last version, but which version does pytorch uses?
I just pivoted a table and wanted to get rid of the top row (all DATA_USAGE_GB__C), and bring the months 1 row down so that the 3rd current row becomes the top one
wanted to do that with python, not simply moving them on the .csv file
anyone got an idea how to do that? got kinda confused trying here
@desert oar the resample creates a shape of 15, which doesnt match the shape of x "start date" start date has 1179 row
we passed it the whole column, it should have the same rows, both of them
new code, new error. getting closer to the shape of x..
im able to generate the same graph with groupby.. not sure if its better or not
ValueError: x and y must have same first dimension, but have shapes (1179,) and (926,)
df2["start date"] = pd.to_datetime(df2["start date"]) # should be good now
fig2, ax2 = plt.subplots(figsize=(6,4)) # good
fig2.set_facecolor("None") # good
plt.style.use("default") # good
ax2.set_xlabel("Year") # good
ax2.set_ylabel("Double faults per match") # good
df2["dbl_ratio"] = (df2["player1 double faults"]/df2["player1 total points total"]) # good
dbl_ratio_avr = df2.groupby(["start date","dbl_ratio"])["dbl_ratio"].mean() # not good
x = df2["start date"] # good
y = df2["dbl_ratio"] # good
ax2.scatter(x, y, alpha=0.3) # good
ax2.plot(x, dbl_ratio_avr, "-", color="C1") # need to change something for y
152+926=1078 .. so still missing 101 rows .. ah geez this graph.. 3 failed days in a row lol
sorry my mistake again. you'll need to plot using
ax2.plot(dbl_ratio_avr.index, dbl_ratio_avr, "-", color="C1") # need to change something for y
or even just
dbl_ratio_avr.plot(ax=ax2, color='C1')
using the pandas built-in plotting helpers
(this is a taste of why indexes are useful)
i wouldn't just delete stuff by hand. pip uninstall the things you don't need
how can I keep rows in a df, which have/don't have a match in another df while merging
anti_join, semi_join kind of thing
Greetings! I have a functional json implementation, for the most part. I am having difficulties with this section:
puntreturns_t1 = dataGameStats['teams'][0]['stats'][7]['data'].split("-")[0] puntreturnsyards_t1 = dataGameStats['teams'][0]['stats'][7]['data'].split("-")[0]
Appropriate JSON code:
{ "stat" : "Punt Returns: Number-Yards", "data" : "-" }
How can I get puntreturns_t1 AND puntreturnsyards_t1 == 0 / None?
I am getting the following error with the current code:
ValueError: invalid literal for int() with base 10: ''
I need some help. I have to do a project on pneumonia detection using deep learning and machine learning. Its a group project and we just know machine learning basics and a little algo. We don't know any deep learning. We do have the code but don't know how to distribute among 3 people. And also how to quickly learn deep learning.. just need to learn straight from the code... They will teach us later. Any tactics?
df.merge should work.
Does anyone know why my spline looks like this (blue)? I would expect it to be like the one i drew op top (red)..
Please do not ask people to read screenshots of text.
!code
Here's how to format Python code on Discord:
```py
print('Hello world!')
```
These are backticks, not quotes. Check this out if you can't find the backtick key.
:incoming_envelope: :ok_hand: applied mute to @earnest raven until <t:1666271518:f> (10 minutes) (reason: newlines rule: sent 106 newlines in 10s).
The <@&831776746206265384> have been alerted for review.
!unmute 130213385265610753
:incoming_envelope: :ok_hand: pardoned infraction mute for @earnest raven.
Pasting large amounts of code
If your code is too long to fit in a codeblock in Discord, you can paste your code here:
https://paste.pythondiscord.com/
After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.
Thank you! appreciate it.
but I appreciate that you used ```py. sorry you got zapped.
No problem 🙂
I've been trying to make a distribution graph based on a dataset that contains duplicate data in an array like so py [5, 15, 15, 15, 15, 15, 20, 20, 20, 30]
However when I change the X axis to be a linear range instead of the actual values of the array, the graph morphs into something completely different.
This is my code which results in the first graph: https://paste.pythondiscord.com/omojaqoxoz
The second graph was created by using x = np.linspace(min(gatewayLatencyValues), max(gatewayLatencyValues), len(gatewayLatencyValues)), however this completely morphs the graph. It is notable that the boxplot stays correct regardless of what the X axis is, since it is generated by matplotlib and not based on array indices.
Anyone have any idea how to solve this?
This is the code for the calculate_normal function: https://paste.pythondiscord.com/iromatirox
what are you trying to achieve by changing the values on the x-axis?
it looks like you trying to plot a histogram of discreet values, perhaps https://numpy.org/doc/stable/reference/generated/numpy.histogram.html will help you?
Id like to smooth the lines out, but for that I need an X axis that has a lot of steps in order to use cubic interpolation
Your calculate_normal function does 2 things.
First it calculates the average and standard deviation
Then it makes the normal
You should separate this into two functions, the first avg, std = get_fit(array) and the second y = make_curve(x, avg, std)
The x you input to the second can have a high density of points, and should not be the same as the array you input to the first. This will give you a smooth curve
I also forgot to mention another thing, i'd like to add another graph to it with a different dataset using the same x axis
but I will keep what you mentioned in mind
I tried to just add it to the existing one, but it has more datapoints but less latency so that means the entire x axis has a different scale
I also advise you to set both dpi and figsize in plt.subplots(). Doing so allows you to control the fontsize regardless of the number of pixels in your graph
A high dpi+low figsize makes the text big while low dpi+high figsize makes the font small
if do
from numba import cuda
device = cuda.get_current_device()
device.reset()
will it affect other users(using that gpu)?
it shouldn't
i hope my prof data doesnt get reset
Good shout, looks much better now! Appreciate it.
in the absence of further information, all we can suggest is to find a GPU with more memory.
keep in mind that none of us have any idea what you're doing that resulted in you getting that error unless you tell us.
is this relevant?
$ free -g
total used free shared buff/cache available
Mem: 503 414 40 8 47 75
Swap: 255 255 0
seems less than normal to me though
I assume you were trying to train a model, or something. your options are to get a GPU with more memory, or see if you can still train your model by using the available memory more efficiently
but if the model itself needs more memory than the size of the GPU, I think you're SOL.
i am trying to extract feature of videos. using resNext model
how big is the model, and how much memory does your GPU have? please answer using the same unit for both
my video is pretty small, and i am doing it one by one
ok one minute
its this one(the model).
but i dont know about GPU, con i know over ssh?
hello guys I'm a very new in here discord and data science. I started to an internship and I have to some forecast with ml is there anybody to help to find some resources ?
got this
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.141.03 Driver Version: 470.141.03 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | N/A |
| 42% 89C P2 282W / 350W | 23647MiB / 24268MiB | 92% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce ... Off | 00000000:25:00.0 Off | N/A |
| 47% 86C P2 215W / 350W | 2667MiB / 24268MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA GeForce ... Off | 00000000:41:00.0 Off | N/A |
| 30% 37C P8 21W / 350W | 14999MiB / 24268MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA GeForce ... Off | 00000000:61:00.0 Off | N/A |
| 30% 34C P8 25W / 350W | 13515MiB / 24268MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 4 NVIDIA GeForce ... Off | 00000000:81:00.0 Off | N/A |
| 87% 68C P2 302W / 350W | 16301MiB / 24268MiB | 98% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 5 NVIDIA GeForce ... Off | 00000000:A1:00.0 Off | N/A |
| 50% 58C P2 143W / 350W | 3213MiB / 24268MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 6 NVIDIA GeForce ... Off | 00000000:C1:00.0 Off | N/A |
| 33% 54C P2 144W / 350W | 3213MiB / 24268MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 7 NVIDIA GeForce ... Off | 00000000:E1:00.0 Off | N/A |
| 31% 47C P2 143W / 350W | 3159MiB / 24268MiB | 100% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage
in sklearn's accuracy_score function, how do I implement sample_weight? https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html
would I use
label, count = np.unique(y_true, return_counts = True)
and call accuracy_score(y_true, y_pred, count)
Examples using sklearn.metrics.accuracy_score: Plot classification probability Plot classification probability Multi-class AdaBoosted Decision Trees Multi-class AdaBoosted Decision Trees Probabilis...
I'm trying to retrain stylegan3
https://github.com/NVlabs/stylegan3
but I keep getting this error:
File "C:\python\Generative-Adversarial-Networks\stylegan3-main\stylegan3-main\torch_utils\misc.py", line 163, in copy_params_and_buffers
tensor.copy_(src_tensors[name].detach()).requires_grad_(tensor.requires_grad)
RuntimeError: The size of tensor a (512) must match the size of tensor b (1024) at non-singleton dimension 1```
I can't figure out why the tensor shapes would be wrong. I'm running this command
py train.py --outdir=~/training-runs --cfg=stylegan3-t --data=datasets/control_dataset.zip --gpus=1 --batch=32 --gamma=8.2 --mirror=1 --workers=1 --snap=50 --tick=4 --resume=C:\python\Generative-Adversarial-Networks\stylegan3-main\stylegan3-main\pretrained_models\stylegan3-r-ffhqu-256x256.pkl```
which runs a model trained on 256x256 images on my dataset (control_dataset.zip) which is also 256x256 images
Hi, I have a question regarding ANNs, in particular what the neurons in the hidden layer represent
In the screenshot, ive just drawn up a quick ANN thats used to predict house prices based on the three features: num. of bedrooms, area, and dist. from closest school.
So the hidden layer consist of two layers of neurons. Looking at the first layer in the hidden layer, each neuron will take as input the same features, but the weights may be different, and hence different features may have more impact in some neurons than others. The neurons then apply an activation function etc and produce an output.
My question is, what exactly is this output? What sort of information is a specific neuron calculating and outputting?
Im assuming this is completely flying over my head, but I can not seem to find a clear/direct answer on this, and would appreciate any help
every hidden layer neuron is computing a score (the activation of that neuron) based on all the ones in the previous layer
I like to think of it like each hidden layer neuron is an olympic judge, and the input neurons are a diver
each input neuron is a quality that the diver had: gracefulness, amount of splash, difficulty of dive
and the hidden layer judges each value those qualities differently
the second hidden layer is like another panel of judges
only instead of judging the diver, they judge the olympic judges
and they too have preferences, so maybe one really hates the russian judge but really likes the swedish judge etc.
I don't know if this is making any sense, but TLDR; the first hidden layer finds patterns in the input data. the second hidden layer finds patterns in the patterns
drawing networks like that is always kinda deceptive imo. you can think of the circular nodes you drew as being entries of a single vector. the lines joining the nodes are matrices performing linear or affine transformations on those vectors
imo, those "connected bipartite subgraphs" visualizations are only intelligible if you already understand how neural networks work. which means that they have no communication power.
yeah that'd be my take as well
agreed, but it sure is hard to draw a bunch of vectors in a diagram like that. 3blue1brown displays them as vectors in his videos but they're animated which makes it easier
you don't need to (and actually can't) draw them geometrycally. you could just use thin rectangles and fat rectangles (and cubes/prisms/etc when dealing with multidimensional stuff and/or tensors)
I actually thought of 3b1b right after I said that. you're right that they're more communicative when they're animated.
if you look, he never actually draws the whole net as vectors
this is literally just one neuron
do you have an example of someone drawing it like that? it sounds cool
well it's basically what you drew there just now, just removing the annotations of the elements of each object
but lemme fish something up
like what you see here for the convolutional parts
there's no reason you can't do the same for a dense network
Alright this is starting to make more sense, especially with the analogy
There just one thing im still a bit unclear on
what would these patterns consist of?
my artistic interpretation. regarding the patterns, that depends entirely on what you're training the network to do, but in general they are not human-interpretable. most deep learning architectures are not interpretable
So artistic
Oh, so all we know is that it builds upon some sort of pattern?
And so via training the model, we set the weights so that at each layer our neurons find the pattern that lead to the best/most accurate output?
pretty much. that's why many people dislike it
it's hard to derive strict guarantees for its performance, but so far it anyway works better than most classical methods
Ok ok I see, so just to summarize:
The different set of weights for each neuron will essentially lead to a different pattern being detected by each neuron. The outputs of all neurons when reaching the end in a way "combine"/each neuron contributes to affecting how the overall model will look at the end, and hence we can get models that can fit to any kind of data (ie models with many squiggly lines when graphed)?
OH so like if neuron 1 for example had weights that emphasized num of bedrooms and area
there could be a pattern in terms of num of bedrooms and area of house having a particular affect on the output right?
sure
Alright, I think its making sense to me now. Thank you all very much for ur help 🙂
i prefer looking at it from the perspective of parameter estimation. you assume a model and find the model parameters that best explain the data
the deep learning model is "ayyy lmao idk what the model is, but this thing has so many parameters it can't go wrong"
Right I see, so like in this image for example, the neurons which are connected with a purple line may have higher weights/parameters when connecting to the dog output neuron. and the neurons connected with the green lines may have higher weights for the cat neuron and lower wegiths for the dog neuron
So our model learned via estimating the parameters which neurons have more emphasis on determining if we have a dog or a cat.
well, but what you're calling a "neuron" here are just entries of intermediate (or final) vectors
the only reason those matter are because you yourself chose which one represents dog and which one represents cat
but yeah that's more or less the idea
the caveat being that the stuff going into that layer already has no interpretation
just to confirm, is this referring to the red circles?
mhm
Sorry, I was referring to the neurons before those 2
the idea is basically the same
since you're applying an affine transformation, it's two vectors related via a matrix
you're finding the entries of that matrix, which correspond to the weights, as you call them
Ahaa ok yes. Thank you so much for all of your help! I really appreciate it 🙂
I have been working on my own neural network implementation using numpy
https://github.com/sivansh11/sklearn-nn-extension
try it out! I feel like there is probably a bug some where in the code lmao
I want this to be an extension to sklearn's neural network capabilities, ie work with all the infrastructure that sklearn has built
nice little project. admittedly i don't think i or many other people would use this when something like skorch is available:
https://skorch.readthedocs.io/en/latest/?badge=latest
but it seems like a good self study project!
ohh, I had no idea this existed,
fair enough,
I did start this project while trying to understand gradient decent and other optimizers (Adam etc). so not all is wasted :), I learnt something new
the only thing I dont understand is how/where to implement l1 and l2 regularisation
Hello, I have looked everywhere for the answer to this. I am using keras / tensorflow and creating a model
history = model.fit(
File "C:\Python310\lib\site-packages\keras\utils\traceback_utils.py", line 70, in error_handler
raise e.with_traceback(filtered_tb) from None
File "C:\Python310\lib\site-packages\tensorflow\python\eager\execute.py", line 54, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InvalidArgumentError: Graph execution error:
Input is empty.
[[{{node decode_image/DecodeImage}}]]
[[IteratorGetNext]] [Op:__inference_test_function_7901]
2022-10-20 22:04:21.910095: W tensorflow/core/kernels/data/cache_dataset_ops.cc:856] The calling iterator did not fully read the dataset being cached. In order to avoid unexpected truncation of the dataset, the partially cached contents of the dataset will be discarded. This can happen if you have an input pipeline similar to `dataset.cache().take(k).repeat()`. You should use `dataset.take(k).cache().repeat()` instead.```
The error is within the package, tried reinstalling, doesnt wanna work
so I am trying to splt a csv into X and y
here is my code:
# Python version
import sys
from sklearn.metrics import make_scorer
print('Python: {}'.format(sys.version))
# scipy
import scipy
print('scipy: {}'.format(scipy.__version__))
# numpy
import numpy
print('numpy: {}'.format(numpy.__version__))
# matplotlib
import matplotlib
print('matplotlib: {}'.format(matplotlib.__version__))
# pandas
import pandas
print('pandas: {}'.format(pandas.__version__))
# scikit-learn
import sklearn
print('sklearn: {}'.format(sklearn.__version__))
# compare algorithms
from pandas import read_csv
from matplotlib import pyplot
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.feature_selection import RFE
# Load dataset
url = "energy.csv"
#url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv"
names = ['YEAR', 'TOTAL', 'PURCHASED', 'NUCLEAR', 'SOLAR', 'WIND', 'NATURAL_GAS', 'COAL', 'OIL']
dataset = read_csv(url, names=names)
print(dataset.shape)
# Split-out validation dataset
array = dataset.values
X = array[:, 0:8]
y = array[:, 8]
print(y)
when I print y in the last line tho
I get this:
[ 19.9948 0. 0. 0. 0. 0. 0.
0. 0. 260.2 1326.9 nan nan nan
nan nan 723.18 2070. ]
which I dont believe is supposed to happen (the 'nan' thing)
can anyone who knows this kind of stuff tell me whats wrong because im not really sure
thanks in advance
One of my images was corrupt I believe, ignore me
lets say i made a nice scatterplot that uses the whole dataframe. and i want to generate smaller scatterplots, but limit the dataframe to a specific entry in one of the column. so column "species" has 6 different birds. how do i generate those similar but slightly different subset of my main scatterplot? idk if i make any sense
so here's what im working with and it made my original scatterplot, but 6 times. now i just need 6 different scatterplots with just the data of a specific entry of my column "surface". which has 6 entries.
num_rows, num_cols = 3,2
fig3, ax3 = plt.subplots(num_rows,num_cols,figsize=(10,12))
fig3.set_facecolor("None") # good
plt.style.use("default") # good
for i in range(num_rows):
for j in range(num_cols):
ax3[i,j].scatter(x,y,alpha=0.3)
ax3[i,j].plot(dbl_ratio_year_avg.index, dbl_ratio_year_avg, "-", color="C1")
ax3[2,0].set_xlabel("Year")
ax3[2,1].set_xlabel("Year")
ax3[0,0].set_ylabel("Double faults per match")
ax3[1,0].set_ylabel("Double faults per match")
ax3[2,0].set_ylabel("Double faults per match")
you just need to filter x and y in the loop. these are called "small multiples" plots, fyi.
thats the name! and good evening to you! thanks for taking my questions again
and maybe do some clever indexing as well, but that's not strictly necessary
let's say that you want to split according to a series or array called categ
the only tricky bit here is figuring out which element in the axes array corresponds to which category
this is what ive found. fifth column of my dataframe
there are a couple different ways to do it actually
@fringe anvil
you can use some clever indexing for this:
df2["dbl_ratio"] = (df2["player1 double faults"] / df2["player1 total points total"])
surfaces = df2['surface'].unique().to_list()
num_rows, num_cols = 3,2
fig3, axs3 = plt.subplots(
num_rows, num_cols,
figsize=(10,12),
sharex=True, sharey=True,
)
for k, surface in enumerate(surfaces):
df_surface = df2.loc[df2['surface'] == surface]
dbl_ratio_year_avg = df_surface.resample('AS', on='start date')["dbl_ratio"].mean()
i, j = np.unravel_index(k, (num_rows, num_cols))
a = axs3[i, j]
a.scatter(df_surface['start date'], df_surface['dbl_ratio'], alpha=0.3)
a.plot(dbl_ratio_year_avg.index, dbl_ratio_year_avg, color="C1")
ax3[2,0].set_xlabel("Year")
ax3[2,1].set_xlabel("Year")
ax3[0,0].set_ylabel("Double faults per match")
ax3[1,0].set_ylabel("Double faults per match")
ax3[2,0].set_ylabel("Double faults per match")
fig.tight_layout()
plt.show()
you can also do this a bit more elegantly with pandas groupby, but this is good enough to start with
np.unravel_index is worth understanding. think of a 4x3 3x3 array:
a00 a01 a02
a10 a11 a12
a20 a21 a22
now imagine "walking" through this array by going across each row. when you get to the end of the row, jump down to the beginning of the next row, like a typewriter:
-->---->---->-|
a00 a01 a02 |
|--------------
-->---->---->-|
a10 a11 a12 |
|--------------
-->---->---->-|
a20 a21 a22
idk if my hilariously bad illustration helps
what's the array index of the 6th step (k = 5 with zero-indexing) along that walk? it's 1, 2.
imagine if were to flatten out the array, connecting rows end-on-end, to produce a 1-d array. then flatten(a)[5] == a[1, 2]
!eval numpy calls this "ravel" (a pun on "unravel", like yarn or thread):
import numpy as np
a = np.arange(9).reshape((3, 3))
assert a.ravel()[5] == a[1, 2]
@desert oar :warning: Your 3.11 eval job has completed with return code 0.
[No output]
and you can convert between these "flat" ("raveled") indexes and the "non-flat" ("unraveled") array indexes with np.unravel_index and np.ravel_multi_index
so either of these would work
i, j = np.unravel_index(k, (num_rows, num_cols))
a = axs3[i, j]
a = axs3.ravel()[k]
@fringe anvil does that make any sense at all?
this is actually how numpy arrays are stored internally: as one big flat array. all the multi-dimensional axis stuff is an illusion, produced by looping over the array contents internally
sorry im back. almost forgot to load my winter tires in the car for tomorrow lol
that looks 3x3 to my untrained eye
it looks like 3x3 to my trained eye as well 😆
yeah it does, you iterate over k,v with enumerate. then use .loc to assign the iteration of the surfaces type to itself then assign that to the figure
right. but do you understand my explanation about the array indexing business?
yeah looks like it's doing the job of a double for loop, for i in stuff for j in stuff.. so iterate through the elements in first row, then go to second row and do the same etc
thats what unravel_index does from what i can see
yeah, it's not looping, but it's converting the k "flat" index into i, j array indexes, at each step of the loop
ah yeah, thats what i just saw in your few next lines. im trying to read while out of breath, brain isnt following apparently
so we are enumerating on surfaces, is that df2["surface"]