#data-science-and-ml
1 messages · Page 405 of 1
hmm, I'm not sure how to do it. Covariance matrix of what, for one? The deviations of individual samples?
ah, if I understand correctly, you're basically saying the variance depends on p here - higher close to the center, smaller at the edges. So, diagonal covariance matrix but not an identity one.
yep, that's exactly what i mean
the least squares function you proposed is optimal when the deviation from the mean follows an IID normal distribution
my naive idea would be to do something like (((pred - Y)**2)/pred).mean() to attempt to fix that - to make samples far from the center have more weight. This essentially assumes that the variance is proportional to the value here
that works too, sorta like the noise-based rescaling in wiener filters
this looks like a fun sunday afternoon task 😛
really shitty way I've been thinking of: sample some random values according to this probability distribution. Calculate the mean and std of these samples. Use them. 🥴
that can be done... but it's kinda like solving another copy of this exact same problem at every single point on the graph
nothing wrong with that ofc, it's a montecarlo approach
ah right, of course, I didn't think of it that way (that's it's just the monte-carlo solution to this).
i kinda wanna try it out now 😛 care to share the data? x and y axes
Im planning on training a multi-label classifier for some project. The goal is that I have a classifier that can tell if one or multiple specific letters are in an image. Could I still use only training images with a single letter in the image?
I also have images of bigrams and trigrams, but the problem would be that all images would have very different shapes. Would a solution to this be padding uni-grams and bi-grams with white space on the left and right before classification?
me ? sorry I just checked
yeah, you indeed. idk if i'll have a chance now, but i can play around with it at some point soonish i think. maybe tomorrow
or i can assign it to my students and see what they come up with haha
would you rather have the numpy file directly or the github repo with all the functions?
it's basically for my internship at uni
hooo
just the numpy file, some npy file with the vectors inside
Hey @wooden forge!
It looks like you tried to attach file type(s) that we do not allow (.npy). We currently allow the following file types: .gif, .jpg, .jpeg, .mov, .mp4, .mpg, .png, .mp3, .wav, .ogg, .webm, .webp, .flac, .m4a, .csv, .json.
Feel free to ask in #community-meta if you think this is a mistake.
:|
you know what
Imma put them in my repo
and share you the link
https://github.com/ChrisZeThird/Kicked-Rotor Quantum>Graphs>Spin @wooden sail
Also Edd, you can use the different scripts to generate Gaussian distributions
Basically the Quantum Case (quantumKickedRotor.py)
you can also play with the initial State, this one is a simple Dirac spike, but you can put whatever you want
for the spin case, Psi is a ket with a Dirac distribution and a 0 array, here again you can put whatever you want
I'm kinda busy rn so I can't generate some array for you
you can also play with the beta parameters, that just allows you to compute an average value
if you could generate it and upload it later on, that'd be best. i'm familiar, but not fluent in this notation 😛
ho sure no problem
just remind me later in the day
What is wrong with my pd dataframe here?
import pandas as pd
file_path = 'data/ngrams_frequencies_withNames.xlsx'
df = pd.read_excel(file_path)
print(df.head(), '\n')
print(df['Frequencies'].head())
Do the hebrew characters somehow flip the stuff? haha
when the helper needs help you know it's a hard question
maybe it's the utf thing ?
you have to precise there are hebrew characters somehow ?
Yeah I'm just dropping them since the information is also supplied by the names column, but definitely weird
@mild dirge did you try making a copy and filling the names column with Latin letters? Or some other LtR character s?
No, but after removing them it displays it as normal
Welp
👍🏽
I blame Zig and Scaleios.
why do you need to train anything? can you use an existing trained Optical Character Recognition?
Well they are hebrew characters, and it's a school assignment to make the ocr and segmentation etc. our selves
tesseract says it supports hebrew but if you have do it yourself that makes sense
It's also really old text, and I'm making a multi-label classifier for better image segmentation, so I need more than just the most probable letter
Hi, can someone help me interpret the results of my VECM estimation? I have a hard time understanding the coefficients that I get!
Hello again everyone. I have been training a CNN in PyTorch to classify images from the CIFAR10 dataset, with the stipulation that it only be three hidden layers. With help from @mild dirge I changed the architecture to be 2 convolutional layers and a fully connected linear layer, leading to much stabler results.
While tuning hyperparameters, I noticed that there continued to be a gap between the training and validation data; the former achieved accuracies of 70-80% while the latter languished around 45-50% (picture 1 below). After implementing weight decay of 0.01 in the optimiser, I significantly closed the gap, but with the result that the training accuracy itself now achieves only around 45% (picture 2 below). Does anyone have any suggestions to what I could tinker with to improve the accuracies while keeping the training-validation gap small?
The hyperparameters are:
- batch size of 8
- optimiser using SGD with a l.r. of 0.001, momentum of 0.9
- cross entropy loss function
If you need any more information, please let me know :)
Incase anyone’s wondering, After Steve Is a good read so far (1/7th through)
for a moment I thought you were doing what I am doing and I got so confused
and then I read the legend lol
It's probably important to note that your goal shouldn't be to get a small gap between the train/valid accuracy, the second graph isn't better than the first imo
The first just shows over-fitting, and the second shows that the model is not even able to correctly predict the patterns it is trained on, thus maybe even underfitting
What size images do you have?
@lone yacht
Hi @mild dirge , the images are 32x32 pixels with 3 channels for RGB
A problem could be that the receptive field of the final convolutional layer is too small to learn more general patterns
Currently it is 5x5 for each feature
You could try experiment with different kernel sizes, and maybe dilation to increase this receptive field, that might help depending on your data
Thanks Camel, I'll have a look at that 👍
Hello all, I'm working on a NN that can predict the next dataset from a trend. As of now I can gather the differences from one array to another, the problem is what would my inputs be as they would be different each time and how could I use that to help determine outputs?
This is what I have so far and it works, but I'm uncertain how to store the "trend" data.
Hey @scenic tulip!
It looks like you tried to attach a Python file - please use a code-pasting service such as https://paste.pythondiscord.com
Hey @scenic tulip!
You either uploaded a .txt file or entered a message that was too long. Please use our paste bin instead.
So I have lists in a file each containing 20 elements, which are integers. I need to be able to find the difference of the first 2, take that set of values and apply it against the next list....so on and so forth until there are no more lists to process
Hello guys, I have some data with a ranking of X names each name assigned with a score. The ranking is performed according to the score going from the highest to the lowest score. I want to perform some clustering to see if there are some clusters in the dataset between the names. Is k-means approach the best way to do this ?
What model are you using, linear or logistic regression?
Oh a neural network
I don’t think a neural network would be required for that
linear
What’s the NN for
to predict the trend based on all previous results
How big is this dataset?
Is each feature a cumulative sum of the previous ?
Or it’s just independently made from two of its own columns
ok so im trying to predict Keno, so i have all the previous keno drawings
What is keno
If ur trying to predict the lottery 😂
but it doesn't get picked by floating balls that get shat out tubes, it uses PRNG
Ohh
I tried to find the seed that matched the first lottery drawing
after 4.29 billion combos still nothing
so i don't think i can find the mersinne value of an array lol, or else id find them all
One for each week results?
By mersenne you’re referencing the twister?
yes, because PRNG is actually not random. so if i could find the MT values of the drawings, perhaps i could find a correlation between the PRNG idk lol
Wait what is a mersenne number again
ive looked into bit shifting, which in my case id have to reverse the mersinne twister. that's fine with single integers
the mersinne number is what is used to generate the random number
Sounds completely impossible dude
Interesting topic though
so if i seed the PRNG with the value of 1 and run it, it will produce a result, if i close and restart it produces the same result
if you run that same value of 1 in a loop of x, you get different results because they bit shift the seed value to produce a new result
Ahhh mersenne is 2n-1
that being said, i do think its quite impossible so im trying to go back through and find the trend of all the data
There’s prob only been a few thousand winning combos so far
but there are a finite amount of drawings
I do not think that’s enough
hmmm
Even if u had 100,000 drawings
Just a load of pseudo random draws
What’s the plan?
ill show you what i tried before
from random import randrange, sample, shuffle
import numpy as np
import time as t
ra = []
fa = []
temp = 0.0
inf = float('inf')
counter = 0
actualcount = 0
dummyarr = np.array([ 1, 2, 18, 19, 22, 24, 25, 35, 41, 44, 45, 50, 52, 54, 55, 59, 67, 68, 70, 74])
while np.array_equiv(dummyarr, ra) != True:
actualcount += 1
np.random.seed(counter)
#ra = np.random.choice(range(1, 80), size=20, replace=False)
ra = np.array(sample(range(1, 81), k=20))
fa = np.sort(ra)
ra = fa
print(ra)
if actualcount <= inf:
counter += 1
temp = inf
print(ra)
print("Count")
print(counter)
print("ActualCount")
print(actualcount)
print("Infinity = {}".format(temp))
so in that i had a dummy array that i tried to find the seed value for. that failed after i ran out of range on 64 bit which is 4.29 billion combos. i think theres like a trillion combos for keno, not sure
you like the actualcount <= inf ? 😂
are you trying to roll until you hit the dummy?
How long that that run for?
It’s not a trillion it’s 20^20 combos right?
Not possible unless you have nasa supercomputer from 100 years in the future
Even if u did manage to land on their combo that doesn’t tell u about any seed value and it doesn’t tell u if they use a constant value
Waste of time
well there are 80 numbers
20 get picked each draw
yeah i was using different seed values to try to hit the dummy array, but yeah finding all those would be impossible
you get to pick up to 10 numbers. I just really want to see if they truly randomize the game with PRNG or if someone is behind the scenes inputting the numbers
i ran it for like 2 days and it hit 4.29 billion combinations before it ran out of memory for the counter
It really should because it would give me the seed value for that set of drawings. if i can find the seed values in a consecutive set of drawings I could tell if it's actually random. really i would need to sample everything. there are actually alot of ways to go about it but it's just time
I could figure out how they shifted the bits to arrive at that conclusion, basically deciphering the mersinne and reverse mersinne twistering all future drawings into the correct seed
well, just ask your pandas question. try dragging the CSV file into this chat and explain what you are trying to do.
i'm trying to read a csv file into jupyter notebooks but i am getting an error
copy/paste the text of the error
table = pd.read_csv("wineReviews.csv")
le Code*
Hey @fading geyser!
You either uploaded a .txt file or entered a message that was too long. Please use our paste bin instead.
Hey @fading geyser!
You either uploaded a .txt file or entered a message that was too long. Please use our paste bin instead.
read the message
yes, so read the message and it tells you what to do.
Hey @fading geyser!
It looks like you tried to attach file type(s) that we do not allow (.docx). We currently allow the following file types: .gif, .jpg, .jpeg, .mov, .mp4, .mpg, .png, .mp3, .wav, .ogg, .webm, .webp, .flac, .m4a, .csv, .json.
Feel free to ask in #community-meta if you think this is a mistake.
Pasting large amounts of code
If your code is too long to fit in a codeblock in discord, you can paste your code here:
https://paste.pythondiscord.com/
After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.
does anyone know a good way to open .fst files inside of python to be used with pandas?
sry bro no idea
yeah just did it
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x93 in position 10: invalid start byte -- so the problem is that the file encoding for the csv file is not utf-8
and how to change it
try pd.read_csv("wineReviews.csv", encoding='ascii')
ohk
File "<ipython-input-9-b6ef65161787>", line 1
table = pd.read_csv("wineReviews.csv" encoding='ascii')
^
SyntaxError: invalid syntax
add the comma
this error is showing
I have to leave soon, btw. did you add the comma and re-run?
yeah it didn't work
this is the error
try pd.read_csv("wineReviews.csv", encoding_errors='ignore')
I won't be able to diagnose the problem without seeing the file itself, unfortunately.
something about the way it's encoded is unexpected.
why does it look like that?
no idea
how did you get it?
it must be corrupted in some way. because csv files are supposed to be human readable
like, it will literally just be comma-separated values
night night
night
If u crack it share the winnings with me
I’ll take 1%
Hello, I have some data with a ranking of X names each name assigned with a score. The ranking is performed according to the score going from the highest to the lowest score. I want to perform some clustering to see if there are some clusters in the dataset between the names. Is k-means approach the best way to do this ?
clustering involves points in space. what are the coordinates going to be?
Ok so basically I have 7 sub-rankings and 1 ranking which aggregate the 7 sub rankings. Can these 7 sub-rankings be my coordinates ?
Or should I lower the number of variables ?
if we have a correlation that shows -.7 between two features, my intuition would be to drop it. What's everyone else's thoughts on this?
correlation of -0.7, what does that mean you think?
correlation of 0 means no correlation, 1 means a lot, -1 means a lot (but in negative direction)
so when one is higher, the other is lower and vice versa
Doesn't sound like useless information
and even then, no correlation between a feature and the desired output does not mean the feature cannot be useful in combination with other features
I guess. what would it mean if two instances were close together according to that coordinate system?
okay gotcha, makes sense. Good looks on this
It would mean that 2 names have a similar profile in their ranking profile. Just imagine X names with classes rankings in Maths, Physics, English, German etc. People from a « scientific » cluster would have typically higher scores in Physics & Maths etc
Does thats makes sense to you
Ok dropping rankings w/ an absolute correlation of 0.7 makes sense
And above*
yeah, so if you want to use those grades to cluster the people with different sets of skills that would work
OK
It wouldn't really give a label for each cluster though
it would jsut show a cluster for people with certain grades for certain subjects, but it won't say something like "scientific people"
But maybe you find a cluster of people who are really good at science subjects, and that they are commonly bad at languages f.e.
Thats perfect, I can interpret each category by myself, is kmeans the proper approach to do this ?
Clustering of unlabeled data can be performed with the module sklearn.cluster. Each clustering algorithm comes in two variants: a class, that implements the fit method to learn the clusters on trai...
Maybe check this out
there is more than kmeans, some work better for specific settings
The interpretation of the clusters would mainly be the mean of the cluster (so the average grades for the courses), the spread and the amount of points in a cluster
Perfect
Thanks
Do you have some idea about which approach would be the most adapted to the problem at first sight
From your experience
?
absolutely not, I don;'t have your data 😛
Might be good to try visualize your data in some way, and get to understand it such that you can make an informed decision yourself
the link I just gave also shows what usecases each method has, try to see what your situation is with respect to those usecases
Does anybody know how to interpret this? It's telling me yearsExpereince, milesFromMetropolis, and degree have the least benefits? no way
Hey, i have a project for university. Its the prediction of energy usage of electric busses.
I decided to ujse XGBoost and after all i get an r2_score of 0.991. Im really surpürised cause that felt really easy and now im worried something ist right.
I mean i put such low effort into it.
Is going for a lower r2score better sometimes?
Why do those values go so high? wut
that's what I'm sayingggg
maybe because I have 92 features
well I'm not sure... my model scored 74%
thats surprisingly good for that values lol
def takes some time to run, let me try it real quick
cv_score['test_score'].mean()
appareantly "If None, the estimator’s default scorer is used"
didn't know the model would have an associated score
got an error?
ValueError: Classification metrics can't handle a mix of multiclass and continuous targets
ahh we can't use accuracy for regression metrics
yeah
Your model scored an accuracy you say?
But isn't the problem a regression problem?
yeah but how did you get this accuracy?
Correct a regression problem, I guess technically I put a cross validate score which found the 'test_score'.mean()
alright, try ‘r2’
or neg mean squared
okay running now
it's a salary column
alright, so nothing over a few million then I assume
mostly int64 of numbers ranging from 50k to 300k
thats a hell lot of money
surgeons lmao
rubles*
haha USD
then those numbers are in cents?
no it's like this for example 120, 200, 45, etc
what is the NONE feature btw?
NONE meaning no degree obtained
so we want to predict salaries, degree obtained is with ordinal encoding- 0(HS) 1(bachelores) etc
LinearRegression
score came back: 0.7434167004759114
^ first model is super basic, no features being dropped just wanna test ito ut
yeah thought if it was a nn, it might just only focus on the job, as that already says quite a lot about the salary
How did you normalize the years experience?
or did you not?
Normalized them with MinMax Scaler
I normalized by numerical features which was years experience and miles From Metropolis
Also I had a column of companyId which was 63 unique values that I OHE.. thought about dropping this as we have a industry feature that may also tell us about the companies and thus reduce our features by 63 columns
Well I mean, that is what we are trying to find out with this permutation right
Maybe there is one or two companies that pay their employees like 10x more than others, and it would still be good to keep those in
Basically yes, my matrix correlation showed some cool things and the table made sense but then I ran into hits
anyone here use hyperopt or some kind of DEOptim library? I'm not sure at least with hyperopt how to batch runs together, i.e, have it return a set of 50 or so samples that I can run at a time then return all 50 results before it iterates the next batch of parameters
anyone know how complicated it is to convert a large matlab script to python?
not counting specific, just general similarity between syntaxes
I mean it heavily depends on what kind of code you want to convert
sometimes there are high level functions that are similar in both languages
If you want to plot something using matplotlib f.e. then obviously things are going to be quite similar etc.
I think i'm pretty comfortable with python, but having a hard time getting used to matlab, so to me the languages feel quite different
gotcha
im much more familiar with computational computing with matalb
but i have an equation in the project im working on that wont plot in matlab, so i had to plot it in wolfram mathematica and generate a matrix of the values to import into matlab for a specific example
so im just trying to find a better solution, as i need my project to be able to run self sufficiently
do u know any trained cnn for object salient detection?
I'm kinda new with AI, but I made this program that used a dataset of types of brain tumors and used tensorflow to make a neural network which evaluates the accuracy it can distinguish the types of brain tumors, I was wondering how this could actually be used? Like what's the logic of using your own data entry of a made up brain tumor data and inputting it in this newly built AI?
I don't really know how brain tumors work, but there's such a thing as assistive AI, where models can be used to assist (but not replace) professionals in certain areas. Models that assist doctors in making diagnoses or designing treatment plans exist, but they have to be held to very high standards
Also I'm confused. Did you create a model that does this, or are you wanting to make one?
I made one, but its with a csv file with data not actual images
such as width, height, depth, color, of certain examinations
and i ran it and it gave me a overall accuracy of around 97%
so i just was wondering how you could actually use it, like could i use my own piece of data
i can show the program its pretty small
`import pandas as pd
import tensorflow as tf
dataset = pd.read_csv('cancer.csv')
x = dataset.drop(columns=["diagnosis(1=m, 0=b)"])
y = dataset["diagnosis(1=m, 0=b)"]
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Dense(256, input_shape=x_train.shape[1:], activation='sigmoid'))
model.add(tf.keras.layers.Dense(256, activation='sigmoid'))
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(x_train, y_train, epochs=1000)
model.evaluate(x_test, y_test)`
So you trained a model. Nice. What you want to do with it you said?
Like put it into practice
like if i ever wanted to make a UI and have some enter their own data
whats the next step after training a model and putting it into practice
You would need an input data in the same form as training data, without target of course. Then you can make prediction
so another csv file with inputted data
is there a way it can show me the value 0 or 1 for each entry that it predicts
model.predict() https://www.tensorflow.org/guide/keras/train_and_evaluate
Is your training data 0 ans 1s now?
well the tumor is either malignant or benign so its set m as 0 and b as 1
so i meant show either as a 0 or 1
i can show one sec
Yeah, if that's the target then you will get such output on predict too.
Just note that if your dataset is not balanced, then you 97% accuracy score could mean nothing if there are a lot more 1s than zeros
Say you have 97% of ones in your set. Then if you always predict 1 then you are always have 97% accuracy. But you can always predict 1 without ML ;)
oh ya lol theres a lot more 0's below idk why theres a bunch of 1's in the beginning
could anyone take a look at my A* algorithm and tell me what im doing wrong
help pls
This may seem like a dumb question, But if im trying to get into data science and machine learning; Where is a good place to start or learn?
I have 1 million lines of data... any advice on how to optimize this for training and fitting a model? My random forest regression takes soooooooooo long
still need to test other models as well
You could use GPU for training one option
Can i do just reduce the sample to 20% of the data? Variance will be the same
Take a small sample instead of running experiments, feature engineering, and training baseline models on all the data. Typically, 10–20% is enough. Here is how it is done in pandas:
sample_df.shape
(191583, 120)```
Also valid option
Yeah will do this to build my model and I'll be happy 🙂
Just need to be sure the sample of the data is valid representation of the full dataset or even better production data
How do I do that?
Do eda on the data and compare it to sample
ahh I see what you mean
How do you train? What hparams?
For time sake I may not do that, because it's for job interview assignment I'm sure it's a cool technique I'll show them. They gave me 1 million lines of data so that could've been a tester
1 million rows should not be that painful
But why is it??
Unless you have millions of features?
I'm testing 1 million lines of data and 27 features
I assumed that's what's taking it so long?
Depends of machine as well
@tacit basin Does this graph make sense to you for important features for determining your salary??
2021 Mac 16inch... Fairly good computer
really would've thought miles From Metropolis would've been least concern lol
Don't know not a salary expert. But would love to learn outcome of your study to optimize my salary :)
I just hope my machine learning model is good :(( I'd like this job haha
I want roadmap for data analyst
thank you!
hey how can i apply kfold with MLPRegressor ?
i have 2 outputs and 7 input dimensions
`from pandas import read_csv
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split, cross_val_score, validation_curve,cross_val_predict
import numpy as np
import matplotlib.pyplot as plt
inputs = df_xg_nu[feat]
targets = df_xg_nu[['home_team_goal','away_team_goal']]
x_train, x_test, y_train, y_test = train_test_split(
inputs, targets, test_size=0.25, random_state=2)
rate1 = 0.005
rate2 = 0.1
mlpr = MLPRegressor(hidden_layer_sizes=(12,10), max_iter=700, learning_rate_init=rate1)
scores = cross_val_predict(mlpr, inputs, targets, cv=5)
print(scores)`
I'm trying to fit elo scores against soccer match scores .
feat=['elo_offensive_1', 'elo_defensive_1', 'elo_home_offensive_1', 'elo_home_defensive_1', 'elo_offensive_2', 'elo_defensive_2', 'elo_away_offensive_2', 'elo_away_defensive_2','homey']
hi, i am currently researching and making a script to autoplay my game, now i want to add a command line at line 23 so that it can recognize that the match has been matched earlier than expected. initial opinion and continue to execute the next commands in the event, how to do it?
Makes sense don’t peopel in cities get paid more
Interesting how exec roles are valued low
Hello guys, i'm currently working on Pyspark. I have a question about i can do a thing. I want to eliminate rows based on column values. I have two features (Home,HomeVariant) and i want to drop if both are positive in the same row. How do i do this?
I wanted to use something like this : " If ViewHome = 1 & ViewHomeVariant = 1 | then drop
try df.drop(df.loc[(df['ViewHome'] > 0) & (df['ViewHomeVariant'] > 0)].index)
:incoming_envelope: :ok_hand: applied mute to @lean cave until <t:1653305221:f> (9 minutes and 59 seconds) (reason: duplicates rule: sent 4 duplicated messages in 10s).
how do i load data from a BatchDataSet tf?
May I know why https://gist.github.com/buttercutter/b6f526c56e20f029d68e6f9041c3f5c0#file-gdas-py-L396 gives runtime error on inplace operation ?
gdas.py line 396
self.nodes[n-1].connections[ni].forward(x, types=types) # Ltrain(w±, alpha)```
Hello everyone
Please I need a little assistance in credit risk modeling in python. Link to some learning resources will do.
you're unlikely to get help if you ask to ask. try asking your actual question.
Does anyone know why it appear as null?
Does anyone know how to solve the cuda discrepancy between these two envs? I need cuda working on my yolo env as well for training some models
I tried uninstalling torch and reinstalling with https://pytorch.org/get-started/locally/
What all math concepts do I need to learn (like matrix) to learn ML using Scikit learn?
Hello! I am wanting to use the numba njit decorator in one of my classes but only if the import is successful. I was able to hand the import with:
with contextlib.suppress(Exception): from numba import jit
But I can't figure out how to only use the decorator if jit was imported. The only way I can get it to work is pull it out of the class and use a try and except block on a normal function and call it from the class but thats not very clean and i end up writing the function twice
try:
from numba import jit
except ImportError:
jit = lambda func: func
I guess
if you have jit = lambda func: func then using @jit as a decorator will have no effect
Ya I was able to handle the import just fine but if I add
@jit def func(self): ...
it will obviously error if I run it outside of my venv without numba installed
do you understand what my proposed code does?
simply suppressing the potential import error eliminates your only opportunity to know that the import failed
Ya I see what you are saying that I was just hoping for a way to do something like
try: @jit except: pass def func.....
inside the class
that isn't syntactically valid, unfortunately.
Yep which sucks. I will probably just abstract two versions or use an interface or factory instead. Thanks for your help!
you don't want to use my solution?
here's another option
HAS_NUMBA = True
try:
from numba import jit
except ImportError:
HAS_NUMBA = False
...
def func(a, b):
...
if HAS_NUMBA:
func = jit(func)
Oh.... I understand what you mean now. I haven't used jit before now so I didn't look into if you could use it as not a decorator. My bad
Their overview just shows it being used as a decorator
in Python, a decorator is always a function. so this is just leveraging how Python itself works, rather than something specific to numba.
Got ya thank you!
@decorate
def func(a, b):
...
# is the same as...
def func(a, b):
...
func = decorate(func)
when you use a decorator, the function you're decorating gets passed to the decorator, which is also a function. and then whatever the decorator returns gets re-bound to the original name of the function
it's not even required that the decorator return a different function. or that it return a function
it can return whatever it wants
If I am passing kwargs in the decorator currently nopython=True I assume I can just add those in in the decorate func call?
Is there a standard for the kwarg for the function I am passing it?
no, if your decorator looks like this
@decorator(x, y, z)
def func(a, b):
...
then decorator is a function that returns a function, and that returned function is the decorator.
def func(a, b):
...
func = decorator(x, y, z)(func)
meaning that this is the semantics.
ok cool! Thanks!
welcome to #decorators-and-ai
so by this it means that we dont have many values of type C execs, but we can see the importance of that column of "JobType"
any ideas on dropping the highly negative correlations? I'm afraid of dropping NONE because it's used up as the top correlation
U don’t have to drop anything btw
I feel like feature selection is such a forced process in many projects
Which model will you use
Or will you choose best performing
Man was seriously thinking this...
I'm running three Models one for LinearRegression, Random Forest, and KNN Regressor
Why don’t u run them all and choose best
It takes like 10 lines of code
Use a box plot
Yup exactly what I'm doing haha
box plot for what?
U can do it in one cell
i'm using a CV_SCORE to see where they range at
Run all three default models and output a plot of each average score over 10 folds
Ez
But u shud seriously consider adding other models to this plot
how many models should I run? I was only gonna do 3 and compare the RMSE scores?
perfect
sweet man, sounds good! Appreciate the help
Then consider taking the highest accuracy model
Unless you will prioritise AUC and ignore accuracy
the job interview wants me to work with RMSE
so I'm gonna take the RMSE of each model and compare them
Fair
but they do ask for another metric so i'm gonna see what other good approaches are
maybe accuracy
That’s not rly gona work?
why's that
oh true true
I think rmse is popular
How do I solve this question?
9. a) Please estimate the RMSE that your model will achieve on the test dataset.
I mean the question is a bit confusing to me haha
Yeah just estimate rmse post tuning?
It will spit out a result
But it’s gonna overfit
Bit odd
You'd probably need to use some validation dataset if you want to estimate your model's performance on the test data
Otherwise you can't really estimate what the performance would be on the test data, as you only have the training data available
So I actually do have the X_test file
right but that would just straight up give you the rmse, not some estimation of it
I thought in order to get an estimate was with the X_train,y_train, x_test, y_test and we would use the x_test and y_test to get an estimate
Maybe they just want to have the rmse of the test data
hmm how the heck do I get them that?
a) Please estimate the RMSE that your model will achieve on the test dataset.
b) How did you create this estimate?
you can just give the model the test input, compare the output with the desired output, which gives the actual rmse
but they want an estimate
interesting
so a naive solution would be to give the rmse on your training data as an estimate of the rmse on your test data
right
but this would be bad, as it will perform better on the data it is trained on than new data
absolutely
so you want some new data, that it is not trained on
hmm
i.e. validation data
so then take the rmse of the test data?
how would you approach this?
you want to give an as close as possible estimate of your model's performance on the test data, without using the test data
Like I said, some data that the model is not trained on, as this would give a biased performance result
so Validation data
okay so using cross validation
if you haven't left out any validation data, you'd have to retrain your model on less data, and test it out on this validation data that you left out
or cross validation could work too
But then it would also be bad if you already used this data for tuning your parameters
from sklearn.ensemble import RandomForestRegressor
instantiate model
rf = RandomForestRegressor()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1)
examine scores
cv_scores = cross_validate(rf,X_train,y_train, cv=5)
rf.fit(X_train, y_train)
rf.score(X_test, y_test)
y_pred = rf.predict(X_test)
from sklearn import metrics
# results of MAE
print(metrics.mean_absolute_error(y_test, y_pred))
# print results of MSE
print(metrics.mean_squared_error(y_test, y_pred))
# print results of RMSE
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))```
But now I'm just confused because how the heck do you find out the RMSE without using test data?????
and if I use my actualy test.csv file that's not estimating
you don't find the rmse, that is not what they want
Here you split your data into training and testing
okay right
do you have test data beyond this scope?
I have another file called test.csv that I have not used yet
to be used on my model
okay, and then you have some other file, that you split into train and test?
No I will not use that file unless it's for testing on my model
So basically have 2 test sets, one is the actual test set, and one that you created
yes
Okay, so one would be called the validation set normally
you have training data/validation data (both from 1 file), and then test data
From this you use only the training data for tuning the model, and training it
this can be done with k-fold cross validation
Then on the validation data you get an estimate of your model's rmse on new data
This is then also an estimate of your model's rmse on the test data, as this validation set is data that is also never shown to your model, yet you don't use the actual test data for this estimation
ok ok got it man, thank you very much!
Ur splitting ur training data into test set ? U can just make test y
Take 20%
You could also just see it as having 2 test set, one you got from your "train data", and the actual test set
Why? Why not just use train and test
and then you ofc wouldn't use the test set from the "training data" for training
Because they want an estimate of the rmse on the test set they give
but an estimate implies that they don't want you to use the test set
then it would be the actual rmse
not if you specify that it is on the test data
it would be an estimate of the rmse on new data
Why would you use validation set
but it would be the rmse on the test data
I cross validate on training set
Because you want an unbiased estimate, so you can't use the data you use for tuning params
I thought after cross validation this gives you an idea of what your model score is. Once I do that I FIT my model and then score it on x_test and y_test
thus giving me my RMSE Scores
U don’t use training anyway so what’s the difference
Oh
I’m not sure how much it impacts the final resul
Hey need help on a project
Is anyone ready to help .?
It’s based on linear regression
Maybe if you ask your question
Yeah so I built this model on a car resale dataset and I just want to decrease the bic
Also my model has like 30 features and if. I reduce them my efficiency falls
Should I share the file ?
That’s what I said, train and test no validation set, those are made temporarily using a sklearn function
yeah, but 2 test sets
I Didn’t bother before
well you'd need an extra one if you'd want an estimate of the rmse on the test.csv data, without using test.csv
but normally you wouldn't
But maybe they just mean the actual rmse of the model on the test.csv data, it would be kinda unclear
this line is giving me an error:
outputs=(layers.Dense(output_nodes, activation=tf.nn.softmax))
code block:
def create_model(input_shape, output_nodes):
model = keras.Model(
(layers.Conv2D(filters=64, kernel_size=3, input_shape=input_shape[1:], activation='relu')),
(layers.MaxPooling2D(2, 2)),
(layers.Conv2D(filters=64, kernel_size=3, input_shape=input_shape[1:], activation='relu')),
(layers.MaxPooling2D(2, 2)),
(layers.Conv2D(filters=64, kernel_size=3, input_shape=input_shape[1:], activation='relu')),
(layers.MaxPooling2D(2, 2)),
(layers.Flatten()),
(layers.Dense(64)),
outputs=(layers.Dense(output_nodes, activation=tf.nn.softmax)),
inputs=(tf.keras.Input(shape=(input_shape))),
)
return(model)
error:
TypeError: __init__() got multiple values for argument 'outputs'
first tensorflow project not directly using a tutorial, so apologies if it's a bit scuffed
the problem is in how you created the class instance
you supplied too many arguments for a single parameter
I get that from the error, but im more confused on how that might be happening
because tf.keras.Input(shape=(input_shape)) is in parentheses
so if it was to return multiple values
wouldn't they still be put into a single tuple?
(input_shape) is not a tuple btw
ah, what is it
no idea, probably did it because it looks better imo
deleted it
(the parentheses I mean)
but yeah specifically looking at this line:
outputs=(layers.Dense(output_nodes, activation=tf.nn.softmax))
print returns a single object as expected
this eror normally comes up when you supply the argument as a positional argument, and then as a keyword argument
!e
def my_func(key, *args):
print("blabla")
my_func('red', key='blue')
@mild dirge :x: Your eval job has completed with return code 1.
001 | Traceback (most recent call last):
002 | File "<string>", line 4, in <module>
003 | TypeError: my_func() got multiple values for argument 'key'
like this
alright, thanks
So the problem is likely not that line, but one of the others
I'm more familiar with pytorch, so don't know what arguments it takes :/
Hey, got this error, my drivers are updated, version is compatible
Used command from pytorch website to install pytorch
Does this make sense for feature importances in a linear model for coef??
what's RFE?
so this graph for feature importance is wrong correct?
I believe this is only with random forest though, right? I need to test this for LinearRegression
@steady basalt how about now
seem like very normal scores again 👀
is that bad or good 😦
features that change your score by 100,000,000,000 seems pretty bad
it's just in my dang linear model
No I doubt it, those values don't really make sense, why would permutating any of your values result in values that are 10^11 lower than what they were
I really don’t know because I’m going back through my process and everything seems to make sense
@mild dirge I’ve never seen such high values before in coefficient so I’m like what the heck
There seem to be some more normally valued features
maybe try to find the difference between these and the 10^11 score affecting features
hmmm, just for time sake I may not include this aspect and only show feature importance of my random forest which makes more sense lol
Hey guys I had a quick question. Working on gathering data to put into my neural network. I have files, containing lists of 20 elements. I want to keep track of the "trend" in the data, so if i had 2 arrays 1.) [ 1, 2 ] 2.) [ 2, 3] the trend would be [ +1, +1 ]. For some reason I can't seem to store the result to add to the next 2 arrays differences. Here's my code.
I guess I'm asking. How do you store the difference of the current 2 arrays im using, then apply it to the next 2 arrays difference?
Not sure what you mean exactly, but for time dependent data some initial approach would be rolling regression
@scenic tulip
I'm not even to applying it to the NN yet
Yes
If you see my code, you'll see what im trying to do.
So you have multiple arrays like:
[1, 2, 3]
[4, 6, 2]
[1, 9, 2]
and then you want difference arrays:
[3, 4, -1]
[-3, 3, 0]
?
What kinda job?
Yeah, i've accomplished getting the difference of the arrays, but how to do hold that value and then apply it to the next two arrays that get differenced
"apply it to the next two arrays"
you have an array with differences, and two arrays with values, what do you mean apply
[1, 2, 3]
[4, 6, 2]
[1, 9, 2]
ok 123 is first array
difference would be like you stated 3, 4, -1
between array 1 and array 2
now
I have that difference, how could i keep that data going by grabbing 4,6,2 and 1, 9 , 2s difference and adding it to the first difference
So let me first show you how to make the differences a lot easier
and then i'll show you how to easily do that second part
you can rewrite it as a multiplication by a single row vector from the left
!e
import numpy as np
values = np.array([
[1, 2, 3],
[4, 6, 2],
[1, 9, 2]
])
differences = np.array([arr_b-arr_a for arr_a, arr_b in zip(values, values[1:])])
print(differences)
sum_of_differences = np.sum(differences, axis=0)
print(sum_of_differences)
oops 1 sec
should be the same as [-1,0,1] multiplied from the left
@mild dirge :white_check_mark: Your eval job has completed with return code 0.
001 | [[ 3 4 -1]
002 | [-3 3 0]]
003 | [ 0 7 -1]
In [1]: import numpy as np
...: values = np.array([
...: [1, 2, 3],
...: [4, 6, 2],
...: [1, 9, 2]
...: ])
In [2]: diffs = np.array([[-1,1,0],[0,-1,1]])
In [3]: diffs.dot(values)
Out[3]:
array([[ 3, 4, -1],
[-3, 3, 0]])
In [4]: adder = np.array([1,1])
In [5]: adder.dot(diffs.dot(values))
Out[5]: array([ 0, 7, -1])
In [7]: adder_diff = adder.dot(diffs)
In [8]: adder_diff
Out[8]: array([-1, 0, 1])
In [9]: adder_diff.dot(values)
Out[9]: array([ 0, 7, -1])
as corroboration. all you need to do is multiply the vector [-1,0,1] to get the sum of the differences. if you want to apply this to N matrices of values at the same time, concatenate them along the columns axis. the result will be a vector of size 3*N of differences
hey im catching up on what you guys said, my neighbor needed help. sorry one sec
oh, wow that is cool never thought to use dot product. I was so caught up in handling the specific arrays rather than the entire dataset at one time.
so since im dealing with lists i should convert them to np arrays?
yeah both our methods use numpy arrays
edd uses some mathematical tricks, I mostly use python tricks
yeah im still new ish to python so, id rather learn those hehe
you can pick whichever one you find more intuitive
thank you all much. I'll try some things and get back to ya with (hopefully) good results
In [10]: values2 = np.array([[1,2,3],[4,5,6],[7,8,9]])
In [11]: vals_concat = np.concatenate((values,values2), axis=1)
In [12]: vals_concat
Out[12]:
array([[1, 2, 3, 1, 2, 3],
[4, 6, 2, 4, 5, 6],
[1, 9, 2, 7, 8, 9]])
In [13]: adder_diff.dot(vals_concat)
Out[13]: array([ 0, 7, -1, 6, 6, 6])
to finish the example
and indeed, however you like. numpy as fast though, and it should scale really well with this type of operation
to be honest the scope of my project is to take in Keno Drawings and identify a trend in the drawings since the beginning. then use that data in my NN
Recursive feature elimination
the trick is only that the operation you wanted can be written as a linear combination, and matrix multiplication is commutative. so the whole thing can be done in a single product
to predict future trends
If you want to predict future trends, you don't need to calculate all the differences
if you train a rolling regression model it will already make some prediction on future values if you feed it new data
im going back from the very beginning of when the drawings started
Could a neural network predict the universe; can I predict the euro millions
i mean, the same array of 20 PRNG elements should never come up again in our life times so. That's one point, but I've found the mean, median, modes variance and std deviation of all drawing s that exist
i think if i could see how the drawings fluctuate and correlate that with all the other simple maths i could have something
I think you’d need to get an opinion from the Japanese dude who invented the twister you need some serious math expertise
LOL yeah
To rule out if this is egen feasabile
I don’t think it is
How do you know that they use a random seed
nothing is impossible
The same one each time
oh ive moved beyond that lol
What other approach can you do now
There’s nothing
It’s akin to predicting the future to the highest accuracy
one thing i have noticed from data that ive gathered is most drawings actually average out to between 35 and 46
so i started trying to generate random arrays of 20 elements that fit that average range
im gonna need divine intervention
And millions of dollars
Even still it isn’t possible
no not millions, just will power
well, my NN should be able to do that work of mathmaticians if i can find the right data and plug it in
No you’d need the worlds best statisticians and mathematicians
To know what to do
What are you trying to do it?
Did you make it
I did
its been months in the making
if you run get keno numbers prepare to wait, there's like 1.6 million drawings now of 20 numbers
I don’t really understand how one calculation can spit out the future values
its a combination of multiple calculations, gathering all of them, and figuring out how to plug them into the NN
Explain it
What are the calcuLtions
Mean, median, mode, variance, and std deviation
How can they predict a future array
for 1, no drawing will ever appear twice
Really?
yep
Rules?
well, not for quite a long time
80 numbers
20 get selected at random
you get to pick up to 10 numbers
numbers being 1 - 80
I want to know how summary statistics of draws can predict future draws
Because when it happens next time it’s random
correct, but if you had a '1' show up 3 times in a row, i would bet it wont next time?
What’s the prize for the winning draw
matching 10 numbers is 110k
Is this a popular lottery
it is
Why would you bet it won't next time? Does the probability decrease each time? If so, how?
I don’t think it’s mathematically possible
im actually not trying to cheat it. i want to see if they control the random factor or if it's actually random
If it’s using mersenne twister it’s randomised well enough
You just can’t predict that for 20 number array matching
because the notion of their being 80 numbers to randomly select that you selected 1 of them 3 times in a row isn't probable that it will be selected again
mersenne twisters are psuedo random, not random
Funnily this is a concept I’ve always struggled with. In school we’re taught about how two flips landing heads is 0.25
And yet each flip is independant!
If I flip a fair coin 3 times and get 3 heads, it does not mean that next must be a tails, nor a heads again. They are independent events.
Yes that’s true
correct, but the odds are
that it will be the other, the universe has to balance itself out at some point
This is literally gambler's fallacy.
It's a fallacy for a reason.
Often described as "the slot machine owes me".
You can try to probabilistically predict this lottery but I think it is a inhuman task unles you had entire research teams
Keno is the only lottery in Ohio that is PRNG
PRNG huh.
the rest get selected with the floating balls
Oh, well the floating balls are not PRNG.
Can anyone explain how odds change with prng
yes it is PRNG, that's why initially i wanted to find the mersenne value of one of the drawings and compare it to the next
Compared to balls
if the repetition cycle of the prng is long, you'd need the statistics of hundreds of thousands of games in the past to find a pattern
yes i know edd, i ran my code on a dummy array till ihit my 4.29 billion bit limit lol
a prng can still appear uniformly distributed over moderately long sample sizes
2 ^^ 32
yeah, so if they used seed value 1, then i could find it if i itereated over it till i found it.
You know which PRNG method they are using?
what if they hit you with a 1-time pad and they change the seed each time
Per iteration
depends on the alg. mersenne twister can be cracked with less than 1000, and in some cases, less than 100
You’d need to pay 4 billion times?
no LOL
Do you know if they change the seed every once in a while?
Can you explain
I do not, i haven't found it yet
aight, i'll read up on it. i'm not familiar with the method, i was just offering a word of caution
yeah i tried to do bit shifting with the mersenne....kinda like a reverse mersenne twister
If they just random the seed, then there is no way. The seeds are often done with not PRNG.
problem is you can only do that with single int values
But if they kept using the same one for a long time, they might be dumb. Lotteries in the past have done dumber things.
i speculated that they used system time as a seed too
@scenic tulip considered paying star and math PhDs to help?
lol no i want to do this
i'm not familiar with the cracking process, but mersenne twister is a very common random number generator that's not very good at producing highly random numbers. this doesn't matter in quite a few fields, however, so it's still used just about everywhere.
i sat at the bar and watched the drawings. when the current drawing ends a 2 minute timer starts. I ran my code that used system time as a seed and generated PRNG arrays up until 1 minute. I compared my results with the following drawing to see if a certain time within that range was being used. Some arrays matched alot, others not so much
it was difficult to tell if that was actually getting me anywhere other than more tipsy 😂
and i wanna be clear about this. lets say i found that mersenne value that could repeatedly win that would be "breaking" the game and is illegal. Im just trying to validate they are actually random and not done by someone after everyone has put their drawings in ....not trying to break laws lol
So with one hundred generations you can confidently predict next array?
that is the theory super
sorry probably a dumb idea but thank you all for helping anyways
is anyone here who has ever done something with CUDA (or more precisely cupy)?
you have to ask your actual question, not if anyone has used a certain library.
ok
How do I answer this question?
Please estimate the RMSE that your model will achieve on the test dataset.
well: i was thinking about numpy/numba on very big images doing simple arithmethic operations is very slow
so when i have 20 cores
i can just use 5800 cores of my GPU
yes, AI that involves images is often faster on a GPU
so i looked into using CUDA and found cupy
i found this article:
https://towardsdatascience.com/heres-how-to-use-cupy-to-make-numpy-700x-faster-4b920dda1f56
what GPU do you have
i redid the exact same example this guy did
3070
and where he got a x10 increase in speed
i got 0.5x
so cupy is slower
can you make a reproducible example, with every variable defined?
u mean code example?
yes. I happen to also have a 3070, so I can run it
ok one sec
but it has to be exact code that I can run without any changes
yes, anyone can download cupy at any time
ok one sec
Here's how to format Python code on Discord:
```py
print('Hello world!')
```
These are backticks, not quotes. Check this out if you can't find the backtick key.
actually its not exactly the same as on website because he use 1000x1000x1000 float64 and this does not fit in our GPU
so we compare 950^3
import numpy as np
import cupy as cp
import time
shaper = 950
arrShape = (shaper,shaper,shaper)
#NUMPY
time0 = time.perf_counter()
arr = np.ones((arrShape))
time1 = time.perf_counter()
print(time1-time0)
#CUPY
time0 = time.perf_counter()
arr = cp.ones((arrShape))
time1 = time.perf_counter()
print(time1-time0)
still working on it
hm what u mean?
the installation for cupy isn't as straightforward as intended
take ur time all good. im just happy u help me xd
my first try got disrupted by windows11-upgrade assasination attempt too
x)
I'm reinstalling the cuda drivers 😛
Anyone experienced with LDA? Got some issues with the topics (numbering?) and the words are not corresponding with the correct topic number. Using Genism
Sorry for interrupting :))
Hey I've got a Pandas question in #help-lemon if anybody's got some bandwidth
@rugged falcon behold the results.
In [9]: import cupy
In [10]: %timeit cupy.random.random((500, 500, 500))
The slowest run took 5.11 times longer than the fastest. This could mean that an intermediate result is being cached.
18.8 µs ± 15.6 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [11]: import numpy as np
In [12]: %timeit np.random.random((500, 500, 500))
601 ms ± 14.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
so how is that possible?
how is it possible that cupy is faster?
so what is your question
how come mines not working
idk
we have same Hardware
prolly same vram not that it matters with 500**3
how can i recreate what u did with %timeit
this should be equivalent to what you did @serene scaffold or am i missing something?
import time
import cupy
timecp0 = time.perf_counter()
cupy.random.random((500, 500, 500))
timecp1 = time.perf_counter()
import numpy as np
timenp0 = time.perf_counter()
np.random.random((500, 500, 500))
timenp1 = time.perf_counter()
print(timecp1-timecp0)
print(timenp1-timenp0)
0.186s for cupy
0.46s for numpy
so our numpy is similar but your cupy is 10x faster. what am i missing?
I used IPython, which can run it a bunch of times to get a better read
python -m pip install IPython
python -m IPython
you can run that if you want to use IPython in a shell.
but this has nothing to do with Ipython having no GIP right?
IPython is just a console. the %timeit command will run a statement a bunch of times and report the average
which is more reliable than what you're doing.
@rugged falcon :white_check_mark: Your eval job has completed with return code 0.
1.88e-05 s
ok i implemented "run a statement a bunch of times"
i end up with 400µs, so yours is still 20x faster
i thought ironpython is python in C
*C#
this statement seems to be completely unrelated to anything we've discussed.
oh
IPython is not ironpython, whatever that is
IPython is the whole name of it.
oh
@serene scaffold
In [5]: %timeit cupy.random.random((500, 500, 500))
The slowest run took 8.97 times longer than the fastest. This could mean that an intermediate result is being cached.
14.3 µs ± 16.9 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [7]: %timeit np.random.random((500, 500, 500))
489 ms ± 24.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
so cupy is faster
woo
so ur results are reproducabel
why would i ever not work in this IPython thing if its faster then without?
yes its so much faster (as i have hoped / otherwise i download so much for nothing xd)
IPython isn't making it faster
it's just telling you how fast it is.
so if ipython is not making anything faster
then why time.perf_counter() has
how you say it
(mean ± std. dev. of 7 runs, 1 loop each)
it did it seven times and reported the average
"right to exist"?
I haven't used it, so idk
for i in range(0,10):
timecp0 = time.perf_counter()
cp.random.random((500, 500, 500))
timecp1 = time.perf_counter()
timenp0 = time.perf_counter()
np.random.random((500, 500, 500))
timenp1 = time.perf_counter()
print(timecp1-timecp0)
print(timenp1-timenp0)
@serene scaffold wouldnt u agree it technically should do exactly the same (except summing/averaging)
the fact that it doesn't average it is a big deal
from the doc it states:
4.665306263360271e-07 2,143,482
for resolution / tickrate
but the minimum is still manifold higher then Ipython %timeit
I don't really have anything to offer about time.perf_counter
ok i will google this behavior
one mor qeustion tho if you dont mind
can you think of a reason why the first execution is always so slow?
it might be that cupy does additional startup when its used for the first time, rather than eagerly when it's first imported.
hmm okay so some magic goign on like always
thanks for comparison/answer/help!
!close
dxd
I'm doing another test.
this seems to support my theory
In [1]: import cupy
In [2]: _ = cupy.random.random((100, 100, 100))
In [3]: %timeit cupy.random.random((100, 100, 100))
57.9 µs ± 706 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
i dont really believe it tbh
cupy advertises itself with several x up to 800x speed increases
and you get 16000x
you have to look at what operations they did to measure that speedup
simply allocating an array is just one thing
so?
this implies all array entries being initialized as random value
it doesn't imply anything. it just means that each element is filled as a random float between 0 and 1.
yes
that's not implied; it's the spec for that function
yes yes sry english xd
i can assume that ones() is even easier because it skips RNG part?
or faster
I guess
so
https://towardsdatascience.com/heres-how-to-use-cupy-to-make-numpy-700x-faster-4b920dda1f56
a possibility exist that this author did some major wrong?
because when i follow what u did (with not only doing it once, but doing it few times with %timeit
i get not 10x increase
but way better increase
Idk if TDS has quality control
Hey I've got a statistical question in #help-candy
for a beginner starting out in python (kinda know the basics) and wanting to do AI,ML,DS in the future. what would you guys suggest as good beginner courses for that
How to make PyTorch faster?
Sentdex has some fun stuff, there are some great youtube videos on these topics and I feel like the entry level is really low for such a sophisticated topic, you don’t necessarily need to know all the math behind algorithms / …
If anyone has the time to glance over my deliverable for this job interview I am completing and give some feedback that would be great! I'm only 8 slides in but take a peak 🙂
https://docs.google.com/presentation/d/1ibyiIDu-b3k3y_yI4UwVdqK7I-4F2Kh5AFBUD74N5p4/edit#slide=id.g12e390617fa_0_811
Create a new presentation and edit with others at the same time. Get stuff done with or without an internet connection. Use Slides to edit PowerPoint files. Free from Google.
whats a good method to automatically identify and remove columns with no change (from a csv file recording data for a long period of time that may have channels not hooked up to anything, but was never turned off)?
you could do a finite difference approximation of the derivative/gradient of the quantity you are measuring. if it is close enough to zero, you omit it
you'd probably want one that is accurate to 2nd order
I tried torch.jit
but it got error
hi is there someone how knows what you can use to predict something in tensorflow lite. in tenserflow it is just perdict but in tflite you dont have that.
anyone here specialises in econometrics, specifically microeconometrics
getting this error while fitting a tf model ```
ValueError: Exception encountered when calling layer "lstm" (type LSTM).
slice index 0 of dimension 0 out of bounds. for '{{node strided_slice_1}} = StridedSlice[Index=DT_INT32, T=DT_FLOAT, begin_mask=0, ellipsis_mask=0, end_mask=0, new_axis_mask=0, shrink_axis_mask=1](transpose, strided_slice_1/stack, strided_slice_1/stack_1, strided_slice_1/stack_2)' with input shapes: [0,?,50], [1], [1], [1] and with computed input tensors: input[1] = <0>, input[2] = <1>, input[3] = <1>.
Call arguments received by layer "lstm" (type LSTM):
• inputs=tf.Tensor(shape=(None, 0, 50), dtype=float32)
• mask=None
• training=True
• initial_state=None```
# define model
model = Sequential()
model.add(Embedding(vocab_size, 50, input_length=seq_length))
model.add(LSTM(100, return_sequences=True))
model.add(LSTM(100))
model.add(Dense(100, activation='relu'))
model.add(Dense(50, activation='relu'))
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())
# compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit model
model.fit(X, y, batch_size=128, epochs=100)```
@ machine learning community
what do you prefer? reading research papers or watching YT Vids on those papers?
they can't achieve the same thing unless the youtube video is a recording of a talk or lecture that covers the paper in depth. you can watch videos while you have your coffee in the morning, but once you find an interesting idea, you need either a lecture on it or to read the full paper yourself if you wanna understand all of it
For pytorch how does one encode a target output for multi-label classification?
using multi-hot encoded rn like [1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0]
But getting this when trying to calculate some performance measure using the prediction and the label
Hello, I believe this fits in data science. Could anyone help me with manipulating csv files? I'm trying to convert a file by deleting a column, copying 65,000 rows of data, and adding a header on top.
All of this should be done in a new file so to preserve the original file
It's trying to take a file that looks like the first image and convert it to a file that looks like the 2nd image. The header info is derived from the original file + a second file for Date, Time and VUnit.
I'm struggling trying to code this algorithm so I would appreciate any assistance 🙂 thank you so much!
use pandas
@serene scaffold I have very little experience with pandas unfortunately, could you drop some function suggestions or dataframe manipulation techniques to achieve this?
What's happening here? why is the change type not work?
the dtypes returns float but it isn't as seen with info()
did you expect df.astype to modify it in-place?
I thought it would, guess I'll have to reassign?
you can keep chaining methods onto pd.read_csv
you would just need to use read_csv and to_csv
and drop
you didn't chain the additional method calls onto read_csv, so nothing interesting happened.
I don't understand, shouldn't df.astype({'Quantity': 'float64'}).dtypes just change it?
df = pd.read_csv(...).astype({'Quantity': 'float64'})
this is what I mean by chaining.
yes i understand
you didn't do that.
but how is this any different from doing it in another line with df?
because if you don't display it or save it to a variable, it just creates a new object and immediately throws it away.
nah
the issues was the last arguement as you said
the .dtypes jsut returns that when I don't care about it
I don't think I can help with this. sorry.
How do I copy the contents of a csv file to the the csv I want to write to?
you can either overwrite or append https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html
thank you!
@serene scaffold thank you so much
I'm trying to group by columns (a list of columns other than Quantity) and get the sum of the quantity, but it doesn't seem to be working
df2 = df.groupby(columns).agg(Quantity = ('Quantity', 'sum'))
my other idea is just dropping duplicates then getting sum by order ID and merging.
that's not correct agg syntax. the docs describe the valid inputs. it's unclear how exactly pandas is interpreting this tuple ('Quantity', 'sum'), but it's clearly not doing what you think it's doing
i recommend always checking the docs for usage help when in doubt https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.DataFrameGroupBy.aggregate.html
!e ```python
import pandas as pd
data = pd.DataFrame({
'Category': ['A', 'A', 'B', 'B', 'A', 'B'],
'Type': ['X', 'Y', 'X', 'Y', 'X', 'Y'],
'Quantity': [3, 2, 4, 9, 4, 7],
})
columns = ['Category', 'Type']
result1 = data.groupby(columns).agg({'Quantity': 'sum'})
result2 = data.groupby(columns)[['Quantity']].sum()
result3 = data.groupby(columns)['Quantity'].sum()
print(result1)
print()
print(result2)
print()
print(result3)
@desert oar :white_check_mark: Your eval job has completed with return code 0.
001 | Quantity
002 | Category Type
003 | A X 7
004 | Y 2
005 | B X 4
006 | Y 16
007 |
008 | Quantity
009 | Category Type
010 | A X 7
011 | Y 2
... (truncated - too many lines)
Full output: https://paste.pythondiscord.com/ixepomulew.txt?noredirect
I did check docs, and tried your thing.
this is what I ended up doing.
something appears to be wrong with your data then
if you provide a sample of your actual data (use our paste site) then i can investigate further
!paste
Pasting large amounts of code
If your code is too long to fit in a codeblock in discord, you can paste your code here:
https://paste.pythondiscord.com/
After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.
yeah that's why I posted, all columns are strings other than quantity.
the columns list have column names other than quantity.
right, but without seeing the data you're forcing people to guess
only code i ran
columns = ["Sale Code",
"Order ID",
"Store Name",
"Player First Name",
"Player Last Name",
"Shipping First Name",
"Shipping Last Name",
"Shipping Address",
"Shipping City",
"Shipping State",
"Shipping Zip",
"Billing Phone",
"Billing Email"]
df = pd.read_csv("order_report.csv", usecols=(columns + ["Quantity"]), dtype=str).astype({'Quantity': 'float64'})
df.info()
thanks for sharing the file
as a side note, you probably do not want to include sale code in the grouping columns... it looks like a unique row id
it is here, but not always.
what do you mean by that?
in the current file, it is the same, but in the future there might be different ones.
ultimately, the reason im passing those columns is just because i just want the total for each order, the rest of the attributes will always be the same.
oh, i see
yea
i will show you a tidier way to do that
thanks, im also curious why that isn't working since i kept googling for quite a bit and just went with my idea of merging.