#data-science-and-ml
1 messages Β· Page 250 of 1
which leads to avoiding a lot of the conveniences and tools that pandas gives you
e.g. vectorisation
anyway yeah @misty mica it also depends on what you want to do with the data after you're done with it
that'd influence storage concerns
just for completeness, it's worth noting that the SQL way to handle this (canonically) would be to create two separate tables
and in the second table each row would be an original filename ID, an index, and one part of the filename
I split it into a list so I could do some tag-like analysis, but I think just storing the string instead of a list is adequate.
if you did that you could just do df[(df['index'] == 1) & (df['filename'] == whatever)]
which is nice, but also adds cognitive overhead because now you have to juggle two tables
In general for most things I'll only care about the files that are formatted according to my standard example above, so I'm going to store those three fields and also filename as a string.
Thanks for the assistance!
yw
speaking of snake_case vs camelCase, constantly going between R and Python for school is starting to destroy my sanity π«
hm
I use TypeScript (camelCase) for my frontend and Python (snake_case) for my backend
it's p okay but
because API responses are returned in snake case
I ALSO have snake case variables in my TypeScript
π₯΄
That's funny, I've been wanting to use camel/pascal casing in python because my databases all tend to be snake case.
is there anything inherently wrong with camelCase in python? or just a matter of good/bad practice?
is there anything inherently wrong with camelCase in python? or just a matter of good/bad practice?
@indigo obsidian snake case is preferred
but not inherently wrong, no
If everyone uses the same style guide it is nice, but not worth going to battle over in most organizations.
I mean, not worth fighting an existing standard that isn't your preferred, it's definitely worth having a standard.
in a list of tuples how do i get the first index of the list but second element in the tuple
something like list = [(2.3, 1), (3.5, 0)]
and get out the 1
@wheat pilot 1. this probably isn't the best place but since you asked
you would so smth like x[0][1]
just wondering, where would be the best place to ask more fundamental questions like this?
This is a python-specific question, really, so #python-discussion or a help channel.
how to solve it
Anyone know of a good textbook for Time Series Analysis in Python? Understanding ACF, PACF, ARIMA, SARIMA with a good depth on the formulae etc...
What is the most performant way of inserting lots of records into a SQL database? It took me almost 100 seconds for 57k records using executemany()
SqlBulkCopy? from what i understand executemany isn't a true bulk operation, and is actually inserting your rows one by one under the hood.
https://programmingwithmosh.com/net/using-sqlbulkcopy-for-fast-inserts/
BULK INSERT is another possible option
@glacial rune : executemany() isn't a true bulk operation, so it tends to be pretty slow.
I found I get significant performance gains by composing a single operation as a massive string and sending it all in one go.
Since you only have 57k records, that may be your best option. Beware that you have to be careful about how you convert numerical data into strings to avoid character truncation.
Another option is to use pyodbc.cursor.fast_executemany. I've not tried this, it just looks promising.
https://github.com/mkleehammer/pyodbc/wiki/Features-beyond-the-DB-API
sqalchemy added support for this feature
https://docs.sqlalchemy.org/en/13/changelog/migration_13.html#support-for-pyodbc-fast-executemany
Thanks π ultimately I will have like 30 million records... the db is on google cloud so I wonder if it would be faster to upload csv files
@glacial rune what database are you actually using?
they all have different features for this
I have a question. Normally is it possible for an image classifier built on CNN to give the count of something in an image. (e.g. the no of cats in an image)?
I have a question. Normally is it possible for an image classifier built on CNN to give the count of something in an image. (e.g. the no of cats in an image)?
@opaque isle yes
hey so i want to train a GRU network with an input of shape (748, 500, 12)
but i'm getting this error:
the model π can someone please help?
can you show a full picture of the error message?
I think you're messing something with x_train with regards to the shape.
my guess is that xtr has a wrong shape here.
the first dim is the no. of examples, the next is the time steps(500 samples), and the last one is the features
is it the pytorch GRU?
of the top of my head, I can give a proper answer right now.
I do think it's a size mismatch between the input_size you're using and the size of xtr, though.
I think the GRU outputs five different sequences, each of which I have to pass through another Dense layer, but I don't know how to do that,....maybe that maybe the reason for the error....
I do think it's a size mismatch between the input_size you're using and the size of xtr, though.
@cobalt jetty input size i have given is (500,12) to the GRU, and xtr is of shape (748, 500, 12)
I looked back at the error and it seems to arise from this function. I.e. your output shape and your y_train shape have a mismatch.
sorry i tried everything it is not working still, i want a multiclass classification, and i think i'm doing something wrong here
I think the GRU outputs five different sequences, each of which I have to pass through another Dense layer, but I don't know how to do that,....maybe that maybe the reason for the error....
but i don't know how to solve that
Anyone have experience with using eGPUs for DL? Most of the benchmarks I see online show a 10-30% performance hit for gaming. Should I expect similar for DL?
Try inputing (5,) as an output shape rather than len(labels), @runic stream
Hey squad - hope everyone had a good weekend and is staying safe/healthy. I'm starting to dip my toes into sentiment analysis, and can imagine this work has been explored in so much detail that there are some good Python libraries that can basically "plug and play" with text that you feed it.
If this is the case, are there any libraries ya'll recommend I explore? I imagine there are easier alternatives to running e.g. CountVectorizer
hey um does anyone know how much space buillding tensorflow from source takes? bc rn it has taken up frickin 30 gigs
@void anvil might have to just read the source code
or use subprocess
cat your_file.naf | corefgraph -l en_conll > output.naf
ugh, useless use of cat
and sudo pip install too
yeah you'd have to check the source code for how it works
good old academic software
no idea
let me know if you figure it out though @void anvil
i like having all this nlp stuff in my toolbox
okay im really dumb and new to everything and would like some help. ive spent 2 days trying to figure out what is wrong with my neural network. im trynna do the handwritten digits thing (mnist) and my code is both super slow and the cost only goes up
can someone look it over and tell me where i am going wrong?
import numpy as np
def cross_entropy(output, y_target):
return - np.sum(np.log(output) * y_target, axis=1)
def cost(output, y_target):
return np.mean(cross_entropy(output, y_target))
def sigmoid(z):
return 1 / (1 + np.exp(z * -1))
def sigmoid_deriv(z):
return sigmoid(z) * (1 - sigmoid(z))
def softmax(z):
return (np.exp(z.T) / np.sum(np.exp(z), axis=1)).T
m = 10000
y = np.zeros((m, 10))
x = np.zeros((m, 784))
file = open("data\\mnist_train.txt", 'r')
for i in range(m):
line = file.readline()
x_line = line[2:].split(',')
x_line = np.array([int(i) for i in x_line]).reshape(1, 784)
x[i] = x_line
y_line = np.zeros((10, 1))
y_line[int(line[0])] = 1
y[i] = np.array(y_line.T)
y = y.reshape(m, 10).T
x = x.T
alpha = .01
W1 = np.random.rand(256, 784) * .01
b1 = np.zeros((256, 1))
W2 = np.random.rand(256, 256) * .01
b2 = np.zeros((256, 1))
W3 = np.random.rand(10, 256) * .01
b3 = np.zeros((10, 1))
for i in range(1000):
# feed forward
Z1 = np.dot(W1, x) + b1
A1 = sigmoid(Z1)
Z2 = np.dot(W2, A1) + b2
A2 = sigmoid(Z2)
Z3 = np.dot(W3, A2) + b3
A3 = softmax(Z3)
# calculating gradients
dz3 = A3 - y
dw3 = np.dot(dz3, A2.T) / m
db3 = np.sum(dz3, axis=1, keepdims=True) / m
da2 = np.dot(W3.T, dz3)
dz2 = np.multiply(da2, sigmoid_deriv(Z2))
dw2 = np.dot(dz2, A1.T) / m
db2 = np.sum(dz2, axis=1, keepdims=True) / m
da1 = np.dot(W2.T, dz2)
dz1 = np.multiply(da1, sigmoid_deriv(Z1))
dw1 = np.dot(dz1, x.T) / m
db1 = np.sum(dz1, axis=1, keepdims=True) / m
# updating weights and biases
W3 = W3 - alpha * dw3
b3 = b3 - alpha * db3
W2 = W2 - alpha * dw2
b2 = b2 - alpha * db2
W1 = W1 - alpha * dw1
b1 = b1 - alpha * db1
ping me if u can help or something
@void anvil what are you ultimately trying to do
Guys, I have created a small corpus of 1.8M sentences and 250K unique words in Spanish for NLP, but I really don't know where to post it. π
what is the output from a tool like this @void anvil ?
some kind of matrix of coreferences?
e.g. matrix C where Cij = 1 if entities i and j appear in the same doc?
hm
interesting
so you need to see coreferences of things like "BOILER" and "TECHNICIAN"?
Anyone with TF lite experience need some help
interesting that coreference resolution is a separate task from "just" entity resolution
how do i standard scale a dataframe to have 0 mean and 1 stdev using sklearn?
i tried standardscaler and scale but when i manually check the returned data the mean is not 0
@wheat pilot it might be +/- some small amount due to floating point error
when i use df.mean() one row returns -3.552714e-16
so i guess that might be it
but then this is being used towards data preprocessing and my initial accuracy for a knn implementation is 1 but for a standard scaled is lower
and even lower for a min max scaled
shouldnt they be higher? @desert oar
do you have any idea how tiny 1e-16 is
yea super close to 0
i wasnt sure if it should be exact or not
but are my accuracies supposed to get worse?
with preprocessing
also i thought the way things were standard scaled is x-mean/stdev
but when i manually do that for one row of my dataset i get a different value
7.3,0.74,0.08,1.7,0.094,10.0,45.0,0.9957600000000001,3.24,0.5,9.8
this is one row of my datafram
its mean is 7.2227054545455
and its stdev is 13.098395928751
the mean of each column should be around 0, and the stddev of each column should be around 0
the point is so that all the data is centered in roughly the same place and occupies roughly the same amount of "space"
eh?
first of all what are your data types
how many columns
any missing values
etc
just so i know what you are dealing with
its a pandas dataframe i think
initially it has 12 columns and one of those is the label column
no missing values
in the start of my standard scaling def i removed the last column of the dataframe with ```python
xTrain = xTrain[xTrain.columns[:-1]];
xTest = xTest[xTest.columns[:-1]];
since i dont want to standardize the labels i think?
yea
you dont need semicolons in python btw
oh yea
i assume you use javascript?
we didnt have any introduction to python just an assignment on the topic of the course π¦
yeah a bit
yeah python has some things in common with java and some things that are very different
each column has a name
yeah we dont care about the column names
pandas is smart enough not to mix those up with your data
x_train = x_train.iloc[:, :-1]
x_test = x_test.iloc[:, :-1]
scaler = StandardScaler()
x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.transform(x_test)
i imported the preprocessing packet(not sure if its the right term) using ```python
from sklearn import preprocessing
("module" or "package")
(a "package" is just a "module" that contains other modules, called submodules)
the code i posted should "just work"
it will scale each column independently
do i have to use iloc instead of what i had?
no but its less typing
or if you know the column name of the label you can do
x_train = x_train.drop(columns=[label_colname])
pandas gives you a few different ways to perform similar operations, depending on what exactly you want
for the line with scaler = StandardScaler i need to use preprocessor.StandardScaler() right?
in your case yes
i usually write from sklearn.preprocessing import StandardScaler
they both work
using both is weird
from sklearn.preprocessing import StandardScaler, MinMaxScaler
this is one option
i wouldnt recommend using both unless you know what you're doing and have a good reason to do it
using both minmax and standard?
min-max scaling (aka "normalizing") works best when the data has a logical "maximum" and "minimum" value
they are for two different definitions to see how preprocessing affects my accuracy
whereas shifting by mean and scaling by std dev (aka "standardizing") works best on unbounded data
yeah, dont use them on the same data
but if youre comparing then go for it
ah ok good i think its set up to use a new copy each time
the "function-only" versions like sklearn.preprocessing.minmax_scale don't preserve the values you need to re-apply the scaling later
whereas the class-based versions like sklearn.preprocessing.MinMaxScaler store the scaling parameters, which lets do you fit_transform on the training data and then just transform on the test data
where you named things scaled are there issues if i use ```python
xTrain = scaler.fit_transform(xTrain)
ah wait
on what u said about function only
for knn should i be using the same tranform values on the test data?
or scaling test data separately
for knn should i be using the same tranform values on the test data?
yes, you should do this
think about it practically: the test data is meant to simulate "out of sample" data. if the data is out of sample, where are you going to get the scaling parameters? nowhere. you have to use the parameters from the training data
ohh
i see
my teaching assistant mentioned not doing the same process on the test data
but the assignemnt info made it seem like we were supposed to
but i think it meant same process as in same parameters as the training and not same code process
so for the same deal but a min max version i would just replace the standard scaler with a min max?
xTrain = xTrain.iloc[:, :-1]
xTest = xTest.iloc[:, :-1]
scaler = MinMaxScaler()
xTrain_scaled = scaler.fit_transform(xTrain)
xTest_scaled = scaler.transform(xTest)
return xTrain_scaled, xTest_scaled
yep
do you know anything about adding noisy features?
i think they might also be called irrelevant features
i implemented this but im not sure this is actually what i should be trying to do ```python
def add_irr_feature(xTrain, xTest):
"""
Add 2 features using Gaussian distribution with 0 mean,
standard deviation of 1.
Parameters
----------
xTrain : nd-array with shape n x d
Training data
xTest : nd-array with shape m x d
Test data
Returns
-------
xTrain : nd-array with shape n x (d+2)
Training data with 2 new noisy Gaussian features
xTest : nd-array with shape m x (d+2)
Test data with 2 new noisy Gaussian features
"""
# TODO FILL IN
feature1_train = np.random.normal(0, 1, len(xTrain))
feature2_train = np.random.normal(0, 1, len(xTrain))
feature1_test = np.random.normal(0, 1, len(xTest))
feature2_test = np.random.normal(0, 1, len(xTest))
xTrain['irr_feat1'] = feature1_train
xTrain['irr_feat2'] = feature2_train
xTest['irr_feat1'] = feature1_test
xTest['irr_feat2'] = feature2_test
return xTrain, xTest
note that those features are uncorrelated with your "meaningful" features
is this part of your homework?
if not, sklearn has a nice routine for making fake classification data https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_classification.html
yea it is
we never went over any code in class though
just the concept of knn
TT
im having trouble following what the classification thing means
what does the homework actually ask you to do?
i do understand im supposed to be making extra columns that are not "necessary" and may mess with results
Fill in the add irr feature function to add two irrelevant features to the training
and test data. The data for each column should be drawn from a Gaussian (normal)
distribution with 0 mean and standard deviation of 1.
oh cool cool
when i run this though my accuracy went up for the one i thought it would go down
and it went down for the ones i thought should go up
well yeah you were usin the scaling totally wrong lol
oh i mean with new code
my results are: ```python
Test Acc (no-preprocessing): 1.0
Test Acc (standard scale): 0.8
Test Acc (min max scale): 0.7
Test Acc (with irrelevant feature): 1.0
actually this may be because i ran it on a small test set
make sure you use the same test/train split for all 4 of those methods
to get a fair comparison
thats fine
x has the data and y has labels
theres something preimplemented to get each thing
i think its pd.read_csv
thats fine
its running now but since my knn has some for loops it takes a few mins for each test
the csvs are 500x12
ah it still goes down
Test Acc (no-preprocessing): 0.8395833333333333
Test Acc (standard scale): 0.70625
so far
Evaluate the accuracy of the model on the test dataset for the different preprocessing
techniques as a function of k. What conclusions can you draw with regards to the
different forms of preprocessing and the sensitivity to irrelevant features for this dataset?
i feel like its fishing for an answer about accuracy getting better since scale should usually matter for knn?
i ended up with Test Acc (no-preprocessing): 0.8395833333333333
Test Acc (standard scale): 0.70625
Test Acc (min max scale): 0.8
Test Acc (with irrelevant feature): 0.84375
yeah that would be my guess as well
im not sure what conclusions to draw as its opposite of my gues
would you be able to skim through my knn to see if i had anything major causing this to be wrong?
i can, but i cant offer that much since this is homework
it works for the first data set that i had to write it for but this question uses that knn for a different set
Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.
Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.
the first is my knn and the second is the preprocessing tests
Are there any websites similar to βkaggleβ that offers competition ?
@desert oar were you able to find anything?
does someone know what this number -> 22/22 means when training a nn with tensorflow?
Epoch 94/100
22/22 [==============================] - 0s 750us/step - loss: 0.0123
i understand its related to the dataset size but i cant find the relation
is that a progress bar for stochastic gradient descent?
Does anyone have an idea how I could recognise the steps annotated below using python? I've got the xy data in a dataframe at the moment
smooth out the curve using moving average then find the derivative at each point?
Some sort of "if the difference between y and y+1 > n, print x and x+1"
scipy.signals.find_peaks
the difference between y and y+1`
basically the derivative
Aye. At the moment I've got something looking a little like this code wise
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.signal import savgol_filter
df = pd.read_csv('e:/Projects/HiringTest/submission/sample.txt', sep = ' ')
# Data befrore filtering.
df.plot(x ='x', y='y', linewidth=0.2)
# Savitzky-Golay filter implementation
dataIIR = df
dataIIR['y'] = savgol_filter(df['y'], 101, 2)
dataIIR.plot(x ='x', y='y', linewidth=0.4)
So importing the data, and reducing the noise a little with filter
'e:/Projects/HiringTest/submission/sample.txt' lol,
have you tried with find_peaks or similars? tweaking it, it should find the steps
@verbal haven So say I want to find a point with difference of 1 between y and y+1, would that be height=1?
ie. jumps = find_peaks(dataIIR['y'], height=1.5)
I'm struggling to get to grips with understanding how the function works
I think you want threshold instead of height
yes its threshold
Hmmm, so ```
jumps = find_peaks(dataIIR['y'], threshold=1)
print(jumps)
Doesn't seem to return a values as such
So, I've got a some data that looks like this
2.24756189047 2.70009589679
2.24831207802 2.85466124369
2.24906226557 2.85664726093
2.24981245311 2.84088991726
2.25056264066 5.23410679429
2.25131282821 5.01424916475
2.25206301575 4.81484599199
2.2528132033 5.25819389546
2.25356339085 4.99236143949
you can see the jump from 2.8 to 5.2
I'm looking to find where in the code that jump happens. I've put it as 1 just because it seems like a reasonable value to measure the number of jumps
what does it return?
So running starts with this
ok so it literally returns nothing
Basically yes
You are not allowed to use that command here. Please use the #bot-commands channel instead.
But I'm not sure why. As you can see above, there clearly is a difference of greater that 2. Even 1 returns nothing
After messing around in #bot-commands I think I see the problem
The threshold checks both sides
if it's just a jump it wouldn't register it as a peak as the other side doesn't have a big enough jump
Hmm I did think that might be a possibility. It looks for spikes not steps basically
I might need to do it manually in that case
try this
!e
import numpy as np
arr = np.array([2.24756189047, 2.70009589679,
2.24831207802, 2.85466124369,
2.24906226557, 2.85664726093,
2.24981245311, 2.84088991726,
2.25056264066, 5.23410679429,
2.25131282821, 5.01424916475,
2.25206301575, 4.81484599199,
2.2528132033, 5.25819389546,
2.25356339085, 4.99236143949])
x, y = arr[::2], arr[1::2]
print(f"x={x}")
print(f"y={y}")
print(x[np.gradient(y) >= 1])
You are not allowed to use that command here. Please use the #bot-commands channel instead.
It returns: x=[2.24756189 2.24831208 2.24906227 2.24981245 2.25056264 2.25131283 2.25206302 2.2528132 2.25356339] y=[2.7000959 2.85466124 2.85664726 2.84088992 5.23410679 5.01424916 4.81484599 5.2581939 4.99236144] [2.24981245 2.25056264]
if you only want a strict difference between consecutive elements, instead of np.gradient you can just do np.ediff1d
Can't seem to get it to how I'd want
I've currently mocked up some code to possibly solve it: ```python
n = 0
for row_index,row in dataIIR.iterrows():
np1 = row['y']
diff = np1 - n
if(diff > 2):
print(row_index)
n = row['y']
But it's not returning anything. I'm trying to get it so it will print the row index if the difference between y and y+1 is greater than 2
Sorry for what part? @velvet thorn
!e
import numpy as np
arr = np.array([2.24756189047, 2.70009589679,
2.24831207802, 2.85466124369,
2.24906226557, 2.85664726093,
2.24981245311, 2.84088991726,
2.25056264066, 5.23410679429,
2.25131282821, 5.01424916475,
2.25206301575, 4.81484599199,
2.2528132033, 5.25819389546,
2.25356339085, 4.99236143949])
x, y = arr[::2], arr[1::2]
print(f"x={x}")
print(f"y={y}")
print(x[np.ediff1d(np.concatenate([[y[0]], y])) >= 1])
You are not allowed to use that command here. Please use the #bot-commands channel instead.
This works?
The concatenation is to ensure that the result of ediff1d is the same length as the original array
Give me a sec to wrap my head around it haha
I think you're correct, I just need to try it with the full data file
so I'll need to convert the dataframe to a similar array
but data.to_numpy() gives it in the following format:
[4000 rows x 2 columns]
[[ 0.00000000e+00 -5.72766726e-03]
[ 7.50187547e-04 -5.37550170e-03]
[ 1.50037509e-03 -5.03534022e-03]
...
[ 2.99849962e+00 5.02267064e+00]
[ 2.99924981e+00 5.02299900e+00]
[ 3.00000000e+00 5.02332816e+00]]
print(x[np.ediff1d(np.concatenate([[y[0]], y])) >= 1]) should still work as intended as far as I can see right?
Ah, wait, hmm
I think it should
running into an issue that my dataFrames aren't keeping seperate
It's return nothing atm
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.signal import savgol_filter
from scipy.signal import find_peaks
data = pd.read_csv('e:/Projects/HiringTest/submission/sample.txt', sep = ' ')
# Data befrore filtering.
data.plot(x ='x', y='y', linewidth=0.2)
# Savitzky-Golay filter implementation
dataIIR = data
savgol_filter(dataIIR['y'], 101, 2)
dataIIR.plot(x ='x', y='y', linewidth=0.2)
# Locating jumps
arr = dataIIR.to_numpy()
x, y = np.split(arr, 2, axis=1)
print(x[np.ediff1d(np.concatenate([[y[0]], y])) >= 1])
#plt.show()
Wait, think it was a bug
Fantastic, I think it's found the 3 steps!!
[[0.75018755]
[1.50037509]
[2.25056264]]
Thanks for all your help. I may be back in a few minutes with some more questions as to how I actually smooth the step between them, but I should be able to do that by appending the array with some values after doing some exponential smoothing
np
I donβt understand any of this but good work
any idea what Im doing wrong here btw Im using PostgreSQL
c.execute("SELECT salary FROM EMPLOYEE WHERE name=$1", ("James",))
gives me
Traceback (most recent call last):
File "D:/Projects/DSaML/Main.py", line 21, in <module>
c.execute("SELECT salary FROM EMPLOYEE WHERE name=$1", "James",)
psycopg2.errors.UndefinedParameter: there is no parameter $1
LINE 1: SELECT salary FROM EMPLOYEE WHERE name=$1
just ping me when you answer pls
Does anyone know how I'd go about smoothing data between two points? I've got the sections I want to smooth coloured below:
and my data as a dataframe
did your moving average trick not work?
I couldn't get it implemented without error
code?
2 secs
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy import signal
from scipy import optimize
data = pd.read_csv('e:/Projects/HiringTest/submission/sample.txt', sep = ' ')
# Data befrore filtering.
plt.figure()
plt.plot(data['x'], data['y'], linewidth=0.2)
# Locating jumps
arr = data.to_numpy()
x, y = np.split(arr, 2, axis=1)
stepIndex = data.index[np.ediff1d(np.concatenate([[y[0]], y])) >= 1]
print(stepIndex)
step1x = x[range(stepIndex[0]-100, stepIndex[0]+100)]
step2x = x[range(stepIndex[1]-100, stepIndex[1]+100)]
step3x = x[range(stepIndex[2]-100, stepIndex[2]+100)]
step1y = y[range(stepIndex[0]-100, stepIndex[0]+100)]
step2y = y[range(stepIndex[1]-100, stepIndex[1]+100)]
step3y = y[range(stepIndex[2]-100, stepIndex[2]+100)]
def moving_avg(x, n):
cumsum = np.cumsum(np.insert(x, 0, 0))
return (cumsum[n:] - cumsum[:-n]) / float(n)
step1yMA = moving_avg(step1y, 3)
step2yMA = moving_avg(step2y, 3)
step3yMA = moving_avg(step3y, 3)
plt.plot(step1x, step1yMA)
plt.plot(step2x, step2yMA)
plt.plot(step3x, step3yMA)
# Savitzky-Golay filter implementation
dataIIR = data
dataIIR['y'] = signal.savgol_filter(data['y'], 101, 2)
plt.figure()
plt.plot(dataIIR['x'], dataIIR['y'], linewidth=0.2)
plt.show()
@hasty grail
So I haven't modified the dataframe with the new data yet, just plotted it, but it should still work. Instead, it prints the error:
ValueError: x and y must have same first dimension, but have shapes (200, 1) and (198,)
which line?
I think that the issue is that the moving average calculation isn't defined for the first n-1 values
hence the difference in shape
I don't understand what you mean by that sorry
I don't understand what you mean by that sorry
@lean wharf say you have an average of 3 values
over this data: [10, 5, 10, 15, 20, 15]
if you always want to have 3 values in the calculation
you'll end up with [25/3, 10, 15, 50/3] (4 values)
1-3, 2-4, 3-5, 4-6
if you wanted 4 values in the moving average, you'd have 3 values in the result
So my value of n is wrong?
I'm still not quite grasping the issue
Trying out a different implementation:
Locating jumps
arr = data.to_numpy()
x, y = np.split(arr, 2, axis=1)
stepIndex = data.index[np.ediff1d(np.concatenate([[y[0]], y])) >= 1]
print(stepIndex)
step1x = x[range(stepIndex[0]-100, stepIndex[0]+100)]
step2x = x[range(stepIndex[1]-100, stepIndex[1]+100)]
step3x = x[range(stepIndex[2]-100, stepIndex[2]+100)]
step1y = y[range(stepIndex[0]-100, stepIndex[0]+100)]
step2y = y[range(stepIndex[1]-100, stepIndex[1]+100)]
step3y = y[range(stepIndex[2]-100, stepIndex[2]+100)]
def movingaverage(interval, window_size):
window= np.ones(int(window_size))/float(window_size)
return np.convolve(interval, window, 'same')
y_av1 = movingaverage(step1y, 10)
y_av2 = movingaverage(step2y, 10)
y_av3 = movingaverage(step3y, 10)
plt.plot(step1x, y_av1)
plt.plot(step2x, y_av2)
plt.plot(step3x, y_av3)
But now getting the error:
ValueError: object too deep for desired array
what gm said
I feel like there's a fundemental flaw in my understanding here
no matter what the value of n is, you will have to define the first n-1 values of your moving average
otherwise you will always be a couple of values short and the arrays won't align to each other
Can I do that by simply increasing the size of the data points I draw from?
ie. step1y = y[range(stepIndex[0]-101, stepIndex[0]+100)]
ok the problem here is that you might get out-of-bound indices when you add/subtract from stepIndex
If I were you I would create a padded version of y before doing that
Apologies again, I don't know what you mean by that. I'm new to these concepts
window_len, exp_alpha = 201, 0.5
pad_left, pad_right = window_len // 2, (window_len - 1) // 2
y_padded = np.pad(y, (pad_left, pad_right), constant_values=(np.nan, np.nan))
exp_kernel_left = ((1 - exp_alpha) ** np.arange(1, pad_left + 1))[::-1]
exp_kernel_right = (1 - exp_alpha) ** np.arange(1, pad_right + 1)
exp_kernel = np.concatenate([exp_kernel_left, [1], exp_kernel_right])
avg_values = []
for i in range(len(y)):
window = y_padded[i:i+window_len]
exp_sum = exp_kernel[~np.isnan(window)].sum()
exp_avg = np.nansum(window * exp_kernel) / exp_sum
avg_values.append(exp_avg)
something like this maybe
Made some errors, I have edited the code
oops the range should be from 1 to N instead of 0 to N-1
Do I no longer need the step1y = y[range(stepIndex[0]-100, stepIndex[0]+100)] functions then, to define the range?
replace all that with what I wrote
I'm struggling to make sense of this, it's a bit above my pay grade haha
But I'll give it a shot
exp_kernel_left = ((1 - exp_alpha) ** np.arange(1, pad_left + 1))[::-1]
exp_kernel_right = (1 - exp_alpha) ** np.arange(1, pad_right + 1)
exp_kernel = np.concatenate([exp_kernel_left, [1], exp_kernel_right])
This builds the kernel for calculating the moving average. It's equal to one in the center then exponentially falls off towards the sides
y_padded = np.pad(y, (pad_left, pad_right), constant_values=(np.nan, np.nan))
This creates a padded version of y where the padded values are NaNs so they can be filtered out later
in the for loop, it takes a window from y_padded such that y[i] is in the center of the window
then, it is multiplied with the kernel to get the numerator
the denominator is the sum of the values (aka weights) in the kernel
how did you get that colour in the text?
Discord has support for Markdown, which allows you to post code with full syntax highlighting. Please use these whenever you paste code, as this helps improve the legibility and makes it easier for us to help you.
To do this, use the following method:
```python
print('Hello world!')
```
Note:
β’ These are backticks, not quotes. Backticks can usually be found on the tilde key.
β’ You can also use py as the language instead of python
β’ The language must be on the first line next to the backticks with no space between them
This will result in the following:
print('Hello world!')
Alright, that makes a bit more sense
you can just copy the text from that bot messgae
np
I'm a beginner at python
yeah
i knew that
thanks sir
π
one last question, what's the python bot's prefix?
an exclamation mark
Traceback (most recent call last):
File "e:/Projects/HiringTest/submission/assignment1.py", line 31, in <module>
exp_sum = exp_kernel[~np.isnan(window)].sum()
IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed
hmm can you print the shape of each variable?
Is that possible without being able to run the code?
any idea what Im doing wrong here btw Im using PostgreSQL
c.execute("SELECT salary FROM EMPLOYEE WHERE name=$1", ("James",)) gives me Traceback (most recent call last): File "D:/Projects/DSaML/Main.py", line 21, in <module> c.execute("SELECT salary FROM EMPLOYEE WHERE name=$1", "James",) psycopg2.errors.UndefinedParameter: there is no parameter $1 LINE 1: SELECT salary FROM EMPLOYEE WHERE name=$1
This is the total code thus far for continuity sake:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy import signal
from scipy import optimize
data = pd.read_csv('e:/Projects/HiringTest/submission/sample.txt', sep = ' ')
# Data befrore filtering.
plt.figure()
plt.plot(data['x'], data['y'], linewidth=0.2)
# Locating jumps
arr = data.to_numpy()
x, y = np.split(arr, 2, axis=1)
# stepIndex = data.index[np.ediff1d(np.concatenate([[y[0]], y])) >= 1]
# print(stepIndex)
window_len, exp_alpha = 201, 0.5
pad_left, pad_right = window_len // 2, (window_len - 1) // 2
y_padded = np.pad(y, (pad_left, pad_right), constant_values=(np.nan, np.nan))
exp_kernel_left = ((1 - exp_alpha) ** np.arange(1, pad_left + 1))[::-1]
exp_kernel_right = (1 - exp_alpha) ** np.arange(1, pad_right + 1)
exp_kernel = np.concatenate([exp_kernel_left, [1], exp_kernel_right])
avg_values = []
for i in range(len(y)):
window = y_padded[i:i+window_len]
exp_sum = exp_kernel[~np.isnan(window)].sum()
exp_avg = np.nansum(window * exp_kernel) / exp_sum
avg_values.append(exp_avg)
# Savitzky-Golay filter implementation
dataIIR = data
dataIIR['y'] = signal.savgol_filter(data['y'], 101, 2)
plt.figure()
plt.plot(dataIIR['x'], dataIIR['y'], linewidth=0.2)
plt.show()
you need to run the whole thing probably
Yeah I'm doing so
Possible unbalanced tuple unpacking with sequence defined at line 785 of numpy.lib.shape_base: left side has 2 label(s), right side has 0 value(s)
not sure which line of your code that is on
maybe that's just an error of the interpreter
as long as arr indeed has 2 columns it should be ok
That's what I'm thinking also - I've ignored that specific message up until now - but it's not wanting to run
But it keeps throwing
IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed" ```
at me
yeah print their shapes
It doesn't return anything. It won't run due to the index error
comment it out then
exp_kernel return (201,)
hii, sorry random question! was curious if there's a way to write code to create columns in google sheets? i want to insert values from a different sheet into another by recognizing the same location names- is there a way to do that? i know how to do it in jupityer notebook, but don't know how to apply it to google sheets
how about window?
Ah, window is (201, 201)
oh I think I know why, the dimensions are not squeezed after split
maybe you should do the more straightforward x, y = arr[:, 0], arr[:, 1] instead
Awesome, they're all single column now
So, this should smooth the step theoretically now?
try
I'm plotting x against avg_values just to clarify?
yeah
So it's reduced the noise (the moving average filter is Fig2):
the end result is Fig3 right?
ah
The MA has maintained the step gradient better
if you want it to be smoother you can try adjusting alpha
you can try to selectively smooth the graph only close to the points where the jump is significant
that's what I was trying to do earlier with the +100/-100
and then somehow make it sine wave like to join the steps
I'm gonna call it a day though I think. Thanks again for all your help, it's been super appreciated! @hasty grail
np
Does somebody know of a numpy function, which takes for example 3 vectors, and represents the first as a linear combination of the others two (I need this for representing a face, as a linear combination of its eigenfaces)
numpy.linalg.solve?
Maybe that won't work depending on the rank of the matrix you construct from the two vectors
Looks like numpy.linalg.lstsq would be able to handle it for a general mx2 matrix to me.
hii, i have image recognition model which recognizes passport and driving_licence images
as we know some countries have statewise driving_licence
how i can make code weather my model is statewise or countrywise
my image recognition model is basically recognizes documents like "passports" and "driving_licence"
as some country has driving_licence "statewise" and some countries has "countrywise
as u know some countries has their driving_licence statewise
for e.g. "usa" , "australia", "india"
some country has "countrywise"
for e.g. "albania", "united_kingdom" etc
how i can make condition if country has state and user does not provide state name
so it should return "provide valid state name"
my inputs this way
Guys I need some help with pandas
Suppose there are two columns with values in it.
I want to find the entry that has the highest difference in values.
How should I go around it?
nvm guys
df.loc[(df['A']-df['B']).idxmax()] , this should be useful.
Looks like numpy.linalg.lstsq would be able to handle it for a general mx2 matrix to me.
@merry ridge I need to find the linear combination of the vectors, that form the first one, so I can take the coefficients, and put them into a weight vector (If you are familiar with eigenfaces, this is used to reconstruct the main face from the eigen ones + the mean one)
I don't see why that won't do what you're asking
@feral spoke dmi.
Given a v_3 in span{v_1, v_2} linalg.lstsq finds the minimizer of the norm of Ax-b where A = [v1, v2]. and b = v3
the minimizer x is precisely the coefficients that satisfy x[0]v_1 + x[1]v_2 = v_3
Obviously if the norm is greater than some epsilon tolerance level, then v_3 is not in the span and there is no solution. I am assuming you already know that v_3 is in the set otherwise you will have to measure the norm at each step and do some error handling if you try to do this.
I will try it with two basis orthogonal vectors, and one other, which must be in the span, just to test for now
That sounds like a good approach
It works
Great!
But what about this one? It seems that the function is returning me wrong coefficients... I mean those 2 vectors are linearly independent, so it has to return me the right solutions
I haven't used this function before. Let me load up a jyupter notebook and see
1, 8 is a perplexing answer
Ok, it looks like when you type [[2,1] , [0,1]] that is imputting your vector as the rows not the columns
I think that I have to pass it like [[2, 0], [1, 1]]
And this way it would work
Like a T-Matrix
So if you do 1*[2,1] + 8*[1,1] you get [10, 8] as required
Yeah, you basically need to transpose it
hi , i have problem on my model in pytorch ligthning , first i wonder what the possible reason of increasing validation loss while train loss decrease
this takes my 3 days. in above sample validation set and train set are same.
From a pure linear algebra perspective, passing in [2,0] [1,1] feels very unnatural, but it makes sense
This way works completely fine, I can pass my vectors as array, and numpy will do the transposition, instead of manually changing
Yeah. I'm just spoiled by matlab notation so whenever I start manipulating vectors in python I turn stupid
Thank you for the help!
Yep, good luck!
does sorting algorithms fit here?
#algos-and-data-structs would be best.
hey guys, if anybody has experience with pandas I have a question over in #help-corn
I'd be grateful for any help π
Yo
I started working last week, and i'm joining the I.T team as a minor
My boss really wants me to learn Python so i've taken a corse to learn the basics of the alnguage
I already worked with C# before so im mildly familiar with programming
But my boss has been demanding that i do some data scrapping for him lately
And i don't really know how to do that
So i just wanted to ask if anyone if willing to help me learn a little bit about it
If no one is that is fine, it's not your job to take request from ramdos and i know it's demanding
I just really don't know how to start doing this
Here's a guide to use Python for data scraping webpages: https://www.edureka.co/blog/web-scraping-with-python/
Hello everyone, is is possible to use ipython without installing Jupyter notebooks, cause I use vs code for data science.
I can send you some YouTube links to learn web scraping, and I have a PDF specifically for data scraping.
Where did you originally get it?
Check out <<data school>> on YouTube, they can give you a good foundation for learning web scraping, after which you can move on to data scraping and web crawlers.
I downloaded it from YouTube to my phone.
Although, I've watched all of them
There's also this book I studied earlier this year, it gives a solid summary and foundation for web scraping.
Python projects for beginners by Connor Milliken.
That's the book title.
Do you use WhatsApp?
I do, but i don't have a personal number
The company gave me the phone since i couldnt buy one, but i can only use it for work
DM me, I'll send you the YouTube links and PDFs. I'll tell you where to read so you can go straight to web scraping and stuff.
@boreal summit u can just run ipython in the terminal and it'll work
Also vscode has support for jupyter notebooks, if u want to use that
Yea, I use Jupyter notebooks in vs code and run code in the terminal. When I used Jupyter notebooks, I could just search ipython in the windows search bar and use it as a stand alone application. @flat quest
So that's why I was wondering if I could use ipython as a stand alone application without installing Jupyter notebooks.
I currently use notebooks in vs code for data science projects.
Hello guys, I'm working on a project for my portfolio. I'm trying to predict how long does it take to a dog to be adopted from an animal shelter. After a couple rounds of preprocessing and feature engineering, I thought it was time to do some encoding. But, when it comes to the dog's race, theres about 150 unique races available. So, for this project, I'm thinking about encoding the 5 most frequent races like 'isGermanSheperd' or 'isPoodle' and the other non-frequent races in a 'otherRaces' variable. What do you guys think about this strategy? And in this case, is there other strategies you would suggest?
hi, i need some help. i am trying to compare three images and find out which one is the closest to one of the images, like if i had two pictures, one that would is red and the other one is blue, and i had another one that would be close to fully blue. the program would decide which picture is the closest to the third picture. someone told me to make a siamese network but i dont know how
I have a question about ANN, not python and I would like to have your opinion. I have a multivariate timeserie that I used to train a multilayer perceptron ANN. When use just a layer with 100 neurons I get a result like this and when I increase the number of layers I get much wors results. Is this because of I have a very small set of training (around 50)?
@velvet thorn I love you
thank you
you just solved 40 minutes of frustration in 30 seconds you are a god
yeah, it can get confusing
part of solving this kind of problems is knowing what to Google
I would probably have tried "generate random integers numpy"
yeah, that gives np.random.randint as the first result π
I have that part correct further up the page, but there was just too many things for me to comprehend
That was step # 2 of part E, have done like 90 others before getting to this one
it can be overwhelming, too
@velvet thorn First time combining that with size in the same line, I overthought it. Thank you so much I may need advice again before this part is over, but I'll probably be able to google my way through it
sorry about that
just now was fine
I mean, in the future
sorry for the misunderstanding
like in general don't tag anyone who hasn't replied to your new question, I guess
That makes sense, I've done it when someones question goes unanswered for quite a few messages
I'm stuck again but I'm gonna pick it back up tomorrow I'm overwhelming myself
Does anyone have any experience implementing a smoothstep function?
I'm looking to smooth the transition between the highlighted areas
I've got some code that provides a value for the centre of the jumps as an array [ 999 1999 2999]
what about applying a low pass filter on the slices?
ive never smoothed a signal with steps like that tbh
Was yesterday's result not good enough?
i mean the general method for dealing with those kinds of time series is ARIMA @lean wharf
If you want the neural network to figure it out thats a whole nother matter
i can't quite remember, but I'm pretty sure i could use ipython from terminal without installing jupyter notebooks @boreal summit
It was a while back tho so i might be wrong
Slightly rewritten to be more efficient
window_len, exp_alpha = 201, 0.5
pad_left, pad_right = window_len // 2, (window_len - 1) // 2
y_padded = np.pad(y, (pad_left, pad_right), constant_values=(np.nan, np.nan))
exp_kernel = (1 - exp_alpha) ** np.abs(np.arange(-pad_left, pad_right + 1))
def get_avg_values():
for i in range(len(y)):
window = y_padded[i:i+window_len]
mask = ~np.isnan(window)
yield np.average(window[mask], weights=exp_kernel[mask])
avg_values = np.fromiter(get_avg_values(), float, count=len(y))
i can't quite remember, but I'm pretty sure i could use ipython from terminal without installing jupyter notebooks @boreal summit
It was a while back tho so i might be wrong
@flat quest yes, Jupyter relies on IPython
yes it does
but the question is if its also standalone.
Ah looks like it is. There's an ipython package u can just download
yes it does
but the question is if its also standalone.Ah looks like it is. There's an ipython package u can just download
@flat quest yeah, that was what I meant to say
but I got distracted
π₯΄
@hasty grail it did the job of smoothing the whole signal, but not joining the steps
I'm nearly getting there. I know I need something like a sigmoid function to connect them
What I've got thus far
#_______________ Sigmoid function to smooth steps _______________
xs = np.array(x)
ys = np.array(avg_values)
diff = ys[1:] - ys[:-1]
indexBool = diff > 0.385 # Variable adjusted to fit number of steps
index = np.argwhere(indexBool).reshape(-1)
step1x = xs[(index[0]-100):(index[0]+100)]
step1y = ys[(index[0]-100):(index[0]+100)]
step2x = xs[(index[1]-100):(index[1]+100)]
step2y = ys[(index[1]-100):(index[1]+100)]
step3x = xs[(index[2]-100):(index[2]+100)]
step3y = ys[(index[2]-100):(index[2]+100)]
def sigmoid(x, mi, mx):
return mi + (mx-mi)*(lambda t: (1+200**(-t+0.5))**(-1) )( (x-mi)/(mx-mi) )
# Alternative to sigmoid junction
def smoothclamp(x, mi, mx):
return mi + (mx-mi)*(lambda t: np.where(t < 0 , 0, np.where( t <= 1 , 3*t**2-2*t**3, 1 ) ) )( (x-mi)/(mx-mi) )
plt.figure()
plt.plot(xs, sigmoid(x, y[index[0]-100], y[index[0]+100]),'b-', lw=3, alpha=0.5, label='sigmoid')
plt.plot(xs, sigmoid(x, y[index[1]-100], y[index[1]+100]),'b-', lw=3, alpha=0.5, label='sigmoid')
plt.plot(xs, sigmoid(x, y[index[2]-100], y[index[2]+100]),'b-', lw=3, alpha=0.5, label='sigmoid')
plt.plot(xs, ys)
plt.plot(step1x, step1y)
plt.plot(step2x, step2y)
plt.plot(step3x, step3y)
plt.show()```
Which give a plot like this:
So I basically need those purple plots scaled down on the x axis
I'm just having some difficulties now where:
plt.plot(step1x, sigmoid(step1x, y[index[0]-100], y[index[0]+100]),'b-', lw=3, alpha=0.5, label='sigmoid')
Only takes a "slice" of the data, and isn't scaled down to it if you get my meaning:
I think what I actually want is ```python
plt.plot(xs, sigmoid(step1x, ys[index[0]-100], ys[index[0]+100]),'b-', lw=1, alpha=0.5, label='sigmoid')
but then we get a similar problem to yesterday where:
```ValueError: x and y must have same first dimension, but have shapes (4000,) and (200,)```
It's not clear what you are actually trying to achieve. You just want to replace the colored regions by a smooth continuous function without a jump discontinuity?
Yes basically
So replace these 200 values or so for y in each region with a smooth function
It's not clear to me what the difficulty is. You can just choose basically any interpolating function you want and fix it to join those points
To me the laziest solution would be to just cut out say the yellow part and interpolate the start and the end of the yellow section by a straight line
and if that is not sufficiently smooth at the connecting points, pass the signal through a one dimension heat equation or some other mollifier to smoothen it out
in a neighborhood of the connecting points, not the whole signal
You could also try connecting it with spline or something, but it really depends on what properties you require in that section
At this stage the laziest solution is best
I still want the 200 values in that region though, so I can slice them back into the original array
so just write down a parametric equation of a line that intercepts those two points and evaluate it at 200 points?
Hey guys. I'm training a model that in every epoch, the train data completely differs. But the model overfits. How is that possible?
Like this. I would be glad if you mention me for any suggestion
I was attempting to get the properties of a sigmoid function @merry ridge
ie.
If you have that plot already then what is the problem?
The black line was drawn to demonstrate sorry, should have specified
like Hexicle said, just write down the equation and evaluate it at the points
That's the bit I'm struggling with I think. I'm an EE engineer, I'm still a python novice relatively speaking
to be more specific, write a function that takes an array of X and Y values, calculates the equation of a sigmoid function that passes through the first and the last of the points, and then evaluate it at each of the X points, and return the result
let's see...
This is just high school math, it's function compositions to shift and scale
like take S(x), center it by function composition with S(x-0.7) or whatever the mid point of the deleted data is
etc
well, there is a problem of the sigmoid function, strictly speaking, not passing through 0 and 1 ever π
I understand the maths, it's being able to modify the values of the array
Right, but you don't need it to pass through 0 or 1
Just restrict the domain and truncate it at some y values
then scaling the sigmoid will match it up for you
What I would do is to identify which indices your yellow parts contain then use a lambda function to plot a sigmoid over a linspace of 200 points in a list and then shove those values into where the yellow part was
Hmm, I'll give that a shot
I want to add L_0.5 loss in my model while training
i have written :: loss = tf.reduce_sum(tf.pow(tf.abs(self.Coef),0.5))
But its giving NaN Error!
Whereas its working perfectly with L1 loss and L2 loss
loss = tf.reduce_sum(tf.square(self.Coef))
loss = tf.reduce_sum(tf.abs(self.Coef))
the above 2 lines are working perfectly, but i want to use L_0.5 loss....How to do that?
a-ha
It's a wonderful sight, thanks for your help @tidal bough @merry ridge
Good work
Yeah, had to use smoothclamp instead of sigmoid though
here's mine:
import numpy as np
from scipy.special import expit # for single values, manual implementation is faster, but expit is better for arrays
def sig_approx(X,Y,x_scale=10):
X = X.copy()
middle = X[len(X)//2]
X -= middle
x_coeff = (2*x_scale)/(X[-1]-X[0])
X = X*x_coeff
return Y[0]+Y[-1]*expit(X)
# plotting stuff:
%matplotlib widget
import matplotlib.pyplot as plt
#test case:
end = 10
X = np.linspace(0,end,100)
midpoint = X.shape[0]//2
Y = np.zeros(X.shape)
Y[midpoint:] = 1
#usage:
inds = slice(midpoint-10,midpoint+10)
Y[inds] = sig_approx(X[inds],Y[inds])
plt.plot(X,Y)
Sigmoid gave me something a little like this
How are you scaling it?
return mi + (mx-mi)*(lambda t: np.where(t < 0 , 0, np.where( t <= 1 , 3*t**2-2*t**3, 1 ) ) )( (x-mi)/(mx-mi) )
so i am watching a kinda of outdated course on pandas does anyone know what happend to the ix[] and are there any equivalent?
I mean how are you scaling the sigmoid
How are you scaling it?
Such that the first X gets changed to -10, and the last one to 10.
and by the Y axis - byY[-1]-Y[0]
Sorry I mean Aromasin's plot
diff = ys[1:] - ys[:-1]
indexBool = diff > 0.385 # Variable adjusted to fit number of steps
index = np.argwhere(indexBool).reshape(-1)
def smoothclamp(x, mi, mx):
return mi + (mx-mi)*(lambda t: np.where(t < 0 , 0, np.where( t <= 1 , 3*t**2-2*t**3, 1 ) ) )( (x-mi)/(mx-mi) )
plt.plot(xl, sigmoid(yl, ys[index[0]-100], ys[index[0]+100]))
So the top set of code returns the point where the step happens
What is the definition of your sigmoid function
I didn't use sigmoid in the end, my sigmoid def was: def sigmoid(x, mi, mx): return mi + (mx-mi)*(lambda t: (1+200**(-t+0.5))**(-1) )( (x-mi)/(mx-mi) )
I would have defined it differently, to be honest, I don't know why that notation even works
Define the signal as f(t) and the sigmoid as S(t) = exp(t)/(1+exp(t).
Find the interval [a,b] that contains the yellow part
1/(1+exp(-t)) is one less exponent π
Then replace S(t) by S(t - (b-a)/2). Call this function g(t). Then find a constant K such that Kg(b) = f(b) so that K = f(b)/g(b). Then use g(t)*f(b)/g(b)
Yeah my implementation is scrappy as hell
I don't even understand that implementation. You have a (mx-mi) on the left, and a "/mx-mi" on the right. Those would cancel? I don't understand the notation here
@merry ridge The last paranthesis group is the argument passed to the lambda
Oh, thanks
so it's (mx-mi) * f( (x-mi) / (mx-mi) ), which is about right
I could probably rewrite it like:
def smoothstep(x, x_min=0, x_max=1, N=1):
x = np.clip((x - x_min) / (x_max - x_min), 0, 1)
result = 0
for n in range(0, N + 1):
result += comb(N + n, n) * comb(2 * N + 1, N - n) * (-x) ** n
result *= x ** (N + 1)
return result
@lean wharf I highly suggest you generally split code into more lines - it's more readable for us, and believe me, you too are going to regret this in a week when you try to read the code and can't π
But if you are evaluating at (x-mi)/(mx-mi) that isn't what you want
Yeah, I do generally, just code vomiting atm till it works
Where N is how smooth I want the curve
Probably a tad more legible
that code looks like it'd be inefficient, if it works
I don't quite get what's happening, but you generally want to vectorize things when possible
Yeah, I've read that vectorizing is more efficient in python but again I'm still relatively new to it
I'm totally drawing a blank - what's the name for the method of determining the statistical significance of multiple variables on an output? It's not ANOVA, it's <<something>> <<something>> analysis
Nvm, it's principal component analysis
im trying to learn the exponent using tf.math.pow in Tensorflow Keras
my layer is created doing
class Dense_Power(Layer):
def __init__(self, **kwargs):
super(Dense_Power, self).__init__(**kwargs)
def build(self, input_shape):
self.kernel = self.add_weight('kernel',
shape=(input_shape[1],),
initializer=tf.keras.initializers.glorot_uniform(),
trainable=True)
# Create a trainable weight variable for this layer.
self.power = self.add_weight('power',
shape=(input_shape[1],),
initializer=tf.keras.initializers.glorot_uniform(),
trainable=True)
super(Dense_Power, self).build(input_shape) # Be sure to call this at the end
def call(self, x):
power_val = tf.math.pow(x, self.power)
dot_prod = tf.linalg.matmul(power_val, self.kernel)
return dot_prod
def compute_output_shape(self, input_shape):
return (input_shape[0], input_shape[1])
however i only get nan's in my training doing this
as it somehow invalidates all weights in my model
While reading Hands-on Machine Learning with Scikit Learn, Keras and Tensorflow Book, I came across this equation for batch gradient descent partial derivative. In batch gradient descent, we try to minimize a prediction error by finding an appropriate weights for our features. In order to find the weights, batch gradient descent calculates partial derivative of the cost function as mentioned in the equation below where it uses random weights initialized at the start of the training, input values, etc(confused about what are all the values). But in this equation, I am not able to understand all the variables, theta is for randomly initialized weights for sure, what are xi and xij out side the braces? I believe yi should be actual value of a dependent variable but need confirmation for the same because it could be predicted value also.
Tried looking on the internet, but could not find the explanation of this same equation anywhere.
@gritty jackal Fairly sure it's the actual value, because Theta^T @ X is how you get a prediction.
So (Theta^T @ x^i - y^i) is just the prediction error of this point.
@tidal bough alright that makes sense, but what about x outside the braces? which value is that?
x^i_j is the jth component of the ith input point
and x^i is input values matrix right?
it might be easier to see if you see how this equation is derived
the cost function is:
1/m * Sum(i from 1 to m)[(Theta^T @ X^i - y^i)^2]
Does that make sense so far?
yes
now, we take the derivative with regards to Theta_j
that's a bit tricky, since it's a single component of the Theta row vector
but we can notice that if we were to expand the product:
Theta^T @ X^i
(for any i), we would notice that there's a single term involving Theta_j:
Theta_j * X^i_j
so it's only it that contributes to the derivative.
The derivative of the outer sum is just the sum of derivatives of the things that are summed over.
Each thing is:
(Theta^T @ X^i - y^i)^2
,the derivative of which is:
2 * (Theta^T @ X^i - y^i) * d/d(Theta_j) (Theta^T @ X^i - y^i))
Does that make sense? It's the derivative of a composite function rule, if I remember the name right:
d/dx (f^2 (x)) == 2*f(x)*d/dx (f(x))
That is the chain rule.
yup , understood. @tidal bough Thank you so much for your time and efforts. Appreciated π
And then, as I said before, only one of the components of the vector product contributes to the derivative, so that derivative on the right is just X^i_j
And so we obtain the right formula
Hmm, yes I got it now.
If you want to read many pages on this equation, Cosma Shalizi has a very good and free book on advanced data analysis
I have a 5gb csv with 11 columns and over 30M rows. I have to connect it to a MySQL db which I will then connect to AWS (I believe RDS) and then access in Tableau.
I am unfamiliar with MySQL and AWS. I know this is a python channel but can someone please help me set this up? I can pay you
He has another 400 pages on this topic (as if the first 5 chapters of that other book isn't enough) that is easier to read here: http://www.stat.cmu.edu/~cshalizi/TALR/
That's great
Anyone ever created a pandas ExtensionArray and ExtensionDtype
damn this stuff looks hard lmfao
so true lol
Worst is when it keeps jumping up and down like it does with gans
Hey, whatβs the best way to get a list or dataframe of the counts of a certain field? Iβve created a DataFrame using value_counts but that value seems to store as the field Key which existed before and the value I want the counts of has no field name now
values[:25]```
Just shows something like:
Key
AB 3
BC 2
Key is the field of AB, BC, etc
np.unique with return_counts=True will return two arrays of the unique values and the count of each unique value. I don't understand exactly what you want but it might help.
I'm just trying to be able to access the value_counts info as a string
I want the value_counts and the AB/BC
I can use values.iloc[0][0] to get the value, but not the key (AB/BC)
AB and BC are the index of your new dataframe you can access them with values.index[0] for example.
If you want AB and BC can become a new column of your dataframe by using values.reset_index(inplace=True, drop=False)
then you can have both the value name and value counts as columns of your dataframe which can be accessed with .loc.
Still not sure if that's what you're after
@lofty meteor Sorry for late reply
You still there?
Actually, let's move to another channel
Yep
Hello, I have a general question. I can show the code if someone wants details but the question is like, I had a MATLAB program that i converted to python because it was slow in matlab. the run time in matlab is 0.048 seconds per run, the run time in python is 0.008 seconds, now I am looping the program to verify the speed difference but for some reasons when I loop over python it's slower than looping over matlab.
Just to clarify the loop isn't represented in the body of the code at all
like I have for i=1:3000
xx
end
for matlab where i , is not part of the code
and in python I have
for i in range(3000)
xx
basically I recorded run time for matlab growth and it doesn't grow linearly while python grows linearly. I just want to see if there is a feature used in matlab to like skip some calculations when looping? that i could import to python as it starts faster
in python the time to run the 100th loop is 100 * run tiime
but in matlab it's just 10x
probably something specific to what you're doing.
but i converted the code from matlab to python code per code
sec
well only thing is I just added the code in matlab to calculate run time
oh dear
I can suggest using a debugger to see what's different on the 100th iteration from the 1st one, for instance
whether the execution time per iteration is constant or not shouldn't depend on the language
ok that sounds great, how do i do that
so something's probably not right
well in python it is constant
but matlab it gets faster
for some reason
and i want to incorporate that
again my python isn't slowing down ibut time grows linearly
matlab grows below linearly
as shown in time of run vs number of iteration graph i created
...how did you determine that it grows below-linearly?
that looks pretty linear to me
if it was linear time to run 2 iteration = 2 * time to run 1 iteration
also it's clear on the end it's flatenning
not if it has a constant term
yep you are right
but the main idea persists that python does first iteration for 0.008 seconds but the 3500th is 3500* 0.008= 28 seconds
while matlab starts with .05 for 1 iteration but does 3500 in 8 seconds
this looks pretty linear for me, again
you can do a linear regression and calculate the R^2 if you want, but it sure looks like a linear function + some noise.
true but that's beside the point
there are some calculations matlab doesn't have to redo
that python does
represented in the constant term
as you said
my point is that matlab is also slow for me in large iterations
and i switched to python because it's faster and it's faster in first 10 iterations
but the main idea persists that python does first iteration for 0.008 seconds but the 3500th is 3500* 0.008= 28 seconds
that's not true if there's a constant term.
The right rule is:
(y3-y2)/(x3-x2) == (y2-y1)/(x2-x1)
which should be true for any three points for a linear function.
Anyway, so is the total running time slower in Python or not? I don't quite get what you're concerned about.
it is slower
in python
anyone know how to turn this list into a dictionary?
["Max "Brown Eye" Scherzer, P (2008-)", "", "Season Pitching", "gamesPlayed: 9", "gamesStarted: 9", "groundOuts: 33", "airOuts: 46", "runs: 19", "doubles: 8", "triples: 1", "homeRuns: 6", "strikeOuts: 69", "baseOnBalls: 17", "intentionalWalks: 0", "hits: 50", "hitByPitch: 1", "avg: .255", "atBats: 196", "obp: .315", "slg: .398", "ops: .713", "caughtStealing: 0", "stolenBases: 5", "stolenBasePercentage: 1.000", "groundIntoDoublePlay: 1", "numberOfPitches: 866", "era: 3.40", "inningsPitched: 50.1", "wins: 4", "losses: 2", "saves: 0", "saveOpportunities: 0", "holds: 0", "earnedRuns: 19", "whip: 1.33", "battersFaced: 216", "gamesPitched: 9", "completeGames: 1", "shutouts: 0", "strikes: 560", "strikePercentage: 64.7", "hitBatsmen: 1", "balks: 0", "wildPitches: 3", "pickoffs: 0", "groundOutsToAirouts: 0.72", "winPercentage: .667", "pitchesPerInning: 17.2", "gamesFinished: 0", "strikeoutWalkRatio: 4.06", "strikeoutsPer9Inn: 12.34", "walksPer9Inn: 3.04", "hitsPer9Inn: 8.94", "runsScoredPer9: 5.01", "homeRunsPer9: 1.07", "inheritedRunners: 0", "inheritedRunnersScored: 0", "sacBunts: 0", "sacFlies: 2", "", ""]
there is no constant term
this is the run time in python
for 3500
for 1 seconds iteration i will show u
u can see range(1) compared to range (3500)
the ratio is 3500
for matlab most of the time is in the constant term, so the growth is less
do you understand my concern now?
while for matlab ratio is 8 seconds / .05 seconds = 180
growth
for matlab, I'm getting:
0.3966 / 0.2016 * 101 ~= 198.69
so it also is exactly proportional to the number of iterations
so there's nothing weird going on here - they both scale with the number of iterations directly. The Python implementation is just slower in general.
how is the python implementation slower when first iteration for python is 0.008 seconds
while first one with matlab is
like ~0.05
sounds like either one first iteration isn't timed right, or there's some bug that causes all the later iterations to be slower
I'd check the former first.
i just have
start=datetime.now()
before the start of the loop
and
at the end
print( datetime.now()-start)
of the loop
i mean after the loop ends
as you said because there is a constant in matlab
ratio of last iteration to first is (slope100+constant)/(slope1+constant), if the constant is high then ratio is small, i am not saying the slope isn't constant
i am saying the constant is a large component of the run time
while for python it's just directly proportional to the ratios
yeah i fitted regression and the time is
time=0.0017*iteration+.007
while for python it's simply time =0.008*iteration
so matlab grows at a lower rate but yes still linearly
so is my solution to try to rewrite the python code independent of the matlab code? i think by force imposing the format of the matlab code i probably am not utilizing all the features of python?
or maybe it has to do with spyder itself? cause some people told me spyder is not fast
what does sequence[:,:-1] return
@lapis sequoia use time.perf_counter for timing
Spyder shouldn't affect the speed of running python code
Python is doing a lot of work in the background, can't speak for matlab but it's probably doing less work as the runtime is more specialized
Eg the "first iteration" in Python also involves implicitly calling iter() which takes some overhead
Not to mention whatever memory allocation and garbage collection is being triggered intermittently
i think one important factor is in matlab i can clear all variables except the ones i need and maybe that makes it faster?
while with python all variables stay in memory through the run
Matlab probably has a lot of optimizations that python does not have
That isn't likely to make a big difference but it might help trigger garbage collection at more regular intervals
well the thing is everyone who used python for their calculations told me it's faster than matlab
but who knows i guess i have to know how to write a fast python code
rather than convert a matlab code to python
i mean i wrote the code in matlab and tried to convert it line per line and probably that's not the best thing
That I don't know, but in general porting code from one language and runtime to another is not a guarantee that you'll get a fair comparison
You posted both versions above?
yes
well it says matlab is shell but it's just matlab, but because in matlab comment is %
but it i meant the discord server xd
is there a "short version"
this is a lot of code
and what exactly are you concerned about
the matlab code seems faster per iteration than the python code?
e=np.array([2e5]); # Young's MODULUS OF ELASTICITY
g=np.array([1e5]); # MODULUS OF RIGIDTY
den=np.array([7850]); # MASS DENSITY
why are you creating all these length-1 arrays?
because the form
has to accept multiple size
i mean i am coding it for running over various values
of e,g
etc
and what is it with matlab programmers and horrible variable names π
would it kill you to write density instead of den?
i dont understand the point of this though
what is igtyp supposed to be
well this is finite element modelling, so igtyp is like whether u have a tower made of same material
or different material
if i have 3 types of material i would have [1 2 3]
and each value would corrospond toa type
it looks like you're just trying to broadcast a number into some shape, right?
since igtyp and imtyp are all 1?
yeah but they need now be
basically i am drawing a beam
that beam could be uniform or it could have like increasing cross section
or like different types of metals
and each node represents a segment
right, but practically you're using imtyp to "expand" this density into a matrix of some size
is that right?
yes
em=np.array(e[(imtyp-1).astype(int)])
gm=np.array(g[(imtyp-1).astype(int)])
em will be a 40x1 vector
let say i had4 types i element, e would have 4 elements, and i would have em(0)... em(9) equal to e(0)
then em(10) to em(19) equal to e(1) etc
my em would have each element as a sub of the possible building blocks defined in e
same for gm
ok. im not sure about matlab but in numpy you can just do this
DENSITY = 7850.0
...
rho = np.full(imtyp.shape, DENSITY)
or better yet
rho = np.full_like(imtyp, DENSITY)
i needed to use astype iint
cuz it kept telling me things like it's floating or tuple or something
i don't remember
thats fine, ```python
rho = np.full_like(imtyp, DENSITY, dtype=int)
i keep having to use as type int when i try to index
oh
i see
well you arent using any indexing here
all this is saying is, "make a new array in the same shape as imtyp and fill it with DENSITY"
the actual contents of imtyp are irrelevant
making an array of 1s just to "expand" numbers to arrays isn't necessary at all ever in numpy
no the thing is let say if imtype was [ 1 1 2 3 1 ] then it should have [ den(1) den1) den(2) den(3) den(1)]
would that code actuate that?
den can have as many elements as there are unique elements
of imtype
den has 1 element because in the simple example i am doing
basically this is like me calling each element an ID and associated a set of values to that ID
hi folks
Simulate a portfolio of home insurance policies (5,000 homes insured).
The value of damages is distributed according to a Uniform law between $ 250,000 and $ 2.25 million.
An βaccidentβ can occur with probability p. If this is the case, there is a probability q that the damage is the maximum possible (total loss). With probability 1-q, the loss is partial according to a Uniform distribution on (0,1).
You don't know what the liability loss could be, but it can be up to 10 times the value of the property.
what module i need for this?
numpy and maybe scipy @slender nymph
the thing is
em=np.array(e[(imtyp-1).astype(int)])
gm=np.array(g[(imtyp-1).astype(int)])
rho=den[(imtyp-1).astype(int)];
sxi=mi[(igtyp-1).astype(int)];
a=aa[(igtyp-1).astype(int)];
sk=shp[(igtyp-1).astype(int)];
dx=xp[(n[1,0:]-1).astype(int)]-xp[(n[0,0:]-1).astype(int)]
dy=yp[(n[1,0:]-1).astype(int)]-yp[(n[0,0:]-1).astype(int)]
me haviing to use as type int all the time
@lapis sequoia ok, other than that i don't see anything too strange in your code. although
f[tuple([fdof,0])]=f1
is weird
is it slowing me down and how can i change it?
you can save it as another variable
no i mean how can i make it normally accept indexiing
how can i simulate 5k police house insurance?
without having to write as type iint all the time
imtyp_indexer = (imtyp - 1).astype(int)
em = e[imtyp_indexer]
gm = g[imtyp_indexer]
rho = den[imtyp_indexer]
...
btw that tuple is because it refused to accept f[fdof,0]=f1
what is f
oh i see, f=np.zeros((ntdof,1))
btw you can remove the trailing semicolons
python doesn't need them
it looks like you have them in some places but not others
i know but ii copy pasted from matlab
so
i mean i copy pasted the matlab code and tried removing some
i guess i can just do replace all
f1 = np.array(100) this is a "size 0" array, don't do this
it can act on multiple component of the beam
f1 is the value of f
fdof is which part of the beam does it act on
basically it's like
but is that supposed to be an array too?
import numpy as np
nn=41
ntdof=nn*3
f1 = np.array([100])
fdof = np.array([122-1])
f = np.zeros((ntdof, 1))
f[fdof, 0] = f1
print(f)
so i won't need tuple
np.array(100) is a weird array thing that has zero shape
if i do that?
this code could use some variable names π
correct
@tidal bough i know i already said
it seems to be a plague among matlab programmers
every matlab programmer ive worked with writes code like this
as little whitespace as possible and as short variable names as possible
I suddenly feel the urge to check my old Octave code π
because we are engineers first and not experienced with programming practices
@slender nymph is this for school?
@lapis sequoia no its because you have bad role models who also write code like this
yes exactly but majority of engineers are such xD
bad role models at programming
at least the ones i work with
@crisp jewel if sequence is a numpy array, that returns everything in the array except the last column
obviously i can't speak worldwide
I suddenly feel the urge to check my old Octave code π
...ehh, it was good enough π
anyway, other than creating the imtyp_indexer variable i dont see anything really slow about this code
totl=sum(xxl);
glom=np.zeros((ntdof,ntdof));
glost=np.zeros((ntdof,ntdof));
for ie in range(ne):
est=estif_frame(ndofe,ie,a,sxi,xxl,em,rho,theta);
for id in range(ndofe):
for jd in range(ndofe):
igdof=ndof[id,ie]
jgdof=ndof[jd,ie]
glost[igdof.astype(int)-1,jgdof.astype(int)-1]=glost[igdof.astype(int)-1,jgdof.astype(int)-1]+est[id,jd]
im not sure if there is a better way to do this
iterating over arrays is slow
but there might not be a vectorized version
right, but thats changing the algorithm
i use thhe same values of igdof
which is fine, but not a fair comparison w/ the matlab code
because ndof will be the same my entire loop
sure
of course you can always try JIT compiling this with numba too
i do however recommend you read through PEP 8
!pep 8
C:\Users\hamad\Downloads\Frame_2D_EU123C_new.py:121: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use arr[tuple(seq)] instead of arr[seq]. In the future this will be interpreted as an array index, arr[np.array(seq)], which will result either in an error or a different result.
btw when i did what u wrote
it told me this
a non-tuple sequence?
when i removed tuple
what did you write
and put []
no
instead
f1= np.array([100]) # load value
f[fdof, 0] = f1
it sounds like you did this
f[[fdof, 0]] = f1
@lapis sequoia with f[fdof, 0] you are indexing with fdof and 0. with f[[fdof, 0]] you are indexing with a single object, [fdof, 0]
numpy tries to be smart and infers that if you write f[(fdof, 0)] you mean f[fdof, 0]
the error has to do with the fact that the default inference behavior is changing
however i recommend not relying on inference
ah ok
could something completely unrelated to the code cause the speed difference?
like whether spyder is installed in ssd or hdd etc
or the code i am running
spyder? no
i mean... maybe, if it has some kind of debugging features that are slowing down the interpreter
but probably not
#Absolute
@cli.command()
@click.argument('N1', type=int)
@click.option('--num', is_flag=True, help='INTEGER')
def abs(n1):
"""Calculates absolute value."""
answer = int(abs(n1))
click.echo('abolsute value = {}'.format(answer))
code is above, im trying to make a command so it shows the absolute value but an error says its not valid.. ideas?
i dont think it has to do with my code @desert oar
if I do this
from time import perf_counter
start= perf_counter()
for j in range(1):
for i in range(j):
x=1;
end= perf_counter()
delta=end-start
@tame pelican what is the error? it looks like you're missing the num parameter to abs