#data-science-and-ml
1 messages Β· Page 286 of 1
oh, so lin reg probably isnt the best for a heteroscedastic dataset
of residuals
uh
errors
sorry im new to all this
y_hat - y
y_hat is predicted y
anyway
just try it out and see how it gors
goes
linear regression is p robust
to assumption violation
I see
i got the score of the model, which was 0.36
i think thats the r^2
is that good?
ok
basically it means that the variance in the x variable explains 36% of variance in the y variable
is that correct?
This is off topic, but I can't help but think of the meme from chernobyl
3.6 roentgen, not great, not terrible
lol
lol
did you watch it?
i've only seen the meme around
hilarious that your R^2 was .36
yeah same digits lol
anyway, i would actually say that .36 is more on the terrible side
you want like 0.7 or higher
yeah
I would say it depends on the problem
because some are just harder
Hello, is anyone here very familiar with numpy?
just ask.
I am currently wondering about this numpy code and i'm pretty confused about the resulting shape.
>> a = np.ones((784,))
>> b = np.ones((784,1))
>> a.shape
(784,)
>> b.shape
(784, 1)
>> a = a - b
>> a.shape
(784, 784)
this is an example of broadcasting.
simplest example
!e
import numpy as np
a = np.array([1, 2, 3, 4, 5])
print(a - 1)
@velvet thorn :white_check_mark: Your eval job has completed with return code 0.
[0 1 2 3 4]
so, a is an array of shape (5,), but 1 is a scalar
how can you subtract a scalar from an array? you broadcast it - duplicating it across axes
now, scale that concept up.
!e
import numpy as np
a = np.array([[1, 2, 3],
[4, 5, 6]])
b = np.array([5, 10, 15])
print(a - b)
@velvet thorn :white_check_mark: Your eval job has completed with return code 0.
001 | [[ -4 -8 -12]
002 | [ -1 -5 -9]]
yes, but with the code above I expecting element-wise subtraction, with (784,) being treated the same as (784,1).
nope
a singleton dimension isn't the same as no dimension.
although I must say that is a bit of an edge case
so it does element-wise subtraction but per axis? broadcasted up?
I think I get it, just not sure how to describe it in text.
yes
I think what you're imagining is for
!e
import numpy as np
a = np.array([1, 2, 3])
b = np.array([[4, 5, 6]])
print(a.shape)
print(b.shape)
print((a - b).shape)
print(a - b)
@velvet thorn :white_check_mark: Your eval job has completed with return code 0.
001 | (3,)
002 | (1, 3)
003 | (1, 3)
004 | [[-3 -3 -3]]
is this what you would expect? @iron basalt
yes
note the shapes
leading dimension is 1 this time
yes
in your original case
the 784 and 1 axes are matched
leading to a 2nd axis of size 784
so in my case it took the last axis from a and matched with the last axis of b, because a only has one axis?
It matches from "right" to "left"?
In your last example 3 matches with 3?
These days if I hear that a product uses "deep learning and AI" I assume that they either used some off-the-shelf AI solution for something that didn't need it, or the AI that they're using isn't very effective. But maybe that's because I see how much AI doesn't work before it does.
Is this something a lot of people start to feel after they've been working with AI for a while?
yes
@velvet thorn thank you, numpy's broadcasting was something I never really fully learned.
i feel the same tbh 
@serene scaffold Yes
sorry for the crappy paint drawing
but
how would i graph something like this with matplotlib
anyone good with webscraping here?
uhh youre probably looking for the subplot function
I guess i can just wing it with just swapping the axis for the left graph when plotting
hello
>>> import numpy as np
>>> import matplotlib.pyplot as plt
>>> x = np.linspace(1, 10, num=10)
>>> x
array([ 1., 2., 3., 4., 5., 6., 7., 8., 9., 10.])
>>> y_1 = x
>>> y_2 = 2*x
>>> y_3 = x**2
>>> plt.subplot(1, 2, 1)
<AxesSubplot:>
>>> plt.plot(x, y_1)
[<matplotlib.lines.Line2D object at 0x7fd3513eb1f0>]
>>> plt.title("Left")
Text(0.5, 1.0, 'Left')
>>> plt.subplot(1, 2, 2)
<AxesSubplot:>
>>> plt.plot(x, y_2)
[<matplotlib.lines.Line2D object at 0x7fd3512939a0>]
>>> plt.plot(x, y_3)
[<matplotlib.lines.Line2D object at 0x7fd351293d00>]
>>> plt.title("Right")
Text(0.5, 1.0, 'Right')
>>> plt.tight_layout(4)
<stdin>:1: MatplotlibDeprecationWarning: Passing the pad parameter of tight_layout() positionally is deprecated since Matplotlib 3.3; the parameter will become keyword-only two minor releases later.
>>> plt.show()
Expect swap the axes on the left one.
yw! thereβs a p comprehensive tutorial/explanation in the docs
it might help to read through it
if you re willing to help me
can Python help me analyze soccer matches and predict the outcome?
ahh i see. what do you mean by swapping axes
import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(1, 10, num=10)
y_1 = x
y_2 = 2*x
y_3 = x**2
ax = plt.subplot(1, 2, 1)
plt.plot(y_3, x)
ax.invert_xaxis()
plt.title("Left")
plt.subplot(1, 2, 2)
plt.plot(x, y_2)
plt.plot(x, y_3)
plt.title("Right")
plt.tight_layout(4)
plt.show()
@magic summit
thank you very much
could i get some help with mongolite
my classmate on piazza said im suppose to sum over purchaseMethod rather than items
and that the $sum:1 sums over the rows
but my total items in the df is all 0
so im doing something wrong with the total_item sum part of the function
lolll mango lite
so my docker code it in the wrong time zone
how do i change time zone
its in utc rn
does anyone know how to lower loss on a tensorflow model? My loss is really high and then goes to nan.
It depends. You'll have to provide more information on what you're doing exactly
i cant figure out how to change this darn time zone on docker
Especially the details of your model, your learning rate, and which loss function you're using
I figured it out nvm
Ok cool
anybody know some good pip packages for gradcam in pytorch?
https://github.com/vickyliin/gradcam_plus_plus-pytorch this is the only one i could find
but im trying to see if there are other better ones
too much scattering

if i expand figsize, i wonder if this will be fixed 
oh it helped
too many columns for a scatter matrix; better to do it individually 
actually im going to see what tableau can do with this
Trying to load yolov5 weights into pytorch, and gives this error:
The code for loading this in is almost exactly copied from the yolov5 detect.py script https://github.com/ultralytics/yolov5/blob/master/detect.py
!code
Here's how to format Python code on Discord:
```py
print('Hello world!')
```
These are backticks, not quotes. Check this out if you can't find the backtick key.
!git clone https://github.com/ultralytics/yolov5
%cd yolov5
%pip install -qr requirements.txt
import torch
import cv2
from models.experimental import attempt_load
model = attempt_load('/content/last.pt')
img = cv2.imread('/content/test.jpg')
img = torch.from_numpy(img).to('cuda')
img = img.float()
img /= 255.0
if img.ndimension() == 3:
img = img.unsqueeze(0)
model(img)
i am looking to learn more about data science but I don't know where to begin
I remember watching those sentdex videos but otherwise I don't know much
Iβll give it a shot in Linux in a second where I can easily check for folder permissions. I still donβt get permissions in windows after using that OS for all my life. Thanks for the heads up!
@topaz oracle https://www.youtube.com/watch?v=3Mm1U1CbzNw&list=PL2zq7klxX5ATMsmyRazei7ZXkP1GHt-vs&ab_channel=KenJee
In this video I help you to answer if data science is a good fit for you. I provide 5 questions that you should ask yourself that will assess your fit for the field.
#DataScience #DataScienceJobs #DataScienceCareers
Questions to Ask Yourself:
- Am I prepared to seriously commit to learning? Data Science has a steep learning curve. You also ha...
thnaks
my fave DS YTber so far
for sklearn's OrdinalEncoder function, is there a way to reverse the 1's and 0's? to make abnormal coded as 1 and normal coded as 0?
is there a way to do a multiple regression of every column in a subset of a dataframe as a function of all the other columns? e.g. if I have a dataframe with columns "age", "pclass", "sex", "embarked", "fare", "sibsp", "parch", I would want to perform multiple regression of age as a function of "pclass", "sex", "embarked", "fare", "sibsp", "parch", pclass as a function of "age", "sex", "embarked", "fare", "sibsp", "parch", and so on...
Could anyone here help us choose a good ML algorithm for our scenario?
Heya!
We're trying to predict the weather/specific statistics using heat maps we took from NASAs' website
An example of our dataframe
Y,X,Year,Month,Land_Surface_Temperature Color Index,Land_Surface_Temperature Is Valuable,Vegetation Color Index,Vegetation Is Valuable
28,160,2000,3,-1,False,15.0,True
28,161,2000,3,-1,False,10.0,True
28,162,2000,3,-1,False,10.0,True
28,176,2000,3,-1,False,19.0,True
28,177,2000,3,-1,False,11.0,True
28,178,2000,3,-1,False,15.0,True
28,179,2000,3,-1,False,14.0,True
28,180,2000,3,5,True,16.0,True
28,181,2000,3,-1,False,14.0,True
28,182,2000,3,0,True,19.0,True
28,183,2000,3,0,True,12.0,True
28,184,2000,3,2,True,14.0,True
28,185,2000,3,0,True,11.0,True
28,186,2000,3,-1,False,18.0,True
28,187,2000,3,0,True,15.0,True
28,188,2000,3,-1,False,17.0,True
28,189,2000,3,-1,False,17.0,True
28,190,2000,3,-1,False,15.0,True
28,191,2000,3,-1,False,18.0,True
28,192,2000,3,-1,False,17.0,True
28,193,2000,3,-1,False,21.0,True
28,194,2000,3,-1,False,25.0,True
28,195,2000,3,-1,False,29.0,True
28,196,2000,3,-1,False,35.0,True
28,197,2000,3,-1,False,29.0,True
Basically it contains a row for each entry * month * year (~Feb 2000-Dec2020)
Hmm
We tried running a couple of ML algorithms on the data, mostly linear models and we got really bad accuracy
0.35 was the max
I'm thinking of graph-based methods
We even got a negative value once
I don't think negative accuracy is possible xD
We didn't either
But here it is
And yes, the print statement is okay, we printed other algorithms as well and they returned normal values
So is each pixel you referred to here one of those entries?
Might be, I could try to send you the code if you'd like
Is it a custom implementation?
Each of those rows is a single pixel in a single month, we also cleaned values which were colorless throughout all of the maps used
Nope, but maybe we didn't call it correctly
Hmm I don't use sklearn that much so idk whether I can help
We'll try to research it, I personally never heard of it
Would we need to change the data structure?
Yeah you probably need an adjacency matrix
to represent your stations as nodes in the graph
We have about 36 hours to turn it in
wait a second
π
nvm I looked at your data again and it seems that your heat map has a value for each point
(i.e. it is a dense 2d map)
maybe you want to use Conv-LSTM then
BTW, we took each and every pixel from a map like that
https://earthobservatory.nasa.gov/global-maps/MOD_NDVI_M
We calculate the scale index for each of the pixels using the given scale
climate change, global climate change, global warming, natural hazards, Earth, environment, remote sensing, atmosphere, land processes, oceans, volcanoes, land cover, Earth science data, NASA, environmental processes, Blue Marble, global maps
turn your data into an "image" (according to the x/y values) and store the metadata as channels
And we only leave the colored pixels, so we have about ~12MM lines
Yup, we're short, real short
This sort of problem pretty much requires deep learning
But each of those pixels contain data for multiple map types
It's harder than image classification already xD
We didn't learn it so we can't really use it
Since you've also got the time dimension to worry about
I mean I guess we can but we never tried it so it'll take more time and it might be an issue
Is it prediction per coordinate or per time?
I think they need both
Yup
Idk how you're even supposed to do this using traditional ML methods
And we have multiple maps, so either take a single one at a time or take them all in favor of a more successful prediction
We weren't given that project, we came up with it π
That's what the channel dimension of the "image" is for
e.g. in RGB images you have R, G and B maps
just extend this to your scenario
Is there a time series?
BTW, we couldn't use MLP since it took way too much RAM, got any idea if DL algorithms will be more lax on it?
As I mentioned above, I would go for training a Conv-LSTM model on your data
Kinda
Convolution is more memory efficient than MLP (Dense)
Amazing
It should be ok as long as you don't use too many timesteps / maps that are too large
My solution would be to use a generative predictive model.
I am not confident that you can get a decent model from all this in 36 hours though
Just make it micro first
Our data is about 1GB, as a CSV
It works for both the time aspect and learns all the correlations.
I am referring to the size of your batch
All good, we'd trying our hardest, worst case we won't get it high enough, otherwise we might be able to get a time extension or maybe even just a lucky streak
Isn't the size of the CSV relevant though?
Not really - if you use lazy loading to feed your data into the model, you don't need to fit the entire thing in memory
that's kinda cheating

I assume this is for a course project
True
is it cheating? what if you tell your prof 
If you need speed, then your best bet is dimensioality reduction via sparse methods.
However if you want to keep within the bounds of your course, I think it is worthwhile to show that traditional ML methods are unable to solve such a difficult problem π
gl tho. 36 hours 
(Well not really, since the images are so small that you could still fit an MLP in memory to basically brute force it)
assuming you have to turn in a report/presentation too? 
MLPs are fine for mnist, get like 97-98% which is pretty much the max since MNIST has miss-labelled data.
What is? Using a cloud server for fast computing? Why would it be?
True
And we did some complex calculations for the data so that might get us some points haha
im already stressed on your behalf 
Because you get way more resources than the other groups
to solve your problem
I would just fail with grace and say why it's not really do-able, so you gain something out of it and they do too.
Yeah, but the issue is the code writing and the presentation
No one cares how you run it
ML is not this all mighty can do everything thing, no matter how much people may hype it up to be.
Very much WIP.
Well, if you're only using traditional ML, sure
if this was for my AI class, my prof would be okay with it but thats bc she gave everyone cloud credits to use

But in DL, models can take hours or even days to train
And then there's hyperparameter tuning, so you have to repeat the training process many times
so differences in resources could matter a lot
(Unless you use very modern ML which can run on the CPU due to exploitation of sparse operations, but that is some bleeding edge / very not common place, and needs much more research)
(non-differentiable models are very hard to grasp since all commonly used techniques are out the window)
(no backprop)
But yeah, better to focus on the process than the results @twin moth
So convolution it is?
Thanks mate π
Stick to what was taught in the course
You don't have the time for DL methods
Especially since you have not dealt with DL before
You don't have the time for DL methods
Oh, Conv is a DL method?
So just try to do ML, stick with the highest percentage and show that ML is not an option for such a scenario?
Yeah
Also DL is a form of ML, to be precise, so you should refer to them as "traditional ML methods" π
IMO it's more like "common traditional ML methods"
And not the improved versions, some of them still have new variants popping up each year.
A "traditional" ML method (just based on time period it was invented) that could actually handle something like this problem would be Adaptive Resonance Theory methods. But very few people know of it.
(And it has many modern variants that drastically improve on the original models)
How difficult would it be to implement it?
The implementation is actually trivial which makes it very elegant, but it would take some reading.
(There are some python implementations on github I think)
(with numpy)
You don't have time for that either though, just stick to the course knowledge.
I'd love some names if you get know them from the top of your head
How come?
ART could have it's entire own course, and many more for its variants. It builds on a lot of ideas that much more neuro-science-ish (biologically plausible), which would take you down the rabbit hole of spiking neural models, and much more.
If you're going to do a presentation using out-of-class materials, you're probably going to be asked on them in Q&A
Better to stick to what you actually understand
There is an entire other community within ML that does biologically plausible models that are very much like real neural networks.
(Personally I jumped right into DL so won't be of much assistance in this situation)
π©
You would need to understand DL too though, since DL is based on an idealization of old neuroscience from which you then can learn why the new neuroscience makes more sense and what you can do with it (how to improve upon DL).
If I send you the list of all of the methods we were taught, would you be able to tell me what should be most fitting for our situation?
sure
We'd be taking an ML course next year
you mean DL?
huh
So I guess that we'll learn ML in depth and maybe even DL
Yeah, we took it kinda far lol
TBH this task seems way outside the scope of your course. I have been told by others similar stories in which they get an ML task that is outside the scope of the course.
(unless the entire point is to show that the methods are insufficient)
Yeah, I mentioned that earlier
Again, we came up with this task
Showing that the methods you learned do not work and why should be fine then. If grading were up to me I would give you full credit if you can give me all the reasons why and also show the best results you got.
We were only told to think of a research question and try to answer it using DS
Hi guys! I have this data frame, Im trying to group by season and year how can I do it?
df.groupby('[Season','Date])['25900MS'].mean()
thats not working
got it.... P25900MS.groupby([P25900MS.iloc[:,0].dt.year, P25900MS.iloc[:,2]]).mean().reset_index()
Hi, I want to perform a multiclass classification. I have a very large dataset, and the number of inputs on the machine can easily extend beyond the 1000 inputs. So far I've used scikit's learn API, with the RidgeClassifier, but if I'm not mistaken this method relies on doing a lot of linear algebra, and if the number of inputs can get quite large I presume that the training time will scale up quite a bit. So I was thinking of maybe implementing a NN, maybe just a simple Perceptron, would that be better? Are there any advantages?
How can I run out of memory with 3D CNN (TensorFlow Keras), even with a batch size of 1, when each of my images is only ~14mb in size?
This is both for CPU and GPU.
Model too big a possibility or something else?
Hello guys, I have a code written with tensorflow 1.14, and I need to migrate to 2.0, but I don't the equivalent of the function tf.contrib in 2.0. Does someone can help me ?
What does your model look like?
Could you elaborate?
in my code I used tf.contrib but when i upgrade the version of tensorflow 2. 0 when i run my code it says no module name tf.contrib
@hasty grail do you know the equivalent of tb.contrib into tensorflow 2.0 ?
If you can't find the function in tfa (TensorFlow Addons) then I'm afraid you're out of luck
Take a look at the source code of the original function and see if you can implement it yourself
Could you share your experience in writing new custom layer in Pytorch? I usually create a jupyter notebook and code the layer with dummy input so that I can get instant feedback if I mess up with the dimension. May be there are better way?
thx
np
Who, where should I report this?
URL: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html
no idea
The site ahead may contain harmful programs
replace pandas-docs with just docs
!d pandas.concat
pandas.concat(objs, axis=0, join='outer', ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=False, copy=True)```
Concatenate pandas objects along a particular axis with optional set logic along the other axes.
Can also add a layer of hierarchical indexing on the concatenation axis, which may be useful if the labels are the same (or overlapping) on the passed axis number.
Parameters **objs**a sequence or mapping of Series or DataFrame objectsIf a mapping is passed, the sorted keys will be used as the keys argument, unless it is passed, in which case the values will be selected (see below). Any None objects will be dropped silently unless they are all None in which case a ValueError will be raised.
**axis**{0/βindexβ, 1/βcolumnsβ}, default 0The axis to concatenate along.
**join**{βinnerβ, βouterβ}, default βouterβHow to handle indexes on other axis (or axes).... [read more](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html#pandas.concat)
hmm
i think pandas site got hijacked actually
cus even old links that ive visited have that message
np
@austere swift @hasty grail
https://github.com/pandas-dev/pandas/issues/39862
Nice
are you sure you don't have some weird plugin or something?
does anyone know how to change the .jupyter directory location to somewhere else in linux
you mean change the software's directory or change the open folder location? :/
@viscid dagger
ow yeah, you could make a shortcut to it
the .jupyter directory in the home folder
yeah i figured
no i actually wanna i move it to another directory
cause my home directory is so cluttered
no i can't really help you with that, maybe someone else could, never had that desire so never faced that issue
Hey does someone know if it's possible to specify dtypes when writing a pandas dataframe to feather file? I try this because feather infers the wrong dtpye for one of my columns, later resulting in an error. I only found this on stackoverflow which does not answer the question: https://stackoverflow.com/questions/41439564/is-it-possible-to-specify-column-types-when-saving-a-pandas-dataframe-to-feather
lol okay chill
ow didn't mean to sound rude :/
lol no it's chill
actually i found it
export JUPYTER_CONFIG_DIR="${XDG_CONFIG_HOME:-$HOME/.config}/jupyter"
i have to set this env variable that jupyter uses
buts thanks anyway for trying to help me out
@lapis sequoia
yeah sorry, but never faced this issue hence
i also have this when i tried to look up something on pandas
βΉοΈ
do you guys have vpns or something? or wtf is wrong with my computer?
See link to issue above
i tried on safari, chrome
If you go to pandas.updatable.org and make you way to the docs there is not warning.
oh missed that. thanks
is there a way to do a multiple regression of every column in a subset of a dataframe as a function of all the other columns? e.g. if I have a dataframe with columns "age", "pclass", "sex", "embarked", "fare", "sibsp", "parch", I would want to perform multiple regression of age as a function of "pclass", "sex", "embarked", "fare", "sibsp", "parch". pclass as a function of "age", "sex", "embarked", "fare", "sibsp", "parch", and so on...
are you having issue with https://pandas.pydata.org?
I donβt
Could you explain your issues more simply?
I have several variables (some that take on values from 1-10000, some that only take on 1 and 0), and want to run multiple regression on each of these variables with values in the columns of a dataframe to find correlations between each variable
@grave frost
pretty sure pandas.plot has parameters you can insert
X and Y
might be what youre looking for
oh yeah you have to navigate from home page instead https://pandas.pydata.org/
ok π
so many tabs 
what would plot do in this scenario? itsnt it just for plotting?
oh no... this is my window with the least amount of tabs
lol
so I did quantile regression two different ways, I'm confused why they are different
one is a straight line and the other is bumpy (to state the obvious lol)
former is gradient boosting regressor from sklearn with loss=quantile, and the latter is quantreg from statsmodels
Idk why they are different
Maybe its because one of them considers the data as a time series interpretation? That would explain why it's wavy opposed to just a regular linear regression
I honestly don't know

Hey guys, does anyone here know Grafana well?
#Use the kfold cross validation to create two lists: train and holdhout which have the indices of those elements of the X matrix that will be #used for the training and holdout (validation) at each iteration (fold of the cross validator)
Cvals = [1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1e0, 1e1, 1e2, 1e3, 1e4, 1e5, 1e6]
k_fold = KFold(n_splits=5)
results_l2=[]
for C in Cvals:
# instantiate a logistic regression with L2 penalty and the proper C value for this iteration of the loop
model = LogisticRegression(penalty='l2',C=C)
# collect the predicted y values and true y values of each hold out set
predicteds=[]
trueys=[]
train=[]
holdout=[] #WTF ARE THESE TWO LISTS FOR?
for train, holdout in k_fold.split(X): ##I ONLY HAD TO ADD THIS LINE
model.fit(X[train],y[train])
predicteds.append( model.predict(X[holdout]) )
trueys.append( y[holdout] )
Can someone help me with this please?
idk if the for loop is OK
Hey anyone has a clue how i can select string index in pivot table?
^ pandas
i wanna do a project
When you say number of inputs, do you mean the dimensionality of the input, the number of samples, or are you working with a time series?
I mean the number of neurons, or the number of parameters of the perceptron, which I believe to be the dime sionality of the input as you put it
To confirm, the dimensionality of the input in case of image input would be the dimensions of the image multiplied together (e.g. a grayscale image that is 32x32 pixels would be 32 rows * 32 columns * 1 channels = dimensionality of 1024).
@keen root
Is that what you mean?
(Not that you have an image as input, but something like that)
Hi yall. I recently got into machine learning and AI. Can yall give me some interesting projects to try to finish without watching any tutorials?
I wanna test my skills and see how much can I do alone
Classify MNIST. Note it has miss-labelled data so don't pull your hair out over not getting 100% accuracy.
is that hand written digits?
yes
I already watched a tutorial on that one hahah
I might try to redo it by myself tho
See how good is my memory
Then try something a bit harder, fashion-MNIST.
So basically I tried it and it overfits
while training, the acc was 90 and after in testing it was 30
how do I know what to change in my code
Yes, that's it
(Sorry for the delay)
Research / learn about regularization.
will do
@keen root Yeah so 1000 inputs is actually not that bad, try just throwing an MLP at the problem, scikit has one: https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html
If they does not work, try using a dimensionality reduction algorithm and then feeding that into an MLP.
That's pretty awesome, I didn't know that existed in scikit
The key thing it does for you is called softmax in case you want to learn more about it just search for softmax.
That's the generalization of the logistic curve, right?
generalization of logistic regression yes
Ok ok, awesome. Thank you. I'll give it a go then
generally if you have a multi-class prediction problem it's the go to
One quick question
So basically I decided to see how did other people make their fashion_mnist code, changed mine to be like theirs and it still overfits
Yup, at that point you gotta try harder. No nice out the box solution.
ML is kind of open-ended, lots of room for improvement.
*very
Just see what seems to work and what does not, and after that you have to get creative.
(Try to come up with reasons why it works or does not and then test those ideas / science)
anyone know how to find a hidden option on target like iuts oos rn and i cant find the websites html for the add to cart button
The numbers MNIST can technically get 100% accuracy on a 20:80 test-train split
There is already a configuration of convolution and pooling that had achieved a 100% CCR
Make support vector machine [SVM] to classify between dogs and cats (aim for a CCR above 80%)
Oh that's interesting. I did that but with CNN using a tutorial
I might try it
Yeah the idea is to use the CNN to feature extract, but instead of using an MLP to classify use an SVM
Not sure if this is widespread or has been asked a lot lately, but is chrome saying the pandas documentation site isβdangerousβ for you guys?
...am I safe to still view the pandas docs?
There's an issue for it:
https://github.com/pandas-dev/pandas/issues/39862
I was just reading that, no real insight in the thread tho
lol, yeah, the pandas devs are really confused
good thing I neither use Chrome nor read documentation π₯΄
New question
Regarding machine learning and AI i was wondering where is whats being learnt is saved/stored? Apparently MachineL can do a linear distribution but i don't get how the machine is learning anything, or with Test/train because nothing is being remembered, the program is just running an approximation on some data... Or with a Chess AI how does the program remember and train against itself? where would each trial be stored and in what format?
I made a post a few weeks ago of an AI playing the chrome Dino, now this is the frogger
guys and girls, if I have a set of input parameters and I want to minimize one of themm, how do i go about making an objective function? what do people do to find a relationship between my variables?
I would like to do a lot of realtime high-speed data analysis. One of my analysis techniques will probably use either fixed-length time series where old data is dropped off and replaced with new data or time series that grow as new data arrives. Will I be hindered by using Pandas? Should I try to focus exclusively on Numpy? Maybe focus on something else entirely? The new data will (at the start of the analysis) be entirely on a SQL server. New data will also arrive to the SQL server periodically.
I'm pretty new to Python. Just want to make sure that I'm practicing with the approaches, techniques, and packages that will serve me best over the long-run
When you say speed do you mean throughput or latency? @digital crescent
I don't think I know the difference well enough. I would like to be able to pull the first set of data + any smaller "new" data quickly from SQL, update my dataset in Python quickly, and run a new updated analysis as fast as possible. So every step in the process is important to me in terms of speed
I've heard a bit in terms of certain packages used to interact with SQL servers being faster/slower, but I guess I was mostly concerned with me inefficiently manipulating the data during the analysis part
throughput = how much gets done in a period of time
latency = how long it takes for one unit of work to get done
Latency would we the time from new data entering the system (before it even gets to the database) to the output of the analysis being updated. Throughput would be how many points in the time series you can process per unit time.
so from this it sounds like you want low latency at least
Optimizing for one is very different than the other.
If we go by Squiggle's definition, I'd say I'm mostly concerned with the speed of analysis
whether you also need high throughput will depend on how much input is coming in at any one time
I think in general it will take much more time for me to analyze the data then move it from SQL to Python
Thousands of new data points per second potentially but the analysis could involve as many computations as I wanted (so thousands, tens of thousands, hundreds of thousands, millions)
So you want low latency? You do not have it backing up in terms of how many points it can process? By that I mean does the database get more points than it gives to the analysis system (think like how much water is flowing into a container versus flowing out).
in particular, Spark streaming
A mixture of things. Regressions, pattern recognition, random stuff with probability distributions
The database will have more data points that I will be using, yes
are you willing to spend $$?
But I would like the flexibility to basically decide to stop looking at one portion of the dataset and look at a completely different portion
Probably not, no
The SQL server will be on my computer, and I will be doing the analysis mostly in Python on my computer as well
(Just to clarify that the "server" isn't like a separate set of hardware)
of course it depends on your exact requirements and it's defo possible to run analyses locally without incurring additional costs
but @ some point you might need to scale up
hard to say without knowing numbers
Then without spending anything I think a solution may be to have the data points go directly to the analysis system to reduce the latency of having to go through a database system. However, at the same time the data points are also sent to the database to be stored for later.
thousands per second is not impossible
but it would require some configuration and data engineering, at least
oh on the same system
I guess my main concern is just the Python tools I should use for analysis provided that the data I'm analyzing is constantly changing (i.e. changing in size, looking at a completely different set of data points, stuff like that)
And I see stuff like with Pandas that say that adding extra rows is super slow
And it makes me wonder about other concerns I should have
adding rows is slow (relatively) because you need to reallocate the backing array
pandas isn't really meant for data that changes
And which preferred approaches I should be considering for realtime analysis
which is why I said
look into Spark streaming
which adds a lot of overhead
but shrugs
is a bit heavyweight for local usage, too
I mean
you could defo build abstractions around numpy that would allow you to do this kind of thing but
Yeah they is starting to sound like a database question, which could be asked in the databases section, there are database systems that are designed for quickly adding new types of data and such, but I am not exactly an expert on them.
so
this goes back to the kind of analyses you need to perform
but yeah, probably Spark.
Huh. I didn't think this was really a databases question primarily, but then again I don't know much about these kinds of things
it's a data engineering problem, specifically.
you're basically asking "how do I construct an ETL pipeline -> data warehouse that will meet my needs?"
Does ETL include an analysis step?
possibly, as part of the T step, but
depends on the complexity of the analysis
ideally that would come after
I'm mostly concerned with the speed of the analysis part than the speed of the "ETL" part
when you say "computations" you mean "data points"?
the reason pandas is fast (relatively speaking) is that it holds the entire dataset in memory.
No, I meant the number of computations required to perform the analysis. Basically you can take 10000 points and look at them a million different ways
so how many points @ any one time?
But if I want to add 100 or replace 100 and look at it differently, I feel like I could run into problems with Pandas
and how often will the subset of data to be looked at change?
relative to the number of analyses being run
The first step is to get upper bounds on things. You can't do as many computations as you want, computation is finite resource.
Could be 500 - 10000, I suspect
Even if those upper bounds are massive
this is very manageable
It will be a balancing act. I would like to produce updated analyses as fast as possible (ideally within a few seconds or a fraction of a second), so I'm aware that I won't be able to do the best analysis nonstop. But say I've got 5000 data points and add/replace 300. I would like to run some regressions or do some pattern recognition or generate some new probability distributions as fast as possible
But it will be constantly running. And the faster I can analyze, the better analysis I can do if the goal is to produce an updated analysis on a rolling, realtime, and almost infinite basis
a million computations is a lot
if it's anything complex like regression analysis, you don't have enough compute for that
nowhere near
I'm trying to think in my head how many computations would be required to a do a simple linear regression with 10,000 coordinate pairs
Probably a lot
do you mean like primitive computations...?
Don't imagine, just try it out, test it.
Might not take a lot of memory though
I mean, I can almost count them in my head
Just make sure you are measuring it correctly.
Yeah. I was abstracting it into the stuff I would do on paper which obviously isn't the same as what goes on in a computer
But again, I feel like this is somewhat beside the point, right?
A simple timing can tell a lot
Regardless of the analysis, if the analysis is taking up the bulk of the time (and not the data-fetching part), is there a generally preferred way to handle the data and intermediate calculations in Python? Or is it really not that simple? Like if I give you 10,000 data points and tell you that every so often the analysis will randomly be performed on a somewhat differently sized database and sometimes the analysis itself will be slightly different, what tools do you use to run the analysis? Not Pandas? Yes to Pandas? Only Numpy? Python Lists?
for numeric data, never lists.
it depends on the analysis.
but numpy is generally faster
I guess I'm just worried that Pandas seems almost useless if speed is at all an issue if you decide to add some data to your existing dataset
pandas can provide better abstractions though
pandas is backed by numpy arrays.
It's not that simple, but like gm wrote, there are definitely some things NOT to do. Python itself is pretty slow so all speed must from the c-libraries.
they have the exact same problem.
Is there even an efficient way to handle data of a changing size or is that kind of a problem that can't be solved?
there is
but
that is not necessarily the correct question
I've just read that Pandas is like Numpy + overhead and is slightly slower. I have no idea if that's relevant to me though
"how often will the dataset size change?"
like let's say
running analyses takes 15s
then you change once
and that takes 0.5s
Gotcha. Thanks. You guys have given me some stuff to consider
It is almost like I should just spend more time thinking about ways to efficiently structure stuff with the tools I have rather than look for a tool that magically solves these problems
No library can magically overcome the limitations set by the hardware itself. In general, the less you know up front (which types of data you will have, how much, etc), the slower the solution will be, but with the trade-off of hack-ability / extensibility.
Here is an example of what I mean (not necessarily one that applies to my project but I think it is in the same line of thought): Imagine your dataset for analysis will be anywhere from 5000 - 6000 rows. Maybe you could just make a 6000-row Pandas table and fill in the ones you don't need with zero or something like that. Or track the rows that aren't being used. And then have some kind of flag to ignore the portion of the vector calculations done on the unneeded rows
Something like that
you could
or you could also use a numpy masked array
which I think might be a more appropriate abstraction, BUT
Not saying this is the best way to do things. But it is what I mean in terms of just coming up with better solutions rather than looking for better tools
shrugs
okay so you see
this will boil down to performing a filtering operation at the start of every set of analyses, probably
because you want to retrieve the subset that you'll use
That is a technique known as pre-allocation, it works (faster and makes the code more simple), but you will be using up more memory.
Just it doesn't seem like this stuff is written out anywhere. Or at least I don't see a good guide saying "this is how you should do x/y/z analysis in Python if you want to do it quickly"
Which is fine
I just wanted to make sure that I wasn't missing anything obvious
I.e. If it was as simple as like "oh, realtime data analysis with changing amounts of data? Do/don't use pandas"
Or do/don't use numpy
But I think you guys have gotten me somewhere π
So thanks
yw
Is sql ever going to become obsolete due to libraries in languages or is learning it valuable?
python implements an sql package
no idea whether it'll ever become obsolete
probably, in a hundred years+?
SQL might, but relational databases probably not (speculation). It will probably stick around for a long time in case anyone needs to manually query things.
but the fact that ORMs can abstract away the need to know raw SQL is not in itself a reason not to learn SQL
...why do you ask
question
what does it mean when people say sql sever and sql developer have their own database? Aren't we the user creating the database? what does it mean to come with one?
im using database synonmous with tabular data
tag = self.soup.body.find('div', class_='fulfillment-add-to-cart-button')
if tag and 'add to cart' in tag.text.lower():
self.alert_subject = alert_subject
self.alert_content = f'{alert_content.strip()}\n{self.url}```
anyone see anything wrong with this?
its saying its in stock but its not idk if the code is messed up or sum
im not making a auto buy bot btw

SQL is going to stick around for a long time because of the same reasons that Java is sticking around. Theres better languages for the job, but so many businesses already use it that they'd never think of not using it #Mongo
That being said, SQL definitely does have times where it shines and if you are looking to get into it with python, try SQLite3
@misty flint spill the beans
@misty flint ppl cant better themselves with inside jokes
Hello evveryone,
I am having an issue with RollingOLS from statsmodels .
'''
mod = RollingOLS(Y, X, window=75, min_nobs=None,expanding=True)
fit=mod.fit()
'''
When i want to get the AIC
i get a list of multiple values
and i think my X and Y are in the wrong format
since i have X and Y two lists
of numbers
How should i proceed ?
oh that was just me being hopeful

so why people used Azure at first and now that theyve upped their services, people are using it for reals?

Can I see how you created the X and Y arrays?
lol, I mean just as Fortan and Cobolt is about for Banks, SQL is here to stay for a while @misty flint
Both are numpy array
1D
containing numbers
np.size(X) give 529
and np.size(Y) give 529 too
@odd aspen Do i have to use a panda dataframe ?
shape gives (529,)
How would i put the labels for the bars in a matplotlib bar graph above the bars themselves
something like this with the "text field" being the label
I use altair for statistical plot, the API is easier to remember
Woah, dont tell me this is made in python ....
I want to specify in pandas dataframe timezone as CDT IST EDT etc , instead of Region/Country, is there a way to do that?? All examples I came across specify timezone as Region/Country Ex - "Africa/Douala"
I donβt know I just found that on the internet
It looks like it is made in excel lol
The label on top, this example will do? https://matplotlib.org/stable/gallery/lines_bars_and_markers/barchart.html#sphx-glr-gallery-lines-bars-and-markers-barchart-py
A standard 3 layer CNN.
not quite sure where to put this
from MTM import matchTemplates
import cv2
r10 = 'out2.png'
lt = [('small', r10)]
image = 'rank101.png'
Hits = matchTemplates(lt, image, score_threshold=0.5, method=cv2.TM_CCOEFF_NORMED, maxOverlap=0)
MTM is a library for template detection, so i don't have to fuss about with more code than i have to
but, it keeps out putting AttributeError: 'str' object has no attribute 'dtype at line 7
For the example, they use coin from from skimage.data import coins... how do i make my image have dtype
You misunderstand the API. The matchTemplates function wants numpy arrays not filepaths to image files. image should be a numpy array, as should r10. The reason the error says that it can't find dtype is because you gave it a string instead of a numpy array, strings don't have dtypes.
Basically it wants the actual image data, not a string telling it where to find the image data.
Could you print the summary?
Don't know if this is the right place but can someone tell me why this is throwing up an error/what I can do to fix it
btw why are you using 3d cnn when you want to process images?
x = 0
index = ufos[ufos['country'] == 'us']
for value in index['datetime']:
if ("24:" in value):
value.replace('24:00', '00:00')
index.iloc[x,0] = value
x+=1
``` there we go
pd.to_datetime doesn't like when values are as 24:00 but im having trouble reassigning the values back into the dataframe once i change 24:00 to 00:00
I'm processing 3D volumetric images.
oh okay
Hang on, I'll get the model summary!
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv3d (Conv3D) (None, 222, 126, 222, 32) 896
_________________________________________________________________
activation (Activation) (None, 222, 126, 222, 32) 0
_________________________________________________________________
max_pooling3d (MaxPooling3D) (None, 111, 63, 111, 32) 0
_________________________________________________________________
conv3d_1 (Conv3D) (None, 109, 61, 109, 32) 27680
_________________________________________________________________
activation_1 (Activation) (None, 109, 61, 109, 32) 0
_________________________________________________________________
max_pooling3d_1 (MaxPooling3 (None, 54, 30, 54, 32) 0
_________________________________________________________________
flatten (Flatten) (None, 2799360) 0
_________________________________________________________________
dense (Dense) (None, 32) 89579552
_________________________________________________________________
activation_2 (Activation) (None, 32) 0
_________________________________________________________________
dense_1 (Dense) (None, 1) 33
_________________________________________________________________
activation_3 (Activation) (None, 1) 0
=================================================================
Total params: 89,608,161
Trainable params: 89,608,161
Non-trainable params: 0
_________________________________________________________________```
And even with a batch size of 1, I get this OOM error during the first epoch:
yeah that definitely looks too big
200 cubed is huge
What can I do to avoid this? I mean I have to process the images somehow.
Does it have to do with how I am shaping my data?
it basically means that your data is too large
what is the kernel size of each CNN layer?
yes
because when you input it into a CNN layer
inside the convolution operation, your 3d image is multipled by each value in the kernel, resulting in a total size = (size of image) x (size of kernel)
kernel size, not number of channels
Sorry - (3,3,3)
ok that's the smallest as it's going to get
Yeah
so yeah, probably your input is too large
Okay so how do I make this work now basically?
you'll need to downsample it before the CNN layers
I have 38 images, each with a dimension of (766, 200, 760)
After creating X and resizing, the shape of X is (39, 224, 128, 224)
Then I add one more channel to make it a 5D tensor fit for Conv3D so it becomes (39, 224, 128, 224, 1)
you'll have to resize your images to be even smaller
Hmm, perhaps (128, 64, 128) will do the trick?
try it
There is an example of volumetric MRI image classification on Keras
They are using the same size without any problems
Yeah
Yes, seems to work now!
Need to increase the accuracy but that's another issue
Great, so the cubed size was far too large
yeah, by halving each dimension you're now using 1/8th of the original memory
can somebody help wiht sklearn
Any Idea why this may be happening (The red letters is the heuristic algorithm used to expand Manhattan distance as cost, Misplaced Tiles as cost, and BFS as cost)
U can see that 3rd last for manhattan and the 4th last for tiles has a drop in nodes expanded
whereas it keeps going up for BFS
@bold olive how thick is your FC layer?
...how many neurons does your Dense layer have?
man that was incoherent
how thick xd lol i like that
it's not even correct
it should be "how wide"
π₯΄
Ah
"thick" is alright with me 
64
im having a dataset that contains names of natural reservoires. I've also got a few cols about Area but they don't seem usefull for my situation since i need longitude and latitude... I found this website and filled in the name of a few reservoires and it seems to return the right longitude and latitutde.. Now since the dataset is quiet big, I obsiously don't want to do this mannually. Could someone give me some help in how I can make a python script that just requests this inside the script for each name in the dataset ?
A handy tool to get lat long from address, helps you to convert address to coordinates (latitude longitude) on map, also calculates the gps coordinates.
Can i get a seed from a picture pixel colors? i mean if you generate 256x256 random color values and you can doit with a seed can you reverse it? you input a 256x266 image and get the seed?
@hexed parrot numpy has a seed https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.random.seed.html
why don't you just generate an array that way?
but can you recover the seed with the list of generated numbers?
or there is no way?
Off the top of my head, you could maybe Bruteforce if a dedicated function to do that is not available.
ahhh ok so i have to open the image file and then give it the string
hello guys, i want to create a model using Keras or Tensorflow that synthesizes two images, joining a body to a head. I don't know much about deep learning and honestly I'm a little lost. Any tips on how to get started?
Can you elucidate about what you exactly want to do?>
input 1
input 2
output
i dont know exactly how to describe this
Ignore the colors. The first two I took a picture of the computer screen
in this case that was made with photoshop. but it takes a looong time to edit and thats why im searching a way to do this with deep learning
would anyone review my recently worked on neural network?
Also how to backpropogate
What ai framework are you using?
python
As in pure python with a few simple modules like math and random
So not really any framework and from scratch
Nope not even numpy. Thats alright thankyou for taking an interest π
from scratch... 
what's the easiest way to remove the time from a column in pandas. The csv i downloaded goes from 0:24 inclusive and its kinda messing with to_datimetime
And I only need the year anyways
So I'm working on a project with a couple of friends. We want to be able remove the background of a picture (usually, but not necessarily a portrait), somewhat like remove.bg. Would PyTorch be a good fit for this project?
@analog pike you slice with loc
@acoustic forge CV2 https://stackoverflow.com/questions/63001988/how-to-remove-background-of-images-in-python
@nova widget Yeah, that suggestion that he made doesn't work
that link might not work but i would go down the route of opencv @acoustic forge
they have some useful modules in their documentation
Okay, I'm gonna check it out @misty flint. Thanks π
Consider using https://pypi.org/project/geopy/ instead.
Show code. What type of neural network?
thanks
will do!
thanks
Its rather long at might be difficult to understand. I only know a limited amount about the math behind neural networks, backpropogation and so fourth. This is my attempt so far.
I have been working on another that uses a genetic/fitness approach.
How familiar are you with linear algebra?
somewhat familiar, only what I know from studying it in maths education
but I think most of the math behind machine learning is beyond me.
I can understand what sigmoid and other activation functions do
Ok, my first note is that this code is much smaller and simple if you make use of matrices.
(Which is their entire purpose)
I just thought I'd give it a try! Still not sure if it works as intended but we can hope.
Right that makes sense.
I used a one dimensional arrary for most of the weights etc
thankyou
You can simply implement matrix multiplication and transpose yourself, it does not need to be fast.
As long as you get the idea
that sounds like a good idea. I do see what you mean. Then I wouldn't have to loop through each neuron individually?
yeah, that's why matrices are cool, they make everything easier to think about and code, since you are thinking at a higher level.
By that I mean like as in programming higher level.
Like assembly vs python
ohh right I see what you mean now. They sound super cool actually! I will try it out thanks
that will be useful
It's also why they were invented in math, nobody wants to manually juggle all those numbers.
It is quite annoying and one of the problems that took me the longest. I have reworked it a few times!
That may be a much better approach
A sign that you may not be doing things the best way is when your objects are too small. For example, neuron does not really need to be it's own object unless you intend to create a neuron by itself outside of a neural network. Or another sign is when something exists not by itself ever, but in a group / cluster. Rather than making it its own object, just have the data held by the object that manages the group / cluster.
Generally you will always be working with groups of neurons.
I see what you mean. It would be much simpler to store the all the weights inside a matrix in a single layer object than a neuron. I will try playing around with different lists to see what I can do. You are right, I do not plan to use a single neuron on its own. Thankyou for that explanation!
Btw that group vs single thing idea applies to pretty much all programming.
(Computers like groups of things)
(Contiguous)
thankyou for that advice! That is the sort of thing that will really help me improve.
Groups do seem to be used a lot in programming, list logic is essential to a lot of software it seems. Or at least, it is used often for challenges etc
Thankyou so much for all your help it has been really helpful π
I might redo the neural network using a different method with matrices now, thanks π
On line 101, you use this count = 0. To keep track of the current layer index correct?
im looping over a geoapicall and want to append some of the json results to my dataframe. Since i select only latitude and longitude out of the json results i use the selecting technique response['latitude']. The things is. For some responses there is no 'latitude' value.. How can i ommit my code from crashing and just continueing to the next record instead of crashing
@lapis sequoia
Oh yes pretty sure I do
let me check
That is correct
It doesn't need to be 1 I don't think as the number of the weights for one neuron in one layer of the neural network should always equal the number of neurons in the next
@thin remnant Use python dict's get function, you can set an optional return value for when there is no entry .get(key, ret_val_when_not_there), e.g. lat = response.get('latitude', None) ... if lat is None: ...
@lapis sequoia Use enumerate instead, also the if statement if(count != len(self.layers)): will always be True.
When using the iterators, we need to keep track of the number of items in the iterator. This is achieved by an in-built method called enumerate(). The enumerate ...
ohh right that is a good idea thankyou I will try that!
Oh I see what you mean about the if statement...
thankyou π
As I am using len() rather than it counting from 0
on 117 if(i + 1 != len(self.layers)): you are using this to make it only loop up until the layer before the last right?
just change the range then
they are not connected to neurons in the next layer
for i in range(0, len(self.layers)): to for i in range(0, len(self.layers) - 1):
rightttt I see
that would also work very well
I do not think about these things sometimes that is a nice and simple solution!
You got a list index out of range, so data is an empty list, check to make sure len of data is greater than zero.
so data['data'] is the actual data
which is a list
and it was empty, but you tried to access the element at index 0.
wow what IDE is that
jupyter notebook xd
@lapis sequoia Python has a bunch of loop control that allows you avoid having to put if statements inside loops to control where they loop.
@thin remnant you are trying to access the index using a string datatype
data['data'] is a list, and data['data'][0] seems to also be a list (i'm guessing length 2 for lat and long).
I recommend trying to print out the types of things
e.g. print(type(data['data'][0]))
These features seem really useful, I will have a think next time about how I can use loop control instead!
Also I found printing literally every variable helps
I have a "debug" mode boolean that I can enable and disable to print everything
Sometimes it can be helpful
ive tried some things but couldnt figure it out
You should also be able to print the json itself probably or save it to a file. Then analyze it.
this is what i did to check stuff
you have any idea how to check if lat en lon are there and otherwise just not care instead of crash xd
gimme a sec, ill make a picture
You can check whether a given key exists in a dictionary using:
print("will execute if this key is present")```
My hunch is that since you are in juypter notebook it may be an out of order cell execution thing (or other), create a new file on your pc and run the script in there to make sure it's nothing strange going on with the notebook.
It could also be that not all responses are the same. Some could be giving different structures.
I would wrap the loop in a try catch and on error print the current and previous data to compare a valid and invalid data.
My linux is rebooting, the window froze
You are using a vm?
no
i run linux as main
anyway, this is what it looks like when i run the script
thats gonna give me a huge output since its a loop
but i got an idea
i can just print the index number
and then next run just print the data of that index where it stopped
yeah
So, yeah, there is no guarantee for anything, just gotta do a ton of checks on everything. A bunch of if key in x, if len(y) > 0, and maybe even if isinstance(z, (typeA, typeB, ...)).
sometimes it stops faster
Yeah it's random it seems.
i think i have the right checks to make it work now
>>> from geopy.geocoders import Nominatim
>>> geolocator = Nominatim(user_agent="specify_your_app_name_here")
>>> location = geolocator.geocode("175 5th Avenue NYC")
>>> print(location.address)
Flatiron Building, 175, 5th Avenue, Flatiron, New York, NYC, New York, ...
>>> print((location.latitude, location.longitude))
(40.7410861, -73.9896297241625)
>>> print(location.raw)
{'place_id': '9167009604', 'type': 'attraction', ...}
ill take a look at geopy later this evening
mmm
wow
that looks pretty easy yea haha
ill give it a shot i guess
They did what you are doing right now, but wrapped it up for you with a bow tie.
haha you sample code does look easy yea
but what is the max amount of calls ?
cause im doing it for each record in a dataset
Depends on which site you choose
They chose Nominatim in this example.
Open source geocoding with OpenStreetMap data
my dataset has 5000 +- records
so i used positionstack
but ill take a look into those things aswell
thanks a lot for the help though!
free tier probably too and probably a lot of requests, because google is big
positionstack didn't have very good docs imo but their calls are at least free
if you do some filtering yourself..
here is the list of all the ones it has
np, I gain more practice from this stuff.
me to π
i wouldnt have ever touched this stuff if it werent for my gf
she doesn't know anything about datascience and has to do data analysis/linear regression in SPSS
and so she doesnt know data cleaning or python at all
and the school didnt give her the data
so that sucked haha
But I knew it was possible in python so i wanted to try it
Hey @cerulean spindle!
It looks like you tried to attach a Python file - please use a code-pasting service such as https://paste.pythondiscord.com
I'm new to TensorFlow and I am having trouble lowering the loss of my model. Please ping me if you know how to improve the model. https://paste.pythondiscord.com/gosebesocu.py
@thin remnant I recommend just trying to look on https://pypi.org/ to see if there is already a package for what you are trying to do, then go to their github page and see if the README has a simple example. If it seems overly complex for what you are trying to do then try doing it yourself.
so im still having troubles with my csv, now its throwing a SettingWithCopyWarning
x = 0
for item in index['datetime']:
index.iloc[x,0] = item[:-6]
x+=1
im just trying to shave the end of a string in each row of a column in pandas
since datetime isnt cooperating
If I have a column in my dataset which contains short string descriptions using keywords, how could I include that in a heatmap/correlogram to show relationships between the keyword and other variables? e.g. I could use this to find that, for example, descriptions that contain the word "red" and "dress" have a smaller value in a column called stock than a description that includes "green" and "bag"
example of data
@analog pike Try modifying your column like so:
I just started data science in uni so I'm willing to get help
import pandas as pd
df = pd.DataFrame(
[[1, 2], [4, 5], [7, 8]],
index=['cobra', 'viper', 'sidewinder'],
columns=['max_speed', 'shield']
)
print(df)
column = df.iloc[:, 0]
print("-------------------")
print(column)
for i, val in enumerate(column):
column[i] = val + 1
print("-------------------")
print(column)
ah elite dangerous
@iron basalt the problem is that these are the values im trying to modify:https://gyazo.com/931714db8974dfa5c1524a31f69d0255
im trying to just strip off the time portion and whatever im trying just doesnt seem to want to work
since pd.to_datetime only uses values 0-23 for time and for some reason the csv goes 0-24
and I don't need the times anyways just the year
just the middle part? the year?
Are they always formatted like this all entries?
yeah I downloaded the cleaned one
if you google it, there are about a million fixes for such a problem
so I wouldn't have to deal with all the data cleaning
they are
yet settingwithcopywarning is messing with me again
A value is trying to be set on a copy of a slice from a DataFrame
are you modifying the column like I did above?
this is my whole code atm ```py
mport matplotlib.pyplot as plt
import pandas as pd
import DateTime as dt
ufos = pd.read_csv("scrubbed.csv",low_memory=False)
countries = ufos['country'].unique()
print(countries)
fig,ax = plt.subplots()
index = ufos[ufos['country'] == 'us']
print(index['datetime'])
column = index.iloc[:, 0]
print("-------------------")
print(column)
for i, val in enumerate(column):
column[i] = val.split()[1]
index['datetime'] = pd.to_datetime(index['datetime'])
index['year'] = index['datetime'].dt.year```
damn the highlighting didnt work
just edit your message
there we go
what does print(index['datetime']) look like?
thats the image i posted before
it gives just the list of dates and times
for each entry
How many columns are there? just one?
no, though now that I think about it i really only need the one column
since im not doing by state or anything and this is just the US
so when you print column what do you get?
same thing as printing index['datetime']
so there is only 1 column in index['datetime']?
oh shoot wait a minute i think I know why
after I made the copy with only US pandas doesn't fix the rows
:/
it leaves the gaps where the other countries were
Damn I forgot how i fix that
sounds like that would be a kata if katas did arrays

i did a similar thing but it was just elements and a list
pain
theres probably a function
ye



