#data-science-and-ml

1 messages Β· Page 286 of 1

velvet thorn
#

linear regression assumes HOMOSCEDASTICITY of RESIDUALS

young dock
#

oh, so lin reg probably isnt the best for a heteroscedastic dataset

young dock
#

uh

velvet thorn
#

errors

young dock
#

sorry im new to all this

velvet thorn
#

y_hat - y

#

y_hat is predicted y

#

anyway

#

just try it out and see how it gors

#

goes

#

linear regression is p robust

#

to assumption violation

young dock
#

I see

#

i got the score of the model, which was 0.36

#

i think thats the r^2

#

is that good?

velvet thorn
#

uh

#

depends on your use case, but generally no

#

not horrible

#

but not great

young dock
#

ok

#

basically it means that the variance in the x variable explains 36% of variance in the y variable

#

is that correct?

feral shard
#

This is off topic, but I can't help but think of the meme from chernobyl

#

3.6 roentgen, not great, not terrible

#

lol

young dock
#

lol

feral shard
#

did you watch it?

young dock
#

i've only seen the meme around

feral shard
#

hilarious that your R^2 was .36

young dock
#

yeah same digits lol

feral shard
#

anyway, i would actually say that .36 is more on the terrible side

#

you want like 0.7 or higher

young dock
#

fair

#

i guess i should try other types of reg in that case?

feral shard
#

yeah you could try that

#

there sure is a lot of variance though

young dock
#

yeah

velvet thorn
#

because some are just harder

iron basalt
#

Hello, is anyone here very familiar with numpy?

velvet thorn
iron basalt
#

I am currently wondering about this numpy code and i'm pretty confused about the resulting shape.

#
>> a = np.ones((784,))
>> b = np.ones((784,1))
>> a.shape
(784,)
>> b.shape
(784, 1)
>> a = a - b
>> a.shape
(784, 784)
velvet thorn
#

simplest example

#

!e

import numpy as np

a = np.array([1, 2, 3, 4, 5])
print(a - 1)
arctic wedgeBOT
#

@velvet thorn :white_check_mark: Your eval job has completed with return code 0.

[0 1 2 3 4]
velvet thorn
#

so, a is an array of shape (5,), but 1 is a scalar

#

how can you subtract a scalar from an array? you broadcast it - duplicating it across axes

#

now, scale that concept up.

#

!e

import numpy as np

a = np.array([[1, 2, 3],
              [4, 5, 6]])

b = np.array([5, 10, 15])

print(a - b)
arctic wedgeBOT
#

@velvet thorn :white_check_mark: Your eval job has completed with return code 0.

001 | [[ -4  -8 -12]
002 |  [ -1  -5  -9]]
iron basalt
#

yes, but with the code above I expecting element-wise subtraction, with (784,) being treated the same as (784,1).

velvet thorn
#

a singleton dimension isn't the same as no dimension.

#

although I must say that is a bit of an edge case

iron basalt
#

so it does element-wise subtraction but per axis? broadcasted up?

#

I think I get it, just not sure how to describe it in text.

velvet thorn
#

I think what you're imagining is for

#

!e

import numpy as np

a = np.array([1, 2, 3])
b = np.array([[4, 5, 6]])

print(a.shape)
print(b.shape)

print((a - b).shape)
print(a - b)
arctic wedgeBOT
#

@velvet thorn :white_check_mark: Your eval job has completed with return code 0.

001 | (3,)
002 | (1, 3)
003 | (1, 3)
004 | [[-3 -3 -3]]
velvet thorn
#

is this what you would expect? @iron basalt

iron basalt
#

yes

velvet thorn
#

note the shapes

iron basalt
#

leading dimension is 1 this time

velvet thorn
#

yes

#

in your original case

#

the 784 and 1 axes are matched

#

leading to a 2nd axis of size 784

iron basalt
#

so in my case it took the last axis from a and matched with the last axis of b, because a only has one axis?

#

It matches from "right" to "left"?

#

In your last example 3 matches with 3?

serene scaffold
#

These days if I hear that a product uses "deep learning and AI" I assume that they either used some off-the-shelf AI solution for something that didn't need it, or the AI that they're using isn't very effective. But maybe that's because I see how much AI doesn't work before it does.

#

Is this something a lot of people start to feel after they've been working with AI for a while?

velvet thorn
iron basalt
#

@velvet thorn thank you, numpy's broadcasting was something I never really fully learned.

iron basalt
#

@serene scaffold Yes

magic summit
#

sorry for the crappy paint drawing

#

how would i graph something like this with matplotlib

storm lintel
#

anyone good with webscraping here?

misty flint
#

uhh youre probably looking for the subplot function

magic summit
lapis sequoia
#

hello

iron basalt
# magic summit I guess i can just wing it with just swapping the axis for the left graph when p...
>>> import numpy as np
>>> import matplotlib.pyplot as plt
>>> x = np.linspace(1, 10, num=10)
>>> x
array([ 1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10.])
>>> y_1 = x
>>> y_2 = 2*x
>>> y_3 = x**2
>>> plt.subplot(1, 2, 1)
<AxesSubplot:>
>>> plt.plot(x, y_1)
[<matplotlib.lines.Line2D object at 0x7fd3513eb1f0>]
>>> plt.title("Left")
Text(0.5, 1.0, 'Left')
>>> plt.subplot(1, 2, 2)
<AxesSubplot:>
>>> plt.plot(x, y_2)
[<matplotlib.lines.Line2D object at 0x7fd3512939a0>]
>>> plt.plot(x, y_3)
[<matplotlib.lines.Line2D object at 0x7fd351293d00>]
>>> plt.title("Right")
Text(0.5, 1.0, 'Right')
>>> plt.tight_layout(4)
<stdin>:1: MatplotlibDeprecationWarning: Passing the pad parameter of tight_layout() positionally is deprecated since Matplotlib 3.3; the parameter will become keyword-only two minor releases later.
>>> plt.show()
#

Expect swap the axes on the left one.

velvet thorn
#

it might help to read through it

lapis sequoia
#

i m new to this

#

and i have a question

iron basalt
lapis sequoia
#

if you re willing to help me

#

can Python help me analyze soccer matches and predict the outcome?

magic summit
iron basalt
#
import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(1, 10, num=10)
y_1 = x
y_2 = 2*x
y_3 = x**2

ax = plt.subplot(1, 2, 1)
plt.plot(y_3, x)
ax.invert_xaxis()
plt.title("Left")

plt.subplot(1, 2, 2)
plt.plot(x, y_2)
plt.plot(x, y_3)
plt.title("Right")

plt.tight_layout(4)
plt.show()

#

@magic summit

magic summit
#

thank you very much

next tree
#

could i get some help with mongolite

#

my classmate on piazza said im suppose to sum over purchaseMethod rather than items

#

and that the $sum:1 sums over the rows

#

but my total items in the df is all 0

#

so im doing something wrong with the total_item sum part of the function

misty flint
#

..excel..?

#

what is this?

#

oh mongolite

#

nice edit kannaSus

next tree
#

lolll mango lite

lapis sequoia
#

can someone help me get started with data science?

#

i have an idea for a project

storm lintel
#

so my docker code it in the wrong time zone

#

how do i change time zone

#

its in utc rn

cerulean spindle
#

does anyone know how to lower loss on a tensorflow model? My loss is really high and then goes to nan.

hasty grail
storm lintel
#

i cant figure out how to change this darn time zone on docker

hasty grail
#

Especially the details of your model, your learning rate, and which loss function you're using

cerulean spindle
#

I figured it out nvm

hasty grail
#

Ok cool

austere swift
#

anybody know some good pip packages for gradcam in pytorch?

#

but im trying to see if there are other better ones

misty flint
#

too much scattering

#

if i expand figsize, i wonder if this will be fixed pithink

#

oh it helped

#

too many columns for a scatter matrix; better to do it individually pithink

#

actually im going to see what tableau can do with this

meager shoal
#

Trying to load yolov5 weights into pytorch, and gives this error:

#

!code

arctic wedgeBOT
#

Here's how to format Python code on Discord:

```py
print('Hello world!')
```

These are backticks, not quotes. Check this out if you can't find the backtick key.

meager shoal
#
!git clone https://github.com/ultralytics/yolov5
%cd yolov5
%pip install -qr requirements.txt
import torch
import cv2
from models.experimental import attempt_load
model = attempt_load('/content/last.pt')
img = cv2.imread('/content/test.jpg')
img = torch.from_numpy(img).to('cuda')
img = img.float()
img /= 255.0
if img.ndimension() == 3:
  img = img.unsqueeze(0)
model(img)
topaz oracle
#

i am looking to learn more about data science but I don't know where to begin

#

I remember watching those sentdex videos but otherwise I don't know much

spring mortar
#

I’ll give it a shot in Linux in a second where I can easily check for folder permissions. I still don’t get permissions in windows after using that OS for all my life. Thanks for the heads up!

misty flint
#

In this video I help you to answer if data science is a good fit for you. I provide 5 questions that you should ask yourself that will assess your fit for the field.

#DataScience #DataScienceJobs #DataScienceCareers

Questions to Ask Yourself:

  • Am I prepared to seriously commit to learning? Data Science has a steep learning curve. You also ha...
β–Ά Play video
topaz oracle
#

thnaks

misty flint
#

my fave DS YTber so far

#

for sklearn's OrdinalEncoder function, is there a way to reverse the 1's and 0's? to make abnormal coded as 1 and normal coded as 0?

astral path
#

is there a way to do a multiple regression of every column in a subset of a dataframe as a function of all the other columns? e.g. if I have a dataframe with columns "age", "pclass", "sex", "embarked", "fare", "sibsp", "parch", I would want to perform multiple regression of age as a function of "pclass", "sex", "embarked", "fare", "sibsp", "parch", pclass as a function of "age", "sex", "embarked", "fare", "sibsp", "parch", and so on...

twin moth
#

Could anyone here help us choose a good ML algorithm for our scenario?

hasty grail
#

About your scenario...

#

@twin moth

twin moth
#

An example of our dataframe

Y,X,Year,Month,Land_Surface_Temperature Color Index,Land_Surface_Temperature Is Valuable,Vegetation Color Index,Vegetation Is Valuable
28,160,2000,3,-1,False,15.0,True
28,161,2000,3,-1,False,10.0,True
28,162,2000,3,-1,False,10.0,True
28,176,2000,3,-1,False,19.0,True
28,177,2000,3,-1,False,11.0,True
28,178,2000,3,-1,False,15.0,True
28,179,2000,3,-1,False,14.0,True
28,180,2000,3,5,True,16.0,True
28,181,2000,3,-1,False,14.0,True
28,182,2000,3,0,True,19.0,True
28,183,2000,3,0,True,12.0,True
28,184,2000,3,2,True,14.0,True
28,185,2000,3,0,True,11.0,True
28,186,2000,3,-1,False,18.0,True
28,187,2000,3,0,True,15.0,True
28,188,2000,3,-1,False,17.0,True
28,189,2000,3,-1,False,17.0,True
28,190,2000,3,-1,False,15.0,True
28,191,2000,3,-1,False,18.0,True
28,192,2000,3,-1,False,17.0,True
28,193,2000,3,-1,False,21.0,True
28,194,2000,3,-1,False,25.0,True
28,195,2000,3,-1,False,29.0,True
28,196,2000,3,-1,False,35.0,True
28,197,2000,3,-1,False,29.0,True
#

Basically it contains a row for each entry * month * year (~Feb 2000-Dec2020)

hasty grail
#

Hmm

twin moth
#

We tried running a couple of ML algorithms on the data, mostly linear models and we got really bad accuracy

#

0.35 was the max

hasty grail
#

I'm thinking of graph-based methods

twin moth
#

We even got a negative value once

hasty grail
#

I don't think negative accuracy is possible xD

twin moth
#

We didn't either

#

But here it is

#

And yes, the print statement is okay, we printed other algorithms as well and they returned normal values

hasty grail
#

Weird lol

#

Maybe LassoLars is not implemented correctly

iron basalt
#

So is each pixel you referred to here one of those entries?

twin moth
hasty grail
#

Is it a custom implementation?

twin moth
twin moth
hasty grail
#

Hmm I don't use sklearn that much so idk whether I can help

twin moth
#

😦

#

So how would you approach it?

hasty grail
#

Regardless of whether your loss function is correct

twin moth
#

We'll try to research it, I personally never heard of it

#

Would we need to change the data structure?

hasty grail
#

Yeah you probably need an adjacency matrix

#

to represent your stations as nodes in the graph

twin moth
#

We have about 36 hours to turn it in

hasty grail
#

wait a second

twin moth
#

πŸ˜…

hasty grail
#

nvm I looked at your data again and it seems that your heat map has a value for each point

#

(i.e. it is a dense 2d map)

#

maybe you want to use Conv-LSTM then

twin moth
#

BTW, we took each and every pixel from a map like that
https://earthobservatory.nasa.gov/global-maps/MOD_NDVI_M

We calculate the scale index for each of the pixels using the given scale

hasty grail
#

turn your data into an "image" (according to the x/y values) and store the metadata as channels

twin moth
#

And we only leave the colored pixels, so we have about ~12MM lines

hasty grail
#

only 36 hours though...

#

Idk whether you can train your model in time

#

oof

twin moth
hasty grail
#

This sort of problem pretty much requires deep learning

twin moth
hasty grail
#

It's harder than image classification already xD

twin moth
hasty grail
#

Since you've also got the time dimension to worry about

twin moth
nova widget
#

Is it prediction per coordinate or per time?

hasty grail
#

I think they need both

twin moth
hasty grail
#

Idk how you're even supposed to do this using traditional ML methods

twin moth
#

And we have multiple maps, so either take a single one at a time or take them all in favor of a more successful prediction

twin moth
hasty grail
#

e.g. in RGB images you have R, G and B maps

#

just extend this to your scenario

twin moth
#

But we have about 6 maps

#

How would you use it?

nova widget
#

Is there a time series?

twin moth
#

BTW, we couldn't use MLP since it took way too much RAM, got any idea if DL algorithms will be more lax on it?

hasty grail
#

As I mentioned above, I would go for training a Conv-LSTM model on your data

twin moth
hasty grail
#

Convolution is more memory efficient than MLP (Dense)

hasty grail
#

It should be ok as long as you don't use too many timesteps / maps that are too large

iron basalt
#

My solution would be to use a generative predictive model.

hasty grail
#

I am not confident that you can get a decent model from all this in 36 hours though

nova widget
#

Just make it micro first

iron basalt
#

It works for both the time aspect and learns all the correlations.

hasty grail
twin moth
hasty grail
#

Should have researched the problem beforehand xD

#

but anyways

twin moth
hasty grail
#

Not really - if you use lazy loading to feed your data into the model, you don't need to fit the entire thing in memory

misty flint
#

buy some cloud GPUs for the model

hasty grail
#

that's kinda cheating

misty flint
hasty grail
#

I assume this is for a course project

twin moth
misty flint
#

is it cheating? what if you tell your prof pithink

iron basalt
#

If you need speed, then your best bet is dimensioality reduction via sparse methods.

hasty grail
#

However if you want to keep within the bounds of your course, I think it is worthwhile to show that traditional ML methods are unable to solve such a difficult problem πŸ˜›

misty flint
#

gl tho. 36 hours amegablobsweats

hasty grail
#

(Well not really, since the images are so small that you could still fit an MLP in memory to basically brute force it)

misty flint
#

assuming you have to turn in a report/presentation too? NervousSip

iron basalt
#

MLPs are fine for mnist, get like 97-98% which is pretty much the max since MNIST has miss-labelled data.

twin moth
twin moth
#

And we did some complex calculations for the data so that might get us some points haha

misty flint
#

im already stressed on your behalf amegablobsweats

hasty grail
#

to solve your problem

iron basalt
#

I would just fail with grace and say why it's not really do-able, so you gain something out of it and they do too.

twin moth
#

No one cares how you run it

iron basalt
#

ML is not this all mighty can do everything thing, no matter how much people may hype it up to be.

#

Very much WIP.

hasty grail
misty flint
#

if this was for my AI class, my prof would be okay with it but thats bc she gave everyone cloud credits to use

hasty grail
#

But in DL, models can take hours or even days to train

#

And then there's hyperparameter tuning, so you have to repeat the training process many times

#

so differences in resources could matter a lot

iron basalt
#

(Unless you use very modern ML which can run on the CPU due to exploitation of sparse operations, but that is some bleeding edge / very not common place, and needs much more research)

#

(non-differentiable models are very hard to grasp since all commonly used techniques are out the window)

#

(no backprop)

hasty grail
#

But yeah, better to focus on the process than the results @twin moth

misty flint
#

gl dude

twin moth
hasty grail
#

Stick to what was taught in the course

#

You don't have the time for DL methods

#

Especially since you have not dealt with DL before

twin moth
#

We weren't taught conv

#

But I don't think that anyone would care if we used it

hasty grail
#

You don't have the time for DL methods

twin moth
#

Oh, Conv is a DL method?

#

So just try to do ML, stick with the highest percentage and show that ML is not an option for such a scenario?

hasty grail
#

in its formulation, not necessarily

#

but working models tend to be deep

hasty grail
#

Also DL is a form of ML, to be precise, so you should refer to them as "traditional ML methods" πŸ˜›

twin moth
#

lol, true

#

Got any traditional ML methods you'd recommend? πŸ˜›

iron basalt
#

IMO it's more like "common traditional ML methods"

#

And not the improved versions, some of them still have new variants popping up each year.

#

A "traditional" ML method (just based on time period it was invented) that could actually handle something like this problem would be Adaptive Resonance Theory methods. But very few people know of it.

#

(And it has many modern variants that drastically improve on the original models)

twin moth
iron basalt
#

The implementation is actually trivial which makes it very elegant, but it would take some reading.

#

(There are some python implementations on github I think)

#

(with numpy)

#

You don't have time for that either though, just stick to the course knowledge.

twin moth
#

I'd love some names if you get know them from the top of your head

iron basalt
#

ART could have it's entire own course, and many more for its variants. It builds on a lot of ideas that much more neuro-science-ish (biologically plausible), which would take you down the rabbit hole of spiking neural models, and much more.

hasty grail
#

If you're going to do a presentation using out-of-class materials, you're probably going to be asked on them in Q&A

#

Better to stick to what you actually understand

iron basalt
#

There is an entire other community within ML that does biologically plausible models that are very much like real neural networks.

hasty grail
#

(Personally I jumped right into DL so won't be of much assistance in this situation)

twin moth
#

😩

iron basalt
#

You would need to understand DL too though, since DL is based on an idealization of old neuroscience from which you then can learn why the new neuroscience makes more sense and what you can do with it (how to improve upon DL).

twin moth
#

If I send you the list of all of the methods we were taught, would you be able to tell me what should be most fitting for our situation?

iron basalt
#

sure

twin moth
hasty grail
#

you mean DL?

twin moth
#

Nope, the current course is an introduction to DS

#

The next will be ML

hasty grail
#

huh

twin moth
#

So I guess that we'll learn ML in depth and maybe even DL

twin moth
iron basalt
#

TBH this task seems way outside the scope of your course. I have been told by others similar stories in which they get an ML task that is outside the scope of the course.

#

(unless the entire point is to show that the methods are insufficient)

hasty grail
#

Yeah, I mentioned that earlier

twin moth
#

Again, we came up with this task

iron basalt
#

Showing that the methods you learned do not work and why should be fine then. If grading were up to me I would give you full credit if you can give me all the reasons why and also show the best results you got.

twin moth
#

We were only told to think of a research question and try to answer it using DS

torpid pilot
#

anyone?

hasty grail
#

Don't ask to ask

#

Just ask your question, if anyone can help you they will answer

rotund dock
#

Hi guys! I have this data frame, Im trying to group by season and year how can I do it?

#

df.groupby('[Season','Date])['25900MS'].mean()

#

thats not working

#

got it.... P25900MS.groupby([P25900MS.iloc[:,0].dt.year, P25900MS.iloc[:,2]]).mean().reset_index()

keen root
#

Hi, I want to perform a multiclass classification. I have a very large dataset, and the number of inputs on the machine can easily extend beyond the 1000 inputs. So far I've used scikit's learn API, with the RidgeClassifier, but if I'm not mistaken this method relies on doing a lot of linear algebra, and if the number of inputs can get quite large I presume that the training time will scale up quite a bit. So I was thinking of maybe implementing a NN, maybe just a simple Perceptron, would that be better? Are there any advantages?

delicate yarrow
#

help

#

GOT IT! has to be .csv (I'm a newb sorry)

bold olive
#

How can I run out of memory with 3D CNN (TensorFlow Keras), even with a batch size of 1, when each of my images is only ~14mb in size?

#

This is both for CPU and GPU.

#

Model too big a possibility or something else?

thin kindle
#

Hello guys, I have a code written with tensorflow 1.14, and I need to migrate to 2.0, but I don't the equivalent of the function tf.contrib in 2.0. Does someone can help me ?

hasty grail
thin kindle
#

@hasty grail do you know the equivalent of tb.contrib into tensorflow 2.0 ?

hasty grail
#

If you can't find the function in tfa (TensorFlow Addons) then I'm afraid you're out of luck

#

Take a look at the source code of the original function and see if you can implement it yourself

keen kestrel
#

Could you share your experience in writing new custom layer in Pytorch? I usually create a jupyter notebook and code the layer with dummy input so that I can get instant feedback if I mess up with the dimension. May be there are better way?

hasty grail
#

np

shadow ridge
#

Who, where should I report this?
URL: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html

hasty grail
#

no idea

shadow ridge
#

The site ahead may contain harmful programs

hasty grail
#

replace pandas-docs with just docs

austere swift
#

!d pandas.concat

arctic wedgeBOT
#
pandas.concat(objs, axis=0, join='outer', ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=False, copy=True)```
Concatenate pandas objects along a particular axis with optional set logic along the other axes.

Can also add a layer of hierarchical indexing on the concatenation axis, which may be useful if the labels are the same (or overlapping) on the passed axis number.

Parameters  **objs**a sequence or mapping of Series or DataFrame objectsIf a mapping is passed, the sorted keys will be used as the keys argument, unless it is passed, in which case the values will be selected (see below). Any None objects will be dropped silently unless they are all None in which case a ValueError will be raised.

**axis**{0/’index’, 1/’columns’}, default 0The axis to concatenate along.

**join**{β€˜inner’, β€˜outer’}, default β€˜outer’How to handle indexes on other axis (or axes).... [read more](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html#pandas.concat)
austere swift
#

hmm

#

i think pandas site got hijacked actually

#

cus even old links that ive visited have that message

shadow ridge
#

I just submitted an issue on github

#

@austere swift thanks for the concat doc

austere swift
#

np

shadow ridge
hasty grail
#

Nice

lapis sequoia
#

No issue at all with safari

#

same on chrome :/

lapis sequoia
viscid dagger
#

does anyone know how to change the .jupyter directory location to somewhere else in linux

lapis sequoia
#

you mean change the software's directory or change the open folder location? :/

#

@viscid dagger

viscid dagger
#

no the config folder

#

if that makes sense

#

@lapis sequoia

lapis sequoia
#

ow yeah, you could make a shortcut to it

viscid dagger
#

the .jupyter directory in the home folder

lapis sequoia
#

yeah i figured

viscid dagger
#

no i actually wanna i move it to another directory

#

cause my home directory is so cluttered

lapis sequoia
#

no i can't really help you with that, maybe someone else could, never had that desire so never faced that issue

devout scroll
#

Hey does someone know if it's possible to specify dtypes when writing a pandas dataframe to feather file? I try this because feather infers the wrong dtpye for one of my columns, later resulting in an error. I only found this on stackoverflow which does not answer the question: https://stackoverflow.com/questions/41439564/is-it-possible-to-specify-column-types-when-saving-a-pandas-dataframe-to-feather

lapis sequoia
#

ow didn't mean to sound rude :/

viscid dagger
#

actually i found it

#

export JUPYTER_CONFIG_DIR="${XDG_CONFIG_HOME:-$HOME/.config}/jupyter"

#

i have to set this env variable that jupyter uses

#

buts thanks anyway for trying to help me out

#

@lapis sequoia

lapis sequoia
#

yeah sorry, but never faced this issue hence

misty flint
#

☹️

lapis sequoia
#

do you guys have vpns or something? or wtf is wrong with my computer?

shadow ridge
lapis sequoia
#

i tried on safari, chrome

shadow ridge
misty flint
astral path
#

is there a way to do a multiple regression of every column in a subset of a dataframe as a function of all the other columns? e.g. if I have a dataframe with columns "age", "pclass", "sex", "embarked", "fare", "sibsp", "parch", I would want to perform multiple regression of age as a function of "pclass", "sex", "embarked", "fare", "sibsp", "parch". pclass as a function of "age", "sex", "embarked", "fare", "sibsp", "parch", and so on...

lapis sequoia
shadow ridge
grave frost
astral path
#

I have several variables (some that take on values from 1-10000, some that only take on 1 and 0), and want to run multiple regression on each of these variables with values in the columns of a dataframe to find correlations between each variable

#

@grave frost

misty flint
#

pretty sure pandas.plot has parameters you can insert

#

X and Y

#

might be what youre looking for

astral path
#

woah

#

im getting that error too now

misty flint
astral path
#

ok πŸ‘

misty flint
#

so many tabs ID_BoomKek

astral path
#

what would plot do in this scenario? itsnt it just for plotting?

misty flint
#

why are half of those sound cloud

astral path
misty flint
#

second-hand stressed amegablobsweats

astral path
#

lol

young dock
#

so I did quantile regression two different ways, I'm confused why they are different

#

one is a straight line and the other is bumpy (to state the obvious lol)

#

former is gradient boosting regressor from sklearn with loss=quantile, and the latter is quantreg from statsmodels

#

Idk why they are different

charred umbra
#

Maybe its because one of them considers the data as a time series interpretation? That would explain why it's wavy opposed to just a regular linear regression

#

I honestly don't know

misty flint
outer fulcrum
#

Hey guys, does anyone here know Grafana well?

graceful scaffold
#
#Use the kfold cross validation to create two lists: train and holdhout which have the indices of those elements of the X matrix that will be #used for the training and holdout (validation) at each iteration (fold of the cross validator)

Cvals = [1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1e0, 1e1, 1e2, 1e3, 1e4, 1e5, 1e6]
k_fold = KFold(n_splits=5)

results_l2=[]
for C in Cvals:
    # instantiate a logistic regression with L2 penalty and the proper C value for this iteration of the loop
    model = LogisticRegression(penalty='l2',C=C)
    
    # collect the predicted y values and true y values of each hold out set
    predicteds=[]
    trueys=[]
    train=[]       
    holdout=[]    #WTF ARE THESE TWO LISTS FOR?
    for train, holdout in k_fold.split(X):  ##I ONLY HAD TO ADD THIS LINE
        model.fit(X[train],y[train])
        predicteds.append( model.predict(X[holdout]) )
        trueys.append( y[holdout] )
#

Can someone help me with this please?

#

idk if the for loop is OK

lapis sequoia
#

Hey anyone has a clue how i can select string index in pivot table?

#

^ pandas

#

i wanna do a project

iron basalt
keen root
#

I mean the number of neurons, or the number of parameters of the perceptron, which I believe to be the dime sionality of the input as you put it

iron basalt
#

To confirm, the dimensionality of the input in case of image input would be the dimensions of the image multiplied together (e.g. a grayscale image that is 32x32 pixels would be 32 rows * 32 columns * 1 channels = dimensionality of 1024).

#

@keen root

#

Is that what you mean?

#

(Not that you have an image as input, but something like that)

stray roost
#

Hi yall. I recently got into machine learning and AI. Can yall give me some interesting projects to try to finish without watching any tutorials?

#

I wanna test my skills and see how much can I do alone

iron basalt
iron basalt
#

yes

stray roost
#

I already watched a tutorial on that one hahah

#

I might try to redo it by myself tho

#

See how good is my memory

iron basalt
#

Then try something a bit harder, fashion-MNIST.

stray roost
#

I might check that one out.

#

Thank you man

stray roost
#

So basically I tried it and it overfits

#

while training, the acc was 90 and after in testing it was 30

#

how do I know what to change in my code

keen root
#

(Sorry for the delay)

iron basalt
stray roost
#

will do

iron basalt
#

If they does not work, try using a dimensionality reduction algorithm and then feeding that into an MLP.

keen root
#

That's pretty awesome, I didn't know that existed in scikit

iron basalt
#

The key thing it does for you is called softmax in case you want to learn more about it just search for softmax.

keen root
#

That's the generalization of the logistic curve, right?

iron basalt
#

generalization of logistic regression yes

keen root
#

Ok ok, awesome. Thank you. I'll give it a go then

iron basalt
#

generally if you have a multi-class prediction problem it's the go to

stray roost
#

One quick question

#

So basically I decided to see how did other people make their fashion_mnist code, changed mine to be like theirs and it still overfits

iron basalt
#

Yup, at that point you gotta try harder. No nice out the box solution.

#

ML is kind of open-ended, lots of room for improvement.

#

*very

#

Just see what seems to work and what does not, and after that you have to get creative.

#

(Try to come up with reasons why it works or does not and then test those ideas / science)

storm lintel
#

anyone know how to find a hidden option on target like iuts oos rn and i cant find the websites html for the add to cart button

charred umbra
#

The numbers MNIST can technically get 100% accuracy on a 20:80 test-train split

#

There is already a configuration of convolution and pooling that had achieved a 100% CCR

charred umbra
astral path
#

just a nice looking heatmap i made

stray roost
#

I might try it

misty flint
astral path
#

seaborn!

#

and pandas

#

but matplotlib is under the hood of seaborn

charred umbra
opal ferry
#

Not sure if this is widespread or has been asked a lot lately, but is chrome saying the pandas documentation site isβ€œdangerous” for you guys?

tidal bough
#

yeah, people have noticed

#

it's pretty weird

opal ferry
#

...am I safe to still view the pandas docs?

tidal bough
opal ferry
#

I was just reading that, no real insight in the thread tho

tidal bough
#

lol, yeah, the pandas devs are really confused

velvet thorn
#

good thing I neither use Chrome nor read documentation πŸ₯΄

tidal bough
#

it shows for me in Firefox too

#

it's just pretty weird, only some paths are affected

lapis sequoia
#

@shadow ridge just got it too

#

date_range

prisma willow
#

New question

Regarding machine learning and AI i was wondering where is whats being learnt is saved/stored? Apparently MachineL can do a linear distribution but i don't get how the machine is learning anything, or with Test/train because nothing is being remembered, the program is just running an approximation on some data... Or with a Chess AI how does the program remember and train against itself? where would each trial be stored and in what format?

plain jungle
#

I made a post a few weeks ago of an AI playing the chrome Dino, now this is the frogger

magic panther
#

guys and girls, if I have a set of input parameters and I want to minimize one of themm, how do i go about making an objective function? what do people do to find a relationship between my variables?

digital crescent
#

I would like to do a lot of realtime high-speed data analysis. One of my analysis techniques will probably use either fixed-length time series where old data is dropped off and replaced with new data or time series that grow as new data arrives. Will I be hindered by using Pandas? Should I try to focus exclusively on Numpy? Maybe focus on something else entirely? The new data will (at the start of the analysis) be entirely on a SQL server. New data will also arrive to the SQL server periodically.

#

I'm pretty new to Python. Just want to make sure that I'm practicing with the approaches, techniques, and packages that will serve me best over the long-run

iron basalt
#

When you say speed do you mean throughput or latency? @digital crescent

digital crescent
#

I've heard a bit in terms of certain packages used to interact with SQL servers being faster/slower, but I guess I was mostly concerned with me inefficiently manipulating the data during the analysis part

velvet thorn
iron basalt
#

Latency would we the time from new data entering the system (before it even gets to the database) to the output of the analysis being updated. Throughput would be how many points in the time series you can process per unit time.

velvet thorn
iron basalt
#

Optimizing for one is very different than the other.

digital crescent
#

If we go by Squiggle's definition, I'd say I'm mostly concerned with the speed of analysis

velvet thorn
#

whether you also need high throughput will depend on how much input is coming in at any one time

digital crescent
#

I think in general it will take much more time for me to analyze the data then move it from SQL to Python

#

Thousands of new data points per second potentially but the analysis could involve as many computations as I wanted (so thousands, tens of thousands, hundreds of thousands, millions)

velvet thorn
#

what kind of analysis?

#

you might want to look into Spark

iron basalt
#

So you want low latency? You do not have it backing up in terms of how many points it can process? By that I mean does the database get more points than it gives to the analysis system (think like how much water is flowing into a container versus flowing out).

velvet thorn
#

in particular, Spark streaming

digital crescent
#

A mixture of things. Regressions, pattern recognition, random stuff with probability distributions

velvet thorn
#

do you want a managed solution?

#

(you probably do, right...)

digital crescent
velvet thorn
#

are you willing to spend $$?

digital crescent
#

But I would like the flexibility to basically decide to stop looking at one portion of the dataset and look at a completely different portion

digital crescent
velvet thorn
#

well

#

cheap, good, fast, choose 2

digital crescent
#

The SQL server will be on my computer, and I will be doing the analysis mostly in Python on my computer as well

#

(Just to clarify that the "server" isn't like a separate set of hardware)

velvet thorn
#

of course it depends on your exact requirements and it's defo possible to run analyses locally without incurring additional costs

#

but @ some point you might need to scale up

#

hard to say without knowing numbers

iron basalt
#

Then without spending anything I think a solution may be to have the data points go directly to the analysis system to reduce the latency of having to go through a database system. However, at the same time the data points are also sent to the database to be stored for later.

velvet thorn
#

but it would require some configuration and data engineering, at least

iron basalt
#

oh on the same system

digital crescent
#

I guess my main concern is just the Python tools I should use for analysis provided that the data I'm analyzing is constantly changing (i.e. changing in size, looking at a completely different set of data points, stuff like that)

#

And I see stuff like with Pandas that say that adding extra rows is super slow

#

And it makes me wonder about other concerns I should have

velvet thorn
#

pandas isn't really meant for data that changes

digital crescent
#

And which preferred approaches I should be considering for realtime analysis

velvet thorn
#

which is why I said

#

look into Spark streaming

#

which adds a lot of overhead

#

but shrugs

#

is a bit heavyweight for local usage, too

#

I mean

#

you could defo build abstractions around numpy that would allow you to do this kind of thing but

iron basalt
#

Yeah they is starting to sound like a database question, which could be asked in the databases section, there are database systems that are designed for quickly adding new types of data and such, but I am not exactly an expert on them.

velvet thorn
#

so

#

this goes back to the kind of analyses you need to perform

#

but yeah, probably Spark.

digital crescent
#

Huh. I didn't think this was really a databases question primarily, but then again I don't know much about these kinds of things

velvet thorn
#

you're basically asking "how do I construct an ETL pipeline -> data warehouse that will meet my needs?"

digital crescent
#

Does ETL include an analysis step?

velvet thorn
#

possibly, as part of the T step, but

#

depends on the complexity of the analysis

#

ideally that would come after

digital crescent
#

I'm mostly concerned with the speed of the analysis part than the speed of the "ETL" part

velvet thorn
#

the reason pandas is fast (relatively speaking) is that it holds the entire dataset in memory.

digital crescent
velvet thorn
digital crescent
#

But if I want to add 100 or replace 100 and look at it differently, I feel like I could run into problems with Pandas

velvet thorn
#

and how often will the subset of data to be looked at change?

#

relative to the number of analyses being run

iron basalt
#

The first step is to get upper bounds on things. You can't do as many computations as you want, computation is finite resource.

digital crescent
iron basalt
#

Even if those upper bounds are massive

velvet thorn
digital crescent
# velvet thorn and how often will the subset of data to be looked at change?

It will be a balancing act. I would like to produce updated analyses as fast as possible (ideally within a few seconds or a fraction of a second), so I'm aware that I won't be able to do the best analysis nonstop. But say I've got 5000 data points and add/replace 300. I would like to run some regressions or do some pattern recognition or generate some new probability distributions as fast as possible

#

But it will be constantly running. And the faster I can analyze, the better analysis I can do if the goal is to produce an updated analysis on a rolling, realtime, and almost infinite basis

velvet thorn
#

if it's anything complex like regression analysis, you don't have enough compute for that

#

nowhere near

digital crescent
#

I'm trying to think in my head how many computations would be required to a do a simple linear regression with 10,000 coordinate pairs

#

Probably a lot

velvet thorn
iron basalt
digital crescent
#

Might not take a lot of memory though

velvet thorn
#

no, 10,000 is very small

#

but yeah, just try it.

digital crescent
#

I mean, I can almost count them in my head

iron basalt
#

Just make sure you are measuring it correctly.

digital crescent
#

Yeah. I was abstracting it into the stuff I would do on paper which obviously isn't the same as what goes on in a computer

#

But again, I feel like this is somewhat beside the point, right?

iron basalt
#

A simple timing can tell a lot

digital crescent
#

Regardless of the analysis, if the analysis is taking up the bulk of the time (and not the data-fetching part), is there a generally preferred way to handle the data and intermediate calculations in Python? Or is it really not that simple? Like if I give you 10,000 data points and tell you that every so often the analysis will randomly be performed on a somewhat differently sized database and sometimes the analysis itself will be slightly different, what tools do you use to run the analysis? Not Pandas? Yes to Pandas? Only Numpy? Python Lists?

velvet thorn
#

it depends on the analysis.

#

but numpy is generally faster

digital crescent
#

I guess I'm just worried that Pandas seems almost useless if speed is at all an issue if you decide to add some data to your existing dataset

velvet thorn
#

pandas can provide better abstractions though

velvet thorn
iron basalt
#

It's not that simple, but like gm wrote, there are definitely some things NOT to do. Python itself is pretty slow so all speed must from the c-libraries.

velvet thorn
#

they have the exact same problem.

digital crescent
#

Is there even an efficient way to handle data of a changing size or is that kind of a problem that can't be solved?

velvet thorn
#

but

#

that is not necessarily the correct question

digital crescent
velvet thorn
#

"how often will the dataset size change?"

#

like let's say

#

running analyses takes 15s

#

then you change once

#

and that takes 0.5s

digital crescent
#

Gotcha. Thanks. You guys have given me some stuff to consider

#

It is almost like I should just spend more time thinking about ways to efficiently structure stuff with the tools I have rather than look for a tool that magically solves these problems

velvet thorn
#

data engineering is an art

#

not one I particularly like, but it is important

iron basalt
#

No library can magically overcome the limitations set by the hardware itself. In general, the less you know up front (which types of data you will have, how much, etc), the slower the solution will be, but with the trade-off of hack-ability / extensibility.

digital crescent
#

Here is an example of what I mean (not necessarily one that applies to my project but I think it is in the same line of thought): Imagine your dataset for analysis will be anywhere from 5000 - 6000 rows. Maybe you could just make a 6000-row Pandas table and fill in the ones you don't need with zero or something like that. Or track the rows that aren't being used. And then have some kind of flag to ignore the portion of the vector calculations done on the unneeded rows

#

Something like that

velvet thorn
#

or you could also use a numpy masked array

#

which I think might be a more appropriate abstraction, BUT

digital crescent
#

Not saying this is the best way to do things. But it is what I mean in terms of just coming up with better solutions rather than looking for better tools

velvet thorn
#

shrugs

velvet thorn
#

this will boil down to performing a filtering operation at the start of every set of analyses, probably

#

because you want to retrieve the subset that you'll use

iron basalt
digital crescent
#

Just it doesn't seem like this stuff is written out anywhere. Or at least I don't see a good guide saying "this is how you should do x/y/z analysis in Python if you want to do it quickly"

#

Which is fine

#

I just wanted to make sure that I wasn't missing anything obvious

#

I.e. If it was as simple as like "oh, realtime data analysis with changing amounts of data? Do/don't use pandas"

#

Or do/don't use numpy

#

But I think you guys have gotten me somewhere πŸ™‚

#

So thanks

velvet thorn
#

yw

atomic obsidian
#

Is sql ever going to become obsolete due to libraries in languages or is learning it valuable?

prisma willow
#

python implements an sql package

velvet thorn
#

probably, in a hundred years+?

iron basalt
#

SQL might, but relational databases probably not (speculation). It will probably stick around for a long time in case anyone needs to manually query things.

velvet thorn
#

but the fact that ORMs can abstract away the need to know raw SQL is not in itself a reason not to learn SQL

prisma willow
#

question
what does it mean when people say sql sever and sql developer have their own database? Aren't we the user creating the database? what does it mean to come with one?

#

im using database synonmous with tabular data

storm lintel
#
tag = self.soup.body.find('div', class_='fulfillment-add-to-cart-button')
        if tag and 'add to cart' in tag.text.lower():
            self.alert_subject = alert_subject
            self.alert_content = f'{alert_content.strip()}\n{self.url}```
#

anyone see anything wrong with this?

#

its saying its in stock but its not idk if the code is messed up or sum

#

im not making a auto buy bot btw

misty flint
plain jungle
#

SQL is going to stick around for a long time because of the same reasons that Java is sticking around. Theres better languages for the job, but so many businesses already use it that they'd never think of not using it #Mongo

#

That being said, SQL definitely does have times where it shines and if you are looking to get into it with python, try SQLite3

prisma willow
#

@misty flint spill the beans

#

@misty flint ppl cant better themselves with inside jokes

lapis sequoia
#

Hello evveryone,

I am having an issue with RollingOLS from statsmodels .
'''
mod = RollingOLS(Y, X, window=75, min_nobs=None,expanding=True)
fit=mod.fit()
'''

#

When i want to get the AIC

#

i get a list of multiple values

#

and i think my X and Y are in the wrong format

#

since i have X and Y two lists

#

of numbers

#

How should i proceed ?

misty flint
misty flint
odd aspen
plain jungle
#

lol, I mean just as Fortan and Cobolt is about for Banks, SQL is here to stay for a while @misty flint

lapis sequoia
#

1D

#

containing numbers

#

np.size(X) give 529

#

and np.size(Y) give 529 too

#

@odd aspen Do i have to use a panda dataframe ?

#

shape gives (529,)

austere swift
#

How would i put the labels for the bars in a matplotlib bar graph above the bars themselves

#

something like this with the "text field" being the label

misty flint
#

the more i use matplotlib, the more i realize hate it

keen kestrel
#

I use altair for statistical plot, the API is easier to remember

tight jewel
still salmon
#

I want to specify in pandas dataframe timezone as CDT IST EDT etc , instead of Region/Country, is there a way to do that?? All examples I came across specify timezone as Region/Country Ex - "Africa/Douala"

austere swift
keen kestrel
#

It looks like it is made in excel lol

bold olive
honest adder
#

not quite sure where to put this

#
from MTM import matchTemplates
import cv2
r10 = 'out2.png'
lt = [('small', r10)]
image = 'rank101.png'
Hits = matchTemplates(lt, image, score_threshold=0.5, method=cv2.TM_CCOEFF_NORMED, maxOverlap=0)
#

MTM is a library for template detection, so i don't have to fuss about with more code than i have to

#

but, it keeps out putting AttributeError: 'str' object has no attribute 'dtype at line 7

#

For the example, they use coin from from skimage.data import coins... how do i make my image have dtype

iron basalt
#

Basically it wants the actual image data, not a string telling it where to find the image data.

hasty grail
analog pike
#

Don't know if this is the right place but can someone tell me why this is throwing up an error/what I can do to fix it

fierce shadow
analog pike
#
x = 0
index = ufos[ufos['country'] == 'us']
for value in index['datetime']:
    if ("24:" in value):
        value.replace('24:00', '00:00')
        index.iloc[x,0] = value
    x+=1
``` there we go
#

pd.to_datetime doesn't like when values are as 24:00 but im having trouble reassigning the values back into the dataframe once i change 24:00 to 00:00

bold olive
fierce shadow
#

oh okay

bold olive
#

Hang on, I'll get the model summary!

bold olive
# hasty grail Could you print the summary?
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv3d (Conv3D)              (None, 222, 126, 222, 32) 896       
_________________________________________________________________
activation (Activation)      (None, 222, 126, 222, 32) 0         
_________________________________________________________________
max_pooling3d (MaxPooling3D) (None, 111, 63, 111, 32)  0         
_________________________________________________________________
conv3d_1 (Conv3D)            (None, 109, 61, 109, 32)  27680     
_________________________________________________________________
activation_1 (Activation)    (None, 109, 61, 109, 32)  0         
_________________________________________________________________
max_pooling3d_1 (MaxPooling3 (None, 54, 30, 54, 32)    0         
_________________________________________________________________
flatten (Flatten)            (None, 2799360)           0         
_________________________________________________________________
dense (Dense)                (None, 32)                89579552  
_________________________________________________________________
activation_2 (Activation)    (None, 32)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 33        
_________________________________________________________________
activation_3 (Activation)    (None, 1)                 0         
=================================================================
Total params: 89,608,161
Trainable params: 89,608,161
Non-trainable params: 0
_________________________________________________________________```
#

And even with a batch size of 1, I get this OOM error during the first epoch:

hasty grail
#

yeah that definitely looks too big

bold olive
hasty grail
#

200 cubed is huge

bold olive
#

What can I do to avoid this? I mean I have to process the images somehow.

#

Does it have to do with how I am shaping my data?

hasty grail
#

it basically means that your data is too large

#

what is the kernel size of each CNN layer?

bold olive
#

The tensor (or X) you mean?

#

Because each image file is only around 14mb

hasty grail
#

yes

#

because when you input it into a CNN layer

#

inside the convolution operation, your 3d image is multipled by each value in the kernel, resulting in a total size = (size of image) x (size of kernel)

#

kernel size, not number of channels

bold olive
#

Sorry - (3,3,3)

hasty grail
#

ok that's the smallest as it's going to get

bold olive
#

Yeah

hasty grail
#

so yeah, probably your input is too large

bold olive
#

Okay so how do I make this work now basically?

hasty grail
#

you'll need to downsample it before the CNN layers

bold olive
#

I have 38 images, each with a dimension of (766, 200, 760)

#

After creating X and resizing, the shape of X is (39, 224, 128, 224)

#

Then I add one more channel to make it a 5D tensor fit for Conv3D so it becomes (39, 224, 128, 224, 1)

hasty grail
#

you'll have to resize your images to be even smaller

bold olive
#

Hmm, perhaps (128, 64, 128) will do the trick?

hasty grail
#

try it

bold olive
#

There is an example of volumetric MRI image classification on Keras

#

They are using the same size without any problems

hasty grail
#

Yeah

bold olive
#

Yes, seems to work now!

#

Need to increase the accuracy but that's another issue

#

Great, so the cubed size was far too large

hasty grail
#

yeah, by halving each dimension you're now using 1/8th of the original memory

short heart
#

can somebody help wiht sklearn

floral flare
#

Any Idea why this may be happening (The red letters is the heuristic algorithm used to expand Manhattan distance as cost, Misplaced Tiles as cost, and BFS as cost)

#

U can see that 3rd last for manhattan and the 4th last for tiles has a drop in nodes expanded

#

whereas it keeps going up for BFS

velvet thorn
#

@bold olive how thick is your FC layer?

#

...how many neurons does your Dense layer have?

#

man that was incoherent

floral flare
#

how thick xd lol i like that

velvet thorn
#

it should be "how wide"

#

πŸ₯΄

floral flare
#

Ah

bold olive
#

"thick" is alright with me joe_maverick

bold olive
thin remnant
#

im having a dataset that contains names of natural reservoires. I've also got a few cols about Area but they don't seem usefull for my situation since i need longitude and latitude... I found this website and filled in the name of a few reservoires and it seems to return the right longitude and latitutde.. Now since the dataset is quiet big, I obsiously don't want to do this mannually. Could someone give me some help in how I can make a python script that just requests this inside the script for each name in the dataset ?

hexed parrot
#

Can i get a seed from a picture pixel colors? i mean if you generate 256x256 random color values and you can doit with a seed can you reverse it? you input a 256x266 image and get the seed?

grave frost
#

why don't you just generate an array that way?

hexed parrot
#

but can you recover the seed with the list of generated numbers?

#

or there is no way?

grave frost
#

Off the top of my head, you could maybe Bruteforce if a dedicated function to do that is not available.

honest adder
rotund rampart
#

hello guys, i want to create a model using Keras or Tensorflow that synthesizes two images, joining a body to a head. I don't know much about deep learning and honestly I'm a little lost. Any tips on how to get started?

grave frost
#

Can you elucidate about what you exactly want to do?>

rotund rampart
#

input 1

#

input 2

#

output

#

i dont know exactly how to describe this

#

Ignore the colors. The first two I took a picture of the computer screen

rotund rampart
# rotund rampart output

in this case that was made with photoshop. but it takes a looong time to edit and thats why im searching a way to do this with deep learning

lapis sequoia
#

would anyone review my recently worked on neural network?

#

Also how to backpropogate

woeful estuary
lapis sequoia
#

python

#

As in pure python with a few simple modules like math and random

#

So not really any framework and from scratch

woeful estuary
#

Oh, not even numpy?

#

I think i can't help with this one

#

Try asking someone else

lapis sequoia
#

Nope not even numpy. Thats alright thankyou for taking an interest πŸ™‚

misty flint
#

from scratch... NervousSip

analog pike
#

what's the easiest way to remove the time from a column in pandas. The csv i downloaded goes from 0:24 inclusive and its kinda messing with to_datimetime

#

And I only need the year anyways

acoustic forge
#

So I'm working on a project with a couple of friends. We want to be able remove the background of a picture (usually, but not necessarily a portrait), somewhat like remove.bg. Would PyTorch be a good fit for this project?

nova widget
#

@analog pike you slice with loc

acoustic forge
#

@nova widget Yeah, that suggestion that he made doesn't work

misty flint
#

that link might not work but i would go down the route of opencv @acoustic forge

#

they have some useful modules in their documentation

acoustic forge
#

Okay, I'm gonna check it out @misty flint. Thanks πŸ™‚

iron basalt
lapis sequoia
#

will do!

#

thanks

#

Its rather long at might be difficult to understand. I only know a limited amount about the math behind neural networks, backpropogation and so fourth. This is my attempt so far.

#

I have been working on another that uses a genetic/fitness approach.

iron basalt
#

How familiar are you with linear algebra?

lapis sequoia
#

somewhat familiar, only what I know from studying it in maths education

#

but I think most of the math behind machine learning is beyond me.

#

I can understand what sigmoid and other activation functions do

iron basalt
#

Ok, my first note is that this code is much smaller and simple if you make use of matrices.

#

(Which is their entire purpose)

lapis sequoia
#

I just thought I'd give it a try! Still not sure if it works as intended but we can hope.

#

Right that makes sense.

#

I used a one dimensional arrary for most of the weights etc

#

thankyou

iron basalt
#

You can simply implement matrix multiplication and transpose yourself, it does not need to be fast.

#

As long as you get the idea

lapis sequoia
#

that sounds like a good idea. I do see what you mean. Then I wouldn't have to loop through each neuron individually?

iron basalt
#

yeah, that's why matrices are cool, they make everything easier to think about and code, since you are thinking at a higher level.

#

By that I mean like as in programming higher level.

#

Like assembly vs python

lapis sequoia
#

ohh right I see what you mean now. They sound super cool actually! I will try it out thanks

#

that will be useful

iron basalt
#

It's also why they were invented in math, nobody wants to manually juggle all those numbers.

lapis sequoia
#

It is quite annoying and one of the problems that took me the longest. I have reworked it a few times!

#

That may be a much better approach

iron basalt
#

A sign that you may not be doing things the best way is when your objects are too small. For example, neuron does not really need to be it's own object unless you intend to create a neuron by itself outside of a neural network. Or another sign is when something exists not by itself ever, but in a group / cluster. Rather than making it its own object, just have the data held by the object that manages the group / cluster.

#

Generally you will always be working with groups of neurons.

lapis sequoia
#

I see what you mean. It would be much simpler to store the all the weights inside a matrix in a single layer object than a neuron. I will try playing around with different lists to see what I can do. You are right, I do not plan to use a single neuron on its own. Thankyou for that explanation!

iron basalt
#

Btw that group vs single thing idea applies to pretty much all programming.

#

(Computers like groups of things)

#

(Contiguous)

lapis sequoia
#

thankyou for that advice! That is the sort of thing that will really help me improve.

#

Groups do seem to be used a lot in programming, list logic is essential to a lot of software it seems. Or at least, it is used often for challenges etc

#

Thankyou so much for all your help it has been really helpful πŸ™‚

#

I might redo the neural network using a different method with matrices now, thanks πŸ™‚

iron basalt
#

On line 101, you use this count = 0. To keep track of the current layer index correct?

thin remnant
#

im looping over a geoapicall and want to append some of the json results to my dataframe. Since i select only latitude and longitude out of the json results i use the selecting technique response['latitude']. The things is. For some responses there is no 'latitude' value.. How can i ommit my code from crashing and just continueing to the next record instead of crashing

iron basalt
#

@lapis sequoia

lapis sequoia
#

Oh yes pretty sure I do

#

let me check

#

That is correct

#

It doesn't need to be 1 I don't think as the number of the weights for one neuron in one layer of the neural network should always equal the number of neurons in the next

iron basalt
#

@thin remnant Use python dict's get function, you can set an optional return value for when there is no entry .get(key, ret_val_when_not_there), e.g. lat = response.get('latitude', None) ... if lat is None: ...

#

@lapis sequoia Use enumerate instead, also the if statement if(count != len(self.layers)): will always be True.

lapis sequoia
#

ohh right that is a good idea thankyou I will try that!

#

Oh I see what you mean about the if statement...

#

thankyou πŸ™‚

#

As I am using len() rather than it counting from 0

iron basalt
#

on 117 if(i + 1 != len(self.layers)): you are using this to make it only loop up until the layer before the last right?

lapis sequoia
#

Yes that is also correct

#

as the neurons in the last layer do not needs weights

iron basalt
#

just change the range then

lapis sequoia
#

they are not connected to neurons in the next layer

iron basalt
#

for i in range(0, len(self.layers)): to for i in range(0, len(self.layers) - 1):

lapis sequoia
#

rightttt I see

#

that would also work very well

#

I do not think about these things sometimes that is a nice and simple solution!

thin remnant
#

squigle

#

still the same

iron basalt
#

You got a list index out of range, so data is an empty list, check to make sure len of data is greater than zero.

#

so data['data'] is the actual data

#

which is a list

#

and it was empty, but you tried to access the element at index 0.

static grail
thin remnant
#

jupyter notebook xd

iron basalt
#

@lapis sequoia Python has a bunch of loop control that allows you avoid having to put if statements inside loops to control where they loop.

thin remnant
#

@iron basalt rip

#

this gives more error

lapis sequoia
#

@thin remnant you are trying to access the index using a string datatype

iron basalt
#

data['data'] is a list, and data['data'][0] seems to also be a list (i'm guessing length 2 for lat and long).

#

I recommend trying to print out the types of things

#

e.g. print(type(data['data'][0]))

lapis sequoia
#

These features seem really useful, I will have a think next time about how I can use loop control instead!

#

Also I found printing literally every variable helps

#

I have a "debug" mode boolean that I can enable and disable to print everything

#

Sometimes it can be helpful

thin remnant
#

ive tried some things but couldnt figure it out

iron basalt
#

You should also be able to print the json itself probably or save it to a file. Then analyze it.

thin remnant
#

this is what i did to check stuff

#

you have any idea how to check if lat en lon are there and otherwise just not care instead of crash xd

iron basalt
#

can you display just data for me?

#

the whole thing

#

or is it really long?

thin remnant
#

gimme a sec, ill make a picture

lapis sequoia
#

You can check whether a given key exists in a dictionary using:

#
    print("will execute if this key is present")```
thin remnant
iron basalt
#

My hunch is that since you are in juypter notebook it may be an out of order cell execution thing (or other), create a new file on your pc and run the script in there to make sure it's nothing strange going on with the notebook.

#

It could also be that not all responses are the same. Some could be giving different structures.

#

I would wrap the loop in a try catch and on error print the current and previous data to compare a valid and invalid data.

thin remnant
#

My linux is rebooting, the window froze

iron basalt
#

You are using a vm?

thin remnant
#

no

#

i run linux as main

#

anyway, this is what it looks like when i run the script

iron basalt
#

ok can you print data?

#

or display it somehow

thin remnant
#

thats gonna give me a huge output since its a loop

#

but i got an idea

#

i can just print the index number

#

and then next run just print the data of that index where it stopped

iron basalt
#

yeah

thin remnant
#

im jsoning it reall quick

#

sec

#

it's weird

iron basalt
#

So, yeah, there is no guarantee for anything, just gotta do a ton of checks on everything. A bunch of if key in x, if len(y) > 0, and maybe even if isinstance(z, (typeA, typeB, ...)).

thin remnant
#

sometimes it stops faster

iron basalt
#

Yeah it's random it seems.

thin remnant
#

sometimes it stops at index 5 and sometimes index 9

#

it shouldnt be random haha

iron basalt
#

Server is not always giving the same thing.

#

Why not just use geopy though?

thin remnant
#

i dont know how xd

#

is it easy ?

iron basalt
#

yes

#

Much easier than doing all this.

thin remnant
#

i think i have the right checks to make it work now

iron basalt
#
>>> from geopy.geocoders import Nominatim
>>> geolocator = Nominatim(user_agent="specify_your_app_name_here")
>>> location = geolocator.geocode("175 5th Avenue NYC")
>>> print(location.address)
Flatiron Building, 175, 5th Avenue, Flatiron, New York, NYC, New York, ...
>>> print((location.latitude, location.longitude))
(40.7410861, -73.9896297241625)
>>> print(location.raw)
{'place_id': '9167009604', 'type': 'attraction', ...}
thin remnant
#

ill take a look at geopy later this evening

#

mmm

#

wow

#

that looks pretty easy yea haha

#

ill give it a shot i guess

iron basalt
#

They did what you are doing right now, but wrapped it up for you with a bow tie.

thin remnant
#

haha you sample code does look easy yea

#

but what is the max amount of calls ?

#

cause im doing it for each record in a dataset

iron basalt
#

Depends on which site you choose

#

They chose Nominatim in this example.

thin remnant
#

my dataset has 5000 +- records

#

so i used positionstack

#

but ill take a look into those things aswell

#

thanks a lot for the help though!

iron basalt
#

It can use google's geocoding

#

I assume there is a paid tier for that or something

thin remnant
#

yea its with a lot of these geocoding sites

#

almost all of them

iron basalt
#

free tier probably too and probably a lot of requests, because google is big

thin remnant
#

positionstack didn't have very good docs imo but their calls are at least free

#

if you do some filtering yourself..

iron basalt
#

here is the list of all the ones it has

thin remnant
#

Thanks

#

you need to receive a reward haah

#

Can I send you a trophy or sth ? πŸ˜„

iron basalt
#

np, I gain more practice from this stuff.

thin remnant
#

me to πŸ˜„

#

i wouldnt have ever touched this stuff if it werent for my gf

#

she doesn't know anything about datascience and has to do data analysis/linear regression in SPSS

#

and so she doesnt know data cleaning or python at all

#

and the school didnt give her the data

#

so that sucked haha

#

But I knew it was possible in python so i wanted to try it

arctic wedgeBOT
cerulean spindle
iron basalt
#

@thin remnant I recommend just trying to look on https://pypi.org/ to see if there is already a package for what you are trying to do, then go to their github page and see if the README has a simple example. If it seems overly complex for what you are trying to do then try doing it yourself.

analog pike
#

so im still having troubles with my csv, now its throwing a SettingWithCopyWarning

#
x = 0
for item in index['datetime']:
    index.iloc[x,0] = item[:-6]
    x+=1
#

im just trying to shave the end of a string in each row of a column in pandas

#

since datetime isnt cooperating

astral path
#

If I have a column in my dataset which contains short string descriptions using keywords, how could I include that in a heatmap/correlogram to show relationships between the keyword and other variables? e.g. I could use this to find that, for example, descriptions that contain the word "red" and "dress" have a smaller value in a column called stock than a description that includes "green" and "bag"

#

example of data

iron basalt
#

@analog pike Try modifying your column like so:

tawny geode
#

I just started data science in uni so I'm willing to get help

iron basalt
#
import pandas as pd

df = pd.DataFrame(
    [[1, 2], [4, 5], [7, 8]],
    index=['cobra', 'viper', 'sidewinder'],
    columns=['max_speed', 'shield']
)

print(df)

column = df.iloc[:, 0]

print("-------------------")
print(column)

for i, val in enumerate(column):
    column[i] = val + 1

print("-------------------")
print(column)
analog pike
#

ah elite dangerous

#

im trying to just strip off the time portion and whatever im trying just doesnt seem to want to work

#

since pd.to_datetime only uses values 0-23 for time and for some reason the csv goes 0-24

#

and I don't need the times anyways just the year

iron basalt
#

just the middle part? the year?

analog pike
#

ye

#

since im just trying to get frequency per year

iron basalt
#

Are they always formatted like this all entries?

analog pike
#

yeah I downloaded the cleaned one

grave frost
analog pike
#

so I wouldn't have to deal with all the data cleaning

iron basalt
#

val.split()[1] is the year then

#

(Assuming each entry is a string)

analog pike
#

they are

#

yet settingwithcopywarning is messing with me again

#

A value is trying to be set on a copy of a slice from a DataFrame

iron basalt
#

are you modifying the column like I did above?

analog pike
#

this is my whole code atm ```py

mport matplotlib.pyplot as plt
import pandas as pd
import DateTime as dt
ufos = pd.read_csv("scrubbed.csv",low_memory=False)
countries = ufos['country'].unique()
print(countries)
fig,ax = plt.subplots()

index = ufos[ufos['country'] == 'us']
print(index['datetime'])

column = index.iloc[:, 0]

print("-------------------")
print(column)

for i, val in enumerate(column):
column[i] = val.split()[1]

index['datetime'] = pd.to_datetime(index['datetime'])

index['year'] = index['datetime'].dt.year```

#

damn the highlighting didnt work

iron basalt
#

just edit your message

analog pike
#

there we go

iron basalt
#

what does print(index['datetime']) look like?

analog pike
#

thats the image i posted before

#

it gives just the list of dates and times

#

for each entry

iron basalt
#

How many columns are there? just one?

analog pike
#

no, though now that I think about it i really only need the one column

#

since im not doing by state or anything and this is just the US

iron basalt
#

so when you print column what do you get?

analog pike
#

same thing as printing index['datetime']

iron basalt
#

so there is only 1 column in index['datetime']?

analog pike
#

oh shoot wait a minute i think I know why

#

after I made the copy with only US pandas doesn't fix the rows

#

:/

#

it leaves the gaps where the other countries were

#

Damn I forgot how i fix that

misty flint
#

sounds like that would be a kata if katas did arrays

#

i did a similar thing but it was just elements and a list

analog pike
#

pain

misty flint
#

theres probably a function

analog pike
#

Probably

#

just have to find it

misty flint
#

ye

analog pike
#

I want to say sort would do it

#

but i don't think so

#

I just want to visualize the number of ufo sightings in the us per year bro