#data-science-and-ml

1 messages · Page 276 of 1

lapis sequoia
#

Is there any possibility to sort data in linked list with time complexity O(n log(n))

odd yoke
serene scaffold
#

huh, markdown didn't work

teal sluice
#

I have a pandas dataframe which contains spending and the month of spending, what would be the best way to create another dataframe which holds the total spent for each month??

torpid cave
#

@teal sluice group + sum maybe

#

or summarise

#

Hi all, has anyone configured a Raspberry PI as a server/VM to run your own scripts via http requests?

#

I just need some quick guidance

lapis sequoia
#

is it worth diving images by 255 when training?

#

like, from uint 8 to float 64

serene scaffold
#

ping me if you do.

velvet thorn
velvet thorn
#

there's something called dependent typing

#

which basically means that the type of a value depends on another value

#

this can be used to encode, for example, an array's length in its type

#

which gives you stronger guarantees regarding whether any particular expression is well-formed

#

e.g. elementwise addition of unequally sized arrays

#

however, this is still (relatively) an academic thing

#

(and is also pretty complex)

#

so there are some things we probz can't do right now.

#

and this also raises the question: what do we want to encode in a dataframe's type?

#

at some point, it might be a question of whether this should be delegated to runtime property checking instead

velvet thorn
#

i.e. not encoded within a type parameter

#

for example

#

in most languages, division by zero is handled as a runtime error

#

not as a compile time type error, because there is no NonzeroNumber type

#

there's also the question of axis alignment

#

if two dataframes have the same column names in a different order

#

are they the same type?

#

different column names, but same data types?

serene scaffold
#

my current thinking is that there would be power in having a language for documenting what properties a dataframe has, even if the linter can ultimately only assume that if a function returns a SomeDataFrameType, that object is a valid argument for a function that takes a SomeDataFrameType.

lapis sequoia
#

I have another question. I have moved to colab since it has gpu acceleration (better than mine for sure) and i uploaded all my images to drive (1 hour it took). Now, i need to append all the images to an array, to give it as input to my nn. But colab takes like tooooooooooo long to append all the images. Any suggestion?

velvet thorn
#

but that would still be a type parameter

#

and it would be irrelevant for other functions

#

so if you wanted to spec this out

#

you would need to think about whether there's a practically viable type hierarchy

#

that can encode the necessary information

lapis sequoia
#
for pok in pokemons:
        path = os.path.join(datadir, pok)
        images = os.listdir(path)
        amount = len(images)
        for i in range(amount):
            print(f'Doing {pok}. {amount - i} remaining images')
            img_array = cv2.imread(os.path.join(path, images[i]), params['color_mode'])
            new_array = cv2.resize(img_array, params['dimensions'])
            if i < amount * params['percentage']:
                train_data.append(new_array / 255)
                train_label.append(pok)
            else:
                valid_data.append(new_array / 255)
                valid_label.append(pok)```
#

i open the imagen with opencv, i resize it, and i append it to an array (input for latter)

velvet thorn
#

what is train_data?

#

where is it defined

lapis sequoia
#

an array

#

above

velvet thorn
#

show.

velvet thorn
#

paste code

#

instead of images

#

way too small to see

lapis sequoia
#

idk how to copy paste code from colab. it is on different cells. wait

#
os.chdir('/content/drive/MyDrive/Colab Notebooks/Python ML/Pokeguesser')

train_data  = []
train_label = []
valid_data  = []
valid_label = []

data_dir = 'dataset'
pokemons = os.listdir(data_dir)
dimensions = (71, 71, 3)
batch_size = 126
num_epochs = 12
percentage = 0.8```
#
for pok in pokemons:    
    path = os.path.join(data_dir, pok)
    images = os.listdir(path)
    amount = len(images)
    for i in range(amount):
        img_array = cv2.imread(os.path.join(path, images[i]), cv2.IMREAD_COLOR)
        new_array = cv2.resize(img_array, dimensions[:2])
        if i < amount * percentage:
            train_data.append(new_array)
            train_label.append(pok)
        else:
            valid_data.append(new_array)
            valid_label.append(pok)```
velvet thorn
#

it's a list

#

it's important to be clear on this

lapis sequoia
#

well, sorry if both are different. For me are the same ^^'

velvet thorn
#

no

#

they are different.

#

very different.

lapis sequoia
#

an array from java is a list on python

velvet thorn
#

yes

#

but

#

normally it doesn't matter that much

#

however, in this case

lapis sequoia
#

thats why sometimes i call them array

velvet thorn
#

when you are working with numpy

#

numpy.ndarray is what is normally called an "array"

#

and because the semantics are different

lapis sequoia
#

okey okey

velvet thorn
#

from a native Python list

#

it is important to distinguish the two

#

anyway

lapis sequoia
#

26k

velvet thorn
#

how big is each image?

lapis sequoia
#

mmm there are different sizes

velvet thorn
#

disk size

#

what's the range like

#

few hundred kb?

lapis sequoia
#

1.14 gb

velvet thorn
#

well

#

then

#

that's why it's taking so long

#

loading images is (relatively) slow

lapis sequoia
#

i think i am not explaining well, wait

lapis sequoia
#

hold on 1 sec

#

sorry for a gif, cant think of a different way to show

#

this is on my local computer

velvet thorn
#

yeah

#

on your local computer

#

I don't know the specifics of Google hardware

lapis sequoia
#

this is on colab

velvet thorn
#

but it's very possible that there needs to be transfer over the wire

#

from Drive to Colab

#

which would make it much slower

#

here you can see

#

that loading is much slower

#

and

#

okay, simple way to show if this is true or not

lapis sequoia
#

oh

velvet thorn
#

img_array = cv2.imread(os.path.join(path, images[i]), cv2.IMREAD_COLOR)

#

this is the line that loads the images

lapis sequoia
#

so colab doesnt actually have my images directly?

velvet thorn
#

include a print before and after

#

to see how long it takes to load

#

IO should be the primary bottleneck here

#

this is what I found after a quick search

#

It takes forever to copy files from Drive to Colab. While this is no problem when dealing with very small datasets, it’s very annoying when facing larger data, for example for image classification.

#

you said your data was in Drive

lapis sequoia
#

yeah, but idk why i though linking drive to colab will make like a copy on colab side

#

i will try that, one sec

#

idk if i fcked up but

#

!cp -r "{data_dir}" ~

#

will copy the folder on root?

#

cuz i am trying not to zipping the images and upload again to drive

#

if this doesnt work i will do it tomorrow

serene scaffold
#

not to distract from the help that's happening, but now I'm wondering: is the only runtime optimization for numpy that it does iterative operations in C, or can it also secretly run independent operations in parallel?

velvet thorn
#

they're run with SIMD

#

stuff like elementwise addition is run in parallel

#

with aforesaid SIMD

#

uh

velvet thorn
#

very good with shell stuff TBH

lapis sequoia
#

nvm. dont do chdir on colab

#

it messes up xD

lapis sequoia
#

btw. Does colab indexes files on a different way? my subdirectories are name like 001_name, 002_name, 003_name and so on

#

But when i do os.listdir it returns some weird sorted list

#

the first item is the 083

proper tendon
#
{
    "server1":
        [
            "id":
                [
                    "s1",
                ],
            "channel1":
                [
                    "c1",
                ],
        ],
    "server2":
        [
            "id":
                ]
                    "s2"
                ],
            "channel2":
                [
                    "c2",
                ],
        ],
},
#

so i got this json

#
import json

with open(r"D:\Heres\Bots\Messager\Files\saves.json") as f:
    data = json.load(f)

server1 = data["server1"]["id"]
channel1 = data["server1"]["channel1"]
server2 = data["server2"]["id"]
channel2 = ["server2"]["channel2"]

print(server1, channel1, server2,channel2)```
#

and this py

#

and for some reason its not working

#

may someone help? i am doing lotsa stuff, may ya ping me if u can help 🙂

still verge
#

what is the error you're seeing?

#

either way, you can't access id and so on since the json data structure isn't a nested dict, you have a list

#

so you'll have to do data["server1"][0]

proper tendon
#

fixed the issue ty

lapis sequoia
#

ValueError: Failed to find data adapter that can handle input: (<class 'list'> containing values of types {"<class 'numpy.ndarray'>"}), (<class 'list'> containing values of types {"<class 'int'>"})

#

Can someone help me fixing this error?

uneven monolith
#

How much math is needed for Data Science?

trail jacinth
uneven monolith
#

Ty

sweet zenith
#

hey guys, I'm looking for a startup idea on AI.. If you have some good ideas do tell me..

lapis sequoia
#

u could help me doing a nn that recognizes pokemons 😄

sweet zenith
#

is AR a big thing in future?

foggy swift
sweet zenith
#

oh

#

ok

clever raft
#

So do i

vestal tiger
#

probably a noob ass question but i have a correlation matrix, how do i extract the highest pairs, as well as what that pair as? for example, the correlation between x and y was .7? most of the methods I am seeing show the correlation number, not what the two variables are

#

basicallt i want to extract the highest values from a matrix and what the two variables are

silver shard
#

Hi guys, I don't suppose anyone understands this and can help me get a solution out?

#

I've been looking at this for hours, inspecting it with debugging tools trying to find the relationship between the input and output

next moat
#
(base) C:\Users\siebe>conda install tensorflow-gpu
Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.

it keeps doing this. I also tried install only tensorflow (without the gpu)
I have a RTX 2070 Super GPU from Nividia (MSI)

mellow vapor
#

Hey guys, i m a but confused with jupyter notebook and anaconda, basically just wanted to know can i use the jupyter notebook as an independent desktop application, or it does either run on the web or with anaconda?

ocean dawn
boreal summit
#

Mehn, I'm still struggling with my Tensorflow installation. I've uninstalled Python 3.8, I even created a virtual environment and stuff, installed just TF in it but I'm still getting issues.

#

It's saying, DLL runtime error: can't load Tensorflow runtime bla bla bla. But I figured out it has to do with my PC, it doesn't have a GPU, that's why. I successfully installed Tensorflow but it can't run without a GPU.

#

I'd try to get another PC January to resolve this.

long gate
#

Can anyone help me with just what algorithm or structure to use on this problem?

https://open.kattis.com/problems/bokforing

My answer is way too slow, although I use python I still know I don't have the right answer. I have one solution to it that would work I guess, but that would be kinda cheating and I want to solve it the right way, any thoughts how to speed up the solution?

violet talon
#

sort of a dumb question, but I can't find the answer to this: how do I read an exponent in this format from mpl? 1e-12+9.9995833333 (other than 'really small' 😉 )?

#

read as in, interpret

lapis sequoia
#

use regex

ember roost
#

I want to find out certain metrics in my model. I have binomial distribution of daily pattern and I have average rate of daily metric. How do I find out the rate at particular time ? ( Basically multiplying binomial curve with average should reflect the distribution of data for certain ranges ) Is there any utility to do such kind of analysis ?

lapis sequoia
#

How can i use ImageDataGenerator to fit my model after? It is complaining idk why

trim oar
lapis sequoia
#

Feelssadman I’m doing python at home on this christmas day :/

boreal summit
#

Also, I just installed Pytorch and it's working fine.

#

If TF doesn't work on my PC, I'll just go with PyTorch.

earnest herald
#

Hello everyone,

I am trying to make a web scrapper off Fortune 500. I was thinking of using Scrapy but I can do well with BeautifulSoup.

When I make a get soup request (and print the soup itself) I end up with useless information named DNS Prefetch and no relevent info about information on the page. Any idea how I could bypass it?

Thanks a lot!!

lapis sequoia
#

guys what's a better way to show lots of graphs in one chart?

#

below is what I did

#

this is so messed up

#

Each line graph shows a historical price of certain good for 5 years

#

I want them to be in one graph but is that even viable to make it look better than this :/

#

x is time, y is price btw

gleaming gyro
#

why do you want to show that many stuff in one graph

#

is it how it is normally done?

lapis sequoia
#

basically that graph is to compare housing price differences between cities

#

I have no other better thought

#

showing some of them makes no sense to me

#

Any advice is highly appreciated!

fleet heath
#

@lapis sequoia you're visualizing a lot of data

#

generally, line plot is the best one if you want to compare real estate prices

#

but this is surely not looking good

#

you can try to take the mean values of different cities and then try plotting a bar chart

#

where one bar will represent the mean price of house in that city in a given year

lapis sequoia
tight torrent
#

Guys will heroku charge me for my add-ons like MySQL, i have my credit card info registered that's why

||sorry if offtopic||

lapis sequoia
#

I used LeabelEncoder from sklearn to transform my labels into valid thing for keras. But once i do the label encoder, i get 1 list of train_data length

#

And i think keras needs a matrix

#

Yesterday i downloaded cifar10 dataset to see what is has. x_train was a ndarray of 50k of images (ndarrays too). But y_train.shape was (50k, 10) cuz 10 classes. I printed what was y_train[0] and it was a list full of zeros except one

#

On my case, y_train is just 1 list where y_train[i] is the class x_train[i] belongs to

#

But model.fit doesnt accept this

lapis sequoia
#

nvm, i fixed it

shadow spruce
#

import pandas as pd

#dataframe " ind returnsnsinc 1926) ,shape(11100,30)
ind = pd.read_csv("ind30_m_vw_rets.csv", header=0, index_col=0)/100
ind.index = pd.to_datetime(ind.index, format="%Y%m").to_period('M')
ind.columns = ind.columns.str.strip()

time series correlations over time over a 36 month window: shape((33300, 30)

ts_corr= ind.rolling(window= 36).corr()

ts_corr.index.names = ["Date","Industry"]

ts_grby =ts_corr.groupby(level = "Date")

ts_grby.get_group("2018-12")

KeyError Traceback (most recent call last)
<ipython-input-8-484dc0e2c324> in <module>
----> 1 ts_grby.get_group("2018-12")

~\anaconda3\lib\site-packages\pandas\core\groupby\groupby.py in get_group(self, name, obj)
808 inds = self._get_index(name)
809 if not len(inds):
--> 810 raise KeyError(name)
811
812 return obj._take_with_is_copy(inds, axis=self.axis)

KeyError: '2018-12'

lapis sequoia
lapis sequoia
#

Hello!

I have a dataset with Quora questions and another dataset with individuals' preferences (ranked from 1 to 5) in which each row represents a different individual.

I would like to match each question with a group of individuals whose preferences may match the topics I obtained with lda modelling. The problem is that I don’t know how to exactly do this…

I don't know what answers I need to to look for... I don't know what to google to find out the magic key! Please help!

What do you think? What would you advice me to look for or how do you think I approach this?

Thanks so much in advance! And Merry Christmas!! 🎄

upbeat jetty
#

How to optimise the local pyspark so it would run the fastest on the work laptop? I need it to run tests, but they are taking waaay too long.

nimble lotus
#

@lapis sequoia what does your quora dataset contain? Was it text responses to various questions?

#

This sounds like a question recommendation system based on a individual prefrences

timid sand
#

Hello everyone

nova smelt
#

Yo guys,

So I first created a space invaders game with a friend and then we tried to add a neat ai which kind of worked but it's not really learning anything. If there is someone that might wanna hop into vc and look at the code and maybe help us make it more efficient and learning, that would be awesome :) if there is someone just DM me!
Not sure if I am right here or at #game-development

trim oar
fiery apex
#

Hi Guys, somebody can help me? Somebody knows how to get data from Pi osisoft with R or python?

astral path
#

I have a pandas dataframe that looks like this, and I'm trying to explode each list into a new row (so I would have a shape of len(list1) * len(list2) * len(list3) rows x 377 cols. The code I'm using to do this is

for column in df.columns: 
      df[column].explode() 

but this does literally nothing. Anyone how this might be fixed? full code here: https://hastebin.com/pifoseripo.properties

lapis sequoia
nova smelt
#

hey

#

anyone knows a tutorial to learn how to save a neat module? or some docs?

#

caue when i train my ai few hours for a game i would like to save it so it doesnt have to start from zero

lapis sequoia
#

Hey guys, I'd like to increase my knowledge about scientific python and also dangers that come with machine learning. For my university, I am asked to write a paper (it's going to be desk research and I want to state my thoughts on a topic that is controversial, so there is room for critical thinking). Therefore, I was wondering if you guys have any book recommendations? I don't need to get into a hands on how-to right away, but something that takes you by the hand and explains the depth of the scientific data world 🙂

opaque seal
#

Nice english.

#

||no sarcasm||

lapis sequoia
#

Uh, thanks? I guess

lament nova
#

Hi, How can I merge two data frames where
df1 has index "Key0"
df2 has indexes ["Key1", "Key2", "Key3"]

for each row ["Key1", "Key2", "Key3"] might contain "Key0"

I came up with a solution using apply but it is really slow...
My Solution

def matchMerge(x, key, df, keys):
  for key in keys:
    try:
      x.update(df.loc[x[key]))
    except:
      ...
df1.apply(matchMerge, key="Key0", df=df2, keys=["Key1", "Key2", "Key3"] axis=1)

is where away to do this with merge?

pd.merge(df1, df2, left_on="Key0", right_on=["Key1", "Key2", "Key3"], how="outer")
# throws indexes must have same length
opaque seal
#

hi

#

so

#

uhh just learning it for now and create some projects

#

with it

#

and then prolly might use it for game deving

#

later

#

for making stuff like traffic in cities and stuff

#

oh

#

like not making cars and stuff crash with each other

#

oh

#

uhh

#

idk

#

e

#

idk much about groups and stuff

#

:c

#

uhh yeah

#

you can say that

#

e

#

e damn i was thinking impossible stuff them

#

alr

#

tru

#

yeah yr

#

alr

#

thanks

#

damn thanks a lot for ur time

#

alr

#

kk

#

oh

#

then it must be good

lapis sequoia
gritty obsidian
#

Hey, is anybody working here with PySpark ?

exotic bronze
#
soup = BeautifulSoup(r.content, 'html.parser')
    find = soup.find_all('img')```
output 

<img alt="blablabla" data-src="linkhere.jpg" height="451" src="anotherlink.jpg" width="300"/>,

How i can specifically select "data-src"
hearty token
#

How do you create an XPATH expression into a new HTML file that lives inside an iframe?

upbeat jetty
#

@trim oar It is meant to be eventually deployed in Azure ecosystem, but it doesn't solve the problem with local tests.

lapis sequoia
#

Guy what layers may i add to my pretrained model (Xception) if i wanna do transferlearning?

#

Like, this is what i have

#
base_model = keras.applications.Xception(weights='imagenet',
                                         input_shape=dimensions,
                                         include_top=False)
base_model.trainable = False
inputs = keras.Input(shape=dimensions)
x = base_model(inputs, training=False)
x = keras.layers.GlobalAveragePooling2D()(x)
outputs = keras.layers.Dense(len(pokemons))(x)
model = keras.Model(inputs, outputs)
model.compile(loss=keras.losses.categorical_crossentropy,
              optimizer=keras.optimizers.Adadelta(),
              metrics=['accuracy'])```
#

But it seems not to be training at all

knotty whale
#

I want to use a supervised ML model for some grade prediction, so I have my train data
X: All 6 mock results
Y: Final grade
And all my train data has all the fields filled fine

But when a user wants to use the app, they may not have all 6 mock results, so can I predict their final grade on less such as 3 mocks?
(Plan on using scikitlearn)
(Please ping me if you reply!)

vague vector
#

Hi guys, Im new to ML.
My question is, that if the model is trained on normalised or standardised data, we also need to normalise or standardise the data when the model is in production?

obtuse mango
#

Hello guys, I am trying to improve my oop skills and ml skills

#

So I am trying to right ml algorithms but I kinda need some guidenca

#

Do you know any resource that gives you steps for this kind of things

lapis sequoia
#

greetings all, as regards NLP what tool do you recommend to create text annotations, other than carving it by hand.

soft salmon
#

how to get started with data-science?

#

any beginner guides?

twilit wind
#

Has anybody tried the Faster RCNN implementation

misty rivet
soft salmon
#

@misty rivet where is the solution though?

misty rivet
#

Bro i don't know..😂, I'm new here...!

#

wait until some experience one reply us @soft salmon

desert parcel
#

If I get the following loss of 15813.8125 from an mse function do I need to find it's square root to know it's actual loss or is that the loss already

high badge
#

you square root it to know its actual loss

sullen crescent
twilit wind
#

No I need it with Tensorflow

#

Actually my program is showing some bad outputs

#

also the mAP is about 23%

#

?

sullen crescent
#

what kind of dataset?

#

playing with deep learning need at least 5000 images if you're working on image detection

twilit wind
#

yea

#

my train split have about 4300 images

#

in total its about 7800 images

sullen crescent
#

wow thats some bounding box issues no wonder your mAP is quite low

#

did you manage to offlane augmentate?

twilit wind
#

means ?

sullen crescent
#

try to optimize your parameter, double check your ground truth, augmentate your training dataset so you will have more data

lapis sequoia
#

You can find great material on YouTube, Udemy, Coursera, something like https://www.udemy.com/course/datascience/, https://www.coursera.org/browse/data-science, https://m.youtube.com/watch?v=ua-CiDNNj30 the last link is awesome for beginners! @soft salmon @misty rivet

Learn Data Science is this full tutorial course for absolute beginners. Data science is considered the "sexiest job of the 21st century." You'll learn the important elements of data science. You'll be introduced to the principles, practices, and tools that make data science the powerful medium for critical insight in business and research. You'l...

▶ Play video
lapis sequoia
#

is there a dedicated channel for NLP?

#

or is data-science the channel? 🙂

velvet thorn
solid kindle
#

i'm trying to add another column to my dataframe

#

this is what i am currently doing

#

if first:
first = False
df = pd.DataFrame([stock, tempdf.iloc[:,3]])

    else:

        print(stock)
        df[stock] = tempdf.iloc[:,3].tolist()
#

but it adds it as a row

#

how do i get it to add the 3rd collumn to the stock?

velvet thorn
solid kindle
#

its a string

#

sorry should have specified it

#

the 3rd column of tempdf is integers

#

and the rows are indexed by datetimes

#

sorry wrong code, this is the only thing thats working

#

if first:
first = False
df = pd.DataFrame([stock, tempdf.iloc[:,3]])

    else:
        
        print(stock)
        tempthing = tempdf.iloc[:,3].tolist()
        df[tempthing] = stock
velvet thorn
#

use code blocks

#

!code

arctic wedgeBOT
#

Here's how to format Python code on Discord:

```py
print('Hello world!')
```

These are backticks, not quotes. Check this out if you can't find the backtick key.

solid kindle
#

sorry not working, but doesn't throw error

#

ok i will

#

'''py

velvet thorn
#

okay maybe you can give me a bit more context

#

as to what you're trying to do

#

it's `, not '

solid kindle
#

right

velvet thorn
#

if you're using a standard English keyboard

#

it'll be below your Esc key

solid kindle
#

ya same as tilda

velvet thorn
#

yup

solid kindle
#

basically, i'm trying to put the closing prices for a bunch of different stocks at different dates into a pandas dataframe

#

i'm doing this by looping through a list of names (stock), and trying to add them one by one to the dataframe

velvet thorn
#

probably not a good idea

#

every time you "add" a column

#

you in fact create a new DataFrame

solid kindle
#

yes

#

it doesn't have to be fast

#

only has to happen once

velvet thorn
#

it's also kind of harder to debug but

#

is your choice

solid kindle
#

what would u reccomend?

#

i'm new to pandas and data science in general

velvet thorn
#

of the data

solid kindle
#

an api

#

yfinance

velvet thorn
#

you have other programming experience?

solid kindle
#

yes

velvet thorn
#

what format does it return the data in

#

okay, good

solid kindle
#

it returns it in a pandas dataframe

velvet thorn
#

so pandas has this concat function

solid kindle
#

yes, ive used it

velvet thorn
solid kindle
#

yes

#

sorry should have specified

velvet thorn
#

so I'm guessing

#

you get a bunch of DataFrames from the API

#

and you want to combine subsets thereof

#

in a specified manner?

solid kindle
#

yes

#

yes

#

just add the same column together

velvet thorn
#

same column from each DataFrame?

#

or what

solid kindle
#

yes

velvet thorn
#

okay so let me just get this right

#

you have, say, 10 DataFrames, and each has a 'output' column, and you want to take that column from each and combine them into one big DataFrame

#

is that right

solid kindle
#

yes

#

exactly

velvet thorn
#

pd.concat([df['output'] for df in dfs], axis=1)

solid kindle
#

thank you!

velvet thorn
#

yw

#

tell me if it works

#

(it should if I have understood you correctly)

solid kindle
#

ya, i'll need to do a little restructuring of my code real quick

velvet thorn
#

so dfs is the iterable of all your source DataFrames

solid kindle
#

i got a TypeError: Cannot join tz-naive with tz-aware DatetimeIndex

#

i used this instead: pd.concat([tempdf['Close'], df], axis=1)

velvet thorn
#

what is df

solid kindle
#

the dataframe i am adding everything to

velvet thorn
#

ah, okay

#

so

solid kindle
#

i initialize it as an empty dataframe

#

and then add something to it every time

velvet thorn
#

okay

#

so

#

to use the approach

#

above

#

you need to put all the individual DataFrames in a collection

solid kindle
#

ok

#

so i made a list of all the dataframes

#

and then ran your command

#

and it worked exept one of the columns has a bunch of NaNs (i forgot what they are called, its not null is it?

#

also thank you so much for helping me

velvet thorn
velvet thorn
#

one of the columns

#

like

#

check the DataFrame

#

that that column came from

#

most likely

#

the source data is bad

#

or its index is misaligned

solid kindle
#

                              Close  Close       Close

Datetime
2020-12-21 00:27:00+00:00 23526.640625 NaN 641.566772
2020-12-21 00:28:00+00:00 23486.863281 NaN 640.518188
2020-12-21 00:29:00+00:00 23493.597656 NaN 640.609680
2020-12-21 00:30:00+00:00 23497.607422 NaN 640.758362
2020-12-21 00:31:00+00:00 23550.359375 NaN 641.541931
... ... ... ...
2020-12-28 00:19:00+00:00 26493.246094 NaN 708.414062
2020-12-28 00:20:00+00:00 26520.251953 NaN 709.166321
2020-12-28 00:21:00+00:00 26509.263672 NaN 708.537170
2020-12-28 00:22:00+00:00 26530.599609 NaN 707.455750
2020-12-28 00:23:02+00:00 26558.570312 NaN 708.471008

#

ok seems to be working

#

just a scattering of NaNs somewhere

velvet thorn
#

check dfs[1]

solid kindle
#

ok thanks

#

its fine

#

i think its just the api

#

and that one dataset

#

all the other ones are fine

lapis sequoia
#

if someone could help

lapis sequoia
#

guys

#

am I the only one who don't use tuples that much

desert parcel
lapis sequoia
#

Have any of you worked with the mal api?

#

I’m trying to extract the user ids of the users on mal, I’ve tried mal,jikan but nothing seems to work

#

Is there no other way than to make a crawler and scrape the user ids?

#

Also I need to extract the rating given by each user to the anime

desert parcel
#

Could someone explain this paragraph to me? I've been replaying the video, but still don't understand it.

lapis sequoia
# desert parcel

not the kind of reply you’re looking for, but may I ask about the guide? Looks cool, and is there any video tutorial for that?

lapis sequoia
# desert parcel

In a very layman language a loss function is a way of telling a model how bad it is doing

#

So the less the loss is

#

It’s better

#

Coz that means it’s doing better

#

won't overfitting be the problem though

#

That’s when you train the model too much on one dataset

desert parcel
#

It's on youtube and it's free

lapis sequoia
#

Thanks m8

lapis sequoia
lapis sequoia
desert parcel
#

I know this is late lol

lapis sequoia
#

there are different types of functions which are used to determine the loss

#

they basically see the difference between what your model is predicting

#

versus the prediction that should be

desert parcel
#

English isn't my first language so don't use too advanced words

lapis sequoia
#

oh

#

prediction is the ideal thing

#

like the actual answer

desert parcel
#

Alright

lapis sequoia
#

versus what the model gave

desert parcel
#

I thought what the model gave is the prediction

#

From the tutorial it says that the predictions should be close to or equal to the targets

#

Looking at the first element in each tensor. The guy says that -4252.4780 is what happens when you differentiate with respect to the 0.2761

#

Correct me if I'm wrong.

#

And the value -4252.4780 is the derivative of the loss with respect to 0.2761?

lapis sequoia
#

is anyone here familiar with naive bayes?

flint sierra
#

I'm a self-taught programmer. I'm lucky enough to have a job where I get to use python every day as a data analyst. However I feel like I've hit a wall on my professional development. Internet bootcamps can only take me so far, I think what I'm missing is peer interactions and networking. Unfortunately I don't work with anyone else who codes in python. I'm considering taking a more rigorous online course, applying to a university or pouring time into an open source project.

#

Any advice?

arctic wedgeBOT
#

Hey @radiant urchin!

Uh-oh! It looks like your message got zapped by our spam filter. We currently don't allow .txt attachments, so here are some tips to help you travel safely:

• If you attempted to send a message longer than 2000 characters, try shortening your message to fit within the character limit or use a pasting service (see below)

• If you tried to show someone your code, you can use codeblocks
(run !code-blocks in #bot-commands for more information) or use a pasting service like:

https://paste.pythondiscord.com

radiant urchin
#

Hello Im having some issues with curve fitting using spicy.optimize.curve_fit.

I keep getting the following issues
ValueError: array must not contain infs or NaNs

“`
def func(x, A, m, hf):
return A * (x - hf)**m

ff='data.txt'
data=pd.read_csv(ff,skiprows=3, delimiter='\t', encoding = "ISO-8859-1")
load=np.array(data.iloc[:, 1])
disp=np.array(data.iloc[:, 0])
istart=np.where(disp==max(disp))[0][0]
p0=[0.001,2,250]
ulfit, pcov = curve_fit(func, disp[istart:], load[istart:],p0,
bounds=(0, [0.1, 5, max(disp)]))

“`

I have a lot of similar curves, some work fine, and others give me errors depending on how I adjust p0.. (even though all the curves are similar) I can share a raw data file too if that helps

lapis sequoia
radiant urchin
#

the raw data has no NaN values

lapis sequoia
#

I can’t seem to find a problem with the code above

#

Sorry that I couldn’t help you

radiant urchin
#

if I drop the initial values, then I get a different error:

RuntimeWarning: invalid value encountered in power

#

But it spits out a reasonable result anyways? not sure if this is an issue?

solemn oracle
#

can someone help me understand how the quantiles are calculated in pandas?

#

Im looking at the documentation on pandas and I see the example:

#
                  columns=['a', 'b'])
df.quantile(.1)
a    1.3
b    3.7
Name: 0.1, dtype: float64
df.quantile([.1, .5])
       a     b
0.1  1.3   3.7
0.5  2.5  55.0
#

q = 0.1 should represent what the bottom ten percent of the data is below

#

so for column a, q=0.5 makes sense, there are 4 data points and an even number of values so you just take the average between them

#

I dont, however, understand how q=0.1 results in 1.3

solemn oracle
#

Ahhh, would the equation be

#

q(n +1) ?

lapis sequoia
#

I think it’s 1.25 but rounded

#

yes

solemn oracle
#

how would I calculate this

#

q = 0.1, n = 4 ?

#

so according to that logic, the 0.1 quantile should be 0.1(4+1) which it isnt

#

or is that the position its found at

#

and I need to do some math to find what the value is

lapis sequoia
#

(n+1)*q

#

oh wait

solemn oracle
#

so that would get 0.5 is the position where 10% of the values are below

#

but I dont get where position 0.5 is

lapis sequoia
#

Have you tried to look up numpy percentile

#

quartile is basically the same as numpy percentile

solemn oracle
#

I get how to use it in np, I just cant figure out where the numbers are coming from

#

when the data set is small

upbeat storm
#

Anyone who is interested in learning more about AI and ML please join this server!

spiral peak
#

So I have an odd pandas question about how to best approach this. Essentially I have 3 columns with data and they're indexed on the time values. They do not, however, have data for the same time values. One column might be missing data in the beginning, one might be missing data at the end and the beginning, and the other might be missing data at the end.

What I want to do: Shift the columns so that their end data all occurs at the same time and back fill the values with NaN. I was going to use df.shift() and the number of NaNs to do the shift, but I can't with the column that also has data missing in the beginning. I'll overshift it. Any suggestions besides manually iterating and count through the NaN values from the back until I have a non-NaN for each column?

vague vector
#

Hey guys, apology for a dumb question, Regression, Classification and Clustering can also be done in Deep Learning(like using Keras), or it can only be done in Machine Learning, Deep Learning is only for RNN, CNN etc...?

graceful glacier
#

has anyone come up with problem while trying to debug a program running pyspark?

#

im currently using pycharm to debug it and this is the code

#

im taking a spark rdd (i believe) called tweets and taking the stopwords out of its "text" column

#

i can place a breakpoint on the last line and the debugger will work fine, but if i place it anywhere inside the remove_stopword function the debugger will disconnect

#

any one have an idea as to why? is it because of how spark works under the hood maybe?

lapis sequoia
#

does someone know how to make violin plots?

#

from lists

#

i have seen how to do it with csv file, but i just need to use list and its not working

astral path
#

if I have a time series as a feature (e.g. pitch over time for an audio file) while clustering, is it bad practice to use the mean of the time series as a feature instead to simplify it and avoid the curse of dimensionality?

shrewd pewter
#

More of a web scraping question but what libs can I use to parse this kind of data?

#

Returned from an HTTP request

obsidian crow
simple iron
#

Hey all, does anyone have good resources for preparing for technical ML interviews? Currently an ML eng at big tech co. I've been using leetcode.com for coding prep for traditional data structures & algorithms, and datascienceprep.com for ML/stats questions, was wondering if anyone knew of others.

old pendant
#

Is there a way to select values by an array that defines which column for every row I will select in numpy? (without iterating every row)

Example:

column_indexes = np.array([1, 0, 1, 1, 1])

values = np.array([[1981.5       , 1894.        ],
       [ 489.33333333,  492.        ],
       [1110.        , 1110.        ],
       [ 197.        ,  197.        ],
       [ 301.66666667,  319.        ]])

values_selected = array([1894.        ,  489.33333333, 1110.        ,  197.        ,
        319.        ])

Thanks!

old pendant
velvet thorn
astral path
#

I'm making a feature set where the features are based on an analysis of audio files of differing length. For example, I have audio files A and B and the feature is the loudness over time, but A is 2 times the length of time as B. As a result, the feature for A would be an array of 2x the length of the feature for B. What would the best way to cluster be when I have feature sets of differing length?

old pendant
#

@velvet thorn thank you!

old pendant
#

@velvet thorn if values matrix have n_cols > 3, the method is still valid? the trick with column_indexes[:, None] will need to be rewritten, correct?

astral path
#

thanks

desert parcel
flint sierra
#

Thanks!!

proven sigil
# solemn oracle I dont, however, understand how q=0.1 results in 1.3

[1, 2, 3, 4]. There's a 3 element gap between first and last element. (n - 1).
q=0.1 which means it gets value of 3 * 0.1 elements after from first element (sorted)
so 1.3rd element => 0.7 * first_element + 0.3 * second_element => 1.3
Same for [1, 10, 100, 1000]
0.7 * 1 + 0.3 * 10 = 3.7

vague vector
#

Please correct me where I'm wrong, I'm trying to clear my basic concepts:

Regression, Classification, Clustering, dimensionality reduction etc are some major algorithms in Machine Learning.

Machine Learning also has another set of special algorithms called Neural Networks.
Deep Learning is when Neural Networks has depth, i.e. with multiple Layers.
Deep Learning specialize in non-linearities, feature engineering is also done automatically.

RNN, CNN, GAN are some popular architectures of Deep Learning.

lapis sequoia
velvet thorn
#

but they're not the same

lapis sequoia
velvet thorn
lapis sequoia
#

Anyone is working on Data Engineering Platform?

lapis sequoia
#

whole ml comes under ai

#

AI>ML>DL in short

lapis sequoia
#

What is DL

torpid cave
#

Hi all, anyone who works with classes for your data pipes

#

Do you prefer long methods to do all the lifting, or many small methods which you can edit later

lapis sequoia
lapis sequoia
fast vector
#

Hello, I'm a second year data science major at a state university. I have been disappointed with my curriculum thus far because my courses don't cover python for data science specifically and the Intro to R class was pretty basic. I'd like to become more familiar with both of these and reach a level in which I could comfortably apply for internships. I eventually want to build a good foundation on python to start with ML. My understanding is that projects are incredibly important. Does anyone have a list of resources, specific python and R libraries, projects, books, or websites I could use to reach my goals?

crisp gazelle
#

!resources

arctic wedgeBOT
#
Resources

The Resources page on our website contains a list of hand-selected learning resources that we regularly recommend to both beginners and experts.

crisp gazelle
#

@fast vector use this

fast vector
#

Wow that's awesome! Thanks @crisp gazelle!

crisp gazelle
#

No problem!

dense nova
#

Hello, I have trained a model in https://teachablemachine.withgoogle.com that has 3 classes (Hand Raised, Thumbs Up and Neutral).
I exported the model as a .h5 Keras model, and I've managed to make some predictions from some testing data that i've gathered.

the predictions output looks like this:

[[9.9910396e-01 1.9197341e-05 8.7688433e-04]]

im not sure what to make of this, any help would be great

vestal magnet
#

Hello! I have a question regarding matplotlib. How can I plot different lists in the same scale?
So, ideally, the dark blue line, should be within the boundaries of the green and red line, but it doesn't.
The dark blue line is a list, composed of several altitude values.

#

I asked in #help-grapes and they told me to adjust the number of samples of the dark blue line same as the red and green lines, which is 512

#

And I can't get anymore cause those values come from server request to Google Maps API, and the max number of samples is 512
On the other hand, the dark blue line comes from the path provided by A* on a csv file
It's like this:

I have a set of altitudes, imagine: 60, 100, 120

For those values I can only trace a path between 60+ and 120+, summed to those numbers, that is, I can only trace between 120-180, 160-220 and 180-240, cause those are the max limits for the drone, in this case.
But I have like 3000 samples or so

#

But even if I adjust the sample numbers to be the same, I still get the plots above

lament loom
#

Is anyone preparing for Google Summer of code, or have experience with the same?
I was planning to participate in GSoC 2021, as I have done a bunch of Machine Learning, NLP and Data Science Projects, also have some entry-level experience with Open-source contribution and Git/GitHub.

sand sluice
#

Is there a way to get the underlying numpy array of a matplotlib plot. I want to apply a color map to 1D data points, and then use openCV to threshold the rgb image. The problem is I want to run k-means on the points inside the threshold, so they need to correspond exactly with the original. The way I am currently doing it, by saving the plot to a file, means that the size will depend on the DPI, and the pixels don't match.

limpid oak
#

how can i add list of columns to df

#

KILLA_LINE_Col = ['SR_NO', 'DISTRICT_N', 'TEHSIL_NAM', 'VILLAGE_NA',
'HB_NO', 'LAYER_NAM', 'DESCRIPTIO', 'LENGTH_MTR',
'LENGTH_KAR', 'AREA_SQMTR', 'DES_MEASUR']

#

i want to add this colmns to existing dataframe

limpid oak
#

KILLA_LINE_file_copy[:,KILLA_LINE_Col] = np.nan

#

TypeError: unhashable type: 'slice'

#

getting this error

#

anybody here for help

fervent flax
#

hey guys, im using the dog.ceo api and sometimes it'll give slightly mispelled names (like "Germanshepherd" or "Stbernard" instead of "German Shepherd" or "St. Bernard")

#

any way to return a "correct" dog breed or fix it? not sure if this is the correct channel

pastel glacier
#

beautiful soup vs selenium vs scrapy??

#

which

#

is best for web scrapping

fleet heath
fleet heath
fleet heath
velvet thorn
#

is there a finite list of misspellings?

fervent flax
#

Kinda, but the list is long soo i didnt wanna go through it, i fixed it by using a different api though

My original idea was to use wikipedias api to search using the mispelled word, and then use the suggested article's name for the correct breed name but it didnt work for edge cases

Or do a google search and use the first suggested wikipedia link's article name (so i did mispelled name + dog for the searcg query) but that took wayyy too long

#

It's fine now though, thanks

sullen crescent
inland iron
#

sup guys howre you going ?

sullen crescent
#

I think it shows confidence precentage for detection based on 3 classes you made (hand raised, thumbs up and neutral), but i'm not sure tho @dense nova

lapis sequoia
#

howdy, working on some NLP projects. anyone here can answer a question about annotations? I see this type of annotation framework: https://universaldependencies.org/format.html are there any other type of annotation standards, frameworks you know of? Thanks

austere swift
#

I'm having an issue with pytorch

#

so what happens is whenever i try to import it in a python file i get this error

Traceback (most recent call last):
  File ".\script.py", line 7, in <module>
    import torch
  File "C:\Users\user\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\__init__.py", line 117, in <module>
    raise err
OSError: [WinError 127] The specified procedure could not be found. Error loading "C:\Users\user\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\lib\cufftw64_10.dll" or one of its dependencies.
#

but when i do it in repl its completely fine

#

i don't understand it lol

#

i tried reinstalling cuda/cudnn and reinstalling pytorch but neither worked

#

and i havent found anything on this error

#

also i verified that the cufftw64_10.dll file is there

#

python 3.8.6 pytorch 1.7.1 cuda 11.0 and cudnn 8.0.5 btw

#

and gpu drivers are the latest

austere swift
#

@ me if you have an answer btw

austere swift
#

So i just fully deleted the torch folder from site-packages as well as fully deleted the cuda folder and then reinstalled both and it worked now

sullen crescent
#

configuring cuda, cudnn, and ML/DL framework on windows is such a pain

lapis sequoia
#

I might think of participating

#

When is it held?

#

The dates

lament loom
lapis sequoia
#

How can I change multiple labels from a value to another in a pandas dataframe. I have tried train_df[train_df['label']=='humor'].label = 'fake' and doesn't really work. .label is a column or Series of the train_df dataframe.

#

wdym label?

#

columns?

#

It's fine, I figured it out train_df.loc[train_df['label']=='humor', 'label'] = 'fake'

#

thanks man

#

aight Pepex3

sturdy wren
hollow scarab
#

if I have a df like this, is it possible to add a row which is not related to these?

#

It would be on the 2. row and it would be: 'Weigthed number', and then for the next 4 columns it would have a formula, the 3. row* 0.6 + the 4. row*0.4

molten hamlet
lapis sequoia
#

Wow

#

Thats some fancy looking plot there

lapis sequoia
hollow scarab
#

no, a new row @lapis sequoia

#

but instead of the formula I want a number displayed in the 2. row ofc

lapis sequoia
hollow scarab
#

Oh, that could work, thank you! @lapis sequoia

lapis sequoia
#

you're welcome (:

gritty wedge
#

@lapis sequoia ur name is very nice

lapis sequoia
#

hi

#

I am 14 year old and 9th grade student

#

I was interested in learning science

#

unfortunately almost all tutorials I've seen have a lot of complex math

#

they have weird symbols and terms I haven't even heard of

#

I wanted to ask that what all maths is required to learn data science

#

please ping me with reply

#

thank you

#

🤔

#

Statistics

#

Discrete Math

#

oh ok

#

I was doing that rn for my exam

#

but

#

they had those weird symbols

#

one looked like a mirrored e

#

which they call sigma

#

I am a math teacher

#

And math is not easy for me either

#

ok

#

I was thinking of just dropping it due to high level maths, but that would be quitting

#

so I thought I might study a math book or 2

lapis sequoia
#

thanks

#

so I need statistics and discrete math right?

#

that's all?

#

I do not think is all, it depends how much deep you want to go into data science

#

well i need enough for machine learning

#

Some hight level math in data science is calculus and linear algebra too

lapis sequoia
#

thanks for advice, I'll try to get a grip on these topics

lapis sequoia
gritty wedge
lapis sequoia
#

Is there any great DL tutorial video on youtube that you guys would recommend? I’ve been doing data analysis with pandas, and I want to dig into deep learning with tensorflow, but can’t seem to find a good tutorial for total beginners.

sly niche
#

Word2vec, what use?

#

Just want to play with nlp a bit. Make a thesaurus, gpt suggested that.

#

Also, are colab tpus really free?

twilit pilot
#

Right now I have a pandas series that looks like this time 2020-12-24 12:34:00-05:00 222.600 2020-12-24 12:35:00-05:00 222.480 2020-12-24 12:36:00-05:00 222.520 2020-12-24 12:37:00-05:00 222.510 2020-12-24 12:38:00-05:00 222.330 ... 2020-12-30 12:51:00-05:00 222.510 2020-12-30 12:52:00-05:00 222.505 2020-12-30 12:53:00-05:00 222.565 2020-12-30 12:54:00-05:00 222.565 2020-12-30 12:55:00-05:00 222.535 Name: close, Length: 1000, dtype: float64 The time column is the index and i want to edit it to be numerical like 1, 2, 3, 4, 5, 6, 7, 8, 9.... Can someone help?

lapis sequoia
#

guys, i wanna use Xception as my model to train
Can i load it somehow and just train it from scratch?

limpid oak
#

how can i make plotygon from linestring

#

my code is not working

#

`import geopandas as gpd
from shapely.geometry import Polygon, mapping

def linestring_to_polygon(fili_shps):
gdf = gpd.read_file(fili_shps) #LINESTRING
gdf['geometry'] = [Polygon(mapping(x)['coordinates']) for x in gdf.geometry]
return gdf`

#

LINESTRING Z (528736.796 3513075.750 0.000, 52...)

limpid oak
#

need help

lapis sequoia
#

Hi guys,
I would like to make a user interface in order to visualize stock data that is being webscraped in real time.
I was wondering what you would recommend as a simple user interface. Would something like HTML and CSS suffice to create a basic real-time UI locally? Or is that not ideal as you have to constantly refresh the page to get new data? Or is it easier to stick to something like tkinter or another python package. I'm new to this so I would appreciate any type of advice!!

soft dock
#

Flask and Pusher

lapis sequoia
#

Nice!! Thanks a lot, @soft dock !!
Is Pusher some kind of online host?

soft dock
#

more of an API

nova smelt
serene scaffold
#

@gentle wagon to answer your question about numpy: It's used for linear algebra, or just do do large numbers of computations in batches. Suppose you're tracking data about the daily temperature in a given city: the array will have 365 elements. If you have that data for ten years, you can stack all those arrays to get a (10, 365)-shaped matrix. And then if you want to get an array of the daily average, you just have to make an array that's the average of each column. Not linear algebra per se, but numpy makes this kind of math easy to do.

lapis sequoia
#

guys, i wanna use Xception as my model to train
Can i load it somehow and just train it from scratch?

vocal bay
#

Hi guys. I want to learn data science and ml (including dl, rl and drl) but i don't think i have the necessary mathematical background for me to understand it properly. Which resources would you recommend to get me up to speed? And which resources would you recommend for learning data science and ml?

lapis sequoia
#

Pusher seems to be dependent on Visual Basic studio. Is there something I can do to prevent using that? I prefer to stick to PyCharm. But I keep getting this error whenever I try to install pusher:

error: Microsoft Visual C++ 14.0 or greater is required. Get it with "Microsoft C++ Build Tools": https://visualstudio.microsoft.com/visual-cpp-build-tools/

austere swift
#

pip needs the build tools when it needs to build some package from source (which probably means theres no whl file for your version of python)

#

what version of python are you on?

astral path
#

is it a bad idea to use feature agglomeration on a time series?

lapis sequoia
lapis sequoia
#

hey guys

soft dock
#

@lapis sequoia I did a pip install in an isolated virtual environment, as I normally do

modest mantle
#

Uh, I'm sorry but I guess it's a mistake.. I don't remember asking anything here '^'

soft dock
#

wrong dean, apologies

modest mantle
#

No problem :D

lapis sequoia
#

guys, i wanna use Xception as my model to train
Can i load it somehow and just train it from scratch?

sullen crescent
#

what do you mean by train it from scratch?

#

if you mean train it with your own dataset which is totally different categories,/classes, you can

hasty grail
lapis sequoia
#

well, i am trying and acc is 0.007

#

i made my own small model with 3 layers and 0.25

#

so idk

hasty grail
#

Can you provide more information on what you're doing?

sullen crescent
#

i'm kinda lost here, what do you mean with "3 layers and 0.25"?

hasty grail
#

I'm guessing 3x conv2d and a dropout of 0.25

austere swift
#

I’m not completely sure if it’s on there but you can look up Christoph gohlke he made a repository of wheel files that you can download, check if the one you need is on there

#

Make sure you get the right one, should have cp<Python version> and if you have 64 bit Python then you need the one that says 64

#

By cp I mean like if you have Python 3.8 it would be cp38, 3.9 would be cp39, etc

lapis sequoia
#

nvm i think i got it

#

this is going good, isnt it? @hasty grail

hasty grail
lapis sequoia
#

i am doing the predictions "manually"

#
def predict(path, dims, color):
    img = cv2.imread(path, color)
    img = cv2.resize(img, dims) / 255
    prediction = model.predict(img[np.newaxis, ...])
    print(np.argmax(prediction))```
#

But for example, ive already found one image it fails

#

Anyway, i would like to know which is the other class most likely to be

#

like, the top 5 classes

#

idk if i am explaining

velvet thorn
#

@fervent flume you want a join

#

if you havne't solved it

fervent flume
#

@velvet thorn hmm that makes sense

#

not sure what the join should look like tho

velvet thorn
#

join on index

#

the datetime index

fervent flume
#

so rather than using the for loop, I have this, but it's just as slow:

temp = df.loc[right,:].set_index(pd.DatetimeIndex(left))
temp = temp.groupby(temp.index).apply(lambda x: x.ffill())
temp = temp[~temp.index.duplicated(keep="last")]
df.update(temp)

but you're saying i should be able to join on this rather than doing the group by?

serene scaffold
#

like what kind of data are in these dataframes and what are you doing with it?

fervent flume
#

yeah i have some daily data for a bunch of columns, and there are some dates that are "bad" (holidays and weekends), but sometimes data comes in on those "bad" days. So what I want to do is update the data on the day before the "bad" day with the "bad" days data. So for the most part that's going to look like updating Friday's data with data that came in on Saturday and Sunday, if any

#

But I can't just backfill, because Sunday's data would overwrite Saturday's data

#

(on the off chance that data came in on both saturday and sunday)

#

so I ahve a function that gives me these date pairs, and the code above is what I have to solve the issue

serene scaffold
fervent flume
#

correct

serene scaffold
#

And what does it mean for Friday to "get" that data? Is this addition of numbers or something?

fervent flume
#
temp = temp.groupby(temp.index).apply(lambda x: x.ffill().iloc[-1])
df.update(temp)```
#

no just overwrite

#

overwrite if not nan

serene scaffold
#

ah

#

can you show me an example of what the dataframe looks like?

#

like if you print it?

tepid pawn
#

I have a question on how to impute values given the contents of a different column. Like if colA=1 impute 2 into colB, if colA=2 impute 3 into colB. Anyone have an idea on how to do this?

serene scaffold
tepid pawn
#

It's the titanic training set. I want to impute average age of people within the same class/sex rather than the mean of the column.

serene scaffold
tepid pawn
#

that sounds right

serene scaffold
#

Let's see if I still have that code.

tepid pawn
#

great, thanks!

serene scaffold
tepid pawn
#

I've done it I think, but it's been a while

fervent flume
#

@serene scaffold

2001-05-04   NaN   NaN   NaN      NaN   NaN         NaN       NaN   NaN   NaN   NaN   NaN  NaN NaN   NaN  NaN  NaN   NaN  NaN  ...        NaN  NaN       NaN   NaN   NaN   NaN    NaN       NaN   NaN   NaN   NaN  NaN   NaN  NaN   NaN   NaN   NaN   NaN
2001-05-04   NaN   NaN   NaN      NaN   NaN         NaN       NaN   NaN   NaN   NaN   NaN  NaN NaN   NaN  NaN  NaN   NaN  NaN  ...        NaN  NaN       NaN   NaN   NaN   NaN    NaN       NaN   NaN   NaN   NaN  NaN   NaN  NaN   NaN   NaN   NaN   NaN
2001-05-11   NaN   NaN   NaN      NaN   NaN         NaN       NaN   NaN   NaN   NaN   NaN  NaN NaN   NaN  NaN  NaN   NaN  NaN  ...        NaN  NaN       NaN   NaN   NaN   NaN    NaN       NaN   NaN   NaN   NaN  NaN   NaN  NaN   NaN   NaN   NaN   NaN
2001-05-11   NaN   NaN   NaN      NaN   NaN         NaN       NaN   NaN   NaN   NaN   NaN  NaN NaN   NaN  NaN  NaN   NaN  NaN  ...        NaN  NaN       NaN   NaN   NaN   NaN    NaN       NaN   NaN   NaN   NaN  NaN   NaN  NaN   NaN   NaN   NaN   NaN
2001-05-18   NaN   NaN   NaN      NaN   NaN         NaN       NaN   NaN   NaN   NaN   NaN  NaN NaN   NaN  NaN  NaN   NaN  NaN  ...        NaN  NaN       NaN   NaN   NaN   NaN    NaN       NaN   NaN   NaN   NaN  NaN   NaN  NaN   NaN   NaN   NaN   NaN```
#

basically

#

lol

tepid pawn
#

um... @fervent flume

serene scaffold
tepid pawn
#

lol

#

ok, following

serene scaffold
#

And then you can mask another column with that to only get the columns where those conditions are true in the other columns.

#

And take the mean of that

#

💥

tepid pawn
#

Ok, thanks. I'll give it a shot

serene scaffold
#

!paste

arctic wedgeBOT
#

Pasting large amounts of code

If your code is too long to fit in a codeblock in discord, you can paste your code here:
https://paste.pydis.com/

After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.

fervent flume
#

it's very sparse

#

but you can imagine creating a df with like a 10 year daily frequency, and 1-3000 columns

#

then injecting random data with a 95% chance of being nan, and you'd have the dataframe

serene scaffold
serene scaffold
# tepid pawn lol

the numbers you're trying to impute: you're replacing nan values, yes?

tepid pawn
#

yes

#

blanks

fluid pike
#

rn i'm trying to read a csv file on jupiter notebook

#

but i keep getting this issue

fervent flume
#

@serene scaffold yeah basically. I have a list of date pairs that I want to replace, Friday is just the most common example, there're other dates in general that i'd want to do this to. And I want to keep the last value if there's a value on both saturday/sunday

fluid pike
#

this doesn't work

#

wait nvm

#

lemme see again

serene scaffold
# fluid pike

how much programming experience would you say you have?

fluid pike
#

nvm I got it to work

#

I put in the wrong filename

serene scaffold
#

I was just going to say that jupyter notebooks tend to be confusing for learners.

fluid pike
serene scaffold
#

Jupyter notebooks can become convoluted since the cells can be executed in any order you'd like.

serene scaffold
serene scaffold
#

Someone ping me if they want me to come back.

tepid pawn
#

I'm working on it now.

hasty grail
serene scaffold
#

or is that tensorflow?

hasty grail
#

the latter

serene scaffold
# hasty grail the latter

I've used tensorflow but I still haven't got a clear picture of what it "is". Is it basically numpy on the gpu?

hasty grail
#

It is so much more

#

I think the most defining aspect of it would be graph execution

tepid pawn
#

@serene scaffold I have the numbers (means) that I want to insert into the NaN points. I just don't know how to conditionally impute them. At first I was thinking something like if df['column'] == x & df['column'] == y, but don't know where to go from there.

#

I got the means with grouping

serene scaffold
#

but if it's everything then I'll never wrap my head around it.

hasty grail
#

Also since it's integrated with Keras you don't have to write your own training loops

#

Makes training ML models so much more convenient

serene scaffold
serene scaffold
hasty grail
#

It's too much for me to explain, would be easier to read the docs

tepid pawn
#

Here I created a new df with fillna, but it was with the mean of the entire column.

train_num2 = train_num.fillna(train_num.mean().round(0))

hasty grail
#

ML training is usually done in graphs, while regular computation uses eager execution

#

Eager execution is essentially just Python logic

fervent flume
#

@serene scaffold no all values should be updated if there's a new non-nan value later

slender oracle
#

I think lazy execution is a little more accurate. e.g. spark

serene scaffold
#
mask = (df['age'] == 40) & (df['class'] == 'first')
df['died'].fillna(np.nanmean(df['died', mask]))

I didn't look up the methods or antying for this so this is probably wrong, but I think something along these lines will work.

tepid pawn
#

ok, I'll play with it. thanks again

fervent flume
#

i think it's mask, died (index, columns)

#

but i could be wrong

serene scaffold
#

might even be df['died'][mask]

hasty grail
#

Graph execution is a bit like lazy execution but not really. It involves compiling Python functions through tf.function which run as graphs during runtime, allowing the engine to perform optimizations such as parallelizing and merging operations.

slender oracle
#

Gotcha. That is similar to spark as well.

#

Can see it when looking at the "explain" for a given dataframe

hasty grail
#

This is also where a lot of the original notoriety of TensorFlow came from though. Originally, everything had to be done via graphs, which made it incredibly difficult to debug because breakpoints don't work in graph execution, as the code that is actually executed is dynamically generated elsewhere when the function is compiled.

#

also you had to write boilerplate code for the compile-run process

slender oracle
#

Was it TensorFlow 2.0 that added the ability to do stuff outside of graphs? I haven't really messed around with it for a long, long time (~2015-ish)

hasty grail
#

Yup.

#

Also even in graph mode you don't have to mess around with tf.Session anymore. You just use the tf.function decorator around whatever function you want to compile.

#

the first time the function is evaluated, it is automatically compiled

slender oracle
#

Have you tried using PyTorch? If so, what are your thoughts on it vs TensorFlow?

#

I used the old Torch package in Lua, but haven't touched the python version yet.

hasty grail
#

Only in passing, the thing I don't like is that you still have to define your training/evaluation loop explicitly whereas TF 2.0 already has a default implementation thanks to Keras

#

However, it is easier to debug because it uses eager execution all the way

slender oracle
#

I think I'm missing something, but can't you use something like a CrossValidator class (or variant thereof) to abstract away the training/evaluation part?

fervent flume
#

tensorflow is so annoying to work with

#

PyTorch is so much easier

hasty grail
fervent flume
#

microsoft's version was the best though. By far the most intuitive to understand and to use imo. the lack of explicit loops and the way recurrence was handled was also super nice.

#

too bad that died

desert parcel
#

can someone explain this error. Specifically what does "non-singleton dimension" mean?

#

RuntimeError: The size of tensor a (1338) must match the size of tensor b (5) at non-singleton dimension 1

tepid pawn
#

@serene scaffold I couldn't get it to work with masking. I figured it out with a different groupby, and defining a funciton to impute, then transform. titanic_tr is the training df

#impute age based on sex/pclass

#Create a groupby object: by_sex_class
by_sex_class = titanic_tr.groupby(['Sex', 'Pclass'])

#Write a function that imputes median
def impute_median(series):
return series.fillna(series.median())

#Impute age and assign to titanic['age']
titanic_tr['Age'] = by_sex_class['Age'].transform(impute_median)

serene scaffold
#

df[BIN_LABEL] was the column that identified the class for that row. Also worth noting that this is doing the imputation for every column, or something

velvet thorn
#

the last dimension is a singleton dimension

#

the first two are not

sleek fjord
#

Hey

#

I'm new to coding

#

and I'm trying to data scrape

#

some nba stats

lapis sequoia
#

@sleek fjord, no

velvet thorn
#

go on

sleek fjord
#

what am i doing wrong?

velvet thorn
#

okay in general

#

don't post screenshots please

#

post code as text; it's easier to read and debug.

sleek fjord
#

o sorry

velvet thorn
#

!code

arctic wedgeBOT
#

Here's how to format Python code on Discord:

```py
print('Hello world!')
```

These are backticks, not quotes. Check this out if you can't find the backtick key.

sleek fjord
#

Traceback (most recent call last):
File "/Users/airmac/Documents/NBA Python/Untitled.py", line 1, in <module>
from basketball_reference_scraper.teams import get_roster, get_team_stats, get_opp_stats, get_roster_stats, get_team_misc
ImportError: No module named basketball_reference_scraper.teams
[Finished in 0.072s]

#

this is the error code

#

a = get_opp_stats('BOS', 1955, data_format='TOTAL')
print(a)
#

thats the code

velvet thorn
#

what do you understand by this ImportError: No module named basketball_reference_scraper.teams

sleek fjord
#

nothing

velvet thorn
#

so you're trying to import something

sleek fjord
#

i did the pip install basketball_reference_scraper

velvet thorn
#

from a module (Python file) that it can't find

#

presumably either your install failed

#

or you're using the wrong Python installation

sleek fjord
#

it didnt fail

sleek fjord
velvet thorn
#

well

#

that seems to be the case, considering you can't import it

#

or the module name could be wrong

sleek fjord
#

can i post links?

#

to the api thing?

#

its been updated recently

#

and I have the latest version of python

#

Python 2.7.16

#

this is the version

#

damn left on read

#

okay

serene scaffold
sleek fjord
#

why

serene scaffold
#

It was released a long time ago and the python community has moved on to python 3.

sleek fjord
#

by updating

#

should my error be resolved?

vital ocean
#

i think so

serene scaffold
#

it might somehow solve it. The problem is that Python can't see the module you're referring to.

serene scaffold
# sleek fjord should my error be resolved?

you should have python 3 in either case. There's almost no point learning python 2 at this point because anyone who hasn't updated their project to 3 has probably abandoned that project.

vital ocean
#

btw @sleek fjord if u are scraping data u can use Parsehub it will help u

sleek fjord
#

do i need to relaunch atom after i get the new version

vital ocean
#

scrape any web with free

serene scaffold
sleek fjord
vital ocean
#

i think so

serene scaffold
vital ocean
#

try robots.txt in the last of you scraping web

#

that will tell u what u can scrape

sleek fjord
#

i just downloaded the new version of python and its still saying my version is 2.7.16

serene scaffold
vital ocean
#

hmm

sleek fjord
#

yeah its saying 3.9.1

#

now

vital ocean
#

cool

sleek fjord
#

and its saying that ive installed basketball-reference-scraper

#

if i do pip show

#

and its still coming up with the same error

#
  File "/Users/airmac/Documents/NBA Python/Untitled.py", line 1, in <module>
    from basketball_reference_scraper.teams import get_roster, get_team_stats, get_opp_stats, get_roster_stats, get_team_misc
ImportError: No module named basketball_reference_scraper.teams
[Finished in 0.139s]```
vital ocean
#

is the module name right?

sleek fjord
#

yes

#

just to double check i downloaded the example of the offical github

#

and ran it

#

and that dont work

serene scaffold
vital ocean
#

just try from cmd

desert parcel
sleek fjord
serene scaffold
#

try pip install git+https://github.com/vishaalagartha/basketball_reference_scraper.git

vital ocean
#

hmm

#

in cmd

sleek fjord
#

im on mac

vital ocean
#

ok

serene scaffold
#

that's fine

sleek fjord
#

its saying

#

'zsh: command not found: pip'

serene scaffold
#

try the same command with python3 -m pip instead of just pip

vital ocean
#

yes

sleek fjord
#

so

#

python3 -m pip https://github.com/vishaalagartha/basketball_reference_scraper.git

#

?

serene scaffold
#

python3 -m pip install git+https://github.com/vishaalagartha/basketball_reference_scraper.git

vital ocean
#

yeah that's what i am sayin'

sleek fjord
#

thank you

lapis sequoia
#

@gaunt heron no

serene scaffold
#

no?

vital ocean
#

what's this

sleek fjord
#

im going to relaunch atom

vital ocean
sleek fjord
#

?

#

i just downloaded

vital ocean
#

np

sleek fjord
#

its coming up with the same problem

#

a = get_opp_stats('BOS', 1955, data_format='TOTAL')
print(a)
#

something wrong with my code?

desert parcel
#

Is there a place I can share a jupyter notebook?

#

Cause I wanna ask a question

serene scaffold
sleek fjord
#

'Traceback (most recent call last):
File "/Users/airmac/Documents/NBA Python/Untitled.py", line 1, in <module>
from basketball_reference_scraper.teams import get_roster, get_team_stats, get_opp_stats, get_roster_stats, get_team_misc
ImportError: No module named basketball_reference_scraper.teams
[Finished in 0.12s]'

#

thats the entire error message

serene scaffold
#

okay, and what was the terminal output when you ran that command from before?

sleek fjord
#

when i downloaded it?

#

what

serene scaffold
sleek fjord
#

yes

serene scaffold
#

What happened?

arctic wedgeBOT
#

Hey @sleek fjord!

Uh-oh! It looks like your message got zapped by our spam filter. We currently don't allow .txt attachments, so here are some tips to help you travel safely:

• If you attempted to send a message longer than 2000 characters, try shortening your message to fit within the character limit or use a pasting service (see below)

• If you tried to show someone your code, you can use codeblocks
(run !code-blocks in #bot-commands for more information) or use a pasting service like:

https://paste.pythondiscord.com

serene scaffold
#

!paste

arctic wedgeBOT
#

Pasting large amounts of code

If your code is too long to fit in a codeblock in discord, you can paste your code here:
https://paste.pydis.com/

After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.

sleek fjord
#

it basically said

#

done

serene scaffold
#

Okay

#

Where is your .py file that contains this code located?

serene scaffold
sleek fjord
#

in my documents folder

#

under a folder called

serene scaffold
#

Is that where your terminal is operating from?

sleek fjord
#

no

#

am i supposed to do that

serene scaffold
#

Can you go there in the terminal?

#

Yes, that's the easiest way for us to help you debug

sleek fjord
#

how do i do that on mac

serene scaffold
#

if you use a UI, we'd have to have extensive knowledge about how that UI works.

#

cd is usually the command to change directories

#

and ls usually tells you what is in your current directory

sleek fjord
#

'cd: string not in pwd: /Users/airmac/Documents/NBA'

#

when i do ls

#

'Applications Documents Library Music Public
Desktop Downloads Movies Pictures get-pip.py'

serene scaffold
#

do cd Documents

sleek fjord
#

when i write ls now it comes up with

#

'Excel NBA Python School'

#

should i do cd nba python

serene scaffold
#

yes, but you might need to put "NBA Python" in quotes

sleek fjord
#

yeah okay

#

thanks

#

do i do the pip install

#

now

serene scaffold
#

no, you said that worked

#

can you do ls again?

sleek fjord
serene scaffold
#

is Untitled.py the file that contains the code you referred to earlier?

sleek fjord
#

yes

serene scaffold
sleek fjord
#

youre a fast tyoper

serene scaffold
#

thxxx

sleek fjord
#

ily

#

wow

serene scaffold
#

it worked?

sleek fjord
#
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/basketball_reference_scraper/teams.py", line 6, in <module>
    from constants import TEAM_TO_TEAM_ABBR, TEAM_SETS
ModuleNotFoundError: No module named 'constants'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/airmac/Documents/NBA Python/Untitled.py", line 1, in <module>
    from basketball_reference_scraper.teams import get_roster, get_team_stats, get_opp_stats, get_roster_stats, get_team_misc
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/basketball_reference_scraper/teams.py", line 10, in <module>
    from basketball_reference_scraper.utils import remove_accents
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/basketball_reference_scraper/utils.py", line 4, in <module>
    import unicodedata, unidecode
ModuleNotFoundError: No module named 'unidecode'```
serene scaffold
#

so their code (not yours) is broken.

sleek fjord
#

no

#

wow

#

how could they

desert parcel
#

@sleek fjord is your problem fixed?

serene scaffold
desert parcel
#

Ahh

#

Well I don't think he can fix that then