#data-science-and-ml

1 messages Β· Page 215 of 1

worn stratus
#

Someone in the pins recommends a different course

#

But studying or academically is probably for the best

acoustic scaffold
#

@plain jungle AI/ML is very broad and very rapidly advancing field. Don't worry if you do not find "clear point of beginning".

#

I started by initially doing some digital image processing which led to machine vision.

plain jungle
#

Thank you

acoustic scaffold
#

Andrew Ng's course is good. It does start from very ground up. This means lots of mathematics which might put some people off. You can do machine learning even without comprehending all of the mathematics.

stray spade
#

Hi guys
I need to do a test with ResNet154. As I know it is too time consuming to train it specially with pc,
My question is that, is there some pre trained ResNet to run on my data set? If yes, how long take time?
I have a face dataset with 1999 portrait image

acoustic scaffold
#

Are you using Tensorflow or Pytorch (or something else)?

stray spade
#

@acoustic scaffold Tensorflow

acoustic scaffold
#

If I recall correctly, there were pretrained resnets for Tensorflow 1.12 year ago

stray spade
#

Do you have some link to download,
And do you have any idea, how long take time to run it with 1990 Image,
I want to get result as fast as possible, to submit thesis

acoustic scaffold
stray spade
#

@acoustic scaffold thank you

acoustic scaffold
#

No problem

jolly briar
#

I've a vector of postcodes that I want to convert to the geographical regions they're within, so I want to lower their resolution.

What's the best way to go about this? not sure if there's a google maps approach or something

#

I'm sure there's a google phrase i'm missing to get info on this... given postcode I want to get region πŸ€”

worn stratus
#

Whereabouts are the postcodes? Global?

jolly briar
#

european

worn stratus
#

https://postcodes.io/ does exactly what you're looking for in the UK at least - theres probably something similar for europe as a whole

jolly briar
#

not uk, can't find eu

#

as in - i'm not looking at the uk atm

worn stratus
jolly briar
#

this seems to want a house number as well, i don't have that information, just postcode

#

this is uk as well i think πŸ€”

worn stratus
jolly briar
#

looking at geocoding atm

#

gah, lookup of italian postcodes just returns american stuff 😦

worn stratus
#

My guess is you can add a param for country into the address info

jolly briar
#

Yeah just reading through the docs atm

uncut shadow
#

Hey. I have another question. What skills do you think are required to master (or atleast learn some) Machine learning? Except the knowledge about programming language Ur going to use. Also do you know any good book/course/tutorial etc. to learn math required for Machine learning?

#

Without any libs like Tensorflow, sklearn etc.

polar acorn
#

Have you looked at the pinned messages in this channel? I think the r/LearnMachineLearning wiki might have what you're looking for.

oblique belfry
#

Grit and perseverance.

jolly briar
#
df['x'].astype <doesn't work
df.x.astype    <does work

why is that?

lapis sequoia
#

why is what

#

as type what.. did you try

#

did you pass an argument and check the output

#

@jolly briar

jolly briar
#

Yes

#

One worked the other didn't,I thought both these indexing methods were analogous

lapis sequoia
#

they are

#

please show code

#

you can check df['x'] == df.x

jolly briar
#

was just with int, codes gone now unfortunately

velvet thorn
#

typo somewhere?

jolly briar
#

hrm, maybe... i'm not sure now the code has gone... thought I'd bumped into some kind of df[x] df[[x]] thing ala R, all good

velvet thorn
#

I'm like 95% certain it was a typo or something

jolly briar
#

i think if it's between me and pandas being wrong i'm willing to raise my hand πŸ˜…

lapis sequoia
#

it's very quiet in here today

#

Im bored.. someone ask something

drowsy ingot
#

anyone using Gym-retro?

worn stratus
#

What's a good way to get started with computer vision stuff?

jolly briar
#

i have many csv files with different separators, i want to convert them to all be comma separated... is there a straightforward approach to this?

misty lake
#

Hi All,

Has anyone worked on NLP , Information Retrieval and building a search engine
Any YT links or other references to build an intelligent search engine based on data in local DB is appreciated

jagged raven
#

The download is stuck here.

misty lake
#

Pls tag me when you answer. Thanks

#

@jagged raven the screenshot seema it is still downloading what is the issue ? May be you can try with sudo and pip3

jagged raven
#

It's stuck there for a couple of minutes now.

rain flare
#

Any insights on how to extract melody from a song using python?

austere oar
#
#

and that would also mean going to every other search result on the first link

frail flower
#

Fun data science story.

This week I was at the 100th Annual Meeting of the American Meteorological Society. Whilst at one of the Python symposia, we had just been introduced to the SPORK project, which uses machine learning to track supercell thunderstorms (and can potentially be used to predict tornadoes). I mentioned I wanted to use it in conjunction with a library called Py-ART for operational forecasting purposes, and a friend next to me started complaining at length about how slow Py-ART is.

Anyway, the lead developer of Py-ART was sitting right next to him on the opposite side.

#

Awkward...

deft harbor
#

Did he say anything?

frail flower
#

Oh boy did he!

#

He was also leading the panel on which said friend was presenting his research!

#

After some serious backpedaling he invited us to a data science reception with the other representatives from Argonne.

#

Oh, and on my other side one of the matplotlib devs was sitting there giggling at Dave’s sudden realization that you do not trash talk popular python weather data science libraries at a python weather data science symposium.

jolly briar
#

has anyone ever done predictive modelling just using uplift?
idk if it's even classed as predictive modelling... perhaps forecast is a better term

jolly briar
#

Something that I wish was possible -
an excel sheet with the dataframe I'm working on that updated live

jolly briar
#

how to convert a group of values into the percentages

so if I have a dataframe and there are groups G=a,b,c,d, so I .groupby('G'). Within a I have a1 = 15, a2=80, a3=90, so that after the group by operation i want to have values of a1=0.08, a2 = 0.43, a3=0.48, and similarly for the other groups

#

i just did groupby().sum() then merged that output with the original, then computed percentage from there

oblique belfry
#

@frail flower that is hilarious.

gilded dagger
#

Hello everybody, I'm having some trouble with gspread_pandas

#

I can open the spread, I have read permissions, I have the right values for row 1 if I directly look at them, but trying to make into a DF... doesn't work

#

Any clue? Not finding anything on google or related.

#

It actually works on other spreadsheet, but for this one (where I have only read access) it doesn't work

urban silo
#

I have a pandas question

I have 3 dataframes (channel, video, comment)
Column mapping is:
channel.channelId = video.channelId = comment.channelId
video.videoId = comment.videoId

I need to get a subset of each dataframe.

  • Only channels that have a video and a comment
  • Only videos that have a channel and a comment
  • Only comments that have a channel and a video

I tried it with a double merge + inner join like

total_channels = total_channels.merge(total_videos, on='channelId')
.merge(total_comments, left_on=['channelId', 'videoId'], right_on=['videoId', 'channelId'])

But that only gives an empty dataframe with all columns from all 3 dataframes instead of only a channel subset that matches the requirements (at least 1 video and 1 comment)

I can't set a PK/FG when writing to SQL in pandas so my SQL solution take ages, that's why I need to do it directly in Python/pandas to speed stuff up.

How can I achieve that?

paper niche
#

@urban silo the order that you specify the join columns matter. you set left to be channelId and videoId, but right to be videoId and channelId, so pandas will try to join left.channelId with right.videoId

#

and, if you just want the channels' columns, I'ld probably just go for total_videos[['channelId']] and total_comments[['channelId','videoId']] in the merge arguments directly.

jolly briar
#

does anyone work with python+R? Or maybe another mixture ( i just use python+R though).
I'm wondering if you have a set way of arranging / organising your projects, code/docs etc

lapis sequoia
#

It’s craaaazzyyy

#

It has to be some kind of 3D program but how do they take and interact and move?

#

Really cool stuff

unkempt delta
#

when I open up Jupyter Notebook it's shows all the files saved on my C drive is there any way to clean this up a bit? If I partition my hard drive and have it open up in the partition will it still be able to import python packages ? I'm using anaconda btw

thorny ocean
#

hey

#

someone for little help in numpy?

worn stratus
#

don't ask to ask

#

just ask the question

thorny ocean
#

i have a 3d binary matrix (M) , i want to create a function that given an axis (x, y, z), the matrix reduce itself in an "or logical gate" by that axis. for example if i choose x axis, so my output would be 2d matrix(m) that if the value on m[x,y] == True it means that there exist an X value that M[X,x,y] = True

unkempt delta
#

nvm figured it out , just made a partition

thorny ocean
#

another question in numpy:
i want to have all 3 digits numbers containing "0,1,2,3"

#

like "000, 001, 002,003,010...333"

strange stag
#

from this csv https://pastebin.com/RnF5rpXQ ive got this data in a pandas groupby object, and im trying to find the min/max price with the associated location, however .agg is giving me whacked results... trying to figure out why
(upon request, more csv data will be given, thus giving reason to groupby)

df2.agg({'price': ['max','min']}).reset_index()

https://pastebin.com/2xS53YTX
as you can see the very first upc is mismatched with max and min

#

the example i have given is the fourth upc

#

ending in 816

#

min should be 59.99 and max should be 127.00

lapis sequoia
#

it's difficult to see formatting here.. can you show the groupby dataframe another way

#

@strange stag

strange stag
#

@lapis sequoia

lapis sequoia
#

what is the min max on

strange stag
#

price, or so i hope

lapis sequoia
#

I think you're applying it wrong

#

let me check

strange stag
#

ye i think ur right, seems the upc is the min/max

#

cause upcs are ascending

lapis sequoia
#

what is the group by on?

strange stag
#

upc

lapis sequoia
#

df.groupby('upc').agg({'price': ['min', 'max']}) then?

strange stag
#

same as i have now, yes

#

for w/e reason that seems to work slightly better

#

first result seems off

#

4th is still techniqually wrong, but idky

#

those values shouldnt be there at all

#

ill post full csv, sec

#

@lapis sequoia

lapis sequoia
#

what is the upc

strange stag
#

wdym

lapis sequoia
#

is it really common across these merchants

strange stag
#

yes

lapis sequoia
#

ok.. lemme think

strange stag
#

would moving the upc to an index, or making it a string help?

lapis sequoia
#

what dtype is it now

strange stag
#

also, if you look at the upc 013964765816, the max is 127.00 and the min is 59.99, which is odd

#

sec

#

object

lapis sequoia
#

I think you should set the dtypes for these columns.. then do the groupby and aggregation

#

it'll work better

#

set upc to int.. and the price to float

strange stag
#

ye, there all objects

#

ight, ill try that

#

@lapis sequoia tyvm!!!! Been trying for hours to figure out what i was doing wrong!!! WOOO tyvm!!!! very nice to see that the data is in a working condition πŸ˜„

lapis sequoia
#

np.. always here

strange stag
#

@lapis sequoia do have one more operation that i hope you could help me with...

#

so i need to drop rows that amazons price is lower than the other prices (associated with the same upc)

#

im using this to remove no margins, possibly something similar for this other operation?

counts = df['upc'].value_counts()
df = df[~df['upc'].isin(counts[counts < 2].index)]
#

also, i think this is kinda weird df.groupby('upc').agg({'price': ['min', 'max']})
giving me url min/max, and upc min/max

lapis sequoia
#

try: df.groupby('upc').price.agg(['min', 'max'])

#

I dont understand your other question

#

drop what now?

strange stag
#

so with that last code you just posted, i still need those other columns

#

cause i need to drop the rows that price_min is associated with the location "Amazon.com"

#

min correlates to a location, and max may correlate to another location

#

if min correlates to amazon, i need to drop the row

#

or in other words, if amazons price for the upc is lower than the other suppliers, i need to drop the row/upc

#

lower than ALL other suppliers*

lapis sequoia
#

hmm.. an easy way to do that would be, for each upc finding row indexes where the row meets your condition.. then dropping multiple rows together by index

strange stag
#

tried df.groupby('upc')['price','location','url'].price.agg(['min', 'max'])
however, it says its already selected the columns, so im not sure how to keep the other columns when aggregating

lapis sequoia
#

df.groupby(['upc','price','location', 'url'], as_index=False).price.agg(______

strange stag
#

so with the above code (multiindex), i just need to convert to a regular index, and then iterate through the df, and if "Amazon.com" in min, then drop the row

lapis sequoia
#

no iterating through dfs.. that's not efficient

#

find another way.. but you can do that as a last resort.. because I'm not able to think of a way right now

#

go through them by upc, check the condition, save the indices somewhere.. then drop by indices together

strange stag
lapis sequoia
#

oops

#

you need to remove price

#

I made a mistake

#

df.groupby(['upc','location', 'url'], as_index=False).price.agg(__

#

which you should've caught btw.. lol

strange stag
#

was kinda wondering why all were grouped, but well πŸ˜›

lapis sequoia
#

it's early morning here.. still getting up.. if you have anything else feel free to ping here.. I'll respond later

strange stag
#

nw πŸ˜› im very grateful for your help, saved me so much time!

strange stag
#

welllll nvm hehe, just shifted the data

strange stag
#

@lapis sequoia you there?

lapis sequoia
#

!ask

arctic wedgeBOT
#
ask

Asking good questions will yield a much higher chance of a quick response:

β€’ Don't ask to ask your question, just go ahead and tell us your problem.
β€’ Don't ask if anyone is knowledgeable in some area, filtering serves no purpose.
β€’ Try to solve the problem on your own first, we're not going to write code for you.
β€’ Show us the code you've tried and any errors or unexpected results it's giving.
β€’ Be patient while we're helping you.

You can find a much more detailed explanation on our website.

strange stag
#

okay, so with the data previously, i have (credited to you) a group for each upc, which is shown in the picture above, however, the min and max is each url, this would be fine if i could sort the high ~> low of each upc grouped by the location, and now that i wrote this out, i think i might have a better idea on what i need to do

#

that and some sleep

#

so this is closer, however, i would still like to see the lowest, and the highest of each upc rather than the lowest/highest for each location

#

df.groupby(['upc','location', 'url'], as_index=False).price.agg(['min','max']).groupby('location', as_index=False).head(len(df))

#

@lapis sequoia

lapis sequoia
#

reading your question

#

yeah no I dont get it.. lol

strange stag
#

so using the first upc as an example
I would like to see amazon has a 14.23 price(max), and walmart has a 8.95 price (min)

lapis sequoia
#

ok so you want to see min max for each upc, and the url

#

yeah?

strange stag
#

yes

#

exactly

lapis sequoia
#

why didnt you say that

strange stag
#

thought i did πŸ˜„

#

have manually parsed like 200 lines so far lolz πŸ˜›

#

cba to parse 1k lines manually a day

lapis sequoia
#

yeah I dont understand what you're saying.. but wait, let me write the code

#

df['max_val'] = df.groupby(['upc'])['price'].transform(max)

#

do you understand what's happening here

strange stag
#

lemme try/think a bit, and ill brb

#

ahhh i forgot...

#

i need to see if any1 else has a lower price than amazon, not just min/max per upc....

#

sorry.........

#

ima see what i can do with this tho

#

oh now i need to drop

#

πŸ™‚

#

so to answer your question @lapis sequoia i think i understand what its doing

#

creating a new column, by grouping the upc and then performing a max transformation on the price column

#

or min

#

this has gotten me a bit closer to what i need (the above) and this

# Drop upcs that arent sold on amazon
df = df[df['upc'] != "Amazon.com"]
#

df['max_val'] = df.groupby(['upc', 'location'])['price'].transform(max)
πŸ™‚

autumn night
#

how Data science and Ai is related???

worn stratus
#

The vast majority of AI is trained (the process of the AI learning) using data collected from the real world. In order to work with AI you need to be able to understand the data, and what it means and how to work with it.

#

Its worth noting that Data Science AI Machine Learning Data Mining and probably more are all pretty ill defined semi-buzzwords that sometimes get used interchangably

lapis sequoia
#

Hey guys! I'm trying to add a new column, or rather replace one with a mistake and therefore I'm trying to merge 2 datasets exactly like I have done dozens of times before in my project... however, this time, something is different and I just can't seem to figure out why.

#

Even though I tried both "left" and "inner" merge, I'm getting out more data in the merged set than in either of the two original sets
df1 1136388 rows Γ— 31 columns
df2 1247995 rows Γ— 8 columns
so with left join I should be getting 1,136,388 rows right?
however, what I'm getting is 1935106 rows Γ— 32 columns (column number is correct, rows are waayyyyyyy off)

#

So in an attempt to find out what's going on, I used the indicator=True function of merge. And guess what... there is only one category [both] and no values that are only either from left or right data set.
How is this even possible? Any help would be much appreciated... this should only be a 2 minute problem, but it cost me 2 days already :[

#

I'm merging on 7 out of those 8 columns as they're identical in both sets, that's why 32 columns instead of 31 is the correct output for the merge.... but rows increased by almost 800,000 !?!?!? There are no NaNs and no duplicates... i absolutely cannot explain how this is even possible

coral yoke
#

if anyone has any experience or understanding of RNNs i'd love to talk whenever you're free. currently doing a project for gun recognition in images and video. just trying to perfect my classifier before working too hard on the object detection factor

oblique belfry
#

Well...why do you need RNNs? I mean...what are you trying to do? RNNs and CNNs can be used for similar problems. If you are doing gun recognition, seems like an object detection problem.

coral yoke
#

unless all of the research papers are uninformed, you need an R-CNN or similar

#

CNN for the quick classification, RNN for the object detection

#

RNN is meant for object detection in my case. i've not seen any other network types used

#

unless you know something i'm missing

#

@oblique belfry

oblique belfry
#

Yolo v3 for Object Detection....

coral yoke
#

i'm not using a pre-trained network

#

and if i'm not wrong, yolo has an RNN

oblique belfry
#

All CNNs. Faster inference than R-CNN.

#

I used it to train on a custom dataset.

#

Are you doing object detection or action recognition, or both?

coral yoke
#

just object detection

#

did you use keras or?

oblique belfry
#

So...this guy wrote it in C. Trains very fast and inference time is fast. However, it is finnicky to work with.

#

There are Keras, Tensorflow, and Pytorch ports. The Pytorch one was the most stable port.

coral yoke
#

honestly looking to use it for reference and still do my own

oblique belfry
#

The hardest part is the input data. Each object detection algorithm has different formats of input data.

coral yoke
#

hence why i'm going to do my best for my own

#

i know that part though

oblique belfry
#

And....I get you wanna do your own. But, it is a solved problem.

coral yoke
#

i very well understand that

oblique belfry
#

Okay.

#

You can use a RNN, but you don't have to.

Input data for object detection is tricky since you can either scale the width and height or just keep it as it is. You can do it all with RNNs. And, these model architectures are out there. I'd copy them.

Why reinvent the wheeel if you do not have to?

coral yoke
#

my end goal isn't to have some pre-trained model returning images with all of its former trained classes filling the image. this is also still a ridiculously new field and while i did not look too far into yolo's latest model i now know it's the latest reach. i'm still not going to use something just handed to me. i'm looking to make my own like i said

oblique belfry
#

You can train it yourself, from scratch and not use other people's weights. I trained it to locate a tennis ball in real time. Tried doing it myself and tried other methods out there, Yolo was the best. Even still, you are going to need a large corpus of labeled data of bounding boxes around the objects in questions. I would spend my time there.

But, good luck.

coral yoke
#

i know what i need data wise. i have 10k images self-collected and already 1k labeled by hand with labelimg. i'm not looking to locate tennis balls because somebody else did that already, i'm looking to do something myself from scratch to prove that i can to clients looking to hire me for this industry so i'm not going to just use something handed to me and say "look, i can use what anyone else can!"

i appreciate you pointing out that yolo wasn't what i thought it was but i feel like you're acting very high and mighty just because i don't want to use somebody else's work and you think i should. have a nice day

oblique belfry
#

@coral yoke It's not high and mighty. Most people don't reinvent the wheel unless the have to. Unless you were tyring to go into research, there are just many very good solutions to this problem out there.

And, you finally explained why doing it from scratch is so important to you. If I knew that before, I could have given you different answers.

coral yoke
#

i said feels like. and honestly, especially in this field, please don't give the answer of "just use what exists" to somebody asking how to make their own thing

jolly briar
#

not reinventing the wheel is pretty sound advice a lot of the time πŸ€”

coral yoke
#

it is, but it isn't always relevant

oblique belfry
#

I would read the papers behind Yolo, R-CNN, Faster R-CNN, etc. They make interesting points on why they chose the architecture.

coral yoke
#

if you want to make a discord bot should i go tell you to use this server's bot instead of making your own?

jolly briar
#

it isn't always relevant
in this context perhaps leading with your reasoning would have made more sense, but all good

coral yoke
#

i've read some papers already tony

worn stratus
#

Choosing to reinvent the wheel is a great way of understanding how the wheel works

coral yoke
#

thank you charlie

oblique belfry
#

That's not me telling you to copy them. Just the logic behind the choices might encourage you on your journey.

coral yoke
#

i understand that tony. that's why i was asking for people familiar with RNNs

oblique belfry
#

I know....I was grouping them in. Yolo is one of the few famous strategies that is all CNNs. The rest are a mix between the two.

#

Are you wanting to run this on a live video stream?

coral yoke
#

no offense but i don't believe you're the person i'd be willing to give any more information to regarding this

#

again, thanks for pointing out my misunderstanding of yolo's architecture

oblique belfry
#

Reinventing the wheel to learn is a great way to learn. But we didn't know you were trying to do that. Hence the miscommunication.

#

Okay. Well, good luck.

coral yoke
#

even without the learning purpose, i would definitely still make my own. especially if the project was specialized enough i would want full control of what was going on.

#

and most of it isn't for learning. i'm having to piece together the last bit of the object detection myself but the rest i mostly understand. it's for showing clients i understand

jolly briar
#

i'm trying to imagine billing someone and pricing in building everything from scratch lol

coral yoke
#

it's not the kind of clients you're imagining

jolly briar
#

cool

oblique belfry
#

Got it. Next time, try to convey that up front. Not just when talking to me, but to other devs. There are gonna be others who will be confused at your request like I was.

I am upset that this convo got derailed so quickly. Because, this is the stuff that interests me.

#

I gotta ask....what kind of clients are you targeting?

coral yoke
#

again no offense, but never when speaking to any other developer in any part of any industry have they told me "use what exists." especially not ones in this discord, they seem to like to help you from scratch irregardless

#

and none of your business

jolly briar
#

lol

oblique belfry
#

Alright. Just curious.

#

If you wanna impress them a bit more, look into image segmentation as well. Don't know if that would be relevant to you, but it would def be cool to show you did that by scratch too.

coral yoke
#

my timeframe doesn't allow any more than i have set

#

i've seen that already, thanks though

oblique belfry
#

Gotcha. Wanted you to really impress them.

coral yoke
#

πŸ‘Œ

oblique belfry
#

Has anyone had luck with graph neural networks?

lapis sequoia
#

Hi! Sorry, not sure if this is right channel for my problem. Where can I ask about data preprocessing for text clusterization?

worn stratus
#

Here probably

lapis sequoia
#

Ok, I don't even understand my task properly...

#

I want to cluster different text to k different authors.(k-means clustering)
My data is: different files with text and other things from different authors in json format,
It looks like this:

{
  "author": "Tolstoy",
  "date": "unknown",
  "format": "unknown",
  "text": "here is some short text by Tolstoy",
  "title": "Anna Karenina",
  "year": "unknown",
  "lang": "ru"
}

Also there is already training data that consists of many dictionaries like this in json format too.

What do I need for k-means clustering? Do I only need "text" strings?

chilly geyser
#

cluster different text to k different authors
Your task is to create k clusters of authors. Presumably this means that authors within each cluster are similar to each other in some way.
What do I need for k-means clustering? Do I only need "text" strings?
To cluster the text you'd probably need to make the 'text' into a format such that you can perform operations on them to talk about any kind of similarity or dissimilarity. There are different ways to do this, and I think you have been given raw book data, along with some meta data. It's honestly up to you to use just data and/or the metadata, as long as at the end of the clustering process, you have a good idea of what algorithms you used are doing

lapis sequoia
#

So can I only take those "text" values from data and put them all in one big list of texts(is this even right?) and then preprocess this list?

lapis sequoia
#

Do you use the Anaconda environment?

#

that is like a software package

#

Me? No, I don't.

#

Anybody here

jolly briar
#

@lapis sequoia yes

lapis sequoia
jolly briar
#

I've never used windows

jolly briar
#

@velvet thorn

x = pd.DataFrame({'index' : [5,6], 'blah' : ['a', 'b']})
print(f"""x.index : {list(x.index)}, x['index'] : {list(x['index'])}""")

this seems like a reasonable example of .v and ['v'] not being exactly the same

velvet thorn
#

@jolly briar yup

#

this applies also to every other attribute that is already bound

#

e.g. min, max, groupby

jolly briar
#

yeah

velvet thorn
#

I think I said "prefer __getitem__ access, because it works in more cases"

#

but if I didn't then I'm saying it now roothink

jolly briar
#

so they're not exactly the same, like running code with ipython vs python, people often say they're the same but it's different

velvet thorn
#

because it is most correct to say that __getitem__ works everywhere __getattr__ does, and some places it doesn't, for the purpose of Series access

jolly briar
#

can't recall exactly what you said, just thought of it now though ( the index thing ), all good

stray spade
jovial river
#

How does an algorithm like KNN handle duplicate data? Meaning we have a set of data objects with identical attributes and the distance between these data objects is 0. Does it make sense to remove these duplicate points here or include it?

jovial river
#

If we were to include duplicates, would it make sense to treat duplicate data points as one observation? Like if k=3 and n1 has 3 duplicates, n1', n1'' and n1''', then n1 would only have 1 nearest neighbor instead of 3.

jolly briar
#

I often get confused when making dataframes with rows, for some reason.

for example - pd.DataFrame( pd.factorize( data.var ) )
If i want this to create a dataframe with columns instead of rows how would I do that?

lapis sequoia
#

Hi! Can I use Random Forest to evaluate k-means clustering? does this make sense?

cinder viper
#

@lapis sequoia I don't understand what you mean when you say you are trying to "evaluate" k-means. I suspect the answer is no... Random Forest is similar to k-Means in that both are "supervised classification" algorithms, but they have differences in what they do and how they do it

lapis sequoia
#

k-means is unsupervised so I wanted to check clusters I got with RF or something

strange stag
#

hey was hoping someone could help me with pandas, im trying to keep the amazon price for each upc, and drop others that are a higher price than amazon (for each upc)

#

if you need me to provide more information, in any way shape or form, please dont hesitate to ask!

chilly geyser
#

@lapis sequoia As previously said, it doesn't make sense. Both k-means and RF clusters are fundamentally different.

You can evaluate the clustering quality of each algorithm using metrics such as cluster purity, or compute/speed requirements, etc. and then compare the results from RF or from k-Means. Indeed, k-Means is likely to be superior in both fitting and prediction, while RF depends on the number of trees, as well as tree parameters. If RF does not produce significantly better clusters, then I would use k-Means.

But there are probably many different ways of generalising each k-Means, RF, and there would be other algorithms. What works might typically depend on your use case.

#

@strange stag So you want to conditionally drop depending on the price column? Is there only one Amazon.com under location or would there be multiple? If there is only one, you can grab the Amazon.com price, store it as a constant, then do a conditional slice using .map

lapis sequoia
#

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
X_train = tfidf.fit_transform(X_train)
X_test = tfidf.transform(X_test)


from sklearn.linear_model import LogisticRegression
from sklearn.cluster import KMeans
from sklearn.metrics import accuracy_score
import numpy as np


n_clusters = len(np.unique(y_train))
clf = KMeans(n_clusters = n_clusters, random_state=42)
clf.fit(X_train)
y_labels_train = clf.labels_
y_labels_test = clf.predict(X_test)
X_train = y_labels_train[:, np.newaxis]
X_test = y_labels_test[:, np.newaxis]


from sklearn.ensemble import RandomForestClassifier

model=RandomForestClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print('Accuracy is:',accuracy_score(y_test, y_pred))
#

Sorry

#

Can't I use RF for mapping labels like here?

#

I had my data labelled before but needed to use k means

chilly geyser
#

@lapis sequoia You mean change clf to RF?

lapis sequoia
#

elf was for k means, I mean there the last part is RF used on data that was "produced" by k means. or? maybe I don't understand the last part of this code, where RF comes

chilly geyser
#

@lapis sequoia That doesn't make sense, why would you use K-means then sequentially run random-forest on it?

#

Why would you fit an RF model after your K-means clustering algorithm?

#

@lapis sequoia Ok I think I get what your script is doing

#

@lapis sequoia Have you read https://scikit-learn.org/stable/modules/model_evaluation.html
I think you should just use the common metrics for evaluating the quality of the clusters out of K-means.

It doesn't make sense to 'evaluate' how good K-means is via RF. RF is itself another classifier that can result in classification errors on its own. If you're trying to do a meta-analysis of algorithmic analysis of either K-means outputs or RF-inputs then it makes sense, but it wouldn't make sense for the implied original problem of 'given N-datapoints and K-possible labels, what is the best way to separate and give each datapoint one of the K-possible labels?'

lapis sequoia
#

I wanted to make a confusion matrix in the end and I don't know how to make it without labels, the code is not mine, I just thought I found something similar, because in the end there is confusion matrix and classification report, and that's what I wanted from k-means. The initial data that I have already has labels and is actually more for classification tasks but I have to use it for k means

#

what is the script doing then?

strange stag
#

there are multiple prices under the location amazon

#

@chilly geyser

chilly geyser
#

@lapis sequoia
Um, how do you get the confusion matrix in the first place? In the first place, do you have a ground truth of classifiers?

#

Setting the K-means as a ground truth does not make sense

strange stag
#

each upc should have an amazon price, if not multiple

chilly geyser
#

To get a confusion matrix you need to say that a cluster X has common property related to its elements being members of X

strange stag
#

not sure what u mean by multiple amazon.com under location tho

chilly geyser
#

Unfortunately K-means only produces indices or rather centroids. You'd need to remap the centroids to get clusters of meaning

#

@strange stag Brb I'll give you a fake table

strange stag
#

location can have maceys, walmart, home-depot, office-depot, or a few others

#

@chilly geyser i can give u a real 1 if u want

chilly geyser
#

@strange stag I'd avoid giving real data.

strange stag
#

its fine idc

#

but yes, that is basically identical to the data i have now

#

id like to keep the 6th row and the 2nd

#

for upc==1

lapis sequoia
#

thank you @chilly geyser but do you understand what script I posted is doing?

chilly geyser
#

@lapis sequoia It's running K-means, then setting it as a ground truth for RF to classify

#

@strange stag I'd look into conditional slicing with pandas. A very naive (aka slow) way to do it is to take subsets of each UPC value, then do the conditional

#

As for faster/simultaneous checking I'm not too sure, I've not used pandas other than for general things and I've never exactly needed it to be speed-optimised

strange stag
#

hmm

#

will possibly be doing millions of rows per day

#

however, shouldnt be a problem for now

#

so something like df.groupby(['upc'])

chilly geyser
#

Yeah my googling seems to imply that too

strange stag
#

i understand i can do something like this (this is what im using to drop single suppliers corresponding to 1 upc)

counts = df['upc'].value_counts()
df = df[~df['upc'].isin(counts[counts < 2].index)]
#

so this selects a column, but not subsets for column values

#

so groupby would render subsets?

chilly geyser
#

I'd try it, I'm not a pd expert here :>

strange stag
#

do also have soemthing like this

#

df1 = df[ df['location'] == "Amazon.com" ].drop_duplicates(subset='upc', keep='first')

#

think i should be using != instead but w/e

chilly geyser
#

That keeps the Amazon.com stuff right?

strange stag
#

this assumes that the df has been sorted by price

#

should

chilly geyser
#

lol TBH IDK what you're doing, but it seems you're doing ok

strange stag
#

actually nvm it doesnt do anything

#

that was an attempt to drop the lower price amazon offers

chilly geyser
#

@strange stag Are you doing this all in VSC or IDLE? I'd recommend a more iteractve thing like Google Colab or at least your own localhost JuPyteR notebook if you think Google's snooping around your data.

strange stag
#

going back to the beginning, just trying to get amazons high vs the lowest of others

#

im on a notebook atm

chilly geyser
#

That way you can see how the pd dataframes are changing

#

Ah ok that's good

#

So you can quickly see stuff

strange stag
#

yes

#

well, not really doing ok

#

still blind as a bat atm..

#

mind boggling me why i cant get amazons high price, and then the lowest price for each upc other than amazon

#

mk

#

this is better...

grouped = df.groupby(['upc'])
result = [g for g in list(grouped)]
#

might actually be able to work with this data πŸ™‚

vast temple
#

hey, guys do stackoverflow links allowed here?

strange stag
#

@chilly geyser tyvm for suggesting subsets! πŸ˜„

chilly geyser
#

@strange stag I got the code if you want it, it's ugly and IDK if it scales

strange stag
#

o.O

#

wrote the code for me πŸ˜„ wooo i got the code too

#

perhaps we shall compare?

chilly geyser
#
for _, y in df.groupby("upc"):
    amazon_min = y[y["location"] == "Amazon.com"]["price"].min()
    # print(y[y["location"] == "Amazon.com"]["price"].min())
    print(y[(y["location"] == "Amazon.com") | (y[y["location"] != "Amazon.com"]["price"] < amazon_min)])
strange stag
#
grouped = df.groupby(['upc'])
result = [g for g in list(grouped)]
result_length = len(result)
new_df = result[0]
high_low_amazon_prices_index = new_df[1][new_df[1]['location'] == "Amazon"]['price'].sort_values(ascending=False).index
new_df = new_df[1].drop(high_low_amazon_prices_index[1:])
amazons_price = new_df['price'][high_low_amazon_prices_index[0]]
price = new_df[new_df['location'] == "Amazon"]['price'].astype(float)
for index in new_df.index:
    if new_df['price'][index] > amazons_price:
        new_df = new_df.drop(index)
chilly geyser
#

Mine is still UPC-prints tho, I haven't done the dropping yet, mine is only a view

strange stag
#

i like ur code is wayyyyy shorter tho..

chilly geyser
#

Basically I get Amazon price minimum per UPC

strange stag
#

erm, need the max

chilly geyser
#

Then any other location (e.g. Walmart) with higher prices are dropped

#

Amazon max?

strange stag
#

ah okay, thats good

chilly geyser
#

I see

strange stag
#

yes

chilly geyser
#

You need the Amazon min rite?

strange stag
#

no

#

max

#

could explain, but with regards to your earlier post of using real data

#

dw, i account for amazons lower price later

chilly geyser
#

Umm I'm trying to make sure I can recreate df now

strange stag
#

this is super sweet tho

chilly geyser
#

Not sure how to get from the group-bys all the way back to the modified df

#

And I think df.append would be slow

strange stag
#

however, i think my version is slightly better

#

but this is alot closer than i have been the past week πŸ˜„

#

and ye, i just need to concat my new_df for each loop

#

i think im biased tho so

chilly geyser
#

@strange stag Up to you, it's your project

strange stag
#

πŸ™‚

#

think ill keep yours posted tho, incase i want the other amazon prices

chilly geyser
#

my final code is this

keep_indices = []
for _, y in df.groupby("upc"):
    amazon_min = y[y["location"] == "Amazon.com"]["price"].max()
    COND = (y["location"] == "Amazon.com") | (y[y["location"] != "Amazon.com"]["price"] < amazon_min)
    keep_indices += y[COND].index.tolist()

# to get the subset just use loc
df.loc[keep_indices]
#

I'm using .max() now

strange stag
#

wdym location?

chilly geyser
#

?

strange stag
#

df.loc

chilly geyser
#

Basically I get a list of indices that match the condition

#

This index uses the original DF's index, so it will be fine

#

in fact I don't think I'm changing the original df

#

You only modify the original DF if you have to

strange stag
#

so faster?

#

than mine by alot?

chilly geyser
#

lol for that I recommend using %%timeit

#

Also, not just this part by itself solo.

#

You need to do a %%timeit on your fullscript if you can

vast temple
strange stag
#

well, only got 1k lines atm so

chilly geyser
#

Unless you are really really sure of your test-case and likely inputs and/or outputs

#

I see

#

The issue with %%timeit on just this portion is even if this part is faster, it might be because it's not evaluating certain parts

#

like list comprehension being stored as a generator, not being used

strange stag
#

well, im concating dfs, for each upc...so

#

im sure thats probably not cheap

chilly geyser
#

Ya, that's what I think too, but maybe pd has an internal magic for that too

#

I'm trying to grab just the indices, but TBH I'm not sure if it's faster

strange stag
#

i think grabbing indices would be way faster, but im no expert

chilly geyser
#

Carrefour because....well, why not :^)
prices are literally from random. upc is choice(range(10)).

#

basically 1000 rows -> 967 rows, cutting off via Amazon max per upc

strange stag
#

think my biggest improvement would be switching how im saving data tho

#

cause loading jsonlines to a df is really slow

#
df = pd.DataFrame()
with jsonlines.open(filename, 'r') as reader:
    for obj in reader:
        df = df.append(obj, ignore_index=True)
#

its like 1 second per 100 rows or something...

#

how do i do a %%timeit?

chilly geyser
#

%%timeit is a JuPyteR magic. You put it at the top of the cell

strange stag
#

ah

#

that code above is...
617 ms Β± 4.36 ms per loop (mean Β± std. dev. of 7 runs, 1 loop each)

chilly geyser
strange stag
#

wow... 10m lines would take 12 hours....

chilly geyser
#

The 1+2 is so that I don't have a single line. You can actually just %timeit [SINGLE_LINE_CODE]

#

While %%timeit is for whole cell execution

strange stag
#

ye...

chilly geyser
#

@strange stag Lol I don't think you can just linearly extrapolate so easily, just try for a slightly larger subset rather than a unittest

strange stag
#

im assuming the 617ms was used to create the df, and the 4.36 is for each line that its appending

chilly geyser
#

The fact is, unittests are unittests for a reason, and that integration testing is rqeuired

strange stag
#

no idea what that means

#

think the above is giving me a ballpark of what to expect tho

chilly geyser
#

Unit tests are for single things by themselves, while integration tests means you have multiple different things working together

#

It's common testing terminology

strange stag
#

tbh testing is outa my league atm

#

not necessary at all

chilly geyser
#

Well TBH IDK how much production-level code you're doing, and honestly personally I've never been involved in production-level stuff

strange stag
#

##autopilot

#

πŸ˜„

#

got a LONG fkn ways to go tho

#

id say im 10% done

#

what would be better to save data than jsonlines?

#

for importing to pandas

#

well nvm

#

hmm

jolly briar
#

i'm wondering how to know what coordinate system i'm in wrt geographic data

velvet thorn
#

@strange stag

#

oh lord why

#

for loop + df.append = death

velvet thorn
#

@jolly briar in the general case?

#

or what

strange stag
#

@chilly geyser you still there?

#

@velvet thorn what about this, atm im getting a blank df for total_df

total_df = pd.DataFrame()
for x in range(result_length):
    new_df = result[x]

    high_low_amazon_prices_index = new_df[1][new_df[1]['location'] == "Amazon"]['price'].sort_values(ascending=False).index
    new_df = new_df[1].drop(high_low_amazon_prices_index[1:])
    try:
        amazons_price = new_df['price'][high_low_amazon_prices_index[0]]
    except IndexError:
        continue
    price = new_df[new_df['location'] == "Amazon"]['price'].astype(float)
    for index in new_df.index:
        if new_df['price'][index] > amazons_price:
            new_df = new_df.drop(index)
            pd.concat([new_df, total_df])
velvet thorn
#

I feel a bit weak just looking at the loops

#

okay, maybe you can tell me what you want to do first?

strange stag
#

mk, so i have a df with all the data and i am able to parse the data that i need with

grouped = df.groupby(['upc'])
result = [g for g in list(grouped)]
result_length = len(result)
new_df = result[0]
high_low_amazon_prices_index = new_df[1][new_df[1]['location'] == "Amazon"]['price'].sort_values(ascending=False).index
new_df = new_df[1].drop(high_low_amazon_prices_index[1:])
amazons_price = new_df['price'][high_low_amazon_prices_index[0]]
price = new_df[new_df['location'] == "Amazon"]['price'].astype(float)
for index in new_df.index:
    if new_df['price'][index] > amazons_price:
        new_df = new_df.drop(index)
#

however, having difficulties running this in a loop

#

because not all of my grouped subsets have the amazon bit

#

so amazon may not be a location when iterating through the df (group)

#

so that code does everything that i want except...

#

i cant figure out how to drop upcs that dont have an amazon location

#

so im grouping by upc, keeping amazons highest price, and dropping anything that is higher than that

jolly briar
#

@velvet thorn i was just thinking generally... i've just been merging some shapey stuff but i'm not too sure how to check that i did it correctly

strange stag
#

new_df is when im seperating each upc into a new dataframe, and parsing it from here, and now im trying to add it back into a master dataframe

velvet thorn
#

hm

#

okay, so first you want to drop entire groups with values of upc that don't have 'Amazon.com' in location, correct?

strange stag
#

yes

velvet thorn
#

df.groupby('upc').filter(lambda g: 'Amazon.com' in set(g['location']))

#

or, actually

#

df.groupby('upc').filter(lambda g: 'Amazon.com' in g['location'].unique())

strange stag
#

ok, so now that i have amazon only upcs, how do i concat the dfs?

#
new_dataframe = df.groupby('upc').filter(lambda g: 'Amazon' in g['location'].unique())
grouped = new_dataframe.groupby('upc')

result = [g for g in list(grouped)]
result_length = len(result)

total_df = pd.DataFrame()

for x in range(result_length):
    new_df = result[x]
    high_low_amazon_prices_index = new_df[1][new_df[1]['location'] == "Amazon"]['price'].sort_values(ascending=False).index
    new_df = new_df[1].drop(high_low_amazon_prices_index[1:])
    amazons_price = new_df['price'][high_low_amazon_prices_index[0]]
    price = new_df[new_df['location'] == "Amazon"]['price'].astype(float)
    for index in new_df.index:
        if new_df['price'][index] > amazons_price:
            new_df = new_df.drop(index)
            print(new_df)
            pd.concat([total_df, new_df])
velvet thorn
#

uh

#

so now

#

you want all the rows where prices are lower than the highest Amazon price for that group, right?

strange stag
#

yes

#

all the upcs with that/those conditions, yes

#

rows include upcs, so yeah

#

basically the high of amazon and the low of anywhere else

velvet thorn
#

wait

#

what?

#

the last line does not mean the same thing

#

as what I said

strange stag
#

which line

velvet thorn
#

basically the high of amazon and the low of anywhere else

strange stag
#

df.groupby('upc').filter(lambda g: 'Amazon' in g['location'].unique())
this is getting all upcs that have an amazon price yes?

velvet thorn
#

I would interpret "low" to mean "only the lowest value", not "everything lower than the highest Amazon value"

#

since it seems to me that there are multiple values of price for each value of location

strange stag
#

low as the lowest value

#

yes, the lowest of anywhere besides amazon

#

and the high of amazon

velvet thorn
#

you want all the rows where prices are lower than the highest Amazon price for that group, right?

#

so this is wrong

strange stag
#

well, its right in the manner that it dropped the upcs that dont have an amazon price, or are you asking about the next step?

velvet thorn
#

next step

#

probably

strange stag
#

okay

velvet thorn
#

you should come up with some sample data

strange stag
#
7847    Amazon    11.53     806481288353    https://www.amazon.com/gp/offer-listing/B083CP...
7850    HomeDepot 28.99     806481288353    https://www.amazon.com/gp/offer-listing/B083CP...
7848    Walmart    24.97    806481288353    //goto.walmart.com/c/1914133/566719/9383?veh=a...
7851    Amazon    136.73    806481288353    https://www.amazon.com/gp/offer-listing/B01IBI...
#

should yield row 7851 and 7848

velvet thorn
#

in other words

#

each group

#

should yield 2 rows

#

?

strange stag
#

yes

velvet thorn
#

okay

#

let me think about that for a moment

strange stag
#

courtesy of another user (earlier)
this yields that, but all of amazon prices, not just the highest

keep_indices = list()
for _, y in df.groupby("upc"):
    amazon_min = y[y["location"] == "Amazon"]["price"].max()
    COND = (y["location"] == "Amazon") | (y[y["location"] != "Amazon"]["price"] < amazon_min)
    keep_indices += y[COND].index.tolist()

df.loc[keep_indices]

id prefer to keep only the highest

velvet thorn
#

sure

#

and it doesn't matter if, for example

#

the highest Amazon price is lower than the lowest non-Amazon price, right

#

in all cases you just want the highest Amazon price and the lowest non-Amazon price

strange stag
#

yes

#

exactly

#

only 1 amazon price should be listed

#

for any given upc

velvet thorn
#

and this is applied on the previous DataFrame

#

the one with UPCs without Amazon filtered out

strange stag
#

with amazon upcs filtered

#

so applied to

new_dataframe = df.groupby('upc').filter(lambda g: 'Amazon' in g['location'].unique())
velvet thorn
#

πŸ‘

strange stag
#

πŸ˜„

velvet thorn
#

try aggs = df.groupby([(df['location'] == 'Amazon').rename('amazon'), 'upc').agg(['min', 'max'])

#

and pd.concat([aggs.xs(c, level=0)[[('location', 'min'), ('price', 'min')]] for c in {False, True}]) to filter out

strange stag
#

filter out?

velvet thorn
#

yeah

#

try it and tell me if it's what you're looking for

strange stag
#

so, filter seems to be almost what im looking for, cept 2 things

#

still need the price for amazon with the upc, and 2 if amazon is the lowest price, then i need to drop that row

velvet thorn
#

huh.

strange stag
#

but other than that the filter is perfect i think, checking now

velvet thorn
#

you didn't say that

strange stag
#

my apologies... 😦

velvet thorn
#

oh wait, the second part is wrong though, ignore it

strange stag
#

?

#

the filter?

velvet thorn
#

it should be

#
pd.concat([
    aggs.xs(False, level=0)[[('location', 'min'), ('price', 'min')]], 
    aggs.xs(True, level=0)[[('location', 'max'), ('price', 'max')]]
])
#

because you want the max for Amazon, right

strange stag
#

yes

velvet thorn
#

okay, I need to go now

#

but basically

#

for the last step where you wanna drop the rows

strange stag
#

same

velvet thorn
#

you can just do another groupby and filter on that condition

strange stag
#

ye, thought so as much πŸ™‚

#

@velvet thorn anyways tyvm!!!!!

#

would elaborate how helpful u have been, but as u, i really have to go like right now!!

leaden bobcat
#

Is anyone available to answer a couple questions regarding out to turn a JSON file into a pandas dataframe? I've got an API call from a sports data website, but I'm missing something obvious

velvet thorn
#

@leaden bobcat do elaborate

jolly briar
#

when merging two df's with

pd.merge(df1, df2, on='shared_column', how='left')

i expect there to be the same number of rows after as there are in df1, this isn't usually the case

#

how is it possible to create more rows than the original df when doing a left join, i figured the max would be the number of rows in the original data

#

when i instead do df1.join(df2, how='left') i get the expected result so idk

jolly briar
#

how to replace a section of a dataframe?

Say i have a df with columns A,B,C and where C == 4 i want to replace C with the value of B.

I'm not sure how to do this without a bunch of for loops

#

i just created a different vector and used that to overwrite

worn stratus
#

select the section you want to replace with .loc or .iloc and just assign it

#

dataframe['column_to_change'] = new_col

#

I think should work

jolly briar
#

yeah i actually did that - justnew_col was a replace with np.where

#

cheers

velvet thorn
#

@jolly briar doesn't seem right, got example?

jolly briar
#

@velvet thorn re what, the joins?

#

It's UK here so not now πŸ™ƒ
But this seemed to be the case

#

As in, I used merge and got way more. Used join and got less

velvet thorn
#

how do yo uknow you got more?

coral yoke
#

he compared his rows before and after. can confirm, when he posted before it showed some weird shit

velvet thorn
#

hm.

#

shouldn't be the case

#

you did pass on to join, right?

coral yoke
#

backpropagation is a general thing for all NNs, what is your question?

#

wait, let me get this right, you're trying to make your own algorithm for backpropagation when the one used is used for a reason?

#

i'm not sure if any of us here honestly know enough about the deep math behind these algorithms that have been around for years for reason. if you'd like to learn them i would definitely just suggest learning about what's there and how it works instead of trying to replace it

#

recreating the core of how any of our NNs work isn't exactly common as far as i'm aware. making your own network? sure yeah, but not recreating the essense

#

i support you totally btw, power to you if you can understand that stuff cause fuckin hell i'm not going through that much

#

i'm afraid i won't be able to help much though, past just understanding how backprop works πŸ˜›

velvet thorn
#

@coral yoke I would disagree that this is β€œdeep” math...

coral yoke
#

πŸ‘Œ

velvet thorn
#

@keen geyser how do you intend to normalise the weights?

#

and which articles are you looking at?

coral yoke
#

i honestly didn't need your ping just for a disagreement, but sure

chilly geyser
#

@strange stag lol I didn't know you only wanted the highest Amazon. Your original said all amazons and every other lower than this Amazon

#

@strange stag Lol now I think I get what you want
You should have just said this at the very start

in all cases you just want the highest Amazon price and the lowest non-Amazon price
So basically all non-Amazons would be the same 🀦

#

@keen geyser Would help if you could share the articles you are using. CNN backprop should be ok-ish material

#

@velvet thorn btw looking at your thing. Why do you need to rename "Amazon" to "amazon"?

velvet thorn
#

don’t need to

#

but if you want to look @ the intermediate result it’s slightly more comprehensible to have a name for that level of the index

strange stag
#

@chilly geyser you still there?

#

ah, confused u with gm

#

ye... my apologies... i have difficulty explaining what i want...so

#

@chilly geyser

velvet thorn
#

@strange stag in general for this kind of data wrangling question

strange stag
#

@velvet thorn how do i merge the two location max / location min?

velvet thorn
#

providing expected output helps everyone out a lot

strange stag
#

i shall try to do so in the future

velvet thorn
#

on phone so I can’t write code, but you want a groupby

strange stag
#
aggs = df.groupby([(df['location'] == 'Amazon').rename('amazon'), 'upc']).agg(['min', 'max'])
df2k = pd.concat([
    aggs.xs(False, level=0)[[('location', 'min'), ('price', 'min')]], 
    aggs.xs(True, level=0)[[('location', 'max'), ('price', 'max')]]
])
df2k.groupby('upc').head(len(df2k)).sort_values(by='upc')
#

merging the upc 8888359036 for example

#

thought grouping by upc would do this however

#

i need an agg yeah?

#

expected output of the first two lines merged would be

8888359036, Amazon, BestBuy, 14.23, 9.99
#

third, fourth, fifth, sixth, would be dropped (later with df2k.dropna())

#

seventh upc merged would be

8421134096783, Amazon, Target, 15.24, 4.99
#

and for extra credit, dropping any rows price max is not greater than or equal to twice the price min

#

i can probably figure this out tho πŸ˜›

velvet thorn
#

actually try groupby fillna

strange stag
#

what would be the value?

#

basically perfect besides nan values

#

what is this doing in aggs? .rename('amazon'), 'upc']

#

.stack() o.O

#

how do i filter these though....

strange stag
#

nvm on the stack...not what im looking for

strange stag
#

ah, u were right

#

.fillna(method='ffill')

strange stag
#

nvm

chilly geyser
#

Uhm so did it work o,o

jolly briar
#

@velvet thorn like i just inner joined two dfs with (1173, 14) and (17000,40) (ish) dimensions respectively and got a df with 2.5 million rows back

#

that just makes zero sense to me for an inner join

velvet thorn
#

that seems like an outer join...

jolly briar
#

right, but it's not

velvet thorn
#

do you have the code?

jolly briar
#

i do but i can't share anything

#

i mean, i can 100% say this has happened with an inner join

#

this is a left

#

outers the same dim, so ive no idea πŸ€”

velvet thorn
#

the reason

#

is duplicates.

jolly briar
#

hrm, i'm not sure what to do there then

velvet thorn
#
>>> import pandas as pd
>>> left = pd.DataFrame([[0, 'a'], [0, 'b']], columns=['a', 'b'])
>>> right = pd.DataFrame([[0, 'c'], [0, 'd']], columns=['a', 'b'])
>>> pd.merge(left, right, on='a')
   a b_x b_y
0  0   a   c
1  0   a   d
2  0   b   c
3  0   b   d
jolly briar
#

because i think this duplicate information is valuable - it would be grouped by

velvet thorn
#

left and right both have 2 rows

#

but the left join has 4

#

quite clear why, I think

jolly briar
#

@velvet thorn yeah, it's giving all combinations

velvet thorn
#

yeah, so that's why you have more rows in your case too

jolly briar
#

yes, i'm confused about what to do with the data now :/

#

the duplicates are for geographic regions , eh

#

thanks tho - that explains it πŸ‘

jolly briar
#

given a df with columns A,B where A are groups and B are count values, how to find the column B percentages per group?

so if i have

A    B
a1   50
a1   50
a2   80
a2   20

i would want to have column B_perc as [0.5, 0.5, 0.8, 0.2]

i get that in this case the data sums to 100, this can't be assumed ( so *0.01 isn't ok)

velvet thorn
#

>>> df.groupby('A').transform(lambda g: g / g.sum())

lapis sequoia
#

I always have difficulty understanding groupby

#

@velvet thorn you have shown the table, it got 2 columns and 4 rows. We can see how it looks. I always wondered how this looks:

df.groupby('A')

Because Python never shows how it looks in reality

velvet thorn
#

it doesn't really make sense

#

to have a raw groupby

#

for reasons I can explain another time, since I'm going to bed soon

lapis sequoia
#

oh..

velvet thorn
#

have you read the pandas groupby docs?

lapis sequoia
#

good night then πŸ™‚

velvet thorn
#

they might help

lapis sequoia
#

Pandas grouby docs, been reading from last 4 days

#

I can read C++ technical definition from the ISO standard

#

But can't understand groupby >:-\

velvet thorn
#

hm

#

okay real quick

#

imagine this

#
A    B
a1   50
a1   50
a2   80
a2   20

you have this, right

#

and say you want the mean of B for each unique value of A

#

you could do this:

#
for a in df['A'].unique():
    print(df.loc[['A'] == a, 'B'].mean())
#

and this gets each subset of the DataFrame

#

for which A has a specific unique value

#

and then performs some transformation on it

#

this is equivalent to df.groupby('A')['B'].mean()

#

@lapis sequoia make sense?

lapis sequoia
#

So far, no.

but I will try to understand while you sleep

jolly briar
#

@velvet thorn thanks again - I didn't know about transform , i used apply with a lambda function, is there any reason to reach for one over the other?

#

ah i see it's late for you, no worries

#

πŸ‘

oblique belfry
plain turret
#

i can imagine a regular algo

#

These vacuums use a navigation algorithm called VSLAM (or visual simultaneous location and mapping

#

according to wikipedia there is some algorithms that are open source

#

you could get some inspiration from this

#

i don't suggest anything i just googled :p

#

i would guess you would need some camera system and the processing power to treat it in real time

oblique belfry
#

I wonder how well Reinforcement Learning would work in this situation.

jolly briar
#

df.isna() will give me true / false for each cell based on whether it's nan or not, how can i select only rows which have some NA though?

chilly geyser
#

Does df[df.isna().any(axis=1)] work?

strange stag
#

alright yall... how do i merge rows by upcs?
For example, I have 2 rows with missing NaN values. the First row's missing NaN values are found within the second row, and vice versa (however a simple .fillna(method='ffill') does not work, because the data is not perfect, and what i mean by that is, not all upcs have 2 rows to makeup for the NaNs

sand gyro
#

I created the functions dropna ,which drops rows with empty values, and isnull ,which keeps rows with empty columns, to filter the dataframe and it works as I am able to print both. Then I would append them to previously created xlsx files

wb = Workbook()
ws = wb.active
wb.title = 'Contacts'
wb2 = Workbook()
ws2 = wb2.active
wb2.title = 'Contacts'

r1 = df.dropna(subset=['Firstname', 'Lastname', ('work_phones' or 'mobile_phones') or (('Work_City','Work_Street','Work_State','Work_Zip') or ('Personal_Street','Personal_City','Personal_State','Personal_Zip')) or ('Work_email' or 'Personal_email')])

r2 = df.loc[(df['Firstname'].isnull()) | (df['Lastname'].isnull()) | (((df['work_phones'].isnull()) & (df['mobile_phones'].isnull())) | (((df['Work_Street'].isnull()) | (df['Work_City'].isnull()) | (df['Work_State'].isnull()) & (df['Work_Zip'].isnull())) | (df['Personal_Street'].isnull()) | (df['Personal_City'].isnull()) | (df['Personal_State'].isnull()) | (df['Personal_Zip'].isnull())) & (df['Work_email'].isnull()) & (df['Personal_email'].isnull()))]

for r in dataframe_to_rows(r1, index=False, header=False):
   ws.append([r])

for r in dataframe_to_rows(r2, index=False, header=False):
    ws.append([r])
  
   

wb.save("Accepted Contacts.xlsx")
wb2.save("Rejected Contacts.xlsx")

However, when I try to add them to the excel files I get this error for r1

raise ValueError("Cannot convert {0!r} to Excel".format(value))

ValueError: Cannot convert ['Doe', 'Jane', nan, nan, nan, nan, '5678743546', 'j@greenbriar.com', '54 George street', 'Ridge Springs', 'VA', '25678', nan, nan, nan, nan, '3245687907', nan, nan, nan] to Excel```
plain turret
#

hmm i don't really understand what you're trying to do, but nan is not an excel character no?

#

if you want an empty value in excel/csv it should be "Jane",,,,"56787453"

sand gyro
#

It needs to be column specific

plain turret
#

,, is one column

sand gyro
#

instead of nan I make it an empty string?

plain turret
#

it would work but then you would have an empty string in your excel

#

so , "",

#

it probably doesn't matter, but sometimes, some excel macro doesn't consider empty string as blank value

sand gyro
#

  File "<ipython-input-2-de3603ab2d77>", line 1, in <module>
    runfile('C:/Users/mosta/.spyder-py3/CRMnew.py', wdir='C:/Users/mosta/.spyder-py3')

  File "C:\Users\mosta\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 827, in runfile
    execfile(filename, namespace)

  File "C:\Users\mosta\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 110, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)

  File "C:/Users/mosta/.spyder-py3/CRMnew.py", line 1311, in <module>
    ws.append([r])

  File "C:\Users\mosta\Anaconda3\lib\site-packages\openpyxl\worksheet\worksheet.py", line 644, in append
    cell = Cell(self, row=row_idx, column=col_idx, value=content)

  File "C:\Users\mosta\Anaconda3\lib\site-packages\openpyxl\cell\cell.py", line 133, in __init__
    self.value = value

  File "C:\Users\mosta\Anaconda3\lib\site-packages\openpyxl\cell\cell.py", line 239, in value
    self._bind_value(value)

  File "C:\Users\mosta\Anaconda3\lib\site-packages\openpyxl\cell\cell.py", line 222, in _bind_value
    raise ValueError("Cannot convert {0!r} to Excel".format(value))

ValueError: Cannot convert ['Doe', 'Jane', '', '', '', '', '5678743546', 'j@greenbriar.com', '54 George street', 'Ridge Springs', 'VA', '25678', '', '', '', '', '3245687907', '', '', ''] to Excel
#

This is not the problem

#

I don't know what is {0!r}?

strange stag
#

@velvet thorn @chilly geyser @gilded harness

plain turret
#

what is r in your code

#

@sand gyro ws.append([r]) -> is r already a list maybe

sand gyro
#

It is in the for loop:

for r in dataframe_to_rows(r2, index=False, header=False):
    ws.append([r])
plain turret
#

what type is this?

sand gyro
#

WHen I print it

   Lastname Firstname          Company  ...  Personal_email  Note  Note_Category
1   Malcoun       Joe  8/28/2019 14:29  ...             NaN   NaN            NaN
4      None    Jordan              NaN  ...             NaN   NaN            NaN
5      None       NaN              NaN  ...             NaN   NaN            NaN
6  Zachuani     Reemo              NaN  ...             NaN   NaN            NaN
7    Suarez   Geraldo              NaN  ...             NaN   NaN            NaN

[5 rows x 20 columns]
  Lastname Firstname Company  ...  Personal_email  Note  Note_Category
0      Doe      Jane     NaN  ...             NaN   NaN            NaN
2  Ramirez    Morgan     NaN  ...             NaN   NaN            NaN
3    Burki     Roman     NaN  ...             NaN   NaN            NaN

[3 rows x 20 columns]
plain turret
#

can you print(type(r)) in your loop ? maybe you pass something like [[your_row]]

sand gyro
#

I print r1 and r2 before the loop

plain turret
#

openpyxl write it as :

#

for r in dataframe_to_rows(df, index=True, header=True): ws.append(r)

#

here you ws.append([r]) so you put list in list probably ?

sand gyro
#

What a stupid mistake by me. It took me days

plain turret
#

heppens

sand gyro
#

Thank you very much @plain turret

plain turret
#

you're welcome, sometimes you just need fresh eyes

crystal sluice
#

hey guys, to learn data science, what subject should i focus first?

#

i'm good with math and statistics, i understand usability of data very well, but idk what to learn to work with data science

#

anybody could give me a north?

strange stag
#

udacity

plain turret
#

i'd just pick a book on the subject i want to explore? data science is super big

coral yoke
#

subject? get used to the python libraries that are used most in the area such as pandas, numpy, etc.

strange stag
#

scikitlearn perhaps as well depending on your preference

crystal sluice
#

pandas, numpy and wich other are used? so i can focus on this first

strange stag
#

what is your goal?

crystal sluice
#

i want to be able to get dataframes and work data to information, create information for decision making

coral yoke
#

definitely just pandas and numpy then for that

strange stag
#

pandas > numpy in priority

crystal sluice
#

and let's suppose i want to make a little dashboard

#

to show data

#

in real time, as the database is working

strange stag
#

still pandas, but then flask, django or something else

coral yoke
#

flask to handle the automatic population of your table

crystal sluice
#

hmmm, nice

strange stag
#

django depending on the scale of the site

crystal sluice
#

nice, thanks guys, helped a lot, i'll start right now

strange stag
#

flask for smaller projects

crystal sluice
#

@strange stag this is something i would ask too

coral yoke
#

i've seen flask used on large projects as well. preference πŸ˜›

crystal sluice
#

what is a small project and a large project? is based on data or views?

strange stag
#

well yes, can happen, but generally that is not done

coral yoke
#

whatever you say yeah

#

both generally georg

strange stag
#

id start out in flask

coral yoke
#

your traffic and how much you're handling

crystal sluice
#

django > flask?

coral yoke
#

no

strange stag
#

management is different

coral yoke
#

neither's better than the other

crystal sluice
#

i tried to start with django

plain turret
#

flask is easier to set up / less stuff to learn imo

coral yoke
#

^

crystal sluice
#

but it was really difficult to me

strange stag
#

flask has more flexibility, django has more structure

coral yoke
#

and flask is generally preferred starting off, even in businesses, as you only add what you need

crystal sluice
#

flask I worked very well

plain turret
#

so to advance fast and get result i would prefer flask

crystal sluice
#

nice

plain turret
#

most of the stuff you'll learn can be transfered to django since i think they both works with templates

crystal sluice
#

i have an idea i'm developing, it can get some size someday, but i'll start with flask

coral yoke
#

they both work with the exact same template engine, yes

plain turret
#

@void anvil seaborn have nice heatmaps with pandas.corr if you want to plot them easily

crystal sluice
#

sorry mispelling or word order, english is not my main language

plain turret
#

you can still print on top

#

i think

coral yoke
#

your english is fine georg, no worries!

plain turret
#

i did this two years ago so i can't say for sure

#

you can with hmm

#

the keyword annot

#

i had make another df with the pvalue significances as * and ploted them on top of them

#

since you have corelation with color anyway

#

but you can mess with it

crystal sluice
#

@coral yoke thank you!!

jolly briar
#

anyone made use of yellowbrick?
it seems to have changed the output of seaborn after inputting it, i don't just mean style wise, but the actual data looks a bit different as though there's some kinda transformation or something... just wondering if anyone's noticed anything similar

plain turret
#

ah i didn't no

#

i see they have ranks that's cool

jolly briar
#

i always thought R plots were nice from regression models, seems that this has diagnostics now at least

plain turret
#

what am i watchi,ng

jolly briar
#

a horror

plain turret
#

why do you have some sort of regression line with columns lol

strange stag
#

anyone able to help with my previous q?

jolly briar
#

yeah it's an odd one - it wasn't like that earlier @plain turret , i don't think πŸ€”

plain turret
#

kinda what i get after i try every tutorial tbh

jolly briar
#

i'm also getting test R2 consistently higher than training πŸ™ƒ

#

so there's clearly something very wrong somewhere lol

jolly briar
#

am i being thick or is drawing a horizontal line on a seaborn plot a bit of a faff

velvet thorn
#

get the Axes

#

ax.axhline

#

@crystal sluice you can consider Dash for that

#

also, another reason to use transform is that it better signals your intent

crystal sluice
#

@velvet thorn what is dash

velvet thorn
#

it’s a framework meant for data analysis

#

integrates with pandas

#

Google β€œdash python”

halcyon venture
#

do I have to use an old version (1.8) of Anaconda if I need to use python 2.6?

#

I don't want it to interfere with the current version installation

jolly briar
#

for two models A,B, if mse( A ) < mse( B ) yet mae( A ) > mae ( B ), how to choose the model based on these metrics?

lapis sequoia
#

could anyone help me translate a function from intention into code? it's probably a bit of text to explain, would appreciate a PM

jolly briar
#

@lapis sequoia what's a PN

lapis sequoia
#

it's supposed to be a private message, but i see the acronym doesn't make sense in English haha

jolly briar
#

either PM or DM would be the english for that @lapis sequoia , and i think you're better off just putting your problem into the channel as best as you're able too

lapis sequoia
#

i'd spam the whole room, because it's a lot to explain πŸ˜•

jolly briar
#

well, not sure what to say then i guess

lapis sequoia
#

ok so i don't know how to explain the problem w/o context

#

i have a huge data set, it's about delays and delay prediction... i still need to engineer some features

#

in the tidy dataset there are columns for delays, train stations, train-line, stop sequence number and so on... what i'm working on right now is a directional index for every train line, to have a dummy variable in the regression part

#

my plan is, to get a list of station acronyms sorted by their sequence of occurance within a line, let's say LINE 1

#

which would look like this:

#

[(0, 'TKT'), (1, 'TKTO'), (2, 'TWD'), (3, 'TWER'), ... (21, 'TSRO'), (22, 'TGOL'), (23, 'TBO'), (24, 'THUB'), (25, 'TEHN'), (26, 'TGT'), (27, 'TNUF'), (28, 'THE')]

#

now i would want to find any match of any train event for the given LINE 1 where the station is in that list, and write the corresponding number into a new column

#

I'd have to do that for every train-line

#

when that column is finished i'd be able to check for every starting and ending train whether he goes from higher number to lower number or vice versa

#

why so complicated? because the dataset is complex and not every train of one specific train-line goes all the way from 0 to XX. some start later and stop earlier etc.

#

do you get it? πŸ€”

#

The procedure would have to be done for every of the 8 train LINES to fill the entire column. So I would like to write some function or pipeline that does the same for all the LINE. I can't just give every station-abbreviation one specific number, because while the station abbreviations are "general", the corresponding number would be LINE-specific.

velvet thorn
#

what is a train event?

#

@lapis sequoia

#

@jolly briar which is more important to you...?

lapis sequoia
#

@velvet thorn there is 5 different train events:

  1. departure of a train from its start station
  2. arrival of a train at a stopover
  3. a passing train
  4. departure of a train at a stopover, and
  5. arrival at its final destination
#

those are coded for example with 1) = 10, 2) = 20, ... 5) = 50 so you can find the specific events for every train and every LINE etc in the dataset... every day has like thousands of logged events... every minute of the day at every station etc.

velvet thorn
#

hm

#

I see

#

that doesn't sound too hard, if I get what you mean

#

basically a join

lapis sequoia
#

i don't think you get me

jolly briar
#

are you able to post the example data @lapis sequoia ?

velvet thorn
#

in general, posting sample data and expected results helps a lot.

lapis sequoia
#

I'm a total beginner and not very used to discord either, so I simply don't know how to post that stuff properly

#

can I msg you @velvet thorn to clarify things?

velvet thorn
#

post here please

lapis sequoia
#

can you load it like that?

#
{'SERVICE_ID': {0: 29664277470, 1: 29664277470, 2: 29664277470, 3: 29664277470, 4: 29664277470}, 'TRAIN_ID': {0: 7087, 1: 7087, 2: 7087, 3: 7087, 4: 7087}, 'STOPSEQUENCE_NO': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5}, 'DS100': {0: 'TP', 1: 'TACH', 2: 'TACH', 3: 'TEZL', 4: 'TEZL'}, 'EVENT_TYPE': {0: 10, 1: 20, 2: 40, 3: 20, 4: 40}, 'Actual_Time': {0: Timestamp('2017-09-16 13:48:00'), 1: Timestamp('2017-09-16 13:50:00'), 2: Timestamp('2017-09-16 13:51:00'), 3: Timestamp('2017-09-16 13:52:00'), 4: Timestamp('2017-09-16 13:53:00')}, 'Sched_Time': {0: Timestamp('2017-09-16 13:48:00'), 1: Timestamp('2017-09-16 13:50:00'), 2: Timestamp('2017-09-16 13:50:00'), 3: Timestamp('2017-09-16 13:52:00'), 4: Timestamp('2017-09-16 13:52:00')}, 'LINE': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1}, 'START_TIME': {0: Timestamp('2017-09-16 13:48:00'), 1: Timestamp('2017-09-16 13:48:00'), 2: Timestamp('2017-09-16 13:48:00'), 3: Timestamp('2017-09-16 13:48:00'), 4: Timestamp('2017-09-16 13:48:00')}}
velvet thorn
#

with a bit of efffort

#

yes

lapis sequoia
#

thanks... I'd do better if I knew how to... I just made a dict and printed it

jolly briar
#
In [111]: from pandas import Timestamp

In [112]: d = {'SERVICE_ID': {0: 29664277470, 1: 29664277470, 2: 29664277470, 3: 29664277470, 4: 29664277470}, 'TRAIN_ID': {0: 708
     ...: 7, 1: 7087, 2: 7087, 3: 7087, 4: 7087}, 'STOPSEQUENCE_NO': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5}, 'DS100': {0: 'TP', 1: 'TACH',
     ...:  2: 'TACH', 3: 'TEZL', 4: 'TEZL'}, 'EVENT_TYPE': {0: 10, 1: 20, 2: 40, 3: 20, 4: 40}, 'Actual_Time': {0: Timestamp('2017
     ...: -09-16 13:48:00'), 1: Timestamp('2017-09-16 13:50:00'), 2: Timestamp('2017-09-16 13:51:00'), 3: Timestamp('2017-09-16 13
     ...: :52:00'), 4: Timestamp('2017-09-16 13:53:00')}, 'Sched_Time': {0: Timestamp('2017-09-16 13:48:00'), 1: Timestamp('2017-0
     ...: 9-16 13:50:00'), 2: Timestamp('2017-09-16 13:50:00'), 3: Timestamp('2017-09-16 13:52:00'), 4: Timestamp('2017-09-16 13:5
     ...: 2:00')}, 'LINE': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1}, 'START_TIME': {0: Timestamp('2017-09-16 13:48:00'), 1: Timestamp('2017-
     ...: 09-16 13:48:00'), 2: Timestamp('2017-09-16 13:48:00'), 3: Timestamp('2017-09-16 13:48:00'), 4: Timestamp('2017-09-16 13:
     ...: 48:00')}}
     ...:

In [113]: pd.DataFrame.from_dict(d)
Out[113]:
    SERVICE_ID  TRAIN_ID  STOPSEQUENCE_NO DS100  EVENT_TYPE         Actual_Time          Sched_Time  LINE          START_TIME
0  29664277470      7087                1    TP          10 2017-09-16 13:48:00 2017-09-16 13:48:00     1 2017-09-16 13:48:00
1  29664277470      7087                2  TACH          20 2017-09-16 13:50:00 2017-09-16 13:50:00     1 2017-09-16 13:48:00
2  29664277470      7087                3  TACH          40 2017-09-16 13:51:00 2017-09-16 13:50:00     1 2017-09-16 13:48:00
3  29664277470      7087                4  TEZL          20 2017-09-16 13:52:00 2017-09-16 13:52:00     1 2017-09-16 13:48:00
4  29664277470      7087                5  TEZL          40 2017-09-16 13:53:00 2017-09-16 13:52:00     1 2017-09-16 13:48:00
lapis sequoia
#

that looks good

#

thanks mate

#

so basically that's a very reduced dataset... usually there are like 30 more columns and millions of rows

velvet thorn
#

when I said a bit I really meant a very tiny bit

#

like what @jolly briar did

#

that's perfectly fine, don't worry about it

lapis sequoia
#

DS100 column is the abbreviation code for each station, so each "event" is at some station, at some point in time, on a specific LINE etc

jolly briar
#

i only posted that for noobs future reference

lapis sequoia
#

unfortunately the "STOPSEQUENCE_NO" column is not usable to make the directional index, as one and the same line can have a different number of stops e.g. one train goes the full way from A to Z, another only goes from C to K etc. depending on the time of day or weekday or whatever... and also it doesn't differentiate whether the train goes from A to Z or from Z to A (direction).

#

so my plan was to make a list for every LINE (1, 2, 3, ... , 8) that puts a number (1, 2, ...., 28) next to every station-abbreviation.
Like so:

#

[(0, 'TKT'), (1, 'TKTO'), (2, 'TWD'), (3, 'TWER'), ... (24, 'THUB'), (25, 'TEHN'), (26, 'TGT'), (27, 'TNUF'), (28, 'THE')]

jolly briar
#

it might be easier to manually edit a small section in excel as an example of what you want

lapis sequoia
#

mh..

#

not easy to explain at all

#

to someone who isn't familiar with the data and the problems etc

jolly briar
#

then make an example

lapis sequoia
#

don't know how πŸ€”

jolly briar
#

you put the data into excel and edit it by hand

lapis sequoia
#

if i could program it in excel i could just google how to translate it to python, lol

jolly briar
#

well if you can't do that you've very little hope of explaining it to someone else

timid vortex
#

If I have a numpy.int64 object and I want to iterate over that specific column, how can I go about doing that?
I get an AttributeError when I try to do dataframe.apply(lambda x . . .)

velvet thorn
#

uh.

#

so basically

#

if I understand you correctly

#

you want to convert the values in the last column to 1 if the original value is 2, and 0 otherwise?

#

@timid vortex

timid vortex
#

yea

velvet thorn
#

hm

#

is there a reason

timid vortex
#

the last column doesn't have a label

velvet thorn
#

df.iloc[:, -1] = (df.iloc[:, -1] == 2).astype(int)

#

.columns accesses the column names

#

also, avoid apply if you can

#

IMO it promotes lazy (and inefficient) thinking

timid vortex
#

How should I properly go about this

velvet thorn
#

is that the sklearn cancer dataset?

#

it should have a label...

timid vortex
#

it's the Wisconsin cancer dataset

#

doesn't have labels

velvet thorn
#

breast cancer, yes?

timid vortex
#

yeah

velvet thorn
#

hm

#

that's not right

#

but anyway you can rename the column, so

#

anyway the code I provided should work for you

#

tell me if it doesn't

timid vortex
#

I guess it did...wow

#

Don't understand iloc and astype(int)

#

thank you so much though

#

just for the future, instead of using apply, what should I do instead

#

if I want to change all elements in a column

#

additionally, if I want to change the labels from just being a list of numbers, how could I do that?

jolly briar
#

if it's a single column you can use replace( )

#

i think that's a done thing , maybe there's something better

velvet thorn
#

.iloc is an indexer

#

basically, you can specify which rows and which columns you want, in that order

#

: means all

#

so basically I said - get me all the rows from the last column (because -1)

#

then I compared them elementwise to 2

lapis sequoia
velvet thorn
#

which returns results of either True or False

timid vortex
#

yeah

velvet thorn
#

the last part, .astype, converts True to 1 and False to 0

#

which is the same logic as yours

timid vortex
#

ahhhhhh

#

that's amazing

velvet thorn
#

the reason to avoid apply is that apply is generally just a big for loop, which means you iterate over each value in turn.

#

very quickly, but still one at a time

#

whereas if you do an == comparison, it's vectorised, which basically means that pandas (through numpy) uses certain special instructions in your CPU to perform multiple operations at once

#

tl;dr: apply is slower.

timid vortex
#

I'll remember this

#

thank you!

velvet thorn
#

lastly, if you have a finite number of source values

#

look into map.

jolly briar
#

if i have

2015 : a = 40%
2016 : a = 45%
2018 : a = 44%

what would an uplift model look like for predicting this years percentage?

lapis sequoia
#

Do you get it now? @velvet thorn @jolly briar
@lapis sequoia so i want to do 2 things. First write that GREY column on the far right. I can't just simply give any DS100 abbreviation a unique number, it has to be line specific. LINE 1 can have a 1st station, and so can LINE 2, ..., LINE X. The 1st station will always have a "1" in that column for every LINE. But a train can also start at the 28th station and go to 5th or the 1st (backwards direction).

#

The excel screenshot should give an idea

#

the second problem would be to code the function right below the table in the screenshot.

#

df.LINE_STATION_NO[EVENT_TYPE == 10] < df.LINE_STATION_NO[EVENT_TYPE==50] then the Train for example starts at station 5 of that LINE and maybe goes to station 20. Because 5 < 20, the direction is then defined as +1. However, if it was going from station 20 to station 5, directional index would be -1 for the train is going backwards.
Why the numbers 5 and 20 in the example? Because not every train is serving all the stations from 1 to 28. Some only serve sections in between.

crystal sluice
#

guys, is really that hard to configure git on vscode?

#

i'm like 2 hours struggling

#

i have my github account, installed 3 hundred thousand extensions on vscode and i'm not having sucess

lapis sequoia
#

Hi everyone
fairly simple question here
I'm trying to create a graph to show the univariate distribution of my training data (the target values)
how can I do this effectively?
I've tried doing sns.distplot(y, hist=False, rug=True), but the graphs before and after oversampling+undersampling remain the same. In other words, it doesn't seem to properly represent my dataset
also, the target values are continuous

shadow quiver
#

Does anyone have a simple explanation of what is graph in Tensorflow means?

lapis sequoia
#

if you dont need tensorflow as a hard requirement.. I would suggest you drop it and move on..

#

really hard to accept.. but I wish I had done that a year ago.. it's really a waste of time because you can't iterate and scale as fast as you can on other frameworks

velvet thorn
#

@shadow quiver a graph is basically a way to represent the flow of data through mathematical operations.

lapis sequoia
#

Pandas groupby example: df.groupby('points').points.count() In this "df " has 17 columns. Now when you combine "points" column using groupby() then what happens to the rest of the columns, where do they exist?

#

I know grouby() does not change original dataset, it is a copy which it is operating on, how does look like, mashup of 2 columns and rest 15 do not change?

velvet thorn
#

no

#

I think

#

you are focusing too much on the idea of the groupby being something concrete

#

think of it as an incomplete instruction.

#

okay, for example, if I tell you "go by car", the very natural question you would ask is "go where?"

#

what that groupby does, conceptually, is separate df into a number of dataframes, and in each dataframe the values of points are all the same.

#

however, because this is an expensive operation, when you just execute df.groupby('points'), all that happens is that pandas stores your instruction for later execution

#

because how exactly the groupby is performed will depend on what you want to do with it.

lapis sequoia
#

hmmm ... conceptually, is separate 'df' into a number of dataframes, and in each dataframe the values of points are all the same

#

this is good

lapis sequoia
#

dataframe.groupby().count() returns -- "Count of values within each group"
dataframe.groupby().size() returns -- "Number of rows in each group"

What's the difference these 2?

velvet thorn
#

count ignores nulls, size doesn't @lapis sequoia

lapis sequoia
#

See you tomorrow @velvet thorn .. good night, will spend some time with Dale Carnegie's book

oblique belfry
#

I dunno if this is the best place for this question, but....

How would you normalize an audio waveform? I am working on an audio classification problem. I know normalizing data is a good practice, but I am not sure if one should do it for waveforms.

plain turret
#

Audio normalization is the application of a constant amount of gain to an audio recording to bring the amplitude to a target level (the norm). Because the same amount of gain is applied across the entire recording, the signal-to-noise ratio and relative dynamics are unchanged...

#

This ?

#

Or removing noise ?

oblique belfry
#

That.

#

I just want the amplitudes to be consistent among samples.

alpine stream
#

Hi guys! I have a question.
I have conversations a customer with an agent (without punctuation). There are phrases of several categories of promises that an agent gave to a customer (call back, make an appointment, etc.). It has been done manually. Altogether 12 categories. Now I'm thinking of creating an algorithm for this. I am thinking to do this task in two steps.

  1. In the first step, I need to create an algorithm that can find an end and a beginning of all promises. This algorithm has to insert a start tag and an end tag.
  2. The second step is to create a classifier that would label a promise to the necessary categories.

As I understand, the second step is well known and this is called text classification. But for the first step, I could not find any articles and github repositories. But I think it is an important NLP task and there must be information on this. Maybe are there approaches that solve two steps at the same time?

proud iron
#

Guys, how can one make his own speech recognition model and train it well on multiple languages? The point of that is to avoid Google's API which has a file size limit. πŸ™‚

oblique belfry
#

@proud iron I read a few papers showing how transformer networks like BERT and GPT-2 worked well in translation scenarios. Might want to start there. This isn't my expertise though so...def want to read up more on that.

austere oar
#

Question: How do I return a javascript object from a python function (after scraping some data from different websites) then putting them back together (in HTML)

oblique belfry
#

I think returning JSON would be the easiest.

#

What’s the use case? Like...a Flask app and some JS front end?

austere oar
#

yeah it's a Flask App

#

Use MongoDB with Flask templating to create a new HTML page that displays all of the information that was scraped from the URLs above.

Start by converting your Jupyter notebook into a Python script called scrape_mars.py with a function called scrape that will execute all of your scraping code from above and return one Python dictionary containing all of the scraped data.

Next, create a route called /scrape that will import your scrape_mars.py script and call your scrape function.
    Store the return value in Mongo as a Python dictionary.

Create a root route / that will query your Mongo database and pass the mars data into an HTML template to display the data.

Create a template HTML file called index.html that will take the mars data dictionary and display all of the data in the appropriate HTML elements. Use the following as a guide for what the final product should look like, but feel free to create your own design.
oblique belfry
#

return one Python dictionary containing all of the scraped data
Just means a JSON object.

upbeat jetty
#

Semi-repost from career channel. What are the essential skills to break into healthcare/pharma data science? Data scientist positions i've seen usually revolve around economics - banking, marketing, ect.

austere oar
#

Ah okay it's JSON, thankfully

oblique belfry
lapis sequoia
#

Can somebody give me a quick tip on how to write columns by checking if-then conditions?
Like: If "HourOfDay" >= 6 and =< 9, then write "NewColumn"=1, otherwise 0.

#

maybe @velvet thorn ?

#

what are you trying to do

#

and what do you mean write columns..

lapis sequoia
#

@lapis sequoia I'm working on a dataset, currently adding features for the predictive regression. I want to add multiple columns with dummy variables

velvet thorn
#

hm

lapis sequoia
#

in this case it's going to be a "morning peak" dummy variable (I'm working on delay prediction)

velvet thorn
#

assuming the column is called HourOfDay (bad practice IMO, should be snake case)

#

the simplest way to do it is df['new_column'] = ((df['hour_of_day'] >= 6) & (df['hour_of_day'] <= 9)).astype(int)

lapis sequoia
#

you can do that together..