#data-science-and-ml | Python | Page 215

worn stratus Jan 15, 2020, 1:39 PM

#

Someone in the pins recommends a different course

#

But studying or academically is probably for the best

acoustic scaffold Jan 15, 2020, 2:43 PM

#

@plain jungle AI/ML is very broad and very rapidly advancing field. Don't worry if you do not find "clear point of beginning".

#

I started by initially doing some digital image processing which led to machine vision.

plain jungle Jan 15, 2020, 3:02 PM

#

Thank you

acoustic scaffold Jan 15, 2020, 3:16 PM

#

Andrew Ng's course is good. It does start from very ground up. This means lots of mathematics which might put some people off. You can do machine learning even without comprehending all of the mathematics.

#

@plain jungle This video might help you at some point https://www.youtube.com/watch?v=FmpDIaiMIeA

YouTube

Brandon Rohrer

How Convolutional Neural Networks work

Find the rest of the How Neural Networks Work video series in this free online course:
https://end-to-end-machine-learning.teachable.com/p/how-deep-neural-networks-work

A gentle guided tour of Convolutional Neural Networks. Come lift the curtain and see how the magic is don...

▶ Play video

stray spade Jan 15, 2020, 3:18 PM

#

Hi guys
I need to do a test with ResNet154. As I know it is too time consuming to train it specially with pc,
My question is that, is there some pre trained ResNet to run on my data set? If yes, how long take time?
I have a face dataset with 1999 portrait image

acoustic scaffold Jan 15, 2020, 3:19 PM

#

Are you using Tensorflow or Pytorch (or something else)?

stray spade Jan 15, 2020, 3:20 PM

#

@acoustic scaffold Tensorflow

acoustic scaffold Jan 15, 2020, 3:21 PM

#

If I recall correctly, there were pretrained resnets for Tensorflow 1.12 year ago

stray spade Jan 15, 2020, 3:25 PM

#

Do you have some link to download,
And do you have any idea, how long take time to run it with 1990 Image,
I want to get result as fast as possible, to submit thesis

acoustic scaffold Jan 15, 2020, 3:30 PM

#

Here might be clues https://github.com/tensorflow/models/tree/master/official/r1/resnet

GitHub

tensorflow/models

Models and examples built with TensorFlow. Contribute to tensorflow/models development by creating an account on GitHub.

#

https://github.com/tensorflow/models/tree/master/research/slim#Pretrained

GitHub

tensorflow/models

Models and examples built with TensorFlow. Contribute to tensorflow/models development by creating an account on GitHub.

stray spade Jan 15, 2020, 3:38 PM

#

@acoustic scaffold thank you

acoustic scaffold Jan 15, 2020, 3:38 PM

#

No problem

jolly briar Jan 15, 2020, 4:34 PM

#

I've a vector of postcodes that I want to convert to the geographical regions they're within, so I want to lower their resolution.

What's the best way to go about this? not sure if there's a google maps approach or something

#

I'm sure there's a google phrase i'm missing to get info on this... given postcode I want to get region 🤔

worn stratus Jan 15, 2020, 5:03 PM

#

Whereabouts are the postcodes? Global?

jolly briar Jan 15, 2020, 5:04 PM

#

european

worn stratus Jan 15, 2020, 5:04 PM

#

https://postcodes.io/ does exactly what you're looking for in the UK at least - theres probably something similar for europe as a whole

Postcodes.io

Free Postcode API for Addresses in Great Britain

jolly briar Jan 15, 2020, 5:05 PM

#

not uk, can't find eu

#

as in - i'm not looking at the uk atm

worn stratus Jan 15, 2020, 5:06 PM

#

~~https://getaddress.io/ this looks like it might do EU. But a web API like this is definitely what you're looking for.~~ is uk

You might have to do something a bit awkward like getting lat/long from one api, then using another api to look up the region

getAddress() - A simple postcode lookup API

getAddress.io is a simple JSON API to lookup UK postal addresses by postcode.

jolly briar Jan 15, 2020, 5:08 PM

#

this seems to want a house number as well, i don't have that information, just postcode

#

this is uk as well i think 🤔

worn stratus Jan 15, 2020, 5:09 PM

#

You can request on that site without a house number. But yeah it is uk https://developers.google.com/maps/documentation/geocoding/intro should work. Even if it doesn't, postcode lookup api comes up with a bunch of different stuff

Google Developers

Developer Guide | Geocoding API | Google Developers

Geocoding converts addresses into geographic coordinates to be placed on a map. Reverse Geocoding finds an address based on geographic coordinates or place IDs.

jolly briar Jan 15, 2020, 5:10 PM

#

looking at geocoding atm

#

gah, lookup of italian postcodes just returns american stuff 😦

worn stratus Jan 15, 2020, 5:14 PM

#

My guess is you can add a param for country into the address info

jolly briar Jan 15, 2020, 5:14 PM

#

Yeah just reading through the docs atm

uncut shadow Jan 15, 2020, 9:59 PM

#

Hey. I have another question. What skills do you think are required to master (or atleast learn some) Machine learning? Except the knowledge about programming language Ur going to use. Also do you know any good book/course/tutorial etc. to learn math required for Machine learning?

#

Without any libs like Tensorflow, sklearn etc.

polar acorn Jan 15, 2020, 10:02 PM

#

Have you looked at the pinned messages in this channel? I think the r/LearnMachineLearning wiki might have what you're looking for.

oblique belfry Jan 15, 2020, 10:11 PM

#

Grit and perseverance.

jolly briar Jan 16, 2020, 12:05 AM

#

df['x'].astype <doesn't work
df.x.astype    <does work

why is that?

lapis sequoia Jan 16, 2020, 12:43 AM

#

why is what

#

as type what.. did you try

#

did you pass an argument and check the output

#

@jolly briar

jolly briar Jan 16, 2020, 12:51 AM

#

Yes

#

One worked the other didn't,I thought both these indexing methods were analogous

lapis sequoia Jan 16, 2020, 1:00 AM

#

they are

#

please show code

#

you can check df['x'] == df.x

jolly briar Jan 16, 2020, 1:05 AM

#

was just with int, codes gone now unfortunately

velvet thorn Jan 16, 2020, 1:50 AM

#

typo somewhere?

jolly briar Jan 16, 2020, 2:07 AM

#

hrm, maybe... i'm not sure now the code has gone... thought I'd bumped into some kind of df[x] df[[x]] thing ala R, all good

velvet thorn Jan 16, 2020, 2:10 AM

#

I'm like 95% certain it was a typo or something

jolly briar Jan 16, 2020, 2:48 AM

#

i think if it's between me and pandas being wrong i'm willing to raise my hand 😅

lapis sequoia Jan 16, 2020, 6:35 AM

#

it's very quiet in here today

#

Im bored.. someone ask something

drowsy ingot Jan 16, 2020, 9:27 AM

#

anyone using Gym-retro?

worn stratus Jan 16, 2020, 10:34 AM

#

What's a good way to get started with computer vision stuff?

jolly briar Jan 16, 2020, 11:32 AM

#

i have many csv files with different separators, i want to convert them to all be comma separated... is there a straightforward approach to this?

misty lake Jan 16, 2020, 1:51 PM

#

Hi All,

Has anyone worked on NLP , Information Retrieval and building a search engine
Any YT links or other references to build an intelligent search engine based on data in local DB is appreciated

jagged raven Jan 16, 2020, 1:51 PM

#

Is there a solution for this thing..?

📎 unknown.png

#

The download is stuck here.

misty lake Jan 16, 2020, 1:51 PM

#

Pls tag me when you answer. Thanks

#

@jagged raven the screenshot seema it is still downloading what is the issue ? May be you can try with sudo and pip3

jagged raven Jan 16, 2020, 1:54 PM

#

It's stuck there for a couple of minutes now.

rain flare Jan 16, 2020, 6:20 PM

#

Any insights on how to extract melody from a song using python?

austere oar Jan 16, 2020, 8:34 PM

#

https://github.com/justinsalamon/audio_to_midi_melodia

GitHub

justinsalamon/audio_to_midi_melodia

Extract the melody from an audio file and export to MIDI - justinsalamon/audio_to_midi_melodia

#

Hello I have a question
How do you extract an image link:
https://astrogeology.usgs.gov/search/results?q=hemisphere+enhanced&k1=target&v1=Mars -> https://astrogeology.usgs.gov/search/map/Mars/Viking/cerberus_enhanced

Astropedia Search Results | USGS Astrogeology Science Center

USGS Astrogeology Science Center Astropedia search results.

Cerberus Hemisphere Enhanced | USGS Astrogeology Science Center

Mosaic of the Cerberus hemisphere of Mars projected into point perspective, a view similar to that which one would see from…

#

and that would also mean going to every other search result on the first link

frail flower Jan 16, 2020, 11:11 PM

#

Fun data science story.

This week I was at the 100th Annual Meeting of the American Meteorological Society. Whilst at one of the Python symposia, we had just been introduced to the SPORK project, which uses machine learning to track supercell thunderstorms (and can potentially be used to predict tornadoes). I mentioned I wanted to use it in conjunction with a library called Py-ART for operational forecasting purposes, and a friend next to me started complaining at length about how slow Py-ART is.

Anyway, the lead developer of Py-ART was sitting right next to him on the opposite side.

#

Awkward...

deft harbor Jan 16, 2020, 11:13 PM

#

Did he say anything?

frail flower Jan 16, 2020, 11:13 PM

#

Oh boy did he!

#

He was also leading the panel on which said friend was presenting his research!

#

After some serious backpedaling he invited us to a data science reception with the other representatives from Argonne.

#

Oh, and on my other side one of the matplotlib devs was sitting there giggling at Dave’s sudden realization that you do not trash talk popular python weather data science libraries at a python weather data science symposium.

#

https://ams.confex.com/ams/2020Annual/webprogram/10PYTHON.html link to the symposium for the curious

jolly briar Jan 16, 2020, 11:38 PM

#

has anyone ever done predictive modelling just using uplift?
idk if it's even classed as predictive modelling... perhaps forecast is a better term

jolly briar Jan 17, 2020, 12:32 AM

#

Something that I wish was possible -
an excel sheet with the dataframe I'm working on that updated live

jolly briar Jan 17, 2020, 12:48 AM

#

how to convert a group of values into the percentages

so if I have a dataframe and there are groups G=a,b,c,d, so I .groupby('G'). Within a I have a1 = 15, a2=80, a3=90, so that after the group by operation i want to have values of a1=0.08, a2 = 0.43, a3=0.48, and similarly for the other groups

#

i just did groupby().sum() then merged that output with the original, then computed percentage from there

oblique belfry Jan 17, 2020, 1:32 AM

#

@frail flower that is hilarious.

gilded dagger Jan 17, 2020, 7:13 AM

#

Hello everybody, I'm having some trouble with gspread_pandas

#

📎 Screenshot_2020-01-17_at_16.12.33.png

#

I can open the spread, I have read permissions, I have the right values for row 1 if I directly look at them, but trying to make into a DF... doesn't work

#

Any clue? Not finding anything on google or related.

#

It actually works on other spreadsheet, but for this one (where I have only read access) it doesn't work

urban silo Jan 17, 2020, 1:05 PM

#

I have a pandas question

I have 3 dataframes (channel, video, comment)
Column mapping is:
channel.channelId = video.channelId = comment.channelId
video.videoId = comment.videoId

I need to get a subset of each dataframe.

Only channels that have a video and a comment
Only videos that have a channel and a comment
Only comments that have a channel and a video

I tried it with a double merge + inner join like

total_channels = total_channels.merge(total_videos, on='channelId')
.merge(total_comments, left_on=['channelId', 'videoId'], right_on=['videoId', 'channelId'])

But that only gives an empty dataframe with all columns from all 3 dataframes instead of only a channel subset that matches the requirements (at least 1 video and 1 comment)

I can't set a PK/FG when writing to SQL in pandas so my SQL solution take ages, that's why I need to do it directly in Python/pandas to speed stuff up.

How can I achieve that?

paper niche Jan 17, 2020, 2:13 PM

#

@urban silo the order that you specify the join columns matter. you set left to be channelId and videoId, but right to be videoId and channelId, so pandas will try to join left.channelId with right.videoId

#

and, if you just want the channels' columns, I'ld probably just go for total_videos[['channelId']] and total_comments[['channelId','videoId']] in the merge arguments directly.

jolly briar Jan 17, 2020, 2:15 PM

#

does anyone work with python+R? Or maybe another mixture ( i just use python+R though).
I'm wondering if you have a set way of arranging / organising your projects, code/docs etc

lapis sequoia Jan 17, 2020, 6:52 PM

#

How in the world are those avatars you se if you go forward about 5 min made? https://youtu.be/UwsrzCVZAb8

YouTube

YouTube Originals

How Far is Too Far? | The Age of A.I.

Can A.I. make music? Can it feel excitement and fear? Is it alive? Will.i.am and Mark Sagar push the limits of what a machine can do. How far is too far, and how much further can we go?

The Age of A.I. is a 8 part documentary series hosted by Robert Downey Jr. covering the w...

▶ Play video

#

It’s craaaazzyyy

#

It has to be some kind of 3D program but how do they take and interact and move?

#

Really cool stuff

unkempt delta Jan 17, 2020, 6:59 PM

#

when I open up Jupyter Notebook it's shows all the files saved on my C drive is there any way to clean this up a bit? If I partition my hard drive and have it open up in the partition will it still be able to import python packages ? I'm using anaconda btw

📎 unknown.png

thorny ocean Jan 17, 2020, 7:21 PM

#

hey

#

someone for little help in numpy?

worn stratus Jan 17, 2020, 7:23 PM

#

don't ask to ask

#

just ask the question

thorny ocean Jan 17, 2020, 7:34 PM

#

i have a 3d binary matrix (M) , i want to create a function that given an axis (x, y, z), the matrix reduce itself in an "or logical gate" by that axis. for example if i choose x axis, so my output would be 2d matrix(m) that if the value on m[x,y] == True it means that there exist an X value that M[X,x,y] = True

unkempt delta Jan 17, 2020, 8:20 PM

#

nvm figured it out , just made a partition

thorny ocean Jan 17, 2020, 8:21 PM

#

another question in numpy:
i want to have all 3 digits numbers containing "0,1,2,3"

#

like "000, 001, 002,003,010...333"

strange stag Jan 17, 2020, 9:58 PM

#

from this csv https://pastebin.com/RnF5rpXQ ive got this data in a pandas groupby object, and im trying to find the min/max price with the associated location, however .agg is giving me whacked results... trying to figure out why
(upon request, more csv data will be given, thus giving reason to groupby)

df2.agg({'price': ['max','min']}).reset_index()

https://pastebin.com/2xS53YTX
as you can see the very first upc is mismatched with max and min

Pastebin

013964765816,69.99,Amazon.com,https://www.amazon.com/gp/offer-list...

Pastebin

>>> df2.agg({'price': ['max','min']}) price ...

#

the example i have given is the fourth upc

#

ending in 816

#

min should be 59.99 and max should be 127.00

lapis sequoia Jan 17, 2020, 11:54 PM

#

it's difficult to see formatting here.. can you show the groupby dataframe another way

#

@strange stag

strange stag Jan 18, 2020, 12:06 AM

#

📎 unknown.png

#

@lapis sequoia

lapis sequoia Jan 18, 2020, 12:08 AM

#

what is the min max on

strange stag Jan 18, 2020, 12:08 AM

#

price, or so i hope

lapis sequoia Jan 18, 2020, 12:08 AM

#

I think you're applying it wrong

#

let me check

strange stag Jan 18, 2020, 12:08 AM

#

ye i think ur right, seems the upc is the min/max

#

cause upcs are ascending

lapis sequoia Jan 18, 2020, 12:09 AM

#

what is the group by on?

strange stag Jan 18, 2020, 12:10 AM

#

upc

lapis sequoia Jan 18, 2020, 12:10 AM

#

df.groupby('upc').agg({'price': ['min', 'max']}) then?

strange stag Jan 18, 2020, 12:11 AM

#

📎 unknown.png

#

same as i have now, yes

#

for w/e reason that seems to work slightly better

#

first result seems off

#

4th is still techniqually wrong, but idky

#

those values shouldnt be there at all

#

ill post full csv, sec

#

https://pastebin.com/dQRYRzy4

Pastebin

upc,price,location,url 008888359036,9.99,Best Buy,https://bestbuy...

#

@lapis sequoia

lapis sequoia Jan 18, 2020, 12:15 AM

#

what is the upc

strange stag Jan 18, 2020, 12:16 AM

#

wdym

lapis sequoia Jan 18, 2020, 12:16 AM

#

is it really common across these merchants

strange stag Jan 18, 2020, 12:16 AM

#

yes

lapis sequoia Jan 18, 2020, 12:16 AM

#

ok.. lemme think

strange stag Jan 18, 2020, 12:17 AM

#

would moving the upc to an index, or making it a string help?

lapis sequoia Jan 18, 2020, 12:17 AM

#

what dtype is it now

strange stag Jan 18, 2020, 12:18 AM

#

also, if you look at the upc 013964765816, the max is 127.00 and the min is 59.99, which is odd

#

sec

#

object

lapis sequoia Jan 18, 2020, 12:18 AM

#

I think you should set the dtypes for these columns.. then do the groupby and aggregation

#

it'll work better

#

set upc to int.. and the price to float

strange stag Jan 18, 2020, 12:19 AM

#

ye, there all objects

#

ight, ill try that

#

@lapis sequoia tyvm!!!! Been trying for hours to figure out what i was doing wrong!!! WOOO tyvm!!!! very nice to see that the data is in a working condition 😄

lapis sequoia Jan 18, 2020, 12:32 AM

#

np.. always here

strange stag Jan 18, 2020, 12:37 AM

#

@lapis sequoia do have one more operation that i hope you could help me with...

#

so i need to drop rows that amazons price is lower than the other prices (associated with the same upc)

#

im using this to remove no margins, possibly something similar for this other operation?

counts = df['upc'].value_counts()
df = df[~df['upc'].isin(counts[counts < 2].index)]

#

also, i think this is kinda weird df.groupby('upc').agg({'price': ['min', 'max']})
giving me url min/max, and upc min/max

#

📎 unknown.png

lapis sequoia Jan 18, 2020, 12:48 AM

#

try: df.groupby('upc').price.agg(['min', 'max'])

#

I dont understand your other question

#

drop what now?

strange stag Jan 18, 2020, 12:48 AM

#

so with that last code you just posted, i still need those other columns

#

cause i need to drop the rows that price_min is associated with the location "Amazon.com"

#

min correlates to a location, and max may correlate to another location

#

if min correlates to amazon, i need to drop the row

#

or in other words, if amazons price for the upc is lower than the other suppliers, i need to drop the row/upc

#

lower than ALL other suppliers*

lapis sequoia Jan 18, 2020, 12:54 AM

#

hmm.. an easy way to do that would be, for each upc finding row indexes where the row meets your condition.. then dropping multiple rows together by index

strange stag Jan 18, 2020, 12:54 AM

#

tried df.groupby('upc')['price','location','url'].price.agg(['min', 'max'])
however, it says its already selected the columns, so im not sure how to keep the other columns when aggregating

lapis sequoia Jan 18, 2020, 12:56 AM

#

df.groupby(['upc','price','location', 'url'], as_index=False).price.agg(______

strange stag Jan 18, 2020, 12:56 AM

#

so with the above code (multiindex), i just need to convert to a regular index, and then iterate through the df, and if "Amazon.com" in min, then drop the row

lapis sequoia Jan 18, 2020, 12:57 AM

#

no iterating through dfs.. that's not efficient

#

find another way.. but you can do that as a last resort.. because I'm not able to think of a way right now

#

go through them by upc, check the condition, save the indices somewhere.. then drop by indices together

strange stag Jan 18, 2020, 12:58 AM

#

hehe kinda defeats the purpose :D

📎 unknown.png

lapis sequoia Jan 18, 2020, 1:00 AM

#

oops

#

you need to remove price

#

I made a mistake

#

df.groupby(['upc','location', 'url'], as_index=False).price.agg(__

#

which you should've caught btw.. lol

strange stag Jan 18, 2020, 1:01 AM

#

was kinda wondering why all were grouped, but well 😛

lapis sequoia Jan 18, 2020, 1:02 AM

#

it's early morning here.. still getting up.. if you have anything else feel free to ping here.. I'll respond later

strange stag Jan 18, 2020, 1:02 AM

#

nw 😛 im very grateful for your help, saved me so much time!

strange stag Jan 18, 2020, 1:48 AM

#

welllll nvm hehe, just shifted the data

strange stag Jan 18, 2020, 11:18 AM

#

@lapis sequoia you there?

lapis sequoia Jan 18, 2020, 11:37 AM

#

!ask

arctic wedgeBOT Jan 18, 2020, 11:37 AM

#

ask

Asking good questions will yield a much higher chance of a quick response:

• Don't ask to ask your question, just go ahead and tell us your problem.
• Don't ask if anyone is knowledgeable in some area, filtering serves no purpose.
• Try to solve the problem on your own first, we're not going to write code for you.
• Show us the code you've tried and any errors or unexpected results it's giving.
• Be patient while we're helping you.

You can find a much more detailed explanation on our website.

strange stag Jan 18, 2020, 11:39 AM

#

okay, so with the data previously, i have (credited to you) a group for each upc, which is shown in the picture above, however, the min and max is each url, this would be fine if i could sort the high ~> low of each upc grouped by the location, and now that i wrote this out, i think i might have a better idea on what i need to do

#

that and some sleep

#

so this is closer, however, i would still like to see the lowest, and the highest of each upc rather than the lowest/highest for each location

📎 unknown.png

#

df.groupby(['upc','location', 'url'], as_index=False).price.agg(['min','max']).groupby('location', as_index=False).head(len(df))

#

@lapis sequoia

lapis sequoia Jan 18, 2020, 12:04 PM

#

reading your question

#

yeah no I dont get it.. lol

strange stag Jan 18, 2020, 12:05 PM

#

so using the first upc as an example
I would like to see amazon has a 14.23 price(max), and walmart has a 8.95 price (min)

lapis sequoia Jan 18, 2020, 12:06 PM

#

ok so you want to see min max for each upc, and the url

#

yeah?

strange stag Jan 18, 2020, 12:06 PM

#

yes

#

exactly

lapis sequoia Jan 18, 2020, 12:07 PM

#

why didnt you say that

strange stag Jan 18, 2020, 12:07 PM

#

thought i did 😄

#

have manually parsed like 200 lines so far lolz 😛

#

cba to parse 1k lines manually a day

lapis sequoia Jan 18, 2020, 12:09 PM

#

yeah I dont understand what you're saying.. but wait, let me write the code

#

df['max_val'] = df.groupby(['upc'])['price'].transform(max)

#

do you understand what's happening here

strange stag Jan 18, 2020, 12:10 PM

#

lemme try/think a bit, and ill brb

#

heres what ive comeup in a few mins

📎 unknown.png

#

📎 unknown.png

#

ahhh i forgot...

#

i need to see if any1 else has a lower price than amazon, not just min/max per upc....

#

sorry.........

#

ima see what i can do with this tho

#

oh now i need to drop

#

🙂

#

so to answer your question @lapis sequoia i think i understand what its doing

#

creating a new column, by grouping the upc and then performing a max transformation on the price column

#

or min

#

this has gotten me a bit closer to what i need (the above) and this

# Drop upcs that arent sold on amazon
df = df[df['upc'] != "Amazon.com"]

#

df['max_val'] = df.groupby(['upc', 'location'])['price'].transform(max)
🙂

autumn night Jan 18, 2020, 1:23 PM

#

how Data science and Ai is related???

worn stratus Jan 18, 2020, 1:55 PM

#

The vast majority of AI is trained (the process of the AI learning) using data collected from the real world. In order to work with AI you need to be able to understand the data, and what it means and how to work with it.

#

Its worth noting that Data Science AI Machine Learning Data Mining and probably more are all pretty ill defined semi-buzzwords that sometimes get used interchangably

lapis sequoia Jan 18, 2020, 5:31 PM

#

Hey guys! I'm trying to add a new column, or rather replace one with a mistake and therefore I'm trying to merge 2 datasets exactly like I have done dozens of times before in my project... however, this time, something is different and I just can't seem to figure out why.

#

Even though I tried both "left" and "inner" merge, I'm getting out more data in the merged set than in either of the two original sets
df1 1136388 rows × 31 columns
df2 1247995 rows × 8 columns
so with left join I should be getting 1,136,388 rows right?
however, what I'm getting is 1935106 rows × 32 columns (column number is correct, rows are waayyyyyyy off)

#

So in an attempt to find out what's going on, I used the indicator=True function of merge. And guess what... there is only one category [both] and no values that are only either from left or right data set.
How is this even possible? Any help would be much appreciated... this should only be a 2 minute problem, but it cost me 2 days already :[

#

I'm merging on 7 out of those 8 columns as they're identical in both sets, that's why 32 columns instead of 31 is the correct output for the merge.... but rows increased by almost 800,000 !?!?!? There are no NaNs and no duplicates... i absolutely cannot explain how this is even possible

coral yoke Jan 18, 2020, 7:45 PM

#

if anyone has any experience or understanding of RNNs i'd love to talk whenever you're free. currently doing a project for gun recognition in images and video. just trying to perfect my classifier before working too hard on the object detection factor

oblique belfry Jan 18, 2020, 7:53 PM

#

Well...why do you need RNNs? I mean...what are you trying to do? RNNs and CNNs can be used for similar problems. If you are doing gun recognition, seems like an object detection problem.

coral yoke Jan 18, 2020, 8:04 PM

#

unless all of the research papers are uninformed, you need an R-CNN or similar

#

CNN for the quick classification, RNN for the object detection

#

RNN is meant for object detection in my case. i've not seen any other network types used

#

unless you know something i'm missing

#

@oblique belfry

oblique belfry Jan 18, 2020, 8:07 PM

#

Yolo v3 for Object Detection....

coral yoke Jan 18, 2020, 8:07 PM

#

i'm not using a pre-trained network

#

and if i'm not wrong, yolo has an RNN

oblique belfry Jan 18, 2020, 8:08 PM

#

All CNNs. Faster inference than R-CNN.

#

I used it to train on a custom dataset.

#

Are you doing object detection or action recognition, or both?

coral yoke Jan 18, 2020, 8:11 PM

#

just object detection

#

did you use keras or?

oblique belfry Jan 18, 2020, 8:11 PM

#

https://pjreddie.com/darknet/yolo/

https://www.learnopencv.com/training-yolov3-deep-learning-based-custom-object-detector/

YOLO: Real-Time Object Detection

You only look once (YOLO) is a state-of-the-art, real-time object detection system.

Learn OpenCV

Sunita Nayak

Training YOLOv3 : Deep Learning based Custom Object Detector | Lea...

Tutorial for training a deep learning based custom object detector using YOLOv3. We provide step by step instructions for beginners and share scripts and data.

#

So...this guy wrote it in C. Trains very fast and inference time is fast. However, it is finnicky to work with.

#

There are Keras, Tensorflow, and Pytorch ports. The Pytorch one was the most stable port.

coral yoke Jan 18, 2020, 8:14 PM

#

honestly looking to use it for reference and still do my own

oblique belfry Jan 18, 2020, 8:15 PM

#

The hardest part is the input data. Each object detection algorithm has different formats of input data.

coral yoke Jan 18, 2020, 8:15 PM

#

hence why i'm going to do my best for my own

#

i know that part though

oblique belfry Jan 18, 2020, 8:15 PM

#

And....I get you wanna do your own. But, it is a solved problem.

coral yoke Jan 18, 2020, 8:16 PM

#

i very well understand that

oblique belfry Jan 18, 2020, 8:16 PM

#

Okay.

#

You can use a RNN, but you don't have to.

Input data for object detection is tricky since you can either scale the width and height or just keep it as it is. You can do it all with RNNs. And, these model architectures are out there. I'd copy them.

Why reinvent the wheeel if you do not have to?

coral yoke Jan 18, 2020, 8:21 PM

#

my end goal isn't to have some pre-trained model returning images with all of its former trained classes filling the image. this is also still a ridiculously new field and while i did not look too far into yolo's latest model i now know it's the latest reach. i'm still not going to use something just handed to me. i'm looking to make my own like i said

oblique belfry Jan 18, 2020, 8:27 PM

#

You can train it yourself, from scratch and not use other people's weights. I trained it to locate a tennis ball in real time. Tried doing it myself and tried other methods out there, Yolo was the best. Even still, you are going to need a large corpus of labeled data of bounding boxes around the objects in questions. I would spend my time there.

But, good luck.

coral yoke Jan 18, 2020, 8:30 PM

#

i know what i need data wise. i have 10k images self-collected and already 1k labeled by hand with labelimg. i'm not looking to locate tennis balls because somebody else did that already, i'm looking to do something myself from scratch to prove that i can to clients looking to hire me for this industry so i'm not going to just use something handed to me and say "look, i can use what anyone else can!"

i appreciate you pointing out that yolo wasn't what i thought it was but i feel like you're acting very high and mighty just because i don't want to use somebody else's work and you think i should. have a nice day

oblique belfry Jan 18, 2020, 8:43 PM

#

@coral yoke It's not high and mighty. Most people don't reinvent the wheel unless the have to. Unless you were tyring to go into research, there are just many very good solutions to this problem out there.

And, you finally explained why doing it from scratch is so important to you. If I knew that before, I could have given you different answers.

coral yoke Jan 18, 2020, 8:44 PM

#

i said feels like. and honestly, especially in this field, please don't give the answer of "just use what exists" to somebody asking how to make their own thing

jolly briar Jan 18, 2020, 8:44 PM

#

not reinventing the wheel is pretty sound advice a lot of the time 🤔

coral yoke Jan 18, 2020, 8:45 PM

#

it is, but it isn't always relevant

oblique belfry Jan 18, 2020, 8:45 PM

#

I would read the papers behind Yolo, R-CNN, Faster R-CNN, etc. They make interesting points on why they chose the architecture.

coral yoke Jan 18, 2020, 8:45 PM

#

if you want to make a discord bot should i go tell you to use this server's bot instead of making your own?

jolly briar Jan 18, 2020, 8:45 PM

#

it isn't always relevant
in this context perhaps leading with your reasoning would have made more sense, but all good

coral yoke Jan 18, 2020, 8:45 PM

#

i've read some papers already tony

worn stratus Jan 18, 2020, 8:45 PM

#

Choosing to reinvent the wheel is a great way of understanding how the wheel works

coral yoke Jan 18, 2020, 8:45 PM

#

thank you charlie

oblique belfry Jan 18, 2020, 8:46 PM

#

That's not me telling you to copy them. Just the logic behind the choices might encourage you on your journey.

coral yoke Jan 18, 2020, 8:46 PM

#

i understand that tony. that's why i was asking for people familiar with RNNs

oblique belfry Jan 18, 2020, 8:48 PM

#

I know....I was grouping them in. Yolo is one of the few famous strategies that is all CNNs. The rest are a mix between the two.

#

Are you wanting to run this on a live video stream?

coral yoke Jan 18, 2020, 8:49 PM

#

no offense but i don't believe you're the person i'd be willing to give any more information to regarding this

#

again, thanks for pointing out my misunderstanding of yolo's architecture

oblique belfry Jan 18, 2020, 8:50 PM

#

Reinventing the wheel to learn is a great way to learn. But we didn't know you were trying to do that. Hence the miscommunication.

#

Okay. Well, good luck.

coral yoke Jan 18, 2020, 8:52 PM

#

even without the learning purpose, i would definitely still make my own. especially if the project was specialized enough i would want full control of what was going on.

#

and most of it isn't for learning. i'm having to piece together the last bit of the object detection myself but the rest i mostly understand. it's for showing clients i understand

jolly briar Jan 18, 2020, 8:53 PM

#

i'm trying to imagine billing someone and pricing in building everything from scratch lol

coral yoke Jan 18, 2020, 8:54 PM

#

it's not the kind of clients you're imagining

jolly briar Jan 18, 2020, 8:54 PM

#

cool

oblique belfry Jan 18, 2020, 8:55 PM

#

Got it. Next time, try to convey that up front. Not just when talking to me, but to other devs. There are gonna be others who will be confused at your request like I was.

I am upset that this convo got derailed so quickly. Because, this is the stuff that interests me.

#

I gotta ask....what kind of clients are you targeting?

coral yoke Jan 18, 2020, 8:56 PM

#

again no offense, but never when speaking to any other developer in any part of any industry have they told me "use what exists." especially not ones in this discord, they seem to like to help you from scratch irregardless

#

and none of your business

jolly briar Jan 18, 2020, 8:56 PM

#

lol

oblique belfry Jan 18, 2020, 8:57 PM

#

Alright. Just curious.

#

If you wanna impress them a bit more, look into image segmentation as well. Don't know if that would be relevant to you, but it would def be cool to show you did that by scratch too.

coral yoke Jan 18, 2020, 8:59 PM

#

my timeframe doesn't allow any more than i have set

#

i've seen that already, thanks though

oblique belfry Jan 18, 2020, 8:59 PM

#

Gotcha. Wanted you to really impress them.

coral yoke Jan 18, 2020, 9:00 PM

#

👌

oblique belfry Jan 18, 2020, 9:19 PM

#

Has anyone had luck with graph neural networks?

lapis sequoia Jan 18, 2020, 9:44 PM

#

Hi! Sorry, not sure if this is right channel for my problem. Where can I ask about data preprocessing for text clusterization?

worn stratus Jan 18, 2020, 9:52 PM

#

Here probably

lapis sequoia Jan 18, 2020, 10:19 PM

#

Ok, I don't even understand my task properly...

#

I want to cluster different text to k different authors.(k-means clustering)
My data is: different files with text and other things from different authors in json format,
It looks like this:

{
  "author": "Tolstoy",
  "date": "unknown",
  "format": "unknown",
  "text": "here is some short text by Tolstoy",
  "title": "Anna Karenina",
  "year": "unknown",
  "lang": "ru"
}

Also there is already training data that consists of many dictionaries like this in json format too.

What do I need for k-means clustering? Do I only need "text" strings?

chilly geyser Jan 18, 2020, 10:22 PM

#

cluster different text to k different authors
Your task is to create k clusters of authors. Presumably this means that authors within each cluster are similar to each other in some way.
What do I need for k-means clustering? Do I only need "text" strings?
To cluster the text you'd probably need to make the 'text' into a format such that you can perform operations on them to talk about any kind of similarity or dissimilarity. There are different ways to do this, and I think you have been given raw book data, along with some meta data. It's honestly up to you to use just data and/or the metadata, as long as at the end of the clustering process, you have a good idea of what algorithms you used are doing

lapis sequoia Jan 18, 2020, 10:35 PM

#

So can I only take those "text" values from data and put them all in one big list of texts(is this even right?) and then preprocess this list?

lapis sequoia Jan 18, 2020, 10:59 PM

#

Do you use the Anaconda environment?

#

that is like a software package

#

Me? No, I don't.

#

Anybody here

jolly briar Jan 18, 2020, 11:31 PM

#

@lapis sequoia yes

lapis sequoia Jan 18, 2020, 11:35 PM

#

@jolly briar do you activate it?

📎 unknown.png

jolly briar Jan 18, 2020, 11:35 PM

#

I've never used windows

jolly briar Jan 19, 2020, 12:45 AM

#

@velvet thorn

x = pd.DataFrame({'index' : [5,6], 'blah' : ['a', 'b']})
print(f"""x.index : {list(x.index)}, x['index'] : {list(x['index'])}""")

this seems like a reasonable example of .v and ['v'] not being exactly the same

velvet thorn Jan 19, 2020, 1:18 AM

#

@jolly briar yup

#

this applies also to every other attribute that is already bound

#

e.g. min, max, groupby

jolly briar Jan 19, 2020, 1:18 AM

#

yeah

velvet thorn Jan 19, 2020, 1:18 AM

#

I think I said "prefer __getitem__ access, because it works in more cases"

#

but if I didn't then I'm saying it now roothink

jolly briar Jan 19, 2020, 1:18 AM

#

so they're not exactly the same, like running code with ipython vs python, people often say they're the same but it's different

velvet thorn Jan 19, 2020, 1:19 AM

#

because it is most correct to say that __getitem__ works everywhere __getattr__ does, and some places it doesn't, for the purpose of Series access

jolly briar Jan 19, 2020, 1:19 AM

#

can't recall exactly what you said, just thought of it now though ( the index thing ), all good

stray spade Jan 19, 2020, 2:07 PM

#

can some one help me with this part of code

"from custom_layers.scale_layer import Scale"

i could not find document or installation guide for this library in python

i am trying to implement ResNet150 with follow repository
https://github.com/flyyufelix/cnn_finetune/blob/master/resnet_152.py

GitHub

flyyufelix/cnn_finetune

Fine-tune CNN in Keras. Contribute to flyyufelix/cnn_finetune development by creating an account on GitHub.

jovial river Jan 19, 2020, 3:13 PM

#

How does an algorithm like KNN handle duplicate data? Meaning we have a set of data objects with identical attributes and the distance between these data objects is 0. Does it make sense to remove these duplicate points here or include it?

jovial river Jan 19, 2020, 3:29 PM

#

If we were to include duplicates, would it make sense to treat duplicate data points as one observation? Like if k=3 and n1 has 3 duplicates, n1', n1'' and n1''', then n1 would only have 1 nearest neighbor instead of 3.

jolly briar Jan 19, 2020, 3:51 PM

#

I often get confused when making dataframes with rows, for some reason.

for example - pd.DataFrame( pd.factorize( data.var ) )
If i want this to create a dataframe with columns instead of rows how would I do that?

lapis sequoia Jan 19, 2020, 5:57 PM

#

Hi! Can I use Random Forest to evaluate k-means clustering? does this make sense?

cinder viper Jan 19, 2020, 7:25 PM

#

@lapis sequoia I don't understand what you mean when you say you are trying to "evaluate" k-means. I suspect the answer is no... Random Forest is similar to k-Means in that both are "supervised classification" algorithms, but they have differences in what they do and how they do it

lapis sequoia Jan 19, 2020, 7:46 PM

#

k-means is unsupervised so I wanted to check clusters I got with RF or something

strange stag Jan 19, 2020, 7:51 PM

#

hey was hoping someone could help me with pandas, im trying to keep the amazon price for each upc, and drop others that are a higher price than amazon (for each upc)

📎 unknown.png

#

if you need me to provide more information, in any way shape or form, please dont hesitate to ask!

chilly geyser Jan 19, 2020, 8:02 PM

#

@lapis sequoia As previously said, it doesn't make sense. Both k-means and RF clusters are fundamentally different.

You can evaluate the clustering quality of each algorithm using metrics such as cluster purity, or compute/speed requirements, etc. and then compare the results from RF or from k-Means. Indeed, k-Means is likely to be superior in both fitting and prediction, while RF depends on the number of trees, as well as tree parameters. If RF does not produce significantly better clusters, then I would use k-Means.

But there are probably many different ways of generalising each k-Means, RF, and there would be other algorithms. What works might typically depend on your use case.

#

@strange stag So you want to conditionally drop depending on the price column? Is there only one Amazon.com under location or would there be multiple? If there is only one, you can grab the Amazon.com price, store it as a constant, then do a conditional slice using .map

lapis sequoia Jan 19, 2020, 8:09 PM

#


from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
X_train = tfidf.fit_transform(X_train)
X_test = tfidf.transform(X_test)


from sklearn.linear_model import LogisticRegression
from sklearn.cluster import KMeans
from sklearn.metrics import accuracy_score
import numpy as np


n_clusters = len(np.unique(y_train))
clf = KMeans(n_clusters = n_clusters, random_state=42)
clf.fit(X_train)
y_labels_train = clf.labels_
y_labels_test = clf.predict(X_test)
X_train = y_labels_train[:, np.newaxis]
X_test = y_labels_test[:, np.newaxis]


from sklearn.ensemble import RandomForestClassifier

model=RandomForestClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print('Accuracy is:',accuracy_score(y_test, y_pred))

#

Sorry

#

Can't I use RF for mapping labels like here?

#

I had my data labelled before but needed to use k means

chilly geyser Jan 19, 2020, 8:15 PM

#

@lapis sequoia You mean change clf to RF?

lapis sequoia Jan 19, 2020, 8:17 PM

#

elf was for k means, I mean there the last part is RF used on data that was "produced" by k means. or? maybe I don't understand the last part of this code, where RF comes

chilly geyser Jan 19, 2020, 8:18 PM

#

@lapis sequoia That doesn't make sense, why would you use K-means then sequentially run random-forest on it?

#

Why would you fit an RF model after your K-means clustering algorithm?

#

@lapis sequoia Ok I think I get what your script is doing

#

@lapis sequoia Have you read https://scikit-learn.org/stable/modules/model_evaluation.html
I think you should just use the common metrics for evaluating the quality of the clusters out of K-means.

It doesn't make sense to 'evaluate' how good K-means is via RF. RF is itself another classifier that can result in classification errors on its own. If you're trying to do a meta-analysis of algorithmic analysis of either K-means outputs or RF-inputs then it makes sense, but it wouldn't make sense for the implied original problem of 'given N-datapoints and K-possible labels, what is the best way to separate and give each datapoint one of the K-possible labels?'

lapis sequoia Jan 19, 2020, 8:30 PM

#

I wanted to make a confusion matrix in the end and I don't know how to make it without labels, the code is not mine, I just thought I found something similar, because in the end there is confusion matrix and classification report, and that's what I wanted from k-means. The initial data that I have already has labels and is actually more for classification tasks but I have to use it for k means

#

what is the script doing then?

strange stag Jan 19, 2020, 8:34 PM

#

there are multiple prices under the location amazon

#

@chilly geyser

chilly geyser Jan 19, 2020, 8:35 PM

#

@lapis sequoia
Um, how do you get the confusion matrix in the first place? In the first place, do you have a ground truth of classifiers?

#

Setting the K-means as a ground truth does not make sense

strange stag Jan 19, 2020, 8:35 PM

#

each upc should have an amazon price, if not multiple

chilly geyser Jan 19, 2020, 8:35 PM

#

To get a confusion matrix you need to say that a cluster X has common property related to its elements being members of X

strange stag Jan 19, 2020, 8:36 PM

#

not sure what u mean by multiple amazon.com under location tho

chilly geyser Jan 19, 2020, 8:36 PM

#

Unfortunately K-means only produces indices or rather centroids. You'd need to remap the centroids to get clusters of meaning

#

@strange stag Brb I'll give you a fake table

strange stag Jan 19, 2020, 8:37 PM

#

location can have maceys, walmart, home-depot, office-depot, or a few others

#

@chilly geyser i can give u a real 1 if u want

chilly geyser Jan 19, 2020, 8:38 PM

#

@strange stag Is this table possible?

📎 unknown.png

#

@strange stag I'd avoid giving real data.

strange stag Jan 19, 2020, 8:39 PM

#

its fine idc

#

but yes, that is basically identical to the data i have now

#

id like to keep the 6th row and the 2nd

#

for upc==1

lapis sequoia Jan 19, 2020, 8:40 PM

#

thank you @chilly geyser but do you understand what script I posted is doing?

chilly geyser Jan 19, 2020, 8:41 PM

#

@lapis sequoia It's running K-means, then setting it as a ground truth for RF to classify

#

@strange stag I'd look into conditional slicing with pandas. A very naive (aka slow) way to do it is to take subsets of each UPC value, then do the conditional

#

As for faster/simultaneous checking I'm not too sure, I've not used pandas other than for general things and I've never exactly needed it to be speed-optimised

strange stag Jan 19, 2020, 8:43 PM

#

hmm

#

will possibly be doing millions of rows per day

#

however, shouldnt be a problem for now

#

so something like df.groupby(['upc'])

chilly geyser Jan 19, 2020, 8:45 PM

#

Yeah my googling seems to imply that too

strange stag Jan 19, 2020, 8:46 PM

#

i understand i can do something like this (this is what im using to drop single suppliers corresponding to 1 upc)

counts = df['upc'].value_counts()
df = df[~df['upc'].isin(counts[counts < 2].index)]

#

so this selects a column, but not subsets for column values

#

so groupby would render subsets?

chilly geyser Jan 19, 2020, 8:49 PM

#

I'd try it, I'm not a pd expert here :>

strange stag Jan 19, 2020, 8:49 PM

#

do also have soemthing like this

#

df1 = df[ df['location'] == "Amazon.com" ].drop_duplicates(subset='upc', keep='first')

#

think i should be using != instead but w/e

chilly geyser Jan 19, 2020, 8:50 PM

#

That keeps the Amazon.com stuff right?

strange stag Jan 19, 2020, 8:50 PM

#

this assumes that the df has been sorted by price

#

should

chilly geyser Jan 19, 2020, 8:50 PM

#

lol TBH IDK what you're doing, but it seems you're doing ok

strange stag Jan 19, 2020, 8:50 PM

#

actually nvm it doesnt do anything

#

that was an attempt to drop the lower price amazon offers

chilly geyser Jan 19, 2020, 8:51 PM

#

@strange stag Are you doing this all in VSC or IDLE? I'd recommend a more iteractve thing like Google Colab or at least your own localhost JuPyteR notebook if you think Google's snooping around your data.

strange stag Jan 19, 2020, 8:51 PM

#

going back to the beginning, just trying to get amazons high vs the lowest of others

#

im on a notebook atm

chilly geyser Jan 19, 2020, 8:52 PM

#

That way you can see how the pd dataframes are changing

#

Ah ok that's good

#

So you can quickly see stuff

strange stag Jan 19, 2020, 8:52 PM

#

yes

#

well, not really doing ok

#

still blind as a bat atm..

#

mind boggling me why i cant get amazons high price, and then the lowest price for each upc other than amazon

#

mk

#

this is better...

grouped = df.groupby(['upc'])
result = [g for g in list(grouped)]

#

might actually be able to work with this data 🙂

vast temple Jan 19, 2020, 9:08 PM

#

hey, guys do stackoverflow links allowed here?

strange stag Jan 19, 2020, 9:11 PM

#

@chilly geyser tyvm for suggesting subsets! 😄

chilly geyser Jan 19, 2020, 9:23 PM

#

@strange stag I got the code if you want it, it's ugly and IDK if it scales

strange stag Jan 19, 2020, 9:23 PM

#

o.O

#

wrote the code for me 😄 wooo i got the code too

#

perhaps we shall compare?

chilly geyser Jan 19, 2020, 9:24 PM

#

for _, y in df.groupby("upc"):
    amazon_min = y[y["location"] == "Amazon.com"]["price"].min()
    # print(y[y["location"] == "Amazon.com"]["price"].min())
    print(y[(y["location"] == "Amazon.com") | (y[y["location"] != "Amazon.com"]["price"] < amazon_min)])

strange stag Jan 19, 2020, 9:25 PM

#

grouped = df.groupby(['upc'])
result = [g for g in list(grouped)]
result_length = len(result)
new_df = result[0]
high_low_amazon_prices_index = new_df[1][new_df[1]['location'] == "Amazon"]['price'].sort_values(ascending=False).index
new_df = new_df[1].drop(high_low_amazon_prices_index[1:])
amazons_price = new_df['price'][high_low_amazon_prices_index[0]]
price = new_df[new_df['location'] == "Amazon"]['price'].astype(float)
for index in new_df.index:
    if new_df['price'][index] > amazons_price:
        new_df = new_df.drop(index)

chilly geyser Jan 19, 2020, 9:25 PM

#

Mine is still UPC-prints tho, I haven't done the dropping yet, mine is only a view

strange stag Jan 19, 2020, 9:25 PM

#

i like ur code is wayyyyy shorter tho..

chilly geyser Jan 19, 2020, 9:25 PM

#

Basically I get Amazon price minimum per UPC

strange stag Jan 19, 2020, 9:26 PM

#

erm, need the max

chilly geyser Jan 19, 2020, 9:26 PM

#

Then any other location (e.g. Walmart) with higher prices are dropped

#

Amazon max?

strange stag Jan 19, 2020, 9:26 PM

#

ah okay, thats good

chilly geyser Jan 19, 2020, 9:26 PM

#

I see

strange stag Jan 19, 2020, 9:26 PM

#

yes

chilly geyser Jan 19, 2020, 9:26 PM

#

You need the Amazon min rite?

strange stag Jan 19, 2020, 9:26 PM

#

no

#

max

#

could explain, but with regards to your earlier post of using real data

#

dw, i account for amazons lower price later

chilly geyser Jan 19, 2020, 9:29 PM

#

Umm I'm trying to make sure I can recreate df now

strange stag Jan 19, 2020, 9:29 PM

#

this is super sweet tho

chilly geyser Jan 19, 2020, 9:29 PM

#

Not sure how to get from the group-bys all the way back to the modified df

#

And I think df.append would be slow

strange stag Jan 19, 2020, 9:30 PM

#

however, i think my version is slightly better

#

e.g

📎 unknown.png

#

or still urs

📎 unknown.png

#

but this is alot closer than i have been the past week 😄

#

and ye, i just need to concat my new_df for each loop

#

i think im biased tho so

chilly geyser Jan 19, 2020, 9:35 PM

#

@strange stag Up to you, it's your project

strange stag Jan 19, 2020, 9:35 PM

#

🙂

#

think ill keep yours posted tho, incase i want the other amazon prices

chilly geyser Jan 19, 2020, 9:36 PM

#

my final code is this

keep_indices = []
for _, y in df.groupby("upc"):
    amazon_min = y[y["location"] == "Amazon.com"]["price"].max()
    COND = (y["location"] == "Amazon.com") | (y[y["location"] != "Amazon.com"]["price"] < amazon_min)
    keep_indices += y[COND].index.tolist()

# to get the subset just use loc
df.loc[keep_indices]

#

I'm using .max() now

strange stag Jan 19, 2020, 9:36 PM

#

wdym location?

chilly geyser Jan 19, 2020, 9:37 PM

#

?

strange stag Jan 19, 2020, 9:37 PM

#

df.loc

chilly geyser Jan 19, 2020, 9:37 PM

#

Basically I get a list of indices that match the condition

#

This index uses the original DF's index, so it will be fine

#

in fact I don't think I'm changing the original df

#

You only modify the original DF if you have to

strange stag Jan 19, 2020, 9:38 PM

#

so faster?

#

than mine by alot?

chilly geyser Jan 19, 2020, 9:38 PM

#

lol for that I recommend using %%timeit

#

Also, not just this part by itself solo.

#

You need to do a %%timeit on your fullscript if you can

vast temple Jan 19, 2020, 9:39 PM

#

if someone worked with pandas and fuzzywuzzy check this question please https://stackoverflow.com/questions/59813111/remake-dataframe-based-of-fuzzywuzzy-matches

Stack Overflow

Remake dataframe based of fuzzywuzzy matches

i have a dataframes now it have 5 rows(in future will have more). In column names there 5 values, if those 5 names the same(their fuzz.ratio close to each other) then ok no changes needed.
But the...

strange stag Jan 19, 2020, 9:39 PM

#

well, only got 1k lines atm so

chilly geyser Jan 19, 2020, 9:39 PM

#

Unless you are really really sure of your test-case and likely inputs and/or outputs

#

I see

#

The issue with %%timeit on just this portion is even if this part is faster, it might be because it's not evaluating certain parts

#

like list comprehension being stored as a generator, not being used

strange stag Jan 19, 2020, 9:40 PM

#

well, im concating dfs, for each upc...so

#

im sure thats probably not cheap

chilly geyser Jan 19, 2020, 9:41 PM

#

Ya, that's what I think too, but maybe pd has an internal magic for that too

#

I'm trying to grab just the indices, but TBH I'm not sure if it's faster

strange stag Jan 19, 2020, 9:42 PM

#

i think grabbing indices would be way faster, but im no expert

chilly geyser Jan 19, 2020, 9:42 PM

#

Anyway this is my result with play-data

📎 unknown.png

#

Carrefour because....well, why not :^)
prices are literally from random. upc is choice(range(10)).

#

basically 1000 rows -> 967 rows, cutting off via Amazon max per upc

strange stag Jan 19, 2020, 9:46 PM

#

think my biggest improvement would be switching how im saving data tho

#

cause loading jsonlines to a df is really slow

#

df = pd.DataFrame()
with jsonlines.open(filename, 'r') as reader:
    for obj in reader:
        df = df.append(obj, ignore_index=True)

#

its like 1 second per 100 rows or something...

#

how do i do a %%timeit?

chilly geyser Jan 19, 2020, 9:50 PM

#

%%timeit is a JuPyteR magic. You put it at the top of the cell

strange stag Jan 19, 2020, 9:51 PM

#

ah

#

that code above is...
617 ms ± 4.36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

chilly geyser Jan 19, 2020, 9:53 PM

#

Something like this

📎 unknown.png

strange stag Jan 19, 2020, 9:54 PM

#

wow... 10m lines would take 12 hours....

chilly geyser Jan 19, 2020, 9:54 PM

#

The 1+2 is so that I don't have a single line. You can actually just %timeit [SINGLE_LINE_CODE]

#

While %%timeit is for whole cell execution

strange stag Jan 19, 2020, 9:54 PM

#

ye...

chilly geyser Jan 19, 2020, 9:55 PM

#

@strange stag Lol I don't think you can just linearly extrapolate so easily, just try for a slightly larger subset rather than a unittest

strange stag Jan 19, 2020, 9:55 PM

#

im assuming the 617ms was used to create the df, and the 4.36 is for each line that its appending

chilly geyser Jan 19, 2020, 9:55 PM

#

The fact is, unittests are unittests for a reason, and that integration testing is rqeuired

strange stag Jan 19, 2020, 9:55 PM

#

no idea what that means

#

think the above is giving me a ballpark of what to expect tho

chilly geyser Jan 19, 2020, 9:56 PM

#

Unit tests are for single things by themselves, while integration tests means you have multiple different things working together

#

It's common testing terminology

strange stag Jan 19, 2020, 9:56 PM

#

tbh testing is outa my league atm

#

not necessary at all

chilly geyser Jan 19, 2020, 9:57 PM

#

Well TBH IDK how much production-level code you're doing, and honestly personally I've never been involved in production-level stuff

strange stag Jan 19, 2020, 9:57 PM

#

##autopilot

#

😄

#

got a LONG fkn ways to go tho

#

id say im 10% done

#

what would be better to save data than jsonlines?

#

for importing to pandas

#

well nvm

#

hmm

jolly briar Jan 19, 2020, 10:56 PM

#

i'm wondering how to know what coordinate system i'm in wrt geographic data

velvet thorn Jan 19, 2020, 11:29 PM

#

@strange stag

#

oh lord why

#

for loop + df.append = death

velvet thorn Jan 19, 2020, 11:49 PM

#

@jolly briar in the general case?

#

or what

strange stag Jan 19, 2020, 11:49 PM

#

@chilly geyser you still there?

#

@velvet thorn what about this, atm im getting a blank df for total_df

total_df = pd.DataFrame()
for x in range(result_length):
    new_df = result[x]

    high_low_amazon_prices_index = new_df[1][new_df[1]['location'] == "Amazon"]['price'].sort_values(ascending=False).index
    new_df = new_df[1].drop(high_low_amazon_prices_index[1:])
    try:
        amazons_price = new_df['price'][high_low_amazon_prices_index[0]]
    except IndexError:
        continue
    price = new_df[new_df['location'] == "Amazon"]['price'].astype(float)
    for index in new_df.index:
        if new_df['price'][index] > amazons_price:
            new_df = new_df.drop(index)
            pd.concat([new_df, total_df])

velvet thorn Jan 19, 2020, 11:51 PM

#

I feel a bit weak just looking at the loops

#

okay, maybe you can tell me what you want to do first?

strange stag Jan 19, 2020, 11:52 PM

#

mk, so i have a df with all the data and i am able to parse the data that i need with

grouped = df.groupby(['upc'])
result = [g for g in list(grouped)]
result_length = len(result)
new_df = result[0]
high_low_amazon_prices_index = new_df[1][new_df[1]['location'] == "Amazon"]['price'].sort_values(ascending=False).index
new_df = new_df[1].drop(high_low_amazon_prices_index[1:])
amazons_price = new_df['price'][high_low_amazon_prices_index[0]]
price = new_df[new_df['location'] == "Amazon"]['price'].astype(float)
for index in new_df.index:
    if new_df['price'][index] > amazons_price:
        new_df = new_df.drop(index)

#

however, having difficulties running this in a loop

#

because not all of my grouped subsets have the amazon bit

#

so amazon may not be a location when iterating through the df (group)

#

so that code does everything that i want except...

#

i cant figure out how to drop upcs that dont have an amazon location

#

here is some data
https://cdn.discordapp.com/attachments/366673247892275221/668571033632112710/unknown.png

#

so im grouping by upc, keeping amazons highest price, and dropping anything that is higher than that

jolly briar Jan 19, 2020, 11:55 PM

#

@velvet thorn i was just thinking generally... i've just been merging some shapey stuff but i'm not too sure how to check that i did it correctly

strange stag Jan 19, 2020, 11:57 PM

#

new_df is when im seperating each upc into a new dataframe, and parsing it from here, and now im trying to add it back into a master dataframe

velvet thorn Jan 20, 2020, 12:01 AM

#

hm

#

okay, so first you want to drop entire groups with values of upc that don't have 'Amazon.com' in location, correct?

strange stag Jan 20, 2020, 12:01 AM

#

yes

velvet thorn Jan 20, 2020, 12:03 AM

#

df.groupby('upc').filter(lambda g: 'Amazon.com' in set(g['location']))

#

or, actually

#

df.groupby('upc').filter(lambda g: 'Amazon.com' in g['location'].unique())

strange stag Jan 20, 2020, 12:27 AM

#

ok, so now that i have amazon only upcs, how do i concat the dfs?

#

new_dataframe = df.groupby('upc').filter(lambda g: 'Amazon' in g['location'].unique())
grouped = new_dataframe.groupby('upc')

result = [g for g in list(grouped)]
result_length = len(result)

total_df = pd.DataFrame()

for x in range(result_length):
    new_df = result[x]
    high_low_amazon_prices_index = new_df[1][new_df[1]['location'] == "Amazon"]['price'].sort_values(ascending=False).index
    new_df = new_df[1].drop(high_low_amazon_prices_index[1:])
    amazons_price = new_df['price'][high_low_amazon_prices_index[0]]
    price = new_df[new_df['location'] == "Amazon"]['price'].astype(float)
    for index in new_df.index:
        if new_df['price'][index] > amazons_price:
            new_df = new_df.drop(index)
            print(new_df)
            pd.concat([total_df, new_df])

#

📎 unknown.png

velvet thorn Jan 20, 2020, 12:37 AM

#

uh

#

so now

#

you want all the rows where prices are lower than the highest Amazon price for that group, right?

strange stag Jan 20, 2020, 12:37 AM

#

yes

#

all the upcs with that/those conditions, yes

#

rows include upcs, so yeah

#

basically the high of amazon and the low of anywhere else

velvet thorn Jan 20, 2020, 12:39 AM

#

wait

#

what?

#

the last line does not mean the same thing

#

as what I said

strange stag Jan 20, 2020, 12:40 AM

#

which line

velvet thorn Jan 20, 2020, 12:40 AM

#

basically the high of amazon and the low of anywhere else

strange stag Jan 20, 2020, 12:40 AM

#

df.groupby('upc').filter(lambda g: 'Amazon' in g['location'].unique())
this is getting all upcs that have an amazon price yes?

velvet thorn Jan 20, 2020, 12:40 AM

#

I would interpret "low" to mean "only the lowest value", not "everything lower than the highest Amazon value"

#

since it seems to me that there are multiple values of price for each value of location

strange stag Jan 20, 2020, 12:41 AM

#

low as the lowest value

#

yes, the lowest of anywhere besides amazon

#

and the high of amazon

velvet thorn Jan 20, 2020, 12:41 AM

#

you want all the rows where prices are lower than the highest Amazon price for that group, right?

#

so this is wrong

strange stag Jan 20, 2020, 12:42 AM

#

well, its right in the manner that it dropped the upcs that dont have an amazon price, or are you asking about the next step?

velvet thorn Jan 20, 2020, 12:42 AM

#

next step

#

probably

strange stag Jan 20, 2020, 12:42 AM

#

okay

velvet thorn Jan 20, 2020, 12:42 AM

#

you should come up with some sample data

strange stag Jan 20, 2020, 12:44 AM

#

7847    Amazon    11.53     806481288353    https://www.amazon.com/gp/offer-listing/B083CP...
7850    HomeDepot 28.99     806481288353    https://www.amazon.com/gp/offer-listing/B083CP...
7848    Walmart    24.97    806481288353    //goto.walmart.com/c/1914133/566719/9383?veh=a...
7851    Amazon    136.73    806481288353    https://www.amazon.com/gp/offer-listing/B01IBI...

#

should yield row 7851 and 7848

velvet thorn Jan 20, 2020, 12:46 AM

#

in other words

#

each group

#

should yield 2 rows

#

?

strange stag Jan 20, 2020, 12:46 AM

#

yes

velvet thorn Jan 20, 2020, 12:46 AM

#

okay

#

let me think about that for a moment

strange stag Jan 20, 2020, 12:47 AM

#

courtesy of another user (earlier)
this yields that, but all of amazon prices, not just the highest

keep_indices = list()
for _, y in df.groupby("upc"):
    amazon_min = y[y["location"] == "Amazon"]["price"].max()
    COND = (y["location"] == "Amazon") | (y[y["location"] != "Amazon"]["price"] < amazon_min)
    keep_indices += y[COND].index.tolist()

df.loc[keep_indices]

id prefer to keep only the highest

velvet thorn Jan 20, 2020, 12:48 AM

#

sure

#

and it doesn't matter if, for example

#

the highest Amazon price is lower than the lowest non-Amazon price, right

#

in all cases you just want the highest Amazon price and the lowest non-Amazon price

strange stag Jan 20, 2020, 12:48 AM

#

yes

#

exactly

#

only 1 amazon price should be listed

#

for any given upc

velvet thorn Jan 20, 2020, 12:49 AM

#

and this is applied on the previous DataFrame

#

the one with UPCs without Amazon filtered out

strange stag Jan 20, 2020, 12:49 AM

#

with amazon upcs filtered

#

so applied to

new_dataframe = df.groupby('upc').filter(lambda g: 'Amazon' in g['location'].unique())

velvet thorn Jan 20, 2020, 12:49 AM

#

👍

strange stag Jan 20, 2020, 12:50 AM

#

😄

velvet thorn Jan 20, 2020, 12:57 AM

#

try aggs = df.groupby([(df['location'] == 'Amazon').rename('amazon'), 'upc').agg(['min', 'max'])

#

and pd.concat([aggs.xs(c, level=0)[[('location', 'min'), ('price', 'min')]] for c in {False, True}]) to filter out

strange stag Jan 20, 2020, 12:59 AM

#

filter out?

velvet thorn Jan 20, 2020, 1:00 AM

#

yeah

#

try it and tell me if it's what you're looking for

strange stag Jan 20, 2020, 1:00 AM

#

so, filter seems to be almost what im looking for, cept 2 things

#

still need the price for amazon with the upc, and 2 if amazon is the lowest price, then i need to drop that row

velvet thorn Jan 20, 2020, 1:01 AM

#

huh.

strange stag Jan 20, 2020, 1:01 AM

#

but other than that the filter is perfect i think, checking now

velvet thorn Jan 20, 2020, 1:01 AM

#

you didn't say that

strange stag Jan 20, 2020, 1:01 AM

#

my apologies... 😦

velvet thorn Jan 20, 2020, 1:01 AM

#

oh wait, the second part is wrong though, ignore it

strange stag Jan 20, 2020, 1:02 AM

#

?

#

the filter?

velvet thorn Jan 20, 2020, 1:02 AM

#

it should be

#

pd.concat([
    aggs.xs(False, level=0)[[('location', 'min'), ('price', 'min')]], 
    aggs.xs(True, level=0)[[('location', 'max'), ('price', 'max')]]
])

#

because you want the max for Amazon, right

strange stag Jan 20, 2020, 1:03 AM

#

yes

velvet thorn Jan 20, 2020, 1:04 AM

#

okay, I need to go now

#

but basically

#

for the last step where you wanna drop the rows

strange stag Jan 20, 2020, 1:04 AM

#

same

velvet thorn Jan 20, 2020, 1:04 AM

#

you can just do another groupby and filter on that condition

strange stag Jan 20, 2020, 1:04 AM

#

ye, thought so as much 🙂

#

@velvet thorn anyways tyvm!!!!!

#

would elaborate how helpful u have been, but as u, i really have to go like right now!!

leaden bobcat Jan 20, 2020, 1:27 AM

#

Is anyone available to answer a couple questions regarding out to turn a JSON file into a pandas dataframe? I've got an API call from a sports data website, but I'm missing something obvious

velvet thorn Jan 20, 2020, 2:14 AM

#

@leaden bobcat do elaborate

jolly briar Jan 20, 2020, 2:29 AM

#

when merging two df's with

pd.merge(df1, df2, on='shared_column', how='left')

i expect there to be the same number of rows after as there are in df1, this isn't usually the case

#

📎 4-pandas-merge-inner-outer-left-right-1024x771.png

#

how is it possible to create more rows than the original df when doing a left join, i figured the max would be the number of rows in the original data

#

when i instead do df1.join(df2, how='left') i get the expected result so idk

jolly briar Jan 20, 2020, 2:50 AM

#

how to replace a section of a dataframe?

Say i have a df with columns A,B,C and where C == 4 i want to replace C with the value of B.

I'm not sure how to do this without a bunch of for loops

#

i just created a different vector and used that to overwrite

worn stratus Jan 20, 2020, 3:11 AM

#

select the section you want to replace with .loc or .iloc and just assign it

#

dataframe['column_to_change'] = new_col

#

I think should work

jolly briar Jan 20, 2020, 3:14 AM

#

yeah i actually did that - justnew_col was a replace with np.where

#

cheers

velvet thorn Jan 20, 2020, 3:40 AM

#

@jolly briar doesn't seem right, got example?

jolly briar Jan 20, 2020, 3:57 AM

#

@velvet thorn re what, the joins?

#

It's UK here so not now 🙃
But this seemed to be the case

#

As in, I used merge and got way more. Used join and got less

velvet thorn Jan 20, 2020, 4:03 AM

#

how do yo uknow you got more?

coral yoke Jan 20, 2020, 4:14 AM

#

he compared his rows before and after. can confirm, when he posted before it showed some weird shit

velvet thorn Jan 20, 2020, 4:17 AM

#

hm.

#

shouldn't be the case

#

you did pass on to join, right?

coral yoke Jan 20, 2020, 4:46 AM

#

backpropagation is a general thing for all NNs, what is your question?

#

wait, let me get this right, you're trying to make your own algorithm for backpropagation when the one used is used for a reason?

#

i'm not sure if any of us here honestly know enough about the deep math behind these algorithms that have been around for years for reason. if you'd like to learn them i would definitely just suggest learning about what's there and how it works instead of trying to replace it

#

recreating the core of how any of our NNs work isn't exactly common as far as i'm aware. making your own network? sure yeah, but not recreating the essense

#

i support you totally btw, power to you if you can understand that stuff cause fuckin hell i'm not going through that much

#

i'm afraid i won't be able to help much though, past just understanding how backprop works 😛

velvet thorn Jan 20, 2020, 5:58 AM

#

@coral yoke I would disagree that this is “deep” math...

coral yoke Jan 20, 2020, 5:58 AM

#

👌

velvet thorn Jan 20, 2020, 5:58 AM

#

@keen geyser how do you intend to normalise the weights?

#

and which articles are you looking at?

coral yoke Jan 20, 2020, 5:59 AM

#

i honestly didn't need your ping just for a disagreement, but sure

chilly geyser Jan 20, 2020, 6:20 AM

#

@strange stag lol I didn't know you only wanted the highest Amazon. Your original said all amazons and every other lower than this Amazon

#

@strange stag Lol now I think I get what you want
You should have just said this at the very start

in all cases you just want the highest Amazon price and the lowest non-Amazon price
So basically all non-Amazons would be the same 🤦

#

@keen geyser Would help if you could share the articles you are using. CNN backprop should be ok-ish material

#

@velvet thorn btw looking at your thing. Why do you need to rename "Amazon" to "amazon"?

velvet thorn Jan 20, 2020, 6:59 AM

#

don’t need to

#

but if you want to look @ the intermediate result it’s slightly more comprehensible to have a name for that level of the index

strange stag Jan 20, 2020, 7:02 AM

#

@chilly geyser you still there?

#

ah, confused u with gm

#

ye... my apologies... i have difficulty explaining what i want...so

#

@chilly geyser

velvet thorn Jan 20, 2020, 7:08 AM

#

@strange stag in general for this kind of data wrangling question

strange stag Jan 20, 2020, 7:08 AM

#

@velvet thorn how do i merge the two location max / location min?

velvet thorn Jan 20, 2020, 7:08 AM

#

providing expected output helps everyone out a lot

strange stag Jan 20, 2020, 7:08 AM

#

i shall try to do so in the future

velvet thorn Jan 20, 2020, 7:09 AM

#

on phone so I can’t write code, but you want a groupby

strange stag Jan 20, 2020, 7:09 AM

#

aggs = df.groupby([(df['location'] == 'Amazon').rename('amazon'), 'upc']).agg(['min', 'max'])
df2k = pd.concat([
    aggs.xs(False, level=0)[[('location', 'min'), ('price', 'min')]], 
    aggs.xs(True, level=0)[[('location', 'max'), ('price', 'max')]]
])
df2k.groupby('upc').head(len(df2k)).sort_values(by='upc')

#

merging the upc 8888359036 for example

#

thought grouping by upc would do this however

#

i need an agg yeah?

#

expected output of the first two lines merged would be

8888359036, Amazon, BestBuy, 14.23, 9.99

#

third, fourth, fifth, sixth, would be dropped (later with df2k.dropna())

#

seventh upc merged would be

8421134096783, Amazon, Target, 15.24, 4.99

#

and for extra credit, dropping any rows price max is not greater than or equal to twice the price min

#

i can probably figure this out tho 😛

velvet thorn Jan 20, 2020, 7:16 AM

#

actually try groupby fillna

strange stag Jan 20, 2020, 7:17 AM

#

what would be the value?

#

basically perfect besides nan values

#

what is this doing in aggs? .rename('amazon'), 'upc']

#

.stack() o.O

#

how do i filter these though....

strange stag Jan 20, 2020, 7:59 AM

#

nvm on the stack...not what im looking for

strange stag Jan 20, 2020, 8:16 AM

#

ah, u were right

#

.fillna(method='ffill')

strange stag Jan 20, 2020, 8:31 AM

#

nvm

chilly geyser Jan 20, 2020, 9:08 AM

#

Uhm so did it work o,o

jolly briar Jan 20, 2020, 11:48 AM

#

@velvet thorn like i just inner joined two dfs with (1173, 14) and (17000,40) (ish) dimensions respectively and got a df with 2.5 million rows back

#

that just makes zero sense to me for an inner join

velvet thorn Jan 20, 2020, 11:49 AM

#

that seems like an outer join...

jolly briar Jan 20, 2020, 11:49 AM

#

right, but it's not

velvet thorn Jan 20, 2020, 11:49 AM

#

do you have the code?

jolly briar Jan 20, 2020, 11:49 AM

#

i do but i can't share anything

#

i mean, i can 100% say this has happened with an inner join

#

@velvet thorn

📎 unknown.png

#

this is a left

#

inner

📎 unknown.png

#

outers the same dim, so ive no idea 🤔

velvet thorn Jan 20, 2020, 11:56 AM

#

the reason

#

is duplicates.

jolly briar Jan 20, 2020, 11:56 AM

#

hrm, i'm not sure what to do there then

velvet thorn Jan 20, 2020, 11:57 AM

#

>>> import pandas as pd
>>> left = pd.DataFrame([[0, 'a'], [0, 'b']], columns=['a', 'b'])
>>> right = pd.DataFrame([[0, 'c'], [0, 'd']], columns=['a', 'b'])
>>> pd.merge(left, right, on='a')
   a b_x b_y
0  0   a   c
1  0   a   d
2  0   b   c
3  0   b   d

jolly briar Jan 20, 2020, 11:57 AM

#

because i think this duplicate information is valuable - it would be grouped by

velvet thorn Jan 20, 2020, 11:57 AM

#

left and right both have 2 rows

#

but the left join has 4

#

quite clear why, I think

jolly briar Jan 20, 2020, 11:59 AM

#

@velvet thorn yeah, it's giving all combinations

velvet thorn Jan 20, 2020, 11:59 AM

#

yeah, so that's why you have more rows in your case too

jolly briar Jan 20, 2020, 11:59 AM

#

yes, i'm confused about what to do with the data now :/

#

the duplicates are for geographic regions , eh

#

thanks tho - that explains it 👍

jolly briar Jan 20, 2020, 2:52 PM

#

given a df with columns A,B where A are groups and B are count values, how to find the column B percentages per group?

so if i have

i would want to have column B_perc as [0.5, 0.5, 0.8, 0.2]

i get that in this case the data sums to 100, this can't be assumed ( so *0.01 isn't ok)

velvet thorn Jan 20, 2020, 2:57 PM

#

>>> df.groupby('A').transform(lambda g: g / g.sum())

lapis sequoia Jan 20, 2020, 3:09 PM

#

I always have difficulty understanding groupby

#

@velvet thorn you have shown the table, it got 2 columns and 4 rows. We can see how it looks. I always wondered how this looks:

df.groupby('A')

Because Python never shows how it looks in reality

velvet thorn Jan 20, 2020, 3:11 PM

#

it doesn't really make sense

#

to have a raw groupby

#

for reasons I can explain another time, since I'm going to bed soon

lapis sequoia Jan 20, 2020, 3:11 PM

#

oh..

velvet thorn Jan 20, 2020, 3:11 PM

#

have you read the pandas groupby docs?

lapis sequoia Jan 20, 2020, 3:11 PM

#

good night then 🙂

velvet thorn Jan 20, 2020, 3:11 PM

#

they might help

lapis sequoia Jan 20, 2020, 3:12 PM

#

Pandas grouby docs, been reading from last 4 days

#

I can read C++ technical definition from the ISO standard

#

But can't understand groupby >:-\

velvet thorn Jan 20, 2020, 3:14 PM

#

hm

#

okay real quick

#

imagine this

#

you have this, right

#

and say you want the mean of B for each unique value of A

#

you could do this:

#

for a in df['A'].unique():
    print(df.loc[['A'] == a, 'B'].mean())

#

and this gets each subset of the DataFrame

#

for which A has a specific unique value

#

and then performs some transformation on it

#

this is equivalent to df.groupby('A')['B'].mean()

#

@lapis sequoia make sense?

lapis sequoia Jan 20, 2020, 3:18 PM

#

So far, no.

but I will try to understand while you sleep

jolly briar Jan 20, 2020, 3:38 PM

#

@velvet thorn thanks again - I didn't know about transform , i used apply with a lambda function, is there any reason to reach for one over the other?

#

ah i see it's late for you, no worries

#

https://stackoverflow.com/questions/27517425/apply-vs-transform-on-a-group-object

Stack Overflow

Apply vs transform on a group object

Consider the following dataframe:

 A      B         C         D

0 foo one 0.162003 0.087469
1 bar one -1.156319 -1.526272
2 foo two 0.833892 -1.666304
3 bar three -2.026673 -0.

#

👍

oblique belfry Jan 20, 2020, 4:58 PM

#

What kind of graph is this?

📎 iu.png

plain turret Jan 20, 2020, 5:38 PM

#

i can imagine a regular algo

#

These vacuums use a navigation algorithm called VSLAM (or visual simultaneous location and mapping

#

https://en.wikipedia.org/wiki/Simultaneous_localization_and_mapping

Simultaneous localization and mapping

In navigation, robotic mapping and odometry for virtual reality or augmented reality, simultaneous localization and mapping (SLAM) is the computational problem of constructing or updating a map of an unknown environment while simultaneously keeping track of an agent's locatio...

#

according to wikipedia there is some algorithms that are open source

#

you could get some inspiration from this

#

i don't suggest anything i just googled :p

#

i would guess you would need some camera system and the processing power to treat it in real time

oblique belfry Jan 20, 2020, 6:45 PM

#

I wonder how well Reinforcement Learning would work in this situation.

jolly briar Jan 20, 2020, 7:42 PM

#

df.isna() will give me true / false for each cell based on whether it's nan or not, how can i select only rows which have some NA though?

chilly geyser Jan 20, 2020, 7:48 PM

#

Does df[df.isna().any(axis=1)] work?

strange stag Jan 20, 2020, 8:17 PM

#

alright yall... how do i merge rows by upcs?
For example, I have 2 rows with missing NaN values. the First row's missing NaN values are found within the second row, and vice versa (however a simple .fillna(method='ffill') does not work, because the data is not perfect, and what i mean by that is, not all upcs have 2 rows to makeup for the NaNs

📎 unknown.png

#

📎 unknown.png

plain turret Jan 20, 2020, 8:34 PM

#

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html

sand gyro Jan 20, 2020, 8:35 PM

#

I created the functions dropna ,which drops rows with empty values, and isnull ,which keeps rows with empty columns, to filter the dataframe and it works as I am able to print both. Then I would append them to previously created xlsx files

wb = Workbook()
ws = wb.active
wb.title = 'Contacts'
wb2 = Workbook()
ws2 = wb2.active
wb2.title = 'Contacts'

r1 = df.dropna(subset=['Firstname', 'Lastname', ('work_phones' or 'mobile_phones') or (('Work_City','Work_Street','Work_State','Work_Zip') or ('Personal_Street','Personal_City','Personal_State','Personal_Zip')) or ('Work_email' or 'Personal_email')])

r2 = df.loc[(df['Firstname'].isnull()) | (df['Lastname'].isnull()) | (((df['work_phones'].isnull()) & (df['mobile_phones'].isnull())) | (((df['Work_Street'].isnull()) | (df['Work_City'].isnull()) | (df['Work_State'].isnull()) & (df['Work_Zip'].isnull())) | (df['Personal_Street'].isnull()) | (df['Personal_City'].isnull()) | (df['Personal_State'].isnull()) | (df['Personal_Zip'].isnull())) & (df['Work_email'].isnull()) & (df['Personal_email'].isnull()))]

for r in dataframe_to_rows(r1, index=False, header=False):
   ws.append([r])

for r in dataframe_to_rows(r2, index=False, header=False):
    ws.append([r])
  
   

wb.save("Accepted Contacts.xlsx")
wb2.save("Rejected Contacts.xlsx")

However, when I try to add them to the excel files I get this error for r1

raise ValueError("Cannot convert {0!r} to Excel".format(value))

ValueError: Cannot convert ['Doe', 'Jane', nan, nan, nan, nan, '5678743546', 'j@greenbriar.com', '54 George street', 'Ridge Springs', 'VA', '25678', nan, nan, nan, nan, '3245687907', nan, nan, nan] to Excel```

plain turret Jan 20, 2020, 8:36 PM

#

hmm i don't really understand what you're trying to do, but nan is not an excel character no?

#

if you want an empty value in excel/csv it should be "Jane",,,,"56787453"

sand gyro Jan 20, 2020, 8:37 PM

#

It needs to be column specific

plain turret Jan 20, 2020, 8:37 PM

#

,, is one column

sand gyro Jan 20, 2020, 8:38 PM

#

instead of nan I make it an empty string?

plain turret Jan 20, 2020, 8:38 PM

#

it would work but then you would have an empty string in your excel

#

so , "",

#

it probably doesn't matter, but sometimes, some excel macro doesn't consider empty string as blank value

sand gyro Jan 20, 2020, 8:41 PM

#


  File "<ipython-input-2-de3603ab2d77>", line 1, in <module>
    runfile('C:/Users/mosta/.spyder-py3/CRMnew.py', wdir='C:/Users/mosta/.spyder-py3')

  File "C:\Users\mosta\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 827, in runfile
    execfile(filename, namespace)

  File "C:\Users\mosta\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 110, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)

  File "C:/Users/mosta/.spyder-py3/CRMnew.py", line 1311, in <module>
    ws.append([r])

  File "C:\Users\mosta\Anaconda3\lib\site-packages\openpyxl\worksheet\worksheet.py", line 644, in append
    cell = Cell(self, row=row_idx, column=col_idx, value=content)

  File "C:\Users\mosta\Anaconda3\lib\site-packages\openpyxl\cell\cell.py", line 133, in __init__
    self.value = value

  File "C:\Users\mosta\Anaconda3\lib\site-packages\openpyxl\cell\cell.py", line 239, in value
    self._bind_value(value)

  File "C:\Users\mosta\Anaconda3\lib\site-packages\openpyxl\cell\cell.py", line 222, in _bind_value
    raise ValueError("Cannot convert {0!r} to Excel".format(value))

ValueError: Cannot convert ['Doe', 'Jane', '', '', '', '', '5678743546', 'j@greenbriar.com', '54 George street', 'Ridge Springs', 'VA', '25678', '', '', '', '', '3245687907', '', '', ''] to Excel

#

This is not the problem

#

I don't know what is {0!r}?

strange stag Jan 20, 2020, 8:44 PM

#

@velvet thorn @chilly geyser @gilded harness

#

this is what im looking for, but this data is inaccurate (due to fillna)

📎 unknown.png

plain turret Jan 20, 2020, 8:45 PM

#

what is r in your code

#

@sand gyro ws.append([r]) -> is r already a list maybe

sand gyro Jan 20, 2020, 8:46 PM

#

It is in the for loop:

for r in dataframe_to_rows(r2, index=False, header=False):
    ws.append([r])

plain turret Jan 20, 2020, 8:47 PM

#

what type is this?

sand gyro Jan 20, 2020, 8:47 PM

#

WHen I print it

   Lastname Firstname          Company  ...  Personal_email  Note  Note_Category
1   Malcoun       Joe  8/28/2019 14:29  ...             NaN   NaN            NaN
4      None    Jordan              NaN  ...             NaN   NaN            NaN
5      None       NaN              NaN  ...             NaN   NaN            NaN
6  Zachuani     Reemo              NaN  ...             NaN   NaN            NaN
7    Suarez   Geraldo              NaN  ...             NaN   NaN            NaN

[5 rows x 20 columns]
  Lastname Firstname Company  ...  Personal_email  Note  Note_Category
0      Doe      Jane     NaN  ...             NaN   NaN            NaN
2  Ramirez    Morgan     NaN  ...             NaN   NaN            NaN
3    Burki     Roman     NaN  ...             NaN   NaN            NaN

[3 rows x 20 columns]

plain turret Jan 20, 2020, 8:48 PM

#

can you print(type(r)) in your loop ? maybe you pass something like [[your_row]]

sand gyro Jan 20, 2020, 8:50 PM

#

I print r1 and r2 before the loop

plain turret Jan 20, 2020, 8:50 PM

#

openpyxl write it as :

#

for r in dataframe_to_rows(df, index=True, header=True): ws.append(r)

#

here you ws.append([r]) so you put list in list probably ?

#

(looked there: https://openpyxl.readthedocs.io/en/stable/pandas.html )

sand gyro Jan 20, 2020, 8:53 PM

#

What a stupid mistake by me. It took me days

plain turret Jan 20, 2020, 8:53 PM

#

heppens

sand gyro Jan 20, 2020, 8:53 PM

#

Thank you very much @plain turret

plain turret Jan 20, 2020, 8:54 PM

#

you're welcome, sometimes you just need fresh eyes

crystal sluice Jan 20, 2020, 9:07 PM

#

hey guys, to learn data science, what subject should i focus first?

#

i'm good with math and statistics, i understand usability of data very well, but idk what to learn to work with data science

#

anybody could give me a north?

strange stag Jan 20, 2020, 9:09 PM

#

udacity

plain turret Jan 20, 2020, 9:09 PM

#

i'd just pick a book on the subject i want to explore? data science is super big

coral yoke Jan 20, 2020, 9:09 PM

#

subject? get used to the python libraries that are used most in the area such as pandas, numpy, etc.

strange stag Jan 20, 2020, 9:10 PM

#

scikitlearn perhaps as well depending on your preference

crystal sluice Jan 20, 2020, 9:11 PM

#

pandas, numpy and wich other are used? so i can focus on this first

strange stag Jan 20, 2020, 9:12 PM

#

what is your goal?

crystal sluice Jan 20, 2020, 9:13 PM

#

i want to be able to get dataframes and work data to information, create information for decision making

coral yoke Jan 20, 2020, 9:13 PM

#

definitely just pandas and numpy then for that

strange stag Jan 20, 2020, 9:13 PM

#

pandas > numpy in priority

crystal sluice Jan 20, 2020, 9:14 PM

#

and let's suppose i want to make a little dashboard

#

to show data

#

in real time, as the database is working

strange stag Jan 20, 2020, 9:14 PM

#

still pandas, but then flask, django or something else

coral yoke Jan 20, 2020, 9:14 PM

#

flask to handle the automatic population of your table

crystal sluice Jan 20, 2020, 9:14 PM

#

hmmm, nice

strange stag Jan 20, 2020, 9:14 PM

#

django depending on the scale of the site

crystal sluice Jan 20, 2020, 9:14 PM

#

nice, thanks guys, helped a lot, i'll start right now

strange stag Jan 20, 2020, 9:15 PM

#

flask for smaller projects

crystal sluice Jan 20, 2020, 9:15 PM

#

@strange stag this is something i would ask too

coral yoke Jan 20, 2020, 9:15 PM

#

i've seen flask used on large projects as well. preference 😛

crystal sluice Jan 20, 2020, 9:15 PM

#

what is a small project and a large project? is based on data or views?

strange stag Jan 20, 2020, 9:15 PM

#

well yes, can happen, but generally that is not done

coral yoke Jan 20, 2020, 9:15 PM

#

whatever you say yeah

#

both generally georg

strange stag Jan 20, 2020, 9:15 PM

#

id start out in flask

coral yoke Jan 20, 2020, 9:16 PM

#

your traffic and how much you're handling

crystal sluice Jan 20, 2020, 9:16 PM

#

django > flask?

coral yoke Jan 20, 2020, 9:16 PM

#

no

strange stag Jan 20, 2020, 9:16 PM

#

management is different

coral yoke Jan 20, 2020, 9:16 PM

#

neither's better than the other

crystal sluice Jan 20, 2020, 9:16 PM

#

i tried to start with django

plain turret Jan 20, 2020, 9:16 PM

#

flask is easier to set up / less stuff to learn imo

coral yoke Jan 20, 2020, 9:16 PM

#

^

crystal sluice Jan 20, 2020, 9:16 PM

#

but it was really difficult to me

strange stag Jan 20, 2020, 9:16 PM

#

flask has more flexibility, django has more structure

coral yoke Jan 20, 2020, 9:16 PM

#

and flask is generally preferred starting off, even in businesses, as you only add what you need

crystal sluice Jan 20, 2020, 9:16 PM

#

flask I worked very well

plain turret Jan 20, 2020, 9:16 PM

#

so to advance fast and get result i would prefer flask

crystal sluice Jan 20, 2020, 9:17 PM

#

nice

plain turret Jan 20, 2020, 9:17 PM

#

most of the stuff you'll learn can be transfered to django since i think they both works with templates

crystal sluice Jan 20, 2020, 9:17 PM

#

i have an idea i'm developing, it can get some size someday, but i'll start with flask

coral yoke Jan 20, 2020, 9:17 PM

#

they both work with the exact same template engine, yes

plain turret Jan 20, 2020, 9:18 PM

#

@void anvil seaborn have nice heatmaps with pandas.corr if you want to plot them easily

crystal sluice Jan 20, 2020, 9:18 PM

#

sorry mispelling or word order, english is not my main language

plain turret Jan 20, 2020, 9:19 PM

#

you can still print on top

#

i think

coral yoke Jan 20, 2020, 9:19 PM

#

your english is fine georg, no worries!

plain turret Jan 20, 2020, 9:19 PM

#

i did this two years ago so i can't say for sure

#

you can with hmm

#

the keyword annot

#

i had make another df with the pvalue significances as * and ploted them on top of them

#

since you have corelation with color anyway

#

but you can mess with it

crystal sluice Jan 20, 2020, 9:24 PM

#

@coral yoke thank you!!

jolly briar Jan 20, 2020, 9:27 PM

#

anyone made use of yellowbrick?
it seems to have changed the output of seaborn after inputting it, i don't just mean style wise, but the actual data looks a bit different as though there's some kinda transformation or something... just wondering if anyone's noticed anything similar

plain turret Jan 20, 2020, 9:29 PM

#

ah i didn't no

#

i see they have ranks that's cool

jolly briar Jan 20, 2020, 9:31 PM

#

i always thought R plots were nice from regression models, seems that this has diagnostics now at least

#

😬

📎 unknown.png

plain turret Jan 20, 2020, 9:32 PM

#

what am i watchi,ng

jolly briar Jan 20, 2020, 9:32 PM

#

a horror

plain turret Jan 20, 2020, 9:32 PM

#

why do you have some sort of regression line with columns lol

strange stag Jan 20, 2020, 9:33 PM

#

anyone able to help with my previous q?

jolly briar Jan 20, 2020, 9:33 PM

#

yeah it's an odd one - it wasn't like that earlier @plain turret , i don't think 🤔

plain turret Jan 20, 2020, 9:34 PM

#

kinda what i get after i try every tutorial tbh

jolly briar Jan 20, 2020, 9:35 PM

#

i'm also getting test R2 consistently higher than training 🙃

#

so there's clearly something very wrong somewhere lol

jolly briar Jan 20, 2020, 10:03 PM

#

am i being thick or is drawing a horizontal line on a seaborn plot a bit of a faff

velvet thorn Jan 20, 2020, 10:25 PM

#

get the Axes

#

ax.axhline

#

@crystal sluice you can consider Dash for that

#

also, another reason to use transform is that it better signals your intent

crystal sluice Jan 20, 2020, 10:37 PM

#

@velvet thorn what is dash

velvet thorn Jan 20, 2020, 10:48 PM

#

it’s a framework meant for data analysis

#

integrates with pandas

#

Google “dash python”

halcyon venture Jan 20, 2020, 11:11 PM

#

do I have to use an old version (1.8) of Anaconda if I need to use python 2.6?

#

I don't want it to interfere with the current version installation

jolly briar Jan 20, 2020, 11:14 PM

#

for two models A,B, if mse( A ) < mse( B ) yet mae( A ) > mae ( B ), how to choose the model based on these metrics?

lapis sequoia Jan 20, 2020, 11:50 PM

#

could anyone help me translate a function from intention into code? it's probably a bit of text to explain, would appreciate a PM

jolly briar Jan 21, 2020, 12:03 AM

#

@lapis sequoia what's a PN

lapis sequoia Jan 21, 2020, 12:04 AM

#

it's supposed to be a private message, but i see the acronym doesn't make sense in English haha

jolly briar Jan 21, 2020, 12:05 AM

#

either PM or DM would be the english for that @lapis sequoia , and i think you're better off just putting your problem into the channel as best as you're able too

lapis sequoia Jan 21, 2020, 12:06 AM

#

i'd spam the whole room, because it's a lot to explain 😕

jolly briar Jan 21, 2020, 12:07 AM

#

well, not sure what to say then i guess

lapis sequoia Jan 21, 2020, 12:09 AM

#

ok so i don't know how to explain the problem w/o context

#

i have a huge data set, it's about delays and delay prediction... i still need to engineer some features

#

in the tidy dataset there are columns for delays, train stations, train-line, stop sequence number and so on... what i'm working on right now is a directional index for every train line, to have a dummy variable in the regression part

#

my plan is, to get a list of station acronyms sorted by their sequence of occurance within a line, let's say LINE 1

#

which would look like this:

#

[(0, 'TKT'), (1, 'TKTO'), (2, 'TWD'), (3, 'TWER'), ... (21, 'TSRO'), (22, 'TGOL'), (23, 'TBO'), (24, 'THUB'), (25, 'TEHN'), (26, 'TGT'), (27, 'TNUF'), (28, 'THE')]

#

now i would want to find any match of any train event for the given LINE 1 where the station is in that list, and write the corresponding number into a new column

#

I'd have to do that for every train-line

#

when that column is finished i'd be able to check for every starting and ending train whether he goes from higher number to lower number or vice versa

#

why so complicated? because the dataset is complex and not every train of one specific train-line goes all the way from 0 to XX. some start later and stop earlier etc.

#

do you get it? 🤔

#

The procedure would have to be done for every of the 8 train LINES to fill the entire column. So I would like to write some function or pipeline that does the same for all the LINE. I can't just give every station-abbreviation one specific number, because while the station abbreviations are "general", the corresponding number would be LINE-specific.

velvet thorn Jan 21, 2020, 1:33 AM

#

what is a train event?

#

@lapis sequoia

#

@jolly briar which is more important to you...?

lapis sequoia Jan 21, 2020, 1:55 AM

#

@velvet thorn there is 5 different train events:

departure of a train from its start station
arrival of a train at a stopover
a passing train
departure of a train at a stopover, and
arrival at its final destination

#

those are coded for example with 1) = 10, 2) = 20, ... 5) = 50 so you can find the specific events for every train and every LINE etc in the dataset... every day has like thousands of logged events... every minute of the day at every station etc.

velvet thorn Jan 21, 2020, 2:08 AM

#

hm

#

I see

#

that doesn't sound too hard, if I get what you mean

#

basically a join

lapis sequoia Jan 21, 2020, 2:10 AM

#

i don't think you get me

jolly briar Jan 21, 2020, 2:10 AM

#

are you able to post the example data @lapis sequoia ?

velvet thorn Jan 21, 2020, 2:13 AM

#

in general, posting sample data and expected results helps a lot.

lapis sequoia Jan 21, 2020, 2:17 AM

#

I'm a total beginner and not very used to discord either, so I simply don't know how to post that stuff properly

#

can I msg you @velvet thorn to clarify things?

velvet thorn Jan 21, 2020, 2:19 AM

#

post here please

lapis sequoia Jan 21, 2020, 2:21 AM

#

can you load it like that?

#

{'SERVICE_ID': {0: 29664277470, 1: 29664277470, 2: 29664277470, 3: 29664277470, 4: 29664277470}, 'TRAIN_ID': {0: 7087, 1: 7087, 2: 7087, 3: 7087, 4: 7087}, 'STOPSEQUENCE_NO': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5}, 'DS100': {0: 'TP', 1: 'TACH', 2: 'TACH', 3: 'TEZL', 4: 'TEZL'}, 'EVENT_TYPE': {0: 10, 1: 20, 2: 40, 3: 20, 4: 40}, 'Actual_Time': {0: Timestamp('2017-09-16 13:48:00'), 1: Timestamp('2017-09-16 13:50:00'), 2: Timestamp('2017-09-16 13:51:00'), 3: Timestamp('2017-09-16 13:52:00'), 4: Timestamp('2017-09-16 13:53:00')}, 'Sched_Time': {0: Timestamp('2017-09-16 13:48:00'), 1: Timestamp('2017-09-16 13:50:00'), 2: Timestamp('2017-09-16 13:50:00'), 3: Timestamp('2017-09-16 13:52:00'), 4: Timestamp('2017-09-16 13:52:00')}, 'LINE': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1}, 'START_TIME': {0: Timestamp('2017-09-16 13:48:00'), 1: Timestamp('2017-09-16 13:48:00'), 2: Timestamp('2017-09-16 13:48:00'), 3: Timestamp('2017-09-16 13:48:00'), 4: Timestamp('2017-09-16 13:48:00')}}

velvet thorn Jan 21, 2020, 2:22 AM

#

with a bit of efffort

#

yes

lapis sequoia Jan 21, 2020, 2:23 AM

#

thanks... I'd do better if I knew how to... I just made a dict and printed it

jolly briar Jan 21, 2020, 2:23 AM

#

In [111]: from pandas import Timestamp

In [112]: d = {'SERVICE_ID': {0: 29664277470, 1: 29664277470, 2: 29664277470, 3: 29664277470, 4: 29664277470}, 'TRAIN_ID': {0: 708
     ...: 7, 1: 7087, 2: 7087, 3: 7087, 4: 7087}, 'STOPSEQUENCE_NO': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5}, 'DS100': {0: 'TP', 1: 'TACH',
     ...:  2: 'TACH', 3: 'TEZL', 4: 'TEZL'}, 'EVENT_TYPE': {0: 10, 1: 20, 2: 40, 3: 20, 4: 40}, 'Actual_Time': {0: Timestamp('2017
     ...: -09-16 13:48:00'), 1: Timestamp('2017-09-16 13:50:00'), 2: Timestamp('2017-09-16 13:51:00'), 3: Timestamp('2017-09-16 13
     ...: :52:00'), 4: Timestamp('2017-09-16 13:53:00')}, 'Sched_Time': {0: Timestamp('2017-09-16 13:48:00'), 1: Timestamp('2017-0
     ...: 9-16 13:50:00'), 2: Timestamp('2017-09-16 13:50:00'), 3: Timestamp('2017-09-16 13:52:00'), 4: Timestamp('2017-09-16 13:5
     ...: 2:00')}, 'LINE': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1}, 'START_TIME': {0: Timestamp('2017-09-16 13:48:00'), 1: Timestamp('2017-
     ...: 09-16 13:48:00'), 2: Timestamp('2017-09-16 13:48:00'), 3: Timestamp('2017-09-16 13:48:00'), 4: Timestamp('2017-09-16 13:
     ...: 48:00')}}
     ...:

In [113]: pd.DataFrame.from_dict(d)
Out[113]:
    SERVICE_ID  TRAIN_ID  STOPSEQUENCE_NO DS100  EVENT_TYPE         Actual_Time          Sched_Time  LINE          START_TIME
0  29664277470      7087                1    TP          10 2017-09-16 13:48:00 2017-09-16 13:48:00     1 2017-09-16 13:48:00
1  29664277470      7087                2  TACH          20 2017-09-16 13:50:00 2017-09-16 13:50:00     1 2017-09-16 13:48:00
2  29664277470      7087                3  TACH          40 2017-09-16 13:51:00 2017-09-16 13:50:00     1 2017-09-16 13:48:00
3  29664277470      7087                4  TEZL          20 2017-09-16 13:52:00 2017-09-16 13:52:00     1 2017-09-16 13:48:00
4  29664277470      7087                5  TEZL          40 2017-09-16 13:53:00 2017-09-16 13:52:00     1 2017-09-16 13:48:00

lapis sequoia Jan 21, 2020, 2:23 AM

#

that looks good

#

thanks mate

#

so basically that's a very reduced dataset... usually there are like 30 more columns and millions of rows

velvet thorn Jan 21, 2020, 2:25 AM

#

when I said a bit I really meant a very tiny bit

#

like what @jolly briar did

#

that's perfectly fine, don't worry about it

lapis sequoia Jan 21, 2020, 2:25 AM

#

DS100 column is the abbreviation code for each station, so each "event" is at some station, at some point in time, on a specific LINE etc

jolly briar Jan 21, 2020, 2:25 AM

#

i only posted that for noobs future reference

lapis sequoia Jan 21, 2020, 2:36 AM

#

unfortunately the "STOPSEQUENCE_NO" column is not usable to make the directional index, as one and the same line can have a different number of stops e.g. one train goes the full way from A to Z, another only goes from C to K etc. depending on the time of day or weekday or whatever... and also it doesn't differentiate whether the train goes from A to Z or from Z to A (direction).

#

so my plan was to make a list for every LINE (1, 2, 3, ... , 8) that puts a number (1, 2, ...., 28) next to every station-abbreviation.
Like so:

#

[(0, 'TKT'), (1, 'TKTO'), (2, 'TWD'), (3, 'TWER'), ... (24, 'THUB'), (25, 'TEHN'), (26, 'TGT'), (27, 'TNUF'), (28, 'THE')]

jolly briar Jan 21, 2020, 2:43 AM

#

it might be easier to manually edit a small section in excel as an example of what you want

lapis sequoia Jan 21, 2020, 2:45 AM

#

mh..

#

not easy to explain at all

#

to someone who isn't familiar with the data and the problems etc

jolly briar Jan 21, 2020, 2:46 AM

#

then make an example

lapis sequoia Jan 21, 2020, 2:47 AM

#

don't know how 🤔

jolly briar Jan 21, 2020, 2:48 AM

#

you put the data into excel and edit it by hand

lapis sequoia Jan 21, 2020, 2:48 AM

#

if i could program it in excel i could just google how to translate it to python, lol

jolly briar Jan 21, 2020, 2:48 AM

#

well if you can't do that you've very little hope of explaining it to someone else

timid vortex Jan 21, 2020, 2:51 AM

#

If I have a numpy.int64 object and I want to iterate over that specific column, how can I go about doing that?
I get an AttributeError when I try to do dataframe.apply(lambda x . . .)

#

📎 unknown.png

velvet thorn Jan 21, 2020, 2:56 AM

#

uh.

#

so basically

#

if I understand you correctly

#

you want to convert the values in the last column to 1 if the original value is 2, and 0 otherwise?

#

@timid vortex

timid vortex Jan 21, 2020, 2:58 AM

#

yea

velvet thorn Jan 21, 2020, 2:58 AM

#

hm

#

is there a reason

timid vortex Jan 21, 2020, 2:58 AM

#

the last column doesn't have a label

velvet thorn Jan 21, 2020, 2:58 AM

#

df.iloc[:, -1] = (df.iloc[:, -1] == 2).astype(int)

#

.columns accesses the column names

#

also, avoid apply if you can

#

IMO it promotes lazy (and inefficient) thinking

timid vortex Jan 21, 2020, 2:59 AM

#

How should I properly go about this

velvet thorn Jan 21, 2020, 2:59 AM

#

is that the sklearn cancer dataset?

#

it should have a label...

timid vortex Jan 21, 2020, 2:59 AM

#

it's the Wisconsin cancer dataset

#

doesn't have labels

velvet thorn Jan 21, 2020, 3:00 AM

#

breast cancer, yes?

timid vortex Jan 21, 2020, 3:00 AM

#

yeah

velvet thorn Jan 21, 2020, 3:00 AM

#

hm

#

that's not right

#

but anyway you can rename the column, so

#

anyway the code I provided should work for you

#

tell me if it doesn't

timid vortex Jan 21, 2020, 3:02 AM

#

I guess it did...wow

#

Don't understand iloc and astype(int)

#

thank you so much though

#

just for the future, instead of using apply, what should I do instead

#

if I want to change all elements in a column

#

additionally, if I want to change the labels from just being a list of numbers, how could I do that?

📎 unknown.png

jolly briar Jan 21, 2020, 3:05 AM

#

if it's a single column you can use replace( )

#

i think that's a done thing , maybe there's something better

velvet thorn Jan 21, 2020, 3:06 AM

#

.iloc is an indexer

#

basically, you can specify which rows and which columns you want, in that order

#

: means all

#

so basically I said - get me all the rows from the last column (because -1)

#

then I compared them elementwise to 2

lapis sequoia Jan 21, 2020, 3:07 AM

#

Do you get it now? @velvet thorn @jolly briar

📎 directionalindex.PNG

velvet thorn Jan 21, 2020, 3:07 AM

#

which returns results of either True or False

timid vortex Jan 21, 2020, 3:07 AM

#

yeah

velvet thorn Jan 21, 2020, 3:07 AM

#

the last part, .astype, converts True to 1 and False to 0

#

which is the same logic as yours

timid vortex Jan 21, 2020, 3:08 AM

#

ahhhhhh

#

that's amazing

velvet thorn Jan 21, 2020, 3:08 AM

#

the reason to avoid apply is that apply is generally just a big for loop, which means you iterate over each value in turn.

#

very quickly, but still one at a time

#

whereas if you do an == comparison, it's vectorised, which basically means that pandas (through numpy) uses certain special instructions in your CPU to perform multiple operations at once

#

tl;dr: apply is slower.

timid vortex Jan 21, 2020, 3:09 AM

#

I'll remember this

#

thank you!

velvet thorn Jan 21, 2020, 3:10 AM

#

lastly, if you have a finite number of source values

#

look into map.

jolly briar Jan 21, 2020, 3:12 AM

#

if i have

2015 : a = 40%
2016 : a = 45%
2018 : a = 44%

what would an uplift model look like for predicting this years percentage?

lapis sequoia Jan 21, 2020, 3:18 AM

#

Do you get it now? @velvet thorn @jolly briar
@lapis sequoia so i want to do 2 things. First write that GREY column on the far right. I can't just simply give any DS100 abbreviation a unique number, it has to be line specific. LINE 1 can have a 1st station, and so can LINE 2, ..., LINE X. The 1st station will always have a "1" in that column for every LINE. But a train can also start at the 28th station and go to 5th or the 1st (backwards direction).

#

The excel screenshot should give an idea

#

the second problem would be to code the function right below the table in the screenshot.

#

df.LINE_STATION_NO[EVENT_TYPE == 10] < df.LINE_STATION_NO[EVENT_TYPE==50] then the Train for example starts at station 5 of that LINE and maybe goes to station 20. Because 5 < 20, the direction is then defined as +1. However, if it was going from station 20 to station 5, directional index would be -1 for the train is going backwards.
Why the numbers 5 and 20 in the example? Because not every train is serving all the stations from 1 to 28. Some only serve sections in between.

crystal sluice Jan 21, 2020, 4:40 AM

#

guys, is really that hard to configure git on vscode?

#

i'm like 2 hours struggling

#

i have my github account, installed 3 hundred thousand extensions on vscode and i'm not having sucess

lapis sequoia Jan 21, 2020, 4:54 AM

#

Hi everyone
fairly simple question here
I'm trying to create a graph to show the univariate distribution of my training data (the target values)
how can I do this effectively?
I've tried doing sns.distplot(y, hist=False, rug=True), but the graphs before and after oversampling+undersampling remain the same. In other words, it doesn't seem to properly represent my dataset
also, the target values are continuous

shadow quiver Jan 21, 2020, 6:55 AM

#

Does anyone have a simple explanation of what is graph in Tensorflow means?

lapis sequoia Jan 21, 2020, 6:58 AM

#

if you dont need tensorflow as a hard requirement.. I would suggest you drop it and move on..

#

really hard to accept.. but I wish I had done that a year ago.. it's really a waste of time because you can't iterate and scale as fast as you can on other frameworks

velvet thorn Jan 21, 2020, 9:09 AM

#

@shadow quiver a graph is basically a way to represent the flow of data through mathematical operations.

lapis sequoia Jan 21, 2020, 11:03 AM

#

Pandas groupby example: df.groupby('points').points.count() In this "df " has 17 columns. Now when you combine "points" column using groupby() then what happens to the rest of the columns, where do they exist?

#

I know grouby() does not change original dataset, it is a copy which it is operating on, how does look like, mashup of 2 columns and rest 15 do not change?

velvet thorn Jan 21, 2020, 11:09 AM

#

no

#

I think

#

you are focusing too much on the idea of the groupby being something concrete

#

think of it as an incomplete instruction.

#

okay, for example, if I tell you "go by car", the very natural question you would ask is "go where?"

#

what that groupby does, conceptually, is separate df into a number of dataframes, and in each dataframe the values of points are all the same.

#

however, because this is an expensive operation, when you just execute df.groupby('points'), all that happens is that pandas stores your instruction for later execution

#

because how exactly the groupby is performed will depend on what you want to do with it.

lapis sequoia Jan 21, 2020, 12:20 PM

#

hmmm ... conceptually, is separate 'df' into a number of dataframes, and in each dataframe the values of points are all the same

#

this is good

lapis sequoia Jan 21, 2020, 1:45 PM

#

dataframe.groupby().count() returns -- "Count of values within each group"
dataframe.groupby().size() returns -- "Number of rows in each group"

What's the difference these 2?

velvet thorn Jan 21, 2020, 1:54 PM

#

count ignores nulls, size doesn't @lapis sequoia

lapis sequoia Jan 21, 2020, 3:33 PM

#

See you tomorrow @velvet thorn .. good night, will spend some time with Dale Carnegie's book

oblique belfry Jan 21, 2020, 4:13 PM

#

I dunno if this is the best place for this question, but....

How would you normalize an audio waveform? I am working on an audio classification problem. I know normalizing data is a good practice, but I am not sure if one should do it for waveforms.

plain turret Jan 21, 2020, 4:15 PM

#

https://en.m.wikipedia.org/wiki/Audio_normalization

Audio normalization

Audio normalization is the application of a constant amount of gain to an audio recording to bring the amplitude to a target level (the norm). Because the same amount of gain is applied across the entire recording, the signal-to-noise ratio and relative dynamics are unchanged...

#

This ?

#

Or removing noise ?

oblique belfry Jan 21, 2020, 4:36 PM

#

That.

#

I just want the amplitudes to be consistent among samples.

#

https://github.com/google/gin-config

Also, what are your thoughts on this library by google?

GitHub

google/gin-config

Gin provides a lightweight configuration framework for Python - google/gin-config

alpine stream Jan 21, 2020, 5:28 PM

#

Hi guys! I have a question.
I have conversations a customer with an agent (without punctuation). There are phrases of several categories of promises that an agent gave to a customer (call back, make an appointment, etc.). It has been done manually. Altogether 12 categories. Now I'm thinking of creating an algorithm for this. I am thinking to do this task in two steps.

In the first step, I need to create an algorithm that can find an end and a beginning of all promises. This algorithm has to insert a start tag and an end tag.
The second step is to create a classifier that would label a promise to the necessary categories.

As I understand, the second step is well known and this is called text classification. But for the first step, I could not find any articles and github repositories. But I think it is an important NLP task and there must be information on this. Maybe are there approaches that solve two steps at the same time?

proud iron Jan 21, 2020, 6:06 PM

#

@alpine stream here is a very detailed guide on speech recognition, there are some helpful APIs and documentation to them. Even if you don't want to use them it is useful to see how they function. https://realpython.com/python-speech-recognition/

The Ultimate Guide To Speech Recognition With Python – Real Python

An in-depth tutorial on speech recognition with Python. Learn which speech recognition library gives the best results and build a full-featured "Guess The Word" game with it.

#

@alpine stream in particular it seems to me that those guys are doing something very close to what you are describing. https://wit.ai/getting-started

#

https://wit.ai/

#

Guys, how can one make his own speech recognition model and train it well on multiple languages? The point of that is to avoid Google's API which has a file size limit. 🙂

oblique belfry Jan 21, 2020, 6:12 PM

#

@proud iron I read a few papers showing how transformer networks like BERT and GPT-2 worked well in translation scenarios. Might want to start there. This isn't my expertise though so...def want to read up more on that.

austere oar Jan 21, 2020, 6:36 PM

#

Question: How do I return a javascript object from a python function (after scraping some data from different websites) then putting them back together (in HTML)

oblique belfry Jan 21, 2020, 6:53 PM

#

I think returning JSON would be the easiest.

#

What’s the use case? Like...a Flask app and some JS front end?

austere oar Jan 21, 2020, 6:57 PM

#

yeah it's a Flask App

#

Use MongoDB with Flask templating to create a new HTML page that displays all of the information that was scraped from the URLs above.

Start by converting your Jupyter notebook into a Python script called scrape_mars.py with a function called scrape that will execute all of your scraping code from above and return one Python dictionary containing all of the scraped data.

Next, create a route called /scrape that will import your scrape_mars.py script and call your scrape function.
    Store the return value in Mongo as a Python dictionary.

Create a root route / that will query your Mongo database and pass the mars data into an HTML template to display the data.

Create a template HTML file called index.html that will take the mars data dictionary and display all of the data in the appropriate HTML elements. Use the following as a guide for what the final product should look like, but feel free to create your own design.

oblique belfry Jan 21, 2020, 7:22 PM

#

return one Python dictionary containing all of the scraped data
Just means a JSON object.

upbeat jetty Jan 21, 2020, 8:02 PM

#

Semi-repost from career channel. What are the essential skills to break into healthcare/pharma data science? Data scientist positions i've seen usually revolve around economics - banking, marketing, ect.

austere oar Jan 21, 2020, 8:44 PM

#

Ah okay it's JSON, thankfully

oblique belfry Jan 21, 2020, 9:57 PM

#

https://arxiv.org/pdf/1905.11946.pdf
A decent paper discussing the depth, width, and resolution of ConvNets.

lapis sequoia Jan 22, 2020, 1:28 AM

#

Can somebody give me a quick tip on how to write columns by checking if-then conditions?
Like: If "HourOfDay" >= 6 and =< 9, then write "NewColumn"=1, otherwise 0.

#

maybe @velvet thorn ?

#

what are you trying to do

#

and what do you mean write columns..

lapis sequoia Jan 22, 2020, 1:47 AM

#

@lapis sequoia I'm working on a dataset, currently adding features for the predictive regression. I want to add multiple columns with dummy variables

velvet thorn Jan 22, 2020, 1:48 AM

#

hm

lapis sequoia Jan 22, 2020, 1:48 AM

#

in this case it's going to be a "morning peak" dummy variable (I'm working on delay prediction)

velvet thorn Jan 22, 2020, 1:48 AM

#

assuming the column is called HourOfDay (bad practice IMO, should be snake case)

#

the simplest way to do it is df['new_column'] = ((df['hour_of_day'] >= 6) & (df['hour_of_day'] <= 9)).astype(int)

lapis sequoia Jan 22, 2020, 1:49 AM

#

you can do that together..