#data-science-and-ml
1 messages Β· Page 215 of 1
@plain jungle AI/ML is very broad and very rapidly advancing field. Don't worry if you do not find "clear point of beginning".
I started by initially doing some digital image processing which led to machine vision.
Thank you
Andrew Ng's course is good. It does start from very ground up. This means lots of mathematics which might put some people off. You can do machine learning even without comprehending all of the mathematics.
@plain jungle This video might help you at some point https://www.youtube.com/watch?v=FmpDIaiMIeA
Find the rest of the How Neural Networks Work video series in this free online course:
https://end-to-end-machine-learning.teachable.com/p/how-deep-neural-networks-work
A gentle guided tour of Convolutional Neural Networks. Come lift the curtain and see how the magic is don...
Hi guys
I need to do a test with ResNet154. As I know it is too time consuming to train it specially with pc,
My question is that, is there some pre trained ResNet to run on my data set? If yes, how long take time?
I have a face dataset with 1999 portrait image
Are you using Tensorflow or Pytorch (or something else)?
@acoustic scaffold Tensorflow
If I recall correctly, there were pretrained resnets for Tensorflow 1.12 year ago
Do you have some link to download,
And do you have any idea, how long take time to run it with 1990 Image,
I want to get result as fast as possible, to submit thesis
Here might be clues https://github.com/tensorflow/models/tree/master/official/r1/resnet
@acoustic scaffold thank you
No problem
I've a vector of postcodes that I want to convert to the geographical regions they're within, so I want to lower their resolution.
What's the best way to go about this? not sure if there's a google maps approach or something
I'm sure there's a google phrase i'm missing to get info on this... given postcode I want to get region π€
Whereabouts are the postcodes? Global?
european
https://postcodes.io/ does exactly what you're looking for in the UK at least - theres probably something similar for europe as a whole
Free Postcode API for Addresses in Great Britain
https://getaddress.io/ this looks like it might do EU. But a web API like this is definitely what you're looking for. is uk
You might have to do something a bit awkward like getting lat/long from one api, then using another api to look up the region
getAddress.io is a simple JSON API to lookup UK postal addresses by postcode.
this seems to want a house number as well, i don't have that information, just postcode
this is uk as well i think π€
You can request on that site without a house number. But yeah it is uk https://developers.google.com/maps/documentation/geocoding/intro should work. Even if it doesn't, postcode lookup api comes up with a bunch of different stuff
looking at geocoding atm
gah, lookup of italian postcodes just returns american stuff π¦
My guess is you can add a param for country into the address info
Yeah just reading through the docs atm
Hey. I have another question. What skills do you think are required to master (or atleast learn some) Machine learning? Except the knowledge about programming language Ur going to use. Also do you know any good book/course/tutorial etc. to learn math required for Machine learning?
Without any libs like Tensorflow, sklearn etc.
Have you looked at the pinned messages in this channel? I think the r/LearnMachineLearning wiki might have what you're looking for.
Grit and perseverance.
df['x'].astype <doesn't work
df.x.astype <does work
why is that?
why is what
as type what.. did you try
did you pass an argument and check the output
@jolly briar
Yes
One worked the other didn't,I thought both these indexing methods were analogous
was just with int, codes gone now unfortunately
typo somewhere?
hrm, maybe... i'm not sure now the code has gone... thought I'd bumped into some kind of df[x] df[[x]] thing ala R, all good
I'm like 95% certain it was a typo or something
i think if it's between me and pandas being wrong i'm willing to raise my hand π
anyone using Gym-retro?
What's a good way to get started with computer vision stuff?
i have many csv files with different separators, i want to convert them to all be comma separated... is there a straightforward approach to this?
Hi All,
Has anyone worked on NLP , Information Retrieval and building a search engine
Any YT links or other references to build an intelligent search engine based on data in local DB is appreciated
Pls tag me when you answer. Thanks
@jagged raven the screenshot seema it is still downloading what is the issue ? May be you can try with sudo and pip3
It's stuck there for a couple of minutes now.
Any insights on how to extract melody from a song using python?
Hello I have a question
How do you extract an image link:
https://astrogeology.usgs.gov/search/results?q=hemisphere+enhanced&k1=target&v1=Mars -> https://astrogeology.usgs.gov/search/map/Mars/Viking/cerberus_enhanced
USGS Astrogeology Science Center Astropedia search results.
Mosaic of the Cerberus hemisphere of Mars projected into point perspective, a view similar to that which one would see fromβ¦
and that would also mean going to every other search result on the first link
Fun data science story.
This week I was at the 100th Annual Meeting of the American Meteorological Society. Whilst at one of the Python symposia, we had just been introduced to the SPORK project, which uses machine learning to track supercell thunderstorms (and can potentially be used to predict tornadoes). I mentioned I wanted to use it in conjunction with a library called Py-ART for operational forecasting purposes, and a friend next to me started complaining at length about how slow Py-ART is.
Anyway, the lead developer of Py-ART was sitting right next to him on the opposite side.
Awkward...
Did he say anything?
Oh boy did he!
He was also leading the panel on which said friend was presenting his research!
After some serious backpedaling he invited us to a data science reception with the other representatives from Argonne.
Oh, and on my other side one of the matplotlib devs was sitting there giggling at Daveβs sudden realization that you do not trash talk popular python weather data science libraries at a python weather data science symposium.
https://ams.confex.com/ams/2020Annual/webprogram/10PYTHON.html link to the symposium for the curious
has anyone ever done predictive modelling just using uplift?
idk if it's even classed as predictive modelling... perhaps forecast is a better term
Something that I wish was possible -
an excel sheet with the dataframe I'm working on that updated live
how to convert a group of values into the percentages
so if I have a dataframe and there are groups G=a,b,c,d, so I .groupby('G'). Within a I have a1 = 15, a2=80, a3=90, so that after the group by operation i want to have values of a1=0.08, a2 = 0.43, a3=0.48, and similarly for the other groups
i just did groupby().sum() then merged that output with the original, then computed percentage from there
@frail flower that is hilarious.
Hello everybody, I'm having some trouble with gspread_pandas
I can open the spread, I have read permissions, I have the right values for row 1 if I directly look at them, but trying to make into a DF... doesn't work
Any clue? Not finding anything on google or related.
It actually works on other spreadsheet, but for this one (where I have only read access) it doesn't work
I have a pandas question
I have 3 dataframes (channel, video, comment)
Column mapping is:
channel.channelId = video.channelId = comment.channelId
video.videoId = comment.videoId
I need to get a subset of each dataframe.
- Only channels that have a video and a comment
- Only videos that have a channel and a comment
- Only comments that have a channel and a video
I tried it with a double merge + inner join like
total_channels = total_channels.merge(total_videos, on='channelId')
.merge(total_comments, left_on=['channelId', 'videoId'], right_on=['videoId', 'channelId'])
But that only gives an empty dataframe with all columns from all 3 dataframes instead of only a channel subset that matches the requirements (at least 1 video and 1 comment)
I can't set a PK/FG when writing to SQL in pandas so my SQL solution take ages, that's why I need to do it directly in Python/pandas to speed stuff up.
How can I achieve that?
@urban silo the order that you specify the join columns matter. you set left to be channelId and videoId, but right to be videoId and channelId, so pandas will try to join left.channelId with right.videoId
and, if you just want the channels' columns, I'ld probably just go for total_videos[['channelId']] and total_comments[['channelId','videoId']] in the merge arguments directly.
does anyone work with python+R? Or maybe another mixture ( i just use python+R though).
I'm wondering if you have a set way of arranging / organising your projects, code/docs etc
How in the world are those avatars you se if you go forward about 5 min made? https://youtu.be/UwsrzCVZAb8
Can A.I. make music? Can it feel excitement and fear? Is it alive? Will.i.am and Mark Sagar push the limits of what a machine can do. How far is too far, and how much further can we go?
The Age of A.I. is a 8 part documentary series hosted by Robert Downey Jr. covering the w...
Itβs craaaazzyyy
It has to be some kind of 3D program but how do they take and interact and move?
Really cool stuff
when I open up Jupyter Notebook it's shows all the files saved on my C drive is there any way to clean this up a bit? If I partition my hard drive and have it open up in the partition will it still be able to import python packages ? I'm using anaconda btw
i have a 3d binary matrix (M) , i want to create a function that given an axis (x, y, z), the matrix reduce itself in an "or logical gate" by that axis. for example if i choose x axis, so my output would be 2d matrix(m) that if the value on m[x,y] == True it means that there exist an X value that M[X,x,y] = True
nvm figured it out , just made a partition
another question in numpy:
i want to have all 3 digits numbers containing "0,1,2,3"
like "000, 001, 002,003,010...333"
from this csv https://pastebin.com/RnF5rpXQ ive got this data in a pandas groupby object, and im trying to find the min/max price with the associated location, however .agg is giving me whacked results... trying to figure out why
(upon request, more csv data will be given, thus giving reason to groupby)
df2.agg({'price': ['max','min']}).reset_index()
https://pastebin.com/2xS53YTX
as you can see the very first upc is mismatched with max and min
the example i have given is the fourth upc
ending in 816
min should be 59.99 and max should be 127.00
it's difficult to see formatting here.. can you show the groupby dataframe another way
@strange stag
what is the min max on
price, or so i hope
what is the group by on?
upc
df.groupby('upc').agg({'price': ['min', 'max']}) then?
same as i have now, yes
for w/e reason that seems to work slightly better
first result seems off
4th is still techniqually wrong, but idky
those values shouldnt be there at all
ill post full csv, sec
@lapis sequoia
what is the upc
wdym
is it really common across these merchants
yes
ok.. lemme think
would moving the upc to an index, or making it a string help?
what dtype is it now
also, if you look at the upc 013964765816, the max is 127.00 and the min is 59.99, which is odd
sec
object
I think you should set the dtypes for these columns.. then do the groupby and aggregation
it'll work better
set upc to int.. and the price to float
ye, there all objects
ight, ill try that
@lapis sequoia tyvm!!!! Been trying for hours to figure out what i was doing wrong!!! WOOO tyvm!!!! very nice to see that the data is in a working condition π
np.. always here
@lapis sequoia do have one more operation that i hope you could help me with...
so i need to drop rows that amazons price is lower than the other prices (associated with the same upc)
im using this to remove no margins, possibly something similar for this other operation?
counts = df['upc'].value_counts()
df = df[~df['upc'].isin(counts[counts < 2].index)]
also, i think this is kinda weird df.groupby('upc').agg({'price': ['min', 'max']})
giving me url min/max, and upc min/max
try: df.groupby('upc').price.agg(['min', 'max'])
I dont understand your other question
drop what now?
so with that last code you just posted, i still need those other columns
cause i need to drop the rows that price_min is associated with the location "Amazon.com"
min correlates to a location, and max may correlate to another location
if min correlates to amazon, i need to drop the row
or in other words, if amazons price for the upc is lower than the other suppliers, i need to drop the row/upc
lower than ALL other suppliers*
hmm.. an easy way to do that would be, for each upc finding row indexes where the row meets your condition.. then dropping multiple rows together by index
tried df.groupby('upc')['price','location','url'].price.agg(['min', 'max'])
however, it says its already selected the columns, so im not sure how to keep the other columns when aggregating
df.groupby(['upc','price','location', 'url'], as_index=False).price.agg(______
so with the above code (multiindex), i just need to convert to a regular index, and then iterate through the df, and if "Amazon.com" in min, then drop the row
no iterating through dfs.. that's not efficient
find another way.. but you can do that as a last resort.. because I'm not able to think of a way right now
go through them by upc, check the condition, save the indices somewhere.. then drop by indices together
hehe kinda defeats the purpose :D
oops
you need to remove price
I made a mistake
df.groupby(['upc','location', 'url'], as_index=False).price.agg(__
which you should've caught btw.. lol
was kinda wondering why all were grouped, but well π
it's early morning here.. still getting up.. if you have anything else feel free to ping here.. I'll respond later
nw π im very grateful for your help, saved me so much time!
welllll nvm hehe, just shifted the data
@lapis sequoia you there?
!ask
Asking good questions will yield a much higher chance of a quick response:
β’ Don't ask to ask your question, just go ahead and tell us your problem.
β’ Don't ask if anyone is knowledgeable in some area, filtering serves no purpose.
β’ Try to solve the problem on your own first, we're not going to write code for you.
β’ Show us the code you've tried and any errors or unexpected results it's giving.
β’ Be patient while we're helping you.
You can find a much more detailed explanation on our website.
okay, so with the data previously, i have (credited to you) a group for each upc, which is shown in the picture above, however, the min and max is each url, this would be fine if i could sort the high ~> low of each upc grouped by the location, and now that i wrote this out, i think i might have a better idea on what i need to do
that and some sleep
so this is closer, however, i would still like to see the lowest, and the highest of each upc rather than the lowest/highest for each location
df.groupby(['upc','location', 'url'], as_index=False).price.agg(['min','max']).groupby('location', as_index=False).head(len(df))
@lapis sequoia
so using the first upc as an example
I would like to see amazon has a 14.23 price(max), and walmart has a 8.95 price (min)
why didnt you say that
thought i did π
have manually parsed like 200 lines so far lolz π
cba to parse 1k lines manually a day
yeah I dont understand what you're saying.. but wait, let me write the code
df['max_val'] = df.groupby(['upc'])['price'].transform(max)
do you understand what's happening here
lemme try/think a bit, and ill brb
heres what ive comeup in a few mins
ahhh i forgot...
i need to see if any1 else has a lower price than amazon, not just min/max per upc....
sorry.........
ima see what i can do with this tho
oh now i need to drop
π
so to answer your question @lapis sequoia i think i understand what its doing
creating a new column, by grouping the upc and then performing a max transformation on the price column
or min
this has gotten me a bit closer to what i need (the above) and this
# Drop upcs that arent sold on amazon
df = df[df['upc'] != "Amazon.com"]
df['max_val'] = df.groupby(['upc', 'location'])['price'].transform(max)
π
how Data science and Ai is related???
The vast majority of AI is trained (the process of the AI learning) using data collected from the real world. In order to work with AI you need to be able to understand the data, and what it means and how to work with it.
Its worth noting that Data Science AI Machine Learning Data Mining and probably more are all pretty ill defined semi-buzzwords that sometimes get used interchangably
Hey guys! I'm trying to add a new column, or rather replace one with a mistake and therefore I'm trying to merge 2 datasets exactly like I have done dozens of times before in my project... however, this time, something is different and I just can't seem to figure out why.
Even though I tried both "left" and "inner" merge, I'm getting out more data in the merged set than in either of the two original sets
df1 1136388 rows Γ 31 columns
df2 1247995 rows Γ 8 columns
so with left join I should be getting 1,136,388 rows right?
however, what I'm getting is 1935106 rows Γ 32 columns (column number is correct, rows are waayyyyyyy off)
So in an attempt to find out what's going on, I used the indicator=True function of merge. And guess what... there is only one category [both] and no values that are only either from left or right data set.
How is this even possible? Any help would be much appreciated... this should only be a 2 minute problem, but it cost me 2 days already :[
I'm merging on 7 out of those 8 columns as they're identical in both sets, that's why 32 columns instead of 31 is the correct output for the merge.... but rows increased by almost 800,000 !?!?!? There are no NaNs and no duplicates... i absolutely cannot explain how this is even possible
if anyone has any experience or understanding of RNNs i'd love to talk whenever you're free. currently doing a project for gun recognition in images and video. just trying to perfect my classifier before working too hard on the object detection factor
Well...why do you need RNNs? I mean...what are you trying to do? RNNs and CNNs can be used for similar problems. If you are doing gun recognition, seems like an object detection problem.
unless all of the research papers are uninformed, you need an R-CNN or similar
CNN for the quick classification, RNN for the object detection
RNN is meant for object detection in my case. i've not seen any other network types used
unless you know something i'm missing
@oblique belfry
Yolo v3 for Object Detection....
All CNNs. Faster inference than R-CNN.
I used it to train on a custom dataset.
Are you doing object detection or action recognition, or both?
https://pjreddie.com/darknet/yolo/
https://www.learnopencv.com/training-yolov3-deep-learning-based-custom-object-detector/
So...this guy wrote it in C. Trains very fast and inference time is fast. However, it is finnicky to work with.
There are Keras, Tensorflow, and Pytorch ports. The Pytorch one was the most stable port.
honestly looking to use it for reference and still do my own
The hardest part is the input data. Each object detection algorithm has different formats of input data.
And....I get you wanna do your own. But, it is a solved problem.
i very well understand that
Okay.
You can use a RNN, but you don't have to.
Input data for object detection is tricky since you can either scale the width and height or just keep it as it is. You can do it all with RNNs. And, these model architectures are out there. I'd copy them.
Why reinvent the wheeel if you do not have to?
my end goal isn't to have some pre-trained model returning images with all of its former trained classes filling the image. this is also still a ridiculously new field and while i did not look too far into yolo's latest model i now know it's the latest reach. i'm still not going to use something just handed to me. i'm looking to make my own like i said
You can train it yourself, from scratch and not use other people's weights. I trained it to locate a tennis ball in real time. Tried doing it myself and tried other methods out there, Yolo was the best. Even still, you are going to need a large corpus of labeled data of bounding boxes around the objects in questions. I would spend my time there.
But, good luck.
i know what i need data wise. i have 10k images self-collected and already 1k labeled by hand with labelimg. i'm not looking to locate tennis balls because somebody else did that already, i'm looking to do something myself from scratch to prove that i can to clients looking to hire me for this industry so i'm not going to just use something handed to me and say "look, i can use what anyone else can!"
i appreciate you pointing out that yolo wasn't what i thought it was but i feel like you're acting very high and mighty just because i don't want to use somebody else's work and you think i should. have a nice day
@coral yoke It's not high and mighty. Most people don't reinvent the wheel unless the have to. Unless you were tyring to go into research, there are just many very good solutions to this problem out there.
And, you finally explained why doing it from scratch is so important to you. If I knew that before, I could have given you different answers.
i said feels like. and honestly, especially in this field, please don't give the answer of "just use what exists" to somebody asking how to make their own thing
not reinventing the wheel is pretty sound advice a lot of the time π€
it is, but it isn't always relevant
I would read the papers behind Yolo, R-CNN, Faster R-CNN, etc. They make interesting points on why they chose the architecture.
if you want to make a discord bot should i go tell you to use this server's bot instead of making your own?
it isn't always relevant
in this context perhaps leading with your reasoning would have made more sense, but all good
i've read some papers already tony
Choosing to reinvent the wheel is a great way of understanding how the wheel works
thank you charlie
That's not me telling you to copy them. Just the logic behind the choices might encourage you on your journey.
i understand that tony. that's why i was asking for people familiar with RNNs
I know....I was grouping them in. Yolo is one of the few famous strategies that is all CNNs. The rest are a mix between the two.
Are you wanting to run this on a live video stream?
no offense but i don't believe you're the person i'd be willing to give any more information to regarding this
again, thanks for pointing out my misunderstanding of yolo's architecture
Reinventing the wheel to learn is a great way to learn. But we didn't know you were trying to do that. Hence the miscommunication.
Okay. Well, good luck.
even without the learning purpose, i would definitely still make my own. especially if the project was specialized enough i would want full control of what was going on.
and most of it isn't for learning. i'm having to piece together the last bit of the object detection myself but the rest i mostly understand. it's for showing clients i understand
i'm trying to imagine billing someone and pricing in building everything from scratch lol
it's not the kind of clients you're imagining
cool
Got it. Next time, try to convey that up front. Not just when talking to me, but to other devs. There are gonna be others who will be confused at your request like I was.
I am upset that this convo got derailed so quickly. Because, this is the stuff that interests me.
I gotta ask....what kind of clients are you targeting?
again no offense, but never when speaking to any other developer in any part of any industry have they told me "use what exists." especially not ones in this discord, they seem to like to help you from scratch irregardless
and none of your business
lol
Alright. Just curious.
If you wanna impress them a bit more, look into image segmentation as well. Don't know if that would be relevant to you, but it would def be cool to show you did that by scratch too.
my timeframe doesn't allow any more than i have set
i've seen that already, thanks though
Gotcha. Wanted you to really impress them.
π
Has anyone had luck with graph neural networks?
Hi! Sorry, not sure if this is right channel for my problem. Where can I ask about data preprocessing for text clusterization?
Here probably
Ok, I don't even understand my task properly...
I want to cluster different text to k different authors.(k-means clustering)
My data is: different files with text and other things from different authors in json format,
It looks like this:
{
"author": "Tolstoy",
"date": "unknown",
"format": "unknown",
"text": "here is some short text by Tolstoy",
"title": "Anna Karenina",
"year": "unknown",
"lang": "ru"
}
Also there is already training data that consists of many dictionaries like this in json format too.
What do I need for k-means clustering? Do I only need "text" strings?
cluster different text to k different authors
Your task is to create k clusters of authors. Presumably this means that authors within each cluster are similar to each other in some way.
What do I need for k-means clustering? Do I only need "text" strings?
To cluster the text you'd probably need to make the 'text' into a format such that you can perform operations on them to talk about any kind of similarity or dissimilarity. There are different ways to do this, and I think you have been given raw book data, along with some meta data. It's honestly up to you to use just data and/or the metadata, as long as at the end of the clustering process, you have a good idea of what algorithms you used are doing
So can I only take those "text" values from data and put them all in one big list of texts(is this even right?) and then preprocess this list?
Do you use the Anaconda environment?
that is like a software package
Me? No, I don't.
Anybody here
@lapis sequoia yes
@jolly briar do you activate it?
I've never used windows
@velvet thorn
x = pd.DataFrame({'index' : [5,6], 'blah' : ['a', 'b']})
print(f"""x.index : {list(x.index)}, x['index'] : {list(x['index'])}""")
this seems like a reasonable example of .v and ['v'] not being exactly the same
@jolly briar yup
this applies also to every other attribute that is already bound
e.g. min, max, groupby
yeah
I think I said "prefer __getitem__ access, because it works in more cases"
but if I didn't then I'm saying it now 
so they're not exactly the same, like running code with ipython vs python, people often say they're the same but it's different
because it is most correct to say that __getitem__ works everywhere __getattr__ does, and some places it doesn't, for the purpose of Series access
can't recall exactly what you said, just thought of it now though ( the index thing ), all good
can some one help me with this part of code
"from custom_layers.scale_layer import Scale"
i could not find document or installation guide for this library in python
i am trying to implement ResNet150 with follow repository
https://github.com/flyyufelix/cnn_finetune/blob/master/resnet_152.py
How does an algorithm like KNN handle duplicate data? Meaning we have a set of data objects with identical attributes and the distance between these data objects is 0. Does it make sense to remove these duplicate points here or include it?
If we were to include duplicates, would it make sense to treat duplicate data points as one observation? Like if k=3 and n1 has 3 duplicates, n1', n1'' and n1''', then n1 would only have 1 nearest neighbor instead of 3.
I often get confused when making dataframes with rows, for some reason.
for example - pd.DataFrame( pd.factorize( data.var ) )
If i want this to create a dataframe with columns instead of rows how would I do that?
Hi! Can I use Random Forest to evaluate k-means clustering? does this make sense?
@lapis sequoia I don't understand what you mean when you say you are trying to "evaluate" k-means. I suspect the answer is no... Random Forest is similar to k-Means in that both are "supervised classification" algorithms, but they have differences in what they do and how they do it
k-means is unsupervised so I wanted to check clusters I got with RF or something
hey was hoping someone could help me with pandas, im trying to keep the amazon price for each upc, and drop others that are a higher price than amazon (for each upc)
if you need me to provide more information, in any way shape or form, please dont hesitate to ask!
@lapis sequoia As previously said, it doesn't make sense. Both k-means and RF clusters are fundamentally different.
You can evaluate the clustering quality of each algorithm using metrics such as cluster purity, or compute/speed requirements, etc. and then compare the results from RF or from k-Means. Indeed, k-Means is likely to be superior in both fitting and prediction, while RF depends on the number of trees, as well as tree parameters. If RF does not produce significantly better clusters, then I would use k-Means.
But there are probably many different ways of generalising each k-Means, RF, and there would be other algorithms. What works might typically depend on your use case.
@strange stag So you want to conditionally drop depending on the price column? Is there only one Amazon.com under location or would there be multiple? If there is only one, you can grab the Amazon.com price, store it as a constant, then do a conditional slice using .map
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
X_train = tfidf.fit_transform(X_train)
X_test = tfidf.transform(X_test)
from sklearn.linear_model import LogisticRegression
from sklearn.cluster import KMeans
from sklearn.metrics import accuracy_score
import numpy as np
n_clusters = len(np.unique(y_train))
clf = KMeans(n_clusters = n_clusters, random_state=42)
clf.fit(X_train)
y_labels_train = clf.labels_
y_labels_test = clf.predict(X_test)
X_train = y_labels_train[:, np.newaxis]
X_test = y_labels_test[:, np.newaxis]
from sklearn.ensemble import RandomForestClassifier
model=RandomForestClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print('Accuracy is:',accuracy_score(y_test, y_pred))
Sorry
Can't I use RF for mapping labels like here?
I had my data labelled before but needed to use k means
@lapis sequoia You mean change clf to RF?
elf was for k means, I mean there the last part is RF used on data that was "produced" by k means. or? maybe I don't understand the last part of this code, where RF comes
@lapis sequoia That doesn't make sense, why would you use K-means then sequentially run random-forest on it?
Why would you fit an RF model after your K-means clustering algorithm?
@lapis sequoia Ok I think I get what your script is doing
@lapis sequoia Have you read https://scikit-learn.org/stable/modules/model_evaluation.html
I think you should just use the common metrics for evaluating the quality of the clusters out of K-means.
It doesn't make sense to 'evaluate' how good K-means is via RF. RF is itself another classifier that can result in classification errors on its own. If you're trying to do a meta-analysis of algorithmic analysis of either K-means outputs or RF-inputs then it makes sense, but it wouldn't make sense for the implied original problem of 'given N-datapoints and K-possible labels, what is the best way to separate and give each datapoint one of the K-possible labels?'
I wanted to make a confusion matrix in the end and I don't know how to make it without labels, the code is not mine, I just thought I found something similar, because in the end there is confusion matrix and classification report, and that's what I wanted from k-means. The initial data that I have already has labels and is actually more for classification tasks but I have to use it for k means
what is the script doing then?
@lapis sequoia
Um, how do you get the confusion matrix in the first place? In the first place, do you have a ground truth of classifiers?
Setting the K-means as a ground truth does not make sense
each upc should have an amazon price, if not multiple
To get a confusion matrix you need to say that a cluster X has common property related to its elements being members of X
not sure what u mean by multiple amazon.com under location tho
Unfortunately K-means only produces indices or rather centroids. You'd need to remap the centroids to get clusters of meaning
@strange stag Brb I'll give you a fake table
location can have maceys, walmart, home-depot, office-depot, or a few others
@chilly geyser i can give u a real 1 if u want
its fine idc
but yes, that is basically identical to the data i have now
id like to keep the 6th row and the 2nd
for upc==1
thank you @chilly geyser but do you understand what script I posted is doing?
@lapis sequoia It's running K-means, then setting it as a ground truth for RF to classify
@strange stag I'd look into conditional slicing with pandas. A very naive (aka slow) way to do it is to take subsets of each UPC value, then do the conditional
As for faster/simultaneous checking I'm not too sure, I've not used pandas other than for general things and I've never exactly needed it to be speed-optimised
hmm
will possibly be doing millions of rows per day
however, shouldnt be a problem for now
so something like df.groupby(['upc'])
Yeah my googling seems to imply that too
i understand i can do something like this (this is what im using to drop single suppliers corresponding to 1 upc)
counts = df['upc'].value_counts()
df = df[~df['upc'].isin(counts[counts < 2].index)]
so this selects a column, but not subsets for column values
so groupby would render subsets?
I'd try it, I'm not a pd expert here :>
do also have soemthing like this
df1 = df[ df['location'] == "Amazon.com" ].drop_duplicates(subset='upc', keep='first')
think i should be using != instead but w/e
That keeps the Amazon.com stuff right?
lol TBH IDK what you're doing, but it seems you're doing ok
actually nvm it doesnt do anything
that was an attempt to drop the lower price amazon offers
@strange stag Are you doing this all in VSC or IDLE? I'd recommend a more iteractve thing like Google Colab or at least your own localhost JuPyteR notebook if you think Google's snooping around your data.
going back to the beginning, just trying to get amazons high vs the lowest of others
im on a notebook atm
That way you can see how the pd dataframes are changing
Ah ok that's good
So you can quickly see stuff
yes
well, not really doing ok
still blind as a bat atm..
mind boggling me why i cant get amazons high price, and then the lowest price for each upc other than amazon
mk
this is better...
grouped = df.groupby(['upc'])
result = [g for g in list(grouped)]
might actually be able to work with this data π
hey, guys do stackoverflow links allowed here?
@chilly geyser tyvm for suggesting subsets! π
@strange stag I got the code if you want it, it's ugly and IDK if it scales
for _, y in df.groupby("upc"):
amazon_min = y[y["location"] == "Amazon.com"]["price"].min()
# print(y[y["location"] == "Amazon.com"]["price"].min())
print(y[(y["location"] == "Amazon.com") | (y[y["location"] != "Amazon.com"]["price"] < amazon_min)])
grouped = df.groupby(['upc'])
result = [g for g in list(grouped)]
result_length = len(result)
new_df = result[0]
high_low_amazon_prices_index = new_df[1][new_df[1]['location'] == "Amazon"]['price'].sort_values(ascending=False).index
new_df = new_df[1].drop(high_low_amazon_prices_index[1:])
amazons_price = new_df['price'][high_low_amazon_prices_index[0]]
price = new_df[new_df['location'] == "Amazon"]['price'].astype(float)
for index in new_df.index:
if new_df['price'][index] > amazons_price:
new_df = new_df.drop(index)
Mine is still UPC-prints tho, I haven't done the dropping yet, mine is only a view
i like ur code is wayyyyy shorter tho..
Basically I get Amazon price minimum per UPC
erm, need the max
ah okay, thats good
I see
yes
You need the Amazon min rite?
no
max
could explain, but with regards to your earlier post of using real data
dw, i account for amazons lower price later
Umm I'm trying to make sure I can recreate df now
this is super sweet tho
Not sure how to get from the group-bys all the way back to the modified df
And I think df.append would be slow
however, i think my version is slightly better
e.g
or still urs
but this is alot closer than i have been the past week π
and ye, i just need to concat my new_df for each loop
i think im biased tho so
@strange stag Up to you, it's your project
my final code is this
keep_indices = []
for _, y in df.groupby("upc"):
amazon_min = y[y["location"] == "Amazon.com"]["price"].max()
COND = (y["location"] == "Amazon.com") | (y[y["location"] != "Amazon.com"]["price"] < amazon_min)
keep_indices += y[COND].index.tolist()
# to get the subset just use loc
df.loc[keep_indices]
I'm using .max() now
wdym location?
?
df.loc
Basically I get a list of indices that match the condition
This index uses the original DF's index, so it will be fine
in fact I don't think I'm changing the original df
You only modify the original DF if you have to
lol for that I recommend using %%timeit
Also, not just this part by itself solo.
You need to do a %%timeit on your fullscript if you can
if someone worked with pandas and fuzzywuzzy check this question please https://stackoverflow.com/questions/59813111/remake-dataframe-based-of-fuzzywuzzy-matches
well, only got 1k lines atm so
Unless you are really really sure of your test-case and likely inputs and/or outputs
I see
The issue with %%timeit on just this portion is even if this part is faster, it might be because it's not evaluating certain parts
like list comprehension being stored as a generator, not being used
Ya, that's what I think too, but maybe pd has an internal magic for that too
I'm trying to grab just the indices, but TBH I'm not sure if it's faster
i think grabbing indices would be way faster, but im no expert
Anyway this is my result with play-data
Carrefour because....well, why not :^)
prices are literally from random. upc is choice(range(10)).
basically 1000 rows -> 967 rows, cutting off via Amazon max per upc
think my biggest improvement would be switching how im saving data tho
cause loading jsonlines to a df is really slow
df = pd.DataFrame()
with jsonlines.open(filename, 'r') as reader:
for obj in reader:
df = df.append(obj, ignore_index=True)
its like 1 second per 100 rows or something...
how do i do a %%timeit?
%%timeit is a JuPyteR magic. You put it at the top of the cell
ah
that code above is...
617 ms Β± 4.36 ms per loop (mean Β± std. dev. of 7 runs, 1 loop each)
Something like this
wow... 10m lines would take 12 hours....
The 1+2 is so that I don't have a single line. You can actually just %timeit [SINGLE_LINE_CODE]
While %%timeit is for whole cell execution
ye...
@strange stag Lol I don't think you can just linearly extrapolate so easily, just try for a slightly larger subset rather than a unittest
im assuming the 617ms was used to create the df, and the 4.36 is for each line that its appending
The fact is, unittests are unittests for a reason, and that integration testing is rqeuired
no idea what that means
think the above is giving me a ballpark of what to expect tho
Unit tests are for single things by themselves, while integration tests means you have multiple different things working together
It's common testing terminology
Well TBH IDK how much production-level code you're doing, and honestly personally I've never been involved in production-level stuff
##autopilot
π
got a LONG fkn ways to go tho
id say im 10% done
what would be better to save data than jsonlines?
for importing to pandas
well nvm
hmm
i'm wondering how to know what coordinate system i'm in wrt geographic data
@chilly geyser you still there?
@velvet thorn what about this, atm im getting a blank df for total_df
total_df = pd.DataFrame()
for x in range(result_length):
new_df = result[x]
high_low_amazon_prices_index = new_df[1][new_df[1]['location'] == "Amazon"]['price'].sort_values(ascending=False).index
new_df = new_df[1].drop(high_low_amazon_prices_index[1:])
try:
amazons_price = new_df['price'][high_low_amazon_prices_index[0]]
except IndexError:
continue
price = new_df[new_df['location'] == "Amazon"]['price'].astype(float)
for index in new_df.index:
if new_df['price'][index] > amazons_price:
new_df = new_df.drop(index)
pd.concat([new_df, total_df])
I feel a bit weak just looking at the loops
okay, maybe you can tell me what you want to do first?
mk, so i have a df with all the data and i am able to parse the data that i need with
grouped = df.groupby(['upc'])
result = [g for g in list(grouped)]
result_length = len(result)
new_df = result[0]
high_low_amazon_prices_index = new_df[1][new_df[1]['location'] == "Amazon"]['price'].sort_values(ascending=False).index
new_df = new_df[1].drop(high_low_amazon_prices_index[1:])
amazons_price = new_df['price'][high_low_amazon_prices_index[0]]
price = new_df[new_df['location'] == "Amazon"]['price'].astype(float)
for index in new_df.index:
if new_df['price'][index] > amazons_price:
new_df = new_df.drop(index)
however, having difficulties running this in a loop
because not all of my grouped subsets have the amazon bit
so amazon may not be a location when iterating through the df (group)
so that code does everything that i want except...
i cant figure out how to drop upcs that dont have an amazon location
so im grouping by upc, keeping amazons highest price, and dropping anything that is higher than that
@velvet thorn i was just thinking generally... i've just been merging some shapey stuff but i'm not too sure how to check that i did it correctly
new_df is when im seperating each upc into a new dataframe, and parsing it from here, and now im trying to add it back into a master dataframe
hm
okay, so first you want to drop entire groups with values of upc that don't have 'Amazon.com' in location, correct?
yes
df.groupby('upc').filter(lambda g: 'Amazon.com' in set(g['location']))
or, actually
df.groupby('upc').filter(lambda g: 'Amazon.com' in g['location'].unique())
ok, so now that i have amazon only upcs, how do i concat the dfs?
new_dataframe = df.groupby('upc').filter(lambda g: 'Amazon' in g['location'].unique())
grouped = new_dataframe.groupby('upc')
result = [g for g in list(grouped)]
result_length = len(result)
total_df = pd.DataFrame()
for x in range(result_length):
new_df = result[x]
high_low_amazon_prices_index = new_df[1][new_df[1]['location'] == "Amazon"]['price'].sort_values(ascending=False).index
new_df = new_df[1].drop(high_low_amazon_prices_index[1:])
amazons_price = new_df['price'][high_low_amazon_prices_index[0]]
price = new_df[new_df['location'] == "Amazon"]['price'].astype(float)
for index in new_df.index:
if new_df['price'][index] > amazons_price:
new_df = new_df.drop(index)
print(new_df)
pd.concat([total_df, new_df])
uh
so now
you want all the rows where prices are lower than the highest Amazon price for that group, right?
yes
all the upcs with that/those conditions, yes
rows include upcs, so yeah
basically the high of amazon and the low of anywhere else
which line
basically the high of amazon and the low of anywhere else
df.groupby('upc').filter(lambda g: 'Amazon' in g['location'].unique())
this is getting all upcs that have an amazon price yes?
I would interpret "low" to mean "only the lowest value", not "everything lower than the highest Amazon value"
since it seems to me that there are multiple values of price for each value of location
low as the lowest value
yes, the lowest of anywhere besides amazon
and the high of amazon
you want all the rows where prices are lower than the highest Amazon price for that group, right?
so this is wrong
well, its right in the manner that it dropped the upcs that dont have an amazon price, or are you asking about the next step?
okay
you should come up with some sample data
7847 Amazon 11.53 806481288353 https://www.amazon.com/gp/offer-listing/B083CP...
7850 HomeDepot 28.99 806481288353 https://www.amazon.com/gp/offer-listing/B083CP...
7848 Walmart 24.97 806481288353 //goto.walmart.com/c/1914133/566719/9383?veh=a...
7851 Amazon 136.73 806481288353 https://www.amazon.com/gp/offer-listing/B01IBI...
should yield row 7851 and 7848
yes
courtesy of another user (earlier)
this yields that, but all of amazon prices, not just the highest
keep_indices = list()
for _, y in df.groupby("upc"):
amazon_min = y[y["location"] == "Amazon"]["price"].max()
COND = (y["location"] == "Amazon") | (y[y["location"] != "Amazon"]["price"] < amazon_min)
keep_indices += y[COND].index.tolist()
df.loc[keep_indices]
id prefer to keep only the highest
sure
and it doesn't matter if, for example
the highest Amazon price is lower than the lowest non-Amazon price, right
in all cases you just want the highest Amazon price and the lowest non-Amazon price
and this is applied on the previous DataFrame
the one with UPCs without Amazon filtered out
with amazon upcs filtered
so applied to
new_dataframe = df.groupby('upc').filter(lambda g: 'Amazon' in g['location'].unique())
π
π
try aggs = df.groupby([(df['location'] == 'Amazon').rename('amazon'), 'upc').agg(['min', 'max'])
and pd.concat([aggs.xs(c, level=0)[[('location', 'min'), ('price', 'min')]] for c in {False, True}]) to filter out
filter out?
so, filter seems to be almost what im looking for, cept 2 things
still need the price for amazon with the upc, and 2 if amazon is the lowest price, then i need to drop that row
huh.
but other than that the filter is perfect i think, checking now
you didn't say that
my apologies... π¦
oh wait, the second part is wrong though, ignore it
it should be
pd.concat([
aggs.xs(False, level=0)[[('location', 'min'), ('price', 'min')]],
aggs.xs(True, level=0)[[('location', 'max'), ('price', 'max')]]
])
because you want the max for Amazon, right
yes
okay, I need to go now
but basically
for the last step where you wanna drop the rows
same
you can just do another groupby and filter on that condition
ye, thought so as much π
@velvet thorn anyways tyvm!!!!!
would elaborate how helpful u have been, but as u, i really have to go like right now!!
Is anyone available to answer a couple questions regarding out to turn a JSON file into a pandas dataframe? I've got an API call from a sports data website, but I'm missing something obvious
@leaden bobcat do elaborate
when merging two df's with
pd.merge(df1, df2, on='shared_column', how='left')
i expect there to be the same number of rows after as there are in df1, this isn't usually the case
how is it possible to create more rows than the original df when doing a left join, i figured the max would be the number of rows in the original data
when i instead do df1.join(df2, how='left') i get the expected result so idk
how to replace a section of a dataframe?
Say i have a df with columns A,B,C and where C == 4 i want to replace C with the value of B.
I'm not sure how to do this without a bunch of for loops
i just created a different vector and used that to overwrite
select the section you want to replace with .loc or .iloc and just assign it
dataframe['column_to_change'] = new_col
I think should work
@jolly briar doesn't seem right, got example?
@velvet thorn re what, the joins?
It's UK here so not now π
But this seemed to be the case
As in, I used merge and got way more. Used join and got less
how do yo uknow you got more?
he compared his rows before and after. can confirm, when he posted before it showed some weird shit
backpropagation is a general thing for all NNs, what is your question?
wait, let me get this right, you're trying to make your own algorithm for backpropagation when the one used is used for a reason?
i'm not sure if any of us here honestly know enough about the deep math behind these algorithms that have been around for years for reason. if you'd like to learn them i would definitely just suggest learning about what's there and how it works instead of trying to replace it
recreating the core of how any of our NNs work isn't exactly common as far as i'm aware. making your own network? sure yeah, but not recreating the essense
i support you totally btw, power to you if you can understand that stuff cause fuckin hell i'm not going through that much
i'm afraid i won't be able to help much though, past just understanding how backprop works π
@coral yoke I would disagree that this is βdeepβ math...
π
@keen geyser how do you intend to normalise the weights?
and which articles are you looking at?
i honestly didn't need your ping just for a disagreement, but sure
@strange stag lol I didn't know you only wanted the highest Amazon. Your original said all amazons and every other lower than this Amazon
@strange stag Lol now I think I get what you want
You should have just said this at the very start
in all cases you just want the highest Amazon price and the lowest non-Amazon price
So basically all non-Amazons would be the same π€¦
@keen geyser Would help if you could share the articles you are using. CNN backprop should be ok-ish material
@velvet thorn btw looking at your thing. Why do you need to rename "Amazon" to "amazon"?
donβt need to
but if you want to look @ the intermediate result itβs slightly more comprehensible to have a name for that level of the index
@chilly geyser you still there?
ah, confused u with gm
ye... my apologies... i have difficulty explaining what i want...so
@chilly geyser
@strange stag in general for this kind of data wrangling question
@velvet thorn how do i merge the two location max / location min?
providing expected output helps everyone out a lot
i shall try to do so in the future
on phone so I canβt write code, but you want a groupby
aggs = df.groupby([(df['location'] == 'Amazon').rename('amazon'), 'upc']).agg(['min', 'max'])
df2k = pd.concat([
aggs.xs(False, level=0)[[('location', 'min'), ('price', 'min')]],
aggs.xs(True, level=0)[[('location', 'max'), ('price', 'max')]]
])
df2k.groupby('upc').head(len(df2k)).sort_values(by='upc')
merging the upc 8888359036 for example
thought grouping by upc would do this however
i need an agg yeah?
expected output of the first two lines merged would be
8888359036, Amazon, BestBuy, 14.23, 9.99
third, fourth, fifth, sixth, would be dropped (later with df2k.dropna())
seventh upc merged would be
8421134096783, Amazon, Target, 15.24, 4.99
and for extra credit, dropping any rows price max is not greater than or equal to twice the price min
i can probably figure this out tho π
actually try groupby fillna
what would be the value?
basically perfect besides nan values
what is this doing in aggs? .rename('amazon'), 'upc']
.stack() o.O
how do i filter these though....
nvm on the stack...not what im looking for
nvm
Uhm so did it work o,o
@velvet thorn like i just inner joined two dfs with (1173, 14) and (17000,40) (ish) dimensions respectively and got a df with 2.5 million rows back
that just makes zero sense to me for an inner join
that seems like an outer join...
right, but it's not
do you have the code?
i do but i can't share anything
i mean, i can 100% say this has happened with an inner join
@velvet thorn
this is a left
inner
outers the same dim, so ive no idea π€
hrm, i'm not sure what to do there then
>>> import pandas as pd
>>> left = pd.DataFrame([[0, 'a'], [0, 'b']], columns=['a', 'b'])
>>> right = pd.DataFrame([[0, 'c'], [0, 'd']], columns=['a', 'b'])
>>> pd.merge(left, right, on='a')
a b_x b_y
0 0 a c
1 0 a d
2 0 b c
3 0 b d
because i think this duplicate information is valuable - it would be grouped by
@velvet thorn yeah, it's giving all combinations
yeah, so that's why you have more rows in your case too
yes, i'm confused about what to do with the data now :/
the duplicates are for geographic regions , eh
thanks tho - that explains it π
given a df with columns A,B where A are groups and B are count values, how to find the column B percentages per group?
so if i have
A B
a1 50
a1 50
a2 80
a2 20
i would want to have column B_perc as [0.5, 0.5, 0.8, 0.2]
i get that in this case the data sums to 100, this can't be assumed ( so *0.01 isn't ok)
>>> df.groupby('A').transform(lambda g: g / g.sum())
I always have difficulty understanding groupby
@velvet thorn you have shown the table, it got 2 columns and 4 rows. We can see how it looks. I always wondered how this looks:
df.groupby('A')
Because Python never shows how it looks in reality
it doesn't really make sense
to have a raw groupby
for reasons I can explain another time, since I'm going to bed soon
oh..
have you read the pandas groupby docs?
good night then π
they might help
Pandas grouby docs, been reading from last 4 days
I can read C++ technical definition from the ISO standard
But can't understand groupby >:-\
hm
okay real quick
imagine this
A B
a1 50
a1 50
a2 80
a2 20
you have this, right
and say you want the mean of B for each unique value of A
you could do this:
for a in df['A'].unique():
print(df.loc[['A'] == a, 'B'].mean())
and this gets each subset of the DataFrame
for which A has a specific unique value
and then performs some transformation on it
this is equivalent to df.groupby('A')['B'].mean()
@lapis sequoia make sense?
So far, no.
but I will try to understand while you sleep
@velvet thorn thanks again - I didn't know about transform , i used apply with a lambda function, is there any reason to reach for one over the other?
ah i see it's late for you, no worries
π
What kind of graph is this?
i can imagine a regular algo
These vacuums use a navigation algorithm called VSLAM (or visual simultaneous location and mapping
according to wikipedia there is some algorithms that are open source
you could get some inspiration from this
i don't suggest anything i just googled :p
i would guess you would need some camera system and the processing power to treat it in real time
I wonder how well Reinforcement Learning would work in this situation.
df.isna() will give me true / false for each cell based on whether it's nan or not, how can i select only rows which have some NA though?
Does df[df.isna().any(axis=1)] work?
alright yall... how do i merge rows by upcs?
For example, I have 2 rows with missing NaN values. the First row's missing NaN values are found within the second row, and vice versa (however a simple .fillna(method='ffill') does not work, because the data is not perfect, and what i mean by that is, not all upcs have 2 rows to makeup for the NaNs
I created the functions dropna ,which drops rows with empty values, and isnull ,which keeps rows with empty columns, to filter the dataframe and it works as I am able to print both. Then I would append them to previously created xlsx files
wb = Workbook()
ws = wb.active
wb.title = 'Contacts'
wb2 = Workbook()
ws2 = wb2.active
wb2.title = 'Contacts'
r1 = df.dropna(subset=['Firstname', 'Lastname', ('work_phones' or 'mobile_phones') or (('Work_City','Work_Street','Work_State','Work_Zip') or ('Personal_Street','Personal_City','Personal_State','Personal_Zip')) or ('Work_email' or 'Personal_email')])
r2 = df.loc[(df['Firstname'].isnull()) | (df['Lastname'].isnull()) | (((df['work_phones'].isnull()) & (df['mobile_phones'].isnull())) | (((df['Work_Street'].isnull()) | (df['Work_City'].isnull()) | (df['Work_State'].isnull()) & (df['Work_Zip'].isnull())) | (df['Personal_Street'].isnull()) | (df['Personal_City'].isnull()) | (df['Personal_State'].isnull()) | (df['Personal_Zip'].isnull())) & (df['Work_email'].isnull()) & (df['Personal_email'].isnull()))]
for r in dataframe_to_rows(r1, index=False, header=False):
ws.append([r])
for r in dataframe_to_rows(r2, index=False, header=False):
ws.append([r])
wb.save("Accepted Contacts.xlsx")
wb2.save("Rejected Contacts.xlsx")
However, when I try to add them to the excel files I get this error for r1
raise ValueError("Cannot convert {0!r} to Excel".format(value))
ValueError: Cannot convert ['Doe', 'Jane', nan, nan, nan, nan, '5678743546', 'j@greenbriar.com', '54 George street', 'Ridge Springs', 'VA', '25678', nan, nan, nan, nan, '3245687907', nan, nan, nan] to Excel```
hmm i don't really understand what you're trying to do, but nan is not an excel character no?
if you want an empty value in excel/csv it should be "Jane",,,,"56787453"
It needs to be column specific
,, is one column
instead of nan I make it an empty string?
it would work but then you would have an empty string in your excel
so , "",
it probably doesn't matter, but sometimes, some excel macro doesn't consider empty string as blank value
File "<ipython-input-2-de3603ab2d77>", line 1, in <module>
runfile('C:/Users/mosta/.spyder-py3/CRMnew.py', wdir='C:/Users/mosta/.spyder-py3')
File "C:\Users\mosta\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 827, in runfile
execfile(filename, namespace)
File "C:\Users\mosta\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 110, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/Users/mosta/.spyder-py3/CRMnew.py", line 1311, in <module>
ws.append([r])
File "C:\Users\mosta\Anaconda3\lib\site-packages\openpyxl\worksheet\worksheet.py", line 644, in append
cell = Cell(self, row=row_idx, column=col_idx, value=content)
File "C:\Users\mosta\Anaconda3\lib\site-packages\openpyxl\cell\cell.py", line 133, in __init__
self.value = value
File "C:\Users\mosta\Anaconda3\lib\site-packages\openpyxl\cell\cell.py", line 239, in value
self._bind_value(value)
File "C:\Users\mosta\Anaconda3\lib\site-packages\openpyxl\cell\cell.py", line 222, in _bind_value
raise ValueError("Cannot convert {0!r} to Excel".format(value))
ValueError: Cannot convert ['Doe', 'Jane', '', '', '', '', '5678743546', 'j@greenbriar.com', '54 George street', 'Ridge Springs', 'VA', '25678', '', '', '', '', '3245687907', '', '', ''] to Excel
This is not the problem
I don't know what is {0!r}?
@velvet thorn @chilly geyser @gilded harness
this is what im looking for, but this data is inaccurate (due to fillna)
It is in the for loop:
for r in dataframe_to_rows(r2, index=False, header=False):
ws.append([r])
what type is this?
WHen I print it
Lastname Firstname Company ... Personal_email Note Note_Category
1 Malcoun Joe 8/28/2019 14:29 ... NaN NaN NaN
4 None Jordan NaN ... NaN NaN NaN
5 None NaN NaN ... NaN NaN NaN
6 Zachuani Reemo NaN ... NaN NaN NaN
7 Suarez Geraldo NaN ... NaN NaN NaN
[5 rows x 20 columns]
Lastname Firstname Company ... Personal_email Note Note_Category
0 Doe Jane NaN ... NaN NaN NaN
2 Ramirez Morgan NaN ... NaN NaN NaN
3 Burki Roman NaN ... NaN NaN NaN
[3 rows x 20 columns]
can you print(type(r)) in your loop ? maybe you pass something like [[your_row]]
I print r1 and r2 before the loop
openpyxl write it as :
for r in dataframe_to_rows(df, index=True, header=True): ws.append(r)
here you ws.append([r]) so you put list in list probably ?
(looked there: https://openpyxl.readthedocs.io/en/stable/pandas.html )
What a stupid mistake by me. It took me days
heppens
Thank you very much @plain turret
you're welcome, sometimes you just need fresh eyes
hey guys, to learn data science, what subject should i focus first?
i'm good with math and statistics, i understand usability of data very well, but idk what to learn to work with data science
anybody could give me a north?
udacity
i'd just pick a book on the subject i want to explore? data science is super big
subject? get used to the python libraries that are used most in the area such as pandas, numpy, etc.
scikitlearn perhaps as well depending on your preference
pandas, numpy and wich other are used? so i can focus on this first
what is your goal?
i want to be able to get dataframes and work data to information, create information for decision making
definitely just pandas and numpy then for that
pandas > numpy in priority
and let's suppose i want to make a little dashboard
to show data
in real time, as the database is working
still pandas, but then flask, django or something else
flask to handle the automatic population of your table
hmmm, nice
django depending on the scale of the site
nice, thanks guys, helped a lot, i'll start right now
flask for smaller projects
@strange stag this is something i would ask too
i've seen flask used on large projects as well. preference π
what is a small project and a large project? is based on data or views?
well yes, can happen, but generally that is not done
id start out in flask
your traffic and how much you're handling
django > flask?
no
management is different
neither's better than the other
i tried to start with django
flask is easier to set up / less stuff to learn imo
^
but it was really difficult to me
flask has more flexibility, django has more structure
and flask is generally preferred starting off, even in businesses, as you only add what you need
flask I worked very well
so to advance fast and get result i would prefer flask
nice
most of the stuff you'll learn can be transfered to django since i think they both works with templates
i have an idea i'm developing, it can get some size someday, but i'll start with flask
they both work with the exact same template engine, yes
@void anvil seaborn have nice heatmaps with pandas.corr if you want to plot them easily
sorry mispelling or word order, english is not my main language
your english is fine georg, no worries!
i did this two years ago so i can't say for sure
you can with hmm
the keyword annot
i had make another df with the pvalue significances as * and ploted them on top of them
since you have corelation with color anyway
but you can mess with it
@coral yoke thank you!!
anyone made use of yellowbrick?
it seems to have changed the output of seaborn after inputting it, i don't just mean style wise, but the actual data looks a bit different as though there's some kinda transformation or something... just wondering if anyone's noticed anything similar
i always thought R plots were nice from regression models, seems that this has diagnostics now at least
π¬
what am i watchi,ng
a horror
why do you have some sort of regression line with columns lol
anyone able to help with my previous q?
yeah it's an odd one - it wasn't like that earlier @plain turret , i don't think π€
kinda what i get after i try every tutorial tbh
i'm also getting test R2 consistently higher than training π
so there's clearly something very wrong somewhere lol
am i being thick or is drawing a horizontal line on a seaborn plot a bit of a faff
get the Axes
ax.axhline
@crystal sluice you can consider Dash for that
also, another reason to use transform is that it better signals your intent
@velvet thorn what is dash
itβs a framework meant for data analysis
integrates with pandas
Google βdash pythonβ
do I have to use an old version (1.8) of Anaconda if I need to use python 2.6?
I don't want it to interfere with the current version installation
for two models A,B, if mse( A ) < mse( B ) yet mae( A ) > mae ( B ), how to choose the model based on these metrics?
could anyone help me translate a function from intention into code? it's probably a bit of text to explain, would appreciate a PM
@lapis sequoia what's a PN
it's supposed to be a private message, but i see the acronym doesn't make sense in English haha
either PM or DM would be the english for that @lapis sequoia , and i think you're better off just putting your problem into the channel as best as you're able too
i'd spam the whole room, because it's a lot to explain π
well, not sure what to say then i guess
ok so i don't know how to explain the problem w/o context
i have a huge data set, it's about delays and delay prediction... i still need to engineer some features
in the tidy dataset there are columns for delays, train stations, train-line, stop sequence number and so on... what i'm working on right now is a directional index for every train line, to have a dummy variable in the regression part
my plan is, to get a list of station acronyms sorted by their sequence of occurance within a line, let's say LINE 1
which would look like this:
[(0, 'TKT'), (1, 'TKTO'), (2, 'TWD'), (3, 'TWER'), ... (21, 'TSRO'), (22, 'TGOL'), (23, 'TBO'), (24, 'THUB'), (25, 'TEHN'), (26, 'TGT'), (27, 'TNUF'), (28, 'THE')]
now i would want to find any match of any train event for the given LINE 1 where the station is in that list, and write the corresponding number into a new column
I'd have to do that for every train-line
when that column is finished i'd be able to check for every starting and ending train whether he goes from higher number to lower number or vice versa
why so complicated? because the dataset is complex and not every train of one specific train-line goes all the way from 0 to XX. some start later and stop earlier etc.
do you get it? π€
The procedure would have to be done for every of the 8 train LINES to fill the entire column. So I would like to write some function or pipeline that does the same for all the LINE. I can't just give every station-abbreviation one specific number, because while the station abbreviations are "general", the corresponding number would be LINE-specific.
what is a train event?
@lapis sequoia
@jolly briar which is more important to you...?
@velvet thorn there is 5 different train events:
- departure of a train from its start station
- arrival of a train at a stopover
- a passing train
- departure of a train at a stopover, and
- arrival at its final destination
those are coded for example with 1) = 10, 2) = 20, ... 5) = 50 so you can find the specific events for every train and every LINE etc in the dataset... every day has like thousands of logged events... every minute of the day at every station etc.
i don't think you get me
are you able to post the example data @lapis sequoia ?
in general, posting sample data and expected results helps a lot.
I'm a total beginner and not very used to discord either, so I simply don't know how to post that stuff properly
can I msg you @velvet thorn to clarify things?
post here please
can you load it like that?
{'SERVICE_ID': {0: 29664277470, 1: 29664277470, 2: 29664277470, 3: 29664277470, 4: 29664277470}, 'TRAIN_ID': {0: 7087, 1: 7087, 2: 7087, 3: 7087, 4: 7087}, 'STOPSEQUENCE_NO': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5}, 'DS100': {0: 'TP', 1: 'TACH', 2: 'TACH', 3: 'TEZL', 4: 'TEZL'}, 'EVENT_TYPE': {0: 10, 1: 20, 2: 40, 3: 20, 4: 40}, 'Actual_Time': {0: Timestamp('2017-09-16 13:48:00'), 1: Timestamp('2017-09-16 13:50:00'), 2: Timestamp('2017-09-16 13:51:00'), 3: Timestamp('2017-09-16 13:52:00'), 4: Timestamp('2017-09-16 13:53:00')}, 'Sched_Time': {0: Timestamp('2017-09-16 13:48:00'), 1: Timestamp('2017-09-16 13:50:00'), 2: Timestamp('2017-09-16 13:50:00'), 3: Timestamp('2017-09-16 13:52:00'), 4: Timestamp('2017-09-16 13:52:00')}, 'LINE': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1}, 'START_TIME': {0: Timestamp('2017-09-16 13:48:00'), 1: Timestamp('2017-09-16 13:48:00'), 2: Timestamp('2017-09-16 13:48:00'), 3: Timestamp('2017-09-16 13:48:00'), 4: Timestamp('2017-09-16 13:48:00')}}
thanks... I'd do better if I knew how to... I just made a dict and printed it
In [111]: from pandas import Timestamp
In [112]: d = {'SERVICE_ID': {0: 29664277470, 1: 29664277470, 2: 29664277470, 3: 29664277470, 4: 29664277470}, 'TRAIN_ID': {0: 708
...: 7, 1: 7087, 2: 7087, 3: 7087, 4: 7087}, 'STOPSEQUENCE_NO': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5}, 'DS100': {0: 'TP', 1: 'TACH',
...: 2: 'TACH', 3: 'TEZL', 4: 'TEZL'}, 'EVENT_TYPE': {0: 10, 1: 20, 2: 40, 3: 20, 4: 40}, 'Actual_Time': {0: Timestamp('2017
...: -09-16 13:48:00'), 1: Timestamp('2017-09-16 13:50:00'), 2: Timestamp('2017-09-16 13:51:00'), 3: Timestamp('2017-09-16 13
...: :52:00'), 4: Timestamp('2017-09-16 13:53:00')}, 'Sched_Time': {0: Timestamp('2017-09-16 13:48:00'), 1: Timestamp('2017-0
...: 9-16 13:50:00'), 2: Timestamp('2017-09-16 13:50:00'), 3: Timestamp('2017-09-16 13:52:00'), 4: Timestamp('2017-09-16 13:5
...: 2:00')}, 'LINE': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1}, 'START_TIME': {0: Timestamp('2017-09-16 13:48:00'), 1: Timestamp('2017-
...: 09-16 13:48:00'), 2: Timestamp('2017-09-16 13:48:00'), 3: Timestamp('2017-09-16 13:48:00'), 4: Timestamp('2017-09-16 13:
...: 48:00')}}
...:
In [113]: pd.DataFrame.from_dict(d)
Out[113]:
SERVICE_ID TRAIN_ID STOPSEQUENCE_NO DS100 EVENT_TYPE Actual_Time Sched_Time LINE START_TIME
0 29664277470 7087 1 TP 10 2017-09-16 13:48:00 2017-09-16 13:48:00 1 2017-09-16 13:48:00
1 29664277470 7087 2 TACH 20 2017-09-16 13:50:00 2017-09-16 13:50:00 1 2017-09-16 13:48:00
2 29664277470 7087 3 TACH 40 2017-09-16 13:51:00 2017-09-16 13:50:00 1 2017-09-16 13:48:00
3 29664277470 7087 4 TEZL 20 2017-09-16 13:52:00 2017-09-16 13:52:00 1 2017-09-16 13:48:00
4 29664277470 7087 5 TEZL 40 2017-09-16 13:53:00 2017-09-16 13:52:00 1 2017-09-16 13:48:00
that looks good
thanks mate
so basically that's a very reduced dataset... usually there are like 30 more columns and millions of rows
when I said a bit I really meant a very tiny bit
like what @jolly briar did
that's perfectly fine, don't worry about it
DS100 column is the abbreviation code for each station, so each "event" is at some station, at some point in time, on a specific LINE etc
i only posted that for noobs future reference
unfortunately the "STOPSEQUENCE_NO" column is not usable to make the directional index, as one and the same line can have a different number of stops e.g. one train goes the full way from A to Z, another only goes from C to K etc. depending on the time of day or weekday or whatever... and also it doesn't differentiate whether the train goes from A to Z or from Z to A (direction).
so my plan was to make a list for every LINE (1, 2, 3, ... , 8) that puts a number (1, 2, ...., 28) next to every station-abbreviation.
Like so:
[(0, 'TKT'), (1, 'TKTO'), (2, 'TWD'), (3, 'TWER'), ... (24, 'THUB'), (25, 'TEHN'), (26, 'TGT'), (27, 'TNUF'), (28, 'THE')]
it might be easier to manually edit a small section in excel as an example of what you want
mh..
not easy to explain at all
to someone who isn't familiar with the data and the problems etc
then make an example
don't know how π€
you put the data into excel and edit it by hand
if i could program it in excel i could just google how to translate it to python, lol
well if you can't do that you've very little hope of explaining it to someone else
If I have a numpy.int64 object and I want to iterate over that specific column, how can I go about doing that?
I get an AttributeError when I try to do dataframe.apply(lambda x . . .)
uh.
so basically
if I understand you correctly
you want to convert the values in the last column to 1 if the original value is 2, and 0 otherwise?
@timid vortex
yea
the last column doesn't have a label
df.iloc[:, -1] = (df.iloc[:, -1] == 2).astype(int)
.columns accesses the column names
also, avoid apply if you can
IMO it promotes lazy (and inefficient) thinking
How should I properly go about this
breast cancer, yes?
yeah
hm
that's not right
but anyway you can rename the column, so
anyway the code I provided should work for you
tell me if it doesn't
I guess it did...wow
Don't understand iloc and astype(int)
thank you so much though
just for the future, instead of using apply, what should I do instead
if I want to change all elements in a column
additionally, if I want to change the labels from just being a list of numbers, how could I do that?
if it's a single column you can use replace( )
i think that's a done thing , maybe there's something better
.iloc is an indexer
basically, you can specify which rows and which columns you want, in that order
: means all
so basically I said - get me all the rows from the last column (because -1)
then I compared them elementwise to 2
Do you get it now? @velvet thorn @jolly briar
which returns results of either True or False
yeah
the last part, .astype, converts True to 1 and False to 0
which is the same logic as yours
the reason to avoid apply is that apply is generally just a big for loop, which means you iterate over each value in turn.
very quickly, but still one at a time
whereas if you do an == comparison, it's vectorised, which basically means that pandas (through numpy) uses certain special instructions in your CPU to perform multiple operations at once
tl;dr: apply is slower.
if i have
2015 : a = 40%
2016 : a = 45%
2018 : a = 44%
what would an uplift model look like for predicting this years percentage?
Do you get it now? @velvet thorn @jolly briar
@lapis sequoia so i want to do 2 things. First write that GREY column on the far right. I can't just simply give any DS100 abbreviation a unique number, it has to be line specific. LINE 1 can have a 1st station, and so can LINE 2, ..., LINE X. The 1st station will always have a "1" in that column for every LINE. But a train can also start at the 28th station and go to 5th or the 1st (backwards direction).
The excel screenshot should give an idea
the second problem would be to code the function right below the table in the screenshot.
df.LINE_STATION_NO[EVENT_TYPE == 10] < df.LINE_STATION_NO[EVENT_TYPE==50] then the Train for example starts at station 5 of that LINE and maybe goes to station 20. Because 5 < 20, the direction is then defined as +1. However, if it was going from station 20 to station 5, directional index would be -1 for the train is going backwards.
Why the numbers 5 and 20 in the example? Because not every train is serving all the stations from 1 to 28. Some only serve sections in between.
guys, is really that hard to configure git on vscode?
i'm like 2 hours struggling
i have my github account, installed 3 hundred thousand extensions on vscode and i'm not having sucess
Hi everyone
fairly simple question here
I'm trying to create a graph to show the univariate distribution of my training data (the target values)
how can I do this effectively?
I've tried doing sns.distplot(y, hist=False, rug=True), but the graphs before and after oversampling+undersampling remain the same. In other words, it doesn't seem to properly represent my dataset
also, the target values are continuous
Does anyone have a simple explanation of what is graph in Tensorflow means?
if you dont need tensorflow as a hard requirement.. I would suggest you drop it and move on..
really hard to accept.. but I wish I had done that a year ago.. it's really a waste of time because you can't iterate and scale as fast as you can on other frameworks
@shadow quiver a graph is basically a way to represent the flow of data through mathematical operations.
Pandas groupby example: df.groupby('points').points.count() In this "df " has 17 columns. Now when you combine "points" column using groupby() then what happens to the rest of the columns, where do they exist?
I know grouby() does not change original dataset, it is a copy which it is operating on, how does look like, mashup of 2 columns and rest 15 do not change?
no
I think
you are focusing too much on the idea of the groupby being something concrete
think of it as an incomplete instruction.
okay, for example, if I tell you "go by car", the very natural question you would ask is "go where?"
what that groupby does, conceptually, is separate df into a number of dataframes, and in each dataframe the values of points are all the same.
however, because this is an expensive operation, when you just execute df.groupby('points'), all that happens is that pandas stores your instruction for later execution
because how exactly the groupby is performed will depend on what you want to do with it.
hmmm ... conceptually, is separate 'df' into a number of dataframes, and in each dataframe the values of points are all the same
this is good
dataframe.groupby().count() returns -- "Count of values within each group"
dataframe.groupby().size() returns -- "Number of rows in each group"
What's the difference these 2?
count ignores nulls, size doesn't @lapis sequoia
See you tomorrow @velvet thorn .. good night, will spend some time with Dale Carnegie's book
I dunno if this is the best place for this question, but....
How would you normalize an audio waveform? I am working on an audio classification problem. I know normalizing data is a good practice, but I am not sure if one should do it for waveforms.
Audio normalization is the application of a constant amount of gain to an audio recording to bring the amplitude to a target level (the norm). Because the same amount of gain is applied across the entire recording, the signal-to-noise ratio and relative dynamics are unchanged...
This ?
Or removing noise ?
That.
I just want the amplitudes to be consistent among samples.
https://github.com/google/gin-config
Also, what are your thoughts on this library by google?
Hi guys! I have a question.
I have conversations a customer with an agent (without punctuation). There are phrases of several categories of promises that an agent gave to a customer (call back, make an appointment, etc.). It has been done manually. Altogether 12 categories. Now I'm thinking of creating an algorithm for this. I am thinking to do this task in two steps.
- In the first step, I need to create an algorithm that can find an end and a beginning of all promises. This algorithm has to insert a start tag and an end tag.
- The second step is to create a classifier that would label a promise to the necessary categories.
As I understand, the second step is well known and this is called text classification. But for the first step, I could not find any articles and github repositories. But I think it is an important NLP task and there must be information on this. Maybe are there approaches that solve two steps at the same time?
@alpine stream here is a very detailed guide on speech recognition, there are some helpful APIs and documentation to them. Even if you don't want to use them it is useful to see how they function. https://realpython.com/python-speech-recognition/
@alpine stream in particular it seems to me that those guys are doing something very close to what you are describing. https://wit.ai/getting-started
Guys, how can one make his own speech recognition model and train it well on multiple languages? The point of that is to avoid Google's API which has a file size limit. π
@proud iron I read a few papers showing how transformer networks like BERT and GPT-2 worked well in translation scenarios. Might want to start there. This isn't my expertise though so...def want to read up more on that.
Question: How do I return a javascript object from a python function (after scraping some data from different websites) then putting them back together (in HTML)
I think returning JSON would be the easiest.
Whatβs the use case? Like...a Flask app and some JS front end?
yeah it's a Flask App
Use MongoDB with Flask templating to create a new HTML page that displays all of the information that was scraped from the URLs above.
Start by converting your Jupyter notebook into a Python script called scrape_mars.py with a function called scrape that will execute all of your scraping code from above and return one Python dictionary containing all of the scraped data.
Next, create a route called /scrape that will import your scrape_mars.py script and call your scrape function.
Store the return value in Mongo as a Python dictionary.
Create a root route / that will query your Mongo database and pass the mars data into an HTML template to display the data.
Create a template HTML file called index.html that will take the mars data dictionary and display all of the data in the appropriate HTML elements. Use the following as a guide for what the final product should look like, but feel free to create your own design.
return one Python dictionary containing all of the scraped data
Just means a JSON object.
Semi-repost from career channel. What are the essential skills to break into healthcare/pharma data science? Data scientist positions i've seen usually revolve around economics - banking, marketing, ect.
Ah okay it's JSON, thankfully
https://arxiv.org/pdf/1905.11946.pdf
A decent paper discussing the depth, width, and resolution of ConvNets.
Can somebody give me a quick tip on how to write columns by checking if-then conditions?
Like: If "HourOfDay" >= 6 and =< 9, then write "NewColumn"=1, otherwise 0.
maybe @velvet thorn ?
what are you trying to do
and what do you mean write columns..
@lapis sequoia I'm working on a dataset, currently adding features for the predictive regression. I want to add multiple columns with dummy variables
hm
in this case it's going to be a "morning peak" dummy variable (I'm working on delay prediction)
assuming the column is called HourOfDay (bad practice IMO, should be snake case)
the simplest way to do it is df['new_column'] = ((df['hour_of_day'] >= 6) & (df['hour_of_day'] <= 9)).astype(int)
you can do that together..