#data-science-and-ml
1 messages ยท Page 415 of 1
here's a MWE
import numpy as np
def find_intersection(a, b, c, d):
r'''
function that finds the intersection between two line segments. one segment
is defined by the points a, b; and the other, by c, d.
'''
y = c - a
A = np.zeros((2,2))
A[:,0] = b - a
A[:,1] = c - d
detA = A[0,0]*A[1,1] - A[0,1]*A[1,0]
if detA == 0:
return np.ones((2))*np.inf #intersection at infinity
else:
x = np.linalg.inv(A).dot(y)
if (0 <= x[0] <= 1) and (0 <= x[1] <= 1):
return a + x[0]*(b-a) #valid intersection
else:
return np.ones((2))*np.inf #intersection out of segment
a = np.array([0,0])
b = np.array([1,0])
c = np.array([0,-1])
d = np.array([1,1])
p = find_intersection(a,b,c,d)
print(p)
the result is
[0.5 0. ]
as expected
i made it so it returns [inf, inf] if the matrix is not invertible (that means the lines are parallel. either they never touch, or they are the same line. it's a degenerate case)
i guess i forgot to check whether the entries of x are in the valid interval [0,1]
edit* there we go
here is my daily complaint about aws
Hereโs my daily I donโt use cloud computing

here's my daily day
its the only way i can get my model deployed at work
otherwise itd be useless

edd can you learn aws and then teach me please
tomorrow might be a good day to help, i have some dead time in between meetings
Iโm too stuck to advance
Would appreciate, itโs more math coding

Pure numpy no functions
absolutely lovely
Theyโre gona make me translate the panda
check out what i shared above. nice way to find the intersection of line segments by inverting a matrix
In python
good morning. i'm wondering if someone can help me with a question regarding dask
i turned a csv into many parquet files using pandas, and when i read these parquet files using Dask i can do basic operations such as .head() and .tail(), but when i try to do other things like operations on a column i'm getting a ValueError: Not all divisions are known, can't align partitions. Please use `set_index` to set the index.. When I do .index on the dataframe it shows that i do index have an index, but known_divisions is False. I'm not really sure how divisions plays into Dask or its indexing structure, or why i'm receiving this error. Any thoughts?
The index on the files are just a basic 0-N index created by Pandas (I didn't specify a column)
A server?
20 million rows isn't much on the scale that a lot of data science is done these days. are you trying to pick a particular flavor of SQL?
Thank you! I came back to report that this: https://www.w3schools.com/python/ref_set_intersection.asp won't work. But as I re-read the other persons's post, they indicated that I would have to include each and every point. So one, I did it wrong, and two, I won't be calculating each and every point. At least, that's not desirable for this process. ๐
Csv
Works perfectly of course. BTW, what does 'MWE' mean?
what are you doing with said data
is it more read heavy or write heavy
analytical vs. transactional
Rex asking the big questions over here
something something ACID vs. CAP

ehh you could probably get away with most things as long as it can store that many rows
since youre not trying to put anything into prod
so whatever youre most familiar with/most interested in learning
i recommend mongo but thats me
do it
you are a student right? you can get access to mongodb atlas too
through the student dev pack
if you're planning to do sentiment analysis, and you don't need to do any complicated queries or transactions (ie, the data is just there for you to feed into a model as-is), a text file should be fine
but then you cant do RDD Stel!
resume-driven development

I would be more concerned about how easy it is to feed the data into the model (a heavily nested JSON is not that), and not losing the data.
also mongo has easy ways to query json stuff btw
and its aggregation pipeline is pretty powerful
but yeah you can go old school with txt files ig

so it just depends on if you want to learn a new tool or not. up to you
You probably don't need a database. If you design your system correctly with a separate IO layer, then you can swap that out later to use a database without having to change anything else in the system.
And I would then start with a simple IO layer and only make an IO layer for databases when you actually need one.
Do you control the file format? JSON / what is in it?
Looks pretty straight forward. If you want something more simple / you get to decide the format, then I would recommend a even more simple flat file format.
JSON is often overkill.
But yeah, if this works, just go with it for now. Databases later when actually needed.
What does this mean?
NumPy uses C-order indexing. That means that the last index usually represents the most rapidly changing memory location, unlike Fortran or IDL, where the first index represents the most rapidly changing location in memory. This difference represents a great potential for confusion.
It's from the numpy docs. https://numpy.org/doc/stable/user/basics.indexing.html
what do they mean by 'rapidly changing memory location'
Well, nothing can be done about how bad Twitter's API gets.
Databases are not really the thing for solving overly complex JSON, they can do it, but there is so much else that they do and add as overhead because of it. There are libraries for wrangling more complex JSON on their own.
Hi guys, I don't know if this is the right section, in pandas I'm trying to figure out how to check with a "relative position" index without iterate through the whole dataframe.
A wrong code solution would be:
`
def foo_func(df):
index = df.tail(1).index[0]
dfCheck = df[index-3:index]
mask1 = dfCheck["A"].head(1) == True
mask2 = dfCheck["B"] > 0
if not dfCheck[mask1].empty and dfCheck[mask2].empty:
df.iloc[index] = True
df.rolling(3).apply(foo_func)
`
So the goal here is to check if there is a true in the -3 relative position and at least a value>0 in the -3:index portion. Any idea how can I translate this? Thank you all very much for any help
Hi guys, anyone have experience with parsing SEC data (XML) using this library/API (https://arelle.org/arelle/documentation/xbrl-database/open-database/) ? I am trying to recover the XML form back into the table format of https://www.sec.gov/ix?doc=/Archives/edgar/data/0000315189/000155837021001774/de-20210131x10q.htm# and store into my database. Anyone knows how can I find the 'reference' or 'calculation' section using this package? Thx
I parse xml using pythons parser
Hello, helloworld
You can try asking your questions in #databases channel as well.
code: https://github.com/krishnaik06/Car-Price-Prediction
Dataset: https://www.kaggle.com/nehalbirla/vehicle-dataset-from-cardekho
โญ Kite is a free AI-powered coding assistant that will help you code faster and smarter. The Kite plugin integrates with all the top editors and IDEs to give you smart completions and documentation while youโre typin...
nice example of an end to end project
i wanted@to learn ai
do you know what AI is?
yeah
what
artificial inelegance
yes, but what is that
basically a human mind inside a computer
you canโt learn that stuff
because thatโs stupid saying you want to learn ai
oh wait
ohhhhh
OHHHHH
๐
that's not what it is at all.
<@&831776746206265384> spam across multiple channels
unrepentant and warned, too: #community-meta message
#community-meta message
in general, AI is when you have programs that solve a knowledge problem. or in other words, they emulate the application of knowledge in some way
in practice, it's usually understood to be a body of programming techniques that one would use when an exact sequence of steps can't arrive at an exact solution for a problem

minimum working example
Hey.
We are working on a machine Learning model. we use xgboost and want to try a blended model between xgboost and lightgbm. I have no idea how i should start on this any tips?
that
u wana learn how to code a computers brain?
Hey, already posted in general discussion but I guess I could receive more help here:
Hey guys, I'm an engineer that has some Python knowledge (I can easily follow documentation, find my way through a repository etc) and I need some help finding the best resources to help me build the following small project for a university course:
- User takes a photo with the phone's camera and uploads it (or is automatically uploaded to the pc that the phone is connected to with a cable / wifi)
- The photo goes through a Python API/ or is processed by a script that reads the photo, extracts the answers to a quizz and then offers the final score of the user
- The file the user will take a photo of looks similar to the one below (it might change just a little bit)
I would really appreciate some help, I want to learn this my self. I know some JS, I know some python, linux, etc. And I understand AI/Machine learning, so I guess I will have to use OpenCV with some other libraries to have this pipeline. Looking forward for some constructive words ๐
hello, I want to build an open domain chatbot. What's the best python framework for that in your opinion?
Whatโs open domain
#1 Recommend the Gmail API.
Quick, easy way to have users upload pictures is to simply have them email them to you.
Lots of ways to do that though.
#2 First build your "user uploads picture to you, picture arrives, user gets a score" workload.
Do the machine learning last.
Yeah, thing is those are the least important parts of the code ๐ I can also do it by manually uploading the images to a directory because users will have to pass the sheet manually to me anyway. And the machine part thus becomes the most important and hard part ๐ฆ
Have you uplpaded the xlsx file in your jupyter notebook
You can either upload in jupyter's home page or link up the path in your PC for that
Is this where people might know about pandas?
Hi guys how can I plot a heatmap from a data frame pandas?
I have a data frame with 120k rows and 3 columns: customer, expenditure-type and ranking, ranking goes from 1 to 5 and indicates the amount spent from each customer (1=very little etc..) per expenditure type, i want to plot these data into a heatmap how can I do?
I've never plotted a heatmap before. does this help? https://stackoverflow.com/questions/12286607/making-heatmap-from-pandas-dataframe
what do you want the heatmap to show?
Should we create a data analysis channel I see we get daily how to pandas
Seaborn ?
seaborn was also gonna be my suggestion, but more details can be given based on what info they want the heatmap to show
Hi everyone, I am trying to save an output of applying a function to a dataframe to a new column in that dataframe, and when I don't save it as a new column, I am seeing the correct response, but when I go to save it in a new column, it is filling it with NaNs
can you show a code snippet
so this is the correct output, all 1's
then I do this: python df['newcol']= df.loc["Beg-4"].apply(foos.Beg_3)
but it's filled with NaNs instead of the 1's
Beg-4 is a column?
a row
then the issue is that you're not applying the function to all rows, most likely?
just 1 row, the row named Beg-4
and what do you want pandas to put into the other rows you didn't specify?
oh, I just want it to take all the outputs, and put them in a new column...oh I guess I should put them in a new row?
I just want to save them somewhere in a df
actually idealy a new dataframe
that I would then add more stuff to once I apply different functions to the other rows
that would make more sense, since it seems you're applying the function to the elements of a row. the number of outputs is equal to the length of a row. if you have more columns than rows, putting this into a column will end up with many unspecified values
each row has it's own function
ah so maybe if I view the whole dataframe some of it would have saved as not NaNs?
le tme look
hmmm, no, they're all Nans
so it didn't fill some with nans, but all
i'm not sure what pandas' default behavior is, but in any case you tried to put a collection of values somewhere it doesn't fit ๐
different functions will handle that error in different ways
so this is how I create the new dataframe, right?
wasn't newdf already a new dataframe? (i'm asking, i've never used pandas)
the code looks ok, just making sure the newdff line isn't redundant
so it was a series?
yeah
and now it's a dataframe?
whatever that is ๐ yep
ty so much โค๏ธ
sure
Do computers have brain ? :)
Damn that sucks u gona have to learn how
and pretty bad in normal python
Get better then
Well u asked how to and the answer is first be able to code
What can u do with python
question
how would you make use of an ontological model in a business setting
these tend to be represented with knowledge graphs
okay, now what?

Hi, I have a question: how to showing all number in x axis in plotly?
plt.xticks(range(1, 13)) @bold timber ?
i think when you create the fig, you can give the parameter x = [some_list_of_tick_values]
doesn't works
so maybe x = range(1,13)
they say plotly there, or do you shorten pyplot and plotly the same way?
should be something like fig = some_plotly_function(some_datafram, x = range(1,13), other_params)
ok thank you
how do I find the win rate of my algorithm?
what's the algorithm
man i fkin love plotly so beautiful
my professor wants me to find the win rate of this algorithm how do I do that?
jesus that paper was a facefull of math
generate a huge amount of scenarios with different starting conditions and see how many times it wins
i cannot read it
could you elaborate?
https://github.com/quantumiracle/Popular-RL-Algorithms
here is the qmix code
not really
same here, I couldn't understand any of it
skimming thru it its 50% symbols
you're training this yourself?
and proof
that's how papers should be
I trained it on my local machine
maybe put proof in the appendix?
or at least I ran this algoritm on my local machine and I got an output pickle file
how do I interpret this?
the training procedure itself already includes some form of training error and validation error
the validation error of the final epoch is what you want
what exactly that looks like, i can't say
am I allowed to send files here?
i wonder if i just pin a massive 20x20 table on my wall with every single math symbol i ll be able to decipher papers
after a few weeks
Hey @brave sand!
It looks like you tried to attach file type(s) that we do not allow (). We currently allow the following file types: .gif, .jpg, .jpeg, .mov, .mp4, .mpg, .png, .mp3, .wav, .ogg, .webm, .webp, .flac, .m4a, .csv, .json.
Feel free to ask in #community-meta if you think this is a mistake.
Reading this paper ive noticed their use of the triple equals sign, what is that?
is that just equating functions?
so I have the model results, where would the validation error of the final epoch be?
the usual procedure is that lots of cool, cutting edge, and largely useless results are produced in academia and research. then the results slowly trickle down as either the authors make the code open-source or people make open-source implementations of the results. then people who know nothing about them start using them more widely through APIs
i have absolutely no idea ๐ you can read how the output is stored in their repo
for my thesis ill be forced to paste the code at the end
not exactly deep coding though anyone can do it
that's rather uncommon
โก means identical to
umm
interesting
1/2 doenst = 2/4
but rather is identical to
why didnt they teach this in school
because it's a useless distinction
We consider a partially observable scenario in which each agent draws individual observations z โ Z according to observation function O(s, a) : S ร A โ Z. Each agent has an action-observation history ฯa โ T โก (Z ร U)โ, on which it conditions a stochastic policy ฯa(ua|ฯa) : T ร U โ [0, 1]. The joint policy ฯ has a joint action-value function: Qฯ(st,ut) = Est+1:โ,ut+1:โ [Rt|st,ut], where Rt = ๔ฐโ ฮณirt+i is the discounted return.
they represent the same number, so there is no case in standard arithmetic where it matters
so reading that i saw that symbol and thought
wtf?
what is the : before infiinte for?
isnt that usually a arrow?
where?
how do you guys usually interpret pickle files?
is there any particular reason why the left side of the equation uses () and the right side uses [] ?
yeah seems like some sort of interval, but it's difficult to say. unbeknownst to most, symbols don't have universal meanings. you'll have to hope the authors explained their notation near the beginning of the paper, or work your way from the top as you figure out how they use the symbols
as a non math person this pisses me off
i often read things and have no clue what tehyre saying
hows 1:inf normally written?
isntit like
(1,inf]
or somehow
Expected value often uses [].
(1,inf] doesn't make it clear whether the numbers are taken from N, Z, or I
can you explain why in that equation they use expected values and what in maths that means? I thought equations didnt do expected
and yeah, expectation often uses [] or {}
What does that mean
.latex $\mathbb{N}$ are natural numbers, $\mathbb{Z}$ are the integers, and $\mathbb{I}$ are the reals
how can expected value be a thing in normal equations? ive only ever ran into that in statistics making synthetic datasets
you're doing statistics there
i need to read up more on exactly how expected values work on maths
"stochastic policy" means "random policy"
Z vs I?
You can think of it as a weighted sum for simplicity often.
(You have probably already seen a sum in an equation)
that was a typo, sorry. Z are whole numbers. I are the reals.
after i finish my linalg instead of calculus i will take a course on notation and definitions..
.latex $\mathbb{N}$ are natural numbers, $\mathbb{Z}$ are the integers, and $\mathbb{I}$ are the reals
sadly i can't get the bot to delete nor update the tex, so here it is again
soooo sick of not knowing definitions
i'm surprised these things were not mentioned in your linalg course
these things were certainly not mentioned
one usually defines vector spaces as "a vector space over a field", so one has to at least briefly mention fields
is there any specific content that explains all these brackets and meanings?
no, because they're not universal, as i said
you pick up math-reading abilities as part of your mathematical maturity
uhh, i mean so far its like /A/ as in length or watever
by doing maths
"mathematical maturity" - Huh, so i'm not the only one that uses that term.
i just realized i still mixed up the symbols, i meant R, not I
starting form 0 means im years away from that sorta stuff
sorry, i'm tired
isn't this a well accepted term? you can find it on papers and documents on pedagogy
I'm not sure. Seems like it would be.
i'm sure, i was just being polite lol
I don't think it was not polite. I just have not seen to term used in a chat room before.
im never gona have the time on my hands to practise enough from 0 to being able to pass highschool or early degree level papers
that shit requires spamming it over and over
that's how everything is learned
that's also fair. consider that these people do this stuff for a living
high school here starts at pretty easy level maths, like straight lines, surds and basic probability
sure
but after 1 year
it gets quite tricky
the trig is confusing, the calculus requires spamming and they dont even teach linalg
its like 99% trig
I'm in linalg right now
they dont teach it here
sadness
Linalg is so widely applicable, especially for many programs (computers compute it really well).
yo
Consecutive terms of a sequence are related by unรพ1 1โ4 3 ๔ฐ (un)2
dammit
whats the strategy to finding the 50th term?
like a one liner?
sure i could go thru them one by one but that wud take ages
you could use a for loop, sure. if you don't want to, though, you have a problem ๐
its a pen and paper math exam
the formula is recursive. you have to do the math yourself on paper
aha
well
see if you can find a pattern
its a small question they expect u do it in 2 lines
maybe it telescopes nicely
i thought hey thats easy when they asked to find third term
then the next q is 50th
compute a few terms and look for the pattern
srlsy?
that's the whole point
ur meant to deduce the 50th term by just doign the first 3 or 4 terms and guaging it?
yes
ngl that thing grows pretty nastily lol (if you start from 0)
aha, that's the trick
If you write it out you will see it.
write out like 4 or 5 terms and you're done
you really shouldn't need more than 4
you either missed the pattern or forgot the parentheses
A lesson on how the base case can completely change things.
ill do it after food
yeah i was 3 terms in and was like "those highschoolers are dead and buried by now"
I recommend trying u_1 is 3 or 4 to see what else happens in those cases.
u shud have a look at UK A2 maths core3/4 papers or advanced maths
its so hard i dropped out
import pandas as pd
file_name = "/Documents/Python Virtual Environments/Popular-RL-Algorithms/model/qmix_agent (1)/archive"
objects = pd.read_pickle(file_name)```
why does this not work?
Traceback (most recent call last): File "qmix_pickle_reader.py", line 4, in <module> objects = pd.read_pickle(file_name) File "/home/ethan/Documents/Python Virtual Environments/marl-test-env/lib/python3.8/site-packages/pandas/io/pickle.py", line 187, in read_pickle with get_handle( File "/home/ethan/Documents/Python Virtual Environments/marl-test-env/lib/python3.8/site-packages/pandas/io/common.py", line 795, in get_handle handle = open(handle, ioargs.mode) FileNotFoundError: [Errno 2] No such file or directory: '/Documents/Python Virtual Environments/Popular-RL-Algorithms/model/qmix_agent (1)/archive'
iirc to read pickled files you need to import all of the libraries that were involved in the object that got pickled
so try taking all of the imports you used on the file that generated the pickle, and put them also in this one that reads the pickle
oh but there it's also telling you you're reading from the wrong directory
same error
yeah
I didn't think loading in libraries would resolve my file not found error
i'm retty sure paths don't like spaces in them
try encasing the part of the path with a space in ''
'qmix_agent (1)'
otherwise, rename the folder ๐
best place to learn deep learning with basic python programming?
What are you currently at?
You know about stuff like linear regression, and perceptron, multi-layer perceptron etc?
@upper spindle
deep learning and basic python programming don't really go together lol
anyone know the best way to get into ai for beginners
quite much replying to both delta and yourdad: there's Andrew Ng's Machine Learning Specialisation on Coursera, but I cannot say for sure if it's the best option out there
i understand the theory behind ai, i just find the code part hard
any advice on this?
Is it possible to practice python with a mobile?
i would say andrew ng is pretty aight. you need some background knowledge though, and iirc it doesn't go much into code. still it's a great place to start and i encourage learning the math before trying the code
the new version goes a little bit more into code than the previous I think
so say i learned the background knowledge where do i go from there and learn the code part
you can try diving head first into the documentation of whichever library you want to use, or look for a course / tutorial series
Hi people! I'm trying to understand the amount of parameters in a CNN. Well, I am classifying black and white images into four classes. Firstly I processed the images as RGB and later as 'grayscale'. I expected an exponential decrease in the amount of parameters after the Flatten layer, but actually they remained the same. What do the parameters actually depend on?
are you able to recommend any good libraries?
depends on what you want to do, which models you want to use etc
well for example i wanted the ai to tell the difference between two things i.e A dog and A cat
ey people, I do not mean to be rude. Just in case you didn't know there is a pedagogy channel
sklearn is fine for non-deep learning
pytorch or tensorflow are used for deep learning, something via a higher level API / package such as fast.ai, huggingface or keras
please,
at that point i would look at big hitters like pytorch, tensorflow, and lower level stuff like numpy and jax. then i'd decide which one to stick to for a while based on personal interest
#pedagogy is used for discussions on how to teach, not for asking for resources
why would there be a decrease in parameters after a flatten layer?
all it does is change the shape. it doesn't apply any function whatsoever
it's akin to the "vectorization" operation you can apply to an m x n matrix in order to obtain a length m*n vector
same number of parameters, just reshaped (and generalized to more dimensions)
that was my question. I expected so, because the size of the input is reduced
lemme make an example for you
In [11]: import numpy as np
In [12]: x = np.array([[2,3],[5,6]])
In [13]: print(x)
[[2 3]
[5 6]]
In [14]: print(x.flatten())
[2 3 5 6]
In [15]:
they're exactly the same thing, just in a different shape
ok, let's change the point of view
when adding a convolutional layer it extracts the feature maps accordingly to a number of filters
if the size of the input is smaller I expected somehow a decrease on the amount of feature maps and also in the number of parameters
but there is something else... that is my question
oh
that's determined by the shape of the convolution layers, the pooling layers, and the dense layers
from which layer to which layer did you expect a large change?
flatten does nothing, dropout doesn't change the number of parameters, only deactivates them randomly at each iteration. the dense layer is a linear mapping from R^n to R^m, here with m << n, so that's the layer that has a ton of parameters, but the feature vectors are quite small after it
in the convolutional layers, you specify an input 2D shape and a number of filters. the output is of size ~ N - kernel_length x N - kernel_length x num_filters.
for all of the layers, the number of parameters is related to the underlying (multi-)linear transformation from something isomorphic to R^N to something isomorphic to R^M, having N*M parameters
so... what you mean is that an image of size (256,256,3) is isomorphic to another image of size (256,256,1)?
not at all
what i'm saying is that an image of size 256 x 256 x 3 is isomorphic to another of size 196608
and you put that into the network and get another size
and that output vector of some size is isomorphic to some other n-dimensional array
so one easy way to think about the number of parameters at a given layer is to vectorize the input and output
then the number of parameters is something like N * M... plus another M, if there are biases
since the effect of a layer (before applying the activation function) is that of an affine transformation y = Ax + b, and A and b are the parameters
nice. In that case I would have expected a decrease in the output shape of the first convolutional layer as well as it happened to its parameters
and that happened indeed. 2D convolutional layers shrink the 2D axis of the image by roughly their own size
though the number of output slices in the image depends on the number of filters you use
any resources for embedding categorical variables in LSTM?
Heyo, i want to calculate e.g. 2 diffrent average values for the watchtime ("Duration")with pandas (from 2022-05-02 until the 2022-06-02) but i want it to be exact2 values... one for the first month( or the rest of the watcdata avaible for that month) and teh secodn should be all teh watchdata avaible in the second month
thanks a lot, you really helped me understand it better. But there is still something else doesn't match... if you take a look at the screenshots: the output shape of the first conv2d layer does not change, but the parameters do. Why that reduction on the parameters does not affect to the first dense layer too?
Hey, any ideas what format this data is in?
https://aoe2.net/api/player/ratinghistory?game=aoe2de&leaderboard_id=3&steam_id=76561199003184910&count=5
it's a JSON. while the outermost structure of a JSON is usually dict-like, it can be list-like.
!e
import json, pprint as pp
result = json.loads("""[{"rating":2351,"num_wins":2587,"num_losses":1916,"streak":-3,"drops":52,"timestamp":1656402092},{"rating":2357,"num_wins":2587,"num_losses":1915,"streak":-2,"drops":52,"timestamp":1656400992},{"rating":2363,"num_wins":2587,"num_losses":1914,"streak":-1,"drops":52,"timestamp":1656400031},{"rating":2369,"num_wins":2587,"num_losses":1913,"streak":4,"drops":52,"timestamp":1655825614},{"rating":2359,"num_wins":2586,"num_losses":1913,"streak":3,"drops":52,"timestamp":1655825282}]""")
pp.pprint(result)
@serene scaffold :white_check_mark: Your eval job has completed with return code 0.
001 | [{'drops': 52,
002 | 'num_losses': 1916,
003 | 'num_wins': 2587,
004 | 'rating': 2351,
005 | 'streak': -3,
006 | 'timestamp': 1656402092},
007 | {'drops': 52,
008 | 'num_losses': 1915,
009 | 'num_wins': 2587,
010 | 'rating': 2357,
011 | 'streak': -2,
... (truncated - too many lines)
Full output: https://paste.pythondiscord.com/uyipusetir.txt?noredirect
@lapis sequoia see?
I've been trying to get values out of the dictionary for so long
but good to know for sure it's json
What's the best way to convert this into a pandas dataframe?
!docs pandas.load_json
No documentation found for the requested symbol.
pandas.read_json(path_or_buf=None, orient=None, typ='frame', dtype=None, convert_axes=None, convert_dates=True, keep_default_dates=True, numpy=False, precise_float=False, ...)```
Convert a JSON string to pandas object.
that one
Ok, I'll give that a try. I'm trying it out by calling the API directly in the meantime.
if the API gives you that data as a string, then this will work
I think
otherwise you can do pd.DataFrame(json.loads(...))
I have a list of top 10k players which I've changed into a dataframe, I want to create a new column in the dataframe('max_rating') which has the highest rating each player has ever had. I'm using the following code:
API = aoe.API()
df = pd.read_csv('AoE2_list_of_top_10000_players.csv')
for s in df['profile_id'].head():
ratings_of_player = API.get_rating_history(profile_id=s)
a = []
for i in ratings_of_player:
a.append(i.get('rating'))
print(a)
df['max_rating'] = max(a)
But whenever I check my dataframe after running this, every player seems to have the same exact highest rating(2560) which is obviously not correct. What am I doing wrong here? I want to append the max rating I've found separately to each player.
Ok, I tried it and it seems to be working pretty nicely. Probably won't need to use all the code I've written above if it works how I think it works.
I need help turning my file into tenserflow
Like I need help installing it, then turning it from python to tiffle
Ive tried all the videos, I just cant figure it out
Csv? Into a tensor?
I want to turn
I python file
into tiffle
I dont know how to do it
its easier to explain in vc
๐
I'm trying to get a list of all AoE2 players(from https://aoe2.net/#api using this wrapper: https://github.com/sixP-NaraKa/aoe2net-api-wrapper/blob/main/docs/docs.md) and their highest/lowest ratings in the previous 2 years. I've tried so many things but I keep failing when I try to convert the Json data to a Pandas dataframe. It converts it into a dataframe where the first few columns have index and some other information which isn't important to me, and then one column inserts a dictionary which has all the important information I need but it's impossible to access because it's all in one column.
Any help would be awesome
Play Age of Empires II: HD (AoE2:HD) and Age of Empires II: Definitive Edition (AoE2:DE) online! Lobby Browser and Leaderboards
open domain means that you can talk about it for any topic
I've got a bunch of data that I'm collecting from various sources in a tabular format. The data is all similar but the tables don't always have the same columns. For example, one source may provide a column with a start value, and an end value and nothing else, other sources may provide some interim values. The ordering of the columns may also be different between tables. The rows are almost always in sorted order.
I'm wondering if there is a way to train ML model to determine column headings. Currently I need to manually open the data in excel look at it, and assign the correct heading and then enter the table into my system such that it can be processed, this is really annoying and time consuming work. The tables can have as few as 200 rows, and up to the tens of thousands, and there are typically about 6 to 12 columns. How would I go about structuring the data to train such a model?
are the columns labelled in the files?
Yes, mostly but the labels are not consistent. Data from different sources can have different labels for the same data.
aight you could look at several examples from your data to see if you can learn something about the statistical distribution of the data. the annoying part is that the files have different row sizes. you can either pad the rows or extract statistical params yourself. then train the network on randomly generated examples based on what you observed in the data.
The statistical distribution of the like columns should be similar regardless of the number of rows.
that was exactly my point ๐
for i in df.head()['profile_id']:
all_ratings = []
max_rating = 0
min_rating = 0
list_of_ratings = API.get_rating_history(profile_id=i)
for i in list_of_ratings:
all_ratings.append(i.get('rating'))
max_rating = max(all_ratings)
print(max_rating)
df.loc[df['profile_id'] == i, 'max_rating'] = max_rating
What mistake am I making here? When I print df.head() I get max_rating values as NaN.
I want them to show the max ratings of the players
pls help ๐ฆ
Hi..so i got this cosine similarity matrix output from a python program. may i know how to do data classsificaiton on this like finding accuracy and all in terms of a specific data classfier which is SVM classifier
the output is like this..im putting link hia since i cant copy and paste my own output hia
Hi everyone! I am working on my first data science project and i am facing some trouble with identifying and dealing with outliers in my dataset
would love to learn how to deal with outliers!!
Don't ask for an expert. Ask your actual question.
https://github.com/Sparsh-mahajan/House-Price-Prediction/blob/main/data_cleaning.ipynb here is the what i have been working with, i have a dataset with around 2.9k rows and 80 columns, I have dealt with missing values in the dataset and have plotted out boxplot, histplot and a scatterplot for each column vs the label ('SalePrice')
Predicting house prices using the dataset from https://www.kaggle.com/datasets/prevek18/ames-housing - House-Price-Prediction/data_cleaning.ipynb at main ยท Sparsh-mahajan/House-Price-Prediction
so now do I manually find out outliers in each of the 80 columns and then remove those rows from the dataset? also have i been following the correct method in finding out the outliers?
Whatโs a tiffle
Iโve google tensorflow tiffle canโt see anything
guys am i tripping did i forget that tensorflow has a file type
I know thi s https://pypi.org/project/tifffile/
We will also build the profile of the analyst profession more broadly across national policymakers and central government. This will include accreditation, training, career opportunities, status and pay to match. no fucking shot (UK, NHS)
how exactly would one obtain accredation
hi
is there a way to convert a unicode to utf-8 character for example ’ to '
in python
if we have in string
should be possible to use something like my_string.encode('utf8')
nope
Anyone can explain to me why I get a plot like this?
resample will create the bins based on the cyclic data
ok thank youu
Im so Fucking sorry
I meant TF2 File
Stupid me
I need help turning my python file into a TF2 file
ah i had a similar problem with dask but it is solvable, but i cant remember how i did it without looking it up and i have work rn - all i can say is it looks like youre close
Lmfaooo
@lapis sequoia can u explain what u mean because afaik tf2 is not a script file type? Do u mean like save model ?
You can save a .py file as a text file easily but Iโm not sure what a tf2 file is
Tensorflow library file?
!pastebin
Pasting large amounts of code
If your code is too long to fit in a codeblock in discord, you can paste your code here:
https://paste.pythondiscord.com/
After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.
i don't understand why the model isn't pickled
unless i have to create the pickle file first and then run the code?
@mild dirge ; i know linear regression from econometrics but all these perceptron stuff, i have no idea
Itโs a simple way to calculate output you can use to introduce urself to neural networks
yes
im trying to optimise my python file to get more fps
so im trying to turn it into a Tensorflow file
so that I can put it into deci playform
FPS?
Frame per second?
Are you making a game?
Are u trolling?
I think ur attempting to save a model not convert a python script to a tensorflow โfileโ
its a script
that you run on a game
when I run it it doesnt give good FPS
so im trying to optimize the script
the only way I can figure out how to optimize it is by making it into a diffrent file one that is capatible with that one website
Are you doing computer vision in a game?
No but I can read
U need to explain what the script is
it is a FOV hack for my game
Why do you run it on a game, how does that work?
Oh okay
And this has what to do with tensorflow?
Tensorflow is a library that creates functions for u to do ML
itโs not a file type
I just want to make it one of theese files, onnz, tensoerflow, keras
because thats when i can optimize it
Those are for storing models
Is ur pc good
What game is it
Increasing ur fov in games is not the fov script that lags you but the game itself having to render more
my pc is good
its a fps game
sorta like fornite
it has the same applications as it
runs on unity engine
I just need help optimizing it
Making your fov script not python u wonโt be able to run it
it injects itself
And what good is converting even language? Itโs not the script itโs your game
Frames per second is not an accuracy
il figure it out on my own
Just to make it clear this isnโt a ML task right?

None of this will help you ur being trolled
Dude
Ur friend is trying to make u fail
Yikes
Improve your games efficiency at rendering
This task has nothing to do with ai
Ur friend scammed u heโs not getting accuracy scores for this lmao
matplotlib is for plotting, and pandas can help you read files and check out what properties the data has. other than that, not much else. the actual ML is done with other tools and can be done entirely without those 2 libs
having an issue with pandas read_csv function, I'm trying to use certain columns of a csv, but I get an error that they are expected but not found.
attached is a screenshot of said csv and the columns I want to extract
here is the code I am running to try to accomplish this:
LORD3DM_100128XY = pd.read_csv(str(PATH100128XY) + "3DM.csv", skiprows=15, usecols=['X Accel [x8004]', 'Y Accel [x8004]', 'Z Accel [x8004]'])
I typically don't have any problems with doing this kind of thing, not sure what's happening here. figured this might be a good place to ask
Anyone know what this tensorflow error means? This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX AVX2
I dont think it has to do with my code
uhh, is it an error?
it's just saying your tensorflow binary is compiled with AVX and AVX2 support
Would anyone here wana teach me how to binary tree in python? Just the basics such as checking nodes and traversing
Oh
No matter how many tutorials I watch and can memorize key inputs i still donโt full grasp it
Same for linked lists I get the theory but not oop
Nevermind, my model was in the wrong directory ๐คฆ
@untold bloom you still here?
yes, kind of :p
OHHHHH
how are you
:\
oh, sorry i didn't respond to that...
i want to calculate e.g. 2 diffrent average values for the watchtime ("Duration")with pandas (from 2022-05-02 until the 2022-06-02) but i want it to be exact2 values... one for the first month( or the rest of the watcdata avaible for that month) and teh secodn should be all teh watchdata avaible in the second month
i usually hang out in other channels and this has a lot of text in between and i tend to forget and not answer then
if possible, can you give me some sample input and expected output?
hmm yes
result=df_vd_E.groupby(df_vd_E["Start Time"].dt.date)["Duration"].sum()
result.index = pd.to_datetime(result.index)
b=(result.loc["2021-04-15": "2021-07-15"].dt.total_seconds()/60/60)
Month= b.mean()
print(Month)```
So, rn, it gives me the average of the time from 4-15 until 7-15 (one value) what i want is diffrent.... i wantg it to be the average of one month each, so it should be one average value( duration) for the rest of month 4 and then full average watchtime duration of the month 5 and so on
i see, thanks
yw
for my first pandas project, ig i really need some help
didn't help yet, though :p
but before you did
so as you said, the .mean() gives you a single number: the "global" mean
but you want it per month
whenever "per" shows up, we tend to go for .groupby
what will we group the data by in this case?
you want it per month so, month of the data
then we take action:
yesyes
b.groupby(b.index.month).mean()
since the month information is at the index (right?), we reach it from there
the month thing refers to the datatype of it right?
uh, not quite
so .month automatticly knows that the -04- is a month?
yes
please observe what print(b.index.month) shows
b.index is a DateTime index; it has convenient attributes attached to it
.year, month, dayofweek, dayofyear...
this will show numbers 0, 1, ..., 11 in general
for 12 months
THIS IS AN AMAZING TOOL
:p
one caveat about the code above though:
we grouped by the month information only
so the year is ignored: any February day will be accounted for the mean of February
be it year 1998 data or year 2921 data
in your case, i guess this is fine
because you have only 2021 data in b
but in general...
you can do b.groupby([b.index.year, b.index.month]).mean()
groupby both year and month
so 1988's February and 2012's February are now signaling different groups.
before, they were falling into the same, February, group.
okay, ```
"2022-02-01": "2022-04-01"
Start Time
2 2.563807
3 2.003324
4 2.275278
so, how do i now do, that teh rest of the month, is counte din aswell
it is counted in yes
ah okay
i mean, however many days are in that month, they will be counted
be it 1 day or 30, 31
Can anyone help me? Why do I get the same color? How to use different colors in that plot?
i wouldn't say they're the same color, but they're pretty close. how about you remove the color parameter?
doesn't works
I just to try with a simple model for time series like this
remove the 'r' and 'g' too? just to see what happens
@untold bloom ```
Rapha= df_vd_R['Duration'].dt.total_seconds()/60/60
over here i want to print an Integer of teh whole watchtime dsuration of one User, but the output prints every duration for that user... how do i define that i want the duration of some rows added together
like this. But, it makes me wondering why I can't choose my color itself
i wonder too tbh lol
Hello,
I have a piece of code but each time I run it it takes a long time. I'm guessing it takes a long time because it calls the API every time it runs and gets information from it.
Is there any way to speed it up so that it doesn't take a minute or two each time I run it?
https://stackoverflow.com/questions/45175916/why-are-colors-not-working-in-matplotlib-for-this-example apparently there are some color glitches around
Here is my code:
API = aoe.API()
df = pd.read_csv('AoE2_list_of_top_10000_players.csv')
df = df[['profile_id', 'name']]
df['max_rating'] = 0
df['min_rating'] = 0
df['date_of_max_rating'] = 0
df['date_of_min_rating'] = 0
df['difference_in_rating'] = 0
for i in df.head(100)['profile_id']:
all_ratings = []
max_rating = 0
min_rating = 0
list_of_ratings = API.get_rating_history(profile_id=i, count=5000)
for j in list_of_ratings:
if j.get('timestamp') > 1591036131:
all_ratings.append(j.get('rating'))
else:
pass
max_rating = max(all_ratings)
min_rating = min(all_ratings)
df.loc[df['profile_id'] == i, 'max_rating'] = max_rating
df.loc[df['profile_id'] == i, 'min_rating'] = min_rating
df.loc[df['profile_id'] == i, 'date_of_max_rating'] = dt.datetime.fromtimestamp(list_of_ratings[all_ratings.index(max_rating)].get('timestamp'))
df.loc[df['profile_id'] == i, 'date_of_min_rating'] = dt.datetime.fromtimestamp(list_of_ratings[all_ratings.index(min_rating)].get('timestamp'))
df.loc[df['profile_id'] == i, 'difference_in_rating'] = max_rating - min_rating
print(df.head(100))
thank you so much
Ladies and gentleman
I am proud to announce
I have inverted a binary tree
Thanks for all ur support
Iโm ready to face interview
cant wait to apply data structures to uhh... dataframes..
IIUC, again we groupby
not sure what the username column is called but let's say it's called "username"
let's first convert the durations to hours and then groupby
df.Duration.dt.total_seconds.div(3600).groupby(df["username"]).sum()
this gives you the total duration per username (in seconds)
which specific username you want, you can index into this to select it, e.g., above_thing.loc["user_1"]
this gives the total durations per user for all users
if you only want a specific user, first filter the frame and then sum; no groupby is needed then
df.loc[df["username"].eq("user_1"), "Duration"].dt.total_seconds().div(3600).
ahhhhhhhhhh
where is stel
hes probs busy
anyway my model doesnt fit even with aws lambda layers + s3

rip
so the alternative will have to be probably be putting the model and inference code into a docker container

then probs deploy using ECS or Fargate or something
more AWS services i do not know

@untold bloom it often tells me that 'function' object has no attribute .div
raceback (most recent call last):
23118 1.025278
23118 1.025278
23119 0.719444
23120 0.000556
23121 0.019444
Name: Duration, Length: 10293, dtype: float64
this ios output btw
somehow it prints more....
hmm
df_vd.loc[df_vd["Profile Name"].eq("Rapha"), "Duration"].dt.total_seconds()/60/60
ahhjh, i see you are also dependet from one dc user to help....
!pastebin
Pasting large amounts of code
If your code is too long to fit in a codeblock in discord, you can paste your code here:
https://paste.pythondiscord.com/
After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.
not really. if you read my message, i dont expect anything from stel. i just complain to him
lmao
welp, i fully get carried by nahita somehow
its my first data project you gotta know
never used pandas before

this is a fun error message
federal-gov
what the fuck is federal-gov
oh i'm being stupiid
i never one hot encoded anything
that's why
yeah you also have 'private'
yep that will do it

@untold bloom ?
don't ping ppl asking for help
ehm

try a help channel
this IS a help channel
ehh sometimes its topical chat
we just get too many newbies
ye, so it isnt prohobited to ask for help
um guys I currently want to undertake a project that involves an ai classifying images shown to it and it getting better at doing so over time
how would i go about doing that?
all ik is show the ai data
train it
over time
and results
its not prohibited but you will probably get faster response if youre in a rush with a help channel
otherwise you will have to wait till peeps are online
you think? but problem is people ghave to get familiar with my project in order to understand ig
hmm you should try to break down your problem into someone with no context can understand then
sometimes i've asked in help channels and then people who don't know pandas try to hop in and help
have you tried a machine learning, specifically Computer Vision, course or something similar
nope :D
they are the real cough
i have to go now
peace
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))
results = []
names = []
validation_size = 0.20
seed = 7
X_train, X_validation, Y_train, Y_validation = train_test_split(X, y,
test_size=validation_size, random_state=seed)
import category_encoders as ce
encoder = ce.OneHotEncoder(cols=['workclass', 'education', 'marital-status', 'occupation', 'relationship',
'race', 'sex', 'native-country'])
X_train = encoder.fit_transform(X_train)
X_test = encoder.transform(X_test).reshape(14)
for name, model in models:
kfold = KFold(n_splits=10, random_state=seed, shuffle = True)
cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring="accuracy")
results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
print(msg)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-88-bb069de3b56c> in <module>()
22 X_train = encoder.fit_transform(X_train)
23
---> 24 X_test = encoder.transform(X_test).reshape(14)
25
26 for name, model in models:
1 frames
/usr/local/lib/python3.7/dist-packages/category_encoders/utils.py in _check_transform_inputs(self, X)
321 # then make sure that it is the right size
322 if X.shape[1] != self._dim:
--> 323 raise ValueError(f'Unexpected input dimension {X.shape[1]}, expected {self._dim}')
324
325 def _drop_invariants(self, X: pd.DataFrame, override_return_df: bool) -> Union[np.ndarray, pd.DataFrame]:
ValueError: Unexpected input dimension 108, expected 14
sigh
i have no clue what to do here
@eternal trench
so this doesn't work?
if not, what's the error?
can you print X_train.shape and X_test.shape right before X_train = encoder.fit_transform(X_train) line?
(26048, 14)
(9769, 108)
so something bad happened after train_test_split and that point
because train_test_split won't mess up with the number of features (i.e., number of columns)
what's the shape of the original X?
my guess is: you already transformed X_test sometime before; now it's as if you're trying to transform again
are you working with JupyterLab/Notebook?
like running that cell again would give that error
yeah
oh i see
yeah i noticed on github people were using .ipynb
so i decided to use google colab
hmm my impression is that x train transform does the transformation implicitly, but does not change the variable in place
so that it might not be necessary to encode x test at all
but if i don't do that i get a weirder error
no, X_train and X_test are different entities
i'm aware
if a transformation happened to X_train, that should happen to X_test as well
what error do you get if you don't transform x test?
!pastebin
Pasting large amounts of code
If your code is too long to fit in a codeblock in discord, you can paste your code here:
https://paste.pythondiscord.com/
After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.
an advice: in these Jupyter-like enviorenments, it helps to make new variables whenever possible (and when it makes sense) instead of assigning to the same name
bc these are categorical features
e.g., you could do X_test_encoded = encoder.transform(X_test) above and that error wouldn't have happened
similar for X_train_encoded = ...
i see
i still get that same error after making X_train_encoded and X_test_encoded
possible; X_test has already been transformed to have got 108 features; trying again to transform it will error
perhaps restart the kernel
i find it weird that the number of examples in x train and x test doesnt add up to the total in x, too
yeah, restart the kernel first
then show again the original sizes of x, xtest, and x train before any transformation is applied
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))
results = []
names = []
validation_size = 0.20
seed = 7
X_train, X_validation, Y_train, Y_validation = train_test_split(X, y,
test_size=validation_size, random_state=seed)
import category_encoders as ce
encoder = ce.OneHotEncoder(cols=['workclass', 'education', 'marital-status', 'occupation', 'relationship',
'race', 'sex', 'native-country'])
X_train = encoder.fit_transform(X_train)
X_test = encoder.transform(X_test).reshape(14)
for name, model in models:
kfold = KFold(n_splits=10, random_state=seed, shuffle = True)
cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring="accuracy")
results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
print(msg)
i don't see an X_test here
apparently nowhere
maybe that's why it wasn't working
my best guess is they meant validation
(26048, 14)
i tried to print X_validation but nothing showed up
actually hold on
[6513 rows x 14 columns]
so then what's wrong with the one hot encoding
well, swap out the nonexistent variable with one that exists and see what error we get (i.e. x test <- x validation)
however, as nahita suggested, i strongly suggest you don't modify x train and x validation in place, but make another variable instead and use those
because as it turns out, jupyter is terrible for debugging
if you make changes in place, you might need to rerun the whole code from the beginning
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))
results = []
names = []
validation_size = 0.20
seed = 7
X_train, X_validation, Y_train, Y_validation = train_test_split(X, y,
test_size=validation_size, random_state=seed)
print(X_train.shape)
print(X_validation)
import category_encoders as ce
encoder = ce.OneHotEncoder(cols=['workclass', 'education', 'marital-status', 'occupation', 'relationship',
'race', 'sex', 'native-country'])
X_train_encoded = encoder.fit_transform(X_train)
X_validation_encoded = encoder.transform(X_validation)
for name, model in models:
kfold = KFold(n_splits=10, random_state=seed, shuffle = True)
cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring="accuracy")
results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
print(msg)
!pastebin
Pasting large amounts of code
If your code is too long to fit in a codeblock in discord, you can paste your code here:
https://paste.pythondiscord.com/
After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.
needed to use x_train_encoded below
evrythiong works now
i recieved help
btw @untold bloom it worked... your thing ( i was just missing some () somewhere)
omg it's WORKING
cool
why do people use .ipynbs on github
then the issue was wrong variable names (test ~ val) and running the code out of order (jupyter)
because github can render the notebooks and they might be "pretty to look at"
especially when some nice latex typesetting is used
but you'd never use jupyter notebooks for real work ๐
it's a nice display tool, not good for development nor deployment though
i like thonny
but i thought my github looked weird with everything as a .py file
right
i did see people showcase their eda projects with notebooks
jupyter notebook actually broke on my mac and i can't even reopen it anymore
yeah, so ideally you'd put all your nice modules into .py, and then make a slick demo in a jupyter notebook
congrats
i am slowly learning this stuff
project based learning works
even if it's just regression and classification
idk if it's enough to turn heads for a portfolio yet, but it's a start
i can't speak about portfolios, but yeah, motivation comes from within. that means if you find a nice thing you're interested in, you'll have the motivation to see it through. that's a big factor in learning: actually practicing what you're learning, and you won't practice it if you're not interested/motivated
i find it hard to come up with portfolio projects
but i'll get there
one step at a time
can anyone help me with a problem involving SpaCy?
@hoary breach ask your question, don't ask to ask ๐
don't ask to ask, just ask.
... ๐
in spacy you can use similarity for some data
by the mouth of two or three shall all be established.
I caught a snag... (AttributeError: 'str' object has no attribute 'similarity')
please show the code and the error message as text
!code
Here's how to format Python code on Discord:
```py
print('Hello world!')
```
These are backticks, not quotes. Check this out if you can't find the backtick key.
My data for SpaCy
what you've shown us is just the last line of a larger error message. the last line isn't very useful in itself.
Hey @hoary breach!
It looks like you tried to attach file type(s) that we do not allow (.ipynb). We currently allow the following file types: .gif, .jpg, .jpeg, .mov, .mp4, .mpg, .png, .mp3, .wav, .ogg, .webm, .webp, .flac, .m4a, .csv, .json.
Feel free to ask in #community-meta if you think this is a mistake.
no one is going to download and run your notebook. please just copy and paste the relevant code.
train_data = nlp(data)
print(data['parsed_doc'][0].similarity(data['parsed_doc'][1]))
so, data['parsed_doc'][0] is a string
train_data = nlp(data)
print(data['parsed_doc'][0].similarity(data['parsed_doc'][1]))
what type of object do you expect to have a similarity method?
or is it a function from some spacy module?
is is a Spacy object
import pandas as pd
data = pd.read_csv('panel_discussion.csv')
do print(type(data['parsed_doc'][0])) and you'll see.
spacy is the name of the library. "spacy" is not a type of object.
it is a str class
Yup
what you mean is "it's an instance of str" or "it's a str". it's not a "str class". these distinctions matter.
So you cannot compare strings in terms of similarity
similarity is a built in function in spacy
Yes, that operates on vectors, not strings
parsed doc refers to the column of the data
based on this notebook https://www.kaggle.com/code/caractacus/thematic-text-analysis-using-spacy-networkx/notebook
Why is it then that they use the columns and can in fact calculate it.
Is it that once you print a dataframe you cannot print it and import the csv code and then use similarity?
relevant: tokens = []
lemma = []
pos = []
parsed_doc = []
col_to_parse = 'Q1'
for doc in nlp.pipe(data[col5_to_parse].astype('unicode').values, batch_size=1,
n_process=1):
if doc.has_annotation("DEP"):
parsed_doc.append(doc)
tokens.append([n.text for n in doc])
lemma.append([n.lemma_ for n in doc])
pos.append([n.pos_ for n in doc])
else:
# We want to make sure that the lists of parsed results have the
# same number of entries of the original Dataframe, so add some blanks in case the parse fails
tokens.append(None)
lemma.append(None)
pos.append(None)
data['parsed_doc'] = parsed_doc
data['comment_tokens'] = tokens
data['comment_lemma'] = lemma
data['pos_pos'] = pos
relevant: tokens = []
lemma = []
pos = []
parsed_doc = []
col_to_parse = 'Q1'
col2_to_parse = 'Q2'
col3_to_parse = 'Q3'
col4_to_parse = 'Q4'
col5_to_parse = 'AddQ'
col6_to_parse = 'LastQ'
for doc in nlp.pipe(data[col5_to_parse].astype('unicode').values, batch_size=1,
n_process=1):
if doc.has_annotation("DEP"):
parseddoc.append(doc)
tokens.append([n.text for n in doc])
lemma.append([n.lemma for n in doc])
pos.append([n.pos_ for n in doc])
else:
# We want to make sure that the lists of parsed results have the
# same number of entries of the original Dataframe, so add some blanks in case the parse fails
tokens.append(None)
lemma.append(None)
pos.append(None)
data['parsed_doc'] = parsed_doc
data['comment_tokens'] = tokens
data['comment_lemma'] = lemma
data['pos_pos'] = pos
!code
Here's how to format Python code on Discord:
```py
print('Hello world!')
```
These are backticks, not quotes. Check this out if you can't find the backtick key.
sweet
i am in voice chat if someone cares to help more
so from my understanding the columns (like parsed doc) get appended to a pandas dataframe
I printed the data out and imported a csv.
is it that pandas dataframe presents data as a vector so that you can use similarity?
the example they provide is this
!code
Here's how to format Python code on Discord:
```py
print('Hello world!')
```
These are backticks, not quotes. Check this out if you can't find the backtick key.
doc2 = nlp("How do I obtain a pet?")
doc1.similarity(doc2)```
what would you like to say about this?
It's a recent blog post I wrote about scalable training
we're not a platform for self-promotion, so if you post your own content, please do so in the context of a conversation about the topic.
Happy?
I'm never happy 
when I run the code through the 'doc' class (posted above) I get <class 'spacy.tokens.doc.Doc'>
which is good! however my issue is i wanted to get multiple columns involved
so I parsed each part individually... but that results in printing a 'str'
why does this result in a vector? It clearly appends values to my csv.
I made a workaround instead but thanks for the help
๐
where can i get help w selenium?
#web-development, I guess
shit, i think i just scraped a website i wasn't supposed to
and proceeded to get banned
same. especially with aws

whoscored
no they use something called
imperva
which sounds like a harry potter spell but it's this thing that blocks scraperss
im surprised espn doesnt have an api
or do they
oh hey they do
you can just grab the data from here
the only thing is you need the requests library https://pypi.org/project/requests/
fun data engineering times

its okay you can usually google those and its a good skill to have

like if you added the ability to work with APIs to your projects, that would def go a long way imo
since more and more places require calling APIs for collecting data
nowadays
๐ฏ FREE Courses (100+ hours) - https://calcur.tech/all-in-ones
๐ Python Course - https://calcur.tech/python-courses
โ Data Structures & Algorithms - https://calcur.tech/dsa-youtube
โ๏ธ Newsletter - https://calcur.tech/newsletter
๐ธ Instagram - https://www.instagram.com/CalebCurry
๐ฆ Twitter - https://twitte...
that looks pretty comprehensive




