#data-science-and-ml
1 messages ยท Page 338 of 1
hey! do you know how many samples is considered as "enough" or not enough to do a c ross validation?
because i think it is used when we dont have enough samples
i have 2100 features and 300 samples, it is "not enough" to split into test and train test ?
please ping me if you answer
I had shown the wrong values in the explanation, I've edited them now
thanks for clarifying
i can't think of an obvious tidy way to do this. maybe ask on SO and link the question here. i can ponder and post a longer answer on SO if nobody else does first
idxs = []
idx = (df[2].notnull()==False) & (df[2].notnull().shift(-1)==True)
idx[len(idx)-1] = True
start=0
for i,end in enumerate(map(lambda x: x+1, sorted(idx.index[idx]))):
print(i,start,end)
idxs.append((i,start,end))
start = end
idxs
result = pd.concat({i:df[start:end].reset_index() for i,start,end in idxs}, axis=1)
result.columns.names = ['combo','stock']
result
that's one way to do it
you're basically resorting to parsing a stream of 1s and 0s and grouping them
the absolute last resort technique ๐
Yeah I can't seem to think of an easy way either!
ask on SO anyway and show your current attempt (and make sure to include copy/paste-able sample data), maybe you'll get some other interesting variations
people like doing performance benchmarks on SO too
sparse_ts = (~df[2].notnull()).astype(pd.SparseDtype('bool'))
# we need to use .values.sp_index.to_block_index() in this version of pandas
block_locs = zip(
sparse_ts.values.sp_index.to_block_index().blocs - 1,
sparse_ts.values.sp_index.to_block_index().blengths + 2,
)
# Map the sparse blocks back to the dense timeseries
blocks = [
df.iloc[start : (start + length - 1)]
for (start, length) in block_locs
]
for block in blocks:
print(block)
welp
that's an interesting one
very clever
i'm gonna file that one away in my brain if i ever need to work on boolean stuff like this
How do you get the depth of a multi index dataframe ?
what do you mean by the depth? the number of levels of a given index?
!docs pandas.MultiIndex.nlevels
property MultiIndex.nlevels```
Integer number of levels in this MultiIndex.
Examples
```py
>>> mi = pd.MultiIndex.from_arrays([['a'], ['b'], ['c']])
>>> mi
MultiIndex([('a', 'b', 'c')],
)
>>> mi.nlevels
3
Hello , any1 maybe have couple minutes to help me with pandas and merge df ?
I think I might have asked for the wrong thing! After doing a pd.concat, you can index into each 'level', how do you get the length of those?
can you show an example?
import io
import pandas as pd
df = pd.read_csv(io.StringIO('''0.7, AAPL, 0.5
0.1, MSFT
0.1, NVDA
0.1, GOOG
0.7, INTC, 0.5
0.15, AMD
0.15, SKYT'''), header=None)
df
sparse_ts = (~df[2].notnull()).astype(pd.SparseDtype('bool'))
block_locs = zip(
sparse_ts.values.sp_index.to_block_index().blocs - 1,
sparse_ts.values.sp_index.to_block_index().blengths + 2,
)
df2 = pd.concat({i:df[start:(start + length - 1)].reset_index() for i, (start,length) in enumerate(block_locs)}, axis=1)
df2.columns.names = ['combo','stock']
df2
Now I can index df2 like this:
df2[0]
df2[1]
How do I get this max depth (== length of depth)
I was doing df2.index.nlevels, it is:
df2.columns.nlevels
I was in a meeting. you figured it out, then?
I got confused as I thought it would be df2.index.nlevels
Hence the name multi index
when multiple indices are called multiindex and multiple columns are called multiindex ๐ฆ
Hello, I have multivariate time series data with the shape of (3916, 4). I am trying to use a CNN-LSTM model with the TimeDestributed layer. I need to reshape my data into [samples, subsequences, timesteps, features]. I want time steps to be 5
Does anyone know how I can reshape the data into this
what would the dimensions be, in that case?
when you reshape the data, the products of lengths of each dimension have to be the same
subsequences and 5 are new, as compared to (3916, 4) -- where is that data now?
are these in arrays, tensors, dataframes, or what?
arrays
what array has the subsequences?
a TimeDestributed layer needs it
steps and provide a time series of interpretations of the subsequences to the LSTM model
to process as input. We can parameterize this and define the number of subsequences as
n seq and the number of time steps per subsequence as n steps.```
do you have more than one array of shape (3916, 4)?
no
whatever number subsequences is, if you have that many (3916, 4)-shaped arrays, you can stack them and get one that is (subsequences, 3916, 4)-shaped, and then you can transpose it.
I'm not sure about the 5.
df.select_dtypes(include="object")
I would like to strip all these columns
anyone know how to fix this error: Tensor.op is meaningless when eager execution is enabled
df.drop(columns=df.select_dtypes(include='object'), inplace=True)
By strip, I mean str.strip()
use .apply?
obj_cols = df.select_dtypes(include='object')
df[obj_cols] = df[obj_cols].apply(lambda y: y.str.strip())
does apply do things in parallel?
not by default but i think there's a patch to do it
you could also loop over column names and parallelize it yourself w/ joblib
Is there away to do this inplace?
i was just about to look into that
you might need to manually memmap it or something
how big is this data exactly?
maybe you should be using dask if something as simple as str.strip is slow enough to want to parallelize
or a real database
unless you have 10k object columns i don't see much value in parallelizing something like this
Yeah, just writing the prototype in python, going to rewrite this in rust. But would like to see how far python will go ๐ , might try numba
numba won't do much for you w/ strings
what are you actually trying to do?
this seems like a total non-useful thing to try to optimize
at least use dtype='string' instead of dtype='object'
I used pd.StringDtype()
same thing
should be a bit faster than object, i believe it's backed by an arrow array
i'm not sure if parallel in-place operations on a dataframe are possible without some serious care and fuckery
at which point, like i said, you're probably outside the realm of what you should be doing with pandas and should be considering with dask or even spark
or, again, an rdbms
however this does exist https://github.com/nalepae/pandarallel, and i think i used another library like this a couple years ago
if you actually explain what you're trying to do maybe i can be more helpful
this sounds like an x-y problem
e.g. if you have an ETL pipeline that needs to run in 500ms and it currently takes 750ms, you need to roll up your sleeves and start profiling and/or thinking about how you can store/handle this data differently
often the biggest gains are algorithmic
reducing number of passes over the data, etc.
will probably use clickhouse as the db
if it really does turn out that you are bottlenecked by string processing over a list, you might want to consider switching to native python lists, and using nuitka, mypy, or cython to build a c extension
or switch to pypy, which can be significantly faster for "plain python" operations than cpython
This is where I see julia being useful
but again, you need to have a goal in mind and start benchmarking
yep, i don't know specifically how much faster it will be for strings though
for all i know, dataframes.jl is backed by arrow anyway and the underlying performance characteristics might be similar
a lot of the slowness of python has to do with the overhead of the bytecode interpreter, so if you stay "inside" a C function for a long period of time then you're going to get significant speedups
yet another option is R, vectorized operations in R are very well-optimized, despite how ungodly slow things like loops and function calls are
i've processed 1bn+ records in data.table without thinking twice about it
for that much data i'd actually sooner pick up r w/ data.table than pandas
tldr you probably dont need to parallelize your str.strip operation
The reason why I'm using pandas is 2 fold, 1. documentation is vast (including SO), 2. learnt python as my first language... Ideally I would switch to Julia, but due to a lack of filled SO pages, the coding process is slower for me atm.
i'm not saying not to use pandas! pandas is a great tool
100% right tool for the right job
strip = re.compile(r'(^\s+|\s+$)')
df.select_dtypes(include=pd.StringDtype()).replace(strip, '', regex=True, inplace=True)
This is the only way I've found inplace
But then I get this error
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
does this still trigger the warning?
string_columns = df.select_dtypes(include=pd.StringDtype()).columns
df[string_columns].replace(
re.compile(r'(^\s+|\s+$)'), '', regex=True, inplace=True
)
678,679,aaa,male,0.96,NG
679,680,bbb,male,0.99,IN
690,691,ccc,male,0.99,IN
691,692,ddd,male,0.99,IN
in the first column you can see the missing data(index 680 to 689 are missing)
I have another data frame
688,689,a4353
689,690,somename
690,691,dfsfds
691,692,sdfsdf
692,693,retret
693,694,ertret
I want to query the data in second data frame based on the missing index in first dataframe
wdym "missing"?
and you want to fill in those missing rows with rows from the 2nd df?
nope
or do you just want to get the rows from the 2nd df that fit that criterion?
I want the rows from 2nd dataframe.
yesss
680 to 689
from 2nd dataframe
there are few other rows missing in 1st dataframe
!e ```python
from io import StringIO
import pandas as pd
df1 = pd.read_csv(StringIO('''
678,679,aaa,male,0.96,NG
679,680,bbb,male,0.99,IN
690,691,ccc,male,0.99,IN
691,692,ddd,male,0.99,IN
'''),
index_col=0,
names=['idx', *(f'y{i}' for i in range(1, 6))]
)
df2 = pd.read_csv(StringIO('''
688,689,a4353
689,690,somename
690,691,dfsfds
691,692,sdfsdf
692,693,retret
693,694,ertret
'''), index_col=0, names=['idx', 'x1', 'x2'])
df1_idx_range = pd.RangeIndex(
df1.index.min(),
df1.index.max(),
)
df1_idx_missing = df1_idx_range.difference(df1.index)
df2_missing_df1 = df2.loc[df1_idx_missing.intersection(df2.index)]
print(df2_missing_df1)
@desert oar :white_check_mark: Your eval job has completed with return code 0.
001 | x1 x2
002 | 688 689 a4353
003 | 689 690 somename
the key operations are RangeIndex, Index.difference, and Index.intesection
Wow
Thanks a lot
๐
happy to help. i keep dm's closed because otherwise i get a lot of them, more than i want
Any way I can repay you
understandable
you can repay me by reading the docs and making sure you understand how these functions work ๐
then you can teach someone else one day
:incoming_envelope: :ok_hand: applied mute to @lapis sequoia until <t:1630539287:f> (9 minutes and 59 seconds) (reason: duplicates rule: sent 4 duplicated messages in 10s).
i'd like to plot continuous wavelets (i'd use pyplot.matshow(coef, fignum=1, aspect="auto") in jupyter) but in pyqtgraph instead, anyone know how to do that?
Has anyone worked with CONVLSTM2d layers in keras for time series data?
If so, can you explain the rules or reshaping data to become 5d?
Everytime I try to reshape, I run into this problem where it says I cannot reshape it
Is it possible to save data in some variable in some other currently being executed file, when captured by the main file where SIGINT handler was attached? for the interruptions when program is being halted or stopped
Directory: /home/hadoop/
module.py
def incr(value):
return int(value + 1)
main.py
import pyspark.sql.functions as F
import pyspark.sql.types as T
from pyspark.sql import SparkSession
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
import sys
sys.path.append('/home/hadoop/')
import module
if __name__ == '__main__':
df = spark.createDataFrame([['a', 1], ['b', 2]], schema=['id', 'value'])
df.show()
print(module.incr(5)) #this works
incr_udf = F.udf(lambda val: module.incr(val), T.IntegerType()) # this throws module not found error
df = df.withColumn('new_value', incr_udf('value'))
df.show()
I think spark task nodes do not have access to /home/hadoop/
How do I access module.py from spark task nodes?
:incoming_envelope: :ok_hand: applied mute to @waxen thorn until <t:1630563396:f> (9 minutes and 59 seconds) (reason: duplicates rule: sent 4 duplicated messages in 10s).
Hey, this question is about SVD,
Say once i have U, sigma and V, how do i decrease the dimensions?
what i did is
U = u[:,:no_of_parameters] # M, K # we can treat U as our X for smaller dimension
Sigma = sigma[:no_of_parameters, :no_of_parameters] # K, K
V = vh[:no_of_parameters] # K, N
Now say I have to classify this, so How would I get X which i could fit in and would have dimensions of (M, K)?
moreover, now to classify with more dimensions, what i did was first slice them and then get the X back but it would have same dimensions as M,K * K,K * K*N = M,N
which works fine but its like converting to K and then again M,N which is big seems stupid,
is there any way by which i could put M,K sized new vector and fit it
and somehow get training data in K columns as well? (please ping me if you reply)
I am pretty sure you need to physically put the file on every node, not sure how though. Hadoop admin is 
SVD is matrix factorization: U V D*, so just multiply the factors back together
That said I haven't ever seen someone actually do this by hand like you describe
Usually I see people doing "truncated SVD" and working with the truncated U and V matrices
There is also an algorithm that directly computes a low rank approximation without doing full SVD and truncating
In a previous post we introduced the Singular Value Decomposition (SVD) and its many advantages and applications. In this post, weโll discuss one of my favorite applications of SVD: data compression using low-rank matrix approximation (LRA). Weโll start off with a quick introduction to LRA and how it relates to data compression. Then weโll demon...
So yes this looks like a totally valid approach to low rank approximation
Maybe use the U matrix and not the reconstruction? Or do PCA and use the first K columns, if you are just looking for dimension reduction (PCA can be obtained from SVD on the covariance matrix on standardized data)
did you see a post in r/math about matrix multiplication
No, did i say something stupid?
no, it seemed like an interesting post, but i didn't look at it
That was meant to be a reply to their question about how to use only the first K columns
Link the post?
I don't follow Reddit usually
unrelated to your previous conversation, just something you might be interested in
a machine learning model of matrix multiplication i think ... : https://www.reddit.com/r/math/comments/pfioih/210610860_multiplying_matrices_without_multiplying/
170 votes and 23 comments so far on Reddit
i didn't look into it either, could just be fool's gold, dunno
i can think of several uses already without reading the paper
i mean, if you're already doing monte-carlo or statistical things, why not
didn't someone post this somewhere recently
I remember seeing it
Thanks for answering, and yep I'm aware of that, but if i multiply them even after say, cutting of some edges like i mentioned earlier, I do get same sized matrix.
there was one answer which said, I can get new matrix by U x Sigma but giving that to classifier did not help.
So by changing into small ranking matrices, I do get the new X, yes but it's dimensions stay the same regardless.
It's not clear what you're trying to do, but that's exactly what that operation would do. Create a lower rank approximation to the original matrix
I'll show you what I'm trying to do in code hold on.
Wdym "did not help"?
Because yes that would be the right solution
the classification was kinda poor, it classified everything into one class.
You are doing clustering?
knn
def get_classifier_model(u, sigma, vh, no_of_parameters, k, y, X):
U = u[:,:no_of_parameters] # M, K # we can treat U as our X for smaller dimension
Sigma = sigma[:no_of_parameters, :no_of_parameters] # K, K
V = vh[:no_of_parameters] # K, N
X_reduced = U @ Sigma @ V # Thats how we can get the X back(same dimensions but data will be changed)
# you can verify how much we are close to original X by np.sum(X-X_reduced, axis=None)
neigh = KNeighborsClassifier(n_neighbors=k)
neigh.fit(X_reduced,y)
return neigh, U, Sigma, V
so thats what i did.
Plot your data
but its on very higher degree.
i have docs of say 61188 words, so what i did above was,
reduced dimensions, got X back of say rows*61188 and put it into classfier.
now say if i just do U @ Sigma, which will give me matrix of say rows*100, the result is very poor.
of classifier.
I'm sorry if I'm sounding confusing.
hello, i would like to please ask is it fine to include titantic dataset machine learning project in my resume if i build machine learning algorithm from scratch? I googled and saw lots of advices from professionals that is it not good to include it but i not sure why and if that applies to my cirucmstance since im building linear regression from scratch without any library?
titanic data set is from kaggle*
Consider that 100 is still a very high dimension for computing useful distance metrics due to the curse of dimensionality
What does "from scratch" mean? Personally I don't think it's that bad to put it on a resume anyway, but it's certainly a low bar and a relatively easy/uncreative project
I don't think knowing about QR composition and applying formulas from your stats class is all that much better than anything else
oh okay and when i mean from scratch i mean like i dont use sklearn library for linear regression, or k means, i implement all the algorithms using math formulas like cost functino, gradient so from scratch
It's not bad, it shows you know more than nothing, which is good
I would suggest that implementing SGD by hand is also kind of a low bar easy project
Moreover I think it reflects a shallow understanding of these algorithms, because I don't think SGD is a sensible choice for the titanic data
So if you're going to use the wrong algorithm then yes I would not suggest putting it on the resume
But it also depends on what kind of job you're applying for
Instead of putting it "on your resume", I would build a project around it
Demonstrate that you can write, that you can come up with a project plan, explain the plan, and reason about your decisions for choosing that plan
Demonstrate that you can generate useful data visualizations and reflect on the pros and cons of different approaches
sorry for algorithms from scratch that i implementing is for datasets that support it or work good with it
Demonstrate that you can write good quality code that other people will be able to use
I don't think it's a good idea to just put your class homework assignments on your resume
If you did some kind of capstone project or thesis, that's a great thing to include
But I doubt your capstone project was to run SGD linear regression on the titanic data
oh okay. So it not great to include as a beginner?
There's nothing wrong with simple tools and familiar data sets, but nowadays the bar is too high
indeed, but for SVD the minimum i can go is k=min(m,n) so well.
It's not enough I think. It's literally a class assignment, which means that everyone who took your class can do the same thing
You can put the class on your resume, I think that's a great idea
Oh okay, and so like for email spam classification, you think i should create some visualization for it and deploy it using flask or just a terminal command line for this app is good enough?
Just not sure which projects are "bad" to hr
It depends on the kind of job you're applying for
Nothing is "bad", what is bad is putting some thing on your resume that doesn't reflect that you know what you're doing
10 years ago, SGD was hot shit
But today it's baseline entry level knowledge
They will ask you about SGD not to test how good you are, but to make sure that you aren't lying about having basic fundamental skills
oh okay.
So if you put fundamental skills on your resume and try to advertise them as something special, it's a red flag that you don't know what you're talking about
As for your spam classification project, it certainly can't hurt to spin up a web application and make nice data visualizations. However there will be a point of diminishing returns with respect to your time and energy
diminishing returns?
I would personally rather see that you have invested energy into making your work reproducible
Setting random seeds, using version control, etc
I would want to see that the code quality is good
I would want to see that you have handled the data correctly
Diminishing returns meaninf, the benefit for every additional hour of work gets lower as you put more hours into the project
So what I would do is focus on making sure that the core classification project is solid, before moving onto "extra" stuff like a web application
oh okay, so instead i should just put it on github and write documentation like in the. readme fiile and show visualization and communicate how i clean data and approached the problem and good code quality , instead of making a web app out of it and deploying it?
However, using data visualizations to demonstrate that your work is successful is a critical skill for a data scientist, so I do strongly recommend working on data viz
oh okay
Yes, I think that's more productive. I also recommend writing a one page "executive summary". It should explain the purpose of the project, the approach you took, and your findings. Include one or two small data visualizations
oh okay thanks so its totoally fine if the app is just a "run file" and the code executes or i should build a web interface
https://conference.nber.org/confer/2011/GFC11/Executive.pdf here is a longer executive summary from a sophisticated economics paper. Yours should be shorter, but they should give you a sense of the kind of content that belongs in an executive summary
Focus on writing documentation and making sure that your code actually runs from a "clean" environment. Can I follow your instructions and obtain the same results that you obtained?
And emphasize that you did this work aggressively when you interview
oh okay thanks
Sloppy data science destroys business value. Demonstrate that you understand this, and demonstrate that you are not a sloppy data scientist
Your code is readable, your scripts can be run without too much difficulty. You have produced sufficient justification that your model performs well. You can communicate to non-technical stakeholders, business management, etc. and demonstrate the value of your work.
Then yeah, throw a web app on top of that if you have time
And of course you have demonstrated that you are not just copying and pasting shit from TDS and Stackoverflow and Kaggle
oh okay thanks, il prob not do web app since that requires more time like you said like flask or django and bug testing if it. happens
Note that these are the opinions and biases of exactly one person and should not be treated as fact or even general principle
oh okay. thanks for the help! Lastly i also saw this website called iNeuron which give free internship projects
from beginners to advance
not sure if you heard of it but should i include that under internship
or under proejcts in my resume
if i do work on a project there
each project there gives a data set, and instructions on what they want "work with x cloud, use this model, " and some other requirements, but you are implementing it there is no source code to copy
Probably can't hurt
oh okay thanks
Note that if you did specifically want to mention the Titanic SGD thing, you can put it as a bullet point when describing the course you are taking
Good excuse to hit some HR buzzwords
oh okay,i not taking any course doing titantic but i thought it was interesting just to show...
thanks for help
I see, maybe you can put it as a single bullet point, but I don't think just programming SGD from scratch is enough to justify it as a "project"
Like I said, if you build more of a project around it it can stand on its own
It's like 20 lines of python, it's just too trivial
Do some data visualizations showing how SGD works maybe
Or compare different approaches to fitting linear models
Embedding vectors are different from regular vector representations in that they have the property of being closer in terms of distance to similar inputs, right?
are you talking about nlp?
Is anyone familiar with a library for using fastext(pretrained) embeddings in NNs (Pytorch) . Online I can only find code to encode text data to embedding ids but everyone is doing it with their own code? I thought that till now there would be l library for that.
Does it matter? Are word embeddings special in some manner
Although now that I think about it I've only seen embeddings in NLP contexts
I just wasn't sure if "embedding vectors" means something outside of NLP.
Based on how word embeddings are trained, words whose embeddings have a closer cosine distance tend to be more semantically similar.
Could you elaborate. What is regular vector representation and what is embedding vectors
From my understanding embeddings are regular vectors with the additional property that their cosine distance from embeddings of similar inputs are relatively low
regular vector representation is just an abstract representation of some input
Oh. Okay. Then sure yes. But they "end up" with this property specifically because the vectors were trained and fine-tuned with an objective that makes similar things get similar vectors.
Like, a poorly trained embedding vector would be no different than just a regular vector.
I found other options
--py-files in spark-submit and sc.addPyFile(path)
But with both of them, the spark-submit command gets stuck forever. Any idea about this?
hm, both seem right
stuck forever?
yo!
i have a problem. I'd like to plot top 5 countries to see if they change across timespan. After some time working on it i've created table with {year:[top_5_countries]}. How to plot it? ๐ฎโ๐จ
so like top 5 countries are the ones with highest year?
oh
one way is dataframe looping
(nvm sorry)
one way to plot this is to plot the rankings as a line over time
one line per country
can you post this data as code/text and not as a screenshot?
then i can copy and paste it to help
!code
Here's how to format Python code on Discord:
```py
print('Hello world!')
```
These are backticks, not quotes. Check this out if you can't find the backtick key.
well i mean. If country occured -> plot else -> dont plot but dunno how to do this.
{1896: ['AUS', 'BLR', 'DEN', 'HUN', 'ITA']}
{1900: ['EUN', 'LTU', 'NZL', 'PAK', 'URS']}
{1904: ['CUB', 'EUN', 'GDR', 'SUI', 'USA']}
{1906: ['CRO', 'IND', 'LTU', 'NZL', 'SRB']}
{1908: ['ARM', 'BOT', 'GUY', 'PAK', 'URS']}
{1912: ['EUN', 'GDR', 'GEO', 'GER', 'URS']}
{1920: ['ANZ', 'EUN', 'MNE', 'PAR', 'SCG']}
{1924: ['ANZ', 'CRO', 'FIN', 'IOA', 'URS']}
{1928: ['GDR', 'KOR', 'SCG', 'SRB', 'URS']}
{1932: ['CMR', 'EUN', 'FIN', 'IND', 'SCG']}
{1936: ['ANZ', 'EUN', 'GRE', 'NEP', 'URS']}
{1948: ['CIV', 'LTU', 'SCG', 'TJK', 'URS']}
{1952: ['AZE', 'ETH', 'GDR', 'PAK', 'SRB']}
{1956: ['ANZ', 'DEN', 'EUN', 'GDR', 'URS']}
{1960: ['EUN', 'GDR', 'SGP', 'URS', 'USA']}
{1964: ['GDR', 'NAM', 'SCG', 'URS', 'WIF']}
{1968: ['ANZ', 'GDR', 'SCG', 'URS', 'USA']}
{1972: ['EUN', 'MKD', 'PAK', 'SCG', 'UZB']}
{1976: ['GDR', 'GRN', 'RUS', 'URS', 'USA']}
{1980: ['ANZ', 'CUB', 'ESP', 'KOS', 'WIF']}
{1984: ['FIJ', 'GDR', 'JAM', 'PAN', 'URS']}
{1988: ['EUN', 'GDR', 'SRB', 'URS', 'USA']}
{1992: ['ANZ', 'EUN', 'GDR', 'HAI', 'URS']}
{1994: ['CUB', 'GDR', 'URS', 'URU', 'YUG']}
{1996: ['EUN', 'GDR', 'JAM', 'PAK', 'URS']}
{1998: ['ARM', 'GDR', 'SRB', 'SWE', 'URS']}
{2000: ['EUN', 'GDR', 'SRB', 'URS', 'ZIM']}
{2002: ['ANZ', 'EUN', 'HAI', 'IND', 'URS']}
{2004: ['AZE', 'BOH', 'GDR', 'SCG', 'URS']}
{2006: ['AZE', 'CRO', 'GDR', 'GEO', 'NGR']}
{2008: ['EUN', 'GDR', 'URS', 'USA', 'WIF']}
{2010: ['EUN', 'GEO', 'RUS', 'URS', 'UZB']}
{2012: ['EUN', 'GDR', 'PAR', 'UAE', 'URS']}
{2014: ['ETH', 'KOR', 'SGP', 'TTO', 'URS']}
{2016: ['ANZ', 'GDR', 'TGA', 'URS', 'WIF']}
yeah
for each year every country got a point. I've sorted them by points and categoried them by year
hm, that might get tricky with the lines thing
how many countries total? you don't want like 10 different colored lines if you can avoid it
is the first item in the list the highest or lowest ranked?
also why is each record a separate dict? i would imagine this should be one dict, where each key is a year and each value is a list
{
2006: ['AZE', 'CRO', 'GDR', 'GEO', 'NGR'],
2008: ['EUN', 'GDR', 'URS', 'USA', 'WIF'],
2010: ['EUN', 'GEO', 'RUS', 'URS', 'UZB'],
}
yeah so 58 different lines is out of the question, even if only 5 of them are going to be visible at a given time
huh
then could you give me advice how to see if these countries change across timespan?
you could narrow it down a lot by getting the list of countries that appear in the top 5
i assume some of the 58 countries never do?
or are there 58 unique countries across this top 5 data?
these are all across this top 5 data
are you interested in 1 country specifically? or are you trying to get a sense of how the top 5 has changed over time?
second one
unfortunately i'm not sure. hadoop and spark can be fussy, the last time i used spark it was on databricks because it was too much work to keep it running well
you could plot a subset of countries' rankings still, maybe 10 or so
maybe take the countries with the 10 best average rankings over time
another option is a grid of some kind, where the y axis is the country and the x axis is the year
and each cell in the grid would have the ranking in it
you can try to use something called "seriation" to find interesting arrangements of the countries https://pypi.org/project/seriate/
or again do something naive like sort by average ranking over time
I am appending multiple dataframes to CSV file using mode=a in to_csv method now how can I get 1 dataframes from this file??
read it with pd.read_csv?
Yeah
it might work! Ima try both options and see how it'll look. Thank you โค๏ธ
Or normal file reading whatever works
i don't understand the question
The things is I am appending multiple dataframes to same CSV
So now how can I get one dataframe out of this csv
do you understand what csv is?
CSV file yeah
if you append them with header=False you can just read the file and it will already be one "data frame"
there's nothing magical here
They are different dataframes with different structure
?
well that's important information
the answer is no, you can't really do that
why not just write them to separate files?
you'd have to put some kind of delimiter between the two dataframes, then read the file line by line, partitioning or splitting on the delimiter
seems like a lot of trouble instead of just writing 1 dataframe per file
if you need to store multiple tables in a single file, consider using sqlite or hdf5
Sqlite sounds good ....
Thanks
guys can anyone share data science and machine learning cousre link or a proper video
!resources
The Resources page on our website contains a list of hand-selected learning resources that we regularly recommend to both beginners and experts.
I'm trying to create a new column in this DataFrame with the name of the day of the week, but I can't. Does anyone have any tips?
from datetime import date
days = (
'Segunda-feira',
'Terรงa-feira',
'Quarta-feira',
'Quinta-feira',
'Sexta-feira',
'Sรกbado',
'Domingo'
)
import pandas as pd
dados_vacinacao = pd.read_excel("vacinacao_br.xlsx")
dados_vacinacao.head()
UF Data Vacinacao quantidade
0 AC 2021-01-18 00:00:00 1
1 AC 2021-01-19 00:00:00 46
2 AC 2021-01-20 00:00:00 1021
3 AC 2021-01-21 00:00:00 1609
4 AC 2021-01-22 00:00:00 1105
dados_vacinacao["dia_semana"] = days[dados_vacinacao["Data Vacinacao"]].weekday()
TypeError: list indices must be integers or slices, not Series
days[dados_vacinacao["Data Vacinacao"].weekday()]
This error appeared
AttributeError: 'Series' object has no attribute 'weekday'
No
SyntaxError: invalid syntax
ok, you have to convert the column to datetime format first
Do you know how I can do this?
Hello, I wanted to learn to make a chatbot (preferably retrieval-based, but generative too if possible). I know Python with Pandas, Numpy and some regex.
Could you suggest some resources to learn how to do that?
Also, between the below two courses, which one seems better?
https://learn.datacamp.com/courses/building-chatbots-in-python
https://www.codecademy.com/learn/paths/build-chatbots-with-python
What can we learn from Artificial Intelligence in video games?
What do you mean?
You are doing AI in school?
guys, what is epsilon in svr model in sklearn.svm ? is it learning rate
pd.to_datetime(df[colname])
Did not work
import pandas as pd
pd.to_datetime(dados_vacinacao["Data Vacinacao"]) ``` assuming dados_vacinacoa is the dataframe and data vacinacao is the columns
sup, could you send a good python video thats focused on data science
thx
does anyone know how to implement a machine learning api into python code
are you calling the api or coding one
those are 2 different topics

calling an api is relatively easy for most common api's. look into their documentation, usually very helpful
no. support vector regression is like linear regression, but errors smaller than ฮต are ignored. so ฮต controls the tolerance of the model to errors; bigger ฮต means greater tolerance. this is analogous to the interpretation of C in support vector classification as the tolerance of the model to misclassification.
see:
https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVR.html#sklearn.svm.LinearSVR
https://scikit-learn.org/stable/modules/svm.html#linearsvr
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.114.4288
CiteSeerX - Document Details (Isaac Councill, Lee Giles, Pradeep Teregowda): In this tutorial we give an overview of the basic ideas underlying Support Vector (SV) machines for function estimation. Furthermore, we include a summary of currently used algorithms for training SV machines, covering both the quadratic (or convex) programming part and...
thanks
hi
im using jupyter notebooks
this happens when i try to load an image (.tif)
I dont know why
"that initially simple mathematical policies can lead to somewhat intelligent and complex behaviours in AI agents, very reminiscent of humans". watch this for an illustration https://www.youtube.com/watch?v=n6nF9WfpPrA
anyone know why i keep getting this error
## using Health Care as BaseLine
exog_vars = ['BOARD_SIZE','CEO_DUALITY','BOARD_AVERAGE_TENURE','BOARD_MEETING_ATTENDANCE_PCT',
'PCT_WOMEN_ON_BOARD','ROE','CUR_MKT_CAP','ROA','D/E',
'Consumer Discretionary','Consumer Staples', 'Energy', 'Financials',
'Industrials', 'Information Technology', 'Materials',
'Utilities']
exog = sm.tools.tools.add_constant(panel_data[exog_vars])
endog = panel_data['ESG_DISCLOSURE_SCORE']
mod = PanelOLS(endog, exog, entity_effects=True)
PanelOLS4_res = mod.fit(cov_type='clustered', cluster_entity=True)
# Store values for checking homoskedasticity graphically
fittedvals_fe4_OLS = PanelOLS4_res.predict().fitted_values
residuals_fe4_OLS = PanelOLS4_res.resids
AbsorbingEffectError Traceback (most recent call last)
<ipython-input-50-a32677c6977b> in <module>
8 endog = panel_data['ESG_DISCLOSURE_SCORE']
9 mod = PanelOLS(endog, exog, entity_effects=True)
---> 10 PanelOLS4_res = mod.fit(cov_type='clustered', cluster_entity=True)
11 # Store values for checking homoskedasticity graphically
12 fittedvals_fe4_OLS = PanelOLS4_res.predict().fitted_values
C:\ProgramData\Anaconda3\lib\site-packages\linearmodels\panel\model.py in fit(self, use_lsdv, use_lsmr, low_memory, cov_type, debiased, auto_df, count_effects, **cov_config)
1779 if self.entity_effects or self.time_effects or self.other_effects:
1780 if not self._drop_absorbed:
-> 1781 check_absorbed(x, [str(var) for var in self.exog.vars])
1782 else:
1783 # TODO: Need to special case the constant here when determining which to retain
C:\ProgramData\Anaconda3\lib\site-packages\linearmodels\panel\utility.py in check_absorbed(x, variables, x_orig)
441 absorbed_variables = "\n".join(rows)
442 msg = absorbing_error_msg.format(absorbed_variables=absorbed_variables)
--> 443 raise AbsorbingEffectError(msg)
444 if x_orig is None:
445 return
AbsorbingEffectError:
The model cannot be estimated. The included effects have fully absorbed
one or more of the variables. This occurs when one or more of the dependent
variable is perfectly explained using the effects included in the model.
The following variables or variable combinations have been fully absorbed
or have become perfectly collinear after effects are removed:
const, BOARD_SIZE, CEO_DUALITY, BOARD_AVERAGE_TENURE, ROE, ROA, Consumer Discretionary, Consumer Staples, Energy, Financials, Industrials, Information Technology, Materials, Utilities
const, BOARD_SIZE, CEO_DUALITY, BOARD_AVERAGE_TENURE, PCT_WOMEN_ON_BOARD, ROE, ROA, Consumer Discretionary, Consumer Staples, Energy, Financials, Industrials, Information Technology, Materials, Utilities
const, BOARD_SIZE, CEO_DUALITY, Consumer Discretionary, Consumer Staples, Energy, Financials, Industrials, Information Technology, Materials, Utilities
const, BOARD_SIZE, CEO_DUALITY, BOARD_AVERAGE_TENURE, ROA, Consumer Discretionary, Consumer Staples, Energy, Financials, Industrials, Information Technology, Materials, Utilities
const, BOARD_SIZE, CEO_DUALITY, ROA, Consumer Discretionary, Consumer Staples, Energy, Financials, Industrials, Information Technology, Materials, Utilities
const, BOARD_SIZE, CEO_DUALITY, BOARD_AVERAGE_TENURE, BOARD_MEETING_ATTENDANCE_PCT, PCT_WOMEN_ON_BOARD, ROE, ROA, Consumer Discretionary, Consumer Staples, Energy, Financials, Industrials, Information Technology, Materials, Utilities
const, BOARD_SIZE, CEO_DUALITY, BOARD_AVERAGE_TENURE, ROE, ROA, Consumer Discretionary, Consumer Staples, Energy, Financials, Industrials, Information Technology, Materials, Utilities
const, BOARD_SIZE, CEO_DUALITY, BOARD_AVERAGE_TENURE, BOARD_MEETING_ATTENDANCE_PCT, PCT_WOMEN_ON_BOARD, ROE, CUR_MKT_CAP, ROA, D/E, Consumer Discretionary, Consumer Staples, Energy, Financials, Industrials, Information Technology, Materials, Utilities
Set drop_absorbed=True to automatically drop absorbed variables.
@glad mulch it looks like one of your variables is fully explained by one of your effects, as per the error
It's possible that you forgot to drop a level in a dummy variable, or you have some combinations of variables that are redundant
@glad mulch https://github.com/statsmodels/statsmodels/issues/2568 https://stackoverflow.com/q/55071706 https://github.com/bashtage/linearmodels/issues/193 you need to drop levels of dummy variables
It's a bit hard to read on my phone, but it looks like they are suggesting several combinations that are collinear
It's possible that your data is messed up
In fact, if you can share the data, or a sample of it that reproduces the problem, that might help
Hey @glad mulch!
Uh-oh! It looks like your message got zapped by our spam filter. We currently don't allow .csv attachments, so here are some tips to help you travel safely:
โข If you attempted to send a message longer than 2000 characters, try shortening your message to fit within the character limit or use a pasting service (see below)
โข If you tried to show someone your code, you can use codeblocks
(run !code-blocks in #bot-commands for more information) or use a pasting service like:
I'm studying data science at a low level and need help trying to create a multi objective optimization using goal programming to generate a pareto frontier.
I managed to make it in excel and mathematica but couldn't do so in python
Any help at all would be appreciated
I did make a post on help-apple
woah, that was quite a "technical" answer, thanks ๐
What do you call the classification categories in a formal way? Like if the dataset has three classification categories like -1,0,1 - what are -1,0,1 called?
labels
targets
Gotcha
Unless, you're talking about the set of defined labels for your dataset?
I'm not sure if there's a separate term but it seems like there should be
Nah I think labels is the correct word here
hello. I've df with two columns, first - 'Class' (one of 8 skills) and second - 'isOver30' (True/False)
how to make stacked bar chart with Classes as x,
total(isOver30) for each class as y and total(true) under total(False)
Hi guys
Has anyone worked with BERT previously? If so, have you used the transformers library by HuggingFace? I need help with running the BERT model offline, because the system I will be running this model on will not have access to an internet connection, and so pre-trained model is the way to go. I'm struck at this point not sure how to proceed, please help. Thank you! ๐
hi :) i put a really basic question about BCELoss in #help-honey if nayone could help i would appreciate it thanks
Hi i need some help how can i dump rows in another df
Hi can someone help me with databases
If you're implementing your model using tensorflow backend, checkout tf serving,
else checkout flask if you're looking for a simple api it's pretty easy to learn
Anyone here doing projects on NLP? Am looking for someone who's is learning AI by building some projects?
Yes
Are there any resources for stats stuff in regards to checking pre-processing image data for a CNN?
Yeah i have a couple NLP projects i'm doing for learning
whenever i wanna learn a new kind of model or something I try to replicate it from the paper so I can see its functions and such
it gives me a good understanding of the inner workings of the model by hard coding them on my own
Dataframe:
CellLine,39,0.541,Animal,red,#FF0000,0.59
GroupName,939,0.399,Animal,red,#FF7F00,0.569
GroupSize,367,0.625,Animal,red,#FFD400,0.343
SampleSize,43,0.584,Animal,red,#FFFF00,0.791
Sex,574,0.83,Animal,red,#BFFF00,0.099
Species,1585,0.859,Animal,red,#6AFF00,0.047
Strain,372,0.62,Animal,red,#00EAFF,0.387
Dose,637,0.618,Dose,green,#0095FF,0.232
DoseDuration,210,0.545,Dose,green,#0040FF,0.262
DoseDurationUnits,198,0.536,Dose,green,#AA00FF,0.177
DoseFrequency,92,0.513,Dose,green,#FF00AA,0.522
DoseRoute,558,0.563,Dose,green,#EDB9B9,0.317
DoseUnits,472,0.592,Dose,green,#E7E9B9,0.252
TimeAtDose,115,0.276,Dose,green,#B9EDE0,0.617
TimeAtFirstDose,45,0.037,Dose,green,#B9D7ED,0.644
TimeAtLastDose,21,0.0,Dose,green,#DCB9ED,0.571
TimeEndpointAssessed,659,0.5,Dose,green,#8F2323,0.382
TimeUnits,594,0.653,Dose,green,#8F6A23,0.12
TestArticle,1831,0.485,Exposure,orange,#4F8F23,0.281
TestArticlePurity,26,0.213,Exposure,orange,#23628F,0.855
Vehicle,417,0.44,Exposure,orange,#6B238F,0.283
TestArticleVerification,5,0.0,Exposure,orange,#000000,1.0
Endpoint,4316,0.308,Endpoint,blue,#737373,0.844
EndpointUnitOfMeasure,682,0.35,Endpoint,blue,#CCCCCC,0.499
Code:
df.plot.scatter(x='unique_percent', y='b_f1', xlabel='Percent Unique Mentions', ylabel='BERT $F_1$ Score', color=df['unique_colors'])
How do I add a color legend on the side?
This is the result:
My advisor told me to do it like this. I'm not complaining because it's gay.
Dump question, I have a really big csv file, like 134000 row of data (in json file). how can I speed up the processing data
It depends on what data processing you are trying to do and how much of it can be done independently.
so far I have spent 2 days with it. In the data processing code, I'm trying to convert a string data which include [range, azimuth, doppler] into [x,y,z]
how is it stored? a csv?
I did the similar process with other data 12900 row before. Yeahh csv
let's just forget how many rows there are. It doesn't matter for figuring out what transformation you need to do
Can you provide a few example rows from the CSV as text (no screenshot)?
Hey @rigid zodiac!
Uh-oh! It looks like your message got zapped by our spam filter. We currently don't allow .txt attachments, so here are some tips to help you travel safely:
โข If you attempted to send a message longer than 2000 characters, try shortening your message to fit within the character limit or use a pasting service (see below)
โข If you tried to show someone your code, you can use codeblocks
(run !code-blocks in #bot-commands for more information) or use a pasting service like:
It's just a snipit of it
I can't dive into this at the moment, but if transforming each row can be done independently, you can figure out what the transformation is and do them in parallel
Can i ask a question about Natural Language Processing here?
I really can't find the definition of Shallow NLP
yeah you can ask anything related to machine learning or data science (which nlp is)
if the rows are independent of each other and it's compute-bound, you can use multiprocessing and create a process pool to compute them in parallel
it would be something along the lines of:
import multiprocessing as mp
with mp.Pool() as pool:
results = pool.map(transform_function, df.iterrows())
you will have to mess with results a bit to get it back to a df because it will return a list of the outputs from the transform_function
So.. can anyone explain about shallow npl or where can i get the source that explains it?
https://nlp.stanford.edu/projects/shallow-parsing.shtml I found this on shallow parsing, i'm not sure if it's what you're talking about though
So, it's written "In NLP for text-retrieval, need robust approach and efficient (Shallow NLP)"
i dunno what it means, so i try to find out
import matplotlib.patches as mpatches
ax = df.plot.scatter(x='unique_percent', y='b_f1', xlabel='Percent Unique Mentions', ylabel='BERT $F_1$ Score', color=df['unique_colors'])
colorlist = zip(df['0'], df['unique_colors'])
handles = [mpatches.Patch(color=colour, label=label) for label, colour in colorlist]
labels = df['0']
ax.legend(handles, labels, loc='center right', bbox_to_anchor=(1.5, 0.5));
Any resources?
@mortal dove seems like an unnecessarily complicated solution, but I suppose matplotlib might not support a better one
Yea, can't find a easier way, seaborn might have a simpler solution
@mortal dove you can add it to the so question I posted, but the question is posed differently https://stackoverflow.com/questions/69046713/create-a-color-coded-key-for-a-matplotlib-scatter-plot-with-specific-colors
Also if you post it there, please don't include the real labels.
Right, will do
seaborn is indeed a lot easier if you can use it
ax = sns.scatterplot(x=df['unique_percent'], y=df['b_f1'], palette=df['unique_colors'].tolist(), hue=df['0'])
5 Steps to Machine Learning with Google(Completely Free)
https://medium.com/@TheUndergrad/5-steps-to-machine-learning-with-google-completely-free-ea5cddf39ddd
That's so cool
can we collab on some projects related to NLP?
Nah I donโt really collab with people, sorry
Guys how can I start learning ai and data science
start with hitting the book or coursera
@rigid zodiac well I wanna learn ai and machine learning
Do you have any specific thing to learn in mind? Like you wanna know more about text processing or image? or you just wanna learn more about them?
if you're new to ML check out Kaggle courses, maybe you can start with intro to machine learning course.
@untold parrot I am making a discord bot but I want it to learn about the way users talk and make it talk to them
Basically a chat bott
i think plotly could do it better. itll also allow you to have hover text for the labels which i think would be helpful in this case
Btw what's the download size of tensorflow
It's going in a PDF
youll need a database to store that information if its a certain server
U mean I don't need ML or u mean I will require a db too
ah gotcha nvm. i was thinking with a deployment mindset
the latter
Ok so I need a db too
And I have thaat
Now I wanna learn ml foor my bot but how @misty flint
plenty of tutorials. you dont need to learn all of ML first to do what you need
you need to learn the bare minimum. how is your chatbot going to learn from your dataset? what do you want it to say? etc.
I want it to learn from the chat of the users
yes but what is it going to say in response
The thing it learned
theres different approaches to your problem so think about what youre looking for and then search through some of the tutorials. come back when you have a more specific problem, yeah?
google discord bot and you can find many tutorials, see which one is close to what you want and follow those
anyone got good links for deciding on NN archs? e.g. how many conv layers, what kernel sizes, pooling types, etc
Is there some good project to start with numpy and pandas? 
These both are in my school syllabus and I'm thinking to learn them with some project, rather than just trying out each method one by one.. coz that's very boring 
shot in the dark. does anyone here have experience with matplotlib for graphing data?
while using pandas i came across statements df.sort_values('tip',ascending=False)[0:2] and df.sort_values('tip',ascending=False).iloc[0:2] both the statements gave me same result how?
i asked this in help channel but no reply
pandas tends to give you multiple ways of doing the same thing.
the first uses a slice. the second slices an iloc. that's about it really, both do the same thing here.
(unless you;re saying that they didnt give the same result. that would be a different question entirely)
when i used iloc it gave me data in some type of format
but here the results were in tables
could you rephrase your question? are you getting same results from both or different
same but iloc give rows
if we slice using iloc shouldn't we get row exluding header row
for example ['name','age']['abc',123]..... so in this name and age are column names
so if i use iloc they should be exluded
oh, i see where you're coming from. when you select a row using iloc, you get the row. but when you slice, you still get the dataframe, so column names will be there.
ya
ie. what you're saying would only be true if you were talking about .iloc[2]. .iloc[0: 2] gives the slice of* df, so no, the column names should be there, as intended
This will only give you the same results if df has integer indexes or RangeIndex
Actually maybe there is a special case in there for slices of integers
I try to avoid using implicit row slicing/subsetting whenever possible
in .iloc[2] there is a type format its not in table form
yes, it's a series
I always use loc/iloc for rows
this will only apply on selection, but not on slices.
so if we select multiple series it should be same
Calling iloc with a slice or something "array like" returns a DataFrame. Calling it with a "scalar" returns a Series
so is there any way that i can specify that its a header row and it should not be included
I think you're confusing the content of the table with its display on the screen
whats the difference b/w loc and iloc
That or you need to pass different options to read_csv
this might be an XY problem. a dataframe always has column names. maybe your goal can be achieved regardless depending on what you're actually facing
The former uses the row labels, the "index", the latter always uses the numerical position of the row in the table
oh
It's a common newbie trap, because the default row labels are also integers
As for your actual question, are you just trying to print the data without the column names? Or did you somehow accidentally get the column names into a row in the dataframe?
thing is that when i did for single iloc it gave me ans in series
but i want multiple rows
not columes
you can range it
Then use a slice or a list of numbers, not a single number
like df.iloc[: , : ]
but can i fix my header row
sure
what's this business about the header row?
that it should not be included in iloc
@wraith lagoon can you show us the output you're getting and explain what you want to be different?
I'm pretty sure when you slice like that, it will have the header row
it will display the header row because that's part of the dataframe. but the column names are not "part of" the data, they're kept in a separate place
ok
Btw Have any one work on the LSTM model? or CNN model before? If yes can I have your code and use it as sample
i started learing pandas
!e ```py
import os
os.system("ls")
@heady yarrow :warning: Your eval job has completed with return code 0.
[No output]
bruh
!e ```python
import pandas as pd
df = pd.DataFrame([
['a', 11, 12],
['b', 21, 22],
['c', 32, 32],
], columns=['i', 'y', 'z']).set_index('i')
print(df)
print( df.iloc[1] )
print( df.iloc[[1]] )
print( df.iloc[:2] )
print( df.iloc[[0, 2]] )
@desert oar :white_check_mark: Your eval job has completed with return code 0.
001 | y z
002 | i
003 | a 11 12
004 | b 21 22
005 | c 32 32
006 | y 21
007 | z 22
008 | Name: b, dtype: int64
009 | y z
010 | i
011 | b 21 22
... (truncated - too many lines)
Full output: https://paste.pythondiscord.com/ewezobaveh.txt?noredirect
dont know much about
@wraith lagoon https://pandas.pydata.org/docs/user_guide/indexing.html
@heady yarrow #bot-commands
so when i do df[df.describe().transpose()] this should also work
but this dont work
...what did you expect that to do?
!d pandas.DataFrame.describe
DataFrame.describe(percentiles=None, include=None, exclude=None, datetime_is_numeric=False)```
Generate descriptive statistics.
Descriptive statistics include those that summarize the central tendency, dispersion and shape of a datasetโs distribution, excluding `NaN` values.
Analyzes both numeric and object series, as well as `DataFrame` column sets of mixed data types. The output will vary depending on what is provided. Refer to the notes below for more detail.
!d pandas.DataFrame.transpose
DataFrame.transpose(*args, copy=False)```
Transpose index and columns.
Reflect the DataFrame over its main diagonal by writing rows as columns and vice-versa. The property [`T`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.T.html#pandas.DataFrame.T "pandas.DataFrame.T") is an accessor to the method [`transpose()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.transpose.html#pandas.DataFrame.transpose "pandas.DataFrame.transpose").
Me totaly confussed
Also why np.vectorize is faster then apply command in pandas . i searched but didnt get what happen in backend of np.vectorize
There was a good time difference
In apply time diff was 3.something and in vectorize it was in 0.something
So why there is a vast diff?
I can't figure out how to calculate micro averages with only the scores to be averaged and the percent share.
is there a way for python (i.e using jupyter notebook) to acess files directly on the USB i.e. without downloading the file locallly first?
or other IDEs?
I am working on some assignment in networkx and need little help of understanding math and graph of it if someone could explain me :
The 2-dimensional lattice generated by vectors u,v consists of the points au+bv for all integers a,b. Considering vectors u,v, and an integer k>0 as input parameters, write a program creating a planar graph of a lattice generated by u,v, whose vertices are the origin (0,0) and only its k closest neighbors among the lattice points, while any vertices related by a translation by only u, v, u+v or u-v should be connected by a straight-line edge.
I have to impleement it in networkx
but could not understand how to start
https://flourish.studio/2019/03/21/bar-chart-race/ this comes to mind, check it out!
The ide that you use doesn't matter, as you provide a path and python requests the data from the operating system
Would this be the right chat to ask about how to label items in bar charts in matplotlib python library?
Show the code you used
i want to be senpai in ML how long it will take me?
if you're plotting this using sns.distplot() you can pass a sequence as your bins in bins= karg
this is for dropping random rows in a df
np.random.seed(np.random.randint(1000))
rand_values=[]
for i in range(2000):
rand_values.append(np.random.randint(5999))
for i in range(2000):
data=data.drop(rand_values[i])
is your dataframe indexed using integers? .drop()ย requires the label matches what's in your dataframe. If you want to use iloc[] for this, you could do data.drop(data.iloc[rand_values[i]].name) or sth like that
perhaps even better solution than what you're trying to get working would be something like
remove_n = 1
drop_indices = np.random.choice(df.index, remove_n, replace=False)
df_subset = df.drop(drop_indices)
the indexes are all integers and sorted , i just want to drop random rows .
this would take a random choice of the actual indices that exist in your dataframe and pass them into .drop() for removing
.sample
np.random.seed(np.random.randint(1000))
this part doesn't make sense btw
works just fine thanks a lot
i have no idea why i did , was little confused , thanks anyways
by updating frac? yeah i could do that
yeah, I'm guessing you want 1/3 of values
or you could just pass 2000
works too
thank you so much been trying so hard and it was this ez
yw ๐
This is something you'd spend a career on
def get_patches(imgfolder, lvl = 0, size = 512):
"""
This function takes in an image of certain resolution
and extracts patches of size - (width, height) and saves them in
certain folder via a directory
parameters:
imgfolder - contains imgs of file type NDPI
lvl - resolution of the img
size - square patch so int - (int, int) - (width height)
"""
# we get the number of the image with the file name
for img_num, imgfile in enumerate(imgfolder, start = 1):
# using openslide to open the img file
os_img = OpenSlide(imgfile)
# dimensions of the img at a particular lvl & calculating the interval
os_img_w_interval = os_img.level_dimensions[lvl][0] / size
os_img_h_interval = os_img.level_dimensions[lvl][1] / size
# row of the overall img
for i in range(os_img_w_interval):
# column of the overall img
for j in range(os_img_h_interval):
# getting a patch of the img
ptch = os_img.read_region((i * size, j * size), lvl,
(size, size))
arr = np.array(ptch)
# exlcude the last channel column RGBA - as all values are 255 so opaque
RGB = arr[:, :, :-1]
# trying to get rid of all background pxs using a min px threshold
# and proportion of green px vals
flat_RGB = RGB.reshape(-1, 3)
green = flat_RGB[:, 1] # just green pxs
# proportion value
prop = len([px for px in green if px > 180]) / len(green)
# min px in the patch
min_px = np.min(RGB)
if (min_px < 130) or (prop < 0.8):
# saving the image PNG format into the specified directory
ptch.save('/content/drive/MyDrive/Data/older_retics_patches/img'+
str(img_num) + '/' + 'patch_' + str(i) + "_" + str(j)+ ".png")
Guys is there a way I can speed up this code, this function extracts patches from images that are of the type NDPI, adn range from 120 MB to 3GB in my data set
Stopwords are just those filler words we use in English that may not be necessary to convey meaning for a machine.
So for example, articles "he is a man" would still retain the meaning more or less if you changed it to "he is man". It may sound weird but for a machine it's learning whatever you send it anyways. The benefit is that you get to reduce the number of words needed
So it can lead to better performance for simple models
"man" would be the only non stopword in a lot of systems.
Yeap yep. But I think it makes explaining it harder, so I'm perfectly happy to fudge the specifics a little to convey the concept easier.
what are you trying to identify?
Need help on this
Hi I wonder how to grab values starting with '#'?
I've tried using it like this...but it returns error
@serene scaffold I want to identify images
Please be more specific
Like I want to make a anti NSFW system in my discord bot but how will it identify any NSFW content that's where I thought of taking help of ai, my bot will work like if any NSFW content is posted then it will immediately delete it
@serene scaffold
I imagine that's going to be exceptionally difficult as "nsfw content" could be a lot of things. I guess I would see what literature is out there about identifying nudity in images and go from there.
Well is it possible with ai to make a anti NSFW
?
it is possible yes.
How?
I don't personally know, though I already made a suggestion about what you could do next.
@serene scaffold what lib shall I use for this???
idk
Aah
I have nothing else to suggest that I haven't already suggested.
I just started with machine learning, so sorry if I made a mistake or something.
So I watched a few tutorials on deep reinforced learning with tensorflow and keras. On the build model step, they added different amounts of Flatten, Dense, and Convolution2D. In the CartPole-v0, it was Flatten, Dense, Dense, Dense(actions). In one where the the states were a number from 0 to 100, there was no Flatten. And on the one with Space Invaders, they added 3 Convolution2D ones.
How do I know which layers to add and which arguments to give? I'm currently trying to build one with an input of an 11 x 15 2D array.
You can create a simple model using tensorflow, using hentai images as your database. I recomend taking at least 1000 nsfw images and 1000 non-nsfw images, label them as [nswf and non-nsfw], and then train your model using any neural network architecture.
The idea is to create a model thtat can detect if a image if nsfw or not. Its possible to improve it
'cuz here we are just classifying, but its possible to create a model to detect and classify. To detect and classify there are some specific models, like YOLO, SSD....
If im not mistaken, in keras doc it explains most of the parameters in every possible layer, and what layer do. I recommend reading keras docs, it helped me. And also, keep studying about Deep Learning and its architectures.
is there a way in python to check the file size of an image before saving it?
what data type is representing the image file?
so i've changed my array to a pil.image
Please always provide code as text. But let me see
How th I am going to get 1000 images and how will I install them in it lol
I'm not sure how to know what the size of the file will be before you save it, but there's os.path.getsize
Just automate your browser use an API to get them ๐
To install: https://www.tensorflow.org/install, and to get the images, you can go to kaggle (web site with a lot of databases) or you can do it yourself manually or maybe Web Scraping can do the job (i've never done this with image)
You could also save it to an in-memory file and use that.
image: PIL.Image
from io import BytesIO
img_file = BytesIO()
image.save(img_file, 'png')
print(img_file.tell())
ok thanks will try this
https://pics.hori.ovh:8036/docs/ Many people use this api in their discord bots, you can try it @quick kestrel 
Its a method to get data from web, literally scrap data from web sites
Like getting all the html content of the site and parse it to get some data out of it
Well it is really done with selenium?
Or just make requests...? easier and faster
Http request?
Yes
Ah me nub in python currently started working with keras
import requests
data = requests.get('https://github.com/AkshuAgarwal').text()
# now use bs4
I want images not text
You'll get the whole html of the site
Do some reverse engineering to find the ids of image tags and parse them
Ah I see, so it is faster and can be used to get many images
Right?
Best is actually to use an API
^
Ok thanks
@dire crane need some more help bud
what you need?
What are the things I need to learn to make this like tensorflow
And what's the use of neural networks in ai, I mean does it work like neurones of human to send data and responses from one location to another
If you really want to go straight to hands-on i suggest you to read tensorflow and keras docs. Keras is like tensorflow high level, with all deep learning layers already created. But i highly suggest you to study the basics of a neural network architecture, ''because knowing the basics you can create simple architectures to predict values.
And do I need maths in it??
Basically this architectures were created based on humans' neurons, i dont know must details, but feel free to study it, must be very interesting. Well with Machine Learning or Deep Learning (actually Deep Learning is a part of Machine Learning) we can do a lot of things, like image classification, predict stock values and other things
Its good to know the basis, but you dont need to know everything, unless you want to go to research area
So no need of maths in it?
but knowing simple math and statistics itss fine
Ok
So to do those calculations of ai do I need numpy or can do it without any module, and are neurons used to transfer data like api does?
tensorflow do it all
Wdym?
Tensorflow has a lot of methods that do the math
So whats the use of keras and what is the use of neural networks
U mean I need keras to work with tensorflow to do the neural network stuff?
Hey I've been working on this imbalanced dataset for awhile, however, can't seem to find a good enough structure to it.
Process:
- Imported the csv (99:1, target) and fixed the skew
- Created 50-50 under sample as a train set
- Scaled the train
- LogisticRegression
- GridSearchCV to find best f1 score for the minority target (fitted the train under sampled df)
- scaled the original dataset (X_val, y_val)
- classification report and confusion matrix of predict(X_val, y_val)
The results are horrible, the maximum I am getting f1_score is 6%, could someone please share your experience and approach on imbalanced datasets?
yes and no, yyou can do everything just with tensorflow, but its too hard to practice as a beginner using just tensorflow, so Keras helps us doing the job
There is a book called: Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition, its a very good book, i recommend
Well scikit learn does the ai stuff too right?
yes, thats another library
So my last question which is the best lib to do all this ai and ml stuff
That depends on your problem, if you want to solve simple problems like linear problems scikilearn may solve your problems. But if is more complex i prefer using Kera + tensorflow
So tbh I have a lot of work to do with ai and ml and many of them are just a bit complex too, so shall I use tensorflow and keras
take it slow, study AI is a long journey. I've been studying it for 2-3 years and still there is so much to learn. I hope you enjoy this journey of AI ;D
2-3 years that's a pretty long time
Well thanks foor the help bud
you're wellcome! ;3
Anyone have any resources for some stats analysis on a feature map for a potential CNN? I have a lot of data and I'm trying to check for skew/normality/etc etc
bet you forget about stuff like transformers ehehe ๐
but if you did, big balls to you man
Hey guys, I'm trying to read an image to use as training set. However, for some motive, instead of getting a 3D RGB array, I'm getting a 4D RGB array with shape (26, 25, 4): ```
[[[186 26 51 255]
[194 51 68 255]
[205 82 90 255]
...
[221 132 134 255]
[215 151 156 255]
[208 167 177 255]]
I've tried using `np.delete(array, 3, axis=1)` to remove this `255` from the array, but it didn't work as I expected. Can someone give me a hint on how to remove this `255` column?
@hasty mountain I believe opencv2 has some options to get down to 3 channels
Hm... How about keras.layers? (I want to use this set in a neural network)
I'm not sure how you pre-processed your data. Can you elaborate on how you got that array?
I just loaded an image using image.io (but it could be matplotlib.pyplot. It gives the same result).
I could be wrong here, but you'll want to get rid of that (I think its an alpha channel) before you convert to the feature map
Well, I don't know how that 4th dimension appeared
Maybe I messed up in Paint 3D...
I see. I'll take a look, thanks!
I did actually, but for that one I had to get some help from pytorchs seq2seq tutorial to understand some stuff
chad
I basically tried it on my own, didnโt work very well, then went to that to see what I did wrong
:incoming_envelope: :ok_hand: applied mute to @topaz yew until <t:1630782168:f> (9 minutes and 59 seconds) (reason: duplicates rule: sent 4 duplicated messages in 10s).
so im trying to create an ai and this error is popping up:
Would anyone be able to help me make sure my implementation of multinomial naive bayes is correct?
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
File "<frozen importlib._bootstrap>", line 984, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'pyttsx3.drivers.sap15' ```
thats my code
engine = pyttsx3.init("sap15")
voices = engine.getProperty("voices")
print(voices[0].id)
engine.setProperty("voices", voices[0].id)```
ahh if you didnt know you need to wait for the first one to finish their problem because like that your blocking them from getting help
people wont see their message like thst
oh sorry
k
ill notify you when i get my problem solved
ok thanks
i think i solved it
yea you can post now
Hope it's ok if I do some self promotion her for my unmonetized blog, I just release an article about python scripting for big data.
https://www.babbling.fish/elt-cookbook-python/
Common design patterns used by data engineers to write python scripts that can be horizontally scaled on a stateless task runner.
I need help reorganizing a pandas data frame. Its a bunch of stock market data about a bunch of companies. Right now, it is ordered where every row is a point in time describing a single stock. There is a row for each of the stocks for every 5 minutes over the course of some time. I need it so the index is the date and time and it would have many columns such as ALKT.US_OPEN, ALKT.US_HIGH,... ect, for all of the stocks. That way, each row is a snapshot of the whole dataset at a given time. Can anyone give me an idea about how to do this? Heres some example rows.
<TICKER> <DATE> <TIME> <OPEN> <HIGH> <LOW> <CLOSE> <VOL> 319763 ALKT.US 20210729 185000 31.8400 31.96 31.840 31.85 1275 24615 ABCL.US 20210730 191500 15.3000 15.31 15.295 15.31 1595 55906 ACB.US 20210804 165500 7.0782 7.09 7.055 7.09 12275 843228 BCRX.US 20210826 203000 15.4100 15.43 15.395 15.41 5699 453707 ANTE.US 20210820 160000 2.5000 2.50 2.500 2.50 100
check out the pivot table method
thanks, i just looked at that. Although its similar, its not really what I intend
Could you also provide an example of what you'd want the dataframe to look like after transforming it?
sure one sec
<AMZN.US_OPEN> <AMZN.US_HIGH> ... <APPL.US_OPEN> <APPLE.US_HIGH> ... for every stock 2021-07-1 3:00 1000 1010 150 155 2021-07-1 3:05 1001 1011 144 155 ...
like this
the index of the row is the date and time and there would be a column for each stock's data like AMZN_OPEN or APPL_HIGH
i dont know if this is an appropriate question for this channel but
i have a dataset of jobs and salaries and the job titles are kinda weird? like some of them repeat and have roman numerals at the end of them and im kinda confused
like for example theres several "Accountant/Auditor"s but each has a different roman numeral at the end of it
and i was just wondering if anyone knew what this meant
https://catalog.data.gov/dataset/average-salary-by-job-classification this is the dataset btw
Pay and seniority grades probably
HR departments at large organizations have things like that
Find answers to 'What does the number following the job title mean? For example, manufacturing associate I,II,III,IV' from Biogen employees. Get answers to your biggest company questions on Indeed.
oh!! thank you so much :)
Hi all. How do generally the reports submitted by data scientists to the higher mgmnt look like in major companies? In what format do they submit? Are those word docs with codes attached? Or ppts? or smething else?
Anyone non technical cannot digest codes. It's usually ppts all the way
For the next few days I'll be explaining some data science concepts through 1 minute videos so that everyone has a basic intuition and clears their basic concepts. Here's today's video: https://youtu.be/gDqeW2t2dV0
This will will give you an intuition about what data collection is in Data Science, its necessity, requirement and the different ways to do it with a simple and easy example.
Join this telegram group is you are serious about learning data science and want to avail free organised resources that are added and updated everyday: https://t.me/analyt...
thank
thanks
Welcome!
Thank you
which algo can be used to classify these data ? any idea ?
probably a weird sort of polynomial or smthing?
yeah
that kinda looks like the sequence of prime numbers or something in 3 blue 1 brown's video
I thought I saw that one can use SVMs in such a way that the two spirals become closer together.
see if there's a kernel function for that.
regular FNN should do fine
what's FNN?
is that a new NN type
man I've been out of touch for so long
I don't even recognise the techniques that are current any more
๐
Quite the opposite haha
this sounds like it?
feed forward neural network
it looks exactly like the kind of thing a projection into a higher dimensional space would solve
hahahaha
and here I was thinking I was onto something. Also what exactly makes a NN feed forward?
I know for sure because that screenshot looks like it's from Tensorflow's neural network playground and on Google's ML Crash Course they trained an FNN on exactly that pattern
the input does one pass through the network
as opposed to having multiple epochs or what?
direction of information flow
for example
well, I guess RNNs are a great example
๐
Try one of these maybe? https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html
haha oh my god so many acronyms
I think its called FFN in the lingo ๐
I have an issue with numpy, when I pass an image through np.asarray.
It seems to change the shape of the image? I have used a work around but is there any reason to why it does this?
Code
def extract_frames(stream: io.BytesIO):
bt_array = bytearray(stream.read())
actual = io.BytesIO(bt_array)
try:
gif = mimread(actual, memtest='1MB', pilmode="RGB")
except ValueError:
gif = [np.asarray(bt_array, dtype=np.uint8)]
frames = gif[:MAX_FRAMES]
The sizes that I got where:
(671, 75) - Original Size
[(13696,)] - After passing through np.asarray
[(1, 13696)] - Before saving to memory
For the same img saved in different versions i.e. jpg, png, tiff, etc. would all the values in the array (RGB values) be the same?
for the current image yes but for others no
its just a simple black and white im
Hey, I'm doing some research on algorithms that help detect outliers (anomalies) in dataset. Does anybody have advice or experiences on how to automate it?
What do you want to automate ? Outlier detection ?
probably to identify data that doesnt suit the pattern
Yes
I've been given a task to research some options, that could help detect anomalies once fed with data from specific timeframe once in a while.
Well there are a bunch of packages and approaches. How comfortable are you with the math ?
Do you want to use a ready made software tool that comes with ready made models ? Or code a model ?
This example shows you how to code a model and train it
https://keras.io/examples/timeseries/timeseries_anomaly_detection/
I think, well enough
I prefer to code the model myself. I know a library pycaret that seems to do all the code work itself, and I would like to avoid that
Hi can someone help me with this
All right, then the approach that example takes is the level of code you're looking for
Do you think that I can use this approach when I am dealing with data containing transactions?
I'd start with that one, just to get a feel. Or look for an even simpler model, density estimation, to have a real simple baseline
I was going to ask, what's you data. Timeline, or other
So far i've researched models like LOF, Isolation Forest, knn, SVM
Transactions, sounds like fraud detection ?
I would consider them fraud yes
There is a nice paper on similiar topic https://www.diva-portal.org/smash/get/diva2:897808/FULLTEXT01.pdf
Right, well fraud detection is one of the most used applications of anomaly detection, you should be able to find tons of resources
I personally don't know much about it, like what models/tools are the right ones to try
Hope someone will recommend some resources to get started
Nice, did you try those with your data ?
Most of papers, that I encountered seems to research the models implementation and their benchmarks.
That keras anomaly example works with timeline data, so I don't expect it to work with transaction data
Well, I personally did not, but a colleague did a while ago
Well sounds like if you're reading papers
hi guys is writing custom loss functions a must know and working with tensors a must-know for AI practitioners?
I would appreciate if I could have an eye on helping me debug a weighted-loss multiclass custom loss function I'm attempting
Is it better to save the image (given that we have thousands of images to save) as a numpy array, or as an image file i.e. as a jpg, png or tiff file etc? (this would be regarding the runtime, disk space (storage)) As after this I would be feeding these images into a CNN
@uncut barn definitely image files because you can make use of compression
I don't believe these are as relevant to statistical approaches to AI.
Loss functions are important for neural network-based AI. And then I think "tensor" means something different in Python's data science ecosystem than it does in pure mathematics.
Anyone know of any cool datasets that aren't something boring like simple image recognition or house prices/stuff like that?
and non-NLP ๐ maybe something scientific
why not nlp? 
they said non boring! ๐
perhaps take the dataset from this tutorial. https://www.tensorflow.org/tutorials/images/classification
Just finished a degree in NLP and spent a long time doing only NLP haha
thanks @ripe forge
img = cv.imread(cv.samples.findFile(img_name))
Im having trouble finding out what samples. && findFile.
Its with respect to import argparse
I can assure u its not from opencv, any advice where to search in the code to decipher that. Or is this information provided by me way to sPARSE (pun intended) ?
I dont like argsparse
I'm just a beginner in python and ML, can someone please tell what apply(lambda x: x) does in df['y'] = df['y'].apply(lambda x: x)
you can assure us it's not from opencv, but your code tells us otherwise. perhaps just run print(cv.samples) and see what it says.
Hey guys! I'm a python enthusiast and my uncle runs a car upholstery business. I want to help him to come up with new designs for car seat covers using pictures of previous works.
Is it doable? I have only a few notions of ML
I can also get a set of 100 vectorial designs with the same angle
I don't need complex material imitation and other fancy stuffy, I only need "lines". Basically black and white
I've read some stuff about GAN but it seems overpowered for what I need
to me, it seems like that should do nothing. specifically apply takes a function and runs it on every item. but lambda x: x is a function that takes a parameter x and returns x so i don't see the point
Thank you!
my bad, I have just never seen those methods before I guess it is opencv.
Lack of sleep can get to u
you're fine! (though yes, sleep is definitely worth it regardless)
it was worth being wrong just to add the pun though
I think I should have added this part ๐ฎ
for img_name in args.img:
img = cv.imread(cv.samples.findFile(img_name))```
please dont tell me to read a guide on args, to tired ๐
```python
args = parser.parse_args()```
ill see if print(args...) works
hey, what's your question here? args.img looks like some kind of iterable that sounds like it should have image names. nothing special as such.
๐ ,
args.img
comes from args = parser.parse_args() , I have no clue what parser.parse_args() is supposed to be.
And how to do get the background like what you did for "args.img"
perhaps my question is still to sParse?
argparse is making an arse out of me I tell you what
args = parser.parse_args()
print(args)
# read input images
imgs = []
for img_name in args.img:
img = cv.imread(cv.samples.findFile(img_name))
I just need to know where args.img is searching
oh. so, that looks like command line args. (and for displaying code you can use backticks. single ` for inline)
so args is just the argparse object. it just "holds" all the arguments. argparse itself is a way to hold parameters when the file is run via command line with extra arguments. so, say python thisfile.py arg1 arg2 arg3 for example is one way to run python files with extra commands
that's probably what info you're missing. this code wont show you what img is unless you see what command it was run with.
so img is just whatever is specified in the parser arguments in the code somewhere. it's actual value is decided at runtime based on the command the file is run with.
for what it's worth, it's easy to assume it's basically like a list of strings, file names.
parser.add_argument('img', nargs='+', help = 'input images')
๐ตโ๐ซ
im really not a cli guy
not yet anyways
yep, so if you wanted to see how exactly that works, you'd want to read up on parser. but if you want a tldr, img is a list that's being given values during the cli run of this file
img is a list that's being given values during the cli run of this file
exactly what im trying to look for ๐
the code of the file won't have it.
it would depend exactly on what command args were written when this file is being run
can I not just my relative path ?
https://github.com/opencv/opencv/blob/master/samples/python/stitching.py
better not though
I think the entire code would need rewriting then
you can do whatever with your own codes. all i'll say is, you can't look at the code to see what input the user passed basically
to a certain extent only
it's like reading a code that says input("blah") and trying to figure out what the user wrote.
me looking at the code for the last one hour lol
cli args are basically like the cli version of input
Thanks very much ๐
Cant help not looking at the code, guess I gotta read the docs lol
hold on u said its like a input
in terms of the general concept, input is a separate thing ofcourse.
yeah
i only meant both are similar in the sense that the code won't know what is passed to it
Im trying to run on cmd
stitching.py: error: the following arguments are required: img same error but I think I may be close now unless u know ?
wait
its working ๐คฆโโ๏ธ ๐คฆโโ๏ธ ๐คฆโโ๏ธ Instructions were sort of there, albeit not that clear
Thank you Darr!
you mentioning รฌnputresulted in an output ๐
np! ๐
Hey there!
I have this bunch of code right here:
(python!)
for adj_t in adj_matrices:
dissimilarity_row = []
for adj_tdt in adj_matrices:
dissimilarity_measure = ((adj_t - adj_tdt)**2).sum()
dissimilarity_row.append(dissimilarity_measure)
dissimilarity_matrix.append(dissimilarity_row)
And as it rises exponentially in time with the dataset it becomes notoriously slow really fast.
Right now this is all default python, no numpy or anything
adj_matrices is a list
Up to 90k+ in length
adj_t is a matrix (in form of 2 dimensional python list)
looks like this
The code is calculating the squared eucludian distance between each of the 90k+ points.
Any idea how and if this could be optimised with numpy.
And if so - with which numpy functions
How did using scipy.pdist go?
Hmm, I ditched that at some point, I'm not sure why.
not sure if pdist would work all that well for me as I need to add another factor later on to the dissimilarity_measure var
this seems to work fine for me, say:
import numpy as np
import scipy.spatial
#random data in that format (1000 10x10 matrices of bools)
arrs = np.random.randint(0,2,(1000,10,10), dtype=np.uint8)
#use pdist, after flattening each matrix:
dists = scipy.spatial.distance.pdist(arrs.reshape(arrs.shape[0],-1))
# this is a "compressed" distance array
Ah, now that would indeed be a problem
you can instead straightforwardly vectorize your loop, then, like, uhh...
yeah, I'd later like to factor in values like velocity, amount of neighboured points, etc. data that isn't inlcuded in the ecludian distance
hmm
I mean, if it really can't be optimized, so be it. The dataset shouldn't exceed 100k entries. But still a shame - I'm also scared it will overload the ram
but that's probably another story
def pairwise_dists(arrs):
# start by flattening the matrices, since we are taking diffs only anyway
arrs = arrs.reshape(arrs.shape[0],-1) # shape: n_points, N*M
# We need to subtract each row from all others.
# That can be done by using some clever broadcasting
arrs1 = arrs.reshape(arrs.shape[0], 1, -1) # "rotate" it so that the rows are pointing depthwise
arrs2 = arrs.reshape(1,arrs.shape[0], -1)
diffs = arrs1 - arrs2 # shape: n_points, n_points, N*M
distances = (diffs**2).sum(axis=2) # sum depthwise, so over each original row
return distances # n_points, n_points
# short version that might be a bit faster, not sure:
def pairwise_dists(arrs):
n = arr.shape[0]
return ((arrs.reshape(n,1,-1) - arrs.reshape(1,n,-1))**2).sum(axis=2)
like this, I think that should work
yeah, seems to not crash and to produce a sane-looking result at least.
But if you're adding stuff like this, than this approach might cease to work. In that case, you'll have one final solution left - using numba, or FFI with a compiled language of your choice, to do the exact same stuff you were doing but without the slowness of Python.
Oh wow, that's great. I'll try this - thanks you!
Hey - this works wonders!
It now takes 1.5 seconds instead of 24 seconds to calculate for 2k entries ... amazing
Here's another solution (I've tested that it produces the same results as pairwise_dists above):
from numba import njit
@njit
def pairwise_numba(arrs):
n = arrs.shape[0]
results = np.empty(shape=(n,n), dtype = np.uint32)
for i in range(n):
for j in range(i,n):
dist = ((arrs[i]-arrs[j])**2).sum()
results[i,j] = dist
results[j,i] = dist
return results
This is the "same thing you are doing but in numba" way
It's 2x faster:
>>> %timeit pairwise_dists(arrs)
354 ms ยฑ 6.34 ms per loop (mean ยฑ std. dev. of 7 runs, 1 loop each)
>>> %timeit pairwise_numba(arrs)
150 ms ยฑ 7.41 ms per loop (mean ยฑ std. dev. of 7 runs, 10 loops each)
which is likely because it avoids doing half the work by using the symmetry of your distance function (for j in range(i,n)).
There is also:
def pairwise_dists(x, y):
""" Computing pairwise distances using memory-efficient
vectorization.
Parameters
----------
x : numpy.ndarray, shape=(M, D)
y : numpy.ndarray, shape=(N, D)
Returns
-------
numpy.ndarray, shape=(M, N)
The Euclidean distance between each pair of
rows between `x` and `y`."""
sqr_dists = -2 * np.matmul(x, y.T)
sqr_dists += np.sum(x**2, axis=1)[:, np.newaxis]
sqr_dists += np.sum(y**2, axis=1)
return np.sqrt(np.clip(sqr_dists, a_min=0, a_max=None))
minus the sqrt
Using this:
may i ask what the above functions are for?
Computing the pairwise distances between many matrices, with the caveat that the distance function might get more complex later
The clip is for numeric precision issues.
hmm, another problem I'm running into right now is this
numpy.core._exceptions.MemoryError: Unable to allocate 180. GiB for an array with shape (22001, 22001, 100) and data type int32

180 gb
nice, so basically:
def pairwise_dists_2(arrs):
n = arrs.shape[0]
x = arrs.reshape(n, -1)
sqr_dists = -2 * np.matmul(x, x.T)
sqr_dists += np.sum(x**2, axis=1)[:, np.newaxis]
sqr_dists += np.sum(x**2, axis=1)
return sqr_dists # we are using integers, so no sqrt or clip needed
Lemme compare it...
oh well ๐
yeah, that's another problem of my vectorized solution
Without the clipping you may get negative values that are very close to being zero.
So just clip them to 0
>>> res_2 = pairwise_dists_2(arrs)
>>> np.array_equal(res,res_2)
True
>>> %timeit pairwise_dists_2(arrs)
183 ms ยฑ 1.87 ms per loop (mean ยฑ std. dev. of 7 runs, 10 loops each)
Result is correct, and this is a bit slower than the numba solution but better than mine vectorized one
Yea numba will be the best, minus the JIT time
In this case the original arr is of int dtype, I believe, and so so are the distances - so no clipping needed, as floats aren't involved.
hmm, so you think this approach would also solve the 180gb problem?
You can do the solution I gave in numba too
As the endresult always will be an array that big - no?
Neither pairwise_numba and pairwise_dists_2 share pairwise_dists's problem of allocating a giant array.
No, the final array is (n,n), so only 1.8GB for you - pairwise_dists allocates an intemediate array of shape (n,n,<number of elements in each matrix>), that's what kills the memory.
oh, I see
The large memory issues comes from a very large intermediate value.
(Which the expression re-write avoids)
Basically, I'd suggest using the numba solution, not just because it's currently the fastest one, but also because it's the only one you'll easily be able to extend when you change your distance function. With vectorized ones, you'll have to work hard to figure out how to vectorize the new more complicated function (if it starts depending on other values than just the two, I mean), and it might even be impossible at all depending on the distance function
(well, I guess it would only be truly impossible if the dissimilarity won't actually be a distance function, since good distance functions should, at the very least, only depend on the two points)
yeah ... I think I'm gonna need a minute to process all that ๐
we'll try it out
When I read a multi index table, how do I specify the dtype of the header?
import io
import pandas as pd
df = pd.read_csv(io.StringIO('''combo,0,0,1,1,2,2
stock,weights,tickers,weights,tickers,weights,tickers
0,0.5,ADBE,0.7,AAPL,0.7,INTC
1,0.25,MSFT,0.1,MSFT,0.15,AMD
2,0.25,NVDA,0.1,NVDA,0.15,SKYT
3,,,0.1,GOOG,,
'''), header=[0, 1], index_col=0)
df.columns.get_level_values(0)
Index(['0', '0', '1', '1', '2', '2'], dtype='object', name='combo')
Here I would like to specify the dtype is int (not object)
I indeed don't see any mentions of being able to specify the header dtype for read_csv, but have you tried converting it to int afterwards?
Just found this solution: https://stackoverflow.com/questions/49199530/pandas-multi-index-column-header-change-type-when-reading-a-csv-file
It doesn't seem elegant tho ๐
I believe it'll at least be efficient (redtyping the header affects only the header column and doesn't copy the entire DF, hopefully), but yeah, annoying
I tried using this:
df.columns = df.columns.set_levels(
df.columns.get_level_values(0).astype(int), level=0
)
Gives me this error:
ValueError: Level values must be unique: [0, 0, 1, 1, 2, 2] on level 0
that solution was for a multiindex, you probably need, uhhh...
actually, it's pretty fair
you indeed have duplicate values in your first row.
But that is intentional
Hmm, what should duplicate index values mean?
hm the numba solution seems to work best
it is a bit slower (but definitely still manageable) than the numpy functions but it doesn't use that much ram so is actually usable.
thanks
@tidal bough I think it could be related this:
df.columns.levels
gives:
FrozenList([['0', '1', '2'], ['tickers', 'weights']])
i wish we didnt have to learn feature eng in matlab
im pretty sure you can do the same things with python + opencv
like 90% sure
theres a lot of the same functions
After reading this: https://github.com/pandas-dev/pandas/issues/28294
I found that they interpret the new indexes as a mapping for unique values
df.columns = df.columns.set_levels(set(df.columns.get_level_values(0).astype(int)), level=0)
EDIT:
Actually a better way is:
df.columns = df.columns.set_levels(df.columns.levels[0].astype(int), level=0)
Code Sample, a copy-pastable example if possible import numpy as np import pandas as pd df = pd.DataFrame(np.random.rand(3, 3), columns=pd.MultiIndex.from_tuples([(1, 2), (1, 4), (5, 6)])) df.colum...
anyone know why my residual vs fitted vals are bounded near the bottom
What's a good set of modules to use for financial modeling? I want to run some simulations but there must be tools out there for modeling stochastic trend series and financial instruments.
I could reinvent the wheel but it would be pretty crude I think.
mplfinance
mlfinlab
Is there a natural upper or lower bound on the data?
can someone explain a concept to do with gradients for me real quick?
well, nobody can answer you if you don't post the question
there is a bound on the dependent variable where it cant be less than 0
i got it answered actually.
and cant be greater than 100
it was about gradient descent but I got it figured out now.