#data-science-and-ml

1 messages ยท Page 338 of 1

trim latch
#

Please anyone

stuck karma
#

hey! do you know how many samples is considered as "enough" or not enough to do a c ross validation?

#

because i think it is used when we dont have enough samples

#

i have 2100 features and 300 samples, it is "not enough" to split into test and train test ?

#

please ping me if you answer

livid kiln
#

I had shown the wrong values in the explanation, I've edited them now

desert oar
#

thanks for clarifying

#

i can't think of an obvious tidy way to do this. maybe ask on SO and link the question here. i can ponder and post a longer answer on SO if nobody else does first

livid kiln
# desert oar thanks for clarifying
idxs = []
idx = (df[2].notnull()==False) & (df[2].notnull().shift(-1)==True)
idx[len(idx)-1] = True

start=0
for i,end in enumerate(map(lambda x: x+1, sorted(idx.index[idx]))):
    print(i,start,end)
    idxs.append((i,start,end))
    start = end
idxs

result = pd.concat({i:df[start:end].reset_index() for i,start,end in idxs}, axis=1)
result.columns.names = ['combo','stock']
result
desert oar
#

that's one way to do it

#

you're basically resorting to parsing a stream of 1s and 0s and grouping them

#

the absolute last resort technique ๐Ÿ˜›

livid kiln
desert oar
#

ask on SO anyway and show your current attempt (and make sure to include copy/paste-able sample data), maybe you'll get some other interesting variations

#

people like doing performance benchmarks on SO too

livid kiln
# desert oar ask on SO anyway and show your current attempt (and make sure to include copy/pa...

Found it!!!
https://stackoverflow.com/questions/21402384/how-to-split-a-pandas-time-series-by-nan-values/66015224#66015224

sparse_ts = (~df[2].notnull()).astype(pd.SparseDtype('bool'))
# we need to use .values.sp_index.to_block_index() in this version of pandas
block_locs = zip(
    sparse_ts.values.sp_index.to_block_index().blocs - 1,
    sparse_ts.values.sp_index.to_block_index().blengths + 2,
)
# Map the sparse blocks back to the dense timeseries
blocks = [
    df.iloc[start : (start + length - 1)]
    for (start, length) in block_locs
]
for block in blocks:
    print(block)
desert oar
#

welp

#

that's an interesting one

#

very clever

#

i'm gonna file that one away in my brain if i ever need to work on boolean stuff like this

livid kiln
#

How do you get the depth of a multi index dataframe ?

serene scaffold
#

!docs pandas.MultiIndex.nlevels

arctic wedgeBOT
#

property MultiIndex.nlevels```
Integer number of levels in this MultiIndex.

Examples

```py
>>> mi = pd.MultiIndex.from_arrays([['a'], ['b'], ['c']])
>>> mi
MultiIndex([('a', 'b', 'c')],
           )
>>> mi.nlevels
3
tranquil drift
#

Hello , any1 maybe have couple minutes to help me with pandas and merge df ?

livid kiln
livid kiln
#
import io
import pandas as pd

df = pd.read_csv(io.StringIO('''0.7, AAPL, 0.5
0.1, MSFT
0.1, NVDA
0.1, GOOG

0.7, INTC, 0.5
0.15, AMD
0.15, SKYT'''), header=None)

df

sparse_ts = (~df[2].notnull()).astype(pd.SparseDtype('bool'))

block_locs = zip(
    sparse_ts.values.sp_index.to_block_index().blocs - 1,
    sparse_ts.values.sp_index.to_block_index().blengths + 2,
)

df2 = pd.concat({i:df[start:(start + length - 1)].reset_index() for i, (start,length) in enumerate(block_locs)}, axis=1)
df2.columns.names = ['combo','stock']

df2
#

Now I can index df2 like this:
df2[0]
df2[1]
How do I get this max depth (== length of depth)

livid kiln
serene scaffold
livid kiln
flat hollow
#

when multiple indices are called multiindex and multiple columns are called multiindex ๐Ÿ˜ฆ

quiet vault
#

Hello, I have multivariate time series data with the shape of (3916, 4). I am trying to use a CNN-LSTM model with the TimeDestributed layer. I need to reshape my data into [samples, subsequences, timesteps, features]. I want time steps to be 5

#

Does anyone know how I can reshape the data into this

serene scaffold
quiet vault
#

3916, subsequences, 5, 4

#

i tried using the reshape function but it wont work

serene scaffold
#

when you reshape the data, the products of lengths of each dimension have to be the same

#

subsequences and 5 are new, as compared to (3916, 4) -- where is that data now?

#

are these in arrays, tensors, dataframes, or what?

quiet vault
#

arrays

serene scaffold
#

what array has the subsequences?

quiet vault
#

a TimeDestributed layer needs it

#
steps and provide a time series of interpretations of the subsequences to the LSTM model
to process as input. We can parameterize this and define the number of subsequences as
n seq and the number of time steps per subsequence as n steps.```
serene scaffold
#

do you have more than one array of shape (3916, 4)?

quiet vault
#

no

serene scaffold
#

whatever number subsequences is, if you have that many (3916, 4)-shaped arrays, you can stack them and get one that is (subsequences, 3916, 4)-shaped, and then you can transpose it.

#

I'm not sure about the 5.

quiet vault
#

alright

#

thanks

livid kiln
#

df.select_dtypes(include="object")
I would like to strip all these columns

white venture
#

anyone know how to fix this error: Tensor.op is meaningless when eager execution is enabled

desert oar
livid kiln
desert oar
#

use .apply?

#
obj_cols = df.select_dtypes(include='object')
df[obj_cols] = df[obj_cols].apply(lambda y: y.str.strip())
livid kiln
#

does apply do things in parallel?

desert oar
#

not by default but i think there's a patch to do it

#

you could also loop over column names and parallelize it yourself w/ joblib

livid kiln
#

Is there away to do this inplace?

desert oar
#

i was just about to look into that

#

you might need to manually memmap it or something

#

how big is this data exactly?

#

maybe you should be using dask if something as simple as str.strip is slow enough to want to parallelize

#

or a real database

#

unless you have 10k object columns i don't see much value in parallelizing something like this

livid kiln
#

Yeah, just writing the prototype in python, going to rewrite this in rust. But would like to see how far python will go ๐Ÿ˜ƒ , might try numba

desert oar
#

numba won't do much for you w/ strings

#

what are you actually trying to do?

#

this seems like a total non-useful thing to try to optimize

#

at least use dtype='string' instead of dtype='object'

livid kiln
#

I used pd.StringDtype()

desert oar
#

same thing

#

should be a bit faster than object, i believe it's backed by an arrow array

#

i'm not sure if parallel in-place operations on a dataframe are possible without some serious care and fuckery

#

at which point, like i said, you're probably outside the realm of what you should be doing with pandas and should be considering with dask or even spark

#

or, again, an rdbms

#

if you actually explain what you're trying to do maybe i can be more helpful

#

this sounds like an x-y problem

#

e.g. if you have an ETL pipeline that needs to run in 500ms and it currently takes 750ms, you need to roll up your sleeves and start profiling and/or thinking about how you can store/handle this data differently

#

often the biggest gains are algorithmic

#

reducing number of passes over the data, etc.

livid kiln
#

will probably use clickhouse as the db

desert oar
#

if it really does turn out that you are bottlenecked by string processing over a list, you might want to consider switching to native python lists, and using nuitka, mypy, or cython to build a c extension

#

or switch to pypy, which can be significantly faster for "plain python" operations than cpython

livid kiln
desert oar
#

but again, you need to have a goal in mind and start benchmarking

#

yep, i don't know specifically how much faster it will be for strings though

#

for all i know, dataframes.jl is backed by arrow anyway and the underlying performance characteristics might be similar

#

a lot of the slowness of python has to do with the overhead of the bytecode interpreter, so if you stay "inside" a C function for a long period of time then you're going to get significant speedups

#

yet another option is R, vectorized operations in R are very well-optimized, despite how ungodly slow things like loops and function calls are

#

i've processed 1bn+ records in data.table without thinking twice about it

#

for that much data i'd actually sooner pick up r w/ data.table than pandas

#

tldr you probably dont need to parallelize your str.strip operation

livid kiln
#

The reason why I'm using pandas is 2 fold, 1. documentation is vast (including SO), 2. learnt python as my first language... Ideally I would switch to Julia, but due to a lack of filled SO pages, the coding process is slower for me atm.

desert oar
#

i'm not saying not to use pandas! pandas is a great tool

livid kiln
#

100% right tool for the right job

desert oar
#

julia has pretty good docs though

#

including dataframes.jl

livid kiln
#
strip = re.compile(r'(^\s+|\s+$)')
df.select_dtypes(include=pd.StringDtype()).replace(strip, '', regex=True, inplace=True)
#

This is the only way I've found inplace

#

But then I get this error

desert oar
#

does this still trigger the warning?

string_columns = df.select_dtypes(include=pd.StringDtype()).columns
df[string_columns].replace(
    re.compile(r'(^\s+|\s+$)'), '', regex=True, inplace=True
)
upper scarab
#

678,679,aaa,male,0.96,NG
679,680,bbb,male,0.99,IN
690,691,ccc,male,0.99,IN
691,692,ddd,male,0.99,IN
in the first column you can see the missing data(index 680 to 689 are missing)
I have another data frame
688,689,a4353
689,690,somename
690,691,dfsfds
691,692,sdfsdf
692,693,retret
693,694,ertret
I want to query the data in second data frame based on the missing index in first dataframe

upper scarab
#

678,679...690

#

index is missing

#

ignore the second column

desert oar
#

and you want to fill in those missing rows with rows from the 2nd df?

upper scarab
#

nope

desert oar
#

or do you just want to get the rows from the 2nd df that fit that criterion?

upper scarab
#

I want the rows from 2nd dataframe.

upper scarab
#

680 to 689

#

from 2nd dataframe

#

there are few other rows missing in 1st dataframe

desert oar
#

!e ```python
from io import StringIO

import pandas as pd

df1 = pd.read_csv(StringIO('''
678,679,aaa,male,0.96,NG
679,680,bbb,male,0.99,IN
690,691,ccc,male,0.99,IN
691,692,ddd,male,0.99,IN
'''),
index_col=0,
names=['idx', *(f'y{i}' for i in range(1, 6))]
)

df2 = pd.read_csv(StringIO('''
688,689,a4353
689,690,somename
690,691,dfsfds
691,692,sdfsdf
692,693,retret
693,694,ertret
'''), index_col=0, names=['idx', 'x1', 'x2'])

df1_idx_range = pd.RangeIndex(
df1.index.min(),
df1.index.max(),
)

df1_idx_missing = df1_idx_range.difference(df1.index)

df2_missing_df1 = df2.loc[df1_idx_missing.intersection(df2.index)]

print(df2_missing_df1)

arctic wedgeBOT
#

@desert oar :white_check_mark: Your eval job has completed with return code 0.

001 |       x1        x2
002 | 688  689     a4353
003 | 689  690  somename
desert oar
#

the key operations are RangeIndex, Index.difference, and Index.intesection

upper scarab
#

Wow

desert oar
#

๐Ÿ‘

upper scarab
#

cant dm you

#

This helped me alot

#

i cant see your name

desert oar
#

happy to help. i keep dm's closed because otherwise i get a lot of them, more than i want

upper scarab
#

Any way I can repay you

desert oar
#

you can repay me by reading the docs and making sure you understand how these functions work ๐Ÿ™‚

#

then you can teach someone else one day

upper scarab
#

Wow sure

#

I just got started. gonna read and learn the docs after this projedct

arctic wedgeBOT
#

:incoming_envelope: :ok_hand: applied mute to @lapis sequoia until <t:1630539287:f> (9 minutes and 59 seconds) (reason: duplicates rule: sent 4 duplicated messages in 10s).

eager imp
#

i'd like to plot continuous wavelets (i'd use pyplot.matshow(coef, fignum=1, aspect="auto") in jupyter) but in pyqtgraph instead, anyone know how to do that?

quiet vault
#

Has anyone worked with CONVLSTM2d layers in keras for time series data?

#

If so, can you explain the rules or reshaping data to become 5d?

#

Everytime I try to reshape, I run into this problem where it says I cannot reshape it

dusty cloud
#

Is it possible to save data in some variable in some other currently being executed file, when captured by the main file where SIGINT handler was attached? for the interruptions when program is being halted or stopped

proven sigil
#

Directory: /home/hadoop/
module.py

def incr(value):
    return int(value + 1)

main.py

import pyspark.sql.functions as F
import pyspark.sql.types as T
from pyspark.sql import SparkSession
spark = SparkSession.builder.enableHiveSupport().getOrCreate()

import sys
sys.path.append('/home/hadoop/')
import module

if __name__ == '__main__':
    df = spark.createDataFrame([['a', 1], ['b', 2]], schema=['id', 'value'])
    df.show()

    print(module.incr(5)) #this works

    incr_udf = F.udf(lambda val: module.incr(val), T.IntegerType()) # this throws module not found error
    df = df.withColumn('new_value', incr_udf('value'))
    df.show()

I think spark task nodes do not have access to /home/hadoop/
How do I access module.py from spark task nodes?

arctic wedgeBOT
#

:incoming_envelope: :ok_hand: applied mute to @waxen thorn until <t:1630563396:f> (9 minutes and 59 seconds) (reason: duplicates rule: sent 4 duplicated messages in 10s).

lapis sequoia
#

Hey, this question is about SVD,
Say once i have U, sigma and V, how do i decrease the dimensions?

what i did is

  U = u[:,:no_of_parameters] # M, K # we can treat U as our X for smaller dimension
  Sigma = sigma[:no_of_parameters, :no_of_parameters] # K, K
  V = vh[:no_of_parameters] # K, N

Now say I have to classify this, so How would I get X which i could fit in and would have dimensions of (M, K)?

lapis sequoia
#

moreover, now to classify with more dimensions, what i did was first slice them and then get the X back but it would have same dimensions as M,K * K,K * K*N = M,N

which works fine but its like converting to K and then again M,N which is big seems stupid,
is there any way by which i could put M,K sized new vector and fit it
and somehow get training data in K columns as well? (please ping me if you reply)

desert oar
desert oar
#

That said I haven't ever seen someone actually do this by hand like you describe

#

Usually I see people doing "truncated SVD" and working with the truncated U and V matrices

#

There is also an algorithm that directly computes a low rank approximation without doing full SVD and truncating

#
#

So yes this looks like a totally valid approach to low rank approximation

#

Maybe use the U matrix and not the reconstruction? Or do PCA and use the first K columns, if you are just looking for dimension reduction (PCA can be obtained from SVD on the covariance matrix on standardized data)

pine wolf
desert oar
pine wolf
#

no, it seemed like an interesting post, but i didn't look at it

desert oar
#

That was meant to be a reply to their question about how to use only the first K columns

#

Link the post?

#

I don't follow Reddit usually

pine wolf
#

unrelated to your previous conversation, just something you might be interested in

#

i didn't look into it either, could just be fool's gold, dunno

desert oar
#

Huh!

#

Interesting that it has a practical application

pine wolf
#

i can think of several uses already without reading the paper

#

i mean, if you're already doing monte-carlo or statistical things, why not

velvet thorn
#

I remember seeing it

pine wolf
#

i don't know, possibly

#

i sort of earmarked it for later

lapis sequoia
#

there was one answer which said, I can get new matrix by U x Sigma but giving that to classifier did not help.

#

So by changing into small ranking matrices, I do get the new X, yes but it's dimensions stay the same regardless.

desert oar
lapis sequoia
#

I'll show you what I'm trying to do in code hold on.

desert oar
#

Because yes that would be the right solution

lapis sequoia
desert oar
#

You are doing clustering?

lapis sequoia
#

knn

#
def get_classifier_model(u, sigma, vh, no_of_parameters, k, y, X):
  U = u[:,:no_of_parameters] # M, K # we can treat U as our X for smaller dimension
  Sigma = sigma[:no_of_parameters, :no_of_parameters] # K, K
  V = vh[:no_of_parameters] # K, N
  X_reduced = U @ Sigma @ V # Thats how we can get the X back(same dimensions but data will be changed)
  # you can verify how much we are close to original X by np.sum(X-X_reduced, axis=None)
  neigh = KNeighborsClassifier(n_neighbors=k)
  neigh.fit(X_reduced,y)
  return neigh, U, Sigma, V

so thats what i did.

desert oar
#

Plot your data

lapis sequoia
#

but its on very higher degree.

#

i have docs of say 61188 words, so what i did above was,

reduced dimensions, got X back of say rows*61188 and put it into classfier.

now say if i just do U @ Sigma, which will give me matrix of say rows*100, the result is very poor.

#

of classifier.

#

I'm sorry if I'm sounding confusing.

prime hearth
#

hello, i would like to please ask is it fine to include titantic dataset machine learning project in my resume if i build machine learning algorithm from scratch? I googled and saw lots of advices from professionals that is it not good to include it but i not sure why and if that applies to my cirucmstance since im building linear regression from scratch without any library?

#

titanic data set is from kaggle*

desert oar
desert oar
#

I don't think knowing about QR composition and applying formulas from your stats class is all that much better than anything else

prime hearth
#

oh okay and when i mean from scratch i mean like i dont use sklearn library for linear regression, or k means, i implement all the algorithms using math formulas like cost functino, gradient so from scratch

desert oar
#

It's not bad, it shows you know more than nothing, which is good

#

I would suggest that implementing SGD by hand is also kind of a low bar easy project

#

Moreover I think it reflects a shallow understanding of these algorithms, because I don't think SGD is a sensible choice for the titanic data

#

So if you're going to use the wrong algorithm then yes I would not suggest putting it on the resume

#

But it also depends on what kind of job you're applying for

#

Instead of putting it "on your resume", I would build a project around it

#

Demonstrate that you can write, that you can come up with a project plan, explain the plan, and reason about your decisions for choosing that plan

#

Demonstrate that you can generate useful data visualizations and reflect on the pros and cons of different approaches

prime hearth
#

sorry for algorithms from scratch that i implementing is for datasets that support it or work good with it

desert oar
#

Demonstrate that you can write good quality code that other people will be able to use

#

I don't think it's a good idea to just put your class homework assignments on your resume

#

If you did some kind of capstone project or thesis, that's a great thing to include

#

But I doubt your capstone project was to run SGD linear regression on the titanic data

prime hearth
#

oh okay. So it not great to include as a beginner?

desert oar
#

There's nothing wrong with simple tools and familiar data sets, but nowadays the bar is too high

lapis sequoia
desert oar
#

You can put the class on your resume, I think that's a great idea

prime hearth
#

Oh okay, and so like for email spam classification, you think i should create some visualization for it and deploy it using flask or just a terminal command line for this app is good enough?

#

Just not sure which projects are "bad" to hr

desert oar
#

It depends on the kind of job you're applying for

prime hearth
#

oh okay, i guess entry level

#

or internships

#

for machine learning

desert oar
#

Nothing is "bad", what is bad is putting some thing on your resume that doesn't reflect that you know what you're doing

#

10 years ago, SGD was hot shit

#

But today it's baseline entry level knowledge

#

They will ask you about SGD not to test how good you are, but to make sure that you aren't lying about having basic fundamental skills

prime hearth
#

oh okay.

desert oar
#

So if you put fundamental skills on your resume and try to advertise them as something special, it's a red flag that you don't know what you're talking about

#

As for your spam classification project, it certainly can't hurt to spin up a web application and make nice data visualizations. However there will be a point of diminishing returns with respect to your time and energy

prime hearth
#

diminishing returns?

desert oar
#

I would personally rather see that you have invested energy into making your work reproducible

#

Setting random seeds, using version control, etc

#

I would want to see that the code quality is good

#

I would want to see that you have handled the data correctly

#

Diminishing returns meaninf, the benefit for every additional hour of work gets lower as you put more hours into the project

#

So what I would do is focus on making sure that the core classification project is solid, before moving onto "extra" stuff like a web application

prime hearth
#

oh okay, so instead i should just put it on github and write documentation like in the. readme fiile and show visualization and communicate how i clean data and approached the problem and good code quality , instead of making a web app out of it and deploying it?

desert oar
#

However, using data visualizations to demonstrate that your work is successful is a critical skill for a data scientist, so I do strongly recommend working on data viz

prime hearth
#

oh okay

desert oar
prime hearth
#

oh okay thanks so its totoally fine if the app is just a "run file" and the code executes or i should build a web interface

desert oar
desert oar
#

And emphasize that you did this work aggressively when you interview

prime hearth
#

oh okay thanks

desert oar
#

Sloppy data science destroys business value. Demonstrate that you understand this, and demonstrate that you are not a sloppy data scientist

#

Your code is readable, your scripts can be run without too much difficulty. You have produced sufficient justification that your model performs well. You can communicate to non-technical stakeholders, business management, etc. and demonstrate the value of your work.

#

Then yeah, throw a web app on top of that if you have time

#

And of course you have demonstrated that you are not just copying and pasting shit from TDS and Stackoverflow and Kaggle

prime hearth
#

oh okay thanks, il prob not do web app since that requires more time like you said like flask or django and bug testing if it. happens

desert oar
#

Note that these are the opinions and biases of exactly one person and should not be treated as fact or even general principle

prime hearth
#

oh okay. thanks for the help! Lastly i also saw this website called iNeuron which give free internship projects

#

from beginners to advance

#

not sure if you heard of it but should i include that under internship

#

or under proejcts in my resume

#

if i do work on a project there

#

each project there gives a data set, and instructions on what they want "work with x cloud, use this model, " and some other requirements, but you are implementing it there is no source code to copy

desert oar
#

Probably can't hurt

prime hearth
#

oh okay thanks

desert oar
#

Note that if you did specifically want to mention the Titanic SGD thing, you can put it as a bullet point when describing the course you are taking

#

Good excuse to hit some HR buzzwords

prime hearth
#

oh okay,i not taking any course doing titantic but i thought it was interesting just to show...

#

thanks for help

desert oar
#

Like I said, if you build more of a project around it it can stand on its own

#

It's like 20 lines of python, it's just too trivial

#

Do some data visualizations showing how SGD works maybe

#

Or compare different approaches to fitting linear models

tender hearth
#

Embedding vectors are different from regular vector representations in that they have the property of being closer in terms of distance to similar inputs, right?

umbral raptor
#

Is anyone familiar with a library for using fastext(pretrained) embeddings in NNs (Pytorch) . Online I can only find code to encode text data to embedding ids but everyone is doing it with their own code? I thought that till now there would be l library for that.

tender hearth
#

Although now that I think about it I've only seen embeddings in NLP contexts

serene scaffold
#

Based on how word embeddings are trained, words whose embeddings have a closer cosine distance tend to be more semantically similar.

ripe forge
tender hearth
#

From my understanding embeddings are regular vectors with the additional property that their cosine distance from embeddings of similar inputs are relatively low

#

regular vector representation is just an abstract representation of some input

ripe forge
#

Oh. Okay. Then sure yes. But they "end up" with this property specifically because the vectors were trained and fine-tuned with an objective that makes similar things get similar vectors.

#

Like, a poorly trained embedding vector would be no different than just a regular vector.

proven sigil
desert oar
#

stuck forever?

proven sigil
#

the run doesn't complete

#

It's just waiting without even printing anything

cold pine
#

yo!
i have a problem. I'd like to plot top 5 countries to see if they change across timespan. After some time working on it i've created table with {year:[top_5_countries]}. How to plot it? ๐Ÿ˜ฎโ€๐Ÿ’จ

prime hearth
#

so like top 5 countries are the ones with highest year?

#

oh

#

one way is dataframe looping

desert oar
#

(nvm sorry)

cold pine
#

did it

#

but i've just a bunch of 0's and 1's

desert oar
#

one way to plot this is to plot the rankings as a line over time

#

one line per country

cold pine
#

thats what i want to do

#

but dunno how

desert oar
#

can you post this data as code/text and not as a screenshot?

#

then i can copy and paste it to help

#

!code

arctic wedgeBOT
#

Here's how to format Python code on Discord:

```py
print('Hello world!')
```

These are backticks, not quotes. Check this out if you can't find the backtick key.

cold pine
#

well i mean. If country occured -> plot else -> dont plot but dunno how to do this.
{1896: ['AUS', 'BLR', 'DEN', 'HUN', 'ITA']}
{1900: ['EUN', 'LTU', 'NZL', 'PAK', 'URS']}
{1904: ['CUB', 'EUN', 'GDR', 'SUI', 'USA']}
{1906: ['CRO', 'IND', 'LTU', 'NZL', 'SRB']}
{1908: ['ARM', 'BOT', 'GUY', 'PAK', 'URS']}
{1912: ['EUN', 'GDR', 'GEO', 'GER', 'URS']}
{1920: ['ANZ', 'EUN', 'MNE', 'PAR', 'SCG']}
{1924: ['ANZ', 'CRO', 'FIN', 'IOA', 'URS']}
{1928: ['GDR', 'KOR', 'SCG', 'SRB', 'URS']}
{1932: ['CMR', 'EUN', 'FIN', 'IND', 'SCG']}
{1936: ['ANZ', 'EUN', 'GRE', 'NEP', 'URS']}
{1948: ['CIV', 'LTU', 'SCG', 'TJK', 'URS']}
{1952: ['AZE', 'ETH', 'GDR', 'PAK', 'SRB']}
{1956: ['ANZ', 'DEN', 'EUN', 'GDR', 'URS']}
{1960: ['EUN', 'GDR', 'SGP', 'URS', 'USA']}
{1964: ['GDR', 'NAM', 'SCG', 'URS', 'WIF']}
{1968: ['ANZ', 'GDR', 'SCG', 'URS', 'USA']}
{1972: ['EUN', 'MKD', 'PAK', 'SCG', 'UZB']}
{1976: ['GDR', 'GRN', 'RUS', 'URS', 'USA']}
{1980: ['ANZ', 'CUB', 'ESP', 'KOS', 'WIF']}
{1984: ['FIJ', 'GDR', 'JAM', 'PAN', 'URS']}
{1988: ['EUN', 'GDR', 'SRB', 'URS', 'USA']}
{1992: ['ANZ', 'EUN', 'GDR', 'HAI', 'URS']}
{1994: ['CUB', 'GDR', 'URS', 'URU', 'YUG']}
{1996: ['EUN', 'GDR', 'JAM', 'PAK', 'URS']}
{1998: ['ARM', 'GDR', 'SRB', 'SWE', 'URS']}
{2000: ['EUN', 'GDR', 'SRB', 'URS', 'ZIM']}
{2002: ['ANZ', 'EUN', 'HAI', 'IND', 'URS']}
{2004: ['AZE', 'BOH', 'GDR', 'SCG', 'URS']}
{2006: ['AZE', 'CRO', 'GDR', 'GEO', 'NGR']}
{2008: ['EUN', 'GDR', 'URS', 'USA', 'WIF']}
{2010: ['EUN', 'GEO', 'RUS', 'URS', 'UZB']}
{2012: ['EUN', 'GDR', 'PAR', 'UAE', 'URS']}
{2014: ['ETH', 'KOR', 'SGP', 'TTO', 'URS']}
{2016: ['ANZ', 'GDR', 'TGA', 'URS', 'WIF']}

desert oar
#

thanks

#

so it's more than 5 countries total

#

it's the top 5 from a larger list?

cold pine
#

yeah

#

for each year every country got a point. I've sorted them by points and categoried them by year

desert oar
#

hm, that might get tricky with the lines thing

#

how many countries total? you don't want like 10 different colored lines if you can avoid it

#

is the first item in the list the highest or lowest ranked?

#

also why is each record a separate dict? i would imagine this should be one dict, where each key is a year and each value is a list

#
{
    2006: ['AZE', 'CRO', 'GDR', 'GEO', 'NGR'],
    2008: ['EUN', 'GDR', 'URS', 'USA', 'WIF'],
    2010: ['EUN', 'GEO', 'RUS', 'URS', 'UZB'],
}
cold pine
#

holy

#

58 exclusive countries

#

different ^

desert oar
#

yeah so 58 different lines is out of the question, even if only 5 of them are going to be visible at a given time

cold pine
#

huh

#

then could you give me advice how to see if these countries change across timespan?

desert oar
#

you could narrow it down a lot by getting the list of countries that appear in the top 5

#

i assume some of the 58 countries never do?

#

or are there 58 unique countries across this top 5 data?

cold pine
#

these are all across this top 5 data

desert oar
#

are you interested in 1 country specifically? or are you trying to get a sense of how the top 5 has changed over time?

cold pine
#

second one

desert oar
desert oar
#

maybe take the countries with the 10 best average rankings over time

#

another option is a grid of some kind, where the y axis is the country and the x axis is the year

#

and each cell in the grid would have the ranking in it

#

or again do something naive like sort by average ranking over time

valid pebble
#

I am appending multiple dataframes to CSV file using mode=a in to_csv method now how can I get 1 dataframes from this file??

desert oar
#

read it with pd.read_csv?

valid pebble
cold pine
#

it might work! Ima try both options and see how it'll look. Thank you โค๏ธ

valid pebble
desert oar
#

i don't understand the question

valid pebble
#

The things is I am appending multiple dataframes to same CSV

#

So now how can I get one dataframe out of this csv

desert oar
#

do you understand what csv is?

valid pebble
desert oar
#

if you append them with header=False you can just read the file and it will already be one "data frame"

#

there's nothing magical here

valid pebble
desert oar
#

?

#

well that's important information

#

the answer is no, you can't really do that

#

why not just write them to separate files?

#

you'd have to put some kind of delimiter between the two dataframes, then read the file line by line, partitioning or splitting on the delimiter

#

seems like a lot of trouble instead of just writing 1 dataframe per file

#

if you need to store multiple tables in a single file, consider using sqlite or hdf5

valid pebble
#

Thanks

vapid sentinel
#

guys can anyone share data science and machine learning cousre link or a proper video

arctic wedgeBOT
#
Resources

The Resources page on our website contains a list of hand-selected learning resources that we regularly recommend to both beginners and experts.

serene scaffold
#

There's data science stuff in here.

#

And there will be more pretty soon

umbral skiff
#

I'm trying to create a new column in this DataFrame with the name of the day of the week, but I can't. Does anyone have any tips?


from datetime import date

days = (
  'Segunda-feira',
  'Terรงa-feira',
  'Quarta-feira',
  'Quinta-feira',
  'Sexta-feira',
  'Sรกbado',
  'Domingo'
  )

import pandas as pd

dados_vacinacao = pd.read_excel("vacinacao_br.xlsx")
dados_vacinacao.head()

UF    Data Vacinacao    quantidade
0    AC    2021-01-18 00:00:00    1
1    AC    2021-01-19 00:00:00    46
2    AC    2021-01-20 00:00:00    1021
3    AC    2021-01-21 00:00:00    1609
4    AC    2021-01-22 00:00:00    1105

dados_vacinacao["dia_semana"] = days[dados_vacinacao["Data Vacinacao"]].weekday()

TypeError: list indices must be integers or slices, not Series

proven sigil
umbral skiff
proven sigil
#
days[dados_vacinacao["Data Vacinacao"].dt.weekday()]
#

this works?

umbral skiff
proven sigil
#

ok, you have to convert the column to datetime format first

umbral skiff
flat fiber
bleak trout
#

What can we learn from Artificial Intelligence in video games?

quiet vault
bleak trout
#

skool projects are like that

quiet vault
#

You are doing AI in school?

somber prism
#

guys, what is epsilon in svr model in sklearn.svm ? is it learning rate

somber prism
umbral skiff
somber prism
#
import pandas as pd

pd.to_datetime(dados_vacinacao["Data Vacinacao"]) ``` assuming dados_vacinacoa is the dataframe and data vacinacao is the columns
opal mango
#

sup, could you send a good python video thats focused on data science

opal mango
#

thx

lapis sequoia
#

does anyone know how to implement a machine learning api into python code

misty flint
#

are you calling the api or coding one

#

those are 2 different topics

#

calling an api is relatively easy for most common api's. look into their documentation, usually very helpful

desert oar
# somber prism guys, what is epsilon in svr model in sklearn.svm ? is it learning rate

no. support vector regression is like linear regression, but errors smaller than ฮต are ignored. so ฮต controls the tolerance of the model to errors; bigger ฮต means greater tolerance. this is analogous to the interpretation of C in support vector classification as the tolerance of the model to misclassification.

see:
https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVR.html#sklearn.svm.LinearSVR
https://scikit-learn.org/stable/modules/svm.html#linearsvr
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.114.4288

lapis sequoia
#

hi

#

im using jupyter notebooks

#

this happens when i try to load an image (.tif)

#

I dont know why

grave frost
glad mulch
#

anyone know why i keep getting this error

#
## using Health Care as BaseLine
exog_vars = ['BOARD_SIZE','CEO_DUALITY','BOARD_AVERAGE_TENURE','BOARD_MEETING_ATTENDANCE_PCT',
             'PCT_WOMEN_ON_BOARD','ROE','CUR_MKT_CAP','ROA','D/E',
             'Consumer Discretionary','Consumer Staples', 'Energy', 'Financials',
             'Industrials', 'Information Technology', 'Materials',
             'Utilities']
exog = sm.tools.tools.add_constant(panel_data[exog_vars])
endog = panel_data['ESG_DISCLOSURE_SCORE']
mod = PanelOLS(endog, exog, entity_effects=True)
PanelOLS4_res = mod.fit(cov_type='clustered', cluster_entity=True)
# Store values for checking homoskedasticity graphically
fittedvals_fe4_OLS = PanelOLS4_res.predict().fitted_values
residuals_fe4_OLS = PanelOLS4_res.resids
#
AbsorbingEffectError                      Traceback (most recent call last)
<ipython-input-50-a32677c6977b> in <module>
      8 endog = panel_data['ESG_DISCLOSURE_SCORE']
      9 mod = PanelOLS(endog, exog, entity_effects=True)
---> 10 PanelOLS4_res = mod.fit(cov_type='clustered', cluster_entity=True)
     11 # Store values for checking homoskedasticity graphically
     12 fittedvals_fe4_OLS = PanelOLS4_res.predict().fitted_values

C:\ProgramData\Anaconda3\lib\site-packages\linearmodels\panel\model.py in fit(self, use_lsdv, use_lsmr, low_memory, cov_type, debiased, auto_df, count_effects, **cov_config)
   1779         if self.entity_effects or self.time_effects or self.other_effects:
   1780             if not self._drop_absorbed:
-> 1781                 check_absorbed(x, [str(var) for var in self.exog.vars])
   1782             else:
   1783                 # TODO: Need to special case the constant here when determining which to retain

C:\ProgramData\Anaconda3\lib\site-packages\linearmodels\panel\utility.py in check_absorbed(x, variables, x_orig)
    441         absorbed_variables = "\n".join(rows)
    442         msg = absorbing_error_msg.format(absorbed_variables=absorbed_variables)
--> 443         raise AbsorbingEffectError(msg)
    444     if x_orig is None:
    445         return

AbsorbingEffectError: 
The model cannot be estimated. The included effects have fully absorbed
one or more of the variables. This occurs when one or more of the dependent
variable is perfectly explained using the effects included in the model.

The following variables or variable combinations have been fully absorbed
or have become perfectly collinear after effects are removed:

          const, BOARD_SIZE, CEO_DUALITY, BOARD_AVERAGE_TENURE, ROE, ROA, Consumer Discretionary, Consumer Staples, Energy, Financials, Industrials, Information Technology, Materials, Utilities
          const, BOARD_SIZE, CEO_DUALITY, BOARD_AVERAGE_TENURE, PCT_WOMEN_ON_BOARD, ROE, ROA, Consumer Discretionary, Consumer Staples, Energy, Financials, Industrials, Information Technology, Materials, Utilities
          const, BOARD_SIZE, CEO_DUALITY, Consumer Discretionary, Consumer Staples, Energy, Financials, Industrials, Information Technology, Materials, Utilities
          const, BOARD_SIZE, CEO_DUALITY, BOARD_AVERAGE_TENURE, ROA, Consumer Discretionary, Consumer Staples, Energy, Financials, Industrials, Information Technology, Materials, Utilities
          const, BOARD_SIZE, CEO_DUALITY, ROA, Consumer Discretionary, Consumer Staples, Energy, Financials, Industrials, Information Technology, Materials, Utilities
          const, BOARD_SIZE, CEO_DUALITY, BOARD_AVERAGE_TENURE, BOARD_MEETING_ATTENDANCE_PCT, PCT_WOMEN_ON_BOARD, ROE, ROA, Consumer Discretionary, Consumer Staples, Energy, Financials, Industrials, Information Technology, Materials, Utilities
          const, BOARD_SIZE, CEO_DUALITY, BOARD_AVERAGE_TENURE, ROE, ROA, Consumer Discretionary, Consumer Staples, Energy, Financials, Industrials, Information Technology, Materials, Utilities
          const, BOARD_SIZE, CEO_DUALITY, BOARD_AVERAGE_TENURE, BOARD_MEETING_ATTENDANCE_PCT, PCT_WOMEN_ON_BOARD, ROE, CUR_MKT_CAP, ROA, D/E, Consumer Discretionary, Consumer Staples, Energy, Financials, Industrials, Information Technology, Materials, Utilities

Set drop_absorbed=True to automatically drop absorbed variables.
desert oar
#

@glad mulch it looks like one of your variables is fully explained by one of your effects, as per the error

#

It's possible that you forgot to drop a level in a dummy variable, or you have some combinations of variables that are redundant

glad mulch
#

i thought i did drop a variable

#

i dropped health care

#

@desert oar

desert oar
#

It's a bit hard to read on my phone, but it looks like they are suggesting several combinations that are collinear

#

It's possible that your data is messed up

#

In fact, if you can share the data, or a sample of it that reproduces the problem, that might help

glad mulch
#

i can only do partial amounts of data

#

to share

arctic wedgeBOT
#

Hey @glad mulch!

Uh-oh! It looks like your message got zapped by our spam filter. We currently don't allow .csv attachments, so here are some tips to help you travel safely:

โ€ข If you attempted to send a message longer than 2000 characters, try shortening your message to fit within the character limit or use a pasting service (see below)

โ€ข If you tried to show someone your code, you can use codeblocks
(run !code-blocks in #bot-commands for more information) or use a pasting service like:

https://paste.pythondiscord.com

glad mulch
#

can i dm u

#

@desert oar

orchid lance
#

I'm studying data science at a low level and need help trying to create a multi objective optimization using goal programming to generate a pareto frontier.

#

I managed to make it in excel and mathematica but couldn't do so in python

#

Any help at all would be appreciated

#

I did make a post on help-apple

bleak trout
sinful gale
#

What do you call the classification categories in a formal way? Like if the dataset has three classification categories like -1,0,1 - what are -1,0,1 called?

sinful gale
#

Gotcha

tender hearth
#

I'm not sure if there's a separate term but it seems like there should be

sinful gale
cold pine
#

hello. I've df with two columns, first - 'Class' (one of 8 skills) and second - 'isOver30' (True/False)
how to make stacked bar chart with Classes as x,
total(isOver30) for each class as y and total(true) under total(False)

shut tapir
#

Hi guys
Has anyone worked with BERT previously? If so, have you used the transformers library by HuggingFace? I need help with running the BERT model offline, because the system I will be running this model on will not have access to an internet connection, and so pre-trained model is the way to go. I'm struck at this point not sure how to proceed, please help. Thank you! ๐Ÿ™‚

warm wharf
#

hi :) i put a really basic question about BCELoss in #help-honey if nayone could help i would appreciate it thanks

livid pine
#

Hi i need some help how can i dump rows in another df

lapis sequoia
#

Hi can someone help me with databases

modest mulch
rustic scarab
#

Anyone here doing projects on NLP? Am looking for someone who's is learning AI by building some projects?

royal crest
#

Yes

hoary wigeon
#

Hello

#

Julia vs Javascript which are good in these two for ML : )

royal crest
#

Julia

#

(Neither personally)

median cliff
#

Are there any resources for stats stuff in regards to checking pre-processing image data for a CNN?

austere swift
#

whenever i wanna learn a new kind of model or something I try to replicate it from the paper so I can see its functions and such

#

it gives me a good understanding of the inner workings of the model by hard coding them on my own

serene scaffold
#

Dataframe:

CellLine,39,0.541,Animal,red,#FF0000,0.59
GroupName,939,0.399,Animal,red,#FF7F00,0.569
GroupSize,367,0.625,Animal,red,#FFD400,0.343
SampleSize,43,0.584,Animal,red,#FFFF00,0.791
Sex,574,0.83,Animal,red,#BFFF00,0.099
Species,1585,0.859,Animal,red,#6AFF00,0.047
Strain,372,0.62,Animal,red,#00EAFF,0.387
Dose,637,0.618,Dose,green,#0095FF,0.232
DoseDuration,210,0.545,Dose,green,#0040FF,0.262
DoseDurationUnits,198,0.536,Dose,green,#AA00FF,0.177
DoseFrequency,92,0.513,Dose,green,#FF00AA,0.522
DoseRoute,558,0.563,Dose,green,#EDB9B9,0.317
DoseUnits,472,0.592,Dose,green,#E7E9B9,0.252
TimeAtDose,115,0.276,Dose,green,#B9EDE0,0.617
TimeAtFirstDose,45,0.037,Dose,green,#B9D7ED,0.644
TimeAtLastDose,21,0.0,Dose,green,#DCB9ED,0.571
TimeEndpointAssessed,659,0.5,Dose,green,#8F2323,0.382
TimeUnits,594,0.653,Dose,green,#8F6A23,0.12
TestArticle,1831,0.485,Exposure,orange,#4F8F23,0.281
TestArticlePurity,26,0.213,Exposure,orange,#23628F,0.855
Vehicle,417,0.44,Exposure,orange,#6B238F,0.283
TestArticleVerification,5,0.0,Exposure,orange,#000000,1.0
Endpoint,4316,0.308,Endpoint,blue,#737373,0.844
EndpointUnitOfMeasure,682,0.35,Endpoint,blue,#CCCCCC,0.499

Code:

df.plot.scatter(x='unique_percent', y='b_f1', xlabel='Percent Unique Mentions', ylabel='BERT $F_1$ Score', color=df['unique_colors'])

How do I add a color legend on the side?

#

This is the result:

#

My advisor told me to do it like this. I'm not complaining because it's gay.

rigid zodiac
#

Dump question, I have a really big csv file, like 134000 row of data (in json file). how can I speed up the processing data

serene scaffold
rigid zodiac
rigid zodiac
#

I did the similar process with other data 12900 row before. Yeahh csv

serene scaffold
#

let's just forget how many rows there are. It doesn't matter for figuring out what transformation you need to do

#

Can you provide a few example rows from the CSV as text (no screenshot)?

arctic wedgeBOT
#

Hey @rigid zodiac!

Uh-oh! It looks like your message got zapped by our spam filter. We currently don't allow .txt attachments, so here are some tips to help you travel safely:

โ€ข If you attempted to send a message longer than 2000 characters, try shortening your message to fit within the character limit or use a pasting service (see below)

โ€ข If you tried to show someone your code, you can use codeblocks
(run !code-blocks in #bot-commands for more information) or use a pasting service like:

https://paste.pythondiscord.com

serene scaffold
austere lion
#

Can i ask a question about Natural Language Processing here?

#

I really can't find the definition of Shallow NLP

austere swift
austere swift
#

it would be something along the lines of:

import multiprocessing as mp
with mp.Pool() as pool:
    results = pool.map(transform_function, df.iterrows())
#

you will have to mess with results a bit to get it back to a df because it will return a list of the outputs from the transform_function

austere lion
#

So.. can anyone explain about shallow npl or where can i get the source that explains it?

austere swift
austere lion
#

i dunno what it means, so i try to find out

mortal dove
# serene scaffold Dataframe: ```0,count,b_f1,group,colors,unique_colors,unique_percent CellLine,39...
import matplotlib.patches as mpatches

ax = df.plot.scatter(x='unique_percent', y='b_f1', xlabel='Percent Unique Mentions', ylabel='BERT $F_1$ Score', color=df['unique_colors'])

colorlist = zip(df['0'], df['unique_colors'])
handles = [mpatches.Patch(color=colour, label=label) for label, colour in colorlist]
labels = df['0']

ax.legend(handles, labels, loc='center right', bbox_to_anchor=(1.5, 0.5));
gray verge
#

Any resources?

serene scaffold
#

@mortal dove seems like an unnecessarily complicated solution, but I suppose matplotlib might not support a better one

mortal dove
#

Yea, can't find a easier way, seaborn might have a simpler solution

serene scaffold
#

@mortal dove you can add it to the so question I posted, but the question is posed differently https://stackoverflow.com/questions/69046713/create-a-color-coded-key-for-a-matplotlib-scatter-plot-with-specific-colors

#

Also if you post it there, please don't include the real labels.

mortal dove
#

Right, will do

#

seaborn is indeed a lot easier if you can use it

#
ax = sns.scatterplot(x=df['unique_percent'], y=df['b_f1'], palette=df['unique_colors'].tolist(), hue=df['0'])
thorn bobcat
rustic scarab
austere swift
#

Nah I donโ€™t really collab with people, sorry

quick kestrel
#

Guys how can I start learning ai and data science

rigid zodiac
quick kestrel
#

@rigid zodiac well I wanna learn ai and machine learning

untold parrot
#

if you're new to ML check out Kaggle courses, maybe you can start with intro to machine learning course.

quick kestrel
#

@untold parrot I am making a discord bot but I want it to learn about the way users talk and make it talk to them

#

Basically a chat bott

misty flint
quick kestrel
#

Btw what's the download size of tensorflow

misty flint
quick kestrel
misty flint
quick kestrel
#

Ok so I need a db too

#

And I have thaat

#

Now I wanna learn ml foor my bot but how @misty flint

misty flint
#

plenty of tutorials. you dont need to learn all of ML first to do what you need

quick kestrel
#

U mean I need to learn just a bit of ml

#

?

misty flint
#

you need to learn the bare minimum. how is your chatbot going to learn from your dataset? what do you want it to say? etc.

quick kestrel
misty flint
#

yes but what is it going to say in response

quick kestrel
misty flint
#

theres different approaches to your problem so think about what youre looking for and then search through some of the tutorials. come back when you have a more specific problem, yeah?

untold parrot
sand timber
#

anyone got good links for deciding on NN archs? e.g. how many conv layers, what kernel sizes, pooling types, etc

lapis sequoia
#

Is there some good project to start with numpy and pandas? rooThink
These both are in my school syllabus and I'm thinking to learn them with some project, rather than just trying out each method one by one.. coz that's very boring blobpain

soft kite
#

shot in the dark. does anyone here have experience with matplotlib for graphing data?

wraith lagoon
#

while using pandas i came across statements df.sort_values('tip',ascending=False)[0:2] and df.sort_values('tip',ascending=False).iloc[0:2] both the statements gave me same result how?

#

i asked this in help channel but no reply

ripe forge
#

pandas tends to give you multiple ways of doing the same thing.

#

the first uses a slice. the second slices an iloc. that's about it really, both do the same thing here.

#

(unless you;re saying that they didnt give the same result. that would be a different question entirely)

wraith lagoon
#

but here the results were in tables

ripe forge
#

could you rephrase your question? are you getting same results from both or different

wraith lagoon
#

same but iloc give rows

#

if we slice using iloc shouldn't we get row exluding header row

#

for example ['name','age']['abc',123]..... so in this name and age are column names

#

so if i use iloc they should be exluded

ripe forge
#

oh, i see where you're coming from. when you select a row using iloc, you get the row. but when you slice, you still get the dataframe, so column names will be there.

ripe forge
#

ie. what you're saying would only be true if you were talking about .iloc[2]. .iloc[0: 2] gives the slice of* df, so no, the column names should be there, as intended

desert oar
#

Actually maybe there is a special case in there for slices of integers

#

I try to avoid using implicit row slicing/subsetting whenever possible

wraith lagoon
ripe forge
#

yes, it's a series

desert oar
#

I always use loc/iloc for rows

ripe forge
#

this will only apply on selection, but not on slices.

wraith lagoon
#

so if we select multiple series it should be same

desert oar
#

Calling iloc with a slice or something "array like" returns a DataFrame. Calling it with a "scalar" returns a Series

wraith lagoon
#

so is there any way that i can specify that its a header row and it should not be included

desert oar
#

I think you're confusing the content of the table with its display on the screen

wraith lagoon
desert oar
#

That or you need to pass different options to read_csv

ripe forge
#

this might be an XY problem. a dataframe always has column names. maybe your goal can be achieved regardless depending on what you're actually facing

desert oar
wraith lagoon
#

oh

desert oar
#

It's a common newbie trap, because the default row labels are also integers

#

As for your actual question, are you just trying to print the data without the column names? Or did you somehow accidentally get the column names into a row in the dataframe?

wraith lagoon
#

thing is that when i did for single iloc it gave me ans in series

#

but i want multiple rows

#

not columes

rigid zodiac
#

you can range it

desert oar
#

Then use a slice or a list of numbers, not a single number

rigid zodiac
#

like df.iloc[: , : ]

wraith lagoon
rigid zodiac
#

sure

desert oar
#

what's this business about the header row?

wraith lagoon
#

that it should not be included in iloc

desert oar
#

@wraith lagoon can you show us the output you're getting and explain what you want to be different?

rigid zodiac
desert oar
#

it will display the header row because that's part of the dataframe. but the column names are not "part of" the data, they're kept in a separate place

rigid zodiac
#

Btw Have any one work on the LSTM model? or CNN model before? If yes can I have your code and use it as sample

heady yarrow
#

!e ```py
import os
os.system("ls")

arctic wedgeBOT
#

@heady yarrow :warning: Your eval job has completed with return code 0.

[No output]
heady yarrow
#

bruh

desert oar
#

!e ```python
import pandas as pd

df = pd.DataFrame([
['a', 11, 12],
['b', 21, 22],
['c', 32, 32],
], columns=['i', 'y', 'z']).set_index('i')

print(df)
print( df.iloc[1] )
print( df.iloc[[1]] )
print( df.iloc[:2] )
print( df.iloc[[0, 2]] )

arctic wedgeBOT
#

@desert oar :white_check_mark: Your eval job has completed with return code 0.

001 |     y   z
002 | i        
003 | a  11  12
004 | b  21  22
005 | c  32  32
006 | y    21
007 | z    22
008 | Name: b, dtype: int64
009 |     y   z
010 | i        
011 | b  21  22
... (truncated - too many lines)

Full output: https://paste.pythondiscord.com/ewezobaveh.txt?noredirect

wraith lagoon
desert oar
#

@heady yarrow #bot-commands

wraith lagoon
#

but this dont work

desert oar
#

...what did you expect that to do?

wraith lagoon
#

just to transpose the dataframe

#

and to overwrite the previous one

desert oar
#

i suggest reading the docs ๐Ÿ™‚

#
df = df.transpose()
#

don't guess at syntax

desert oar
#

!d pandas.DataFrame.describe

arctic wedgeBOT
#

DataFrame.describe(percentiles=None, include=None, exclude=None, datetime_is_numeric=False)```
Generate descriptive statistics.

Descriptive statistics include those that summarize the central tendency, dispersion and shape of a datasetโ€™s distribution, excluding `NaN` values.

Analyzes both numeric and object series, as well as `DataFrame` column sets of mixed data types. The output will vary depending on what is provided. Refer to the notes below for more detail.
desert oar
#

!d pandas.DataFrame.transpose

arctic wedgeBOT
#

DataFrame.transpose(*args, copy=False)```
Transpose index and columns.

Reflect the DataFrame over its main diagonal by writing rows as columns and vice-versa. The property [`T`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.T.html#pandas.DataFrame.T "pandas.DataFrame.T") is an accessor to the method [`transpose()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.transpose.html#pandas.DataFrame.transpose "pandas.DataFrame.transpose").
wraith lagoon
#

Me totaly confussed

wraith lagoon
#

Also why np.vectorize is faster then apply command in pandas . i searched but didnt get what happen in backend of np.vectorize

#

There was a good time difference

#

In apply time diff was 3.something and in vectorize it was in 0.something

#

So why there is a vast diff?

serene scaffold
#

I can't figure out how to calculate micro averages with only the scores to be averaged and the percent share.

uncut barn
#

is there a way for python (i.e using jupyter notebook) to acess files directly on the USB i.e. without downloading the file locallly first?

#

or other IDEs?

mint matrix
#

I am working on some assignment in networkx and need little help of understanding math and graph of it if someone could explain me :
The 2-dimensional lattice generated by vectors u,v consists of the points au+bv for all integers a,b. Considering vectors u,v, and an integer k>0 as input parameters, write a program creating a planar graph of a lattice generated by u,v, whose vertices are the origin (0,0) and only its k closest neighbors among the lattice points, while any vertices related by a translation by only u, v, u+v or u-v should be connected by a straight-line edge.
I have to impleement it in networkx

#

but could not understand how to start

livid kiln
serene scaffold
soft kite
#

Would this be the right chat to ask about how to label items in bar charts in matplotlib python library?

wanton spear
#

guys i cant find a way to add individual range on bins...google didnt help much

crystal harness
#

i want to be senpai in ML how long it will take me?

robust cosmos
#

guys could anyone help in this code?

#

thows this error:'[2927] not found in axis'

flat hollow
# wanton spear

if you're plotting this using sns.distplot() you can pass a sequence as your bins in bins= karg

robust cosmos
#

this is for dropping random rows in a df

robust cosmos
flat hollow
#

perhaps even better solution than what you're trying to get working would be something like

remove_n = 1
drop_indices = np.random.choice(df.index, remove_n, replace=False)
df_subset = df.drop(drop_indices)
robust cosmos
#

the indexes are all integers and sorted , i just want to drop random rows .

flat hollow
#

this would take a random choice of the actual indices that exist in your dataframe and pass them into .drop() for removing

velvet thorn
robust cosmos
robust cosmos
robust cosmos
velvet thorn
#

or you could just pass 2000

#

works too

robust cosmos
velvet thorn
crystal harness
#

Guys I am using matplotlib on Android

serene scaffold
uncut barn
#
def get_patches(imgfolder, lvl = 0, size = 512):
  """
  This function takes in an image of certain resolution
  and extracts patches of size - (width, height) and saves them in
  certain folder via a directory

  parameters:
  imgfolder - contains imgs of file type NDPI
  lvl - resolution of the img
  size - square patch so int - (int, int) - (width height)
  """
  
  # we get the number of the image with the file name
  for img_num, imgfile in enumerate(imgfolder, start = 1):
    
    # using openslide to open the img file
    os_img = OpenSlide(imgfile)

    # dimensions of the img at a particular lvl & calculating the interval
    os_img_w_interval = os_img.level_dimensions[lvl][0] / size
    os_img_h_interval = os_img.level_dimensions[lvl][1] / size

    # row of the overall img
    for i in range(os_img_w_interval):
      # column of the overall img
      for j in range(os_img_h_interval):
        # getting a patch of the img
        ptch = os_img.read_region((i * size, j * size), lvl, 
                                          (size, size))
        arr = np.array(ptch)
        # exlcude the last channel column RGBA - as all values are 255 so opaque
        RGB = arr[:, :, :-1] 
        # trying to get rid of all background pxs using a min px threshold
        # and proportion of green px vals
        flat_RGB = RGB.reshape(-1, 3)
        green = flat_RGB[:, 1] # just green pxs
        # proportion value
        prop = len([px for px in green if px > 180]) / len(green)
        # min px in the patch
        min_px = np.min(RGB)
        if (min_px < 130) or (prop < 0.8):
          # saving the image PNG format into the specified directory
          ptch.save('/content/drive/MyDrive/Data/older_retics_patches/img'+ 
                   str(img_num) + '/' + 'patch_' + str(i) + "_" + str(j)+ ".png")

Guys is there a way I can speed up this code, this function extracts patches from images that are of the type NDPI, adn range from 120 MB to 3GB in my data set

boreal wasp
#

hi my I know what is nltk?

#

stopwords

ripe forge
#

Stopwords are just those filler words we use in English that may not be necessary to convey meaning for a machine.

#

So for example, articles "he is a man" would still retain the meaning more or less if you changed it to "he is man". It may sound weird but for a machine it's learning whatever you send it anyways. The benefit is that you get to reduce the number of words needed

#

So it can lead to better performance for simple models

boreal wasp
#

Ahh I see

#

Thank you

serene scaffold
ripe forge
#

Yeap yep. But I think it makes explaining it harder, so I'm perfectly happy to fudge the specifics a little to convey the concept easier.

quick kestrel
#

Guys how can I make a image identifier

#

Using ai

serene scaffold
boreal wasp
#

Need help on this

#

Hi I wonder how to grab values starting with '#'?
I've tried using it like this...but it returns error

quick kestrel
#

@serene scaffold I want to identify images

serene scaffold
quick kestrel
#

Like I want to make a anti NSFW system in my discord bot but how will it identify any NSFW content that's where I thought of taking help of ai, my bot will work like if any NSFW content is posted then it will immediately delete it

serene scaffold
quick kestrel
#

?

serene scaffold
quick kestrel
serene scaffold
# quick kestrel How?

I don't personally know, though I already made a suggestion about what you could do next.

quick kestrel
#

@serene scaffold what lib shall I use for this???

quick kestrel
#

Aah

serene scaffold
#

I have nothing else to suggest that I haven't already suggested.

errant flame
#

I just started with machine learning, so sorry if I made a mistake or something.

So I watched a few tutorials on deep reinforced learning with tensorflow and keras. On the build model step, they added different amounts of Flatten, Dense, and Convolution2D. In the CartPole-v0, it was Flatten, Dense, Dense, Dense(actions). In one where the the states were a number from 0 to 100, there was no Flatten. And on the one with Space Invaders, they added 3 Convolution2D ones.

How do I know which layers to add and which arguments to give? I'm currently trying to build one with an input of an 11 x 15 2D array.

dire crane
# quick kestrel How?

You can create a simple model using tensorflow, using hentai images as your database. I recomend taking at least 1000 nsfw images and 1000 non-nsfw images, label them as [nswf and non-nsfw], and then train your model using any neural network architecture.

#

The idea is to create a model thtat can detect if a image if nsfw or not. Its possible to improve it

dire crane
dire crane
uncut barn
#

is there a way in python to check the file size of an image before saving it?

serene scaffold
uncut barn
#

so i've changed my array to a pil.image

serene scaffold
quick kestrel
serene scaffold
lapis sequoia
dire crane
serene scaffold
uncut barn
#

ok thanks will try this

lapis sequoia
quick kestrel
#

@dire crane Well what's web scrapping

#

??

#

Never tried it

dire crane
#

Its a method to get data from web, literally scrap data from web sites

lapis sequoia
#

Like getting all the html content of the site and parse it to get some data out of it

dire crane
#

There is a librarie called Beautiful Soup that do this for you

#

yes

quick kestrel
#

Well it is really done with selenium?

lapis sequoia
#

Or just make requests...? easier and faster

quick kestrel
lapis sequoia
#

Yes

quick kestrel
#

Ah me nub in python currently started working with keras

lapis sequoia
#
import requests
data = requests.get('https://github.com/AkshuAgarwal').text()
# now use bs4
lapis sequoia
#

You'll get the whole html of the site

#

Do some reverse engineering to find the ids of image tags and parse them

quick kestrel
#

Right?

lapis sequoia
#

Best is actually to use an API

quick kestrel
#

@dire crane need some more help bud

dire crane
#

what you need?

quick kestrel
#

What are the things I need to learn to make this like tensorflow

#

And what's the use of neural networks in ai, I mean does it work like neurones of human to send data and responses from one location to another

dire crane
#

If you really want to go straight to hands-on i suggest you to read tensorflow and keras docs. Keras is like tensorflow high level, with all deep learning layers already created. But i highly suggest you to study the basics of a neural network architecture, ''because knowing the basics you can create simple architectures to predict values.

dire crane
dire crane
quick kestrel
#

So no need of maths in it?

dire crane
#

but knowing simple math and statistics itss fine

quick kestrel
#

So to do those calculations of ai do I need numpy or can do it without any module, and are neurons used to transfer data like api does?

quick kestrel
dire crane
#

Tensorflow has a lot of methods that do the math

quick kestrel
#

So whats the use of keras and what is the use of neural networks

dire crane
#

Keras is an API that facilitates the use of tensorflow

quick kestrel
lapis sequoia
#

Hey I've been working on this imbalanced dataset for awhile, however, can't seem to find a good enough structure to it.
Process:

  1. Imported the csv (99:1, target) and fixed the skew
  2. Created 50-50 under sample as a train set
  3. Scaled the train
  4. LogisticRegression
  5. GridSearchCV to find best f1 score for the minority target (fitted the train under sampled df)
  6. scaled the original dataset (X_val, y_val)
  7. classification report and confusion matrix of predict(X_val, y_val)
    The results are horrible, the maximum I am getting f1_score is 6%, could someone please share your experience and approach on imbalanced datasets?
dire crane
#

There is a book called: Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition, its a very good book, i recommend

quick kestrel
dire crane
#

yes, thats another library

quick kestrel
dire crane
#

That depends on your problem, if you want to solve simple problems like linear problems scikilearn may solve your problems. But if is more complex i prefer using Kera + tensorflow

quick kestrel
dire crane
#

take it slow, study AI is a long journey. I've been studying it for 2-3 years and still there is so much to learn. I hope you enjoy this journey of AI ;D

quick kestrel
#

Well thanks foor the help bud

dire crane
#

you're wellcome! ;3

median cliff
#

Anyone have any resources for some stats analysis on a feature map for a potential CNN? I have a lot of data and I'm trying to check for skew/normality/etc etc

grave frost
#

but if you did, big balls to you man

hasty mountain
#

Hey guys, I'm trying to read an image to use as training set. However, for some motive, instead of getting a 3D RGB array, I'm getting a 4D RGB array with shape (26, 25, 4): ```
[[[186 26 51 255]
[194 51 68 255]
[205 82 90 255]
...
[221 132 134 255]
[215 151 156 255]
[208 167 177 255]]

I've tried using `np.delete(array, 3, axis=1)` to remove this `255` from the array, but it didn't work as I expected. Can someone give me a hint on how to remove this `255` column?
median cliff
#

@hasty mountain I believe opencv2 has some options to get down to 3 channels

hasty mountain
#

Hm... How about keras.layers? (I want to use this set in a neural network)

median cliff
#

I'm not sure how you pre-processed your data. Can you elaborate on how you got that array?

hasty mountain
#

I just loaded an image using image.io (but it could be matplotlib.pyplot. It gives the same result).

median cliff
#

I could be wrong here, but you'll want to get rid of that (I think its an alpha channel) before you convert to the feature map

hasty mountain
#

Well, I don't know how that 4th dimension appeared

#

Maybe I messed up in Paint 3D...

hasty mountain
austere swift
austere swift
#

I basically tried it on my own, didnโ€™t work very well, then went to that to see what I did wrong

arctic wedgeBOT
#

:incoming_envelope: :ok_hand: applied mute to @topaz yew until <t:1630782168:f> (9 minutes and 59 seconds) (reason: duplicates rule: sent 4 duplicated messages in 10s).

lapis sequoia
#

Who are you?

#

Oh. Well, no

plucky oar
#

so im trying to create an ai and this error is popping up:

edgy gulch
#

Would anyone be able to help me make sure my implementation of multinomial naive bayes is correct?

plucky oar
#
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 984, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'pyttsx3.drivers.sap15' ```
#

thats my code

#

engine = pyttsx3.init("sap15")
voices = engine.getProperty("voices")
print(voices[0].id)
engine.setProperty("voices", voices[0].id)```
#

ahh if you didnt know you need to wait for the first one to finish their problem because like that your blocking them from getting help

#

people wont see their message like thst

slow jewel
#

oh sorry

plucky oar
#

in

#

t

#

its not a problem

slow jewel
#

k

plucky oar
#

ill notify you when i get my problem solved

slow jewel
#

ok thanks

plucky oar
#

i think i solved it

plucky oar
chilly torrent
slow jewel
#

I need help reorganizing a pandas data frame. Its a bunch of stock market data about a bunch of companies. Right now, it is ordered where every row is a point in time describing a single stock. There is a row for each of the stocks for every 5 minutes over the course of some time. I need it so the index is the date and time and it would have many columns such as ALKT.US_OPEN, ALKT.US_HIGH,... ect, for all of the stocks. That way, each row is a snapshot of the whole dataset at a given time. Can anyone give me an idea about how to do this? Heres some example rows.
<TICKER> <DATE> <TIME> <OPEN> <HIGH> <LOW> <CLOSE> <VOL> 319763 ALKT.US 20210729 185000 31.8400 31.96 31.840 31.85 1275 24615 ABCL.US 20210730 191500 15.3000 15.31 15.295 15.31 1595 55906 ACB.US 20210804 165500 7.0782 7.09 7.055 7.09 12275 843228 BCRX.US 20210826 203000 15.4100 15.43 15.395 15.41 5699 453707 ANTE.US 20210820 160000 2.5000 2.50 2.500 2.50 100

chilly torrent
#

check out the pivot table method

slow jewel
#

thanks, i just looked at that. Although its similar, its not really what I intend

mortal dove
#

Could you also provide an example of what you'd want the dataframe to look like after transforming it?

slow jewel
#

sure one sec

#

<AMZN.US_OPEN> <AMZN.US_HIGH> ... <APPL.US_OPEN> <APPLE.US_HIGH> ... for every stock 2021-07-1 3:00 1000 1010 150 155 2021-07-1 3:05 1001 1011 144 155 ...

#

like this

#

the index of the row is the date and time and there would be a column for each stock's data like AMZN_OPEN or APPL_HIGH

old axle
#

i dont know if this is an appropriate question for this channel but

#

i have a dataset of jobs and salaries and the job titles are kinda weird? like some of them repeat and have roman numerals at the end of them and im kinda confused

#

like for example theres several "Accountant/Auditor"s but each has a different roman numeral at the end of it

#

and i was just wondering if anyone knew what this meant

desert oar
old axle
#

huh ok

#

thank you!

desert oar
#

HR departments at large organizations have things like that

old axle
#

weird

#

eugh jobs are so confusing

desert oar
old axle
#

oh!! thank you so much :)

tardy jungle
#

As we are on stock market

#

And also value of higj

#

High**

dusty cloud
#

Hi all. How do generally the reports submitted by data scientists to the higher mgmnt look like in major companies? In what format do they submit? Are those word docs with codes attached? Or ppts? or smething else?

ripe forge
#

Anyone non technical cannot digest codes. It's usually ppts all the way

buoyant adder
#

For the next few days I'll be explaining some data science concepts through 1 minute videos so that everyone has a basic intuition and clears their basic concepts. Here's today's video: https://youtu.be/gDqeW2t2dV0

This will will give you an intuition about what data collection is in Data Science, its necessity, requirement and the different ways to do it with a simple and easy example.
Join this telegram group is you are serious about learning data science and want to avail free organised resources that are added and updated everyday: https://t.me/analyt...

โ–ถ Play video
buoyant adder
#

Welcome!

hoary wigeon
#

which algo can be used to classify these data ? any idea ?

grave frost
hoary wigeon
#

yeah

grave frost
#

that kinda looks like the sequence of prime numbers or something in 3 blue 1 brown's video

serene scaffold
#

see if there's a kernel function for that.

tender hearth
velvet thorn
#

is that a new NN type

#

man I've been out of touch for so long

#

I don't even recognise the techniques that are current any more

#

๐Ÿ˜”

tender hearth
#

Quite the opposite haha

tender hearth
#

feed forward neural network

velvet thorn
#

it looks exactly like the kind of thing a projection into a higher dimensional space would solve

velvet thorn
#

๐Ÿ˜”

tender hearth
#

hahahaha

serene scaffold
#

and here I was thinking I was onto something. Also what exactly makes a NN feed forward?

tender hearth
#

I know for sure because that screenshot looks like it's from Tensorflow's neural network playground and on Google's ML Crash Course they trained an FNN on exactly that pattern

tender hearth
serene scaffold
tender hearth
#

like it doesn't loop compared to RNNs

#

i'm talking about MLPs

velvet thorn
#

for example

#

well, I guess RNNs are a great example

#

๐Ÿ™‚

frosty sigil
abstract sinew
#

haha oh my god so many acronyms

grave frost
keen shadow
#

I have an issue with numpy, when I pass an image through np.asarray.

It seems to change the shape of the image? I have used a work around but is there any reason to why it does this?
Code

def extract_frames(stream: io.BytesIO):

    bt_array = bytearray(stream.read())
    actual = io.BytesIO(bt_array)
    
    try:
        gif = mimread(actual, memtest='1MB', pilmode="RGB")
    except ValueError:
        gif = [np.asarray(bt_array, dtype=np.uint8)]

    frames = gif[:MAX_FRAMES]

The sizes that I got where:

(671, 75) - Original Size
[(13696,)] - After passing through np.asarray
[(1, 13696)] - Before saving to memory
lapis sequoia
#

can anyone tell me list of AI modules used ?

#

nm

uncut barn
#

For the same img saved in different versions i.e. jpg, png, tiff, etc. would all the values in the array (RGB values) be the same?

keen shadow
#

its just a simple black and white im

desert bear
#

Hey, I'm doing some research on algorithms that help detect outliers (anomalies) in dataset. Does anybody have advice or experiences on how to automate it?

true estuary
#

What do you want to automate ? Outlier detection ?

keen shadow
#

probably to identify data that doesnt suit the pattern

desert bear
#

I've been given a task to research some options, that could help detect anomalies once fed with data from specific timeframe once in a while.

true estuary
#

Well there are a bunch of packages and approaches. How comfortable are you with the math ?

#

Do you want to use a ready made software tool that comes with ready made models ? Or code a model ?

desert bear
#

I prefer to code the model myself. I know a library pycaret that seems to do all the code work itself, and I would like to avoid that

dusky dome
#

Hi can someone help me with this

true estuary
#

All right, then the approach that example takes is the level of code you're looking for

dusky dome
desert bear
true estuary
#

I'd start with that one, just to get a feel. Or look for an even simpler model, density estimation, to have a real simple baseline

true estuary
desert bear
#

So far i've researched models like LOF, Isolation Forest, knn, SVM

true estuary
#

Transactions, sounds like fraud detection ?

desert bear
true estuary
#

Right, well fraud detection is one of the most used applications of anomaly detection, you should be able to find tons of resources

#

I personally don't know much about it, like what models/tools are the right ones to try

#

Hope someone will recommend some resources to get started

true estuary
desert bear
#

Most of papers, that I encountered seems to research the models implementation and their benchmarks.

true estuary
desert bear
true estuary
#

Well sounds like if you're reading papers

zenith slate
#

hi guys is writing custom loss functions a must know and working with tensors a must-know for AI practitioners?

#

I would appreciate if I could have an eye on helping me debug a weighted-loss multiclass custom loss function I'm attempting

uncut barn
#

Is it better to save the image (given that we have thousands of images to save) as a numpy array, or as an image file i.e. as a jpg, png or tiff file etc? (this would be regarding the runtime, disk space (storage)) As after this I would be feeding these images into a CNN

zenith slate
#

@uncut barn definitely image files because you can make use of compression

serene scaffold
#

Loss functions are important for neural network-based AI. And then I think "tensor" means something different in Python's data science ecosystem than it does in pure mathematics.

shell berry
#

Anyone know of any cool datasets that aren't something boring like simple image recognition or house prices/stuff like that?

#

and non-NLP ๐Ÿ˜› maybe something scientific

serene scaffold
ripe forge
#

they said non boring! ๐Ÿ˜›

shell berry
#

thanks @ripe forge

raven cloud
#
img = cv.imread(cv.samples.findFile(img_name))

Im having trouble finding out what samples. && findFile.
Its with respect to import argparse
I can assure u its not from opencv, any advice where to search in the code to decipher that. Or is this information provided by me way to sPARSE (pun intended) ?
I dont like argsparse

wide meadow
#

I'm just a beginner in python and ML, can someone please tell what apply(lambda x: x) does in df['y'] = df['y'].apply(lambda x: x)

ripe forge
true elk
#

Hey guys! I'm a python enthusiast and my uncle runs a car upholstery business. I want to help him to come up with new designs for car seat covers using pictures of previous works.
Is it doable? I have only a few notions of ML

#

I can also get a set of 100 vectorial designs with the same angle

#

I don't need complex material imitation and other fancy stuffy, I only need "lines". Basically black and white

#

I've read some stuff about GAN but it seems overpowered for what I need

ripe forge
raven cloud
#

my bad, I have just never seen those methods before I guess it is opencv.
Lack of sleep can get to u

ripe forge
#

you're fine! (though yes, sleep is definitely worth it regardless)

raven cloud
#

it was worth being wrong just to add the pun though

raven cloud
#

ill see if print(args...) works

ripe forge
raven cloud
#

๐Ÿ‘‹ ,
args.img
comes from args = parser.parse_args() , I have no clue what parser.parse_args() is supposed to be.

#

And how to do get the background like what you did for "args.img"

#

perhaps my question is still to sParse?

#

argparse is making an arse out of me I tell you what

#
    args = parser.parse_args()
    print(args)
    # read input images
    imgs = []
    for img_name in args.img:
        img = cv.imread(cv.samples.findFile(img_name))

I just need to know where args.img is searching

ripe forge
#

oh. so, that looks like command line args. (and for displaying code you can use backticks. single ` for inline)

#

so args is just the argparse object. it just "holds" all the arguments. argparse itself is a way to hold parameters when the file is run via command line with extra arguments. so, say python thisfile.py arg1 arg2 arg3 for example is one way to run python files with extra commands

#

that's probably what info you're missing. this code wont show you what img is unless you see what command it was run with.

#

so img is just whatever is specified in the parser arguments in the code somewhere. it's actual value is decided at runtime based on the command the file is run with.

#

for what it's worth, it's easy to assume it's basically like a list of strings, file names.

raven cloud
#

parser.add_argument('img', nargs='+', help = 'input images')

#

๐Ÿ˜ตโ€๐Ÿ’ซ

#

im really not a cli guy

#

not yet anyways

ripe forge
#

yep, so if you wanted to see how exactly that works, you'd want to read up on parser. but if you want a tldr, img is a list that's being given values during the cli run of this file

raven cloud
#

img is a list that's being given values during the cli run of this file
exactly what im trying to look for ๐Ÿ˜…

ripe forge
#

the code of the file won't have it.

#

it would depend exactly on what command args were written when this file is being run

raven cloud
#

better not though

#

I think the entire code would need rewriting then

ripe forge
#

you can do whatever with your own codes. all i'll say is, you can't look at the code to see what input the user passed basically

raven cloud
#

to a certain extent only

ripe forge
#

it's like reading a code that says input("blah") and trying to figure out what the user wrote.

raven cloud
ripe forge
#

cli args are basically like the cli version of input

raven cloud
#

Thanks very much ๐Ÿ™‚

#

Cant help not looking at the code, guess I gotta read the docs lol

#

hold on u said its like a input

ripe forge
#

in terms of the general concept, input is a separate thing ofcourse.

raven cloud
#

yeah

ripe forge
#

i only meant both are similar in the sense that the code won't know what is passed to it

raven cloud
#

Im trying to run on cmd

#

stitching.py: error: the following arguments are required: img same error but I think I may be close now unless u know ?

#

wait

#

its working ๐Ÿคฆโ€โ™‚๏ธ ๐Ÿคฆโ€โ™‚๏ธ ๐Ÿคฆโ€โ™‚๏ธ Instructions were sort of there, albeit not that clear

#

Thank you Darr!

#

you mentioning รฌnputresulted in an output ๐Ÿ˜„

ripe forge
#

np! ๐Ÿ™‚

tough bolt
#

Hey there!

I have this bunch of code right here:
(python!)

   for adj_t in adj_matrices:
        dissimilarity_row = []
        for adj_tdt in adj_matrices:
            dissimilarity_measure = ((adj_t - adj_tdt)**2).sum()
            dissimilarity_row.append(dissimilarity_measure)
        dissimilarity_matrix.append(dissimilarity_row)

And as it rises exponentially in time with the dataset it becomes notoriously slow really fast.
Right now this is all default python, no numpy or anything

#

adj_matrices is a list

#

Up to 90k+ in length

#

adj_t is a matrix (in form of 2 dimensional python list)

#

looks like this

#

The code is calculating the squared eucludian distance between each of the 90k+ points.

#

Any idea how and if this could be optimised with numpy.

And if so - with which numpy functions

tidal bough
#

How did using scipy.pdist go?

tough bolt
#

Hmm, I ditched that at some point, I'm not sure why.

#

not sure if pdist would work all that well for me as I need to add another factor later on to the dissimilarity_measure var

tidal bough
#

this seems to work fine for me, say:

import numpy as np
import scipy.spatial

#random data in that format (1000 10x10 matrices of bools)
arrs = np.random.randint(0,2,(1000,10,10), dtype=np.uint8)

#use pdist, after flattening each matrix:
dists = scipy.spatial.distance.pdist(arrs.reshape(arrs.shape[0],-1))
# this is a "compressed" distance array
tidal bough
#

you can instead straightforwardly vectorize your loop, then, like, uhh...

tough bolt
tough bolt
#

I mean, if it really can't be optimized, so be it. The dataset shouldn't exceed 100k entries. But still a shame - I'm also scared it will overload the ram

#

but that's probably another story

tidal bough
# tough bolt hmm
def pairwise_dists(arrs):
    # start by flattening the matrices, since we are taking diffs only anyway
    arrs = arrs.reshape(arrs.shape[0],-1) # shape: n_points, N*M
    # We need to subtract each row from all others.
    # That can be done by using some clever broadcasting
    arrs1 = arrs.reshape(arrs.shape[0], 1, -1) # "rotate" it so that the rows are pointing depthwise
    arrs2 = arrs.reshape(1,arrs.shape[0], -1)
    diffs = arrs1 - arrs2 # shape: n_points, n_points, N*M
    distances = (diffs**2).sum(axis=2) # sum depthwise, so over each original row
    return distances # n_points, n_points
# short version that might be a bit faster, not sure:
def pairwise_dists(arrs):
    n = arr.shape[0]
    return ((arrs.reshape(n,1,-1) - arrs.reshape(1,n,-1))**2).sum(axis=2)
#

like this, I think that should work

#

yeah, seems to not crash and to produce a sane-looking result at least.

tidal bough
tough bolt
#

Oh wow, that's great. I'll try this - thanks you!

#

Hey - this works wonders!

It now takes 1.5 seconds instead of 24 seconds to calculate for 2k entries ... amazing

tidal bough
# tough bolt Hey - this works wonders! It now takes 1.5 seconds instead of 24 seconds to ca...

Here's another solution (I've tested that it produces the same results as pairwise_dists above):

from numba import njit
@njit
def pairwise_numba(arrs):
    n = arrs.shape[0]
    results = np.empty(shape=(n,n), dtype = np.uint32)
    for i in range(n):
        for j in range(i,n):
            dist = ((arrs[i]-arrs[j])**2).sum()
            results[i,j] = dist
            results[j,i] = dist
    return results

This is the "same thing you are doing but in numba" way

#

It's 2x faster:

>>> %timeit pairwise_dists(arrs)
354 ms ยฑ 6.34 ms per loop (mean ยฑ std. dev. of 7 runs, 1 loop each)
>>> %timeit pairwise_numba(arrs)
150 ms ยฑ 7.41 ms per loop (mean ยฑ std. dev. of 7 runs, 10 loops each)

which is likely because it avoids doing half the work by using the symmetry of your distance function (for j in range(i,n)).

iron basalt
#

There is also:

#
def pairwise_dists(x, y):
    """ Computing pairwise distances using memory-efficient
    vectorization.

    Parameters
    ----------
    x : numpy.ndarray, shape=(M, D)
    y : numpy.ndarray, shape=(N, D)

    Returns
    -------
    numpy.ndarray, shape=(M, N)
        The Euclidean distance between each pair of
        rows between `x` and `y`."""
    sqr_dists = -2 * np.matmul(x, y.T)
    sqr_dists +=  np.sum(x**2, axis=1)[:, np.newaxis]
    sqr_dists += np.sum(y**2, axis=1)
    return  np.sqrt(np.clip(sqr_dists, a_min=0, a_max=None))
#

minus the sqrt

#

Using this:

abstract sinew
#

may i ask what the above functions are for?

tidal bough
iron basalt
#

The clip is for numeric precision issues.

tough bolt
#

hmm, another problem I'm running into right now is this

numpy.core._exceptions.MemoryError: Unable to allocate 180. GiB for an array with shape (22001, 22001, 100) and data type int32

#

180 gb

tidal bough
tidal bough
#

yeah, that's another problem of my vectorized solution

iron basalt
#

Without the clipping you may get negative values that are very close to being zero.

#

So just clip them to 0

tidal bough
iron basalt
#

Yea numba will be the best, minus the JIT time

tidal bough
tough bolt
iron basalt
#

You can do the solution I gave in numba too

tough bolt
#

As the endresult always will be an array that big - no?

tidal bough
tidal bough
tough bolt
#

oh, I see

iron basalt
#

The large memory issues comes from a very large intermediate value.

#

(Which the expression re-write avoids)

tidal bough
#

Basically, I'd suggest using the numba solution, not just because it's currently the fastest one, but also because it's the only one you'll easily be able to extend when you change your distance function. With vectorized ones, you'll have to work hard to figure out how to vectorize the new more complicated function (if it starts depending on other values than just the two, I mean), and it might even be impossible at all depending on the distance function

(well, I guess it would only be truly impossible if the dissimilarity won't actually be a distance function, since good distance functions should, at the very least, only depend on the two points)

tough bolt
#

yeah ... I think I'm gonna need a minute to process all that ๐Ÿ˜…

livid kiln
#

When I read a multi index table, how do I specify the dtype of the header?

import io
import pandas as pd

df = pd.read_csv(io.StringIO('''combo,0,0,1,1,2,2
stock,weights,tickers,weights,tickers,weights,tickers
0,0.5,ADBE,0.7,AAPL,0.7,INTC
1,0.25,MSFT,0.1,MSFT,0.15,AMD
2,0.25,NVDA,0.1,NVDA,0.15,SKYT
3,,,0.1,GOOG,,

'''), header=[0, 1], index_col=0)

df.columns.get_level_values(0)
Index(['0', '0', '1', '1', '2', '2'], dtype='object', name='combo')

Here I would like to specify the dtype is int (not object)

tidal bough
livid kiln
# tidal bough I indeed don't see any mentions of being able to specify the header dtype for `r...
#

It doesn't seem elegant tho ๐Ÿ˜…

tidal bough
#

I believe it'll at least be efficient (redtyping the header affects only the header column and doesn't copy the entire DF, hopefully), but yeah, annoying

livid kiln
#

I tried using this:

#
df.columns = df.columns.set_levels(
      df.columns.get_level_values(0).astype(int), level=0
)
#

Gives me this error:

#
ValueError: Level values must be unique: [0, 0, 1, 1, 2, 2] on level 0
tidal bough
#

that solution was for a multiindex, you probably need, uhhh...

#

actually, it's pretty fair

#

you indeed have duplicate values in your first row.

livid kiln
#

But that is intentional

tidal bough
#

Hmm, what should duplicate index values mean?

livid kiln
#

Same group

#

e.g. I can already use this

df['0']
#

But would like to use this:

df[0]
mossy owl
livid kiln
misty flint
#

i wish we didnt have to learn feature eng in matlab

#

im pretty sure you can do the same things with python + opencv

#

like 90% sure

#

theres a lot of the same functions

livid kiln
# tidal bough Hmm, what should duplicate index values mean?

After reading this: https://github.com/pandas-dev/pandas/issues/28294
I found that they interpret the new indexes as a mapping for unique values

df.columns = df.columns.set_levels(set(df.columns.get_level_values(0).astype(int)), level=0)

EDIT:
Actually a better way is:

df.columns = df.columns.set_levels(df.columns.levels[0].astype(int), level=0)

REF: https://stackoverflow.com/questions/49199530/pandas-multi-index-column-header-change-type-when-reading-a-csv-file/49199682#comment85952840_49199682

GitHub

Code Sample, a copy-pastable example if possible import numpy as np import pandas as pd df = pd.DataFrame(np.random.rand(3, 3), columns=pd.MultiIndex.from_tuples([(1, 2), (1, 4), (5, 6)])) df.colum...

glad mulch
#

anyone know why my residual vs fitted vals are bounded near the bottom

plush pier
#

What's a good set of modules to use for financial modeling? I want to run some simulations but there must be tools out there for modeling stochastic trend series and financial instruments.

#

I could reinvent the wheel but it would be pretty crude I think.

desert oar
thorn bobcat
#

can someone explain a concept to do with gradients for me real quick?

velvet thorn
glad mulch
thorn bobcat
glad mulch
#

and cant be greater than 100

thorn bobcat
#

it was about gradient descent but I got it figured out now.