#data-science-and-ml

1 messages · Page 212 of 1

olive prairie
#

I know how I could write javascript to fix this (draw a circle, then draw a number, through the set), but I'm using a Bokeh widget to filter the data that's working pretty nicely...

#

I've been reading through their code and it doesn't look like there's anything to do this, but just started looking into using their "Scatter" figure, and creating a second dataframe; one for the circles, one for the numbers, and interleaving the two together. But I'm also open to other libraries if there's one that does this better

lapis sequoia
#

looks like peanuts

#

your goal is to visualize and you say you're open to things other than bokeh.. but you have to state what you intend to visualize better

olive prairie
#

Haha - I've added borders to those circles since then

lapis sequoia
#

I see different colors.. so I'm guessing is some categories?

#

and it's a scatter plot

olive prairie
#

Yeah, those are different categories, and the numbers I'm putting on top of them are the category numbers; because there are so many different ones, I don't want to have to refer to a long color legend

#

I have a hovertooltip which could include the categories, but I'd like people to be able to get a sense of how close or far away the different categories are from one another just by looking at it

lapis sequoia
#

what's the number in the circles

#

do you need it

olive prairie
#

The number is the category. This is a TSNE visualization of an LDA topic model - the numbers and colors are the topics

lapis sequoia
#

you can't use colors and numbers to represent the same thing

olive prairie
#

I think it could look good with the numbers

lapis sequoia
#

use plotly

#

and plot graphs side by side.. one for each category

#

and label the graph with the number

olive prairie
#

I get the idea of having different graphs for each category, but there are 18 categories

#

And I'm trying to visualize their relationship/distance to one another

lapis sequoia
#

distance together or distance one by one

#

just reduce the size

#

and add a filter

somber hamlet
#

just remove the numbers?

lapis sequoia
#

yeah I told him that..

plain turret
#

Since it have colors, thoses numbers can be in the legend yeah?

lapis sequoia
#

that's what I said

plain turret
#

I dunno if they represent something but a color gradient might be cool to use too if they make sense to use

#

If 1 is red and 14 is blue, you just need to put the gradient next to graph. But this work only if the numbers are a measure of the same thing

lapis sequoia
#

they're different categories

plain turret
#

Ah yeah nono then

chilly geyser
#

@lapis sequoia Pre-training the whole thing is too expensive obviously though

lapis sequoia
#

you keep missing the point here

#
  1. your problem/application 2. representation that suits your application 3. metric that fits your application best
#
  1. is where the type of model comes in.. representing your word vectors in a suitable space..
#

there's lighter frameworks that are language specific.. including ones from BERT that'll help you do that

supple ferry
#

Hey there! anyone knows the reason of such behavior? I have this dataframe (just several rows of it for reproduction purposes):

id    dpt    price    minutes
9710556    0    180.82    140
9710556    0    180.82    140
9710556    0    202.32    145
9710556    1    218.32    145
9710556    1    250.82    140

I am trying to find out the number of (price minutes) combos being strictly less than the other combos. And I try to find it for all of them. My data will be grouped by id and dept after all.
This is the function I came up with which gives me the correct output if I forget about grouping by dept for now:

def ranker(df):
  values = df[["price", "minutes"]].values
  result = values[:, None] < values
  return np.logical_and.reduce(result, axis = 2).sum(axis = 1)

And if I apply it to my data now, I get this:

small.groupby("id").apply(ranker)

Out[144]: 
id
9710556    [2, 2, 0, 0, 0]
dtype: object

Which means that, the first price minutes combination is exactly less (in both values) from 2 options within this dataset, and so on.
When i try to assign it back to dataframe, I get NaNs everywhere:

small["a"] = small.groupby("id").apply(ranker)

small.a
Out[147]: 
102    NaN
103    NaN
104    NaN
105    NaN
106    NaN
Name: a, dtype: object

How can I solve this? My overall goal is to run this function groupbing by id and dept in the end
EDIT: code

lapis sequoia
#

what's small

supple ferry
#

the name of tha dataframe i gave it

#

as far as i know, groupby applies the function to every group seperately which is dataframe by logic

chilly geyser
#

@lapis sequoia I don't see your point

lapis sequoia
#

exactly

barren bluff
#

Hey im just starting off with CNN's working with the basic Fashion-mnist dataset using tensorflow and keras. I am a bit stuck with two things hoping someone can help me out! if the data is 2D do I have to flatten it two a 1D array? Also, how does the layering work exactly and how do you set it up?

acoustic mural
#

tf.keras has a flatten layer

#

but you could also get there through pooling

chilly geyser
#

@lapis sequoia No, I mean that you talking about different parts of solving the problem. I don't see your point of stating it. I was never interested in solving the problem.

  1. I'm certainly not going to pretrain because I don't have that kind of data, nor is it my main project to do so.
  2. My problem is primarily input: text, output: classification. It's as simple as that and details like sentiment, text type, etc. generally don't matter except that they are in English sentences.
  3. Metric - I don't even see your point with this, most business applications would just put a dollar sign to everything that they can or care about. In either case, any and all categorical losses would be relevant to me, and I'm not particularly using or focussing on any

My problem (was) is very simple, the tokenizer from TF2Hub's Albert doesn't seem to produce expected things, and the scripts provided seem tricky (with stuff like FLAGS and TF2.0 migration in the way). That is about it.

#

@acoustic mural Pooling is probably better, or at least, in the standard help I see online

acoustic mural
#

it depends on what you're doing with it, sometimes pooling all the way down to 1D loses too much information

silent swan
#

lol just use roberta+pytorch

#

albert is new enough that I think you'll get less help around it I think

#

or roberta+tf if you really want to use tf

chilly geyser
#

Yeah I noticed the different levels of how involved the coder is for each different package. Anyway I think I accomplished my goal with benchmarks, it does seem that roBERTa works quite out-of-the-box and it's not really quite clear what hyperparameters are really good/important to change for any given problem

silent swan
#

learning rate X num_epochs from my experience

#

how big is your dataset

lapis sequoia
#

In python multithreading if you multi thread 2 threads on one class, then the variables within that class wont change by the other thread? Like duck = 0 in thread 1 And in thread 2 duck gets changed to duck = 1, now in thread 1 duck = 0 still right?

#

Also does the Same apply for calling Another function within that Same function with 2 threads?

distant inlet
twin hinge
#

What Python version is that? What version was the as keyword added in?

distant inlet
#

Python 3

#

Its working now

twin hinge
#

@distant inlet: :)

distant inlet
lapis sequoia
#

this is pretty cool

#

covers basics for DS

quartz monolith
#

Is there a lib to extract from a photo (Document) to text, tables and photo? Just saw photo to text with cv2 and pil

quartz stream
olive willow
#

hey guys, do you have any good resources where you can find messy datasets to train data cleaning skills?

compact bluff
#

i have a matrix in tensorflow and I want to create a heatmap of it using seaborn and log it using tf.summary.image. pretty much, I need to get pixel data from a seaborn plot. does anyone know how to do this?

#

I've only found tf.image.decode_png but I think it would be more efficient and accurate to directly get the pixel array from the seaborn plot itself

native stag
#

love this resource

#

amazin

lapis sequoia
#

if you want the summary, as in the shape.. why dont you use np @compact bluff

wraith basin
#

@olive willow http://www.kdnuggets.com/datasets/index.html

Stanford Large Network Dataset Collection: http://snap.stanford.edu/data/

Google Public Data Directory :http://www.google.com/publicdata/directory

Natural Earth Data : http://www.naturalearthdata.com/downloads/

Geocomm : http://data.geocomm.com/drg/index.html

Geonames data: http://www.geonames.org/

US GIS Data: Available from http://libremap.org/

olive willow
#

@wraith basin thanks!!!

pale thunder
#

Is there a way to find the focal points of an ellipsis based on it's contour? numpy+mpl

maiden void
#

hope you dont mind me doubleposting, just figured this was a better place to ask: im looking to change the structure of a dataset to this:

Country Year Debt Unemployment GDP
Afghanistan 1986 13 7 3456
Afghanistan 1987 12 8 3487
Afghanistan 1988 13 4 2356

#

anyone know how i could go about changing that?

olive willow
#

it's a method that changes columns into individual rows

#

you would have to look around how to get it into your desired format tho, I've no clue

lapis sequoia
#

Can anybody recommend me a website that mainly host contest for ml frequently.

split temple
errant venture
#

Hey I'm trying to train a weeather data set to tell me if it will rain tomorrow, but the column is "yes/no" and not binary like it's aasking for

#

Anyone know a way around this? I can provide pictures if anyone is interested thanks!

true fiber
#

Can you use pandas?

errant venture
#

df['RainTomorrow'] = df['RainTomorrow'].map({'Yes': 1, 'No': 0}) Worked for me, now I have to fix the other errors 😦

#

I know this means very little without knowing other data, but is anyone able to translate what these errors are saying?

maiden void
#

thanks @olive willow it did work to some degree, but unfortunately not to do what i wanted

#

going borderline mental here after spending the entire day on this single dataset

olive willow
#

Hahaha, maybe you can get only the columns with the years, transpose them and then add them to the others?

#

@maiden void

maiden void
#

so all the Japan columns are actually variables

#

that go on for like 100 variables and then they continue for the next country

#

so what i should do is move the next country under this one

#

unfortunately, i dont know how to do that effectively

#

(could always copy paste, but that will take forever and also its better to do it in python or R so i can recreate it)

true fiber
#

@errant venture this error is very simple - "could not convert string to float" - the date string cannot be converted to a number. In fact, why should the probability that it rains tomorrow depend on the date today? It is possible that rain depends on pressure or temperature for several days earlier. If this is the case, then the table with the dates must be transformed so as to enter data for 1,2,3, ... the last few days.

errant venture
#

@true fiber that makes sense, so if I were to remove the date column that would likely fix it?

true fiber
#

@errant venture Yes, the error will disappear, but this does not mean that the prediction will make sense, can it still take into account the pressure over the past few days?

errant venture
#

@true fiber yeah it has pressure/Wind etc

#

And location, will I have to remove all data sets that contain strings?

#

It throws an error for Wind direction and location since they arent floats, does this mean I can' tu se the mto train a data set

true fiber
#

All strings need to be converted to numbers, but if there are several values, then they need to be binarized. If you simply delete the strings, important information is lost, but you can check how something works.

errant venture
#

So say wind direction, that has say N W E S values, would converting them to 1, 2, 3, 4 be smart?

#

Thank you so much for answering by the way, you've already helped me understand this a lot

true fiber
errant venture
#

@true fiber Ah thanks for that, will look into it now!

true fiber
#

But it is more interesting what to do with the cities. First you need to try to remove the city and look at the accuracy of the forecast. Then it may be necessary to take into account the existence of rainy cities and consider them separately. But probably for a simple task, they can be deleted.

errant venture
#

@true fiber Yeah I'll try and use them, but for now I just want to train the rest of the data set just for now

#

I'm just working passed a "Input contains NaN, infinity or a value too large for dtype('float64')." error

true fiber
#

Yes, you must clear the table by deleting all missing data.
df.fillna(0) / df.dropna(axis = 1, thresh=3) / df.fillna(df.mean())

warped tangle
#

pandas people, whats the difference between df[0] and df[df.columns[0]]

#

i normally do the first one to get the targets of a df but for some reason its not working on the df im working on rn

errant venture
#

@true fiber I've already prepped the data and removed or replaced all NaN values, is there a check for infinite values?

silent swan
#

basically df[blah] can be ambiguous. df[column_name] generally works to get a column

warped tangle
#

wdym by ambiguous?

silent swan
#

it can have different behavior depending on what "blah" is

#

actually don't worry about that

#

in any case, if you want to get a column, do df[column_name]

warped tangle
#

k

#

thx

errant venture
#

np.isfinite(df.any()) returns true for every column ,what does that mean?

haughty vale
#

Hey is anyone here open to giving me some guidance on this ML project I'm doing?

#

Oops sorry if I interrupted something

jovial river
#

I have a dataframe that has the batting peformance of player who played in the world baseball classics. This is only played in certain years. The columns this dataframe has are playerID, yearID, BattingPerformance and Name. I want to calculate the average in their batting performance in the current year they played in the WBC, previous Non WBC year and following Non WBC year to see if the WBC had any effect in their performance. For example, player X in year 2006 had a batting performance of 0.3, -0.2 in 2005, and 0.01 in 2007. The average would be (0.3 + (-0.2) + 0.01) / 3. The yearID goes from 2005 to 2018. The WBC years are 2006, 2009, 2013, 2017.

The final output of this new dataframe should have the following columns

[Name] | [Average calculated from 2005,2006,2007] | [Average calculated from 2008,2009,2010] | [Average calculated from 20012,2013,2014] | [Average calculated from 2016,2017,2018]

What methods does pandas have that will help me achieve this?

lapis sequoia
#

using groupby and mean should do it

jovial river
#

@lapis sequoia

So this is what I have so far.

def caculate_impact_score(row, batting_df):
    print(row)
    return 0
      
batting_impact_score = people_WBC_batting.groupby(['playerID', 'yearID']).apply(lambda row: caculate_impact_score(row, batting))

people_WBC_batting is a dataframe that contains all the players that played in a WBC year(yearID are 2006, 2009, 2013, 2017).

batting is the one that has the batting performance for a player in a particular year.

I can't really do mean() on the batting dataframe because it includes player that didn't play in WBC.

#

So I have find a way to use the player ID and yearID in wbc dataframe and associate that with the playerID and yearID in batting and get their batting performance.

lapis sequoia
#

could you join the tables

#

merge them

#

or filter

#
bad_id_list = []
filtered_frame = batting[~batting['playerID'].isin(bad_id_list)]
errant venture
#

I keep gettiing:

Input contains NaN, infinity or a value too large for dtype('float64').

But my table has none of these issues

#

grrr

lapis sequoia
#

are you converting types?

#

try pd.tonumeric

errant venture
#

@lapis sequoia There are only floats in the data set, I've removed any other data types

lapis sequoia
#

still, i've had that issue before, and panda's to numeric helped resolve it

#

its good for number type conversions

jovial river
#

has an example of doing previous row and following row but not both.

paper niche
#

@jovial river not tested, but probably do a diff, then shift

#

diff of 2, then shift by -1

rancid slate
quartz stream
#

@rancid slate Yes

#

Here is a simple example to show that

true fiber
#

@errant venture No, I don’t think it’s a good idea to use this, by advice @lapis sequoia . You must understand what you are doing, so you need to find the erroneous element, for example, by eliminating it by deleting columns or rows. You don’t have any infinite numbers, almost surely you just missed some empty spurious element.

df.dropna()
olive willow
#

Hey guys can somebody help me with this

#

I've a dataframe that I want to modify

#
Country Name                   Afghanistan  ...                               Zimbabwe
Country Code                           AFG  ...                                    ZWE
Indicator Name  5-bank asset concentration  ...  Working capital financed by banks (%)
Indicator Code                  GFDD.OI.06  ...                             GFDD.AI.35
1960                                   NaN  ...                                    NaN
...                                    ...  ...                                    ...
2013                               79.6688  ...                                    NaN
2014                               86.6035  ...                                    NaN
2015                               72.1549  ...                                    NaN
2016                               71.9406  ...                                    5.8
2017                               73.6723  ...                                    NaN
#

this is a preview of it

#

I want the first 4 rows to become columns and there data points to become vertical not horizontal if that makes sense

tranquil rose
#

how about the other rows for the years?

olive willow
#

you know like with transpose, I've applied it to the dataset and the years should be the rows like the index rows

#

but the other columns like Country code, Name etc should become the columns

#

and all the countries should become the rows not the columns

#

like from top to bottom

serene plume
#

So I'm trying the senet model (https://github.com/moskomule/senet.pytorch) which is supposed to yield better results than ResNet, but I keep getting 0% accuracy on my train set (throughout 10 epochs)
I'm using it as such:

model_senets = se_resnet20(num_classes=len(classes), reduction=16)
cuda = torch.device('cuda')
model = model_senets.to(cuda)

optimizer = homura_optim.SGD(lr=hp["lr"], momentum=0.9, weight_decay=1e-4)
scheduler = homura_lr_scheduler.StepLR(80, 0.1)
tqdm_rep = reporters.TQDMReporter(range(hp["epochs"]), callbacks=[callbacks.AccuracyCallback()])

trainer = homura_Trainer(model, optimizer, F.cross_entropy, scheduler=scheduler, callbacks=[tqdm
for _ in tqdm_rep:
    trainer.train(train_loader)
    trainer.test(test_loader)
    trainer.update_scheduler(scheduler)

Am I implementing it wrong? Why am I always getting 0 accuracy? 😕 (please tag me if you reply)

orchid geode
#

anyone here can tutor R? Willing to pay, please slide into my dm pls. Thanks.

fallen anchor
#

Hello

#
name,age,weight
mike,22,180.2
alexa,28,133.30
terry,56,
jordan,,
joey,82,138.90```
#

I got a csv like that

#

I want to specify dtypes on import

#

but the missing data is screwing it up, I get an error ValueError: Integer column has NA values in column 1

#

how do I avoid that?

lapis sequoia
#

use fillna

fallen anchor
#

But what am I gonna fill it with?

#

I don't want to add random numbers

#

And filling with None doesn't work

sullen wing
#

@fallen anchor py df = pd.read_csv('test.csv').fillna(value=0)worked for me

fallen anchor
#

hmm

#

but 0 is appropriate data

#

can I use something like -1000?

#

my actual data has temp and wind speed etc, so 0 makes sense

sullen wing
#

You can use anything, sure

#

-1000 works as well

#
     name     age  weight
0    mike    22.0   180.2
1   alexa    28.0   133.3
2   terry    56.0 -1000.0
3  jordan -1000.0 -1000.0
4    joey    82.0   138.9
```this is what it will look like with -1000
#

You can also do float('inf')

#
     name   age  weight
0    mike  22.0   180.2
1   alexa  28.0   133.3
2   terry  56.0     inf
3  jordan   inf     inf
4    joey  82.0   138.9```
#

Infinite age and weight yes

fallen anchor
#

interesting

#

next question

#
time,temp
2019-11-20 00:56,5
2019-11-20 01:56,
2019-11-20 02:56,8
2019-11-20 03:56,
2019-11-20 04:56,4
2019-11-20 05:56,
2019-11-20 06:56,
2019-11-20 07:56,
2019-11-20 08:56,0```
#

I want to interopoliate the missing data

sullen wing
#

Do you mean, filling data?

#
df = pd.read_csv('test.csv').fillna(method='ffill')```Will give```py
               time  temp
0  2019-11-20 00:56   5.0
1  2019-11-20 01:56   5.0
2  2019-11-20 02:56   8.0
3  2019-11-20 03:56   8.0
4  2019-11-20 04:56   4.0
5  2019-11-20 05:56   4.0
6  2019-11-20 06:56   4.0
7  2019-11-20 07:56   4.0
8  2019-11-20 08:56   0.0```
fallen anchor
#

interpolate

#

so lets say at 0 hours temp was 2c at 3 hour it was 7c, we can assume at 2hour it was around 4c

#

does that make sense?

sullen wing
#

Yes

#
df = pd.read_csv('test.csv')
df = df.interpolate(method='linear', limit_direction='forward')```
#
               time  temp
0  2019-11-20 00:56   5.0
1  2019-11-20 01:56   6.5
2  2019-11-20 02:56   8.0
3  2019-11-20 03:56   6.0
4  2019-11-20 04:56   4.0
5  2019-11-20 05:56   3.0
6  2019-11-20 06:56   2.0
7  2019-11-20 07:56   1.0
8  2019-11-20 08:56   0.0```
#

Like this?

fallen anchor
#

woah

#

that is built-in?

sullen wing
#

It is, you can shorten it to py df = pd.read_csv('test.csv').interpolate(method='linear', limit_direction='forward')

#

well pandas has a lot of cool stuff

fallen anchor
#

that's awesome. thank you

#

what if I only want to do it for the temp column (even though this example only has temp I know)

#

Oh, I can use axis

sullen wing
#

You can do this

#
df['temp'] = df['temp'].interpolate(method='linear', limit_direction='forward')```
#

Test datacsv time,temp,test 2019-11-20 00:56,5,1 2019-11-20 01:56,2 2019-11-20 02:56,8, 2019-11-20 03:56, 2019-11-20 04:56,4, 2019-11-20 05:56,, 2019-11-20 06:56,, 2019-11-20 07:56,, 2019-11-20 08:56,0,3Outputpy time temp test 0 2019-11-20 00:56 5.0 1.0 1 2019-11-20 01:56 2.0 NaN 2 2019-11-20 02:56 8.0 NaN 3 2019-11-20 03:56 6.0 NaN 4 2019-11-20 04:56 4.0 NaN 5 2019-11-20 05:56 3.0 NaN 6 2019-11-20 06:56 2.0 NaN 7 2019-11-20 07:56 1.0 NaN 8 2019-11-20 08:56 0.0 3.0

fallen anchor
#

but how can I do it within the interpolate() call? wouldn't that be cleaner

#

df = df.interpolate(method='linear', limit_direction='forward', axis='temp') thows an error

#

UnboundLocalError: local variable 'ax' referenced before assignment

sullen wing
#

You cannot, axis only accepts 0, 1 or none

#
axis : {0 or ‘index’, 1 or ‘columns’, None}, default None
    Axis to interpolate along.```
#

So you will need to split into 2

fallen anchor
#

ahh, this works

#

df = df.interpolate(method='linear', limit_direction='forward', columns='temp')

sullen wing
#

hmm it doesnt work for me

#

It still interpolate the extra column

fallen anchor
#

huh

#

weird, same for me

#

whaat does it even do then

sullen wing
#

Nothing haha

#

Well this is clean enough

#
df = pd.read_csv('test.csv')
df['temp'] = df['temp'].interpolate()```
fallen anchor
#

I will use that

#

ValueError: time-weighted interpolation only works on Series or DataFrames with a DatetimeIndex

#

I get that error when I try to use the time method

#

df['temp'] = df['temp'].interpolate(method='time')

#

even though column 0 is time

#

I even added this df = df.set_index('time')

#

Fixed it

#
df['time'] = pd.to_datetime(df['time'])
df = df.set_index('time')
df['temp'] = df['temp'].interpolate(method='time')```
supple ferry
#

@fallen anchor all time rated such operations require datetimeindex dtype which can be set as you did with pd.to_datetime or you can specify time column during read csv

fallen anchor
#

how do I do the latter?

#

it would be a 2 in 1, convert to dt and set as index

supple ferry
#

@fallen anchor look at the parse dates argument in read_csv. Also, date_parser can be of help

#

They are both arguments of read_csv

ruby vortex
#

Hi everyone I am looking for some person whom I can work with on some data science/Machine learning project. If anyone is working on some data science project I would happy to be part of team. I am an undergrad student and want to gain skills and experience in deep learning. DM me if you have some project. Thanks

acoustic mural
#

just saw a picture of the Two Minute Papers guy... not at all what I expected based on his voice

#

no other takeaways, just that i pictured him EXTREMELY different

fallen anchor
#

what is a common thing to fillna with? np.nan ?

fallen anchor
#

nevermind, np.nan is the default anyway, no need to set it

lime cradle
#

Hey guys I kinda need help with making a simple supervised machine learning code for my science project where imputing an integer will respond with a win or a loss based on given data if someone could help me out that would be awesome dm me.

tranquil rose
#

@lime cradle it would be better to ask in the channel instead

#

ask a specific question

fallen anchor
#

that is some confusing wording ^ @tranquil rose

lime cradle
#

Well I am very new to coding and I was doing some research and it seems like I am going to be using a logistic regression code and I was wondering if there was a an algorithm that is already made that I could use for my project

fallen anchor
#

TF and keras probably have them too

lime cradle
#

So would i use the desciion_fuction method for the link that you sent?

fallen anchor
#

I've never actually use scikit

#

I just know it is widly popular

lime cradle
#

ok are you more familiar with tensor flow or keras

fallen anchor
#

TF

lime cradle
#

ok

fallen anchor
#

but really I don't know TF well either

lime cradle
#

Is there any way if you could help me set up the program if I find the coding to use

#

well if you arent I just looked up a tutorial so wish me luck

fallen anchor
#

I wish I could, but my knowledge is limited

lime cradle
#

ok thank you for the help

storm gate
#

How does one get the mode of a groupby in pandas?

deft harbor
#

.mode()?

lapis sequoia
#

lol

#

dammit Gir

fading cloak
#

does any one know what queue model applies to a queue that has a single queue with multiple processors but where each process requires N processors?

#

I figured the first part would be M/M/c but that only applies when each process use 1 processor

#

I'm trying to make a simulation

obtuse skiff
#

For anyone familiar with pyspark. Im getting the error "Caused by: java.lang.ClassCastException: java.lang.Long cannot be cast to scala.collection.Seq"

First using ALS, Im turning my prediction values from a dataframe to a rdd, then putting that into RankingMetrics. Then calling that evaluators .meanAveragePrecision

the error is occuring when calling .meanAveragePrecision

Any idea why this could be happening?

#

its wierd because im using python not scala, so idk what its trying to use scala

deft harbor
#

have you look at something like this

obtuse skiff
#

@deft harbor Whats confusing me, is that Im using python, not scala

deft harbor
#

this is beyond me, but my guess is that spark itself is trying to call something

#

perhaps there is an issue with configuration or a particular package it is trying to call

#

sorry im not more of a help

obtuse skiff
#

hmm, let me check the parameters for the rdd. looks like your right, it calls a scala method

urban shore
#

a friend and i are trying to get into the realm of data science/ml, any recommended videos to watch or simple projects to try?

obtuse skiff
#

Is anyone familiar with RankingMetrics from mllib pyspark?

I have my results from ALS which is in dataframe form, but idk what values Im supposed to passinto RankingMetrics for when I create the rdd for it
I see something called predictionAndLabels but Im not understanding what those values are in relation to the values I get for transforming the test data in the ALS model

#

@urban shore look up k nearest neighbor (KNN), for movie reviews. also tfidf is used for it. Its one of the starting projects you can do. if you can get that to work with sklearn, you can do most other similiar classification models

barren bluff
#

Hey im working on a final project for school where I have to build a CNN and plot some interesting results. Im using keras on the fashion mnist dataset, but right now I only have a plot showing training loss and accuracy on my data set with and without regularization. So I was wondering do any of you have any good ideas on stuff I could plot to show interesting results?

lapis sequoia
#
1. final project for school - somewhat relevant.. ok
2. build a cnn - for what?
3. need to know what you're applying it for, to tell you what an interesting result is
barren bluff
#

Im trying to classify images in the Fashion-MNIST dataset by zalando using a convolutional neural network.

#

Dont know what else to say

lapis sequoia
#

classifying images.. there you go

#

why do you think it's interesting to show training loss

barren bluff
#

pr epoch

lapis sequoia
#

if your model was what you're showcasing, you can show how your metrics improve for different methods

barren bluff
#

Well, I just wanted to show if the model is over or underfitting

#

oh so you mean like log the results after tweaking parameters?

lapis sequoia
#

yes... compare the metrics

#

you could try things like data augmentation..

#

also can try sequencing the data (sequence of images) and trying to make a prediction on that..

barren bluff
#

Not sure I understand the last two points 🙂

polar acorn
#

@barren bluff If you are mostly interested in nice ways to present your results, you could showcase a confusion matrix of your results. Or visualise both training and validation error when training. Or if you have sometime and some curiosity you can look into something like shap and try to interpret why your models predicts a certain class for a certain picture. https://github.com/slundberg/shap

barren bluff
#

yeah I actually just did that @polar acorn but it looks a bit funky for some reason

#

looks like this

#

this is the code the generated the plot:

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Greens):
    
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=90)
    plt.yticks(tick_marks, classes)

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")
    
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
#

Any of you know how I can fix the block at the top and bottom?

#

row***

polar acorn
#

Idk, try to set plt.ylim(cm.shape[0] - 0.5, -0.5)?

barren bluff
#

yeah let me try that

#

inside of the for loop right?

#

that fixed it gosh darn!

#

Why would I even need that block @polar acorn ?

#

doesnt plt.yticks(tick_marks, classes) define that value already?

polar acorn
#

It defines where to put the ticks I think not necessarily the axis limits? Anyhow I think that example used to work without that line in a previous matplotlib version and they changed something with update without updating that example.

mystic ravine
#

I'm still learning about nlp and ml, sorry for the newbie question, maybe wrong channel
I have a classified dataset with
Title (string), is_drama (bool), is_scifi (bool)

I'm looking for something that could suggest if it's drama or scifi, based on a Title input, but I have no idea of algorithm or method, it could be a link, video, or anything that could help, thanks in advance

sand lark
#

@mystic ravine you can use an LSTM

mystic ravine
#

Thanks, I'll try it

oblique belfry
#

Do you all use something like Luigi, Airflow, or some other DAG orchestrator for your data science/machine learning pipelines? Is there a reason to use one of these frameworks versus not using one? I am trying to evaluate the usefulness of them, and I am looking at potentially implementing this at work. I like standardization of processes, but I do not want to want over engineer.

silent swan
#

better: use BERT/RoBERTa

lapis sequoia
#

you use one of those.. usually airflow..

#

just depends how mature it is.. and how much you think it'll be maintained as you take the risk of deploying it over one framework..

oblique belfry
fallen anchor
#

Any recommendations for an online course of series of courses for scienc

#

I was thinking edx our coursera

#

Maybe even a stats course

#

Hopefully something comprehensive

trail island
#

Is it normal to get 0.0 as a pvalue?

#

just seems a little unlikely to get it 4 hypothesis tests in a row

plain badger
#

there's nothing inherently unreasonable about it

trail island
#

even from a large real world data set?

plain badger
#

sure

trail island
#

ok

plain badger
#

i mean depending what you're testing, the fact that it's a large dataset might make it way more likely to have a v small p value

#

like testing for normality when youve got a big dataset that's very not normal

trail island
#
import scipy.stats as st

mean_elo_n_project_team = assigned_team_df['elo_n'].mean()
print("Mean Relative Skill of the assigned team in the years 1996 to 1998 =", round(mean_elo_n_project_team,2))

mean_elo_n_your_team = your_team_df['elo_n'].mean()
print("Mean Relative Skill of your team in the years 2013 to 2015  =", round(mean_elo_n_your_team,2))


# Hypothesis Test
# ---- TODO: make your edits here ----
test_statistic, p_value = st.ttest_ind(assigned_team_df['elo_n'], your_team_df['elo_n'])

print("Hypothesis Test for the Difference Between Two Population Means")
print("Test Statistic =", round(test_statistic,2)) 
print("P-value =", round(p_value,4))
#

im comparing two nba teams from two different time periods

#

so not testing for normalcy i think

plain badger
#

the same 2 nba teams?

trail island
#

yes

plain badger
#

you should be doing a paired differences test

trail island
#

oh

#

how do you know that?

plain badger
#

ttest_ind is a 2 sample t-test which is for 2 independent samples. paired differences is for testing a difference between 2 dependent samples

#

like the weight of your family in 2018 vs. their weights in 2019 is two dependent samples because it's the same family members

chilly geyser
#

0.0 as a pvalue
Are you 'allowed' to state this p-value? Because if not, it's probably safer to state that the p-value is below machine epsilon (usually 10^-16)

plain badger
#

i dunno i've never seen 10^-16 lol. usually something like < 0.001

chilly geyser
#

If values are assumed to follow either t or norm-dist, a p-value of zero indicates an unbounded difference.

#

Alternatively, find out the actual t or z-score, and use logarithmic scale

#

If the logarithmic scale also breaks, your problem's precision is really problematic (since we're talking in orders of magnitudes when talking logarithmic scales)

trail island
#

oh im sorry Naarkie I meant they are 2 different teams, i missunderstood your question.

#

Denver Nuggets 2013-2015
Chicago Bulls 1996-1998

chilly geyser
#

Oh I realise you have this
print("P-value =", round(p_value,4))
Yes you should state p-value<0.00005 instead (That's with respect to your round function)

#
>>> round (0.00004,4)
0.0
plain badger
#

yeah then that's fine

trail island
#

oh

chilly geyser
#

FWIW, physics uses 5-sigma or p-value ~ 3.5 * 10^-7

trail island
#
print("P-value =", round(p_value, 10))
#

no wait

#

no mater what i round to, it comes out as 0.0

chilly geyser
#

What happens when you try just
print(p_value)

trail island
#

ooooooo

#

P-value = 1.604719099435058e-51

#

strange, idk why it would round to 0.0

chilly geyser
#

Um bro, e-51 is extremely small

#

It means 0.[fifty or so zeros]1604719 ....

trail island
#

omg

chilly geyser
#

Your choices are

  1. claim p-value is that number - I'm too skeptical of machine floating point to do so
  2. claim p-value below any usual significance level, which that p-value is definitely
trail island
#

ok

#

i dont think i understand p-value like i thought

#

p-value is the probability that the hypothesis is true right?

chilly geyser
#

I will quote wiki because it's very specific
probability of obtaining test results at least as extreme as the results actually observed during the test, assuming that the null hypothesis is correct

#

Your null hypo is that the means are the same

trail island
#

i see

chilly geyser
#

The at least as extreme as the results refers to how different your results are from this assumption - with an underlying assumption that the means follow gaussian distributions (if not, just central limit theorem)

#

So at least as extreme refers to mean differences very far from 0

#

The at least part means inequalities - and basically it means you can integrate from negative infinity up to that point, OR on the other side, from positive infinity down towards some positive-side number (2-sided)

#

Either way the 2-side or 1-side impacts the integral value with dividing by 2 (since the Gaussian is symmetric) and/or shifting the comparison values - which doesn't really matter with your p-value at that kind of magnitude

trail island
#

holy shit

#

thats a ton of information, i have more research to do i can see!
thanks so much for the help 😄

deft harbor
#

take a basic stats class

native stag
#

read The Elements of Statistical Learning on stanfords website

subtle glade
#

Is there an R discord someone could send me a discord link for

polar acorn
#

Suspect you'll have better luck with either slack or irc

rocky maple
#

Is there anyone particularly experienced in Keras?

I'm trying to build essentially a deep learning hashing algorithm. I have a Keras model, and I'll feed it an image, and another version of the same image with noise/rotations/crops, whatever else I want it to be invariant to. I run both through the same autoencoder, and I train on the similarity between the two vectors, trying to get them as close as possible.

But, there's a problem with this approach. If all that you do is nudge similarity closer together, then all your vectors will end up looking the same no matter what. So, I'm also running the original through an autodecoder and training both models on that too.

I have two loss functions. One that trains the autoencoder by comparing the Cartesian distance between the vectors of the original and the scrombled image, and another loss function another that trains both the autoencoder and the autodecoder on how well it can reconstruct the original image using the vector. Hopefully this combination of loss functions will yield a well trained model.

The issue comes in implementation. This is actually my first project, and I'm not very familiar with setting up branching networks like this in Keras. If I was doing something sequential it would be easy, but I have some questions.

  1. The docs say that you can use Models like Layers, which are really just tf Tensors. How do I get that to work with multiple outputs? Furthermore, if I incorporate one model into another and train it, does it train both?

  2. Right now how I have it set up is I'm passing it two images. In my autoencoder Model I define convolutional and max pooling layers, then some dense layers, and apply them all on both images in the correct order. My model does the same thing twice. But in "production," I only want to give it one and have it tell me what the autoencoder says. How would I rewrite it to do so, and link up the loss functions correctly?

arctic wedgeBOT
#

Hey @rocky maple!

It looks like you tried to attach a file type that we do not allow. We currently allow the following file types: .3gp, .3g2, .avi, .bmp, .gif, .h264, .jpg, .jpeg, .m4v, .mkv, .mov, .mp4, .mpeg, .mpg, .png, .tiff, .wmv.

Feel free to ask in #community-meta if you think this is a mistake.

rocky maple
#

So I can't attach notebooks. Alright.

acoustic mural
#

@rocky maple
1)
a) each output is one input into the following layers
b) you can freeze layers so they aren't adjusted during training
2) huh? i don't follow sorry

rocky maple
#

But Keras allows you to branch and share layers, correct?

#

Is isinstance(Model, Layer) true?

#

Also, I do absolutely want to train every single value, but only on specific things

silent swan
#

I'm gonna be annoying and recommend pytorch as usual

polar acorn
#

Deep self insight at least 😉

oblique belfry
#

I wasn't a fan of Pytorch at first, but I am really like the design concepts behind it now. My team uses Keras extensively, but I am going to reimplement a work project in Pytorch to show the team the difference.

The only thing that worries me about Pytorch is deployment in production. But, I am just waiting on more blog posts about that.

polar acorn
#

I keep wanting to try PyTorch, but I feel I should properly learn tf 2.0 first to evaluate my options better.

wraith bluff
#

anyone here worked with speech to text models/libraries before?

silent swan
#

fair enough, if you have business needs

#

but if you need to learn just one library for your own use

#

learn pytorch

oblique belfry
#

I really like Keras Callbacks. Pytorch doesn't have a native way to do that. There are some helper libraries like Pytorch-Lightning and Poutyne that give it a more "Keras-like" api.

#

Pytorch documentation is pretty dope though.

#

And, I seemingly never have weird Cuda issues.

rocky maple
#

Holy shit, why did nobody ever tell me that PyTorch had a Java API?

#

Sign me up

rocky maple
#

In Keras, do I have to output something to train on it for loss?

lapis sequoia
#

ModuleNotFoundError: No module named 'modin.backends.pandas.parsers'

#

modin is a pain in the ass

lapis sequoia
#

p value

lapis sequoia
#

can you call a classification task as predictive modelling?

chilly shuttle
#

yes

#

you're predicting the class

ruby vortex
#

import tensorflow_datasets as tfds
dataset, info = tfds.load("imdb_reviews/subwords8k", with_info=True, as_supervised=True)

#

Can anyone execute these 2 lines and tell if he/she is getting any error. Thanks

deft harbor
#

ModuleNotFoundError: No module named 'tensorflow_datasets'

native stag
#

so if you had limited data for something would you use an oversampling technique to resample the data? if not what would you do

vital cipher
#

@wraith bluff was reading about it- as Qualcoms new chipset has one of its feature so was pretty interesting for me 🙂

ruby vortex
#

@deft harbor thanks

vital cipher
#

hi guys, just wanted to share ...trying out the sentimental analysis from the twitter api's...any suggestion or ideas to perform any specific data analysis algorithm for more learnings...open to all suggestions 🙂

grand breach
#

When creating a virtual env in anaconda, is it possible to import packages from the base dir to virtual env or eveytime need to be installed with pip ?

grand breach
#

Will copy pasting work?

silent swan
#

yes, but depends on your configuration

#

why not just pip install as well

vital cipher
#

i would suggrst to use pip to install them in your environment whenever you want!!

jolly briar
#

@native stag ESL isn't appropriate advice to all surely? there's quite a lot of math

native stag
#

i'm reading ISL right now i would start with that i have almost no math background and its completely understandable to me and is fantastic every data scientist should read it, i'm going to move to ESL after and i may have to learn some calc LA in between to fully understand ESL

jolly briar
#

@native stag ISLR is a more suitable start yes

#

ESL, no

#

ISLR wouldn't be suitable to most without any maths either though

#

I don't know why you'd recommend a resource like that - but hey ho

native stag
#

i have no maths and i'm doing fine it explains things well but ya whatever you wanna do

jolly briar
#

good for you - i'm talking for most

native stag
#

idk i wanted to incase people haven't heard of it sorry that it bothered you so much but gl to ya mate

jolly briar
#

it's just not helpful to someone beginning to be recommend texts that aren't at a suitable level imo

deft harbor
#

I started with ISLR and enjoyed it, found it really helpful

#

Pretty sure I wouldn't have been able to take as much away from it if it spent most of its time bashing me over the head with only linear algebra notation.

flint nest
#

how hard is it to build an ai that can play a game like ticactoe
in highschool

oblique belfry
#

Anyone willing to help me port a Keras model to Pytorch? I am doing a lot with Conv3d stuff.

oblique belfry
#

Input shape is (1, 240, 320, 3)

model = Sequential()

    # Define model
    model.add(Conv3D(32, kernel_size=(3, 3, 3), input_shape=input_shape, padding="same", kernel_regularizer=l2(opt.l2), bias_regularizer=l2(opt.l2)))
    model.add(Activation('relu'))
    model.add(Conv3D(32, padding="same", kernel_size=(3, 3, 3),kernel_regularizer=l2(opt.l2), bias_regularizer=l2(opt.l2)))
    model.add(Activation('relu'))
    model.add(MaxPooling3D(pool_size=(3, 3, 3), padding="same"))
    model.add(Dropout(0.7))

    model.add(Conv3D(64, padding="same", kernel_size=(3, 3, 3)))
    model.add(Activation('relu'))
    model.add(Conv3D(64, padding="same", kernel_size=(3, 3, 3)))
    model.add(Activation('relu'))
    model.add(MaxPooling3D(pool_size=(3, 3, 3), padding="same"))
    model.add(Dropout(0.25))
    
    model.add(Conv3D(64, padding="same", kernel_size=(3, 3, 3)))
    model.add(Activation('relu'))
    model.add(Conv3D(64, padding="same", kernel_size=(3, 3, 3)))
    model.add(Activation('relu'))
    model.add(MaxPooling3D(pool_size=(3, 3, 3), padding="same"))
    model.add(Dropout(0.25))

    model.add(Conv3D(64, padding="same", kernel_size=(3, 3, 3),  kernel_regularizer=l2(opt.l2), bias_regularizer=l2(opt.l2)))
    model.add(Activation('relu'))
    model.add(Conv3D(64, padding="same", kernel_size=(3, 3, 3),  kernel_regularizer=l2(opt.l2), bias_regularizer=l2(opt.l2)))
    model.add(Activation('relu'))
    model.add(MaxPooling3D(pool_size=(3, 3, 3), padding="same"))
    model.add(Dropout(0.7))

    model.add(Flatten())
    model.add(Dense(1024, activation='relu'))
    model.add(BatchNormalization())
    model.add(Dropout(0.7))
    model.add(Dense(1, activation='sigmoid'))

    model.compile(
        optimizer=RMSprop(lr=opt.learning_rate),
        loss='binary_crossentropy',
        metrics=['accuracy'])

    return model
#

Forget the regularizers and what not. I cannot seem to translate the Flatten layer to Pytorch correctly. I cannot seem to translate the same padding options.

silent swan
#

flatten should be doable

#

but padding is not, without manually calling a padding layer

#

pytorch and TF for some reason have conv operations with fundamentally different padding

#

you might be able to find some code that auto computes the correct manual padding for you

#

that said, whether this is worth it depending on whether you're training from scratch or reusing a trained model

#

if it's the former, you can skip the exact replication

oblique belfry
#

Yeah....it kinda needs to be exact.

#

That is so frustrating.

silent swan
#

if you're replicating then yea

#

worst case just go layer by layer and ensure that they match

#

also make sure to configure cudnn to be deterministic

oblique belfry
#

I am trying to replicate the model.

silent swan
#

the flip side is that padding is about the only hard thing to translate between the two frameworks, in terms of straightforward models

oblique belfry
#

I am not sure how to configure the Conv3d layers properly. Keras lets you be pretty lazy, but pytorch makes you be explicit. I've tried 2 different ways to create a flatten layer, but each time I get a tensor that is so big, it doesn't fit in memory.

silent swan
#

explain what shapes you're trying to go from/to

#

you'll come to love how explicit pytorch is

oblique belfry
#

I don't think I understand your question. What about shapes?

#

Oh, I am loving pytorch, until this project. lol

#

Nah, I like it.

silent swan
#

you're trying to flatten right? that's manipulating the shape of hidden activations

#

put another way, you should be able to find out the shape of your outputs at every stage of that model

#

that's important for translating between tf and pytorch

oblique belfry
#

I am not sure exactly what the dimensions were, but out of the convolution block, I ran x.view() to flatten the layer. This was an example I found online to flatten the layer.

silent swan
#

that should work

oblique belfry
#

2019-12-08 22:17:20,646 - ERROR - root: [example.py:95] Given input size: (512x1x29x39). Calculated output size: (512x0x14x19). Output size is too small

#

Model file

silent swan
#

which line? (your log is pointing to the line number in the full file I guess?)

#

it doesn't look like Flatten is the issue?

supple ferry
#

hey ! anyone tried new version of Spyder? 4.0 i presume it is.
any feedbacks ??

idle oracle
#

Yo anyone familiar with pytorch, because i have a problem and cant seem to figure out what it is.
ahh

#
#Choose device for training
if torch.cuda.is_available:
    device = torch.device("cuda:0")
    print("Running on GPU")
else:
    device = torch.device("cpu")
    print("running on CPU")

net.to(device)
net = Net().to(device)


# Print out training information, set epoch range to train

for epoch in range(50): # (n) full passes over the data # set to ridiculous amount if using accepted value
    for data in testset:  # `data` is a batch of data
        X, y = data  # X is the batch of features, y is the batch of targets.
        X, y = X.to(device), y.to(device)
        net.zero_grad()  # sets gradients to 0 before loss calc. You will do this likely every step.
        output = net(X.view(-1,784))  # pass in the reshaped batch (recall they are 28x28 atm)
        loss = F.nll_loss(output, y)  # calculate and grab the loss value
        loss.backward()  # apply this loss backwards thru the network's parameters
        optimizer.step()  # attempt to optimize weights to account for loss/gradients
    print(loss)  # print loss
    # Adding accepted value, Comment out if need be
    if loss <= (accploss):
        break
#

anyway, the loss i get remains constant and I'm either blind to the problem, or did something really dumb and don't know how to code for the life of me.

#
tensor(2.3093, device='cuda:0', grad_fn=<NllLossBackward>)
tensor(2.3093, device='cuda:0', grad_fn=<NllLossBackward>)
tensor(2.3093, device='cuda:0', grad_fn=<NllLossBackward>)
tensor(2.3093, device='cuda:0', grad_fn=<NllLossBackward>)
tensor(2.3093, device='cuda:0', grad_fn=<NllLossBackward>)```
#

sample output
I'd appreciate any help whatsoever, thx in advanced

oblique belfry
#

maybe set net.train()

#

I'd also set the graddients on the optimizer to zero instead of the model.

#
# From the MNIST Pytorch example

def train(args, model, device, train_loader, optimizer, epoch):
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output, target)
        loss.backward()
        optimizer.step()
        if batch_idx % args.log_interval == 0:
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                epoch, batch_idx * len(data), len(train_loader.dataset),
                100. * batch_idx / len(train_loader), loss.item()))


lapis sequoia
#

guys i was looking into the functional API of keras and came across a problem [a doubt] why do they use the input shape as (64, 64, 1) when the input is just a 64x64 image
this is the code

# Convolutional Neural Network
from keras.utils import plot_model
from keras.models import Model
from keras.layers import Input
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers.convolutional import Conv2D
from keras.layers.pooling import MaxPooling2D
visible = Input(shape=(64,64,1))
conv1 = Conv2D(32, kernel_size=4, activation='relu')(visible)
pool1 = MaxPooling2D(pool_size=(2, 2))(conv1)
conv2 = Conv2D(16, kernel_size=4, activation='relu')(pool1)
pool2 = MaxPooling2D(pool_size=(2, 2))(conv2)
flat = Flatten()(pool2)
hidden1 = Dense(10, activation='relu')(flat)
output = Dense(1, activation='sigmoid')(hidden1)
model = Model(inputs=visible, outputs=output)
# summarize layers
print(model.summary())
# plot graph
plot_model(model, to_file='convolutional_neural_network.png')
#

i think i got it !!!
is it for mentioning the number of channels?

polar acorn
#

Yep 👍

lapis sequoia
#

im a genius

#

xD

oblique belfry
#

Yep...channels last. Seting that properly is extremely important...speaking from experience

silent swan
#

@idle oracle yes check the gradients after one backward pass

#

@oblique belfry did you figure it otu?

oblique belfry
#

nope

hardy crag
#

@idle oracle as @oblique belfry already said, you need to call optimizer.zero_grad() instead of net.zero_grad()

silent swan
#

the two are generally interchangeable unless you have a very esoteric training scheme

#

optimizer.zero_grad is recommended, but net.zero_grad will still zero out the gradients for the model

#

@oblique belfry well lemme know if I can help. I've had to port models back and forth between TF and pytorch multiple times, so I'm well aware of the pain points

lapis sequoia
#

hey

#

trying to figure out how to parse OCR'd text that has arbitrary yet somewhat similar formatting

#

is there some kind of matching system that works well for messed up OCR

#

like for instance i might want to match "FURCHASE ORDER NO." with "PURCHASE ORDER NO." as my search criteria

idle oracle
#

@hardy crag nah i tried, didn't work

#

its most likely an issue with throwing things with to.device, becaus an older version works.

silent swan
#

check the gradients! thx

idle oracle
#

how would you go about doing that

#

i belive my grads are none

silent swan
#

if you call net.parameters(), you should get a list of parameters

#

check the .grad on each one. It should be none or 0 at the start, and after a forward pass and loss.backward, the .grad should be tensors

#

let me know if you do or do not see that

#

actually I have another guess. Can you show me the code you're using including where you initialize the optimizer?

idle oracle
#

yea

#

one sec

arctic wedgeBOT
#

Hey @idle oracle!

It looks like you tried to attach a file type that we do not allow. We currently allow the following file types: .3gp, .3g2, .avi, .bmp, .gif, .h264, .jpg, .jpeg, .m4v, .mkv, .mov, .mp4, .mpeg, .mpg, .png, .tiff, .wmv.

Feel free to ask in #community-meta if you think this is a mistake.

idle oracle
#

ok....

#

@silent swan so now the loss values change, but...

#
tensor(2.2650, device='cuda:0', grad_fn=<NllLossBackward>)
tensor(2.2917, device='cuda:0', grad_fn=<NllLossBackward>)
tensor(2.3413, device='cuda:0', grad_fn=<NllLossBackward>)
tensor(2.2748, device='cuda:0', grad_fn=<NllLossBackward>)
tensor(2.3047, device='cuda:0', grad_fn=<NllLossBackward>)
tensor(2.2935, device='cuda:0', grad_fn=<NllLossBackward>)
tensor(2.2991, device='cuda:0', grad_fn=<NllLossBackward>)
tensor(2.3095, device='cuda:0', grad_fn=<NllLossBackward>)
tensor(2.2907, device='cuda:0', grad_fn=<NllLossBackward>)
#

their pretty stagnant

#

Ok, I think I have the problem identified, it must be a .to (device issue)

#

but i dont kno whow to fix it

silent swan
#

try creating the optimizer all the to(device) stuff

idle oracle
#

yea imma stick to cpu

#

small probelm anyway

#

stick a xeon to it and she'll be fine

silent swan
#

lol we could just fix the issue

oblique belfry
#

Unless your model is tiny, use the GPU.

idle oracle
#

Honestly i now have a massive problem with weights

#

for some weird reason it seems like every time i reset the runtime and create a new model (code creates a fresh one) there is always a single weight un accounted for, with a value of 0.000, where as the other have massive negative values.

#
tensor([-23179.8730, -17778.3848, -14537.1084, -31701.2402, -27408.4082,
        -20759.7539, -24848.2812,      0.0000, -38601.3164, -40405.9219],
       grad_fn=<SelectBackward>)
tensor(7)
#

so now i have 7 with nothing in it

#

it goes , 0,1,2,3,4... etc

#

and the rest have massive neg values

#

i and going to sleep no cuz almost 12:00AM so i will chk back in the morning. about 7 hrs from now

#

suggestion son how to change format?

oblique belfry
#

You could grayscale the training data.

#

However.....a good neural net should be able to handle that.

#

Also, matplotlib's plots aren't always the most indicative of the data. The first image almost looks like a heatmap.

lapis sequoia
#

I need some help

#

is it normal for a vm instance to idle after a heavy computation..

#

my cpu utilization near maxed out during training for a few hours.. now it's done but in the model exporting phase it's not doing anything and I barely see a blip on the utilization..

lapis sequoia
#

how can I understand the unitvector (r_a) rightside and the ornage block U_s

#

I tried googling but I do not konw how this works. I want to understand it

silent swan
#

is your input 28x28x3 or 28x28x1?

#

in any case, no, if the color scale is modified, you should have no prior expectation that the model should still work

#

would you mind posting your whole code again?

idle oracle
#

yea ok

#

@silent swan what if my training set it inverted form what im testing

#

from*

#

it thin kit was sensitive to color

#

@silent swan it works now,

#
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
from PIL import Image
import PIL.ImageOps 

import numpy
im = Image.open("number.png")
img = PIL.ImageOps.invert(im)
num_img = numpy.array(img)
lum_img = num_img[:, :, 0]
plt.imshow(lum_img)
#

used this to apply the 'LUT' like the data set, inverted it thenis works with 99.6% accuracy

oblique belfry
#

I use cv2 for image transformations. Found it to beat Pillow and scikit-image in terms of speed.

upper ginkgo
#

I'm trying to make a family tree image like MarriageBot's for my own bot
Here's a family tree from MarriageBot:

#

I'd like to achieve something similar, except I'm using PIL to make a family tree with this design:

#

I've never made something like this, how would I do it while keeping the design?

silent swan
#

inversion of inputs can break a model

oblique belfry
#

My coworkers worked on this massive project for weeks and they were getting nowhere. Ends up they had the inputs reversed. They wasted weeks. "channels_first" and "channels_last" are probably the most important words in image recognition.

Harder to mess up in Pytorch though since you have to be so explicit. Keras lets you be a bit too lazy at times.

silent swan
#

exactly!

#

too much automatic inference and things become hard to debug

oblique belfry
#

For the most part, it is pretty cool.

#

I just am not a fan of Keras/TF exceptions. Curse of the static graph....

#

That alone might move me to Pytorch.

Also projects like Poutyne are pretty great. It gives a Keras-like interface to Pytorch. I can re-implement all my custom Keras callbacks and it will behave in a similar way as the original. I really didn't want to really configure all that again.

silent swan
#

I've never gotten into the whole callback style of programming. For Keras, it really feels more like a hack because Keras owns all of your control flow, so you have to play by its rules

#

PyTorch is a little more verbose, but in returns you basically control everything

#

but yes, there are definitely similar keras-like wrappers for pytorch

oblique belfry
#

Saves me on boilerplate. I just like running a set of functions at the end. I really don't care about controlling all that. (For one project, I have a Socket.IO callback.) Just let that run at the end, I don't really care. I think the Callbacks are a nice abstraction. I think it makes the training code cleaner. But, that is more for my readability than true fucntionality.

#

I like how TF 2.0 essentially took everything people like about Pytorch and try to implement that in their stuff.

silent swan
#

well, I think they tried to at least

oblique belfry
#

Instead of trying to bring in dynamic graphs (okay...technically it is called eager execution) in TF, they should have done an AngularJS and Angular 2 thing. Just do a rewrite and do it the right way.

#

Right now it is still a mess.

silent swan
#

I think TF fundamentally targets a different goal though

oblique belfry
#

I always have Cuda errors with TF. Seems like Pytorch always works with whatever version of Cuda I got. That is so nice.

silent swan
#

I think that's mostly because of the TF/Google attitude

#

"you have to do things our way"

#

that's why the TF library contains everything, to force you into their eco system

#

PyTorch feels like it's trying to serve the users

#

TF feels like you need to follow their way

oblique belfry
#

The next ML project I get, I might do the project in Pytorch.

The only thing about Pytorch is its "deployment strategy." TF/Keras does a good job at deploying models. Pytorch, to me, is a bit lagging in this area. But, this will improve with time and more people writing up articles.

silent swan
#

fair enough. I'm on the research side so I don't do much with deployment

#

TF has a lot of good tools for that

#

but if you just want to train models, PyTorch is generally the far better option

oblique belfry
#

Unless you have to due Conv3D...

silent swan
#

how so

oblique belfry
#

Still can't configure the model architecture correctly. lol

silent swan
#

that's because of the incompatibility of TF/PyTorch conv padding. Not that one or the other is more correct

#

if anything, PyTorch allows you to drop a debugger into wherever you're running into an error, TF throws you into magic C error space

oblique belfry
#

I haven't touched a debugger since I was learning Visual Basic in undergrad. I am not the biggest fan of them. I need to get into them more. Might could help.

#

The padding is what is killing me.

silent swan
#

show me your stack trace and the line that it's throwing an error on

oblique belfry
#

I will tonight

idle oracle
idle oracle
native rivet
#

any machine learning engineer here?

oblique belfry
#

@native rivet I’ll do my best. What’s up?

silent swan
#

you can do a think like find the bounding box for non-white pixels, find center of the bounding box, and then shift

pulsar stag
#

Dash-Bootstrap-Components How to Build Layered Dashboards with Python https://youtu.be/P-XYio7G_Dg

oblique belfry
#

https://arxiv.org/abs/1912.04316
An interesting approach to action recognition. They use graph CNNS to solve the problem. Very novel. I have not read an approach like this before.

We don't have an ML channel, so sorry if it doesn't truly fit data science.

oblique belfry
#

I am having to refactor our custom ML platform, and I just cannot seem to get started. it was built very project specific. Trying to abstract that is going to be fun. My boss is also a bit opinionated about things and so some of my changes may not even matter.

For the record: Comments are not a waste of time.

silent swan
#

"eh, I'll just use kwargs for this"

oblique belfry
#

Honestly, I would take that.

Had to convince him we should use classes for this one thing because it holds state, and we were just passing things through a million functions. I get that some people go a bit too far with polymorphis and inheritance, but there is a time and place. You know?

#

Currently working on an action recognition project, but we are trying to make this platform accessible for sequence data, tabular data, audio, etc.

lapis sequoia
#

sounds fun

#

what do you mean platform though

oblique belfry
#

It's composed of several parts. Mainly data labelling and deep learning "best practices" our company has found to be uberly succesful. Essentially standardize things and allow a ML engineer to focus on building models, not software development.

#

If anything, it has been a killer in-house tool.

lapis sequoia
#

sounds nice

#

how did you build it

#

I'm familiar with building models but not so much on serving..

oblique belfry
#

Can't dive too deep into it. But, the python side is pretty simple. We found caching our data in an fast, binary format gave us 5x to 10x improvement on training time. So, after the traditional ETL process, prep your data into a high performance format, then load that into a Keras generator or Pytorch Dataloader. Kind of like a head of time compiling code. Same principal.

Used React and Express to create a SPA for labeling and viewing model stats. We also found tensorbaord annoying as hell to work with.

lapis sequoia
#

labelling?

oblique belfry
#

data labelling.

lapis sequoia
#

I know what it is..

#

I'm wondering how you're doing it over the app

oblique belfry
#

There weren't any "simple" data labelling solutions. And, we weren't interested in Sagemaker.

lapis sequoia
#

how do you save and version models

oblique belfry
#

Labeling videos for Action Recognition isn't the easiest. Gotta search per frame.

lapis sequoia
#

ok then

oblique belfry
#

Save it locally. Though it isn't hard to save stuff to S3 or Azure. For each run, we save weights, logs, performance stats, and other stuff.

lapis sequoia
#

hmm ok

vital cipher
#

has anyone setup metabase in ubuntu/or any destrooo?

#

and used?

lapis sequoia
#

I was reading about Decision trees on medium.. Can anyone tell me what does it mean by this paragraph 👇

#

I dont understand what does it mean by saying pure

#

i even found the same text in the decision tree documentation

chilly geyser
#

Pure means the leaf has only 1 category

#

To give an example, suppose you had a dataset of 6 things, of category "Blue" and "Red"

#

Suppose in your dataset you have some predictors

#

What the tree does is calculate for entropy changes ('information gain') by splitting using the predictors (if predictor <= some_value, classify in some way, else classify the other way)

#

I have terrible drawing skills but this should illustrate my point

#

Suppose that basically of your 6 data points, 4 were "Red" and 2 were "Blue"

#

We assume for simplicity that your data is very good, so it just splits once, and then all the 2 "Blue" get together - it's now pure
The same happens for the 4 "Red" - the leaf is now pure.
Since the tree has split into pure leaves, the algorithm ends

#

The splitting and deciding of predictors is the core to the algorithm - use of Entropy/Information Gain isn't the only way - although there are reasons for doing so

fast rain
#

hey guys

#

can anyone help me with excel stuff?

lapis sequoia
#

Ok i get it now..

#

thanks a lot

#

I have one more question

#

If we oversplit the data in decision tree what will happen?

#

Will we overfit the data then?

chilly geyser
#

Yes you will overfit your training set, so you need a model selector/parameters/algorithm to decide if you really need a tree that always ends in pure leaves

#

What happens is that with the fully grown tree, you can prune it at somewhere in the middle, and this is something you can do after you grow the tree

#

It's going to be a hard problem to dynamically check for overfitting while growing the tree - so that's why that is done after

#

With pruning the tree becomes something that ends in 'impure' leaves, but what's important is if the tree is really useful and generalises to actual data or use cases you want to present it towards

lapis sequoia
#

Ok got it.. thanks a lot

deft harbor
#

thats a pretty tree you have there

#

@lapis sequoia look at bagging and boosting as well

twilit ore
#

Is this the appropriate channel to discuss things related to Scrapy?

deft harbor
#

I'm not against it, but I don't know the rules

chilly shuttle
#

if you keep it general and not 'im scraping some site that has TOS saying not to' its probably fine

#

although it's also not data science

lapis sequoia
#

@deft harbor Sure will

#

Also in Random forests if we used a large number of n_estimators or trees will we overfit the data?

silent swan
#

eventually in some way yes

haughty pawn
#

hi there

#

anyone heard of ai dungeon2? question related:
can you help me out by modifying that model in a way to shove the entire save and not just the prompt + 10 8 last phrases?

#

here's an example of a desired outcome:
you save your conversation, you load it via id and it shoves the contents of the save into the model, you continue your conversation in tact and the context is not broken

deft harbor
#

I don't think I've seen a lot of people do the work for others here

haughty pawn
#

sorry, but i'm not an ai(TF) programmer, so i can't really figure it out myself

twilit ore
#

Anyone working in a project that requires scraping the web with Scrapy? I'm up to join you to learn it hands-on and eventually help you in exchange for some knowledge. I've some experience scraping and wrangling data using requests, selenium webdriver, beautifulsoup, regex, lxml, json...

If you've worked with Scrapy before but are not using it in any of your project and have some free time, I'd be interested in partnering up to tackle some whatever-payment freelance jobs using it if that pleases you.

Not sure if this is appropriate channel but seems like it's the most appropriate one.

idle oracle
#

hey is it possible to create an AI that plays a game against you and gets better everytime?

oblique belfry
#

That is the field of Reinforcement Learning.

native moon
#

i have merged two csv files into a single dataframe, is there any way i can check if the row from the new merged dataframe is present in the second csv file or not?

paper niche
#

plenty of ways. why not add a column called “source”, whose value is “first” for the first csv and “second” for the second csv, before merging them?

haughty pawn
#

alrighty then, since you either don't know the answer to the previous question or just don't want to answer here's another two three:

  1. is ML available to general populus yet (was it dumbed down for anyone to take on it)?
  2. what is the least resource (GPU/CPU) intensive but just as performant model (compared to GPT-2)?
  3. what model is better than GPT-2 and supports i18n?
    (pardon if i mix my terms up)
lyric kernel
#
df1 = data[data.MESS_DATUM >= 19600101]
df1 = df1[df1.MESS_DATUM <= 19981231]

How do i get this in one line ?
and and & didnt work

paper niche
#
data[data.MESS_DATUM.between(19600101,19981231)]
``` @lyric kernel
lyric kernel
#

crisp!

paper niche
#

if you want to use &, you're gonna need to surround your conditions in parenthesis, like

data[(data.MESS_DATUM >= 19600101) & (data.MESS_DATUM <= 19981231)]

I'm guessing that's why your attempt didnt' work

worn stratus
#

pipenv, venv, or conda for a machine learning project?

silent swan
#

conda, always conda

quartz monolith
#

has somebody used graphs database to recommend users based on keywords? The Keywords are extracted from a text and combined with the users

formal storm
#

Hi, Is this the right place to ask for some Pandas help?

quartz monolith
#

zes

#

yes

formal storm
#

awesome

worn stratus
#

@silent swan whats the advantage to conda?

slim fox
#

imo venv would give you the most flexibility. Some packages might not be of the latest version in vonda

silent swan
#

conda installs all the scientific computing libraries properly

#

you get both the package installer and environment manager in one

oblique belfry
#

I've always found conda to be annoying. And the only issues I have with any scientific computing packages is Tensorflow due to it not playing with certain versions of Cuda.

silent swan
#

what issue have you had with conda

#

also tensorflow plays nicely with NO ONE lol

uneven harbor
#

If anyone has extra time and would like to help with the creation of my Jojo bot you can help create stands here https://docs.google.com/document/d/1o4gkz4jmROzNSp79LvOwggQBW2sUI5gZ0Ft6MF2WEPo/edit?usp=sharing

lapis sequoia
#

what does this mean

#

1:06:14<2412:56:12 this part

silent swan
#

time so far < expected time to completion probably?

lapis sequoia
#

2000 hrs?

#

wut

oblique belfry
#

I can’t remember exactly. But. I remember it’s just simpler to setup up a virtual env and install what I need. I also don’t mind pipenv either.

I also don’t need all those dependencies in a project.

deft harbor
#

But it comes with sypder 😶

oblique belfry
#

....I don’t use Spyder 😬

silent swan
#

That's what it says. 900 seconds per iteration, and you have 10000 iterations.

spare arch
#

Does anyone know

#

where I can get started

#

learning how to make

#

AI learn how to play games?

spare arch
#

dope

#

thanks @oblique belfry

jolly briar
#

@oblique belfry i found conda confusing at first as it dumps stuff into your bashrc or something, but after using it for a bit it's fine really, haven't had issues since, perhaps i don't do a fat lot with it though

oblique belfry
#

If I was on Windows, I'd take a second look at it. But I just haven't had a need good ole pip couldnt solve.

jolly briar
#

@oblique belfry fair, i'm only using it because the team uses it, and if it wasn't for that it's unlikely i'd have got past the initial hiccup i expect

oblique belfry
#

I get that.

silent swan
#

if I'm not wrong conda also installs optimized versions of e.g. numpy libraries

oblique belfry
#

I’ve read that before. Like I said, I don’t look down on conda. But there just hasn’t been a situation where I needed something else than pip to get all my ml packages going.

deft harbor
#

You know you want spyder

twilit ore
#

@oblique belfry Do you only use one version of python or you create venvs to replace conda?

oblique belfry
#

Just venvs. I like the separation between projects.

twilit ore
#

Then that's just fine.
Conda is a huge help when you need to run several python versions and libs versions without having to store python and libs versions over and over within each project folder

#

But if you're satisfied with storing libs and python for each project, that's a way to do it

paper niche
#

it’s been a while since i last used conda, but in my experience it also takes really long for conda to resolve dependencies and when installing packages; whereas if i just want to spin up a quick virtualenv, venv+pip is usually much faster and i can get going quickly

#

i see the merits of conda as well though, but just saying it’s not for everybody and not necessarily a defacto for all ds projects

gleaming thorn
#

how to speedup model traning using tensorflow for object detection yoloV3 Darknet its takes 5 days for 1000 itterations

silent swan
#

faster GPU, larger batch sizes, or tweak the learning rate if you're okay with having slightly worse performance

gleaming thorn
#

i have already use GPU and also run on google colab GPU but its takes same time

deft harbor
#

That was a useful article sheemp

gleaming thorn
#

@deft harbor thanks

lapis sequoia
deft harbor
#

What did you go with

oblique belfry
#

I'd use the original Yolo Darknet that is written in C. It is more finnicky to work with, but the performance time is impressive.

worn stratus
#

Does anyone happen to have a neat example of calculating information gain in Python?

waxen topaz
#

Hello everyone, I'm trying to write an essay on optical character recognition as implemented with machine learning. do you guys have any interesting or useful sources explaining the topic? looking for youtube videos or articles or academic papers.
As it's not for a CS course I want to explain what OCR and Machine learning are as well as what it's useful for.

acoustic mural
#

what libraries do you all use to automate testing different neural architectures? i'm not going to be able to work interactively

#

not just regular hyperparameter tuning but also things like number, type, and size of layers

#

on top of keras*

#

although i'm sure more general solutions exist

native moon
#
def fill_data(data_frame):
    for i in np.setdiff1d(unique_full_folder_path, data_frame["Full Folder Path"].values):
        data_frame.loc[data_frame.shape[0]] = [i, 0, *data_frame.iloc[0,2:].values]
    return data_frame

can anyone tell me what this does?

#

unique_full_folder_path = report_1_df["Full Folder Path"].append(report_2_df["Full Folder Path"]).unique()

silent swan
#

testing in what sense

#

looks like it gets all the unique "Full Folder Path"s across both data frames

lapis sequoia
#

anyone have experience with Rstudio and SQL?

#

I know this a python disc

lapis sequoia
#

sql, you can ask in databases

vital cipher
#

@lapis sequoia yeah working with rstudio too so can dm me 🙂

#

hope i can help you

native stag
#

worth a read

worn stratus
#

Does anyone know of an example of a decision tree (preferably id3) implemented from scratch? I can find a couple on random githubs, but they seem to have issues

vital cipher
#

@worn stratus you wanna implement decision tree from scratch or you wanna an example for it?
i do have an example and can share you...so do let me know

worn stratus
#

Yeah, an example would be useful

#

my end goal is random forest from scratch

#

but the first step is a decision tree

vague merlin
#

Hey I need to create a program that can check similarity between two images. Is machine learning the best way to solve this?

deft harbor
#

are you looking for the SAME image?

#

if all you need to do is match the same picture to itself, then you dont need machine learning

#

you could just see if the pixels match

#

@vague merlin

wraith basin
#

@worn stratus for information gain in Python check this out https://machinelearningmastery.com/information-gain-and-mutual-information/

#

Has worked examples

deft harbor
#

Thanks for the read

lapis sequoia
#

how is this different from stacking models

#

say I use catboost or something that works well with categories.. and then append predictions to support the next model on top of this

silent swan
#

@worn stratus if you're good at reading code you can read the sklearn cython

#

actually maybe not, I don't think it's a good learning experience

worn stratus
#

@silent swan at this point I'd happily look at it in the sklearn source, but I can't find it

#

nvm - got it

#

sorry for the ping

vague merlin
#

@deft harbor It needs to able to check similarity between two different pictures, it could be almost the same image but from a different angle or zoom etc

deft harbor
#

Ah, then yeah, you will most likely need some sort of NN for that. @vague merlin

#

How many different image classes will you have? For example, you know you will want to match pictures of a certain statue downtown, a bus stop and a specific building, that would be three.

#

If you want to match all possible similar images of anything, that's going to be a pretty large undertaking.

vague merlin
#

@deft harbor ah okey,
it needs to check similarity between different rooms within a house, so if there are two images of the same kitchen but from a different angle it would still give it a high similarity score, the same goes for bedrooms, bathrooms and so on, preferably the outside of the house as well. Just enough to determine if it's the same house or not.

I assume that it would be quite a large project to make something like that work?

silent swan
#

still, try some silly heuristic like pixel-level difference, maybe with image registration first

oblique belfry
#

Maybe try running an edge detector on the image first, and then running correlate2d.

vague merlin
uncut shadow
#

Hey. What is the problem here? I visited tf errors page but I couldn't find this error.

#
Traceback (most recent call last):
  File "C:\Users\PC\Anaconda3\envs\TF\lib\site-packages\tensorflow_core\python\pywrap_tensorflow.py", line 58, in <module>
    from tensorflow.python.pywrap_tensorflow_internal import *
  File "C:\Users\PC\Anaconda3\envs\TF\lib\site-packages\tensorflow_core\python\pywrap_tensorflow_internal.py", line 28, in <module>
    _pywrap_tensorflow_internal = swig_import_helper()
  File "C:\Users\PC\Anaconda3\envs\TF\lib\site-packages\tensorflow_core\python\pywrap_tensorflow_internal.py", line 24, in swig_import_helper
    _mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
  File "C:\Users\PC\Anaconda3\envs\TF\lib\imp.py", line 242, in load_module
    return load_dynamic(name, filename, file)
  File "C:\Users\PC\Anaconda3\envs\TF\lib\imp.py", line 342, in load_dynamic
    return _load(spec)
ImportError: DLL load failed with error code 3221225501
#
During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:/Users/PC/PycharmProjects/TF/test.py", line 1, in <module>
    import tensorflow
  File "C:\Users\PC\Anaconda3\envs\TF\lib\site-packages\tensorflow\__init__.py", line 98, in <module>
    from tensorflow_core import *
  File "C:\Users\PC\Anaconda3\envs\TF\lib\site-packages\tensorflow_core\__init__.py", line 40, in <module>
    from tensorflow.python.tools import module_util as _module_util
  File "C:\Users\PC\Anaconda3\envs\TF\lib\site-packages\tensorflow\__init__.py", line 50, in __getattr__
    module = self._load()
  File "C:\Users\PC\Anaconda3\envs\TF\lib\site-packages\tensorflow\__init__.py", line 44, in _load
    module = _importlib.import_module(self.__name__)
  File "C:\Users\PC\Anaconda3\envs\TF\lib\importlib\__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "C:\Users\PC\Anaconda3\envs\TF\lib\site-packages\tensorflow_core\python\__init__.py", line 49, in <module>
    from tensorflow.python import pywrap_tensorflow
above this error message when asking for help.
#
  File "C:\Users\PC\Anaconda3\envs\TF\lib\site-packages\tensorflow_core\python\pywrap_tensorflow.py", line 74, in <module>
    raise ImportError(msg)
ImportError: Traceback (most recent call last):
  File "C:\Users\PC\Anaconda3\envs\TF\lib\site-packages\tensorflow_core\python\pywrap_tensorflow.py", line 58, in <module>
    from tensorflow.python.pywrap_tensorflow_internal import *
  File "C:\Users\PC\Anaconda3\envs\TF\lib\site-packages\tensorflow_core\python\pywrap_tensorflow_internal.py", line 28, in <module>
    _pywrap_tensorflow_internal = swig_import_helper()
  File "C:\Users\PC\Anaconda3\envs\TF\lib\site-packages\tensorflow_core\python\pywrap_tensorflow_internal.py", line 24, in swig_import_helper
    _mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
  File "C:\Users\PC\Anaconda3\envs\TF\lib\imp.py", line 242, in load_module
    return load_dynamic(name, filename, file)
  File "C:\Users\PC\Anaconda3\envs\TF\lib\imp.py", line 342, in load_dynamic
    return _load(spec)
ImportError: DLL load failed with error code 3221225501


Failed to load the native TensorFlow runtime.

See https://www.tensorflow.org/install/errors

for some common reasons and solutions.  Include the entire stack trace
#

This error occurs even when I want to import it

#

I'm using windows 7 if that's needed to know

lapis sequoia
#

@uncut shadow what do you want to run locally with TF

#

google colab can run TF without separately installing it

rigid summit
#

Hey, anyone here use Anaconda? I can't seem to update my Spyder...

#

the "conda install spyder-4.0.0" doesn't work in the Anaconda Prompt, it says it's not a legit command

silent swan
#

what's the error message

#

windows/unix?

lapis sequoia
#

try spyder==4.0.0

#

@rigid summit

primal ravine
#

Hey, im trying to give my Matrix categories. by first making the words in a column into numbers, and then setting those numbers as categories. However, i keep getting an error message and i dont know what to do, my code is identical to the professor

#

Professor's Code:

#

Any help would be greatly appreciated, im just starting out learning python for data science!

lapis sequoia
#

you need to install those packages

#

are you using spyder

#

open anaconda prompt and run conda install numpy matplotlib

fallen anchor
#

Hey

#
valid,value
2004-07-21 20:00:00,280
2004-07-21 21:00:00,020```
#

df.iloc[1]['value2'] = 555

#

I am trying to create a new column value2 for the last row, but the above code doesn't do anything

#

I want it to look like

#
valid,value,value2
2004-07-21 20:00:00,280,
2004-07-21 21:00:00,020,555```
#

how can I do that?

lapis sequoia
#

this is a csv?

oblique wyvern
#

Hey everyone, I'm trying to figure out how to do the following with numpy:
There's a array a = [1, 2, 3, 4, 5] with data and an array b = [0, 1, 2, 0, 1] which are row numbers
How would I create a matrix use the data from a but placing the elements in the rows specified by b but keeping their column position from a?
Like:

[[1, 0, 0, 4, 0],
 [0, 2, 0, 0, 5],
 [0, 0, 3, 0, 0]]
paper niche
#

with scipy you can use coo_matrix to construct a sparse matrix using row,col,data then call .todense() after

#

@oblique wyvern

rigid summit
#

Is anyone willing to help me troubleshoot my Anaconda installation? I can't update it OR my Spyder program using either: conda update anaconda OR conda install spyder=4.0.0

#

It's really frustrating, especially because uninstalling and installing seems to take so long

fallen anchor
#

@lapis sequoia yes, a csv

fierce ravine
#

Would using PCA for dimensionality reduction be a good method for a dataset with 600 features?

#

I’m going to use clustering on the dataset, but I feel as though I am getting really weird results

oblique belfry
#

It won’t hurt. Try and see what happens.

fierce ravine
#

I did, but I’m having some issues with plotting it. If I want to keep my variance at greater than 0.85, I still have over 100 features

#

If I reduce it to 0.59 i have 3 features which is nice to work with, but wouldn’t the data be extremely muddy?

primal ravine
#

Hey, im new to Learning Machine Learn. And im learning about multiple linear regression, Desicion tree, vector machines, rainorest classification.
It seems the course is directed towards business, slightly, but i want to know if all this information is Applicable to AI in robotics

#

for example programming a robot to avoid obstacles

#

or pick of certain objects

#

any help would be appreciated

oblique belfry
#

Look into reinforcement learning.

primal ravine
#

Okat

#

but

#

i would like to know if multiple linear regression, Desicion tree, vector machines, rainorest classification. this sort of thing will help me in applying machine learning to AI in relation to say robotics

fierce ravine
#

I think I’m just gonna run with 0.59 variance. It works RenShrugGif

primal ravine
#

If anyone can confirm for me that content such as multiple linear regression, Desicion tree, vector machines, rainorest classification. this sort of thing will help me in applying machine learning to AI in relation to say robotics

silent swan
#

you've got a long way to go to get to robotics

oblique belfry
#

@primal ravine Potentially...but I don’t know of people currently doing that.

Those are good techniques to know about. But, neural networks are the new norm in that field. You need to learn a super complex non-Linear function. Only neural nets can capture that.

#

@silent swan Agreed.

silent swan
#

those are the correct starting points, but it'll still be far away from your goal

#

but if this is like something you want to do over the course of like, 3 years, yes, that's where to start

#

@rigid summit post your error message, and are you using windows?

rigid summit
#

I don't know if I'll get a response, there are 1000+ open issues apparently

#

I am using windows

lapis sequoia
#

is anyone alive

#

I need help understanding this graph..

#

why does the % change keep decreasing

#

does that mean media is only part of digital ad spending?

lapis sequoia
#

nvm I figured it out.. but I'm having trouble understanding difference between ARIMA and ARMA

primal ravine
#

Hey, can someone explain to me how Deep Learning or Neural networks are used in robotics? By that i mean, how do you actually allow the robot to make its own desicion using your intended code, Do they all require an arduino? is there a more powerful alternative?

oblique belfry
#

Reinforcement learning. Lol

#

Not all of them have an arduino. But they have some type of sensors that feed into some type of computer. It depends on what the task is. Object detection requires a different architectural than movement.

#

Robotics is a VERY big field. And each robotics problem can be subdivided into many miniature problems that might use multiple neural networks.

oblique belfry
#

My coworker is doing some data transformation on Temple's eeg corpus.

Man...that is some narly code, which is all done in a iPython terminal. I am not a fan of running long-running jobs in iPython/Jupyter.

jolly briar
#

@oblique belfry looks like it's been copy pasted from a script no?

#

i use this workflow quite often

oblique belfry
#

Eh.....knowing him....doubt it.

#

He trains all of his neural nets (ones that take over 5 hours to complete an epoch) through Jupyter. I have had too many kernels crap out on long computations. The above script will take a day and a half to complete. He must have better luck with Jupyter than I do.

fierce ravine
#

I’ve been looking online and just want to make sure I’m right about this

#

For unsupervised, clustering methods we don’t need to split between training and testing data?

silent swan
#

I'm a fan of doing whatever you want in notebooks, but once things move into code, unless it's explicitly a script, you gotta start cleaning things up

#

@fierce ravine that intuition is somewhat correct, but there can be more subtlety around it

#

certainly generally you don't really care about evaluating against anything

#

or in the case of visualization methods, you just want any decent represetation of your data so feel free to run over and over on the whole dataset

fierce ravine
alpine stream
#

Hi! I have an NLP task. There is a text (telephone conversations). Voice is already converted into text and is divided into agent and customer paragraphs. I need to understand what approach is the best one for the next tasks:

  1. Who is the customer and who is the agent?
  2. Customer Name
  3. The topic of conversation
  4. Promises made by the operator to the customer (for example, "I call back tomorrow")
  5. Negative Sentiment (if there is something in the conversation that the subscriber is not happy with)
    I am just trying to understand how to handle it. Is it possible to create some kind of general approach for this? If yes, for which packages (maybe BERT)/publications/books could I pay my attention?
silent swan
#

best to think in terms of "what is the output" for each task

#

e.g. 3 is "document" -> "topic" classification

#

1/5 are per-sentence classification

#

(some of these can be reframed but this is a starting point)

#

2 is sort of span prediction, potentially 4 can be framed the same way as well

#

after that, go look for what models is suitable for each

#

fwiw, BERT (with additional modules) is suitable for all of them

#

but BERT is also much more computationally intensive than simpler models

lapis sequoia
#

I am running a GPT-2 text generation model that looks something like this

gpt2.generate(sess,
              model_name=model_name,
              prefix=pbuffer[0],
              return_as_list=True,
              length=120,
              temperature=flavorslider.value,
              top_p=0.9,
              truncate='<|e',
              nsamples=1,
              batch_size=1,
              )```
#

where sess = tensorflow.compat.v1.Session

#

what I want to do is clear the session grid and variables each run to prevent memory leaks

#

so basically after this runs, collect the output, then close the session

#

and just before it runs the next time, open a fresh session

#

anyone know how to do this?

silent swan
#

does session.close() not suffice? or using a context manager

lapis sequoia
#

it does work, however returns an error about attempting to reuse closed tf session

#

not sure why, since its at the end of the code

silent swan
#

do you create a new session each time?

lapis sequoia
#

yeah thats what im trying to do

silent swan
#

I guess I'm confused about the issue here

#

if you could post a code snippet?

lapis sequoia
#

its rather large

#

i can send a link to the colab

silent swan
#

sure

lapis sequoia
#
def reset_session(sess, threads=-1, server=None):
    """Resets the current TensorFlow session, to clear memory
    or load another model.
    """

    tf.compat.v1.reset_default_graph()
    sess.close()
    sess = start_tf_sess(threads, server)
    return sess```
#

this might be working

silent swan
#

oh it's ai dungeon

lapis sequoia
#

if you double click the "Enter Dungeon" header

#

its my own dungeon 👅

silent swan
#

yea anyway you can do the above which sounds like it should work

#

all you want to do is either 1) close and create a new session each time

lapis sequoia
#

yeah it wasnt working before but now im not getting an error message

silent swan
#

or 2) use a context manager, which does that for you

lapis sequoia
#

ok, that was what i thought originally, but the error messages had me confused

#

glad to know i was right

#

thank you

lapis sequoia
#

always this

#
gpt2.start_tf_sess(threads=-1, server=None)
  with sess:
    message = gpt2.generate(sess,
                model_name=model_name,
                prefix=pbuffer[0],
                return_as_list=True,
                length=120,
                temperature=flavorslider.value,
                top_p=0.9,
                truncate='<|e',
                nsamples=1,
                batch_size=1,
                )
    return```
silent swan
#

instead do

#
with gpt2.start_tf_sess(threads=-1, server=None) as sess:
    blahblah
lapis sequoia
#

yes i have switched to that

#

was getting initialization errors so had to add this:

#
init_op = tf.global_variables_initializer()```
silent swan
#

yep, you need to reinitialize the global variables for a new session

lapis sequoia
#

sess.run(init_op)

#

it runs, but i get crazy garbled output

#

trying a slightly different arrangement

oblique belfry
#

That’s why I don’t like TF.

fallen pendant
#

new to this i just finished my first python course and i have been practicing on there two apps which one do you think is better .

granite marsh
#

@fallen pendant First of all, congrats for completing first course on python👍
My priority (Level of hardness/complexity) would be:
Hackerrank (Improves your basic)

Leetcode (Improves your knowledge through medium complexity)

Hackerearth (Gives you better knowledge and also you can get internships or jobs)

SPOJ(Very tough level of problems)

Still, many exist but even if you practice from these is more than enough.
If you are still having any doubts, you can contact me at:
https://www.linkedin.com/in/tejas-s-401ab4185

fallen pendant
#

@granite marsh thanks so much

granite marsh
jolly briar
#

anyone done parallel processing with R? I"m wondering whether it's more straightforward or not than Python

#

seems fairly straightforward.

drifting hemlock
#

I think this is more of an operational question instead of data science itself but I'm a bit curious, do you guys implement a workflow in your work as a data scientist?

#

I'm not talking about methodologies (osemn, crisp, asum, etc), more like a team workflow. I've heard of Agile, but that's complicated to implement in a data science team.

rigid summit
#

Anybody a seasoned user of Anaconda? The tech support is terrible to non-existent so far for me. I have an issue described here:

https://stackoverflow.com/questions/59419880/how-do-i-change-the-directories-in-anaconda-having-issues-updating

lapis sequoia
#

Try deleting the sitepackages -> %USERPROFILE%/AppData/Roaming/ -> Python/../site-packages

drifting hemlock
#

In which environment are you trying to update your packages? The base environment?

lapis sequoia
#

Check with "conda info -e"

drifting hemlock
#

Anaconda is not that great at handling path variables at install/uninstall. You should check them out and make sure your user path is not referenced for Anaconda:

rigid summit
#

Oh awesome, thanks let me check

#

It would be amazing if I can finally get this to work...

#

the base is c:\Anaconda

#

@drifting hemlock How do I pull that path up?

#

I mean, that screen...

drifting hemlock
#

Then click in PATH and click the Edit... button

rigid summit
#

Alright, looks like the "Path" does reference my account/user (with the space) for Python

#

What should I do?

drifting hemlock
#

Just change it to reference the folder in which you installed Anaconda, for example change:

C:\Users\Your Username\Anaconda3\Library\usr\bin

to

C:\Anaconda3\Library\usr\bin
#

Oh, and remember to close and reopen the console so it can refresh PATH

rigid summit
#

Alright 👍 checking to see if that worked

#

shoot, no dice... I might have not done it properly - the Paths didn't say Anaconda, they said Python, so that might be one issue... also the user config file, populated config files, package caches, and envs directories all still point through my user name

drifting hemlock
#

Then the issue is definitely your path, unfortunately Anaconda sucks at updating the path environment. You have to options I believe:

  1. Updating PATH manually. This means removing entries referencing Anaconda in your Path variable and then creating them. That can be a pain in the ass.
  2. You can remove all the entries referencing anaconda in your PATH variable and then reinstall anaconda making sure to select Add Anaconda to my PATH environment variable.

Just full disclosure: playing with Path can lead to undesirable results, so have a backup of your path just in case.

rigid summit
#

Thanks 🙂 ... how undesirable?

drifting hemlock
#

Well, depends on what you have there, anyways you can easily make a backup and then restore it if anything goes wrong, you can just open regedit and go to Computer\HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager\Environment and then clicking in Path and saving the contents in a text file.

rigid summit
#

Alright, I'll give it a shot. Thanks very much for your help.

#

Oh, one thing before I get started - I accidentally deleted "Path" that included an entry with this %USERPROFILE% in it... the only other two were for python... I'm hoping if I restart my computer it will come back...

#

Should have made a backup, haha

drifting hemlock
#

I really hope so too hahaha

#

Know what? Let me show you how to back it up easily without getting into the registry.

#

Hold on

drifting hemlock
rigid summit
#

Perfect!! Thanks. Crossing my fingers that this works

rigid summit
#

Damnit. I'm still getting the same error when I try to install updates - the directories are all the same to (with my user name in it) using conda info

#

The Paths are all correct now though.

drifting hemlock
#

what environment are you using? Base? do conda env list

#

You should get something like:

PS C:\Users\Franccesco> conda env list
# conda environments:
#
base                  *  C:\Users\username\Anaconda3
getaltname               C:\Users\username\Anaconda3\envs\getaltname
rigid summit
#

Just: C:\Anaconda

lusty trellis
#

hey guys has anyone here worked on Scrapy for scrapping i am trying to scrape and infinte scrolling page but i dont know how to do it

lost sinew
#

anyone knows how to get the historical 1 minute data from january 2018 for bitmex without getting api banned

oblique belfry
#

"Code submission is one of the elements I’m most impressed with. A year ago, 50% of accepted NeurIPS papers contained a link to code; this year, we’re at 75%."
Albeit, this is just for NeurIPS, but I hope this trend continues.

lapis sequoia
#

I need help in creating a SVM model in pure numpy and python

oblique belfry
#

bumpy? I hope you mean numpy.

#

I get that.

#

Our company tried to reproduce results from eeg papers with the given data (Temple's corpus) and we still couldn't achieve the results. We followed the specs of the papers (had the same LSTM/Convolution layers/hyperparameters), but the results didn't match.

#

Showing the agorithm can get us one step closer.

#

The way in which people manipulate their data before training can significally alter the data to the point it isn't generalizable.

#

hmm....I never thought of doing it with different random seeds. I think that should be a requirement as well.

#

Yeah. I get why they might not be abel to publish their dataset. But publishing their code can help mitigate issues like this.

#

Ha...use a random number generator to pick seeds for other random generators.

I love it.

rigid summit
#

Sorry for the repeat for those who have scanned over this before: Anybody a seasoned user of Anaconda? The tech support is terrible to non-existent so far for me. I have an issue described here:

https://stackoverflow.com/questions/59419880/how-do-i-change-the-directories-in-anaconda-having-issues-updating

silent swan
#

in the case of medical data, a lot of it comes down to preprocessing

#

that's why they're especially hard to reproduce (in addition to all the big datasets being unsharable)

#

if you're open to reinstalling

#

and you're still running into issues

#

you need to go track down wherever the bad paths are coming from

jolly briar
#

@oblique belfry if there's code and no data though is it reproducible ?

oblique belfry
#

It’s better than nothing.

You can at least check out their methods and validate the logic behind it.

#

I’ve also tried to replicate papers that used public datasets.

I think you need both data and code to replicate results. Data will be the hardest to get, code is pretty simple. I’d rather have something than nothing.

deft harbor
#

What database should I study first?

#

SQL?

oblique belfry
#

What type of job do you want?

#

I know the basics of SQL, but if I have a project that really utilizes it, I’ll use SQLAlchemy or Orator to query data. I mostly use Mongo at work. But, we do a lot of machine learning and AI.

#

Data science is a big field that honestly should become more separated. A data scientist at one company may look completely different at another company.

acoustic mural
#

as someone in industry who sees some hiring decisions, SQL on the resume never hurts

#

(unless it's a lie)

deft harbor
#

No specific job, more looking to expand my skill set for my own projects that might be useful later down the road. I was thinking SQL, and most of the tasks would be machine learning. At least at first, until I get a better hand on things.

oblique belfry
#

A lot of positions I applied for wanted SQL experience. But, those were jobs that I wasn’t very interested in.

bright heron
#

Anyone recommend any guides on how to interact with API using python?

deft harbor
#

The api documentation?

fallen anchor
#

which API?

jolly briar
#

@oblique belfry what does SQL experience mean though? I'm never sure how much is typically expected to satisfy that, i guess it's "how longs a piece of string".

#

personally i've never needed anything beyond a join

#

no subtables, views, or whatever else

oblique belfry
#

I don’t know either

drifting hemlock
#

Has anyone noticed that in Data Science / Analysis there's a ton of information available to improve your skills, but not much information on the "operational" side of the industry? For example, how to work with teams, how to implement a workflow in a corporate environment that is scalable, how the different methodologies fit in an data science organizational team.

#

At least for me it's been difficult to get this kind of information on the internet.

deft harbor
#

Management, Dev ops and "big data" covers a lot of that

drifting hemlock
#

Yeah but I've been feeling like we're kinda lost in that scenario, we know the process that comes with gathering data, scrubing, feature engineering, modeling and deployment, but we tend to forget that all of that needs a place where toolsets and teams in a corporate environment needs to take place and co-exist.

#

I think that's harder for small teams though.

#

In my work environment we're still trying to figure this out, so for example we have a binary classification task:

  • Where do we document it? Let's say Jira.
  • Where do perform the EDA/Feature Engineering? Jupyter notebooks, right.
  • Where and with what we build the model? sklearn.
  • Deployment? IBM Watson or a simple API in a cluster.

All of that comes with a price and is that, depending on what tools you use, you're going to get a big technical debt, or you are going to sacrifice collaboration if you do the development offline, or even reproducibility.

#

I know that there's no a tried and true workflow in which all of these can be implemented because the industry is still very young, but it would be neat to have more direction. * rant over lol *