#data-science-and-ml

1 messages ¡ Page 390 of 1

misty flint
#

google's crash course on GANs is pretty good

grave frost
#

yeah, but "crash courses" remain what they are - a quick overview. I would discourage people from actually using them, and instead encourage deep dives in the topic

misty flint
#

eh if you just want a quick overview then thats what they are for

#

you cant deep dive into everything

#

you should deep dive into what youre actually interested in tbh

steady basalt
#

How the hell did a pro get 0.86 accuracy. I can only manage about 0.73 after doing all the steps

#

0.86 is on training data

thin palm
#

Hi Python gang, I have a take home assessment for this job interview and was wondering if I can get some help on your thoughts of what to look for when doing data exploration? What's everyone's top 5 things they look for when examining data? Cheers!

misty flint
thin palm
#

So I could make a model that says how likely it was to succeed

thin palm
misty flint
#

(if the data is linear)

thin palm
misty flint
#

best of luck bud. some of those take homes can eat up a lot of time

#

🕯️

thin palm
misty flint
#

its good to have a standard approach

thin palm
misty flint
#

as that can help with this

pine flare
#

Hey, I just joined this server, i'm 17 and wanting to get starting in data science and AI, how would y'all go about doing this? Im learning matplotlib and pandas libraries right now, I only started learning python 2 and a half months ago.

serene scaffold
hollow flare
serene scaffold
hollow flare
#

Ok, thanks

gusty forge
#

Hey

#

Opencv lags so much when I run a ipynb notebook. Most of the time, frame gives a not responding message

#

What do I do

mint palm
#

is this good^

#

?

mint palm
#

initially val accu is higher then train but gets better later......is it ok?

spark sonnet
#

umm vids to learn python?

lapis sequoia
arctic wedgeBOT
#
Resources

The Resources page on our website contains a list of hand-selected learning resources that we regularly recommend to both beginners and experts.

steady basalt
#

would anyone know why my model performs worse on unseen data after tuning than before?

#

initial score is say 0.7 on cv, final test scores are more like 0.68

#

cv is performed on training set

#

@mild dirge I feel as though holding back test data is making my model seem WORSE than it should be

#

especially KNN

#

dropped to 0.58

#

auc

#

altho Random forest is pretty good

#

worse still, I am getting like 0.52 for precision on target feature being the 'positive' value

#

which is really bad in cases like detecting disease

blissful fulcrum
#

Anyone can help me in this ?

steady basalt
#

@mild dirge another thing is the hold out testing set is using very imbalanced data, so of course precision is high for predicting the class which has like 4x more values than the other class

blissful fulcrum
lapis sequoia
blissful fulcrum
#

Thanks @lapis sequoia got the answer 👍

blissful fulcrum
#

One More Question, I have a data frame and I wanted to generate a new column for colour codes which starts from red for the least value of Opportunity and moves toward green for the highest value of Opportunity

#

dataframe ☝️

#

Interpolation of colors basically

steady basalt
#

You want to colour code the opportunity column?

#

For plotting or ?

blissful fulcrum
#

yes

#

I need color codes for using in front end of my application so i can minimize the time for rendering cause data set so big

steady basalt
#

You mean you’re going to categorise the column into a lower number of colour code?

blissful fulcrum
#

yes you can say like that

steady basalt
#

Do u have the colour codes you want?

#

You need how many?

blissful fulcrum
#

I want values between red and green depends on Opportunity

#

top least val is dark red and top large val is dark green

wicked grove
#

Can i penalize one of the classes of my model with this function? tf.nn.softmax_cross_entropy_with_logits

potent plank
#

hello

#

!voiceverify

indigo cove
#

Hello

#

Anyone willing to help with coding a special K-means algorithm?

mild dirge
#

special?

indigo cove
#

I am trying to code this

#

But when using

#

dist = cdist(X,np.array([D[i,:]]).T,axis=1)

#

I get an error

#

ValueError: XA and XB must have the same number of columns (i.e. feature dimension.)

mild dirge
#

XA and XB?

#

there are no XA and XB in that snippet

indigo cove
#

It is a common error

#

when using cdist

mild dirge
#

alright, so print the shape of X and np.array([D[i,:]]).T and check if they are what you expect

indigo cove
#

I am just thinking are their other ways to calculate the minimum distance between data points and clusters

#

since I only find errors

#

Using different methods

mild dirge
#

Well it seems that there is a shape mismatch between the two points

#

so it can't calculate the distance

#

Some simple distance functions are f.e euclidean, or manhattan

indigo cove
#

Anyone that can help me with some k-means coding

#

I am stuck again

serene scaffold
hardy prism
#

Long shot - anyone here use JS & PYTHON to script/automate in Microsoft Excel?

serene scaffold
#

you can read excel data into python code with pandas, do all the transformations, and save the result (and any intermediary parts, if you want) back to excel.

steady basalt
#

thats why no point using R 😆

hardy prism
serene scaffold
steady basalt
haughty ibex
#

i have strings that are formatted in different ways and i want them all to have the same format for example i have a string that is like "7 MY STRING" however i want it to be "7th My String" using ordinal suffixes. I'm thinking the best way to do this is to split the column after the integers add the ordinal suffixs (1st 2nd 3rd etc...) then use the title() method on the strings and then rejoin the column. I"m just not sure how i would implement it in pandas.

wicked grove
#

Hello,i have a 3 class classification problem

#

I want to increase the loss for one of the classes

#

Can i penalize one of the classes of my model with this function? tf.nn.softmax_cross_entropy_with_logits

#

I can't understand what the logits argument does

inland mantle
#

I feel like I asked this before but what is the difference between machine learning and deep learning

misty flint
#

the difference between deep learning and machine learning can be blurry but you should consider that deep learning is a special type of machine learning where typically the algorithm does its own "feature extraction" vs. doing it yourself

quick eagle
#

trying to get help: parsing data from a .fit file into pandas DF - is that something for this forum or elsewhere?

misty flint
inland mantle
#

Which would be faster when trying to train data sets

haughty ibex
#

if i understand correctly an example of machine learning would be like a streaming platform recommending you something based on your interests and then deep learning would be like how google developed ALPHAGO to beat the best GO player in the world.

quick eagle
#

I'm in help-coconut, but do people just show up there, or do I 'recruit help; from here?

serene scaffold
#

your question does not appear to be data science related, however

quick eagle
#

yeah, its more about basic tuples/list/for loops.... what's the better place to start?

serene scaffold
#

nevermind, I see that it's about pandas

wicked grove
#

Hello, can i use a pytorch loss function with keras model.compile?

desert oar
#

very unlikely

wicked grove
#

i have a model for 3 class classification and want to penalise one class i thought of using tf.nn.softmax_cross_entropy_with_logits

steady basalt
hardy prism
hardy prism
agile cobalt
#

and what do you mean by script?

steady basalt
#

The only beenfit to using JavaScript for this is if you are good with it and cant use python

#

Unless I’m misinterpreting what you want to do

misty flint
mint palm
#

when making a NN for classification of lets say A, B, C. is it neccessary to have number of example in ratio 1:1:1 ?

agile cobalt
#

it doesn't have to be 1:1:1, but if it's too unbalanced, you'll have to pay extra attention to that when evaluating the model

mint palm
#

I checked its 350000:120000:200000

mild dirge
agile cobalt
#

converting to percentages 52% ; 18% ; 30%

mild dirge
#

I had a dateset with a distribution somewhat like this, it had 400 classes, most having around 120 images, but some a bit more

#

And not balancing the data gave better results

#

Than simple down-sampling to 120 images

mint palm
mild dirge
#

If you are looking at an average accuracy over the test set than that would be good imo yes

#

or another averaged performance measure

mint palm
mild dirge
#

Otherwise if your data has 90000:2 ratio and your test set also has a similar ration, then obviously it can get a very high average accuracy without being able to separate the classes

#

As it would just guess the most common class every time, or at least a lot more often

mint palm
#

I want it to be exceptional with no redundancies

mint palm
#

May be thats why test train split wasnt good enough

agile cobalt
#

you might want to use balanced-accuracy instead of accuracy or something like that if your scoring function supports such

agile cobalt
#

a scoring metric

mild dirge
#

yeah, macro average vs micro average performance is something to look into

#

macro takes the average performance per class and then averages it to get a final score
micro average takes the average over the entire set, whether or not it has been balanced

#

iirc*

#

macro treats all classes as equally important

mint palm
#

Actually a simple project has taught me a lot, no theory class can teach..

mild dirge
#

Yeah for sure, google is a dang good professor 😛

steady basalt
mild dirge
#

No I study AI

steady basalt
#

I have come to conclude your strategy of taking a test set isn’t really useful in my work because the dataset doesn’t represent a population

#

I did take the set anyway to see how it acts on balanced vs unbalanced

steady basalt
#

Because it’s labelled data

#

And it’s showing many people with a condition and not as many without

#

So maybe in this case I should not split? Well if I had enough data I’d split and rebalance towards reality

#

But that takes a lot more data than I usb

#

Have

#

Also my friends lecturer said you shouldn’t do this at all

#

U shud balance test data

mild dirge
#

Yeah but the the performance of your model should not be judged on the accuracy on the real poluation distribution.
Say you are a doctor and people come to you for diagnosis, as they might have a disease.
99% of the time the people have no harmful disease, so saying no to all the people would lead to an accuracy of a whopping 99%!
But you want to know how many of the patients that were sick could have been diagnozed before it got bad

steady basalt
#

So you’re saying to not balance at all? That would mean it’s scored purely against the balance of data done in the original testing which we don’t know what it was

#

Why not judge it on real population? You can see it’s use as a screening tool

mild dirge
#

No I am saying you what problems you run into when judging the performance of your model on the unbalanced population data

steady basalt
#

In this case as you say it’s only going to be an accurate predictor or performance on peopel who probably have the disease

mild dirge
#

You can use unbalanced data to test your performance on, but you should consider giving all classes equal weights towards the performance measure

steady basalt
#

It’s binary

mild dirge
#

then both classes, the point still holds

steady basalt
#

That’s why I’ve balanced the data for model design

mild dirge
#

right, thats good

steady basalt
#

But testing I against an untouched holdout

#

That will only show performance on a group of people who mostly have the disease

#

Why not show how it does on a balanced population or better still a population likely to come to testing where more do not have disease

#

As per your advice it’s only compared against unseen data from original data which is a group that mainly has disease

#

How’s that more useful

#

Wouldn’t it be wise to give precision scores for different balances

mild dirge
#

Because your goal is not to persĂŠ optimize the accuracy of the model on the population, but to optimize recall

steady basalt
#

Which I do prefer in health data over accuracy anyway

mild dirge
#

Making sure not too many people who have a disease that will be told they're healthy

steady basalt
#

The precision is quite good on diagnosing but it’s really bad on detecting those without disease

#

I’d say it’s more acceptable than other way around

mild dirge
#

precision would be seeing how many of the people that you diagnosed as having cancer, actually have cancer

#

Which I say is less important

steady basalt
#

Exaclty

#

No though you want to make sure as many WITH cancer get treated

mild dirge
#

I think making sure someone who is sick, will actually be diagnosed as sick and get themselves checked out

#

which is recall

steady basalt
#

Which is why precision for positives is more important than negative

#

That’s precision ?

mild dirge
#

precision is true positives / (true positives + false positives)

stone marlin
#

Precision is: "I said all these people had cancer. How many of them actually did have cancer?"

#

Recall is: "Out of all of the people who had cancer, how many did I say had cancer?"

mild dirge
#

yeah that

#

that's more intuitive than showing the formulas haha

stone marlin
#

This graphic is the one every DS has taped to their wall, pret much:

steady basalt
#

Ah ha

#

I should replace my precision with recall then

sand cedar
stone marlin
#

Right, you're probably going to want Recall.

steady basalt
#

But still, I have a holdout set

#

Is it worth balancing that at all

#

Desperately

#

Seperately

#

I have tested it unbalanced

#

As a first step

mild dirge
#

well it's not that bad when you look at recall

steady basalt
#

And that’s showing performance on a biased population

mild dirge
#

and the class is heavily under represented in the training data

steady basalt
#

Over represented

mild dirge
#

Talking about the disease class

steady basalt
#

Yeah

#

It’s over represented

stone marlin
#

I only skimmed this, but if you're talking about doing SMITE/SMOTE with your holdout set, that's not good. You don't want to affect "real world data" by artificially inflating.

steady basalt
#

Compare to real life

mild dirge
#

Right, I just mean compared to balanced data

#

But you balanced it

steady basalt
stone marlin
#

You're literally just scoring the model on the holdout, so it's saying, "Given that I feed this 1 row, how accurately would that be classified?" If you modified your holdout in some way, you're giving your scoring (not your model) an advantage.

mild dirge
#

I disagree, and melatonin does as well

stone marlin
#

Which makes zero sense, because your model is unaffected.

steady basalt
#

Ye

stone marlin
#

You should never SM[I/O]TE the test/holdout set.

sand cedar
#

I agree, I don't understand why SMOTE would be useful here either.

stone marlin
#

It makes zero sense to do so. It will artificially inflate your metrics.

#

Like, you're basically saying, "I think my model is good at detecting the thing... lemme give it a lot of easy cases it can correctly classify, which will inflate my metric."

steady basalt
#

I don’t understand why there’s no merit to re balancing data to get a more accurate view or real populations so you can see how good it is as a screening diagnostic

mild dirge
#

because you are just giving it "the same" (or very similar) cases over and over

stone marlin
#

To sum this up in a very concise way: Your holdout set should, as closely as possible, represent the distribution of the real data you will be feeding it.

steady basalt
#

So instead of 50 with disease and 10 without, select as a holdout the other way airings

steady basalt
#

My point

#

Real data in real life

#

Wouldn’t be every 7/10 people have the disease

mild dirge
#

You either balance it and take the averaged performance measure, or look at macro averaged performance measure

#

Which treats all classes as equally important

steady basalt
#

Balance?

mild dirge
#

yes, downsample

steady basalt
#

I oversampled because I have a tiny set

mild dirge
#

not upsample

steady basalt
#

It’s going to really mess with my reliability

mild dirge
#

reliability?

steady basalt
#

But are you saying to undersample the test set

#

U said not to touch

stone marlin
#

Okay, we've got a few things going on here. The data that you're given, in general, should represent the data that you expect to collect. If this is violated, nothing else matters.

steady basalt
#

It doesn’t represent a population

#

At all

#

Nothing I can do about that have to just use the cards dealt to me

#

Unless we’re taking about a literal alcoholic hospital ward

stone marlin
#

If your current data that you're training on does not represent the data that you expect to collect, then --- you can do some things synthetically to it, as we've noted, like SMITE or SMOTE, to TRY and make it similar to the real data. This isn't great, but it does work sometimes. You'd, then, do this before you do anything else. Then you'd split into train-test/holdout. Do not do this, see below.

#

In the past, this has worked like... 25% of the time for me, for standard datasets, but I tend to use this more for imbalance than anything.

steady basalt
#

I have done that

#

Well not exaclty no

stone marlin
#

Actually --- hm. Actually, someone else check me here --- I don't think SMITE/SMOTE before everything works nicely because there's gonna be data-leakage.

steady basalt
#

I took pccamels advice and did the split before smote

#

As to have a non touched test set

stone marlin
#

Yeah, I'd do exactly that, and then test on a non-SMOTE'd test set.

steady basalt
#

But then we only have performance on a really unrealistic population

#

Why not inverse the balance and get a read on how it would be irl

mild dirge
#

But again, you aren't aiming for a high accuracy on the population, you want to be able to see how many of the people with diseases you can diagnoze, and how many you deem healthy that actually have a disease.

stone marlin
#

You're attempting to classify something, and you should be able to still do so with your test set. Recall / Precision / whatever. It sucks that you don't have a lot of data, but that's how it goes.

#

If you're introducing more positive elements, then "missing" one of these elements won't be as big of a deal for your model's score.

steady basalt
stone marlin
#

But I kind of get what you're saying here. The dataset in general isn't representative.

#

You don't want accuracy, you prob want prec, recall, and f1, and look at those.

mild dirge
#

Sure, there's some balance between false positives and false nagtives you want to consider

#

Rather false positive than false negative

stone marlin
#

You can prob find some beta for F_beta and adjust accordingly, but F_1 is usually a good inbetween.

steady basalt
#

Btw, I took an accuracy read earlier on the training data before doing anything to it so it’s basically the same as test data in terms of distribution

#

My final model tuned on such performs 3% worse

stone marlin
#

Yeah, it should be stratified.

#

What is the size of your whole dataset?

steady basalt
#

That was with k=5

#

500+

stone marlin
#

Like... 500 - 1000?

steady basalt
#

Perhaps

#

Maybe just 500

#

Actually

stone marlin
#

I wouldn't worry too much about +/-3% to whatever metric you're using there.

steady basalt
#

So, my final report says that Iiterslly lost accuracy after doing all this work

#

To perfect a model

#

Looks like time wasted

stone marlin
#

Sure, but how are prec / recall? Accuracy is rarely a good metric to use.

mild dirge
#

you didn't "lose accuracy", the metric wasn't correct when you artificially inflated your test data

#

it had little meaning

steady basalt
#

Should you test that and auroc before training too as a comparator benchmark

stone marlin
#

To compare what? If your problem doesn't lend itself to use the accuracy metric, there is no point.

mild dirge
#

ehh, before training the weights are randomized, so the results will probably be as good as random

steady basalt
#

To see how much improvement came from tuning etc

#

?

stone marlin
#

You should try a baseline (maybe just guessing the most frequent class, or something like that) but you need to know what metrics you'll be using before scoring.

steady basalt
#

Else we can just assume it did nothing lol

mild dirge
#

Yeah using a baseline classifier (like a small or simple algo) tells you more about how well the model performs on the problem than using a random guesser

stone marlin
#

For example, in this case, you care a bit about precision and you care about recall. So you can do, you know, two models that choose either always choose zero or always choose 1 or whatever. Or you can do a simple linear model. That'll be an okay baseline.

#

My baseline is usually a linear model or a random forest, and I go from there.

mild dirge
#

It also tells you how complex the problem might be

stone marlin
#

But to emphasize: you need to choose your metrics before you compare anything to anything.

#

Most "real world" problems do very well with recall, precision, and [their harmonic mean] F_1.

steady basalt
#

Oh I just tried now getting a bench mark I fit the random forest to the training data and evaluated it on predicting hold out

#

Scored 0.3

#

Weird

stone marlin
#

Scored 0.3 for what? Accuracy?

steady basalt
#

Might be because I forgot to reset kernel

#

Sec

#

Probably one of them got scaled and one didn’t

mild dirge
#

make it guess the opposite and you get 0.7 ^^

steady basalt
#

I’ll do without scaling

#

0.74

#

Accuracy

#

Recall 0.94

#

Lol

#

My model got wrecked by default

mild dirge
#

yeah those seem like good metrics

steady basalt
#

Yeah but

stone marlin
#

Pret good recall.

steady basalt
#

Then I go on to tune and do feature selection and scale

#

And the model then performs much worse

#

On the same holdout

mild dirge
#

what kinda model?

steady basalt
#

Random forest and KNN

stone marlin
#

Here's my DS secret. Many models that I make work "just fine" out of the box. Most are like, "80% good" without too much fuss. It's the iterative optimization that's the extremely difficult part.

steady basalt
#

So what I conclude is that essentially my entire processing stage as well as parameter optimisation and scaling and over sampling made my performance much worse

#

Tf can I fix this? Looks really bad as a conclusion lol

#

I want improvement

stone marlin
#

Especially RFs.

#

Uh. You could try out xgboost and see if that does anythin' for you as opposed to RF.

mild dirge
steady basalt
#

I shud use cv instead of .score right

stone marlin
#

But honestly RF works really well right outt'a the box.

#

Random Forests, sorry Camel.

mild dirge
#

oh random forest

misty flint
#

RF praise

stone marlin
#

I'm lookin' at a model right now for evaluation for work, and it's 90% feature engineering and then at the end it's like two lines of a grid search on a random forest. Works really well.

misty flint
#

still performs well on non-random missing data

stone marlin
#

It's not mine, but, you know.

misty flint
#

was for time series healthcare data too kekHands

steady basalt
#

Anyone wana try fix my model

#

Maybe the problem is tuning isn’t wide enough

#

I only did about 500 searches

stone marlin
#

Once you get time series data in the right form, it's a delight to work with. :'''] But before that? It's a gd nightmare.

steady basalt
#

So it got beat by default

#

Should the benchmark be done after overdampling the training data

stone marlin
#

It happens to the best of us. I get beat by my baseline model a bunch during hyperparam sweeps.

#

"I can't believe I lost to linear regression!"

mild dirge
#

Maybe it is over-fitting on your training data if your model is complex

steady basalt
#

It’s not really complex

mild dirge
#

the bench mark should probably use the same data to train on, and the test data to test on

steady basalt
#

I saw SMOTE as just a part of the process rather than having to be done before

#

Then you’d also say to benchmark on scaled data too

#

The only thing changing is the parameter of model

mild dirge
#

smote is part of the process, but just the training process

steady basalt
#

The metrics function which gives things like recall on a table only works for a single predict

#

How do u cross validate and use the same table as averages

bold timber
#

Hi, I have a question: What the meaning of 1 in LogSoftmax?

tall blaze
#

What is the purpose of this model.

thin palm
#

what's the best plot for when I'm comparing countries and the top occupations in each country?

thin palm
# tall blaze histogram

I was thinking of making mulitple pie charts and each representing the country and then occuptaion

#

it's for astronaut data

tall blaze
#

But I could see the pie thing if you had a user selection to select each country

quick eagle
#

I'm trying to slice data in pandas to look at different areas of a data frame, eg:

df['field1'][4500:9000]

however, I'm doing graphs, etc, which means if I want to look at 5000:7000, I need to change it in a lot of places.
Is there a way to define a variable " slice = '4500:9000', and then use something like df['field1'][slice] ?

serene scaffold
#

!docs pandas.DataFrame.loc

arctic wedgeBOT
#

property DataFrame.loc```
Access a group of rows and columns by label(s) or a boolean array.

`.loc[]` is primarily label based, but may also be used with a boolean array.

Allowed inputs are:
thin palm
serene scaffold
#

df.loc[4500:9000, 'field1'] is probably what you need, since it indexes by row and then by column.

tall blaze
quick eagle
#

something like this: ?

tzero = combined_dive_df.index[4600]
start = 4500
stop = 9000
fig = go.Figure()

fig.add_trace(go.Scatter(x=combined_dive_df.iloc['start':'stop'], y=combined_dive_df["SAC Rate (2 minute avg)_Shearwater"].loc['start':'stop'].interpolate(method='time'),
                    mode='lines', name='Shearwater'))

fig.add_trace(go.Scatter(x=combined_dive_df.iloc['start':'stop'], y=(14.7/100)*combined_dive_df["pressure_sac_Garmin"].loc['start':'stop'].interpolate(method='time'),
                    mode='lines', name='Garmin'))
serene scaffold
#

combined_dive_df["SAC Rate (2 minute avg)_Shearwater"].loc['start':'stop'] this is wrong. the dataframe is one thing. this is treating it as two things.

#

if "SAC Rate (2 minute avg)_Shearwater" is the name of a column, it goes in the loc call after the row indexers.

#

are you picking both rows and columns, or just columns?

quick eagle
#

just columns for the y axis, and trying to slice from row 'start' to row 'stop'

serene scaffold
quick eagle
#

ie, select the column to graph, and then only slice for the interesting bits

serene scaffold
#

@quick eagle these are the names of your columns. start and stop are none of them

Index(['distance_Garmin', 'enhanced_altitude_Garmin',
       'absolute_pressure_Garmin', 'depth_Garmin', 'ascent_rate_mm_s_Garmin',
       'heart_rate_Garmin', 'temperature_Garmin', 'unknown_135_Garmin',
       'unknown_136_Garmin', 'next_stop_depth_Garmin', 'next_stop_time_Garmin',
       'time_to_surface_Garmin', 'ndl_time_Garmin', 'n2_load_Garmin',
       'cns_load_Garmin', 'air_time_remaining_s_Garmin', 'pressure_sac_Garmin',
       'unknown_108_Garmin', 'timer_trigger_Garmin', 'event_Garmin',
       'event_type_Garmin', 'event_group_Garmin', 'unknown_19_Garmin',
       'unknown_20_Garmin', 'data_Garmin', 'transmitterID_Garmin',
       'pressure_100_Garmin', 'Heartrate_Garmin', 'ElapsedTime_Garmin',
       'Time (ms)_Shearwater', 'Depth_Shearwater',
       'First Stop Depth_Shearwater', 'Time To Surface (min)_Shearwater',
       'Average PPO2_Shearwater', 'Fraction O2_Shearwater',
       'Fraction He_Shearwater', 'First Stop Time_Shearwater',
       'Current NDL_Shearwater', 'Current Circuit Mode_Shearwater',
       'Current CCR Mode_Shearwater', 'Water Temp_Shearwater',
       'Gas Switch Needed_Shearwater', 'External PPO2_Shearwater',
       'Set Point Type_Shearwater', 'Circuit Switch Type_Shearwater',
       'External O2 Sensor 1 (mV)_Shearwater',
       'External O2 Sensor 2 (mV)_Shearwater',
       'External O2 Sensor 3 (mV)_Shearwater', 'Battery Voltage_Shearwater',
       'Tank 1 pressure (PSI)_Shearwater', 'Tank 2 pressure (PSI)_Shearwater',
       'Tank 3 pressure (PSI)_Shearwater', 'Tank 4 pressure (PSI)_Shearwater',
       'Gas Time Remaining_Shearwater', 'SAC Rate (2 minute avg)_Shearwater',
       'Ascent Rate_Shearwater', 'Safe Ascent Depth_Shearwater',
       'CO2mbar_Shearwater', 'moles_tank_Ideal_Garmin',
       'moles_tank_interpolate_Ideal_Garmin',
       'moles_tank_diff_interp_Ideal_Garmin',
       'liters_ambient_used_interp_Ideal_Garmin', 'moles_tank_Ideal_SW',
       'moles_tank_interpolate_Ideal_SW', 'moles_tank_diff_interp_Ideal_SW',
       'liters_ambient_used_interp_Ideal_SW'],
      dtype='object')
#

and then your rows are indexed by timestamps.

quick eagle
#

correct - I'm trying to use 'start' and 'stop' as shortcuts for a slice, not as columns

serene scaffold
#

as shortcuts for a slice?

quick eagle
thin palm
#

in my Data there's 'Pilot' and 'pilot', thus my pandas is recognizing them as 2 unique values. Is this a good way to make them the same?

    return x.replace('P','p') ```
my code works, but want to see if this is like "okay cool", or if it's "why do that?"
quick eagle
#

see the top - works with numbers, but when trying to 'centralize' the slice indexes into variables (so I can change it in one location, not 4), it doesn't work

serene scaffold
#

@quick eagle I'm not following. .loc has one or two parts. the first (required) part is the row indexer, which in your case has to be a timestamp or a slice of timestamps. the second (which is optional) is the column indexer. they both go in the .loc[ ], separated by commas. any syntax that looks like df[ ][ ] or df[ ].loc[ ] is likely to be wrong.

.iloc is similar except that it's by position, regardless of how the DF is indexed.

tall blaze
quick eagle
#

rows

#

from row start to row stop

tall blaze
#

change it to loc

serene scaffold
serene scaffold
quick eagle
#

x=df.index[4500:9000], y=df["datafield"][4500:9000]

#

I'm trying to replace the above with

tall blaze
quick eagle
#

x=df.index[4500:9000], y=df["datafield"][4500:9000]

a = 4500
b = 9000
x=df.index[a:b], y=df["datafield"][a:b]

serene scaffold
quick eagle
#

explicitly putting the slice indexes works, but trying to reference variables to slice with doesn;t

tall blaze
quick eagle
#

basically, I have several hours of data, and just want to look at a specific time period - although plotly has some built in slicing on graphs, etc; I'm trying to do it by slicing the data frame

#

(but the specific time period is not known apriori)

serene scaffold
quick eagle
#

yep, already sorted and indexed by timestamp

#

the x axis is datettime

#

which is also the index

tall blaze
serene scaffold
tall blaze
#

ty

quick eagle
#

that one is later on, where there's stuff actually happening

thin palm
#

In my data there's 'NAME' and 'NATIONALITY', but sometimes the names appear twice or more because they've competed in spaceflight, my pandas will count the same person twice, how do I prevent this?

serene scaffold
thin palm
#

or is this okay?

thin palm
serene scaffold
tall blaze
thin palm
# serene scaffold yes, as long as it's followed by `.to_dict('list')` in the code.
{'id': [1, 2, 3, 4, 5], 'number': [1, 2, 3, 3, 4], 'nationwide_number': [1, 2, 1, 1, 2], 'name': ['Gagarin, Yuri', 'Titov, Gherman', 'Glenn, John H., Jr.', 'Glenn, John H., Jr.', 'Carpenter, M. Scott'], 'original_name': ['ГАГАРИН Юрий Алексеевич', 'ТИТОВ Герман Степанович', 'Glenn, John H., Jr.', 'Glenn, John H., Jr.', 'Carpenter, M. Scott'], 'sex': ['male', 'male', 'male', 'male', 'male'], 'year_of_birth': [1934, 1935, 1921, 1921, 1925], 'nationality': ['U.S.S.R/Russia', 'U.S.S.R/Russia', 'U.S.', 'U.S.', 'U.S.'], 'military_civilian': ['military', 'military', 'military', 'military', 'military'], 'selection': ['TsPK-1', 'TsPK-1', 'NASA Astronaut Group 1', 'NASA Astronaut Group 2', 'NASA- 1'], 'year_of_selection': [1960, 1960, 1959, 1959, 1959], 'mission_number': [1, 1, 1, 2, 1], 'total_number_of_missions': [1, 1, 2, 2, 1], 'occupation': ['pilot', 'pilot', 'pilot', 'pSp', 'pilot'], 'year_of_mission': [1961, 1961, 1962, 1998, 1962], 'mission_title': ['Vostok 1', 'Vostok 2', 'MA-6', 'STS-95', 'Mercury-Atlas 7'], 'ascend_shuttle': ['Vostok 1', 'Vostok 2', 'MA-6', 'STS-95', 'Mercury-Atlas 7'], 'in_orbit': ['Vostok 2', 'Vostok 2', 'MA-6', 'STS-95', 'Mercury-Atlas 7'], 'descend_shuttle': ['Vostok 3', 'Vostok 2', 'MA-6', 'STS-95', 'Mercury-Atlas 7'], 'hours_mission': [1.77, 25.0, 5.0, 213.0, 5.0], 'total_hrs_sum': [1.77, 25.3, 218.0, 218.0, 5.0], 'field21': [0, 0, 0, 0, 0], 'eva_hrs_mission': [0.0, 0.0, 0.0, 0.0, 0.0], 'total_eva_hrs': [0.0, 0.0, 0.0, 0.0, 0.0]}```
serene scaffold
#

keep in mind @thin palm that if you had only done print(df.head()), most of the columns would have been omitted, and it would be useless.

thin palm
# serene scaffold thank you; one moment

because if I want to find out how much time each person spent in space, it'll count duplicates. If Neil Armstrong went to space twice his first time in space was lets say 3 hours, then his next mission was 27 hours. it'll count 30 + 30 = 60. Even though it is only 30 hours in space.

quick eagle
tall blaze
#

@quick eagle I think it is because you are calling the loc function vs the index function but I am not 100%. I would create a new numerical index in place of the current one:

df=df.replace_index(drop=False)
serene scaffold
tall blaze
bold timber
tall blaze
#

Then you can use numerical ranges for the loc function. Make sure that your data is sorted correctly first thou

thin palm
tall blaze
quick eagle
tall blaze
#

if the model was to predict like type of diabetes you would most likely use softmax with the number of nodes equal to the number of diabetes types

bold timber
tall blaze
#

sigmoid is the same as logistic regression as is always the go to choice for binary outputs

#

and I looked into this further, I am pretty sure you would need to have 2 output nodes with softmax regression even if you are doing a binary classifier

quick eagle
bold timber
bold timber
tall blaze
quick eagle
# tall blaze amazing!

looks likt the 'timestamp' column is preserved as a datetime (so I can do x = df[timestmap][a:b] - df[timestamp][time_zero] ), but the x axis labels are '0, 0,2T, 0.4T' etc ... never seen that before. what is that 'T" notation mean? (I'm trying to basically set a start time, and then elapsed time (in min:sec) since the time_zero point.....

tall blaze
tall blaze
tall blaze
tall blaze
# bold timber Ok, thank you so much

of course, this stuff is very complex. Even with a masters in it, it can seem like a blackbox that just spits out numbers. I would say really dive into the math for each of the layers. And if you want to be an actual expert get a math PhD lol.

thin palm
#

Does this boxplot make sense to everybody? the amount of time each country has spent in space.

desert oar
#

also what is the unit here? the total time per country is one value per country

quick eagle
desert oar
#

a boxplot represents a distribution of values

desert oar
thin palm
desert oar
#

i would at least recommend sorting by median astronaut hours or by total number of astronauts

thin palm
#

What plots would you use?

quick eagle
#

then make the x axis "hours per astronaut" or similar

#

otherwise it's misleading

desert oar
#

the title should be "Distribution of astronaut total times in space, by country"

#

and the x axis could be "Total hours spent by astronaut in space"

thin palm
#

so the box plot is fine yes? Just my labels?

desert oar
#

it's fine in that it shows something that isn't nothing. but are you trying to show something specific? or just "something"

quick eagle
#

I would emphasize by the title "Distribution of individual astronaut total times in space, by country"

desert oar
#

that's better

#

"individual" makes it clearer

thin palm
thin palm
quick eagle
#

also, you could try a violin plot, etc. the problem you have is with the US data set - it's extremely bimodal (~2week shuttle flights and 6month ISS expeditions), which makes the boxplot have a ton of outliers

desert oar
#

+1 for violins, great observation

quick eagle
#

did you try just plotting individual points?

thin palm
desert oar
#

it might be interesting to plot number of astronauts vs total flight hours across all astronauts on a scatterplot

thin palm
quick eagle
#

just a dot per astronaut total time per country. no boxes

thin palm
quick eagle
#

I usually start with scatterplot, and if the data distribution permits, then summarize with something else (eg boxplot, etc)

thin palm
wicked grove
#

Hello, i have a 3 class classification problem and i want to penalise 1 class
Does this work for that tf.nn.softmax_cross_entropy_with_logits

#

I can't understand what i should put in the place of logits and labels

quick eagle
misty flint
#

data viz is def something to focus on especially in industry setting

quick eagle
#

Strongly suggest the Edward Tufte series !!!

misty flint
#

how you persuade stakeholders is important

quick eagle
#

(4 books)

misty flint
#

aka half of your job sometimes kekHands

thin palm
#

It's for a job hopefully I make some cool stuff here

misty flint
#

storytelling with data by cole knaflic is another recommendation

quick eagle
#

*4 bins

thin palm
misty flint
#

yeah so part of it is really understanding the data

#

and that comes with experience or domain knowledge

quick eagle
#

and the obvious story that ought to pop up right away is the soviet mostly long duration, the US has mostly short due to shuttle (10-14 days, but each flight had 7 crew, whereas ISS are crews of 3 per expedition)

thin palm
# quick eagle average is not great for spaceflight times, as they are are either minutes, ~2 w...

how do you feel about what I have?
1.)The amount of times each country has sent someone on a space mission
2.)Plot of the amount of times only women have gone on missions (only showing which countries have sent them)
3.)Histplot showing what year each astronaut was selected for space missions (showing the year that had most declared missions)
4.)When the year of mission was actually initiated (another histplot) from 1960 - 2020

quick eagle
quick eagle
#

your initial plot is 1B (with the corrections discussed)

#

"amount of times each country has sent someone on a space mission" -> "number of times a citizen of that country has gone to space" ; international partners go on US, or Russia/USSR vehicles

thin palm
quick eagle
thin palm
quick eagle
#

also, commander vs pilot is a bit tricky (the 'lead pilot' was the commander, but 2 were trained as pilots for shuttle; the rest were all mission specialists, with the occasional payload specialist for shuttle). apollo was 3 pilots - lunar, command module, and commander...

mystic cloud
#

Can someone help me with pycharm + virtualenv + jupyter notebook?
I have the venv created with inherit global options (for jupyter in global python) and tensorflow in my new virtualenv
I open the jupyter notebook, try to import tensorflow and it shows as it is not installed but it is installed ._.

desert oar
quick eagle
thin palm
quick eagle
#

I'm trying to calculate a difference over a time period (eg 2 min), but my dataset has samples at a non-constant sample rate. is it possible to do a .diff(period=X) where X is '2min', not a set number of steps? my index is timestamp

#

(this is kind of like using .interpolate(method='time'), but with diff)

mint palm
#

Is it ok to initially have higher accuracy then trains set....cuz they really come from same distridution...

worn bough
dusty ivy
#

why does the line not fits to the dots?

#

    vector<double> X = { 38, 50, 15, 30, 50, 38, 50, 20, 45, 50, 20, 35, 30, 43, 35, 37.5, 37, 35, 30, 45, 4, 37.5, 25, 46, 30, 200, 200, 30
    };
    
    // variable X
    vector<double> Y = { 8000, 6400, 2500, 3000, 6000, 5000, 8000, 4000, 11000, 25000, 4000, 8800, 5000, 7000, 8000, 1800, 5400, 15000, 3500, 2400, 1000, 8000, 2100, 8000, 4000, 1000, 2000, 4800
    };
    
    
    double alpha = 0.0001; // learning rate
    int epoch = 1000;// number of epochs
    SimpleLinearRegression *slr = new SimpleLinearRegression(X, Y, alpha, epoch, true);
    slr->train();
    slr->print_yhat();

    
    vector<double> Y_c = slr->predict(X);

    // denormalize Y_c
    vector<double> Y_c_denormalize;
    double Y_MAX = *max_element(Y.begin(), Y.end());
    double Y_MIN = *min_element(Y.begin(), Y.end());
    double X_MAX = *max_element(X.begin(), X.end());
    double X_MIN = *min_element(X.begin(), X.end()); 

    for(int i = 0; i < Y_c.size(); i++){
        Y_c_denormalize.push_back(Y_c[i] * ((Y_MAX - Y_MIN) + Y_MIN));
    }

    
    double Y_c_MAX = *max_element(Y_c_denormalize.begin(), Y_c_denormalize.end());
    double Y_c_MIN = *min_element(Y_c_denormalize.begin(), Y_c_denormalize.end());

    // Scatter plot
    matplotlibcpp::figure_size(700, 500);
    matplotlibcpp::scatter(X, Y, 25);

    double x = 45;
    double y = slr->predict(x);
    double y_denorm = y * (Y_MAX - Y_MIN) + Y_MIN;

    cout << "Prediction of " << x << " Hours Per week is " << y_denorm << " Income" << endl;
    
    matplotlibcpp::plot({X_MIN, X_MAX}, {Y_c_MIN, Y_c_MAX}, "r");
    matplotlibcpp::xlabel("Hours per Week (x)");
    matplotlibcpp::xlim(0, 80);
    matplotlibcpp::ylabel("Income (y)");
    matplotlibcpp::title("Scatter Plot");
    matplotlibcpp::show();
#

the problem here is when I want to denormalization the value of Y_c

#

I just want to hardcode this one in C++ rather than using libraries in python

lone drum
#

hello i am having dataframe which has a column name marks in that values arepython 21100 23000 25650 78550 36100 22600 22700 34550 i want to get rows which are multiple of 100 for e.g. my expected output python 21100 23000 36100 22600 22700 this way ping me when reply

inland zephyr
#

i need suggestion about image classification or related field about image processing in ML

#

is it common to consider original image resolution and depth (eg: dpi of the image) before feed it to NN for create the knowledge? since i aware that image size to used are generally small (ranged between 120x120 px to 200x200 px) to feed up the NN.

#

Also is it considerable whether process each channel separately and combine it in the of the NN? since my intuition said for colored image, R G and B channel must have different value and could have different story to tell the NN about the image.

steady basalt
maiden pelican
#

Where can I find code for back propagation ?

dusty ivy
maiden pelican
#

Can somebody help me with bp neural network algorithm ?

strange stump
#

@misty flint i need help 😄

misty flint
#

lol just ask. others can help too

#

it is 6am over here kekHands

mellow vapor
#

If a single layer MLPClassifier gives me an accuracy of like 94%
does adding more layers guranteer any more precision or scope of improvement
or is it just black box testing
may or may not work?

woeful falcon
#

!e Why is it not rounding the decimal places in the array

import numpy as np

w = np.array([9.79810329e+209,
 2.01077594e+210,
 1.57202605e+210,
 2.53363565e+210])

print(np.round(w,10))
arctic wedgeBOT
#

@woeful falcon :white_check_mark: Your eval job has completed with return code 0.

[9.79810329e+209 2.01077594e+210 1.57202605e+210 2.53363565e+210]
strange stump
#

rip

misty flint
#

oof

#

🕯️

#

what happened

strange stump
#

bro

#

so

#

the data set was like

#

containership sizes

#

ok

#

i tried to see if there is some correlation between

#

the age of the ship and the size

#

sht like that

#

heatmap

#

i couldnt

#

it said there are strings in the data

#

i check on excel for strings

#

literally nothing

#

one question was to split the size of the containers into 5000 bands

#

and plot and find the distribution

#

wtf

#

"cant split strings"

#

what can i do man

#

never began i gotta work as some cleaner forever now

mild dirge
#

You can write sentences without an enter every 3 words 😛

strange stump
#

im sorry im just mad 😦

misty flint
mild dirge
#

And if you are using pandas, you should try convert the columns to floats instead

#

if they are supposed to be floats

strange stump
#

yeah they are

misty flint
#

its ok bro, i failed my first takehomes as well

#

it gets better with experience

strange stump
#

well this was the analysis

#

i wrote down what i would have done

#

hopefully i can talk my way into it lol

#

tmro is the interview

#

wait do you think i can get away with trying something now after the allotted time and presenting it in the interview would it be appropriate

misty flint
#

idk your constraints so maybe

strange stump
#

it was 30 mins

#

4 questions

misty flint
#

thats rough

#

idk what they expect for 30 mins tbh

#

so i feel like whatever you say is fine

#

lolo

strange stump
#

bruh i dont get it

#

30 mins

#

like

#

not enough time omg

#

its for a junior role as well

#

do they expect me to be some pro at 20 yrs old 0 experience

misty flint
#

yeah thats def not enough time

#

for much

#

only if youre experienced would you maybe get anything valuable

strange stump
#

oh i found out my problem

#

the fking added commas for the 10000

#

like

#

13,000

#

so the code wont see this as a number

#

i gotta split(",") like this right? or something

steady basalt
#

Pulling the line up?

misty flint
#

i think you should be able to convert datatypes

#

even with the commas

strange stump
#

nah i tried man

#

😦

misty flint
#

which function did you use

strange stump
#

no i tried to make stuff like

#

heatmaps

#

to find correlations

#

.corr()

#

uh

#

describe()

misty flint
#

you cant do that stuff

#

without converting datatypes first

#

you basically have dirty data

#

gotta clean it first

strange stump
#

so they want me to clean it and find trends within 30 mins

#

nice

misty flint
#

yeah its still much for a junior

#

i wouldve probably done it all in excel tbh

steady basalt
#

Did u remove missing values

#

And encode

misty flint
#

since you didnt really need python

steady basalt
#

I usually at least remove nan before I do heatmap tbh

#

@strange stump yes literally everywhere is expecting 20 year olds to be pro rn

#

It’s the new meta

strange stump
#

😦

#

i have more experience in python rex

#

i used python to analyse data for uni work

steady basalt
#

At least u got that job dude, some of us data scientists have to claw our way up from junior roles and internships doing exel

strange stump
#

since it looks better on reports

steady basalt
#

And have training in ML

strange stump
steady basalt
#

Ohh

#

U applied to a pretty hard job for ur skills then haha

strange stump
#

yea i should probably just look elsewhere

steady basalt
#

Good luck

#

Nah if u can get this done u have what it takes

strange stump
#

probably

steady basalt
#

But I’d not want to have to learn pandas in one day lol

#

Ok just check the data for NAN values

#

And if there’s any consider filling them with column medians or modes

#

The data is integer?

misty flint
strange stump
#

i just converted the data into integers

#

by removing all the commas they had

misty flint
#

ok gtg bye

strange stump
#

cya boss

#

i checked for null values

#

isnull().sum()

#

all gave 0

#

now i try to use corr() and the output is "__"

steady basalt
#

Cause commas?

#

Did u check data type

strange stump
#

there arent any commas now

#

yeah

#

says int 64

steady basalt
#

Umm

strange stump
#

do i need them as floats

#

wait what

steady basalt
#

Dm me a screenshot of the df and the matrix code

strange stump
#

matrix code?

steady basalt
#

If in doubt google ur question

strange stump
#

do you want my python stuff?

steady basalt
#

Did u try to google it

#

Chances are someone’s posted on stack overflow this question

strange stump
#

"code featured in the movie matrix"

#

....

steady basalt
#

Google pandas corr giving ___

strange stump
#

OMG IT IS FLOATS

#

bro i kid you not i searched how to convert it to float, copied the code and it coverts it into int64

steady basalt
#

What?

#

Corr require float?

strange stump
#

yea

steady basalt
#

Haah TIL

strange stump
#

i converted all the data into floats

#

which i thought thats what my code did when i searched for " convert column into float"

steady basalt
#

Yeah learning process is literally how good are you at googling

strange stump
#

nah this is bs ima complain about the time in the interview

#

but i think they will like that im trying again

#

hopefully anyway

steady basalt
#

Well I could prob do this in under 10 mins

strange stump
#

i feel like thats how it works for interviews

#

oh ok sorry mr pro

steady basalt
#

😅

strange stump
#

this is my heatmap i wanted

steady basalt
#

Nice

strange stump
#

but

steady basalt
#

They gave u three columns?

strange stump
#

the range is 0.992 - 1

#

they all have a strong correlation then?

steady basalt
#

They’re highly correlated

#

Yes

strange stump
#

just some stronger

#

nah i was just testing these columns

steady basalt
#

Do it with all columns

#

Trust

strange stump
#

i send a screenshot of the head of the dataset

steady basalt
#

U can also mask half of that and save eyesore

strange stump
#

some are worded answers

mild dirge
#

If you haven't carefully tested something but so just "played around with a lot of different configurations", how can you neatly put this in a report?
Like we empirically found that this type of model gave the best results so we used this.

steady basalt
#

Hmm by that I mean make it a triangle

mild dirge
#

Or something of that nature

steady basalt
#

I’d say that ahha

strange stump
#

i know what youre saying yeah ill look into that later im happy i know how to do this now

steady basalt
#

I’d just say an initial test proved certain models stronger

strange stump
#

oh so supermoon

#

i wanna make a new column

#

Age of the ship

steady basalt
#

Show us the matrix with all columns

strange stump
#

2022 - the year built

#

how would i do that

steady basalt
#

You have year built column

#

U can create a new column that passes exactly that formula

#

Ur gona need to google syntax

#

Are u applying to data analyst?

strange stump
#

yes

#

ok i created it nvm

steady basalt
#

USA?

strange stump
#

UK

steady basalt
#

Me too

#

I was under the impression most of those jobs are rly hard but maybe it’s regional

#

I’m prob gona have to do this type of work in my first year

#

No one wants a data scientist without analyst experience anymore

#

FMl

strange stump
#

idk man i dont think im smart enough for this

steady basalt
#

U already done have of it

#

It doesn’t get much harder

#

Now just do some plots

#

My hint is use pairplot by seaborn to scout out areas of interest

#

I do that

#

And just use matplotlib or pandas to plot bars

#

Or distributions

rich olive
#

Guys I'm tryna self-teach python

#

As my first language and dip into programming

strange stump
#

oh wtf

steady basalt
#

U are in the data science room

rich olive
#

And I have some super basic code that's not working.

steady basalt
#

Ahhh

#

U found the hack for graphs

rich olive
#

You guys don't do data science in python

steady basalt
#

I do

rich olive
#

I am building linear regression. That is data sciency

steady basalt
#

Nice

strange stump
#

none of these graphs are useful imo

steady basalt
#

They are

#

Isn’t it a key part of analysing a data set

#

U now see where all the ships were built

#

Which years got more contracts

#

Now u can plot these individually and later remove the pairplot

strange stump
#

i got rid of the unique identifier column

#

hm

steady basalt
#

What exactly is the task

#

If it’s general analysis what’s bad about plotting to find which years were best

#

Or correlations

#

I mean some of those are just a literal extension of ur matrix

strange stump
#

one of the questions was to find some trends

steady basalt
#

U can plot the trends on a graph from ur matrix

#

As u can see those highly correlated dots

strange stump
#

ok 😄

urban marlin
#

so i was trying to train a model with tensorflow object detection module and this problem came up , can anybody tell me how to change checkpoint version to V2 ?

gusty forge
#

Is it possible to convert an opencv model to tensorflow model?

#

Ultimately I just want to use the model to run in an Android app

next phoenix
tacit grail
#

Thanks @modest mulch for response.
my application is following:
There will be an online examination system.
In the question, the attached image will be shown. (image may be vary) and asks student to create the same in ms word.

our program collects all student's created word doc and compare with our word document.
I want to make a program that give scores based how created document is similar to provided one.

  • this will compare template
  • font size
  • color
  • font-face
steady basalt
#

Can’t u just use ur eyes and look if it’s the same

#

To check for typos just use a text filter

#

Otherwise you’re going to need some state of the art computer vision

mild dirge
#

What info are you planning on feeding the model?

steady basalt
#

This a good example of using AI for a simple task that would require some really advanced AI or no AI

lone drum
#
Traceback (most recent call last):
  File "D:\college_project\modules\model_train.py", line 21, in <module>
    model.add(Convolution2D(16, 3, 3, activation = 'relu'))

  File "C:\Users\shubh\anaconda3\lib\site-packages\tensorflow\python\training\tracking\base.py", line 629, in _method_wrapper
    result = method(self, *args, **kwargs)

  File "C:\Users\shubh\anaconda3\lib\site-packages\keras\utils\traceback_utils.py", line 67, in error_handler
    raise e.with_traceback(filtered_tb) from None

  File "C:\Users\shubh\anaconda3\lib\site-packages\tensorflow\python\framework\ops.py", line 2013, in _create_c_op
    raise ValueError(e.message)

ValueError: Exception encountered when calling layer "conv2d_1" (type Conv2D).

Negative dimension size caused by subtracting 3 from 1 for '{{node conv2d_1/Conv2D}} = Conv2D[T=DT_FLOAT, data_format="NHWC", dilations=[1, 1, 1, 1], explicit_paddings=[], padding="VALID", strides=[1, 3, 3, 1], use_cudnn_on_gpu=true](Placeholder, conv2d_1/Conv2D/ReadVariableOp)' with input shapes: [?,1,1,8], [3,3,8,16].

Call arguments received:
  • inputs=tf.Tensor(shape=(None, 1, 1, 8), dtype=float3``` how to fix this error
desert oar
#

you could just look at some image similarity metric

#

but i have a feeling that will not be easy to tune and will not reliably give good results

#

you'd need to segment the image and compute similarities on various parts + some kind of graph similarity for the overall structure

desert oar
#

you'd spend 5x as long building the model as you would grading by hand

#

and yeah that too lol

mild dirge
#

You have to use more data to make sure it's not just that

desert oar
#

well you'd have to make it low res in the exam, too low to scale up properly to the document size

misty flint
#

bruh

#

first rule of google's ML

#

solve the problem without ML if you can

wicked grove
# desert oar no

Oh okay thank you, could you please tell me what i should put for logits in this function

desert oar
#

the logits are the outputs of your model

#

basically the stuff that comes out of the final output layer, before applying softmax

#

they are called "logits" because conceptually they are the result of applying the logit function to the predicted probabilities

desert oar
desert oar
arctic blade
#

What would happen if somebody made self aware ai?

serene scaffold
arctic blade
#

I was wondering if it would be like the matrix smh😂

serene scaffold
#

in either case, a "self-aware AI" is a long way out. the way AI is depicted in the media is just wrong.

arctic blade
wicked grove
serene scaffold
wicked grove
#

Labels,logits and pos_weight

arctic blade
#

Idk

#

Whats the most impressive ai then

mild dirge
#

stuff like alexa and google home is pretty impressive

serene scaffold
agile cobalt
#

Alpha Zero (or AlphaGo) has some nice advancements in complex-ish games
Nvidia has some crazy image manipulation stuff

desert oar
#

yeah i'd say that the alpha-stuff is probably the most-developed for general-purpose problem solving, at least that the public knows about

arctic blade
mild dirge
#

Ai is not machine learning persĂŠ, it's just something "intelligent"

#

very broad

desert oar
mild dirge
#

But it definitely uses machine learning too

serene scaffold
desert oar
#

the definition of "AI" is fuzzy and has been co-opted by marketing teams to sell machine learning products

arctic blade
#

I see

wicked grove
#

           0       0.96      0.97      0.97       100
           1       0.93      0.74      0.82       100
           2       0.78      0.93      0.85       100

    accuracy                           0.88       300
   macro avg       0.89      0.88      0.88       300
weighted avg       0.89      0.88      0.88       300
wicked grove
serene scaffold
#

@arctic blade Google (the search engine) is an AI: it's a document retrieval system that figures out what documents (web pages) are relevant to your query.

arctic blade
desert oar
#

so you can try adjusting pos_weight above 1

#

the docs are pretty clear, they even have formulas

#

for one specific class, you might need to specifically assign class weights

wicked grove
#

Yupp i saw this,but should i pass the fully connected layer as the logits

desert oar
#

ah, it looks like pos_weights can be a vector

#

so you can assign different weights to different classes

#

A coefficient to use on the positive examples, typically a scalar but otherwise broadcastable to the shape of logits. Its value should be non-negative.

wicked grove
wicked grove
desert oar
#

what do you mean?

#

how can you access the values without applying softmax?

#

you could just not put the softmax layer on the nn, and apply it manually when generating predictions. but i'm not sure what actual tf users do, let me see if i can figure it out

wicked grove
desert oar
#

yes, do not apply softmax to the logits. weighted_cross_entropy_with_logits does that internally

#

ah wait

#

yeah nvm

wicked grove
desert oar
#

weighted_cross_entropy_with_logits applies sigmoid, not softmax

#

the docs say that

#

sorry i misread

wicked grove
#

Ohh okayy, it's alrightt

wicked grove
modest mulch
#

@desert oar yo man, do you have any idea on using GANS for generating object on images (the output of GANS could directly be fed into an object detector)

misty flint
#

your question doesnt make any sense. you can just do one and then the other afterwards

desert oar
#

sigmoid is for independent binary classes, softmax is for 1 mutually exclusive set of classes

wicked grove
wicked grove
mint palm
#

which is better?

grave frost
mint palm
#

both were trained on something like this

grave frost
#

on the surface, it often looks like a "stochastic parrot" (I certainly thought so too) but its really from some digging that one actually understands how much it can do as compared to previous methods

#

as much as people hate calling it "intelligent" on the internet - those are usually ones posting blogs who live in an extreme, expecting GPT3 to be skynet-like AGI

#

while in the academic community, its mostly GPT3's meta-learning capabilities that really astound. Its completely unexpected, was never thought to be emergent yet the model managed to do it a bit... just by being pre-trained on MLM 🤔

mild dirge
# mint palm

This one gives better test accuracy, so if this test data represent new data well, then this one is better

#

Also using too many epochs can cause overfitting

#

The model converged way before 5 epochs, let alone 30

mint palm
#

yup

mint palm
mild dirge
#

why try 100?

#

30 is too many

mint palm
#

i wanna see....

mild dirge
#

you want to see it overfitting? 😛

mint palm
#

i want to add in a report....isnt lower epoch graph very noisy

mild dirge
#

you can average it over multiple runs

zinc sparrow
#

Hey all! Old timer AI guy here - used to run my own C++ libraries - how are y'all running performant python code?

mint palm
zinc sparrow
#

Gotcha... So precompiled code, eh?

#

Cool, thanks

mild dirge
#

You can run it on your cpu and gpu

#

multiple gpu's / machines even

zinc sparrow
#

Yeah, that's a given

mild dirge
#

yeah but it has a nice api to do that, you don't have to figure that all out yourself from scratch

zinc sparrow
#

Just needed to confirm my suspicion and an argument I've had with a colleague that debated that python was as performant as C -eyeroll-

mild dirge
#

well pytorch is mostly written in c++ iirc

zinc sparrow
zinc sparrow
#

Sorry about off topic, carry on!

mild dirge
#

ah right haha. Well it'll run pretty fast but it's not the python code making it happen 😛

lone drum
#
Traceback (most recent call last):

  File "D:\college_project\modules\untitled0.py", line 56, in <module>
    model.fit(train_generator, epochs=5, validation_data=validation_generator)

  File "C:\Users\shubh\anaconda3\lib\site-packages\keras\utils\traceback_utils.py", line 67, in error_handler
    raise e.with_traceback(filtered_tb) from None

  File "C:\Users\shubh\anaconda3\lib\site-packages\tensorflow\python\eager\execute.py", line 54, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,

InvalidArgumentError: Graph execution error:```
mint palm
#

this is 10 epochs....sound good?

lone drum
mild dirge
#

btw how does your model have 94% accuracy on epoch 0?

agile cobalt
mild dirge
#

Assuming epoch 0 is not trained

wicked grove
#
loss = tf.nn.softmax_cross_entropy_with_logits(labels=[[1. 0. 0.] [0. 0. 1.] [0. 1. 0.]], logits=output, axis=-1, name=None)```
#

i am getting an invalid syntax error

#

i cant understand it

mint palm
#

i have 200,000 of category A
120,000 of B
126,000 of c

agile cobalt
#

maybe double check if there's any data leakage?

mild dirge
#

you should probably just stop after epoch 2

#

Well with that much data it can be possible as long as the function is not too complex

mint palm
#

this is my code:

agile cobalt
#

maybe see how well it works with other forms of train/test splitting

#

the simplest would be just leaving a slightly larger group out, and not shuffling the data

mint palm
#

ok

wicked grove
agile cobalt
#

you cannot simply copy paste a numpy array like that

mint palm
#

same

#

but i checked the dataset

#

its quite shuffled....like category a, b, c is present in chunks of 100

wicked grove
#

i tried with floating points but that gave errors as well,so idk what i should do for th labels

bold timber
#

my friend want to install anaconda, but his get like this. how to fix this?

mild dirge
#

whatever place they try to install it in has 2 spaces in the name which apparently anaconda does not like

misty flint
#

look @serene scaffold

misty flint
#

i wonder who makes that call thats like

#

this model is way too big we need to train on multiple gpu's

#

maybe its more like, this is never converging

#

lets try multiple gpu's

serene river
#

Hi, sorry for bothering, i'm encountering a problem at plotting streamline in matplotlib. I plotted vectors but then i need to use Euler's method (or Runge Kutta) to trace the streamlines. I have no idea on how to start and what result I should get

inland mantle
#

I’m still learning about a career I want to do, so I am thinking of choosing a career in deep learning with a specialty in computer vision

#

Does computer graphics include computer vision?

elfin merlin
#

Hey guys, I have a computer vision problem. I am using openCV but since there is no computer vision chat I figure data science is the closest thing to the problem that I am having. I am using OpenCV in python. I have a color image and a binary mask image (0 to 255). I want to instead have a color image with the mask applied.

 full_mask_bgr = cv2.cvtColor(full_mask,cv2.COLOR_GRAY2BGR)
 full_mask_bgr[full_mask_bgr==255]=1.0
 img2 = np.multiply(img, full_mask_bgr)```

I am able to do this by doing these functions.  First: convert from grayscale to bgr, then convert all the 255 (white) values of mask to 1 then multiply the original bgr image by the 0,1 mask.
The only problem with this solution is that its slow as hell.  Is there a better way to do this?
misty flint
#

maybe try doing a project in both and seeing how you feel about it

inland mantle
#

Yeah I’ll see I’m just exploring rn

misty flint
#

same im interested in 3 things atm

#

hopefully i decide on 1 before i graduate kekHands

#

which is soon

elfin merlin
#

I love computer vision but Im probably going to go into app dev instead

misty flint
#

mobile or web

elfin merlin
#

Mobile (probably)

misty flint
#

gotcha. theres still opportunities to apply CV in that space

#

i also dont know the answer to your question since im not really a CV guy kekHands

#

we also did our stuff with matlab, which has tons of image processing functions DoggoKek

elfin merlin
#

I think theres got to be a way to do it in numpy or opencv but the way I did it is so roundabout and my camera is now like 1 fps

misty flint
#

yeah someone who knows opencv well could probs answer your question

elfin merlin
#

I got to figure this out. Our robotics competition is this friday and those 3 lines of code are slowing down the robot

inland mantle
misty flint
misty flint
#

one is more i believe analyzing the data, while the other is generating it

#

but maybe

inland mantle
#

Ah I see

steady basalt
misty flint
steady basalt
#

In my opinion if we built a neural network that’s a 1:1 replica of the brain and raised it like a child would it know if it’s inside a computer ? If so it wud wana kill itself

mild dirge
#

supermoon, this is going to be hard to break it to you but...

misty flint
#

but...

tight flare
acoustic peak
misty flint
#

ah here are the peeps that know opencv

gilded bobcat
#

Hi all I ahd a question on using "feature importance" with sklearn?

#

Namely, I ran a tree and took the most important features to determine if an animal would be adopted (so 0 or 1). I get these results.

#

However my question is this: How do I know if these features are important to classify the observation as adopted (1) or not adopted (0)?

#

Like "Sex upon Intake Unknown" is def important, but its important to classify an obs as 1 or 0?!

agile cobalt
#

that depends on:

  • which model are you using
  • what is the scale of the variables
  • what is the intercept (assuming that it has one)
#

if it's a LogisticRegression with the default parameters, check the model's intercept_

gilded bobcat
#

Good points, I am having a hard time on thinking of how to use that info to determine this. If it helps:

  1. Using Adaboost over my data
  2. Everything is not scaled in any particular fashion, so no change from raw (I read that scaling data in tree's does nothing, but maybe actually harms the predictive power, not too sure)
  3. None, it's a tree (?)
#

Very much learning

#

So sorry for obvious stupid replies lol