#data-science-and-ml
1 messages ¡ Page 390 of 1
yeah, but "crash courses" remain what they are - a quick overview. I would discourage people from actually using them, and instead encourage deep dives in the topic
eh if you just want a quick overview then thats what they are for
you cant deep dive into everything

you should deep dive into what youre actually interested in tbh
How the hell did a pro get 0.86 accuracy. I can only manage about 0.73 after doing all the steps
0.86 is on training data
Hi Python gang, I have a take home assessment for this job interview and was wondering if I can get some help on your thoughts of what to look for when doing data exploration? What's everyone's top 5 things they look for when examining data? Cheers!
- Errors/dirty data (i.e. things that seem erroneous or might need data cleaning for further analysis)
- Summary statistics (mean, distribution, etc.)
- Outliers (special instances, etc.)
- Visualize in multiple ways (may see something unexpected)
- Basic models (to see any relationships)
Man awesome! But what do you mean by basic models? For more context I'm working on astronauts data and in another file their missions information
So I could make a model that says how likely it was to succeed
Data leak
uhhh Data Leak may happen if you dont take away duplicates or do split train correctly
anything that doesnt take too much time. if its a take home, you just want the low hanging fruit first. (i.e. youre not going to be building a neural network, but probably should look at a simple linear regression model)
(if the data is linear)

Man love it, thank you very much for this!
I'll do: Cleaning Data (duplicates, missing values,scaling), Data Viz, Feature engineering (encoding), feature selection(feature correlation, modeling)
factsssss
its good to have a standard approach
for sure
as that can help with this
Hey, I just joined this server, i'm 17 and wanting to get starting in data science and AI, how would y'all go about doing this? Im learning matplotlib and pandas libraries right now, I only started learning python 2 and a half months ago.
are you going to go to college/university, and if so, will you be pursuing a degree related to DS/AI?
This book is great for starters https://allendowney.github.io/ElementsOfDataScience/README.html
For Data science, where should I start from
Looks like the message above yours attempts to answer the same question.
Ok, thanks
Hey
Opencv lags so much when I run a ipynb notebook. Most of the time, frame gives a not responding message
What do I do
initially val accu is higher then train but gets better later......is it ok?
umm vids to learn python?
!resources
please keep other non-data science questions in the help channels
The Resources page on our website contains a list of hand-selected learning resources that we regularly recommend to both beginners and experts.
okay
would anyone know why my model performs worse on unseen data after tuning than before?
initial score is say 0.7 on cv, final test scores are more like 0.68
cv is performed on training set
@mild dirge I feel as though holding back test data is making my model seem WORSE than it should be
especially KNN
dropped to 0.58
auc
altho Random forest is pretty good
worse still, I am getting like 0.52 for precision on target feature being the 'positive' value
which is really bad in cases like detecting disease
Anyone can help me in this ?
@mild dirge another thing is the hold out testing set is using very imbalanced data, so of course precision is high for predicting the class which has like 4x more values than the other class
you could combine rank with a groupby https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rank.html
One More Question, I have a data frame and I wanted to generate a new column for colour codes which starts from red for the least value of Opportunity and moves toward green for the highest value of Opportunity
dataframe âď¸
Interpolation of colors basically
yes
I need color codes for using in front end of my application so i can minimize the time for rendering cause data set so big
You mean youâre going to categorise the column into a lower number of colour code?
yes you can say like that
I want values between red and green depends on Opportunity
top least val is dark red and top large val is dark green
Can i penalize one of the classes of my model with this function? tf.nn.softmax_cross_entropy_with_logits
special?
I am trying to code this
But when using
dist = cdist(X,np.array([D[i,:]]).T,axis=1)
I get an error
ValueError: XA and XB must have the same number of columns (i.e. feature dimension.)
alright, so print the shape of X and np.array([D[i,:]]).T and check if they are what you expect
I am just thinking are their other ways to calculate the minimum distance between data points and clusters
since I only find errors
Using different methods
Well it seems that there is a shape mismatch between the two points
so it can't calculate the distance
Some simple distance functions are f.e euclidean, or manhattan
Try giving some more information. What code have you written so far, and what part are you stuck with?
Long shot - anyone here use JS & PYTHON to script/automate in Microsoft Excel?
python data science people usually use pandas for that kind of thing.
you can read excel data into python code with pandas, do all the transformations, and save the result (and any intermediary parts, if you want) back to excel.
thats why no point using R đ
Thanks. I figured. Hoping to find a unicorn who can help explain pros & cons
Do you of a better place I could ask this question?
the pros and cons of pandas vs what?
I find pandas more simple than excel but thatâs just cause I never used exel
i have strings that are formatted in different ways and i want them all to have the same format for example i have a string that is like "7 MY STRING" however i want it to be "7th My String" using ordinal suffixes. I'm thinking the best way to do this is to split the column after the integers add the ordinal suffixs (1st 2nd 3rd etc...) then use the title() method on the strings and then rejoin the column. I"m just not sure how i would implement it in pandas.
Hello,i have a 3 class classification problem
I want to increase the loss for one of the classes
Can i penalize one of the classes of my model with this function? tf.nn.softmax_cross_entropy_with_logits
I can't understand what the logits argument does
I feel like I asked this before but what is the difference between machine learning and deep learning
the difference between deep learning and machine learning can be blurry but you should consider that deep learning is a special type of machine learning where typically the algorithm does its own "feature extraction" vs. doing it yourself
trying to get help: parsing data from a .fit file into pandas DF - is that something for this forum or elsewhere?
typically deep learning involves a type of neural network structure which you dont usually see in traditional ML
Which would be faster when trying to train data sets
if i understand correctly an example of machine learning would be like a streaming platform recommending you something based on your interests and then deep learning would be like how google developed ALPHAGO to beat the best GO player in the world.
I'm in help-coconut, but do people just show up there, or do I 'recruit help; from here?
you wait for someone to volunteer themselves to help, though you can crosspost in whichever topical channel relates to your question.
your question does not appear to be data science related, however
yeah, its more about basic tuples/list/for loops.... what's the better place to start?
nevermind, I see that it's about pandas
Hello, can i use a pytorch loss function with keras model.compile?
very unlikely
i want to use the adaptive loss function available in pytorch but idk how i can do that
i have a model for 3 class classification and want to penalise one class i thought of using tf.nn.softmax_cross_entropy_with_logits
Deep learning is a subset of machine learning
Compared to using office.JS or node JS packages
Do you use JavaScript to automate or script?
I'm pretty sure that nobody here uses JavaScript to automate stuff
and what do you mean by script?
I use python because pandas is python
The only beenfit to using JavaScript for this is if you are good with it and cant use python
Unless Iâm misinterpreting what you want to do

when making a NN for classification of lets say A, B, C. is it neccessary to have number of example in ratio 1:1:1 ?
it doesn't have to be 1:1:1, but if it's too unbalanced, you'll have to pay extra attention to that when evaluating the model
Tensorflow has a tutorial about it: https://www.tensorflow.org/tutorials/structured_data/imbalanced_data
I checked its 350000:120000:200000
Just make sure you have a separate test (possibly balanced) test on which you can see the accuracy of the model
converting to percentages 52% ; 18% ; 30%
I had a dateset with a distribution somewhat like this, it had 400 classes, most having around 120 images, but some a bit more
And not balancing the data gave better results
Than simple down-sampling to 120 images
So just test needs to be balanced?
If you are looking at an average accuracy over the test set than that would be good imo yes
or another averaged performance measure
Otherwise if your data has 90000:2 ratio and your test set also has a similar ration, then obviously it can get a very high average accuracy without being able to separate the classes
As it would just guess the most common class every time, or at least a lot more often
I want it to be exceptional with no redundancies
Yes,
May be thats why test train split wasnt good enough
you might want to use balanced-accuracy instead of accuracy or something like that if your scoring function supports such
Dont know what that is
a scoring metric
yeah, macro average vs micro average performance is something to look into
macro takes the average performance per class and then averages it to get a final score
micro average takes the average over the entire set, whether or not it has been balanced
iirc*
macro treats all classes as equally important
Actually a simple project has taught me a lot, no theory class can teach..
Yeah for sure, google is a dang good professor đ
Do you work as a data scientist?
No I study AI
I have come to conclude your strategy of taking a test set isnât really useful in my work because the dataset doesnât represent a population
I did take the set anyway to see how it acts on balanced vs unbalanced
Why do you think this?
Because itâs labelled data
And itâs showing many people with a condition and not as many without
So maybe in this case I should not split? Well if I had enough data Iâd split and rebalance towards reality
But that takes a lot more data than I usb
Have
Also my friends lecturer said you shouldnât do this at all
U shud balance test data
Yeah but the the performance of your model should not be judged on the accuracy on the real poluation distribution.
Say you are a doctor and people come to you for diagnosis, as they might have a disease.
99% of the time the people have no harmful disease, so saying no to all the people would lead to an accuracy of a whopping 99%!
But you want to know how many of the patients that were sick could have been diagnozed before it got bad
So youâre saying to not balance at all? That would mean itâs scored purely against the balance of data done in the original testing which we donât know what it was
Why not judge it on real population? You can see itâs use as a screening tool
No I am saying you what problems you run into when judging the performance of your model on the unbalanced population data
In this case as you say itâs only going to be an accurate predictor or performance on peopel who probably have the disease
You can use unbalanced data to test your performance on, but you should consider giving all classes equal weights towards the performance measure
Itâs binary
then both classes, the point still holds
Thatâs why Iâve balanced the data for model design
right, thats good
But testing I against an untouched holdout
That will only show performance on a group of people who mostly have the disease
Why not show how it does on a balanced population or better still a population likely to come to testing where more do not have disease
As per your advice itâs only compared against unseen data from original data which is a group that mainly has disease
Howâs that more useful
Wouldnât it be wise to give precision scores for different balances
Because your goal is not to persĂŠ optimize the accuracy of the model on the population, but to optimize recall
Which I do prefer in health data over accuracy anyway
Making sure not too many people who have a disease that will be told they're healthy
The precision is quite good on diagnosing but itâs really bad on detecting those without disease
Iâd say itâs more acceptable than other way around
precision would be seeing how many of the people that you diagnosed as having cancer, actually have cancer
Which I say is less important
I think making sure someone who is sick, will actually be diagnosed as sick and get themselves checked out
which is recall
Which is why precision for positives is more important than negative
Thatâs precision ?
precision is true positives / (true positives + false positives)
Precision is: "I said all these people had cancer. How many of them actually did have cancer?"
Recall is: "Out of all of the people who had cancer, how many did I say had cancer?"
This graphic is the one every DS has taped to their wall, pret much:
Hell, I had that made into a magnet and put on my fridge LOL.
Right, you're probably going to want Recall.
But still, I have a holdout set
Is it worth balancing that at all
Desperately
Seperately
I have tested it unbalanced
As a first step
well it's not that bad when you look at recall
And thatâs showing performance on a biased population
and the class is heavily under represented in the training data
Over represented
Talking about the disease class
I only skimmed this, but if you're talking about doing SMITE/SMOTE with your holdout set, that's not good. You don't want to affect "real world data" by artificially inflating.
Compare to real life
The thing is my friends professor said you should smote the test set too
You're literally just scoring the model on the holdout, so it's saying, "Given that I feed this 1 row, how accurately would that be classified?" If you modified your holdout in some way, you're giving your scoring (not your model) an advantage.
I disagree, and melatonin does as well
Which makes zero sense, because your model is unaffected.
Ye
You should never SM[I/O]TE the test/holdout set.
I agree, I don't understand why SMOTE would be useful here either.
It makes zero sense to do so. It will artificially inflate your metrics.
Like, you're basically saying, "I think my model is good at detecting the thing... lemme give it a lot of easy cases it can correctly classify, which will inflate my metric."
I donât understand why thereâs no merit to re balancing data to get a more accurate view or real populations so you can see how good it is as a screening diagnostic
because you are just giving it "the same" (or very similar) cases over and over
To sum this up in a very concise way: Your holdout set should, as closely as possible, represent the distribution of the real data you will be feeding it.
So instead of 50 with disease and 10 without, select as a holdout the other way airings
Exactly
My point
Real data in real life
Wouldnât be every 7/10 people have the disease
You either balance it and take the averaged performance measure, or look at macro averaged performance measure
Which treats all classes as equally important
Balance?
yes, downsample
I oversampled because I have a tiny set
not upsample
Itâs going to really mess with my reliability
reliability?
Okay, we've got a few things going on here. The data that you're given, in general, should represent the data that you expect to collect. If this is violated, nothing else matters.
It doesnât represent a population
At all
Nothing I can do about that have to just use the cards dealt to me
Unless weâre taking about a literal alcoholic hospital ward
If your current data that you're training on does not represent the data that you expect to collect, then --- you can do some things synthetically to it, as we've noted, like SMITE or SMOTE, to TRY and make it similar to the real data. This isn't great, but it does work sometimes. You'd, then, do this before you do anything else. Then you'd split into train-test/holdout. Do not do this, see below.
In the past, this has worked like... 25% of the time for me, for standard datasets, but I tend to use this more for imbalance than anything.
Actually --- hm. Actually, someone else check me here --- I don't think SMITE/SMOTE before everything works nicely because there's gonna be data-leakage.
I took pccamels advice and did the split before smote
As to have a non touched test set
Yeah, I'd do exactly that, and then test on a non-SMOTE'd test set.
But then we only have performance on a really unrealistic population
Why not inverse the balance and get a read on how it would be irl
But again, you aren't aiming for a high accuracy on the population, you want to be able to see how many of the people with diseases you can diagnoze, and how many you deem healthy that actually have a disease.
You're attempting to classify something, and you should be able to still do so with your test set. Recall / Precision / whatever. It sucks that you don't have a lot of data, but that's how it goes.
If you're introducing more positive elements, then "missing" one of these elements won't be as big of a deal for your model's score.
Accuracy also kinda matters so you donât misdiagnose and waste resources
But I kind of get what you're saying here. The dataset in general isn't representative.
You don't want accuracy, you prob want prec, recall, and f1, and look at those.
Sure, there's some balance between false positives and false nagtives you want to consider
Rather false positive than false negative
You can prob find some beta for F_beta and adjust accordingly, but F_1 is usually a good inbetween.
Btw, I took an accuracy read earlier on the training data before doing anything to it so itâs basically the same as test data in terms of distribution
My final model tuned on such performs 3% worse
Like... 500 - 1000?
I wouldn't worry too much about +/-3% to whatever metric you're using there.
So, my final report says that Iiterslly lost accuracy after doing all this work
To perfect a model
Looks like time wasted
Sure, but how are prec / recall? Accuracy is rarely a good metric to use.
you didn't "lose accuracy", the metric wasn't correct when you artificially inflated your test data
it had little meaning
Should you test that and auroc before training too as a comparator benchmark
To compare what? If your problem doesn't lend itself to use the accuracy metric, there is no point.
ehh, before training the weights are randomized, so the results will probably be as good as random
You should try a baseline (maybe just guessing the most frequent class, or something like that) but you need to know what metrics you'll be using before scoring.
Else we can just assume it did nothing lol
Yeah using a baseline classifier (like a small or simple algo) tells you more about how well the model performs on the problem than using a random guesser
For example, in this case, you care a bit about precision and you care about recall. So you can do, you know, two models that choose either always choose zero or always choose 1 or whatever. Or you can do a simple linear model. That'll be an okay baseline.
My baseline is usually a linear model or a random forest, and I go from there.
It also tells you how complex the problem might be
But to emphasize: you need to choose your metrics before you compare anything to anything.
Most "real world" problems do very well with recall, precision, and [their harmonic mean] F_1.
Oh I just tried now getting a bench mark I fit the random forest to the training data and evaluated it on predicting hold out
Scored 0.3
Weird
Scored 0.3 for what? Accuracy?
Might be because I forgot to reset kernel
Sec
Probably one of them got scaled and one didnât
make it guess the opposite and you get 0.7 ^^
Iâll do without scaling
0.74
Accuracy
Recall 0.94
Lol
My model got wrecked by default
yeah those seem like good metrics
Yeah but
Pret good recall.
Then I go on to tune and do feature selection and scale
And the model then performs much worse
On the same holdout
what kinda model?
Random forest and KNN
Here's my DS secret. Many models that I make work "just fine" out of the box. Most are like, "80% good" without too much fuss. It's the iterative optimization that's the extremely difficult part.
So what I conclude is that essentially my entire processing stage as well as parameter optimisation and scaling and over sampling made my performance much worse
Tf can I fix this? Looks really bad as a conclusion lol
I want improvement
Especially RFs.
Uh. You could try out xgboost and see if that does anythin' for you as opposed to RF.
rf?
I shud use cv instead of .score right
But honestly RF works really well right outt'a the box.
Random Forests, sorry Camel.
oh random forest
RF 
I'm lookin' at a model right now for evaluation for work, and it's 90% feature engineering and then at the end it's like two lines of a grid search on a random forest. Works really well.
still performs well on non-random missing data
It's not mine, but, you know.
Anyone wana try fix my model
Maybe the problem is tuning isnât wide enough
I only did about 500 searches
Once you get time series data in the right form, it's a delight to work with. :'''] But before that? It's a gd nightmare.
So it got beat by default
Should the benchmark be done after overdampling the training data
It happens to the best of us. I get beat by my baseline model a bunch during hyperparam sweeps.
"I can't believe I lost to linear regression!"
Maybe it is over-fitting on your training data if your model is complex
Itâs not really complex
the bench mark should probably use the same data to train on, and the test data to test on
I saw SMOTE as just a part of the process rather than having to be done before
Then youâd also say to benchmark on scaled data too
The only thing changing is the parameter of model
smote is part of the process, but just the training process
The metrics function which gives things like recall on a table only works for a single predict
How do u cross validate and use the same table as averages
Hi, I have a question: What the meaning of 1 in LogSoftmax?
That's weird, I haven't seen softmax being used with a single output node. Typically softmax is used for classification outputs, while sigmoid is used for binary outputs
What is the purpose of this model.
what's the best plot for when I'm comparing countries and the top occupations in each country?
histogram
I was thinking of making mulitple pie charts and each representing the country and then occuptaion
it's for astronaut data
If you can fit it, I would personally start with a stacked histogram, where each country was assigned a color.
But I could see the pie thing if you had a user selection to select each country
I'm trying to slice data in pandas to look at different areas of a data frame, eg:
df['field1'][4500:9000]
however, I'm doing graphs, etc, which means if I want to look at 5000:7000, I need to change it in a lot of places.
Is there a way to define a variable " slice = '4500:9000', and then use something like df['field1'][slice] ?
with pandas, the whole dataframe is "one thing". it's not like looking up something in a list that's in a dict, where the list is a completely separate thing from the dict that it's in.
you need to use loc
!docs pandas.DataFrame.loc
property DataFrame.loc```
Access a group of rows and columns by label(s) or a boolean array.
`.loc[]` is primarily label based, but may also be used with a boolean array.
Allowed inputs are:
ahhh makes sense, I only have 40 countries but that may take a lot of space good call mate. Maybe even filter it with occupation and then countries filled inside
df.loc[4500:9000, 'field1'] is probably what you need, since it indexes by row and then by column.
Yea with 40 youll have to find a way to allow the user to select. Otherwise I cannot think of a way to make it not look like a mess
something like this: ?
tzero = combined_dive_df.index[4600]
start = 4500
stop = 9000
fig = go.Figure()
fig.add_trace(go.Scatter(x=combined_dive_df.iloc['start':'stop'], y=combined_dive_df["SAC Rate (2 minute avg)_Shearwater"].loc['start':'stop'].interpolate(method='time'),
mode='lines', name='Shearwater'))
fig.add_trace(go.Scatter(x=combined_dive_df.iloc['start':'stop'], y=(14.7/100)*combined_dive_df["pressure_sac_Garmin"].loc['start':'stop'].interpolate(method='time'),
mode='lines', name='Garmin'))
combined_dive_df["SAC Rate (2 minute avg)_Shearwater"].loc['start':'stop'] this is wrong. the dataframe is one thing. this is treating it as two things.
if "SAC Rate (2 minute avg)_Shearwater" is the name of a column, it goes in the loc call after the row indexers.
are you picking both rows and columns, or just columns?
just columns for the y axis, and trying to slice from row 'start' to row 'stop'
can you do print(combined_dive_df.head().to_dict('list'), combined_dive_df.head().index) and show the text (no screenshots)?
@quick eagle these are the names of your columns. start and stop are none of them
Index(['distance_Garmin', 'enhanced_altitude_Garmin',
'absolute_pressure_Garmin', 'depth_Garmin', 'ascent_rate_mm_s_Garmin',
'heart_rate_Garmin', 'temperature_Garmin', 'unknown_135_Garmin',
'unknown_136_Garmin', 'next_stop_depth_Garmin', 'next_stop_time_Garmin',
'time_to_surface_Garmin', 'ndl_time_Garmin', 'n2_load_Garmin',
'cns_load_Garmin', 'air_time_remaining_s_Garmin', 'pressure_sac_Garmin',
'unknown_108_Garmin', 'timer_trigger_Garmin', 'event_Garmin',
'event_type_Garmin', 'event_group_Garmin', 'unknown_19_Garmin',
'unknown_20_Garmin', 'data_Garmin', 'transmitterID_Garmin',
'pressure_100_Garmin', 'Heartrate_Garmin', 'ElapsedTime_Garmin',
'Time (ms)_Shearwater', 'Depth_Shearwater',
'First Stop Depth_Shearwater', 'Time To Surface (min)_Shearwater',
'Average PPO2_Shearwater', 'Fraction O2_Shearwater',
'Fraction He_Shearwater', 'First Stop Time_Shearwater',
'Current NDL_Shearwater', 'Current Circuit Mode_Shearwater',
'Current CCR Mode_Shearwater', 'Water Temp_Shearwater',
'Gas Switch Needed_Shearwater', 'External PPO2_Shearwater',
'Set Point Type_Shearwater', 'Circuit Switch Type_Shearwater',
'External O2 Sensor 1 (mV)_Shearwater',
'External O2 Sensor 2 (mV)_Shearwater',
'External O2 Sensor 3 (mV)_Shearwater', 'Battery Voltage_Shearwater',
'Tank 1 pressure (PSI)_Shearwater', 'Tank 2 pressure (PSI)_Shearwater',
'Tank 3 pressure (PSI)_Shearwater', 'Tank 4 pressure (PSI)_Shearwater',
'Gas Time Remaining_Shearwater', 'SAC Rate (2 minute avg)_Shearwater',
'Ascent Rate_Shearwater', 'Safe Ascent Depth_Shearwater',
'CO2mbar_Shearwater', 'moles_tank_Ideal_Garmin',
'moles_tank_interpolate_Ideal_Garmin',
'moles_tank_diff_interp_Ideal_Garmin',
'liters_ambient_used_interp_Ideal_Garmin', 'moles_tank_Ideal_SW',
'moles_tank_interpolate_Ideal_SW', 'moles_tank_diff_interp_Ideal_SW',
'liters_ambient_used_interp_Ideal_SW'],
dtype='object')
and then your rows are indexed by timestamps.
correct - I'm trying to use 'start' and 'stop' as shortcuts for a slice, not as columns
as shortcuts for a slice?
in my Data there's 'Pilot' and 'pilot', thus my pandas is recognizing them as 2 unique values. Is this a good way to make them the same?
return x.replace('P','p') ```
my code works, but want to see if this is like "okay cool", or if it's "why do that?"
see the top - works with numbers, but when trying to 'centralize' the slice indexes into variables (so I can change it in one location, not 4), it doesn't work
@quick eagle I'm not following. .loc has one or two parts. the first (required) part is the row indexer, which in your case has to be a timestamp or a slice of timestamps. the second (which is optional) is the column indexer. they both go in the .loc[ ], separated by commas. any syntax that looks like df[ ][ ] or df[ ].loc[ ] is likely to be wrong.
.iloc is similar except that it's by position, regardless of how the DF is indexed.
Yea for: "x=combined_dive_df.iloc[start:stop]" are you trying to return rows or columns?
change it to loc
I don't think this question makes sense. a dataframe always has rows and columns.
I try to use simple language
how would you have asked the question if you weren't trying to hide any complexity?
x=df.index[4500:9000], y=df["datafield"][4500:9000]
I'm trying to replace the above with
Look at my profile and we can skip the check if I know what I am talking about
x=df.index[4500:9000], y=df["datafield"][4500:9000]
a = 4500
b = 9000
x=df.index[a:b], y=df["datafield"][a:b]
I'm not trying to call that into question. I'm trying to understand what you meant.
explicitly putting the slice indexes works, but trying to reference variables to slice with doesn;t
Oh sorry, have had some interesting people here. I am asking if he is trying to grab a specific subset of the index with the code line he pasted
basically, I have several hours of data, and just want to look at a specific time period - although plotly has some built in slicing on graphs, etc; I'm trying to do it by slicing the data frame
(but the specific time period is not known apriori)
taking this statement on its own, it sounds like you need to first make sure that the rows are sorted by the index (so that they're in order by timestamp), and then use loc to pick a slice for the first and last timestamp that you want.
yep, already sorted and indexed by timestamp
the x axis is datettime
which is also the index
Is the timestam the index or have you created a new numerical one?
there's a print of the index in this pastebin: https://paste.pythondiscord.com/usigohoxik.py
ty
In my data there's 'NAME' and 'NATIONALITY', but sometimes the names appear twice or more because they've competed in spaceflight, my pandas will count the same person twice, how do I prevent this?
please do print(df.head().to_dict('list')) and show the text (no screenshots) so we know what you're working with.
or is this okay?
it's nasty, can I print out just the head?
yes, as long as it's followed by .to_dict('list') in the code.
So unless I am wrong here @serene scaffold I think he is attempting to insert numerical index values into the loc function when he should be putting in datetime values
{'id': [1, 2, 3, 4, 5], 'number': [1, 2, 3, 3, 4], 'nationwide_number': [1, 2, 1, 1, 2], 'name': ['Gagarin, Yuri', 'Titov, Gherman', 'Glenn, John H., Jr.', 'Glenn, John H., Jr.', 'Carpenter, M. Scott'], 'original_name': ['ĐĐĐĐĐ ĐĐ ĐŽŃиК ĐНокŃоовиŃ', 'ТĐТĐĐ ĐĐľŃПан ĐĄŃопанОвиŃ', 'Glenn, John H., Jr.', 'Glenn, John H., Jr.', 'Carpenter, M. Scott'], 'sex': ['male', 'male', 'male', 'male', 'male'], 'year_of_birth': [1934, 1935, 1921, 1921, 1925], 'nationality': ['U.S.S.R/Russia', 'U.S.S.R/Russia', 'U.S.', 'U.S.', 'U.S.'], 'military_civilian': ['military', 'military', 'military', 'military', 'military'], 'selection': ['TsPK-1', 'TsPK-1', 'NASA Astronaut Group 1', 'NASA Astronaut Group 2', 'NASA- 1'], 'year_of_selection': [1960, 1960, 1959, 1959, 1959], 'mission_number': [1, 1, 1, 2, 1], 'total_number_of_missions': [1, 1, 2, 2, 1], 'occupation': ['pilot', 'pilot', 'pilot', 'pSp', 'pilot'], 'year_of_mission': [1961, 1961, 1962, 1998, 1962], 'mission_title': ['Vostok 1', 'Vostok 2', 'MA-6', 'STS-95', 'Mercury-Atlas 7'], 'ascend_shuttle': ['Vostok 1', 'Vostok 2', 'MA-6', 'STS-95', 'Mercury-Atlas 7'], 'in_orbit': ['Vostok 2', 'Vostok 2', 'MA-6', 'STS-95', 'Mercury-Atlas 7'], 'descend_shuttle': ['Vostok 3', 'Vostok 2', 'MA-6', 'STS-95', 'Mercury-Atlas 7'], 'hours_mission': [1.77, 25.0, 5.0, 213.0, 5.0], 'total_hrs_sum': [1.77, 25.3, 218.0, 218.0, 5.0], 'field21': [0, 0, 0, 0, 0], 'eva_hrs_mission': [0.0, 0.0, 0.0, 0.0, 0.0], 'total_eva_hrs': [0.0, 0.0, 0.0, 0.0, 0.0]}```
thank you; one moment
keep in mind @thin palm that if you had only done print(df.head()), most of the columns would have been omitted, and it would be useless.
because if I want to find out how much time each person spent in space, it'll count duplicates. If Neil Armstrong went to space twice his first time in space was lets say 3 hours, then his next mission was 27 hours. it'll count 30 + 30 = 60. Even though it is only 30 hours in space.
But note that when I explicitly use numbers ([int:int], it works just fine; the top half of https://paste.pythondiscord.com/naranaxiba shows this and it works just fine
@quick eagle I think it is because you are calling the loc function vs the index function but I am not 100%. I would create a new numerical index in place of the current one:
df=df.replace_index(drop=False)
I think you need something like df.groupby('name')['column_to_sum'].sum(), but where you replace 'column_to_sum' with a column name
Hang on this is the wrong function. I just got back from vacation lol one sec
ahh okay, this makes sense
the purpose of the model is to predict whether diabetes or not
@quick eagle its reset_index(drop=False)
Then you can use numerical ranges for the loc function. Make sure that your data is sorted correctly first thou
for plotting purposes does a box plot sound like a good idea? to see average, mean, and outliers?
For binary classification you should be using a sigmoid output layer then
hmm.. problem is that I'm merging 4 different sensors at different sampling rates, so I have to interpolate the data when graphin (otherwise nothing appears), and it needs to be interpolated by method=time
if the model was to predict like type of diabetes you would most likely use softmax with the number of nodes equal to the number of diabetes types
I design the architecture as a multiclass classification
so you can do this but it is unusual and depending on the dataset it wont yield as effective of a model. How well do you know the math for the two regression models?
sigmoid is the same as logistic regression as is always the go to choice for binary outputs
and I looked into this further, I am pretty sure you would need to have 2 output nodes with softmax regression even if you are doing a binary classifier
so I hard-coded the interpolation, and then reset the index (and used x = df['timestamp'] [a:b] ) and that works!
amazing!
I not really know about the two regression models. I'm a beginner for Neural Network and this is my first study case to learn. Can you explain it?
In this case I use a LogSoftmax to get an Integer value to predict yes or no. If I use a Binary Classificatio the value is floating point arrange 0 to 1
I think we can use both
without getting way too complicated: the way softmax works is with 2 nodes the first one 0-1 will represent the models prediction for the first class and the second 0-1(diabetes) will represent the prediction for the second class (not having diabetes). For logistic it predicts two mutually exclusive variables so 0 will be no diabetes while 1 means that the model predicts diabetes
looks likt the 'timestamp' column is preserved as a datetime (so I can do x = df[timestmap][a:b] - df[timestamp][time_zero] ), but the x axis labels are '0, 0,2T, 0.4T' etc ... never seen that before. what is that 'T" notation mean? (I'm trying to basically set a start time, and then elapsed time (in min:sec) since the time_zero point.....
Additionally my thought is that if you dont use sigmoid it wont "let" the model know when it backward propagates that the two values are in fact mutually exclusive. So there is a possibility, in the way the math is written, for the model to predict that an individual both has diabetes and doesnt have diabetes at the same time
And although this is unlikely as the dataset prolly doesnt have those values it will lead to less efficent training and a less accurate model
one of the best ways I really learned this in my classes was writing a simple neural network from scratch: this is a fun tutorial I dug up. Try inputting both value and look at the results https://towardsdatascience.com/how-to-build-a-simple-neural-network-from-scratch-with-python-9f011896d2f3
Ok, thank you so much
of course, this stuff is very complex. Even with a masters in it, it can seem like a blackbox that just spits out numbers. I would say really dive into the math for each of the layers. And if you want to be an actual expert get a math PhD lol.
Does this boxplot make sense to everybody? the amount of time each country has spent in space.
that x axis label makes no sense
also what is the unit here? the total time per country is one value per country
I would sort from most to least. also, it should be a total - not a range?? or are you doing it by astronaut?
a boxplot represents a distribution of values
this, exactly. what is each data point? per astronaut?
per astronaut per country
i would at least recommend sorting by median astronaut hours or by total number of astronauts
What plots would you use?
I like that better tbh
the title should be "Distribution of astronaut total times in space, by country"
and the x axis could be "Total hours spent by astronaut in space"
perfect!
so the box plot is fine yes? Just my labels?
it's fine in that it shows something that isn't nothing. but are you trying to show something specific? or just "something"
I would emphasize by the title "Distribution of individual astronaut total times in space, by country"
my assigment is to take this data and create something out of it. I have data on Astronauts -> so I'm figuring out which country sent the most humans, what year they were being sent, men v women being sent, etc
It's my job to tell a story with this data
also, you could try a violin plot, etc. the problem you have is with the US data set - it's extremely bimodal (~2week shuttle flights and 6month ISS expeditions), which makes the boxplot have a ton of outliers
+1 for violins, great observation
did you try just plotting individual points?
what does the violin plot tell?
it might be interesting to plot number of astronauts vs total flight hours across all astronauts on a scatterplot
individual points in what?
just a dot per astronaut total time per country. no boxes
so scatter plot maybe?
I usually start with scatterplot, and if the data distribution permits, then summarize with something else (eg boxplot, etc)
ok ok, data viz not my thing but it's so valuable
Hello, i have a 3 class classification problem and i want to penalise 1 class
Does this work for that tf.nn.softmax_cross_entropy_with_logits
I can't understand what i should put in the place of logits and labels
boxplots work best for normally distributed data. they can be used for non-normal distributions, but they are less useful
data viz is def something to focus on especially in industry setting
Strongly suggest the Edward Tufte series !!!
how you persuade stakeholders is important
(4 books)
aka half of your job sometimes 
my approach was to see what the average time spent in space per astronaut per country. But I'll do a scatter instead
It's for a job hopefully I make some cool stuff here
average is not great for spaceflight times, as they are are either minutes, ~2 weeks, 6 months, or 1 year. You could try those as 3 bins
*4 bins
great call, I didn't process much about it. But this makes sense thank you so much
yeah so part of it is really understanding the data
and that comes with experience or domain knowledge
and the obvious story that ought to pop up right away is the soviet mostly long duration, the US has mostly short due to shuttle (10-14 days, but each flight had 7 crew, whereas ISS are crews of 3 per expedition)
how do you feel about what I have?
1.)The amount of times each country has sent someone on a space mission
2.)Plot of the amount of times only women have gone on missions (only showing which countries have sent them)
3.)Histplot showing what year each astronaut was selected for space missions (showing the year that had most declared missions)
4.)When the year of mission was actually initiated (another histplot) from 1960 - 2020
yes - knowing about the history of spaceflight lets you shortcut the wonky distributions
1 could be panel A: # times by citizenship, B: total by citizen
2 - yes, although it would be 'women', not 'only women', as there haven't been any all-female crews (yet!)
3 - may be interesting when they were selected vs first flew
4- yes
your initial plot is 1B (with the corrections discussed)
"amount of times each country has sent someone on a space mission" -> "number of times a citizen of that country has gone to space" ; international partners go on US, or Russia/USSR vehicles
I think for a final graph I will include which occupation was was in space the most, we have pilot, PSP, commander, space tourist. Would be cool to see which occupation was most trusted and which went to space just for fun (space tourist)
you may have to tweak that - I think you mean crew role (pilot ,commander, mission specialist, payload specialist for NASA, commander, flight engineer, spaceflight participant for RSA), not occupation/training (pilot, geologist, doctor, electrical engineer, etc)
sorry by occupation I am referring to my data! Yes role in the mission
also, commander vs pilot is a bit tricky (the 'lead pilot' was the commander, but 2 were trained as pilots for shuttle; the rest were all mission specialists, with the occasional payload specialist for shuttle). apollo was 3 pilots - lunar, command module, and commander...
Can someone help me with pycharm + virtualenv + jupyter notebook?
I have the venv created with inherit global options (for jupyter in global python) and tensorflow in my new virtualenv
I open the jupyter notebook, try to import tensorflow and it shows as it is not installed but it is installed ._.
on the other hand, plotting the wonky distributions correctly actually helps you see the pattern
how's this puppy?
nice! looks like you may need to try log x scale? everything is piled up on the left
yeah I moved my nationality over to the y axis because it was getting crunched up on the x axis and I coudln't figure out how to resize
I'm trying to calculate a difference over a time period (eg 2 min), but my dataset has samples at a non-constant sample rate. is it possible to do a .diff(period=X) where X is '2min', not a set number of steps? my index is timestamp
(this is kind of like using .interpolate(method='time'), but with diff)
Is it ok to initially have higher accuracy then trains set....cuz they really come from same distridution...
I think you could use .rolling using a timestamp and then apply the .diff
(the documentation links to here: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases)
why does the line not fits to the dots?
vector<double> X = { 38, 50, 15, 30, 50, 38, 50, 20, 45, 50, 20, 35, 30, 43, 35, 37.5, 37, 35, 30, 45, 4, 37.5, 25, 46, 30, 200, 200, 30
};
// variable X
vector<double> Y = { 8000, 6400, 2500, 3000, 6000, 5000, 8000, 4000, 11000, 25000, 4000, 8800, 5000, 7000, 8000, 1800, 5400, 15000, 3500, 2400, 1000, 8000, 2100, 8000, 4000, 1000, 2000, 4800
};
double alpha = 0.0001; // learning rate
int epoch = 1000;// number of epochs
SimpleLinearRegression *slr = new SimpleLinearRegression(X, Y, alpha, epoch, true);
slr->train();
slr->print_yhat();
vector<double> Y_c = slr->predict(X);
// denormalize Y_c
vector<double> Y_c_denormalize;
double Y_MAX = *max_element(Y.begin(), Y.end());
double Y_MIN = *min_element(Y.begin(), Y.end());
double X_MAX = *max_element(X.begin(), X.end());
double X_MIN = *min_element(X.begin(), X.end());
for(int i = 0; i < Y_c.size(); i++){
Y_c_denormalize.push_back(Y_c[i] * ((Y_MAX - Y_MIN) + Y_MIN));
}
double Y_c_MAX = *max_element(Y_c_denormalize.begin(), Y_c_denormalize.end());
double Y_c_MIN = *min_element(Y_c_denormalize.begin(), Y_c_denormalize.end());
// Scatter plot
matplotlibcpp::figure_size(700, 500);
matplotlibcpp::scatter(X, Y, 25);
double x = 45;
double y = slr->predict(x);
double y_denorm = y * (Y_MAX - Y_MIN) + Y_MIN;
cout << "Prediction of " << x << " Hours Per week is " << y_denorm << " Income" << endl;
matplotlibcpp::plot({X_MIN, X_MAX}, {Y_c_MIN, Y_c_MAX}, "r");
matplotlibcpp::xlabel("Hours per Week (x)");
matplotlibcpp::xlim(0, 80);
matplotlibcpp::ylabel("Income (y)");
matplotlibcpp::title("Scatter Plot");
matplotlibcpp::show();
the problem here is when I want to denormalization the value of Y_c
I just want to hardcode this one in C++ rather than using libraries in python
hello i am having dataframe which has a column name marks in that values arepython 21100 23000 25650 78550 36100 22600 22700 34550 i want to get rows which are multiple of 100 for e.g. my expected output python 21100 23000 36100 22600 22700 this way ping me when reply
i need suggestion about image classification or related field about image processing in ML
is it common to consider original image resolution and depth (eg: dpi of the image) before feed it to NN for create the knowledge? since i aware that image size to used are generally small (ranged between 120x120 px to 200x200 px) to feed up the NN.
Also is it considerable whether process each channel separately and combine it in the of the NN? since my intuition said for colored image, R G and B channel must have different value and could have different story to tell the NN about the image.
Remove outlier
Where can I find code for back propagation ?
what outlier?
Can somebody help me with bp neural network algorithm ?
Use modulo 100
@misty flint i need help đ
If a single layer MLPClassifier gives me an accuracy of like 94%
does adding more layers guranteer any more precision or scope of improvement
or is it just black box testing
may or may not work?
!e Why is it not rounding the decimal places in the array
import numpy as np
w = np.array([9.79810329e+209,
2.01077594e+210,
1.57202605e+210,
2.53363565e+210])
print(np.round(w,10))
@woeful falcon :white_check_mark: Your eval job has completed with return code 0.
[9.79810329e+209 2.01077594e+210 1.57202605e+210 2.53363565e+210]
bro
so
the data set was like
containership sizes
ok
i tried to see if there is some correlation between
the age of the ship and the size
sht like that
heatmap
i couldnt
it said there are strings in the data
i check on excel for strings
literally nothing
one question was to split the size of the containers into 5000 bands
and plot and find the distribution
wtf
"cant split strings"
what can i do man
never began i gotta work as some cleaner forever now
You can write sentences without an enter every 3 words đ
im sorry im just mad đŚ

And if you are using pandas, you should try convert the columns to floats instead
if they are supposed to be floats
yeah they are
well this was the analysis
i wrote down what i would have done
hopefully i can talk my way into it lol
tmro is the interview
wait do you think i can get away with trying something now after the allotted time and presenting it in the interview would it be appropriate
thats rough
idk what they expect for 30 mins tbh
so i feel like whatever you say is fine
lolo
bruh i dont get it
30 mins
like
not enough time omg
its for a junior role as well
do they expect me to be some pro at 20 yrs old 0 experience
yeah thats def not enough time
for much
only if youre experienced would you maybe get anything valuable
oh i found out my problem
the fking added commas for the 10000
like
13,000
so the code wont see this as a number
i gotta split(",") like this right? or something
which function did you use
no i tried to make stuff like
heatmaps
to find correlations
.corr()
uh
describe()
you cant do that stuff
without converting datatypes first
you basically have dirty data
gotta clean it first
since you didnt really need python
I usually at least remove nan before I do heatmap tbh
@strange stump yes literally everywhere is expecting 20 year olds to be pro rn
Itâs the new meta
đŚ
i have more experience in python rex
i used python to analyse data for uni work
At least u got that job dude, some of us data scientists have to claw our way up from junior roles and internships doing exel
since it looks better on reports
And have training in ML
i dont have the job this was the first half of the interview
yea i should probably just look elsewhere
probably
But Iâd not want to have to learn pandas in one day lol
Ok just check the data for NAN values
And if thereâs any consider filling them with column medians or modes
The data is integer?
i see. i think you just got unlucky bud
ok gtg bye
cya boss
i checked for null values
isnull().sum()
all gave 0
now i try to use corr() and the output is "__"
Umm
Dm me a screenshot of the df and the matrix code
matrix code?
If in doubt google ur question
do you want my python stuff?
Did u try to google it
Chances are someoneâs posted on stack overflow this question
Google pandas corr giving ___
OMG IT IS FLOATS
bro i kid you not i searched how to convert it to float, copied the code and it coverts it into int64
yea
Haah TIL
i converted all the data into floats
which i thought thats what my code did when i searched for " convert column into float"
Yeah learning process is literally how good are you at googling
nah this is bs ima complain about the time in the interview
but i think they will like that im trying again
hopefully anyway
Well I could prob do this in under 10 mins
đ
this is my heatmap i wanted
Nice
but
They gave u three columns?
i send a screenshot of the head of the dataset
U can also mask half of that and save eyesore
some are worded answers
If you haven't carefully tested something but so just "played around with a lot of different configurations", how can you neatly put this in a report?
Like we empirically found that this type of model gave the best results so we used this.
Hmm by that I mean make it a triangle
Or something of that nature
Iâd say that ahha
i know what youre saying yeah ill look into that later im happy i know how to do this now
Iâd just say an initial test proved certain models stronger
Show us the matrix with all columns
You have year built column
U can create a new column that passes exactly that formula
Ur gona need to google syntax
Are u applying to data analyst?
USA?
UK
Me too
I was under the impression most of those jobs are rly hard but maybe itâs regional
Iâm prob gona have to do this type of work in my first year
No one wants a data scientist without analyst experience anymore
FMl
idk man i dont think im smart enough for this
U already done have of it
It doesnât get much harder
Now just do some plots
My hint is use pairplot by seaborn to scout out areas of interest
I do that
And just use matplotlib or pandas to plot bars
Or distributions
oh wtf
U are in the data science room
And I have some super basic code that's not working.
You guys don't do data science in python
I do
I am building linear regression. That is data sciency
Nice
none of these graphs are useful imo
They are
Isnât it a key part of analysing a data set
U now see where all the ships were built
Which years got more contracts
Now u can plot these individually and later remove the pairplot
What exactly is the task
If itâs general analysis whatâs bad about plotting to find which years were best
Or correlations
I mean some of those are just a literal extension of ur matrix
one of the questions was to find some trends
U can plot the trends on a graph from ur matrix
As u can see those highly correlated dots
ok đ
so i was trying to train a model with tensorflow object detection module and this problem came up , can anybody tell me how to change checkpoint version to V2 ?
Is it possible to convert an opencv model to tensorflow model?
Ultimately I just want to use the model to run in an Android app
Found this. For advanced Python crash course : https://medium.com/coders-mojo/python-crash-course-part-2-78acc9694997?sk=ebf1a04a790fe88f6483e0d56f12fbf2
Thanks @modest mulch for response.
my application is following:
There will be an online examination system.
In the question, the attached image will be shown. (image may be vary) and asks student to create the same in ms word.
our program collects all student's created word doc and compare with our word document.
I want to make a program that give scores based how created document is similar to provided one.
- this will compare template
- font size
- color
- font-face
Canât u just use ur eyes and look if itâs the same
To check for typos just use a text filter
Otherwise youâre going to need some state of the art computer vision
Wouldn't students just be able to copy the image and paste it in word?
What info are you planning on feeding the model?
lol
This a good example of using AI for a simple task that would require some really advanced AI or no AI
Traceback (most recent call last):
File "D:\college_project\modules\model_train.py", line 21, in <module>
model.add(Convolution2D(16, 3, 3, activation = 'relu'))
File "C:\Users\shubh\anaconda3\lib\site-packages\tensorflow\python\training\tracking\base.py", line 629, in _method_wrapper
result = method(self, *args, **kwargs)
File "C:\Users\shubh\anaconda3\lib\site-packages\keras\utils\traceback_utils.py", line 67, in error_handler
raise e.with_traceback(filtered_tb) from None
File "C:\Users\shubh\anaconda3\lib\site-packages\tensorflow\python\framework\ops.py", line 2013, in _create_c_op
raise ValueError(e.message)
ValueError: Exception encountered when calling layer "conv2d_1" (type Conv2D).
Negative dimension size caused by subtracting 3 from 1 for '{{node conv2d_1/Conv2D}} = Conv2D[T=DT_FLOAT, data_format="NHWC", dilations=[1, 1, 1, 1], explicit_paddings=[], padding="VALID", strides=[1, 3, 3, 1], use_cudnn_on_gpu=true](Placeholder, conv2d_1/Conv2D/ReadVariableOp)' with input shapes: [?,1,1,8], [3,3,8,16].
Call arguments received:
⢠inputs=tf.Tensor(shape=(None, 1, 1, 8), dtype=float3``` how to fix this error
the old classic. so many tasks are still more or less intractable or unsolved for "AI" and machine learning, but are trivial for humans (if perhaps slow/tedious)
you could just look at some image similarity metric
but i have a feeling that will not be easy to tune and will not reliably give good results
you'd need to segment the image and compute similarities on various parts + some kind of graph similarity for the overall structure
yeah but this
you'd spend 5x as long building the model as you would grading by hand
and yeah that too lol
You have to use more data to make sure it's not just that
well you'd have to make it low res in the exam, too low to scale up properly to the document size
Hello, is this only for binary classification https://www.tensorflow.org/api_docs/python/tf/nn/weighted_cross_entropy_with_logits
Computes a weighted cross entropy.

bruh
first rule of google's ML
solve the problem without ML if you can

no
Oh okay thank you, could you please tell me what i should put for logits in this function
can you clarify this question?
the logits are the outputs of your model
basically the stuff that comes out of the final output layer, before applying softmax
they are called "logits" because conceptually they are the result of applying the logit function to the predicted probabilities
Ohhh
Is that the y_pred?
no, it's the "raw" values that come out of the final layer
What would happen if somebody made self aware ai?
it would realize how horrible the world is and delete itself.
Lol
I was wondering if it would be like the matrix smhđ
in either case, a "self-aware AI" is a long way out. the way AI is depicted in the media is just wrong.
Whats the most developed ai we have developed as of now?
Ohhh okayy,thank you so much!!
So i have a question, what do i put in the argument.. this tf.nn.weighted_cross_entropy_with_logits takes 3 arguments
most developed ai for what? AIs are designed to solve specific problems.
Labels,logits and pos_weight
Oh, i know nothing about ai lol, just a general curiosity
Idk
Whats the most impressive ai then
stuff like alexa and google home is pretty impressive
uhh, GPT-3 is a model that's able to generate long realistic-sounding texts, but that doesn't mean that the model actually "knows" what the text means.
Alpha Zero (or AlphaGo) has some nice advancements in complex-ish games
Nvidia has some crazy image manipulation stuff
yeah i'd say that the alpha-stuff is probably the most-developed for general-purpose problem solving, at least that the public knows about
Thats ai? It thought it was just a box thats told âheres 4000 odd different ways to ask what the weather is, if ur asked this, talk about weatherâ
That sounds interesting
that's literally what people thought AI would be for ~50 years. look up "expert systems" and "symbolic AI"
But it definitely uses machine learning too
yes, there's a few AI components for those products. the intent classifier figures out what you're asking it to do. the automated question/answerer takes a question and searches for text that answers it.
the definition of "AI" is fuzzy and has been co-opted by marketing teams to sell machine learning products
I see
0 0.96 0.97 0.97 100
1 0.93 0.74 0.82 100
2 0.78 0.93 0.85 100
accuracy 0.88 300
macro avg 0.89 0.88 0.88 300
weighted avg 0.89 0.88 0.88 300
This is my report,And i wanted to improve class1's recall
@arctic blade Google (the search engine) is an AI: it's a document retrieval system that figures out what documents (web pages) are relevant to your query.
I never thought about that, thats pretty cool
pos_weightallows one to trade off recall and precision by up- or down-weighting the cost of a positive error relative to a negative error.
this is what the docs say
so you can try adjusting pos_weight above 1
the docs are pretty clear, they even have formulas
for one specific class, you might need to specifically assign class weights
Yupp i saw this,but should i pass the fully connected layer as the logits
Ohhh
ah, it looks like pos_weights can be a vector
so you can assign different weights to different classes
A coefficient to use on the positive examples, typically a scalar but otherwise broadcastable to the shape of
logits. Its value should be non-negative.
I wanted to do this jll look at that
Ohh okayy,thank youu!! Just the last question, how can i pass the raw outputs
what do you mean?
how can you access the values without applying softmax?
you could just not put the softmax layer on the nn, and apply it manually when generating predictions. but i'm not sure what actual tf users do, let me see if i can figure it out
Yeahh
tf.nn.weighted_cross_entropy_with_logits(
  labels, logits, pos_weight, name=None
) like for this, i can put y for labels ,i gotta put the raw outputs for logits right?
yes, do not apply softmax to the logits. weighted_cross_entropy_with_logits does that internally
this is described in the page for https://www.tensorflow.org/api_docs/python/tf/nn/softmax_cross_entropy_with_logits softmax_cross_entropy_with_logits
Computes softmax cross entropy between logits and labels.
ah wait
yeah nvm
Ohhhh,yes i came across this but is it the same as weighted cross entropy loss?
weighted_cross_entropy_with_logits applies sigmoid, not softmax
the docs say that
sorry i misread
Ohh okayy, it's alrightt
Should i use this instead?but it isnt similar weighted cross entropy loss is it?
@desert oar yo man, do you have any idea on using GANS for generating object on images (the output of GANS could directly be fed into an object detector)
your question doesnt make any sense. you can just do one and then the other afterwards
sigmoid applies the sigmoid function to each component individually, softmax applies softmax to all components
sigmoid is for independent binary classes, softmax is for 1 mutually exclusive set of classes
Okayy yess correct, thank you so much ,i understood
But this doesn't have the weights options so that i can penalize one class
https://gist.github.com/wassname/ce364fddfc8a025bfab4348cf5de852d do you think this is incorrect
GPT3 - the fact that it can learn things is pretty insane
both were trained on something like this
on the surface, it often looks like a "stochastic parrot" (I certainly thought so too) but its really from some digging that one actually understands how much it can do as compared to previous methods
as much as people hate calling it "intelligent" on the internet - those are usually ones posting blogs who live in an extreme, expecting GPT3 to be skynet-like AGI
while in the academic community, its mostly GPT3's meta-learning capabilities that really astound. Its completely unexpected, was never thought to be emergent yet the model managed to do it a bit... just by being pre-trained on MLM đ¤
This one gives better test accuracy, so if this test data represent new data well, then this one is better
Also using too many epochs can cause overfitting
The model converged way before 5 epochs, let alone 30
yup
i will try one 100 epoch....then see how it fits.....for the report i will adjust epochs accordingly
i wanna see....
you want to see it overfitting? đ
i want to add in a report....isnt lower epoch graph very noisy
you can average it over multiple runs
Hey all! Old timer AI guy here - used to run my own C++ libraries - how are y'all running performant python code?
and what do you guys think of this, i tried like 30 models before i settled on this
Using pytorch
Yeah, that's a given
yeah but it has a nice api to do that, you don't have to figure that all out yourself from scratch
Just needed to confirm my suspicion and an argument I've had with a colleague that debated that python was as performant as C -eyeroll-
well pytorch is mostly written in c++ iirc
Yeah... I'll check the APIs, Thanks!
Yeah, but it's not python đ The argument has another background... Just a young fellow trying to flex his python skills on me saying he could make better performance code with pure python...
Sorry about off topic, carry on!
ah right haha. Well it'll run pretty fast but it's not the python code making it happen đ
my error https://paste.pythondiscord.com/citaneweho my code
https://paste.pythondiscord.com/abohecepas here
Traceback (most recent call last):
File "D:\college_project\modules\untitled0.py", line 56, in <module>
model.fit(train_generator, epochs=5, validation_data=validation_generator)
File "C:\Users\shubh\anaconda3\lib\site-packages\keras\utils\traceback_utils.py", line 67, in error_handler
raise e.with_traceback(filtered_tb) from None
File "C:\Users\shubh\anaconda3\lib\site-packages\tensorflow\python\eager\execute.py", line 54, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
InvalidArgumentError: Graph execution error:```
i tried searching SO and tuning the parameters but unable to fix this error ping me when u reply
btw how does your model have 94% accuracy on epoch 0?
yeah suspiciously high
if that's for a classification task, how imbalanced are your categories?
Assuming epoch 0 is not trained
loss = tf.nn.softmax_cross_entropy_with_logits(labels=[[1. 0. 0.] [0. 0. 1.] [0. 1. 0.]], logits=output, axis=-1, name=None)```
i am getting an invalid syntax error
i cant understand it
yes classification
3 category
2 : 1.2 : 1.25
i have 200,000 of category A
120,000 of B
126,000 of c
maybe double check if there's any data leakage?
you should probably just stop after epoch 2
Well with that much data it can be possible as long as the function is not too complex
i have checked many times...i just think i overlook....its becoming nightmare....
this is my code:
maybe see how well it works with other forms of train/test splitting
the simplest would be just leaving a slightly larger group out, and not shuffling the data
ok
Hello,could you please help me with this error
loss = tf.nn.softmax_cross_entropy_with_logits(labels=[[1. 0. 0.] [0. 0. 1.] [0. 1. 0.]], logits=output, axis=-1, name=None)â
you cannot simply copy paste a numpy array like that
same
but i checked the dataset
its quite shuffled....like category a, b, c is present in chunks of 100
it's one hot encoded,but what should i do?
i tried with floating points but that gave errors as well,so idk what i should do for th labels
my friend want to install anaconda, but his get like this. how to fix this?
whatever place they try to install it in has 2 spaces in the name which apparently anaconda does not like
did someone say multiple gpu's 
look @serene scaffold

ok thank you
i wonder who makes that call thats like
this model is way too big we need to train on multiple gpu's

maybe its more like, this is never converging
lets try multiple gpu's

Hi, sorry for bothering, i'm encountering a problem at plotting streamline in matplotlib. I plotted vectors but then i need to use Euler's method (or Runge Kutta) to trace the streamlines. I have no idea on how to start and what result I should get
Iâm still learning about a career I want to do, so I am thinking of choosing a career in deep learning with a specialty in computer vision
Does computer graphics include computer vision?
Hey guys, I have a computer vision problem. I am using openCV but since there is no computer vision chat I figure data science is the closest thing to the problem that I am having. I am using OpenCV in python. I have a color image and a binary mask image (0 to 255). I want to instead have a color image with the mask applied.
full_mask_bgr = cv2.cvtColor(full_mask,cv2.COLOR_GRAY2BGR)
full_mask_bgr[full_mask_bgr==255]=1.0
img2 = np.multiply(img, full_mask_bgr)```
I am able to do this by doing these functions. First: convert from grayscale to bgr, then convert all the 255 (white) values of mask to 1 then multiply the original bgr image by the 0,1 mask.
The only problem with this solution is that its slow as hell. Is there a better way to do this?
hmm you usually see the two going separate routes but im sure theres ways to combine them (i.e. GANs + 3D modeling, etc.)
maybe try doing a project in both and seeing how you feel about it
Yeah Iâll see Iâm just exploring rn
same im interested in 3 things atm
hopefully i decide on 1 before i graduate 
which is soon
I love computer vision but Im probably going to go into app dev instead
mobile or web
Mobile (probably)
gotcha. theres still opportunities to apply CV in that space
i also dont know the answer to your question since im not really a CV guy 
we also did our stuff with matlab, which has tons of image processing functions 
I think theres got to be a way to do it in numpy or opencv but the way I did it is so roundabout and my camera is now like 1 fps
yeah someone who knows opencv well could probs answer your question
I got to figure this out. Our robotics competition is this friday and those 3 lines of code are slowing down the robot
Would CV and augmented reality work hand in hand
oh shoot...i would try asking in ML-specific servers or at different times since some peeps live in dif timezones
hmm idk
one is more i believe analyzing the data, while the other is generating it

but maybe
Ah I see
Depends if itâs a human
i think if you get really good at maybe GANs, you could produce better AR stuff

In my opinion if we built a neural network thatâs a 1:1 replica of the brain and raised it like a child would it know if itâs inside a computer ? If so it wud wana kill itself
supermoon, this is going to be hard to break it to you but...
But?
but...
is full_mask_bgr the binary image? The binary image that you're converting to BGR and setting all the 255 values to 1 and then multiplying it with another BGR image?
Instead of np.multiply in your third line, can you delete your second line and use cv2.bitwise_and(img, full_mask_bgr) instead? It should work if full_mask_bgr only contains 0 and 255.
Hi all I ahd a question on using "feature importance" with sklearn?
Namely, I ran a tree and took the most important features to determine if an animal would be adopted (so 0 or 1). I get these results.
However my question is this: How do I know if these features are important to classify the observation as adopted (1) or not adopted (0)?
Like "Sex upon Intake Unknown" is def important, but its important to classify an obs as 1 or 0?!
that depends on:
- which model are you using
- what is the scale of the variables
- what is the intercept (assuming that it has one)
if it's a LogisticRegression with the default parameters, check the model's intercept_
Good points, I am having a hard time on thinking of how to use that info to determine this. If it helps:
- Using Adaboost over my data
- Everything is not scaled in any particular fashion, so no change from raw (I read that scaling data in tree's does nothing, but maybe actually harms the predictive power, not too sure)
- None, it's a tree (?)
Very much learning
So sorry for obvious stupid replies lol
