#data-science-and-ml
1 messages · Page 382 of 1
You'll probably do the data validation and then, after you get the results, it'll take an existing template and fill the stuff. In Flask / Jinja, you need to make this template yourself. In Streamlit, it sort'a makes it for you. I don't remember Dash's thing, but I think it's similar.
there's also Mode (and many others), if you just want reports instead of actual dashboards
in Dash you sort of define the HTML outline then feed plotly graphs into a div
(and use callbacks to update it based on user interactions)
Ahhh, that sounds familiar. I haven't seen Mode, dang, there's a lot of these.
I only know Streamlit because Emyrs told me about it here. :']
I'm not sure if Mode was free or not though
Lol yeah in a bus. I thought that's what everyone calls it
Here's a table example in Streamlit, for Tsar's reference. I'm not sure how much customization you can do on tables, tho. https://share.streamlit.io/streamlit/example-app-interactive-table/main
streamlit seems fine as well - if it doesn't lock anything behind a paywall (cough dash enterprise...), maybe go for it
Secretly, I love streamlit too because it integrates well with Altair, my beloved underrated plotting library. :'''']
i have no idea, you will have to investigate what is different between the validation and training sets. it might be background objects, or it might be some other problem
Let me represent this a bit more clearly. So I don't need to create any charts and etc. The thing that I have is a dataset which contains an 'n' amount of columns and each column has an some sort of value in it. For this column I have an observed and expected value. I have to create a validation script in Python with the help of Pandas and some other validation scripts that I am able to find to run their checks such as if this value is either negative and etc. After that validation is completed I need to somehow visualize it in a report style showing with flags(importance) each column that did not pass the check, it has to be sort of interactive to be able to filter and so on
So I am not sure where to really categorize my problem if its web related or not
Yeah, that seems like something you could do in a table, but I'm not sure how things like icons of flags or highlighting work in Streamlit.
this sounds like it might be easier to just roll your own flask app or something
the requirement seems pretty straightforward, just a yes/no indicator next to each column name, and an expandable <details> element w/ specific information about what failed if anything
maybe a way to export some report as a text document or json or whatever
In the flask app, they'd still have to use some js framework like data.table or something. I think Streamlit has this built-in if the results are in pandas.
Either way is probably fine, though.
would they? you could render it statically in an html table
To get an interactive table with filters?
oh i missed that they wanted it to be interactive
this is gonna sound stupid but... have you considered generating an excel workbook?
Yeah, that's the only reason I'd recommend SL instead of just rollin' their own. https://datatables.net/ is very powerful, but also --- can be frustrating to work with.
Haha, that's not a bad idea either. And pandas, also, has a default exporter for excel.
yeah ive generated pretty sophisticated reports that way
Yeah, it's actually really cool. A lot of people require excel, so it's a pretty nice thing they put in. :']
So basically extract the results of the data validation into an excel spreadsheet and just structure it there?
im a bit surprised there arent convenient and light-weight "off the shelf" libraries for sortable/filterable tables
yeah pretty much, instead of making a webpage
that said, i feel like everyone's first web app is a table
so it can't be that hard
not that there's anything wrong with streamlit either
I am just not good with web dev, not much experience and I have no idea how to make a template and later feed data into that template and so on
i'd go with excel then personally, if that meets the requirements
If you want to get better at webdev, try that out. Otherwise, there's some other good options here. :']
https://dash.plotly.com/datatable/interactivity exists... but even I admit it doesn't really meets these criteria all that well
weirdly i dont even see a "table" widget here https://streamlit.io/components?category=widget
I researched dash and I dont think it meets the requirements that are necessary. Streamlit sort of does. Excel seems like a good idea, but so does the web populating. I am down to learn new things but for the sake of graduating I am not sure which is the correct course of action
excel is fine tbh
you can even use xlslwriter / openpyxl to format it nicely for reports
Labels consist of 3 items, together them 3 makes around 12k datapoints. The graph yall see above is correct,
1st item has 4.2k points
2nd item has 5.3k points( As yall can see, they added 5.3k on the 4.2k graph) How can I avoid this?
3rd item has 2.5k points which is again added over 1 and 2, what to do to make their bar plots separately?
i do appreciate the ideas, I think excel might be a good option as well at this point
Thank you for the ideas, people! If anybody has any other input that they would like to share, I'd love to discuss it!
I'm not sure why they call some stuff widgets and some stuff not, but it's done with
st.dataframe(my_dataframe)
st.table(data.iloc[0:10])
https://docs.streamlit.io/library/cheatsheet Here's most of the stuff they have.
for the sake of graduating I am not sure which is the correct course of action
"whatever is easiest for you" is the best option here imo
Hi! If anyone has experience with Federated Learning implementations on custom image dataset, or any experience with TFF or even FLOWER or sth else, we need some advice/help to get started. Please dm me.
https://www.tensorflow.org/federated/federated_learning
is it correct or is it incorrect? who is "they"? can you provide more context for this question?
So, New York(0), London(1) and Paris(2) has 4723, 5341, 2510 points respectively, and these are together merged in label(which is my x-axis here), together these make around 12k points.
I wanted to plot a bar chart for each label individually, the bar chart for New York is correct(as you could see in the graph) .
London(1) has 5.3k data points and it is supposed to show 5.3k in the graph above, but it is the addition of NewYork + London and addition of all 3 (which is 12k as shown in the graph) in the Paris barplot.
How can I plot them individually?
post the code where you defined your data
cities = pd.Series({
'New York': 4723,
'London': 5341,
'Paris': 2510,
})
cities.plot.bar()
plt.show()
should be as easy as that
Hey @sterile rivet!
It looks like you tried to attach file type(s) that we do not allow (.ipynb). We currently allow the following file types: .gif, .jpg, .jpeg, .mov, .mp4, .mpg, .png, .mp3, .wav, .ogg, .webm, .webp, .flac, .m4a, .csv, .json.
Feel free to ask in #community-meta if you think this is a mistake.
Here
https://github.com/avonis3/Twitter-classification-project/blob/main/Twitter project part 2.ipynb
labels = [0] * len(new_york_text) + [1] * len(london_text) + [2] * len(paris_text)
what did you expect?
matplotlib bar just takes counts and positions
you're way overthinking this
why are you converting these to lists?
Bc these are all actual tweets
so?
this code makes no sense to me
you just need to get the length of each dataframe and put it in a bar chart
it looks like you're trying to do a bunch of complicated stuff that you don't need to do
uh, this is actually a project with some assigned tasks, plotting a bar chart isnt a task but I am still trying to plot one for practicing matplolib.
3 different datasets are given according to the areas, and I am supposed to make a system which predicts whether a tweet was sent from any of the 3 cities.
even so. i think you are way overthinking this plot here
I was getting a Value Error
what is the simplest possible way this could work?
just put the 3 lengths in a list...
sizes = [
len(new_york_tweets),
len(london_tweets),
len(paris_tweets),
]
labels = ["NY", "London", "Paris"]
plt.bar(range(len(sizes)), sizes, tick_label=labels)
Yep, that's what I did, I converted all the tweets and put it into 1 big list.
but that isn't what i am saying to do
i'm saying to just get the length of each group of tweets individually
and just plot them
look at my code
it couldn't get any simpler
you're trying to do something much fancier and more complicated than you need to
simple is good
Yep! I got it now, ty!
If you want to output stuff to excel From python, the xlsxwriter library is incredible. I use it extensively for professional workflows to dynamically generate templated spreadsheets. It’s pretty easy I actually find it easier to make a complex spreadsheet using xlsxwriter than to make it using excel.
I'm not sure exactly how these libraries work, because as I said I am rather new to Python and still figuring my way through it. So I would need to look into that library and what exactly does
But thank you for the tip
Do I basically create the format of the excel file through this library or how exactly?
how can I save the result of a df.groupby to a new dataframe?
It allows you to write excel files from python. Put data or formulas into cells, create filters, lock down certain sheets, graphs , everything. So you can use python to hit your apis or database or whatever, then make a dope excel out of the data you've collected. Once you write the code right once it’s automated and you can just run the python code every week or whatever to generate the dope spreadsheet with no work
But if your dataset is easily imported directly into excel, it may be kind of pointless to do anything in python
Except as a learning exercise for yourself
Let's say this is your DataFrame
df = pd.DataFrame({
'Animal': ['Falcon', 'Falcon', 'Parrot', 'Parrot'],
'Max Speed': [380., 370., 24., 26.]})
reset_index() can be used to return a DataFrame based on your grouping
df.groupby(["Animal"]).size().reset_index()
so df.groupby returns a "grouped dataframe", which is like a bag of dataframes where each dataframe is one group. you have to do some operation on the "bag" to reduce them back to one dataframe
I don't wanted to save the groups as a new dataframe
I might have what I need now 🙏
the code doesn't look nice, but if it works it works
show code
been struggling with this for way to long
I'd rather not, it's embarrasing 😅
if you're willing to swallow your pride, I can suggest improvements. up to you.
I might be doing something else by then. we'll see
No worries, I have time. There is no need for a quick reply
@serene scaffold actually here is what I'm going right now 😅
The writing an reading csv part, does exactly what I want. Of course it's not very efficient
df = pd.DataFrame(df).reset_index()
df.to_csv("chunk_processed_csv.csv", index=False, encoding='utf-8-sig')
df = pd.read_csv("chunk_processed_csv.csv")
df = df.iloc[1: , :]
what is this supposed to do? you're trying to "get rid of the index"?
because you can't--every dataframe always has an index no matter what
if you just don't want to look at the index, that is doable.
Hey everyone, I’m trying to read and store H5 file data in pandas dataframes. I have 8 H5 files each around 3GB. So, it’s a lot of data. I can do this successfully, but it freezes my computer and takes a very long time. I’m wondering, is there a more efficient and less memory-taxing way of doing this? Should I convert from H5 to another format like CSV or Parquet or pickle?
What no
I'll show ya step by step
the dataframe always has an index. there's no way around that. you can just choose to not print it.
we're not talking about any index
it seems that the whole point of all of this is just to get away from there being an index
I'll show you what I'm doing
Sure. I have about five minutes.
my program won't finish in that time
Alright, good luck!
it works anyways
not very fast, but it works
actually
This is what I have
and this is what I want
@serene scaffold
does that explain what I'm trying to do 🤔
I wanna save the groups I made as a new dataframe (and not ungroup them)
you could make the group name a new column, I guess?
I can't really tell what your data model is.
once my code has finished running, I'll go through what I'm doing
today was the first time I used groupby so I know next to nothing about it
this is what the columns look like after grouping
(the data is sensitive so I'm not showing this)
then I reset the index with
df = pd.DataFrame(df).reset_index()
when I now write the file to a csv, and read it back in
the lambda, min max and sum level somehow becomes part of the data 🤷♂️
note that the data is still grouped in this state
Then I drop the first row with this state
drop the "drived_tstamp" and rename the other 2 to "min" and "max" respectivelely
and then I got the data exaclty the way I wanted, still grouped but without the lambda, min max and sum level
# Apply standardized scaling to the training and test data, but only fit the training set
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
# SVM model with parameters adjusted for maximum optimization
rand_list = {"C": stats.uniform(1, 100),
"gamma": stats.uniform(0.1, 1)}
svm_model = SVC(max_iter = 5000,kernel='rbf',C=97.366, gamma=0.4834)
# Perform Randomized Search for hyper parameterization
clf = RandomizedSearchCV(svm_model, param_distributions = rand_list, random_state = 0)
search = clf.fit(X_train, y_train)
params = search.best_params_
print(params)
print(search.score)
# Fit model with data and perform prediction
svm_fit = svm_model.fit(X_train, y_train)
prediction = svm_model.predict(X_test) # Model prediction using testing set
# Use the score metric for evaluation of the model accuracy
score_train = svm_model.score(X_train,y_train)
score_test = svm_model.score(X_test,y_test)
print(score_train)
print(score_test)
# Perform k-fold cross validation to optimize the model and reduce bias/variance
# Number of folds
k = 5
kf = StratifiedKFold(n_splits=k, shuffle = False, random_state = None)
# Model Prediction with k-fold cross-validation using testing set
prediction_kf = cross_val_predict(svm_fit,X_test,y_test,cv = k)
# K-fold cross validation on the training/validation set
k_score_train = cross_val_score(svm_fit,X_train,y_train,cv = k)
# K-fold cross validation on the testing set
k_score_test = cross_val_score(svm_fit,X_test,y_test,cv = k)
mean_accuracy_train = np.average(k_score_train)
mean_accuracy_test = np.average(k_score_test)
print(mean_accuracy_train)
print(mean_accuracy_test)
someone please help I have bad overfitting. note that my model is highly non-linear
I can not figure out why this group is splitting like this -
can someone help me 1 on 1? i can explain it easily through voice
have you tried looking at the data columns individually. if i saw this, i would slice only IndustrySubsector just to double check
Never mind it looked to be some sort of error in the string field - possibly spacing issue, I fixed it with this -
you're not actually using the search.\_best\_params in your "final" svm_model
Yea I know i just input the values to do a quick check
it didnt change
svm_fit = svm_model.fit(X_train, y_train, **params)
okay well can you understand that pasting code in here that's not the code you actually used is not very helpful
that is the code I used
the C and gamma values I changed manually
in the code I posted. those were the results from the random searhc
**
ah. That's confusing.
O woops
yeah it'd be svm_fit = SVC(max_iter = 5000,kernel='rbf',**params).fit(X_train, y_train)
ok yea it worked now
but it still has bad overfitting
not sure why test set has terrible score. Could I just do private chat with u, im sure u could help easily if u understood the data
@twin hound there's parameters you're not tuning in the hyperparameter search- the kernel and max_iter. Try adding those to the search.
(from #python-discussion )hello, new here
is there any way to do Shape From shading in python, if so, how do i do it?
i want to make DEMs for many of the solar system's moons with the image data avalable
rand_list = {
"C": stats.uniform(1, 100),
"gamma": stats.uniform(0.1, 1),
"max_iter": stats.uniform(1,5000),
"kernel": ["rbf", "opt2", "opt3"],
}
@twin hound
like turn this image here into a usable height map/DEM that can be used in space programs, 3d modeling, etc
There's no way to judge rbf is best in isolation from changes in the other hyperparameters. maybe when max_iters is 5000 rbf is best, but if max_iters is a different value, some other kernel may be best.
ok well anyway to summarize my issue, even after applying all of this tuning, the score on the training set is really high (0.96-0.997) but the testing set doesnt change (0.5-0.7) and when I apply kfold cross validation the training set ranges from 0.6-0.7 and the test set ranges from 0.4-0.5
my issue is it seems the model only works well with the training set
for 2 days Ive played around with the parameters. literally have changed everything tried many combinations, grid search, rand search, etc. my main issue is just what I said above. Wondering if you know why this occurs usually
Rather low number. What algorithm is that?
if you're not optimizing all the parameters in a properly set-up cross-validated search, that's my #1 guess as to why you're getting unexpectedly bad performance on the test set. #2 guess is the data in the test set is just too different from the data in the train set.
how do I set up a good cross validated search?
I'm not sure it's #2 because I performed the same analysis using train_test_split to just verify if the test data was bad. but train_test_split gave same results
like given the code, what would u do
with your experience
What is your data? Does it have time factor that you can't randomly select train test split
its already provided training data, and already provided testing data from excel file. 8 inputs and 1 output with [0,1,2,3,4] classifiers
that's what I'm trying to tell you to do. Use all the parameters with reasonable search ranges in the RandomSearchCV. Make sure the number of cross-validation folds makes sense
could it be because the training set is small?
its only [750,8]
test set is [150,8]
you're not properly doing the cross-validated hyperparameter search, so your model is overfit to the Training data.
I missed it. Anyway does it have time related factor?
I dont understand how I am not doing it correctly
am I supposed to literally just randomly try every single parameter and hope for thes best. I feel thats ineficient
Great. Can you send link to your data again?
you need to simultaneously run hyperparameter search for as many different parameters at once. Like this
Don't know. Can be just couple of rows
this is an astute observation. That's why more recent versions of SKLearn added hyperparameter search functions that learn as they go: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.HalvingRandomSearchCV.html#sklearn.model_selection.HalvingRandomSearchCV
Examples using sklearn.model_selection.HalvingRandomSearchCV: Release Highlights for scikit-learn 0.24 Release Highlights for scikit-learn 0.24, Prediction Intervals for Gradient Boosting Regressio...
Random search has its benefits over grid search. But you can try both
you can also try the hyperopt library, which uses different learning algorithms than sklearn for hyperparameter optimization
But random search is quite powerful. Your model and data are both small, so randomly searching a few hundred options and picking the best is likely to find a very good solution in a relatively short amount of computation time.
probably not all 30. that's where expertise comes in. You need to understand how SVMs work, how the training algorithm works, and how your data is interacting with those things. Then you can pick reasonable choices for which parameters to search and reasonable ranges for them.
just 2 parameters is clearly not enough, since you are reporting such a large difference b/w train and test performance
If your data is small training probably doesn't take long. So you search a lot :). You can use some auto ml library as well.
well I know for svm regularization, kernel and gamma have high impact
what if I change my input features
instead of just putting the x data as is, make a relationship between them to reduce dimensions
this is the data fyi
so for example reduce dimensions by going water/cement and coarse/fine aggregate. than
i feel my input data is really bad and has bad bias
What is age?
its the setting time of the concrete
different concretes have different setting times because it highly affects compressive strength
its classified as concrete age
Do you scale the data before fitting the model?
yea its all scaled
What are zeros in some rows?
In the excel is not scaled right?
the excel is not
0 just means it has no value for that certain input
like for example no fly ash for the third sample
Is that missing value?
no it means there is no "amount" of that parameter in the concrete mix
essentially what you see is 8 inputs which are different materials for concrete mix and 1 output which is the strength class
some concretes have no fly ash, plasticizer, etc. so it has a value of 0
I see.
Regarding the y class. Is it balanced?
Or do we have more samples with certain class?
heres also an example of what the data looks like if we plot the first and 2nd inputs against eachother. most of them look random like this
the colors are just the different classes
I have to use ANN and SVC
its for a project thats why : (
I am getting the same issue with ANN if you are wondering
MLP to be specific
Ah
Ok. All seems fine what you showed me.
I hate machine learning 🤪
I know man... thats why im here 😭
So this seems like overfitting. What can cause overfitting in SVC?
Having your product revolve around one big model is a fundemental strategy for "unicorn" companies. The idea being to find some niche which has yet to be automated (low hanging pre-computerization fruit) and then automate it with a website + maybe an ML model. They often call themselves "tech" companies (using tech does not make the company a tech company) and mostly pop up on the west coast of the US.
The goal is then to hype it up to infinite and sell when it's highly valued to a bigger "tech" company or a bank. And it works for some, and when it does it's very profitable so they keep trying.
Sorry, but we don't allow recruitment in this server.
I'm not really sure either. There's a Python job board on python.org.
as nobody will help me, and its vital I have someone who knows what they are doing to take on this task.
thanks
this channel would pertain to phyphox correct?
spectrum analysis etc.
idk what phyplox is, but this is the channel to discuss scientific computing in Python
it handles sensors in mobile devices, ranging across all types of the likes. mostly, I am looking for someone who knows a little bit about auditory and frequency spectrum analysis
here is an example
anyone see this?
anyone know how to make stats.uniform select integers only?
@neat anvilHow can I plot all of my predictions vs. actual (for example x1 with x_test1 vs. y for all inputs)
ty
!d scipy.stats.randint
scipy.stats.randint = <scipy.stats._discrete_distns.randint_gen object>```
A uniform discrete random variable.
As an instance of the [`rv_discrete`](https://scipy.github.io/devdocs/reference/generated/scipy.stats.rv_discrete.html#scipy.stats.rv_discrete "scipy.stats.rv_discrete") class, [`randint`](https://scipy.github.io/devdocs/reference/generated/scipy.stats.randint.html#scipy.stats.randint "scipy.stats.randint") object inherits from it a collection of generic methods (see below for the full list), and completes them with details specific for this particular distribution.
Notes
The probability mass function for [`randint`](https://scipy.github.io/devdocs/reference/generated/scipy.stats.randint.html#scipy.stats.randint "scipy.stats.randint") is:
\[f(k) = \frac{1}{\texttt{high} - \texttt{low}}\] for \(k \in \{\texttt{low}, \dots, \texttt{high} - 1\}\).
[`randint`](https://scipy.github.io/devdocs/reference/generated/scipy.stats.randint.html#scipy.stats.randint "scipy.stats.randint") takes \(\texttt{low}\) and \(\texttt{high}\) as shape parameters...
!d warnings.catch_warnings
class warnings.catch_warnings(*, record=False, module=None)```
A context manager that copies and, upon exit, restores the warnings filter and the [`showwarning()`](https://docs.python.org/3/library/warnings.html#warnings.showwarning "warnings.showwarning") function. If the *record* argument is [`False`](https://docs.python.org/3/library/constants.html#False "False") (the default) the context manager returns [`None`](https://docs.python.org/3/library/constants.html#None "None") on entry. If *record* is [`True`](https://docs.python.org/3/library/constants.html#True "True"), a list is returned that is progressively populated with objects as seen by a custom [`showwarning()`](https://docs.python.org/3/library/warnings.html#warnings.showwarning "warnings.showwarning") function (which also suppresses output to `sys.stdout`). Each object in the list has attributes with the same names as the arguments to [`showwarning()`](https://docs.python.org/3/library/warnings.html#warnings.showwarning "warnings.showwarning").
The *module* argument takes a module that will be used instead of the module returned when you import [`warnings`](https://docs.python.org/3/library/warnings.html#module-warnings "warnings: Issue warning messages and control their disposition.") whose filter will be protected. This argument exists primarily for testing the [`warnings`](https://docs.python.org/3/library/warnings.html#module-warnings "warnings: Issue warning messages and control their disposition.") module itself.
ok got it thanks
how do I make the randomized search select the best parameters based on the score?
because everytime I run it it keeps changing
@neat anvil
i have this data set i want to check what are the survival chances of people with same tickets
could someone help
It is selecting the best parameters based on cross-validation score
if it's changing every time that could mean a couple of things: your search space has many roughly equivalent optima (if the CV scores of many of the random models are around a similar reasonable value) OR you've selected the validation splits in a way that makes it difficult to get a reliable score (if the CV scores of many of the random models near 100%) OR the training data is so messy there is no way to achieve a good model with this type of model (if the CV score of many of the random models are low) OR your scoring metric is ill-defined OR the training data is so messy it's not much better than training on random noise, so you just get random parameters out (they're different each time you run it b/c it randomizes how it splits the data and the params)
those (if whatever) conditions are kind of hand-wavey, not for certain
but those are some signals and possible explanations
I'd recommend trying a much, much simpler model. Like just a basic logistic regression.
If you can't fit it with decent accuracy on data that simple
I would but the problem is this is for a project where SVM and MLP needs to be used
more complex models aren't going to do much better.
well, it can give you a baseline expectation of what is reasonable
its all good I appreciate youre help. Im meeting with my prof today to help my sorry ass
always a good idea
yea thanks man
search = [] for values in df['data']: search.append(re.search(r'\d{7}[N]\d{7}[E]', values).group(0).rstrip()) print(search)
Hello everybody i have this regex. I'm trying to search through one of the columns in my dataframe and return the string not the match object. i know i need to use group to achieve this however on some occasions throughout my dataframe re.search will return none. and group() will crash saying 'NoneType' object has no attribute 'group' i saw somewhere that group(0) should get rid of the nones but it didn't work. I know i can fix this with a try: except: block but im trying to find a different solution.
@haughty ibex did you try Series.str.find?
!docs pandas.Series.str.find
oh that's the wrong one. must be extract
!docs pandas.Series.str.extract
Series.str.extract(pat, flags=0, expand=True)```
Extract capture groups in the regex pat as columns in a DataFrame.
For each subject string in the Series, extract groups from the first match of regular expression pat.
you'll also have to put parentheses around the part of the pattern you want to keep. which I guess will be all of it.
try to figure it out, and if you can't, I will show you the solution @haughty ibex
df['report text'].str.extract(r'\d{7}[N]\d{7}[E]')
getting ValueError: pattern contains no capture groups
@haughty ibex it extracts a capture group, so you have to put the whole thing in parentheses, if you want that
though it looks like there are two parts to this pattern, \d{7}[N] and \d{7}[E]
so you could get that information in two columns automatically, if you wanted.
>>> s.str.extract(r'(?P<letter>[ab])(?P<digit>\d)')
letter digit
0 a 1
1 b 2
2 NaN NaN
oh ok i think i got it to work but now im getting Length of values (0) does not match length of index (11)
in my test csv file i have 3 rows that would contain no matches for my re to test it out.
please show what you changed the code to and the whole error message starting from Traceback.
ok sorry the traceback was something i forgot to comment out while testing out the changes
df['pattern match'] = df['data'].str.extract(r'(\d{7}[N]\d{7}[E])')
yay
is there a flag to not get NaN values and just have and empty cell
No, because NaN is an empty cell, basically
They're also the best way to represent missing data.
@haughty ibex make sense?
@serene scaffold yes. i appreciate the help. can i use multiple regex patterns.
i have two other regex patterns that im using to find some data in csv files
regex1 = r'\d{1,3}[thrd]([a-zA-Z]+( [a-zA-Z]+)+)[e]\s' regex2 = r'\d{1,3}([a-zA-Z]+( [a-zA-Z]+)+)\d\s+([a-zA-Z])+\b'
could i do something like:
df['pattern match'] = df['data'].str.extract(regex1,regex2)
im guessing its not that simple lol
regex1 = r'\d{1,3}[thrd]([a-zA-Z]+( [a-zA-Z]+)+)[e]\s' regex2 = r'\d{1,3}([a-zA-Z]+( [a-zA-Z]+)+)\d\s+([a-zA-Z])+\b' regex_list = [regex1, regex2] regex_search = [] for x in df['data']: for regex in regex_list: try: regex_search.append(re.search(regex, x).group().rstrip()) except: pass
i am currently doing this and it seems to be working just looking for a more optimized solution.
banish this for loop from your life
you can do df['data'].str.extract more than once and make more than one column, yes.
Hi, how would you plot a 3D linear regression model from a dataframe?
@misty flint I actually don't know how I'd do that off the top of my head ^
@serene scaffold i want the matches to be in the same column so thats why i did the double for loop.
how does one plot 3d data in general? I only did that once for a homework assignment two years ago.
and then forgot
@serene scaffold I've been trying to look for solutions for it online but I genuinely don't understand it. Thank you tho.
is the dataframe that you currently have multi-indexed or what?
why do you want that? what do the matches even represent?
@serene scaffold This is what it looks like
df['pattern_match'] = ''
for pattern in [regex1, regex2]:
df['pattern_match'] += df['data'].str.extract(pattern).fillna('')
you could do this, I guess @haughty ibex
why are some of them NaNs?
@serene scaffold Oh the dataframe is from a practical
is practical a thing? anyway, do you want to ignore the rows with NaNs?
and what three columns are going to be the axes on the plot?
@serene scaffold 1) Yes
- The dependent variable will be the flipper length, bill length and depth will be the other 2 variables
tbh i wouldnt either lol
oh wait
i think i did it before in MATLAB
import matplotlib.pyplot as plt
x, y, z = df[['bill_length_mm', 'bill_depth_mm', 'filpper_length_mm']].dropna()
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(x, y, z)
plt.show()
but thats MATLAB 
try something like this
pd.concat(
reduce(
add,
(df['data'].str.extract(pattern).fillna('') for pattern in (regex1, regex2))
)
)
I made it slightly more lispy
Looking for a library that can analyze a game that is being played on my stream. I’m a twitch streamer and want to be able to track say the number of kills I get in a particular game. What libraries would I look into to do that?
what's the simplest way to know that you got a kill? is there a kill count on screen?
this could end up being a relatively simple problem or something much more involved lol
yes. if the solution isn't something happening to a static UI element, any solution we come up with will probably need so much compute power that you won't be able to play your game.
video data is not something ive worked with personally but like you mentioned its def a hassle
but at least you have a GPU 😄
the groups ive seen work with it also complain about how much data is generated as well
so you def dont want to use all that data, just certain stills/frames if possible
Ya there is a kill count on screen.
so, you need something that watches the pixels on that part of the screen, and any time they change, it needs to detect if the change is the number going up.
Is that going to take up a lot of computing power? In my mind I was thinking of it taking a screen shot and then analyze it, see if it changed from last time then delete the screenshot.
my first instinct is to look into opencv and pytesseract
you'd need to constantly be taking screenshots and analyzing them
at least thats off the top of my head
anyway, I don't do anything with images except maybe optical character recognition. so I don't even know if there are libraries that watch parts of a screen.
hmm i think ive seen an article about it once
there was a twitch streamer that did something similar
Ok. I’m familiar with opencsv. Not so much pytesseract.
I figured someone somewhere has done it. Not trying to copy someone’s code or work but just wanted to see what libraries they used to do it.
ah i remember now, this was a high schooler on a podcast i listen to
Listen to this episode from Ken's Nearest Neighbors on Spotify. Will is a junior in high school, he has been super involved with data science. He is innovating the data collected on twitch and esports. He self taught himself how to code from a young age and is now using what he has learned to create tools for esports and analyze data from twitch.
maybe you can find something by googling him

i think he ended up getting into a really good school bc of this
its been a while, i dont remember
Thank you! I definitely will. Just looking at his podcasts I’m probably going to listen to all of his episodes.
lol the host is the data scientist, while the high school kid is the twitch streamer that was a guest on that episode

but you can still listen
he has something interesting guests all across DS
some people work in all sorts of domains and fields
i think its most interesting hearing their background/journey
one was an olympic medalist before going into DS
another one was an ex-cultist
💀
anyway interesting stories tbh
Ahh. Ok. I see. Well I appreciate it.
good luck bud. let me know if you end up getting it to work
i still think opencv will let you do something
OpenCV with Pytesseract will probably just work.
PIL's image grab will work for getting the image: https://pillow.readthedocs.io/en/stable/reference/ImageGrab.html
If you generally know where the text is you probably want to only grab that region or it will be slow on larger resolutions.
Ok. Thank you!
class AutoEncoder(nn.Module):
def __init__(self):
super(AutoEncoder, self).__init__()
self.encoder = nn.Sequential(
nn.Conv2d(55, 16, 3, stride=1, padding=1), # b, 16, 10, 10
nn.ReLU(True),
nn.MaxPool2d(2, stride=1), # b, 16, 5, 5
nn.Conv2d(16, 8, 3, stride=1, padding=1), # b, 8, 3, 3
nn.ReLU(True),
nn.MaxPool2d(2, stride=1) # b, 8, 2, 2
)
self.decoder = nn.Sequential(
nn.ConvTranspose2d(8, 16, 3, stride=1), # b, 16, 5, 5
nn.ReLU(True),
nn.ConvTranspose2d(16, 8, 5, stride=1, padding=1), # b, 8, 15, 15
nn.ReLU(True),
nn.ConvTranspose2d(8, 55, 2, stride=1, padding=1), # b, 1, 28, 28
nn.Tanh()
)
def forward(self, x):
x = self.encoder(x)
print(x.shape)
x = self.decoder(x)
print(x.shape)
return x
``` I am getting error `/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:47: UserWarning: Using a target size (torch.Size([1, 55, 46, 46])) that is different to the input size (torch.Size([1, 55, 47, 47])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.` . My input size is `(1,55,46,46)` but i dont know why i am getting `[1, 55, 47, 47]` ?
Is there any way to get only last 3 months data?
The first row is latest month so what I did is: made a list of months
l1=['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']
month=df['Date'].iloc[0]
curr_month=month[:3]
curr_index=l1.index(curr_month)
prev_month=l1[curr_index-1]
last_second_month=l1[curr_index-2]
month_list=[curr_month,prev_month,last_second_month]
so month_list gives me last 3 months including current, then I tried to find list elements in df column using df[df['Date'].str.contains('|'.join(month_list))]
but as you can see in the picture the last rows from df it contains last year Mar data. so it returning that data also. so How can I get the only latest last 3 months data
u can use slice operator with dates, so assuming u have the date in variable x you can go df.loc[x:,:]
date indexs can get a bit intricate with pandas
u could resample the index to monthly frequency, take the 3rd last index of that, then make x the "yyyy-mm" string of that with strftime(), then use x with the slice index
in other news this is an interesting review from openai re gpt https://openai.com/blog/language-model-safety-and-misuse/
The deployment of powerful AI systems has enriched our understanding of safety and misuse far more than would have been possible through research alone. Notably: API-based language model misuse often comes in different forms than we feared most. We have identified limitations in existing language model evaluations that we are
How can I use Levenshtein.ratio to compared strings between 2 different columns in a dataframe? I have a dataframe with a few ten million rows and can't figure out how to get it to do the ratio of the strings in each row of the dataframe.
I don't have year in data, as you can see the attached picture
base_model = Sequential()
resnet50_model = tf.keras.applications.ResNet50(include_top=False,
input_shape=(144,144,3),
pooling='max',classes=6,
weights='imagenet')
for layer in resnet50_model.layers:
layer.trainable=False
base_model.add(resnet50_model)
base_model.add(Flatten())
base_model.add(Dense(1024, activation='relu'))
base_model.add(Dense(512, activation='relu'))
base_model.add(Dense(256, activation='relu'))
base_model.add(Dense(6, activation='softmax'))
in this snippet there shows transfer learning right?
but what if i just want to just use the architecture of resnet50 and i want to train it myself?
non trainable parameters are units that are unchangable? isnt that bad?
how to prevent it?
I wouldn't worry too much, but I'm interested in the answer anyways
how to diagnose this kind of thing on keras?
maybe those non trainables are from the resnet50?
The number of none trainable weights of the model comes from the BatchNormalization layers whose mean and variance vectors are updated via layer updates instead of backpropagation and therefore are considered as none trainable parameters.
https://github.com/experiencor/keras-yolo2/issues/167
oh its normal and its form bn of resnet i see nice nice thank you
btw in this code
base_model = Sequential()
resnet50_model = tf.keras.applications.ResNet50(include_top=False,
input_shape=(144,144,3),
pooling='max',classes=6,
weights=None)
base_model.add(resnet50_model)
base_model.add(Flatten())
base_model.add(Dense(1024, activation='relu'))
base_model.add(Dense(512, activation='relu'))
base_model.add(Dense(256, activation='relu'))
base_model.add(Dense(6, activation='softmax'))
base_model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=METRICS)
i want to just use the architecture of the resnet50 and train it myself with my own data and classes
i changed the input shape and added dense layer to the end is this it? did i implement what i wanted to do right?
yo?
wew since i am training resnet from scratch there are alot of computations needed right? this will take alot of time
I want to see if these two columns are related or not like if particular Branch always maps to particular city?
How can I check it?
using pandas
or any python lib
Hello all, Im trying to build a ordinal classification model (basically ranking prediction). Can someone help me out in choosing model? Thanks 😊
644 is the number of minibatches in your dataset. it's the same regardless if you train from scratch or using transfer learning. It depends on your minibatch size and number of training images. training from scratch you probably will need more than 10 epochs.
will 100 be a good number for epochs? btw it will take forever if i train this on my laptop so i tried this google collab
but my dataset is from may local drive do i need to upload it to collab?
yes you would need to upload data to gdrive
nice nice thank you
btw is this natural ?
never trained from scratch. but early epochs that may be right since almost nothing is correct yet.
btw why don't you want to train using transfer learning?
i am doing experiment on image augmentations i want to compare if there will be performance boost or what and to compare them fairly i think using the same architecture and exactly the same initial weights will be good so i first created my own cnn architecture but i realized that doing this experiment on my own simple architecture is non sense because noone will ever use it so i decided to use a popular or one of the best architectures
using the architecture ill create 2 identical models and train them on classifying the same classes but with different data
like datasetA is with etc and datasetB with etc like that
does it make sense?
btw how you use pre trained models? did i do it right? but in this i just dont copy the weights learned from the imagenet dataset so in short i just copied the architecture?
i want to read in a csv file from a directory using this code eth = pd.read_csv("../EC331/combined_posts_comments_final.csv") but it doesnt seem to work
@tacit basin
btw how about this ?
what does it mean its learning but maybe it needs more epochs to get better validation?]
honestly @pastel valley these questions about data augmentation, transfer learning, and deep learning model architecture are quite complicated to answer and get to the root of a lot of fundamentals of deep learning. You'd probably be best served taking some courses and building up your fundamentals in math and stats IMO.
and I mean sounds like you're curious enough about the topic that you'd probably enjoy the courses
can someone explain to my how this "sum" param works exaclty? I'm having some strange results
df.groupby(["user",pd.Grouper(key="timestamp", freq="W")]).agg({
"col1": "sum"
})```
I have a column with true and false values exclusively, I'm trying to count the true values within a certain interval
but some results are negative
I really don't understand why it does that
it should just be aggregating col1 with sum, or something like that. if you want additional help, show a reproducible example with df.head(10).to_dict('list'). Screenshots are useless, in this context.
it appears as though groupby tries to save an int16 value in an int8 🤔
Good Afternoon everyone ,
i am trying to use "Word2Vec" package in pycharm
from gensim.models import Word2Vec
but it shows an error Unresolved reference 'Word2Vec'
can anybody support me on this
You can evaluate image augmentations with transfer learning as well. You will see results faster.
Yes if you pass None to weights it will initialize 'random' weights. If you soecify say imagenet then it will use ptetrained weights. You can still train from there with different augmentations.
Yes when accuracy on valid set is improving it's learning. You can continue training up until your valid metric improves. If I'd doesn't improve or gets worse then it's overfitting
why are you using such low-bit integers?
What error you get?
!traceback
Please provide the full traceback for your exception in order to help us identify your issue.
While the last line of the error message tells us what kind of error you got,
the full traceback will tell us which line, and other critical information to solve your problem.
Please avoid screenshots so we can copy and paste parts of the message.
A full traceback could look like:
Traceback (most recent call last):
File "my_file.py", line 5, in <module>
add_three("6")
File "my_file.py", line 2, in add_three
a = num + 3
TypeError: can only concatenate str (not "int") to str
If the traceback is long, use our pastebin.
Because they read as objects and I turned them into int8 because that's what they were before processing the datasets 🤦♂️
And also my dataset was HUGE and it caused memory issues
ah
im getting this error FileNotFoundError: [Errno 2] No such file or directory: '../EC331/combined_posts_comments_final.csv'
you need to know your current working directory, and then see if there is a EC311 directory in the one above it, and if it has a combined_posts_comments_final.csv file
though we already know from the error message that you don't.
this screenshot cuts off the error message. but I'll only look at error messages that are given as text.
do you know what the .. at the beginning of the path do? if not, you should probably delete them.
FileNotFoundError: [Errno 2] No such file or directory: '../EC331/Ethereum Data/combined_posts_comments_final.csv'
this is the error sorry
deleted but the error is still the same as i just sent FileNotFoundError: [Errno 2] No such file or directory: '../EC331/Ethereum Data/combined_posts_comments_final.csv'
you have to figure out the working directory that Python is using and give the path relative to that location
okay, ill give that a try
this is jupyterlab?
you can figure out what the working directory is by doing this in a python code cell: import os; os.getcwd()
.. is always relative to the working directory, not to the current file/script being executed
it worked thanks @serene scaffold and @desert oar
I'm planning to make a tag for this kind of issue, and I should mention this caveat. thanks!
good idea. this comes up a lot in the context of "scripts" as well as notebooks
So im planning to make a program that can identify the original 151 pokemon if you can upload a picture, and I got a dataset from kaggle, and i was going to use googlr teach able machine to upload and make the model, but i was wondering if that would be bad idea to have 151 diffrent things in 1 tenserflow model?
what you use to create the model (tensorflow, pytorch, etc) doesn't actually matter as far as its potential capability
what matters is the training data that you have and the model architecture
that said, how many separate images do you have for each of the 151 pokemon?
because if you only have one image per pokemon, that's not going to be enough
There are around 50 - 70 each
In addition to what salt rock lamp said about os.getcwd() in jupyterlab in a cell you could use bash pwd like that
!pwd
It's a useful way to execute bash commands in a notebook
okay, so you can make a neural network for that. 151 might be a lot of classes--I'm not sure.
i generally recommend not using shell commands in notebooks, because it's convenient but you very quickly run into problems where your notebook is dependent on specifics of the user's environment, e.g. it no longer works on windows
151 is fine as long as you have enough data points per class
50-70 seems good
I'm actually learning about image processing for the first time, for text recognition 😄
That's possible. But for quick pwd is perfect
I think it's a standard practice in image classification problems to synthetically generate a lot more training samples by algorithmically distorting or otherwise modifying the images
thanks
stretching, skewing, altering colors, rotating/mirroring, adding noise, etc.
there are lots of articles about data augmentation for image classification problems
that's a very broad question that could take months to explain
they are no help:(
my post was in response to some thing above, not you
what you need is a machine learning course starting at the basics
oversimplified then
fast.ai is a good option
you're basically asking somebody to type out a textbook chapter for you
copy and paste
no but rlly could someone at least show me an ai program so i could see how it works?
each one works differently. but if AI is something that interests you, you will probably enjoy a course about the fundamentals
then a random ai program?
how would someone from an econ background specialise/learn about ML/DL/NN
Self driving cars
Same way anyone would, start with linear algebra and statistics
for your own edification, or to pivot careers?
Depends
...........You know where to find code for a tesla?
At Tesla i guess
the source code for tesla is probably very proprietary
but self-driving cars are going to have numerous components
true
they probably need cameras to see whats going on, and models to identify what each thing is
so cv?
and then it needs some formula to decide how fast or slow to go based on those conditions, as well as incline, speed limits, etc.
depending on how advanced the econ background is, you should have more than enough math and statistics foundational knowledge to jump in "math first". Fast.ai can't hurt as an easy "first course in modern deep learning". for books, check out Probabilistic Machine Learning by Murphy and/or Deep Learning by Goodfellow. what the econ background lets you do is skip all the statistics basics and go right for the fun stuff
well elon musk must have good employees
well, of course
lol
however you will probably want to revisit statistics from outside the perspective of econometrics, because in my experience econometricians tend to use different techniques and think about problems differently @upper spindle . so it depends on your background. the general recommendations are more or less the same as for someone who knows very little or nothing, but the benefit of having a quantitative background is that you can move a lot faster through the intro material and don't need to spend time learning how to program a computer, how to read equations, how to reason statistically/probabilistically, etc.
i wanna move into data science
what is your background @upper spindle, specifically?
so, to pivot careers? salt rock lamp just gave you some great advice, so I'll respond once you've been able to read that.
thanks, that was my issue with programming and with the statistics/probability
im a current university student but about to graduate in a few months
but im lacking on the programming side
anyway, my advice would be to apply to graduate programs in something more closely related to data science. I've worked with data scientists with an economics background, so it's probably one of the better non-CS avenues into DS/AI.
did you do any programming in R?
i have done some, but my department here in the uk used stata
yeah you basically should treat yourself like an advanced beginner
start where everyone else starts
you probably can read equations, and do calculus, and know some linear algebra
you know what regression is, you know about model bias and variance, you know about statistical inference at least on a basic level, you know how to reason about model building
yeh, my maths background from a-level was pretty strong so im not too worried about that too much, other than when equations get horrible
so start at the basics but you can move quickly through it
i very strongly suggest the Murphy book
the beginning material should all be familiar to you from econometrics, but it might be expressed somewhat differently from what you are used to
that + the fast.ai course should be a great start imo
no need to rush through it
thanks
i also strongly suggest learning python, since this is a python forum 🙂
R isn't that useful for "machine learning" as such
yeh, ive been developing my python ability over the year
good, that will be useful in industry
a lot of jobs will place high value on your ability to write code independently
that is the toughest skill ive found over the year, especially for NNs specifically
I only asked about R because I thought economists usually use that
a lot of them still use stata, but yeah social scientists and statisticians often use R
are you writing them "from scratch" somehow?
pytorch is pretty easy to use
especially when you already know the underlying math
i also wouldn't spend too much energy on learning how to implement things "from scratch"
numerical computing is its own field
learn about how the models work mathematically and how to use them, don't worry about implementing them
yeye, i am, but ive been using youtube and github projects to just get a sense of what im trying to do
yeh, my whole department uses stata and are slowly transitioning to R
okay thanks
ive been using tensorflow to implement lstm's so far
that is, you're actually implementing LSTMs "from scratch" (ie with no constructs more abstract than individual tensors)?
I ask because we've discussed lately how overused the word "implement" tends to be. but yeah, implementing things like that "from scratch" isn't something I'd do at your stage, though you'll get to a point where you could if you wanted to.
ohh okay, sorry haha, ive been using code from githubs, youtube and combining them into a univariate lstm
yeah dont waste your time with the youtube tutorials
work through fast.ai
go in with a beginner's mind imo
you'll make progress quickly
you won't struggle like a real beginner would
so do public health folks and many in pharmaceuticals/biostats. CDC uses R here.
but tbh
i also recommended python
since if you want to do advanced data science in R, you end up calling the Reticulate package anyway
aka using python through R

even the R podcasters i listen to end up having to use python sometimes
and theyre trained as biostatisticians too
even if you have to do bioinformatics, theres biopython
but the documentation for some of that stuff can be terrible sometimes so good luck
is there cuda-enabled deep learning in R?
because it's easier to just have all of scientific computing under one roof, and if they're missing that, it's going to become impossible to compete.
okay, will do, thanks
is jupyter labs the best tool for data science/programming in python
seen a few people use spyder
or what are your go to tools for data science
think of which two quadrants the data points are in, and which quadrant the options are in
oh yeah, i have a problem with jupyter, tried installing and keep getting this, how do i fix this and install jupyter
the first two options are on an axis, but not lined up with data points. look at where the other two are, if you treat them as points.
i would uninstall
and try to install again
you mean jupyter or python?
are you following a tutorial? this looks like a misguided use of Python OOP
return self.df # there is no self.df attribute of myDataframe
x.dataframe() # This value isn't used, so nothing happens--did you want to return it?
If you want the d variable in the myDataframe.dataframe method to be exposed, you have to store it as self.d = ...
the best tool is the one that you like the best. try both jupyterlab and spyder, you can also try just using plain python files + ipython on the side, etc.
even as pseudocode, this is very weird code that seems to have been written by someone who was confused about how to use classes
i don't mean to be offensive, but i think that was what stelercus was commenting on
@upper spindle tried this and still getting same errors
its jupyterlab i am trying to install
myDataframe just looks like a wrapper around a single dataframe with no particular purpose, and it refers to instance variables that aren't defined.
def make_df(ticker):
return pd.DataFrame({'Ticker': [ticker]})
What you have appears to be an over-engineered version of this.
jupyter
if you need to go back to having {'Ticker': [ticker]}, you can just do df.to_dict() on the dataframe. the wrapper class just adds a layer of potential complexity.
im not too sure, tbh, maybe check you have the right requirements
where are the requirements on the jupyter site?
I find it is easier to define an init function with self.var in case I decide to add functions or alter the code down the line.
one usually wants to avoid having lots of mutable state
you're also creating an additional API on top of pandas that people who use your code would have to learn.
Aw, I was trying to do a little refactor of the code, but they deleted it. :'[
They just announced new 'old' course for this April. Similar to last one but with Timm and transformers integration
Vscode is option too. It's supports notebooks as well as py files
On windows you can always install anaconda. You will get a lot of libraries for DS. And visual UI launcher. It's a bit heavy download though.
i want to do these tutorials and they use jupyter or collab, and i want to use jupyter https://www.planethunters.coffee/tutorials
Where do they say what 'tools' they use?
Ok. They don't specify install method . But you can install graphical anaconda. It comes with jupyter, a lot of libraries preinstalled, conda virtual env manager. https://www.anaconda.com/products/individual#Downloads
It's a commonly used tool in DS.
Another option is standalone jupyter app https://github.com/jupyterlab/jupyterlab-desktop
will i still be able to import Astropy and other stuff needed to do what i want to do?
I would think so but i didn't use app version. With conda / anaconda yes you can install packages with conda install or pip install
Colab is a good option as well
i may go with colab than
may install the desktop app in the future, but not now
but how do you even import astropy and other modules on colab?
You would need to pip install the package every time you start session if it's not a part of preinstalled packages on colab (a lot of packages are preinstalled)
will it keep it installed if i save a session to my drive and reopen it (and just modify the code so i can look at a different TIC star)
It needs to be installed every time but it wouldn't take long
alright
now that i know how do do this, can someone help me with doing shape from shading with python
i want to know if its possible, and if so, how do i do it?
i want to make some DEMs for moons in the solar system
Not sure what that means but you could do most things with images in python for example with opencv library
shape from shading is where you turn a still image into a DEM
here is a example from unmannedspaceflight.com (astronomy forum)
http://www.unmannedspaceflight.com/index.php?s=9c8c48a9b3359b9a1a68a71293b12207&showtopic=6543
whole topic for those interested, program there using is for linux only, not windows
Not sure if this is what you want? http://geologyandpython.com/dem-processing.html
This tutorial shows how to automate downloading and processing DEM files. It shows a one-liner code to download SRTM (30 or 90 m) data and how to use rasterio to reproject the downloaded data into a desired CRS, spatial resolution or bounds.
what does it mean when a arima problem is unconstrained?
do you mean unconstrained optimization?
The error when I run SAIRMAX is "this problem is unconstrained"
It outputs buts not entirely
try showing the whole error message from Traceback.
heres the error if you want to see
Pasting large amounts of code
If your code is too long to fit in a codeblock in discord, you can paste your code here:
https://paste.pythondiscord.com/
After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.
how do you get the inferred frequency?
i would persume it guesses off the data but how do you pass that in
I don't know--I don't even know what the problem is
I'm just following my usual debugging steps
I don't think I can do a deep dive right now, but someone else might.
dont worry ill just have it on here if anyone can help, havnt had much progress by myself and need to do 3 models lol
its not a large program but trying to get it working

i need april to come now
yeah, will be fun: https://twitter.com/jeremyphoward/status/1499600211714674688
I am over the moon to announce:
- I'm now a professor at University of Queensland (UQ), the top institute in my home state!
- I'll be teaching a brand new deep learning course at UQ from April, which will form the basis of a new @fastdotai course! 🧵
https://t.co/RAMaHb7eZ2
2448
164
you can pass it with the freq argument, MS is monthly data where the index -dd part is the start of the month
for your SARIMA thing i would suggest using a simpler model like dropping the seasonal order and trend and see if that works then you can add in the extra stuff to find out exactly whats causing the issue
im not sure the 'this problem is unconstrained' is an error
Anyone here good with pytesseract and pyautogui dm me im tryna create a bot for something
It's not likely that anyone will DM you. you should say what you want help with in this channel.
Ight, so I am trying to make a bot answer these questions rlly fast, so I am trying to use ocr to get the questions and answer it, and then I will try to correspond the answer to one of the choices and press 1,2,3,4 to get the correct answer
is it always two integers and one of the four basic operations?
➜ wbanalysis git:(gcp) ✗ make upload_data [🐍 warren-buffet]
CommandException: Destination URL must name a directory, bucket, or bucket
subdirectory for the multiple source form of the cp command.
why is GCP saying the destination url must name a directory even though I give it a directory?
I am having trouble converting api data into a csv and also manipulating said data
for example I am doing a project on murder rates as reported by different news publications
I was able to pull the API
but every time I try to convert it to a cvf, I get an error message
i've looked at several stack overflow message boards but cannot find a solution that works for me
import json
import csv
with urllib.request.urlopen('https://content.guardianapis.com/search?api-key=47b0057b-d60d-4a3d-a6cf-c1f79aeedaa4') as url:
s = url.read()
print(s)```
this is the code for right now
Hey guys can someone tell me why my scores are so low for my testing set when clearly the model predicts the test data very well:
These plots are test vs prediction from my model
blue is test, red is prediction
What I mean is they predict the data relatively well
So a score less than 0.6 makes zero sense
I've seen it plotted using another ML method with higher score and it doesn't look nearly that clean
I am following a tutorial on pix2pix generation. The output for the shapes of each of the target arrays and the source arrays are
Loaded: (1096, 256, 256, 3) (1096, 256, 256, 3)
but for me the arrays are
Loaded: (256,256,3) (256,256,3)
Does this mean that not all of the images are loaded into the arrays?
I plotted the contents on matplotlib and this is what I get, but I am supposed to get a picture of the images.
Could someone please help. Thx
So I'm making a custom detection model where you can upload an image and it will put it through the detection, and display the top 3 closest results (like on a bar graph), but im somewhat new to python so i dont know which library to use
Hello all! I'm trying to use the fastdtw module to align time-series data that is slightly off.... but the fastdtw alignment makes it WAAY worse!! Any suggestions on whether I'm using that module incorrectly, or a better way to synchronize data?
I'm following this writeup:
https://towardsdatascience.com/how-to-synchronize-time-series-datasets-in-python-f2ae51bee212
But I start with that (orange trace is a few seconds ahead), and fastdtw makes a mess of it!!!
WHAT
what
so there's only two options?
I don't mean to be the bearer of bad news, but for binary classification, 50% is the worst possible accuracy
so 75% is kind of like 50%
If there's only two classes, and your model was completely random, then it would get 50% accuracy
I suppose
im bad at the maths hahah
yo its been already 10 epochs and the validation metric are the same as the 1st epoch is this normal? i am training it for 100 epochs
btw i am using google collab is there an option to use more computational power?
@pastel valley there is if you're willing to pay them for it.
will there be any other platform which is free?
Not one that will give you more compute power than colab
Colab is already generous.
Did you remember to set it to use the gpu
Yes. But idk how to do it off the top of my head
Also, let me reiterate that I think you would benefit a lot from and very much enjoy a formal data science course.
You need to change runtime to GPU
Runtime - change runtime type
You might also need to move the model to the GPU
I'm making a simple neural network to find the relationship between 2 numbers ```py
from tensorflow import keras
import numpy as np
model = keras.Sequential(keras.layers.Dense(units=1, input_shape=[1]))
model.compile(optimizer='sgd', loss='mean_squared_error')
def calulate_trangular_numbers(n):
for i in range(1, n+1):
yield int(i*(i+1)/2)
n = 20
x = np.array(list(range(1, n+1)))
y = np.array(list(calulate_trangular_numbers(n)))
model.fit(x, y, epochs=500)``` (I want it to find the relationship between the x values and y y = x*(x+1)/2)
But for some reason when I fit the model the loss is nan
Epoch 1/500
1/1 [==============================] - 0s 9ms/step - loss: nan
Epoch 2/500
1/1 [==============================] - 0s 8ms/step - loss: nan
Epoch 3/500
1/1 [==============================] - 0s 7ms/step - loss: nan
Epoch 4/500
1/1 [==============================] - 0s 12ms/step - loss: nan
Epoch 5/500
1/1 [==============================] - 0s 12ms/step - loss: nan``` any reason for why this could happen?
There are couple of free GPU options: colab, paperspace, kaggle, AWS sagemaker studio lab
Make sure that y doesn't have any nans in it
I checked, it doesn't have any nan values
probably but i need the maths first i think hahaha
oh nice nice i got it
weird the loss isn't nan when the input arrays contain 10 elements but the prediction is far from the expected value
Is this good
https://youtu.be/tPYj3fFJGjk
Learn how to use TensorFlow 2.0 in this full tutorial course for beginners. This course is designed for Python programmers looking to enhance their knowledge and skills in machine learning and artificial intelligence.
Throughout the 8 modules in this course you will learn about fundamental concepts and methods in ML & AI like core learning alg...
I am getting into ml and ai field
Look up a youtube video on machine-learning image analysis
I would recommend getting more confident with python before starting something like this...
Hi guys, i have a question regarding machine learning. Which algorithm will be the best if the data set generated will be based on the graphical location of the mouse cursor (numerical data) the objective is the allow the machine to learn the mouse movements
I don't know this course. I can recommend Fastai courses they are suitable for beginners in AI with some python coding experience. course.fast.ai
What would input and output for the model?
mouse movement. graphical data( numerical)
Images?
Example?
x and y axis
And output?
from what i belive, the output will be based on the input
since the machine will have to predict what the next input might look llike
I mean also x,y coordinates?
It's a multiple output regression. For example Deep neutral network https://machinelearningmastery.com/deep-learning-models-for-multi-output-regression/
Alr ty
Or these algorithms support multuoutput regression in scikit learn:
LinearRegression (and related)
KNeighborsRegressor
DecisionTreeRegressor
RandomForestRegressor (and related)
https://machinelearningmastery.com/multi-output-regression-models-with-python/
They will have live course starting it n April, in person and online https://mobile.twitter.com/jeremyphoward/status/1499600211714674688
I am over the moon to announce:
- I'm now a professor at University of Queensland (UQ), the top institute in my home state!
- I'll be teaching a brand new deep learning course at UQ from April, which will form the basis of a new @fastdotai course! 🧵
https://t.co/RAMaHb7eZ2
2797
192
Interesting
But their courses and book are available for free, course above, book: GitHub.com/fastai/fastbook
Tysm
Live course may be paid, but they release as free MOOC soon after live course finishes.
Any of yall are experienced with big data projects? I want to start with one and would love to know your dataset preferences.
guys i have a doubt , here https://fractaldle.medium.com/brief-overview-on-object-detection-algorithms-ec516929be93
what does it mean by For each object class, train a SVM (one versus other) classifier. You can use hard negative mining to improve the classification accuracy. , does it take the output of last fc hidden layer and feed it to svm for classification or take the softmax fc layer and feed it to svm?
https://youtu.be/GVsUOuSjvcg For anyone who hasn't seen it yet. Very interesting bit about flashable analogue chips running pretrained models with significantly reduced power consumption vs banks of gpus.
Visit https://brilliant.org/Veritasium/ to get started learning STEM for free, and the first 200 people will get 20% off their annual premium subscription. Digital computers have served us well for decades, but the rise of artificial intelligence demands a totally new kind of computer: analog.
Thanks to Mike Henry and everyone at Mythic for the...
Cool bit of ML history, too.
Does anyone know how to use Tf-Idf with a CNN for texts (NLP)
any article or something u can refer me to or tutorial?
Does anyone know of any data-science projects which I can join?
FALL 2018 - Harvard University, Institute for Applied Computational Science.
wow this is nice
it has application and bases itself off a good textbook
this is probably my best results so far the distance of train test is not like the other ones
but those spikes on loss and accuracy is it normal? or there are common knowledge on why those happens?
its a pretty bad vid IMO, Derek doesn't look like he researched enough. Ever since his vids got out of physics, they're quality has been steadily degrading
anyone who has worked on ML, do people put training and testing processes on same .py file or create different modules for each?
I like tests to always go in a separate ‘tests/‘ directory. Source and tests being together makes things confusing IMO
wdym by some python coding experience? I am am doing automatetheboringstuff book would that be enough or would a beginner mooc be beter
like the university of heilsinki one
Both, when the model is new and still very buggy I just want fast development iteration times and just keep it all in one file. Then when it seems not so buggy anymore I start creating separate "official" tests that need to be passed before it's ready for use.
(I do this for not just ML but all new algorithms)
In addition, I like to have at least 1 test made by someone else to make sure that i'm not just making tests I know it will pass.
Hi there, I am looking for a data-science community to work with on interesting projects.
This is a perfect answer to the question.
Everyone should read this and apply it to their development cycle, for real.
I like to imagine programming like crystallization or annealing. At first it's hot and I want to strike it often, but eventually I want it to cool off and harden / crystallize.
(Pro tip, check the commit rate of a piece of code, if it starts slowing down, it's time to add some tests and let it harden, but if it's updated a lot even after a long time, maybe it's the wrong approach / design and therefor is causing a lot of bugs)
(If someone asks you to fix their code base, look for what is being changed a lot and find out why)
I was given that last advice by a former manager, and we had a tool to look at file-commit-rates. Many of them were just adding business logic (or false positives --- typos someone forgot to squash) and so it was easy to pull that out so that the business logic could be more easily changed and updated and then "plugged in" to the microservice. Great advice.
Hello I have a conda project with a typer cli app in it located in libraryassignment/__main__.py file and I'm currently running typer commands like so: python -m libraryassignment <command_name>. It works fine but I want to be able to execute without -m flag like so: python libraryassignment <command_name> but I get ModuleNotFoundError: no moduled named 'libraryassignment'.
As far as I know, I have to either include it to the path or create a python package. I'm relatively new to conda and I wonder how can I tackle this issue creating all the required configuration to build a package so that python detects it as a package allowing me to keep developing on the project.
I used poetry in the past and it's pretty intuitive and easy to use especially regarding the building process of python packages with pyproject.toml and poetry.lock files but I don't have much experience with conda and I wonder if you can help me with some guidelines that I can put into practice to build a package from a conda project.
Thank you very much in advance.
any GCP experts out there who can tell me why I keep getting this error? Just trying to upload a folder to my GCP Bucket
➜ wbanalysis git:(gcp) make upload_data
CommandException: Destination URL must name a directory, bucket, or bucket
subdirectory for the multiple source form of the cp command.
make: *** [upload_data] Error 1
maybe ask in #tools-and-devops
speaking of, i need to learn a cloud tool

A year of coding (preferably Python) and high school math is the recommended pre-requisite. The best way to get up to speed is to start taking the current course new, and work to fill in any knowledge/expertise gaps you come across as you go.
https://mobile.twitter.com/jeremyphoward/status/1499600223920074754
A year of coding (preferably Python) and high school math is the recommended pre-requisite. The best way to get up to speed is to start taking the current course new, and work to fill in any knowledge/expertise gaps you come across as you go.
https://t.co/nzv7pek0iq
highschool math means 12 math right
like you need calculus and advanced functions
for the course
Hm. Okay. Good to know another view.
how long i s the cooldown with this?
I'm not sure, but going forward, you should probably experiment with a smaller amount of data. you might also consider paying.
The more you use it the longer you have to wait I've read. It's like hours or days.
You could try transfer learning. Kaggle will give you around 30-40 hrs of GPU usage a week guaranteed. For now AWS sagemaker studio lab doesn't have limits other than 4hrs session, similar to paperspace but here GPU may be not available at times due to demand.
you at least need to understand/be excited to learn about probability and statistics. if you want to do ML as well, you'd need to have the same relationship with linear algebra and calculus.
You can start with using high level APIs like Sci-kit learn and fill the gaps as you go.
!code
Here's how to format Python code on Discord:
```py
print('Hello world!')
```
These are backticks, not quotes. Check this out if you can't find the backtick key.
hi i have a question
model = Sequential()
model.add(Conv1D(filters=3, kernel_size=1, activation='relu', input_shape=(None, 3, 10, 1)))
# model.add(MaxPool1D(pool_size=3, strides=1))
# model.add(GlobalMaxPooling1D())
# model.add(Conv1D(filters=32, kernel_size=3, activation='relu'))
# model.add(MaxPool1D(pool_size=2, strides=2))
# model.add(GlobalMaxPooling1D())
model.add(Flatten())
model.add(Dense(units=128,activation='relu'))
model.add(Dense(units=1,activation='sigmoid'))
# For a binary classification problem
model.compile(loss='binary_crossentropy', optimizer='adam')
here is a cnn model code
im getting this error
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-70-4023ec9e66ce> in <module>
11 model.add(Flatten())
12
---> 13 model.add(Dense(units=128,activation='relu'))
14
15 model.add(Dense(units=1,activation='sigmoid'))
ValueError: The last dimension of the inputs to `Dense` should be defined. Found `None`.
does anayone know y?
you only has 1 available prediction
what?
or also maybe the input shape?
so how can i fix?
isnt the input shape set on the first layer, which is the first Conv1D layer, and i have done that
am also noob so i dont know if what i am saying is correct hahaha
but the shape should be square i think
and the units after your flatten() should be =< the units of flattened
also your final layer should be more than one unit because if it is only one then its predicting only a single class in your case i think sigmoid is for binary class
correct me if am wrong 😅
i am prediciting one value, a binary value, so its binary prediction
i love data & specially probabilities but im not the algebra type of person
i have used it before
https://paste.pythondiscord.com/utaqajurat can someone verify that the algorithms are working and its working as expected?
oh my bad
also is there a certain matrix size inside the neural network that is most efficient
ex.
121 data points --> x --> y --> z --> 2
1x1.
= 1
what?
A matrix of size 1x1 is most efficient.
Are we? What do you mean then?
so i input an array of size 121 * 121 right
and then i go through 3 layers of matrix multiplication to get the 121 into a 1x2 or a 2x1 i forgor
"get the 121 into a 1x2 or a 2x1 i forgor" - I don't understand what this means.
can someone help answer my question
know where to get started with data science and what is basically is
is it legit just anyalization shit tons of data for a motive
like facebook with their ad systems?
you can check pins snow.
it helps in various ways. it can be used for analytics, predictions, classification, reinforcementing, problem solving and well.. hella hella stuff.
icic
Crazy idea: Neovim but like jupyter Notebook. So Neovim Notebooks! Possible?
Hello,
rect = win32gui.GetWindowRect(hwnd)
I grab my screen for object detection but i want grab specific section of my screen, How can i do that?
hello y'all! I created a SARIMAX model and need some help evaluating the Results:
I mean this looks quite good at first glance, right? But is it? The RMSE is 0.024718 when comparing acutal vs. prediction
I posted my code and my approach to: https://www.reddit.com/r/learnmachinelearning/comments/t7yznq/i_need_help_evaluating_my_results_interpreting_my/
0 votes and 0 comments so far on Reddit
Could you maybe have a look at it?
what is the actual scale of the variable?
rmse of 0.025 on values on the order of ~2 seems good to me!
however it looks like your model testing procedure is probably not valid
you don't want to just check a bunch of one-step-ahead forecasts, obviously those will always be good
you need a train/test split
or better yet cross validation
https://otexts.com/fpp3/tscv.html @gloomy anvil
Any body in the data science/ analytics field?
I wanna ask how much more do i need to know to get a basic/ junior data analyst position
So this is the closing price that I am trying to predict:
count 171.000000
mean 1.906868
std 0.505193
min 1.056412
25% 1.393226
50% 1.988028
75% 2.233953
max 2.968611
Name: close, dtype: float64
this is the description of my test datset. I split it into 1000 rows for training and 171 rows for the test.
This is my code:
#load dataset
df = pd.read_csv('ADA_1440.csv', index_col = 'date', parse_dates = True)
#split the closing price into train and test data
train = df.iloc[:1000,4]
test = df.iloc[1000:,4]
#select exogenous variables
exo = df.iloc[:,6:61]
#split exogenuous variables into train and test data
exo_train = exo.iloc[:1000]
exo_test = exo.iloc[1000:]
#run auto_arima to find the best configuration (I selected m=7 and D=1 by running seasonal_decompose and acf and pacf plots)
auto_arima(df['close'], exogenous=exo, m=7, trace=True, D=1).summary()
#set the best configuration from auto_arima for the SARIMAX model
Model = SARIMAX(train, exog = exo_train, order=(1,0,2), seasonal_order = (0,1,1,7))
#train model
Model = Model.fit()
#get prediction
prediction = Model.predict(len(train), len(train)+len(test)-1, exog = exo_test, typ = 'levels')
#plot the prediction
plt.plot(test, color ='red', label = 'Actual')
plt.plot(prediction, color ='blue', label = 'Prediction')
plt.xlabel('Time')
plt.ylabel('Price')
plt.legend()
plt.show
#calculate rmse
rmse = math.sqrt(mean_squared_error(test, prediction))
thanks for the crossvalidation. I didnt think about this, because I am working with continuous timeseries. My assumption was that crossval is not possible in timeseries
don't say that you're "not an algebra type of person". algebra is pretty much the most basic level of math, and if you let a prior experience dictate whether or not you like it, you're setting yourself up for disappointment.
best way to find out is to apply to a few jobs tbh
for entry-level data analyst positions, the amount of knowledge needed isnt usually too much
but the issue comes with how competitive those positions, especially from people making career changes
sometimes you are competing with people with graduate degrees, work experience in certain domain, etc.
so you usually need something special to stand out
please give more information; you haven't said enough for anyone to know what you need
sorry about that im still thinking through it
sounds good. I might be able to check this channel again when your question is ready.
Hey, I am trying to change my project to be able to detect multiple objects in a single instance. All my images are annotated using pascal, but I am unsure where to go from here. Previously I had a "default" class filled with many random images, but I realized this is very incorrect (in my mind) and I would rather use general object detection and maybe add some bounding boxes if time allows
My file breakdown looks like this:
i like the profile picture tony
thanks 🤣
🤔 what you mean by general obj detection ? you mean you want your target label to have xmin, ymin, xmax, ymax ? if thats the case then pascal voc format is exactly that
Hmm yeah I guess so. Would that allow me to detect multiple objects in the same image?
This is my first time trying something of this nature
These are some results of the older project but I realised that I was treating an ordinary image as a class instead of it being the default setting, if that makes sense
btw does anyone know how to plot the normalized image, i mean after normalizing the image using albumentatation and converting the image range to -1 to 1 , matplotlib is displaying black image . now how can i avoid that ?? i even tried plt.imshow((image * 255).to(torch.uint8))
rn almost every object detection model can detect multiple objs in an image
Oh? This is my model currently. Maybe I'm finding this is because I'm finding the argmax of the prediction
what you are trying is image classification
Ohhhhhhh right. Okay time to look into object detection approaches
Thanks for the help
https://paste.pythondiscord.com/ekavewuzaf how fix and be less suck?
i ran it for an hour and it stopped increasing at about a score of 165
Any tips to plot this, all blocks on the same plot? I have it inside a DataFrame. My line of thought was iterating through it each column but I guess iterating and dataframe shouldnt work together, right?
avoid thinking about iterating when you're working with dataframes.
what kind of plot are we talking about?
do you want line plots, where each block is a line?
Yea
I was thinking of making it dinamically, since the amount of blocks can change based on user input
I would first do df.index = df.index.str.extract(r'(\d+)').astype(int) so that the index is ints instead of strings
and then you can use df.plot.line(). it might even work just like that, without any additional work
you might have to transpose it. but then that's just df.T.plot.line()
I thought of that too, I've used that Block as column format because it's easier to visualize on excel
I don't know what you mean by this. did df.plot.line() look like something other than what you expected?
I would need a code representation of your dataframe that I can c/p to experiment.
I mean, the DF is the blocks as columns because my original project only had an excel file as outpout, I thought of transposing now before plotting
you'll have to do print(df.head().to_dict('list')) and show the text for us to continue.
{'Bloco 1': [6000.0, 6000.0, 6000.0, 6000.0, 5996.913420966637], 'Bloco 2': [6000.0, 6000.0, 6000.0, 5986.342797261716, 5963.890663247039], 'Bloco 3': [6000.0, 6000.0, 5939.570902083334, 5873.3415172031355, 5809.641970812106], 'Bloco 4': [6000.0, 5732.619047619048, 5586.096291071429, 5478.48851392744, 5386.497501391264], 'Bloco 5': [6000.0, 6000.0, 5939.570902083334, 5859.684314464852, 5773.532634059145]}
so what's the problem? didn't I basically give you the solution?
I was getting an error on the df.index =.... line
okay, so show the error
I guess it's because my index names have "Day" on the string
saying that you "got an error" is uninformative. copy and paste the error from Traceback
also, you can label the x axis as "day" with xlabel='Day'
ValueError: Index data must be 1-dimensional
Please provide the full traceback for your exception in order to help us identify your issue.
While the last line of the error message tells us what kind of error you got,
the full traceback will tell us which line, and other critical information to solve your problem.
Please avoid screenshots so we can copy and paste parts of the message.
A full traceback could look like:
Traceback (most recent call last):
File "my_file.py", line 5, in <module>
add_three("6")
File "my_file.py", line 2, in add_three
a = num + 3
TypeError: can only concatenate str (not "int") to str
If the traceback is long, use our pastebin.
Traceback (most recent call last):
File "C:\Users\joao_\Desktop\Projetos Python\Simulador de pressão\simuladorexplicito.py", line 89, in <module>
relatorio.index = relatorio.index.str.extract(r'(\d+)').astype(int)
File "C:\Users\joao_\Desktop\Projetos Python\Simulador de pressão\.venv\lib\site-packages\pandas\core\generic.py", line 5596, in __setattr__
return object.__setattr__(self, name, value)
File "pandas\_libs\properties.pyx", line 70, in pandas._libs.properties.AxisProperty.__set__
File "C:\Users\joao_\Desktop\Projetos Python\Simulador de pressão\.venv\lib\site-packages\pandas\core\generic.py", line 768, in _set_axis
w__
return Index(np.asarray(data), dtype=dtype, copy=copy, name=name, **kwargs) File "C:\Users\joao_\Desktop\Projetos Python\Simulador de pressão\.venv\lib\site-packages\pandas\core\indexes\base.py", line 503, in __new__
arr = klass._ensure_array(arr, dtype, copy) File "C:\Users\joao_\Desktop\Projetos Python\Simulador de pressão\.venv\lib\site-packages\pandas\core\indexes\numeric.py", line 183, in _ensure_array
raise ValueError("Index data must be 1-dimensional")ValueError: Index data must be 1-dimensional
okay, can you do print(df.index)?
Index(['Dia 0', 'Dia 15', 'Dia 30', 'Dia 45', 'Dia 60', 'Dia 75', 'Dia 90',
'Dia 105', 'Dia 120', 'Dia 135', 'Dia 150', 'Dia 165', 'Dia 180',
'Dia 195', 'Dia 210', 'Dia 225', 'Dia 240', 'Dia 255', 'Dia 270',
'Dia 285', 'Dia 300', 'Dia 315', 'Dia 330', 'Dia 345', 'Dia 360'],
dtype='object')
Btw, I need an extension to plot on the VS Code? The plot line runs fine, but shows nothing
I don't use vs code
try
df.index = df.index.str.extract(r'(\d+)').astype(int).squeeze().tolist()
there's probably a better way to do it. somewhere.
looks like you need to transpose it.
That was with transpose already, gonna try without it
Worked, thank you very much mate
🔥
https://paste.pythondiscord.com/ekavewuzaf how fix and be less suck?
i ran it for an hour and it stopped increasing at about a score of 165
165 out of what?
what
you got a score of 165. idk what that means.
there is a data set of 500 squares and 500 not squares
for every data thing it correctly identifies it gets a point
and for every one it does wrong it loses a point
so 165 means it got 417.5 wrong and 582.5 right i think?
you should use an actual metric, like precision or recall. do you know what true positive, false positives, true negatives, and false negatives are?
saying that you "got 517.5 wrong and 582.5 right" is vague, whereas reporting the score for a performance metric is specific.
also how did you get some partially correct?
it didnt
it never actually outputted 165
it only outputs even numbers
165 was just an average i saw
how do i make a performance metric
well, let's go over a few issues with the code first.
notSquare = square # This **does not** make a copy, it just makes another reference
if self.classify(square) == True: # Never do comparisons to True or False. if `self.classify(squre)` is already True, you're just writing `if True == True`
You're also using lowerCamelCase for everything, when you should be using UpperCamelCase for class names and snake_case for everything else.
Wouldn't the code still technically work anyway?
^
If you were using camel case
I don't think I can dive into what it would take to improve the performance, as you've written a lot of it in "pure python" and I'm used to reading code that uses the numpy/torch style is used more extensively.
not the part about notSquare = square. but I was just making suggestions as I was reading through the code, regardless of whether they were logic or style errors.
I'm not sure what it's intended to do.
which part
all of it. what is the model supposed to predict, for what inputs?
its suppose to take an array like ```
test = [[
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0,
0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0,
0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0,
0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0,
0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0,
0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
]]
like, if the region of 1s is square?
yes
you don't need ML for that?
well i want to
also you have [[ and ]] but each row isn't its own list
for square in squares:
if self.classify(square) == True:
score += 1
else:
score -= 1
you don't subtract when a model makes the wrong prediction.
yes i do




