#data-science-and-ml
1 messages ยท Page 261 of 1
@proper fable just explore other EDA notebooks there in Kaggle, by using the search bar in the notebooks section
@lapis sequoia The math required for ML/AI is pretty dependent on the task you are doing - simple tasks, simple math complex tasks, complex maths. I think calculus and Algebra basics should be pretty good for general Machine Learning and knowledge about vectors/matrices (usually taught in C.S in schools) would be very helpful too.
@proper fable just explore other EDA notebooks there in Kaggle, by using the search bar in the
notebookssection
@grave frost Thankyouuu that helps me a lot. I dont know that I can do such
a thing before
np
word_vecs = KeyedVectors.load_word2vec_format("./glove.txt") how do get the "glove.txt" file or how do i generate it?
I am using gensim.models
spaCy: Are vocabularies a set of just the words of all analyzed documents or a set beyond former?
2020-10-17 18:32:05,249 findDocumentType1 MainThread : test!
2020-10-17 18:32:11,981 findDocumentType1 Thread-19 : Exception on /findDocumentType1 [POST]
Traceback (most recent call last):
File "C:\Users\Admin\anaconda3\lib\site-packages\flask\app.py", line 1949, in full_dispatch_request
rv = self.dispatch_request()
File "C:\Users\Admin\anaconda3\lib\site-packages\flask\app.py", line 1935, in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
File "C:\Users\Admin\anaconda3\lib\site-packages\flask_restful\__init__.py", line 468, in wrapper
resp = resource(*args, **kwargs)
File "C:\Users\Admin\anaconda3\lib\site-packages\flask\views.py", line 89, in view
return self.dispatch_request(*args, **kwargs)
File "C:\Users\Admin\anaconda3\lib\site-packages\flask_restful\__init__.py", line 583, in dispatch_request
resp = meth(*args, **kwargs)
TypeError: post() takes 0 positional arguments but 1 was given```
Guys, i'm a beginner in python. Do you guys have reference code for random forest algorithm without using scikit learn (sklearn) in jupyter notebook. Thank you.
Hey guys, I ended up getting a gig on DS in freelancing. I have a dataset of users location and activity location and time data from which I have to find how much time the users spends in a specific location.
Is there a way to do it?
id, time, user_x, user_y, act_x, act_y, activity are the features
ids repeat and activity coordinates repeat sometimes.
What are the features like?
@lapis sequoia The math required for ML/AI is pretty dependent on the task you are doing - simple tasks, simple math complex tasks, complex maths. I think calculus and Algebra basics should be pretty good for general Machine Learning and knowledge about vectors/matrices (usually taught in C.S in schools) would be very helpful too.
@grave frost thank you for your help. wish you the best rhings
Hello wonderful people,
asking for advice here. I'm doing a semantic search using roberta embeddings, but it's trained on a max length of 512.
But the text data I'm working with are double that. Should I truncate the text data?
My end goal is get the embeddings to compare.
Or should I not go with fancy approach and go with something simpler like tfidf due to the text length?
I think I was able to solve the previous problem just taking a naive approach.
Now I have a new question. Does it matter of the batch size when we're encoding for the embeddings?
ie getting embeddings at batch size 32 vs 256. Using the embeddings only for comparison.
I'm aware batch size makes a difference when doing downstream tasks, but what about encoding the actual embedding?
self.df["64gb"] = np.where("64" in self.df["title"], True, False)
returns False but it should work
I don't understand why i is a str and not an integer. Also, how could I iterate over this list?:
`lst = [('someting1'), ('something2')]
for i in lst:
first_lst = lst[i].split('|')
`
@limpid raft i is not always an integer ,in this case i can be ('something1' ) or ('something2')
lst = ['someting1', 'something2']
first_lst = lst[0]
@lapis sequoia Does it take then the type lst? and what if lst is a list of integers and strings, what does i become in that case? And is it possible to not iterate over this list manually?
@limpid raft Always lst[0] will be the first it doesn't matter int or str
to get all of them there two options you can use While or For loop
lst = ['someting1', 'something2']
for lsts in lst:
print(lsts)
lst = ['someting1', 'something2']
i = 0
while i < len(lst):
print(lst[i])
i += 1
ahh, so lsts[0] would then be something1. But, does the 'in' statement create the variable lsts such that it has the same type as lst?
From my understanding it's purpose is to check if a value is present in a sequence (range, list,etc). Is the 'for' loop forcing the type lst onto lsts?
Hi, does anyone here worked on graph neural networks?
I am looking for efficient implications of SOTAs in graph representation learning. Need to deploy model that works on huge number of small relatively sparse graphs (<100k nodes). Wondering which package would be best etc.
Can someone please help me understand the X and y inputs to scikit-learn's linear regression? I have a list of X points and a list of corresponding Y points.
X is the features, y is the labels
thats the most basic way of understanding it
or in the case of linear regression you can think of it like regressing on a graph with x and y variables
@austere swift thanks, but when I try it says the sizes of the lists are wrong even though they're both 1x5000
whats the exact error message?
so sklearn doesnt like lists that look like [a, b, c, d], it wants lists like [[a], [b], [c], [d]]
Ah I see
so thats why its asking you to do the array.reshape(-1, 1) thing
so you can just reshape it like that
x = np.reshape(mapping_x, (-1,1))
y = np.reshape(mapping_y, (-1,1))
reg = LinearRegression().fit(x, y)```
Same error with this
try only reshaping the x variable, not y
is web scraping data science
if web scraping isn't data science can someone tell me where to ask a beautifulsoup question
Whats the best way to do a column level compare between two dataframes in pandas?
Whats the best way to do a column level compare between two dataframes in pandas?
@regal belfry what od you mean column level compare
need help over at #help-kiwi
@regal belfry what od you mean column level compare
@velvet thorn if df1.column == df2.column then show all matching rows
why doi get this error
UndefinedMetricWarning: R^2 score is not well-defined with less than two samples.
warnings.warn(msg, UndefinedMetricWarning)
im trying to predict data from a csv file
ping
r is undefined i think
no
r is not even there in my code
sry im newbie
idk
getting bar graph behind catplot
Hey guys am getting barplot behid catplot, i only want catplot. How do I remove bar graphs and the lines fro the chart?
if web scraping isn't data science can someone tell me where to ask a beautifulsoup question
@gray sedge You can ask it here too and try in Web Dev channel.
Hey guys am getting barplot behid catplot, i only want catplot. How do I remove bar graphs and the lines fro the chart?
@sweet ember give the code that you are using to generate the plot.
or Read the documentation here.
https://seaborn.pydata.org/generated/seaborn.catplot.html
You can try different parameters in kind to fix it.
Is ROOT well known/respected/w.e in the data science community? I'm doing a physics masters using it and might be interested in going into data science after
hey, I've made a scraper that will monitor ads posted to craiglist for certain categories and compare against the average price in order to identify bargains
is there any other rules you guy would suggest, I was thinking if item is 30% cheaper than the average, notify me
but maybe average is not the best metric to use?
@pure pond BTW What is ROOT?
can someone help?
ERROR: Could not install packages due to an EnvironmentError: [Errno 2] No such file or directory: 'C:\\Users\\HP\\AppData\\Local\\Packages\\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\\LocalCache\\local-packages\\Python37\\site-packages\\sklearn\\datasets\\tests\\data\\openml\\292\\api-v1-json-data-list-data_name-australian-limit-2-data_version-1-status-deactivated.json.gz'
i get this error when trying to install sklearn
i upgraded pip and it got fixed
ugh now i get this
ImportError: cannot import name '__check_build' from 'sklearn' (C:\Users\HP\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\sklearn\__init__.py)
this is the code:
# make predictions
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC
# Load dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
dataset = read_csv(url, names=names)
# Split-out validation dataset
array = dataset.values
X = array[:,0:4]
y = array[:,4]
X_train, X_validation, Y_train, Y_validation = train_test_split(X, y, test_size=0.20, random_state=1)
# Make predictions on validation dataset
model = SVC(gamma='auto')
model.fit(X_train, Y_train)
predictions = model.predict(X_validation)
# Evaluate predictions
print(accuracy_score(Y_validation, predictions))
print(confusion_matrix(Y_validation, predictions))
print(classification_report(Y_validation, predictions))
me doing a udemy course on data science
can anyone suggest some resources for learning NLTK sentiment analysis
if i plot error like this, it is overfit right?
so i just need to stop iteration early to make it not overfit?
x-axis = iteration
y-axis = rmse
wow ya looks like after 2 iterations it's there ๐
hey guys, i'm having some trouble printing zero values from my dataframe/panda code
this is my code and output, i just want it to ALSO print the data for the ones that have a zero value, any ideas?
I have imbalance dataset and I've done under sampling with decision tree classifier which give me score of f1=1, looks too good to be true then I saw the confusing matrix and it shows that FN and FP is both 0...
is it a good thing? I'm very new at this. I've also try over and under sampling with SMOTE combined with XGBoost classifier and the best f1 score is 0.46
so i wanted to ask the math behind test_size
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(X,y, test_size=)```
how can we decide test_size
just want to know the math behind it
I have imbalance dataset and I've done under sampling with decision tree classifier which give me score of f1=1, looks too good to be true then I saw the confusing matrix and it shows that FN and FP is both 0...
is it a good thing? I'm very new at this. I've also try over and under sampling with SMOTE combined with XGBoost classifier and the best f1 score is 0.46
@quiet whale you probably have data leakage
just want to know the math behind it
@lapis sequoia there's no hard and fast rule but generally 20% or so
hm
this is my code and output, i just want it to ALSO print the data for the ones that have a zero value, any ideas?
@dusky carbon what do you mean? what are you trying to do?
what is data science?
Please help, I wanna use plt.imshow in Flask for research data visualisations. Creating/displaying .jpg or .png isn't helpful as they cannot be updated on the go. Please suggest a way.
plt.imshow because I wanna use colormap and clim.
what is data science?
Please help, I wanna use plt.imshow in Flask for research data visualisations. Creating/displaying .jpg or .png isn't helpful as they cannot be updated on the go. Please suggest a way.
plt.imshow because I wanna use colormap and clim.
@lilac minnow what do you mean "use it in flask"?
I wanna create a server for visualising outputs from TF Models, as numpy arrays. plt.imshow works well with Jupyter. But I'm not able to get them to work in flask.
I wanna create a server for visualising outputs from TF Models, as numpy arrays.
plt.imshowworks well with Jupyter. But I'm not able to get them to work in flask.
@lilac minnow they're two different things...
if you want that kind of behaviour, you need JS
@velvet thorn thank you. Can you provide me with any example/template for me to get started?
you're basically saying you want an interactive interface
alternatively, you can consider Dash
nope, I can't
Google should help
@quiet whale you probably have data leakage
@velvet thorn ah I did! Thankyou, need to be more careful next time :/
hey guys
I want to do a neural network model for sentiment analysis on tweets
but I dont have much spare time, so I was considering using mturk or fiver to have people manually label a training set
thoughts?
@mossy dragon what topic are the tweets for? Also, were you planning to have two people label everything independently and compare the results?
i dont have a specific topic yet
im willing to be flexible on that tbh
I wasn't planning on getting different people but that sounds like a good idea
one of my coworkers does sentiment analysis. It tends to be difficult because of sarcasm and especially nuanced texts.
no
I would see if someone has already made a set of tweets and associated sentiment data
hmm
i actually had a similar idea to that
i know there is an IMBD dataset containing movie reviews labeled as positive/negative
that sounds pretty good
so i was considering maybe using that to train a model and then classifying tweets about a new movie trailer that was released or a movie that recently came out
i haven't really done any neural net models though
do you know a sample size that i should aim for?
unfortunately I don't
hmm
I work in an NLP lab and I'm the worst one
probably because I spend too much time on discord
what class?
NLP
nice
but still we're all either busy with other classes or working full time, but I'm curious if we could get a decent sized training set if we spent ~1 hour manually labeling data
our annotators are always complaining about how long it takes
so my guess is no
but if you're just assigning labels to entire documents (rather than individual tokens) I guess that's faster
I don't know the exact specification of your assignment but I would be very surprised if your professor wanted you to create your own data set.
oh lol
we're not required too
but i personally would like to
we dont even have to do sentiment analysis, we could do a different method to analyze the text
does it have to involve ML?
or can you do some other kind of analysis
like maybe something that could be interesting is analysis of document structure?
hm it says a single data set so I guess my idea is out
but yeah it seems like you're intended to find your own dataset
yea
as opposed to creating one
however, I'm like 99% sure there are existing tweet datasets out there
for sentiment analysis
it's a very common task
so you could use that as a baseline and find something interesting to add your own spin on things
for example, comparing across geographical regions?
I'd like to modify this and put this on my github for future job searches
so i figured it would be more impressive to extract that data myself
but i guess i dont have to do that now
so i figured it would be more impressive to extract that data myself
@mossy dragon it would be!
but yeah, if it's a group project
probably not.
thanks for the help 
yw ๐
hello```python
print("hello")
try:
model = load_model(r"E://demo3//albania_100_model.p")
#model = load_model(r"{path}//{country}_100_model.p")
print("model loaded...")
except OSError:
logger.debug({
"Status" : "failed",
"message" : "model not available"})
return{
"Status" : "failed",
"message" : "model not available"}```
in output i am getting as python { "Status" : "failed", "message" : "model not available"}
i am not able to load model
Hello everyone,
I am currently looking for a dataset on cholera, do any of you know where to download a dataset about cholera? or do you guys know where I can find the source dataset like this one? https://github.com/soujanyajoshi/Cholera/blob/master/data.xlsx .Because I have searched for the dataset in Kaggle, but the features on the dataset are different.
@pure pond BTW What is ROOT?
@grave frost https://root.cern.ch/
is this correct place to talk about stock market analysis?
Just use an rng stock picker you'll probably outperform other attempts xd
a monkey outperformed most
๐
News, analysis and comment from the Financial Times, the worldสผs leading global business publication
how long does it take to train a single-thread ntlk classifier model with 8000 training points and 2000 test points?
I'm running on a i7-10750H @4.5ghz
or is there an easy way to run it with CUDA?
I need advice. What machine learning course should I take?
@earnest forge i would highly recommend the complete zero to mastery machine learning course by Andrei Neagoie on Udemy . Its very affordable for its quality and content in my opinion
I got two dictionaries that contain several pandas dataframes on it. The columns and the rows are all the same names however i would like to iterate through the dataframes from each dictionary and run df1.compare(df2) one at the time.
is there a way to write a function that will make this quicker instead of writing df1[key1].compare(df2[key1]) for each key in these dictionaries
Hello ! How can i use raw sql queries in flask_sqlalchemy ?
Hi there!
I'm really new to Python but I want to invite people to take interest in a ML/NLP project. I want us to figure out how to digitize The Turing Digital Archive (http://www.turingarchive.org/) into easy-to-read text.
I'm not sure what the best tool is for the project, so I'm posting this to make interested friends who want to help.
To begin, I was looking at EasyOCR (https://github.com/JaidedAI/EasyOCR) but I don't know if it's the right tool for the job.
We'll be working in conda with Python for this; I personally will be using Windows 10; apart from the experience itself, I think creating one document containing all of Alan Turing's writings will be it's own reward.
@foggy tundra One option is to ignore the ORM and interact with the database directly with something like pymysql https://pypi.org/project/PyMySQL/
how long does it take to train a single-thread ntlk classifier model with 8000 training points and 2000 test points?
@marsh tartan Well it will depend the configuration of models and not just on the data. A complex model with higher number of parameter will take more time than a simple one.
And to train with GPU for free than you can try using Google Colab which is free for 12 hours in a single run.
Anyway if you are just looking for some simple classifier for text than it should not take more than few minutes. Unless your model architecture is very complex.
is there a way to write a function that will make this quicker instead of writing
df1[key1].compare(df2[key1])for each key in these dictionaries
@real geode You can convert each datframe into numpy array and compare.
(A==B).all()
test if all values of array (A==B) are True.
Note: maybe you also want to test A and B shape, such as A.shape == B.shape
Special cases and alternatives:
It should be noted that:
this solution can have a strange behaviour in a particular case:
if either A or B is empty and the other one contains a single element, then it return True.
For some reason, the comparison A==B returns an empty array, for which the all operator returns True.
Another risk is if A and B don't have the same shape and aren't broadcast-able, then this approach will raise an error.
Source: https://stackoverflow.com/questions/10580676/comparing-two-numpy-arrays-for-equality-element-wise
Thanks for the tip but i managed to find a workaround while still keeping dataframes
#Call this function to create crosstab tables
def crosstab_compare(df1cross, df2cross, df1original):
"""
df1cross = dictionary of pandas dataframe where crosstabs have been performed, the self.
df2cross = specifies another dictionary of pandas dataframe where crosstab has been performed, the other
df1original = pandas dataframe non crosstabulated that will be used to extract the list of labels
The end result is a dictionary
The tables shown will appear only if results are different from each other
The function will attempt to compare all dataframes with equal shape. If one dataframe doesnt match with the other, the function will
continue to work but skip the mismatching dataframe
"""
question_list = list(df1original.columns)[1:]
print("Self: Refers to the table that was called first in the arguments")
comparedf = {}
for k in question_list:
try:
comparedf['{}'. format(k)] = df1cross[k].compare(df2cross[k], align_axis='rows')
except ValueError:
continue
return comparedf
i had the problem where some DFs didn't have the same shape which is why i added the try block
Is there a lighter-weight alternative to jupyter notebooks?
Is there a lighter-weight alternative to jupyter notebooks?
@lapis sequoia lighter in what sense ?
You can just use VS Code editor as a notebook instead of installing anaconda and everything for jupyter if you want.
Also you can use cloud notebook providers like Google Colab which are hosted on VMs. So your system will not have any load and you get decent Machines.
yea I use Visual Studio Code jupyter notebooks for work and is pretty light overall
oof
Hi guys, is anyone fimilar with with python script that aling DNA dequence. Have an assignment that I have no idea where to start from
they want you to code a BLAST from scratch?
That's a hell of a school project lol
Can you use the NCBI API (if it has to use python)?
BLAST Developer Information
just use BLAST directly lol i dont know why they would want you to use python just to get there. No need to reinvent the wheel
Yeah, I agree. Was just suggesting in case it was a project that required Python scripts. Depends on if it is a bio or computer class. No way in hell a biologist would write their own BLAST scripts.
Here is the alignment tool: https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastSearch&BLAST_SPEC=blast2seq&LINK_LOC=align2seq
is there a reason my tensor which is [173, 173] is getting resized to [231, 231] when plotted using plt.imshow() ?
Okay sorry it was a matplotlib issue.
Hello, I was directed here. I have a live graph being plotted from incoming ECG data, the two line plots (heart rate and moving average (called rolling mean in the code)) are updating successfully and moving across the screen, while the scatter plot data is not. The initial set of scatter points gets plotted, but remains static, unlike the line plots. I have to set up line. and scatter. plots a bit differently, so that is probably where the problem lies.
using funcAnimation
I basically need to do this: Convert align_seqs.py to a Python program that takes the DNA sequences as an input from a single external file and saves the best alignment along with its corresponding score in a single text file (your choice of format and file type) to an appropriate location. No external input should be required; that is, you should still only need to use python align_seq.py to run it. For example, the input file can be a single .csv file with the two example sequences given at the top of the original script.
Gn
Hello everyone,
I am currently looking for a dataset on cholera, do any of you know where to download a dataset about cholera? or do you guys know where I can find the source dataset like this one? https://github.com/soujanyajoshi/Cholera/blob/master/data.xlsx .Because I have searched for the dataset in Kaggle, but the features on the dataset are different.
hello I'm trying to access google maps using API Key
but i'm getting this
"error_message" : "You must enable Billing on the Google Cloud Project at https://console.cloud.google.com/project/_/billing/enable Learn more a "results" : [],
"status" : "REQUEST_DENIED"```
any welp for me?
Thank You
Hello ! I have a problem with pandas and read-Excel feature.
I can't read one of the columns in my excel sheet. The console return this error:
File "path\to\pandas\core\indexing.py", line 1177, in _validate_read_indexer
key=key, axis=self.obj._get_axis_name(axis)
KeyError: "None of [Index(['S2007-02', 'S2007-02', 'S2007-02', 'S2007-02', 'S2007-02', 'S2007-02',\n 'S2007-02', 'S2007-02', 'S2007-02', 'S2007-02',\n ...\n '1 - New', '1 - New', '3 - Approved', '3 - Approved', '1 - New',\n '3 - Approved', '3 - Approved', '3 - Approved', '3 - Approved',\n '1 - New'],\n dtype='object', length=1043)] are in the [columns]"
But I don't understand what is those "\n". Moreover, they aren't into the string value.
I checked the column format but I don't saw any return line or space in the data. Someone as any clue to fix this ?
Thanks !
any welp for me?
@tight sparrow You need to enable billing. Go into Google Cloud console and inside Billing you should be able to see if there is any active billing account.
Also check Account Management and enabale the billing if you have closed it in the past. You will need a debit/credit card to do that.
hi
import sklearn
from sklearn import datasets
from sklearn import svm
from sklearn import metrics
from sklearn.neighbors import KNeighborsClassifier
cancer = datasets.load_breast_cancer()
#print(cancer.feature_names)
#print(cancer.target_names)
x = cancer.data
y = cancer.target
x_train,x_test,y_train,y_test = sklearn.model_selection.train_test_split(x,x,test_size=0.2)
print(x_train,y_train)
classes = ['malignant' 'benign']
clf = svm.SVC()
clf.fit(x_train,y_train)
y_pred = clf.predict(x_test)
acc = metrics.accuracy_score(y_test,y_pred)
print(acc)
"got an array of shape {} instead.".format(shape))
ValueError: y should be a 1d array, got an array of shape (455, 30) instead. error
Hello ! I have a problem with pandas and read-Excel feature.
I can't read one of the columns in my excel sheet. The console return this error:File "path\to\pandas\core\indexing.py", line 1177, in _validate_read_indexer key=key, axis=self.obj._get_axis_name(axis) KeyError: "None of [Index(['S2007-02', 'S2007-02', 'S2007-02', 'S2007-02', 'S2007-02', 'S2007-02',\n 'S2007-02', 'S2007-02', 'S2007-02', 'S2007-02',\n ...\n '1 - New', '1 - New', '3 - Approved', '3 - Approved', '1 - New',\n '3 - Approved', '3 - Approved', '3 - Approved', '3 - Approved',\n '1 - New'],\n dtype='object', length=1043)] are in the [columns]"But I don't understand what is those "\n". Moreover, they aren't into the string value.
I checked the column format but I don't saw any return line or space in the data. Someone as any clue to fix this ?
Thanks !
@uneven wind\nis used for next line. So it is possible that it is causing the problem. Also Are you passing any other parameters while reading CSV. First try to read without any index and columns. Then choose column and index properly.
import sklearn
from sklearn import datasets
from sklearn import svm
from sklearn import metrics
from sklearn.neighbors import KNeighborsClassifier
cancer = datasets.load_breast_cancer()
#print(cancer.feature_names)
#print(cancer.target_names)
x = cancer.data
y = cancer.target
x_train,x_test,y_train,y_test = sklearn.model_selection.train_test_split(x,x,test_size=0.2)
print(x_train,y_train)
classes = ['malignant' 'benign']
clf = svm.SVC()
clf.fit(x_train,y_train)
y_pred = clf.predict(x_test)
acc = metrics.accuracy_score(y_test,y_pred)
print(acc)
@lapis sequoia x_train,x_test,y_train,y_test = sklearn.model_selection.train_test_split(x,y,test_size=0.2)
You made one error here. The input for split should be x and y but you have only passed x and x.
OHHH
@lapis sequoia thankssss
@lapis sequoia do I need to pay some money to gain access?
@lapis sequoia do I need to pay some money to gain access?
@tight sparrow You some free quota, after that you need to pay. Free quota would be more than enough if it is for personal project.
Also you get free $300 credits when you register. So you have to use them if you want to get access.
Hi, i'm trying to learn about machine learning in these few weeks, is there any youtube or website that can help with oversample,logistic regression,etc? thanks
Hi, i'm trying to learn about machine learning in these few weeks, is there any youtube or website that can help with oversample,logistic regression,etc? thanks
@weary heart andrew ng?
Thanks i'll look it up ๐
hello
i have a code which creates image from base64 string to image, now i want to resize this image in desired pixels howi i can do this ? can anyone help me in this ?
it is pink line drawn. what does it stand for? what is its meaning?
hello
i have a code which creates image from base64 string to image, now i want to resize this image in desired pixels howi i can do this ? can anyone help me in this ?
@mild topaz I'm not sure what tool you are using for creating string to image but when you save the file you can change its dpi and figure size.
If you are using matplotlib then you can resize with the help of matplotlib.pyplot.figure and choose the appropriate parameter values for dpi and figsize.
https://paste.pythondiscord.com/ibudajodix.py my code @lapis sequoia plz check
Am i allowed to ask a question in this channel?
it is pink line drawn. what does it stand for? what is its meaning?
@earnest forge Correlation?
the line between two variables on a scatter plot is supposed to represent the relation between them
isn't this some high school level math?
Can someone help me with pandas regression
what exactly?
https://paste.pythondiscord.com/ibudajodix.py my code @lapis sequoia plz check
@mild topaz I'm not sure what is the problem in the code.
You are resizing the image in the code so it should take care of your needs.
it is pink line drawn. what does it stand for? what is its meaning?
@earnest forge that is the best linear fit for your data. If you have to approximate your data with some function then that line gives the best result. And it also tells about how x and y are correlated.
@earnest forge So I have a bunch of dummy variables right
I groupedby/summed by a certain column
But the dummy variables got messed up and now show numbers that aren't either 0 or 1.
How can I either fix that or make it where all the dummy variable columns greater than 1 get turned into a 1
guys what do you like using
seaborn
or matplotlib
for graphs
which one is actually worth my time bc i used matplotlib in my last project
seaborn has way prettier graphs imo
I like seaborn cus its a lot prettier
seaborn seems easier to use for me
yeah that too
i've been doing a udemy course on data science & machine learning
that's why i've been so quiet
Jose Portilla is a beast
matplotlib is more basic and allows you to do alot custom things. Seaborn is built on top of Matplotlib.
ohh
yeah seaborn is just a wrapper for matplotlib that makes it easier to use and has a lot better looking default themes
yeah i think i'll be using seaborn more often now
Any help for me
what did you ask @lapis sequoia
If the graphs you want are available in seaborn or plotly then you can just use them. The idea of matplotlib is to allow any python programmer complex graphs.
i
@hollow sentinel I have a bunch of dummy variables
I groupedby/summed by a certain column
But the dummy variables got messed up and now show numbers that aren't either 0 or 1.
How can I either fix that or make it where all the dummy variable columns greater than 1 get turned into a 1
i'm traumatized by plotly
chloropeth ๐ฆ
idk i remember with pandas you can conditionally select within the dataframe
sorry i'm new to this lmao
same lol
@lapis sequoia I'm not able to understand your problem. But yeah if you just want to make a column with max value 1 then it is possible. You can apply some map or apply_map to fix it
the only thing is that I'm worried I'm not actually learning anything
i don't learn from basic udemy videos I learn from projects
built different
@lapis sequoia I created dummy variables for 4 columns
Then grouped the rows by a certain column
Doing so aggregated all the dummy variables as well, instead of the only column I wanted (as far as I know, there is no way around this)
But the dummy variables must be either 0 or 1, some of them have numbers such as 200, 300, 450 etc. So I need all the ones with those numbers to be a 1 so I can perform regression correctly
you're doing linear regression?
yeah
idek how to do that in python lol
lmao do you want me to email the udemy course notes
i wanted to do a linear regression on a dataset
and it's good to use seaborn for the graph
afaik you can't
never heard of statsmodels
didnt know that
is that another module in python?
@tall aurora Why do you want to know?
I groupedby/summed by a certain column
@lapis sequoia could you provide a bit of your code?
guys what do you like using
@hollow sentinel I combine both seaborn and matplotlib. seaborn ain't capable of everything matplotlib can provide you
@mild topaz If you don't mind me asking, how did you get an Image as base64 string?
@earnest forge yeah when I look at Kaggle they use both seaborn and matplotlib
Kaggle is really good
The only thing I like about Kaggle notebooks is that their kernels are reproducible. Apart from that, Kaggle is just a time-waste
i think i understand linear regression w two variables but i don't understand multiple linear regression
linear regression is just a relationship between two variables right
The groupby code?
@lapis sequoia yes
linear regression is just a relationship between two variables right
@hollow sentinel no
Why are you doing LInear Regression if YOU don't fully unnderstand it?
i thought i would pick it up as I go
either you did something wrong when grouping or values initially were 'bad'
df = df.groupby(by='Tool').sum()
@hollow sentinel Linear regression just a simple method to find the relationship between data points using (as the name implies) a linear function as a basis of a relationship. If the data does not exhibit linear relation, then it is useless methods and you are better off using other ways like polynomial regression, etc.
"Simple linear regression is a statistical method that allows us to summarize and study relationships between two continuous (quantitative) variables"
oh
allows you to study the relationship between two variables
Yeah, that def seems a bit off because that implies that the data points are like coordinates (with x and y value) and you find the linear relationship between those 2 variables, but that's not actually the most fundamental one
df = df.groupby(by='Tool').sum()
@lapis sequoia
1st: you better not replace initial dataframe, try to save the result to variable named likedf_grouped
2nd: make sure data in Tool is in convenient data type
thank you @grave frost
is it int or object?
um they were probably trying to simplify it so the layman (me) can understand it
you can find linear relationship in 3D space too
what
@earnest forge I was using new dataframes initially yeah but my then I'd have a bunch of cells which was making me confused
What do you mean convenient data type?
No clue what that is tbh.
The tool column is a product serial # if that helps
convenient for df.group_by method to work with values
in Tool column what type is it?
@hollow sentinel Imagine the line connecting you to your ceiling fan - that is a line in 3D space. I doubt much data exhibit linear relationship in 3 dimensions, but that doesn't mean it's impossible to do
you can check it out using df.dtypes
@hollow sentinel Two distinct but related variables is how I look at it
@earnest forge Let me try that thanks
Why is my SMTP code not working ( the emails are fake but I use real ones for the errors shown below):
File "scratch.py", line 10, in <module>
server.login(sender_email, password)
File "C:\Users\dhruv_\AppData\Local\Programs\Python\Python38-32\lib\smtplib.py", line 734, in login
raise last_exception
File "C:\Users\dhruv_\AppData\Local\Programs\Python\Python38-32\lib\smtplib.py", line 723, in login
(code, resp) = self.auth(
File "C:\Users\dhruv_\AppData\Local\Programs\Python\Python38-32\lib\smtplib.py", line 646, in auth
raise SMTPAuthenticationError(code, resp)
smtplib.SMTPAuthenticationError: (534, b'5.7.9 Application-specific password required. Learn more at\n5.7.9 https://support.google.com/mail/?p=InvalidSecondFactor x23sm2799418pfc.47 - gsmtp')
Tip: App Passwords arenโt recommended and are unnecessary in most cases. To help keep your account secure, use "Sign in with Google" to connect apps to your Google Account.ย
An App Password is
@earnest forge Not working
Tool column is the only one that isn't showing up
Is that because I grouped it already?
My dummy variables all say float64
@main pelican i like your profile pic of Sokka
yes. it may be
@hollow sentinel lol
reload the data and group it one more time
I'll rename the groupby df
@lapis sequoia good
Says it is an object
and my dummy variables are now uint8
My dependent variable still says float64
Is it just me or does anybody else have problems ssh'ing into a google VM instance?
Says it is an object
@lapis sequoia can you showdf.head()of the data?
on the same code?
told you bro
on the same dataframe, yes
yeah but it looks gross
yes
uhh
Sure give me a second
Need to block some info out
Everything beginning with LH is a dummy variable
I gave them that prefix cause I was trying to fix the aggregation problem
@earnest forge
Hey
Does anyone know how to plot a pandas window when you run a file.py in a linux terminal?
Oh
I got what's wrong
You count all values in tool and it exceeds space in the memory
What do you mean?
The group is by the tool column but the sum is for the quantity
if that makes sense
Oh
You need to bring values in other columns to int data type. They are percepted by object type by pandas, that's the reason you get these unexpected results
So all the dummy variables?
yes
@earnest forge How can I change the dtype
The dummy variables are showing as float64
check df.dtypes one more time. look at the columns which are desirable to be int (if the dtype is float, then left it that, no need to change)
after you decide which columns' data type values to change use the following:
df = df.astype({'column_name':'int32'})
I have 100+ dummy variable columns
is there a way to not set them manually one by one lol
Why does the code have to sum thedummy variables i
oh, then make it all int, except particular columns:
cols = df.columns
df[cols[your_slice]] = df[cols[your_slice]].apply(pd.to_numeric, errors='coerce')
in df[cols[your_slice]] you must specify all columns except those you do not want to convert to numeric type.
For instance, if you want to keep first and fourth columns as they are, you may apply the following slice: df[cols[[1:4]] = that code above
df[cols[[4::]] = that code above
sum method can't summarize values that are not represented as numeric types. so it thinks of it as summarizing string. in the end, it gives you weirdly computed result
Do most people going into data science have a masters or can you get in if you have a bachelors (physics)? Been studying machine learning lately so I figured I might apply for some jobs.
it must fix it
Okay let me try it
@earnest forge Wait, I'm confused sorry. Should the dummy variables be numeric
tool (serial number I want to group by), quantity (dependent variable, what I want to sum), dummy variables (independent variables)
are my columns
heya! I'm trying to extend pyannote to build a fun NLP app for podcasters. anyone familiar with that lib?
trying to make sure it can do the thing i think it can do
idea being: running the same set of data through a bunch of different ML algos, and having all the results for the same data tagged. once it gets manually okayed by the EU, the data is marked for each ML set to use for more training data.
so, a "master" pyannote annotation with: segments to cut up the source audio, speaker, transcription, sentiment, etc. then once they're all corrected, they can then be cut up by the segment defs to feed the various ML algos.
tool (serial number I want to group by), quantity (dependent variable, what I want to sum), dummy variables (independent variables)
@lapis sequoia when you group by tool and aggregate summarization, your grouped daraframe represents sum of values in other columns depending on Tool value.
Is there a function to turn a list of labels [cat, dog, dog, rat, rat, rat, cat] into a list of class labels, such as [0, 1, 1, 2, 2, 2, 0]?
I can't seem to find it on google so apologies if this is trivial
in scikit-learn*
hmm, this can be coded manually, but I think scikit-learn has one
Yeah I can use a dict and do it manually but I want to learn the built ins to scikit learn
Thanks ๐
Follow up question: Why is my naive bayes model working in scikit learn with just plain text labels for the classes?
"dog", "cat", etc. Don't they have to be in an int/vector representation?
You might be using a high-level enough feature that it handles all the encoding and prediction for you.
Oh weird, thanks
@earnest forge Ya so it would aggregate all of them regardless
So which columns am I changing to numeric
All of them?
@earnest forge Code you gave me isnt working bro
syntaxerrro
F
dummy variables are your qualitative variables turned to numbers. if you have 1 qualitative variables with multiple categories (for stock markets, Industry could be a dummy variable). lets say industry can be either financial, tech, industrials. You will have 3-1 dummy variables.
@glad mulch I know what a dummy variable is lmao I have that all set up, I was just asking about the data type for it inside pandas
@earnest forge Wait, I'm confused sorry. Should the dummy variables be numeric
@lapis sequoia what other data type would you use?
Theyโre float run
@lapis sequoia generally some integer type is appropriate, but honestly it doesn't really matter
@velvet thorn Any insight as to why his code didn't work?
@lapis sequoia honestly I only skimmed the discussion
but if you still need help maybe you can summarise the problem?
@velvet thorn I have like 90+ dummy variables in my data that I created using pandas. I grouped my data using a product serial # to sum the quantity of hours. Doing this also aggregated the dummy variables , so they show numbers like 500, 294, 348, etc etc instead of just the 0 or 1 like they are supposed to
So I am trying to find a way to either fix this or to find a way to just make all the ones > 0 turn to 1
@velvet thorn I have like 90+ dummy variables in my data that I created using pandas. I grouped my data using a product serial # to sum the quantity of hours. Doing this also aggregated the dummy variables , so they show numbers like 500, 294, 348, etc etc instead of just the 0 or 1 like they are supposed to
@lapis sequoia how can you identify the dummy variable columns?
What do you mean? They all have names
And I put a prefix to all of them
Cause I was trying to see if I can apply the >0 make it a 1 thing but couldnt figure it out
What do you mean? They all have names
@lapis sequoia like what's the filter you can apply on them
okay, I think you said they all start with LH, right?
yeah that's the prefix I gave them
someone said I should give them a common prefix to be able to edit them all at once or somethijng
dummy_cols = [col for col in df.columns if col.startswith('LH')]
df[dummy_cols] = df[dummy_cols].clip(0, 1)
should work
I tried something similar to that and it didn't work, let me try yours I probably had my code fucked up lol
@velvet thorn That worked. You're a lifesaver
Thank you so much
yw!
Doing it that way by the replacing doesn't mess up any regression results right?
I'd assume not but just making sure ofc
what do you mean?
Like it will still see it as a regulardummy variable
yeah
long story short, yes
I mean, not in a bad way
in the sense that each dummy variable now represents "for this group of results (since you said they're aggregated, right), is <condition> true for at least one of the source rows"
when originally it meant "how many source rows was <condition> true for"
you get what I mean?
that's the effect of the clipping, right
adj = adjusted?
yeah
since there are multiple independent variables gotta use adj.
the jarque-bera is 25541 lol
hmm
the jarque-bera is 25541 lol
@lapis sequoia why does this matter?
why do you think so?
isnt that how the test works?
I mean, yes
but what are you running the test on
and why do you think the data must be normally distributed?
when I did the regression without fixing the dummy variable 0 or 1s I got an adj r square of 0.996 and jarque bera of like 1350
I don't think it must be, just seems high
presumably
@velvet thorn my dependent variable is labor hours. independent variables are product, product config, customer, and build type
trying to model our labor hours and DL costs
nothing wrong with non-normality though
to help the ops guys get a better target
oh also
going back to your source row thing
the reason I grouped them is because the data is set up in the way that each row is labor hours being charged to a certain assembly process
but I wanted the total hours for the corresponding product they all went to
unless I misunderstood you
my independent variables which are all non-numeric values
so the product, configuration, customer, and build type
certain products and customers for example drive the labor hours more
Non-numeric**
@velvet thorn Do you know if its possible to see which column is driving it more than others
Or are you not familiar with statsmodels
Hi everyone. Nice to met you?
So, i'm doing a work at my college and i'm needing date about social inequality. Are the date about it?
hi, i'm new to machine learning, i'm curious .. how do you know if the data is overfitting or underfitting? is it trough test and train result? and if so how do you find test and train result? f1 score ? or else? thanks
@weary heart yeah mainly its through the training and testing accuracy, if the training accuracy is high but the testing accuracy is low that's overfitting and if the training accuracy is low and the testing accuracy is low too its underfitting
Hey guys
is there a way to have [(1, 1, 1, 1, 1) (1, 0, 0, 0, 1) (1, 0, 0, 0, 1) (1, 0, 0, 0, 1) (1, 1, 1, 1, 1)] in one line?
here's my code
a = np.ones((5,1), dtype=[('a', 'i4'), ('b', 'i4'),('c', 'i4'),('d', 'i4'),('e', 'i4')])
print(a)
it's numpy array
in python btw
ah okay, so if i use SMOTE and i got this result
how do you know if it's overfitting , normal, or underfitting?
0 0.97 0.70 0.81 66699
1 0.28 0.83 0.42 9523
accuracy 0.71 76222
macro avg 0.62 0.76 0.62 76222
weighted avg 0.88 0.71 0.76 76222```
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
realEstate = pd.read_csv("realEstate.csv")
realEstate.head(5)
sns.pairplot("realEstate")
I'm getting an error saying TypeError: 'data' must be pandas DataFrame object, not: <class 'str'>
dont put quotes around it
lol
Hey people, learning how to work with images today.
How do I convert images in numpy to binary string?
from a quick search I was able to get to
array.tobytes() #or array.tostring()
but
np.fromstring(array.tobytes())
doesn't give me the original numpy array back.
Any suggestions? or thoughts on what I'm doing wrong?
idk what's going on
have you guys ever seen that next to a jupyter notebook
does that mean it's loading?
It means it's running.
oh it's probably bc it's a gigantic dataset
should i pick a smaller one i don't wanna deal with this
I can't find good numbery datasets everything I find on Kaggle is words
When you say numbery datasets do you mean tabular (in table formats)?
no i mean under each column it's a number not a word
like if the columns were price, age, weight, gender, height
so like in a table format?
if that's what it's called yes
where each row is a record and column are features?
yes
also the reason why it was taking so long to load was bc the dataset shape was 511, 14
I think there are quite a few of those on Kaggle. Here's one famous one.
is there any way you can pick a smaller sized data set on kaggle
You can set a filter.
Or you can just grab a subsample of the dataset.
Where you only grab a certain number of rows.
Let's say you have 511 rows, you only grab 100 of those.
oh man I need luck to select a couple rows?
F
that or I might move to another dataset
anyways it's also 11 at night so like bedtime
gn guys
Noob here... for the life of me I can not find out how to get the output plot from sklearn's metrics.plot_confusion_matrix into a tkinter gui - can anyone point me to a reference?
I work for a discrete mathematics journal, but my area of research is in math finance and it only has a minor intersection with ML
got a link for that?
Anyone know any good code/video examples of rainbow deep q learning?
There seems to be quite little information on this for whatever reason, it seems like it should be pretty popular
is it not rainbow deep reinforcement learning?
Yeah I might've got the name wrong
that'd do it lol
It's using deep q networks tho right
I did use the correct name when searching for stuff
But most of the result I found are just people reading the paper
These two are the only examples I've seen
Hey guys I have a quick question. I'm using TFRecord to store my numpy array in bytes, then reading the tfrecord.
But after I parse the tfrecord and convert it back to numpy array, the values aren't the same.
ie.
It's an image.
-> Image in numpy array
-> Convert numpy to bytes
-> tfrecord features
-> read tfrecord
-> Turn into tf dataset
-> convert bytes back to numpy
When I took a look at the image, the values has negatives in them. Any advice?
I've also made sure to convert it back from bytes using the original dtype.
Can you provide your code?
@lapis sequoia I could be wrong I'm a bit rusty on this but the difference between deep q/deep reinforcement learning is that q learning doesn't use transition probability distribution (or the reward function) associated with the MDP
q learning is considered a model-free reinforcement learning algorithm
def convert_to_example(image: Dict) -> tf.train.Example:
"""Convert Image to TFRecord ready format"""
feature = {
'height': _int64_feature(32),
'width': _int64_feature(32),
'channels': _int64_feature(3),
'label': _int64_feature(image['label']),
'filename': _bytes_feature(image['filename']),
'image_raw': _bytes_feature(image['data'].tobytes()),
}
return tf.train.Example(features=tf.train.Features(feature=feature))
train_record_file = 'train.tfrecords'
with tf.io.TFRecordWriter(train_record_file) as writer:
for image in tqdm(train_data):
tf_example = convert_to_example(image)
writer.write(tf_example.SerializeToString())
raw_train_dataset = tf.data.TFRecordDataset('train.tfrecords')
I broke it apart into two parts, one to write into TFRecord, one to read from it.
def parse_image_function(ex_proto):
image_feature_desc = {
'height': tf.io.FixedLenFeature([], tf.int64),
'width': tf.io.FixedLenFeature([], tf.int64),
'channels': tf.io.FixedLenFeature([], tf.int64),
'label': tf.io.FixedLenFeature([], tf.int64),
'filename': tf.io.FixedLenFeature([], tf.string),
'image_raw': tf.io.FixedLenFeature([], tf.string),
}
example = tf.io.parse_single_example(ex_proto, image_feature_desc)
img_raw = example['image_raw']
return img_raw
for img in raw_train_dataset.map(parse_image_function).take(1):
print(tf.io.decode_raw(img, np.int8))
Please let me know if you need more information.
I am trolling. @hasty grail
Thank you so much for your help.
I accidentally converted it into np.int8 instead of np.uint8.
If you don't mind me asking, how did you get an Image as base64 string?
@grave frost i am getting an base64 string which i have to decode it to make imafe from it
I accidentally converted it into np.int8 instead of np.uint8.
Problem solved I guess xD
i am not able to resize image to desired pixels i want
my code here https://paste.pythondiscord.com/aboyomupij.py
@hasty grail sorry to ping u , can u plz look into it ?
i am saving an image but not in desired pixels i want
@ripe crane hello
Have you done what I asked yesterday?
about what bro ?
I think that you should take some time to brush up on Python basics
sure bro, but right now i need to finish this bro , i want to submit this project
as soon as i resize my image then further code i know how to deal with it
i need a small help in resizing an image
i am decoding an base64 string which creates image from it
Do what I have asked first, it will save you a lot of time with the remaining part
but not in desired pixels
Especially the part about functions
i agree with u bro , but plz try to understand i need to finish this asap
at least can u look in this why image is not getting resized
which line are you at right now?
line 174 @hasty grail
im <PIL.Image.Image image mode=RGB size=200x99 at 0x24D0005C248>
done
wrong here1```
from your understanding of Python, what would cause the statement at line 174 to be executed?
wait , i need to comment that part of code from 160to 174
bcoz i am again reopening file image file
correct @hasty grail ?
yeah you don't need that code
now see it has created an image but not in correct pixels i want @hasty grail
can you display the problem?
see @hasty grail
where is the code for saving the image?
which variable are you passing into the save function?
check carefully
is this ```python
with open("imageToSave.jpg", "wb") as test_img:
test_img.write(image_data)
try:
test_img = image.load_img("imageToSave.jpg", target_size= (200,99))
except OSError :
logger.debug ({"Status" : "failed",
"message" : "provide valid base64 string"})
return ({"Status" : "failed",
"message" : "provide valid base64 string"})``` @hasty grail
Can you identify what data are you writing to the file?
image_data i guess ? @hasty grail
no
is it the resized image?
@hasty grail
well there's your problem
fix it so that you're actually passing in the resized image
fix it so that you're actually passing in the resized image
@hasty grail means bro ?
instead of image_data (the original image) you need to give the function the data that corresponds to the resized image
you need to write the resized image to the file
not the original image
you are currently writing image_data (the original image) to the file, of course the image size is unchanged
you need to write the resized image to the file
@hasty grail means how way u are saying here bro ?
it means what I said, I don't know how to simplify that
ok , can u show in code how way u are saying . so i can get clear idea what u are saying ? @hasty grail
with open("output.jpg", "wb") as f:
# Don't do this
f.write(incorrect_image)
# Do this
f.write(correct_imgae)
so in my case ```python
with open("output.jpg", "wb") as f:
# Don't do this
f.write(image_data)
# Do this
f.write(im)``` is this correct ? @hasty grail
yes
what data type is im?
<class 'PIL.Image.Image'> @hasty grail
shouldn't you be using im.save instead of file.write then?
that's what I gathered from the documentation of PIL
on which line bro ?
shouldn't you be using
im.saveinstead offile.writethen?
@hasty grail
on the line where you write to the file
with open("imageToSave.jpg", "wb") as test_img: test_img.write(im) @hasty grail here u mean ?
yes
with open("imageToSave.jpg", "wb") as test_img: im.save(im) @hasty grail this way ?
read the documentation of PIL to see how to use Image.save
with open("imageToSave.jpg", "wb") as test_img:
test_img.write("im.jpg")``` @hasty grail
with open("imageToSave.jpg", "wb") as test_img:
im.save("im.jpg")``` @hasty grail
you didn't answer my question
why do you have to open an image file when you are saving to a different file?
look at the example they have given
do you need to use open at all?
then delete it
open ?
then delete it
@hasty grail
yes
see i am using this code python with ("im.jpg", "wb") as test_img: im.save("im.jpg") @hasty grail
image not creted
do you know what the with statement even does?
(if you don't please review your Python basics)
sure bro , but at this moment i am really messed up with different things also
@hasty grail can u plz help in this ?
as soon the resized image creates i know how to deal with it
no, you have to understand what it means, it's so basic
yes i can understand bro
but right now i am messed up with different things bro ? plz
just help me to solve this issue @hasty grail
lets finish this issue now only
are u thier bro ? @hasty grail
Sorry, I won't finish your code for you, you have to demonstrate your understanding first
i know bro, can u help in this issue @hasty grail ?
so i can go further and try to solve issues by myself @hasty grail
If you can answer me what the line with ("im.jpg", "wb") as test_img: is supposed to do, then sure
with makes code compact @hasty grail
what about the line as a whole though?
it takes img.jpg and in write mode @hasty grail
is it needed in this case?
mhm
@hasty grail hello
yes
if it's not needed, what do you do with that line?
(I mean you can use your own common sense)
so how i can make changes here then ,? should i remove it? @hasty grail
(I mean you can use your own common sense)
can u be more specific here bro plz @hasty grail
so i need to remove this line of code , correct? @hasty grail
you can judge that for yourself
I don't think I have to answer that question since it's really obvious
when the textbook says 'open terminal', does it mean cmd or python shell?
@hasty grail ๐ bro plz , i got confused here , lets finish this ?
when the textbook says 'open terminal', does it mean cmd or python shell?
Usually that can be inferred from the context
bro plz , i got confused here , lets finish this ?
Just delete that line
You shouldn't have to ask for help for every single thing you do
see i have deleted taht line but image is not created here ? @hasty grail https://paste.pythondiscord.com/ewoyetojuh.py
ok then how it should be ? @hasty grail
undelete im.save
ok then ? @hasty grail
test the code?
no wait see this python Traceback (most recent call last): File "C:\Users\Admin\anaconda3\lib\site-packages\flask\app.py", line 1949, in full_dispatch_request rv = self.dispatch_request() File "C:\Users\Admin\anaconda3\lib\site-packages\flask\app.py", line 1935, in dispatch_request return self.view_functions[rule.endpoint](**req.view_args) File "C:\Users\Admin\anaconda3\lib\site-packages\flask_restful\__init__.py", line 468, in wrapper resp = resource(*args, **kwargs) File "C:\Users\Admin\anaconda3\lib\site-packages\flask\views.py", line 89, in view return self.dispatch_request(*args, **kwargs) File "C:\Users\Admin\anaconda3\lib\site-packages\flask_restful\__init__.py", line 583, in dispatch_request resp = meth(*args, **kwargs) File "E:\demo3\findDocumentType1.py", line 126, in post self.resize_im(image_data) File "E:\demo3\findDocumentType1.py", line 202, in resize_im predictions = model.predict(samples_to_predict) NameError: name 'model' is not defined @hasty grail
The error literally tells you what is wrong, please tell me you can fix this by yourself
Hi
line 119 i have defined it @hasty grail
ok so i have changed to this python def resize_im(self,image_data): print("test_img1") model = load_model(pathlib.Path('E:/', 'demo3', 'united_kingdom_50.h5')) @hasty grail
now i am that error is no more
Traceback (most recent call last):
File "C:\Users\Admin\anaconda3\lib\site-packages\flask\app.py", line 1949, in full_dispatch_request
rv = self.dispatch_request()
File "C:\Users\Admin\anaconda3\lib\site-packages\flask\app.py", line 1935, in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
File "C:\Users\Admin\anaconda3\lib\site-packages\flask_restful\__init__.py", line 468, in wrapper
resp = resource(*args, **kwargs)
File "C:\Users\Admin\anaconda3\lib\site-packages\flask\views.py", line 89, in view
return self.dispatch_request(*args, **kwargs)
File "C:\Users\Admin\anaconda3\lib\site-packages\flask_restful\__init__.py", line 583, in dispatch_request
resp = meth(*args, **kwargs)
File "E:\demo3\findDocumentType1.py", line 126, in post
self.resize_im(image_data)
File "E:\demo3\findDocumentType1.py", line 202, in resize_im
predictions = model.predict(samples_to_predict)
File "C:\Users\Admin\anaconda3\lib\site-packages\keras\engine\training.py", line 1441, in predict
x, _, _ = self._standardize_user_data(x)
File "C:\Users\Admin\anaconda3\lib\site-packages\keras\engine\training.py", line 579, in _standardize_user_data
exception_prefix='input')
File "C:\Users\Admin\anaconda3\lib\site-packages\keras\engine\training_utils.py", line 145, in standardize_input_data
str(data_shape))
ValueError: Error when checking input: expected conv2d_1_input to have shape (99, 200, 1) but got array with shape (200, 99, 3)```
@hasty grail
Have you checked the shape before input
which shape @twilit wind
the shape of your input '
like before input to the conv layer you need to flatten it or do some resizing
Thanks @lapis sequoia , It turns out that I was not accessing the data properly. It works fine now ๐
Traceback (most recent call last):
File "C:\Users\Admin\anaconda3\lib\site-packages\flask\app.py", line 1949, in full_dispatch_request
rv = self.dispatch_request()
File "C:\Users\Admin\anaconda3\lib\site-packages\flask\app.py", line 1935, in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
File "C:\Users\Admin\anaconda3\lib\site-packages\flask_restful\__init__.py", line 468, in wrapper
resp = resource(*args, **kwargs)
File "C:\Users\Admin\anaconda3\lib\site-packages\flask\views.py", line 89, in view
return self.dispatch_request(*args, **kwargs)
File "C:\Users\Admin\anaconda3\lib\site-packages\flask_restful\__init__.py", line 583, in dispatch_request
resp = meth(*args, **kwargs)
File "E:\demo3\findDocumentType1.py", line 126, in post
self.resize_im(image_data)
File "E:\demo3\findDocumentType1.py", line 219, in resize_im
img = preprocessing(img)
File "E:\demo3\findDocumentType1.py", line 215, in preprocessing
img = grayscale(img)
File "E:\demo3\findDocumentType1.py", line 207, in grayscale
img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
cv2.error: OpenCV(4.2.0) c:\projects\opencv-python\opencv\modules\imgproc\src\color.simd_helpers.hpp:94: error: (-2:Unspecified error) in function '__cdecl cv::impl::`anonymous-namespace'::CvtHelper<struct cv::impl::`anonymous namespace'::Set<3,4,-1>,struct cv::impl::A0xe227985e::Set<1,-1,-1>,struct cv::impl::A0xe227985e::Set<0,2,5>,2>::CvtHelper(const class cv::_InputArray &,const class cv::_OutputArray &,int)'
> Unsupported depth of input image:
> 'VDepth::contains(depth)'
> where
> 'depth' is 6 (CV_64F)
``` @twilit wind @hasty grail
can you share the code
my code here https://paste.pythondiscord.com/ohebolimuj.py @twilit wind
I've some text that contain fraction in text - "one-third", "one-half"......
How do I convert these into their relevant fractions? 1/3, 1/2 etc...
I've some text that contain fraction in text - "one-third", "one-half"......
How do I convert these into their relevant fractions? 1/3, 1/2 etc...
@verbal sand how many unique fractions do you have
It can be any.... this is contained in a text sentence like - "Take one-half of the tablet daily".
Doctor's prescription data.
@twilit wind do u get my code?
it is for prediction @twilit wind
It can be any.... this is contained in a text sentence like - "Take one-half of the tablet daily".
Doctor's prescription data.
@verbal sand create a mapping of fractions to numbers
and apply it
my updated code here https://paste.pythondiscord.com/ficexumiha.py @twilit wind
plz check
Traceback (most recent call last):
File "C:\Users\Admin\anaconda3\lib\site-packages\flask\app.py", line 1949, in full_dispatch_request
rv = self.dispatch_request()
File "C:\Users\Admin\anaconda3\lib\site-packages\flask\app.py", line 1935, in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
File "C:\Users\Admin\anaconda3\lib\site-packages\flask_restful\__init__.py", line 468, in wrapper
resp = resource(*args, **kwargs)
File "C:\Users\Admin\anaconda3\lib\site-packages\flask\views.py", line 89, in view
return self.dispatch_request(*args, **kwargs)
File "C:\Users\Admin\anaconda3\lib\site-packages\flask_restful\__init__.py", line 583, in dispatch_request
resp = meth(*args, **kwargs)
File "E:\demo3\findDocumentType1.py", line 126, in post
self.resize_im(image_data)
File "E:\demo3\findDocumentType1.py", line 219, in resize_im
im = preprocessing(im)
File "E:\demo3\findDocumentType1.py", line 215, in preprocessing
im = grayscale(im)
File "E:\demo3\findDocumentType1.py", line 207, in grayscale
im = cv2.cvtColor(im, cv2.COLOR_BGR2GRAY)
TypeError: Expected Ptr<cv::UMat> for argument 'src'
my updated code here https://paste.pythondiscord.com/ficexumiha.py @twilit wind
plz check this is my updated code for testing a model
It says a type error
some typo there I think
@twilit wind means bro?
Bro the code seems ok
Do you have any other file where you are running the code @mild topaz
ay file for flask
i have a model file @twilit wind
no @twilit wind
You are predicting the country name by its image I guess @mild topaz
yes @twilit wind
ok np
@velvet thorn isn't there any library?
For the string "one-third" - I though of mapping one with 1 and third with 3 and it becomes 1-3. How do I give it the meaning that the hyphen (-) in "1-3" should be considered as a division and not like "one-three days"?
@velvet thorn isn't there any library?
For the string "one-third" - I though of mapping one with 1 and third with 3 and it becomes 1-3. How do I give it the meaning that the hyphen (-) in "1-3" should be considered as a division and not like "one-three days"?
@verbal sand beats me
what do you mean?
like do you want to convert it into a number?
I suggest a regex
I mean that since the text is doctors's prescription so there can be texts like "one-third of tablet", "one-three days".
The first one mean 1/3 of the tablet while the other means 1 to 3 days.
If I map one with 1 and third with 3 and three with 3 then after replacing with their corresponding texts, it becomes "1-3 of tablet" and "1-3 days". Now, how do I distinguish whether the 3 in both sentences is to be understood as dividing the 1 or just the upper range(1 to 3 days of range).
@velvet thorn
what do you mean?
@velvet thorn
yes I do want to convert it into number. Later the amount of medicine can be converted into some fractional value. I wanted that to know how much dose a patient takes.
updated code https://paste.pythondiscord.com/olisidijub.py and my error python Traceback (most recent call last): File "C:\Users\Admin\anaconda3\lib\site-packages\flask\app.py", line 1949, in full_dispatch_request rv = self.dispatch_request() File "C:\Users\Admin\anaconda3\lib\site-packages\flask\app.py", line 1935, in dispatch_request return self.view_functions[rule.endpoint](**req.view_args) File "C:\Users\Admin\anaconda3\lib\site-packages\flask_restful\__init__.py", line 468, in wrapper resp = resource(*args, **kwargs) File "C:\Users\Admin\anaconda3\lib\site-packages\flask\views.py", line 89, in view return self.dispatch_request(*args, **kwargs) File "C:\Users\Admin\anaconda3\lib\site-packages\flask_restful\__init__.py", line 583, in dispatch_request resp = meth(*args, **kwargs) File "E:\demo3\findDocumentType1.py", line 126, in post self.resize_im(image_data) File "E:\demo3\findDocumentType1.py", line 231, in resize_im self.getclassname(classNo) NameError: name 'classNo' is not defined
@mild topaz the variable classNo is not defined
the error message is pretty clear
https://paste.pythondiscord.com/kasadoxiyo.py line 233 is not printing
is 0.873 adj R square goo enough
Hi guys. Can anyone explain a what a cost function is for a non-math person like me please? The lesson I'm watching introduce us to this equation and said "for simplicity, half of this value is considered the cost function through the derivative process"
I have absolutely no clue what that means
What are the next steps after I finish my regression in statsmdels
If you guys are interested in Natural Language Processing. Here,
@vague bear that's squared error
https://en.m.wikipedia.org/wiki/Mean_squared_error
It's used to find how good the model is
Lesser the value, relatively it's a better model
In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errorsโthat is, the average squared difference between the estimated values and the act...
@mild topaz are you trying to make a rest API which takes a base64 image as input and do some prediction with it?
yes
@mild topaz I'm not very familiar with flash restful, but where are you going wrong?
@raw mortar give me some time , as soon as get free i will ping u
@raw mortar thanks, i'll do some readings now. I searched cost function on YT and didn't find anything
how can you identify it as mean square error? The equation looks differently in the wiki
this one?
@vague bear its not mse, its squared error, in the wiki look for the loss function part
mse is when you divide it by num of data points, 1/2 is just squared error
How can I get my regression equation in statsmodels
@vague bear https://datascience.stackexchange.com/questions/10188/why-do-cost-functions-use-the-square-error
here this has a better explanation
ya, it is used interchangeably, but some prefer to say its a loss when its a single data point and cost when all the points are considered
and there is no consistency in the expressions ๐คฆ
I looked up some tutorial in my language and it uses pi to represents probability. is that normal
nope have not seen that one
ic
How can I get my regression equation in statsmodels
@lapis sequoia i don't quite understand your question, someone else might answer it
@raw mortar I ran OLS regression and got the output table. Is there a way to look at the equation for it
@lapis sequoia the equations would remain the same i think, probably just google for the implemention docs it ge the exact equation
@raw mortar What do you mean?
Like is there a way to export it to excel and then plug in the item in each variable to get the output of my dependent
oh you want to make predictions from the model ?
my dependent var is labor hrs. independent variables (using dummies) are product, customer, build type,product config. Want to be able to plug in certain customers and products for example to get my labor hours output
yes exactly
let me look it up, have not used ols in statmodels before
thank you
@lapis sequoia https://realpython.com/linear-regression-in-python/
initialize, fit and predict
import statsmodels.api as sm
model = sm.OLS(y, x)
results = model.fit()
results.predict(x)
is confusion matrix used a lot?
@raw mortar yeah I have the results. so i just use results.predict(x) to predict the y?
@vague bear yep, usually in classification problems
I see. The correct ones are churn 1,1 and churn 0,0 right
i'm looking at this https://www.kaggle.com/sudalairajkumar/chennai-water-management
is there something wrong with the data
is it the color scheme?
@raw mortar do you know if there is a way to export it into excel and just make a dropdown to choose the x variables I want to include to predict the y
@lapis sequoia not sure about that one though, might be possible
@raw mortar I'll try to figure it out
Anything else I should do in the meantime after my regression
so does this mean it's a good or bad model
i'm gonna go out on a limb and say it's bad
I've seen worse acc
how can I check which variables are most important/drive the dependent variable the most
Do I just use the std coeff
hahah I like your username @bitter harbor
Why are some of my independent variables showing twice in my summary table
Can anyone recommend the best way to start with reinforcement learning and I am good with most of the deep learning concepts
My only knowledge of reinforcement learning is the library Gym.
Maybe you can start with their docs.
or if someone else has a better source of information.
guys what does it mean when your data set does that
@hollow sentinel fade and not fade just shows the density. If there are many points at single point it will become darker. Check thealphaortransparencyvalue when you plot.
Hey guys question on validation data.
Epoch 1/10
1250/1250 [==============================] - 361s 289ms/step - loss: 5.0271 - accuracy: 0.3601 - val_loss: 1.1977 - val_accuracy: 0.5984
Epoch 2/10
1250/1250 [==============================] - 360s 288ms/step - loss: 1.3753 - accuracy: 0.5232 - val_loss: 0.7962 - val_accuracy: 0.7531
Epoch 3/10
1250/1250 [==============================] - 359s 287ms/step - loss: 1.0479 - accuracy: 0.6364 - val_loss: 0.5072 - val_accuracy: 0.8499
Epoch 4/10
1250/1250 [==============================] - 363s 291ms/step - loss: 0.7664 - accuracy: 0.7330 - val_loss: 0.2894 - val_accuracy: 0.9197
Epoch 5/10
1250/1250 [==============================] - 360s 288ms/step - loss: 0.5792 - accuracy: 0.7965 - val_loss: 0.1755 - val_accuracy: 0.9532
Epoch 6/10
1221/1250 [============================>.] - ETA: 7s - loss: 0.4574 - accuracy: 0.8416
The epochs haven't finished yet, but it feels like I'm heavily overfitting on the validation data.
@raw mortar results.predict(x) doesn'twork
Guys
In tutorials anywhere, I can see only basics of ml
I can't find like that goes deeper
Like what
They don't give some deep like pd iloc functions etc...
I need to learn that
Where can I find it?
I know pandas basics
I onow numpy basics
I know ml basics
What's your definition of ml basics?
But I want to learn deeper in numpy, pandas
Ml basics mean I know main algorithms like regression, Knn, etc..
In scikit learn
Any tutorial I can learn deep??
deep learning?
For deeper in NumPy and Pandas, here are some exercises.
https://github.com/rougier/numpy-100/blob/master/100_Numpy_exercises.md
Look at the exercises down there.
@heady hatch you ever used statsmdels
Kk, thanks
Is there a way to export my linear regression model into Excel
And use in Excel to predict my y var
I don't really use excel so I can't give any advice on that.
I can learn by exercises?
but
ty
If you can somehow read those files into excel.
I can learn by exercises??
Then I would probably google advance numpy or pandas tutorial.
Hey guys,
question on validation.
My validation accuracy is significantly higher, I was wondering what I could be doing wrong.
It's on Cifar 10 dataset.
I realized I forgot to check for dataset imbalance.
I'm trying to do polynomial regression and I have an array of what the coefficients should be. The number of terms varies. Once I know how far off the prediction was along the y axis, what adjustment am I supposed to make to the coefficients?
I'm not super familiar with stats.
but how come you're manually adjusting the coefficients?
how else would I make sure that the curve is correct?
Are you not able to base that off of your error?
I'm not sure what to do with the error once I have it.
Oh are you manually calculating the regression?
yes, I have to show that I understand how it works.
the sample code is in perl ๐ฆ
Is a 0.873 adjusted r-squared good?
Oh man. I'm currently looking up how to calculate regression to give you further thoughts.
I appreciate it
@serene scaffold You using statsmodels?
@lapis sequoia no, I'm using numpy. I can't use anything that eliminates the need to show how the math works.
Damn idk bro
that's okay. thank you.
I'm new to this
@heady hatch Is there a way to have my equation show in statsmodels
for my linear model
What do you mean by equation?
my linear formula
@serene scaffold
I don't know if this is relevant.
http://polynomialregression.drque.net/math.html
From how they're calculating the coefficients, they're using a system of equation to solve for it. And I guess in your case, do you have the data points?
If so, you might be able to do the same.
@lapis sequoia I'm still unsure of what you mean. Like you want the coefficients?
@heady hatch let me look at this. Thanks!
I think there's a coefficient method to get it from the models.
So after you fit it, you can get the coefficients via the methods.
Let me try that thank you
let me see if I understand correctly
basically given my training data, which is a list of (x, y) points, if I want to find the best-fit curve, I should start with a polynomial function y = a * (x ** 1) + b * (x ** 2) + ...
and if I have an array of [a, b, ...] then I'll have the answer
so the goal is to solve for [a, b, ...] for each instance of (x, y), multiply that array by the alpha, and add that to the weights?
does that sound right @heady hatch?
That's from my understanding of how regressions work.
That's not to take into consideration of regularization or anything.
I don't think I have to do that
My validation accuracy is significantly higher, I was wondering what I could be doing wrong.
@heady hatch Check if you are splitting the data properly. And that there is no data leakage. It is very rare to have situation like show above.
I'm not splitting the data myself. It's presplit.
Good point about data leakage.
So it's the cifar 10 dataset.
They've split the data into train and test already.
40000 training images
10000 test images
I did add couple things to the training dataset pipeline that I didn't for the testing dataset pipeline. Such as shuffling the data and repeating it.
Though I was under the impression that I'm not supposed to shuffle the test data.
@heady hatch That worked thanks
I coulda just used it from the table too
Is 0.873 adj r squared good enough
I want to use the model to be able to use the dependent variable (labor hours) as a benchmark based on the independent variables (product, customer, config, build type)
Depending on your problem. Is adjusted r squared the metric you want to look at?
yeah
since I have multiple independent variales
@heady hatch
Split the 40K into 35K and 5K.
Well here shuffling should not do anything and do a stratified split.
Also check if you are plotting right legends. Maybe you are confusing train and vlad while plot.
@lapis sequoia
So should I leave the test set as a holdout?
There's no simple way to do a stratified split with Tensorflow, is there? I would have to redo the data pipeline and make the test dataset myself.
Thank you for bringing the labelling up, I double checked and they are the correct labels.
What should my p values be
This is how I'm constructing the data pipeline.
train_ds = (raw_train_dataset.map(parse_image_function)
.map(process_image)
.repeat()
.shuffle(buffer_size=20000)
.batch(batch_size=32)
.prefetch(buffer_size=100)
)
test_ds = (raw_test_dataset.map(parse_image_function)
.map(process_image)
.shuffle(buffer_size=5000)
.batch(batch_size=32)
.prefetch(buffer_size=100)
)
process_image standardize the images and resize them.
@heady hatch my model isnt predicting certain mixes well it seems
certain mixes?
ya like
product 2 to customer 5 with build type A
etc
looks different than historical avg
Ahh well now here's something to consider.
Is r squared the metric you want to look at?
adjusted r sq
What I mean by this is you don't necessary need to change r squared if that's the few metrics you can get.
Because think about the definition of what r squared means.
R squared means the goodness of fit.
yea
But it doesn't necessarily talk about the actual problem itself.
It's just a proxy metric for something else you care about.
Because yea .81 r squared could be good.
But not if it's constantly making mistakes on a particular group of people or product.
I don't know your actual problem so you'd have to determine that yourself.
yeah the errors are high for some
Maybe it's okay for it to keep making mistakes on certain things.
