#data-science-and-ml
1 messages ยท Page 348 of 1
Latter one. Point taken, will go directly for the question
The reason is, even if people know about the topic of your question, it's unlikely that they'll want to interview you to dig down to the actual question. It's best to save everyone a step and put the actual question out.
Good luck!
Got it. Thanks for the feedback!
Yes I know what a dot product is and I can do it on paper
I know a lot about Numpy, How the neron Network works and all. I have created the model like a bit of it parts
Yes I don't think I am Expert IN AI and ML but I think that it's hard
Also it's not for my homework but I am leaning Deep Learning ๐
Can you share the code that you have made already? I'm having trouble understanding what you mean by "Where is the Deep Learning"
Ok
yeah, tell
whats the difference between me using a statistical program to make a linear regression model versus doing it in Python via sklearn? How is the latter machine learning when its just generating predictions? Can you say that in this specific case with linear regression models that all "automated" calculators of least squares are considered a rough form of machine learning?
The former uses an algorithm to directly generate a line. The latter initializes a bunch of random weights and adjusts them over time based on loss. Both (may) lead to the same outcome, but they arrive at that outcome differently. If the data you have is linearly separable such that least squares suffices there is no need for ML which is suited for modeling non-linear relationships
(Just realized I did not answer your question. No, least squares is not a form of machine learning because no learning is done)
Hi, in Machine Learning, do we select the model first and then tune it's hyperparameters or tune the hyperparameters for all the models first and then select the model?
in ordinary least squares what is there to randomly generate?
i realize theres multiple ways to calculate the predictions and MSE in linear regression, but im not seeing what you mean by "adjusting the random weights over time ", the only thing i can think of is gradient descent cause that is an algorithm that adjusts the MSE iteravely. Do you mean neural networks?
So I have just done setup of folders, I took some imagages from Google for training and then I have labelled those images. You can just say that I have only worked with ipython till now. So there is no code but I know about Numpy (learned from Sentdex) but I don't see any use of it (Numpy) till now
I was aligning images and I wanted to load images in afolder with a specific name. How can I do that?
Anyone into NLP here,
is nltk a good choice ?
hello
how can i improve this histogram made on plotly as you can see he markings are not clear since one bar is too high so other bars are so low that they are not visible and the y axis scale is too high as well how to make the scale smaller with the frequency of 100
Right, that is what I meant. Sorry, I should've been specific
I really strongly disagree with this
Like fundamentally I think that's wrong
I'm on mobile so I can't effectively plead my case, but my basic opinion is that fitting a line is definitely "learning", albeit learning something relatively simple
Arguably "least squares" itself is just an algorithm, it's not inherently machine learning. But fitting a line with least squares is just as much machine learning as fitting a deep network with SGD
cc @jade acorn
Hm, upon further reflection I see your point
My mind Ctrl+Shift+F's "machine learning" to "deep learning" since that's usually what people are referring to
Hi, random question:
I'm interested in datascience/Ai stuff; can you guys recommend some certifications or so, maybe I can convince my employer to pay me some courses/certificates;
I just don't know where to start
This book provides an accessible overview of the field of statistical learning, with applications in R programming....
I'm working in some consulting setup, would be nice if there where soem certifications wich I can convince the higher ups that those are marketable and improve my value so they can hire my out for more moneys, but thy for that link
Hello everyone. I want to create a blank screen with the plt.figure() function, but I can't. Can you help me?
I am making a object detection using Tenserflow but I am installing a lot of things
Do professional programmers also download so many files like labelImg and more.
Can we not just install all the modules and use them as code, like install by doing pip
what is labelimg?
Can we not just install all the modules and use them as code, like install by doing pip
are you talking about downloading software that isn't Python libraries? different operating systems have different packages managers. what OS are you on?
labelimg is image annotation software. Ironically it is in pip, though some level tools aren't. Many packages have many ways of installing
Yes I am saying that Can we install all modules by doing pip install
Do all professionals also download so many repositories and all other things
I am following a tutorial in which I have made so many steps, installed so many things and all this
https://youtu.be/yqkISICHH-U This is the tutorial ,Do you know any good place where I can learn like a professional
In reality, even professionals deal with heterogeneous and painful environment setups
You can look try and look for dockers with everything set up
But I am installing many libraries differently even when I can do pip
Is there any different use of installing libraries differently
When will I be doing that big code part
if it's a standalone package I don't think there's anything wrong with simply installing with pip. Some more sophisticated packages also come with software that you need to manually download and install - but if pip install works you should use that
Depends on the tutorial you're following. I imagine once you have all the required packages it'll be coding time
Is there not one requirements file?
I am just using ipython till now
Like?
pip install -r requirements.txt
Never did that
Several pip packages can be installed that way from a list
Tutorials differ in the amount of effort they put into the setup
If they are having you a bunch of different pip packages separately, they could've made it one line with a file like that
No that guy is making me install all these differently
Also Do all programmers do all this creepy things in AI and ML ? ๐ณ
I haven't written a single line of code in Pycharm, only using ipython to install and setup
And the guy in the video says now time to train the model
https://www.tensorflow.org/hub/tutorials/object_detection If you look at this then you can see that we are not installing anything leaving libs
i see
that doesn't sound like a great way to learn ML
start from the basics instead of some random guy on youtube
Does anyone know how to find two points that have extending coordinates in certain domain.
In this example, my domain is (-5, 5), and one point extends outside of that.
I'm having difficulty understanding a concept. .... true or false, ... when using a threshold other than .5 for classification, does that essentially turn it in to a regression problem? are classification and regression the same, just one has a binary output (in the case of 2 classes)?
Sounds like you just want to filter the points by a condition like not (-5<=x<=5 and -5<=y<=5)
Hi guys do ya got some good tutorials and explanations for basics in data science with python or c language?
I wanted to adjust the plot when the iteration of my algorithm produces points that passes my domain
but that would be too much of work honestly
is there a function in numpy, sklearn or scipy that can detect if a dataset goes in a linear line, polynomial or exponential?
Well, you can try fitting the dataset to a line, a polynomial of some degree or to an exponent. I don't think there is (or can be, really) an automatic function to determine that
Ordinarily, sigmoid function takes care of that in classification problem. This is what works behind the scene to determine the class each predicted value rightly belongs to.
So long as you understand the concept of sigmoid function, then you honestly shouldn't bother about having a threshold other than 0.5 (given you're interested in building an unbiased model)
Sigmoid function is an activation function used in Logistic Regression (classification problem) to make our output be between 0 and 1 (or between - 1 and 1)
So if the predicted value is above 0.5, it'll be assigned to class 1, and if otherwise class 0.
So long as there is an activation function ( in our case sigmoid function) present in your Logistic Regression model, you cannot predict a continuous value.
So in essence, even if the set threshold is 0.7, that doesn't turn your classification problem into a regression problem.
Thanks for the response. .. my thought on moving the threshold is due to running an imbalanced dataset, .. i've ran standard, undersampling, oversampling and smote against my dataset and in all regards the recall is not satisfactory unless i lower my threshold
just want to make sure that i'm considering the right value, .. that a regression prediction is the continuous value for what would be a classification otherwise when the threshold is .5
If you have an imbalanced class, try to use of the available resampling techniques like SMOTE (if you want to oversample the minority class)
You can even add the parameter stratify =y when splitting your data with train test split.
Then endeavour to also add the class_weight parameter in your model to handle imbalance class.
XGBoost uses scale_pos_weight
Then use StratifiedKFokd for your cross-validation.
Try to google more ways to handle imbalance class. Don't touch the default set threshold which is 0.5
in the case of xgbregressor, how do i know what activation function is being used? is sigmoid default, i hear relu is more preferred
hey everyone, i seem to be having a problem with the feature importance on my models. im using the built in sklearn feature_importances_ for gradient boosting, random forest, and extreme gradient boosting. but my extreme gradient boosting feature importance seems to be a very different than the gradient boosting or random forest feature importances to the point where a low correlating, binary feature is my most important variable.
is the built in feature_importance for random forest and gradient boosting different than for extreme gradient boosting? im struggling to find an explanation for this. or is there a better way to find feature importance for my models?
I just recently started learning Deep Learning but to the best of my knowledge, ReLu is an activation function used in Neural Network. It's not used in Regression problem. Activation function isn't used in Regression
so i am trying to learn ai and machine learning and i was watching a video from tech with tim and i did what he did and understood a part of it, i was hoping someone could explain to me this:
import pandas as pd
import numpy as np
import sklearn
from sklearn import linear_model
from sklearn.utils import shuffle
data = pd.read_csv("student-mat.csv", sep=";")
data = data[["G1", "G2", "G3", "studytime", "failures", "absences"]]
predict = "G3"
X = np.array(data.drop([predict], 1))
y = np.array(data[predict])
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size = 0.1)
linear = linear_model.LinearRegression()
linear.fit(x_train, y_train)
acc = linear.score(x_test, y_test)
print(acc)
print('Coefficient: \n', linear.coef_)
print('Intercept: \n', linear.intercept_)
predictions = linear.predict(x_test)
for x in range(len(predictions)):
print(predictions[x], x_test[x], y_test[x])```
thanks for the chat, .. i think you're right with relu being a neural network function.
Activation functions are used for hidden layers in regression problems
Just not for the output layer
There is no activation function in xgboost. It's worth reading about how it works
That shouldn't worry you. Of course, different models built with different ML algorithms won't exactly have the same feature importance. I think you should focus more on the model that best minimizes your loss function.
About other ways to do Feature Selection, you could use RFE or better still, combine feature selectors using Voting Method.
In Deep Learning right?
Basically what this code does is apply a Linear Regression model onto your dataset. You are storing the dataset in the dataframe "data" and you are predicting the value of G3 given the other features in the dataset. Firstly you are dropping the value to be predicted from the dataset and spltting the data into training and testing data. Training data is to train the model and ensure that it can learn the best fit for the linear regression line, and test data is to find the accuracy of the model u have applied --> acc = linear.score (x_test, y_test). In the application of linear regression, you are initializing the model in "linear" and applying to the model using the "fit". Linear regression works by estimating parameters ie: the gradient and the y intercept, and then minimising loss using the Mean Squared Error. After applying lin regression to the model you are predicting the test values.
im probably wrong in some places as im also new to ML, so you should probs seek confirmation from more experiences peopl
what about SHAP values? the logic behind them seems sound and easily explainable
Importing necessary Libraries
import pandas as pd
import numpy as np
import sklearn
from sklearn import linear_model
from sklearn.utils import shuffle
Read Dataset Into Pandas DataFrame
data = pd.read_csv("student-mat.csv", sep=";")
__Selected The Features and Label he's interested in using to build a model __
data = data[["G1", "G2", "G3", "studytime", "failures", "absences"]]
Specified the label (Response Variable)
target_col = "G3"
Declared The Input & Target Variable
He dropped the target column from dataframe so it'll contain only the input features (X), then converted both X (DataFrame) and y (Series) into a numpy array.
X = np.array(data.drop([target_col], 1))
y = np.array(data[target_col])
Splitted Dataset into train and holdout set
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size = 0.1)
Instantiated the Linear Regression object
linear = linear_model.LinearRegression()
Train a Linear Regression Model
linear.fit(x_train, y_train)
__Get The Coefficient of Determination (R2) __
acc = linear.score(x_test, y_test)
print(acc)
__Print Your Model Parameters (The weight and intercept) __
print('Coefficient: \n', linear.coef_)
print('Intercept: \n', linear.intercept_)
Make Predictions with the holdout set
predictions = linear.predict(x_test)
Print The Predicted_y, X_test, and Actual_y
for x in range(len(predictions)):
print(predictions[x], x_test[x], y_test[x])
We're all learning... ๐
I know sigmoid function is the activation function used in Logistic Regression. I'm just starting to learn DL, however, I'm yet to come across where an activation function is used for a Regression problem.
Arent activation functions primarly used in Neural Nets tho?
Yeah, but activation function was also used in Logistic Regression (sigmoid) that's what works behind the scene to designate values to the class they belong to during prediction
I don't know enough about SHAP values to comment on it at this time.
Do you mind telling me what it means/where it's applied?
it's a method of evaluating model feature importance, down to the observation if needed, based on game theory. .. it's able to work on multiple classifiers by evaluating the results in a similar manner of leave one out cross validation. ... that's about the closest parallel i can think of off the top of my head. ... look up shapley values on youtube, there are some great resources. ... data robot has one that's around 5 minutes that explains it very well at a high level
Okay that's cool I'll look it up. I majorly use the feature_importances_ method, RFE, and sometimes Voting Method for combing feature selectors (when I want to combine multiple models)
Yes
Hi Gys think I'm at the right place. I have a question. For someone in mapping
I want to build interactive maps. I want my python coordinate points to show up on this map and have lines between them. What would you guys recommend?
Is 83% accuracy good, fellas?
I have already learned How the Neuron Network Network works
I think more than 85 is good
Oh thanks mate, but my output is absurdly restricted to 2 out of 11 possible outputs
No
You should use it for better input if you are doing image detection
It depends entirely on the prediction task and the data you have available
I think numpy, matplotit and seaborn library would be enough for your task
!rule 7
7. Keep discussions relevant to the channel topic. Each channel's description tells you the topic.
Which is the best Keypoints, Boxes or Masks in Tenserflow 2 Zoo
I just started learning neural Networks but I don't understand the role of Activation and softmax functions and how the neurons work with them
Can anyone help
Or recommend any books/videos for it
Yes there is a very good video by Sentdex on Neron Networks and Numpy
Oh OK
It covers mostly everything from scratch
Oh thank you, I will definitely check that out
I will give you the link
Building neural networks from scratch in Python introduction.
Neural Networks from Scratch book: https://nnfs.io
Playlist for this series: https://www.youtube.com/playlist?list=PLQVvvaa0QuDcjD5BAw2DxE6OF2tius3V3
Python 3 basics: https://pythonprogramming.net/introduction-learn-python-3-tutorials/
Intermediate Python (w/ OOP): https://pythonpr...
Oh ok thank you so much
also check out the playlist
๐ Happy to help
hey guys, i wanna start working on my portofolio.
is dash and plotly recommended to build first basic dashboards?
then why the difficulty?
I want to start with AI
Can anyone tell me what should I do first ?
I'm reading the Universal Sentence Encoder paper, and I'm having trouble understanding this part of the paper:
The context aware word representations are con-
verted to a fixed length sentence encoding vector
by computing the element-wise sum of the repre-
sentations at each word position.
Here's the rest of the text if you need context:
The transformer based sentence encoding model
constructs sentence embeddings using the en-
coding sub-graph of the transformer architecture
(Vaswani et al., 2017). This sub-graph uses at-
tention to compute context aware representations
of words in a sentence that take into account both
the ordering and identity of all the other words.
The context aware word representations are con-
verted to a fixed length sentence encoding vector
by computing the element-wise sum of the repre-
sentations at each word position. The encoder
takes as input a lowercased PTB tokenized string
and outputs a 512 dimensional vector as the sen-
tence embedding.
"Element-wise sum"?
Ah. I believe it's taking a sum of axis=0
That would make the most sense. Since axis=1 would still produce a variable length tensor
I've been practising some of the concepts I've learnt in this data science course that i'm doing and I've been wondering (and this might come across as a stupid question but) how do you know what to do with the data you have at hand?
There isn't an easy answer for that question, as it's largely case-by-case
you might start by asking yourself "what are insights that a human could extract from this data if they had time to go through all of it by hand"
I guess that's sort of a drawback of just practising code? Picking up a dataset from nowhere without having a plan kinda made me blank.
what was the most recent dataset you got ahold of? maybe we could spitball some ideas. (once I get back from making coffee.)
Can i post links here?
I'll be back in a few
@reef dock did you look at the names of the columns?
['Permit Number', 'Permit Type', 'Permit Type Definition',
'Permit Creation Date', 'Block', 'Lot', 'Street Number',
'Street Number Suffix', 'Street Name', 'Street Suffix', 'Unit',
'Unit Suffix', 'Description', 'Current Status', 'Current Status Date',
'Filed Date', 'Issued Date', 'Completed Date',
'First Construction Document Date', 'Structural Notification',
'Number of Existing Stories', 'Number of Proposed Stories',
'Voluntary Soft-Story Retrofit', 'Fire Only Permit',
'Permit Expiration Date', 'Estimated Cost', 'Revised Cost',
'Existing Use', 'Existing Units', 'Proposed Use', 'Proposed Units',
'Plansets', 'TIDF Compliance', 'Existing Construction Type',
'Existing Construction Type Description', 'Proposed Construction Type',
'Proposed Construction Type Description', 'Site Permit',
'Supervisor District', 'Neighborhoods - Analysis Boundaries', 'Zipcode',
'Location', 'Record ID']
which ones did you find not useful?
Street Number Suffix, Proposed Construction Type, Site Permit, TIDF Compliance, Unit
@reef dockwhat do you think would be interesting to learn from this data?
I'm not entirely sure since I picked the dataset at random. Though I did try to look at the neighborhoods that have the most permit applications.
ok
help
<@&831776746206265384>
can you help meeeee
plssssssssssss
Please don't ping moderators asking for help.
so who can i ping for help
@digital shard
No one
why
so can you help meeeee
We don't provide on-call help.
There's help channels in the server if that's what you're looking for.
I don't even know what you want help with, whereas keN has been asking clear questions.
i want pyaudio in py
it not working
You should look at #โ๏ฝhow-to-get-help
so i go on unofficial python but my powershell is coming with wheel not support error
Try opening a help channel; see #โ๏ฝhow-to-get-help
ok
done
lol
@serene scaffold if you had to work with this data, what would you do?
I do human language stuff, so I don't really know. It might be interesting to see which of these features predicts the estimated cost
what do you mean human language stuff?
or when the estimated cost and revised cost are different
information extraction, text-to-speech engines, machine translation
So kinda like the predicted impact of a change in those costs?
Oh wow, that sounds interesting.
maybe? sounds like you know about this kind of thing
I tried to work with predictive models in the past. Though my knowledge in that is very brief and theoretical.
Anyone have any suggestions for producing a speaker embedding from an audio waveform
Reading WaveNet's paper rn to see what they've done
can anyone help me out ?
here is the code
data = pd.read_csv('hollowen_costumes.csv')
print(data.describe())```
soo like i can print this
so you're trying to go from a wav file to an array?
but its not clean
Depends what you mean
is there any func to do that ?
"unclean data" varies wildly. you have to know what you have and what you want it to be.
If by array, you mean the audio samples, I already have that. if by array, you mean fixed-length embedding vector, yes that's what I'm looking for
uh huh
can i make it like in a striaght line ?
So I'm looking at Wavenet autoencoders because audio is extremely high resolution and using recurrent methods would be really compute intensive
This isn't something I know about, evidently.
@edgy hearth can you do print(data.head().to_csv()) and paste the text in this chat?
sure
1,Clown,100
2,Vampire,97
3,Harley Quinn,96
4,Casa De Papel,95```
this is the output
yeah ig it worked
: )))
did .head() do the thing ?
head just shows you the first few rows. describe does calculations and shows you the result.
hello, has anyone used bookstoscrape in the past?
Your best bet is to ask the question you would be asking if someone said yes, as even those who have heard of bookstoscrape need to know what you need help with.
okay i managed to extract every info on a book page and now im struggling the same info for every book in the same genre.
well idk if that made sense, but basically i have the cript to get all infos of a book and now i just need to repeat it for every book of a genre
Anybody had any experience with donkey cars?
well I want to create a fully fledged AI assistant that can talk as well as work as a utility software
As I don't want it to be based on if else statements
which modules should I learn to make such a AI in a easy way
not very complex
P.S. - I am a 9th grade student with good python skills so pls list down the prerequisites(basically math required)
I'm concerned that your expectations aren't realistic.
why tho ?
Making a "fully fleged AI assistant" would be challenging even for people with advanced degrees in math and computer science.
though this does not mean that you can't do things with AI that will be interesting and gratifying for you.
I didn't meant that much fully fledged like google assistant and siri
one that can do some basic functionality and basically talk in hindi and english languages if possible
by basic functionality I mean
opening a software
playing music
opening different folders
searching stuff on web
that can easily be done with python
hmm
so can you suggest the module/modules I should learn to make it a little AI
You might try reading about intent classification, as your assistant would need that.
Also, I would suggest against "learning libraries/modules"
hmm which module should I specifically target
tensorflow or spacy or some frameworks like rasa
umm so that I learn to train models from scratch ?
Again, you don't want to "learn libraries" as these libraries are intended to be building blocks for solving lots of different problems, and understanding what problems are out there and what the solutions potential approaches are will serve you better.
hmm so from where should I begin
Reading about intent classification.
hmm after that
You can't skip this step. Good luck!
I won't and my last and most important question
what math I would need to learn for this
AI often involves knowing probability, statistics, combinatorics, and linear algebra.
the first three are inter-related though.
thnx !!
Does differentiation gets included in LA too? It's not heavily needed but at the same time it's better if we know it.
That falls under calculus. My list was not exhaustive, however
Makes sense.
:incoming_envelope: :ok_hand: applied mute to @lapis sequoia until <t:1634491907:f> (9 minutes and 59 seconds) (reason: duplicates rule: sent 4 duplicated messages in 10s).
gotcha
I have a question with huge loads of NLP
Can libraries like PySpark process them chunk by chunk? Because I really want to try something with the ArXiV dataset snapshot.
And it's unfortunate that I can't fully exploit it because of RAM capacity
I've been trying Spark and Spark-NLP but idk if I should bounce back to a simple NN on Torch once the preprocessing is done
The first production grade versions of the latest deep learning NLP research
Because so far it looks neat
Do I really have to set up a machine on the cloud ?
Has anybody ever used spark-nlp? Ever?
There's some towardsdatascience articles on it
wav2vec2 if you don't wanna mess around with SOTA stuff. otherwise, waveNet is pretty good too
Noted, thanks for the suggestion
What steps or learning paths should I take if I want to become an ML engineer?
Tree Depth: 37
yuck
hey guys i want to build a speaking assistant which learns day by day..i am thinking of using reinforcment learning..so what do u guys suggest
Hey folks, a small qsn.
Should I use feature selection in test data too?
I.e, I used feature selection to clear the data and model in decision tree but when i used it to predict the test data.
It gives
Feature name useen at fit time.
Hi everyone https://paste.pythondiscord.com/sicunewomu.yaml this is the csv
the is the pivot table code
weighted_matrix = combineframes.explode('product_ids').pivot_table(index='Customer_ID', columns='product_ids', aggfunc='sum', values='res')
weighted_matrix```
the above code is only returning 174 columns but it should be 999 rows
in this weighted_matrix
hi would like help in #help-dumpling for pandas
Guys, I would like to know if Cloud Plataforms (AWS, GCP, AZURE, etc) can have acess to my data? If yes, which types of data?
i am trying to resample my data to yearly but the problem is that my price column is integer and country code and organization is string so after resampling i used groupby to group according to country adnd org and then used sum but it is showing the price 0 for every row what shuoul i do?
in test data, use the same exact features as in training. remember: test data is meant to simulate new data that your model has not seen yet.
Thanks much.
Also, I got 0.75 AUC, it is considered good exam wise?
we can't give exam help here.
!rules 8
8. Do not help with ongoing exams. When helping with homework, help people learn how to do the assignment without doing it for them.
but this is a general enough question that i am comfortable answering: it depends entirely on the data. often something like 0.75 is not good enough for "business" purposes, but it depends on the cost of misclassification in your particular problem
recall the interpretation of AUROC on binary classification
in some medical studies or other research settings (e.g. social science), 0.75 might be great
but in those contexts you are often interested in probability modeling and not just accurate point predictions
can you share a sample of data that reproduces the problem, and can you repost your code as text, using a code block?
also, you might want to pay attention to that warning: some of your Date values might be ints.
!code
Here's how to format Python code on Discord:
```py
print('Hello world!')
```
These are backticks, not quotes. Check this out if you can't find the backtick key.
Pasting large amounts of code
If your code is too long to fit in a codeblock in discord, you can paste your code here:
https://paste.pythondiscord.com/
After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.
Hey @uneven thistle!
Uh-oh! It looks like your message got zapped by our spam filter. We currently don't allow .csv attachments, so here are some tips to help you travel safely:
โข If you attempted to send a message longer than 2000 characters, try shortening your message to fit within the character limit or use a pasting service (see below)
โข If you tried to show someone your code, you can use codeblocks
(run !code-blocks in #bot-commands for more information) or use a pasting service like:
it is showing that csv attachents are currently not allowed
yes, use the paste site i linked
How to generate unique random numbers in numpy array?
Can someone pls help?
what is this @desert oar
i want to resample the data not sample
did you understand my question @desert oar
that was in response to the other user. i understand the question, but i can't help without sample data. i don't have any way to reproduce your problem
how can i upload csv
paste it into the paste site
i only need enough rows to see the problem, i don't need the whole file
there is no option to past it in the paste site
How exactly?
plzz help @desert oar
that isn't enough, sorry. the best i can offer is that a lot of your Price values are nan in this screenshot https://cdn.discordapp.com/attachments/366673247892275221/899628086654545931/Screenshot_141.png, which would result in a lot of 0s upon grouping and summing
anyone good with pyspark?
I have a binary file that is very large. Trying to figure out how to use pyspark to work with it.
yes you can store json, csv, txt, xls parquet any type of file in AWS s3
Google Object Storage in gcp
in azure Azure Blob
@silver summit it's important that you always state your actual question, as even people who know about a given topic won't volunteer themselves until they know what you're really asking. I assume you're asking this question: #python-discussion message
I'm not. The other question was regarding using custom udf's. This question is I just have a very large binary file that is slow to work with. I'm interested in finding out if I can use pyspark to work with it in a distributed way. Possible streaming so I additionally don't need to laod the whole thing into memory. I think spar.readStream could be an option but it's not currently working.
"not currently working" is a great place to start by asking a specific question about the unexpected or error output you are getting
No, I mean, is secure to use these services? Can they acess private data?
No. I just want to know if these companies can have acess to the data that I use on this plataforms.
no they wont there is a strict policy that your data wont be shared or used by anyone else apart from you
That's what I wanted to know. Thanks!
I am thinking about get into the Cloud Industry, but I was afraid about privacy concerns
Yes
There's public, private and hybrid cloud, that's what you mean right?
no no
so in s3 we have this policy were we can allow the file to be publicly accessible then anyone can download your file
i do suggest s3 with private policy its really good
Sure
Good to know that these services are safe
When I mean safe I am saying about data protection
yes you can use these services what is your exact functionality?
I don't know, I will start to learn them now. I just wanted to know if they were Safe
wasn't around my computer when responded, daughter is asleep now so I can add more context
when I use pyspark to read a binary file as spark.readStream.format('binaryfile').load('filename.bin') I get a schema error ```
IllegalArgumentException:
Schema must be specified when creating a streaming source DataFrame. If some
files already exist in the directory, then depending on the file format you
may be able to create a static DataFrame on that directory with
'spark.read.load(directory)' and infer schema from it.'
how to handle this error?
any ideas how can I use seaborn boxplot to describe this database?
can you copy and paste the whole error message as text? Please ping me if you do this.
!traceback
Please provide the full traceback for your exception in order to help us identify your issue.
A full traceback could look like:
Traceback (most recent call last):
File "tiny", line 3, in
do_something()
File "tiny", line 2, in do_something
a = 6 / b
ZeroDivisionError: division by zero
The best way to read your traceback is bottom to top.
โข Identify the exception raised (in this case ZeroDivisionError)
โข Make note of the line number (in this case 2), and navigate there in your program.
โข Try to understand why the error occurred (in this case because b is 0).
To read more about exceptions and errors, please refer to the PyDis Wiki or the official Python tutorial.
^ this is what is meant by "whole error message"
ImportError: cannot import name 'MaskedArray' from 'sklearn.utils.fixes'
This is only the very end of the error message. I asked to see the whole thing.
0.23.2
try installing 0.18.2
ok, I gave up on readStream... trying to figure out how to use just read with a binary file... so I currently have the data loaded in with spark.read.format('binaryfile').load(fn) and have the following
what is actually in the file? is it supposed to be a dataframe? what is the file format?
the content field is a list of values, the first 4 are just checks, the next 3 are header info then the rest is the data I need, trying to figure out how to strip out the first 4, then the next 3 then the remaining values
the header info basically tells me how much data there is and the type etc
yeah the data is a list of numbers that will eventually need to be shaped into a dataframe
file format is binary
What r req for ai and how to start?
@silver summit i think readStream is only for dataframes, i don't know if there's an API to parse an arbitrary huge blob of bytes into a dataframe. you might have to fall back to reading each line into an RDD, and parsing each line into a Row, and then using createDataFrame to collect all those rows into a dataframe
it'd still be "streaming" but it's not quite as elegant as declaratively specifying a schema
How can I find the output node names of an xception model
what's the correct way to do that?
@desert oar well I have the data in a dataframe now, I just need to transform it to a series instead of a list of values in a single row of a column
(Keras)
Depends on what you are trying to do
idk machine learning?
lol bro...
are they all the same lengths? i think there's a way to explode a single array-valued column into separate columns, but it's ugly. i know that you can "explode" an array-valued column to rows, and then re-collect them into columns with a lot of expensive joins. i'd personally go back to the RDD.map version w/ a function that returns a Row
absolutely don't do that
you need to understand fundamentals before you just jump into stuff like that
take an online course about machine learning, get a sense for statistics, look at sklearn and kaggle
someting like that? But still the error
just load data into pandas and try to figure out what youre looking at, means, stds, sizes, shapes, plots etc...
daughter is awake... be on later
i agree that you should definitely learn some data viz and data manipulation in pandas along w/ tensorflow
but i don't think it's that bad to start poking around in TF with image classification or whatever
but you will very quickly start wishing you knew more principled approaches to problem solving, and then you should start focusing on stats and more machine learning fundamentals
mhm
it might be more satisfying to do a bit of hands-on playing around with tensorflow, then spend some time on fundamentals, then spend some more time playing around, etc.
so what do u mean by "playing around", im not sure I can build anything
AttributeError: module 'matplotlib.pyplot' has no attribute 'set_xlabel'
what
It has
what I am missing?
@median fulcrum plt.xlabel, set_xlabel is for the underlying Axes object, e.g. plt.gca().set_xlabel
but ,for example in this code I can't put the set_xlabal in the final
How I would write in that?
btw I tried with ax
plt.hist(x='loan', data=credit_risk)
plt.xlabel('Loan ($)')
maybe like that? i don't use the plt api much
did it work with ax? that should work
Ok thank you
don't work
as you can see
how to fixed this?
:/
if someone know ping me pls
got it!
yeah this is what i had in mind. i thought plt.xlabel would do it but i guess i was wrong
I think is outdated that
idk
but when I tried just got an error
maybe you can do plt.xlabel = ...? like i said, i don't use the pyplot api much anymore
It worked so I'm not gonna change that ๐
that's how i'd do it 99% of the time
you can also just do ax = plt.gca() to get the current axis object instead of plt.subplots, that's up to you
oh
true
hi everyone
I was learning pandas
and I wonder if there is anyway to use DataFrame.loc with both list of columns and slice at the same time
like I want to have something like df.loc[:, ["Name", "LastName", "Age":"School", "Interest":]]
however this does not work
How can I do something like this?
isn't iloc for int indexes of columns?
not really, like I use it to select a few column only
wait my mistakes
it's just df[{'a','b'}]
oh wait yeah
that works
but like then what's the purpose of loc?
also can I do slices with this?
like every column from Name to Age?
iloc is when you want to choose a specific row to specific column
look
I wanna get these columns in this order
Name, LastName, Age, .... , School, Interest, ....
where those ... means anything in between
how can I do it in one line?
there is this syntax
df.loc[:, ["Name", "LastName", "Age"]]
and also there is this:
df.loc[:, "Age":"School"]
how can I use this both at the same time?
for that it will be better to choose iloc
df.iloc[:3 , : ]
where the first one is for the amount of columns up to 3 (ie column 0 -> 3)
what if the data is changing in column positions?
then I need some way to get column indexes first
you're using a set here, that isn't normally necessary, nor is it different from a list
what do you use instead? curious
i don't think you can combine those, but you can try writing slice("Age", "School") instead of "Age":"School" - the : syntax is special syntax that expands to slice()
I did, didn't work
the object oriented api, so fig, ax = plt.subplots() ; ax.foo() ; ax.bar() ; plt.show()
oh. gotcha
maybe it's just not supported by pandas. it would be convenient, i agree
you are trying to do something like this, right? df.loc[:, ["Name", slice("Age", "School")]]
Hi, I have a question.
Suppose we want to use face recognition in a mobile application using Python language, what library do we use or how do we link these both together?
yes exactly
yeah you'd have to write your own function to "expand" that
what if I do something like
df = pd.concat([df.loc[:, ["Name", "Age"]], df.loc[:, "Age":"School"]])
but then it doubles the rows each having some columns
oh wait wrong
this I mean
that's a good idea, you can generalize it like this:
def slice_columns(df, *column_specs):
return pd.concat([df.loc[:, spec] for spec in column_spec], axis=1)
cause you do age : school
yeah something like this
I actually found something cool
I did
pd.merge([df.loc[:, ["Name", "Age"]], df.loc[:, "Age":"School"]])
however, the downside is that you can only do 1 slice or you need to have another merge inside
any better way to merge multiple objects instead of just 2 will fix this problem
yeah you still have to de-duplicate columns after, i think the pd.concat version lets you do that more easily
concat produces wrong set
the bad part about defining a new function is that you can't use : syntax anymore
wrong how?
it is like
Name LastName Age School
Someone SomeLast None None
None None SomeAge SomeSchool
where there is actually only 1 entry
Someone SomeLast SomeAge SomeSchool
that seems odd, did you forget axis=1?
maybe
lemme try
oh wait now it works
I am like so confused about those axis things
like I have a software engineer kind backend
and I see all these multi-dimensional arrays as nested arrays
and it makes understand axis hard for me
I need to research more and get comfortable with it
it's good enough to think of a DataFrame as a collection of Series in a trenchcoat
the columns and index are labels for columns (1 column = 1 series) and rows, respectively
a Series has an index, those are element/row labels
all Series in a DataFrame share an index
the pandas docs don't have a clear explanation of this data model, so it's understandable if you don't get it right away
the index/columns thing itself is an interesting beast.. that's an instance of the Index class, which is array-like, but also acts like keys in a lookup table
(i'm not sure if internally it uses a b-tree index or hash index or something else, maybe it varies)
you can also have a multiindex, where each element is a conceptually tuple, but it also acts like a collection of individual Index array things
i've been wanting to write a guide to this stuff for months, but every time i try i find it very difficult to explain clearly, and very difficult to design a sensible learning path through it
i think everyone learns pandas by just stumbling around until things make sense ๐
my confusion kinda comes from numpy
anyway I am feeling pretty sleepy rn so I probably won't understand anything rn XD
so like I guess I should continue tomorrow
I'll also read your explanation and continue this discussion tomorrow
thanks for your help tho
you're welcome
everything is basically like that
since this chat is not that active I guess I can find your message easily after even a day XD
true, although things with better docs tend to reduce the amount of "stumbling" needed
pandas has a lot of it
yeah
it can be very active sometimes. i tend to write things down in a note file and copy the discord message link
by the time I was talkin my chrome didn't load google for some reason so i couldn't google stuff so I could find concat and merge easier lol
hmm ok I'll do that
alright I'll come back tomorrow bye
@desert oar hey salt, did you know how to unpack a list from a single row in a pyspark dataframe and return a series?
what do you mean by return a "series"? do you want to turn every array element into a separate column, or a separate row?
the latter has a built-in method, explode
the former i definitely had to do in the past but i don't remember how and i do remember it was ugly
separate row
@desert oar ayyyy, nice!! tyty
I'm not clear on when spark actually runs compuation. Transformations are just added to the execution graph but not actually run until results need to be returned to the driver. In this case, ignoring the show, does the explode count as a transformation?
this is the eternally shitty part about spark. you kind of just have to know, and the docs don't really state it clearly. i don't believe explode triggers computation, but you might want to repartition if the results are "unbalanced" in size
I have so many jobs at work that just take days b/c ti wrote them shitty... really need to figure this all out.
using a lot of pandas udfs b/c I just dunno how to do stuff
if you post some specific examples i might be able to help
neither, currently. but ds in the past
oh, swe then?
yep, less work for more money!
haha, yeah I'm thinking that too
swapping over to MLE in the next 6mo or so, similar pay range as swe
coinbase throwing 400k at senior mle's right now ๐ฎ
i didn't know the numbers were that high. i also don't know how much i want to work at coinbase ๐
haha yeah for sure, but any big tech company will pay very well for MLE in general
anyone knowledgeable on scipy stats cdf and pdf? for calculating p value of F statistic
@jade acorn most of these functions should return a tuple of the statistic and p
can you give an example?
I need to calculate the bottom part manually (Prob > F) and i already have the F(2,247) value which i also calculated manually, The F-value is the Mean Square Model divided by the Mean Square Residual yielding F=3.48, The p-value associated with this F value is 0.0325
how did you define F?
F = (ssreg/modeldf)/(ssres/resdf) , ssreg being SS model , modeldf being models degrees of freedom,ssres being SS residuals, and resdf is the residuals degrees of freedom
and SS is Sum of Squares
sure, this is from scipy?
the above picture is from the program called Stata
oh
i just want to know how to do it in python
this is from some python code i wrote, as u can see its almost the same but i just cant figure out the p value of the F
check scipy docs and sklearn docs a bit more, I'm certain it's there, my daughter just woke up from her second nap... gotta go get her, will be on much later today if you still need help
@jade acorn it's better to ask your entire question, don't wait for someone to interview you in order to figure out what you want
you know this by now, i think
and to answer the question, you would do something like this to get an object representing the F(2, 247) distribution:
import scipy.stats as sstats
dist = sstats.f(2, 247)
which you can then use any of the methods of rv_continuous as defined in https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.rv_continuous.html
typically for an hypothesis test you want the "quantile function", the inverse cdf
scipy calls it the "percentage point function", ppf
so you might write stats.f(2, 247).ppf(3.475101) to get the p-value associated with the test statistic 3.475101
Before feeding data to an algorithm, is it necessary to transform the features to normal distribution?
Hi, I have question about inputs in LSTM . What should be this if I have 20 data for inputs in model.fit
Could it be input_shape(20), and later model.predict(array of 20)? ( sorry for my english)
no, it isn't
this looks useful, i haven't tried it yet. nice collection of things you'd normally want to use R for
Good evening, I built an algorithm that compares two methods of finding local minimum of given functions.
Gradient Descent and Newton's method.
I run the algorithms with the same parameters and the following results were achieved.
In the same number of iterations Gradient descent completed it faster (~4s) and got closer to the local minimum.
Newton's method computed in ~7.25s and got further from the optimum.
I though that newton's method would achieve better results in the same number of iterations. I mean, it is more computationally expensive, but still....
Does anyone have any thought on that?
I suspect that It depends on the step_size, parameter. Even though this results had same value of this parameter, but somehow I think that comparing results with this parameter equal in two runs seems not okay, but how to describe that to my teacher
@desert bear i think your inuition is correct, but can you show the code? i never thought of the newton-raphson method as having a configurable step size
is it not x1 = x0 + f'(x0) / f''(x0)?
Basically, I'm using algorithms given by my teacher
This Beta parameter is my step_size
normally i would set B_t to 1
that's the "theoretical" version
i think the point of using the hessian is that you can take fewer and bigger steps
Okay, I set that in one of my tests, and It found local minimum in one iteration ๐ฎ
yep! it is actually the "optimal" step size for a quadratic function
Okay, let me read about it, thanks
https://sites.stat.washington.edu/adobra/classes/536/Files/week1/newtonfull.pdf see also an example of its use for fitting maximum likelihood
how about my question? ๐
i think it was the developers' core motive! they essentially went "R is great for all these, so why not having something equivalent in Python?"
so they built it upon pandas iirc, so it's cut down a lot of time for me doing stats
yep that appears to be an explicit goal, scipy stats + pandas + a lot of r-like convenience functions
validated against R equivalents too so that's one for reliability
Anyone know how to get the output node names from an xception model?
im trying to figure out how to make datapipe lines but everytime I do this I get the error that "income" is not in the dataset, can someone tell why?
@modest timber i'm not sure if i understand. can you be more specific? input_size is the number of features at each time step, not the length of the input sequence
https://stackoverflow.com/a/45023288/2954547
https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html
!paste this is too small to read, can you re-post as code, either using our paste site or a code block?
Pasting large amounts of code
If your code is too long to fit in a codeblock in discord, you can paste your code here:
https://paste.pythondiscord.com/
After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.
include the error output as well
!code
Here's how to format Python code on Discord:
```py
print('Hello world!')
```
These are backticks, not quotes. Check this out if you can't find the backtick key.
โ๏ธ read the box
same with the box under my post, !paste just generates the box with instructions
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
c_names = ["age", "workclass", "fnlwgt", "education", "education-num", "maritalstatus", "occupation", "relationship", "race",
"sex", "capital-gain", "capital-loss", "hours-per-week", "nativecountry", "income"]
df = pd.read_csv(url, names=c_names)
column_trans = make_column_transformer((
OneHotEncoder(sparse=False), ["workclass", "occupation", "nativecountry"]),
(LabelEncoder(), ["income"]),
(OrdinalEncoder(categories=[' Preschool',' 1st-4th',' 5th-6th',' 7th-8th',' 9th',' 10th',' 11th',' 12th',' HS-grad',
' Prof-school', ' Assoc-acdm', ' Assoc-voc', ' Some-college',' Bachelors',' Masters', ' Doctorate']),
["education"]), remainder="passthrough")
X = df.drop(["maritalstatus", "relationship", "race", "sex", "income"], axis="columns")
y = df.income
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
svmm = svm.SVC()
pipe = make_pipeline(column_trans, svmm)
scores = cross_val_score(pipe, X, y, scoring="accuracy",cv=5)
idk if thats the right way
Thanks a lot, you are extremely knowledgeable. These links are very useful. I decided to test both methods on different function (Rosenbrock function). Firstly I was nailed down, because Newton's method was jumping significantly, but then I found that it is correct behaviour (https://www.numerical-tours.com/matlab/optim_2_newton/).
One thing is not clear for me. How did the teacher want me to compare the results of both algorithms for the same parameters.
Newton's gives best result for step_size=1, but when gradient descent is fed with this parameter it produces points of coordinates' values 1e+20, basically it makes too big steps. It seems incomparable.
Maybe that's part of the exercise?
See how instructive it was to try the different sizes?
I bet if you tried gradient descent with the same step size, it would go all over the place
Yea, maybe, I will sum all this observations in my report. The teacher probably won't be happy, since he seems like a "do simple - simple means less reading for me". Thanks a lot for making me understand it more
@desert oari come here because I am kinda confused - I use list with 20 inputs signals - input_shape=(20,1)
but i got error
ValueError: Input 0 of layer sequential is incompatible with the layer: expected ndim=3, found ndim=2. Full shape received: (None, 1)
predicted_stock_price = model.predict(X_predict[0:20])
can you show the code for the model? what is the shape of X_predict?
Hi, anyone could help me with descent gradient for summation function like in png. I only see explaination for x^2 function but i cannot find anywhere information how deal with it. Any ideas?
for i in range(0,10):
a= X_predict[i:20+i]
print(a.shape)
predicted_stock_price = model.predict(a)
shape = 20,1
model.add(LSTM(units=50, return_sequences=True, input_shape=(X_train.shape[1],1)))
shape = 20,1
ValueError: Input 0 of layer sequential is incompatible with the layer: expected ndim=3, found ndim=2. Full shape received: (None, 1)
I'm thinking of splitting the waveform into 800ms windows, running a dilated causal convolution on each independently, and then averaging the outputs
What do you think?
it is expecting a 3-dimensional input, but you gave it 2
that is, each time step should be [x1, ..., x20]
https://stats.stackexchange.com/a/277169/36229 @modest timber
I dont understand, you show me list with 1-dimension, but i need 3,
Maybe i need to add simply the 1 value, and should have shape 20,1,1
hi, can sequential API handle inputs of nonlinear relationship in my case i have input variables as flow rate and temperature?
How do I write a huge data list (say with 10 million data points) into a .txt file? When I do it for a few hundred points, it's fine, but for millions, it stores it as a "1.2 3.4 1.6 ... ... 1.4 1.2] in the shortened form.
I tried writing it element by element, but the for loop makes it slow. Any way to directly write the whole list?
!d numpy.savetxt
numpy.savetxt(fname, X, fmt='%.18e', delimiter=' ', newline='\n', header='', footer='', comments='# ', encoding=None)```
Save an array to a text file.
Aha, thanks!
i have created a recommendation engine can anyone go through it please
and let me know if its correct or not
Hello, can i combine 2 image datasets together for multi class classification and train them with a cnn model?
yeah
the thing as much as someone explains, for some stuff you'll understand better when u know how it works
pandas is a complex library, but I guess it's possible to figure out how it works in a high level
that way it helps me understand better
like now I know that those are series, but I don't know where that axis goes and what it does
in the docs it only says the axis of operation and nothing else
I'll figure it out by stumbling around easier than to try to find some form of article describing it
Hi all- really dumb question:
I am using matplot and I have a 3x14 table. For column 3, my title is significantly longer than columns 1 and 2, and unfortunately if I leave it the text inside the table becomes unreadable. Changing text size simply scales the column titles too so the problem remains. If I shorten the column 3 title, the issue resolves. Is there any way to resolve this without shortening my header?
Hi! I'm trying to make a speech recognition script for a personal project, and I've decided on mozilla deepspeech. My problem is, I don't really understand the audio handling, and I want to remove the VAD feature from this script so I can manually control when it records:
https://github.com/mozilla/DeepSpeech-examples/tree/r0.9/mic_vad_streaming
Hello I have a dataframe in which one column has 'CE' and 'PE' values in that column
I have to separate this column based on these values
For eg
'CE' values are saved in different data frame and
'PE' values saved in another data frame
Ping me when replying
My code
df_chunks = pd.read_csv(f'{input_path}{input_file}{extension}' , engine='python', chunksize=500000, names=['Msgtype', 'Activity Type', 'Transaction Time', 'script_name', 'expiry', 'strike_price', 'call/put', 'Exchange', 'Token', 'Buy/Sell', 'Buy Order Number', 'Sell Order Number', 'Price', 'Qty', 'price_in_rupees', 'lot'])
i=0
for chunk in df_chunks['call/put']:
print('chunk')
print(chunk)
# for i in chunk['call/put']:
# put_val = chunk.loc[chunk['call/put'] == 'PE']
# call_val = chunk.loc[chunk['call/put'] == 'CE']
# print(put_val)
# print(call_val)
# put_val.to_csv(f'{output_path}{output_file_put}{extension}', index =False, header = None, mode = 'a')
# call_val.to_csv(f'{output_path}{output_file_call}{extension}', index =False, header = None, mode = 'a')
# break
# break
def data_type_format(data, indexes):
"remove the header row and convert all the columns to type float"
headless = data[1:, :]
# sinker = headless[:, indexes]
# floater = np.delete(headless, indexes, 1)
# sunk = sinker.astype('<U30')
# floated = floater.astype(float)
indices = np.arange(9)
mask = np.delete(indices, indexes, 0)
mask_list = list(mask)
a = headless[:, indexes].astype('<U30')
b = headless[:, mask_list].astype(float)
headless.astype(object)
# answer = np.concatenate((sunk, floated), axis=1)
# np.sort(answer)
# for index in data:
# answer.append(tuple(index))
return headless
I'm not sure how to get two different dtypes in the same array
Hey,
In pandas, I have time, value and mark_upcoming_change columns, I want to calculate the amount of time a column was on a specific value, as seen here
right column is the one I want to calculate
heya\
i wanted to integrate alarm cllock system in an virtual assistant
how do i do
?
this isn't really a data science question
then tell where should i ask i m a begginer
idk
wht is the difference between a and data sceince
Hey stelercus
When I write in CSV file using pandas some of columns are not completely filled
the highlighted part is getting empty
Can u please help me to understand this?
Why I am getting this way
In my original data file i have complete data
@lone drum I would need to see the original CSVs (no screenshots) and the code you used to create this table (no screenshots). Please ping me if you provide that.
Send a screenshot of how it turned out on Pandas. So people can easily understand the error message or what went wrong
Please don't ask for screenshots of pandas stuff as it is prohibitively difficult for people to replicate data in screenshots.
the "axis of operation" is a numpy concept. you can think of it as the "the axis that is consumed" when performing an operation.
this is easy to see with an "aggregation" operation like DataFrame.sum:
>>> df = DataFrame({'a': [1,2,3], 'b': [4,5,6]})
>>> df.sum(axis=0) # axis='index'
a 6
b 15
dtype: int64
this means "apply the .sum operation by iterating over the 0th axis (the index).
DataFrame.apply is a bit tricker:
>>> df = DataFrame({'a': [1,2,3], 'b': [4,5,6]})
>>> add_one = lambda y: y + 1
>>> df.apply(add_one, axis=0) # axis='index'
a b
0 2 5
1 3 6
2 4 7
the add_one function is applied to each column, thereby "consuming" the entire index for each column
How I can show u original CSV?
It is 32 gb in size
just grab the first 10 rows?
yeah actually I found it in the docs, it said concat over axis 0 will concat indexes but axis 1 will concat columns
I guess that also applies to any other function
thanks for the info
pandas might be the wrong tool if you're dealing with that much data.
Hey @lone drum!
It looks like you tried to attach file type(s) that we do not allow (.xlsx). We currently allow the following file types: .gif, .jpg, .jpeg, .mov, .mp4, .mpg, .png, .mp3, .wav, .ogg, .webm, .webp, .flac, .m4a.
Feel free to ask in #community-meta if you think this is a mistake.
How I can share you sample data for same
hello I have a question if my model is overfitting or not
I am doing CAE
does this seem like its overfitting? I see gap its "big" in a way but it is only 0.0020 difference between them
should I support that is overfitting or not?
axis X is epochs
!paste use this site ๐
Pasting large amounts of code
If your code is too long to fit in a codeblock in discord, you can paste your code here:
https://paste.pythondiscord.com/
After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.
hi everyone
I have a weird data that when I import to pandas using json_normalize loads very inefficienty
simple format of the json is
[
{
"id": 12342,
"type": "node",
"tags": [ "amenity":"table", "size":2 ]
}
]
id and type are always there
however tags can be empty, have any amount of k/v pairs, and there are total of 110,000 possible keys (eg. amentiy)
How can I properly import this to a pandas DataFrame with efficient access to tags?
[ "amenity":"table", "size":2 ] isn't valid syntax in either python or json. did you mean { "amenity":"table", "size":2 }?
i would load it like this to start:
data = [
{
"id": 12342,
"type": "node",
"tags": { "amenity":"table", "size":2 }
},
{
"id": 93823,
"type": "node",
"tags": {}
}
]
df = pd.DataFrame(data)
id type tags
0 12342 node {'amenity': 'table', 'size': 2}
1 93823 node {}
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 2 non-null int64
1 type 2 non-null object
2 tags 2 non-null object
dtypes: int64(1), object(2)
memory usage: 176.0+ bytes
yeah that was a typo
hmm
but then tags are dictionary
oh wait
right, efficient depends heavily on what you're trying to do
so basically it will be a dataframe inside another dataframe right?
no, it's literally a dict in each element of the tags column
what if I convert that dict to a dataframe first?
maybe since dataframe is faster than dict?
faster for what?
I need to run a lot of search stuff on the tags part
so I may as well need to use some dataframe features other than performance
correct me if i am wrong cant we use flatten_json methon here @desert oar
how can I do that fast?
like I don't wanna run a for loop doing df["tags"] = pd.DataFrame(js[i]["tags"])
there are a lot of ways to manipulate data like this! but without some specific use case, it's all guesswork
it sounds like you want to "explode" these into key-value pairs or something?
I basically wanna store them as dataframe instead of dict
how can I make it do it without running a for loop
is there like anything builtin?
ok
the dataframe would be just the normal thing
like we have this parent dataframe called pf
and pf["tags"] is another dataframe with simple format like Columns: key and value
for example for my example there will be 2 rows:
key value
amenity table
size 2
I want something like this
anyone know how to explode a bytearray? I have a bytearray in a single row of a pyspark dataframe and I need to get each value of the bytearray into a new row, the error I'm getting is
AnalysisException: cannot resolve 'explode(content)' due to data type mismatch: input to function explode should be array or map type, not binary;
'Project [explode(content#131) AS List()]
+- Relation [path#128,modificationTime#129,length#130L,content#131] binaryFile
command I'm using is just df.select(F.explode('content'))
can you give a complete dataframe example please
ok
id type key value
1 node amenity table
1 node size 2
2 node size large
like this?
this will work, but itsn't it bad to have a row duplicated several times?
I was thinking of this format
id type tags
0 1234 node <pd.DataFrame object>
1 8897 way <pd.DataFrame object>
where that object has this format
key value
0 amenity table
1 size 2
it's not really bad, don't over-optimize. if the id is the index, it won't be duplicated. you can keep the "metadata" separate from the "tags" if you want
id key value
1 amenity table
1 size 2
2 size large
id type
1 node
2 node
what about this?
you can do that, but i don't recommend it
I guess I should try these
the thing is I have over 100 thousand of rows, which I want to run some analysis on, fast, without using that much memory
so I am trying to optimize it as much as possible
also I guess I should finish my pandas tutorial vid before continuing on
what's the reason just curios
a dataframe inside each element is not really better than a dict in each element, in that pandas has to loop slowly over the series of dataframes
don't fall into the trap of "pandas fast, more pandas more fast"
dict lookups will be faster than dataframe lookups in most cases anyway
so I am trying to optimize it as much as possible
have you heard the quote, "premature optimization is the root of all evil"?
that said, exploding this to a key-value format like i described would probably possibly make it easier to work with
it really depends on what kinds of operations you're trying to do
yeah, I am done with the code, which has no numpy/pandas in it
decided to learn pandas to optimize it
I also saw this a lot on google
I'll try it
it might be instructive to post the non-pandas version
!paste
Pasting large amounts of code
If your code is too long to fit in a codeblock in discord, you can paste your code here:
https://paste.pythondiscord.com/
After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.
I can use conv to convert the bytearray to a value, but need to map the convert to each element then explode somehow
I have it on github
but it's not as simple
I'm also trying to implement a data engine, which happened to be super slow by my own
I used that in most of my code
others are just for loops basically
so
I can tell you more details
I wanna be able to:
get all rows that have a specific key <and any other>
get all rows with a specific k/v pair
get all rows that have at least one key
search by regular expression on key names and get all that match
nothing else
my strategy for this stuff is: write a naive udf to do the conversion, then figure out a way to use spark-isms later. so df.withColumn('parsed_content', parse(F.col('content'))) where parse is your udf that returns a list/array, which can be exploded
what's spark isms?
that was a response to the other user. they are asking about Apache Spark
@desert oar yeah that sounds reasonable, be back in a couple hours to review this
oh I didn't realize bruh
so how can I actually do this?
what's the function/way?
is it okay that the val loss and the same with the loss ?
I know this is not overfitting
but it is another problem?
!eval ```python
import pandas as pd
data = [
{ "id": 12342, "type": "node", "tags": { "amenity":"table", "size":2 } },
{ "id": 93823, "type": "node", "tags": { "color":"blue" } },
]
id_column = 'id'
tag_column = 'tags'
meta_columns = ['type']
df = pd.DataFrame(data).set_index('id')
tag_kv_pairs is a Series of tuples: (tagName, tagValue)
tag_kv_pairs = (
df[tag_column]
.map(lambda kv: list(kv.items()))
.explode()
)
df_tags = pd.DataFrame(
tag_kv_pairs.tolist(),
index=tag_kv_pairs.index,
columns=['key', 'value'],
)
df_meta = df[meta_columns].copy()
del df
print(df_meta)
print(df_tags)
@desert oar :white_check_mark: Your eval job has completed with return code 0.
001 | type
002 | id
003 | 12342 node
004 | 93823 node
005 | key value
006 | id
007 | 12342 amenity table
008 | 12342 size 2
009 | 93823 color blue
oh so there is explode() function
thanks for your help I really appreciate it
sorry again I wanted to add is this the scenario of prefect fitting?
i dont know how to handle customer_id with null
one moment
let me share the data
https://paste.pythondiscord.com/omixaxecay.apache this the data
i dont know how to handel customer_id with null or 0
can anyone help me with that
how do you want to handle it? @quasi parcel
this would make me suspicious. how big are the validation and training sets? what kind of ML task is this? what is the data like? is it highly imbalanced?
you might want to manually input some made-up data to see what the predictions are, make sure they make sense
EMNIST balanced dataset
so i am building a recommendation engine so is it okay to create separate df to which i keep all the customer_ids of null there
?
I created noise input
oh, i'd be less worried then
i don't know what SotA numbers are, but you should always be suspicious if your DIY thing is beating or coming close to SotA
otherwise it's probably okay? i'm not much of an image classification expert
ahh okayy thank you for your help!!!!!
can you suggest me a way or its too much to ask @desert oar
i'm not sure what you mean by that either, sorry
you'd be losing too much contextual information, and overlapped samples won't help. what's your use-case?
are you doing user-user or user-item collaborative filtering? you might need to have 2 separate recommendation models: one that uses customer ids, and one that doesnt. then you ensemble them together when customer id is present, and you use the non-id model otherwise. i've done things like that before (although not specifically in the case of customer ids and recommendations)
@quasi parcel โ๏ธ
ohh okay
i think i got some clarity thanks
if i have any doubts
i will ask again
does anyone know how I can get output node names from a keras model?
there is the mysql.connector thing for ur responses in python to mysql is there something like that for csv files
hi everyone
i am new to python and i have developed a program to spot differences between 2 images but it is too many differences even if there is no difference in images. kindly help me with that
what's your distance metric?
okay i am sending give me 2 min
If you're comparing the pixel values in the image you'll have a tough time. You will need some function that says these are "close enough" to be considered the same.
well i think the program is comparing pixel wise because the images have ti be same size
is it true??
you're are talking about 2 differenet things, size != pixel value
sure, you should probably make sure the dimensions line up, but the real challenge is how you compare the images
should i send you the code?
okay
I can spend like 10min looking at it now or can review it later. Lot of smart ppl here to help however.
okay but just one thing
Hi, I'm trying to:
@heavy sail reduce the equation, you have x on top and bottom, same with y, what do you have?
the main task which i have to perform is that " i have to compare two images and the should be able to tell the differences in them for eg if there is a text missing or is there any spot in any of the image,"
could you help me on #help-ramen @silver summit ?
does it have to be able to tell what about the image is different
or that it is different at all
@fluid pebble ok, keep in mind that the wording sounds straight forward but this can be very complicated... if you're saying you have to pick out words embedded in the image... well
need like ocr for that, but I'm willing to bet your task is much simpler than that
thanks
the program can tell if there is any difference in one or both of the image
take this with a grain of salt:
find synonyms for height
search for those
find a number following it
search for what unit it's using
do you have to modify something preexisting, or make your own?
or are they both options
no no its not like that
just comparison of 2 images
yes, im talking about the program to do it
one is orignal and the other is new one
are you being given a program and being told to modify it in some way
or
are you being given a task as above and being told to make something to do it
well i got a program from youtube and made change according to result
alright
yes i have been assigned a task by my boss
at office
im a newbie to this so I haven't tried many things, but I know that one way it could work is with keras
since I have a program that can do something similar to that in theory
my idea:
use transfer learning
can you share
data augmentation if you wanted to ignore something
basically
use that image you have as a dataset to train it off of
input the 2nd image
get output
might work might not
ill test it gimme a second
i have started working on python only a week ago
okay
this isn't a training and testing problem
it's just a math problem
if you need to pick out parts of the image, this is called semantic segmenation
if this is the case just use some off the shelf image models and maybe ocr to pick out text, you should not have to train anything or build any models
neat
i have used ocr also but the accuracy of ocr was really bad
You need to define the problem in much more detail for us to help
okay i will define deeply
how do i learn ai without going to youtube
for example, if you say the images need to be the same do you mean exactly? like pixel for pixel? or can one be stretched a bit, rotated or filpped and still be the same? can it have a bit of noise on it (like some small percentage of the pixels are different) and it still be the same? can it have a bit of text on it but the image is still the same etc
also what is the context? this is for work but how will it be used? business context, timelines etc etc
"bag of words", not "word of bags" ๐ the idea is that the order of the words in the document is ignored, so it's like you took all the words, dumped them into a bag, and shook the bag around.
as for your actual question: you might want to read this book chapter https://web.stanford.edu/~jurafsky/slp3/23.pdf from Speech and Language Processing, a currently-in-progress book (homepage is here https://web.stanford.edu/~jurafsky/slp3/)
i am giving scan of two images one is an 'ideal image' and the other is 'difference image'. i need to develop a program two compare 'ideal image with the difference image'. the result i want to get is the program to tell difference in text, in color and if the text is missing,
its for business purpose
OCR sounds like a good start for the "text" part
yup
you can probably solve that problem without heavy-duty "machine learning"
e.g. something like levenshtein distance on the OCR'ed text (maybe something word-based and not character-based),
lets omit color
compare the histogram distributions, this should be pretty straight forward
you might need to put in some kind of adjustments to account for the scanning (?) process
just missing of text and if there is a spot in difference image against ideal image
i wonder how this works, it seems kind of like what you're asking for?
Image Similarity compares two images and returns a value that tells you how visually similar they are. The lower the the score, the more contextually similar the two images are with a score of '0' being identical. Sifting through datasets looking for duplicates or finding a visually similar set of images can be painful - so let computer vision d...
are you trying to match up receipts or something?
omg receips fucking sucks....
heh, apparently expensify just has people manually enter receipts
i am trying to match medicine packaging components
I've tried this before at work... so bad
ok, so you're dealing with mostly text on labels on bottles/boxes
have you ever used polyfax?
i have not, it looks like some kind of antibiotic
the small boxes of medication cream
yeah that sounds reasonable, ocr out the text, compare the word sets with edit distance (mentioned above)
but ocr accuracy is really bad
are you sure? what have you tried?
OCR is really good nowadays, maybe the scanning process is really noisy?
I've done ocr on products on grocery store shelves... it's pretty good
is the next not english? maybe non-english ocr is a lot worse
are you doing something like detecting counterfeit products?
no no the text is in english
well that's not a correct conclusion, if you tried one ocr you cannot say ocr sucks... there are many ocr models
no thats too advance
i also pytesseract same problem exist
I gotta run, I think this problem is very doable.
okay
quick question: how can I get output node names from a keras Xception model?
specifically I want to freeze it
in order to use it with opencv
so i was watching a tutorial of tech with tim about saving modules and visualizing data and it was fine but the code was suppose to train the modules and use the best one
and what it is doing is using the last module trained
#Import Library
import numpy as np
import pandas as pd
from sklearn import linear_model
import sklearn
from sklearn.utils import shuffle
import matplotlib.pyplot as plt
from matplotlib import style
import pickle
style.use("ggplot")
data = pd.read_csv("student-mat.csv", sep=";")
predict = "G3"
data = data[["G1", "G2", "absences","failures", "studytime","G3"]]
data = shuffle(data) # Optional - shuffle the data
x = np.array(data.drop([predict], 1))
y =np.array(data[predict])
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x, y, test_size=0.1)
# TRAIN MODEL MULTIPLE TIMES FOR BEST SCORE
best = 0
for _ in range(20):
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x, y, test_size=0.1)
linear = linear_model.LinearRegression()
linear.fit(x_train, y_train)
acc = linear.score(x_test, y_test)
print("Accuracy: " + str(acc))
if acc > best:
best = acc
with open("studentgrades.pickle", "wb") as f:
pickle.dump(linear, f)
# LOAD MODEL
pickle_in = open("studentgrades.pickle", "rb")
linear = pickle.load(pickle_in)
print("-------------------------")
print('Coefficient: \n', linear.coef_)
print('Intercept: \n', linear.intercept_)
print("-------------------------")
predicted= linear.predict(x_test)
for x in range(len(predicted)):
print(predicted[x], x_test[x], y_test[x])
# Drawing and plotting model
plot = "studytime"
plt.scatter(data[plot], data["G3"])
plt.legend(loc=4)
plt.xlabel(plot)
plt.ylabel("Final Grade")
plt.show()```
The idea of a generative adversarial network can exist in the human brain, but the specific thing referred to as a GAN in Deep Learning can't, it's not biologically plausible. @surreal elm
Have you seen the recent suggestion that dentrites are logic gates?
XOR / NOR /NAND / etc?
That's a 1940s thing and was the first thing suggested.
They built several logic gates out of real neurons.
Yes neurons are far more complex than anything currently used in code.
hmm that link is not loading now
When talk of artificial neural networks began some fifty years ago the idea was to mimic the behaviour and function of the neurons in human brains โ a premise that has more or less survived to this day. But new research now suggests scientists may have severely underestimated the power and potential of our neurons.โBrainsContinue Reading
this apparently adds about 20x power to the estimated potential
and there is likely more
The thing that makes Deep Learning not biologically plausible is things such as backpropagation (through multiple layers), and convolutions (shared weights specifically, there is no sliding window in the human brain for obvious reasons, but a similar thing, multiple receptive fields, can do the trick).
the neurons can share info too locally
the capsid viral shell thing
that packages packets of data
I have to look up it's name again
"ARC proteins"
so there is a second and even third channel
THC is backpropigating
(cannabinoids)
so it's not easy to even imagine how data is flowing
Real neural networks do not work well on Von Neumann machines. They require special hardware, specifically https://en.wikipedia.org/wiki/Reservoir_computing. The gains from this are not just 20x, it's much larger, but also can't really be directly compared.
can someone help me to scrap a website
i'm stuck here for about 13hour
please if any expert in web scraping can help
hi everyone
why is this so hard
I have a geopandas data frame containing some polygons
some polygons are inside each other, for example there may be a big park containing a couple small playgrounds inside it
I wanna get rid of the polygons that are inside another polygon, in my geodataframe
how can I do it?
When referring to backpropagation I mean like in Deep Learning, which is most definitely not biologically plausible. It's why the original learning rules for NNs did not use backpropagation either.
Hello, I've been learning python for aa bit more than 1 year now and currently learning JS (following a web development path which will lead to going through Django) .
However I am starting to think that I don't like web development - so I was thinking maybe going for what python's most popular for - data science and machine learning.
Could some one give some tips if it's a good idea for a non-good mathematician to start his journey on this long path?
I really like OOP as a concept and I am not sure if I apply my knowledge in this new field.
Any courses for newbies?
I don't think it's possible to not even read the other columns (with CSV files - it's possible with some formats like FWF), but it probably just discarded all other columns of each row right after reading that row.
oh, I see what you're asking though; that'd require some sort of stream decoding. Is that possible for ZIP?..
yeah it just discards the unused fields and i think doesn't even parse them
i think zlib does have some kind of streaming support
99% py
the spikes are when I am generating the buildings
I ported like 1/2 my game to the new streaming system
the old file was uber bloated too,
cut out a bunch of fluff*
don't work like that. parsing csv is line-by-line
unless the lines themselves were really really long
Howdy - does anybody know how I can hide this error message in Jupyter Notebook for QQ Plot:
I needed to convert 50000 images in numpy arrays and it took so long I had to do multiprocessing and split it up into individual files and it still took a crazy long time. I used a csv to organize everything.
I'm looking into learning openpyxl and use excel spreadsheets instead.
convert
[convert]
VERB
cause to change in form, character, or function.
as in, from a .png image to an array of floats representing rgb values.
I'm trying to capture voice characteristics, so I don't think contextual information is all that important
Voice cloning is the use case
Hi, what if I want do predict stocks market by 20 last days close price, should i use 20= batch_size in LSTM? Could anyone explain me the batch size idea? because I coudn't get it
You could run the network on the whole dataset, and then compute the loss and adjust the networks' weights
Or you could compute the loss and adjust the networks' weights on each sample in the dataset
The first option is generally slow
The second option generally needs to unstable learning
A 3rd option would be to compute loss and adjust network weights in "batches" of 16, 32, 50, or however big you want your batch to be
Batch size is just another hyperparameter
If you want to make a prediction based on close prices of the last 20 days you don't need to specify any parameter to the LSTM
Since by design it accepts variable-length sequences
Sorry, I stuck with making proper input shape and x_test shape
What should look input shape in that case
it's just 1 price series, like the s&p 500? or it's 20 stocks?
One price series
If it's one price the input shape is (batch_size, 20, 1)
Well actually I believe it's (20, batch_size, 1) but if you set the batch_first=True keyword argument to your LSTM it would be (batch_size, 20, 1)
Ok thank you. Let me ask you, why we need this batch size ( i think of this like of blocks of data) if my network use only 20 input at ones.
It dont get it
Well batch size is just the number of samples your network will look at before adjusting its weights
If you have batch_size = len(dataset), then your network will look at the entire dataset, and then adjust its weights
if you have batch_size = 1, then your network will look at one sample at a time, adjust its weights, and the move on to the next sample
I got it. :)
Batch size is just another hyperparameter
You can see in batch=4 the loss jumps up and down
Thats because its adjusting its weights every 4 samples which might be too little samples
So its try 4 weight and choose one or drop some
Or smthing like that
4 difrent weight
No it's just the number of samples your network looks at a time
Let's take the stock market example
Say you're trying to predict the price on the 21st day given 20 days of data
If you have batch_size = 1, then your network will look at one sample, give a prediction, compute how wrong it was from that prediction, and then adjust its weights
If you have batch_size = 4, then your network will look at 4 samples, which in this case would be 4 samples of 20 days of data and 4 predictions
Ok i understand i think :)
So why the shape batch_size = True would have input shape begin with batch_size, and normaly no
But secondary
I don't know that's an internal design decision
Aha, ok .
Function f({stats}) takes in your 3 inputs and outputs 1 or 0 depending on if you win. Find the stat point allocation that maximizes wins```
How would you guys approach this problem?
This is not a data science question
Hello, i have question
Should i convert this list to a string datatype before applying stemming and lemmatizer or leave the data type as is?
it depends on what the stemmer/lemmatizer expects.
by creating the list of strings, I assume you're tokenizing
Going forward, please always copy and paste the actual text into the chat.
Yes i used regexptokenizer
Okay
tokenizer=RegexpTokenizer(r'\w+')
dataset['text']=dataset['text'].apply(tokenizer.tokenize)
dataset['text']=dataset['text'].astype.str()
print(dataset['text'].head())
remember to put a space on either side of binary operators.
tokenizer = RegexpTokenizer(r'\w+')
dataset['text'] = dataset['text'].apply(tokenizer.tokenize)
dataset['text'] = dataset['text'].astype.str()
print(dataset['text'].head())
However, it is unlikely that this does what you expected.