#data-science-and-ml

1 messages ยท Page 348 of 1

serene scaffold
#

In the other three, did you expose your question, or did you ask if anyone has certain knowledge?

oblique ridge
serene scaffold
#

Good luck!

oblique ridge
next lance
#

Yes I know what a dot product is and I can do it on paper
I know a lot about Numpy, How the neron Network works and all. I have created the model like a bit of it parts
Yes I don't think I am Expert IN AI and ML but I think that it's hard

Also it's not for my homework but I am leaning Deep Learning ๐Ÿ˜…

tender hearth
next lance
#

Ok

#

After 2 hours

peak halo
#

@lapis sequoia and @granite vine

#

here

peak halo
#

yeah, tell

jade acorn
#

whats the difference between me using a statistical program to make a linear regression model versus doing it in Python via sklearn? How is the latter machine learning when its just generating predictions? Can you say that in this specific case with linear regression models that all "automated" calculators of least squares are considered a rough form of machine learning?

tender hearth
#

(Just realized I did not answer your question. No, least squares is not a form of machine learning because no learning is done)

wide meadow
#

Hi, in Machine Learning, do we select the model first and then tune it's hyperparameters or tune the hyperparameters for all the models first and then select the model?

jade acorn
jade acorn
next lance
verbal cape
#

I was aligning images and I wanted to load images in afolder with a specific name. How can I do that?

woeful falcon
#

Anyone into NLP here,
is nltk a good choice ?

lapis sequoia
#

hello

uneven thistle
#

how can i improve this histogram made on plotly as you can see he markings are not clear since one bar is too high so other bars are so low that they are not visible and the y axis scale is too high as well how to make the scale smaller with the frequency of 100

tender hearth
desert oar
#

Like fundamentally I think that's wrong

#

I'm on mobile so I can't effectively plead my case, but my basic opinion is that fitting a line is definitely "learning", albeit learning something relatively simple

#

Arguably "least squares" itself is just an algorithm, it's not inherently machine learning. But fitting a line with least squares is just as much machine learning as fitting a deep network with SGD

#

cc @jade acorn

tender hearth
#

My mind Ctrl+Shift+F's "machine learning" to "deep learning" since that's usually what people are referring to

sage latch
#

Hi, random question:
I'm interested in datascience/Ai stuff; can you guys recommend some certifications or so, maybe I can convince my employer to pay me some courses/certificates;
I just don't know where to start

neon imp
sage latch
#

I'm working in some consulting setup, would be nice if there where soem certifications wich I can convince the higher ups that those are marketable and improve my value so they can hire my out for more moneys, but thy for that link

short chasm
#

Hello everyone. I want to create a blank screen with the plt.figure() function, but I can't. Can you help me?

next lance
#

I am making a object detection using Tenserflow but I am installing a lot of things
Do professional programmers also download so many files like labelImg and more.
Can we not just install all the modules and use them as code, like install by doing pip

serene scaffold
#

Can we not just install all the modules and use them as code, like install by doing pip
are you talking about downloading software that isn't Python libraries? different operating systems have different packages managers. what OS are you on?

azure marsh
#

labelimg is image annotation software. Ironically it is in pip, though some level tools aren't. Many packages have many ways of installing

next lance
#

Do all professionals also download so many repositories and all other things

#

I am following a tutorial in which I have made so many steps, installed so many things and all this

azure marsh
#

In reality, even professionals deal with heterogeneous and painful environment setups

#

You can look try and look for dockers with everything set up

next lance
#

When will I be doing that big code part

chrome lintel
#

if it's a standalone package I don't think there's anything wrong with simply installing with pip. Some more sophisticated packages also come with software that you need to manually download and install - but if pip install works you should use that

next lance
#

Yes I am thinking the same

#

So When will I be doing lot of code ๐Ÿค”

chrome lintel
#

Depends on the tutorial you're following. I imagine once you have all the required packages it'll be coding time

azure marsh
next lance
#

I am just using ipython till now

next lance
azure marsh
#

pip install -r requirements.txt

next lance
#

Never did that

azure marsh
#

Several pip packages can be installed that way from a list

#

Tutorials differ in the amount of effort they put into the setup

#

If they are having you a bunch of different pip packages separately, they could've made it one line with a file like that

next lance
#

No that guy is making me install all these differently

#

Also Do all programmers do all this creepy things in AI and ML ? ๐Ÿ˜ณ

#

I haven't written a single line of code in Pycharm, only using ipython to install and setup

#

And the guy in the video says now time to train the model

jade acorn
grave frost
#

start from the basics instead of some random guy on youtube

desert bear
#

Does anyone know how to find two points that have extending coordinates in certain domain.
In this example, my domain is (-5, 5), and one point extends outside of that.

frigid elk
#

I'm having difficulty understanding a concept. .... true or false, ... when using a threshold other than .5 for classification, does that essentially turn it in to a regression problem? are classification and regression the same, just one has a binary output (in the case of 2 classes)?

tidal bough
viscid siren
#

Hi guys do ya got some good tutorials and explanations for basics in data science with python or c language?

desert bear
#

but that would be too much of work honestly

jade acorn
#

is there a function in numpy, sklearn or scipy that can detect if a dataset goes in a linear line, polynomial or exponential?

tidal bough
#

Well, you can try fitting the dataset to a line, a polynomial of some degree or to an exponent. I don't think there is (or can be, really) an automatic function to determine that

odd meteor
# frigid elk I'm having difficulty understanding a concept. .... true or false, ... when usin...

Ordinarily, sigmoid function takes care of that in classification problem. This is what works behind the scene to determine the class each predicted value rightly belongs to.

So long as you understand the concept of sigmoid function, then you honestly shouldn't bother about having a threshold other than 0.5 (given you're interested in building an unbiased model)

Sigmoid function is an activation function used in Logistic Regression (classification problem) to make our output be between 0 and 1 (or between - 1 and 1)

So if the predicted value is above 0.5, it'll be assigned to class 1, and if otherwise class 0.

So long as there is an activation function ( in our case sigmoid function) present in your Logistic Regression model, you cannot predict a continuous value.

So in essence, even if the set threshold is 0.7, that doesn't turn your classification problem into a regression problem.

frigid elk
#

just want to make sure that i'm considering the right value, .. that a regression prediction is the continuous value for what would be a classification otherwise when the threshold is .5

odd meteor
# frigid elk Thanks for the response. .. my thought on moving the threshold is due to running...

If you have an imbalanced class, try to use of the available resampling techniques like SMOTE (if you want to oversample the minority class)

You can even add the parameter stratify =y when splitting your data with train test split.

Then endeavour to also add the class_weight parameter in your model to handle imbalance class.

XGBoost uses scale_pos_weight

Then use StratifiedKFokd for your cross-validation.

Try to google more ways to handle imbalance class. Don't touch the default set threshold which is 0.5

frigid elk
cold cloud
#

hey everyone, i seem to be having a problem with the feature importance on my models. im using the built in sklearn feature_importances_ for gradient boosting, random forest, and extreme gradient boosting. but my extreme gradient boosting feature importance seems to be a very different than the gradient boosting or random forest feature importances to the point where a low correlating, binary feature is my most important variable.

is the built in feature_importance for random forest and gradient boosting different than for extreme gradient boosting? im struggling to find an explanation for this. or is there a better way to find feature importance for my models?

odd meteor
lapis sequoia
#

so i am trying to learn ai and machine learning and i was watching a video from tech with tim and i did what he did and understood a part of it, i was hoping someone could explain to me this:

#
import pandas as pd
import numpy as np
import sklearn
from sklearn import linear_model
from sklearn.utils import shuffle

data = pd.read_csv("student-mat.csv", sep=";")

data = data[["G1", "G2", "G3", "studytime", "failures", "absences"]]

predict = "G3"

X = np.array(data.drop([predict], 1))
y = np.array(data[predict])

x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size = 0.1)

linear = linear_model.LinearRegression()

linear.fit(x_train, y_train)
acc = linear.score(x_test, y_test)
print(acc)

print('Coefficient: \n', linear.coef_)
print('Intercept: \n', linear.intercept_)

predictions = linear.predict(x_test)

for x in range(len(predictions)):
    print(predictions[x], x_test[x], y_test[x])```
frigid elk
desert oar
#

Just not for the output layer

desert oar
odd meteor
odd meteor
rain temple
# lapis sequoia so i am trying to learn ai and machine learning and i was watching a video from ...

Basically what this code does is apply a Linear Regression model onto your dataset. You are storing the dataset in the dataframe "data" and you are predicting the value of G3 given the other features in the dataset. Firstly you are dropping the value to be predicted from the dataset and spltting the data into training and testing data. Training data is to train the model and ensure that it can learn the best fit for the linear regression line, and test data is to find the accuracy of the model u have applied --> acc = linear.score (x_test, y_test). In the application of linear regression, you are initializing the model in "linear" and applying to the model using the "fit". Linear regression works by estimating parameters ie: the gradient and the y intercept, and then minimising loss using the Mean Squared Error. After applying lin regression to the model you are predicting the test values.

#

im probably wrong in some places as im also new to ML, so you should probs seek confirmation from more experiences peopl

frigid elk
odd meteor
# lapis sequoia ``` import pandas as pd import numpy as np import sklearn from sklearn import li...

Importing necessary Libraries

import pandas as pd
import numpy as np
import sklearn
from sklearn import linear_model
from sklearn.utils import shuffle

Read Dataset Into Pandas DataFrame
data = pd.read_csv("student-mat.csv", sep=";")

__Selected The Features and Label he's interested in using to build a model __

data = data[["G1", "G2", "G3", "studytime", "failures", "absences"]]

Specified the label (Response Variable)

target_col = "G3"

Declared The Input & Target Variable

He dropped the target column from dataframe so it'll contain only the input features (X), then converted both X (DataFrame) and y (Series) into a numpy array.

X = np.array(data.drop([target_col], 1))  
y = np.array(data[target_col])

Splitted Dataset into train and holdout set

x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size = 0.1)

Instantiated the Linear Regression object

linear = linear_model.LinearRegression()

Train a Linear Regression Model

linear.fit(x_train, y_train)

__Get The Coefficient of Determination (R2) __

acc = linear.score(x_test, y_test)
print(acc)

__Print Your Model Parameters (The weight and intercept) __

print('Coefficient: \n', linear.coef_)
print('Intercept: \n', linear.intercept_)

Make Predictions with the holdout set

predictions = linear.predict(x_test)

Print The Predicted_y, X_test, and Actual_y

for x in range(len(predictions)):
    print(predictions[x], x_test[x], y_test[x])
odd meteor
rain temple
#

Arent activation functions primarly used in Neural Nets tho?

odd meteor
odd meteor
frigid elk
# odd meteor I don't know enough about SHAP values to comment on it at this time. Do you min...

it's a method of evaluating model feature importance, down to the observation if needed, based on game theory. .. it's able to work on multiple classifiers by evaluating the results in a similar manner of leave one out cross validation. ... that's about the closest parallel i can think of off the top of my head. ... look up shapley values on youtube, there are some great resources. ... data robot has one that's around 5 minutes that explains it very well at a high level

odd meteor
desert oar
severe hazel
#

Hi Gys think I'm at the right place. I have a question. For someone in mapping

#

I want to build interactive maps. I want my python coordinate points to show up on this map and have lines between them. What would you guys recommend?

lapis sequoia
#

Is 83% accuracy good, fellas?

next lance
next lance
lapis sequoia
next lance
#

Oh bad

#

Are you using openvc?

lapis sequoia
next lance
desert oar
blazing pawn
twin valve
#

uk what u dont hv enough of

#

water bottles

royal crest
arctic wedgeBOT
#

7. Keep discussions relevant to the channel topic. Each channel's description tells you the topic.

next lance
#

Which is the best Keypoints, Boxes or Masks in Tenserflow 2 Zoo

spice moss
#

I just started learning neural Networks but I don't understand the role of Activation and softmax functions and how the neurons work with them

#

Can anyone help

#

Or recommend any books/videos for it

next lance
spice moss
#

Oh OK

next lance
#

It covers mostly everything from scratch

spice moss
#

Oh thank you, I will definitely check that out

next lance
#

I will give you the link

#

Building neural networks from scratch in Python introduction.

Neural Networks from Scratch book: https://nnfs.io

Playlist for this series: https://www.youtube.com/playlist?list=PLQVvvaa0QuDcjD5BAw2DxE6OF2tius3V3

Python 3 basics: https://pythonprogramming.net/introduction-learn-python-3-tutorials/
Intermediate Python (w/ OOP): https://pythonpr...

โ–ถ Play video
spice moss
next lance
#

This is the book

spice moss
#

Oh ok thank you so much

next lance
#

also check out the playlist

next lance
lyric ermine
#

hey guys, i wanna start working on my portofolio.

is dash and plotly recommended to build first basic dashboards?

grave frost
lapis sequoia
#

I want to start with AI
Can anyone tell me what should I do first ?

tender hearth
#

I'm reading the Universal Sentence Encoder paper, and I'm having trouble understanding this part of the paper:

The context aware word representations are con-
verted to a fixed length sentence encoding vector
by computing the element-wise sum of the repre-
sentations at each word position.

Here's the rest of the text if you need context:

The transformer based sentence encoding model
constructs sentence embeddings using the en-
coding sub-graph of the transformer architecture
(Vaswani et al., 2017). This sub-graph uses at-
tention to compute context aware representations
of words in a sentence that take into account both
the ordering and identity of all the other words.
The context aware word representations are con-
verted to a fixed length sentence encoding vector
by computing the element-wise sum of the repre-
sentations at each word position. The encoder
takes as input a lowercased PTB tokenized string
and outputs a 512 dimensional vector as the sen-
tence embedding.

#

"Element-wise sum"?

#

Ah. I believe it's taking a sum of axis=0

#

That would make the most sense. Since axis=1 would still produce a variable length tensor

reef dock
#

I've been practising some of the concepts I've learnt in this data science course that i'm doing and I've been wondering (and this might come across as a stupid question but) how do you know what to do with the data you have at hand?

serene scaffold
#

you might start by asking yourself "what are insights that a human could extract from this data if they had time to go through all of it by hand"

reef dock
#

I guess that's sort of a drawback of just practising code? Picking up a dataset from nowhere without having a plan kinda made me blank.

serene scaffold
reef dock
#

Can i post links here?

serene scaffold
#

I'll be back in a few

#

@reef dock did you look at the names of the columns?

['Permit Number', 'Permit Type', 'Permit Type Definition',
       'Permit Creation Date', 'Block', 'Lot', 'Street Number',
       'Street Number Suffix', 'Street Name', 'Street Suffix', 'Unit',
       'Unit Suffix', 'Description', 'Current Status', 'Current Status Date',
       'Filed Date', 'Issued Date', 'Completed Date',
       'First Construction Document Date', 'Structural Notification',
       'Number of Existing Stories', 'Number of Proposed Stories',
       'Voluntary Soft-Story Retrofit', 'Fire Only Permit',
       'Permit Expiration Date', 'Estimated Cost', 'Revised Cost',
       'Existing Use', 'Existing Units', 'Proposed Use', 'Proposed Units',
       'Plansets', 'TIDF Compliance', 'Existing Construction Type',
       'Existing Construction Type Description', 'Proposed Construction Type',
       'Proposed Construction Type Description', 'Site Permit',
       'Supervisor District', 'Neighborhoods - Analysis Boundaries', 'Zipcode',
       'Location', 'Record ID']
reef dock
#

Yes

#

I ended up removing some of the ones that didn't seem too useful too.

serene scaffold
#

which ones did you find not useful?

reef dock
#

Street Number Suffix, Proposed Construction Type, Site Permit, TIDF Compliance, Unit

serene scaffold
#

@reef dockwhat do you think would be interesting to learn from this data?

granite vine
#

hello

#

lol

reef dock
#

I'm not entirely sure since I picked the dataset at random. Though I did try to look at the neighborhoods that have the most permit applications.

granite vine
#

help

#

<@&831776746206265384>

#

can you help meeeee

#

plssssssssssss

serene scaffold
#

Please don't ping moderators asking for help.

granite vine
#

@digital shard

serene scaffold
granite vine
granite vine
serene scaffold
granite vine
#

sir

granite vine
#

a short help

reef dock
#

There's help channels in the server if that's what you're looking for.

serene scaffold
#

I don't even know what you want help with, whereas keN has been asking clear questions.

granite vine
#

it not working

reef dock
granite vine
granite vine
#

done

#

lol

reef dock
#

@serene scaffold if you had to work with this data, what would you do?

serene scaffold
reef dock
#

what do you mean human language stuff?

serene scaffold
#

or when the estimated cost and revised cost are different

serene scaffold
reef dock
reef dock
serene scaffold
reef dock
tender hearth
#

Anyone have any suggestions for producing a speaker embedding from an audio waveform

#

Reading WaveNet's paper rn to see what they've done

edgy hearth
#

can anyone help me out ?

#

here is the code

#

data = pd.read_csv('hollowen_costumes.csv')
print(data.describe())```
#

soo like i can print this

serene scaffold
edgy hearth
#

but its not clean

tender hearth
#

Depends what you mean

edgy hearth
#

is there any func to do that ?

serene scaffold
tender hearth
#

If by array, you mean the audio samples, I already have that. if by array, you mean fixed-length embedding vector, yes that's what I'm looking for

edgy hearth
#
unique                                                238                                                                                                                                                                   50
top                                                  Name                                                                                                                                                                    1
freq                                                    1                                                                                                                                                                   28```
#

this is the output

edgy hearth
#

can i make it like in a striaght line ?

tender hearth
#

So I'm looking at Wavenet autoencoders because audio is extremely high resolution and using recurrent methods would be really compute intensive

serene scaffold
#

@edgy hearth can you do print(data.head().to_csv()) and paste the text in this chat?

edgy hearth
#

sure

#
1,Clown,100
2,Vampire,97
3,Harley Quinn,96
4,Casa De Papel,95```
#

this is the output

#

yeah ig it worked

#

: )))

#

did .head() do the thing ?

serene scaffold
abstract frost
#

hello, has anyone used bookstoscrape in the past?

serene scaffold
abstract frost
#

okay i managed to extract every info on a book page and now im struggling the same info for every book in the same genre.

#

well idk if that made sense, but basically i have the cript to get all infos of a book and now i just need to repeat it for every book of a genre

lapis sequoia
#

Anybody had any experience with donkey cars?

#

well I want to create a fully fledged AI assistant that can talk as well as work as a utility software
As I don't want it to be based on if else statements
which modules should I learn to make such a AI in a easy way
not very complex
P.S. - I am a 9th grade student with good python skills so pls list down the prerequisites(basically math required)

serene scaffold
serene scaffold
# lapis sequoia why tho ?

Making a "fully fleged AI assistant" would be challenging even for people with advanced degrees in math and computer science.

#

though this does not mean that you can't do things with AI that will be interesting and gratifying for you.

lapis sequoia
#

by basic functionality I mean
opening a software
playing music
opening different folders
searching stuff on web
that can easily be done with python

lapis sequoia
serene scaffold
#

Also, I would suggest against "learning libraries/modules"

lapis sequoia
lapis sequoia
serene scaffold
#

Again, you don't want to "learn libraries" as these libraries are intended to be building blocks for solving lots of different problems, and understanding what problems are out there and what the solutions potential approaches are will serve you better.

lapis sequoia
serene scaffold
lapis sequoia
serene scaffold
#

You can't skip this step. Good luck!

lapis sequoia
serene scaffold
#

the first three are inter-related though.

lapis sequoia
serene scaffold
lapis sequoia
#

Makes sense.

arctic wedgeBOT
#

:incoming_envelope: :ok_hand: applied mute to @lapis sequoia until <t:1634491907:f> (9 minutes and 59 seconds) (reason: duplicates rule: sent 4 duplicated messages in 10s).

coral kindle
#

I have a question with huge loads of NLP

#

Can libraries like PySpark process them chunk by chunk? Because I really want to try something with the ArXiV dataset snapshot.

#

And it's unfortunate that I can't fully exploit it because of RAM capacity

#

I've been trying Spark and Spark-NLP but idk if I should bounce back to a simple NN on Torch once the preprocessing is done

#

Because so far it looks neat

#

Do I really have to set up a machine on the cloud ?

coral kindle
#

Has anybody ever used spark-nlp? Ever?

#

There's some towardsdatascience articles on it

grave frost
tender hearth
lapis sequoia
#

What steps or learning paths should I take if I want to become an ML engineer?

robust rampart
#

Tree Depth: 37
yuck

ashen sable
#

hey guys i want to build a speaking assistant which learns day by day..i am thinking of using reinforcment learning..so what do u guys suggest

warm valley
#

Hey folks, a small qsn.
Should I use feature selection in test data too?

I.e, I used feature selection to clear the data and model in decision tree but when i used it to predict the test data.

It gives
Feature name useen at fit time.

quasi parcel
#

Hi everyone https://paste.pythondiscord.com/sicunewomu.yaml this is the csv
the is the pivot table code

weighted_matrix = combineframes.explode('product_ids').pivot_table(index='Customer_ID', columns='product_ids', aggfunc='sum', values='res') 

weighted_matrix```
#

the above code is only returning 174 columns but it should be 999 rows

#

in this weighted_matrix

zinc rock
rose cipher
#

Guys, I would like to know if Cloud Plataforms (AWS, GCP, AZURE, etc) can have acess to my data? If yes, which types of data?

uneven thistle
#

i am trying to resample my data to yearly but the problem is that my price column is integer and country code and organization is string so after resampling i used groupby to group according to country adnd org and then used sum but it is showing the price 0 for every row what shuoul i do?

zinc rock
#

anyone know where is this from

#

lol

desert oar
warm valley
desert oar
#

!rules 8

arctic wedgeBOT
#

8. Do not help with ongoing exams. When helping with homework, help people learn how to do the assignment without doing it for them.

desert oar
#

but this is a general enough question that i am comfortable answering: it depends entirely on the data. often something like 0.75 is not good enough for "business" purposes, but it depends on the cost of misclassification in your particular problem

#

recall the interpretation of AUROC on binary classification

#

in some medical studies or other research settings (e.g. social science), 0.75 might be great

#

but in those contexts you are often interested in probability modeling and not just accurate point predictions

desert oar
#

!code

arctic wedgeBOT
#

Here's how to format Python code on Discord:

```py
print('Hello world!')
```

These are backticks, not quotes. Check this out if you can't find the backtick key.

desert oar
#

i can't copy and paste data from a screenshot. can you post CSV?

#

!paste

arctic wedgeBOT
#

Pasting large amounts of code

If your code is too long to fit in a codeblock in discord, you can paste your code here:
https://paste.pythondiscord.com/

After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.

uneven thistle
#

this is the data i am gettig after resampling

#

ok

arctic wedgeBOT
#

Hey @uneven thistle!

Uh-oh! It looks like your message got zapped by our spam filter. We currently don't allow .csv attachments, so here are some tips to help you travel safely:

โ€ข If you attempted to send a message longer than 2000 characters, try shortening your message to fit within the character limit or use a pasting service (see below)

โ€ข If you tried to show someone your code, you can use codeblocks
(run !code-blocks in #bot-commands for more information) or use a pasting service like:

https://paste.pythondiscord.com

uneven thistle
#

it is showing that csv attachents are currently not allowed

desert oar
#

yes, use the paste site i linked

jolly cliff
#

How to generate unique random numbers in numpy array?

jolly cliff
uneven thistle
#

here it is @desert oar but i didnt find a way t copy csv file

uneven thistle
#

what is this @desert oar

#

i want to resample the data not sample

#

did you understand my question @desert oar

desert oar
uneven thistle
#

how can i upload csv

desert oar
#

paste it into the paste site

#

i only need enough rows to see the problem, i don't need the whole file

uneven thistle
#

there is no option to past it in the paste site

jolly cliff
uneven thistle
#

is this enough for you?

#

@desert oar

uneven thistle
#

plzz help @desert oar

desert oar
silver summit
#

anyone good with pyspark?

#

I have a binary file that is very large. Trying to figure out how to use pyspark to work with it.

quasi parcel
#

Google Object Storage in gcp

#

in azure Azure Blob

serene scaffold
#

@silver summit it's important that you always state your actual question, as even people who know about a given topic won't volunteer themselves until they know what you're really asking. I assume you're asking this question: #python-discussion message

silver summit
desert oar
#

"not currently working" is a great place to start by asking a specific question about the unexpected or error output you are getting

rose cipher
quasi parcel
#

like?

#

credentials

#

@rose cipher

rose cipher
#

No. I just want to know if these companies can have acess to the data that I use on this plataforms.

quasi parcel
#

no they wont there is a strict policy that your data wont be shared or used by anyone else apart from you

rose cipher
#

I am thinking about get into the Cloud Industry, but I was afraid about privacy concerns

quasi parcel
#

unless you keep open to public

#

so try to keep it private

rose cipher
#

There's public, private and hybrid cloud, that's what you mean right?

quasi parcel
#

no no

#

so in s3 we have this policy were we can allow the file to be publicly accessible then anyone can download your file

#

i do suggest s3 with private policy its really good

rose cipher
#

Sure

#

Good to know that these services are safe

#

When I mean safe I am saying about data protection

quasi parcel
#

yes you can use these services what is your exact functionality?

rose cipher
#

I don't know, I will start to learn them now. I just wanted to know if they were Safe

silver summit
#

when I use pyspark to read a binary file as spark.readStream.format('binaryfile').load('filename.bin') I get a schema error ```
IllegalArgumentException:
Schema must be specified when creating a streaming source DataFrame. If some
files already exist in the directory, then depending on the file format you
may be able to create a static DataFrame on that directory with
'spark.read.load(directory)' and infer schema from it.'

bold timber
#

how to handle this error?

median fulcrum
#

any ideas how can I use seaborn boxplot to describe this database?

serene scaffold
#

!traceback

arctic wedgeBOT
#

Please provide the full traceback for your exception in order to help us identify your issue.

A full traceback could look like:

Traceback (most recent call last):
    File "tiny", line 3, in
        do_something()
    File "tiny", line 2, in do_something
        a = 6 / b
ZeroDivisionError: division by zero

The best way to read your traceback is bottom to top.

โ€ข Identify the exception raised (in this case ZeroDivisionError)
โ€ข Make note of the line number (in this case 2), and navigate there in your program.
โ€ข Try to understand why the error occurred (in this case because b is 0).

To read more about exceptions and errors, please refer to the PyDis Wiki or the official Python tutorial.

serene scaffold
#

^ this is what is meant by "whole error message"

bold timber
serene scaffold
quasi parcel
#

what is your sklean version

#

@bold timber

bold timber
quasi parcel
#

try installing 0.18.2

silver summit
#

ok, I gave up on readStream... trying to figure out how to use just read with a binary file... so I currently have the data loaded in with spark.read.format('binaryfile').load(fn) and have the following

desert oar
silver summit
#

the content field is a list of values, the first 4 are just checks, the next 3 are header info then the rest is the data I need, trying to figure out how to strip out the first 4, then the next 3 then the remaining values

#

the header info basically tells me how much data there is and the type etc

#

yeah the data is a list of numbers that will eventually need to be shaped into a dataframe

#

file format is binary

formal lava
#

What r req for ai and how to start?

desert oar
#

@silver summit i think readStream is only for dataframes, i don't know if there's an API to parse an arbitrary huge blob of bytes into a dataframe. you might have to fall back to reading each line into an RDD, and parsing each line into a Row, and then using createDataFrame to collect all those rows into a dataframe

#

it'd still be "streaming" but it's not quite as elegant as declaratively specifying a schema

robust jungle
#

How can I find the output node names of an xception model

median fulcrum
#

what's the correct way to do that?

silver summit
#

@desert oar well I have the data in a dataframe now, I just need to transform it to a series instead of a list of values in a single row of a column

robust jungle
#

(Keras)

robust jungle
formal lava
#

idk machine learning?

silver summit
#

lol bro...

robust jungle
#

I recommend starting with tensorflow

#

There are good guides online

silver summit
#

wtf

#

that's terrible advice

desert oar
robust jungle
#

Mb

#

Iโ€™m kinda new too, so Iโ€™m just saying what I did

silver summit
#

absolutely don't do that

#

you need to understand fundamentals before you just jump into stuff like that

#

take an online course about machine learning, get a sense for statistics, look at sklearn and kaggle

median fulcrum
silver summit
#

just load data into pandas and try to figure out what youre looking at, means, stds, sizes, shapes, plots etc...

#

daughter is awake... be on later

desert oar
#

i agree that you should definitely learn some data viz and data manipulation in pandas along w/ tensorflow

#

but i don't think it's that bad to start poking around in TF with image classification or whatever

#

but you will very quickly start wishing you knew more principled approaches to problem solving, and then you should start focusing on stats and more machine learning fundamentals

formal lava
#

mhm

desert oar
#

it might be more satisfying to do a bit of hands-on playing around with tensorflow, then spend some time on fundamentals, then spend some more time playing around, etc.

formal lava
#

so what do u mean by "playing around", im not sure I can build anything

median fulcrum
#

AttributeError: module 'matplotlib.pyplot' has no attribute 'set_xlabel'

#

what

#

It has

#

what I am missing?

desert oar
#

@median fulcrum plt.xlabel, set_xlabel is for the underlying Axes object, e.g. plt.gca().set_xlabel

median fulcrum
#

How I would write in that?

desert oar
#
plt.hist(x='loan', data=credit_risk)
plt.xlabel('Loan ($)')

maybe like that? i don't use the plt api much

#

did it work with ax? that should work

bold timber
median fulcrum
median fulcrum
desert oar
#

what doesn't work about it?

#

i don't see the error or any output

bold timber
#

how to fixed this?

median fulcrum
#

:/

#

if someone know ping me pls

#

got it!

desert oar
median fulcrum
#

idk

desert oar
#

no i think i was just wrong

#

plt.xlabel appears to get the current label

median fulcrum
#

but when I tried just got an error

desert oar
#

maybe you can do plt.xlabel = ...? like i said, i don't use the pyplot api much anymore

median fulcrum
#

It worked so I'm not gonna change that ๐Ÿ˜‚

desert oar
#

that's how i'd do it 99% of the time

#

you can also just do ax = plt.gca() to get the current axis object instead of plt.subplots, that's up to you

plain verge
#

hi everyone

#

I was learning pandas

#

and I wonder if there is anyway to use DataFrame.loc with both list of columns and slice at the same time

#

like I want to have something like df.loc[:, ["Name", "LastName", "Age":"School", "Interest":]]

#

however this does not work

#

How can I do something like this?

rigid zodiac
#

df.iloc[{'a','b'}]

#

that's how you select columns

plain verge
#

isn't iloc for int indexes of columns?

rigid zodiac
#

not really, like I use it to select a few column only

#

wait my mistakes

#

it's just df[{'a','b'}]

plain verge
#

oh wait yeah
that works
but like then what's the purpose of loc?

#

also can I do slices with this?

#

like every column from Name to Age?

rigid zodiac
#

iloc is when you want to choose a specific row to specific column

plain verge
#

look

#

I wanna get these columns in this order
Name, LastName, Age, .... , School, Interest, ....

#

where those ... means anything in between

#

how can I do it in one line?

#

there is this syntax
df.loc[:, ["Name", "LastName", "Age"]]
and also there is this:
df.loc[:, "Age":"School"]
how can I use this both at the same time?

rigid zodiac
#

for that it will be better to choose iloc
df.iloc[:3 , : ]
where the first one is for the amount of columns up to 3 (ie column 0 -> 3)

plain verge
#

what if the data is changing in column positions?

#

then I need some way to get column indexes first

rigid zodiac
#

you can use { }

#

so like

df.iloc[ {   } , { } ]
#

or just google

desert oar
ripe forge
desert oar
desert oar
ripe forge
#

oh. gotcha

desert oar
#

you are trying to do something like this, right? df.loc[:, ["Name", slice("Age", "School")]]

steady cargo
#

Hi, I have a question.
Suppose we want to use face recognition in a mobile application using Python language, what library do we use or how do we link these both together?

desert oar
#

yeah you'd have to write your own function to "expand" that

plain verge
#

what if I do something like

df = pd.concat([df.loc[:, ["Name", "Age"]], df.loc[:, "Age":"School"]])

but then it doubles the rows each having some columns

#

oh wait wrong

#

this I mean

desert oar
#

that's a good idea, you can generalize it like this:

def slice_columns(df, *column_specs):
    return pd.concat([df.loc[:, spec] for spec in column_spec], axis=1)
plain verge
#

I actually found something cool
I did

pd.merge([df.loc[:, ["Name", "Age"]], df.loc[:, "Age":"School"]])

however, the downside is that you can only do 1 slice or you need to have another merge inside
any better way to merge multiple objects instead of just 2 will fix this problem

desert oar
#

yeah you still have to de-duplicate columns after, i think the pd.concat version lets you do that more easily

plain verge
#

concat produces wrong set

desert oar
#

the bad part about defining a new function is that you can't use : syntax anymore

#

wrong how?

plain verge
#

it is like

Name    LastName    Age     School
Someone SomeLast    None    None
None    None        SomeAge SomeSchool
#

where there is actually only 1 entry
Someone SomeLast SomeAge SomeSchool

desert oar
#

that seems odd, did you forget axis=1?

plain verge
#

maybe

#

lemme try

#

oh wait now it works

#

I am like so confused about those axis things
like I have a software engineer kind backend
and I see all these multi-dimensional arrays as nested arrays
and it makes understand axis hard for me
I need to research more and get comfortable with it

desert oar
#

it's good enough to think of a DataFrame as a collection of Series in a trenchcoat

#

the columns and index are labels for columns (1 column = 1 series) and rows, respectively

#

a Series has an index, those are element/row labels

#

all Series in a DataFrame share an index

#

the pandas docs don't have a clear explanation of this data model, so it's understandable if you don't get it right away

#

the index/columns thing itself is an interesting beast.. that's an instance of the Index class, which is array-like, but also acts like keys in a lookup table

#

(i'm not sure if internally it uses a b-tree index or hash index or something else, maybe it varies)

#

you can also have a multiindex, where each element is a conceptually tuple, but it also acts like a collection of individual Index array things

#

i've been wanting to write a guide to this stuff for months, but every time i try i find it very difficult to explain clearly, and very difficult to design a sensible learning path through it

#

i think everyone learns pandas by just stumbling around until things make sense ๐Ÿ˜†

plain verge
#

my confusion kinda comes from numpy
anyway I am feeling pretty sleepy rn so I probably won't understand anything rn XD
so like I guess I should continue tomorrow
I'll also read your explanation and continue this discussion tomorrow
thanks for your help tho

desert oar
#

you're welcome

plain verge
#

since this chat is not that active I guess I can find your message easily after even a day XD

desert oar
#

true, although things with better docs tend to reduce the amount of "stumbling" needed

#

pandas has a lot of it

plain verge
#

yeah

desert oar
#

it can be very active sometimes. i tend to write things down in a note file and copy the discord message link

plain verge
#

by the time I was talkin my chrome didn't load google for some reason so i couldn't google stuff so I could find concat and merge easier lol

plain verge
#

alright I'll come back tomorrow bye

silver summit
#

@desert oar hey salt, did you know how to unpack a list from a single row in a pyspark dataframe and return a series?

desert oar
#

the latter has a built-in method, explode

#

the former i definitely had to do in the past but i don't remember how and i do remember it was ugly

silver summit
#

separate row

#

@desert oar ayyyy, nice!! tyty

#

I'm not clear on when spark actually runs compuation. Transformations are just added to the execution graph but not actually run until results need to be returned to the driver. In this case, ignoring the show, does the explode count as a transformation?

desert oar
silver summit
#

I have so many jobs at work that just take days b/c ti wrote them shitty... really need to figure this all out.

#

using a lot of pandas udfs b/c I just dunno how to do stuff

desert oar
#

if you post some specific examples i might be able to help

silver summit
#

cool, ty!

#

do you work as ds or mle?

desert oar
#

neither, currently. but ds in the past

silver summit
#

oh, swe then?

desert oar
#

yep, less work for more money!

silver summit
#

haha, yeah I'm thinking that too

#

swapping over to MLE in the next 6mo or so, similar pay range as swe

#

coinbase throwing 400k at senior mle's right now ๐Ÿ˜ฎ

desert oar
#

i didn't know the numbers were that high. i also don't know how much i want to work at coinbase ๐Ÿ˜†

silver summit
#

haha yeah for sure, but any big tech company will pay very well for MLE in general

jade acorn
#

anyone knowledgeable on scipy stats cdf and pdf? for calculating p value of F statistic

silver summit
#

@jade acorn most of these functions should return a tuple of the statistic and p

#

can you give an example?

jade acorn
#

I need to calculate the bottom part manually (Prob > F) and i already have the F(2,247) value which i also calculated manually, The F-value is the Mean Square Model divided by the Mean Square Residual yielding F=3.48, The p-value associated with this F value is 0.0325

silver summit
#

how did you define F?

jade acorn
#

F = (ssreg/modeldf)/(ssres/resdf) , ssreg being SS model , modeldf being models degrees of freedom,ssres being SS residuals, and resdf is the residuals degrees of freedom

#

and SS is Sum of Squares

silver summit
#

sure, this is from scipy?

jade acorn
#

the above picture is from the program called Stata

silver summit
#

oh

jade acorn
#

i just want to know how to do it in python

#

this is from some python code i wrote, as u can see its almost the same but i just cant figure out the p value of the F

silver summit
#

check scipy docs and sklearn docs a bit more, I'm certain it's there, my daughter just woke up from her second nap... gotta go get her, will be on much later today if you still need help

desert oar
#

@jade acorn it's better to ask your entire question, don't wait for someone to interview you in order to figure out what you want

#

you know this by now, i think

#

and to answer the question, you would do something like this to get an object representing the F(2, 247) distribution:

import scipy.stats as sstats

dist = sstats.f(2, 247)
#

typically for an hypothesis test you want the "quantile function", the inverse cdf

#

scipy calls it the "percentage point function", ppf

#

so you might write stats.f(2, 247).ppf(3.475101) to get the p-value associated with the test statistic 3.475101

wide meadow
#

Before feeding data to an algorithm, is it necessary to transform the features to normal distribution?

royal crest
#

Thereโ€™s also pingouin which is also quite comprehensive

#

!pypi pingouin

arctic wedgeBOT
modest timber
#

Hi, I have question about inputs in LSTM . What should be this if I have 20 data for inputs in model.fit

#

Could it be input_shape(20), and later model.predict(array of 20)? ( sorry for my english)

desert oar
desert bear
#

Good evening, I built an algorithm that compares two methods of finding local minimum of given functions.
Gradient Descent and Newton's method.
I run the algorithms with the same parameters and the following results were achieved.
In the same number of iterations Gradient descent completed it faster (~4s) and got closer to the local minimum.
Newton's method computed in ~7.25s and got further from the optimum.

I though that newton's method would achieve better results in the same number of iterations. I mean, it is more computationally expensive, but still....
Does anyone have any thought on that?

#

I suspect that It depends on the step_size, parameter. Even though this results had same value of this parameter, but somehow I think that comparing results with this parameter equal in two runs seems not okay, but how to describe that to my teacher

desert oar
#

@desert bear i think your inuition is correct, but can you show the code? i never thought of the newton-raphson method as having a configurable step size

#

is it not x1 = x0 + f'(x0) / f''(x0)?

desert bear
#

This Beta parameter is my step_size

desert oar
#

normally i would set B_t to 1

#

that's the "theoretical" version

#

i think the point of using the hessian is that you can take fewer and bigger steps

desert bear
#

Okay, I set that in one of my tests, and It found local minimum in one iteration ๐Ÿ˜ฎ

desert oar
#

yep! it is actually the "optimal" step size for a quadratic function

desert bear
#

Okay, let me read about it, thanks

desert oar
modest timber
#

how about my question? ๐Ÿ˜„

royal crest
#

so they built it upon pandas iirc, so it's cut down a lot of time for me doing stats

desert oar
#

yep that appears to be an explicit goal, scipy stats + pandas + a lot of r-like convenience functions

royal crest
#

validated against R equivalents too so that's one for reliability

robust jungle
#

Anyone know how to get the output node names from an xception model?

green phoenix
#

im trying to figure out how to make datapipe lines but everytime I do this I get the error that "income" is not in the dataset, can someone tell why?

desert oar
#

@modest timber i'm not sure if i understand. can you be more specific? input_size is the number of features at each time step, not the length of the input sequence

desert oar
arctic wedgeBOT
#

Pasting large amounts of code

If your code is too long to fit in a codeblock in discord, you can paste your code here:
https://paste.pythondiscord.com/

After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.

desert oar
#

include the error output as well

green phoenix
#

!paste

#

how do i make a code block

royal crest
#

!code

arctic wedgeBOT
#

Here's how to format Python code on Discord:

```py
print('Hello world!')
```

These are backticks, not quotes. Check this out if you can't find the backtick key.

desert oar
#

โ˜๏ธ read the box

#

same with the box under my post, !paste just generates the box with instructions

green phoenix
#

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"

c_names = ["age", "workclass", "fnlwgt", "education", "education-num", "maritalstatus", "occupation", "relationship", "race", 
          "sex", "capital-gain", "capital-loss", "hours-per-week", "nativecountry", "income"]

df = pd.read_csv(url, names=c_names)

column_trans = make_column_transformer((
    OneHotEncoder(sparse=False), ["workclass", "occupation", "nativecountry"]),
    (LabelEncoder(), ["income"]),
    (OrdinalEncoder(categories=[' Preschool',' 1st-4th',' 5th-6th',' 7th-8th',' 9th',' 10th',' 11th',' 12th',' HS-grad', 
     ' Prof-school', ' Assoc-acdm', ' Assoc-voc', ' Some-college',' Bachelors',' Masters', ' Doctorate']),
     ["education"]), remainder="passthrough")

X = df.drop(["maritalstatus", "relationship", "race", "sex", "income"], axis="columns")
y = df.income

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

svmm = svm.SVC()

pipe = make_pipeline(column_trans, svmm)

scores = cross_val_score(pipe, X, y, scoring="accuracy",cv=5)



#

idk if thats the right way

desert bear
# desert oar yep! it is actually the "optimal" step size for a quadratic function

Thanks a lot, you are extremely knowledgeable. These links are very useful. I decided to test both methods on different function (Rosenbrock function). Firstly I was nailed down, because Newton's method was jumping significantly, but then I found that it is correct behaviour (https://www.numerical-tours.com/matlab/optim_2_newton/).

One thing is not clear for me. How did the teacher want me to compare the results of both algorithms for the same parameters.
Newton's gives best result for step_size=1, but when gradient descent is fed with this parameter it produces points of coordinates' values 1e+20, basically it makes too big steps. It seems incomparable.

desert oar
#

Maybe that's part of the exercise?

#

See how instructive it was to try the different sizes?

#

I bet if you tried gradient descent with the same step size, it would go all over the place

desert bear
# desert oar Maybe that's part of the exercise?

Yea, maybe, I will sum all this observations in my report. The teacher probably won't be happy, since he seems like a "do simple - simple means less reading for me". Thanks a lot for making me understand it more

modest timber
#

@desert oari come here because I am kinda confused - I use list with 20 inputs signals - input_shape=(20,1)

#

but i got error

#
ValueError: Input 0 of layer sequential is incompatible with the layer: expected ndim=3, found ndim=2. Full shape received: (None, 1)
#

predicted_stock_price = model.predict(X_predict[0:20])

desert oar
#

can you show the code for the model? what is the shape of X_predict?

modest timber
#

i could sent you in file, because its some complex

#

with data i operate on

#

ok?

distant trout
#

Hi, anyone could help me with descent gradient for summation function like in png. I only see explaination for x^2 function but i cannot find anywhere information how deal with it. Any ideas?

modest timber
#
 for i in range(0,10):
        a= X_predict[i:20+i]
        print(a.shape)
        predicted_stock_price = model.predict(a)
#

shape = 20,1

#

model.add(LSTM(units=50, return_sequences=True, input_shape=(X_train.shape[1],1)))

#

shape = 20,1

#
 ValueError: Input 0 of layer sequential is incompatible with the layer: expected ndim=3, found ndim=2. Full shape received: (None, 1)
tender hearth
#

What do you think?

desert oar
#

that is, each time step should be [x1, ..., x20]

modest timber
#

Maybe i need to add simply the 1 value, and should have shape 20,1,1

granite flame
#

hi, can sequential API handle inputs of nonlinear relationship in my case i have input variables as flow rate and temperature?

onyx drum
#

How do I write a huge data list (say with 10 million data points) into a .txt file? When I do it for a few hundred points, it's fine, but for millions, it stores it as a "1.2 3.4 1.6 ... ... 1.4 1.2] in the shortened form.

I tried writing it element by element, but the for loop makes it slow. Any way to directly write the whole list?

royal crest
#

!d numpy.savetxt

arctic wedgeBOT
#

numpy.savetxt(fname, X, fmt='%.18e', delimiter=' ', newline='\n', header='', footer='', comments='# ', encoding=None)```
Save an array to a text file.
royal crest
#

this could be useful

#

though i have not benchmarked it myself

onyx drum
#

Aha, thanks!

quasi parcel
#

i have created a recommendation engine can anyone go through it please

#

and let me know if its correct or not

wicked grove
#

Hello, can i combine 2 image datasets together for multi class classification and train them with a cnn model?

plain verge
# desert oar i've been wanting to write a guide to this stuff for months, but every time i tr...

yeah
the thing as much as someone explains, for some stuff you'll understand better when u know how it works
pandas is a complex library, but I guess it's possible to figure out how it works in a high level
that way it helps me understand better
like now I know that those are series, but I don't know where that axis goes and what it does
in the docs it only says the axis of operation and nothing else
I'll figure it out by stumbling around easier than to try to find some form of article describing it

glossy moth
#

Hi all- really dumb question:
I am using matplot and I have a 3x14 table. For column 3, my title is significantly longer than columns 1 and 2, and unfortunately if I leave it the text inside the table becomes unreadable. Changing text size simply scales the column titles too so the problem remains. If I shorten the column 3 title, the issue resolves. Is there any way to resolve this without shortening my header?

pearl beacon
#

Hi! I'm trying to make a speech recognition script for a personal project, and I've decided on mozilla deepspeech. My problem is, I don't really understand the audio handling, and I want to remove the VAD feature from this script so I can manually control when it records:
https://github.com/mozilla/DeepSpeech-examples/tree/r0.9/mic_vad_streaming

GitHub

Examples of how to use or integrate DeepSpeech. Contribute to mozilla/DeepSpeech-examples development by creating an account on GitHub.

brave sparrow
#

hi guys

#

how do i make not 1 output but two for this code

lone drum
#

Hello I have a dataframe in which one column has 'CE' and 'PE' values in that column
I have to separate this column based on these values
For eg
'CE' values are saved in different data frame and
'PE' values saved in another data frame
Ping me when replying

#

My code

df_chunks = pd.read_csv(f'{input_path}{input_file}{extension}' , engine='python',  chunksize=500000, names=['Msgtype', 'Activity Type', 'Transaction Time', 'script_name', 'expiry', 'strike_price', 'call/put', 'Exchange', 'Token', 'Buy/Sell', 'Buy Order Number', 'Sell Order Number', 'Price', 'Qty', 'price_in_rupees', 'lot'])
i=0
for chunk in df_chunks['call/put']:
    print('chunk')
    print(chunk)
    # for i in chunk['call/put']:
    #     put_val = chunk.loc[chunk['call/put'] == 'PE']
    #     call_val = chunk.loc[chunk['call/put'] == 'CE']
    #     print(put_val)            
    #     print(call_val)
    #     put_val.to_csv(f'{output_path}{output_file_put}{extension}', index =False, header = None, mode = 'a')
    #     call_val.to_csv(f'{output_path}{output_file_call}{extension}', index =False, header = None, mode = 'a')
    #     break
    # break
exotic coral
#
def data_type_format(data, indexes):
    "remove the header row and convert all the columns to type float"
    headless = data[1:, :]
    # sinker = headless[:, indexes]
    # floater = np.delete(headless, indexes, 1)
    # sunk = sinker.astype('<U30')
    # floated = floater.astype(float)
    indices = np.arange(9)
    mask = np.delete(indices, indexes, 0)
    mask_list = list(mask)
    a =  headless[:, indexes].astype('<U30')
    b = headless[:, mask_list].astype(float)
    headless.astype(object)
    # answer = np.concatenate((sunk, floated), axis=1)
    # np.sort(answer)
    # for index in data:
    #     answer.append(tuple(index))
    return headless
#

I'm not sure how to get two different dtypes in the same array

hard pelican
#

Hey,
In pandas, I have time, value and mark_upcoming_change columns, I want to calculate the amount of time a column was on a specific value, as seen here

#

right column is the one I want to calculate

obsidian bramble
#

heya\

#

i wanted to integrate alarm cllock system in an virtual assistant

#

how do i do

#

?

serene scaffold
obsidian bramble
#

idk

#

wht is the difference between a and data sceince

lone drum
#

Hey stelercus
When I write in CSV file using pandas some of columns are not completely filled

#

the highlighted part is getting empty

#

Can u please help me to understand this?

#

Why I am getting this way

#

In my original data file i have complete data

serene scaffold
#

@lone drum I would need to see the original CSVs (no screenshots) and the code you used to create this table (no screenshots). Please ping me if you provide that.

odd meteor
serene scaffold
desert oar
# plain verge yeah the thing as much as someone explains, for some stuff you'll understand bet...

the "axis of operation" is a numpy concept. you can think of it as the "the axis that is consumed" when performing an operation.

this is easy to see with an "aggregation" operation like DataFrame.sum:

>>> df = DataFrame({'a': [1,2,3], 'b': [4,5,6]})
>>> df.sum(axis=0)  # axis='index'
a     6
b    15
dtype: int64

this means "apply the .sum operation by iterating over the 0th axis (the index).

DataFrame.apply is a bit tricker:

>>> df = DataFrame({'a': [1,2,3], 'b': [4,5,6]})
>>> add_one = lambda y: y + 1
>>> df.apply(add_one, axis=0)  # axis='index'
   a  b
0  2  5
1  3  6
2  4  7

the add_one function is applied to each column, thereby "consuming" the entire index for each column

lone drum
royal crest
#

just grab the first 10 rows?

plain verge
serene scaffold
arctic wedgeBOT
#

Hey @lone drum!

It looks like you tried to attach file type(s) that we do not allow (.xlsx). We currently allow the following file types: .gif, .jpg, .jpeg, .mov, .mp4, .mpg, .png, .mp3, .wav, .ogg, .webm, .webp, .flac, .m4a.

Feel free to ask in #community-meta if you think this is a mistake.

lone drum
worthy crystal
#

hello I have a question if my model is overfitting or not

#

I am doing CAE

#

does this seem like its overfitting? I see gap its "big" in a way but it is only 0.0020 difference between them

#

should I support that is overfitting or not?

#

axis X is epochs

desert oar
arctic wedgeBOT
#

Pasting large amounts of code

If your code is too long to fit in a codeblock in discord, you can paste your code here:
https://paste.pythondiscord.com/

After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.

plain verge
#

hi everyone

#

I have a weird data that when I import to pandas using json_normalize loads very inefficienty
simple format of the json is

[
 {
  "id": 12342,
  "type": "node",
  "tags": [ "amenity":"table", "size":2 ]
 }
]

id and type are always there
however tags can be empty, have any amount of k/v pairs, and there are total of 110,000 possible keys (eg. amentiy)

#

How can I properly import this to a pandas DataFrame with efficient access to tags?

desert oar
# plain verge I have a weird data that when I import to pandas using json_normalize loads very...

[ "amenity":"table", "size":2 ] isn't valid syntax in either python or json. did you mean { "amenity":"table", "size":2 }?
i would load it like this to start:

data = [
 {
  "id": 12342,
  "type": "node",
  "tags": { "amenity":"table", "size":2 }
 },
 {
  "id": 93823,
  "type": "node",
  "tags": {}
 }
]

df = pd.DataFrame(data)
      id  type                             tags
0  12342  node  {'amenity': 'table', 'size': 2}
1  93823  node                               {}
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   id      2 non-null      int64
 1   type    2 non-null      object
 2   tags    2 non-null      object
dtypes: int64(1), object(2)
memory usage: 176.0+ bytes
plain verge
#

hmm

#

but then tags are dictionary

#

oh wait

desert oar
#

right, efficient depends heavily on what you're trying to do

plain verge
#

so basically it will be a dataframe inside another dataframe right?

desert oar
#

no, it's literally a dict in each element of the tags column

plain verge
#

what if I convert that dict to a dataframe first?

desert oar
#

you can, but why would you?

#

that'd be way more confusing imo

plain verge
#

maybe since dataframe is faster than dict?

desert oar
#

faster for what?

plain verge
#

I need to run a lot of search stuff on the tags part
so I may as well need to use some dataframe features other than performance

quasi parcel
#

correct me if i am wrong cant we use flatten_json methon here @desert oar

plain verge
#

how can I do that fast?
like I don't wanna run a for loop doing df["tags"] = pd.DataFrame(js[i]["tags"])

desert oar
desert oar
plain verge
desert oar
#

and i am asking you how the dataframe needs to look

#

give an example

plain verge
#

ok

serene scaffold
#

print(df.head().to_csv())

#

@lone drum ^

plain verge
#

the dataframe would be just the normal thing
like we have this parent dataframe called pf
and pf["tags"] is another dataframe with simple format like Columns: key and value
for example for my example there will be 2 rows:

key       value
amenity   table
size      2
#

I want something like this

silver summit
#

anyone know how to explode a bytearray? I have a bytearray in a single row of a pyspark dataframe and I need to get each value of the bytearray into a new row, the error I'm getting is

AnalysisException: cannot resolve 'explode(content)' due to data type mismatch: input to function explode should be array or map type, not binary;
'Project [explode(content#131) AS List()]
+- Relation [path#128,modificationTime#129,length#130L,content#131] binaryFile
#

command I'm using is just df.select(F.explode('content'))

quasi parcel
desert oar
#
id  type  key      value
 1  node  amenity  table
 1  node  size     2
 2  node  size     large

like this?

plain verge
#

this will work, but itsn't it bad to have a row duplicated several times?

#

I was thinking of this format

#
   id   type         tags
0  1234 node         <pd.DataFrame object>
1  8897 way          <pd.DataFrame object>

where that object has this format

   key     value
0  amenity table
1  size    2
desert oar
#

it's not really bad, don't over-optimize. if the id is the index, it won't be duplicated. you can keep the "metadata" separate from the "tags" if you want

id  key      value
 1  amenity  table
 1  size     2
 2  size     large

id  type
 1  node
 2  node
plain verge
desert oar
#

you can do that, but i don't recommend it

plain verge
plain verge
desert oar
#

a dataframe inside each element is not really better than a dict in each element, in that pandas has to loop slowly over the series of dataframes

#

don't fall into the trap of "pandas fast, more pandas more fast"

plain verge
#

hmmm I see

#

I almost fell for that

desert oar
#

dict lookups will be faster than dataframe lookups in most cases anyway

#

so I am trying to optimize it as much as possible
have you heard the quote, "premature optimization is the root of all evil"?

#

that said, exploding this to a key-value format like i described would probably possibly make it easier to work with

#

it really depends on what kinds of operations you're trying to do

plain verge
plain verge
desert oar
#

!paste

arctic wedgeBOT
#

Pasting large amounts of code

If your code is too long to fit in a codeblock in discord, you can paste your code here:
https://paste.pythondiscord.com/

After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.

silver summit
#

I can use conv to convert the bytearray to a value, but need to map the convert to each element then explode somehow

plain verge
#

so

#

I can tell you more details

#

I wanna be able to:

#

get all rows that have a specific key <and any other>

#

get all rows with a specific k/v pair

#

get all rows that have at least one key

#

search by regular expression on key names and get all that match

#

nothing else

desert oar
desert oar
silver summit
#

@desert oar yeah that sounds reasonable, be back in a couple hours to review this

plain verge
#

oh I didn't realize bruh

plain verge
worthy crystal
#

is it okay that the val loss and the same with the loss ?

#

I know this is not overfitting

#

but it is another problem?

desert oar
# plain verge I wanna be able to:

!eval ```python
import pandas as pd

data = [
{ "id": 12342, "type": "node", "tags": { "amenity":"table", "size":2 } },
{ "id": 93823, "type": "node", "tags": { "color":"blue" } },
]

id_column = 'id'
tag_column = 'tags'
meta_columns = ['type']

df = pd.DataFrame(data).set_index('id')

tag_kv_pairs is a Series of tuples: (tagName, tagValue)

tag_kv_pairs = (
df[tag_column]
.map(lambda kv: list(kv.items()))
.explode()
)
df_tags = pd.DataFrame(
tag_kv_pairs.tolist(),
index=tag_kv_pairs.index,
columns=['key', 'value'],
)

df_meta = df[meta_columns].copy()
del df

print(df_meta)
print(df_tags)

arctic wedgeBOT
#

@desert oar :white_check_mark: Your eval job has completed with return code 0.

001 |        type
002 | id         
003 | 12342  node
004 | 93823  node
005 |            key  value
006 | id                   
007 | 12342  amenity  table
008 | 12342     size      2
009 | 93823    color   blue
plain verge
#

oh so there is explode() function
thanks for your help I really appreciate it

worthy crystal
quasi parcel
#

i dont know how to handle customer_id with null

#

one moment

#

let me share the data

#

i dont know how to handel customer_id with null or 0

#

can anyone help me with that

desert oar
#

how do you want to handle it? @quasi parcel

desert oar
#

you might want to manually input some made-up data to see what the predictions are, make sure they make sense

quasi parcel
#

so i am building a recommendation engine so is it okay to create separate df to which i keep all the customer_ids of null there

#

?

worthy crystal
#

I created noise input

desert oar
worthy crystal
#

and it clear it

desert oar
#

i don't know what SotA numbers are, but you should always be suspicious if your DIY thing is beating or coming close to SotA

#

otherwise it's probably okay? i'm not much of an image classification expert

worthy crystal
#

ahh okayy thank you for your help!!!!!

quasi parcel
#

can you suggest me a way or its too much to ask @desert oar

desert oar
grave frost
desert oar
#

are you doing user-user or user-item collaborative filtering? you might need to have 2 separate recommendation models: one that uses customer ids, and one that doesnt. then you ensemble them together when customer id is present, and you use the non-id model otherwise. i've done things like that before (although not specifically in the case of customer ids and recommendations)

#

@quasi parcel โ˜๏ธ

quasi parcel
#

ohh okay

#

i think i got some clarity thanks

#

if i have any doubts

#

i will ask again

robust jungle
#

does anyone know how I can get output node names from a keras model?

delicate tree
#

there is the mysql.connector thing for ur responses in python to mysql is there something like that for csv files

sacred narwhal
#

is this a good tutorial to learn pytorch

fluid pebble
#

hi everyone

#

i am new to python and i have developed a program to spot differences between 2 images but it is too many differences even if there is no difference in images. kindly help me with that

robust jungle
#

sure

#

can you send

#

the code

#

the example images

#

and the output?

silver summit
#

what's your distance metric?

fluid pebble
silver summit
#

If you're comparing the pixel values in the image you'll have a tough time. You will need some function that says these are "close enough" to be considered the same.

fluid pebble
#

well i think the program is comparing pixel wise because the images have ti be same size

#

is it true??

silver summit
#

you're are talking about 2 differenet things, size != pixel value

#

sure, you should probably make sure the dimensions line up, but the real challenge is how you compare the images

fluid pebble
#

should i send you the code?

silver summit
#

post it on github

#

link here

fluid pebble
#

okay

silver summit
#

I can spend like 10min looking at it now or can review it later. Lot of smart ppl here to help however.

fluid pebble
#

okay but just one thing

stiff inlet
#

hey guys

#

can i use some help?

heavy sail
#

Hi, I'm trying to:

silver summit
#

@heavy sail reduce the equation, you have x on top and bottom, same with y, what do you have?

fluid pebble
heavy sail
robust jungle
#

or that it is different at all

silver summit
#

@fluid pebble ok, keep in mind that the wording sounds straight forward but this can be very complicated... if you're saying you have to pick out words embedded in the image... well

#

need like ocr for that, but I'm willing to bet your task is much simpler than that

heavy sail
#

thanks

fluid pebble
robust jungle
#

take this with a grain of salt:
find synonyms for height
search for those
find a number following it
search for what unit it's using

robust jungle
#

or are they both options

fluid pebble
#

just comparison of 2 images

robust jungle
#

yes, im talking about the program to do it

fluid pebble
#

one is orignal and the other is new one

robust jungle
#

are you being given a program and being told to modify it in some way

#

or

#

are you being given a task as above and being told to make something to do it

fluid pebble
robust jungle
#

alright

fluid pebble
#

at office

robust jungle
#

im a newbie to this so I haven't tried many things, but I know that one way it could work is with keras

#

since I have a program that can do something similar to that in theory

#

my idea:

#

use transfer learning

fluid pebble
#

can you share

robust jungle
#

data augmentation if you wanted to ignore something

#

basically

#

use that image you have as a dataset to train it off of

#

input the 2nd image

#

get output

#

might work might not

#

ill test it gimme a second

fluid pebble
#

i have started working on python only a week ago

robust jungle
#

og

#

oh

#

brb

fluid pebble
#

okay

silver summit
#

this isn't a training and testing problem

#

it's just a math problem

#

if you need to pick out parts of the image, this is called semantic segmenation

#

if this is the case just use some off the shelf image models and maybe ocr to pick out text, you should not have to train anything or build any models

robust jungle
#

neat

fluid pebble
#

i have used ocr also but the accuracy of ocr was really bad

silver summit
#

You need to define the problem in much more detail for us to help

fluid pebble
#

okay i will define deeply

lapis sequoia
#

how do i learn ai without going to youtube

silver summit
#

for example, if you say the images need to be the same do you mean exactly? like pixel for pixel? or can one be stretched a bit, rotated or filpped and still be the same? can it have a bit of noise on it (like some small percentage of the pixels are different) and it still be the same? can it have a bit of text on it but the image is still the same etc

#

also what is the context? this is for work but how will it be used? business context, timelines etc etc

desert oar
#

"bag of words", not "word of bags" ๐Ÿ™‚ the idea is that the order of the words in the document is ignored, so it's like you took all the words, dumped them into a bag, and shook the bag around.

as for your actual question: you might want to read this book chapter https://web.stanford.edu/~jurafsky/slp3/23.pdf from Speech and Language Processing, a currently-in-progress book (homepage is here https://web.stanford.edu/~jurafsky/slp3/)

fluid pebble
desert oar
#

OCR sounds like a good start for the "text" part

silver summit
#

yup

desert oar
#

you can probably solve that problem without heavy-duty "machine learning"

silver summit
#

color is harder to figure out

#

you may just use a histogram of rgb values

desert oar
#

e.g. something like levenshtein distance on the OCR'ed text (maybe something word-based and not character-based),

fluid pebble
#

lets omit color

silver summit
#

compare the histogram distributions, this should be pretty straight forward

desert oar
#

you might need to put in some kind of adjustments to account for the scanning (?) process

fluid pebble
#

just missing of text and if there is a spot in difference image against ideal image

desert oar
#

i wonder how this works, it seems kind of like what you're asking for?

#
#

are you trying to match up receipts or something?

silver summit
#

omg receips fucking sucks....

desert oar
#

heh, apparently expensify just has people manually enter receipts

fluid pebble
silver summit
#

I've tried this before at work... so bad

desert oar
fluid pebble
#

have you ever used polyfax?

desert oar
#

i have not, it looks like some kind of antibiotic

fluid pebble
#

the small boxes of medication cream

silver summit
#

yeah that sounds reasonable, ocr out the text, compare the word sets with edit distance (mentioned above)

fluid pebble
silver summit
#

are you sure? what have you tried?

desert oar
#

OCR is really good nowadays, maybe the scanning process is really noisy?

silver summit
#

I've done ocr on products on grocery store shelves... it's pretty good

fluid pebble
#

it misses some text and it also miss spells some word

#

i have used easyocr

desert oar
#

is the next not english? maybe non-english ocr is a lot worse

#

are you doing something like detecting counterfeit products?

fluid pebble
silver summit
#

well that's not a correct conclusion, if you tried one ocr you cannot say ocr sucks... there are many ocr models

fluid pebble
#

i also pytesseract same problem exist

silver summit
#

I gotta run, I think this problem is very doable.

fluid pebble
#

okay

robust jungle
#

quick question: how can I get output node names from a keras Xception model?

#

specifically I want to freeze it

#

in order to use it with opencv

lapis sequoia
#

so i was watching a tutorial of tech with tim about saving modules and visualizing data and it was fine but the code was suppose to train the modules and use the best one

#

and what it is doing is using the last module trained

#
#Import Library
import numpy as np
import pandas as pd
from sklearn import linear_model
import sklearn
from sklearn.utils import shuffle
import matplotlib.pyplot as plt
from matplotlib import style
import pickle

style.use("ggplot")

data = pd.read_csv("student-mat.csv", sep=";")

predict = "G3"

data = data[["G1", "G2", "absences","failures", "studytime","G3"]]
data = shuffle(data) # Optional - shuffle the data

x = np.array(data.drop([predict], 1))
y =np.array(data[predict])
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x, y, test_size=0.1)


# TRAIN MODEL MULTIPLE TIMES FOR BEST SCORE
best = 0
for _ in range(20):
    x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x, y, test_size=0.1)

    linear = linear_model.LinearRegression()

    linear.fit(x_train, y_train)
    acc = linear.score(x_test, y_test)
    print("Accuracy: " + str(acc))

    if acc > best:
        best = acc
        with open("studentgrades.pickle", "wb") as f:
            pickle.dump(linear, f)

# LOAD MODEL
pickle_in = open("studentgrades.pickle", "rb")
linear = pickle.load(pickle_in)


print("-------------------------")
print('Coefficient: \n', linear.coef_)
print('Intercept: \n', linear.intercept_)
print("-------------------------")

predicted= linear.predict(x_test)
for x in range(len(predicted)):
    print(predicted[x], x_test[x], y_test[x])

# Drawing and plotting model
plot = "studytime"
plt.scatter(data[plot], data["G3"])
plt.legend(loc=4)
plt.xlabel(plot)
plt.ylabel("Final Grade")
plt.show()```
iron basalt
#

The idea of a generative adversarial network can exist in the human brain, but the specific thing referred to as a GAN in Deep Learning can't, it's not biologically plausible. @surreal elm

surreal elm
#

Have you seen the recent suggestion that dentrites are logic gates?

#

XOR / NOR /NAND / etc?

iron basalt
#

That's a 1940s thing and was the first thing suggested.

#

They built several logic gates out of real neurons.

surreal elm
#

originally they assumed it was all on / off

#

this is much more complex

iron basalt
#

Yes neurons are far more complex than anything currently used in code.

surreal elm
#

hmm that link is not loading now

#
#

this apparently adds about 20x power to the estimated potential

#

and there is likely more

iron basalt
#

The thing that makes Deep Learning not biologically plausible is things such as backpropagation (through multiple layers), and convolutions (shared weights specifically, there is no sliding window in the human brain for obvious reasons, but a similar thing, multiple receptive fields, can do the trick).

surreal elm
#

the neurons can share info too locally

#

the capsid viral shell thing

#

that packages packets of data

#

I have to look up it's name again

#

"ARC proteins"

#

so there is a second and even third channel

#

THC is backpropigating

#

(cannabinoids)

#

so it's not easy to even imagine how data is flowing

iron basalt
#

Real neural networks do not work well on Von Neumann machines. They require special hardware, specifically https://en.wikipedia.org/wiki/Reservoir_computing. The gains from this are not just 20x, it's much larger, but also can't really be directly compared.

torpid raptor
#

can someone help me to scrap a website

#

i'm stuck here for about 13hour

#

please if any expert in web scraping can help

quasi parcel
#

YES

#

where are you stuck at

#

@torpid raptor

plain verge
#

hi everyone
why is this so hard
I have a geopandas data frame containing some polygons
some polygons are inside each other, for example there may be a big park containing a couple small playgrounds inside it
I wanna get rid of the polygons that are inside another polygon, in my geodataframe
how can I do it?

iron basalt
# surreal elm THC is backpropigating

When referring to backpropagation I mean like in Deep Learning, which is most definitely not biologically plausible. It's why the original learning rules for NNs did not use backpropagation either.

thorn crag
#

Hello, I've been learning python for aa bit more than 1 year now and currently learning JS (following a web development path which will lead to going through Django) .
However I am starting to think that I don't like web development - so I was thinking maybe going for what python's most popular for - data science and machine learning.
Could some one give some tips if it's a good idea for a non-good mathematician to start his journey on this long path?
I really like OOP as a concept and I am not sure if I apply my knowledge in this new field.
Any courses for newbies?

tidal bough
#

I don't think it's possible to not even read the other columns (with CSV files - it's possible with some formats like FWF), but it probably just discarded all other columns of each row right after reading that row.

#

oh, I see what you're asking though; that'd require some sort of stream decoding. Is that possible for ZIP?..

desert oar
#

yeah it just discards the unused fields and i think doesn't even parse them

#

i think zlib does have some kind of streaming support

surreal elm
#

99% py

#

the spikes are when I am generating the buildings

#

I ported like 1/2 my game to the new streaming system

#

the old file was uber bloated too,
cut out a bunch of fluff*

desert oar
#

don't work like that. parsing csv is line-by-line

#

unless the lines themselves were really really long

nova gate
#

Howdy - does anybody know how I can hide this error message in Jupyter Notebook for QQ Plot:

stable umbra
#

I needed to convert 50000 images in numpy arrays and it took so long I had to do multiprocessing and split it up into individual files and it still took a crazy long time. I used a csv to organize everything.

#

I'm looking into learning openpyxl and use excel spreadsheets instead.

stable umbra
#

convert
[convert]
VERB
cause to change in form, character, or function.

#

as in, from a .png image to an array of floats representing rgb values.

tender hearth
#

Voice cloning is the use case

modest timber
#

Hi, what if I want do predict stocks market by 20 last days close price, should i use 20= batch_size in LSTM? Could anyone explain me the batch size idea? because I coudn't get it

tender hearth
#

Or you could compute the loss and adjust the networks' weights on each sample in the dataset

#

The first option is generally slow

#

The second option generally needs to unstable learning

#

A 3rd option would be to compute loss and adjust network weights in "batches" of 16, 32, 50, or however big you want your batch to be

#

Batch size is just another hyperparameter

#

If you want to make a prediction based on close prices of the last 20 days you don't need to specify any parameter to the LSTM

#

Since by design it accepts variable-length sequences

modest timber
#

Sorry, I stuck with making proper input shape and x_test shape

#

What should look input shape in that case

desert oar
#

it's just 1 price series, like the s&p 500? or it's 20 stocks?

modest timber
#

One price series

tender hearth
#

If it's one price the input shape is (batch_size, 20, 1)

#

Well actually I believe it's (20, batch_size, 1) but if you set the batch_first=True keyword argument to your LSTM it would be (batch_size, 20, 1)

modest timber
#

Ok thank you. Let me ask you, why we need this batch size ( i think of this like of blocks of data) if my network use only 20 input at ones.

#

It dont get it

tender hearth
#

Well batch size is just the number of samples your network will look at before adjusting its weights

#

If you have batch_size = len(dataset), then your network will look at the entire dataset, and then adjust its weights

#

if you have batch_size = 1, then your network will look at one sample at a time, adjust its weights, and the move on to the next sample

modest timber
#

I got it. :)

tender hearth
#

Batch size is just another hyperparameter

#

You can see in batch=4 the loss jumps up and down

#

Thats because its adjusting its weights every 4 samples which might be too little samples

modest timber
#

So its try 4 weight and choose one or drop some

#

Or smthing like that

#

4 difrent weight

tender hearth
#

No it's just the number of samples your network looks at a time

#

Let's take the stock market example

#

Say you're trying to predict the price on the 21st day given 20 days of data

#

If you have batch_size = 1, then your network will look at one sample, give a prediction, compute how wrong it was from that prediction, and then adjust its weights

#

If you have batch_size = 4, then your network will look at 4 samples, which in this case would be 4 samples of 20 days of data and 4 predictions

desert oar
#

yeah maybe the confusion is what a "sample" is

#

a "sample" is one "window" of 20 days

modest timber
#

Ok i understand i think :)

#

So why the shape batch_size = True would have input shape begin with batch_size, and normaly no

#

But secondary

tender hearth
#

I don't know that's an internal design decision

modest timber
#

Aha, ok .

shrewd pewter
#
Function f({stats}) takes in your 3 inputs and outputs 1 or 0 depending on if you win. Find the stat point allocation that maximizes wins```
How would you guys approach this problem?
serene scaffold
wicked grove
serene scaffold
#

by creating the list of strings, I assume you're tokenizing

wicked grove
serene scaffold
#

Going forward, please always copy and paste the actual text into the chat.

wicked grove
wicked grove
#
tokenizer=RegexpTokenizer(r'\w+')
dataset['text']=dataset['text'].apply(tokenizer.tokenize)
dataset['text']=dataset['text'].astype.str()
print(dataset['text'].head())
serene scaffold