#data-science-and-ml
1 messages · Page 321 of 1
yeah
but if it’s not
then the brain has the same limitation
of course we know @ very small scales
I previously said the brain's structure is kinda dynamic.
the world is not deterministic
its extremely complicated
what does “dynamic” mean in this context
it can't be mapped as a static function
honestly, even I don't know neuro-science fully. I assume its due to the voting mechanism that aggregates dreams and live expereinces + memories to make different predictions each time
in the sense that the structure always changes
some connections break
some do not
its kinda contested BTW, and the research is pretty new. but we do apparently modify the brain in any case over time
yesyes
of course this happens
but
the point is that such changes
if they are deterministic, they can be modelled through the history of state, which the hypothetical function takes
and if they’re not, then those aspects are just random, which can also be theoretically modelled
your argument is such that as long as humans do actions, it can be considered a function. if that's the case, then even behavior of quarks is a definite function, its output being the set of coordinates?
not everything can be modelled, and a deterministic view of the universe is pretty incorrect. Our brain is far from a function, as has been often laughed by neuroscientists.
HTM hasn't achieved AGI mostly because its breakthrough ideas are very new and its slowly picking up steam to be started and researched even fully. Maybe it won't lead us to AGI, but its the closed thing we have got - the path with the least error, as compared to DL
guys, what are the ways to increase a model acc?
from the most newbie ones to the most advanced
what does the model do?
Guys I have question. What kind of statistical test can I use the determine a categorical features importance on a regression task?
I've been reading a bit about it and it seems a one-way ANOVA (after turning the categorical features into dummies) seems like the most viable approach, but I'd like to be sure.
the TL;DR -> What statistic is best to match: Categorical Input -> Numerical Output
I tried using sklearn's f_regression and mutual_info_regression, but i'm not confident in the significance of this results
Hi, I am currently working a deep learning model for image colorization and have a pretty big dataset as well
Even though i switched the dataset, the results aren't very good at all
I am not sure how I can improve them
this is the dataset I am using
if anyone know how I can achieve good results please ping me
engine = create_engine('sqlite:///data.sqlite')
create_table_from_csv(engine,
"country-income.csv", # name of file
table_name = "country_income", # give a name to the table
fields = [ # all the columns in the csv file
("region", "string"),
("age", "integer"),
("online_shopper", "string")],
create_id = True
)
How do I load the CSV file using Cubes, and create a JSON file for the data cube model, and create a data cube for the data?
ive done this part but dont know where to go from here
hi, anyone tried GPT with graph database ?
How to implemante ANN with python for image recognize (not letters)
Classification
Tabular dataset? try TAPAS
you can try naive ablation experiments to confirm their results
@grave frost can you please give some references?
Hello All, I thought I would share another tool you can use to visualize your data, let me know what you think
https://github.com/codemation/easycharts
A lot of models do, so that isn't specific enough.
i'm thinking of moving my anaconda dir from C to a different drive. if i create a symlink to the old directory after moving which is there in the PATH variable will everything work as expected?
or should i backup my base and other env and restore them later after re-installing?
Wait for someone else to answer or confirm this, but a symlink should work without any issues. I haven't done it, that's why I'm not sure.
or maybe just move only my environments to get back some space?
You would definitely be able to do it in Linux. If Windows is not acting weird, in theory it all should be fine as well.
If you want to do it manually, yeah.
i'm on windows
I was clearing up my C and saw my anaconda installation was occupying ~12gb so thought why not move it to other place..i did a conda clean --all to remove some unnecessary files
It's been a while for me on windows, but I'm skeptical this will work because windows aliases usually aren't invisible to programs in the way symlinks are (at least ~10 years ago)
You might try just moving the cache
You can always count that Windows will be weird and quirky, then.
hey guys anyone know how to subtract sine with cosine waves and plot the resultant wave in numpy python?
then what are u asking exactly?
Hello guys, i have a question. Actually i programming a simple application with mist 784 database. The application is to recognize the drawn numbers. I use SVC, and here is my problem. Model calculation take a long time. I can wait 1 hour and nothing happens. I try another model like KNN or a DummyClassifier, but the effect is the same.
There is my code
Are you asking how to improve model performance in general? Because if you're asking how to improve the performance of a specific model, I can't guess without knowing what that model is designed to do. What classes does it classify?
Don't forget the py
!code
from sklearn.datasets import fetch_openml
import pandas
from sklearn.model_selection import GridSearchCV, cross_val_score, cross_val_predict
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
import py
from inne import jes
mnist = fetch_openml('mnist_784', version=1)
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
X, y = mnist["data"], mnist["target"]
some_digit = X.iloc[999]
some_digit_image = some_digit.values.reshape(28, 28)
plt.imshow(some_digit_image, cmap="binary")
plt.axis("off")
y = y.astype(np.uint8)
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]
sgd_clf = SVC(gamma='auto',probability=False)
sgd_clf.fit(X_train, y_train)
while 1==1:
py.Paint()
obj = jes.init()
print(sgd_clf.predict([obj]))
some_digit_image = obj.values.reshape(28, 28)
plt.imshow(some_digit_image, cmap="binary")
plt.axis("off")
plt.show()
so, i was asking general techniques, but for my particular case, im the guy doing the pokemon classification
I don't think it's possible to answer that question in general.
okey let's start from the beginning
The comment you're replying to was not directed at you.
oh sorry
Which part is taking a long time? Because you have an infinite while loop at the end.
The problem is with training model. Here ```py
sgd_clf.fit(X_train, y_train)
I check it with a debbuger
if you have a lot of data, that's the expected bottleneck
This database have 70k records. Interestingly earlier this problem did not occur.
can anybody help me with tensorflow?
Don't you have to batch you data, unless you can hold everything in memory
How can I vectorize this:py images = #an ndarray of 3-channel images, with dimensions (batch, height, width, rgb) images2 = np.zeros(images.shape) for i, image in enumerate(images): images2[i] = skimage.color.rgb2hed(image)?
I know there's a way to run a transformation of each subdimension of an ndarray
but I can't recall what t's called
scipy.color?..
skimage*
I can't find any docs on it, huh
my bad
ah
by the docs, it only needs the last dimension to be colors - there's no requirement of it being 3d
nice catch!
so you can cast it on the entire array
Does tensorflow work with python 3.9? In their official page it says it has been tested with python 3.8
but doesn't mention py3.9?
In that case, I just wouldn't use it with 3.9 until they have a release that they say is 3.9-compatible.
so what do I do then? just reinstall python?
what OS are you on? You can have more than one python version installed at a time.
i am on windows
it does have wheels for python 3.9
install a 3.8 Python, yes. If you're using something like pyenv (IIRC) you can even manage them automatically
so it should work
I don't know how pyenv works tho. should I just download anaconda?
definitely not
if you install 3.8 from the Python website, you can use the py command to pick which version you use
so I can have 2 installations at once and switch between them with the py command?
yes
so I am guessing it changes the default python system-wide?
no
thanks for the help guys. I guess I'll be looking into it.
hello. I want to write a program to detect vehicles in traffic.
Does anyone know any good books on this subject? The field of computer vision in general is also relevant
so it will be detecting which parts of an image are of a vehicle? or will it be a live video feed? something else?
That isn't something that I know about, but "dashcam footage detect vehicles machine learning" might be a good Google query. But hopefully someone else who knows about that topic will show up here.
it doesnt have to be specifically that
im looking for books about image classification in general
Hey there everybody! I have a little question, do you absolutely need to be an advance Python programmer(Knowing A-Z in beginner level programming like lambda function sort() etc.,OOP, Socket programming, Concurrent Programming, Data Structures & Algorithms) to learn AI, ML & DL or do you only need to learn the core concepts like the basics and OOP?
look up Yolo models, use them for inference, count the number of vehicles. bingo!
nah, moderate python is more than enough for starters. you might have to upgrade a bit later on, but if you would be in the industry then chances are you would use simple stuff which wouldn't require extreme knowledge of python
These are some advanced topics you're mentioning. Most of them don't relate to ML&DS in obvious ways. If you master the basics (I'd say the content of Automate The Boring Stuff) then you're ready to learn numpy and pandas and after that sklearn and pytorch/tensorflow. This might be an unpopular opinion among thoroughly trained people but the libraries keep getting easier to use and you learn most by just applying it in real life
its not an unpopular opinion, just a wrong one. shallow knowledge won't get you anywhere significant if you don't put in the effort yourself
If you want to be the person that creates the libraries used by others to do ML/AI/DL/DS stuff (e.g. pytorch, numpy, pandas, etc) then yes.
Hello guys, I am hawing problems with pivoting table with pandas in Kiwi help room. Can someone help please?
Is it worth it to learn excel for data analysis?
You should also have a solid grasp of the computational complexity of various data structures / algorithms to make an informed decision on whether something is a viable option or requires too much compute (even if you never use the specific structures / algorithms studied, just get used to estimating how fast things are).
anyone heard of background matting?
A little, but it can't deal with large datasets.
you need to be able to work with tabular data in general, so if you were to learn excel, you'd probably learn a lot of the terminology surrounding data manipulation. But I only use excel (well, google sheets in this case) to put data on my work's google drive for my coworkers to look at.
https://arxiv.org/pdf/2011.11961.pdf this is a paper on an algo called grabcut
thinking of introducing a gan to cut human interference since the original paper required human guidance.
also object detection to highlight the object rather than having a user create a box
Hi, I have been reading for the past 3 weeks now but i'm still at a loss. Can anyone please guide ?
eeeehm guys one thing, when using ImageDataGenerator
How many new imgs are returned?
or how can i control it?
i'm trying to use available libraries to be able to generate texts, from existing keywords. I have few texts I wanted to feed for the learning process but I'm a bit a t a loss on where to start from
use gpt3
i'm told GPT3 is not accessible to the public yet.
Use GPT2 then
There's an API for working with it, but it's not open source yet, no.
what's the url pls? how is the how is the delivery compared to GPT2?
i mean the content quality
I'm not sure. Sentence generation isn't part of my area within nlp.
what's the url for the api?
stelercus, do u have by chance any snipper using albumentations?
Hello
terrific
idk what that is, sorry
dw
from here
what augmentations do u think will be usefull?
hello guys
i am stuck (again)
suppose i have this
the "prediciton" column is to determine which cluster does which object belong to
so 0 menas cluster 1
1 means cluster 2, and so on
now i want to see how the object's classificaion has anything to do with its being clustered
but i have no idea how to proceed
my intial strat is to do this https://matplotlib.org/stable/gallery/lines_bars_and_markers/barh.html
but it seems a bit weird
are you sure you don't have ingredient and classification backwards? it seems like there's tons of unique values on either side.
I guess that's fine, actually
you see the annoying thing about drugs is
one ingredient can treat different thigns lmao
and there are sub divisions of classes
like there is an umbrella class, a sub class, an even subber calss
@visual violet can you do print(df.sample(axis=0, n=7).to_csv())?
remember that I can't do anything with screenshots
very true
,Ingredient,predictions,classification
1397,NABUMETONE,0,Nonsteroidal Anti-inflammatory Drugs (NSAIDs)
1738,PROPRANOLOL HCL,0,propranolol hydrochloride
1801,LEVETIRACETAM,0,Seizure Disorders
733,DULOXETINE HYDROCHLORIDE,0,"Analgesics, Centrally-acting Nonopioid; Anxiolytics, Non-benzodiazepines; Fibromyalgia; Neuropathy/Neuralgia; Serotonin-Norepinephrine Reuptake Inhibitors (SNRIs)"
1773,OXYBUTYNIN CHLORIDE,0,"Antispasmodics, Urinary"
1229,LABETALOL HYDROCHLORIDE,0,Beta Blockers
16,ZOLMITRIPTAN,2,Headache/Migraine
i hope this works
df.groupby('classification')['predictions'].value_counts().unstack(fill_value=0)
this way you can see the frequency of each class in each cluster.
sheesh did you just do everything in one line of code
but it's your job as the human to speculate as to why your feature selection resulted in which class ending up in which cluster.
yeah i used selenium to get the classification hehe
even though there are 84 None values
i just asked you about it in the morning lol
yeah but all I understood out of that was that you wanted to handle exceptions
oh right
the ultimate goal is to find subclasses for each ingredient
which i have finally done
good job!
will this work for none value?
Are those manifest as nans?
i tried it it does work
so i took a look at your code
i mean one line of code
it does put the number of classification for each cluster
but it doesn't sort for each cluster
for example
i know you can't read screenshot but i can't do anything else :((
how can i put the trycylic on top of acne
do you just want to arbitrarily put that one on top, or is there a reason?
df.rename(axis=1, mapper=str).sort_values('0 1 2'.split(), ascending=False)
it's intended to act upon the dataframe in the screenshot I replied to
what is the meaning of life? idk
suppose i am reading a research paper and i like a piece of evidence
burn it
but that evidence is linked to another research paper
do i cite the research paper i am reading
or the original source
depends on what claim made in the paper you're citing
how about the evidence is just a fact
for example, on the research paper, Americans eat 3 cheeseburgers a day (python committe)
do i cite python committee?
was it the "python committee" that conducted the survey that determined that?
yes
then yes
oh man
i like how in the background info section
the author link to many other reearch paper
@visual violet you didn't save the result of the first statement to a variable
So it got thrown away. Pandas rarely modifies the source data.
Nope. You should save it to a variable with a different name.
Ya
can you sort over all three?
It is.
i mean it won't be a complete beautiful table like that
It sorts on the first column, then the second, then the third
Looks right to me.
i am glad that each cluster doesn't have the same classification popularity
i know waht i just wrote doesn't make much sense for anybody
So, you're glad that the clusters are largely disjoint with respect to what classes the instances have.
oh yes
english has rejoined the chat
this is potentially very meaningful
so if your code does what you say
then the biggest count in column '2' is 18?
My code always does what I say
there is no bigger number
No
That row is there because of the 40 in the zero column
If you want it to sort by the maximum value of each row, that's different
is there a way to just break the column '2' away along with the ingredient and sort it by itself
You can select only those rows where the two column is not 0
With loc
Anyway I must go to sleep
good night!
Bye!
@grave frost what do you mean by "Moderate Python"? Is it OOP and the basics?
how do i feed in my custom text model for training for GPT and BERT ?
guys i install tensorflow using pip and then when i imported it it gives me this error
ImportError: DLL load failed while importing _pywrap_tensorflow_internal: A dynamic link library (DLL) initialization routine failed.```
any explanation ?
well, the first problem you have is that that’s one column
try sep=';'
i think you have delimiter issues in the csv file
try pd.read_csv('patient_data.csv', delimiter=';')
@winged stratus thx a lot!
why does my bot say 'goodbye' to everything I say?. how can i fix that ?
Hey @nova tapir!
It looks like you tried to attach a Python file - please use a code-pasting service such as https://paste.pythondiscord.com
i'm training him but it just says 'goodbye' , 'bye' or something to what i say
{"intents": [
{"tag": "greetings",
"patterns": ["hello", "hey", "hi", "good day", "greetings", "what's up?", "how is it going?"],
"responses": ["Hello", "Hey!", "What can I do for you?","Hi", "Good day", "Greetings", "what's up?"]
},
{"tag": "goodbye",
"patterns": ["cya", "See you later", "Goodbye", "I am leaving", "Have a good day", "bye", "cao", "see ya"],
"responses": ["Sad to see you go :(", "Talk to you later", "Goodbye!","Bye","cao","cya","see ya","bye bye"]
},
{"tag": "age",
"patterns": ["how old", "how old is trojan", "what is yor age", "how old are you", "age?"],
"responses": ["My owner Trojan is 17 years old!", "17 years!", "I am 1"]
},
{"tag": "name",
"patterns": ["what is your name", "what should I call you", "whats your name?", "who are you?"],
"responses": ["You can call me Jane!", "I'm Jane!", "I'm AI Assistant of trojan!", "My name is Jane"]
},
{"tag": "hours",
"patterns": ["When are you guys open", "what are your hours", "hours of operation"],
"responses": ["24/7"]
}
]}
here is the json file
Hey guys
Can you suggest me a walkthrough for PyTorch
I have experience in Keras and some Tensorflow 2.0 for deep learning
But I have not done any convolutional unsupervised systems
Only classification and regression supervised
theres the tutorial walkthrough on pytorch's site
if you want a video form, sentdex has a pytorch tutorial playlist as well https://www.youtube.com/playlist?list=PLQVvvaa0QuDdeMyHEYc0gxFpYwHY2Qfdh
@nova tapir how do you custom-train your models?
what do you mean ?
i'm trying to do something similar, but i'm still new in the field. reading few resources but still confused
I have custom text which i want to train in order to spin them to provide a new text in the same context
@nova tapir do you have few resources i can follow ?
Hello, does anyone here know of any packages that will allow me to categorise words?
For example, I can search for all prepositions in a list and it will return this
you can use spaCy, which has a part-of-speech tagger
otherwise, you need to know what word categories you have in mind.
(which happens to be my area of expertise, inasfaras I can be considered an expert on anything)
Hi,
Is it possible to use a scatterplot on a pandas dataframe with 256 columns on the x-axis and having a 5 column identifier on the y-axis?
Here is what the dataframe looks like
if you've ever heard of matplotlib and pyplot and used them, what would you prefer on the basis of user friendliness ?
or even pls state the reason for superiority of one over other if more factors other than user-friendliness do exist and are important
Hello
I'd just like to let you all know i'm a lying, racist, and sexist scammer. I give no regard to anyone else, I'm just a idiotic kid who doesn't know crap and acts like a big man. I constantly spam slurs, and I love scamming people. I also violated multiple discord's terms of service. I'm a piece of garbage.
Please spread the word. I'm a scammer.
This is my ID: 841420280425611265
This account was hijacked by someone I scammed. They're the ones posting this. Deal with caution when dealing with me. Have a good day
@lapis sequoia
So I've been trying to use matplotlib, but when i try running this on jupyter noteboook, the cell just freezes
for x, col in enumerate(phoneme_df2.columns):
for y, ind in enumerate(phoneme_df2['g'].index):
if phoneme_df2.loc[ind, col]:
plt.plot(x, y, 'o', color='red')
plt.xticks(range(len(phoneme_df2.columns)), labels=phonme_df2.columns)
plt.yticks(range(len(phoneme_df2)), labels=phoneme_df2['g'].index)
plt.show()
ohhh woops thought you were speaking to me

hello everyone, i have a case and my time is very limited can someone help me? I would be really happy if you do this, if anyone thinks about it, can you write it privately?
Hello everyone, if someone can help me understanding a bit more a piece of code, I'd be extremely grateful. 
I'm short on time, I have a thesis to write...
I'm dealing with CNN, I already have the architecture and the code, I would like some clarifications so I can make correspondences between the two (thanks in advance)
any insight?
I actually have a dataset with 3197 features, here are the 5 first rows. I found a code that I would like to understand. The model structure comes as :
- Input layer;
- 1D convolutional layer, consisting of 10 2x2 filters, L2 regularization and RELU activation function;
- 1D max pooling layer, window size - 2x2, stride - 2;
-Dropout with 20% probability;
-Fully connected layer with 48 neurons and RELU activation function;
-Dropout with 40% probability;
-Fully connected layer with 18 neurons and RELU activation function;
-Output layer with sigmoid function.
I'm having troubles understanding counting the number of layers..
The code :
# Architecture
model = Sequential()
model.add(Reshape((3197, 1), input_shape=(3197,)))
model.add(Conv1D(filters=10, kernel_size=2, activation='relu', input_shape=(n_features, 1), kernel_regularizer='l2'))
model.add(MaxPooling1D(pool_size=2, strides=2))
model.add(Dropout(0.2))
model.add(Flatten())
model.add(Dense(48, activation="relu"))
model.add(Dropout(0.4))
model.add(Dense(18, activation="relu"))
model.add(Dense(1, activation="sigmoid"))
Anyone worked with background_matting v2?
My question is : what does the : model = Sequential()
model.add(Reshape((3197, 1), input_shape=(3197,))) line of code mean and how can I calculate the number of output given the structure of the filters (the Conv1D for example)
Is there any good course for mathematics for machine learning?
Model sequential just initializes the model, where you then can start defining the layers in it.
Model.add adds whatever layer is given to form the architecture layer by layer
Reshape must simply be a layer for reshaping the input received
yeah, there is a book named mathematics for machine learning. its free, you can start with that
do model.summary to see all details about your model, including each individual layer
from code tho, you have 8 layers. parameters would be printed out by model.summary()
Here's how to format Python code on Discord:
```py
print('Hello world!')
```
These are backticks, not quotes. Check this out if you can't find the backtick key.
that being said, \mathrel{+} is not valid Python code, so something else must have been intended.
my life is a lie
the methods to find k values for k-means are all different
i am so confused
Hello, I wanted to know what regularization methods were.
Methods to prevent overfitting like LASSO and Ridge?
How do I disable the cudart64_110.dll not found errors in Tensorflow?
btw I know my GPU isn't cuda-enabled. I just want to suppress these warnings
do u know where can i apply cutmix augmentation?
Hello everyone , I'm learning image treatment with python and I would like to know if I can change the color of a specific pixel.
Guys, I have a question, can anyone help me out please?
I am uploading a dataset from a csv file and I convert it to a pandas frame but it gets loaded as an object of datatype int64 but I need datatype of int32 for my model
How can I change the color for the numbers 2?
train a classifier for numbers 2.
to detect pixel values in number 2
looking for second opinion on this.
. A
sequence composed of a series of nominal symbols from a
particular alphabet is usually called a temporal sequence, and
a sequence of continuous, real-valued elements, is known as a
time-series
i have 0 idea what this means
isn't it just y value over time?
i am very confused
Wo' ooh bo' ooh
yes?
Can someone here help me with trying to implement an LDA model on my dataset for a scatter plot?
I was following this Youtube video that used np.where to show separation between two classifications but the current data that Im working with has 5. The part of the code that's commented out, was the original code in the video.
lda = LinearDiscriminantAnalysis()
lda.fit(X_train, y_train.ravel())
y_prob_lda = lda.predict_proba(X_test)[:,1]
y_pred_lda = np.where(phoneme_df2['g'] != 'aa' | phoneme_df2['g'] != 'dcl' | phoneme_df2['g'] != 'iy'
phoneme_df2['g'] != 'sh' | phoneme_df2['g'] != 'ao')
#np.where(y_prob_lda > .5, 1, 0)
The dataset itself has 256 features, about 4509 instances, and 5 classifications
your code looks complicated
check this C:\Program Files\NVIDIA GPU Computing Toolkit\if its cuda in there or no....if not then install it
i will try changing that
these are the dimensions
n_h,n_w,c means ``horizontal,vertical, channel(for rbg an all)`
m means no. of examples
What are some interesting data sources with regularly updated data?
Like yahoo finance
import numpy as np
def nonlin(x, deriv=False):
if(deriv == True):
return (x * (1-x))
return 1 / (1+np.exp(-x))
X = np.array([[0,0,1],
[0,1,1],
[1,0,1],
[1,1,1]])
y = np.array([[1],
[0],
[1],
[1]])
np.random.seed(1)
syn0 = 2*np.random.random((3,4)) - 1
syn1 = 2*np.random.random((4,1)) - 1
for j in range(60000):
l0 = X
l1 = nonlin(np.dot(l0, syn0))
l2 = nonlin(np.dot(l1, syn1))
l2_error = y - l2
if (j % 10000) == 0:
print("Error: " + str(np.mean(np.abs(l2_error))))
l2_delta = l2_error * nonlin(l2, deriv = True)
l1_error = l2_delta.dot(syn1.T)
l1_delta = l1_error * nonlin (l1, deriv = True)
syn1 += l1.T.dot(l2_delta)
syn0 += l0.T.dot(l1_delta)
print("Output after training")
print (l2)
which is the correct visualization of this neural network code?
How to convert type of string to integer?
Check for the ML and DL courses from Andrew Ng in Coursera.org. It's more of a practical approach. There's a course that I did called "Mathematics for programmers" or something along those lines, but I can't find it.
Check these also. https://github.com/ertsiger/coursera-mathematics-for-ml
has anyone worked with TF-hub before? I tried 2-3 models but I have no idea what is the output they give me
Hi Guys is there any dedicated chanel for opencv python
Hi I have 15 columns but describe give me just 1 column and other techniques too(graph correlation etc) how can i solve this problem?
anyone can help me?
I think describe works with numeric columns only
Hi, i need guidance, I want to generate text from series of keywords. I want to train the model from bunch of texts I already have. I come across hugginface but i don't know which procedure to follow. Is there any writeups that can help me set to the path? How do i convert the text into train models?
hello i want to #ask
how to deal with he's and has for text preprocessing?
hey i want to do some machine learning using tensor linear regression. but my problem is that i dont know how i can convert my string files into int
Not all of your columns contain numerical data. Try calling .describe for individual columns.
I can't look at both of these screenshots at the same time, so please provide the data as text.
well this is my date format
another day another stuggle to cluster
and this is my code
!code
Here's how to format Python code on Discord:
```py
print('Hello world!')
```
These are backticks, not quotes. Check this out if you can't find the backtick key.
ValueError: time data '5.03.2021 00:00' does not match format '%d-%m-%Y %H:%M' (match) and value error
This screenshot has hyphens in it rather than periods. Note that if you need any additional help with this, I will not read any more screenshots of text.
df['DateDim[Day]'] = pd.to_datetime(df['DateDim[Day]'], format='%d-%m-%Y %H:%M:%S')
df['DateDim[Day]'] = pd.to_datetime(df['DateDim[Day]'], format='%m-%d-%Y %H:%M') or this
'%d-%m-%Y %H:%M:%S' # This has hyphens where there should be dots
'5.03.2021 00:00' # The actual strings you're trying to match has dots
I can't guess if 5 is a day or a month, as 5/3 and 3/5 are both valid day-month pairs.
5 is day and 03 is month
Here's the mini-language for date formatting and parsing: https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes
See if you can solve the problem with the hyphens.
yess you are great
I thought you were talking with yourself
\
i think i graphed it wrong
can somebody pleae help?
0 0.0 0.350547 -0.389165 0.031171 0.131560 0.101988 0.012384 0.118384 -0.326644 0.159515 0.641205 -0.287578 0.295131 0.049982 -0.453871 -0.058566 0.084067 0.033252 -0.087150 -0.026975
1 0.0 -0.063362 0.228691 -0.177655 -0.035891 -0.194385 -0.225461 0.085287 -0.112722 -0.190376 -0.319231 0.316168 0.297476 -0.222511 -0.161768 -0.022497 -0.107356 0.343189 -0.142414 0.157067
2 0.0 -0.525495 0.044349 0.259054 0.032564 0.017787 0.109994 0.617328 1.539279 -0.704796 -0.155155 0.132843 -0.039865 -0.213152 0.298412 -0.391566 -0.107134 -0.313010 -0.238712 -0.138868
3 0.0 0.294616 -0.146452 -0.010603 -0.289189 0.518459 -0.348416 0.174120 0.197173 -0.207225 -0.202068 -0.067731 -0.098195 0.377949 -0.284327 0.140845 0.179972 -0.269980 -0.163283 0.055986
4 0.0 -0.286758 0.176156 -0.045746 -0.031385 -0.361086 0.691218 -0.348555 0.612737 -0.376248 0.030953 -0.105264 0.176193 -0.208051 0.025628 -0.079569 0.342263 -0.220916 0.133213 -0.003057
uhh
as you can see there are negative values
somehow the graph doesn't show negative values?
please also share the code that made the graph
counter = 0
figure(figsize=(15, 10), dpi=80)
for index, row in percentage_difference.iterrows():
#line, = plt.plot(row, marker='o')
line, = plt.plot(row)
if predictions_pct[counter] == 0:
line.set_color("b") #blue
if predictions_pct[counter] == 1:
line.set_color("g") #green
if predictions_pct[counter] == 2:
line.set_color("r") #red
if predictions_pct[counter] == 3:
line.set_color("c") #cyan
if predictions_pct[counter] == 4:
line.set_color("m")
if predictions_pct[counter] == 5:
line.set_color("y")
if predictions_pct[counter] == 6:
line.set_color("k")
if predictions_pct[counter] == 7:
line.set_color("pink")
counter = counter + 1
plt.xlabel('Quarter')
plt.ylabel('Percentage differnce')
plt.autoscale(enable=True, axis='x', tight=True)
is predictions_pct the dataframe from before?
predictions_pct shows what cluster
IF I DO np.multiply(3d_matrix_1, 3d_matrix_2)(same dimensions lets say (m, n, l)), will i get matrix of (m, n, 1)
<class 'numpy.ndarray'>
please share the data so that we can replicate this
model_pct = TimeSeriesKMeans(n_clusters=3, metric="dtw", max_iter=10)
predictions_pct = model_pct.fit_predict(percentage_difference)
Yes, it will just do element-wise multiplication
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
lol
i graph each row of the percentage_difference
then i set color accordingly
where do you actually put this data in the plot?
nevermind, I see now
one moment
the problem is there is no negative value in the y axis in the graph when there should be
alright let me see
@visual violet
In [21]: df.plot.line(xlabel='Quarter', ylabel='Percentage Difference')
Out[21]: <AxesSubplot:xlabel='Quarter', ylabel='Percentage Difference'>
Not sure why the key is in the middle though
idk, I just fumbled my way to this by testing stuff
are you coloring it according to the date?
well at least you have the negative direction and i don't lol
please teach me your way
In [23]: df.T.plot.line(xlabel='Quarter', ylabel='Percentage Difference')
Out[23]: <AxesSubplot:xlabel='Quarter', ylabel='Percentage Difference'>
yes. the method call in line 23 takes color= as a kwarg
do you have an array/series of which cluster each row belongs to?
this
it is called predictions_pct
so it predicted that each row is the same cluster...?
yes! each row is an object
so one element in the array represent which cluster the corresponding row belongs to
But it's all zero
there are some 1 and 2 lol
oh there are a few 1s
that is why i complain the clustering doesn't work
Do you know what color you want for 0, 1, and 2?
but now i am already fucked, so i gotta keep going with the idea
excuse my language please
uhh i don't mind, make it red, blue, green
i got error like that when there is a logic error in my code
one time i merge dataframe with identical rows
python tried to do every single combinations
@visual violet
# This is your array from before
In [32]: predictions = pd.Series([0, 1, 2, 1, 1])
In [33]: predictions.replace(dict(enumerate(['red', 'green', 'blue'])))
Out[33]:
0 red
1 green
2 blue
3 green
4 green
dtype: object
In [41]: df.T.plot.line(xlabel='Quarter', ylabel='Percentage Difference', color=colors)
Out[41]: <AxesSubplot:xlabel='Quarter', ylabel='Percentage Difference'>
line 33 won't change the array it self right?
No, numpy/pandas operations pretty much always act on a copy
I don't understand how it works.
Unable to allocate 65.6 GiB for an array with shape (8803915165,) and data type float64
I'm getting an error like this, I think it can be solved with virtual memory, but do I really need to allocate 65 gb virtual memory from the disk or do I have 12 gb ram already, do I need to top it up?
Well, 65 GiB is more that 65 GB. Is it a sparse array?
yep probably it is.
I should get an output like this
plt.figure(figsize=(20,12))
mergings = linkage(rfm_scaled, method="complete", metric='euclidean')
dendrogram(mergings)
plt.show()
and this is my code it is rfm analyze
If I change the method, will the problem be solved?
bruh i possibly messed up both graphs
and yet my professor didn't telll me that
even though i sent him my code
i am sad
this won't work if i have 5 clusters right?
because it assumes that it must have 3 clusters
you just have to give as many colors in the enumeration as there are clusters.
if you give more colors than there are clusters, however many extra will just never be used.
'numpy.ndarray' object has no attribute 'replace'
you have to make it a series
yup i got it
(a series is just the pandas version of a 1d array)
wait where did you get the colors
you assigned colors = array.replaced... right?
colors = pd.Series(predictions_pct).replace(dict(enumerate(['red', 'green', 'blue'])))
percentage_difference.T.plot.line(xlabel='Quarter', ylabel='Percentage Difference', color=colors)
@serene scaffold sorry for ping
@visual violet do you have matplotlib installed?
yup i plot multiple things before
what version of pandas do you have?
you might need to switch the scale so that values on the y axis aren't evenly spaced
1.0.1
that's an older version.
matplotlib : 3.1.3
that version is old as well
in either case, look into how you can change the scale of the y axis
in particular, you probably want to make it logarithmic.
here's what I'm talking about: https://en.wikipedia.org/wiki/Semi-log_plot
In science and engineering, a semi-log plot/graph or semi-logarithmic plot/graph has one axis on a logarithmic scale, the other on a linear scale. It is useful for data with exponential relationships, where one variable covers a large range of values, or to zoom in and visualize that - what seems to be a straight line in the beginning - is in fa...
if you try again with updated versions of pandas and matplotlib, you just need to add logy=True
!docs pandas.DataFrame.plot
DataFrame.plot(*args, **kwargs)```
Make plots of Series or DataFrame.
Uses the backend specified by the option `plotting.backend`. By default, matplotlib is used.
the main focus is to make it looks nice lol
Yes, but you need an updated version of pandas and matplotlib to do what I said
which will fix the scale of the y axis so your bottom lines aren't all scrunched together
i should probably resetart my computer
That won't have the effect of updating the pandas and matplotlib versions
did you do pip install -U pandas matplotlib?
And what was the error message?
Im trying to implement an LDA model on my dataset of 256 columns/features and 4509 rows. The problem Im facing with now is that the dataset used in the tutorial is using only 2 classifications and I have 5.
-I commented out the original statement from the tutorial and have been trying to work on it myself but haven't had any luck. Any ideas on how I can modify this?
lda = LinearDiscriminantAnalysis()
lda.fit(X_train, y_train.ravel())
y_prob_lda = lda.predict_proba(X_test)[:,1]
y_pred_lda = np.where(y_prob_lda == 0, 0, 0 | y_prob_lda == 1, 1, 0) #np.where(y_prob_lda > .5, 1, 0)
I'm worried that y_prob_lda == 0, 0, 0 | y_prob_lda == 1, 1, 0 does something other than what you expected
Yeahh I've been trying a few things. One thing was following how he did it, and since I read that np.where uses bitwise operators I tried using or. But that didn't work. Is there a way to fix this? Im using this dataset https://web.stanford.edu/~hastie/ElemStatLearn/datasets/phoneme.data
Also its Im finding it kinda hard to derive insights from a dataset where I don't even know the names of the columns
So any tips on that would be great too! right now Im learning the LDA model reduces the number of classifications to the important ones which will hopefully aid in figuring out how to Analyze this
sounds good. is speaker the class?
Ahh I was told to ignore the speaker column. Column g is the class of different phonemes
I took the speaker and row column out of my dataframe
for y_pred_lda = np.where(y_prob_lda == 0, 0, 0 | y_prob_lda == 1, 1, 0), are you really just trying to figure out what the most probable class is for each row?
In [21]: lda = LDA().fit(X, y)
In [23]: lda.predict(X)
Out[23]: array(['sh', 'iy', 'dcl', 'dcl'], dtype='<U3')
In [27]: lda.predict_proba(X)
Out[27]:
array([[1.28273626e-01, 5.59261688e-03, 8.66133757e-01],
[6.77497944e-07, 9.93583762e-01, 6.41556027e-03],
[9.92678238e-01, 3.56884038e-09, 7.32175869e-03],
[8.43266553e-01, 6.81605776e-06, 1.56726631e-01]])
In [30]: lda.predict_proba(X).argmax(axis=0)
Out[30]: array([2, 1, 0])
I was using a subset of the data with only three classes.
Thats one thing I was thinking about. Honestly Im having a hard time figuring out how to go about analyzing this because Im not even sure what the columns mean
each column is a class and each row is an instance. The value represents the probability that that instance belongs to the class for that column.
Ahhh I see
In [33]: lda.classes_
Out[33]: array(['dcl', 'iy', 'sh'], dtype='<U3')
I assume the columns follow this scheme
however this just seems like a roundabout way of doing lda.predict
I should have done .argmax(axis=1)
So basically for each of these 4509 instances, we're trying to see how it matches to the corresponding g classification?
it appears that you're trying to learn what audio frequencies represent which phoneme. Are you familiar with phonemes?
Phonemes are sounds specific to letters/words? right?
they're the speech sounds in a natural language. They're somewhat related to letters, as spelling systems can be arbitrary.
it looks like the end goal is to be able to transcribe audio, though.
My background is in linguistics
bruh is there anything that you can't do?
spelling systems, and more broadly, writing systems in general.
Ohhh awesome! Guess I asked the right person about this haha
anyway, the number of classes shouldn't matter, as LinearDiscriminantAnalysis can generalize to an arbitrary number of classes. I think.
Also if your y is just the g column of the dataframe, you don't need to .ravel() it.
I actually don't know how LDAs work 🤷♂️
Here is where Im at right now. This is where I left off in the video so just trying to figure out how to move forward.
le = LabelEncoder()
phoneme_df2['g'] = le.fit_transform(phoneme_df2['g'])
encoded_data = pd.get_dummies(phoneme_df2)
y = phoneme_df2['g'].values.reshape(-1, 1)
X= encoded_data.drop(['g'], axis = 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2, random_state = 42)
lda = LinearDiscriminantAnalysis()
lda.fit(X_train, y_train.ravel())
y_prob_lda = lda.predict_proba(X_test)[:,1]
y_pred_lda = np.where(y_prob_lda == 0, 0, 0 | y_prob_lda == 1, 1, 0) #np.where(y_prob_lda > .5, 1, 0)
I actually have to head out for a bit
And my understanding is LDA's are supposed to reduce the number of dimensions to the most relevant features depending on the data
Ahh gotcha np thanks for the break down though
What does ravel do exactly?
!docs numpy.ndarray.ravel
ndarray.ravel([order])```
Return a flattened array.
Refer to [`numpy.ravel`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.ravel.html#numpy.ravel "numpy.ravel") for full documentation.
See also
[`numpy.ravel`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.ravel.html#numpy.ravel "numpy.ravel")equivalent function
[`ndarray.flat`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.flat.html#numpy.ndarray.flat "numpy.ndarray.flat")a flat iterator on the array.
@haughty wharf that
i have finally updated the version lol
thx god
why does this thing label every single row lmao
i didn't even command it to do that
colors = pd.Series(predictions_pct).replace(dict(enumerate(['red', 'green', 'blue'])))
percentage_difference.T.plot.line(xlabel='Quarter', ylabel='Percentage Difference', color=colors)
@visual violet you forgot the logy part
Also I'm at the gym so I may not may not respond between sets
Ahhh got it thanks
I'm already back from that
oh
'1.2.4'
there is really no error
when i put py percentage_difference.T.plot.line(xlabel='Quarter', ylabel='Percentage Difference', color=colors, logy=True)
it shows
when i change to
percentage_difference.T.plot.line(xlabel='Quarter', ylabel='Percentage Difference', color=colors, logy=False
)
it shows
@serene scaffold
Set logy to true
can you put the whole CSV in a pastebin?
!paste
Pasting large amounts of code
If your code is too long to fit in a codeblock in discord, you can paste your code here:
https://paste.pydis.com/
After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.
when training a model, should u augment the validation data too?
Hey @visual violet!
It looks like you tried to attach file type(s) that we do not allow (.xlsx). We currently allow the following file types: .gif, .jpg, .jpeg, .mov, .mp4, .mpg, .png, .mp3, .wav, .ogg, .webm, .webp, .flac, .m4a.
Feel free to ask in #community-meta if you think this is a mistake.
lol
Even for the paste bin?
lol yes
Just do fewer rows, I guess. How many are there?
724 🙂
Try 400
hmm still too big
tried recreating this but got LDA is not defined
I imported that long class name as LDA. LinearDiscriminatorwahtever
Having long class names is a waste of milliseconds.
300?
ohhh gotcha
can i make a model checkpoint to save model every 5 epochs, for example?
ah with period
Sure, go ahead and send that link over
you have to do it in a print(...) statement or it won't do line breaks.
which are necessary in this case.
so i print the csv file?
I just need this, but with linebreaks where a row ends.
got it
lol it looks so cool in pastebin
yet so useless when it comes to being clustered
@visual violet I tried setting it to a symlog scale and got this
so i tried out some graphing
the weird labeling is not ebcause of the colors
hmm
how do you scale it so nicely though
i am starting to think it is because of my jupyter skin
the graph doesn't show fully
In [73]: df.T.plot.line(xlabel='Quarter', ylabel='Percentage Difference')
Out[73]: <AxesSubplot:xlabel='Quarter', ylabel='Percentage Difference'>
In [74]: matplotlib.pyplot.yscale('symlog')
In [75]: matplotlib.pyplot.show()
should i augment data in validation?
depends on the type of data
I was writing a paper on data augmentation for nlp but it hasn't gone anywhere
should i augment data in validation?
can i call you Steele ?
You can call me Stelercus
right
no
i still need to figure out how to remove the labels lol
but now i looks much like a cluster
before it is quite weird
atually surprised how the price behaves to similarly
thanks
@serene scaffold Im trying to follow this https://www.python-course.eu/linear_discriminant_analysis.php for LDA.
Im seeing the first step looks like:
Would my target feature be 256 columns and my descriptive feature would column g?
# 0. Load in the data and split the descriptive and the target feature
df = pd.read_csv('data/Wine.txt',sep=',',names=['target','Alcohol','Malic_acid','Ash','Akcakinity','Magnesium','Total_pheonols','Flavanoids','Nonflavanoids','Proanthocyanins','Color_intensity','Hue','OD280','Proline'])
X = df.iloc[:,1:].copy()
target = df['target'].copy()
I don't actually have experience using LDAs, so I might be able to get back to you
There might be a point at which there are just too many lines to effectively plot
gotcha no worries. i didnt think this would be so confusing. Most of the tutorials I've been looking at are working with two classes
you are not wrong
but it does show the cluster prety well though, so i am pretty happy about it
yeah, I might have been wrong about the number of classes not mattering. if I figure it out I'll let you know
you can see one giant clump
Hi me again
df2=df[['DateDim[Day]', '[NetSales]']]
df2['DateDim[Day]'] = pd.to_datetime(df2['DateDim[Day]'])
plt.figure(figsize=(16,8))
plt.title('Sale History')
plt.plot(df2['[NetSales]'])
plt.xlabel('DateDim[Day]')
plt.ylabel('[NetSales]', fontsize=25)
plt.show()
thats my code and output like this
Appreciate it man!
and you know my date data looking like this
why the program still doesn't see it in date format
WARNING:tensorflow:`period` argument is deprecated. Please use `save_freq` to specify the frequency in number of batches seen.
do like data["Date"]= pd.to_datetime(data["Date"])
What does this mean? like, i though period was used to specify number of epochs
right now it doesn't know that those are dates
same output like 500000 not date on xlabel
how about plt.plot(df2['DateDim[Day]'],df2['[NetSales]'])
now you can extend the y axis like i did lol
figure(figsize=(10, 15), dpi=80)
put ^ code first then everything else after for it to work
thankss
How do I know if theres a good classification distribution in my data?
@haughty wharf having a confusion matrix that's just the diagnonal means your model got everything right
Though you would want to evaluate that using different data than you used for training
Like an entirely new dataset?
@haughty wharf you usually train on like 70% and evaluate on 30%, or something like that
Unless you want to cross validate, which is nice if you can afford that computationally.
Ahhh I see
Okayy. I guess Ill pause on the LDA analysis for now. I also wanted to do some exploratory analysis. But Im having a hard time figuring out how to. The only thing I've done so far in that regard was get a pie chart displaying the phoneme distribution in the data
@haughty wharf exploratory analysis with the training data? Remind me what the rows and columns mean?
WIth the whole pandas dataframe, minus the speaker and row columns. I actually have some notes let me see.
A dataset was formed by selecting five phonemes for classification based on digitized speech from this database. The phonemes are transcribed as follows: "sh" as in "she", "dcl" as in "dark", "iy" as the vowel in "she", "aa" as the vowel in "dark", and "ao" as the first vowel in "water". From continuous speech of 50 male speakers, 4509 speech frames of 32 msec duration were selected, approximately 2 examples of each phoneme from each speaker. Each speech frame is represented by 512 samples at a 16kHz sampling rate, and each frame represents one of the above five phonemes. The breakdown of the 4509 speech frames into phoneme frequencies is as follows:
From each speech frame, we computed a log-periodogram, which is one of several widely used methods for casting speech data in a form suitable for speech recognition. Thus the data used in what follows consist of 4509 log-periodograms of length 256, with known class (phoneme) memberships.
The data contain 256 columns labelled "x.1" - "x.256", a response column labelled "g", and a column labelled "speaker" identifying the diffferent speakers.
g- is labeled the phoneme```
the rows are 4509 instances, and there are 259 columns(x.1- x.256(frequency measurements/predictor features), column g is the phoneme respone, and column speaker identifies different speakers but I was told to ignore this column)
anybody knows how to explain dataframe in words?
like anybody has experience with describing a dataframe in word in a research paper
a pandas dataframe?
you could have a look around the docs, they have to explain it somehow
the docs for the class say:
Two-dimensional, size-mutable, potentially heterogeneous tabular data.
there will probably be a more lengthy description elsewhere, but its difficult to navigate the docs on mobile 😬
oh the way pandas doc describe their dataframe confuses me a lot lol
"Two-dimensional, size-mutable, potentially heterogeneous tabular data."
lol
wat does this even mean
it means that you have rows and columns (2 dimensions), the size can change (you can add and remove rows and columns), and can hold mixed types (ints, floats, timestamps, ...) in a single dataframe
hey I'm really hoping someone can help me, struggling to get correct(?) output from confusion matrix
I am building a model to infer sentiment from reviews. The model accuracy is listed at 97% and I am now trying to calculate the confusion matrix however it doesn't seem to output the correct information unless i'm misinterpreting it
this is what the matrix is outputting, can this be right with a 97% accurate model?
why don’t you think so?
how do I split with multiple delimiters
in pandas?
anything
what do you mean anything
it depends
re.split for the general case
Got it, thanks
just curious because it says there are 4100 false positives
out of 5000, unless i'mr eading it wrong
it just seems high because the model is meant to be 97% accurate
not sure why the display is in scientific notation? but I see 10000 TN, 410 FN, and 9600 TN
which seems about right
or FP, I forgot which axis is predicted
yw 👋
do u have any mixup implementation for keras?
what representation method should i use
for a time-series data
i only have 20 columns and 725 rows
but i wanna know how to reduce the 'noise' 🙂
numpy.linspace(start, stop, num=50, endpoint=True, retstep=False, dtype=None, axis=0)```
Return evenly spaced numbers over a specified interval.
Returns *num* evenly spaced samples, calculated over the interval [*start*, *stop*].
The endpoint of the interval can optionally be excluded.
Changed in version 1.16.0: Non-scalar *start* and *stop* are now supported.
Changed in version 1.20.0: Values are rounded towards `-inf` instead of `0` when an integer `dtype` is specified. The old behavior can still be obtained with `np.linspace(start, stop, num).astype(int)`
@austere swift starts from 0 - 360 ( as pi =180) of 400 different samples right?
If you’re in the context of radians, yes
Thanks
can someone recommend an ebook for me to get started with deep learning with pytorch?
i need help with a bit of my code in which i'm training a model
and it is a bit urgent
IF you need help. post the questions here (:
I have my neural network made and stuff, now how would I Utilize it to actually create an AI? please help.
why can't i see the ipynb file in github?
gives Sorry, something went wrong. Reload?
thats a github problem, sometimes is not very reliable to check notebooks
you can use https://nbviewer.jupyter.org/ and paste the notebook url
yah just read the problem and found the site
thanks dude
@novel elbow is fastai better to learn than pytorch?
its build in top of pytorch
too many abstractions though right?
if I wanted to something manual it would be hard is what I heard
depends on what you want to learn
then check the fastai course
the part 2 of the course they teach you how is the library built
so you can see all the inner and manual parts
oh ok thanks
does anybody here participate in kaggle competitions regularly?
Does anyone in here know the proper way to feed tensorflow/keras conv1d network with pandas dataframe? I always have problem with 1D datastructure from pandas to conv1d
let say i have n x 1000 features data for train or test, i always troubled with the input_shape with [n,n_feature] or batch_input_shape with(n,n_feature)
i using this code line dataset = tf.data.Dataset.from_tensor_slices((df.values, target.values)) and put my 1st convo1D layer like this model.add(layers.Conv1D(filters=64,kernel_size=9,activation='relu',batch_input_shape = [None,15360, 1])) it stuck on ValueError: Input 0 of layer sequential_3 is incompatible with the layer: : expected min_ndim=3, found ndim=2. Full shape received: (15360, 1)
and this is my model structure:
Model: "sequential_2"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv1d_12 (Conv1D) (None, 15352, 64) 640
_________________________________________________________________
conv1d_13 (Conv1D) (None, 15344, 64) 36928
_________________________________________________________________
conv1d_14 (Conv1D) (None, 15336, 64) 36928
_________________________________________________________________
max_pooling1d_4 (MaxPooling1 (None, 7668, 64) 0
_________________________________________________________________
dropout_4 (Dropout) (None, 7668, 64) 0
_________________________________________________________________
conv1d_15 (Conv1D) (None, 7660, 64) 36928
_________________________________________________________________
conv1d_16 (Conv1D) (None, 7652, 64) 36928
_________________________________________________________________
max_pooling1d_5 (MaxPooling1 (None, 3826, 64) 0
_________________________________________________________________
dropout_5 (Dropout) (None, 3826, 64) 0
_________________________________________________________________
conv1d_17 (Conv1D) (None, 3818, 64) 36928
_________________________________________________________________
flatten_2 (Flatten) (None, 244352) 0
_________________________________________________________________
dense_2 (Dense) (None, 1) 244353
=================================================================
Total params: 429,633
Trainable params: 429,633
Non-trainable params: 0
_________________________________________________________________
!d numpy
hey guys, i'm trying to load a pre-trained gensim Word2Vec model and i am experiencing this error:
UnpicklingError: invalid load key, '6'.
i got the model like this :
'blabla-10-300.w2v.model.bz2'
i tried multiple ways: i loaded it directly, i loaded it unzipped, i tried multiple methods to load it into gensim
and nothing seems to work
do you have a fix?
@wooden cosmos try copying the error message into the chat
thx, i already fixed it
Yay!
yeah, cool
but i ran into an other problem - the model is trained on non-stemmed and non-lemmatized french words
that seems weird to me, what do you think about it?
I don't know enough about french to know why they would have made that decision
@eager heath, I choose you!
Yes!
French. Help.
Tell me about it
You know how there's like the normal form of a word, like the version that's used to look it up in a dictionary?
I don't know what the format is, but now it exists, yes
It's called the lemma.
It's usually the singular form of a word when it's the subject of a sentence.
the question is : is it better to train w2v on lemmatized and stemmed tokens or to use just plain words like in the text, without any preproc
and if i have a model, which is not trained on lemms and stemms - should i try and retrain that thing or i just stick with it?
that would be a bit weird, we do have plural and gendered forms for most words
My blind guess is that you would get less accurate results because of all the possible spelling of a word
yeah
and also for the verbs -> "aller" could be spelled as "allez","allons","allait","allaient"
Yeah
Are there rules you can apply to certain types of words to get their base form?
Yes, there should be
Although they would get really complicated quickly
We like exceptions of exceptions
romance languages are generally easier to reduce to a base form than english, as far as i know
i know spanish you could probably do it in prolog, not many exceptions and even the exceptions are pretty "regular"
English is actually pretty easy. The exceptions to our pluralization rules are when we retain the plural forms of other languages
there are two conjugations of verbs, those where you append -ed and those where a vowel changes internally. Like swim vs swam. But the latter category is shrinking over time.
Can someone help me with an error I'm having running a BERT model on Colab? Colab's RAM is getting depleted, what am I missing here?
https://www.reddit.com/r/MachineLearning/comments/o4uxc5/p_help_error_due_to_colab_ram_depletion_when/
fair enough, although the "standard" nltk stemmers do miss a lot of specific cases
and those have had a lot of research behind them
you won't get any replies on r/machinelearning, and did you try googling your error?
yeah but nothing relevant that is helping 😦
what other subreddits do you suggest i check out?
how can i get a random vector [x, y, z] from a 3d numpy array with shape (255, 255, 3)
you want a random slice of the array?
yes
the 3d array is conceptually a 2d array of (r, g, b) triples
i want a random triple
i looked at at np.random.choice and some others but i can't see an easy way to do this even though it should be easy
from random import randrange
i, j = randrange(255), randrange(255)
my_random_rgb = array[i, j, :]
no?
lol sure
for some reason i thought there'd be some numpy function that did it directly
random.choice specifically says it's for 1D
from numpy.random import default_rng
rng = default_rng()
i, j = rng.integers(0, 255, size=2, dtype=int, endpoint=False)
i think this is equivalent using the numpy rng
which you probably should do if you want to use the same random seed as your other numpy code
maybe you can "partially ravel" the array and then use choices
@ember sapphire ```python
from numpy.random import default_rng
rng = default_rng()
image = ... # 255 x 255 x 3
random_triple = rng.choice(image.reshape((-1,3)))
the i,j version might be faster for what it's worth
In [76]: b = np.arange(255*255*3).reshape((255,255,3))
In [77]: rng = np.random.default_rng()
In [78]: def rand1(rng, array):
...: i, j = rng.integers(0, 255, size=2, dtype=int, endpoint=False)
...: return array[i, j, :]
...:
In [79]: def rand2(rng, array):
...: return rng.choice(array.reshape((-1,3)))
...:
In [80]: %timeit rand1(rng, b)
23.6 µs ± 3.03 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [81]: %timeit rand2(rng, b)
21.3 µs ± 2.1 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
not much different
!e ```python
from math import sqrt
import scipy.stats
n1 = 7
n2 = 7
m1 = 23.6
s1 = 3.03
m2 = 21.3
s2 = 2.10
welch_t = (m1 - m2) / sqrt(s12 + s22)
welch_df = ((n1-1)*s14 + (n2-1)*s24) / sqrt(s14 + s24)
welch_p = scipy.stats.t(df=welch_df).ppf(welch_t)
print(welch_p)
@desert oar :white_check_mark: Your eval job has completed with return code 0.
0.3171214613223502
another simple numpy question
so i have my 255x255 image
and i want to augment it so that it's 255x255x5 instead of 255x255x3
i want image[x, y] to become [x/255, y/255, *image[x, y]]
is there a way to do that in a vectorized fashion?
i want the index in the array
as in, image[10, 30] should be (10/255, 30/255, r, g, b)?
yes
dare i ask, why?
because that is the space in which i want to compute distances
im clustering pixels based on their location and their color
so my centroids for k-means need to be vectors in that 5d space
cluster[y, x] = np.argmin(np.linalg.norm(centroids - img[y, x], axis=1))
the goal is to be able to write that
!eval there might be a nicer way to do it, but this appears to work
import numpy as np
# rgb 255x255 image
b = np.arange(255*255*3).reshape((255,255,3))
m, n = b.shape[:2]
i_broadcast = np.repeat(np.arange(m), n).reshape((m, n, -1))
j_broadcast = np.tile(np.arange(m), n).reshape((m, n, -1))
b_aug = np.concatenate((i_broadcast, j_broadcast, b), axis=2)
print(b_aug)
@desert oar :white_check_mark: Your eval job has completed with return code 0.
001 | [[[ 0 0 0 1 2]
002 | [ 0 1 3 4 5]
003 | [ 0 2 6 7 8]
004 | ...
005 | [ 0 252 756 757 758]
006 | [ 0 253 759 760 761]
007 | [ 0 254 762 763 764]]
008 |
009 | [[ 1 0 765 766 767]
010 | [ 1 1 768 769 770]
011 | [ 1 2 771 772 773]
... (truncated - too many lines)
Full output: https://paste.pythondiscord.com/jenuwoguja.txt?noredirect
if you have to do this for lots of images you could of course re-use the i_broadcast and j_broadcast over and over in a tight loop
i forgot to /255 but you get the idea
is it just me or is numpy miserable to work with
x = np.repeat(np.arange(m), n).reshape((m, n, -1)) / 255.0
y = np.tile(np.arange(m), n).reshape((m, n, -1)) / 255.0
b_aug = np.concatenate((x, y, b), axis=2)
it's just a reality when working within a language like python
you need to do as much in C as possible
which means you need custom C functions and lots of custom functionality that in a "fast" language you might just do in a for loop
kind of defeats the purpose of using a high level language
it's also a bit of a learning curve and an acquired taste
sort of, you the developer don't need to worry about allocating memory and strided array lookups and bytes and stuff
also if you really do need to write a for loop over a numpy array, numba can be magical
import numba
import numpy as np
def augment_with_coords_np(array):
x = np.repeat(np.arange(m), n).reshape((m, n, -1)) / 255.0
y = np.tile(np.arange(m), n).reshape((m, n, -1)) / 255.0
return np.concatenate((x, y, array), axis=2)
@numba.njit
def augment_with_coords_nb(array_in):
array_out = np.zeros((255, 255, 5))
for i in range(255):
x = i / 255.0
for j in range(255):
y = j / 255.0
array_out[i, j, 0] = x
array_out[i, j, 1] = y
array_out[i, j, 2] = array_in[i, j, 0]
array_out[i, j, 3] = array_in[i, j, 1]
array_out[i, j, 4] = array_in[i, j, 2]
return array_out
In [144]: %timeit augment_with_coords_np(b)
1.06 ms ± 93.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [146]: %timeit augment_with_coords_nb(b)
255 µs ± 53.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
where b = np.arange(255*255*3).reshape((255,255,3))
numba is a lot faster here because it can be more algorithmically efficient, only a single nested loop and a single allocation, instead of lots and lots of looping and allocation + python function call overhead
if you use np.empty instead of np.zeros , the numba version is even faster
In [149]: %timeit augment_with_coords_nb(b)
155 µs ± 4.62 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
wow
isn't having all the iteration abstracted away more high-level than writing loops?
am I reading that right? the numba one was faster? what the fuck?
how does that even happen?
a lot faster
like i said, fewer passes over the data + no python function call overhead
at least, that is my understanding of why
does numba only work with cpython?
i think so, i've read some blog posts about using it with pypy but i think you need to build pypy from source with some patches, maybe?
maybe it's better now in 2021
def augment_with_coords_np(array):
x = np.repeat(np.arange(m), n).reshape((m, n, -1)) / 255.0 # arrange, repeat, reshape, divide; 4
y = np.tile(np.arange(m), n).reshape((m, n, -1)) / 255.0 # same, basically. 4
return np.concatenate((x, y, array), axis=2) # 1
If I'm reading this right, this involves creating 9 arrays, only one of which gets returned. But I assume that within a numba-decorated function, the semantic requirement that intermediary arrays are created isn't there, yes?
that and you don't create intermediary arrays anyway
reducing batch size is literally the first thing that comes up
def augment_with_coords_np(array):
return np.concat((
np.repeat(np.arange(m), n).reshape((m, n, -1)),
np.tile(np.arange(m), n).reshape((m, n, -1)),
array), axis=2
) / 255.0
I think this is the same?
already tried that
this is the same use case as numexpr in pandas
😦
if you can't get it to work with BS 1, then you have no other choice than to reduce model parameters
or buy/obtain a better GPU
even in numba? or can numba figure out the intended semantics?
(when it's only having to deal with arrays)
that i don't know, i'll try it
it doesn't work with nopython mode
In [156]: %timeit augment_with_coords_nb_np(b)
1.17 ms ± 79.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
this is the numpy version with numba.jit slapped on top of it
so it's actually slower than just doing it in plain python
or at least not any faster
note: this is all with numpy 1.20.2 under cpython 3.9.4 x86_x64
using the pypi wheel, not conda
properties might be different in different situations of course
so to benefit from numba, you can't be creating lots of extra arrays?
to benefit from numba you need to be writing for loops over numpy arrays
not using high-level numpy functions
@numba.njit
def augment_with_coords_nb_pre(array_in, array_out):
for i in range(255):
x = i / 255.0
for j in range(255):
y = j / 255.0
array_out[i, j, 0] = x
array_out[i, j, 1] = y
array_out[i, j, 2] = array_in[i, j, 0]
array_out[i, j, 3] = array_in[i, j, 1]
array_out[i, j, 4] = array_in[i, j, 2]
i think you could even require a pre-allocated array_out parameter to be filled
this is one possible optimization if you're writing a loop, you can allocate the memory once for the entire loop
I'd rather it be something like
@numba.njit(array_out=np.zeros((5, 5)))
def augment_with_coords_nb_pre(array_in):
for i in range(255):
x = i / 255.0
for j in range(255):
y = j / 255.0
array_out[i, j, 0] = x
array_out[i, j, 1] = y
array_out[i, j, 2] = array_in[i, j, 0]
array_out[i, j, 3] = array_in[i, j, 1]
array_out[i, j, 4] = array_in[i, j, 2]
interestingly it doesn't actually seem faster if you use the pre-allocated array
but I guess that would fuck with the namespacing
In [162]: @numba.njit
...: def augment_with_coords_nb_pre(array_in, array_out):
...: for i in range(255):
...: x = i / 255.0
...: for j in range(255):
...: y = j / 255.0
...: array_out[i, j, 0] = x
...: array_out[i, j, 1] = y
...: array_out[i, j, 2] = array_in[i, j, 0]
...: array_out[i, j, 3] = array_in[i, j, 1]
...: array_out[i, j, 4] = array_in[i, j, 2]
...:
In [163]: c = np.empty((255, 255, 5))
In [164]: %timeit augment_with_coords_nb_pre(b, c)
179 µs ± 23.8 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [165]: %timeit c = np.empty((255, 255, 5)); augment_with_coords_nb_pre(b, c)
193 µs ± 13.3 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
so the time savings is probably just from making the array in advance?
what do you mean?
it's still significantly faster than the np.concatenate version
actually wait
it is faster than the numba version using np.zeros, but not faster than the numba version using np.empty
these differences aren't really statistically significant though
i'm surprised that pre-allocating isn't significantly faster, maybe there's additional overhead somehow, or i need to use the numba signature
@desert oar On an unrelated note, I've been wanting to write an article on transition from general Python program design to data science Python design, and it's based on the idea that where general Python is mostly OOP and imperative (there are lots of data types and you use loops to read and write to different data structures), data science Python is more functional and less OOP (you mostly work with "rectangular" data structures, pretty much everything is a function that doesn't modify the underlying data (ie there are usually no (gasp) side effects)). Do you think I'm on the right track?
In [167]: c = np.empty((255, 255, 5)); augment_with_coords_nb_pre(b, c)
In [168]: np.testing.assert_array_almost_equal(augment_with_coords_nb(b), c)
In [169]: %timeit augment_with_coords_nb_pre(b, c)
158 µs ± 20.4 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [170]: %timeit augment_with_coords_nb_pre(b, c)
149 µs ± 1.77 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [171]: %timeit augment_with_coords_nb(b)
150 µs ± 1.02 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [172]: %timeit augment_with_coords_nb(b)
158 µs ± 12.1 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
kind of. pandas itself is quite object-oriented. the main difference is transitioning from imperative (for loops) to declarative (vectorized operations)
pandas also isn't really that "functional"
in fact id' say it's not really functional at all other than its use of higher-order functions in some places (map, apply, etc)
I'm thinking of both pandas and numpy
same goes for numpy as for pandas
but pandas and numpy are both fairly object-oriented and only somewhat "functional"
numpy and pandas do both support some mix of views and copy-on-write, but it's mostly exposed directly to the user, rather than hidden away as optimization behind an immutable interface
the "no side effects" aspect of functional programming is mostly incidental by virtue of what people usually want to do with numpy and pandas: math
Declarative is definitely more along the right lines. It wasn't part of my CS education, apparently.
I should get a refund.
heh, sql is declarative
my database class was weird
i'm sure they talked more about database implementation than about programming language design though
in my database class? the first half of the class was ACID, relational algebra and all those normal forms, and the second half was SQL and making a website
we didn't talk about the time complexity of different queries, which is kind of annoying
if True:
tokens = [t for t in tokens if t not in set(stopwords.words('english'))]
can anyone help me understand what this code means
i mean the first part if True
what has to be True, and what makes this statement false?
it returns a copy of tokens that doesn't have any stopwords. Do you know what a stopword is?
if True is pointless
ok so its just extra code
I guess so. if True blocks will always get entered
ok i must be doing something seriously wrong
yes check if a statement is true or not and executes the result if a condition is achieved
it executes the block when the condition is evaluated to true
so if True will always be executed, since the condition is True
and inversely, if False will never be executed, since the condition is False
import numpy as np
from matplotlib import pyplot as plt
from matplotlib import image
rng = np.random.default_rng()
img = image.imread('fruits_small.jpg')
h, w = img.shape[:2]
x = np.repeat(np.arange(h), w).reshape((h, w, -1)) / w
y = np.tile(np.arange(h), w).reshape((h, w, -1)) / h
augmented_image = np.concatenate((x, y, img), axis=2)
augmented_image = np.array(img)
plt.subplot(3, 3, 1)
plt.title('Original')
plt.imshow(img)
for plot, k in enumerate([4, 8, 16, 32, 64]):
centroids = rng.choice(augmented_image.reshape((-1, 3)), size=k, replace=False)
clusters = np.empty((h, w))
print(centroids)
while True:
for y, x in np.ndindex(img.shape[:2]):
v = augmented_image[y, x]
clusters[y, x] = np.argmin(np.linalg.norm(centroids - v, axis=1))
d = 0
for i in range(k):
c = augmented_image[clusters == i]
new_centroid = c.mean(axis=0)
d += np.linalg.norm(centroids[i] - new_centroid)
centroids[i] = new_centroid
if d == 0:
break
cluster_colors = [np.random.rand(3) for _ in range(k)]
for i in range(k):
img[clusters == i] = centroids[i]
plt.subplot(3, 3, plot + 1)
plt.title(f'k = {k}')
plt.imshow(img)
plt.show()
is there anything obviously terrible here? running it on a 750x500 image is taking hours
it looks like it isn't even converging
this also wastefully re-computes set on the stopwords every iteration
stopwords_set = set(stopwords.words('english'))
tokens = [t for t in tokens if t not in stopwords_set]
ah so this way saves time?
yes, and if you're doing text processing on a lot of data and/or have a big stopwords list, the time savings could add up
ahh so this is why my runtime was taking too long, thanks
Does anyone know how to write an ML paper?
For background, I've trained a image-based biostatistical model that can identify COVID-19 at >99% accuracy. I extended the model to three other diseases. It trained on CT scans and x-rays. It sucesfully identified Coronavirus, Tuberculosis, Carcinoma, & Pneumonia at above 93% accuracy, specificity, sensitivity, and precision when tested through one-vs-all adaptations of 2x2 confusion matrices in cross validation.
The model is deep-learning based in semi-supervised platform. It used convolutional neural-network, deep multilayer perceptron, isolation forest, and support vector machine to make diagnosis. It shows promising results, and I want to write the paper. Idk how these types of papers are written though.
not to be discouraging, but there are literally 100s of papers that are already 99% accuracy giving out there
if its just for CV, not for actual conferences then I guess that doesn't really matter
Yeah ik, but theyre all written differently
so Im so confused
btw, Im just writing this paper as a mf school science fair project, so Im not trynna get compensated or anything
I also wanted to write a paper too 😁 but writing even a decent paper requires a ton of knowledge and studying of previous methods - not to mention all the formality
Thing is that Im required to
I already did the actual experiment with the network and stuff
Now I just gotta put it into word form by summer's end
if its only for a school project
then you just need to formalize whatever you have done - no need to write a full fledged paper
Yeah Im thinking I want to try and get the paper to ISEF, but idk if it's good enough though
do they specifically ask for research papers? if not, then a document would be enough
In the past 2 years, at least one paper at my regionals fair that advanced to ISEF was a ML classifier for cancer. The aforementioned model can do cancer, as well as other diseases like tuberculosis and stuff
they want a paper in the APA format
seems like just a document formatting way
so if you put decent amount of formalism in it, you would be fine
ebic
I have a question for professionals data scientists ; in which context do you use maths and in which context do you use coding ? thank you 🙂
my cost function is increasing from one iteration to the next
does that mean i have a problem in my implementation?
So I am running into this dilemma that I have not found an answer for and its weird...
When concatenating the values of 2 columns at a row level, you have to do something ugly like
df['combined']=df['one'].astype(str)+' stuff ' + df['two'].astype(str)
It works and all but I can't help but feel like its a code smell
indeed it is, why are you concatenating stringified versions of your data at all?
and what are the actual datatypes here?
sometimes it is the best way to do something, but it's rare that this is actually what you want to do
Can't say particularly. Basically generating an instruction based on values in two columns.
the only other way to perform this particular task is with .apply over rows
if the column is already a string column, why astype(str)?
They are strings.
if there are nulls, you need to handle those differently
astype(str) will do the wrong thing for the most part
I was getting weird errors.
what errors
!e ```python
import pandas as pd
s1 = pd.Series(['a', 'b', None])
s2 = pd.Series(['x', None, 'z'])
print( s1.astype(str) + ' -> ' + s2.astype(str) )
@desert oar :white_check_mark: Your eval job has completed with return code 0.
001 | 0 a -> x
002 | 1 b -> None
003 | 2 None -> z
004 | dtype: object
Sry actually not weird.
that probably isn't what you want
Float > string coercion error
so you have mixed data types
i.e. not strings
are there np.nan's in there?
!e ```python
import pandas as pd
import numpy as np
s1 = pd.Series(['a', 'b', np.nan])
s2 = pd.Series(['x', np.nan, 'z'])
print( s1.astype(str) + ' -> ' + s2.astype(str) )
@desert oar :white_check_mark: Your eval job has completed with return code 0.
001 | 0 a -> x
002 | 1 b -> nan
003 | 2 nan -> z
004 | dtype: object
again, probably not what you want

