#data-science-and-ml
1 messages · Page 364 of 1
As you can see in the form [normal, postprocessing1, postprocessing2] accuracy
the postprocessing1 also makes the training data get into the 80's in accuracy
Hello. I am trying to compare two dataframes. They have the same number of columns and same column names. Error is below. How do I adjust this so the compare works?
ValueError: Can only compare identically-labeled DataFrame objects
Here are my functions.
engine = create_engine('postgresql://postgres:secret@10.0.10.125:5432/sheepdb')
def jdb_sheep_df():
table_name = 'jdb_sheep'
table_df = pd.read_sql_table(
table_name,
con=engine,
schema='sheepdb'
)
return table_df
def sql_csv_compare(path='/home/steven/projects/repos/sheepdb-cli/sheepdb-cli/test.csv'):
csv_df = pd.read_csv(path)
sql_df = jdb_sheep_df()
#ne = sql_df.compare(csv_df)
#ne = (csv_df != sql_df).any(1)
ne = csv_df.compare(sql_df, keep_equal=True, keep_shape=True)
return ne
print(sql_csv_compare())
@forest canyon did you confirm that the rows are indexed the same way?
Confirm in what way?
@forest canyon looks like you're reading the DataFrames from an SQL database. Once you have the DataFrames, make sure that the .index of each are the same sets.
could you look at my problem if you can?
How?
The one from earlier? I don't look at screenshots of text.
i can send a log if you want
I'm about to drive but I might be able to look later. I would ask questions to the whole channel, not individuals.
.index is an attribute of DataFrames. Look at them and see.
I know it's an attribute but I dont know what you mean by "the same"
It gives both dataframes an index column at the front starting at 0
anyone reccomend the easiest single variable ml linear regression tutorial theyve come across.. I figure id start with something like that..
If the values in both indices are the same, not considering the order. If they're both range indices of the same length, then they're the same.
Oh.. So then what is the use of a compare if you always have to have the same number of rows?
Maybe make an if statement that will say if they are equal… for example if set1 = set2 print(“They are equal to each other.”)
just my idiotic brain thinking
I need to identify differences
here is the log of the same ss
2021
[0.5121227115289461, 0.8090054428500743, 0.5314200890648194]
2020
[0.5096486887679367, 0.8169223156853043, 0.5319148936170213]
2021
2020
[0.5091538842157348, 0.8213755566551212, 0.5329045027214251]
2021
2020
[0.5091538842157348, 0.830282038594755, 0.5338941118258288]
yea but there could be no differences
I'm not quite there yet.
I need to get them comparing first
So the compare function only works on data with same row count on both sides?
that could be it… maybe ur comparing a 2d list to a 3D list
That's what was said above anyway.
My dataframes are same columns just one has more rows than the other
can you delete a row and test it?
here is the question reposted with the log instead of the ss:
Hey i am making a stock predictor and i am wondering if it was ok to use the output from testing and use a post processing function which make the testing accuracy go from .53 to .79?
I already made the function
It seems to make the testing accuracy much better
As you can see in the form [normal, postprocessing1, postprocessing2] accuracy
2021
[0.5121227115289461, 0.8090054428500743, 0.5314200890648194]
2020
[0.5096486887679367, 0.8169223156853043, 0.5319148936170213]
2021
2020
[0.5091538842157348, 0.8213755566551212, 0.5329045027214251]
2021
2020
[0.5091538842157348, 0.830282038594755, 0.5338941118258288]
the postprocessing1 also makes the training data get into the 80's in accuracy
I added a row and tested. it sets mismatches as NaN. Is there a way to compare when rows count is different?
Like compare two columns?
And see if items in the sql_df exist in the csv_df?
would it change the value if you added 0’s to the extra row? Cause then it will slow you to compare them
what do you mean like to see if they are the same?
Basically - take a column in sql_df, compare it to the same column csv_df, then give me any value in sql_df column that doesn't exist in csv_df column
ok are they pandas dataframes already?
Yep. They are in my code above.
sql_df comes from a DB table and csv_df comes from a csv file
u might want to convert to a list with .tolist() and then use .index(False)
It looks like that will be same issue - row count must match but I'll try and Google if it doesn't work. Main question is answered on compare. The compare has to be the same row count and you'd want to sort by the same column so it was an accurate compare Thanks all.
alright
:ok_hand: applied mute to @junior robin until <t:1641019694:f> (9 minutes and 59 seconds) (reason: duplicates rule: sent 4 duplicated messages in 10s).
hey, i want to make a virtual assistant like alexa or siri but it works offline and on mobile devices with basic functionality like play music open application etc.
i have hit a roadblock that in the offline voice recognition part. i have tried sphinx and vosk but the accuracy is not that great. i am thinking of learing ML and designing my own voice recognition , but getting huge amount of voice data is a problem.. or should i just try to modify existing voice recognition models?
have you tried searching in places like kaggle for datasets?
https://www.kaggle.com/mozillaorg/common-voice a quick search found me this
SOS
I need project topic on Machine Learning not something like price prediction or churn, cancer classificaiton
i'm afraid of breaking the advertisement rule but I have a free api which hosts a bunch of datasets https://mldrive.io/documentation
an API for Machine Learning
:incoming_envelope: :ok_hand: applied mute to @deep gyro until <t:1641035996:f> (9 minutes and 59 seconds) (reason: duplicates rule: sent 4 duplicated messages in 10s).
best place to learn tensorflow???
i know data science bacis nn models every thing, language, procedure
just library is what i had left
so tell me the best place to learn tf
hello can anyone help me how to output a binary in a KNN.predict value
Hi there, excuse me for posting this question here, but I wouldn't know what other channel to post it in. Is there anyone who can help me understand how an encoder, ROM decoder,PLA works?
I tried to read from many parts but I don't understand especially the OR matrix and AND matrix associated
how to keep theano cache file/folder names consistent over different systems?
If you know, a lot of the other concepts, you can just apply it in a project with using TF. That's what I would do.
Anybody has experience with nbdiff-web from nbdime? https://nbdime.readthedocs.io/en/latest/
I’m trying to generate diffs for multiple files and I can’t get it to work
hello, i have a doubt in tensorflow.when i use conv2D with kernel size=1..does it mean 1X1 convolution
Is it possible to replace nan with null in a dataframe and then still turn it into a dict after? I can totally replace nan with null using fillna, but I can't turn it into a dict after. Anyway to convert nan to null and then still be able to turn it into a dict after?
csv_df = pd.read_csv(path)
dict = csv_df.to_dict('records')
This is the answer
.replace([np.nan], [None])
Hi is plt.loglog(x,y) the same as plt.plot(np.log(x),np.log(y))?
I believe that the first one is for matplot and the second is for numpy
so they are different
Yes I believe so
my neigh is the same as the bottom clf_KNN pic, just different variables
how do i make this predict value
to this
Im guessing you could get the index of the 1 and then do classes[index] to get the value of the class
Hey does anyone have experience working with TF-IDF vectors, especially regarding using them along with non-NLP features? I managed to get it into a dataframe, however, it is too large to concat with the other features (2 additional columns of equal length). I'd love to hear how other people solved this!
Error thrown explains I need 5.4GiBs extra to process, however, i managed to navigate that error already once before while creating the Dataframe
I could perhaps do my test-train split early and do this part individually for both, cutting down the training set by ~20%
Does anyone know why I get this error, my GPU is 8GB, not 6GB? (Windows, 3060 Ti, 8GB VRAM)
RuntimeError: CUDA out of memory. Tried to allocate 90.00 MiB (GPU 0; 8.00 GiB total capacity; 5.49 GiB already allocated; 0 bytes free; 7.60 GiB allowed; 5.56 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF```
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 497.29 Driver Version: 497.29 CUDA Version: 11.5 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... WDDM | 00000000:01:00.0 Off | N/A |
| 0% 32C P8 4W / 200W | 142MiB / 8192MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Ooofles
I tried to follow a tutorial but I'm having weird CUDA memory issues or maybe classes idk
def train(dataloader: DataLoader, model: nn.Module, optimizer: optim.Optimizer, criterion: nn.Module, epoch: int):
model.train()
total_acc, total_count = 0, 0
log_interval = 500
start_time = time.time()
for idx, (label, text, offsets) in enumerate(dataloader):
optimizer.zero_grad()
predicted_label = model(text, offsets)
loss = criterion(predicted_label, label)
loss.backward()
# error happens here
torch.nn.utils.clip_grad_norm_(model.parameters(), 0.1)
optimizer.step()
total_acc += (predicted_label.argmax(1) == label).sum().item()
total_count += label.size(0)
if idx % log_interval == 0 and idx > 0:
elapsed = time.time() - start_time
print(
"| epoch {:3d} | {:5d}/{:5d} batches "
"| accuracy {:8.3f}".format(epoch, idx, len(dataloader), total_acc / total_count)
)
total_acc, total_count = 0, 0
start_time = time.time()
I tried to follow the tut's steps with another dataset: https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html
And that one worked on both Colab and local
According to Stackoverflow it's happening because my label > n_labels which is kinda weird
This is what happened when I'm on CPU
Yes, 8 GB is writtend. However, PyTorch already kept some VRAM for your model and the dataloaders you're sending to your GPU.
how do I fix it?
If you're on a notebook, the only way to get rid of this error is to restart your kernel. Then lower down your batch size.
batch size?
Rinse and repeat till you find a suitable batch size
how?
When you configurate a DataLoader, you have to pass in a batch size. It's the number of samples you feed in your model.
it's just a plain python file
Ideally we aim for bigger batch sizes but due to GPU constraints we lower it.
Ok so no need to restart anything. Just change the batch size where you define your DataLoader
does it improve quality? (higher batch size)
I can make the result image smaller
but the quality seems to be the same from 128 to 400
Usually yes, your training goes faster
so it's speed, not quality?
If you're doing a CNN the only variables you can touch upon are the batch size and maybe the number of neurons in a convolution layer.
A mix of both I'd say
But higher batch sizes means you do less iterations
on my GPU, it does iterations really fast, on my CPU, it has far more memory, but does slow iterations
didn't work
still get this:
RuntimeError: CUDA out of memory. Tried to allocate 16.00 MiB (GPU 0; 8.00 GiB total capacity; 5.55 GiB already allocated; 0 bytes free; 7.60 GiB allowed; 5.62 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF```
I doubt the batch size affects 2.5 GB
I have some problems. They should be simple, but they are not.
These questions require a certain expertise in order to provide any useful answers.
They involve the correct data normalization boundaries for certain kinds of data, variance calculations, bayesian algorithms, and audio sampling.
I could use some help.
ideally, i could use some help from a systems developer, perhaps a python or numPY project developer
send it in the server I’m sure we can all work together for an answer!
I've asked about it before. I'm reluctant to expend tendons.
https://github.com/falseywinchnet/fabada/blob/master/examples/streamclean_rx_buffer.py
take a look over this. Note where the fabada function enters and what is done initially. Note what kind of data I am working with.
This is a simple, linear, single file, open source project. It is not complex from that perspective. It is complex in terms of the computation, but we can hold off on that.
I've coded things like variance arbitrarily because I don't know what should work best.
One problem i have with it is that it "clicks" at the end and start of each sample frame. This is unrelated to the sampling or buffers, simply bypassing the noise reduction function and passing the data frame back to the output is proof of this.
So the problem is internal to how I have adapted this formula from another person, who is a co-author on it, to this particular application.
https://github.com/PabloMSanAla/fabada/blob/master/fabada/__init__.py this is their original work. I have made a lot of changes, but i have verified they return identical results in superior timeframes for all of the stuff i've separated out of the main function, like the chi2pdf calculation.
I could use some help optimizing what the constants should be for normalization and for variance calculation. I am positive these are the only things remaining that need some love and affection.
I am new bee in machine learning
I use python and I am intermediate in python
But I need some projects in machine learning
Beginner projects?
How far are you into ml?
like, to study, or like, something to do?
just beginning
does anyone know how to use pandas read_parquet() and make it include the partition column in the table it returns?
there is something in spark I think where you can provide a 'base path' argument and it will adjust the schema, but I can't find anything for pandas
What is data science
What is life
Can anyone please provide some interesting resources on text summarization using k means clustering
I need help with PCA,
I im using pca on image array data
I got 268 (out of 2304 (img shape 48x48)) component explaining 95% of variance
but my doubt is how can i plot 268 element as image... shape issue and all
So you want to probably mark off which part of the 48x48 accounts for the variance...what about color coding the 268 elements in the 48×48 grid . Most PCA examples use scatter plots but image data has its own built in coordinate system eg x and y and as such marking the pixels is a scatter plot too
i have no idea regarding the reconstruction of image with missing pixels
Guys
I already asked
I am a new bee in ml
any beginners projects is available to learn ml with python
any websites?
https://machinelearningmastery.com/machine-learning-in-python-step-by-step/ nice intro into Scipy, training, algorithms etc.
https://towardsdatascience.com/image-compression-using-principal-component-analysis-pca-253f26740a9f
Maybe inverse transform can recover the image
pca_10 = PCA(n_components=10)mnist_pca_10_reduced = pca_10.fit_transform(mnist)mnist_pca_10_recovered = pca_10.inverse_transform(mnist_pca_10_reduced)
The inverse_transform() method of the pca object is used to decompress the reduced dataset back to 784 dimensions. This is very useful for visualizing the compressed image!
Yeah it can be helpful according to the article
@odd patio
How about generate 50 random x and y axis numbers and graph them in numpy
Thank you 🤗
What are the major libraries I should learn for machine learning ? 🤔
Numpy, pandas, and cvs are probably the first three you should learn. Then tensorflow
first, i'd suggest np pd and csv then scipy and tensor flow....
"learning libraries" isn't the way to go in the first place.
I have a message in the pins about what the major libraries are, but you shouldn't try to learn them in a box-check sort of way.
Where is the best place to ask for help cleaning up a CSV data source for use in pandas? Is there a special channel?
@lapis sequoia this one
I agree with this, start learning the concepts and the math behind it before getting into libraries. once you know what's being done in the backend, you can adapt that to any library by just reading documentation
The problem is the following, the values that should go in the last column are prepended by 3 empty columns in the source CSV. This seems to shift the headers 4 places to the right when parsing it with panda's read_csv function. So can I resolve that? I can't modify the data manually since it's coming from a remote server and changes frequently.
so the 12, 24, and 12 should be in the "casecount" column?
Yes
And the id header should be above the first column (that contains 7, 32 etc.) in the 2nd screenshot
I suspect it has something to do with how the new_subcategory column is being parsed. In the CSV, does each value for new_subcategory have quotes around them?
Only when it contains spaces, not for single word categories
I'm not sure what to do, unfortunately
if I had a copy of an inputted CSV, I might be able to figure it out, but I see that you can't share that.
Okay, thanks. I just censored it because it contains adult content/words. I could share it
I'll allow it.
just drag/drop the file into this chat.
Well maybe not publicly share on such a big server lol
Dont' want to get into hot water, since it's client work
Ping me if you decide that you're willing to upload it. Otherwise I'll have to move on to something else.
The only other thing I can think to suggest is that you try changing the delimiter for read_csv
Can't share publicly without modifying it, so that would defeat the purpose. Delimiter is ; and I specified that
The first numeric value here is the new_subcatid from the screenshot: 32869;Dildos;;;;;12 and the last one is casecount
So there are just too many delimiters in the data source
Yes otherwise it can become mechanical. Maybe this checklist thinking is driven by the job market where they employ checklist driven hiring
who are you working for? 
I guess just do df['casecount'] = df['???'] to correct it
if there are cases where the casecount value is actually in the casecount column, you could use fillna instead. assuming that missing values in the column are NaN
Maybe I can get you an employee discount 😉 jk lol. It's just an adult drop shipping business.
casecount is in the wrong column on all rows
= df['???'] is that actual syntax?
@lapis sequoia if you're talking about the string, no, that's there because idk what the name of the column the values ended up in is.
@lapis sequoia I'm not completely sure that this would work but try making a df that reads headers only, then making a df that doesnt read the headers (skip the first row to make sure that the headers don't get counted as data), then you can drop the empty rows from the data df and assign the columns from header_df to data_df.columns
Thanks I'll give that a try. I already spent hours on this, one thing I tried was dropping the last 4 columns including the casecount column (since it's not important) and re-assigning the headers. For some reason that didn't work out
it would be something like this
header_df = pd.read_csv(filename, nrows=0, delimiter=';') # nrows 0 so that it only reads headers
data_df = pd.read_csv(filename, header=None, skiprows=1, delimiter=';')
data_df.dropna(axis=1, inplace=True)
data_df.columns = header_df.columns
I created the headers list manually for now and it works, next step would be to try your solution. Thanks! (the settings var is just coming from a loaded in config file)
df = pd.read_csv(
settings['pandas_csv_options']['source_file'],
sep=settings['pandas_csv_options']['separator'],
header=None,
skiprows=1,
names=headers
)
i wouldn't think that assigning the headers directly in the read_csv would work since you would have to drop the empty columns before assigning the headers, but if it works then it's good anyways
You got me there, I have only tried it with a shortened CSV so far, that doesn't contain the problematic last columns
I could use a little help. I was trying to make an environment to practice on the MNIST dataset, but I was getting an error thrown trying to make the environment. Online, I found you can solve the issue by uninstalling and reinstalling Anaconda, so I did that, but now Anaconda installs without python or the powershell prompt, idk why.
Any time you get an error message, and you want help with that error message, always immediately show the error message.
i dont want help with that error message that's why i didn;t include it
i don't even know what that error message is anymore
what OS are you on?
I'm asking for help with why my anaconda is installing without python or the powershell, I am on Windows 10
and what library/ies are you trying to use to experiment with the MNIST dataset?
I just need keras to import the dataset, I was going to go about getting the dataset another way but I found a way to import the dataset through keras
besides that I don't know off the top of my head
I would avoid Anaconda entirely unless you're sure it's the best way to install one of your dependencies. It should be simple to install tensorflow (which contains Keras) without Anaconda.
but what if I am wanting to install Anaconda
Okay so that almost worked after adding how='all' to dropna BUT there's one additional completely empty row column in the CSV that also gets dropped, so I end up with 59 columns and 60 headers
Why do you want to install Anaconda if it's giving you issues and you can accomplish your end goal without it?
Because I want to, I would like to figure out what is going wrong for the future instead of never fixing it
I would like to ask another question unrelated to anaconda if that's ok 😛
then shoot
number_of_breeds = df.apply(pd.value_counts)
i have this line of code
which returns
Crossbred Canine/dog 164
Retriever - Labrador 136
Domestic Shorthair 63
Retriever - Golden 59
Dog (unknown) 56
... ...
[Retriever - Labrador, Catahoula Leopard Dog] 1
[Cattle Dog - Australian (blue heeler, red heel... 1
Spaniel - Tibetan 1
[Retriever - Labrador, Deutsche Dogge, Great Dane] 1
Mixed (Horse)
please show df with print(df.head().to_dict('list'))
oh
Please ping me when you have provided this.
In case you don't plan to provide that, refer to this:
!docs pandas.DataFrame.nunique
DataFrame.nunique(axis=0, dropna=True)```
Count number of distinct elements in specified axis.
Return Series with number of distinct elements. Can ignore NaN values.
{0: [138, 126, 81, 71, 62]}
this one?
i did it this way
number_of_breeds.head().to_dict('list')
it returns this :
{0: ['Retriever - Labrador', 'Shih Tzu', 'Hound - Basset', 'Alaskan Malamute', 'Shepherd Dog - German']}
so, if you want the number of unique values in the 0 column, it's just df[0].nunique()
i just want to plot it
using dash
and i am not sure how to access each column i guess?
hmm, I'm not sure about that.
can't happen?
there's only one column?
if there's secretly a dataframe of interest with multiple columns, I haven't seen that one yet.
wait one moment, let's say we use the .value_counts method
it returns the thing above
can we somehow access each column
so we can plot it?
is that even 2 columns?
well, in that case, you should just use df[0].value_counts(), without using .apply
gr8 advice getting into it now
you'd only have one value (a frequency) for each item (a breed)
if you want to plot it in two dimensions, you probably need a second value.
otherwise you're not demonstrating any kind of relationship
I think we're mixing up terms here
in my usage, the frequency is a value
it sounds like you're using value to refer to an item (ie a dog breed)
Crossbred Canine/dog 164
on Y axis
it should be 164
and on the X the "Crossbred Canine"
names of breeds aren't numbers
so any way that you order them on the y axis would be arbitrary
they have to be?
what library are you using to create the figure?
Dash
is plotly part of dash?
i think so
one moment
it looks like it is
import plotly.express as px
df = px.data.gapminder().query("continent == 'Europe' and year == 2007 and pop > 2.e6")
fig = px.bar(df, y='pop', x='country', text='pop')
fig.show()
This code from the docs creates a bar chart where the labels are country names (and thus don't have a numerical value)
I'm not sure that their df variable is actually a pandas DataFrame though, or pandas DFs are compatible in this context.
There are compatible no worries about that
yay!
the issue remains tho :<
this one returns
number_of_breeds = df[0](pd.value_counts())
TypeError: value_counts() missing 1 required positional argument: 'values'
oh wait
ignore that
ignore what
the code you gave me above
it doesn't return what I wrote
it returns this ```
C 305
D 281
R 236
S 230
T 160
...
Pug 1
Great Pyrenees 1
Collie - Border 1
Schnauzer - Miniature 1
Shepherd Dog (unspecified) 1
which is kind of weird, but
In [1]: df = pd.DataFrame({0: ['Retriever - Labrador', 'Shih Tzu', 'Hound - Basset', 'Alaskan Malamute', 'Shepherd Dog - German']})
In [2]: df
Out[2]:
0
0 Retriever - Labrador
1 Shih Tzu
2 Hound - Basset
3 Alaskan Malamute
4 Shepherd Dog - German
In [3]: df[0].value_counts()
Out[3]:
Retriever - Labrador 1
Shih Tzu 1
Hound - Basset 1
Alaskan Malamute 1
Shepherd Dog - German 1
Name: 0, dtype: int64
It worked when I did it.
ye that's probably because that's not the whole data
one sec
Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.
and that's what it returns for me
Retriever - Labrador 158
Crossbred Canine/dog 154
Domestic Shorthair 72
Dog (unknown) 62
Domestic (unspecified) 52
...
[Crossbred Canine/dog, Collie - Border] 1
[Shepherd Dog - German, Retriever - Golden] 1
Horse (other) 1
Goat (unknown) 1
Coonhound (unspecified) 1
ok ok take a look
Well I did it, by typing out 62 headers manually, including headers for the empty columns. Work hard not smart. Actually not completely wasted time since I maybe wanted to rename the headers anyway. But pandas is awesome for what I'm trying to do. I hate thinking about doing that kind of data manipulation in NodeJS
yeah pandas is pretty awesome
although you should probably tell your client that they should fix their remote server data lol
it would likely be a lot easier to fix it on that end
The data is actually coming from a big wholesaler which claims that they serve 1000s of online stores. The data is trash tho. They even have two columns with the exact same name in the CSV. Could still be a problem on my end somehow, who knows
I need a team of people to help me on a secret project. This seemed like a good place to look for some. DM me if your curiosity is piqued and you can be trusted.
I'd give you guys more info if I thought it would be morally fine to do so
Any advice on working with large data in python? Currently (I believe) python is creating temporary stores for training and test data which then overload my ram before I get to putting it into the model
Or is 16GB simply a bit too small to work with
Online people have suggested using dictionaries, however, Im not quite sure how that can help
how much RAM do you have and what are you trying to do?
16GB and im just running a NLP with both text and non-text features
what algorithm? what library?
how big is your dataset?
800,000,000 values or so
get a lack of memory error when trying to put it into naive bayes
what Spacecraft wants to know is how much memory the dataset takes. Also, I still need to know what algorithm you're using.
I assume you're using one of these? https://scikit-learn.org/stable/modules/naive_bayes.html
There's a partial_fit method that you can use to train the model in batches
yeah naive bayes through sklearn thats the one
yeah i think that sounds like a good approach
thank you
yw
ìs there a general best practice when using large data?
that essentially is the best practice, load/fit incrementally so that you don't have to have all the data in memory at once
Basically what Spacecraft just said. Though every institution I've worked at has had one or more high-performance computers so that it isn't necessary.
yeah in most cases theres a balance between how much memory the computer should have and how much data you need to use, since loading it in batches also makes the training quite a bit slower
most people write code on their personal computer and then have it run on a remote, more powerful computer that can run it more easily
ah yeah that makes perfect sense, thanks again
Happy New Year guys 😀
Whoo, new year.
I'm workin' on portfolio fodder, and I keep seeing a lot of love for Plotly. When I was workin' with it a while back you needed like your own local server for it and it was kind'a gross. Has it gotten better? What viz library are you all into? (I've been big into Altair, but will default to matplotlib if I just wanna see something quick.)
Hi I am currently working on a Kaggle project and I'd like to understand why they used y = 'twp' to count the number of calls made per month on line 22. Shouldn't it be based off of timeStamp?
https://www.kaggle.com/vahidehdashti/911-call-data-eda
Depends on the use case. I like seaborn, matplotlib, lux and pandas+matplotlib.pyplot. I also know that bokeh is loved by many out there, but I haven’t used it very much.
I also use streamlit a lot, it’s primarily for rapid prototyping of web apps in data science, and it has the most basic visualizations, which is easy to use 🙂
I've tried bokeh before and it's pretty good! I've not heard of streamlit or lux before, so I'll check those out.
Oh, dang, Streamlit looks really cool.
Streamlit is really cool and powerful. You can write a simple web app in 2 min 🎉
oh wow streamlit does look really cool
can it produce interactive dashboards that are anything like the ones you can create in BI or Tableau?
So I take it no one cares then
I don't DM anyone for "secret projects". Very few people are going to put themselves out and express interest in a project where they don't know the person at all and they don't know the project at all.
Perhaps you could give some hints as to what the project is, what you expect us to do, and what your experience is? That might make more people interested.
hmm.... let me think... I have to figure out what I can tell you...
I'm making a AGI. sort of.
Her name is AIRI.
It's not that it HAS to be secret, it's just that I'm worried to many people, and maybe even the wrong people will know too much about it. Most projects that were publicly available that I did were copied before I could patent. Among other things.
And then what're you looking for in a team? And what would your role be?
I'm looking for people with experience in machine learning, know python, maybe even a little bit of SQL.
What do you mean by my role?
Like, what would you be doing in the team.
Essentially, you'd look through the list of things needed to be done, and if you have a moment where you're like, "Ah yes I know how to do that" then you'd do that. So you'd help out where you could.
So, you'd be the project manager? Or main coder?
I'd be project manager, but I'd also help where I could.
(Just getting all this stuff out so people here might be more interested in working on it.)
Yah maybe. Ima be honest with you: it's not like hierarchy level organized. It's more of a thing where it's like you have nothing to do so you pitch in your 2 cents and if it contributes then yay
That's fine, lot'sa projects work like that. People here will prob be more interested in asking to work for it if they knew a little about what was gonna be involved, that's why I wanted to ask you all that stuff above. More likely to get interested peepz.
hmm what might make peeps interested....
Oh, I just meant the stuff you said already was fine. It's unlikely to get people if it's just like "hey dm me for a secret thing." but if you explain like you did above, it's more likely people will be like, "Okay, I can get into this."
No problemo, I can't promise anyone will be into it, but I hope that it piques someone's interest now!
OH! I know something that might interest a few weebs peeps: It involves an anime girl
all the best things involve anime girls.
hehe lmao
it's also against our rules to recruit for "secret projects".
really? where?
I suppose it doesn't fall under any specific rule, but we only allow people to recruit for open-source projects, because if the project isn't completely transparent, we can't verify that it isn't a business venture, or that it's ethical, or that contributors will get any value for their contribution, etc.
I literally banned a dude for trying to sell the project, so no not a business venture. I personally believe that if treated right it's perfectly ethical. I mean, this isn't going to be public even when it gets done, so even if we did give them credit (which we probably will anyways in a book I'm writing about it), they won't exactly become famous or anything.
I'm not sure what you're referring to (I haven't read all the way back in the conversation), but in either case, we're not going to budge on the requirement that all recruited-for projects be open-source.
ah, this
If you're not willing to open-source the project, please permanently stop recruiting for it. Thanks!
What about an ambassatorial agreement
I don't know what that is.
Essentially, we'd invite a highly trusted member of the server to the project. Then, they'd look it up and down through and through and if they think it's ok, they'd come back and say it's ok verifying it.
There is no way we'll allow you to recruit for a closed-source project. If you have any further questions or comments about this, please contact us via @sonic vapor.
No.
great. big help.
If your gonna get mad at people for doing it, you should probably put it in #rules .
I'll bring that up with the other moderators.
Yeah i suggest building a relationship first to build trust. I have been approached in LinkedIn and burned so be careful
yea I know I just don't know many coders that actually know enough to help instead of hurt
I get why the rule is in place: it's easy to exploit workers from this area. Same deal in a lot of the game dev discords. Makes sense.
Though I agree, if I can't point to a rule (or see a rule) it's hard to deter people from doing it. So, thank you for bringing it up with the other mods!
anyone familiar with dynamic mode decomposition?
trying to do a little project and im getting a bit stuck
Guys can you help me out, i am in VSCode RightNow, i inserted a .dot file and installed Graphviz extension for it to run, but when i press on the three dots on the top right corner in VSCode there should be an option saying ‘Open Preview to the side’ but it’s not showing, please help me out i’ve been searching for an hour
it’s supposed to show a visual decision tree
I usually use Pycharm so i am not familiar with VSCode
sounds like an #editors-ides question.
Does anyone have any tips on how to stop class imbalance on a Tensorflow model? I've looked at the Tensorflow guide using Credit Card Fraud Detection though my model has a few more classes (24,000 more to be exact)
it just outputs "yes" which is likely to be correct around 25% of the time
what do you mean by "class imblance"? this is usually a statement about the training data, not the predictions of a model.
I have a majority class of 25% of all of my training data inputs which my model is unable to circumvent and instead just uses the "most common" answer when making future predictions
am I to understand that you're training a binary classifier where each item can either be "fraud" or "not fraud"?
it's the way it trains
No, I was a bit misleading by using the credit card fraud data I apologise, I'm using 24,000 different classes so I'm trying to use a Categorical Cross Entropy Loss
I just mentioned that because tensorflows only actual guide on class imbalance uses binary classifiers
well I have 248,000 bits of data to use
24,000 is a huge over-use anyway, but I wanted to include a larger vocabulary (output is one-hot)
by "248,000 bits of data" do you actually mean 248,000 training instances? because "bits" are a specific thing.
yeah i just didnt know how better to say
I have 248,000 triplets of input1, input2, output
I see. And what algorithm are you using to classify them?
because I've never heard of any classifier having to learn 24k different classes.
I use a Categorical Cross Entropy for loss, an Adam for Optimizing and output as a softmax
so it's a neural network of some kind?
the output is 24,000 large so that the softmax gives me a vector that I can turn into a one-hot encoded vector
yeah it's an MLP
MLP?
like I said, I've never heard of a classifier for that many classes, NN or otherwise
fair enough, thanks anyways
hey I had a ques. So if you have a column of a bunch of tokens and then a list of words (outside the data frame). How can you check if a row contains any of the words?
this is the token from the df
and I have a list of words.. whose elements I wanna match with
I assume this is a column of a DataFrame (ie, a Series). The only operation that a Series of lists supports is explosion.
If you want to check if a list contains at least one of a certain number of values, the best way is to put the values of interest in a set
Is the topic of data science in general much more harder to learn than web dev?
Depends what you're in to, passion is a large factor in learning these sorts of things and a mix of both probably won't even hurt that much (there's a few websites that use many Data Science applications)
24k classes? That is, by magnitudes, more than I've ever seen, haha.
What the heck are these classes where you need so many?
words
a lot of the classes I'm making are pointless because they're only used once in comparison to a majority class that's used for 25% of all answers
(1 class = 25% of all answers, 2 classes = 38% of all answers, 12,998 classes make up remaining 62%)
^ pain
Dang, maybe it does require that, I don't do a lot of CV/NLP stuff. That still seems really, really high to me, and seems like, if it's that imbalanced, you could ensemble on another model which does the "Other" thing for labels who are predicted less than some number of times.
Yeah i was planning on learning both to see which one id like more, but i wanted to go through the hardest one first lol i guess you’re right depends on the passion too
yeah I propose an idea like that in the documentation I'm making for it, though was hoping I'd be able to complete my own program that allows processing of it
Also, if it's predicting one label, what label is that where it's so popular?
I managed to get it to consider different classes but I think it worked only because of a Tensorflow Seed giving me a good value for training
every other training hasn't worked
so my training is non deterministic
You don't even need another model, you could just post-process and send everything under a certain frequency to an "Other" class.
although now it's deterministic because it only outputs one thing
the popular label is "yes"
Second most popular is "no", I guess?
Okay, and then what's an example of another one that's sort'a popular?
something around that lol
the next most popular one is about 1000 (as opposed to no which is ~40,000)
think thats "2"
Yeah, Data Science is probably harder but if you plan to just put it on websites you may as well learn websites first, imo DS is as hard as you let it be, which in my case is painfully hard
Ha. Okay, shot in the dark here, but what about this: an initial model to say if a question has a "Yes/No" answer --- which will feed into another model that gets that yes/no answer --- and the rest of the data goes to yet another model that excludes the yes/no.
If only to see the distribution of the other answers, and how to group those.
ah probably dont have time for something that advanced
I wanna get something out of the way soon and then come back to it in the future
It's all good, just thinking aloud.
upload an "alpha"
but yeah that is a good idea haha
my whole idea was to focus on the question more than the image so a culmination of that would use that yes/no thing
Depending on what the DS job is, it might require some heavy maths --- which is, for some, hard to learn and might take around a year or more to learn well. That's why I think that path for DS might be a bit harder than web dev, since you've got to have the math AND coding background. But as you note, it depends strongly on what you wanna do with it.
If someone just wants to plug-and-chug things into sklearn models, prob doesn't need to know a whole lot of linear algebra / stats. Same for making simple webpages.
TBH, TensorFlow has done all the maths for me, some of the only complex stuff I've been doing is figuring out input shapes/sizes, TF offers a pretty easy and basic package for Data Science that you can use Flask with to create web-apps with a TF (Data Science) model running in the server
This guy does a pretty decent flask VQA model with semi-clear walkthrough
he has a supporting webpage writeup on it too that explains it further
though VQA probably isn't for first timers in Data Science as I found out x-x
There are some people out there that can do DS without knowing much of the maths, but in terms of careers I've not seen too many DS people without a good grasp of the math after entry-level. [My biases here: live in the US, work mostly in small-to-mid startups, mostly timeseries + fin + IIoT work.]
Guys which is the best free source on internet to learn machine learning and data analysis
Thanks ill check it out! TensorFlow sounds interesting ill have to learn that for sure
they have lots of online tutorials on their official website
definitely worth looking at
What type of tasks do you usually do at work as a data scientist
tensorflow is used to build models?
yeah
and then the models can be used to predict things
and as such, a flask application can provide a front end to the model
damn that's really interesting! are you a data scientist as well?
I wish, I'm nothing special but trying to create a VQA model
nice I am trying to learn data science by myself and see if I could land a job later on, but if not I'd back to school to get a degree to solidify my credentials
Yeah, hopefully you'll manage, lots of current people in the field are all self taught and lots of places offer apprenticeships (best of both worlds)
It can vary by the type of project, but I'd say it's like, 40% pipeline work, 50% EDA + making dashboards + talking to SMEs + going back and forth with the data-holders, and then like 10% research. Depends also on the DS person; I really dig the pipelining and devops side of things, so I usually will work with those teams, hence the more pipelining. Some colleagues will do more research, some more eda, etc., as the project requires. In a previous job it was more like 20% eda, 30% pipelining, and 50% building in-house engines.
I went to uni and paid £40k to wait for my supervisor to email back, 13 months on still waiting
It's not hard to get a job in the DS field --- since the title is pretty vague and can be anything from data analysis to data engineering to data science --- but it's hard to get a good job in DS where you're actually doing a lot of EDA + research, which is usually what people think of when they think of going into DS.
I see that's what I was thinking too when I was on kaggle, most of the works seems to be lots of research + EDA. Doesn't data analysis or data engineering need different type of skills though?
I am going through a Udemy course on data science right now to get a feel, but I honestly don't know what to do afterwards. Someone told me to look into hackathons or build my own projects through kaggle. What do you think?
It's also the case that, at least in the US, there is a huge, huge, huge market saturation for DS at the entry level. We had an entry-level position open up at my previous company [travel company w/ a big ds focus] and we had 1.2k people apply to it. But, after parsing it down, we had around 200 people who actually were in any way qualified for it, something like 50 or so who could pass the hackercode challenge, I think 12 made it past the data analysis challenge (logistic regression, classifying problem), and we interviewed all of them. I think there were six good ones from that.
This is not to intimidate if y'all are going into entry level, but to note not to worry too much if people say, "Oh, we got 400 applicants!" because the vast majority are very much not good.
DA + DE definitely require other skillsets, and you'll be able to point them out on resumes. Like, "Oh, this is data science, but I'm gonna mostly be doing DE tasks."
Damn lol what would you recommend someone to learn to put on their resume to even be qualified in the beginning?
I'd say that, for better or worse, the research + eda things are mostly given to the math-strong people, since, in my field at least, there's no way to just plop something into a NN and be done with it. You need to be able to explain results. So it's a lot of feature engineering + time series analysis, and that requires some pretty hefty stats know-how.
Note again that this is ONLY in my experience, and my experience is only a few small- to mid-sized startups in a city in the US.
I'd say that a common thread in who we hired and considered was:
- Knowledge of Python/R (or some other language, we weren't too picky, and you can pick up either of those pretty quick if you know another).
- Knowledge of SQL and roughly how DBs work and how to get data and clean data.
- Knowledge of general ML methods and general supervised and unsupervised algorithms.
That's the bare minimum, IMO, but others may disagree. And it def depends on the industry. In a Computer Vision job, this would not be nearly enough. In a business job, it might not even need the first or third.
Thanks
Yeah I like hearing each individual's experience haha I graduated in mechanical engineering recently but I want to switch so maybe the math part should be a bit easier for me since I can recycle what I've learned to help me out.
If you're starting out, I think focusing on Python / R (or some other language) is gonna be the best bang for one's buck. If the DS isn't working out, you can fall back on software engineering for a bit.
Yeah, ME will be totally fine as a degree. We've had people who are even like, history or english or whatever majors, that's fine --- so long as you knew the math and could demonstrate it, it was all good.
Thanks so it all depends on which role you want to fill in later on to know what to learn right now
Or, roughly, what type of role you think you would want. You could always change. But if your passion is computer vision, or if your passion is self-driving cars, you might wanna look at those job descriptions and see what they want and build towards that.
That's exactly what I was thinking haha I just finished a Python course and just finishing learning how to use Numpy, Pandas and Seaborn. Moving onto ML soon in my Udemy course
Good, numpy + pandas are the backbone for this stuff --- but remember also that writing scripts in python is also useful, so that's good to touch on too.
Like, "Vanilla Python" writing.
Gotcha I'd have to review that as well. Is OOP important in data science?
It was kind of a hard concept for me to grasp in the beginning but i think i got the hang of it now
That's gonna strongly depend on where you go. My last place, it was very important. My second-to-last place, not at all.
I'd say keep trying with it, if only to strengthen your understanding of languages, design patterns, and architecture in general.
Yeah there are still lots to learn for me haha
For example, it may be the case that you will have to work with something like Airflow or PySpark or whatever, and you'll need to know how to maneuver around in those fairly-different-looking spheres. Knowing general python / general programming will help immensely, IMO.
Id recommend just using one or two tensorflow tutorials. Some of what they teach you need to have a eureka moment to understand as they don't always have comments on every line, but they're rather high level
Image captioning with visual attention or something is really good imo
But very complex
I know it's wildly unpopular to say this here, but I always strongly recommend against doing NN stuff at first when learning DS (unless that's a popular tool in the area you're going into). Having said that, it may not be bad to look at to see if something clicks.
Youd have to learn about customary encoding decoding attention and training functions/classes which is super confusing to go straight into
I'm not sure what else there is in ds but I trust this man so yeah do that hahaha
Even though everyone in this room seems to LOVE NNs, I've rarely had to use them for actual day-to-day work. Additionally, they're mostly their own API, so knowing them does not really translate to knowing Python better.
All I can think of besides nn is ensuring you understand the task at hand. If you can't figure out what you want to do it won't matter what you try. Understanding datasets is a huge help too. Always make sure you preprocess it into a form understandable to you
Making a model in general (linear reg, logistic, trees, whatever) is also pretty much just a two-line ordeal to fit, but it requires the coder to do more data cleaning and feature engineering, which do translate to learning Python better.
Anyways its 4.45am and I haven't been away from a screen for 17 hours today (i just cannot figure out my model haha)
So I must go to bed, gn lads/ladesses/ladems
Yeah, I think that preprocessing is also a fine way to go, if you're gonna focus on NNs. That's v important to learn.
Gn!
As an alternative to NNs, I'd recommend some of the more basic Sklearn tutorials and focus on things like logistic regression, linear regression, tree models (random forests, boosted trees, etc.). I feel like this stuff is easier to "look into" and see what's happening, whereas NNs are often a big bulky black box.
This'll prob be covered in your ML class, though.
Yeah I honestly don't know what my focus will be at this moment! I'll take what you said into consideration for sure. I still have a long way to go lol self driving car is really interesting now that you've said it haha i'd have to do more research. Yeah I see these topics are included in my ML class. Hopefully I can start working on a project by myself after this class
GN
Haha, don't worry about it too much. I'd focus on getting through the classes, seeing where you are, and keeping on codin' things (even non-ML/DS things!) in Python or R. That'll put you in a good spot.
Thanks man I really appreciate you taking the time to talk to me!
from me, learn statistics and programming first, and then spend a couple weeks plunking around with deep learning
or plunk around with deep learning while you cool off between stats theory sessions 😉
Yeah, I can't imagine going into deep learning without a solid foundation. I still get confused by the stuff and I've been doing this nonsense for years, haha.
i do actually think it's a useful toolset nowadays, if only because data scientists (in the "good" jobs) tend to be very close to management still, and need to make informed decisions about what tools and methods to use
data science is already too big of a field for any one person to be an expert in all of it. but much like programming, you would do well do develop "T-shaped" knowledge
broad but cursory knowledge of many things, deep knowledge of one thing
and be good at math. it will just make you faster and more efficient and better at everything data-related
Yeah, I'm not sure how much time beginners should spend with it, but certainly after reaching mid-level or so, one should know their way around some of the more useful NNs.
right
heck even within traditional stats you can probably never know even close to all of it
I feel like I am strongly biased against them in places like this because I've had to interview so many entry-level / intern DS people that ONLY know how to set up a TF thing or have done a tutorial on dog-recognition or something.
Oh, yeah, pretty much undergrad-level stats is usually good enough. Rarely do I need anything more than that!
NNs are def popular because of the cool things they can do, but it's one of those things where it's like --- yeah, you might be a carpenter who can REALLY use a saw well, but if you can't use nails or hammers or measuring tapes or... then you're basically going to be useless on a construction team.
what kind of organization do you work for, if you don't mind me asking?
I've got a v specific gig that'll prob dox me, haha, but I'll say in general I work around the Industrial IoT space. I've worked in travel and medical as well.
Hence my emphasis on methods with explainability and my love for time series analysis junk, haha.
What're you in, if you don't mind my asking?
fair enough
i am currently a software engineer, taking a break from data science basically
although very much wanting to get back into DS
That's great, though. I feel like there are so many DS projects that can benefit from SE knowledge and best practices.
i was at a large P&C insurance company before this, but was having a difficult time for a variety of reasons, so i burned out and quit
yeah it seems to be a desirable skillset
Ha! I was in lending at one point, that nonsense is so easy to burn-out in. I definitely get it.
in hindsight i was probably a productivity multiplier
not that useful on my own, but made everyone around me more productive by being able to pick up all their programming slack
Which def could lead to that burnout.
for sure, i was also a very susceptible person at the time
i'm very lucky i was able to quit (thanks covid!) and reset
I did a very similar thing, it was super good. Gives time to learn new stuff and reposition.
plus now i have a spouse who keeps me from becoming a degenerate again
yep, i'm going through this now. i am self-studying some stuff that otherwise i'd never have had time to study
my only fear when i quit was that the industry would pass me by, and my skills would atrophy, and i'd be unemployable in DS
that seems to not be happening, so it does seem good
when did you take your break / how long?
Prob around half-a-year or eight months, and I felt the same way.
Lots of new stuff out, and new stuff in Python to get good at, but otherwise it looks pretty much the same. PySpark looks to have gotten a bit better, too, haha, but I haven't started on that yet.
heh yep i noticed the same about pyspark
Lol I have heard of T shape too
Issue is they go for the sexy new thing rather than foundational math or stats
Yeah, that's to be expected, I guess. :'] It's all bright and shiny, v alluring.
You can blame youtube tutorials.. ask them if they can go beyond mechanical coding and come up with new algo and new solutions
Lol been in the burned out camp but recovered
Yes Covid has been great for some of us lol
Doing same
Speaking of learning new things, someone mentioned Streamlit to me today in here and I started using it --- it's pretty fantastic, I really dig it.
hello, this my linear regression brute force formula
#imports
import matplotlib.pyplot as plt
import numpy as np
def euclidean_distance_calc(y, data):
"""Calculate the euclidean distance between two points"""
euclidean_l = []
data_y = data[:, 1]
for i in range(len(data_y)):
e = y[i] - data_y[i]
euclidean_l.append(e)
score = np.sum(np.array(euclidean_l))
return score
def ploting(x,y,data,Loss):
"""This Function plots all the points"""
color = '#1C2833'
plt.plot(x, y,label=f"Score is {Loss}" )
plt.xlabel('x', color=color)
plt.ylabel('y', color=color)
plt.scatter(data[:,0], data[:,1])
plt.legend(loc='upper left')
plt.axis("equal")
plt.grid()
def main():
#Variables
L = 100
data = np.array([[1, 1], [2, 2], [3, 3]])
for i in range(L):
m = np.random.randint(1,4)
b = np.random.randint(1,10)
x = data[:,0]
y = m * x + b
Loss = euclidean_distance_calc(y,data)
ploting(x,y,data,Loss)
if __name__ == "__main__":
main()
plt.show()
It works the problem is i wanna better way to scale m and b to the data instead of just guessing.
Yes, i do know I can use calculus but would like not because it is a brute force version
If I'm understanding correctly, this is exactly the problem that https://ml-cheatsheet.readthedocs.io/en/latest/gradient_descent.html solves.
thanks for the source
you might also be interested in other numerical optimization techniques like newton's method / newton-raphson method, coordinate descent, et al
I remember those along with runge kutta ... some of them were taught in a Numerical Analysis Course at my Uni.. grab a used Applied Numerical Analysis book at a second hand bookstore. Better than Googling
indeed, although i usually file those things away as "things i mostly don't need to know the details of"
i've used lbfgs before but not for any strong technical reason, it's just what gave good results quickly
Yep the libraries...but best to know how they work
The prof thou made the students implement some of those algo from scratch in C or Pascal ..
probably useful if you want to be a developer of scientific software or a machine learning engineer
maybe not that useful if you want to be a statistician
but yeah i agree you should at least know how they work in general
even if you haven't worked through the underlying theory
Done a bit of scientific software and ML myself but bulk of my career was in Traditional Web and Desktop software dev
I enjoy the maths thou and like understanding how things work in general
I always forget about newton, and that'd prob work fine here. I remember RK-4! That's --- well, something with ODEs. I guess I don't remember it as well as I should.
guys, i have question
i have two datetime columns, each one start at a specific date.
both of them got no null values.
the first start from 2016 to 2021, and the secpnd start from 2014 to 2018, and they have some overlapping points.
The Question is: How can i combine those two columns into one while preserving the rest columns? obviously i will get more rows after this performation.
"Environment variable $DATABASE_URL not set, and no connect string given."
i have a sql error
In my 2022 list of stuffs I want to learn data science and ai is one of it. Somehow please recommend me from where to start ?
Hi, I have a questions: Do we need polynomial feature in classfication?
x_train = numpy.array(read_x).reshape(-1,1)
y_train = []
for i in y_read[::-1]:
y_train.append(i)
model.fit(x_train,numpy.array(y_train))
prediction = model.predict(np.array([453]).reshape(-1,1)) # trained upto 452 what it predicts now is correct but even if i increase the value to like 460 or even more its the same but if i reduce it from 452 it predicts as expected
print(prediction)```
whenever i increase the prediction value the prediction does not change and does not gets affected if the prediction value is something which is not in the training data but when i reduce it to what it is in the training data it predicts as expected
im very new to sklearn and its one of my first projects i didn't watched much tutorials
also im using RandomForestRegressor
Cam anybody help me with machine learning problems?
I was trying to fill missing values of a categorical data with mode. I used following code where "Embarked" is the categorical variable. But it's not working..
df["Embarked"].fillna(df["Embarked"].mode())
hi all! I've been playing with pandas a bit lately trying to sort through some scientific data. It's been a blast but there are some simple things that I can't seem to be able to google for.
I have a table with different columns. Some of the values are the same for all columns, some instead appear only in certain columns. I'd like to sort all the columns so that values that match are all next to each other.
For example, if ColumnA has value "10", and ColumnB and C also have a value "10", then all the "10"s are put next to each other. If any of the columns is missing "10", then the field is left blank.
for v in sample['ColumnA']:
print(v in sample['ColumnB'].values)
I can check if vaues exist in different columns like this. I'm unsure of how to do this for multiple columns and how to implement the logic of sorting the values accordingly or leaving blank
resist the temptation to loop. you'll learn more if you look for an idiomatic way to do it.
I just thought that maybe I could pivot. I'm not sure if I'm looking in the right direction.
Maybe I can have values as indexes and have column names just show the presence (true or false)
can you give a copy/pastable sample of the data?
df.head().to_dict('list')
also, are you trying to do this for some practical reason, or are you just experimenting?
it doesn't seem like a very useful thing to do
I'm not sure if pivot is what I'm looking for though
I need to do something else for a few minutes, but I will only be able to continue helping if you provide the result of print(df.head().to_dict('list'))
anyone?
I was about to start working, but if you can do print(df.head().to_dict('list')) immediately, I will take a look.
{'Survived': [0, 1, 1, 1, 0], 'Pclass': [3, 1, 3, 1, 3], 'Name': ['Braund, Mr. Owen Harris', 'Cumings, Mrs. John Bradley (Florence Briggs Thayer)', 'Heikkinen, Miss. Laina', 'Futrelle, Mrs. Jacques Heath (Lily May Peel)', 'Allen, Mr. William Henry'], 'Sex': ['male', 'female', 'female', 'female', 'male'], 'Age': [22.0, 38.0, 26.0, 35.0, 35.0], 'SibSp': [1, 1, 0, 1, 0], 'Parch': [0, 0, 0, 0, 0], 'Ticket': ['A/5 21171', 'PC 17599', 'STON/O2. 3101282', '113803', '373450'], 'Fare': [7.25, 71.2833, 7.925, 53.1, 8.05], 'Embarked': ['S', 'C', 'S', 'S', 'S']}
Thank you
In [5]: df["Embarked"].fillna(df["Embarked"].mode())
Out[5]:
0 S
1 C
2 S
3 S
4 S
Name: Embarked, dtype: object
It works when I do it. In what way is it not working?
Tensorflow, anyone?
Try giving enough information that someone who knows about tensorflow could read your question and answer it.
It is a long dataset with 891 rows. In two rows "Embarked" values are missing. So tried filling missing values with mode using following code.
df["Embarked"].fillna(df["Embarked"].mode())
But after that when I try to print sum of na, it is still showing 2.
df.isna().sum()
show what df["Embarked"].mode() is
import tensorflow as tf
sess = tf.Session()
op = sess.graph.get_operations()
print([m.name for m in op])
print([m.values() for m in op])
I'm doing it right or..?
S
You're "doing it right" if this code relates to your question, but you have not asked a question yet.
Actually its printing
0 S
oh thanks I just need some corrections from professional
Why does it have to be from a professional? And why is this code wrong?
so, the problem is that mode is returning a Series rather than a stand-alone value. Try .mode().iat[0] to pick the first element.
for anyone interested in Quantum Computing: https://quantumzeitgeist.com/a-quantum-year-in-review-what-happened-in-2021-our-picks-of-news-and-views/
Thanks a lot
Seaborn, Matplotlib, and Plotly
I'm trying to do a simple outer join to solve my problem. However, I have to do a join among a df rows, not among different dataframes. It seems to not be possible.
I tried with concat(), but the result of the join is not what I expect
sorry, I had missed this
Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.
Alright, let me see.
the practical purpose is to find out if a specific metabolites shows in different mediums (and which)
so it's an experiment repeated using different mediums, I need to find out how the different medium affects the production of certain metabolites. For which I need to visualize what metabalites are present in what mediums at a glance
hi, may i ask some question? right now im working for my final year project & now planning to implement data visualization for my project. does anyone know to suggest me on how to do it?
for further reference, this is what I have now
this is what I want to end up with
this is a spreadsheet, but it shows what I want
alternatively, this works
I feel like this should be very simple. Maybe I'm overthinking it
While TensorFlow might appear alluring you might wanna consider learning OOP in Python as well just so you can easily get yourself acquainted with PyTorch.
Knowing 2 Deep Learning frameworks can easily increase your worth, make you super flexible, and perhaps a little bit indispensable (with the right attitude) 😀
You can imagine someone in Software dev domain who knows React + Vue.Js + Laravel
Moral story: Try not to be too over dependent on one framework. Know at least two.
basically right now my web based already deploy machine learning model and now i'm planning to do visualization on my website
anyone have any suggestion for me?
@keen storm unfortunately I haven't come up with a solution yet, but I have to do something else. I'll leave it running on my computer in case I get a chance to try again later.
thanks a lot for looking at it!
With this, do you think getting a Master Degree does help in breaking into Research domain?
Sometimes I do imagine, what if those people that discovered Attention and Transformer models didn't deem it fit to go into Research, we might probably still be stuck with RNN and LSTM 😂
Create a new column entirely, then merge (you could use concatenate or join method as well) both date columns row-wise. Finally, sort the values of the new column to remove any overlapping points.
Once mission has been accomplished, you can then get rid of the two date columns.
Since the intent really is to get the loss function to its global minimum, doing a feature engineering to create a new polynomial feature that would surely lend a hand in helping your model learn more underlying patterns in your data won't hurt.
However, remember the response variable being predicted in any classification task is always a discrete value.
-
Check the pinned message
-
Do you want free resources or a paid one with more structured courses?
Free Resources
a) Kaggle, FreeCodeCamp, YouTube, Andrew Ng's ML and DL courses on Coursera etc
Paid
a) Udemy, DataQuest, DataCamp, Coursera
b) Bootcamp: check FourthBrain.ai
c) Graduate Studies in University
i have used as you tell me but another problem occurred, it is a Memory Error, 28.7Gb
You successfully was able to merge it right? And the memory error is from another operation yeah?
what metrics or method i can use to compare 2 identical cnn models but trained with different augmentation techniques and checks which model performs best?
or should say more reliable and accurate
the memory problem is from the merge operation
and yes i was able to merge successfully when using a sample of 10 rows from that data
Look at the two CNN performances on unseen data (Validation set and/or other Holdout set)
- Check the networks validation loss and validation accuracy.
You can also plot their respective learning curves to visualise the performance.
Hmmm what woulda been the shape of the new dataframe given the new rows due to the merge?
actually the datafrmae has around 2800 rows and 20 columns, but iam doing the merge 10 times
Hmm but why 10 times though?
10 date and price columns to be merged
very messy data
OK but assuming you're doing 2800 x 10 that woulda been 28,000 rows yeah?
But 28,000 isn't that much rows to give a memory error warning. 🤔
yeah that really wierd
Are you working on colab or on your local machine?
i have built 10 new dataframes from the big one, every df have two columns, and then i remerged again
on local machine
You're not even supposed to repeat that 10 times. Although I don't exactly have a picture of the initial dataset.
You sure you're stacking them row-wise not column wise right?
-
Did you use merge, concatenate, or join?
-
Can you write the code you used as well?
i will send you my code
import pandas as pd
import numpy as np
df=pd.read_excel('sample_question2022-01.xlsx')
columns=df.columns.tolist()
for column in columns:
if (df[column].isnull().sum()>2300):
df.drop(column,axis=1,inplace=True)
columns=df.columns.tolist()
import itertools
count_date=itertools.count(1)
count_price=itertools.count(1)
for column in columns:
if(df[column].dtypes=='datetime64[ns]'):
df.rename(columns={column:f'date{next(count_date)}'},inplace=True)
else:
df.rename(columns={column:f'Price{next(count_price)}'},inplace=True)
columns=df.columns.tolist()
merged=df[[columns[0],columns[1]]].set_index('date1')
k=2
for i in range(2,len(columns)-1,2):
merged=pd.merge(merged,df[[columns[i],columns[i+1]]].set_index(f'date{k}'),how='outer',left_index=True,right_index=True)
k+=1
i have used merge, and on rows
Still giving Memory warning?
I do have a new observation. What happens to other column(s) if the date column successfully gets merged?
This obviously will increase the shape of final dataframe and even populate every other non date column with NAN.
Do you see the angle I'm coming from yet?
yes i see, the other columns return nan values while date column is the
If you could care less about the NaN then no problem... If otherwise, I'd suggest you write a custom function instead. Then in your function give the condition for date_col1 and date_col2 to be merged. Then apply it to the new column.
i will try this later, Thank you very much :)
You're welcome
Thanks! Do you have any other recommendations as to what to learn? So I far i plan to learn numpy panda seaborn and scikit. Ill have to add tensorflow and pytorch to that list as well. What do people usually use for data cleaning for big data?
For data cleaning, pandas is your go to for most applications
For 100Gb+ datasets pandas starts to struggle. Then a distributed analytics library like PySpark or Dask is required
I see thanks!
keep in mind that most machines don't have 100 GB of ram, regardless of what pandas can technically handle 🙂
oh so that is what the validation set is about
thank you sir
In general, the objective of any train/test split, cross validation, bootstrapping, etc. is to emulate "out of sample" data; data that your model has not yet seen and was not involved in training. This is important if you want to try to estimate how well the model generalize is beyond the training sample
Where would I start with ai?
I have some decent knowledge with python. I am just wondering if there is an article to read off of or a book to read or something?
God i wish they did. My whole project is about 170gb. 70gb storage then about 100gb of files to run through
how often do you run into big projects like that?
Depends what you do and what data type you have
I have images that take about 169GB (cached image feature files and images themselves)
The text files alone are less than 1gb
im curious what you're doing with all those images, i assume you don't need to keep them all in memory at once
Its unlikely you'll face such large amounts of it all
I'm destroying my hard drives and ram probably. You are correct
at work we had a server with 256 gb of ram, it was nice until someone left a notebook running with 100 gb of ram used, on thursday, and didn't clean it up until the following tuesday
Annoyingly tensorflow is slower to duplicate the same file 3 times than it is to load the same file 3 times
But I stand by my choice of loading image once and duplicating its values 3 times
Dont wanna kill my hard drive any more x-x
I had a friend who worked in security camera detection stuff and his company had this experimental model to train so on a weekend when everyone was gone they let this guy use all the resources he could and I think he ended up blue screening on about 750GB of ram
Well, running out of ram, probably not blue screening these days
i wouldn't worry about your ssd dying
unless this is on a personal machine and not a work machine
Why not follow a structured course? Learning each data science tool/ library one after the other can slow your progression.
Have you checked out Udemy and Coursera courses on Data Science yet?
If you can afford about $50 use it to buy a well-structured course on Udemy (at least, Udemy is quite cheaper than most courses on other platforms)
Of course, you can always use YouTube as a second resource to augment what you've learned from Udemy
yeah its personal, I'm using a HDD for longevity purposes as I bought this PC specially to ensure it can survive a decade or so (last one managed 8 years and still works fine, able to play VR and msot games still)
My current PC performs quicker than google colab (except for times I have to download data due to bad Wi-Fi) - especially as Google Colab RAM/Storage even on Pro is too small for my task
Yeah im currently following a Udemy course from Jose Portilla, it’s teaching numpy pandas seaborn and scikit. Just wondering what to do after
Awesome. Just keep at it. You're in the right hands. Jose Portilla and Andrei Nagogei Data Science & Machine Learning courses are 5 🌟 materials.
Alright thanks man
Hello guys, is data science still a relevant field to get into/ start learning rn? if yes, what learning path do you recommend?
Hello
Can someone Explain what are activation functions and what are they really used for in simple terms?
Data Science has always and will continue be a relevant tech field to get into 😂
If you read the last couple of messages here, I'm sure you'll see the answer to your question
Neural networks basically work by, each layer, multiplying a vector by a matrices, and applying an activation function to it, until you reach the end.
Without activation functions, any number of such multiplications would be equivalent to one (because any number of linear operators applied in order is a linear operator). And almost no data is a linear mapping of inputs to outputs, so without activation functions, it'd be impossible for NNs to work.
So we introduce nonlinearties of some kind - that's what activation functions are, pretty much any nonlinear function. In practice, you need one that has a derivative almost everywhere (because you need it for backpropagation), is cheap to calculate, and (that's the most advanced requirement) ideally behave nicely under backpropagation - have nonzero derivatives everywhere, etc.
*have nonzero derivatives everywhere
though as RELUs show, even that is sometimes not required
Let me break it down as much as I can
Activation function helps our neural network to make more accurate prediction. Now the main reason we use activation function is because ;
Not all datasets are in linear space, some can be in 3-dimensional space and above... Remember, plane and hyperplane in geometry yeah? Cool.
By using activation function, it helps our neural nets capture non-linearities in our dataset thus leading to our neural network producing a more accurate prediction
Thank You Very much for explaining!!
Hello everyone, I can help you if you have any questions or concerns about the software. or you can contact me from this account instag @ai.engineer1
Hello! I was wondering if someone might be able to kindly help me out: I'm working with a pandas dataframe, and I'm trying to iterate through only columns that have int values, and ignore the columns with other datatypes.
numerical_columns = [column for columns in identity_survey.columns if identity_survey[column].dtypes() == int]
print(numericalcolumns)
I get a TypeError: 'numpy.dtype[object]' object is not callable
I've tried (and unfortunately already deleted) multiple other approaches with no luck
identity_survey is the name of the dataframe
.dtypes isn't a function, you probably also meant .dtype anyway
"not callable" means "you can't use it like a function", i.e. you can't use ()
also you should just use select_dtypes https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.select_dtypes.html
integer_columns = identity_survey.select_dtypes(include=[int, np.integer])
or something like that
not sure if you can use np.integer alone or if you need int as well
the DataFrame has a .dtypes attribute, but it's still not a function: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dtypes.html
each individual Series has a .dtype attribute: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.dtype.html
So does the activation function help in deciding the weights and biases cuz that is what makes up the output of a neuron right?
thanks @desert oar
the entire network "helps in deciding", because if you change anything in the network, it will change how information propagates through the network
the reason we use nonlinear activation functions is to help the network find more interesting non-linear relationships in the data
imagine that the network is a child building stuff out of blocks, and they only have rectangular blocks
that would be like if you only had linear activation functions, everything would still end up kind of rectangular
by relations do you mean correlations?
no, i mean generally the relationship between inputs and outputs
the function that the NN is learning to approximate
if the child is building stuff out of various funky-shaped blocks, they could build more sophisticated shapes in the end
Right
same reason your network can learn more if you have more nodes and layers: more blocks to build out of
note that this is a very very highly non-mathematical analogy
So does the activation function help in shaping the network and nodes in such a way to make it more efficient in predicting?
not necessarily more efficient, but it is necessary in order for a network to learn arbitrarily sophisticated functions
i see melat0nin is typing, they know what they are talking about so you might want to stay online for a few more minutes 🙂
Alright
Haha, I wasn't gonna say anything mathy here. :'] One of the exercises that made me really "get" the steps of NNs was trying to do ezpz perceptron exercises.
You got all of it, I didn't need to add anything, haha.
The math part is pretty much that you get linear combos of linear combos without activation functions, which is just kind of boring. You pretty much just get new features that look like 3.0 * petal_length + 4.1 * sepal_width - ... stuff like that. Which is fine for some things, but for other things they may not have a clean linear relationship.
I'm having a ton of fun with streamlit, y'all, if you haven't checked it out yet, check it out.
You might deputise Streamlit if you try Gradio 😂 Well, I love both of them
Oh, interesting! It seems like Streamlit is made for like, dashboarding EDA / Results, but Gradio is more for like, showing off models?
These both seem really cool, I'll try to make something cool in Gradio this week. :'']
Mainly streamlit is used in the Data Science and Machine Learning community for deploying model as a web app.
Gradio can also be used for model deployment. The beauty of Gradio is in building and sharing a demo of Machine Learning model so people can interact or play with it.
Check this out https://huggingface.co/spaces/valhalla/glide-text2im
Oh, this is neat. Yeah, Streamlit seems like the "Python Version of Shiny" to me, but I'm still learning it. This is kind'a cool, and I wonder how it compares with or complements things like H2o and MLFlow. I'll have to dig in. It looks really slick tho.
i was wondering about mlflow here too
I haven't touched MLFlow in like, two years --- but when we were using it, I was like, heavy into it. We had a fork of it we added junk to and it was great. H2o was also really neat --- our modelers really liked the charts and stuff, but idk how that is now. I gott'a look back into this stuff.
I don't have experience with MLOps yet so I can't say for now
It's prob the case that MLFlow works fine with this, since MLFlow (at least the tracking part) just kind of sits in the code and reports stuff to the DB.
I'll check it out tho, I'm so excited about the new stuff that's come up recently (or not-so-recently) that I get to learn about, haha.
@tidal bough @odd meteor thanks for your comments on activation functions. I somehow didn't know a lot of that 
what is that?
i can just use accuracy and loss as the basis of their performance on a validation data?
for the most part yes
You can measure it with pretty much any metric you want. F1, accuracy, recall, precision, etc.
How to determine which columns to drop in knn?
I know which column is categorical , continuous and is class attribute.
I assume:
Continuous data is for training set
Categorical is meant to be dropped
Class attribute is a thing to be predicted?
Example:
I tried to make a heatmap (picture). On left is whole dataset, on right head(20). ||11th column had 0.5 value on first 20 records after normalisation||
drop the bad ones, keep the good ones 😉
that heatmap is the distance matrix between data points?
Yeah, its made from data.corr().
If i have a df that contains user ratings for movies (columns=[userID, movieID, rating,....]), what would be the most efficient way to create a df that contains the count of users that rated both movies for all combinations of movies. Right now I'm doing this iteratively for every combination of movie ids like this, but I'm looking for a way to speed it up. Any suggestions?
def foo(id1, id2):
id1_users = set(df[df["movieID"] == id1]["userID"].to_list())
id2_users = set(df[df["movieID"] == id2]["userID"].to_list())
combined = len(id1_users & id2_users)
return combined
does anyone know why this error occur?
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
this is the error stated
result is a list array, so you cant check for equality with an int
Your result variable is a list array
So you cant check if a list array is equal to an int
array, not list. And the problem isn't quite that you can't compare it to an int - you can - but that the result isn't a boolean, but an array of booleans
and you can't use an array of booleans as a condition
Presumably, you expect result to be a single-element array. If that's the case, extract its only element and do this stuff on it, not on the entire array.
yes my bad array not list
hmm i still try to fix this one
how about Use a.any() or a.all()
`# Let's predict!
import numpy as np
newData = np.array([
2,
2,
3,
3,
3,
2,
2,
4,
4,
4,
4,
3,
5,
2,
3,
4,
4,
3,
4,
4,
1,
4,
3,
5,
1,
5,
2,
5,
5,
4,
3,
4,
4,
4,
4,
4,
4,
2,
5,
4,
3,
2,
4,
2,
4,
2,
4,
2,
4,
4]).reshape(-1,1)
result = gnb.predict(newData)
print(result)
if (result == 0):
print("Your Personality Is Agreeableness")
elif (result == 1):
print("Your Personality Is Conscientiousness")
elif (result == 2):
print('Your Personality Is Extrovert')
elif (result == 3):
print('Your Personality Is Neuroticism')
elif (result == 4):
print('Your Personality Is Openness')
elif (result == 5):
print('Your Personality Is Tie')`
basically, this is the code for predict
is there any code that need to be adjusted?
@errant shore @tidal bough
What does print(result) print?
ah, I see on the screenshot
well, that's the reason then. What do you expect to happen, when you compare result to ints? It is after all an array of many values. So what if, if any, should execute?
hm i expect when the outcome is between 0-5 and will print the personality
hmm basically, what should i do right now to fix this one?
hopefully it might help me
...which of the ton of outcomes you have? 😛
basically, when user input, then the outcomes will be 0 or 1 or 2 or 3 or 4 or 5
so every outcome will define different personality
it will be 6 outcome
basically, i cant compare array to int?
Here's what result is.
what do you mean by trying to compare all that to a single int? For example, is this equal to 4, or not?
owh i see, supposedly the result only display one number only
not the all 50 outcomes like that
i want to predict the outcome that will display the predicted single number outcome
not the 50 like this
so what can i do to make the outcome only display one number only? @tidal bough
is this channel also for web scraping??
Nope
is there one??
@proper sable not really. I guess you could ask in #web-development, but be prepared to confirm that the website you're trying to scrape is cool with that.
Hi, can anyone help me with creating a custom dataset. I cannot seem to find my error here.
Hey @calm bison!
Uh-oh! It looks like your message got zapped by our spam filter. We currently don't allow .txt attachments, so here are some tips to help you travel safely:
• If you attempted to send a message longer than 2000 characters, try shortening your message to fit within the character limit or use a pasting service (see below)
• If you tried to show someone your code, you can use codeblocks
(run !code-blocks in #bot-commands for more information) or use a pasting service like:
@terse frigate try giving enough information so that people can start making suggestions
so i need to make a search string based on a job description. i have some example to make supervised
but i also wanna match the accuracy of that result by comparing it with the example results
A search string based on a job description. Can you explain that some more?
so my job is to search for resumes on job boards
we are given a job description of the Job Role.
my job is to make a search string and use it to look for resumes
no i want to make a model which read the description and constructs a search string like how i do - using keywords and important skills mentioned and understanding the role
for example the clients they request for a cloud architect and provide requirements in skills etc
and i make a string -
(cloud architect) OR (Solutions architect) AND (AWS OR Google Cloud OR Azure) AND Agile
soemthing like that
So the real goal is to detect resumes that relate to a job opening. But due to some limitation, you have to enter keywords into a search API.
yeah i have to manually construct the search string
Weird

so i was thinking if i fed enough job descriptions and also some strings to learn on
it should give me proper string right?
I have to go to sleep but I'll probably think about it some more. In the mean time, do you know about term frequency inverse document frequency?
It might give you some ideas for extracting keywords.
Hi, does anyone know a way to easily use files located in Google public cloud (the Landsat data for example https://cloud.google.com/storage/docs/public-datasets/landsat) inside Google Colab? Similar to how we can mount our drive for example.
❤️ thanks
Hi can anyone help me in understanding this complex inheritance automl code. I just need little help. after i will able to understand it. Please let me know guys if anybody ready to explore this attached code with me. I really want to understand this code. Please help me.
Hey @tidal edge!
It looks like you tried to attach a Python file - please use a code-pasting service such as https://paste.pythondiscord.com
What parts of this aren't clear to you? How to use it, how to modify the code, etc.?
I think I have an idea on this. Folks at my company's AI & Research department developed a product called CV Ranker which performs exactly whatchu asking for.
I can only refer you to delve into this
https://deepnote.com/@abid/spaCy-Resume-Analysis-gboeS3-oRf6segt789p4Jg
I want to make plant detection app for android.. So how can I make a model and how? I am beginner.. As there is no problem for me in Android App.. I just need some resources and support regards model.. Either ML or DL? And what is some tutorial for brainstorming.. Thanks..
just train out a simple CNN
You can train a model using YoloV5 algorithm. Then you can host it on a vps and make an API to connect to that vps and detect stuff.
You can use Roboflow to label images, Colab to train your YoloV5 model and Amazon AWS to host your VPS.
i have a problem in code flow understanding
Hi I'm trying to do logarithmic binning for PSD graph but its just not working as the binned graph on the right still look linearly binned can anyone help?, my code for binning: x4,y4=zip(*sorted(zip(faxis,Sxx.real))) logbin = pd.DataFrame({'X' : x4,'Y' : y4}) bins5=np.arange(4e-08, 6e-06, step=2.5e-07) categorical_object2 = pd.cut(x4, bins5) count=pd.value_counts(categorical_object2) grp2 = logbin.groupby(by = categorical_object2) #we group the data by the cut av = grp2.aggregate(np.mean) plt.figure() plt.plot(av.X,av.Y, 'x') plt.title('binned PSD plot') plt.xlabel('mean Frequency [Hz]') # Label the axes plt.ylabel('mean Power')
is this channel also for computer vision?
like cascade classifiers
can you train a cascade classifier that can count number of stars in the sky with normal camera
does someone know how to fix this problem?
I have a plot here of a column from my panda dataframe, I'd like to make subplots for the other columns but I I can't seem to get it to work, any help appreciated
Why can hyperparameters not be 'autotuned' the same way PID controllers can be?
what would autotuning consist of?
i don't know how PID controllers work
(although i should probably learn, because i want to learn about modding my espresso machine)
This one is probably the easiest to understand
While I'm aware there's way more than 3 hyperparameters, it's interesting that a similar process isn't used (or at least I haven't come across it)
this sounds more or less like what we do in machine learning
set up cross validation and pick the set of hyperparameters that performs best
the hyperparameter space is huge even with a small number of parameters (and usually real-valued so uncountably infinite), so various strategies like random search, halving random search, bayesian optimization, etc. exist that improve on the traditional "grid search"
the reason you need cross validation is that you need to emulate predicting on "out of sample" data
whereas PID controllers you don't have this problem of in-sample vs out-of-sample
so yes, it is used
and especially in cases with one single hyperparameter (eg the shrinkage parameter in lasso or ridge regression), you can literally just look at the plot of model performance vs parameter value, and pick the one right before model performance starts to decline (indicating overfitting)
Hmm ok that makes sense
I have a doubt. In the above code why some method name starts with _ and some method name starts with directly name in the same class?
method name starting with _
method name directly starting with name
Anyone use shap/lime for fb prophet?
yes
hello everyone, how can I filter a pandas dataset for 2 conditions in the same column?
For example, I want to filter the length column for values equal to 48 and to 40
filtro = (all_years_numeric_filtered["Length"] == 48) & (all_years_numeric_filtered["Length"] == 40)
all_years_numeric_filtered[filtro]
Tried doing that but I'm getting an empty df
Already got it, created a list with the values and used .isin() for the Length column
Hello, I have a question about One Hot Encoding. I'm working on a simple Heart Disease ML model, and following many tutorials they seem to use OHE on columns that numerical? I thought we only use OHE when we have text that needs to be converted into numbers or when we have plenty of options in one category. Can anyone explain this to me?
Can you take something that is non-deterministic and cant be run in parallel and use AI to parallelize it?
@gritty bough what do you mean "can't be run in parallel"?
@thin palm one hot encoding is a way of representing nominal data (ie not quantifiable or orderable). So if you have an Animal feature, and your animals are pigs, goats, and snakes, you don't want to assign them 1, 2, and 3, because that would mean that snakes are there times as much as pigs, whatever that means.
Though it sounds like you might already understand that much.
@novel acorn & represents logical AND and something can't be both 48 and 40.
Hey guys quick question. Let's say I have a dataframe of stock prices. The index is the dates and the column is the daily closing price of each stock.
bank_stocks.xs(key='Close',axis=1,level=1).plot()
Does this line format automatically plot the Close column against the index (date) whenever I don't chose which data to plot against?
@rose pasture if you show the dataframe in a copy-and-pastable way, I will try
namely print(bank_stocks.head().to_dict('list'))
Please ping me if you decide to do that.
{('BAC', 'High'): [47.18000030517578, 47.2400016784668, 46.83000183105469, 46.90999984741211, 46.970001220703125], ('BAC', 'Low'): [46.150001525878906, 46.45000076293945, 46.31999969482422, 46.349998474121094, 46.36000061035156], ('BAC', 'Open'): [46.91999816894531, 47.0, 46.58000183105469, 46.79999923706055, 46.720001220703125], ('BAC', 'Close'): [47.08000183105469, 46.58000183105469, 46.63999938964844, 46.56999969482422, 46.599998474121094], ('BAC', 'Volume'): [16296700.0, 17757900.0, 14970700.0, 12599800.0, 15619400.0], ('BAC', 'Adj Close'): [33.942649841308594, 33.582183837890625, 33.62542724609375, 33.57497024536133, 33.59661102294922], ('C', 'High'): [493.79998779296875, 491.0, 487.79998779296875, 489.0, 487.3999938964844], ('C', 'Low'): [481.1000061035156, 483.5, 484.0, 482.0, 483.0], ('C', 'Open'): [490.0, 488.6000061035156, 484.3999938964844, 488.79998779296875, 486.0], ('C', 'Close'): [492.8999938964844, 483.79998779296875, 486.20001220703125, 486.20001220703125, 483.8999938964844], ('C', 'Volume'): [1537600.0, 1870960.0, 1143160.0, 1370210.0, 1680740.0], ('C', 'Adj Close'): [368.26544189453125, 361.4664611816406, 363.2597351074219, 363.2597351074219, 361.5412902832031], ('GS', 'High'): [129.44000244140625, 128.91000366210938, 127.31999969482422, 129.25, 130.6199951171875], ('GS', 'Low'): [124.2300033569336, 126.37999725341797, 125.61000061035156, 127.29000091552734, 128.0], ('GS', 'Open'): [126.69999694824219, 127.3499984741211, 126.0, 127.29000091552734, 128.5], ('GS', 'Close'): [128.8699951171875, 127.08999633789062, 127.04000091552734, 128.83999633789062, 130.38999938964844], ('GS', 'Volume'): [6188700.0, 4861600.0, 3717400.0, 4319600.0, 4723500.0], ('GS', 'Adj Close'): [103.86396026611328, 102.42938232421875, 102.38907623291016, 103.83979034423828, 105.0890121459961],
('JPM', 'High'): [40.36000061035156, 40.13999938964844, 39.810001373291016, 40.2400016784668, 40.720001220703125], ('JPM', 'Low'): [39.29999923706055, 39.41999816894531, 39.5, 39.54999923706055, 39.880001068115234], ('JPM', 'Open'): [39.83000183105469, 39.779998779296875, 39.61000061035156, 39.91999816894531, 39.880001068115234], ('JPM', 'Close'): [40.189998626708984, 39.619998931884766, 39.7400016784668, 40.02000045776367, 40.66999816894531], ('JPM', 'Volume'): [12838600.0, 13491500.0, 8109400.0, 7966900.0, 16575200.0], ('JPM', 'Adj Close'): [26.503398895263672, 26.350433349609375, 26.430240631103516, 26.616453170776367, 27.048765182495117], ('MS', 'High'): [58.4900016784668, 59.279998779296875, 58.59000015258789, 58.849998474121094, 59.290000915527344], ('MS', 'Low'): [56.7400016784668, 58.349998474121094, 58.02000045776367, 58.04999923706055, 58.619998931884766], ('MS', 'Open'): [57.16999816894531, 58.70000076293945, 58.54999923706055, 58.77000045776367, 58.630001068115234], ('MS', 'Close'): [58.310001373291016, 58.349998474121094, 58.5099983215332, 58.56999969482422, 59.189998626708984], ('MS', 'Volume'): [5377000.0, 7977800.0, 5778000.0, 6889800.0, 4144500.0], ('MS', 'Adj Close'): [36.114253997802734, 36.13903045654297, 36.23814010620117, 36.27529525756836, 36.65928649902344], ('WFC', 'High'): [31.975000381469727, 31.81999969482422, 31.55500030517578, 31.774999618530273, 31.825000762939453], ('WFC', 'Low'): [31.19499969482422, 31.364999771118164, 31.309999465942383, 31.385000228881836, 31.55500030517578], ('WFC', 'Open'): [31.600000381469727, 31.799999237060547, 31.5, 31.579999923706055, 31.674999237060547], ('WFC', 'Close'): [31.899999618530273, 31.530000686645508, 31.4950008392334, 31.68000030517578, 31.674999237060547], ('WFC', 'Volume'): [11016400.0, 10870000.0, 10158000.0, 8403800.0, 5619600.0], ('WFC', 'Adj Close'): [20.444866180419922, 20.207735061645508, 20.1853084564209, 20.303871154785156, 20.300668716430664]}
was it like that?
!paste
Pasting large amounts of code
If your code is too long to fit in a codeblock in discord, you can paste your code here:
https://paste.pythondiscord.com/
After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.
Next time use this if it's too long
Ok I will next time
Is this different from what you wanted?
Can contact you?
The index is the dates and the column is the daily closing price of each stock.
For your reference, this description is incomplete. Your columns are a multiindex of company names by (high, low, open, close, etc).
Also your rows are probably indexed by date, so my example will look a bit different.
if you do print(bank_stocks.head().index), I can correct my version.
Yes it is different, here's what I got. I was just curious as to how the plot() chose the index as my x axis even though I didn't specify anything.
sorry I should've given more details
it looks like the default behavior for DataFrame.plot is to treat the index as the x axis, columns as items, and values as observations about said items (y axis).
I see thank you very much! Appreciate the help
@tidal edge you there?
Yes. Shall we connect little later. I'm on one conf call.
Okay okau
Like you cannot send 2 parts away to be computed independent of each other.
For example. Making a checkerboard.
You can check the neighbors to the NSEW directions if they are black or white or you could do if(mod % 2) { black} else white.
Mmmm
I'm thinking
Maybe problems like what I'm asking dont exist.
What if we have a column that is “male” and “female” why split those into 2 columns, such as this tutorial did?
onehotencoding will create 2 column ['male', 'female'] using either of one is enough
if male is 0, it means female = 1, vice versa
anyone using cv2 ?
because at some point you need to figure out a way to encode your data as numbers. splitting the data into two 1/0-valued columns is one straightforward way to do that
row_vals = '\n'.join([val for val in df['Date:']])
``` i want to only view the values in the specified row in groups of 10, how would i be able to do that?
that line of code displays all the values
I am using pandas to help work with excel files
can you clarify what "groups of 10" means?
do you mean "in the specified column"?
columns go "up and down" -- like columns in a building
rows go "across", like rows of crops in a field
i only want columns 1-10 to be printed, then 11-20, etc
"Date: " is the column im referencing
because the column name starts at 1, i want to be able to print the data in the cells from columns 2-11
are you talking about rows or columns?
ok, and what do you want to print?
I want to print the data in groups of 10, such as 1-10 and 11-20
yes, sorry for the confusion
size = 10
for lo in range(0, len(df), size):
hi = lo + size
date_values = df['Date:'].iloc[lo : hi]
print(' '.join(date_values.tolist())
i'd just write a loop for that
!d range
class range(stop)``````py
class range(start, stop[, step])```
The arguments to the range constructor must be integers (either built-in [`int`](https://docs.python.org/3/library/functions.html#int "int") or any object that implements the [`__index__()`](https://docs.python.org/3/reference/datamodel.html#object.__index__ "object.__index__") special method). If the *step* argument is omitted, it defaults to `1`. If the *start* argument is omitted, it defaults to `0`. If *step* is zero, [`ValueError`](https://docs.python.org/3/library/exceptions.html#ValueError "ValueError") is raised.
For a positive *step*, the contents of a range `r` are determined by the formula `r[i] = start + step*i` where `i >= 0` and `r[i] < stop`.
For a negative *step*, the contents of the range are still determined by the formula `r[i] = start + step*i`, but the constraints are `i >= 0` and `r[i] > stop`.
thank you
need some help with cspdarknet architecture.. anyone?
Is the spp block a part of the cspdarknet backbone, like the final block in the backbone? Or lies in the neck?
Hello what would be the most efficient way of adding 150k rows to a dataframe? When using .append it starts with a pretty decent 30mins remaining and then it increases when it gets closer to the end. When using .loc it starts with more than an hour and then increases but not as much as using .append. It looks like .append performs better with small dataframes and .loc with large dataframes.
30 minutes might be a bit slow but I have another df with 57k rows load in memory to perform some conditionals and decide which rows should be added to the final df I want to create
I see how it dies little by little...😅
Hey, I have a question related to building a multi-class classification model. In my datasets I have some sequence of vectors that are unique for a specific class. Do you think that throwing this UNIQUE vector into an unsupervised model is a waste of resources?
If I'm not mistaken appending on df is ... expensive. Try one thing? Add in python data structure. Like dict and then convert to df. See how it goes?
Sure I will give that a try
Hey, I'm really interested in AI image generation with GANs because the results are really amazing, so I'm looking to learn more about it. I followed this course https://livecodestream.dev/post/generating-images-with-deep-learning/ and adapted the code to work with RGB images and such, and learnt more about the parameters but I still have a long way to go. I'm looking for any papers or articles on GAN image generation that you would all reccomend so I can learn further about this topic
I was able to get it to generate images that started to resemble the dataset and as expected, a dataset with images from the same perspective also worked a lot better
but its still not optimized and over time loses stability and the images drop in quality. So i definetly need to read more articles on image generation
Worked well for fashion-mnist though
:incoming_envelope: :ok_hand: applied mute to @lapis sequoia until <t:1641386303:f> (9 minutes and 59 seconds) (reason: duplicates rule: sent 4 duplicated messages in 10s).
Proved to be better at the beginning but then it also increases remaining time
I'll paste the code I'm using just in case you saw smth that can be improved
hello
i am trying to create an ai
or self learning stuff
can any one help
or hav expeiriance?
Is it posible to write to an xml by chunks instead of appending to a giant list and then dumping that list to df and then to xml?
Uhm you can may be reduce the internal complexity? You are getting filtered_df for some reason only to check if it's empty or not right?
One way would be to not to retrieve the whole thing.
Another way would be getting to know your logic and using some dict initially instead of that df to... make search in it faster in while loop
Yeah there are plenty of experienced people over here. Feel free to ask questions.
i have installed jupyter but i wanna know which location is best to install so that i can easily access python libraries
should i install in ....../python37/lib/
on is there something else i should do
Just to check if it's empty or not
I'll be playing around with different data structures and hopefully one performs relatively quick. Thank you very much
To clarify, in the column it's already 0 / 1. the column is "Sex" and it has 0 and 1 already. The tutorial still OHE this? why?
Hello, i have two npz files which i wanna load into a jupyter notebook
I used this x_train=load('outfile.npz')
Is there any way i can check the contents of this x_train
If the categorical feature sex already has a numeric value (1/0) as you mentioned, then the reason why sex was OHE again was because, the person behind the tutorial video does not want the model to be bias towards any gender.
The person wouldn't want the model to assume that any gender encoded as 1 is more important than the other gender encoded as 0.
I have not personally uhm loaded npz files so lemme just check out.
Alrightt
If the file is a .npz file, then a dictionary-like object is returned, containing {filename: array} key-value pairs, one for each file in the archive.
If the file is a .npz file, the returned value supports the context manager protocol in a similar fashion to the open function:
with load('foo.npz') as data: a = data['a']
Via:
https://numpy.org/doc/stable/reference/generated/numpy.load.html
@wicked grove
Try just printing x_train for now? Doc suggests it must a dict of files
Ohhh okayy, i will do that now
Thank you so much
Man best answer I've gotten. Seriously thank you for this. Makes sense now because 1 > 0 so I see where this is coming from. Cheers mate!
You're welcome 😊
When receiving a cross validated score is our final model output score supposed to be higher than our CV score? For example when I Cross Validate with a K fold of 10 my score mean is .81%. When I take my model and fit it with our features and target my score jumps to .86%. does this make sense?
so you did 10-fold CV, which means you have 10 separate scores, the mean of which is .81 (not .81%). Can you explain again what you did that received the .86 score?
If you trained (fit) the model on the same instances that you used to evaluate, that would explain why the score went up.

