#data-science-and-ml

1 messages · Page 364 of 1

sour spindle
#

I already made the function

#

It seems to make the testing accuracy much better

sour spindle
#

As you can see in the form [normal, postprocessing1, postprocessing2] accuracy

sour spindle
#

the postprocessing1 also makes the training data get into the 80's in accuracy

desert oar
#

Right, exactly

#

That is the value of abstraction

forest canyon
#

Hello. I am trying to compare two dataframes. They have the same number of columns and same column names. Error is below. How do I adjust this so the compare works?

ValueError: Can only compare identically-labeled DataFrame objects

Here are my functions.

engine = create_engine('postgresql://postgres:secret@10.0.10.125:5432/sheepdb')

def jdb_sheep_df():
    table_name = 'jdb_sheep'
    table_df = pd.read_sql_table(
        table_name,
        con=engine,
        schema='sheepdb'
    )
    return table_df

def sql_csv_compare(path='/home/steven/projects/repos/sheepdb-cli/sheepdb-cli/test.csv'):
    csv_df = pd.read_csv(path)
    sql_df = jdb_sheep_df()
    #ne = sql_df.compare(csv_df)
    #ne = (csv_df != sql_df).any(1)
    ne = csv_df.compare(sql_df, keep_equal=True, keep_shape=True) 
    return ne

print(sql_csv_compare())
serene scaffold
#

@forest canyon did you confirm that the rows are indexed the same way?

forest canyon
#

Confirm in what way?

serene scaffold
#

@forest canyon looks like you're reading the DataFrames from an SQL database. Once you have the DataFrames, make sure that the .index of each are the same sets.

sour spindle
forest canyon
#

How?

serene scaffold
sour spindle
serene scaffold
serene scaffold
forest canyon
#

I know it's an attribute but I dont know what you mean by "the same"

#

It gives both dataframes an index column at the front starting at 0

undone fiber
#

anyone reccomend the easiest single variable ml linear regression tutorial theyve come across.. I figure id start with something like that..

serene scaffold
forest canyon
#

Oh.. So then what is the use of a compare if you always have to have the same number of rows?

sour spindle
#

how to send code again

#

in the format

tidal patrol
#

Maybe make an if statement that will say if they are equal… for example if set1 = set2 print(“They are equal to each other.”)

#

just my idiotic brain thinking

forest canyon
#

I need to identify differences

sour spindle
tidal patrol
#

yea but there could be no differences

forest canyon
#

I'm not quite there yet.

#

I need to get them comparing first

#

So the compare function only works on data with same row count on both sides?

tidal patrol
#

that could be it… maybe ur comparing a 2d list to a 3D list

forest canyon
#

That's what was said above anyway.

#

My dataframes are same columns just one has more rows than the other

tidal patrol
#

can you delete a row and test it?

sour spindle
#

here is the question reposted with the log instead of the ss:

Hey i am making a stock predictor and i am wondering if it was ok to use the output from testing and use a post processing function which make the testing accuracy go from .53 to .79?
I already made the function
It seems to make the testing accuracy much better
As you can see in the form [normal, postprocessing1, postprocessing2] accuracy

2021
[0.5121227115289461, 0.8090054428500743, 0.5314200890648194]
2020
[0.5096486887679367, 0.8169223156853043, 0.5319148936170213]
2021
2020
[0.5091538842157348, 0.8213755566551212, 0.5329045027214251]
2021
2020
[0.5091538842157348, 0.830282038594755, 0.5338941118258288] 

the postprocessing1 also makes the training data get into the 80's in accuracy

forest canyon
#

Like compare two columns?

#

And see if items in the sql_df exist in the csv_df?

tidal patrol
#

um…

#

I’m an idiot trying to learn. I got no clue.

forest canyon
#

Kind of like you can in Excel

#

Okay

#

At least I know this won't work now

tidal patrol
#

would it change the value if you added 0’s to the extra row? Cause then it will slow you to compare them

sour spindle
forest canyon
#

Basically - take a column in sql_df, compare it to the same column csv_df, then give me any value in sql_df column that doesn't exist in csv_df column

sour spindle
forest canyon
#

Yep. They are in my code above.

#

sql_df comes from a DB table and csv_df comes from a csv file

sour spindle
#

u might want to convert to a list with .tolist() and then use .index(False)

forest canyon
#

It looks like that will be same issue - row count must match but I'll try and Google if it doesn't work. Main question is answered on compare. The compare has to be the same row count and you'd want to sort by the same column so it was an accurate compare Thanks all.

arctic wedgeBOT
#

failmail :ok_hand: applied mute to @junior robin until <t:1641019694:f> (9 minutes and 59 seconds) (reason: duplicates rule: sent 4 duplicated messages in 10s).

devout trench
#

hey, i want to make a virtual assistant like alexa or siri but it works offline and on mobile devices with basic functionality like play music open application etc.
i have hit a roadblock that in the offline voice recognition part. i have tried sphinx and vosk but the accuracy is not that great. i am thinking of learing ML and designing my own voice recognition , but getting huge amount of voice data is a problem.. or should i just try to modify existing voice recognition models?

austere swift
hoary wigeon
#

SOS

#

I need project topic on Machine Learning not something like price prediction or churn, cancer classificaiton

ruby crown
hoary wigeon
#

lemme check

#

this isn't helping me

arctic wedgeBOT
#

:incoming_envelope: :ok_hand: applied mute to @deep gyro until <t:1641035996:f> (9 minutes and 59 seconds) (reason: duplicates rule: sent 4 duplicated messages in 10s).

mint palm
#

best place to learn tensorflow???

#

i know data science bacis nn models every thing, language, procedure

#

just library is what i had left

#

so tell me the best place to learn tf

marsh yacht
#

hello can anyone help me how to output a binary in a KNN.predict value

stable dove
#

Hi there, excuse me for posting this question here, but I wouldn't know what other channel to post it in. Is there anyone who can help me understand how an encoder, ROM decoder,PLA works?
I tried to read from many parts but I don't understand especially the OR matrix and AND matrix associated

mortal silo
#

how to keep theano cache file/folder names consistent over different systems?

earnest widget
nova timber
#

I’m trying to generate diffs for multiple files and I can’t get it to work

wicked grove
#

hello, i have a doubt in tensorflow.when i use conv2D with kernel size=1..does it mean 1X1 convolution

forest canyon
#

Is it possible to replace nan with null in a dataframe and then still turn it into a dict after? I can totally replace nan with null using fillna, but I can't turn it into a dict after. Anyway to convert nan to null and then still be able to turn it into a dict after?

csv_df = pd.read_csv(path)
dict = csv_df.to_dict('records')
forest canyon
#

This is the answer

.replace([np.nan], [None])
mighty spoke
#

Hi is plt.loglog(x,y) the same as plt.plot(np.log(x),np.log(y))?

tidal patrol
#

I believe that the first one is for matplot and the second is for numpy

tidal patrol
marsh yacht
marsh yacht
# marsh yacht

my neigh is the same as the bottom clf_KNN pic, just different variables

marsh yacht
marsh yacht
quiet vault
#

Im guessing you could get the index of the 1 and then do classes[index] to get the value of the class

solar yew
#

Hey does anyone have experience working with TF-IDF vectors, especially regarding using them along with non-NLP features? I managed to get it into a dataframe, however, it is too large to concat with the other features (2 additional columns of equal length). I'd love to hear how other people solved this!

#

Error thrown explains I need 5.4GiBs extra to process, however, i managed to navigate that error already once before while creating the Dataframe

#

I could perhaps do my test-train split early and do this part individually for both, cutting down the training set by ~20%

mossy stratus
#

Does anyone know why I get this error, my GPU is 8GB, not 6GB? (Windows, 3060 Ti, 8GB VRAM)

RuntimeError: CUDA out of memory. Tried to allocate 90.00 MiB (GPU 0; 8.00 GiB total capacity; 5.49 GiB already allocated; 0 bytes free; 7.60 GiB allowed; 5.56 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF```

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 497.29 Driver Version: 497.29 CUDA Version: 11.5 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... WDDM | 00000000:01:00.0 Off | N/A |
| 0% 32C P8 4W / 200W | 142MiB / 8192MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

coral kindle
#

Ooofles

#

I tried to follow a tutorial but I'm having weird CUDA memory issues or maybe classes idk

#
def train(dataloader: DataLoader, model: nn.Module, optimizer: optim.Optimizer, criterion: nn.Module, epoch: int):
    model.train()
    total_acc, total_count = 0, 0
    log_interval = 500
    start_time = time.time()

    for idx, (label, text, offsets) in enumerate(dataloader):
        optimizer.zero_grad()
        predicted_label = model(text, offsets)
        loss = criterion(predicted_label, label)
        loss.backward()
        # error happens here
        torch.nn.utils.clip_grad_norm_(model.parameters(), 0.1)
        optimizer.step()
        total_acc += (predicted_label.argmax(1) == label).sum().item()
        total_count += label.size(0)
        if idx % log_interval == 0 and idx > 0:
            elapsed = time.time() - start_time
            print(
                "| epoch {:3d} | {:5d}/{:5d} batches "
                "| accuracy {:8.3f}".format(epoch, idx, len(dataloader), total_acc / total_count)
            )
            total_acc, total_count = 0, 0
            start_time = time.time()
#

And that one worked on both Colab and local

#

According to Stackoverflow it's happening because my label > n_labels which is kinda weird

#

This is what happened when I'm on CPU

coral kindle
coral kindle
#

If you're on a notebook, the only way to get rid of this error is to restart your kernel. Then lower down your batch size.

mossy stratus
#

batch size?

coral kindle
#

Rinse and repeat till you find a suitable batch size

coral kindle
# mossy stratus batch size?

When you configurate a DataLoader, you have to pass in a batch size. It's the number of samples you feed in your model.

mossy stratus
coral kindle
#

Ideally we aim for bigger batch sizes but due to GPU constraints we lower it.

coral kindle
mossy stratus
#

does it improve quality? (higher batch size)

#

I can make the result image smaller

#

but the quality seems to be the same from 128 to 400

coral kindle
mossy stratus
#

so it's speed, not quality?

coral kindle
coral kindle
#

But higher batch sizes means you do less iterations

mossy stratus
#

on my GPU, it does iterations really fast, on my CPU, it has far more memory, but does slow iterations

#

didn't work

#

still get this:

RuntimeError: CUDA out of memory. Tried to allocate 16.00 MiB (GPU 0; 8.00 GiB total capacity; 5.55 GiB already allocated; 0 bytes free; 7.60 GiB allowed; 5.62 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF```
#

I doubt the batch size affects 2.5 GB

lapis sequoia
#

I have some problems. They should be simple, but they are not.

#

These questions require a certain expertise in order to provide any useful answers.

#

They involve the correct data normalization boundaries for certain kinds of data, variance calculations, bayesian algorithms, and audio sampling.

#

I could use some help.

#

ideally, i could use some help from a systems developer, perhaps a python or numPY project developer

tidal patrol
#

send it in the server I’m sure we can all work together for an answer!

lapis sequoia
#

I've asked about it before. I'm reluctant to expend tendons.
https://github.com/falseywinchnet/fabada/blob/master/examples/streamclean_rx_buffer.py
take a look over this. Note where the fabada function enters and what is done initially. Note what kind of data I am working with.
This is a simple, linear, single file, open source project. It is not complex from that perspective. It is complex in terms of the computation, but we can hold off on that.

GitHub

Fully Adaptive Bayesian Algorithm for Data Analysis (FABADA) is a new approach of noise reduction methods. In this repository is shown the package developed for this new method based on \citepaper....

#

I've coded things like variance arbitrarily because I don't know what should work best.

#

One problem i have with it is that it "clicks" at the end and start of each sample frame. This is unrelated to the sampling or buffers, simply bypassing the noise reduction function and passing the data frame back to the output is proof of this.

#

So the problem is internal to how I have adapted this formula from another person, who is a co-author on it, to this particular application.
https://github.com/PabloMSanAla/fabada/blob/master/fabada/__init__.py this is their original work. I have made a lot of changes, but i have verified they return identical results in superior timeframes for all of the stuff i've separated out of the main function, like the chi2pdf calculation.

GitHub

Fully Adaptive Bayesian Algorithm for Data Analysis (FABADA) is a new approach of noise reduction methods. In this repository is shown the package developed for this new method based on \citepaper....

#

I could use some help optimizing what the constants should be for normalization and for variance calculation. I am positive these are the only things remaining that need some love and affection.

odd patio
#

I am new bee in machine learning

#

I use python and I am intermediate in python

#

But I need some projects in machine learning

#

Beginner projects?

tidal patrol
lapis sequoia
#

like, to study, or like, something to do?

odd patio
slow vigil
#

does anyone know how to use pandas read_parquet() and make it include the partition column in the table it returns?
there is something in spark I think where you can provide a 'base path' argument and it will adjust the schema, but I can't find anything for pandas

lapis sequoia
#

What is data science

steel fox
#

What is life

outer bay
#

Can anyone please provide some interesting resources on text summarization using k means clustering

hoary wigeon
#

I need help with PCA,
I im using pca on image array data

I got 268 (out of 2304 (img shape 48x48)) component explaining 95% of variance

but my doubt is how can i plot 268 element as image... shape issue and all

safe elk
#

So you want to probably mark off which part of the 48x48 accounts for the variance...what about color coding the 268 elements in the 48×48 grid . Most PCA examples use scatter plots but image data has its own built in coordinate system eg x and y and as such marking the pixels is a scatter plot too

hoary wigeon
odd patio
#

I already asked

#

I am a new bee in ml

#

any beginners projects is available to learn ml with python

#

any websites?

lapis sequoia
safe elk
#

Maybe inverse transform can recover the image

#

pca_10 = PCA(n_components=10)mnist_pca_10_reduced = pca_10.fit_transform(mnist)mnist_pca_10_recovered = pca_10.inverse_transform(mnist_pca_10_reduced)

safe elk
#

The inverse_transform() method of the pca object is used to decompress the reduced dataset back to 784 dimensions. This is very useful for visualizing the compressed image!

#

Yeah it can be helpful according to the article

tidal patrol
#

@odd patio
How about generate 50 random x and y axis numbers and graph them in numpy

odd patio
#

okay

#

Thank you

odd patio
#

What are the major libraries I should learn for machine learning ? 🤔

tidal patrol
#

Numpy, pandas, and cvs are probably the first three you should learn. Then tensorflow

late vault
serene scaffold
#

"learning libraries" isn't the way to go in the first place.

#

I have a message in the pins about what the major libraries are, but you shouldn't try to learn them in a box-check sort of way.

lapis sequoia
#

Where is the best place to ask for help cleaning up a CSV data source for use in pandas? Is there a special channel?

serene scaffold
#

@lapis sequoia this one

austere swift
lapis sequoia
#

The problem is the following, the values that should go in the last column are prepended by 3 empty columns in the source CSV. This seems to shift the headers 4 places to the right when parsing it with panda's read_csv function. So can I resolve that? I can't modify the data manually since it's coming from a remote server and changes frequently.

serene scaffold
lapis sequoia
#

And the id header should be above the first column (that contains 7, 32 etc.) in the 2nd screenshot

serene scaffold
#

I suspect it has something to do with how the new_subcategory column is being parsed. In the CSV, does each value for new_subcategory have quotes around them?

lapis sequoia
serene scaffold
#

if I had a copy of an inputted CSV, I might be able to figure it out, but I see that you can't share that.

lapis sequoia
serene scaffold
#

just drag/drop the file into this chat.

lapis sequoia
#

Well maybe not publicly share on such a big server lol

#

Dont' want to get into hot water, since it's client work

serene scaffold
#

Ping me if you decide that you're willing to upload it. Otherwise I'll have to move on to something else.

#

The only other thing I can think to suggest is that you try changing the delimiter for read_csv

lapis sequoia
#

Can't share publicly without modifying it, so that would defeat the purpose. Delimiter is ; and I specified that

lapis sequoia
safe elk
serene scaffold
#

if there are cases where the casecount value is actually in the casecount column, you could use fillna instead. assuming that missing values in the column are NaN

lapis sequoia
serene scaffold
#

@lapis sequoia if you're talking about the string, no, that's there because idk what the name of the column the values ended up in is.

austere swift
#

@lapis sequoia I'm not completely sure that this would work but try making a df that reads headers only, then making a df that doesnt read the headers (skip the first row to make sure that the headers don't get counted as data), then you can drop the empty rows from the data df and assign the columns from header_df to data_df.columns

lapis sequoia
austere swift
#

it would be something like this

header_df = pd.read_csv(filename, nrows=0, delimiter=';') # nrows 0 so that it only reads headers
data_df = pd.read_csv(filename, header=None, skiprows=1, delimiter=';')
data_df.dropna(axis=1, inplace=True)
data_df.columns = header_df.columns
lapis sequoia
austere swift
#

i wouldn't think that assigning the headers directly in the read_csv would work since you would have to drop the empty columns before assigning the headers, but if it works then it's good anyways

lapis sequoia
vague moon
#

I could use a little help. I was trying to make an environment to practice on the MNIST dataset, but I was getting an error thrown trying to make the environment. Online, I found you can solve the issue by uninstalling and reinstalling Anaconda, so I did that, but now Anaconda installs without python or the powershell prompt, idk why.

serene scaffold
vague moon
#

i dont want help with that error message that's why i didn;t include it

#

i don't even know what that error message is anymore

serene scaffold
#

what OS are you on?

vague moon
#

I'm asking for help with why my anaconda is installing without python or the powershell, I am on Windows 10

serene scaffold
#

and what library/ies are you trying to use to experiment with the MNIST dataset?

vague moon
#

I just need keras to import the dataset, I was going to go about getting the dataset another way but I found a way to import the dataset through keras

#

besides that I don't know off the top of my head

serene scaffold
vague moon
#

but what if I am wanting to install Anaconda

lapis sequoia
serene scaffold
vague moon
#

Because I want to, I would like to figure out what is going wrong for the future instead of never fixing it

crystal jewel
#

I would like to ask another question unrelated to anaconda if that's ok 😛

vague moon
#

then shoot

crystal jewel
#
number_of_breeds = df.apply(pd.value_counts)
#

i have this line of code

#

which returns

Crossbred Canine/dog                                164
Retriever - Labrador                                136
Domestic Shorthair                                   63
Retriever - Golden                                   59
Dog (unknown)                                        56
...                                                 ...
[Retriever - Labrador, Catahoula Leopard Dog]         1
[Cattle Dog - Australian (blue heeler, red heel...    1
Spaniel - Tibetan                                     1
[Retriever - Labrador, Deutsche Dogge, Great Dane]    1
Mixed (Horse)    
serene scaffold
crystal jewel
#

oh

serene scaffold
#

In case you don't plan to provide that, refer to this:

#

!docs pandas.DataFrame.nunique

arctic wedgeBOT
#

DataFrame.nunique(axis=0, dropna=True)```
Count number of distinct elements in specified axis.

Return Series with number of distinct elements. Can ignore NaN values.
crystal jewel
#

{0: [138, 126, 81, 71, 62]}

#

this one?

#

i did it this way

number_of_breeds.head().to_dict('list')
serene scaffold
#

I asked for df, not number_of_breeds.

#

I'll be back in a few minutes.

crystal jewel
#

it returns this :

{0: ['Retriever - Labrador', 'Shih Tzu', 'Hound - Basset', 'Alaskan Malamute', 'Shepherd Dog - German']}
serene scaffold
crystal jewel
#

i just want to plot it

#

using dash

#

and i am not sure how to access each column i guess?

serene scaffold
#

hmm, I'm not sure about that.

crystal jewel
#

can't happen?

serene scaffold
#

if there's secretly a dataframe of interest with multiple columns, I haven't seen that one yet.

crystal jewel
#

wait one moment, let's say we use the .value_counts method

#

it returns the thing above

#

can we somehow access each column

#

so we can plot it?

#

is that even 2 columns?

serene scaffold
#

well, in that case, you should just use df[0].value_counts(), without using .apply

serene scaffold
#

if you want to plot it in two dimensions, you probably need a second value.

#

otherwise you're not demonstrating any kind of relationship

crystal jewel
#

The idea is to plot the frequency of the values

#

with the values

serene scaffold
#

I think we're mixing up terms here

crystal jewel
#

we do?

#

one moment

serene scaffold
#

in my usage, the frequency is a value

#

it sounds like you're using value to refer to an item (ie a dog breed)

crystal jewel
#
Crossbred Canine/dog                                164
#

on Y axis

#

it should be 164

#

and on the X the "Crossbred Canine"

serene scaffold
#

names of breeds aren't numbers

#

so any way that you order them on the y axis would be arbitrary

crystal jewel
#

they have to be?

serene scaffold
#

what library are you using to create the figure?

crystal jewel
#

something like that

serene scaffold
crystal jewel
#

i think so

#

one moment

#

The Dash platform empowers data science teams to focus on the data and models, while producing and sharing enterprise-ready analytic apps that sit on top of Python and R models.

#

it looks like it is

serene scaffold
#
import plotly.express as px

df = px.data.gapminder().query("continent == 'Europe' and year == 2007 and pop > 2.e6")
fig = px.bar(df, y='pop', x='country', text='pop')
fig.show()

This code from the docs creates a bar chart where the labels are country names (and thus don't have a numerical value)

https://plotly.com/python/bar-charts/

How to make Bar Charts in Python with Plotly.

#

I'm not sure that their df variable is actually a pandas DataFrame though, or pandas DFs are compatible in this context.

crystal jewel
#

There are compatible no worries about that

serene scaffold
#

yay!

crystal jewel
#

the issue remains tho :<

crystal jewel
#

oh wait

#

ignore that

serene scaffold
#

ignore what

crystal jewel
#

the code you gave me above

#

it doesn't return what I wrote

#

it returns this ```
C 305
D 281
R 236
S 230
T 160
...
Pug 1
Great Pyrenees 1
Collie - Border 1
Schnauzer - Miniature 1
Shepherd Dog (unspecified) 1

#

which is kind of weird, but

serene scaffold
#
In [1]: df = pd.DataFrame({0: ['Retriever - Labrador', 'Shih Tzu', 'Hound - Basset', 'Alaskan Malamute', 'Shepherd Dog - German']})

In [2]: df
Out[2]:
                       0
0   Retriever - Labrador
1               Shih Tzu
2         Hound - Basset
3       Alaskan Malamute
4  Shepherd Dog - German

In [3]: df[0].value_counts()
Out[3]:
Retriever - Labrador     1
Shih Tzu                 1
Hound - Basset           1
Alaskan Malamute         1
Shepherd Dog - German    1
Name: 0, dtype: int64

It worked when I did it.

crystal jewel
#

ye that's probably because that's not the whole data

#

one sec

#

and that's what it returns for me

Retriever - Labrador                           158
Crossbred Canine/dog                           154
Domestic Shorthair                              72
Dog (unknown)                                   62
Domestic (unspecified)                          52
                                              ... 
[Crossbred Canine/dog, Collie - Border]          1
[Shepherd Dog - German, Retriever - Golden]      1
Horse (other)                                    1
Goat (unknown)                                   1
Coonhound (unspecified)                          1
#

ok ok take a look

lapis sequoia
austere swift
#

yeah pandas is pretty awesome

#

although you should probably tell your client that they should fix their remote server data lol

#

it would likely be a lot easier to fix it on that end

lapis sequoia
twilit imp
#

I need a team of people to help me on a secret project. This seemed like a good place to look for some. DM me if your curiosity is piqued and you can be trusted.

#

I'd give you guys more info if I thought it would be morally fine to do so

solar yew
#

Any advice on working with large data in python? Currently (I believe) python is creating temporary stores for training and test data which then overload my ram before I get to putting it into the model

#

Or is 16GB simply a bit too small to work with

#

Online people have suggested using dictionaries, however, Im not quite sure how that can help

serene scaffold
solar yew
#

16GB and im just running a NLP with both text and non-text features

serene scaffold
solar yew
#

using sklearn

#

but most of the ram used up in the preparation

austere swift
#

how big is your dataset?

solar yew
#

800,000,000 values or so

#

get a lack of memory error when trying to put it into naive bayes

serene scaffold
solar yew
#

oh sorry, when i run it it takes c13gb ram

#

or 11mb of disk

serene scaffold
#

There's a partial_fit method that you can use to train the model in batches

solar yew
#

yeah naive bayes through sklearn thats the one

#

yeah i think that sounds like a good approach

#

thank you

serene scaffold
#

yw

solar yew
#

ìs there a general best practice when using large data?

austere swift
#

that essentially is the best practice, load/fit incrementally so that you don't have to have all the data in memory at once

serene scaffold
austere swift
#

yeah in most cases theres a balance between how much memory the computer should have and how much data you need to use, since loading it in batches also makes the training quite a bit slower

#

most people write code on their personal computer and then have it run on a remote, more powerful computer that can run it more easily

solar yew
#

ah yeah that makes perfect sense, thanks again

odd meteor
#

Happy New Year guys 😀

stone marlin
#

Whoo, new year.

I'm workin' on portfolio fodder, and I keep seeing a lot of love for Plotly. When I was workin' with it a while back you needed like your own local server for it and it was kind'a gross. Has it gotten better? What viz library are you all into? (I've been big into Altair, but will default to matplotlib if I just wanna see something quick.)

rose pasture
#

Hi I am currently working on a Kaggle project and I'd like to understand why they used y = 'twp' to count the number of calls made per month on line 22. Shouldn't it be based off of timeStamp?
https://www.kaggle.com/vahidehdashti/911-call-data-eda

odd gust
stone marlin
#

I've tried bokeh before and it's pretty good! I've not heard of streamlit or lux before, so I'll check those out.

odd gust
stone marlin
#

Oh, dang, Streamlit looks really cool.

odd gust
grim lintel
#

oh wow streamlit does look really cool

#

can it produce interactive dashboards that are anything like the ones you can create in BI or Tableau?

twilit imp
stone marlin
#

I don't DM anyone for "secret projects". Very few people are going to put themselves out and express interest in a project where they don't know the person at all and they don't know the project at all.

#

Perhaps you could give some hints as to what the project is, what you expect us to do, and what your experience is? That might make more people interested.

twilit imp
#

hmm.... let me think... I have to figure out what I can tell you...

#

I'm making a AGI. sort of.

#

Her name is AIRI.

#

It's not that it HAS to be secret, it's just that I'm worried to many people, and maybe even the wrong people will know too much about it. Most projects that were publicly available that I did were copied before I could patent. Among other things.

stone marlin
#

And then what're you looking for in a team? And what would your role be?

twilit imp
#

I'm looking for people with experience in machine learning, know python, maybe even a little bit of SQL.

twilit imp
stone marlin
#

Like, what would you be doing in the team.

twilit imp
#

Essentially, you'd look through the list of things needed to be done, and if you have a moment where you're like, "Ah yes I know how to do that" then you'd do that. So you'd help out where you could.

stone marlin
#

So, you'd be the project manager? Or main coder?

twilit imp
#

I'd be project manager, but I'd also help where I could.

stone marlin
#

(Just getting all this stuff out so people here might be more interested in working on it.)

twilit imp
#

Yah maybe. Ima be honest with you: it's not like hierarchy level organized. It's more of a thing where it's like you have nothing to do so you pitch in your 2 cents and if it contributes then yay

stone marlin
#

That's fine, lot'sa projects work like that. People here will prob be more interested in asking to work for it if they knew a little about what was gonna be involved, that's why I wanted to ask you all that stuff above. More likely to get interested peepz.

twilit imp
#

hmm what might make peeps interested....

stone marlin
#

Oh, I just meant the stuff you said already was fine. It's unlikely to get people if it's just like "hey dm me for a secret thing." but if you explain like you did above, it's more likely people will be like, "Okay, I can get into this."

twilit imp
#

Right right I see what you were saying

#

Yah you've been pretty helpful

stone marlin
#

No problemo, I can't promise anyone will be into it, but I hope that it piques someone's interest now!

twilit imp
#

OH! I know something that might interest a few weebs peeps: It involves an anime girl

serene scaffold
#

all the best things involve anime girls.

twilit imp
#

hehe lmao

serene scaffold
twilit imp
#

really? where?

serene scaffold
#

I suppose it doesn't fall under any specific rule, but we only allow people to recruit for open-source projects, because if the project isn't completely transparent, we can't verify that it isn't a business venture, or that it's ethical, or that contributors will get any value for their contribution, etc.

twilit imp
#

I literally banned a dude for trying to sell the project, so no not a business venture. I personally believe that if treated right it's perfectly ethical. I mean, this isn't going to be public even when it gets done, so even if we did give them credit (which we probably will anyways in a book I'm writing about it), they won't exactly become famous or anything.

serene scaffold
#

I'm not sure what you're referring to (I haven't read all the way back in the conversation), but in either case, we're not going to budge on the requirement that all recruited-for projects be open-source.

serene scaffold
#

If you're not willing to open-source the project, please permanently stop recruiting for it. Thanks!

twilit imp
#

What about an ambassatorial agreement

serene scaffold
#

I don't know what that is.

twilit imp
#

Essentially, we'd invite a highly trusted member of the server to the project. Then, they'd look it up and down through and through and if they think it's ok, they'd come back and say it's ok verifying it.

serene scaffold
#

There is no way we'll allow you to recruit for a closed-source project. If you have any further questions or comments about this, please contact us via @sonic vapor.

twilit imp
#

sad.

#

Do you personally know of any other places where I can find peeps?

serene scaffold
#

No.

twilit imp
#

great. big help.

#

If your gonna get mad at people for doing it, you should probably put it in #rules .

serene scaffold
#

I'll bring that up with the other moderators.

safe elk
twilit imp
#

yea I know I just don't know many coders that actually know enough to help instead of hurt

stone marlin
#

I get why the rule is in place: it's easy to exploit workers from this area. Same deal in a lot of the game dev discords. Makes sense.

#

Though I agree, if I can't point to a rule (or see a rule) it's hard to deter people from doing it. So, thank you for bringing it up with the other mods!

crisp sluice
#

anyone familiar with dynamic mode decomposition?

#

trying to do a little project and im getting a bit stuck

night quartz
#

Guys can you help me out, i am in VSCode RightNow, i inserted a .dot file and installed Graphviz extension for it to run, but when i press on the three dots on the top right corner in VSCode there should be an option saying ‘Open Preview to the side’ but it’s not showing, please help me out i’ve been searching for an hour

#

it’s supposed to show a visual decision tree

#

I usually use Pycharm so i am not familiar with VSCode

delicate sphinx
#

Does anyone have any tips on how to stop class imbalance on a Tensorflow model? I've looked at the Tensorflow guide using Credit Card Fraud Detection though my model has a few more classes (24,000 more to be exact)

#

it just outputs "yes" which is likely to be correct around 25% of the time

serene scaffold
delicate sphinx
#

I have a majority class of 25% of all of my training data inputs which my model is unable to circumvent and instead just uses the "most common" answer when making future predictions

serene scaffold
#

am I to understand that you're training a binary classifier where each item can either be "fraud" or "not fraud"?

delicate sphinx
#

it's the way it trains

#

No, I was a bit misleading by using the credit card fraud data I apologise, I'm using 24,000 different classes so I'm trying to use a Categorical Cross Entropy Loss

serene scaffold
#

you have 24k classes? lemon_exploding_head

#

how many training instances are there?

delicate sphinx
#

I just mentioned that because tensorflows only actual guide on class imbalance uses binary classifiers

#

well I have 248,000 bits of data to use

#

24,000 is a huge over-use anyway, but I wanted to include a larger vocabulary (output is one-hot)

serene scaffold
delicate sphinx
#

yeah i just didnt know how better to say

#

I have 248,000 triplets of input1, input2, output

serene scaffold
#

I see. And what algorithm are you using to classify them?

#

because I've never heard of any classifier having to learn 24k different classes.

delicate sphinx
#

I use a Categorical Cross Entropy for loss, an Adam for Optimizing and output as a softmax

serene scaffold
#

so it's a neural network of some kind?

delicate sphinx
#

the output is 24,000 large so that the softmax gives me a vector that I can turn into a one-hot encoded vector

#

yeah it's an MLP

serene scaffold
#

MLP?

delicate sphinx
#

Multi-Layer Perceptron

#

Basically a larger connection of NNs

serene scaffold
#

like I said, I've never heard of a classifier for that many classes, NN or otherwise

delicate sphinx
#

fair enough, thanks anyways

ashen umbra
#

hey I had a ques. So if you have a column of a bunch of tokens and then a list of words (outside the data frame). How can you check if a row contains any of the words?

#

this is the token from the df

#

and I have a list of words.. whose elements I wanna match with

serene scaffold
# ashen umbra

I assume this is a column of a DataFrame (ie, a Series). The only operation that a Series of lists supports is explosion.

If you want to check if a list contains at least one of a certain number of values, the best way is to put the values of interest in a set

rose pasture
#

Is the topic of data science in general much more harder to learn than web dev?

delicate sphinx
stone marlin
#

24k classes? That is, by magnitudes, more than I've ever seen, haha.

delicate sphinx
#

👀

#

If I cut out some data I can get to 13k classes

stone marlin
#

What the heck are these classes where you need so many?

delicate sphinx
#

words

#

a lot of the classes I'm making are pointless because they're only used once in comparison to a majority class that's used for 25% of all answers

#

(1 class = 25% of all answers, 2 classes = 38% of all answers, 12,998 classes make up remaining 62%)

stone marlin
#

Dang, maybe it does require that, I don't do a lot of CV/NLP stuff. That still seems really, really high to me, and seems like, if it's that imbalanced, you could ensemble on another model which does the "Other" thing for labels who are predicted less than some number of times.

rose pasture
delicate sphinx
stone marlin
#

Also, if it's predicting one label, what label is that where it's so popular?

delicate sphinx
#

I managed to get it to consider different classes but I think it worked only because of a Tensorflow Seed giving me a good value for training

#

every other training hasn't worked

#

so my training is non deterministic

stone marlin
#

You don't even need another model, you could just post-process and send everything under a certain frequency to an "Other" class.

delicate sphinx
#

although now it's deterministic because it only outputs one thing

#

the popular label is "yes"

stone marlin
#

Second most popular is "no", I guess?

delicate sphinx
#

which accounts for about 23% of all answers

#

yeah thats like 17%

stone marlin
#

Okay, and then what's an example of another one that's sort'a popular?

delicate sphinx
#

something around that lol

#

the next most popular one is about 1000 (as opposed to no which is ~40,000)

#

think thats "2"

delicate sphinx
stone marlin
#

Ha. Okay, shot in the dark here, but what about this: an initial model to say if a question has a "Yes/No" answer --- which will feed into another model that gets that yes/no answer --- and the rest of the data goes to yet another model that excludes the yes/no.

#

If only to see the distribution of the other answers, and how to group those.

delicate sphinx
#

I wanna get something out of the way soon and then come back to it in the future

stone marlin
#

It's all good, just thinking aloud.

delicate sphinx
#

upload an "alpha"

#

but yeah that is a good idea haha

#

my whole idea was to focus on the question more than the image so a culmination of that would use that yes/no thing

stone marlin
#

Depending on what the DS job is, it might require some heavy maths --- which is, for some, hard to learn and might take around a year or more to learn well. That's why I think that path for DS might be a bit harder than web dev, since you've got to have the math AND coding background. But as you note, it depends strongly on what you wanna do with it.

#

If someone just wants to plug-and-chug things into sklearn models, prob doesn't need to know a whole lot of linear algebra / stats. Same for making simple webpages.

delicate sphinx
#

This guy does a pretty decent flask VQA model with semi-clear walkthrough

#

he has a supporting webpage writeup on it too that explains it further

#

though VQA probably isn't for first timers in Data Science as I found out x-x

stone marlin
#

There are some people out there that can do DS without knowing much of the maths, but in terms of careers I've not seen too many DS people without a good grasp of the math after entry-level. [My biases here: live in the US, work mostly in small-to-mid startups, mostly timeseries + fin + IIoT work.]

supple prism
#

Guys which is the best free source on internet to learn machine learning and data analysis

rose pasture
delicate sphinx
#

they have lots of online tutorials on their official website

#

definitely worth looking at

rose pasture
rose pasture
delicate sphinx
#

yeah

#

and then the models can be used to predict things

#

and as such, a flask application can provide a front end to the model

rose pasture
delicate sphinx
#

I wish, I'm nothing special but trying to create a VQA model

rose pasture
delicate sphinx
#

Yeah, hopefully you'll manage, lots of current people in the field are all self taught and lots of places offer apprenticeships (best of both worlds)

stone marlin
#

It can vary by the type of project, but I'd say it's like, 40% pipeline work, 50% EDA + making dashboards + talking to SMEs + going back and forth with the data-holders, and then like 10% research. Depends also on the DS person; I really dig the pipelining and devops side of things, so I usually will work with those teams, hence the more pipelining. Some colleagues will do more research, some more eda, etc., as the project requires. In a previous job it was more like 20% eda, 30% pipelining, and 50% building in-house engines.

delicate sphinx
#

I went to uni and paid £40k to wait for my supervisor to email back, 13 months on still waiting

stone marlin
#

It's not hard to get a job in the DS field --- since the title is pretty vague and can be anything from data analysis to data engineering to data science --- but it's hard to get a good job in DS where you're actually doing a lot of EDA + research, which is usually what people think of when they think of going into DS.

rose pasture
rose pasture
stone marlin
#

It's also the case that, at least in the US, there is a huge, huge, huge market saturation for DS at the entry level. We had an entry-level position open up at my previous company [travel company w/ a big ds focus] and we had 1.2k people apply to it. But, after parsing it down, we had around 200 people who actually were in any way qualified for it, something like 50 or so who could pass the hackercode challenge, I think 12 made it past the data analysis challenge (logistic regression, classifying problem), and we interviewed all of them. I think there were six good ones from that.

#

This is not to intimidate if y'all are going into entry level, but to note not to worry too much if people say, "Oh, we got 400 applicants!" because the vast majority are very much not good.

stone marlin
rose pasture
#

Damn lol what would you recommend someone to learn to put on their resume to even be qualified in the beginning?

stone marlin
#

I'd say that, for better or worse, the research + eda things are mostly given to the math-strong people, since, in my field at least, there's no way to just plop something into a NN and be done with it. You need to be able to explain results. So it's a lot of feature engineering + time series analysis, and that requires some pretty hefty stats know-how.

#

Note again that this is ONLY in my experience, and my experience is only a few small- to mid-sized startups in a city in the US.

#

I'd say that a common thread in who we hired and considered was:

  • Knowledge of Python/R (or some other language, we weren't too picky, and you can pick up either of those pretty quick if you know another).
  • Knowledge of SQL and roughly how DBs work and how to get data and clean data.
  • Knowledge of general ML methods and general supervised and unsupervised algorithms.

That's the bare minimum, IMO, but others may disagree. And it def depends on the industry. In a Computer Vision job, this would not be nearly enough. In a business job, it might not even need the first or third.

rose pasture
stone marlin
#

If you're starting out, I think focusing on Python / R (or some other language) is gonna be the best bang for one's buck. If the DS isn't working out, you can fall back on software engineering for a bit.

#

Yeah, ME will be totally fine as a degree. We've had people who are even like, history or english or whatever majors, that's fine --- so long as you knew the math and could demonstrate it, it was all good.

rose pasture
stone marlin
#

Or, roughly, what type of role you think you would want. You could always change. But if your passion is computer vision, or if your passion is self-driving cars, you might wanna look at those job descriptions and see what they want and build towards that.

rose pasture
stone marlin
#

Good, numpy + pandas are the backbone for this stuff --- but remember also that writing scripts in python is also useful, so that's good to touch on too.

#

Like, "Vanilla Python" writing.

rose pasture
#

Gotcha I'd have to review that as well. Is OOP important in data science?

#

It was kind of a hard concept for me to grasp in the beginning but i think i got the hang of it now

stone marlin
#

That's gonna strongly depend on where you go. My last place, it was very important. My second-to-last place, not at all.

#

I'd say keep trying with it, if only to strengthen your understanding of languages, design patterns, and architecture in general.

rose pasture
#

Yeah there are still lots to learn for me haha

stone marlin
#

For example, it may be the case that you will have to work with something like Airflow or PySpark or whatever, and you'll need to know how to maneuver around in those fairly-different-looking spheres. Knowing general python / general programming will help immensely, IMO.

delicate sphinx
#

Image captioning with visual attention or something is really good imo

#

But very complex

stone marlin
#

I know it's wildly unpopular to say this here, but I always strongly recommend against doing NN stuff at first when learning DS (unless that's a popular tool in the area you're going into). Having said that, it may not be bad to look at to see if something clicks.

delicate sphinx
#

Youd have to learn about customary encoding decoding attention and training functions/classes which is super confusing to go straight into

delicate sphinx
stone marlin
#

Even though everyone in this room seems to LOVE NNs, I've rarely had to use them for actual day-to-day work. Additionally, they're mostly their own API, so knowing them does not really translate to knowing Python better.

delicate sphinx
#

All I can think of besides nn is ensuring you understand the task at hand. If you can't figure out what you want to do it won't matter what you try. Understanding datasets is a huge help too. Always make sure you preprocess it into a form understandable to you

stone marlin
#

Making a model in general (linear reg, logistic, trees, whatever) is also pretty much just a two-line ordeal to fit, but it requires the coder to do more data cleaning and feature engineering, which do translate to learning Python better.

delicate sphinx
#

Anyways its 4.45am and I haven't been away from a screen for 17 hours today (i just cannot figure out my model haha)

#

So I must go to bed, gn lads/ladesses/ladems

stone marlin
#

Yeah, I think that preprocessing is also a fine way to go, if you're gonna focus on NNs. That's v important to learn.

#

Gn!

#

As an alternative to NNs, I'd recommend some of the more basic Sklearn tutorials and focus on things like logistic regression, linear regression, tree models (random forests, boosted trees, etc.). I feel like this stuff is easier to "look into" and see what's happening, whereas NNs are often a big bulky black box.

#

This'll prob be covered in your ML class, though.

rose pasture
#

Yeah I honestly don't know what my focus will be at this moment! I'll take what you said into consideration for sure. I still have a long way to go lol self driving car is really interesting now that you've said it haha i'd have to do more research. Yeah I see these topics are included in my ML class. Hopefully I can start working on a project by myself after this class

stone marlin
#

Haha, don't worry about it too much. I'd focus on getting through the classes, seeing where you are, and keeping on codin' things (even non-ML/DS things!) in Python or R. That'll put you in a good spot.

rose pasture
#

Thanks man I really appreciate you taking the time to talk to me!

desert oar
#

or plunk around with deep learning while you cool off between stats theory sessions 😉

stone marlin
#

Yeah, I can't imagine going into deep learning without a solid foundation. I still get confused by the stuff and I've been doing this nonsense for years, haha.

desert oar
#

i do actually think it's a useful toolset nowadays, if only because data scientists (in the "good" jobs) tend to be very close to management still, and need to make informed decisions about what tools and methods to use

#

data science is already too big of a field for any one person to be an expert in all of it. but much like programming, you would do well do develop "T-shaped" knowledge

#

broad but cursory knowledge of many things, deep knowledge of one thing

#

and be good at math. it will just make you faster and more efficient and better at everything data-related

stone marlin
#

Yeah, I'm not sure how much time beginners should spend with it, but certainly after reaching mid-level or so, one should know their way around some of the more useful NNs.

desert oar
#

right

#

heck even within traditional stats you can probably never know even close to all of it

stone marlin
#

I feel like I am strongly biased against them in places like this because I've had to interview so many entry-level / intern DS people that ONLY know how to set up a TF thing or have done a tutorial on dog-recognition or something.

#

Oh, yeah, pretty much undergrad-level stats is usually good enough. Rarely do I need anything more than that!

#

NNs are def popular because of the cool things they can do, but it's one of those things where it's like --- yeah, you might be a carpenter who can REALLY use a saw well, but if you can't use nails or hammers or measuring tapes or... then you're basically going to be useless on a construction team.

desert oar
stone marlin
#

I've got a v specific gig that'll prob dox me, haha, but I'll say in general I work around the Industrial IoT space. I've worked in travel and medical as well.

#

Hence my emphasis on methods with explainability and my love for time series analysis junk, haha.

#

What're you in, if you don't mind my asking?

desert oar
#

fair enough

#

i am currently a software engineer, taking a break from data science basically

#

although very much wanting to get back into DS

stone marlin
#

That's great, though. I feel like there are so many DS projects that can benefit from SE knowledge and best practices.

desert oar
#

i was at a large P&C insurance company before this, but was having a difficult time for a variety of reasons, so i burned out and quit

#

yeah it seems to be a desirable skillset

stone marlin
#

Ha! I was in lending at one point, that nonsense is so easy to burn-out in. I definitely get it.

desert oar
#

in hindsight i was probably a productivity multiplier

#

not that useful on my own, but made everyone around me more productive by being able to pick up all their programming slack

stone marlin
#

Which def could lead to that burnout.

desert oar
#

for sure, i was also a very susceptible person at the time

#

i'm very lucky i was able to quit (thanks covid!) and reset

stone marlin
#

I did a very similar thing, it was super good. Gives time to learn new stuff and reposition.

desert oar
#

plus now i have a spouse who keeps me from becoming a degenerate again

desert oar
#

my only fear when i quit was that the industry would pass me by, and my skills would atrophy, and i'd be unemployable in DS

#

that seems to not be happening, so it does seem good

#

when did you take your break / how long?

stone marlin
#

Prob around half-a-year or eight months, and I felt the same way.

#

Lots of new stuff out, and new stuff in Python to get good at, but otherwise it looks pretty much the same. PySpark looks to have gotten a bit better, too, haha, but I haven't started on that yet.

desert oar
#

heh yep i noticed the same about pyspark

safe elk
stone marlin
#

Yeah, that's to be expected, I guess. :'] It's all bright and shiny, v alluring.

safe elk
safe elk
safe elk
stone marlin
#

Speaking of learning new things, someone mentioned Streamlit to me today in here and I started using it --- it's pretty fantastic, I really dig it.

magic dune
#

hello, this my linear regression brute force formula

#imports
import matplotlib.pyplot as plt
import numpy as np

def euclidean_distance_calc(y, data):
    """Calculate the euclidean distance between two points"""
    euclidean_l = []
    data_y = data[:, 1]
    for i in range(len(data_y)):
        e = y[i] - data_y[i]
        euclidean_l.append(e)
    score = np.sum(np.array(euclidean_l))
    return score


def ploting(x,y,data,Loss):
    """This Function plots all the points"""
    color = '#1C2833'
    plt.plot(x, y,label=f"Score is {Loss}" )
    plt.xlabel('x', color=color)
    plt.ylabel('y', color=color)
    plt.scatter(data[:,0], data[:,1])
    plt.legend(loc='upper left')
    plt.axis("equal")
    plt.grid()

def main():
    #Variables
    L = 100
    data = np.array([[1, 1], [2, 2], [3, 3]])
    for i in range(L):
        m = np.random.randint(1,4)
        b = np.random.randint(1,10)
        x = data[:,0]
        y = m * x + b
        Loss = euclidean_distance_calc(y,data)
        ploting(x,y,data,Loss)


if __name__ == "__main__":
    main()
    plt.show()

It works the problem is i wanna better way to scale m and b to the data instead of just guessing.

Yes, i do know I can use calculus but would like not because it is a brute force version

stone marlin
desert oar
# magic dune thanks for the source

you might also be interested in other numerical optimization techniques like newton's method / newton-raphson method, coordinate descent, et al

safe elk
desert oar
#

indeed, although i usually file those things away as "things i mostly don't need to know the details of"

#

i've used lbfgs before but not for any strong technical reason, it's just what gave good results quickly

safe elk
#

Yep the libraries...but best to know how they work

#

The prof thou made the students implement some of those algo from scratch in C or Pascal ..

desert oar
#

probably useful if you want to be a developer of scientific software or a machine learning engineer

#

maybe not that useful if you want to be a statistician

#

but yeah i agree you should at least know how they work in general

#

even if you haven't worked through the underlying theory

safe elk
#

Done a bit of scientific software and ML myself but bulk of my career was in Traditional Web and Desktop software dev

#

I enjoy the maths thou and like understanding how things work in general

stone marlin
#

I always forget about newton, and that'd prob work fine here. I remember RK-4! That's --- well, something with ODEs. I guess I don't remember it as well as I should.

lapis sequoia
#

guys, i have question

#

i have two datetime columns, each one start at a specific date.

both of them got no null values.

the first start from 2016 to 2021, and the secpnd start from 2014 to 2018, and they have some overlapping points.

The Question is: How can i combine those two columns into one while preserving the rest columns? obviously i will get more rows after this performation.

marsh yacht
#

"Environment variable $DATABASE_URL not set, and no connect string given."

#

i have a sql error

prisma lake
#

In my 2022 list of stuffs I want to learn data science and ai is one of it. Somehow please recommend me from where to start ?

bold timber
#

Hi, I have a questions: Do we need polynomial feature in classfication?

novel raven
#
x_train = numpy.array(read_x).reshape(-1,1)
y_train = []

for i in y_read[::-1]:
    y_train.append(i)

model.fit(x_train,numpy.array(y_train))
prediction = model.predict(np.array([453]).reshape(-1,1)) # trained upto 452 what it predicts now is correct but even if i increase the value to like 460 or even more its the same but if i reduce it from 452 it predicts as expected 
print(prediction)```
whenever i increase the prediction value the prediction does not change and does not gets affected if the prediction value is something which is not in the training data but when i reduce it to what it is in the training data it predicts as expected
#

im very new to sklearn and its one of my first projects i didn't watched much tutorials

#

also im using RandomForestRegressor

jovial junco
#

anyone here know neat?

#

;-; i need help

vapid sentinel
#

Cam anybody help me with machine learning problems?

night gorge
#

I was trying to fill missing values of a categorical data with mode. I used following code where "Embarked" is the categorical variable. But it's not working..
df["Embarked"].fillna(df["Embarked"].mode())

keen storm
#

hi all! I've been playing with pandas a bit lately trying to sort through some scientific data. It's been a blast but there are some simple things that I can't seem to be able to google for.
I have a table with different columns. Some of the values are the same for all columns, some instead appear only in certain columns. I'd like to sort all the columns so that values that match are all next to each other.
For example, if ColumnA has value "10", and ColumnB and C also have a value "10", then all the "10"s are put next to each other. If any of the columns is missing "10", then the field is left blank.

#

for v in sample['ColumnA']:
print(v in sample['ColumnB'].values)

#

I can check if vaues exist in different columns like this. I'm unsure of how to do this for multiple columns and how to implement the logic of sorting the values accordingly or leaving blank

serene scaffold
keen storm
#

I just thought that maybe I could pivot. I'm not sure if I'm looking in the right direction.

#

Maybe I can have values as indexes and have column names just show the presence (true or false)

serene scaffold
#

can you give a copy/pastable sample of the data?

#

df.head().to_dict('list')

#

also, are you trying to do this for some practical reason, or are you just experimenting?

#

it doesn't seem like a very useful thing to do

keen storm
#

I'm not sure if pivot is what I'm looking for though

serene scaffold
#

I need to do something else for a few minutes, but I will only be able to continue helping if you provide the result of print(df.head().to_dict('list'))

serene scaffold
night gorge
#

{'Survived': [0, 1, 1, 1, 0], 'Pclass': [3, 1, 3, 1, 3], 'Name': ['Braund, Mr. Owen Harris', 'Cumings, Mrs. John Bradley (Florence Briggs Thayer)', 'Heikkinen, Miss. Laina', 'Futrelle, Mrs. Jacques Heath (Lily May Peel)', 'Allen, Mr. William Henry'], 'Sex': ['male', 'female', 'female', 'female', 'male'], 'Age': [22.0, 38.0, 26.0, 35.0, 35.0], 'SibSp': [1, 1, 0, 1, 0], 'Parch': [0, 0, 0, 0, 0], 'Ticket': ['A/5 21171', 'PC 17599', 'STON/O2. 3101282', '113803', '373450'], 'Fare': [7.25, 71.2833, 7.925, 53.1, 8.05], 'Embarked': ['S', 'C', 'S', 'S', 'S']}

serene scaffold
#

Thank you

serene scaffold
lapis sequoia
#

Tensorflow, anyone?

serene scaffold
night gorge
#

It is a long dataset with 891 rows. In two rows "Embarked" values are missing. So tried filling missing values with mode using following code.
df["Embarked"].fillna(df["Embarked"].mode())
But after that when I try to print sum of na, it is still showing 2.
df.isna().sum()

serene scaffold
lapis sequoia
night gorge
serene scaffold
night gorge
lapis sequoia
serene scaffold
serene scaffold
vapid shoal
odd meteor
keen storm
#

I'm trying to do a simple outer join to solve my problem. However, I have to do a join among a df rows, not among different dataframes. It seems to not be possible.

#

I tried with concat(), but the result of the join is not what I expect

keen storm
serene scaffold
#

Alright, let me see.

keen storm
#

the practical purpose is to find out if a specific metabolites shows in different mediums (and which)

#

so it's an experiment repeated using different mediums, I need to find out how the different medium affects the production of certain metabolites. For which I need to visualize what metabalites are present in what mediums at a glance

vale wedge
#

hi, may i ask some question? right now im working for my final year project & now planning to implement data visualization for my project. does anyone know to suggest me on how to do it?

keen storm
#

for further reference, this is what I have now

#

this is what I want to end up with

#

this is a spreadsheet, but it shows what I want

#

alternatively, this works

#

I feel like this should be very simple. Maybe I'm overthinking it

odd meteor
# rose pasture Thanks ill check it out! TensorFlow sounds interesting ill have to learn that fo...

While TensorFlow might appear alluring you might wanna consider learning OOP in Python as well just so you can easily get yourself acquainted with PyTorch.

Knowing 2 Deep Learning frameworks can easily increase your worth, make you super flexible, and perhaps a little bit indispensable (with the right attitude) 😀

You can imagine someone in Software dev domain who knows React + Vue.Js + Laravel

Moral story: Try not to be too over dependent on one framework. Know at least two.

vale wedge
#

basically right now my web based already deploy machine learning model and now i'm planning to do visualization on my website

#

anyone have any suggestion for me?

serene scaffold
#

@keen storm unfortunately I haven't come up with a solution yet, but I have to do something else. I'll leave it running on my computer in case I get a chance to try again later.

odd meteor
odd meteor
odd meteor
# bold timber Hi, I have a questions: Do we need polynomial feature in classfication?

Since the intent really is to get the loss function to its global minimum, doing a feature engineering to create a new polynomial feature that would surely lend a hand in helping your model learn more underlying patterns in your data won't hurt.

However, remember the response variable being predicted in any classification task is always a discrete value.

odd meteor
lapis sequoia
odd meteor
pastel valley
#

what metrics or method i can use to compare 2 identical cnn models but trained with different augmentation techniques and checks which model performs best?

#

or should say more reliable and accurate

lapis sequoia
#

and yes i was able to merge successfully when using a sample of 10 rows from that data

odd meteor
odd meteor
lapis sequoia
lapis sequoia
#

very messy data

odd meteor
# lapis sequoia very messy data

OK but assuming you're doing 2800 x 10 that woulda been 28,000 rows yeah?

But 28,000 isn't that much rows to give a memory error warning. 🤔

odd meteor
lapis sequoia
#

i have built 10 new dataframes from the big one, every df have two columns, and then i remerged again

lapis sequoia
odd meteor
lapis sequoia
#
import pandas as pd 
import numpy as np

df=pd.read_excel('sample_question2022-01.xlsx')
columns=df.columns.tolist()
for column in columns:
    if (df[column].isnull().sum()>2300):
        df.drop(column,axis=1,inplace=True) 
columns=df.columns.tolist()
import itertools
count_date=itertools.count(1)
count_price=itertools.count(1)
for column in columns:

    if(df[column].dtypes=='datetime64[ns]'):
        df.rename(columns={column:f'date{next(count_date)}'},inplace=True)
    else:
        df.rename(columns={column:f'Price{next(count_price)}'},inplace=True)
columns=df.columns.tolist()
merged=df[[columns[0],columns[1]]].set_index('date1')
k=2
for i in range(2,len(columns)-1,2):
    merged=pd.merge(merged,df[[columns[i],columns[i+1]]].set_index(f'date{k}'),how='outer',left_index=True,right_index=True)
    k+=1
lapis sequoia
odd meteor
lapis sequoia
#

yes it still

odd meteor
#

I do have a new observation. What happens to other column(s) if the date column successfully gets merged?

This obviously will increase the shape of final dataframe and even populate every other non date column with NAN.

Do you see the angle I'm coming from yet?

lapis sequoia
odd meteor
lapis sequoia
#

i will try this later, Thank you very much :)

odd meteor
rose pasture
whole birch
#

For data cleaning, pandas is your go to for most applications

#

For 100Gb+ datasets pandas starts to struggle. Then a distributed analytics library like PySpark or Dask is required

desert oar
#

keep in mind that most machines don't have 100 GB of ram, regardless of what pandas can technically handle 🙂

pastel valley
#

thank you sir

desert oar
# pastel valley oh so that is what the validation set is about

In general, the objective of any train/test split, cross validation, bootstrapping, etc. is to emulate "out of sample" data; data that your model has not yet seen and was not involved in training. This is important if you want to try to estimate how well the model generalize is beyond the training sample

plain sundial
#

Where would I start with ai?

#

I have some decent knowledge with python. I am just wondering if there is an article to read off of or a book to read or something?

delicate sphinx
rose pasture
#

how often do you run into big projects like that?

delicate sphinx
#

Depends what you do and what data type you have

#

I have images that take about 169GB (cached image feature files and images themselves)
The text files alone are less than 1gb

desert oar
#

im curious what you're doing with all those images, i assume you don't need to keep them all in memory at once

delicate sphinx
#

Its unlikely you'll face such large amounts of it all

delicate sphinx
desert oar
#

at work we had a server with 256 gb of ram, it was nice until someone left a notebook running with 100 gb of ram used, on thursday, and didn't clean it up until the following tuesday

delicate sphinx
#

Annoyingly tensorflow is slower to duplicate the same file 3 times than it is to load the same file 3 times

#

But I stand by my choice of loading image once and duplicating its values 3 times

#

Dont wanna kill my hard drive any more x-x

#

I had a friend who worked in security camera detection stuff and his company had this experimental model to train so on a weekend when everyone was gone they let this guy use all the resources he could and I think he ended up blue screening on about 750GB of ram

#

Well, running out of ram, probably not blue screening these days

desert oar
#

i wouldn't worry about your ssd dying

#

unless this is on a personal machine and not a work machine

odd meteor
# rose pasture Thanks! Do you have any other recommendations as to what to learn? So I far i pl...

Why not follow a structured course? Learning each data science tool/ library one after the other can slow your progression.

Have you checked out Udemy and Coursera courses on Data Science yet?

If you can afford about $50 use it to buy a well-structured course on Udemy (at least, Udemy is quite cheaper than most courses on other platforms)

Of course, you can always use YouTube as a second resource to augment what you've learned from Udemy

delicate sphinx
#

My current PC performs quicker than google colab (except for times I have to download data due to bad Wi-Fi) - especially as Google Colab RAM/Storage even on Pro is too small for my task

rose pasture
odd meteor
warm wedge
#

Hello guys, is data science still a relevant field to get into/ start learning rn? if yes, what learning path do you recommend?

spare junco
#

Hello

#

Can someone Explain what are activation functions and what are they really used for in simple terms?

odd meteor
tidal bough
# spare junco Can someone Explain what are activation functions and what are they really used ...

Neural networks basically work by, each layer, multiplying a vector by a matrices, and applying an activation function to it, until you reach the end.

Without activation functions, any number of such multiplications would be equivalent to one (because any number of linear operators applied in order is a linear operator). And almost no data is a linear mapping of inputs to outputs, so without activation functions, it'd be impossible for NNs to work.

So we introduce nonlinearties of some kind - that's what activation functions are, pretty much any nonlinear function. In practice, you need one that has a derivative almost everywhere (because you need it for backpropagation), is cheap to calculate, and (that's the most advanced requirement) ideally behave nicely under backpropagation - have nonzero derivatives everywhere, etc.

#

*have nonzero derivatives everywhere
though as RELUs show, even that is sometimes not required

odd meteor
# spare junco Can someone Explain what are activation functions and what are they really used ...

Let me break it down as much as I can

Activation function helps our neural network to make more accurate prediction. Now the main reason we use activation function is because ;

Not all datasets are in linear space, some can be in 3-dimensional space and above... Remember, plane and hyperplane in geometry yeah? Cool.

By using activation function, it helps our neural nets capture non-linearities in our dataset thus leading to our neural network producing a more accurate prediction

spare junco
#

Thank You Very much for explaining!!

velvet pivot
#

Hello everyone, I can help you if you have any questions or concerns about the software. or you can contact me from this account instag @ai.engineer1

fast glacier
#

Hello! I was wondering if someone might be able to kindly help me out: I'm working with a pandas dataframe, and I'm trying to iterate through only columns that have int values, and ignore the columns with other datatypes.
numerical_columns = [column for columns in identity_survey.columns if identity_survey[column].dtypes() == int]

print(numericalcolumns)

I get a TypeError: 'numpy.dtype[object]' object is not callable
I've tried (and unfortunately already deleted) multiple other approaches with no luck
identity_survey is the name of the dataframe

desert oar
#

"not callable" means "you can't use it like a function", i.e. you can't use ()

#

not sure if you can use np.integer alone or if you need int as well

fast glacier
#

mmm weird, my notes say dtypes...

#

checking now

desert oar
spare junco
fast glacier
#

thanks @desert oar

desert oar
#

the reason we use nonlinear activation functions is to help the network find more interesting non-linear relationships in the data

#

imagine that the network is a child building stuff out of blocks, and they only have rectangular blocks

#

that would be like if you only had linear activation functions, everything would still end up kind of rectangular

spare junco
desert oar
#

no, i mean generally the relationship between inputs and outputs

#

the function that the NN is learning to approximate

#

if the child is building stuff out of various funky-shaped blocks, they could build more sophisticated shapes in the end

spare junco
#

Right

desert oar
#

same reason your network can learn more if you have more nodes and layers: more blocks to build out of

#

note that this is a very very highly non-mathematical analogy

spare junco
#

So does the activation function help in shaping the network and nodes in such a way to make it more efficient in predicting?

desert oar
#

not necessarily more efficient, but it is necessary in order for a network to learn arbitrarily sophisticated functions

spare junco
#

Right

#

Got it!!

#

Now i will have a good sleep

#

lol

desert oar
#

i see melat0nin is typing, they know what they are talking about so you might want to stay online for a few more minutes 🙂

spare junco
#

Alright

stone marlin
#

Haha, I wasn't gonna say anything mathy here. :'] One of the exercises that made me really "get" the steps of NNs was trying to do ezpz perceptron exercises.

#

You got all of it, I didn't need to add anything, haha.

#

The math part is pretty much that you get linear combos of linear combos without activation functions, which is just kind of boring. You pretty much just get new features that look like 3.0 * petal_length + 4.1 * sepal_width - ... stuff like that. Which is fine for some things, but for other things they may not have a clean linear relationship.

spare junco
#

Thank You @desert oar

stone marlin
#

I'm having a ton of fun with streamlit, y'all, if you haven't checked it out yet, check it out.

odd meteor
desert oar
#

interesting

#

i always thought streamlit was some kind of zoomer streaming tool 😆

stone marlin
#

Oh, interesting! It seems like Streamlit is made for like, dashboarding EDA / Results, but Gradio is more for like, showing off models?

#

These both seem really cool, I'll try to make something cool in Gradio this week. :'']

odd meteor
stone marlin
#

Oh, this is neat. Yeah, Streamlit seems like the "Python Version of Shiny" to me, but I'm still learning it. This is kind'a cool, and I wonder how it compares with or complements things like H2o and MLFlow. I'll have to dig in. It looks really slick tho.

desert oar
#

i was wondering about mlflow here too

stone marlin
#

I haven't touched MLFlow in like, two years --- but when we were using it, I was like, heavy into it. We had a fork of it we added junk to and it was great. H2o was also really neat --- our modelers really liked the charts and stuff, but idk how that is now. I gott'a look back into this stuff.

odd meteor
#

I don't have experience with MLOps yet so I can't say for now

stone marlin
#

It's prob the case that MLFlow works fine with this, since MLFlow (at least the tracking part) just kind of sits in the code and reports stuff to the DB.

#

I'll check it out tho, I'm so excited about the new stuff that's come up recently (or not-so-recently) that I get to learn about, haha.

serene scaffold
#

@tidal bough @odd meteor thanks for your comments on activation functions. I somehow didn't know a lot of that lemon_hyperpleased

pastel valley
stone marlin
#

You can measure it with pretty much any metric you want. F1, accuracy, recall, precision, etc.

light hemlock
#

How to determine which columns to drop in knn?
I know which column is categorical , continuous and is class attribute.
I assume:
Continuous data is for training set
Categorical is meant to be dropped
Class attribute is a thing to be predicted?
Example:
I tried to make a heatmap (picture). On left is whole dataset, on right head(20). ||11th column had 0.5 value on first 20 records after normalisation||

desert oar
#

drop the bad ones, keep the good ones 😉

#

that heatmap is the distance matrix between data points?

light hemlock
errant shore
#

If i have a df that contains user ratings for movies (columns=[userID, movieID, rating,....]), what would be the most efficient way to create a df that contains the count of users that rated both movies for all combinations of movies. Right now I'm doing this iteratively for every combination of movie ids like this, but I'm looking for a way to speed it up. Any suggestions?

def foo(id1, id2):
  id1_users = set(df[df["movieID"] == id1]["userID"].to_list())
  id2_users = set(df[df["movieID"] == id2]["userID"].to_list())
  combined = len(id1_users & id2_users)
  return combined
vale wedge
#

does anyone know why this error occur?

#

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

errant shore
#

result is a list array, so you cant check for equality with an int

vale wedge
#

sorry, but i dont clearly understand

#

@errant shore

errant shore
#

Your result variable is a list array

#

So you cant check if a list array is equal to an int

tidal bough
#

array, not list. And the problem isn't quite that you can't compare it to an int - you can - but that the result isn't a boolean, but an array of booleans

#

and you can't use an array of booleans as a condition

#

Presumably, you expect result to be a single-element array. If that's the case, extract its only element and do this stuff on it, not on the entire array.

errant shore
#

yes my bad array not list

vale wedge
#

hmm i still try to fix this one

#

how about Use a.any() or a.all()

#

`# Let's predict!
import numpy as np
newData = np.array([
2,
2,
3,
3,
3,
2,
2,
4,
4,
4,
4,
3,
5,
2,
3,
4,
4,
3,
4,
4,
1,
4,
3,
5,
1,
5,
2,
5,
5,
4,
3,
4,
4,
4,
4,
4,
4,
2,
5,
4,
3,
2,
4,
2,
4,
2,
4,
2,
4,
4]).reshape(-1,1)

result = gnb.predict(newData)
print(result)
if (result == 0):
print("Your Personality Is Agreeableness")
elif (result == 1):
print("Your Personality Is Conscientiousness")
elif (result == 2):
print('Your Personality Is Extrovert')
elif (result == 3):
print('Your Personality Is Neuroticism')
elif (result == 4):
print('Your Personality Is Openness')
elif (result == 5):
print('Your Personality Is Tie')`

vale wedge
#

is there any code that need to be adjusted?

#

@errant shore @tidal bough

tidal bough
#

What does print(result) print?

#

ah, I see on the screenshot

#

well, that's the reason then. What do you expect to happen, when you compare result to ints? It is after all an array of many values. So what if, if any, should execute?

vale wedge
#

hm i expect when the outcome is between 0-5 and will print the personality

vale wedge
#

hopefully it might help me

tidal bough
vale wedge
#

basically, when user input, then the outcomes will be 0 or 1 or 2 or 3 or 4 or 5

#

so every outcome will define different personality

vale wedge
brazen spire
#

is ML actually good for solving PDEs?

#

or ODEs

vale wedge
tidal bough
#

what do you mean by trying to compare all that to a single int? For example, is this equal to 4, or not?

vale wedge
#

owh i see, supposedly the result only display one number only

#

not the all 50 outcomes like that

#

i want to predict the outcome that will display the predicted single number outcome

vale wedge
#

so what can i do to make the outcome only display one number only? @tidal bough

proper sable
#

is this channel also for web scraping??

serene scaffold
proper sable
#

is there one??

serene scaffold
#

@proper sable not really. I guess you could ask in #web-development, but be prepared to confirm that the website you're trying to scrape is cool with that.

proper sable
#

Yes im just learning now

#

Thx

calm bison
#

Hi, can anyone help me with creating a custom dataset. I cannot seem to find my error here.

arctic wedgeBOT
#

Hey @calm bison!

Uh-oh! It looks like your message got zapped by our spam filter. We currently don't allow .txt attachments, so here are some tips to help you travel safely:

• If you attempted to send a message longer than 2000 characters, try shortening your message to fit within the character limit or use a pasting service (see below)

• If you tried to show someone your code, you can use codeblocks
(run !code-blocks in #bot-commands for more information) or use a pasting service like:

https://paste.pythondiscord.com

terse frigate
#

hey guys I needed some help figuring out a model

#

and ab approach

#

an*

serene scaffold
#

@terse frigate try giving enough information so that people can start making suggestions

terse frigate
#

but i also wanna match the accuracy of that result by comparing it with the example results

serene scaffold
terse frigate
#

so my job is to search for resumes on job boards
we are given a job description of the Job Role.

#

my job is to make a search string and use it to look for resumes

#

no i want to make a model which read the description and constructs a search string like how i do - using keywords and important skills mentioned and understanding the role

#

for example the clients they request for a cloud architect and provide requirements in skills etc

#

and i make a string -
(cloud architect) OR (Solutions architect) AND (AWS OR Google Cloud OR Azure) AND Agile

#

soemthing like that

serene scaffold
#

So the real goal is to detect resumes that relate to a job opening. But due to some limitation, you have to enter keywords into a search API.

terse frigate
serene scaffold
#

Weird

terse frigate
#

so i was thinking if i fed enough job descriptions and also some strings to learn on

#

it should give me proper string right?

serene scaffold
#

I have to go to sleep but I'll probably think about it some more. In the mean time, do you know about term frequency inverse document frequency?

terse frigate
#

yes

#

ive read about it

serene scaffold
#

It might give you some ideas for extracting keywords.

shrewd saddle
terse frigate
#

❤️ thanks

tidal edge
#

Hi can anyone help me in understanding this complex inheritance automl code. I just need little help. after i will able to understand it. Please let me know guys if anybody ready to explore this attached code with me. I really want to understand this code. Please help me.

arctic wedgeBOT
tidal edge
stone marlin
#

What parts of this aren't clear to you? How to use it, how to modify the code, etc.?

odd meteor
sly turtle
#

I want to make plant detection app for android.. So how can I make a model and how? I am beginner.. As there is no problem for me in Android App.. I just need some resources and support regards model.. Either ML or DL? And what is some tutorial for brainstorming.. Thanks..

unkempt monolith
#

You can use Roboflow to label images, Colab to train your YoloV5 model and Amazon AWS to host your VPS.

tidal edge
mighty spoke
#

Hi I'm trying to do logarithmic binning for PSD graph but its just not working as the binned graph on the right still look linearly binned can anyone help?, my code for binning: x4,y4=zip(*sorted(zip(faxis,Sxx.real))) logbin = pd.DataFrame({'X' : x4,'Y' : y4}) bins5=np.arange(4e-08, 6e-06, step=2.5e-07) categorical_object2 = pd.cut(x4, bins5) count=pd.value_counts(categorical_object2) grp2 = logbin.groupby(by = categorical_object2) #we group the data by the cut av = grp2.aggregate(np.mean) plt.figure() plt.plot(av.X,av.Y, 'x') plt.title('binned PSD plot') plt.xlabel('mean Frequency [Hz]') # Label the axes plt.ylabel('mean Power')

novel raven
#

is this channel also for computer vision?

#

like cascade classifiers

#

can you train a cascade classifier that can count number of stars in the sky with normal camera

amber lark
#

does someone know how to fix this problem?

sick wedge
#

I have a plot here of a column from my panda dataframe, I'd like to make subplots for the other columns but I I can't seem to get it to work, any help appreciated

warm verge
#

Why can hyperparameters not be 'autotuned' the same way PID controllers can be?

desert oar
#

i don't know how PID controllers work

#

(although i should probably learn, because i want to learn about modding my espresso machine)

warm verge
#

This one is probably the easiest to understand

#

While I'm aware there's way more than 3 hyperparameters, it's interesting that a similar process isn't used (or at least I haven't come across it)

desert oar
#

this sounds more or less like what we do in machine learning

#

set up cross validation and pick the set of hyperparameters that performs best

#

the hyperparameter space is huge even with a small number of parameters (and usually real-valued so uncountably infinite), so various strategies like random search, halving random search, bayesian optimization, etc. exist that improve on the traditional "grid search"

#

the reason you need cross validation is that you need to emulate predicting on "out of sample" data

#

whereas PID controllers you don't have this problem of in-sample vs out-of-sample

#

so yes, it is used

#

and especially in cases with one single hyperparameter (eg the shrinkage parameter in lasso or ridge regression), you can literally just look at the plot of model performance vs parameter value, and pick the one right before model performance starts to decline (indicating overfitting)

warm verge
#

Hmm ok that makes sense

tidal edge
#

I have a doubt. In the above code why some method name starts with _ and some method name starts with directly name in the same class?

#

method name starting with _

#

method name directly starting with name

lapis sequoia
#

Anyone use shap/lime for fb prophet?

tidal edge
novel acorn
#

hello everyone, how can I filter a pandas dataset for 2 conditions in the same column?
For example, I want to filter the length column for values equal to 48 and to 40

#
filtro = (all_years_numeric_filtered["Length"] == 48) & (all_years_numeric_filtered["Length"] == 40)

all_years_numeric_filtered[filtro]
#

Tried doing that but I'm getting an empty df

#

Already got it, created a list with the values and used .isin() for the Length column

thin palm
#

Hello, I have a question about One Hot Encoding. I'm working on a simple Heart Disease ML model, and following many tutorials they seem to use OHE on columns that numerical? I thought we only use OHE when we have text that needs to be converted into numbers or when we have plenty of options in one category. Can anyone explain this to me?

gritty bough
#

Can you take something that is non-deterministic and cant be run in parallel and use AI to parallelize it?

serene scaffold
#

@gritty bough what do you mean "can't be run in parallel"?

#

@thin palm one hot encoding is a way of representing nominal data (ie not quantifiable or orderable). So if you have an Animal feature, and your animals are pigs, goats, and snakes, you don't want to assign them 1, 2, and 3, because that would mean that snakes are there times as much as pigs, whatever that means.

#

Though it sounds like you might already understand that much.

#

@novel acorn & represents logical AND and something can't be both 48 and 40.

rose pasture
#

Hey guys quick question. Let's say I have a dataframe of stock prices. The index is the dates and the column is the daily closing price of each stock.

bank_stocks.xs(key='Close',axis=1,level=1).plot() 

Does this line format automatically plot the Close column against the index (date) whenever I don't chose which data to plot against?

serene scaffold
#

@rose pasture if you show the dataframe in a copy-and-pastable way, I will try

#

namely print(bank_stocks.head().to_dict('list'))

#

Please ping me if you decide to do that.

rose pasture
# serene scaffold namely `print(bank_stocks.head().to_dict('list'))`
{('BAC', 'High'): [47.18000030517578, 47.2400016784668, 46.83000183105469, 46.90999984741211, 46.970001220703125], ('BAC', 'Low'): [46.150001525878906, 46.45000076293945, 46.31999969482422, 46.349998474121094, 46.36000061035156], ('BAC', 'Open'): [46.91999816894531, 47.0, 46.58000183105469, 46.79999923706055, 46.720001220703125], ('BAC', 'Close'): [47.08000183105469, 46.58000183105469, 46.63999938964844, 46.56999969482422, 46.599998474121094], ('BAC', 'Volume'): [16296700.0, 17757900.0, 14970700.0, 12599800.0, 15619400.0], ('BAC', 'Adj Close'): [33.942649841308594, 33.582183837890625, 33.62542724609375, 33.57497024536133, 33.59661102294922], ('C', 'High'): [493.79998779296875, 491.0, 487.79998779296875, 489.0, 487.3999938964844], ('C', 'Low'): [481.1000061035156, 483.5, 484.0, 482.0, 483.0], ('C', 'Open'): [490.0, 488.6000061035156, 484.3999938964844, 488.79998779296875, 486.0], ('C', 'Close'): [492.8999938964844, 483.79998779296875, 486.20001220703125, 486.20001220703125, 483.8999938964844], ('C', 'Volume'): [1537600.0, 1870960.0, 1143160.0, 1370210.0, 1680740.0], ('C', 'Adj Close'): [368.26544189453125, 361.4664611816406, 363.2597351074219, 363.2597351074219, 361.5412902832031], ('GS', 'High'): [129.44000244140625, 128.91000366210938, 127.31999969482422, 129.25, 130.6199951171875], ('GS', 'Low'): [124.2300033569336, 126.37999725341797, 125.61000061035156, 127.29000091552734, 128.0], ('GS', 'Open'): [126.69999694824219, 127.3499984741211, 126.0, 127.29000091552734, 128.5], ('GS', 'Close'): [128.8699951171875, 127.08999633789062, 127.04000091552734, 128.83999633789062, 130.38999938964844], ('GS', 'Volume'): [6188700.0, 4861600.0, 3717400.0, 4319600.0, 4723500.0], ('GS', 'Adj Close'): [103.86396026611328, 102.42938232421875, 102.38907623291016, 103.83979034423828, 105.0890121459961], 
#
('JPM', 'High'): [40.36000061035156, 40.13999938964844, 39.810001373291016, 40.2400016784668, 40.720001220703125], ('JPM', 'Low'): [39.29999923706055, 39.41999816894531, 39.5, 39.54999923706055, 39.880001068115234], ('JPM', 'Open'): [39.83000183105469, 39.779998779296875, 39.61000061035156, 39.91999816894531, 39.880001068115234], ('JPM', 'Close'): [40.189998626708984, 39.619998931884766, 39.7400016784668, 40.02000045776367, 40.66999816894531], ('JPM', 'Volume'): [12838600.0, 13491500.0, 8109400.0, 7966900.0, 16575200.0], ('JPM', 'Adj Close'): [26.503398895263672, 26.350433349609375, 26.430240631103516, 26.616453170776367, 27.048765182495117], ('MS', 'High'): [58.4900016784668, 59.279998779296875, 58.59000015258789, 58.849998474121094, 59.290000915527344], ('MS', 'Low'): [56.7400016784668, 58.349998474121094, 58.02000045776367, 58.04999923706055, 58.619998931884766], ('MS', 'Open'): [57.16999816894531, 58.70000076293945, 58.54999923706055, 58.77000045776367, 58.630001068115234], ('MS', 'Close'): [58.310001373291016, 58.349998474121094, 58.5099983215332, 58.56999969482422, 59.189998626708984], ('MS', 'Volume'): [5377000.0, 7977800.0, 5778000.0, 6889800.0, 4144500.0], ('MS', 'Adj Close'): [36.114253997802734, 36.13903045654297, 36.23814010620117, 36.27529525756836, 36.65928649902344], ('WFC', 'High'): [31.975000381469727, 31.81999969482422, 31.55500030517578, 31.774999618530273, 31.825000762939453], ('WFC', 'Low'): [31.19499969482422, 31.364999771118164, 31.309999465942383, 31.385000228881836, 31.55500030517578], ('WFC', 'Open'): [31.600000381469727, 31.799999237060547, 31.5, 31.579999923706055, 31.674999237060547], ('WFC', 'Close'): [31.899999618530273, 31.530000686645508, 31.4950008392334, 31.68000030517578, 31.674999237060547], ('WFC', 'Volume'): [11016400.0, 10870000.0, 10158000.0, 8403800.0, 5619600.0], ('WFC', 'Adj Close'): [20.444866180419922, 20.207735061645508, 20.1853084564209, 20.303871154785156, 20.300668716430664]}
#

was it like that?

serene scaffold
#

!paste

arctic wedgeBOT
#

Pasting large amounts of code

If your code is too long to fit in a codeblock in discord, you can paste your code here:
https://paste.pythondiscord.com/

After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.

serene scaffold
#

Next time use this if it's too long

rose pasture
#

Ok I will next time

serene scaffold
#

Is this different from what you wanted?

lapis sequoia
serene scaffold
#

The index is the dates and the column is the daily closing price of each stock.
For your reference, this description is incomplete. Your columns are a multiindex of company names by (high, low, open, close, etc).

Also your rows are probably indexed by date, so my example will look a bit different.

#

if you do print(bank_stocks.head().index), I can correct my version.

rose pasture
#

Yes it is different, here's what I got. I was just curious as to how the plot() chose the index as my x axis even though I didn't specify anything.

rose pasture
serene scaffold
rose pasture
lapis sequoia
#

@tidal edge you there?

tidal edge
#

Yes. Shall we connect little later. I'm on one conf call.

lapis sequoia
#

Okay okau

gritty bough
#

For example. Making a checkerboard.

You can check the neighbors to the NSEW directions if they are black or white or you could do if(mod % 2) { black} else white.

#

Mmmm

#

I'm thinking

#

Maybe problems like what I'm asking dont exist.

thin palm
hoary wigeon
#

if male is 0, it means female = 1, vice versa

#

anyone using cv2 ?

desert oar
copper ridge
#
        row_vals = '\n'.join([val for val in df['Date:']])
``` i want to only view the values in the specified row in groups of 10, how would i be able to do that?
#

that line of code displays all the values

#

I am using pandas to help work with excel files

desert oar
#

do you mean "in the specified column"?

#

columns go "up and down" -- like columns in a building

#

rows go "across", like rows of crops in a field

copper ridge
copper ridge
copper ridge
desert oar
#

are you talking about rows or columns?

copper ridge
#

columns, up and down

#

the data in the Date column

desert oar
#

ok, and what do you want to print?

copper ridge
#

I want to print the data in groups of 10, such as 1-10 and 11-20

desert oar
#

so you want to print the first 10 rows?

#

then the next 10 rows, etc.

copper ridge
#

yes, sorry for the confusion

desert oar
#
size = 10
for lo in range(0, len(df), size):
    hi = lo + size
    date_values = df['Date:'].iloc[lo : hi]
    print(' '.join(date_values.tolist())
#

i'd just write a loop for that

#

!d range

arctic wedgeBOT
#

class range(stop)``````py

class range(start, stop[, step])```
The arguments to the range constructor must be integers (either built-in [`int`](https://docs.python.org/3/library/functions.html#int "int") or any object that implements the [`__index__()`](https://docs.python.org/3/reference/datamodel.html#object.__index__ "object.__index__") special method). If the *step* argument is omitted, it defaults to `1`. If the *start* argument is omitted, it defaults to `0`. If *step* is zero, [`ValueError`](https://docs.python.org/3/library/exceptions.html#ValueError "ValueError") is raised.

For a positive *step*, the contents of a range `r` are determined by the formula `r[i] = start + step*i` where `i >= 0` and `r[i] < stop`.

For a negative *step*, the contents of the range are still determined by the formula `r[i] = start + step*i`, but the constraints are `i >= 0` and `r[i] > stop`.
copper ridge
#

thank you

grand breach
#

need some help with cspdarknet architecture.. anyone?

#

Is the spp block a part of the cspdarknet backbone, like the final block in the backbone? Or lies in the neck?

lapis sequoia
#

Hello what would be the most efficient way of adding 150k rows to a dataframe? When using .append it starts with a pretty decent 30mins remaining and then it increases when it gets closer to the end. When using .loc it starts with more than an hour and then increases but not as much as using .append. It looks like .append performs better with small dataframes and .loc with large dataframes.

#

30 minutes might be a bit slow but I have another df with 57k rows load in memory to perform some conditionals and decide which rows should be added to the final df I want to create

#

I see how it dies little by little...😅

desert bear
#

Hey, I have a question related to building a multi-class classification model. In my datasets I have some sequence of vectors that are unique for a specific class. Do you think that throwing this UNIQUE vector into an unsupervised model is a waste of resources?

lapis sequoia
snow tangle
#

Hey, I'm really interested in AI image generation with GANs because the results are really amazing, so I'm looking to learn more about it. I followed this course https://livecodestream.dev/post/generating-images-with-deep-learning/ and adapted the code to work with RGB images and such, and learnt more about the parameters but I still have a long way to go. I'm looking for any papers or articles on GAN image generation that you would all reccomend so I can learn further about this topic

Learn how to use AI to generate new images such as faces or art.

#

I was able to get it to generate images that started to resemble the dataset and as expected, a dataset with images from the same perspective also worked a lot better

#

but its still not optimized and over time loses stability and the images drop in quality. So i definetly need to read more articles on image generation

#

Worked well for fashion-mnist though

arctic wedgeBOT
#

:incoming_envelope: :ok_hand: applied mute to @lapis sequoia until <t:1641386303:f> (9 minutes and 59 seconds) (reason: duplicates rule: sent 4 duplicated messages in 10s).

lapis sequoia
#

I'll paste the code I'm using just in case you saw smth that can be improved

fleet meteor
#

hello

#

i am trying to create an ai

#

or self learning stuff

#

can any one help

#

or hav expeiriance?

lapis sequoia
lapis sequoia
#

One way would be to not to retrieve the whole thing.
Another way would be getting to know your logic and using some dict initially instead of that df to... make search in it faster in while loop

lapis sequoia
mint palm
#

i have installed jupyter but i wanna know which location is best to install so that i can easily access python libraries

#

should i install in ....../python37/lib/

#

on is there something else i should do

lapis sequoia
lapis sequoia
thin palm
wicked grove
#

I used this x_train=load('outfile.npz')

#

Is there any way i can check the contents of this x_train

odd meteor
lapis sequoia
wicked grove
#

Alrightt

lapis sequoia
#

If the file is a .npz file, then a dictionary-like object is returned, containing {filename: array} key-value pairs, one for each file in the archive.

If the file is a .npz file, the returned value supports the context manager protocol in a similar fashion to the open function:

with load('foo.npz') as data: a = data['a']

Via:
https://numpy.org/doc/stable/reference/generated/numpy.load.html

#

@wicked grove

#

Try just printing x_train for now? Doc suggests it must a dict of files

wicked grove
thin palm
thin palm
#

When receiving a cross validated score is our final model output score supposed to be higher than our CV score? For example when I Cross Validate with a K fold of 10 my score mean is .81%. When I take my model and fit it with our features and target my score jumps to .86%. does this make sense?

serene scaffold
#

If you trained (fit) the model on the same instances that you used to evaluate, that would explain why the score went up.