#data-science-and-ml | Python | Page 364

sour spindle Dec 31, 2021, 6:10 PM

#

I already made the function

#

It seems to make the testing accuracy much better

sour spindle Dec 31, 2021, 7:41 PM

#

As you can see in the form [normal, postprocessing1, postprocessing2] accuracy

sour spindle Dec 31, 2021, 8:25 PM

#

the postprocessing1 also makes the training data get into the 80's in accuracy

desert oar Dec 31, 2021, 9:13 PM

#

Right, exactly

#

That is the value of abstraction

forest canyon Jan 1, 2022, 1:22 AM

#

Hello. I am trying to compare two dataframes. They have the same number of columns and same column names. Error is below. How do I adjust this so the compare works?

ValueError: Can only compare identically-labeled DataFrame objects

Here are my functions.

engine = create_engine('postgresql://postgres:secret@10.0.10.125:5432/sheepdb')

def jdb_sheep_df():
    table_name = 'jdb_sheep'
    table_df = pd.read_sql_table(
        table_name,
        con=engine,
        schema='sheepdb'
    )
    return table_df

def sql_csv_compare(path='/home/steven/projects/repos/sheepdb-cli/sheepdb-cli/test.csv'):
    csv_df = pd.read_csv(path)
    sql_df = jdb_sheep_df()
    #ne = sql_df.compare(csv_df)
    #ne = (csv_df != sql_df).any(1)
    ne = csv_df.compare(sql_df, keep_equal=True, keep_shape=True) 
    return ne

print(sql_csv_compare())

serene scaffold Jan 1, 2022, 1:27 AM

#

@forest canyon did you confirm that the rows are indexed the same way?

forest canyon Jan 1, 2022, 1:27 AM

#

Confirm in what way?

serene scaffold Jan 1, 2022, 1:29 AM

#

@forest canyon looks like you're reading the DataFrames from an SQL database. Once you have the DataFrames, make sure that the .index of each are the same sets.

sour spindle Jan 1, 2022, 1:29 AM

#

serene scaffold <@786351976991817729> did you confirm that the rows are indexed the same way?

could you look at my problem if you can?

forest canyon Jan 1, 2022, 1:30 AM

#

How?

serene scaffold Jan 1, 2022, 1:30 AM

#

sour spindle could you look at my problem if you can?

The one from earlier? I don't look at screenshots of text.

sour spindle Jan 1, 2022, 1:30 AM

#

serene scaffold The one from earlier? I don't look at screenshots of text.

i can send a log if you want

serene scaffold Jan 1, 2022, 1:31 AM

#

sour spindle i can send a log if you want

I'm about to drive but I might be able to look later. I would ask questions to the whole channel, not individuals.

serene scaffold Jan 1, 2022, 1:31 AM

#

forest canyon How?

.index is an attribute of DataFrames. Look at them and see.

forest canyon Jan 1, 2022, 1:32 AM

#

I know it's an attribute but I dont know what you mean by "the same"

#

It gives both dataframes an index column at the front starting at 0

undone fiber Jan 1, 2022, 1:33 AM

#

anyone reccomend the easiest single variable ml linear regression tutorial theyve come across.. I figure id start with something like that..

serene scaffold Jan 1, 2022, 1:33 AM

#

forest canyon I know it's an attribute but I dont know what you mean by "the same"

If the values in both indices are the same, not considering the order. If they're both range indices of the same length, then they're the same.

forest canyon Jan 1, 2022, 1:33 AM

#

Oh.. So then what is the use of a compare if you always have to have the same number of rows?

sour spindle Jan 1, 2022, 1:33 AM

#

how to send code again

#

in the format

tidal patrol Jan 1, 2022, 1:34 AM

#

Maybe make an if statement that will say if they are equal… for example if set1 = set2 print(“They are equal to each other.”)

#

just my idiotic brain thinking

forest canyon Jan 1, 2022, 1:34 AM

#

I need to identify differences

sour spindle Jan 1, 2022, 1:35 AM

#

sour spindle As you can see in the form [normal, postprocessing1, postprocessing2] accuracy

here is the log of the same ss
2021
[0.5121227115289461, 0.8090054428500743, 0.5314200890648194]
2020
[0.5096486887679367, 0.8169223156853043, 0.5319148936170213]
2021
2020
[0.5091538842157348, 0.8213755566551212, 0.5329045027214251]
2021
2020
[0.5091538842157348, 0.830282038594755, 0.5338941118258288]

tidal patrol Jan 1, 2022, 1:35 AM

#

yea but there could be no differences

forest canyon Jan 1, 2022, 1:35 AM

#

I'm not quite there yet.

#

I need to get them comparing first

#

So the compare function only works on data with same row count on both sides?

tidal patrol Jan 1, 2022, 1:36 AM

#

that could be it… maybe ur comparing a 2d list to a 3D list

forest canyon Jan 1, 2022, 1:37 AM

#

That's what was said above anyway.

#

My dataframes are same columns just one has more rows than the other

tidal patrol Jan 1, 2022, 1:37 AM

#

can you delete a row and test it?

sour spindle Jan 1, 2022, 1:39 AM

#

here is the question reposted with the log instead of the ss:

Hey i am making a stock predictor and i am wondering if it was ok to use the output from testing and use a post processing function which make the testing accuracy go from .53 to .79?
I already made the function
It seems to make the testing accuracy much better
As you can see in the form [normal, postprocessing1, postprocessing2] accuracy

2021
[0.5121227115289461, 0.8090054428500743, 0.5314200890648194]
2020
[0.5096486887679367, 0.8169223156853043, 0.5319148936170213]
2021
2020
[0.5091538842157348, 0.8213755566551212, 0.5329045027214251]
2021
2020
[0.5091538842157348, 0.830282038594755, 0.5338941118258288]

the postprocessing1 also makes the training data get into the 80's in accuracy

forest canyon Jan 1, 2022, 1:39 AM

#

tidal patrol can you delete a row and test it?

I added a row and tested. it sets mismatches as NaN. Is there a way to compare when rows count is different?

#

Like compare two columns?

#

And see if items in the sql_df exist in the csv_df?

tidal patrol Jan 1, 2022, 1:40 AM

#

um…

#

I’m an idiot trying to learn. I got no clue.

forest canyon Jan 1, 2022, 1:41 AM

#

Kind of like you can in Excel

#

Okay

#

At least I know this won't work now

tidal patrol Jan 1, 2022, 1:41 AM

#

would it change the value if you added 0’s to the extra row? Cause then it will slow you to compare them

sour spindle Jan 1, 2022, 1:42 AM

#

forest canyon Like compare two columns?

what do you mean like to see if they are the same?

forest canyon Jan 1, 2022, 1:43 AM

#

Basically - take a column in sql_df, compare it to the same column csv_df, then give me any value in sql_df column that doesn't exist in csv_df column

sour spindle Jan 1, 2022, 1:44 AM

#

forest canyon Basically - take a column in sql_df, compare it to the same column csv_df, then ...

ok are they pandas dataframes already?

forest canyon Jan 1, 2022, 1:44 AM

#

Yep. They are in my code above.

#

sql_df comes from a DB table and csv_df comes from a csv file

sour spindle Jan 1, 2022, 1:44 AM

#

try this
https://www.kite.com/python/answers/how-to-compare-two-pandas-dataframe-columns-in-python

Code Faster with Line-of-Code Completions, Cloudless Processing

Kite is a free autocomplete for Python developers. Code faster with the Kite plugin for your code editor, featuring Line-of-Code Completions and cloudless processing.

#

u might want to convert to a list with .tolist() and then use .index(False)

forest canyon Jan 1, 2022, 1:46 AM

#

It looks like that will be same issue - row count must match but I'll try and Google if it doesn't work. Main question is answered on compare. The compare has to be the same row count and you'd want to sort by the same column so it was an accurate compare Thanks all.

sour spindle Jan 1, 2022, 1:48 AM

#

forest canyon It looks like that will be same issue - row count must match but I'll try and Go...

alright

arctic wedgeBOT Jan 1, 2022, 6:38 AM

#

failmail :ok_hand: applied mute to @junior robin until <t:1641019694:f> (9 minutes and 59 seconds) (reason: duplicates rule: sent 4 duplicated messages in 10s).

devout trench Jan 1, 2022, 7:15 AM

#

hey, i want to make a virtual assistant like alexa or siri but it works offline and on mobile devices with basic functionality like play music open application etc.
i have hit a roadblock that in the offline voice recognition part. i have tried sphinx and vosk but the accuracy is not that great. i am thinking of learing ML and designing my own voice recognition , but getting huge amount of voice data is a problem.. or should i just try to modify existing voice recognition models?

austere swift Jan 1, 2022, 10:35 AM

#

devout trench hey, i want to make a virtual assistant like alexa or siri but it works offline ...

have you tried searching in places like kaggle for datasets?

#

https://www.kaggle.com/mozillaorg/common-voice a quick search found me this

Common Voice

500 hours of speech recordings, with speaker demographics

hoary wigeon Jan 1, 2022, 10:42 AM

#

SOS

#

I need project topic on Machine Learning not something like price prediction or churn, cancer classificaiton

ruby crown Jan 1, 2022, 10:52 AM

#

hoary wigeon I need project topic on Machine Learning not something like price prediction or ...

i'm afraid of breaking the advertisement rule but I have a free api which hosts a bunch of datasets https://mldrive.io/documentation

mlDrive Documentation

an API for Machine Learning

hoary wigeon Jan 1, 2022, 10:53 AM

#

lemme check

#

this isn't helping me

arctic wedgeBOT Jan 1, 2022, 11:09 AM

#

:incoming_envelope: :ok_hand: applied mute to @deep gyro until <t:1641035996:f> (9 minutes and 59 seconds) (reason: duplicates rule: sent 4 duplicated messages in 10s).

mint palm Jan 1, 2022, 11:54 AM

#

best place to learn tensorflow???

#

i know data science bacis nn models every thing, language, procedure

#

just library is what i had left

#

so tell me the best place to learn tf

marsh yacht Jan 1, 2022, 12:10 PM

#

hello can anyone help me how to output a binary in a KNN.predict value

stable dove Jan 1, 2022, 1:00 PM

#

Hi there, excuse me for posting this question here, but I wouldn't know what other channel to post it in. Is there anyone who can help me understand how an encoder, ROM decoder,PLA works?
I tried to read from many parts but I don't understand especially the OR matrix and AND matrix associated

mortal silo Jan 1, 2022, 1:03 PM

#

how to keep theano cache file/folder names consistent over different systems?

earnest widget Jan 1, 2022, 3:20 PM

#

mint palm so tell me the best place to learn tf

If you know, a lot of the other concepts, you can just apply it in a project with using TF. That's what I would do.

nova timber Jan 1, 2022, 4:24 PM

#

Anybody has experience with nbdiff-web from nbdime? https://nbdime.readthedocs.io/en/latest/

#

I’m trying to generate diffs for multiple files and I can’t get it to work

wicked grove Jan 1, 2022, 5:01 PM

#

hello, i have a doubt in tensorflow.when i use conv2D with kernel size=1..does it mean 1X1 convolution

forest canyon Jan 1, 2022, 5:55 PM

#

Is it possible to replace nan with null in a dataframe and then still turn it into a dict after? I can totally replace nan with null using fillna, but I can't turn it into a dict after. Anyway to convert nan to null and then still be able to turn it into a dict after?

csv_df = pd.read_csv(path)
dict = csv_df.to_dict('records')

forest canyon Jan 1, 2022, 7:12 PM

#

This is the answer

.replace([np.nan], [None])

mighty spoke Jan 1, 2022, 8:26 PM

#

Hi is plt.loglog(x,y) the same as plt.plot(np.log(x),np.log(y))?

tidal patrol Jan 1, 2022, 8:34 PM

#

I believe that the first one is for matplot and the second is for numpy

mighty spoke Jan 1, 2022, 9:03 PM

#

tidal patrol I believe that the first one is for matplot and the second is for numpy

so they are different

tidal patrol Jan 1, 2022, 9:25 PM

#

mighty spoke so they are different

Yes I believe so

marsh yacht Jan 1, 2022, 10:21 PM

#

#

marsh yacht Jan 1, 2022, 10:22 PM

#

marsh yacht

my neigh is the same as the bottom clf_KNN pic, just different variables

marsh yacht Jan 1, 2022, 10:23 PM

#

marsh yacht

how do i make this predict value

marsh yacht Jan 1, 2022, 10:23 PM

#

marsh yacht

to this

quiet vault Jan 1, 2022, 10:57 PM

#

Im guessing you could get the index of the 1 and then do classes[index] to get the value of the class

solar yew Jan 1, 2022, 11:07 PM

#

Hey does anyone have experience working with TF-IDF vectors, especially regarding using them along with non-NLP features? I managed to get it into a dataframe, however, it is too large to concat with the other features (2 additional columns of equal length). I'd love to hear how other people solved this!

#

Error thrown explains I need 5.4GiBs extra to process, however, i managed to navigate that error already once before while creating the Dataframe

#

#

I could perhaps do my test-train split early and do this part individually for both, cutting down the training set by ~20%

mossy stratus Jan 1, 2022, 11:17 PM

#

Does anyone know why I get this error, my GPU is 8GB, not 6GB? (Windows, 3060 Ti, 8GB VRAM)

RuntimeError: CUDA out of memory. Tried to allocate 90.00 MiB (GPU 0; 8.00 GiB total capacity; 5.49 GiB already allocated; 0 bytes free; 7.60 GiB allowed; 5.56 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF```

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

coral kindle Jan 1, 2022, 11:22 PM

#

Ooofles

#

I tried to follow a tutorial but I'm having weird CUDA memory issues or maybe classes idk

#

def train(dataloader: DataLoader, model: nn.Module, optimizer: optim.Optimizer, criterion: nn.Module, epoch: int):
    model.train()
    total_acc, total_count = 0, 0
    log_interval = 500
    start_time = time.time()

    for idx, (label, text, offsets) in enumerate(dataloader):
        optimizer.zero_grad()
        predicted_label = model(text, offsets)
        loss = criterion(predicted_label, label)
        loss.backward()
        # error happens here
        torch.nn.utils.clip_grad_norm_(model.parameters(), 0.1)
        optimizer.step()
        total_acc += (predicted_label.argmax(1) == label).sum().item()
        total_count += label.size(0)
        if idx % log_interval == 0 and idx > 0:
            elapsed = time.time() - start_time
            print(
                "| epoch {:3d} | {:5d}/{:5d} batches "
                "| accuracy {:8.3f}".format(epoch, idx, len(dataloader), total_acc / total_count)
            )
            total_acc, total_count = 0, 0
            start_time = time.time()

#

I tried to follow the tut's steps with another dataset: https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html

#

And that one worked on both Colab and local

#

According to Stackoverflow it's happening because my label > n_labels which is kinda weird

#

This is what happened when I'm on CPU

coral kindle Jan 1, 2022, 11:36 PM

#

mossy stratus Does anyone know why I get this error, my GPU is 8GB, not 6GB? (Windows, 3060 Ti...

Yes, 8 GB is writtend. However, PyTorch already kept some VRAM for your model and the dataloaders you're sending to your GPU.

mossy stratus Jan 1, 2022, 11:36 PM

#

coral kindle Yes, 8 GB is writtend. However, PyTorch already kept some VRAM for your model an...

how do I fix it?

coral kindle Jan 1, 2022, 11:36 PM

#

If you're on a notebook, the only way to get rid of this error is to restart your kernel. Then lower down your batch size.

mossy stratus Jan 1, 2022, 11:36 PM

#

batch size?

coral kindle Jan 1, 2022, 11:37 PM

#

Rinse and repeat till you find a suitable batch size

mossy stratus Jan 1, 2022, 11:37 PM

#

coral kindle Rinse and repeat till you find a suitable batch size

how?

coral kindle Jan 1, 2022, 11:37 PM

#

mossy stratus batch size?

When you configurate a DataLoader, you have to pass in a batch size. It's the number of samples you feed in your model.

mossy stratus Jan 1, 2022, 11:37 PM

#

coral kindle If you're on a notebook, the only way to get rid of this error is to restart you...

it's just a plain python file

coral kindle Jan 1, 2022, 11:38 PM

#

Ideally we aim for bigger batch sizes but due to GPU constraints we lower it.

coral kindle Jan 1, 2022, 11:38 PM

#

mossy stratus it's just a plain python file

Ok so no need to restart anything. Just change the batch size where you define your DataLoader

mossy stratus Jan 1, 2022, 11:39 PM

#

does it improve quality? (higher batch size)

#

I can make the result image smaller

#

but the quality seems to be the same from 128 to 400

coral kindle Jan 1, 2022, 11:39 PM

#

mossy stratus does it improve quality? (higher batch size)

Usually yes, your training goes faster

mossy stratus Jan 1, 2022, 11:39 PM

#

so it's speed, not quality?

coral kindle Jan 1, 2022, 11:40 PM

#

mossy stratus but the quality seems to be the same from 128 to 400

If you're doing a CNN the only variables you can touch upon are the batch size and maybe the number of neurons in a convolution layer.

coral kindle Jan 1, 2022, 11:41 PM

#

mossy stratus so it's speed, not quality?

A mix of both I'd say

#

But higher batch sizes means you do less iterations

mossy stratus Jan 1, 2022, 11:42 PM

#

on my GPU, it does iterations really fast, on my CPU, it has far more memory, but does slow iterations

#

didn't work

#

still get this:

RuntimeError: CUDA out of memory. Tried to allocate 16.00 MiB (GPU 0; 8.00 GiB total capacity; 5.55 GiB already allocated; 0 bytes free; 7.60 GiB allowed; 5.62 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF```

#

I doubt the batch size affects 2.5 GB

lapis sequoia Jan 2, 2022, 2:47 AM

#

I have some problems. They should be simple, but they are not.

#

These questions require a certain expertise in order to provide any useful answers.

#

They involve the correct data normalization boundaries for certain kinds of data, variance calculations, bayesian algorithms, and audio sampling.

#

I could use some help.

#

ideally, i could use some help from a systems developer, perhaps a python or numPY project developer

tidal patrol Jan 2, 2022, 3:01 AM

#

send it in the server I’m sure we can all work together for an answer!

lapis sequoia Jan 2, 2022, 3:10 AM

#

I've asked about it before. I'm reluctant to expend tendons.
https://github.com/falseywinchnet/fabada/blob/master/examples/streamclean_rx_buffer.py
take a look over this. Note where the fabada function enters and what is done initially. Note what kind of data I am working with.
This is a simple, linear, single file, open source project. It is not complex from that perspective. It is complex in terms of the computation, but we can hold off on that.

GitHub

fabada/streamclean_rx_buffer.py at master · falseywinchnet/fabada

Fully Adaptive Bayesian Algorithm for Data Analysis (FABADA) is a new approach of noise reduction methods. In this repository is shown the package developed for this new method based on \citepaper....

#

I've coded things like variance arbitrarily because I don't know what should work best.

#

One problem i have with it is that it "clicks" at the end and start of each sample frame. This is unrelated to the sampling or buffers, simply bypassing the noise reduction function and passing the data frame back to the output is proof of this.

#

So the problem is internal to how I have adapted this formula from another person, who is a co-author on it, to this particular application.
https://github.com/PabloMSanAla/fabada/blob/master/fabada/__init__.py this is their original work. I have made a lot of changes, but i have verified they return identical results in superior timeframes for all of the stuff i've separated out of the main function, like the chi2pdf calculation.

GitHub

fabada/__init__.py at master · PabloMSanAla/fabada

Fully Adaptive Bayesian Algorithm for Data Analysis (FABADA) is a new approach of noise reduction methods. In this repository is shown the package developed for this new method based on \citepaper....

#

I could use some help optimizing what the constants should be for normalization and for variance calculation. I am positive these are the only things remaining that need some love and affection.

odd patio Jan 2, 2022, 3:24 AM

#

I am new bee in machine learning

#

I use python and I am intermediate in python

#

But I need some projects in machine learning

#

Beginner projects?

tidal patrol Jan 2, 2022, 3:25 AM

#

odd patio Beginner projects?

How far are you into ml?

lapis sequoia Jan 2, 2022, 3:26 AM

#

like, to study, or like, something to do?

odd patio Jan 2, 2022, 4:31 AM

#

tidal patrol How far are you into ml?

just beginning

slow vigil Jan 2, 2022, 5:15 AM

#

does anyone know how to use pandas read_parquet() and make it include the partition column in the table it returns?
there is something in spark I think where you can provide a 'base path' argument and it will adjust the schema, but I can't find anything for pandas

lapis sequoia Jan 2, 2022, 6:09 AM

#

What is data science

steel fox Jan 2, 2022, 6:46 AM

#

What is life

outer bay Jan 2, 2022, 7:04 AM

#

Can anyone please provide some interesting resources on text summarization using k means clustering

hoary wigeon Jan 2, 2022, 7:33 AM

#

I need help with PCA,
I im using pca on image array data

I got 268 (out of 2304 (img shape 48x48)) component explaining 95% of variance

but my doubt is how can i plot 268 element as image... shape issue and all

safe elk Jan 2, 2022, 8:12 AM

#

So you want to probably mark off which part of the 48x48 accounts for the variance...what about color coding the 268 elements in the 48×48 grid . Most PCA examples use scatter plots but image data has its own built in coordinate system eg x and y and as such marking the pixels is a scatter plot too

hoary wigeon Jan 2, 2022, 8:56 AM

#

safe elk So you want to probably mark off which part of the 48x48 accounts for the varian...

i have no idea regarding the reconstruction of image with missing pixels

odd patio Jan 2, 2022, 9:18 AM

#

tidal patrol How far are you into ml?

Guys

#

I already asked

#

I am a new bee in ml

#

any beginners projects is available to learn ml with python

#

any websites?

lapis sequoia Jan 2, 2022, 9:50 AM

#

https://machinelearningmastery.com/machine-learning-in-python-step-by-step/ nice intro into Scipy, training, algorithms etc.

Machine Learning Mastery

Your First Machine Learning Project in Python Step-By-Step

Do you want to do machine learning using Python, but you’re having trouble getting started? In this post, you will […]

safe elk Jan 2, 2022, 9:54 AM

#

hoary wigeon i have no idea regarding the reconstruction of image with missing pixels

https://towardsdatascience.com/image-compression-using-principal-component-analysis-pca-253f26740a9f

Medium

Image Compression Using Principal Component Analysis (PCA)

Dimensionality Reduction in Action

#

https://stackoverflow.com/questions/55533116/pca-inverse-transform-in-sklearn

Stack Overflow

pca.inverse_transform in sklearn

after fitting my data into
X = my data

pca = PCA(n_components=1)
pca.fit(X)
X_pca = pca.fit_transform(X)
now X_pca has one dimension.

When I perform inverse transformation by definition isn't it

#

Maybe inverse transform can recover the image

#

pca_10 = PCA(n_components=10)mnist_pca_10_reduced = pca_10.fit_transform(mnist)mnist_pca_10_recovered = pca_10.inverse_transform(mnist_pca_10_reduced)

safe elk Jan 2, 2022, 10:32 AM

#

The inverse_transform() method of the pca object is used to decompress the reduced dataset back to 784 dimensions. This is very useful for visualizing the compressed image!

#

Yeah it can be helpful according to the article

tidal patrol Jan 2, 2022, 1:54 PM

#

@odd patio
How about generate 50 random x and y axis numbers and graph them in numpy

odd patio Jan 2, 2022, 1:54 PM

#

okay

#

Thank you

odd patio Jan 2, 2022, 1:55 PM

#

tidal patrol <@731845190641516614> How about generate 50 random x and y axis numbers and grap...

Thank you 🤗

#

What are the major libraries I should learn for machine learning ? 🤔

tidal patrol Jan 2, 2022, 2:36 PM

#

Numpy, pandas, and cvs are probably the first three you should learn. Then tensorflow

late vault Jan 2, 2022, 2:39 PM

#

odd patio What are the major libraries I should learn for machine learning ? 🤔

first, i'd suggest np pd and csv then scipy and tensor flow....

serene scaffold Jan 2, 2022, 2:39 PM

#

"learning libraries" isn't the way to go in the first place.

#

I have a message in the pins about what the major libraries are, but you shouldn't try to learn them in a box-check sort of way.

lapis sequoia Jan 2, 2022, 4:02 PM

#

Where is the best place to ask for help cleaning up a CSV data source for use in pandas? Is there a special channel?

serene scaffold Jan 2, 2022, 4:04 PM

#

@lapis sequoia this one

austere swift Jan 2, 2022, 4:17 PM

#

serene scaffold "learning libraries" isn't the way to go in the first place.

I agree with this, start learning the concepts and the math behind it before getting into libraries. once you know what's being done in the backend, you can adapt that to any library by just reading documentation

lapis sequoia Jan 2, 2022, 4:19 PM

#

The problem is the following, the values that should go in the last column are prepended by 3 empty columns in the source CSV. This seems to shift the headers 4 places to the right when parsing it with panda's read_csv function. So can I resolve that? I can't modify the data manually since it's coming from a remote server and changes frequently.

serene scaffold Jan 2, 2022, 4:20 PM

#

lapis sequoia The problem is the following, the values that should go in the last column are p...

so the 12, 24, and 12 should be in the "casecount" column?

lapis sequoia Jan 2, 2022, 4:20 PM

#

serene scaffold so the 12, 24, and 12 should be in the "casecount" column?

Yes

#

And the id header should be above the first column (that contains 7, 32 etc.) in the 2nd screenshot

serene scaffold Jan 2, 2022, 4:21 PM

#

I suspect it has something to do with how the new_subcategory column is being parsed. In the CSV, does each value for new_subcategory have quotes around them?

lapis sequoia Jan 2, 2022, 4:22 PM

#

serene scaffold I suspect it has something to do with how the `new_subcategory` column is being ...

Only when it contains spaces, not for single word categories

serene scaffold Jan 2, 2022, 4:23 PM

#

lapis sequoia Only when it contains spaces, not for single word categories

I'm not sure what to do, unfortunately

#

if I had a copy of an inputted CSV, I might be able to figure it out, but I see that you can't share that.

lapis sequoia Jan 2, 2022, 4:24 PM

#

serene scaffold I'm not sure what to do, unfortunately

Okay, thanks. I just censored it because it contains adult content/words. I could share it

serene scaffold Jan 2, 2022, 4:24 PM

#

lapis sequoia Okay, thanks. I just censored it because it contains adult content/words. I coul...

I'll allow it.

#

just drag/drop the file into this chat.

lapis sequoia Jan 2, 2022, 4:25 PM

#

Well maybe not publicly share on such a big server lol

#

Dont' want to get into hot water, since it's client work

serene scaffold Jan 2, 2022, 4:26 PM

#

Ping me if you decide that you're willing to upload it. Otherwise I'll have to move on to something else.

#

The only other thing I can think to suggest is that you try changing the delimiter for read_csv

lapis sequoia Jan 2, 2022, 4:27 PM

#

Can't share publicly without modifying it, so that would defeat the purpose. Delimiter is ; and I specified that

lapis sequoia Jan 2, 2022, 4:35 PM

#

serene scaffold so the 12, 24, and 12 should be in the "casecount" column?

The first numeric value here is the new_subcatid from the screenshot: 32869;Dildos;;;;;12 and the last one is casecount

So there are just too many delimiters in the data source

safe elk Jan 2, 2022, 4:36 PM

#

austere swift I agree with this, start learning the concepts and the math behind it before get...

Yes otherwise it can become mechanical. Maybe this checklist thinking is driven by the job market where they employ checklist driven hiring

serene scaffold Jan 2, 2022, 4:36 PM

#

lapis sequoia The first numeric value here is the `new_subcatid` from the screenshot: `32869;D...

who are you working for? tangerine_think

I guess just do df['casecount'] = df['???'] to correct it

#

if there are cases where the casecount value is actually in the casecount column, you could use fillna instead. assuming that missing values in the column are NaN

lapis sequoia Jan 2, 2022, 4:40 PM

#

serene scaffold who are you working for? <:tangerine_think:756526770693603420> I guess just do...

Maybe I can get you an employee discount 😉 jk lol. It's just an adult drop shipping business.

casecount is in the wrong column on all rows

= df['???'] is that actual syntax?

serene scaffold Jan 2, 2022, 4:41 PM

#

@lapis sequoia if you're talking about the string, no, that's there because idk what the name of the column the values ended up in is.

austere swift Jan 2, 2022, 4:42 PM

#

@lapis sequoia I'm not completely sure that this would work but try making a df that reads headers only, then making a df that doesnt read the headers (skip the first row to make sure that the headers don't get counted as data), then you can drop the empty rows from the data df and assign the columns from header_df to data_df.columns

lapis sequoia Jan 2, 2022, 4:52 PM

#

austere swift <@456226577798135808> I'm not completely sure that this would work but try makin...

Thanks I'll give that a try. I already spent hours on this, one thing I tried was dropping the last 4 columns including the casecount column (since it's not important) and re-assigning the headers. For some reason that didn't work out

austere swift Jan 2, 2022, 4:57 PM

#

it would be something like this

header_df = pd.read_csv(filename, nrows=0, delimiter=';') # nrows 0 so that it only reads headers
data_df = pd.read_csv(filename, header=None, skiprows=1, delimiter=';')
data_df.dropna(axis=1, inplace=True)
data_df.columns = header_df.columns

lapis sequoia Jan 2, 2022, 5:03 PM

#

austere swift it would be something like this ```py header_df = pd.read_csv(filename, nrows=0,...

I created the headers list manually for now and it works, next step would be to try your solution. Thanks! (the settings var is just coming from a loaded in config file)

df = pd.read_csv(
    settings['pandas_csv_options']['source_file'],
    sep=settings['pandas_csv_options']['separator'],
    header=None,
    skiprows=1,
    names=headers
)

austere swift Jan 2, 2022, 5:05 PM

#

i wouldn't think that assigning the headers directly in the read_csv would work since you would have to drop the empty columns before assigning the headers, but if it works then it's good anyways

lapis sequoia Jan 2, 2022, 5:06 PM

#

austere swift i wouldn't think that assigning the headers directly in the read_csv would work ...

You got me there, I have only tried it with a shortened CSV so far, that doesn't contain the problematic last columns

vague moon Jan 2, 2022, 5:18 PM

#

I could use a little help. I was trying to make an environment to practice on the MNIST dataset, but I was getting an error thrown trying to make the environment. Online, I found you can solve the issue by uninstalling and reinstalling Anaconda, so I did that, but now Anaconda installs without python or the powershell prompt, idk why.

serene scaffold Jan 2, 2022, 5:19 PM

#

vague moon I could use a little help. I was trying to make an environment to practice on th...

Any time you get an error message, and you want help with that error message, always immediately show the error message.

vague moon Jan 2, 2022, 5:19 PM

#

i dont want help with that error message that's why i didn;t include it

#

i don't even know what that error message is anymore

serene scaffold Jan 2, 2022, 5:20 PM

#

what OS are you on?

vague moon Jan 2, 2022, 5:20 PM

#

I'm asking for help with why my anaconda is installing without python or the powershell, I am on Windows 10

serene scaffold Jan 2, 2022, 5:20 PM

#

and what library/ies are you trying to use to experiment with the MNIST dataset?

vague moon Jan 2, 2022, 5:21 PM

#

I just need keras to import the dataset, I was going to go about getting the dataset another way but I found a way to import the dataset through keras

#

besides that I don't know off the top of my head

serene scaffold Jan 2, 2022, 5:23 PM

#

vague moon besides that I don't know off the top of my head

I would avoid Anaconda entirely unless you're sure it's the best way to install one of your dependencies. It should be simple to install tensorflow (which contains Keras) without Anaconda.

vague moon Jan 2, 2022, 5:29 PM

#

but what if I am wanting to install Anaconda

lapis sequoia Jan 2, 2022, 5:30 PM

#

austere swift it would be something like this ```py header_df = pd.read_csv(filename, nrows=0,...

Okay so that almost worked after adding how='all' to dropna BUT there's one additional completely empty ~~row~~ column in the CSV that also gets dropped, so I end up with 59 columns and 60 headers

serene scaffold Jan 2, 2022, 5:30 PM

#

vague moon but what if I am wanting to install Anaconda

Why do you want to install Anaconda if it's giving you issues and you can accomplish your end goal without it?

vague moon Jan 2, 2022, 5:31 PM

#

Because I want to, I would like to figure out what is going wrong for the future instead of never fixing it

crystal jewel Jan 2, 2022, 5:33 PM

#

I would like to ask another question unrelated to anaconda if that's ok 😛

vague moon Jan 2, 2022, 5:34 PM

#

then shoot

crystal jewel Jan 2, 2022, 5:34 PM

#

number_of_breeds = df.apply(pd.value_counts)

#

i have this line of code

#

which returns

Crossbred Canine/dog                                164
Retriever - Labrador                                136
Domestic Shorthair                                   63
Retriever - Golden                                   59
Dog (unknown)                                        56
...                                                 ...
[Retriever - Labrador, Catahoula Leopard Dog]         1
[Cattle Dog - Australian (blue heeler, red heel...    1
Spaniel - Tibetan                                     1
[Retriever - Labrador, Deutsche Dogge, Great Dane]    1
Mixed (Horse)

serene scaffold Jan 2, 2022, 5:34 PM

#

crystal jewel ```py number_of_breeds = df.apply(pd.value_counts) ```

please show df with print(df.head().to_dict('list'))

crystal jewel Jan 2, 2022, 5:35 PM

#

oh

serene scaffold Jan 2, 2022, 5:35 PM

#

serene scaffold please show `df` with `print(df.head().to_dict('list'))`

Please ping me when you have provided this.

#

In case you don't plan to provide that, refer to this:

#

!docs pandas.DataFrame.nunique

arctic wedgeBOT Jan 2, 2022, 5:37 PM

#

pandas.DataFrame.nunique


DataFrame.nunique(axis=0, dropna=True)```
Count number of distinct elements in specified axis.

Return Series with number of distinct elements. Can ignore NaN values.

crystal jewel Jan 2, 2022, 5:37 PM

#

{0: [138, 126, 81, 71, 62]}

#

this one?

#

i did it this way

number_of_breeds.head().to_dict('list')

serene scaffold Jan 2, 2022, 5:38 PM

#

I asked for df, not number_of_breeds.

#

I'll be back in a few minutes.

crystal jewel Jan 2, 2022, 5:44 PM

#

it returns this :

{0: ['Retriever - Labrador', 'Shih Tzu', 'Hound - Basset', 'Alaskan Malamute', 'Shepherd Dog - German']}

serene scaffold Jan 2, 2022, 5:47 PM

#

crystal jewel it returns this : ``` {0: ['Retriever - Labrador', 'Shih Tzu', 'Hound - Basset',...

so, if you want the number of unique values in the 0 column, it's just df[0].nunique()

crystal jewel Jan 2, 2022, 5:49 PM

#

i just want to plot it

#

using dash

#

and i am not sure how to access each column i guess?

serene scaffold Jan 2, 2022, 5:49 PM

#

hmm, I'm not sure about that.

crystal jewel Jan 2, 2022, 5:49 PM

#

can't happen?

serene scaffold Jan 2, 2022, 5:49 PM

#

crystal jewel and i am not sure how to access each column i guess?

there's only one column?

#

if there's secretly a dataframe of interest with multiple columns, I haven't seen that one yet.

crystal jewel Jan 2, 2022, 5:50 PM

#

wait one moment, let's say we use the .value_counts method

#

it returns the thing above

#

can we somehow access each column

#

so we can plot it?

#

is that even 2 columns?

serene scaffold Jan 2, 2022, 5:51 PM

#

well, in that case, you should just use df[0].value_counts(), without using .apply

mint palm Jan 2, 2022, 5:52 PM

#

earnest widget If you know, a lot of the other concepts, you can just apply it in a project wit...

gr8 advice getting into it now

serene scaffold Jan 2, 2022, 5:52 PM

#

crystal jewel is that even 2 columns?

you'd only have one value (a frequency) for each item (a breed)

#

if you want to plot it in two dimensions, you probably need a second value.

#

otherwise you're not demonstrating any kind of relationship

crystal jewel Jan 2, 2022, 5:53 PM

#

The idea is to plot the frequency of the values

#

with the values

serene scaffold Jan 2, 2022, 5:53 PM

#

I think we're mixing up terms here

crystal jewel Jan 2, 2022, 5:53 PM

#

we do?

#

one moment

serene scaffold Jan 2, 2022, 5:53 PM

#

in my usage, the frequency is a value

#

it sounds like you're using value to refer to an item (ie a dog breed)

crystal jewel Jan 2, 2022, 5:54 PM

#

Crossbred Canine/dog                                164

#

on Y axis

#

it should be 164

#

and on the X the "Crossbred Canine"

serene scaffold Jan 2, 2022, 5:54 PM

#

names of breeds aren't numbers

#

so any way that you order them on the y axis would be arbitrary

crystal jewel Jan 2, 2022, 5:54 PM

#

they have to be?

serene scaffold Jan 2, 2022, 5:56 PM

#

what library are you using to create the figure?

crystal jewel Jan 2, 2022, 5:57 PM

#

#

something like that

crystal jewel Jan 2, 2022, 5:57 PM

#

serene scaffold what library are you using to create the figure?

Dash

serene scaffold Jan 2, 2022, 6:01 PM

#

crystal jewel Dash

is plotly part of dash?

crystal jewel Jan 2, 2022, 6:02 PM

#

i think so

#

one moment

#

https://plotly.com/dash/

Dash Overview

The Dash platform empowers data science teams to focus on the data and models, while producing and sharing enterprise-ready analytic apps that sit on top of Python and R models.

#

it looks like it is

serene scaffold Jan 2, 2022, 6:03 PM

#

import plotly.express as px

df = px.data.gapminder().query("continent == 'Europe' and year == 2007 and pop > 2.e6")
fig = px.bar(df, y='pop', x='country', text='pop')
fig.show()

This code from the docs creates a bar chart where the labels are country names (and thus don't have a numerical value)

https://plotly.com/python/bar-charts/

Bar Charts

How to make Bar Charts in Python with Plotly.

#

I'm not sure that their df variable is actually a pandas DataFrame though, or pandas DFs are compatible in this context.

crystal jewel Jan 2, 2022, 6:05 PM

#

There are compatible no worries about that

serene scaffold Jan 2, 2022, 6:05 PM

#

yay!

crystal jewel Jan 2, 2022, 6:05 PM

#

the issue remains tho :<

crystal jewel Jan 2, 2022, 6:06 PM

#

serene scaffold well, in that case, you should just use `df[0].value_counts()`, without using `....

this one returns

    number_of_breeds = df[0](pd.value_counts())
TypeError: value_counts() missing 1 required positional argument: 'values'

#

oh wait

#

ignore that

serene scaffold Jan 2, 2022, 6:07 PM

#

ignore what

crystal jewel Jan 2, 2022, 6:08 PM

#

the code you gave me above

#

it doesn't return what I wrote

#

it returns this ```
C 305
D 281
R 236
S 230
T 160
...
Pug 1
Great Pyrenees 1
Collie - Border 1
Schnauzer - Miniature 1
Shepherd Dog (unspecified) 1

#

which is kind of weird, but

serene scaffold Jan 2, 2022, 6:09 PM

#

In [1]: df = pd.DataFrame({0: ['Retriever - Labrador', 'Shih Tzu', 'Hound - Basset', 'Alaskan Malamute', 'Shepherd Dog - German']})

In [2]: df
Out[2]:
                       0
0   Retriever - Labrador
1               Shih Tzu
2         Hound - Basset
3       Alaskan Malamute
4  Shepherd Dog - German

In [3]: df[0].value_counts()
Out[3]:
Retriever - Labrador     1
Shih Tzu                 1
Hound - Basset           1
Alaskan Malamute         1
Shepherd Dog - German    1
Name: 0, dtype: int64

It worked when I did it.

crystal jewel Jan 2, 2022, 6:11 PM

#

ye that's probably because that's not the whole data

#

one sec

#

https://pastebin.com/TWjPbubX

Pastebin

DogData - Pastebin.com

Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.

#

and that's what it returns for me

Retriever - Labrador                           158
Crossbred Canine/dog                           154
Domestic Shorthair                              72
Dog (unknown)                                   62
Domestic (unspecified)                          52
                                              ... 
[Crossbred Canine/dog, Collie - Border]          1
[Shepherd Dog - German, Retriever - Golden]      1
Horse (other)                                    1
Goat (unknown)                                   1
Coonhound (unspecified)                          1

#

ok ok take a look

#

lapis sequoia Jan 2, 2022, 6:58 PM

#

austere swift i wouldn't think that assigning the headers directly in the read_csv would work ...

Well I did it, by typing out 62 headers manually, including headers for the empty columns. Work hard not smart. Actually not completely wasted time since I maybe wanted to rename the headers anyway. But pandas is awesome for what I'm trying to do. I hate thinking about doing that kind of data manipulation in NodeJS

austere swift Jan 2, 2022, 6:59 PM

#

yeah pandas is pretty awesome

#

although you should probably tell your client that they should fix their remote server data lol

#

it would likely be a lot easier to fix it on that end

lapis sequoia Jan 2, 2022, 7:16 PM

#

austere swift although you should probably tell your client that they should fix their remote ...

The data is actually coming from a big wholesaler which claims that they serve 1000s of online stores. The data is trash tho. They even have two columns with the exact same name in the CSV. Could still be a problem on my end somehow, who knows

twilit imp Jan 2, 2022, 7:49 PM

#

I need a team of people to help me on a secret project. This seemed like a good place to look for some. DM me if your curiosity is piqued and you can be trusted.

#

I'd give you guys more info if I thought it would be morally fine to do so

solar yew Jan 2, 2022, 8:07 PM

#

Any advice on working with large data in python? Currently (I believe) python is creating temporary stores for training and test data which then overload my ram before I get to putting it into the model

#

Or is 16GB simply a bit too small to work with

#

Online people have suggested using dictionaries, however, Im not quite sure how that can help

serene scaffold Jan 2, 2022, 8:20 PM

#

solar yew Any advice on working with large data in python? Currently (I believe) python is...

how much RAM do you have and what are you trying to do?

solar yew Jan 2, 2022, 8:21 PM

#

16GB and im just running a NLP with both text and non-text features

serene scaffold Jan 2, 2022, 8:21 PM

#

solar yew 16GB and im just running a NLP with both text and non-text features

what algorithm? what library?

solar yew Jan 2, 2022, 8:21 PM

#

using sklearn

#

but most of the ram used up in the preparation

austere swift Jan 2, 2022, 8:22 PM

#

how big is your dataset?

solar yew Jan 2, 2022, 8:22 PM

#

800,000,000 values or so

#

get a lack of memory error when trying to put it into naive bayes

serene scaffold Jan 2, 2022, 8:23 PM

#

solar yew 800,000,000 values or so

what Spacecraft wants to know is how much memory the dataset takes. Also, I still need to know what algorithm you're using.

#

I assume you're using one of these? https://scikit-learn.org/stable/modules/naive_bayes.html

scikit-learn

1.9. Naive Bayes

Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes’ theorem with the “naive” assumption of conditional independence between every pair of features given the val...

solar yew Jan 2, 2022, 8:24 PM

#

oh sorry, when i run it it takes c13gb ram

#

or 11mb of disk

serene scaffold Jan 2, 2022, 8:24 PM

#

There's a partial_fit method that you can use to train the model in batches

solar yew Jan 2, 2022, 8:24 PM

#

yeah naive bayes through sklearn thats the one

#

yeah i think that sounds like a good approach

#

thank you

serene scaffold Jan 2, 2022, 8:25 PM

#

yw

solar yew Jan 2, 2022, 8:26 PM

#

ìs there a general best practice when using large data?

austere swift Jan 2, 2022, 8:26 PM

#

that essentially is the best practice, load/fit incrementally so that you don't have to have all the data in memory at once

serene scaffold Jan 2, 2022, 8:27 PM

#

solar yew ìs there a general best practice when using large data?

Basically what Spacecraft just said. Though every institution I've worked at has had one or more high-performance computers so that it isn't necessary.

austere swift Jan 2, 2022, 8:29 PM

#

yeah in most cases theres a balance between how much memory the computer should have and how much data you need to use, since loading it in batches also makes the training quite a bit slower

#

most people write code on their personal computer and then have it run on a remote, more powerful computer that can run it more easily

solar yew Jan 2, 2022, 8:32 PM

#

ah yeah that makes perfect sense, thanks again

odd meteor Jan 2, 2022, 9:08 PM

#

Happy New Year guys 😀

stone marlin Jan 2, 2022, 9:53 PM

#

Whoo, new year.

I'm workin' on portfolio fodder, and I keep seeing a lot of love for Plotly. When I was workin' with it a while back you needed like your own local server for it and it was kind'a gross. Has it gotten better? What viz library are you all into? (I've been big into Altair, but will default to matplotlib if I just wanna see something quick.)

rose pasture Jan 2, 2022, 10:05 PM

#

Hi I am currently working on a Kaggle project and I'd like to understand why they used y = 'twp' to count the number of calls made per month on line 22. Shouldn't it be based off of timeStamp?
https://www.kaggle.com/vahidehdashti/911-call-data-eda

911 call data_EDA

Explore and run machine learning code with Kaggle Notebooks | Using data from Emergency - 911 Calls

odd gust Jan 2, 2022, 10:25 PM

#

stone marlin Whoo, new year. I'm workin' on portfolio fodder, and I keep seeing a lot of l...

Depends on the use case. I like seaborn, matplotlib, lux and pandas+matplotlib.pyplot. I also know that bokeh is loved by many out there, but I haven’t used it very much.

I also use streamlit a lot, it’s primarily for rapid prototyping of web apps in data science, and it has the most basic visualizations, which is easy to use 🙂

stone marlin Jan 2, 2022, 10:25 PM

#

I've tried bokeh before and it's pretty good! I've not heard of streamlit or lux before, so I'll check those out.

odd gust Jan 2, 2022, 10:28 PM

#

stone marlin I've tried bokeh before and it's pretty good! I've not heard of streamlit or lu...

https://docs.streamlit.io/library/api-reference/charts

https://github.com/lux-org/lux

Chart elements - Streamlit Docs

GitHub

GitHub - lux-org/lux: Automatically visualize your pandas dataframe...

Automatically visualize your pandas dataframe via a single print! 📊 💡 - GitHub - lux-org/lux: Automatically visualize your pandas dataframe via a single print! 📊 💡

stone marlin Jan 2, 2022, 10:29 PM

#

Oh, dang, Streamlit looks really cool.

odd gust Jan 2, 2022, 10:30 PM

#

odd gust https://docs.streamlit.io/library/api-reference/charts https://github.com/lux-o...

Streamlit is really cool and powerful. You can write a simple web app in 2 min 🎉

grim lintel Jan 2, 2022, 10:56 PM

#

oh wow streamlit does look really cool

#

can it produce interactive dashboards that are anything like the ones you can create in BI or Tableau?

twilit imp Jan 2, 2022, 11:17 PM

#

twilit imp I need a team of people to help me on a secret project. This seemed like a good ...

So I take it no one cares then

stone marlin Jan 2, 2022, 11:18 PM

#

I don't DM anyone for "secret projects". Very few people are going to put themselves out and express interest in a project where they don't know the person at all and they don't know the project at all.

#

Perhaps you could give some hints as to what the project is, what you expect us to do, and what your experience is? That might make more people interested.

twilit imp Jan 2, 2022, 11:19 PM

#

hmm.... let me think... I have to figure out what I can tell you...

#

I'm making a AGI. sort of.

#

Her name is AIRI.

#

It's not that it HAS to be secret, it's just that I'm worried to many people, and maybe even the wrong people will know too much about it. Most projects that were publicly available that I did were copied before I could patent. Among other things.

stone marlin Jan 2, 2022, 11:24 PM

#

And then what're you looking for in a team? And what would your role be?

twilit imp Jan 2, 2022, 11:25 PM

#

I'm looking for people with experience in machine learning, know python, maybe even a little bit of SQL.

twilit imp Jan 2, 2022, 11:25 PM

#

stone marlin And then what're you looking for in a team? And what would your role be?

What do you mean by my role?

stone marlin Jan 2, 2022, 11:30 PM

#

Like, what would you be doing in the team.

twilit imp Jan 2, 2022, 11:33 PM

#

Essentially, you'd look through the list of things needed to be done, and if you have a moment where you're like, "Ah yes I know how to do that" then you'd do that. So you'd help out where you could.

stone marlin Jan 2, 2022, 11:36 PM

#

So, you'd be the project manager? Or main coder?

twilit imp Jan 2, 2022, 11:37 PM

#

I'd be project manager, but I'd also help where I could.

stone marlin Jan 2, 2022, 11:37 PM

#

(Just getting all this stuff out so people here might be more interested in working on it.)

twilit imp Jan 2, 2022, 11:38 PM

#

Yah maybe. Ima be honest with you: it's not like hierarchy level organized. It's more of a thing where it's like you have nothing to do so you pitch in your 2 cents and if it contributes then yay

stone marlin Jan 2, 2022, 11:39 PM

#

That's fine, lot'sa projects work like that. People here will prob be more interested in asking to work for it if they knew a little about what was gonna be involved, that's why I wanted to ask you all that stuff above. More likely to get interested peepz.

twilit imp Jan 2, 2022, 11:40 PM

#

hmm what might make peeps interested....

stone marlin Jan 2, 2022, 11:41 PM

#

Oh, I just meant the stuff you said already was fine. It's unlikely to get people if it's just like "hey dm me for a secret thing." but if you explain like you did above, it's more likely people will be like, "Okay, I can get into this."

twilit imp Jan 2, 2022, 11:41 PM

#

Right right I see what you were saying

#

Yah you've been pretty helpful

stone marlin Jan 2, 2022, 11:42 PM

#

No problemo, I can't promise anyone will be into it, but I hope that it piques someone's interest now!

twilit imp Jan 2, 2022, 11:43 PM

#

OH! I know something that might interest a few ~~weebs~~ peeps: It involves an anime girl

serene scaffold Jan 2, 2022, 11:45 PM

#

all the best things involve anime girls.

twilit imp Jan 2, 2022, 11:46 PM

#

hehe lmao

serene scaffold Jan 2, 2022, 11:46 PM

#

stone marlin I don't DM anyone for "secret projects". Very few people are going to put thems...

it's also against our rules to recruit for "secret projects".

twilit imp Jan 2, 2022, 11:46 PM

#

really? where?

serene scaffold Jan 2, 2022, 11:48 PM

#

I suppose it doesn't fall under any specific rule, but we only allow people to recruit for open-source projects, because if the project isn't completely transparent, we can't verify that it isn't a business venture, or that it's ethical, or that contributors will get any value for their contribution, etc.

twilit imp Jan 2, 2022, 11:52 PM

#

I literally banned a dude for trying to sell the project, so no not a business venture. I personally believe that if treated right it's perfectly ethical. I mean, this isn't going to be public even when it gets done, so even if we did give them credit (which we probably will anyways in a book I'm writing about it), they won't exactly become famous or anything.

serene scaffold Jan 2, 2022, 11:53 PM

#

I'm not sure what you're referring to (I haven't read all the way back in the conversation), but in either case, we're not going to budge on the requirement that all recruited-for projects be open-source.

serene scaffold Jan 2, 2022, 11:53 PM

#

twilit imp I need a team of people to help me on a secret project. This seemed like a good ...

ah, this

#

If you're not willing to open-source the project, please permanently stop recruiting for it. Thanks!

twilit imp Jan 2, 2022, 11:54 PM

#

What about an ambassatorial agreement

serene scaffold Jan 2, 2022, 11:54 PM

#

I don't know what that is.

twilit imp Jan 2, 2022, 11:55 PM

#

Essentially, we'd invite a highly trusted member of the server to the project. Then, they'd look it up and down through and through and if they think it's ok, they'd come back and say it's ok verifying it.

serene scaffold Jan 2, 2022, 11:56 PM

#

There is no way we'll allow you to recruit for a closed-source project. If you have any further questions or comments about this, please contact us via @sonic vapor.

twilit imp Jan 2, 2022, 11:56 PM

#

sad.

#

Do you personally know of any other places where I can find peeps?

serene scaffold Jan 2, 2022, 11:56 PM

#

No.

twilit imp Jan 2, 2022, 11:56 PM

#

great. big help.

#

If your gonna get mad at people for doing it, you should probably put it in #rules .

serene scaffold Jan 2, 2022, 11:58 PM

#

I'll bring that up with the other moderators.

safe elk Jan 3, 2022, 12:28 AM

#

stone marlin I don't DM anyone for "secret projects". Very few people are going to put thems...

Yeah i suggest building a relationship first to build trust. I have been approached in LinkedIn and burned so be careful

twilit imp Jan 3, 2022, 12:33 AM

#

yea I know I just don't know many coders that actually know enough to help instead of hurt

stone marlin Jan 3, 2022, 12:40 AM

#

I get why the rule is in place: it's easy to exploit workers from this area. Same deal in a lot of the game dev discords. Makes sense.

#

Though I agree, if I can't point to a rule (or see a rule) it's hard to deter people from doing it. So, thank you for bringing it up with the other mods!

crisp sluice Jan 3, 2022, 1:03 AM

#

anyone familiar with dynamic mode decomposition?

#

trying to do a little project and im getting a bit stuck

night quartz Jan 3, 2022, 1:10 AM

#

Guys can you help me out, i am in VSCode RightNow, i inserted a .dot file and installed Graphviz extension for it to run, but when i press on the three dots on the top right corner in VSCode there should be an option saying ‘Open Preview to the side’ but it’s not showing, please help me out i’ve been searching for an hour

#

it’s supposed to show a visual decision tree

#

I usually use Pycharm so i am not familiar with VSCode

serene scaffold Jan 3, 2022, 1:23 AM

#

night quartz Guys can you help me out, i am in VSCode RightNow, i inserted a .dot file and in...

sounds like an #editors-ides question.

delicate sphinx Jan 3, 2022, 1:25 AM

#

Does anyone have any tips on how to stop class imbalance on a Tensorflow model? I've looked at the Tensorflow guide using Credit Card Fraud Detection though my model has a few more classes (24,000 more to be exact)

#

it just outputs "yes" which is likely to be correct around 25% of the time

serene scaffold Jan 3, 2022, 1:29 AM

#

delicate sphinx Does anyone have any tips on how to stop class imbalance on a Tensorflow model? ...

what do you mean by "class imblance"? this is usually a statement about the training data, not the predictions of a model.

delicate sphinx Jan 3, 2022, 1:30 AM

#

I have a majority class of 25% of all of my training data inputs which my model is unable to circumvent and instead just uses the "most common" answer when making future predictions

serene scaffold Jan 3, 2022, 1:30 AM

#

am I to understand that you're training a binary classifier where each item can either be "fraud" or "not fraud"?

delicate sphinx Jan 3, 2022, 1:30 AM

#

it's the way it trains

#

No, I was a bit misleading by using the credit card fraud data I apologise, I'm using 24,000 different classes so I'm trying to use a Categorical Cross Entropy Loss

serene scaffold Jan 3, 2022, 1:30 AM

#

you have 24k classes? lemon_exploding_head

#

how many training instances are there?

delicate sphinx Jan 3, 2022, 1:31 AM

#

I just mentioned that because tensorflows only actual guide on class imbalance uses binary classifiers

#

well I have 248,000 bits of data to use

#

24,000 is a huge over-use anyway, but I wanted to include a larger vocabulary (output is one-hot)

serene scaffold Jan 3, 2022, 1:32 AM

#

delicate sphinx well I have 248,000 bits of data to use

by "248,000 bits of data" do you actually mean 248,000 training instances? because "bits" are a specific thing.

delicate sphinx Jan 3, 2022, 1:32 AM

#

yeah i just didnt know how better to say

#

I have 248,000 triplets of input1, input2, output

serene scaffold Jan 3, 2022, 1:32 AM

#

I see. And what algorithm are you using to classify them?

#

because I've never heard of any classifier having to learn 24k different classes.

delicate sphinx Jan 3, 2022, 1:33 AM

#

I use a Categorical Cross Entropy for loss, an Adam for Optimizing and output as a softmax

serene scaffold Jan 3, 2022, 1:33 AM

#

so it's a neural network of some kind?

delicate sphinx Jan 3, 2022, 1:33 AM

#

the output is 24,000 large so that the softmax gives me a vector that I can turn into a one-hot encoded vector

#

yeah it's an MLP

serene scaffold Jan 3, 2022, 1:33 AM

#

MLP?

delicate sphinx Jan 3, 2022, 1:34 AM

#

Multi-Layer Perceptron

#

Basically a larger connection of NNs

serene scaffold Jan 3, 2022, 1:35 AM

#

like I said, I've never heard of a classifier for that many classes, NN or otherwise

delicate sphinx Jan 3, 2022, 1:35 AM

#

fair enough, thanks anyways

ashen umbra Jan 3, 2022, 1:41 AM

#

hey I had a ques. So if you have a column of a bunch of tokens and then a list of words (outside the data frame). How can you check if a row contains any of the words?

#

this is the token from the df

#

and I have a list of words.. whose elements I wanna match with

serene scaffold Jan 3, 2022, 1:49 AM

#

ashen umbra

I assume this is a column of a DataFrame (ie, a Series). The only operation that a Series of lists supports is explosion.

If you want to check if a list contains at least one of a certain number of values, the best way is to put the values of interest in a set

rose pasture Jan 3, 2022, 2:31 AM

#

Is the topic of data science in general much more harder to learn than web dev?

delicate sphinx Jan 3, 2022, 2:48 AM

#

rose pasture Is the topic of data science in general much more harder to learn than web dev?

Depends what you're in to, passion is a large factor in learning these sorts of things and a mix of both probably won't even hurt that much (there's a few websites that use many Data Science applications)

stone marlin Jan 3, 2022, 2:51 AM

#

24k classes? That is, by magnitudes, more than I've ever seen, haha.

delicate sphinx Jan 3, 2022, 3:00 AM

#

👀

#

If I cut out some data I can get to 13k classes

stone marlin Jan 3, 2022, 3:07 AM

#

What the heck are these classes where you need so many?

delicate sphinx Jan 3, 2022, 3:10 AM

#

words

#

a lot of the classes I'm making are pointless because they're only used once in comparison to a majority class that's used for 25% of all answers

#

(1 class = 25% of all answers, 2 classes = 38% of all answers, 12,998 classes make up remaining 62%)

#

https://visualqa.org/vqa_v1_download.html

delicate sphinx Jan 3, 2022, 3:12 AM

#

stone marlin What the heck are these classes where you need so many?

^ pain

stone marlin Jan 3, 2022, 3:17 AM

#

Dang, maybe it does require that, I don't do a lot of CV/NLP stuff. That still seems really, really high to me, and seems like, if it's that imbalanced, you could ensemble on another model which does the "Other" thing for labels who are predicted less than some number of times.

rose pasture Jan 3, 2022, 3:17 AM

#

delicate sphinx Depends what you're in to, passion is a large factor in learning these sorts of ...

Yeah i was planning on learning both to see which one id like more, but i wanted to go through the hardest one first lol i guess you’re right depends on the passion too

delicate sphinx Jan 3, 2022, 3:18 AM

#

stone marlin Dang, maybe it does require that, I don't do a lot of CV/NLP stuff. That still ...

yeah I propose an idea like that in the documentation I'm making for it, though was hoping I'd be able to complete my own program that allows processing of it

stone marlin Jan 3, 2022, 3:18 AM

#

Also, if it's predicting one label, what label is that where it's so popular?

delicate sphinx Jan 3, 2022, 3:18 AM

#

I managed to get it to consider different classes but I think it worked only because of a Tensorflow Seed giving me a good value for training

#

every other training hasn't worked

#

so my training is non deterministic

stone marlin Jan 3, 2022, 3:19 AM

#

You don't even need another model, you could just post-process and send everything under a certain frequency to an "Other" class.

delicate sphinx Jan 3, 2022, 3:19 AM

#

although now it's deterministic because it only outputs one thing

#

the popular label is "yes"

stone marlin Jan 3, 2022, 3:19 AM

#

Second most popular is "no", I guess?

delicate sphinx Jan 3, 2022, 3:19 AM

#

which accounts for about 23% of all answers

#

yeah thats like 17%

stone marlin Jan 3, 2022, 3:19 AM

#

Okay, and then what's an example of another one that's sort'a popular?

delicate sphinx Jan 3, 2022, 3:20 AM

#

something around that lol

#

the next most popular one is about 1000 (as opposed to no which is ~40,000)

#

think thats "2"

delicate sphinx Jan 3, 2022, 3:21 AM

#

rose pasture Yeah i was planning on learning both to see which one id like more, but i wanted...

Yeah, Data Science is probably harder but if you plan to just put it on websites you may as well learn websites first, imo DS is as hard as you let it be, which in my case is painfully hard

stone marlin Jan 3, 2022, 3:21 AM

#

Ha. Okay, shot in the dark here, but what about this: an initial model to say if a question has a "Yes/No" answer --- which will feed into another model that gets that yes/no answer --- and the rest of the data goes to yet another model that excludes the yes/no.

#

If only to see the distribution of the other answers, and how to group those.

delicate sphinx Jan 3, 2022, 3:21 AM

#

stone marlin Ha. Okay, shot in the dark here, but what about this: an initial model to say i...

ah probably dont have time for something that advanced

#

I wanna get something out of the way soon and then come back to it in the future

stone marlin Jan 3, 2022, 3:22 AM

#

It's all good, just thinking aloud.

delicate sphinx Jan 3, 2022, 3:22 AM

#

upload an "alpha"

#

but yeah that is a good idea haha

#

my whole idea was to focus on the question more than the image so a culmination of that would use that yes/no thing

stone marlin Jan 3, 2022, 3:23 AM

#

Depending on what the DS job is, it might require some heavy maths --- which is, for some, hard to learn and might take around a year or more to learn well. That's why I think that path for DS might be a bit harder than web dev, since you've got to have the math AND coding background. But as you note, it depends strongly on what you wanna do with it.

#

If someone just wants to plug-and-chug things into sklearn models, prob doesn't need to know a whole lot of linear algebra / stats. Same for making simple webpages.

delicate sphinx Jan 3, 2022, 3:24 AM

#

rose pasture Yeah i was planning on learning both to see which one id like more, but i wanted...

TBH, TensorFlow has done all the maths for me, some of the only complex stuff I've been doing is figuring out input shapes/sizes, TF offers a pretty easy and basic package for Data Science that you can use Flask with to create web-apps with a TF (Data Science) model running in the server

#

https://github.com/sominw/vqamd_floyd

GitHub

GitHub - sominw/vqamd_floyd: Visual Question Answering through moda...

Visual Question Answering through modal dialogue (B.Tech Project) + API - GitHub - sominw/vqamd_floyd: Visual Question Answering through modal dialogue (B.Tech Project) + API

#

This guy does a pretty decent flask VQA model with semi-clear walkthrough

#

he has a supporting webpage writeup on it too that explains it further

#

though VQA probably isn't for first timers in Data Science as I found out x-x

stone marlin Jan 3, 2022, 3:26 AM

#

There are some people out there that can do DS without knowing much of the maths, but in terms of careers I've not seen too many DS people without a good grasp of the math after entry-level. [My biases here: live in the US, work mostly in small-to-mid startups, mostly timeseries + fin + IIoT work.]

supple prism Jan 3, 2022, 3:48 AM

#

Guys which is the best free source on internet to learn machine learning and data analysis

rose pasture Jan 3, 2022, 4:19 AM

#

delicate sphinx TBH, TensorFlow has done all the maths for me, some of the only complex stuff I'...

Thanks ill check it out! TensorFlow sounds interesting ill have to learn that for sure

delicate sphinx Jan 3, 2022, 4:19 AM

#

they have lots of online tutorials on their official website

#

definitely worth looking at

rose pasture Jan 3, 2022, 4:19 AM

#

stone marlin There are some people out there that can do DS without knowing much of the maths...

What type of tasks do you usually do at work as a data scientist

rose pasture Jan 3, 2022, 4:20 AM

#

delicate sphinx they have lots of online tutorials on their official website

tensorflow is used to build models?

delicate sphinx Jan 3, 2022, 4:20 AM

#

yeah

#

and then the models can be used to predict things

#

and as such, a flask application can provide a front end to the model

rose pasture Jan 3, 2022, 4:21 AM

#

delicate sphinx and then the models can be used to predict things

damn that's really interesting! are you a data scientist as well?

delicate sphinx Jan 3, 2022, 4:22 AM

#

I wish, I'm nothing special but trying to create a VQA model

rose pasture Jan 3, 2022, 4:23 AM

#

delicate sphinx I wish, I'm nothing special but trying to create a VQA model

nice I am trying to learn data science by myself and see if I could land a job later on, but if not I'd back to school to get a degree to solidify my credentials

delicate sphinx Jan 3, 2022, 4:24 AM

#

Yeah, hopefully you'll manage, lots of current people in the field are all self taught and lots of places offer apprenticeships (best of both worlds)

stone marlin Jan 3, 2022, 4:24 AM

#

It can vary by the type of project, but I'd say it's like, 40% pipeline work, 50% EDA + making dashboards + talking to SMEs + going back and forth with the data-holders, and then like 10% research. Depends also on the DS person; I really dig the pipelining and devops side of things, so I usually will work with those teams, hence the more pipelining. Some colleagues will do more research, some more eda, etc., as the project requires. In a previous job it was more like 20% eda, 30% pipelining, and 50% building in-house engines.

delicate sphinx Jan 3, 2022, 4:24 AM

#

I went to uni and paid £40k to wait for my supervisor to email back, 13 months on still waiting

stone marlin Jan 3, 2022, 4:26 AM

#

It's not hard to get a job in the DS field --- since the title is pretty vague and can be anything from data analysis to data engineering to data science --- but it's hard to get a good job in DS where you're actually doing a lot of EDA + research, which is usually what people think of when they think of going into DS.

rose pasture Jan 3, 2022, 4:29 AM

#

stone marlin It's not hard to get a job in the DS field --- since the title is pretty vague a...

I see that's what I was thinking too when I was on kaggle, most of the works seems to be lots of research + EDA. Doesn't data analysis or data engineering need different type of skills though?

rose pasture Jan 3, 2022, 4:29 AM

#

delicate sphinx I went to uni and paid £40k to wait for my supervisor to email back, 13 months o...

I am going through a Udemy course on data science right now to get a feel, but I honestly don't know what to do afterwards. Someone told me to look into hackathons or build my own projects through kaggle. What do you think?

stone marlin Jan 3, 2022, 4:30 AM

#

It's also the case that, at least in the US, there is a huge, huge, huge market saturation for DS at the entry level. We had an entry-level position open up at my previous company [travel company w/ a big ds focus] and we had 1.2k people apply to it. But, after parsing it down, we had around 200 people who actually were in any way qualified for it, something like 50 or so who could pass the hackercode challenge, I think 12 made it past the data analysis challenge (logistic regression, classifying problem), and we interviewed all of them. I think there were six good ones from that.

#

This is not to intimidate if y'all are going into entry level, but to note not to worry too much if people say, "Oh, we got 400 applicants!" because the vast majority are very much not good.

stone marlin Jan 3, 2022, 4:31 AM

#

rose pasture I see that's what I was thinking too when I was on kaggle, most of the works see...

DA + DE definitely require other skillsets, and you'll be able to point them out on resumes. Like, "Oh, this is data science, but I'm gonna mostly be doing DE tasks."

rose pasture Jan 3, 2022, 4:32 AM

#

Damn lol what would you recommend someone to learn to put on their resume to even be qualified in the beginning?

stone marlin Jan 3, 2022, 4:32 AM

#

I'd say that, for better or worse, the research + eda things are mostly given to the math-strong people, since, in my field at least, there's no way to just plop something into a NN and be done with it. You need to be able to explain results. So it's a lot of feature engineering + time series analysis, and that requires some pretty hefty stats know-how.

#

Note again that this is ONLY in my experience, and my experience is only a few small- to mid-sized startups in a city in the US.

#

I'd say that a common thread in who we hired and considered was:

Knowledge of Python/R (or some other language, we weren't too picky, and you can pick up either of those pretty quick if you know another).
Knowledge of SQL and roughly how DBs work and how to get data and clean data.
Knowledge of general ML methods and general supervised and unsupervised algorithms.

That's the bare minimum, IMO, but others may disagree. And it def depends on the industry. In a Computer Vision job, this would not be nearly enough. In a business job, it might not even need the first or third.

supple prism Jan 3, 2022, 4:36 AM

#

stone marlin I'd say that a common thread in who we hired and considered was: - Knowledge of...

Thanks

rose pasture Jan 3, 2022, 4:37 AM

#

stone marlin I'd say that, for better or worse, the research + eda things are mostly given to...

Yeah I like hearing each individual's experience haha I graduated in mechanical engineering recently but I want to switch so maybe the math part should be a bit easier for me since I can recycle what I've learned to help me out.

stone marlin Jan 3, 2022, 4:37 AM

#

If you're starting out, I think focusing on Python / R (or some other language) is gonna be the best bang for one's buck. If the DS isn't working out, you can fall back on software engineering for a bit.

#

Yeah, ME will be totally fine as a degree. We've had people who are even like, history or english or whatever majors, that's fine --- so long as you knew the math and could demonstrate it, it was all good.

rose pasture Jan 3, 2022, 4:37 AM

#

stone marlin I'd say that a common thread in who we hired and considered was: - Knowledge of...

Thanks so it all depends on which role you want to fill in later on to know what to learn right now

stone marlin Jan 3, 2022, 4:38 AM

#

Or, roughly, what type of role you think you would want. You could always change. But if your passion is computer vision, or if your passion is self-driving cars, you might wanna look at those job descriptions and see what they want and build towards that.

rose pasture Jan 3, 2022, 4:39 AM

#

stone marlin If you're starting out, I think focusing on Python / R (or some other language) ...

That's exactly what I was thinking haha I just finished a Python course and just finishing learning how to use Numpy, Pandas and Seaborn. Moving onto ML soon in my Udemy course

stone marlin Jan 3, 2022, 4:40 AM

#

Good, numpy + pandas are the backbone for this stuff --- but remember also that writing scripts in python is also useful, so that's good to touch on too.

#

Like, "Vanilla Python" writing.

rose pasture Jan 3, 2022, 4:41 AM

#

Gotcha I'd have to review that as well. Is OOP important in data science?

#

It was kind of a hard concept for me to grasp in the beginning but i think i got the hang of it now

stone marlin Jan 3, 2022, 4:41 AM

#

That's gonna strongly depend on where you go. My last place, it was very important. My second-to-last place, not at all.

#

I'd say keep trying with it, if only to strengthen your understanding of languages, design patterns, and architecture in general.

rose pasture Jan 3, 2022, 4:43 AM

#

Yeah there are still lots to learn for me haha

stone marlin Jan 3, 2022, 4:43 AM

#

For example, it may be the case that you will have to work with something like Airflow or PySpark or whatever, and you'll need to know how to maneuver around in those fairly-different-looking spheres. Knowing general python / general programming will help immensely, IMO.

delicate sphinx Jan 3, 2022, 4:43 AM

#

rose pasture I am going through a Udemy course on data science right now to get a feel, but I...

Id recommend just using one or two tensorflow tutorials. Some of what they teach you need to have a eureka moment to understand as they don't always have comments on every line, but they're rather high level

#

Image captioning with visual attention or something is really good imo

#

But very complex

stone marlin Jan 3, 2022, 4:44 AM

#

I know it's wildly unpopular to say this here, but I always strongly recommend against doing NN stuff at first when learning DS (unless that's a popular tool in the area you're going into). Having said that, it may not be bad to look at to see if something clicks.

delicate sphinx Jan 3, 2022, 4:44 AM

#

Youd have to learn about customary encoding decoding attention and training functions/classes which is super confusing to go straight into

delicate sphinx Jan 3, 2022, 4:45 AM

#

stone marlin I know it's wildly unpopular to say this here, but I always strongly recommend a...

I'm not sure what else there is in ds but I trust this man so yeah do that hahaha

stone marlin Jan 3, 2022, 4:46 AM

#

Even though everyone in this room seems to LOVE NNs, I've rarely had to use them for actual day-to-day work. Additionally, they're mostly their own API, so knowing them does not really translate to knowing Python better.

delicate sphinx Jan 3, 2022, 4:46 AM

#

All I can think of besides nn is ensuring you understand the task at hand. If you can't figure out what you want to do it won't matter what you try. Understanding datasets is a huge help too. Always make sure you preprocess it into a form understandable to you

stone marlin Jan 3, 2022, 4:46 AM

#

Making a model in general (linear reg, logistic, trees, whatever) is also pretty much just a two-line ordeal to fit, but it requires the coder to do more data cleaning and feature engineering, which do translate to learning Python better.

delicate sphinx Jan 3, 2022, 4:47 AM

#

Anyways its 4.45am and I haven't been away from a screen for 17 hours today (i just cannot figure out my model haha)

#

So I must go to bed, gn lads/ladesses/ladems

stone marlin Jan 3, 2022, 4:47 AM

#

Yeah, I think that preprocessing is also a fine way to go, if you're gonna focus on NNs. That's v important to learn.

#

Gn!

#

As an alternative to NNs, I'd recommend some of the more basic Sklearn tutorials and focus on things like logistic regression, linear regression, tree models (random forests, boosted trees, etc.). I feel like this stuff is easier to "look into" and see what's happening, whereas NNs are often a big bulky black box.

#

This'll prob be covered in your ML class, though.

rose pasture Jan 3, 2022, 4:50 AM

#

Yeah I honestly don't know what my focus will be at this moment! I'll take what you said into consideration for sure. I still have a long way to go lol self driving car is really interesting now that you've said it haha i'd have to do more research. Yeah I see these topics are included in my ML class. Hopefully I can start working on a project by myself after this class

rose pasture Jan 3, 2022, 4:50 AM

#

delicate sphinx So I must go to bed, gn lads/ladesses/ladems

GN

stone marlin Jan 3, 2022, 4:51 AM

#

Haha, don't worry about it too much. I'd focus on getting through the classes, seeing where you are, and keeping on codin' things (even non-ML/DS things!) in Python or R. That'll put you in a good spot.

rose pasture Jan 3, 2022, 4:51 AM

#

Thanks man I really appreciate you taking the time to talk to me!

desert oar Jan 3, 2022, 4:55 AM

#

stone marlin I know it's wildly unpopular to say this here, but I always strongly recommend a...

plus1 from me, learn statistics and programming first, and then spend a couple weeks plunking around with deep learning

#

or plunk around with deep learning while you cool off between stats theory sessions 😉

stone marlin Jan 3, 2022, 4:55 AM

#

Yeah, I can't imagine going into deep learning without a solid foundation. I still get confused by the stuff and I've been doing this nonsense for years, haha.

desert oar Jan 3, 2022, 4:56 AM

#

i do actually think it's a useful toolset nowadays, if only because data scientists (in the "good" jobs) tend to be very close to management still, and need to make informed decisions about what tools and methods to use

#

data science is already too big of a field for any one person to be an expert in all of it. but much like programming, you would do well do develop "T-shaped" knowledge

#

broad but cursory knowledge of many things, deep knowledge of one thing

#

and be good at math. it will just make you faster and more efficient and better at everything data-related

stone marlin Jan 3, 2022, 4:57 AM

#

Yeah, I'm not sure how much time beginners should spend with it, but certainly after reaching mid-level or so, one should know their way around some of the more useful NNs.

desert oar Jan 3, 2022, 4:58 AM

#

right

#

heck even within traditional stats you can probably never know even close to all of it

stone marlin Jan 3, 2022, 4:58 AM

#

I feel like I am strongly biased against them in places like this because I've had to interview so many entry-level / intern DS people that ONLY know how to set up a TF thing or have done a tutorial on dog-recognition or something.

#

Oh, yeah, pretty much undergrad-level stats is usually good enough. Rarely do I need anything more than that!

#

NNs are def popular because of the cool things they can do, but it's one of those things where it's like --- yeah, you might be a carpenter who can REALLY use a saw well, but if you can't use nails or hammers or measuring tapes or... then you're basically going to be useless on a construction team.

desert oar Jan 3, 2022, 5:00 AM

#

stone marlin I feel like I am strongly biased against them in places like this because I've h...

what kind of organization do you work for, if you don't mind me asking?

stone marlin Jan 3, 2022, 5:00 AM

#

I've got a v specific gig that'll prob dox me, haha, but I'll say in general I work around the Industrial IoT space. I've worked in travel and medical as well.

#

Hence my emphasis on methods with explainability and my love for time series analysis junk, haha.

#

What're you in, if you don't mind my asking?

desert oar Jan 3, 2022, 5:02 AM

#

fair enough

#

i am currently a software engineer, taking a break from data science basically

#

although very much wanting to get back into DS

stone marlin Jan 3, 2022, 5:03 AM

#

That's great, though. I feel like there are so many DS projects that can benefit from SE knowledge and best practices.

desert oar Jan 3, 2022, 5:03 AM

#

i was at a large P&C insurance company before this, but was having a difficult time for a variety of reasons, so i burned out and quit

#

yeah it seems to be a desirable skillset

stone marlin Jan 3, 2022, 5:03 AM

#

Ha! I was in lending at one point, that nonsense is so easy to burn-out in. I definitely get it.

desert oar Jan 3, 2022, 5:03 AM

#

in hindsight i was probably a productivity multiplier

#

not that useful on my own, but made everyone around me more productive by being able to pick up all their programming slack

stone marlin Jan 3, 2022, 5:04 AM

#

Which def could lead to that burnout.

desert oar Jan 3, 2022, 5:04 AM

#

for sure, i was also a very susceptible person at the time

#

i'm very lucky i was able to quit (thanks covid!) and reset

stone marlin Jan 3, 2022, 5:05 AM

#

I did a very similar thing, it was super good. Gives time to learn new stuff and reposition.

desert oar Jan 3, 2022, 5:05 AM

#

plus now i have a spouse who keeps me from becoming a degenerate again

desert oar Jan 3, 2022, 5:06 AM

#

stone marlin I did a very similar thing, it was super good. Gives time to learn new stuff an...

yep, i'm going through this now. i am self-studying some stuff that otherwise i'd never have had time to study

#

my only fear when i quit was that the industry would pass me by, and my skills would atrophy, and i'd be unemployable in DS

#

that seems to not be happening, so it does seem good

#

when did you take your break / how long?

stone marlin Jan 3, 2022, 5:07 AM

#

Prob around half-a-year or eight months, and I felt the same way.

#

Lots of new stuff out, and new stuff in Python to get good at, but otherwise it looks pretty much the same. PySpark looks to have gotten a bit better, too, haha, but I haven't started on that yet.

desert oar Jan 3, 2022, 5:26 AM

#

heh yep i noticed the same about pyspark

safe elk Jan 3, 2022, 5:37 AM

#

desert oar data science is already too big of a field for any one person to be an expert in...

Lol I have heard of T shape too

safe elk Jan 3, 2022, 5:39 AM

#

stone marlin Yeah, I'm not sure how much time beginners should spend with it, but certainly a...

Issue is they go for the sexy new thing rather than foundational math or stats

stone marlin Jan 3, 2022, 5:39 AM

#

Yeah, that's to be expected, I guess. :'] It's all bright and shiny, v alluring.

safe elk Jan 3, 2022, 5:40 AM

#

stone marlin NNs are def popular because of the cool things they can do, but it's one of thos...

You can blame youtube tutorials.. ask them if they can go beyond mechanical coding and come up with new algo and new solutions

safe elk Jan 3, 2022, 5:41 AM

#

stone marlin Which def could lead to that burnout.

Lol been in the burned out camp but recovered

safe elk Jan 3, 2022, 5:42 AM

#

desert oar i'm very lucky i was able to quit (thanks covid!) and reset

Yes Covid has been great for some of us lol

safe elk Jan 3, 2022, 5:42 AM

#

stone marlin I did a very similar thing, it was super good. Gives time to learn new stuff an...

Doing same

stone marlin Jan 3, 2022, 5:43 AM

#

Speaking of learning new things, someone mentioned Streamlit to me today in here and I started using it --- it's pretty fantastic, I really dig it.

magic dune Jan 3, 2022, 5:45 AM

#

hello, this my linear regression brute force formula

#imports
import matplotlib.pyplot as plt
import numpy as np

def euclidean_distance_calc(y, data):
    """Calculate the euclidean distance between two points"""
    euclidean_l = []
    data_y = data[:, 1]
    for i in range(len(data_y)):
        e = y[i] - data_y[i]
        euclidean_l.append(e)
    score = np.sum(np.array(euclidean_l))
    return score


def ploting(x,y,data,Loss):
    """This Function plots all the points"""
    color = '#1C2833'
    plt.plot(x, y,label=f"Score is {Loss}" )
    plt.xlabel('x', color=color)
    plt.ylabel('y', color=color)
    plt.scatter(data[:,0], data[:,1])
    plt.legend(loc='upper left')
    plt.axis("equal")
    plt.grid()

def main():
    #Variables
    L = 100
    data = np.array([[1, 1], [2, 2], [3, 3]])
    for i in range(L):
        m = np.random.randint(1,4)
        b = np.random.randint(1,10)
        x = data[:,0]
        y = m * x + b
        Loss = euclidean_distance_calc(y,data)
        ploting(x,y,data,Loss)


if __name__ == "__main__":
    main()
    plt.show()

It works the problem is i wanna better way to scale m and b to the data instead of just guessing.

Yes, i do know I can use calculus but would like not because it is a brute force version

stone marlin Jan 3, 2022, 5:46 AM

#

If I'm understanding correctly, this is exactly the problem that https://ml-cheatsheet.readthedocs.io/en/latest/gradient_descent.html solves.

magic dune Jan 3, 2022, 5:48 AM

#

stone marlin If I'm understanding correctly, this is _exactly_ the problem that https://ml-ch...

thanks for the source

odd patio Jan 3, 2022, 6:10 AM

#

late vault first, i'd suggest np pd and csv then scipy and tensor flow....

Okay

desert oar Jan 3, 2022, 6:32 AM

#

magic dune thanks for the source

you might also be interested in other numerical optimization techniques like newton's method / newton-raphson method, coordinate descent, et al

safe elk Jan 3, 2022, 6:41 AM

#

desert oar you might also be interested in other numerical optimization techniques like new...

I remember those along with runge kutta ... some of them were taught in a Numerical Analysis Course at my Uni.. grab a used Applied Numerical Analysis book at a second hand bookstore. Better than Googling

desert oar Jan 3, 2022, 6:42 AM

#

indeed, although i usually file those things away as "things i mostly don't need to know the details of"

#

i've used lbfgs before but not for any strong technical reason, it's just what gave good results quickly

safe elk Jan 3, 2022, 6:43 AM

#

Yep the libraries...but best to know how they work

#

The prof thou made the students implement some of those algo from scratch in C or Pascal ..

desert oar Jan 3, 2022, 6:45 AM

#

probably useful if you want to be a developer of scientific software or a machine learning engineer

#

maybe not that useful if you want to be a statistician

#

but yeah i agree you should at least know how they work in general

#

even if you haven't worked through the underlying theory

safe elk Jan 3, 2022, 6:48 AM

#

Done a bit of scientific software and ML myself but bulk of my career was in Traditional Web and Desktop software dev

#

I enjoy the maths thou and like understanding how things work in general

stone marlin Jan 3, 2022, 7:25 AM

#

I always forget about newton, and that'd prob work fine here. I remember RK-4! That's --- well, something with ODEs. I guess I don't remember it as well as I should.

lapis sequoia Jan 3, 2022, 7:40 AM

#

guys, i have question

#

i have two datetime columns, each one start at a specific date.

both of them got no null values.

the first start from 2016 to 2021, and the secpnd start from 2014 to 2018, and they have some overlapping points.

The Question is: How can i combine those two columns into one while preserving the rest columns? obviously i will get more rows after this performation.

marsh yacht Jan 3, 2022, 9:13 AM

#

"Environment variable $DATABASE_URL not set, and no connect string given."

#

i have a sql error

#

prisma lake Jan 3, 2022, 9:27 AM

#

In my 2022 list of stuffs I want to learn data science and ai is one of it. Somehow please recommend me from where to start ?

bold timber Jan 3, 2022, 9:37 AM

#

Hi, I have a questions: Do we need polynomial feature in classfication?

novel raven Jan 3, 2022, 9:58 AM

#

x_train = numpy.array(read_x).reshape(-1,1)
y_train = []

for i in y_read[::-1]:
    y_train.append(i)

model.fit(x_train,numpy.array(y_train))
prediction = model.predict(np.array([453]).reshape(-1,1)) # trained upto 452 what it predicts now is correct but even if i increase the value to like 460 or even more its the same but if i reduce it from 452 it predicts as expected 
print(prediction)```
whenever i increase the prediction value the prediction does not change and does not gets affected if the prediction value is something which is not in the training data but when i reduce it to what it is in the training data it predicts as expected

#

im very new to sklearn and its one of my first projects i didn't watched much tutorials

#

also im using RandomForestRegressor

jovial junco Jan 3, 2022, 10:08 AM

#

anyone here know neat?

#

;-; i need help

vapid sentinel Jan 3, 2022, 10:08 AM

#

Cam anybody help me with machine learning problems?

night gorge Jan 3, 2022, 10:53 AM

#

I was trying to fill missing values of a categorical data with mode. I used following code where "Embarked" is the categorical variable. But it's not working..
df["Embarked"].fillna(df["Embarked"].mode())

keen storm Jan 3, 2022, 10:55 AM

#

hi all! I've been playing with pandas a bit lately trying to sort through some scientific data. It's been a blast but there are some simple things that I can't seem to be able to google for.
I have a table with different columns. Some of the values are the same for all columns, some instead appear only in certain columns. I'd like to sort all the columns so that values that match are all next to each other.
For example, if ColumnA has value "10", and ColumnB and C also have a value "10", then all the "10"s are put next to each other. If any of the columns is missing "10", then the field is left blank.

#

for v in sample['ColumnA']:
print(v in sample['ColumnB'].values)

#

I can check if vaues exist in different columns like this. I'm unsure of how to do this for multiple columns and how to implement the logic of sorting the values accordingly or leaving blank

serene scaffold Jan 3, 2022, 11:13 AM

#

keen storm for v in sample['ColumnA']: print(v in sample['ColumnB'].values)

resist the temptation to loop. you'll learn more if you look for an idiomatic way to do it.

keen storm Jan 3, 2022, 11:13 AM

#

I just thought that maybe I could pivot. I'm not sure if I'm looking in the right direction.

#

Maybe I can have values as indexes and have column names just show the presence (true or false)

serene scaffold Jan 3, 2022, 11:13 AM

#

can you give a copy/pastable sample of the data?

#

df.head().to_dict('list')

#

also, are you trying to do this for some practical reason, or are you just experimenting?

#

it doesn't seem like a very useful thing to do

keen storm Jan 3, 2022, 11:18 AM

#

I'm not sure if pivot is what I'm looking for though

serene scaffold Jan 3, 2022, 11:19 AM

#

I need to do something else for a few minutes, but I will only be able to continue helping if you provide the result of print(df.head().to_dict('list'))

night gorge Jan 3, 2022, 11:27 AM

#

night gorge I was trying to fill missing values of a categorical data with mode. I used foll...

anyone?

serene scaffold Jan 3, 2022, 11:28 AM

#

night gorge I was trying to fill missing values of a categorical data with mode. I used foll...

I was about to start working, but if you can do print(df.head().to_dict('list')) immediately, I will take a look.

night gorge Jan 3, 2022, 11:29 AM

#

{'Survived': [0, 1, 1, 1, 0], 'Pclass': [3, 1, 3, 1, 3], 'Name': ['Braund, Mr. Owen Harris', 'Cumings, Mrs. John Bradley (Florence Briggs Thayer)', 'Heikkinen, Miss. Laina', 'Futrelle, Mrs. Jacques Heath (Lily May Peel)', 'Allen, Mr. William Henry'], 'Sex': ['male', 'female', 'female', 'female', 'male'], 'Age': [22.0, 38.0, 26.0, 35.0, 35.0], 'SibSp': [1, 1, 0, 1, 0], 'Parch': [0, 0, 0, 0, 0], 'Ticket': ['A/5 21171', 'PC 17599', 'STON/O2. 3101282', '113803', '373450'], 'Fare': [7.25, 71.2833, 7.925, 53.1, 8.05], 'Embarked': ['S', 'C', 'S', 'S', 'S']}

serene scaffold Jan 3, 2022, 11:29 AM

#

Thank you

serene scaffold Jan 3, 2022, 11:31 AM

#

night gorge I was trying to fill missing values of a categorical data with mode. I used foll...

In [5]:  df["Embarked"].fillna(df["Embarked"].mode())
Out[5]:
0    S
1    C
2    S
3    S
4    S
Name: Embarked, dtype: object

It works when I do it. In what way is it not working?

lapis sequoia Jan 3, 2022, 11:34 AM

#

Tensorflow, anyone?

serene scaffold Jan 3, 2022, 11:34 AM

#

lapis sequoia Tensorflow, anyone?

Try giving enough information that someone who knows about tensorflow could read your question and answer it.

night gorge Jan 3, 2022, 11:35 AM

#

It is a long dataset with 891 rows. In two rows "Embarked" values are missing. So tried filling missing values with mode using following code.
df["Embarked"].fillna(df["Embarked"].mode())
But after that when I try to print sum of na, it is still showing 2.
df.isna().sum()

serene scaffold Jan 3, 2022, 11:35 AM

#

night gorge It is a long dataset with 891 rows. In two rows "Embarked" values are missing. ...

show what df["Embarked"].mode() is

lapis sequoia Jan 3, 2022, 11:36 AM

#

serene scaffold Try giving enough information that someone who knows about tensorflow could read...

import tensorflow as tf
sess = tf.Session()
op = sess.graph.get_operations()
print([m.name for m in op])
print([m.values() for m in op])

I'm doing it right or..?

night gorge Jan 3, 2022, 11:36 AM

#

serene scaffold show what `df["Embarked"].mode()` is

S

serene scaffold Jan 3, 2022, 11:36 AM

#

lapis sequoia ```py import tensorflow as tf sess = tf.Session() op = sess.graph.get_operations...

You're "doing it right" if this code relates to your question, but you have not asked a question yet.

night gorge Jan 3, 2022, 11:37 AM

#

serene scaffold show what `df["Embarked"].mode()` is

Actually its printing
0 S

lapis sequoia Jan 3, 2022, 11:37 AM

#

serene scaffold You're "doing it right" if this code relates to your question, but you have not ...

oh thanks I just need some corrections from professional

serene scaffold Jan 3, 2022, 11:38 AM

#

lapis sequoia oh thanks I just need some corrections from professional

Why does it have to be from a professional? And why is this code wrong?

serene scaffold Jan 3, 2022, 11:38 AM

#

night gorge Actually its printing 0 S

so, the problem is that mode is returning a Series rather than a stand-alone value. Try .mode().iat[0] to pick the first element.

vapid shoal Jan 3, 2022, 11:40 AM

#

for anyone interested in Quantum Computing: https://quantumzeitgeist.com/a-quantum-year-in-review-what-happened-in-2021-our-picks-of-news-and-views/

night gorge Jan 3, 2022, 11:42 AM

#

serene scaffold so, the problem is that `mode` is returning a Series rather than a stand-alone v...

Thanks a lot

odd meteor Jan 3, 2022, 12:48 PM

#

stone marlin Whoo, new year. I'm workin' on portfolio fodder, and I keep seeing a lot of l...

Seaborn, Matplotlib, and Plotly

keen storm Jan 3, 2022, 1:04 PM

#

I'm trying to do a simple outer join to solve my problem. However, I have to do a join among a df rows, not among different dataframes. It seems to not be possible.

#

I tried with concat(), but the result of the join is not what I expect

keen storm Jan 3, 2022, 1:05 PM

#

serene scaffold I need to do something else for a few minutes, but I will only be able to contin...

sorry, I had missed this

keen storm Jan 3, 2022, 1:06 PM

#

serene scaffold I need to do something else for a few minutes, but I will only be able to contin...

https://pastebin.com/NqvdafhS

Pastebin

{'PDB': [nan, '(2-Aziridinylethyl)amine', '(S)-(+)-1-Cyclohexylethy...

Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.

serene scaffold Jan 3, 2022, 1:06 PM

#

Alright, let me see.

keen storm Jan 3, 2022, 1:07 PM

#

the practical purpose is to find out if a specific metabolites shows in different mediums (and which)

#

so it's an experiment repeated using different mediums, I need to find out how the different medium affects the production of certain metabolites. For which I need to visualize what metabalites are present in what mediums at a glance

vale wedge Jan 3, 2022, 1:08 PM

#

hi, may i ask some question? right now im working for my final year project & now planning to implement data visualization for my project. does anyone know to suggest me on how to do it?

keen storm Jan 3, 2022, 1:08 PM

#

for further reference, this is what I have now

#

this is what I want to end up with

#

this is a spreadsheet, but it shows what I want

#

alternatively, this works

#

I feel like this should be very simple. Maybe I'm overthinking it

odd meteor Jan 3, 2022, 1:10 PM

#

rose pasture Thanks ill check it out! TensorFlow sounds interesting ill have to learn that fo...

While TensorFlow might appear alluring you might wanna consider learning OOP in Python as well just so you can easily get yourself acquainted with PyTorch.

Knowing 2 Deep Learning frameworks can easily increase your worth, make you super flexible, and perhaps a little bit indispensable (with the right attitude) 😀

You can imagine someone in Software dev domain who knows React + Vue.Js + Laravel

Moral story: Try not to be too over dependent on one framework. Know at least two.

vale wedge Jan 3, 2022, 1:10 PM

#

basically right now my web based already deploy machine learning model and now i'm planning to do visualization on my website

#

anyone have any suggestion for me?

serene scaffold Jan 3, 2022, 1:15 PM

#

@keen storm unfortunately I haven't come up with a solution yet, but I have to do something else. I'll leave it running on my computer in case I get a chance to try again later.

keen storm Jan 3, 2022, 1:15 PM

#

serene scaffold <@!88986626268622848> unfortunately I haven't come up with a solution yet, but I...

thanks a lot for looking at it!

odd meteor Jan 3, 2022, 1:16 PM

#

stone marlin It's not hard to get a job in the DS field --- since the title is pretty vague a...

With this, do you think getting a Master Degree does help in breaking into Research domain?

Sometimes I do imagine, what if those people that discovered Attention and Transformer models didn't deem it fit to go into Research, we might probably still be stuck with RNN and LSTM 😂

odd meteor Jan 3, 2022, 1:44 PM

#

lapis sequoia i have two datetime columns, each one start at a specific date. both of them go...

Create a new column entirely, then merge (you could use concatenate or join method as well) both date columns row-wise. Finally, sort the values of the new column to remove any overlapping points.

Once mission has been accomplished, you can then get rid of the two date columns.

odd meteor Jan 3, 2022, 1:54 PM

#

bold timber Hi, I have a questions: Do we need polynomial feature in classfication?

Since the intent really is to get the loss function to its global minimum, doing a feature engineering to create a new polynomial feature that would surely lend a hand in helping your model learn more underlying patterns in your data won't hurt.

However, remember the response variable being predicted in any classification task is always a discrete value.

odd meteor Jan 3, 2022, 2:06 PM

#

prisma lake In my 2022 list of stuffs I want to learn data science and ai is one of it. Some...

Check the pinned message
Do you want free resources or a paid one with more structured courses?

Free Resources

a) Kaggle, FreeCodeCamp, YouTube, Andrew Ng's ML and DL courses on Coursera etc

Paid

a) Udemy, DataQuest, DataCamp, Coursera

b) Bootcamp: check FourthBrain.ai

c) Graduate Studies in University

lapis sequoia Jan 3, 2022, 2:06 PM

#

odd meteor Create a new column entirely, then merge (you could use concatenate or join meth...

i have used as you tell me but another problem occurred, it is a Memory Error, 28.7Gb

odd meteor Jan 3, 2022, 2:12 PM

#

lapis sequoia i have used as you tell me but another problem occurred, it is a Memory Error, 2...

You successfully was able to merge it right? And the memory error is from another operation yeah?

pastel valley Jan 3, 2022, 2:14 PM

#

what metrics or method i can use to compare 2 identical cnn models but trained with different augmentation techniques and checks which model performs best?

#

or should say more reliable and accurate

lapis sequoia Jan 3, 2022, 2:18 PM

#

odd meteor You successfully was able to merge it right? And the memory error is from anothe...

the memory problem is from the merge operation

#

and yes i was able to merge successfully when using a sample of 10 rows from that data

odd meteor Jan 3, 2022, 2:23 PM

#

pastel valley what metrics or method i can use to compare 2 identical cnn models but trained w...

Look at the two CNN performances on unseen data (Validation set and/or other Holdout set)

Check the networks validation loss and validation accuracy.

You can also plot their respective learning curves to visualise the performance.

odd meteor Jan 3, 2022, 2:25 PM

#

lapis sequoia and yes i was able to merge successfully when using a sample of 10 rows from tha...

Hmmm what woulda been the shape of the new dataframe given the new rows due to the merge?

lapis sequoia Jan 3, 2022, 2:28 PM

#

odd meteor Hmmm what woulda been the shape of the new dataframe given the new rows due to t...

actually the datafrmae has around 2800 rows and 20 columns, but iam doing the merge 10 times

odd meteor Jan 3, 2022, 2:28 PM

#

lapis sequoia actually the datafrmae has around 2800 rows and 20 columns, but iam doing the me...

Hmm but why 10 times though?

lapis sequoia Jan 3, 2022, 2:29 PM

#

odd meteor Hmm but why 10 times though?

10 date and price columns to be merged

#

very messy data

odd meteor Jan 3, 2022, 2:33 PM

#

lapis sequoia very messy data

OK but assuming you're doing 2800 x 10 that woulda been 28,000 rows yeah?

But 28,000 isn't that much rows to give a memory error warning. 🤔

lapis sequoia Jan 3, 2022, 2:36 PM

#

odd meteor OK but assuming you're doing 2800 x 10 that woulda been 28,000 rows yeah? But 2...

yeah that really wierd

odd meteor Jan 3, 2022, 2:36 PM

#

lapis sequoia yeah that really wierd

Are you working on colab or on your local machine?

lapis sequoia Jan 3, 2022, 2:37 PM

#

i have built 10 new dataframes from the big one, every df have two columns, and then i remerged again

lapis sequoia Jan 3, 2022, 2:37 PM

#

odd meteor Are you working on colab or on your local machine?

on local machine

odd meteor Jan 3, 2022, 2:44 PM

#

lapis sequoia i have built 10 new dataframes from the big one, every df have two columns, and ...

You're not even supposed to repeat that 10 times. Although I don't exactly have a picture of the initial dataset.

You sure you're stacking them row-wise not column wise right?

Did you use merge, concatenate, or join?
Can you write the code you used as well?

lapis sequoia Jan 3, 2022, 2:48 PM

#

odd meteor You're not even supposed to repeat that 10 times. Although I don't exactly have ...

i will send you my code

#

import pandas as pd 
import numpy as np

df=pd.read_excel('sample_question2022-01.xlsx')
columns=df.columns.tolist()
for column in columns:
    if (df[column].isnull().sum()>2300):
        df.drop(column,axis=1,inplace=True) 
columns=df.columns.tolist()
import itertools
count_date=itertools.count(1)
count_price=itertools.count(1)
for column in columns:

    if(df[column].dtypes=='datetime64[ns]'):
        df.rename(columns={column:f'date{next(count_date)}'},inplace=True)
    else:
        df.rename(columns={column:f'Price{next(count_price)}'},inplace=True)
columns=df.columns.tolist()
merged=df[[columns[0],columns[1]]].set_index('date1')
k=2
for i in range(2,len(columns)-1,2):
    merged=pd.merge(merged,df[[columns[i],columns[i+1]]].set_index(f'date{k}'),how='outer',left_index=True,right_index=True)
    k+=1

lapis sequoia Jan 3, 2022, 2:53 PM

#

odd meteor You're not even supposed to repeat that 10 times. Although I don't exactly have ...

i have used merge, and on rows

odd meteor Jan 3, 2022, 3:00 PM

#

lapis sequoia i have used merge, and on rows

Still giving Memory warning?

lapis sequoia Jan 3, 2022, 3:01 PM

#

yes it still

#

github.com/txd2x/datastore

this is data sample

odd meteor Jan 3, 2022, 3:07 PM

#

I do have a new observation. What happens to other column(s) if the date column successfully gets merged?

This obviously will increase the shape of final dataframe and even populate every other non date column with NAN.

Do you see the angle I'm coming from yet?

lapis sequoia Jan 3, 2022, 3:09 PM

#

odd meteor I do have a new observation. What happens to other column(s) if the date column ...

yes i see, the other columns return nan values while date column is the

odd meteor Jan 3, 2022, 3:15 PM

#

lapis sequoia yes i see, the other columns return nan values while date column is the

If you could care less about the NaN then no problem... If otherwise, I'd suggest you write a custom function instead. Then in your function give the condition for date_col1 and date_col2 to be merged. Then apply it to the new column.

lapis sequoia Jan 3, 2022, 3:21 PM

#

i will try this later, Thank you very much :)

odd meteor Jan 3, 2022, 3:36 PM

#

lapis sequoia i will try this later, Thank you very much :)

You're welcome

rose pasture Jan 3, 2022, 4:13 PM

#

odd meteor While TensorFlow might appear alluring you might wanna consider learning OOP in ...

Thanks! Do you have any other recommendations as to what to learn? So I far i plan to learn numpy panda seaborn and scikit. Ill have to add tensorflow and pytorch to that list as well. What do people usually use for data cleaning for big data?

whole birch Jan 3, 2022, 4:18 PM

#

For data cleaning, pandas is your go to for most applications

#

For 100Gb+ datasets pandas starts to struggle. Then a distributed analytics library like PySpark or Dask is required

rose pasture Jan 3, 2022, 4:40 PM

#

whole birch For 100Gb+ datasets pandas starts to struggle. Then a distributed analytics libr...

I see thanks!

desert oar Jan 3, 2022, 4:44 PM

#

keep in mind that most machines don't have 100 GB of ram, regardless of what pandas can technically handle 🙂

pastel valley Jan 3, 2022, 4:45 PM

#

odd meteor Look at the two CNN performances on unseen data (Validation set and/or other Hol...

oh so that is what the validation set is about

#

thank you sir

desert oar Jan 3, 2022, 4:47 PM

#

pastel valley oh so that is what the validation set is about

In general, the objective of any train/test split, cross validation, bootstrapping, etc. is to emulate "out of sample" data; data that your model has not yet seen and was not involved in training. This is important if you want to try to estimate how well the model generalize is beyond the training sample

plain sundial Jan 3, 2022, 4:50 PM

#

Where would I start with ai?

#

I have some decent knowledge with python. I am just wondering if there is an article to read off of or a book to read or something?

delicate sphinx Jan 3, 2022, 4:57 PM

#

desert oar keep in mind that most machines don't have 100 GB of ram, regardless of what pan...

God i wish they did. My whole project is about 170gb. 70gb storage then about 100gb of files to run through

rose pasture Jan 3, 2022, 4:59 PM

#

how often do you run into big projects like that?

delicate sphinx Jan 3, 2022, 5:00 PM

#

Depends what you do and what data type you have

#

I have images that take about 169GB (cached image feature files and images themselves)
The text files alone are less than 1gb

desert oar Jan 3, 2022, 5:01 PM

#

im curious what you're doing with all those images, i assume you don't need to keep them all in memory at once

delicate sphinx Jan 3, 2022, 5:01 PM

#

Its unlikely you'll face such large amounts of it all

delicate sphinx Jan 3, 2022, 5:01 PM

#

desert oar im curious what you're doing with all those images, i assume you don't need to k...

I'm destroying my hard drives and ram probably. You are correct

desert oar Jan 3, 2022, 5:01 PM

#

at work we had a server with 256 gb of ram, it was nice until someone left a notebook running with 100 gb of ram used, on thursday, and didn't clean it up until the following tuesday

delicate sphinx Jan 3, 2022, 5:02 PM

#

Annoyingly tensorflow is slower to duplicate the same file 3 times than it is to load the same file 3 times

#

But I stand by my choice of loading image once and duplicating its values 3 times

#

Dont wanna kill my hard drive any more x-x

#

I had a friend who worked in security camera detection stuff and his company had this experimental model to train so on a weekend when everyone was gone they let this guy use all the resources he could and I think he ended up blue screening on about 750GB of ram

#

Well, running out of ram, probably not blue screening these days

desert oar Jan 3, 2022, 5:12 PM

#

i wouldn't worry about your ssd dying

#

unless this is on a personal machine and not a work machine

odd meteor Jan 3, 2022, 5:20 PM

#

rose pasture Thanks! Do you have any other recommendations as to what to learn? So I far i pl...

Why not follow a structured course? Learning each data science tool/ library one after the other can slow your progression.

Have you checked out Udemy and Coursera courses on Data Science yet?

If you can afford about $50 use it to buy a well-structured course on Udemy (at least, Udemy is quite cheaper than most courses on other platforms)

Of course, you can always use YouTube as a second resource to augment what you've learned from Udemy

delicate sphinx Jan 3, 2022, 5:44 PM

#

desert oar unless this is on a personal machine and not a work machine

yeah its personal, I'm using a HDD for longevity purposes as I bought this PC specially to ensure it can survive a decade or so (last one managed 8 years and still works fine, able to play VR and msot games still)

#

My current PC performs quicker than google colab (except for times I have to download data due to bad Wi-Fi) - especially as Google Colab RAM/Storage even on Pro is too small for my task

rose pasture Jan 3, 2022, 6:03 PM

#

odd meteor Why not follow a structured course? Learning each data science tool/ library one...

Yeah im currently following a Udemy course from Jose Portilla, it’s teaching numpy pandas seaborn and scikit. Just wondering what to do after

odd meteor Jan 3, 2022, 6:08 PM

#

rose pasture Yeah im currently following a Udemy course from Jose Portilla, it’s teaching num...

Awesome. Just keep at it. You're in the right hands. Jose Portilla and Andrei Nagogei Data Science & Machine Learning courses are 5 🌟 materials.

rose pasture Jan 3, 2022, 6:21 PM

#

odd meteor Awesome. Just keep at it. You're in the right hands. Jose Portilla and Andrei Na...

Alright thanks man

warm wedge Jan 3, 2022, 6:35 PM

#

Hello guys, is data science still a relevant field to get into/ start learning rn? if yes, what learning path do you recommend?

rose pasture Jan 3, 2022, 6:57 PM

#

warm wedge Hello guys, is data science still a relevant field to get into/ start learning r...

#data-science-and-ml message

spare junco Jan 3, 2022, 6:57 PM

#

Hello

#

Can someone Explain what are activation functions and what are they really used for in simple terms?

odd meteor Jan 3, 2022, 7:05 PM

#

warm wedge Hello guys, is data science still a relevant field to get into/ start learning r...

Data Science has always and will continue be a relevant tech field to get into 😂

If you read the last couple of messages here, I'm sure you'll see the answer to your question

tidal bough Jan 3, 2022, 7:09 PM

#

spare junco Can someone Explain what are activation functions and what are they really used ...

Neural networks basically work by, each layer, multiplying a vector by a matrices, and applying an activation function to it, until you reach the end.

Without activation functions, any number of such multiplications would be equivalent to one (because any number of linear operators applied in order is a linear operator). And almost no data is a linear mapping of inputs to outputs, so without activation functions, it'd be impossible for NNs to work.

So we introduce nonlinearties of some kind - that's what activation functions are, pretty much any nonlinear function. In practice, you need one that has a derivative almost everywhere (because you need it for backpropagation), is cheap to calculate, and (that's the most advanced requirement) ideally behave nicely under backpropagation - have nonzero derivatives everywhere, etc.

#

*have nonzero derivatives everywhere
though as RELUs show, even that is sometimes not required

odd meteor Jan 3, 2022, 7:14 PM

#

spare junco Can someone Explain what are activation functions and what are they really used ...

Let me break it down as much as I can

Activation function helps our neural network to make more accurate prediction. Now the main reason we use activation function is because ;

Not all datasets are in linear space, some can be in 3-dimensional space and above... Remember, plane and hyperplane in geometry yeah? Cool.

By using activation function, it helps our neural nets capture non-linearities in our dataset thus leading to our neural network producing a more accurate prediction

spare junco Jan 3, 2022, 7:15 PM

#

Thank You Very much for explaining!!

velvet pivot Jan 3, 2022, 7:54 PM

#

Hello everyone, I can help you if you have any questions or concerns about the software. or you can contact me from this account instag @ai.engineer1

fast glacier Jan 3, 2022, 8:30 PM

#

Hello! I was wondering if someone might be able to kindly help me out: I'm working with a pandas dataframe, and I'm trying to iterate through only columns that have int values, and ignore the columns with other datatypes.
numerical_columns = [column for columns in identity_survey.columns if identity_survey[column].dtypes() == int]

print(numericalcolumns)

I get a TypeError: 'numpy.dtype[object]' object is not callable
I've tried (and unfortunately already deleted) multiple other approaches with no luck
identity_survey is the name of the dataframe

desert oar Jan 3, 2022, 8:35 PM

#

fast glacier Hello! I was wondering if someone might be able to kindly help me out: I'm work...

.dtypes isn't a function, you probably also meant .dtype anyway

#

"not callable" means "you can't use it like a function", i.e. you can't use ()

#

also you should just use select_dtypes https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.select_dtypes.html

integer_columns = identity_survey.select_dtypes(include=[int, np.integer])

or something like that

#

not sure if you can use np.integer alone or if you need int as well

fast glacier Jan 3, 2022, 8:37 PM

#

mmm weird, my notes say dtypes...

#

checking now

desert oar Jan 3, 2022, 8:38 PM

#

the DataFrame has a .dtypes attribute, but it's still not a function: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dtypes.html

#

each individual Series has a .dtype attribute: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.dtype.html

spare junco Jan 3, 2022, 8:38 PM

#

odd meteor Let me break it down as much as I can Activation function helps our neural netw...

So does the activation function help in deciding the weights and biases cuz that is what makes up the output of a neuron right?

fast glacier Jan 3, 2022, 8:39 PM

#

thanks @desert oar

desert oar Jan 3, 2022, 8:39 PM

#

spare junco So does the activation function help in deciding the weights and biases cuz that...

the entire network "helps in deciding", because if you change anything in the network, it will change how information propagates through the network

#

the reason we use nonlinear activation functions is to help the network find more interesting non-linear relationships in the data

#

imagine that the network is a child building stuff out of blocks, and they only have rectangular blocks

#

that would be like if you only had linear activation functions, everything would still end up kind of rectangular

spare junco Jan 3, 2022, 8:40 PM

#

desert oar the reason we use nonlinear activation functions is to help the network find mor...

by relations do you mean correlations?

desert oar Jan 3, 2022, 8:40 PM

#

no, i mean generally the relationship between inputs and outputs

#

the function that the NN is learning to approximate

#

if the child is building stuff out of various funky-shaped blocks, they could build more sophisticated shapes in the end

spare junco Jan 3, 2022, 8:41 PM

#

Right

desert oar Jan 3, 2022, 8:42 PM

#

same reason your network can learn more if you have more nodes and layers: more blocks to build out of

#

note that this is a very very highly non-mathematical analogy

spare junco Jan 3, 2022, 8:42 PM

#

So does the activation function help in shaping the network and nodes in such a way to make it more efficient in predicting?

desert oar Jan 3, 2022, 8:43 PM

#

not necessarily more efficient, but it is necessary in order for a network to learn arbitrarily sophisticated functions

spare junco Jan 3, 2022, 8:43 PM

#

Right

#

Got it!!

#

Now i will have a good sleep

#

lol

desert oar Jan 3, 2022, 8:44 PM

#

i see melat0nin is typing, they know what they are talking about so you might want to stay online for a few more minutes 🙂

spare junco Jan 3, 2022, 8:44 PM

#

Alright

stone marlin Jan 3, 2022, 8:44 PM

#

Haha, I wasn't gonna say anything mathy here. :'] One of the exercises that made me really "get" the steps of NNs was trying to do ezpz perceptron exercises.

#

You got all of it, I didn't need to add anything, haha.

#

The math part is pretty much that you get linear combos of linear combos without activation functions, which is just kind of boring. You pretty much just get new features that look like 3.0 * petal_length + 4.1 * sepal_width - ... stuff like that. Which is fine for some things, but for other things they may not have a clean linear relationship.

spare junco Jan 3, 2022, 8:46 PM

#

PepeThumbsUp

#

Thank You @desert oar

stone marlin Jan 3, 2022, 8:48 PM

#

I'm having a ton of fun with streamlit, y'all, if you haven't checked it out yet, check it out.

odd meteor Jan 3, 2022, 8:50 PM

#

stone marlin I'm having a ton of fun with streamlit, y'all, if you haven't checked it out yet...

You might deputise Streamlit if you try Gradio 😂 Well, I love both of them

desert oar Jan 3, 2022, 8:51 PM

#

interesting

#

i always thought streamlit was some kind of zoomer streaming tool 😆

stone marlin Jan 3, 2022, 8:51 PM

#

Oh, interesting! It seems like Streamlit is made for like, dashboarding EDA / Results, but Gradio is more for like, showing off models?

#

These both seem really cool, I'll try to make something cool in Gradio this week. :'']

odd meteor Jan 3, 2022, 9:06 PM

#

stone marlin Oh, interesting! It seems like Streamlit is made for like, dashboarding EDA / R...

Mainly streamlit is used in the Data Science and Machine Learning community for deploying model as a web app.

Gradio can also be used for model deployment. The beauty of Gradio is in building and sharing a demo of Machine Learning model so people can interact or play with it.

Check this out https://huggingface.co/spaces/valhalla/glide-text2im

Glide Text2im - a Hugging Face Space by valhalla

stone marlin Jan 3, 2022, 9:12 PM

#

Oh, this is neat. Yeah, Streamlit seems like the "Python Version of Shiny" to me, but I'm still learning it. This is kind'a cool, and I wonder how it compares with or complements things like H2o and MLFlow. I'll have to dig in. It looks really slick tho.

desert oar Jan 3, 2022, 9:14 PM

#

i was wondering about mlflow here too

stone marlin Jan 3, 2022, 9:16 PM

#

I haven't touched MLFlow in like, two years --- but when we were using it, I was like, heavy into it. We had a fork of it we added junk to and it was great. H2o was also really neat --- our modelers really liked the charts and stuff, but idk how that is now. I gott'a look back into this stuff.

odd meteor Jan 3, 2022, 9:16 PM

#

I don't have experience with MLOps yet so I can't say for now

stone marlin Jan 3, 2022, 9:16 PM

#

It's prob the case that MLFlow works fine with this, since MLFlow (at least the tracking part) just kind of sits in the code and reports stuff to the DB.

#

I'll check it out tho, I'm so excited about the new stuff that's come up recently (or not-so-recently) that I get to learn about, haha.

serene scaffold Jan 3, 2022, 9:49 PM

#

@tidal bough @odd meteor thanks for your comments on activation functions. I somehow didn't know a lot of that lemon_hyperpleased

magic dune Jan 3, 2022, 10:59 PM

#

desert oar you might also be interested in other numerical optimization techniques like new...

what is that?

pastel valley Jan 3, 2022, 11:06 PM

#

desert oar In general, the objective of any train/test split, cross validation, bootstrappi...

i can just use accuracy and loss as the basis of their performance on a validation data?

desert oar Jan 3, 2022, 11:06 PM

#

pastel valley i can just use accuracy and loss as the basis of their performance on a validati...

for the most part yes

stone marlin Jan 3, 2022, 11:07 PM

#

You can measure it with pretty much any metric you want. F1, accuracy, recall, precision, etc.

light hemlock Jan 3, 2022, 11:14 PM

#

How to determine which columns to drop in knn?
I know which column is categorical , continuous and is class attribute.
I assume:
Continuous data is for training set
Categorical is meant to be dropped
Class attribute is a thing to be predicted?
Example:
I tried to make a heatmap (picture). On left is whole dataset, on right head(20). ||11th column had 0.5 value on first 20 records after normalisation||

desert oar Jan 3, 2022, 11:18 PM

#

drop the bad ones, keep the good ones 😉

#

that heatmap is the distance matrix between data points?

light hemlock Jan 3, 2022, 11:31 PM

#

desert oar that heatmap is the distance matrix between data points?

Yeah, its made from data.corr().

errant shore Jan 4, 2022, 12:00 AM

#

If i have a df that contains user ratings for movies (columns=[userID, movieID, rating,....]), what would be the most efficient way to create a df that contains the count of users that rated both movies for all combinations of movies. Right now I'm doing this iteratively for every combination of movie ids like this, but I'm looking for a way to speed it up. Any suggestions?

def foo(id1, id2):
  id1_users = set(df[df["movieID"] == id1]["userID"].to_list())
  id2_users = set(df[df["movieID"] == id2]["userID"].to_list())
  combined = len(id1_users & id2_users)
  return combined

vale wedge Jan 4, 2022, 12:17 AM

#

does anyone know why this error occur?

#

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

vale wedge Jan 4, 2022, 12:18 AM

#

vale wedge ValueError: The truth value of an array with more than one element is ambiguous....

this is the error stated

errant shore Jan 4, 2022, 12:19 AM

#

result is a ~~list~~ array, so you cant check for equality with an int

vale wedge Jan 4, 2022, 12:21 AM

#

sorry, but i dont clearly understand

#

@errant shore

errant shore Jan 4, 2022, 12:21 AM

#

Your result variable is a ~~list~~ array

#

So you cant check if a ~~list~~ array is equal to an int

tidal bough Jan 4, 2022, 12:23 AM

#

array, not list. And the problem isn't quite that you can't compare it to an int - you can - but that the result isn't a boolean, but an array of booleans

#

and you can't use an array of booleans as a condition

#

Presumably, you expect result to be a single-element array. If that's the case, extract its only element and do this stuff on it, not on the entire array.

errant shore Jan 4, 2022, 12:24 AM

#

yes my bad array not list

vale wedge Jan 4, 2022, 12:29 AM

#

hmm i still try to fix this one

#

how about Use a.any() or a.all()

#

`# Let's predict!
import numpy as np
newData = np.array([
2,
2,
3,
3,
3,
2,
2,
4,
4,
4,
4,
3,
5,
2,
3,
4,
4,
3,
4,
4,
1,
4,
3,
5,
1,
5,
2,
5,
5,
4,
3,
4,
4,
4,
4,
4,
4,
2,
5,
4,
3,
2,
4,
2,
4,
2,
4,
2,
4,
4]).reshape(-1,1)

result = gnb.predict(newData)
print(result)
if (result == 0):
print("Your Personality Is Agreeableness")
elif (result == 1):
print("Your Personality Is Conscientiousness")
elif (result == 2):
print('Your Personality Is Extrovert')
elif (result == 3):
print('Your Personality Is Neuroticism')
elif (result == 4):
print('Your Personality Is Openness')
elif (result == 5):
print('Your Personality Is Tie')`

vale wedge Jan 4, 2022, 12:33 AM

#

vale wedge `# Let's predict! import numpy as np newData = np.array([ 2, 2, 3, 3, 3, 2, 2, 4...

basically, this is the code for predict

#

is there any code that need to be adjusted?

#

@errant shore @tidal bough

tidal bough Jan 4, 2022, 12:37 AM

#

What does print(result) print?

#

ah, I see on the screenshot

#

well, that's the reason then. What do you expect to happen, when you compare result to ints? It is after all an array of many values. So what if, if any, should execute?

vale wedge Jan 4, 2022, 12:41 AM

#

hm i expect when the outcome is between 0-5 and will print the personality

vale wedge Jan 4, 2022, 12:42 AM

#

tidal bough well, that's the reason then. What do you expect to happen, when you compare `re...

hmm basically, what should i do right now to fix this one?

#

hopefully it might help me

tidal bough Jan 4, 2022, 12:43 AM

#

vale wedge hm i expect when the outcome is between 0-5 and will print the personality

...which of the ton of outcomes you have? 😛

vale wedge Jan 4, 2022, 12:44 AM

#

basically, when user input, then the outcomes will be 0 or 1 or 2 or 3 or 4 or 5

#

so every outcome will define different personality

vale wedge Jan 4, 2022, 12:45 AM

#

tidal bough ...which of the ton of outcomes you have? 😛

it will be 6 outcome

brazen spire Jan 4, 2022, 12:46 AM

#

is ML actually good for solving PDEs?

#

or ODEs

vale wedge Jan 4, 2022, 12:46 AM

#

tidal bough well, that's the reason then. What do you expect to happen, when you compare `re...

basically, i cant compare array to int?

tidal bough Jan 4, 2022, 12:54 AM

#

vale wedge it will be 6 outcome

Here's what result is.

#

what do you mean by trying to compare all that to a single int? For example, is this equal to 4, or not?

vale wedge Jan 4, 2022, 12:56 AM

#

owh i see, supposedly the result only display one number only

#

not the all 50 outcomes like that

#

i want to predict the outcome that will display the predicted single number outcome

vale wedge Jan 4, 2022, 1:00 AM

#

tidal bough Here's what `result` is.

not the 50 like this

#

so what can i do to make the outcome only display one number only? @tidal bough

proper sable Jan 4, 2022, 4:01 AM

#

is this channel also for web scraping??

serene scaffold Jan 4, 2022, 4:06 AM

#

proper sable is this channel also for web scraping??

Nope

proper sable Jan 4, 2022, 4:07 AM

#

is there one??

serene scaffold Jan 4, 2022, 4:24 AM

#

@proper sable not really. I guess you could ask in #web-development, but be prepared to confirm that the website you're trying to scrape is cool with that.

proper sable Jan 4, 2022, 4:30 AM

#

Yes im just learning now

#

Thx

calm bison Jan 4, 2022, 4:46 AM

#

Hi, can anyone help me with creating a custom dataset. I cannot seem to find my error here.

arctic wedgeBOT Jan 4, 2022, 4:48 AM

#

Hey @calm bison!

Uh-oh! It looks like your message got zapped by our spam filter. We currently don't allow .txt attachments, so here are some tips to help you travel safely:

• If you attempted to send a message longer than 2000 characters, try shortening your message to fit within the character limit or use a pasting service (see below)

• If you tried to show someone your code, you can use codeblocks
(run !code-blocks in #bot-commands for more information) or use a pasting service like:

https://paste.pythondiscord.com

terse frigate Jan 4, 2022, 5:48 AM

#

hey guys I needed some help figuring out a model

#

and ab approach

#

an*

serene scaffold Jan 4, 2022, 5:51 AM

#

@terse frigate try giving enough information so that people can start making suggestions

terse frigate Jan 4, 2022, 5:52 AM

#

serene scaffold <@310860262624460801> try giving enough information so that people can start mak...

so i need to make a search string based on a job description. i have some example to make supervised

#

but i also wanna match the accuracy of that result by comparing it with the example results

serene scaffold Jan 4, 2022, 5:54 AM

#

terse frigate so i need to make a search string based on a job description. i have some exampl...

A search string based on a job description. Can you explain that some more?

terse frigate Jan 4, 2022, 5:54 AM

#

so my job is to search for resumes on job boards
we are given a job description of the Job Role.

#

my job is to make a search string and use it to look for resumes

#

no i want to make a model which read the description and constructs a search string like how i do - using keywords and important skills mentioned and understanding the role

#

for example the clients they request for a cloud architect and provide requirements in skills etc

#

and i make a string -
(cloud architect) OR (Solutions architect) AND (AWS OR Google Cloud OR Azure) AND Agile

#

soemthing like that

serene scaffold Jan 4, 2022, 5:57 AM

#

So the real goal is to detect resumes that relate to a job opening. But due to some limitation, you have to enter keywords into a search API.

terse frigate Jan 4, 2022, 5:57 AM

#

serene scaffold So the real goal is to detect resumes that relate to a job opening. But due to s...

yeah i have to manually construct the search string

serene scaffold Jan 4, 2022, 5:57 AM

#

Weird

terse frigate Jan 4, 2022, 5:58 AM

#

4Shrug

#

so i was thinking if i fed enough job descriptions and also some strings to learn on

#

it should give me proper string right?

serene scaffold Jan 4, 2022, 5:59 AM

#

I have to go to sleep but I'll probably think about it some more. In the mean time, do you know about term frequency inverse document frequency?

terse frigate Jan 4, 2022, 5:59 AM

#

yes

#

ive read about it

serene scaffold Jan 4, 2022, 5:59 AM

#

It might give you some ideas for extracting keywords.

shrewd saddle Jan 4, 2022, 6:00 AM

#

Hi, does anyone know a way to easily use files located in Google public cloud (the Landsat data for example https://cloud.google.com/storage/docs/public-datasets/landsat) inside Google Colab? Similar to how we can mount our drive for example.

terse frigate Jan 4, 2022, 6:00 AM

#

❤️ thanks

tidal edge Jan 4, 2022, 6:39 AM

#

Hi can anyone help me in understanding this complex inheritance automl code. I just need little help. after i will able to understand it. Please let me know guys if anybody ready to explore this attached code with me. I really want to understand this code. Please help me.

arctic wedgeBOT Jan 4, 2022, 6:40 AM

#

Hey @tidal edge!

It looks like you tried to attach a Python file - please use a code-pasting service such as https://paste.pythondiscord.com

tidal edge Jan 4, 2022, 6:41 AM

#

https://paste.pythondiscord.com/idikacokol.py

stone marlin Jan 4, 2022, 6:47 AM

#

What parts of this aren't clear to you? How to use it, how to modify the code, etc.?

odd meteor Jan 4, 2022, 7:49 AM

#

terse frigate no i want to make a model which read the description and constructs a search str...

I think I have an idea on this. Folks at my company's AI & Research department developed a product called CV Ranker which performs exactly whatchu asking for.

I can only refer you to delve into this
https://deepnote.com/@abid/spaCy-Resume-Analysis-gboeS3-oRf6segt789p4Jg

Deepnote

spaCy Resume Analysis

Managed notebooks for data scientists and researchers.

sly turtle Jan 4, 2022, 10:26 AM

#

I want to make plant detection app for android.. So how can I make a model and how? I am beginner.. As there is no problem for me in Android App.. I just need some resources and support regards model.. Either ML or DL? And what is some tutorial for brainstorming.. Thanks..

lapis sequoia Jan 4, 2022, 1:00 PM

#

sly turtle I want to make plant detection app for android.. So how can I make a model and h...

just train out a simple CNN

unkempt monolith Jan 4, 2022, 1:25 PM

#

sly turtle I want to make plant detection app for android.. So how can I make a model and h...

You can train a model using YoloV5 algorithm. Then you can host it on a vps and make an API to connect to that vps and detect stuff.

#

You can use Roboflow to label images, Colab to train your YoloV5 model and Amazon AWS to host your VPS.

tidal edge Jan 4, 2022, 1:31 PM

#

stone marlin What parts of this aren't clear to you? How to use it, how to modify the code, ...

i have a problem in code flow understanding

mighty spoke Jan 4, 2022, 5:09 PM

#

Hi I'm trying to do logarithmic binning for PSD graph but its just not working as the binned graph on the right still look linearly binned can anyone help?, my code for binning: x4,y4=zip(*sorted(zip(faxis,Sxx.real))) logbin = pd.DataFrame({'X' : x4,'Y' : y4}) bins5=np.arange(4e-08, 6e-06, step=2.5e-07) categorical_object2 = pd.cut(x4, bins5) count=pd.value_counts(categorical_object2) grp2 = logbin.groupby(by = categorical_object2) #we group the data by the cut av = grp2.aggregate(np.mean) plt.figure() plt.plot(av.X,av.Y, 'x') plt.title('binned PSD plot') plt.xlabel('mean Frequency [Hz]') # Label the axes plt.ylabel('mean Power')

novel raven Jan 4, 2022, 5:34 PM

#

is this channel also for computer vision?

#

like cascade classifiers

#

can you train a cascade classifier that can count number of stars in the sky with normal camera

amber lark Jan 4, 2022, 5:48 PM

#

does someone know how to fix this problem?

sick wedge Jan 4, 2022, 5:50 PM

#

I have a plot here of a column from my panda dataframe, I'd like to make subplots for the other columns but I I can't seem to get it to work, any help appreciated

warm verge Jan 4, 2022, 6:16 PM

#

Why can hyperparameters not be 'autotuned' the same way PID controllers can be?

desert oar Jan 4, 2022, 6:17 PM

#

warm verge Why can hyperparameters not be 'autotuned' the same way PID controllers can be?

what would autotuning consist of?

#

i don't know how PID controllers work

#

(although i should probably learn, because i want to learn about modding my espresso machine)

warm verge Jan 4, 2022, 6:20 PM

#

https://www.sciencedirect.com/topics/computer-science/ziegler-nichols-method

#

This one is probably the easiest to understand

#

While I'm aware there's way more than 3 hyperparameters, it's interesting that a similar process isn't used (or at least I haven't come across it)

desert oar Jan 4, 2022, 6:27 PM

#

this sounds more or less like what we do in machine learning

#

set up cross validation and pick the set of hyperparameters that performs best

#

the hyperparameter space is huge even with a small number of parameters (and usually real-valued so uncountably infinite), so various strategies like random search, halving random search, bayesian optimization, etc. exist that improve on the traditional "grid search"

#

the reason you need cross validation is that you need to emulate predicting on "out of sample" data

#

whereas PID controllers you don't have this problem of in-sample vs out-of-sample

#

so yes, it is used

#

and especially in cases with one single hyperparameter (eg the shrinkage parameter in lasso or ridge regression), you can literally just look at the plot of model performance vs parameter value, and pick the one right before model performance starts to decline (indicating overfitting)

warm verge Jan 4, 2022, 6:54 PM

#

Hmm ok that makes sense

tidal edge Jan 4, 2022, 6:58 PM

#

https://paste.pythondiscord.com/idikacokol.py

#

I have a doubt. In the above code why some method name starts with _ and some method name starts with directly name in the same class?

#

method name starting with _

#

method name directly starting with name

lapis sequoia Jan 4, 2022, 7:05 PM

#

Anyone use shap/lime for fb prophet?

tidal edge Jan 4, 2022, 7:18 PM

#

lapis sequoia Anyone use shap/lime for fb prophet?

yes

novel acorn Jan 5, 2022, 1:57 AM

#

hello everyone, how can I filter a pandas dataset for 2 conditions in the same column?
For example, I want to filter the length column for values equal to 48 and to 40

#

filtro = (all_years_numeric_filtered["Length"] == 48) & (all_years_numeric_filtered["Length"] == 40)

all_years_numeric_filtered[filtro]

#

Tried doing that but I'm getting an empty df

#

Already got it, created a list with the values and used .isin() for the Length column

thin palm Jan 5, 2022, 2:16 AM

#

Hello, I have a question about One Hot Encoding. I'm working on a simple Heart Disease ML model, and following many tutorials they seem to use OHE on columns that numerical? I thought we only use OHE when we have text that needs to be converted into numbers or when we have plenty of options in one category. Can anyone explain this to me?

gritty bough Jan 5, 2022, 2:26 AM

#

Can you take something that is non-deterministic and cant be run in parallel and use AI to parallelize it?

serene scaffold Jan 5, 2022, 2:55 AM

#

@gritty bough what do you mean "can't be run in parallel"?

#

@thin palm one hot encoding is a way of representing nominal data (ie not quantifiable or orderable). So if you have an Animal feature, and your animals are pigs, goats, and snakes, you don't want to assign them 1, 2, and 3, because that would mean that snakes are there times as much as pigs, whatever that means.

#

Though it sounds like you might already understand that much.

#

@novel acorn & represents logical AND and something can't be both 48 and 40.

rose pasture Jan 5, 2022, 3:05 AM

#

Hey guys quick question. Let's say I have a dataframe of stock prices. The index is the dates and the column is the daily closing price of each stock.

bank_stocks.xs(key='Close',axis=1,level=1).plot()

Does this line format automatically plot the Close column against the index (date) whenever I don't chose which data to plot against?

serene scaffold Jan 5, 2022, 3:08 AM

#

@rose pasture if you show the dataframe in a copy-and-pastable way, I will try

#

namely print(bank_stocks.head().to_dict('list'))

#

Please ping me if you decide to do that.

rose pasture Jan 5, 2022, 3:16 AM

#

serene scaffold namely `print(bank_stocks.head().to_dict('list'))`

{('BAC', 'High'): [47.18000030517578, 47.2400016784668, 46.83000183105469, 46.90999984741211, 46.970001220703125], ('BAC', 'Low'): [46.150001525878906, 46.45000076293945, 46.31999969482422, 46.349998474121094, 46.36000061035156], ('BAC', 'Open'): [46.91999816894531, 47.0, 46.58000183105469, 46.79999923706055, 46.720001220703125], ('BAC', 'Close'): [47.08000183105469, 46.58000183105469, 46.63999938964844, 46.56999969482422, 46.599998474121094], ('BAC', 'Volume'): [16296700.0, 17757900.0, 14970700.0, 12599800.0, 15619400.0], ('BAC', 'Adj Close'): [33.942649841308594, 33.582183837890625, 33.62542724609375, 33.57497024536133, 33.59661102294922], ('C', 'High'): [493.79998779296875, 491.0, 487.79998779296875, 489.0, 487.3999938964844], ('C', 'Low'): [481.1000061035156, 483.5, 484.0, 482.0, 483.0], ('C', 'Open'): [490.0, 488.6000061035156, 484.3999938964844, 488.79998779296875, 486.0], ('C', 'Close'): [492.8999938964844, 483.79998779296875, 486.20001220703125, 486.20001220703125, 483.8999938964844], ('C', 'Volume'): [1537600.0, 1870960.0, 1143160.0, 1370210.0, 1680740.0], ('C', 'Adj Close'): [368.26544189453125, 361.4664611816406, 363.2597351074219, 363.2597351074219, 361.5412902832031], ('GS', 'High'): [129.44000244140625, 128.91000366210938, 127.31999969482422, 129.25, 130.6199951171875], ('GS', 'Low'): [124.2300033569336, 126.37999725341797, 125.61000061035156, 127.29000091552734, 128.0], ('GS', 'Open'): [126.69999694824219, 127.3499984741211, 126.0, 127.29000091552734, 128.5], ('GS', 'Close'): [128.8699951171875, 127.08999633789062, 127.04000091552734, 128.83999633789062, 130.38999938964844], ('GS', 'Volume'): [6188700.0, 4861600.0, 3717400.0, 4319600.0, 4723500.0], ('GS', 'Adj Close'): [103.86396026611328, 102.42938232421875, 102.38907623291016, 103.83979034423828, 105.0890121459961],

#

('JPM', 'High'): [40.36000061035156, 40.13999938964844, 39.810001373291016, 40.2400016784668, 40.720001220703125], ('JPM', 'Low'): [39.29999923706055, 39.41999816894531, 39.5, 39.54999923706055, 39.880001068115234], ('JPM', 'Open'): [39.83000183105469, 39.779998779296875, 39.61000061035156, 39.91999816894531, 39.880001068115234], ('JPM', 'Close'): [40.189998626708984, 39.619998931884766, 39.7400016784668, 40.02000045776367, 40.66999816894531], ('JPM', 'Volume'): [12838600.0, 13491500.0, 8109400.0, 7966900.0, 16575200.0], ('JPM', 'Adj Close'): [26.503398895263672, 26.350433349609375, 26.430240631103516, 26.616453170776367, 27.048765182495117], ('MS', 'High'): [58.4900016784668, 59.279998779296875, 58.59000015258789, 58.849998474121094, 59.290000915527344], ('MS', 'Low'): [56.7400016784668, 58.349998474121094, 58.02000045776367, 58.04999923706055, 58.619998931884766], ('MS', 'Open'): [57.16999816894531, 58.70000076293945, 58.54999923706055, 58.77000045776367, 58.630001068115234], ('MS', 'Close'): [58.310001373291016, 58.349998474121094, 58.5099983215332, 58.56999969482422, 59.189998626708984], ('MS', 'Volume'): [5377000.0, 7977800.0, 5778000.0, 6889800.0, 4144500.0], ('MS', 'Adj Close'): [36.114253997802734, 36.13903045654297, 36.23814010620117, 36.27529525756836, 36.65928649902344], ('WFC', 'High'): [31.975000381469727, 31.81999969482422, 31.55500030517578, 31.774999618530273, 31.825000762939453], ('WFC', 'Low'): [31.19499969482422, 31.364999771118164, 31.309999465942383, 31.385000228881836, 31.55500030517578], ('WFC', 'Open'): [31.600000381469727, 31.799999237060547, 31.5, 31.579999923706055, 31.674999237060547], ('WFC', 'Close'): [31.899999618530273, 31.530000686645508, 31.4950008392334, 31.68000030517578, 31.674999237060547], ('WFC', 'Volume'): [11016400.0, 10870000.0, 10158000.0, 8403800.0, 5619600.0], ('WFC', 'Adj Close'): [20.444866180419922, 20.207735061645508, 20.1853084564209, 20.303871154785156, 20.300668716430664]}

#

was it like that?

serene scaffold Jan 5, 2022, 3:17 AM

#

!paste

arctic wedgeBOT Jan 5, 2022, 3:17 AM

#

Pasting large amounts of code

If your code is too long to fit in a codeblock in discord, you can paste your code here:
https://paste.pythondiscord.com/

After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.

serene scaffold Jan 5, 2022, 3:17 AM

#

Next time use this if it's too long

rose pasture Jan 5, 2022, 3:17 AM

#

Ok I will next time

serene scaffold Jan 5, 2022, 3:18 AM

#

Is this different from what you wanted?

lapis sequoia Jan 5, 2022, 3:19 AM

#

tidal edge yes

Can contact you?

serene scaffold Jan 5, 2022, 3:20 AM

#

The index is the dates and the column is the daily closing price of each stock.
For your reference, this description is incomplete. Your columns are a multiindex of company names by (high, low, open, close, etc).

Also your rows are probably indexed by date, so my example will look a bit different.

#

if you do print(bank_stocks.head().index), I can correct my version.

rose pasture Jan 5, 2022, 3:24 AM

#

Yes it is different, here's what I got. I was just curious as to how the plot() chose the index as my x axis even though I didn't specify anything.

rose pasture Jan 5, 2022, 3:25 AM

#

serene scaffold > The index is the dates and the column is the daily closing price of each stock...

sorry I should've given more details

serene scaffold Jan 5, 2022, 3:27 AM

#

rose pasture Yes it is different, here's what I got. I was just curious as to how the plot() ...

it looks like the default behavior for DataFrame.plot is to treat the index as the x axis, columns as items, and values as observations about said items (y axis).

rose pasture Jan 5, 2022, 3:28 AM

#

serene scaffold it looks like the default behavior for DataFrame.plot is to treat the index as t...

I see thank you very much! Appreciate the help

lapis sequoia Jan 5, 2022, 3:31 AM

#

@tidal edge you there?

tidal edge Jan 5, 2022, 3:32 AM

#

Yes. Shall we connect little later. I'm on one conf call.

lapis sequoia Jan 5, 2022, 3:32 AM

#

Okay okau

gritty bough Jan 5, 2022, 4:07 AM

#

serene scaffold <@506925999770828826> what do you mean "can't be run in parallel"?

Like you cannot send 2 parts away to be computed independent of each other.

#

For example. Making a checkerboard.

You can check the neighbors to the NSEW directions if they are black or white or you could do if(mod % 2) { black} else white.

#

Mmmm

#

I'm thinking

#

Maybe problems like what I'm asking dont exist.

thin palm Jan 5, 2022, 4:15 AM

#

serene scaffold <@401209043550732289> one hot encoding is a way of representing nominal data (ie...

What if we have a column that is “male” and “female” why split those into 2 columns, such as this tutorial did?

hoary wigeon Jan 5, 2022, 5:01 AM

#

thin palm What if we have a column that is “male” and “female” why split those into 2 colu...

onehotencoding will create 2 column ['male', 'female'] using either of one is enough

#

if male is 0, it means female = 1, vice versa

#

anyone using cv2 ?

desert oar Jan 5, 2022, 5:08 AM

#

thin palm What if we have a column that is “male” and “female” why split those into 2 colu...

because at some point you need to figure out a way to encode your data as numbers. splitting the data into two 1/0-valued columns is one straightforward way to do that

copper ridge Jan 5, 2022, 5:11 AM

#

        row_vals = '\n'.join([val for val in df['Date:']])
``` i want to only view the values in the specified row in groups of 10, how would i be able to do that?

#

that line of code displays all the values

#

I am using pandas to help work with excel files

desert oar Jan 5, 2022, 5:16 AM

#

copper ridge ```py row_vals = '\n'.join([val for val in df['Date:']]) ``` i want to o...

can you clarify what "groups of 10" means?

#

do you mean "in the specified column"?

#

columns go "up and down" -- like columns in a building

#

rows go "across", like rows of crops in a field

copper ridge Jan 5, 2022, 5:18 AM

#

desert oar can you clarify what "groups of 10" means?

i only want columns 1-10 to be printed, then 11-20, etc

copper ridge Jan 5, 2022, 5:18 AM

#

desert oar do you mean "in the specified _column_"?

"Date: " is the column im referencing

#

copper ridge Jan 5, 2022, 5:20 AM

#

copper ridge i only want columns 1-10 to be printed, then 11-20, etc

because the column name starts at 1, i want to be able to print the data in the cells from columns 2-11

desert oar Jan 5, 2022, 5:20 AM

#

are you talking about rows or columns?

copper ridge Jan 5, 2022, 5:20 AM

#

columns, up and down

#

the data in the Date column

desert oar Jan 5, 2022, 5:20 AM

#

ok, and what do you want to print?

copper ridge Jan 5, 2022, 5:21 AM

#

I want to print the data in groups of 10, such as 1-10 and 11-20

desert oar Jan 5, 2022, 5:21 AM

#

so you want to print the first 10 rows?

#

then the next 10 rows, etc.

copper ridge Jan 5, 2022, 5:21 AM

#

yes, sorry for the confusion

desert oar Jan 5, 2022, 5:23 AM

#

size = 10
for lo in range(0, len(df), size):
    hi = lo + size
    date_values = df['Date:'].iloc[lo : hi]
    print(' '.join(date_values.tolist())

#

i'd just write a loop for that

#

!d range

arctic wedgeBOT Jan 5, 2022, 5:23 AM

#

range


class range(stop)``````py

class range(start, stop[, step])```
The arguments to the range constructor must be integers (either built-in [`int`](https://docs.python.org/3/library/functions.html#int "int") or any object that implements the [`__index__()`](https://docs.python.org/3/reference/datamodel.html#object.__index__ "object.__index__") special method). If the *step* argument is omitted, it defaults to `1`. If the *start* argument is omitted, it defaults to `0`. If *step* is zero, [`ValueError`](https://docs.python.org/3/library/exceptions.html#ValueError "ValueError") is raised.

For a positive *step*, the contents of a range `r` are determined by the formula `r[i] = start + step*i` where `i >= 0` and `r[i] < stop`.

For a negative *step*, the contents of the range are still determined by the formula `r[i] = start + step*i`, but the constraints are `i >= 0` and `r[i] > stop`.

copper ridge Jan 5, 2022, 5:24 AM

#

thank you

grand breach Jan 5, 2022, 9:43 AM

#

need some help with cspdarknet architecture.. anyone?

#

Is the spp block a part of the cspdarknet backbone, like the final block in the backbone? Or lies in the neck?

lapis sequoia Jan 5, 2022, 11:31 AM

#

Hello what would be the most efficient way of adding 150k rows to a dataframe? When using .append it starts with a pretty decent 30mins remaining and then it increases when it gets closer to the end. When using .loc it starts with more than an hour and then increases but not as much as using .append. It looks like .append performs better with small dataframes and .loc with large dataframes.

#

30 minutes might be a bit slow but I have another df with 57k rows load in memory to perform some conditionals and decide which rows should be added to the final df I want to create

#

I see how it dies little by little...😅

desert bear Jan 5, 2022, 11:36 AM

#

Hey, I have a question related to building a multi-class classification model. In my datasets I have some sequence of vectors that are unique for a specific class. Do you think that throwing this UNIQUE vector into an unsupervised model is a waste of resources?

lapis sequoia Jan 5, 2022, 12:06 PM

#

lapis sequoia Hello what would be the most efficient way of adding 150k rows to a dataframe? W...

If I'm not mistaken appending on df is ... expensive. Try one thing? Add in python data structure. Like dict and then convert to df. See how it goes?

lapis sequoia Jan 5, 2022, 12:08 PM

#

lapis sequoia If I'm not mistaken appending on df is ... expensive. Try one thing? Add in pyth...

Sure I will give that a try

snow tangle Jan 5, 2022, 12:09 PM

#

Hey, I'm really interested in AI image generation with GANs because the results are really amazing, so I'm looking to learn more about it. I followed this course https://livecodestream.dev/post/generating-images-with-deep-learning/ and adapted the code to work with RGB images and such, and learnt more about the parameters but I still have a long way to go. I'm looking for any papers or articles on GAN image generation that you would all reccomend so I can learn further about this topic

Generating images with Deep Learning

Learn how to use AI to generate new images such as faces or art.

#

https://cdn.discordapp.com/attachments/853692276470317100/927980769303998514/unknown.png

#

I was able to get it to generate images that started to resemble the dataset and as expected, a dataset with images from the same perspective also worked a lot better

#

but its still not optimized and over time loses stability and the images drop in quality. So i definetly need to read more articles on image generation

#

https://cdn.discordapp.com/attachments/927109339636981791/927575744828284948/unknown.png

#

Worked well for fashion-mnist though

arctic wedgeBOT Jan 5, 2022, 12:28 PM

#

:incoming_envelope: :ok_hand: applied mute to @lapis sequoia until <t:1641386303:f> (9 minutes and 59 seconds) (reason: duplicates rule: sent 4 duplicated messages in 10s).

lapis sequoia Jan 5, 2022, 12:29 PM

#

lapis sequoia If I'm not mistaken appending on df is ... expensive. Try one thing? Add in pyth...

Proved to be better at the beginning but then it also increases remaining time

#

I'll paste the code I'm using just in case you saw smth that can be improved

fleet meteor Jan 5, 2022, 12:35 PM

#

hello

#

i am trying to create an ai

#

or self learning stuff

#

can any one help

#

or hav expeiriance?

lapis sequoia Jan 5, 2022, 12:37 PM

#

lapis sequoia If I'm not mistaken appending on df is ... expensive. Try one thing? Add in pyth...

https://www.toptal.com/developers/hastebin/matexadupe.py

lapis sequoia Jan 5, 2022, 1:00 PM

#

lapis sequoia https://www.toptal.com/developers/hastebin/matexadupe.py

Is it posible to write to an xml by chunks instead of appending to a giant list and then dumping that list to df and then to xml?

lapis sequoia Jan 5, 2022, 1:18 PM

#

lapis sequoia https://www.toptal.com/developers/hastebin/matexadupe.py

Uhm you can may be reduce the internal complexity? You are getting filtered_df for some reason only to check if it's empty or not right?

#

One way would be to not to retrieve the whole thing.
Another way would be getting to know your logic and using some dict initially instead of that df to... make search in it faster in while loop

lapis sequoia Jan 5, 2022, 1:22 PM

#

fleet meteor or hav expeiriance?

Yeah there are plenty of experienced people over here. Feel free to ask questions.

mint palm Jan 5, 2022, 2:24 PM

#

i have installed jupyter but i wanna know which location is best to install so that i can easily access python libraries

#

should i install in ....../python37/lib/

#

on is there something else i should do

lapis sequoia Jan 5, 2022, 2:41 PM

#

lapis sequoia Uhm you can may be reduce the internal complexity? You are getting filtered_df f...

Just to check if it's empty or not

lapis sequoia Jan 5, 2022, 2:46 PM

#

lapis sequoia One way would be to not to retrieve the whole thing. Another way would be getti...

I'll be playing around with different data structures and hopefully one performs relatively quick. Thank you very much

thin palm Jan 5, 2022, 3:24 PM

#

hoary wigeon onehotencoding will create 2 column ['male', 'female'] using either of one is en...

To clarify, in the column it's already 0 / 1. the column is "Sex" and it has 0 and 1 already. The tutorial still OHE this? why?

wicked grove Jan 5, 2022, 3:38 PM

#

lapis sequoia Yeah there are plenty of experienced people over here. Feel free to ask question...

Hello, i have two npz files which i wanna load into a jupyter notebook

#

I used this x_train=load('outfile.npz')

#

Is there any way i can check the contents of this x_train

odd meteor Jan 5, 2022, 3:41 PM

#

thin palm To clarify, in the column it's already 0 / 1. the column is "Sex" and it has 0 a...

If the categorical feature sex already has a numeric value (1/0) as you mentioned, then the reason why sex was OHE again was because, the person behind the tutorial video does not want the model to be bias towards any gender.

The person wouldn't want the model to assume that any gender encoded as 1 is more important than the other gender encoded as 0.

lapis sequoia Jan 5, 2022, 3:51 PM

#

wicked grove Hello, i have two npz files which i wanna load into a jupyter notebook

I have not personally uhm loaded npz files so lemme just check out.

wicked grove Jan 5, 2022, 3:52 PM

#

Alrightt

lapis sequoia Jan 5, 2022, 3:53 PM

#

If the file is a .npz file, then a dictionary-like object is returned, containing {filename: array} key-value pairs, one for each file in the archive.

If the file is a .npz file, the returned value supports the context manager protocol in a similar fashion to the open function:

with load('foo.npz') as data: a = data['a']

Via:
https://numpy.org/doc/stable/reference/generated/numpy.load.html

#

@wicked grove

#

Try just printing x_train for now? Doc suggests it must a dict of files

wicked grove Jan 5, 2022, 3:54 PM

#

lapis sequoia Try just printing x_train for now? Doc suggests it must a dict of files

Ohhh okayy, i will do that now
Thank you so much

thin palm Jan 5, 2022, 4:04 PM

#

odd meteor If the categorical feature `sex` already has a numeric value (1/0) as you mentio...

Man best answer I've gotten. Seriously thank you for this. Makes sense now because 1 > 0 so I see where this is coming from. Cheers mate!

odd meteor Jan 5, 2022, 4:06 PM

#

thin palm Man best answer I've gotten. Seriously thank you for this. Makes sense now becau...

You're welcome 😊

thin palm Jan 5, 2022, 4:50 PM

#

When receiving a cross validated score is our final model output score supposed to be higher than our CV score? For example when I Cross Validate with a K fold of 10 my score mean is .81%. When I take my model and fit it with our features and target my score jumps to .86%. does this make sense?

serene scaffold Jan 5, 2022, 4:54 PM

#

thin palm When receiving a cross validated score is our final model output score supposed ...

so you did 10-fold CV, which means you have 10 separate scores, the mean of which is .81 (not .81%). Can you explain again what you did that received the .86 score?

#

If you trained (fit) the model on the same instances that you used to evaluate, that would explain why the score went up.