#data-science-and-ml | Python | Page 4

floral nova · 2022-07-31T04:05:47.652Z

``` df = data.DataReader('APPL', 'yahoo', '2010-01-01', '2022-01-01') df.head() ```

floral nova Jul 31, 2022, 4:05 AM

#

df.head() ```

#

when i try running this code it returns keyerror date

#

what does that mean and how do i fix it? I use google colab btw

shut bear Jul 31, 2022, 4:10 AM

#

hey ... so I made this NN model in Pytorch like so:

#

#

#

and I'm trying to graph it like this.

#

this uses the defined forward() func.

#

#

However, apparently: -1 <= dim <= 0 is expected... idk why

untold bloom Jul 31, 2022, 8:33 AM

#

shut bear However, apparently: `-1 <= dim <= 0` is expected... idk why

hi, torch.cat concatanates along an existing axis, and white and black are 1D tensors so they have only 1 dimension so dim=0 should do

#

i'm assuming you want to concatanate those two 1D tensors such that you end up with again a 1D tensor but the size doubled as 81920

shut bear Jul 31, 2022, 8:36 AM

#

Yeah I figured that out. And that assumption is correct.

#

Thank you though ❤️

wheat ivy Jul 31, 2022, 10:25 AM

#

How do I remove 1e6 from pandas bar plot

dataframe print:-

Random External Data
crash color
0 2080101 red
1 1789736 blue
2 760134 green
3 1225782 orange

#

comes off as this at y axis

#

i want it to show complete numbers the few working solutions for pandas as well didn't work

wooden sail Jul 31, 2022, 10:27 AM

#

what's your issue with it? one thing you can do is to divide by 1e6 before plotting

wheat ivy Jul 31, 2022, 10:27 AM

#

divide by 1e6?

#

my issue is y axis isn't showing complete numbers
2080101 like this
would be fine even if it just showed 200000

wooden sail Jul 31, 2022, 10:28 AM

#

the 1e6 is telling you that the numbers at the left of the bar are in millions

#

why would you want to display 6 zeros? the tick marks would be super cluttered

#

the way it is now is standard and readable, it's scientific notation

wheat ivy Jul 31, 2022, 10:29 AM

#

well its a use case thats why

#

i tried removing scientific notation and also tried removing offsets

#

neither worked unfortunately

#

ax.get_yaxis().get_major_formatter().set_useOffset(False)

#

this I tried for offset

wooden sail Jul 31, 2022, 10:30 AM

#

how did you attempt that? there's pd.set_option and pd.options.display.float_format

#

or you can set the tick marks manually

wheat ivy Jul 31, 2022, 10:30 AM

#

ax was returned from pd.plot()

#

also tried using
pd.set_option('display.float_format', lambda x: '%.3f' % x)

#

copied that off from stackoverflow tho, didn't work for my usecase

wooden sail Jul 31, 2022, 10:37 AM

#

try setting the tick marks explicitly, then

void sail Jul 31, 2022, 11:14 AM

#

You need to unpack the list or use a vector operation and you are doing neither

#

If you are trying to get the mode of those lists where the value is <0.6 and >= 0.5 you could do this but not very efficient

#

stats.mode([value for value in x if value < 0.6 and value >= 0.5])

lapis sequoia Jul 31, 2022, 12:10 PM

#

Hey guys, any servers for R?

wheat ivy Jul 31, 2022, 12:23 PM

#

nothing seems to work for it, very annoying

serene scaffold Jul 31, 2022, 12:41 PM

#

lapis sequoia Hey guys, any servers for R?

I don't know of an R-specific server, but you might be able to ask here: https://discord.gg/kfxk9nbg

compact valley Jul 31, 2022, 3:02 PM

#

Hey this might seem like weird question but what is the industry standard IDE for Machine Learning Engineers, like what code editors/IDE do they use in big tech companies
pls tell me if u know / or are employed at big tech company - this bugs me a lot

wooden sail Jul 31, 2022, 3:04 PM

#

it usually doesn't matter

grand canyon Jul 31, 2022, 3:04 PM

#

i had a question about a CNN im working with. over a couple epochs, my loss is gradually decreasing, but my validation accuracy is also decreasing.

#

what could be a potential cause for this?

wooden sail Jul 31, 2022, 3:05 PM

#

sounds like overfitting

grand canyon Jul 31, 2022, 3:09 PM

#

so should i do stuff like drop out

mild dirge Jul 31, 2022, 3:09 PM

#

Yeah, or maybe just less layers

grand canyon Jul 31, 2022, 3:27 PM

#

mild dirge Yeah, or maybe just less layers

alright

#

when will i know if im underfitting the data

mild dirge Jul 31, 2022, 3:28 PM

#

If your training loss is also not decreasing a lot

grand canyon Jul 31, 2022, 3:31 PM

#

ok cool thank you

potent flame Jul 31, 2022, 6:28 PM

#

compact valley Hey this might seem like weird question but what is the industry standard IDE fo...

Most data scientists I know use vs code

finite kayak Jul 31, 2022, 6:45 PM

#

Hi everyone

Can someone with “data science and artificial intelligence” degree work in software engineering field? You can also list a few fields that can be worked in.

Thanks in advance

serene scaffold Jul 31, 2022, 6:47 PM

#

finite kayak Hi everyone Can someone with “data science and artificial intelligence” degree...

there aren't a lot of DS/AI degree programs. the main degree for DS/AI is computer science.

finite kayak Jul 31, 2022, 6:48 PM

#

If you get one, can you work as a software engineer?

serene scaffold Jul 31, 2022, 6:49 PM

#

finite kayak If you get one, can you work as a software engineer?

there's a lot of overlap in all these job titles. I work as an AI developer, but I still do "software engineering" like stuff.

finite kayak Jul 31, 2022, 6:50 PM

#

Technically, there is programming in the curriculum of “data science and artificial intelligence”. Therefore, I think it would work. Thanks for your answer.

iron basalt Jul 31, 2022, 6:51 PM

#

finite kayak Technically, there is programming in the curriculum of “data science and artific...

Can't really do artificial intelligence without software engineering to some extent.

wooden sail Jul 31, 2022, 6:51 PM

#

well... that depends on what you mean by "do artificial intelligence"

iron basalt Jul 31, 2022, 6:53 PM

#

wooden sail well... that depends on what you mean by "do artificial intelligence"

While having ideas and doing math and all that surrounding the implementation is fine, it still needs to be implemented to demonstrate that it actually works. I would rather have someone that can do both.

#

A lot of AI is kind of just tweaking / messing around and for that a feedback loop where you can implement it and test it is required.

finite kayak Jul 31, 2022, 6:55 PM

#

wooden sail well... that depends on what you mean by "do artificial intelligence"

Is there any specific thing that you ask? I mean the name of the degree is “data science and artificial intelligence” and it contains mathematics, programming and artificial intelligence.

iron basalt Jul 31, 2022, 6:55 PM

#

And that will often involve some software engineering.

finite kayak Jul 31, 2022, 6:55 PM

#

Also, C, Java and another programming language will be learned in the programming lessons.

wooden sail Jul 31, 2022, 6:56 PM

#

finite kayak Is there any specific thing that you ask? I mean the name of the degree is “data...

squiggle addressed it in their post. you can by and large do AI by doing only math research and no coding

#

this isn't what most people think of when they say or hear "i do AI", but it's certainly an important part of it

iron basalt Jul 31, 2022, 6:56 PM

#

finite kayak Is there any specific thing that you ask? I mean the name of the degree is “data...

There is more outside of just the programming.

#

(Although you will benefit highly from being able program effectively, because feedback loops are great for building ideas / knowledge / understanding (fast iteration / one of the main things computers have given us the ability to do))

finite kayak Jul 31, 2022, 7:00 PM

#

Then, can it be said that an AI engineer can also work as a software developer?

#

Since “Data science and artificial intelligence” will cover the topics of “software engineering” to some extend?

wooden sail Jul 31, 2022, 7:01 PM

#

that will really depend on the program tbh

#

AI engineering can range from very math heavy with little coding, to lots of coding and not as much math depth. and in that spectrum, what the coding focuses on may also vary

#

some programs would surely go into so-called "MLOps" or talk about massive parallelization and how to set up large scale systems, and others might focus on other stuff instead, for instance

iron basalt Jul 31, 2022, 7:05 PM

#

finite kayak Also, C, Java and another programming language will be learned in the programmin...

Learning a programming language is good, but software engineering is more than learning a programming language. Software engineering deals with concepts beyond any specific programming language and is mostly learned through experience. You need to program a lot to really get a feel for what works and what does not (lots of projects, both small and large). It also has to do with working in teams and that is pretty complicated (and kind of an unsolved problem still).

#

Could you get a job in software engineering? Probably. As long as you can show that you can do it, but that is somewhat independent of degree.

steady basalt Jul 31, 2022, 7:09 PM

#

compact valley Hey this might seem like weird question but what is the industry standard IDE fo...

Pycharm, vscode is trash

#

And jupyter for notebooks

finite kayak Jul 31, 2022, 7:10 PM

#

Okay I got it. Thank you for your time and answers.

lethal swallow Jul 31, 2022, 7:19 PM

#

Hi guys what approach do I need to take, when I have a dataset with following info: consultants training clients at a specific location and date. Usually someone has to manually define which consultant teaches the client and has to look up where he usually teaches. Now I want to automate this approch a little bit. My goal is to set a specific date and my machine learning model predicts me the location and which consultant can do it. But I also want to able to set a client and the predictive model suggests me the date and consultant and so on.

coral cradle Jul 31, 2022, 7:20 PM

#

guys how do I even start with ai, I'm a final year student in university level I've learn ai for 2 years but never applied it T_T

mild dirge Jul 31, 2022, 7:20 PM

#

never applied it?

#

I'm sure you've had projects for uni

coral cradle Jul 31, 2022, 7:20 PM

#

I have never coded an ai

#

it was all theory

mild dirge Jul 31, 2022, 7:21 PM

#

Really?

#

Have you coded before then?

#

If it was all theoretical

coral cradle Jul 31, 2022, 7:22 PM

#

not really, they told us what a neural network is, how weight bias work and for my second year we learn about different type of optimizers

mild dirge Jul 31, 2022, 7:22 PM

#

You probably want to start learning a programming language then, like Python

#

afterwards you can learn how to implement AI

coral cradle Jul 31, 2022, 7:22 PM

#

mild dirge You probably want to start learning a programming language then, like Python

oh I know coding

#

I just don't understand how to make an ai

wooden sail Jul 31, 2022, 7:23 PM

#

if you've learned the maths and know about optimizers, i would recommend you become part of the new generation and look at JAX

#

recommend is too strong, let's say suggest instead

coral cradle Jul 31, 2022, 7:24 PM

#

hmmm

wooden sail Jul 31, 2022, 7:25 PM

#

you'll have nice, low level control of your "networks" as composition of affine transformations and nonlinear transformations

coral cradle Jul 31, 2022, 7:25 PM

#

I have my final year project coming up and I want it to have ai, I want to understand more about ai but I can't find the correct resources for my level.

#

I want to integrate computer vision with ai actually.

#

I've heard about tensorflow for ai as well.

wooden sail Jul 31, 2022, 7:29 PM

#

tensorflow and pytorch are the common ones, yeah

lethal swallow Jul 31, 2022, 7:37 PM

#

lethal swallow Hi guys what approach do I need to take, when I have a dataset with following in...

anyone got an idea?

wooden sail Jul 31, 2022, 7:41 PM

#

tbh this sounds to me more like a database lookup problem than an AI one, unless you want nondeterministic schedules. though you can maybe make a case for preferred hours for the consultants and possibly the clients if they are recurring

steady basalt Jul 31, 2022, 7:43 PM

#

coral cradle I have my final year project coming up and I want it to have ai, I want to under...

How long do u have? This takes ages to learn

quaint leaf Jul 31, 2022, 7:47 PM

#

coral cradle I've heard about tensorflow for ai as well.

Computer vision is a field of AI, Tensorflow is one of the (Python) packages that help you work with it. Here is an example series from FreeCodeCamp to get a taste: https://www.youtube.com/watch?v=cPmjQ9V6Hbk

coral cradle Jul 31, 2022, 7:48 PM

#

steady basalt How long do u have? This takes ages to learn

Starting this October 😭

iron basalt Jul 31, 2022, 7:49 PM

#

coral cradle I want to integrate computer vision with ai actually.

Have a specific goal you want the AI to do. Then we can help.

#

(A specific project in mind)

coral cradle Jul 31, 2022, 7:49 PM

#

What would be a good project in your opinion?

#

What course shall I buy off udemy that might help me in your opinion

iron basalt Jul 31, 2022, 7:50 PM

#

For a beginner I would first recommend smaller projects. But this is your senior year project so it's suppose to be a bit more involved. You would have to do multiple more simple projects before then.

#

It's not so much a problem of courses, you need experience.

coral cradle Jul 31, 2022, 7:51 PM

#

What's a good starting point in your opinion?

wooden sail Jul 31, 2022, 7:52 PM

#

the classic mnist and fashion mnist ones are a good place to start

iron basalt Jul 31, 2022, 7:53 PM

#

I'm not sure how much you know / can do, so at the very most basic level: implement simple linear regression from scratch.

wooden sail Jul 31, 2022, 7:53 PM

#

or doing a polynomial regression the traditional pseudo-inverse way and comparing that to a deep learning solution

iron basalt Jul 31, 2022, 7:53 PM

#

Gain the ability to read and implement pages like this: https://en.wikipedia.org/wiki/Simple_linear_regression

Simple linear regression

In statistics, simple linear regression is a linear regression model with a single explanatory variable. That is, it concerns two-dimensional sample points with one independent variable and one dependent variable (conventionally, the x and y coordinates in a Cartesian coordinate system) and finds a linear function (a non-vertical straight line) ...

#

(Sometimes Wikipedia goes into too much unrelated detail, but it's a starting point)

coral cradle Jul 31, 2022, 7:54 PM

#

Does this use back propagation?

iron basalt Jul 31, 2022, 7:54 PM

#

No.

#

But it will help you understand it (partially).

steady basalt Jul 31, 2022, 7:54 PM

#

coral cradle What's a good starting point in your opinion?

Start with sklearn on tabular data

#

Try and use a few methods to predict classes

coral cradle Jul 31, 2022, 7:55 PM

#

I see

iron basalt Jul 31, 2022, 7:55 PM

#

One step at a time. And after each step, feel free to just use someone else's implementation after you understood how it works.

iron basalt Jul 31, 2022, 7:58 PM

#

wooden sail the classic mnist and fashion mnist ones are a good place to start

Yeah, if you want to do something involving vision later, you will have mnist as one of your steps.

#

If you want neural networks, one step will probably be implementing a Perceptron from scratch.

#

After enough basic steps, you should have a more clear picture of where you can take it.

wooden sail Jul 31, 2022, 8:01 PM

#

if you're good with your linalg and multivar calc, you can skip the perceptron and backprop, but only you know where you're at

coral cradle Jul 31, 2022, 8:03 PM

#

alr

#

thanks guys

tropic matrix Jul 31, 2022, 8:21 PM

#

what would be the best way to normalize a dataset that's too large to keep in ram?
i'm currently using a data generator based off of keras.Sequence, but when I try to blindly input the generator into the keras.layers.normalization().adapt() function, i get the following error:

ValueError: in user code:

    File "/usr/local/lib/python3.8/dist-packages/keras/engine/base_preprocessing_layer.py", line 117, in adapt_step  *
        self._adapt_maybe_build(data)
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/base_preprocessing_layer.py", line 285, in _adapt_maybe_build  **
        self.build(data_shape)
    File "/usr/local/lib/python3.8/dist-packages/keras/layers/preprocessing/normalization.py", line 137, in build
        input_shape = tf.TensorShape(input_shape).as_list()

    ValueError: as_list() is not defined on an unknown TensorShape.

steady basalt Jul 31, 2022, 8:24 PM

#

tropic matrix what would be the best way to normalize a dataset that's too large to keep in ra...

How large

#

How much ram u got

#

My ram is 16gb and when I load in a 100m row dataset it says it’s using 40gb ram, but still works

tropic matrix Jul 31, 2022, 8:26 PM

#

steady basalt How large

well i got 128gb of ram but apparently that isn't enough to load a 6.6gb csv file

#

well to be honest

#

if i can use something similar to a minmaxscaler for example in sklearn (where i have an encoder i need to fit just once on the dataset) then i can use 128gb

#

if it needs to be done on the fly (as in throughout training) i use 4 gpus and each get assigned 32gb of usable ram

#

@steady basalt

#

i'll see if i'm able to load the entire dataset using 128gb of ram rn using pd.read_csv just to make sure

#

ok @steady basalt here's the situation:
i'm able to load the csv file, however when i preprocess it it ends up running out of memory and crashing

#

to deal with this i use a sequence class and preprocess whatever portion i need on the fly

timid kiln Jul 31, 2022, 8:41 PM

#

void sail You need to unpack the list or use a vector operation and you are doing neither

Thanks for that advice. Between my post and yours, I found the function np.histogram which, for 0 to 100% bounds, evenly spaced, works quite nicely for me. Thanks for your help!

steady basalt Jul 31, 2022, 9:20 PM

#

tropic matrix well i got 128gb of ram but apparently that isn't enough to load a 6.6gb csv fil...

That’s totally not real

tropic matrix Jul 31, 2022, 9:20 PM

#

steady basalt That’s totally not real

read what i said after that

#

it was my misrememberance on that statement

steady basalt Jul 31, 2022, 9:20 PM

#

No even still, sounds totally like a false problem

#

6gb is nothing

#

Shudnt crash a laptop even

#

Must be issue elsewhere

tropic matrix Jul 31, 2022, 9:21 PM

#

20.7m rows + 106 columns says otherwise

#

that may be a possibility

#

it always crashes when running pd.get_dummies

steady basalt Jul 31, 2022, 9:21 PM

#

Nah, I just did similar work on multiple million rows

tropic matrix Jul 31, 2022, 9:21 PM

#

so maybe when running that function it explodes in memory usage?

steady basalt Jul 31, 2022, 9:21 PM

#

With dozens of columns on a 16gb laptop

tropic matrix Jul 31, 2022, 9:21 PM

#

the dataset does contain a lot of strings

steady basalt Jul 31, 2022, 9:21 PM

#

Then count@how many unique strings are there it’s trying to add

tropic matrix Jul 31, 2022, 9:22 PM

#

steady basalt Then count@how many unique strings are there it’s trying to add

alright, what could i do with that number though?

#

if there's a solution to make it not run out of memory when running pd.get_dummies then i'd love to find it

olive shore Jul 31, 2022, 9:44 PM

#

hey so I have a collection of pdfs with some that I would like to extract from it

#

how would I go about extracting the data

serene scaffold Jul 31, 2022, 9:50 PM

#

olive shore how would I go about extracting the data

Have you looked for pdf-to-text libraries?

#

You'll have varying degrees of success using them, depending on how complicated the pages are.

olive shore Jul 31, 2022, 9:55 PM

#

serene scaffold Have you looked for pdf-to-text libraries?

do you mind if i pm you i have looked into those libraries but there is something specific that i would like to capture

serene scaffold Jul 31, 2022, 9:56 PM

#

olive shore do you mind if i pm you i have looked into those libraries but there is somethin...

I won't answer messages over DMs that could have been sent in this channel.

olive shore Jul 31, 2022, 9:56 PM

#

#

this pdf has other text

serene scaffold Jul 31, 2022, 9:57 PM

#

Ah. I'm not sure how successful those systems are with Arabic letters.

olive shore Jul 31, 2022, 9:57 PM

#

including this and i just want to capture the text within these blocks and place information of each block with each other

#

yeah well ill try to get around that issue

#

but the thing is how could i just capture the information in those blocks

serene scaffold Jul 31, 2022, 9:58 PM

#

Information extraction systems deal with "raw text". Ie stuff that could be represented as strings in code

#

So you have to extract the text before you can extract any information.

#

You might look into OCR for Arabic

#

Which is optical character recognition.

tropic matrix Jul 31, 2022, 10:40 PM

#

I'm struggling to figure out why this error occurs when inputting a keras.utils.Sequence into keras.layers.Normalization:

norm = keras.layers.Normalization()
norm.adapt(SEQUENCE_VARIABLE)

ValueError: in user code:

    File "/usr/local/lib/python3.8/dist-packages/keras/engine/base_preprocessing_layer.py", line 117, in adapt_step  *
        self._adapt_maybe_build(data)
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/base_preprocessing_layer.py", line 285, in _adapt_maybe_build  **
        self.build(data_shape)
    File "/usr/local/lib/python3.8/dist-packages/keras/layers/preprocessing/normalization.py", line 150, in build
        raise ValueError(

    ValueError: All `axis` values to be kept must have known shape. Got axis: (-1,), input shape: [None, None], with unknown axis at index: 1

#

the shape of each batch is (4096, 2899)

#

(4096 = batch_size)

stiff belfry Jul 31, 2022, 11:13 PM

#

yo can someone help me save my model so that i can just run a command in another program that uses the model to predict the end product

#

i cant find something like this. I have found the .save() and .load() functions but they are both returning errors

steady basalt Jul 31, 2022, 11:19 PM

#

#

would anyone know what would be the next step to boost class 0 recall? I had a hugely imbalanced dataset with something like 17,000 1's and 390,000 0's - I had resampled this using smote to balance them out tho. Still unable to achieve greater than 0.5 recall for both classes. Other parameters result in an effective switch - giving 1.00 recal lfor 0's and 1.00 precision for 1's.....

olive shore Aug 1, 2022, 12:06 AM

#

#

import arabic_reshaper
from bidi.algorithm import get_display
import pdfplumber
with pdfplumber.open(r'C:\Users\ahmad\PycharmProjects\pythonProject4\c6fe28d10be56fafc39a65ec189f6259.pdf') as pdf:
    my_page = pdf.pages[12]
    thepages=my_page.extract_text()
    reshaped_text = arabic_reshaper.reshape(thepages)
    bidi_text = get_display(reshaped_text)
    print(bidi_text)

#

im using this code to get the pdf move to text

#

and im getting these weird symbols

#

is there any way to get around this

serene scaffold Aug 1, 2022, 12:29 AM

#

@olive shore you can filter them out of the result, but if there aren't settings you can adjust for the extraction algorithm, you can't make the algorithm give you the correct result.

olive shore Aug 1, 2022, 12:32 AM

#

it seems that the letters arent connected properly

#

it seems that the arabic reshaper isnt functioning properly

serene scaffold Aug 1, 2022, 12:49 AM

#

olive shore it seems that the arabic reshaper isnt functioning properly

I would check to see if it works correctly for a PDF that is just one block of text.

#

No borders, no separate blocks of text. Nothing that could possibly confuse it. And see if it works in the most ideal circumstance.

olive shore Aug 1, 2022, 12:49 AM

#

ok

#

ok the pdf has

#

اللغة العربية رائعة

#

and when i use the code it outputs

#

 ﻼﻠﻏﺓ ﻼﻋﺮﺒﻳﺓ ﺭﺎﺌﻋ ﺓ

serene scaffold Aug 1, 2022, 12:51 AM

#

Well fuck.

olive shore Aug 1, 2022, 12:51 AM

#

yeah

#

the letters are in the correct positions but they arent connected

serene scaffold Aug 1, 2022, 12:51 AM

#

But did you do what I said. Or is this just an example from a PDF you already have

olive shore Aug 1, 2022, 12:52 AM

#

i did exactly what you said

serene scaffold Aug 1, 2022, 12:52 AM

#

Great

olive shore Aug 1, 2022, 12:52 AM

#

i made a pdf with just the first part

#

and ran it through the code and it gave that output

serene scaffold Aug 1, 2022, 12:52 AM

#

How is it even non-connecting them. Is it adding spaces

#

I can read Arabic btw.

olive shore Aug 1, 2022, 12:53 AM

#

oh shit thats nice

serene scaffold Aug 1, 2022, 12:53 AM

#

I'm a """"computational linguist""""

olive shore Aug 1, 2022, 12:53 AM

#

thats so cool

#

that makes stuff easier for both of us

#

yeah for some reason the arabic reshaper isnt functioning properly

serene scaffold Aug 1, 2022, 12:54 AM

#

Anyway, i wonder if there's a way to replace each character with the correct version automatically

#

You can just do it based on whether the next character is a space and whether the previous character connects. Right?

olive shore Aug 1, 2022, 12:55 AM

#

wait im sorry i dont understand what should I do?

serene scaffold Aug 1, 2022, 12:57 AM

#

So each character has two or four forms. As you know. And you can know which is the right one based on the previous and next character

olive shore Aug 1, 2022, 12:57 AM

#

yeah

serene scaffold Aug 1, 2022, 12:57 AM

#

So you can just write something to iterate through the string and figure out which character is right, and fix it.

olive shore Aug 1, 2022, 1:09 AM

#

ok wait so i did this instead

#

import arabic_reshaper
from bidi.algorithm import get_display
import pdfplumber
import re
with pdfplumber.open(r'C:/Users/ahmad/OneDrive/Desktop/arabic.pdf') as pdf:
    my_page = pdf.pages[0]
    thepages=my_page.extract_text()

    reshaped_text = arabic_reshaper.reshape(thepages)
    print(reshaped_text)

#

and it outputs\

#

ﺓ ﻋﺌﺎﺭ ﺓﻳﺒﺮﻋﻼ ﺓﻏﻠﻼ

#

so its all flipped and the first and last letters are flipped

#

import arabic_reshaper

text_to_be_reshaped =  'اللغة العربية رائعة'

reshaped_text = arabic_reshaper.reshape(text_to_be_reshaped)


print(reshaped_text)

#

but for some reason this code up here just works

olive shore Aug 1, 2022, 1:14 AM

#

serene scaffold So you can just write something to iterate through the string and figure out whi...

^

#

the pdf is causing issues

#

yeah bruh idk

#

weird shit

tropic matrix Aug 1, 2022, 3:31 AM

#

I'm struggling to figure out why this error occurs when inputting a custom keras.utils.Sequence into keras.layers.Normalization:

norm = keras.layers.Normalization()
norm.adapt(SEQUENCE_VARIABLE)

ValueError: in user code:

    File "/usr/local/lib/python3.8/dist-packages/keras/engine/base_preprocessing_layer.py", line 117, in adapt_step  *
        self._adapt_maybe_build(data)
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/base_preprocessing_layer.py", line 285, in _adapt_maybe_build  **
        self.build(data_shape)
    File "/usr/local/lib/python3.8/dist-packages/keras/layers/preprocessing/normalization.py", line 150, in build
        raise ValueError(

    ValueError: All `axis` values to be kept must have known shape. Got axis: (-1,), input shape: [None, None], with unknown axis at index: 1

the shape of each batch is (4096, 2899)
(4096 = batch_size)

lapis sequoia Aug 1, 2022, 5:11 AM

#

#

@olive shore omg it's Albert Einstein, a×b=ab → 2×3=23 XD

thick marlin Aug 1, 2022, 6:31 AM

#

Hello,
I have a pre-trained model on the original pix2pixHD. The saved model dir has 4 files

iter.txt
latest_net_D.pth
latest_net_G.pth
loss_log.txt
opt.txt

How can I convert the model to utilize it with Imaginaire's pix2pixHD? Imaginaire uses 1 .pt file for loading the models.
It's the same model but they have integrated the library with their other models to create a model zoo.

unborn crow Aug 1, 2022, 6:52 AM

#

Someone here that has time for a urgent call regarding SKlearn ?

steady basalt Aug 1, 2022, 8:28 AM

#

Deadline tomorrow? 😂

unborn crow Aug 1, 2022, 8:43 AM

#

steady basalt Deadline tomorrow? 😂

more like in 4 Hours ...

solar tiger Aug 1, 2022, 10:10 AM

#

can anyone help me to solve this error

ValueError: invalid literal for int() with base 10: ''

#

I got this error, while im converting string values into integer values in data frame

#

here is my function code

#

python```
def mrp(x):
MRP = int(x)
return MRP

wooden sail Aug 1, 2022, 10:20 AM

#

what did you pass to your func? this works for strings that represent numbers in base 10

solar tiger Aug 1, 2022, 10:24 AM

#

column values from a dataframe

wooden sail Aug 1, 2022, 10:24 AM

#

sure, but what do the values in that column look like

solar tiger Aug 1, 2022, 10:29 AM

#

#

look at the MRP column

#

here is the info of my dataframe

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30758 entries, 0 to 30757
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   BrandName  30758 non-null  object
 1   Deatils    30758 non-null  object
 2   Sizes      30758 non-null  object
 3   MRP        30758 non-null  object
 4   SellPrice  30758 non-null  object
 5   Discount   30758 non-null  object
 6   Category   30758 non-null  object
dtypes: object(7)
memory usage: 1.6+ MB

wooden sail Aug 1, 2022, 10:37 AM

#

one thing you can do is put a try except inside that function call, and write None or NaN in the except block. then find these Nones or NaNs by index, and check in the original dataframe what their value is

#

some of the values in your dataframe appear not to be numbers

solar tiger Aug 1, 2022, 10:38 AM

#

ok got it, thank you @wooden sail

#

hey @wooden sail , I see no NaN values in MRP column

wooden sail Aug 1, 2022, 10:50 AM

#

can you show your code

solar tiger Aug 1, 2022, 10:51 AM

#

yeah

steady basalt Aug 1, 2022, 10:51 AM

#

Anyone wana try help me getting higher than 57% accuracy with xgboost on a poor dataset

#

Well, not poor. Just not powerful for this use

amber swan Aug 1, 2022, 10:52 AM

#

@solar tiger just a question, is all youre trying to do is to convert the column from type String to Integer?

#

because there are easier methods to do that within pandas (https://pandas.pydata.org/docs/reference/api/pandas.to_numeric.html)

amber swan Aug 1, 2022, 10:52 AM

#

solar tiger I got this error, while im converting string values into integer values in data ...

considering this

#

you could probably just write d['MRP'] = pd.to_numeric(d['MRP'])

wooden sail Aug 1, 2022, 10:53 AM

#

the default pandas.to_numeric will also raise an error or set a nan, just as above 😛 but yeah, built-in functions are nice

solar tiger Aug 1, 2022, 10:56 AM

#

here are the screen shorts of my code

wooden sail Aug 1, 2022, 10:58 AM

#

you didn't catch the exception. let's make this easier. use dingleyz suggestion. apply the to_numeric function to the appropriate column, and pass the parameter errors='coerce' to turn the errored entries into NaN

lapis sequoia Aug 1, 2022, 10:58 AM

#

Can someone help me understand what's happening here. It's showing empty when I retrieve whole df. But values are in there when I retrieve just that column

steady basalt Aug 1, 2022, 10:59 AM

#

Recently made jupyter notebook dark mode , so much better

lapis sequoia Aug 1, 2022, 11:00 AM

#

steady basalt Recently made jupyter notebook dark mode , so much better

Thanks

steady basalt Aug 1, 2022, 11:00 AM

#

For?

#

It’s jupyter themes oceans16

lapis sequoia Aug 1, 2022, 11:02 AM

#

telling me that it exists

solar tiger Aug 1, 2022, 11:03 AM

#

wooden sail you didn't catch the exception. let's make this easier. use dingleyz suggestion....

pd.to_numeric is working

#

thanq @amber swan @wooden sail

amber swan Aug 1, 2022, 11:04 AM

#

wait, even without the coerce?

solar tiger Aug 1, 2022, 11:04 AM

#

i used errors=coerce as parameter in apply method

wooden sail Aug 1, 2022, 11:04 AM

#

ok

#

cool. now if you want to satisfy your scientific curiosity, you should look for the indices where this result has NaNs, and use those indices in the ORIGINAL column to see what the problem was

#

something like df_original['MRP'].loc[casted_to_numeric.isna()], i think that's the syntax? someone correct me if not, my pandas is rough

versed gulch Aug 1, 2022, 11:43 AM

#

Hi is there a way I can fill in sparse arrays into a zero array in Python?

wooden sail Aug 1, 2022, 11:44 AM

#

what format is your sparse array in? scipy and numpy have sparse arrays that have something like to_dense() to turn into dense arrays, but this is usually not something you want to do

somber prism Aug 1, 2022, 11:46 AM

#

anyone here tried to install ontonote 5.0 ???

versed gulch Aug 1, 2022, 11:53 AM

#

wooden sail what format is your sparse array in? scipy and numpy have sparse arrays that hav...

format? if you mean how I created them, then i converted my 3D images into sparse arrays which where of shape (242, 512, 512) and are of uint8

wooden sail Aug 1, 2022, 11:54 AM

#

and how exactly did you do that?

versed gulch Aug 1, 2022, 11:55 AM

#

using the module sparse

wooden sail Aug 1, 2022, 11:56 AM

#

all righty. and what do you want to make it dense for?

versed gulch Aug 1, 2022, 11:58 AM

#

dont want to make it dense as I'm running out of memeory with numpy arrays as my zero which I want to fill in is of size (39, 39, 242, 512, 512)

wooden sail Aug 1, 2022, 11:59 AM

#

i think i'm just not understanding what you're trying to say. you have a dense array and want to make it sparse? or backwards? or?

versed gulch Aug 1, 2022, 12:00 PM

#

no, so what I did first was filling by large 3D arrays (imgs) of size (242, 512, 512) into a zero array of size (39, 39, 242, 512, 512), but Python says that numpy array runs out of memory

#

therefore I converted my 3D arrays to sparse arrays as it has lots of zeros anyways and trying to fill my zero array like this

wooden sail Aug 1, 2022, 12:03 PM

#

are the images sparse?

versed gulch Aug 1, 2022, 12:05 PM

#

yes as in they have a lot of zeros

wooden sail Aug 1, 2022, 12:05 PM

#

all right

versed gulch Aug 1, 2022, 12:05 PM

#

and pixels of low value

#

this is my code:

arr = np.zeros((39, 39, 242, 512, 512), dtype='uint8') # temp zero array to fill in

# path for the MIP tiles and sort them
tiles3D_path =  "my_path/raw_data/confocal/raw_tiles/*.CZI"
list_of_tiles = sorted(glob.glob(tiles3D_path))
list_of_tiles.sort(key=lambda f: int(''.join(filter(str.isdigit, f))))
# place the tiles inside the array 'arr' based on the grid coordinates from the dataframe
# like placing oieces of a puzzle but with images as arrays into a zero array

for tile_path, xy in zip(list_of_tiles, df0["grid_coord"].values.tolist()):
    # open the MIP tile and convert to an array
    raw_tile = czifile.imread(tile_path)
    tile_arr = raw_tile.reshape(242, 512, 512)
    sparse_tile_arr = sparse.COO(tile_arr)
    # unpack the (x, y) coordinates
    x, y = xy
    # fill in the zero based on the coordinates
    arr[int(y), int(x), :, :, :] = sparse_tile_arr

wooden sail Aug 1, 2022, 12:07 PM

#

arr is dense there, so that won't work

#

you should be able to use the usual slice notation, but arr needs to be a sparse array itself

versed gulch Aug 1, 2022, 12:08 PM

#

wooden sail arr is dense there, so that won't work

so does that mean I have to make my arr sparse too?

wooden sail Aug 1, 2022, 12:08 PM

#

yeah

versed gulch Aug 1, 2022, 12:08 PM

#

wooden sail yeah

okay so only change the arr defining line?

wooden sail Aug 1, 2022, 12:10 PM

#

that should do, and just leave its data field empty

#

i think you might have to use a DOK sparse matrix instead of COO for that

versed gulch Aug 1, 2022, 12:12 PM

#

DOK?

wooden sail Aug 1, 2022, 12:13 PM

#

dictionary of keys instead of coordinate sparse array

versed gulch Aug 1, 2022, 12:14 PM

#

TypeError: Expected rank <=2 dense array or matrix.

wooden sail Aug 1, 2022, 12:14 PM

#

show the code

versed gulch Aug 1, 2022, 12:14 PM

#

got this when i did this
from scipy.sparse import dok_matrix
arr = dok_matrix((grid_size, grid_size, 242, 512, 512), dtype=np.uint8)

#

grid_size is 39

wooden sail Aug 1, 2022, 12:15 PM

#

nono, DOK sparse array from the same library you are using

#

scipy's sparse arrays are only 2D

#

alternatively try making a list or an array of each of the sparse COO arrays, and then using those to create a new COO array that contains all the data

versed gulch Aug 1, 2022, 12:18 PM

#

wooden sail nono, DOK sparse array from the same library you are using

for some reason sparse doesnt have this?

wooden sail Aug 1, 2022, 12:18 PM

#

it does, but i just gave you an alternative

#

https://sparse.pydata.org/en/stable/generated/sparse.DOK.html

versed gulch Aug 1, 2022, 12:20 PM

#

wooden sail https://sparse.pydata.org/en/stable/generated/sparse.DOK.html

I did this but it gave my this:

#

RuntimeError: Cannot convert a sparse array to dense automatically. To manually densify, use the todense method.

wooden sail Aug 1, 2022, 12:20 PM

#

show the full code and full error message

versed gulch Aug 1, 2022, 12:20 PM

#

this was my code
arr = sparse.DOK((grid_size, grid_size, 242, 512, 512), dtype=np.uint8)

#

arr = sparse.DOK((grid_size, grid_size, 242, 512, 512), dtype=np.uint8)

# path for the MIP tiles and sort them
tiles3D_path =  "/home/si22/data/raw_data/confocal/raw_tiles/*.CZI"
list_of_tiles = sorted(glob.glob(tiles3D_path))
list_of_tiles.sort(key=lambda f: int(''.join(filter(str.isdigit, f))))
# place the tiles inside the array 'arr' based on the grid coordinates from the dataframe
# like placing oieces of a puzzle but with images as arrays into a zero array

for tile_path, xy in zip(list_of_tiles, df0["grid_coord"].values.tolist()):
    # open the MIP tile and convert to an array
    raw_tile = czifile.imread(tile_path)
    tile_arr = raw_tile.reshape(242, 512, 512)
    sparse_tile_arr = sparse.COO(tile_arr)
    # unpack the (x, y) coordinates
    x, y = xy
    # fill in the zero based on the coordinates
    arr[int(y), int(x), :, :, :] = sparse_tile_arr
    break

wooden sail Aug 1, 2022, 12:24 PM

#

all right. and it doesn't seem there's COO to DOK, or at least i couldn't find it. gimme a second to test the other thing i suggested

versed gulch Aug 1, 2022, 12:24 PM

#

wooden sail all right. and it doesn't seem there's COO to DOK, or at least i couldn't find i...

kl

steady basalt Aug 1, 2022, 12:43 PM

#

Anyone have any strategy to boost recall ?

serene scaffold Aug 1, 2022, 12:44 PM

#

steady basalt Anyone have any strategy to boost recall ?

that's generally applicable to any possible model?

somber prism Aug 1, 2022, 12:49 PM

#

guys

#

anyone know how to install ontonote 5.0

grand blaze Aug 1, 2022, 1:03 PM

#

Guys its my x-th time posting the same problem (although with more details now), which it doesnt seem many can solve. Just wondered if someone could help med at #help-orange ?

unborn crow Aug 1, 2022, 1:09 PM

#

versed gulch ```py arr = sparse.DOK((grid_size, grid_size, 242, 512, 512), dtype=np.uint8) #...

What kind of psycho uses a red terminal ?

wooden sail Aug 1, 2022, 1:10 PM

#

versed gulch kl

holy crap, the functions for this thing are SUPER limited! there seems to be no easy way to make a large sparse array from smaller ones other than concatenating them yourself. passing lists of sparse or dense arrays doesn't work

#

this is my artistic representation of what you're looking for

import sparse
import numpy as np

size_x = 2
size_y = 2
size_i1 = size_i2 = size_i3 = 3

for row in range(size_x):
    for col in range(size_y):
        x = np.random.binomial(1, 0.1, size=(1,1,size_i1, size_i2, size_i3))
        if col == 0:
            temp = sparse.COO.from_numpy(x)
        else:
            temp = sparse.concatenate((temp, sparse.COO(x)), axis = 1)
    if row == 0:
        big_sparse = temp
    else:
        big_sparse = sparse.concatenate((big_sparse, temp), axis = 0)
    
print(big_sparse.shape)

it's pretty shabby because i got frustrated with the lib and need to go back to work, but this should get the job done

#

make sure you massage the images into size 1,1,242,512,512 before sparsifying them

#

you can alternatively find the support of each of the images and do index gymnastics along the image dimensions or concatenation dimensions. but yeah, you kinda have to piece the stuff together yourself if the images don't fit in memory. this lib is not friendly with arrays it makes itself lol

versed gulch Aug 1, 2022, 1:22 PM

#

unborn crow What kind of psycho uses a red terminal ?

me 😉

wooden sail Aug 1, 2022, 1:22 PM

#

i use a purple one

versed gulch Aug 1, 2022, 1:25 PM

#

wooden sail make sure you massage the images into size 1,1,242,512,512 before sparsifying th...

There of this shape or something similar to this as the images are czi files

wooden sail Aug 1, 2022, 1:25 PM

#

just make sure those dummy extra dimensions in front are added, that's all i mean

#

if those 1's in front are missing, you have to add them in

#

on one hand i'm frustrated the library is so limited, but on the other, larger-than-memory-arrays are never easy to deal with. at least i can immediately see cases where i could use this library myself, so thanks for showing me something new

versed gulch Aug 1, 2022, 1:28 PM

#

wooden sail this is my artistic representation of what you're looking for ```py import spars...

from your code I'm assuming x is the zero array but not sure how to fill this in with my images

versed gulch Aug 1, 2022, 1:29 PM

#

wooden sail on one hand i'm frustrated the library is so limited, but on the other, larger-t...

thanks for giving your input and helping

wooden sail Aug 1, 2022, 1:29 PM

#

remove the x = np.random.binom, since i used that only to generate random data, and replace it with raw_tile.reshape(1,1,242,512,512)

lapis sequoia Aug 1, 2022, 1:29 PM

#

Doesn't decison tree work with string columns?

#

Categorical I mean

wooden sail Aug 1, 2022, 1:30 PM

#

you have to attach some value to the strings so that you can do math on them. the common way is with "one-hot encoding" since it makes labels equidistant

lapis sequoia Aug 1, 2022, 1:31 PM

#

wooden sail you have to attach some value to the strings so that you can do math on them. th...

Oh

#

I actually removed those values so that It becomes purely categorical

#

and not treat it as int

#

Changed 0,1 to "yes", "no"

wooden sail Aug 1, 2022, 1:32 PM

#

you can do boolean arithmetic on that, but not continuous operations

lapis sequoia Aug 1, 2022, 1:32 PM

#

what does that mean

#

I have a gender column

#

Will it work with "male", "female"?

wooden sail Aug 1, 2022, 1:33 PM

#

it means, what is the distance between male and female? and what mathematical operation do you do to move x % of the distance from male to fomale?

lapis sequoia Aug 1, 2022, 1:33 PM

#

huh

#

There's no distance between male and female. They are categories

#

How do I find distance

wooden sail Aug 1, 2022, 1:34 PM

#

by embedding the categories in a space where distance is defined

lapis sequoia Aug 1, 2022, 1:34 PM

#

like 0-male, 1-female?

wooden sail Aug 1, 2022, 1:34 PM

#

mhm

lapis sequoia Aug 1, 2022, 1:35 PM

#

so then I need to remember that map for each category

#

Reeeee

wooden sail Aug 1, 2022, 1:35 PM

#

that's the whole point of one-hot, yeah?

steady basalt Aug 1, 2022, 1:35 PM

#

serene scaffold that's generally applicable to any possible model?

I mean I’m using random forest and xgboost but if you could help me using any model I’d appreciate it

#

Pulling my hair out because clearly the data is holding me back

lapis sequoia Aug 1, 2022, 1:35 PM

#

wooden sail that's the whole point of one-hot, yeah?

so many categories

lapis sequoia Aug 1, 2022, 1:35 PM

#

wooden sail that's the whole point of one-hot, yeah?

what's one-hot

wooden sail Aug 1, 2022, 1:36 PM

#

one hot encoding...

#

we mentioned it above

steady basalt Aug 1, 2022, 1:36 PM

#

Use pd get dummies

#

You will create columns for all possible categories

lapis sequoia Aug 1, 2022, 1:37 PM

#

wooden sail one hot encoding...

is there a function for it that I can use?

wooden sail Aug 1, 2022, 1:37 PM

#

the one supermoon just gave you

lapis sequoia Aug 1, 2022, 1:37 PM

#

pd.get_dummies(df)

#

like this?

unborn crow Aug 1, 2022, 1:38 PM

#

versed gulch me 😉

i hope i never meet you at night

steady basalt Aug 1, 2022, 1:39 PM

#

Of any columns u want

#

Is there a problem with using get dummies before test train split so that I don’t encounter unknowns and have to swap to using sklearn encoder

lapis sequoia Aug 1, 2022, 1:41 PM

#

Pog it worked

steady basalt Aug 1, 2022, 1:41 PM

#

@wooden sail

lapis sequoia Aug 1, 2022, 1:42 PM

#

How do I get original values mapped back?

steady basalt Aug 1, 2022, 1:42 PM

#

Cause it’s what I’ve done

#

Google that I can’t recall

wooden sail Aug 1, 2022, 1:42 PM

#

i don't see any problem with that off the top of my head

#

i need to get going, busy busy

steady basalt Aug 1, 2022, 1:42 PM

#

So why do people cry and say u shud use onehotencoder to avoid such an issue

wooden sail Aug 1, 2022, 1:43 PM

#

aren't they the same thing

hollow sentinel Aug 1, 2022, 1:48 PM

#

they are i think

steady basalt Aug 1, 2022, 1:51 PM

#

They aren’t

#

According to stackoverflow

#

People say u shud use encoder

hollow sentinel Aug 1, 2022, 1:51 PM

#

https://stackoverflow.com/questions/68702705/one-hot-encoding-vs-pd-get-dummies

Stack Overflow

One Hot Encoding vs pd.get_dummies

What is the difference between one_hot_encoder and pd.get_dummies? Because sometimes, the get_dummies function gives the same results as the one hot encoding thing, but people tend to use the one hot

steady basalt Aug 1, 2022, 1:55 PM

#

So what’s the advantage of using sklearn

unborn crow Aug 1, 2022, 1:57 PM

#

unborn crow more like in 4 Hours ...

I just presented a completely overfitted Model because i am an Idiot

misty flint Aug 1, 2022, 1:57 PM

#

where is the lie

versed gulch Aug 1, 2022, 1:57 PM

#

unborn crow i hope i never meet you at night

hmmm

misty flint Aug 1, 2022, 1:57 PM

#

kekHands

unborn crow Aug 1, 2022, 1:57 PM

#

versed gulch hmmm

maybe it would be better to get gutted by a psycho like you ...

misty flint Aug 1, 2022, 1:58 PM

#

misty flint where is the lie

@hollow sentinel bro look at this

#

💀

versed gulch Aug 1, 2022, 1:59 PM

#

unborn crow maybe it would be better to get gutted by a psycho like you ...

ok you're just digger deeper into what you think my personality is like

hollow sentinel Aug 1, 2022, 2:01 PM

#

misty flint <@567030124306759710> bro look at this

hmmm i don't know if i agree with this

worthy phoenix Aug 1, 2022, 2:02 PM

#

!resources

arctic wedgeBOT Aug 1, 2022, 2:02 PM

#

Resources

The Resources page on our website contains a list of hand-selected learning resources that we regularly recommend to both beginners and experts.

worthy phoenix Aug 1, 2022, 2:02 PM

#

i dont see any resources for learning AI in there

unborn crow Aug 1, 2022, 2:02 PM

#

versed gulch ok you're just digger deeper into what you think my personality is like

clearly you choose the Color of your terminal because you enjoy to look at it - red the color of Blood, so it is not far of that you like to touch red things like blood and gore

worthy phoenix Aug 1, 2022, 2:03 PM

#

any suggestions for beginner resources to learn about ai and deep learning?

unborn crow Aug 1, 2022, 2:03 PM

#

worthy phoenix any suggestions for beginner resources to learn about ai and deep learning?

are you based in germany ?

versed gulch Aug 1, 2022, 2:03 PM

#

unborn crow clearly you choose the Color of your terminal because you enjoy to look at it - ...

those are not the only connotations for red, there are some positive connotations for this colour, and anyway, the connotations are subjective

worthy phoenix Aug 1, 2022, 2:03 PM

#

nope

unborn crow Aug 1, 2022, 2:04 PM

#

versed gulch those are not the only connotations for red, there are some positive connotation...

no my opinion is fact

worthy phoenix Aug 1, 2022, 2:05 PM

#

worthy phoenix any suggestions for beginner resources to learn about ai and deep learning?

.

misty flint Aug 1, 2022, 2:05 PM

#

hollow sentinel hmmm i don't know if i agree with this

haha i think it depends on who you meet tbh

unborn crow Aug 1, 2022, 2:05 PM

#

worthy phoenix .

Depends on your current skills

misty flint Aug 1, 2022, 2:05 PM

#

some peeps are better at teamwork than others...some... kekHands

worthy phoenix Aug 1, 2022, 2:05 PM

#

unborn crow Depends on your current skills

intermediate with python, and got a good understanding of maths

unborn crow Aug 1, 2022, 2:06 PM

#

worthy phoenix intermediate with python, and got a good understanding of maths

How are you on probability and stochastics

versed gulch Aug 1, 2022, 2:06 PM

#

unborn crow no my opinion is fact

hmmm

hollow sentinel Aug 1, 2022, 2:06 PM

#

misty flint some peeps are better at teamwork than others...some...<:kekHands:94869794071158...

meanwhile my friend has been texting me nonstop to do a group project with him and i keep saying no ☠️

worthy phoenix Aug 1, 2022, 2:07 PM

#

unborn crow How are you on probability and stochastics

probability is one of my fav topics of mathematics well statistics i kinda know basics

unborn crow Aug 1, 2022, 2:07 PM

#

worthy phoenix probability is one of my fav topics of mathematics well statistics i kinda know ...

good starting point

worthy phoenix Aug 1, 2022, 2:07 PM

#

............

unborn crow Aug 1, 2022, 2:08 PM

#

worthy phoenix ............

Go find a project that you would like to do and try it

#

if you need help come back here

worthy phoenix Aug 1, 2022, 2:08 PM

#

aight

misty flint Aug 1, 2022, 2:14 PM

#

hollow sentinel meanwhile my friend has been texting me nonstop to do a group project with him a...

ehh he should get a hint lol

coral cradle Aug 1, 2022, 2:28 PM

#

is sklearn worth learning rather than pytorch?

serene scaffold Aug 1, 2022, 2:28 PM

#

coral cradle is sklearn worth learning rather than pytorch?

sklearn and pytorch are for completely different things.

coral cradle Aug 1, 2022, 2:29 PM

#

serene scaffold sklearn and pytorch are for completely different things.

pytorch is for images is it?

serene scaffold Aug 1, 2022, 2:30 PM

#

coral cradle pytorch is for images is it?

pytorch is for neural networks. and image processing often involves neural networks, because there are so many inputs (each pixel is an input)

#

but you can use neural networks for other things, too

coral cradle Aug 1, 2022, 2:30 PM

#

rn I'm learning sklearn, will it help me use pytorch in the future?

serene scaffold Aug 1, 2022, 2:31 PM

#

coral cradle rn I'm learning sklearn, will it help me use pytorch in the future?

I would rethink your whole learning strategy. learning how to use these libraries as a goal unto itself isn't really a good idea. you should focus on solving certain kinds of problems, and learn how to use the libraries as they pertain to accomplishing that.

#

I have a message in the pins about what the main libraries are for. none of them are intended to be an end-to-end solution for a class of DS/AI problems.

coral cradle Aug 1, 2022, 2:33 PM

#

thanks

river escarp Aug 1, 2022, 2:34 PM

#

I need to make a model which forecasts whether a product will go out of stock based on sales history. What would be a cool / interesting way of implementing this?

civic ivy Aug 1, 2022, 3:08 PM

#

does anyone have a good tutorial i can look at for make a neural network to simulate life of an creature?

#

i do hope this is the right spot to ask

lapis sequoia Aug 1, 2022, 3:32 PM

#

How should sell my model cuz I know how to make model , it’s pre-processing, developing model (eg: car price predicting) . What you guys suggest to you know amazed it

steady basalt Aug 1, 2022, 3:32 PM

#

unborn crow How are you on probability and stochastics

Hot garbage and still able to be a data scientist

lapis sequoia Aug 1, 2022, 3:33 PM

#

And I had another question how can I give brain to my model xD to u know get smarter than human . Let’s say I input a company name that not even exist , may be I can use exceptional handling??!wdus

ocean swallow Aug 1, 2022, 3:59 PM

#

is there any rule based text generation models using grammar?

#

I am trying to POS tag the words in my data set and like markov chains, it will create sentences based on probabilities etc

serene scaffold Aug 1, 2022, 4:00 PM

#

ocean swallow is there any rule based text generation models using grammar?

I'm sure someone has made one, but it's very difficult to account for all grammar rules.

ocean swallow Aug 1, 2022, 4:00 PM

#

If there is I simply couldn't find really

#

i am actually okay with kinda working models

serene scaffold Aug 1, 2022, 4:02 PM

#

@ocean swallow there isn't really incentive for people to invest more development in grammar-aware text generation, because there are already excellent text generation models that don't need that.

ocean swallow Aug 1, 2022, 4:04 PM

#

suppose you have 20 sentences as input and you want to generate text solely from them.

serene scaffold Aug 1, 2022, 4:04 PM

#

why would you even be in that circumstance? there's tons of text out there.

ocean swallow Aug 1, 2022, 4:04 PM

#

All ML will do is generate next word based on the previous words etc

#

Well I want to create Hot Topic titles based on the small trending niche

#

And that small trend doesn't have many titles also, creativity is a big priority

#

it would be great if it was able to understand rules of grammar and generate from them too actually

#

but I kinda lowered my expectations lol

iron basalt Aug 1, 2022, 4:17 PM

#

ocean swallow suppose you have 20 sentences as input and you want to generate text solely from...

That is the only data you are given? No other data allowed?

ocean swallow Aug 1, 2022, 4:18 PM

#

iron basalt That is the only data you are given? No other data allowed?

yep. the numbers of course can change somewhat like maybe 100 etc

#

training on general data pollutes are scope e.g.: old news titles

iron basalt Aug 1, 2022, 4:20 PM

#

ocean swallow yep. the numbers of course can change somewhat like maybe 100 etc

Well, that won't get you anything. To see why, imagine a human that has never seen, felt or heard anything, they have never spoken, written, or read anything, except for those 100 sentences. What could they possibly generate? Not much.

#

Only remixes of those 100.

ocean swallow Aug 1, 2022, 4:20 PM

#

Markov chains does somewhat good job

iron basalt Aug 1, 2022, 4:20 PM

#

Yeah, that is what they do.

#

But if you want creativity / introducing new factors. They have to come from somewhere.

ocean swallow Aug 1, 2022, 4:22 PM

#

no, some words and or phrases actually combine correctly into each other. but since they didn't in any of the samples, markov chain doesn't produce it

iron basalt Aug 1, 2022, 4:22 PM

#

Yes, but only from those 100 correct?

#

You are missing out on the tons of other text out there.

#

Or even non-text.

#

So it's going to be limited.

#

Best case those 100 sentences have a ton of variety to allow for a decent amount.

ocean swallow Aug 1, 2022, 4:24 PM

#

yeah

iron basalt Aug 1, 2022, 4:24 PM

#

Even if you get the grammar correct, it still can't choose from words not included in any of those 100.

ocean swallow Aug 1, 2022, 4:25 PM

#

I mean, I kinda don't want it to.

#

I want it to revolve around the trending topic titles

iron basalt Aug 1, 2022, 4:25 PM

#

But you also wanted creativity right?

ocean swallow Aug 1, 2022, 4:26 PM

#

it is easy to just put in 100 million news title and make it generate text

#

I don't think it will be creative though

#

because it will be using texts generated earlier

#

but just mix-and-matching them mostly

#

what I want is to disregard mix and match and just create from grammar rules

ocean swallow Aug 1, 2022, 4:29 PM

#

ocean swallow but just mix-and-matching them mostly

it really is what GPT-2 is doing now too. Context awareness is recognizing the order of words. nothing more

iron basalt Aug 1, 2022, 4:29 PM

#

How do you create from grammar rules without mix and match, is that not mix and matching words?

#

With some dictionary, which may not include all words.

#

Are you trying to generate based on more than just text?

ocean swallow Aug 1, 2022, 4:33 PM

#

Because model will not know what words are. There is just a bag called NP.

#

Okay I am gonna make it really simple.

I have dog.
You are a table.

dog here is NP.
table is also a NP.

I have a table. is a valid sentence based on the examples. A neural network won't do this.

#

Unless you provided the labeled POS tags with it.

unborn crow Aug 1, 2022, 4:37 PM

#

steady basalt Hot garbage and still able to be a data scientist

I am not only Garbage in Math - i am garbage in general

unborn crow Aug 1, 2022, 4:38 PM

#

unborn crow I am not only Garbage in Math - i am garbage in general

Top that my Nword

wooden sail Aug 1, 2022, 4:39 PM

#

what is even going on here, please keep the discussion on topic

unborn crow Aug 1, 2022, 4:40 PM

#

ocean swallow Aug 1, 2022, 4:42 PM

#

wooden sail what is even going on here, please keep the discussion on topic

huh?

wooden sail Aug 1, 2022, 4:43 PM

#

not you, i mean masterofthemüllermilch

ocean swallow Aug 1, 2022, 4:44 PM

#

oh yay. man had to decompress there a bit

iron basalt Aug 1, 2022, 4:45 PM

#

ocean swallow Okay I am gonna make it really simple. I have dog. You are a table. dog here i...

A neural network can do that, it just depends on what kind neural network. Neural networks such are as transformers are not forced to do things like this. But with enough data they tend to figure it out. You can create neural networks with forced / hard rules (e.g. for grammar). Adding POS tags makes it much easier when there is not much data, but with a lot of data a complex neural network may end up learning to do that itself. Transformers and other sequence models have been shown to learn a lot of unexpected things.

#

Their approach is tons of data + they can kind of do whatever they want / not the hard coded rules approach.

#

But not all NN models follow that approach.

ocean swallow Aug 1, 2022, 4:47 PM

#

iron basalt A neural network can do that, it just depends on what kind neural network. Neura...

we come back to the initial problem I told you. It will deviate from the topic. I may get a 200 new cool titles but it is very possible that they will be buried in 20 million other titles that doesn't interest me.

#

So yes it will be creative but it will not stay in the topic.

#

which I did btw. GPT2 bert XLNet all of them deviate either too little or too much.

#

fine tuning is not really achieved right now with those models is as I saw

ocean swallow Aug 1, 2022, 4:54 PM

#

iron basalt A neural network can do that, it just depends on what kind neural network. Neura...

No it won't do that. Why would it?

iron basalt Aug 1, 2022, 4:56 PM

#

ocean swallow No it won't do that. Why would it?

It depends on what the rest of the data is, but one way to tell is that it can learn that table and dog happen around the same positions. It has positional input.

ocean swallow Aug 1, 2022, 4:56 PM

#

The feature vectors should spell out some form of grammar rule in some way.

#

or the model itself

iron basalt Aug 1, 2022, 4:58 PM

#

ocean swallow The feature vectors should spell out some form of grammar rule in some way.

GPT-2 and such decided to not go the route of explicit rules like that, so without a bunch of fine-tuning and more data, it will probably fail.

#

You can make another model, that uses the hidden state (or output) of those big language models as extra information. But that model itself has more strict rules and such.

ocean swallow Aug 1, 2022, 5:01 PM

#

hmmm. Yeah I have heard people doing it.

#

How can I approach doing it is idk?

iron basalt Aug 1, 2022, 5:03 PM

#

How you have the model ignore or make use of what the big language model is telling it is up to you. One way would be to have a much more agile smaller model that uses both those sentences that you want and whatever the big language model thinks as input (the big language model has all that tons of text data baked into its weights). The smaller model being easier to understand and hand-crafted for the specific task.

#

This is similar to how some robots are now making use of a language model paired with its visual system to get better results (e.g. sees cup, language model says cup is related to table, so object below cup has higher probability of being table). Although in that case it's two different types of input.

#

(text contains a lot of information / relational stuff and so it makes for a great way to get a lot more information into your system to make use of because of the access to all of the text / contains humans knowledge baked into it)

iron basalt Aug 1, 2022, 5:46 PM

#

ocean swallow How can I approach doing it is idk?

Maybe you could even combine it with the Markov Chain method and grammar checks.

#

See what each generates, check grammar, and decide.

unborn crow Aug 1, 2022, 6:59 PM

#

wooden sail not you, i mean masterofthemüllermilch

i am sorry, will you ever forgive me ?

lapis sequoia Aug 1, 2022, 7:05 PM

#

What function can I use to map a set of categorical values to a bunch of integers. Like "a":1, "b":2, etc

#

I don't care about a specific value being attached

wooden sail Aug 1, 2022, 7:06 PM

#

you probably don't wanna map to integers, this will cluster the categories because they're not equidistant

lapis sequoia Aug 1, 2022, 7:07 PM

#

hmm'

#

what to do then

wooden sail Aug 1, 2022, 7:07 PM

#

one hot encoding is popular for a reason

lapis sequoia Aug 1, 2022, 7:07 PM

#

It doesn't work for more than 2 categories

wooden sail Aug 1, 2022, 7:07 PM

#

yes it does, that's the whole point of it

#

it maps a set of categories with cardinality N to R^N

lapis sequoia Aug 1, 2022, 7:08 PM

#

It just gives 3 columns for 3 categories

wooden sail Aug 1, 2022, 7:08 PM

#

that's exactly the point

#

that's what you want to do

lapis sequoia Aug 1, 2022, 7:08 PM

#

hmm

#

Really?

wooden sail Aug 1, 2022, 7:08 PM

#

those vectors are equidistant (and orthonormal, which is sexy)

lapis sequoia Aug 1, 2022, 7:08 PM

#

increase my number of features?

steady basalt Aug 1, 2022, 7:08 PM

#

Welcome to data analysis

wooden sail Aug 1, 2022, 7:09 PM

#

for all the reasons i mentioned, yes. if the number of categories is too large, you can anyway employ sparse regularization

steady basalt Aug 1, 2022, 7:09 PM

#

Nothing inherently wrong with many features and I’m kinda sick of the rhetoric that there’s always an issue unless you remove features until there’s 4 left

lapis sequoia Aug 1, 2022, 7:09 PM

#

Well, last time I didn't do it. I did 0 for no, yes for 1 and -1 for no plan. Through manual mapping by hand

steady basalt Aug 1, 2022, 7:09 PM

#

Features contain information

#

Models can also decide for themselves how much so

#

I’d never drop features unless they’re totally useless

lapis sequoia Aug 1, 2022, 7:10 PM

#

lapis sequoia Well, last time I didn't do it. I did 0 for no, yes for 1 and -1 for no plan. Th...

Bad idea?

steady basalt Aug 1, 2022, 7:10 PM

#

Which they rarely are in data I work with

wooden sail Aug 1, 2022, 7:10 PM

#

lapis sequoia Well, last time I didn't do it. I did 0 for no, yes for 1 and -1 for no plan. Th...

note that 1 is 2 units of distance away from -1, and so this means categorizing something as -1 when it should be 1 is "more wrong" that misclassifying it as 0. this makes the algorithm biased

lapis sequoia Aug 1, 2022, 7:10 PM

#

ooooo

wooden sail Aug 1, 2022, 7:10 PM

#

it yields a larger gradient and skews your parameters

lapis sequoia Aug 1, 2022, 7:10 PM

#

Looks like I wasted a couple of hours of my life

wooden sail Aug 1, 2022, 7:11 PM

#

or in other words, it clusters 1 with 0, and -1 with 0, but does not allow grouping 1 and -1 together

#

you're pretty much telling it that it should allow some errors and fix others

lapis sequoia Aug 1, 2022, 7:11 PM

#

What about in boolean catgories

#

Should i let it keep 2 columns

wooden sail Aug 1, 2022, 7:11 PM

#

then you can use just 0 and 1, it's not an issue

lapis sequoia Aug 1, 2022, 7:12 PM

#

Yeah but hot encoding gives me back 2 columns. Can I keep both?

wooden sail Aug 1, 2022, 7:12 PM

#

the problem only becomes evident with 3+ categories

steady basalt Aug 1, 2022, 7:12 PM

#

No

#

As type integer

wooden sail Aug 1, 2022, 7:12 PM

#

you can, yes. doesn't make a difference

steady basalt Aug 1, 2022, 7:12 PM

#

Python things 😅😅

lapis sequoia Aug 1, 2022, 7:12 PM

#

wooden sail you can, yes. doesn't make a difference

hmmmmmm

#

Reallyyy

wooden sail Aug 1, 2022, 7:12 PM

#

it might require some renormalization though, since sometimes you'll get distances of root 2

lapis sequoia Aug 1, 2022, 7:13 PM

#

That extra feature is providing no value?

wooden sail Aug 1, 2022, 7:13 PM

#

keeping the hessian nice and spherical is important

lapis sequoia Aug 1, 2022, 7:13 PM

#

lol

steady basalt Aug 1, 2022, 7:13 PM

#

@wooden sail how badly do u think not normalising test set will degrade model accuracy

#

When it’s just been min maxed

wooden sail Aug 1, 2022, 7:13 PM

#

lapis sequoia That extra feature is providing no value?

pretty much, because the data is in a low dimensional manifold. if you wanted you could represent a bool with 1000 columns

lapis sequoia Aug 1, 2022, 7:14 PM

#

I don't understand more than half the shit. Why don't they teach me statistics

steady basalt Aug 1, 2022, 7:14 PM

#

Not much right

wooden sail Aug 1, 2022, 7:14 PM

#

it's useless, but it can be done with good results

wooden sail Aug 1, 2022, 7:14 PM

#

lapis sequoia I don't understand more than half the shit. Why don't they teach me statistics

this is linalg

wooden sail Aug 1, 2022, 7:14 PM

#

steady basalt Not much right

i have no idea what you're talking about from the words you used

iron basalt Aug 1, 2022, 7:14 PM

#

lapis sequoia Really?

The integer version can be treated the same as the one-hot, but it's the compressed version (each integer is the index of the 1 in the vector, rest are 0s). Same thing, but a lot of systems are not designed to take in the compressed form. Either way, it's so you can have a bunch of nice orthonormal vectors.

steady basalt Aug 1, 2022, 7:14 PM

#

Fit transforming columns between 0-1

#

For ONLY training data

#

The pattern remains but without it for testing data would the model struggle

wooden sail Aug 1, 2022, 7:15 PM

#

steady basalt Fit transforming columns between 0-1

then you should do it for the test data and the data you'll do inference on

steady basalt Aug 1, 2022, 7:15 PM

#

Of course but if you didn’t how serious really is the impact

wooden sail Aug 1, 2022, 7:15 PM

#

the answer, as always, is "it depends"

#

you're telling the network that the data dimensions are nice and spherical, but the real data is a weird high dimensional ellipsoid

#

then if one column has large values, your alg will prioritize that one and think everything else is 0

#

loosely

#

smoothness and condition numbers are very unforgiving things

steady basalt Aug 1, 2022, 7:18 PM

#

Interesting

wooden sail Aug 1, 2022, 7:19 PM

#

linalg, multivar calc, and convex opt, if you wanna look into it

steady basalt Aug 1, 2022, 7:20 PM

#

Never heard of convex opt

#

Is it hard ?

#

I’m still on calc1

wooden sail Aug 1, 2022, 7:20 PM

#

linalg and multivar calc are prerequisites

#

some basic real analysis helps a lot, too

iron basalt Aug 1, 2022, 7:20 PM

#

lapis sequoia Bad idea?

The idea is that all the categories are treated equally in a sense. The higher number of dimensions also makes them easier to suss apart for the system.

wooden sail Aug 1, 2022, 7:21 PM

#

and if you work with complex valued quantities, you'll need to be at least familiar enough with complex numbers and complex differentiability to understand that you can relate C^n to R^2n and work with easier conditions to satisfy than being complex differentiable

lapis sequoia Aug 1, 2022, 7:22 PM

#

wooden sail this is linalg

Maths then

#

They should teach me maths

wooden sail Aug 1, 2022, 7:22 PM

#

ofc 🙂 what do you think ML is

lapis sequoia Aug 1, 2022, 7:22 PM

#

But all they make me do is write silly reports

wooden sail Aug 1, 2022, 7:23 PM

#

are you in undergrad? that's not uncommon (not saying it's good either)

lapis sequoia Aug 1, 2022, 7:23 PM

#

I got lower accuracy with your method btw

#

compared to the one I used

#

But I just used a simple train test split to test

#

You would say your model is more generalised?

wooden sail Aug 1, 2022, 7:25 PM

#

that'll also depend on how many samples you have of each class

#

i would say one hot is the standard if you have no additional information or good reason to introduce bias in the distances

#

if you have good reason to make one misclassification worse than another, then it's fine. if you don't, then don't

lapis sequoia Aug 1, 2022, 7:27 PM

#

hmm

#

How does this look?

wooden sail Aug 1, 2022, 7:28 PM

#

unchurned

lapis sequoia Aug 1, 2022, 7:29 PM

#

My churn predictions really bad. What might be wromg

#

is it dataset or me

#

Am I the bad guy

#

The feature importance is really hard to describe with one hot encoding too

mild dirge Aug 1, 2022, 7:35 PM

#

If you one-hot encode categorical data, you get new features that can each have their own importance

lapis sequoia Aug 1, 2022, 7:35 PM

#

Yeah but it's hard to describe those features' importaces in report

mild dirge Aug 1, 2022, 7:35 PM

#

F.e. if you want to predict the salary of a person, and you have a feature "company" that tells what company they work at, it might be useless for 99% of the data. But if someone works at Google, and the category is one-hot encoded, the new column for works at google is very important

lapis sequoia Aug 1, 2022, 7:36 PM

#

Like I don't understant that old_yes has an importance too and old_no has an importance too. When infact old yes told everything there was to be told in old_no

mild dirge Aug 1, 2022, 7:37 PM

#

For binary categories you can just convert it to 0 and 1 pretty sure

#

Don't think it is necessary to one-hot encode those

mild dirge Aug 1, 2022, 7:38 PM

#

lapis sequoia Like I don't understant that old_yes has an importance too and old_no has an imp...

For this reason

iron basalt Aug 1, 2022, 7:38 PM

#

lapis sequoia Like I don't understant that old_yes has an importance too and old_no has an imp...

For binary it's one or the other. But for 3 you can already see that you have one of them and not the other two (it's different).

#

So for binary just leave it as is.

lapis sequoia Aug 1, 2022, 7:44 PM

#

how to decide for the depth in decision trees

#

When I let it run on default params, it gives very high accuracy

#

But wasn't that supposed to not be good at generalisation

wooden sail Aug 1, 2022, 7:48 PM

#

how many no churn and churn were in the original data set?

#

in the training one, i mean

lapis sequoia Aug 1, 2022, 7:59 PM

#

Hmm

#

The total values are 30%

#

The one in my confusion matrix

#

Do you need bifracation? @wooden sail

lapis sequoia Aug 1, 2022, 8:01 PM

#

mild dirge For this reason

Okay. I just used the same function for simplicity. I will drop those extra columns then ig

wooden sail Aug 1, 2022, 8:02 PM

#

what is bifracation and what do you mean by "the total values are 30%'"

lapis sequoia Aug 1, 2022, 8:02 PM

#

Bifurcation

wooden sail Aug 1, 2022, 8:03 PM

#

i just want to know how many were churn category and how many were no churn category in the training set

lapis sequoia Aug 1, 2022, 8:03 PM

#

Like test values were 30%

#

It was 70-30 split

#

Oh

wooden sail Aug 1, 2022, 8:03 PM

#

in the 70% you used for training, how many were of each category

lapis sequoia Aug 1, 2022, 8:03 PM

#

Hmm

#

Will have to turn computer on for that. Are you onto something from that information?

stuck schooner Aug 1, 2022, 8:04 PM

#

Hey, do you have any idea why that behavior happen ?

#

wooden sail Aug 1, 2022, 8:04 PM

#

lapis sequoia Will have to turn computer on for that. Are you onto something from that informa...

just letting oyu know that if it's not close to 50/50 that'll also affect your results

lapis sequoia Aug 1, 2022, 8:05 PM

#

Hmm. I should get out of bed and check then

#

https://c.tenor.com/fXSPEQ_Xi9UAAAAM/sleep-tired.gif

wooden sail Aug 1, 2022, 8:05 PM

#

i'm gonna go sleep, good luck with that

lapis sequoia Aug 1, 2022, 8:05 PM

#

Don't go

#

Wait 2 minutes

stuck schooner Aug 1, 2022, 8:06 PM

#

anyone :c ?

lapis sequoia Aug 1, 2022, 8:06 PM

#

I will uncover

#

Then we will sleep together

#

Uh oh

#

Doesn't sound right

#

@wooden sail

wooden sail Aug 1, 2022, 8:08 PM

#

that'll do a number on you

lapis sequoia Aug 1, 2022, 8:09 PM

#

What's the solution

wooden sail Aug 1, 2022, 8:09 PM

#

you can read up on imbalanced classification

steady basalt Aug 1, 2022, 8:19 PM

#

lapis sequoia It was 70-30 split

Try 80/20

lapis sequoia Aug 1, 2022, 8:19 PM

#

And check what?

steady basalt Aug 1, 2022, 8:20 PM

#

Results

#

80/20 is the standard these days

#

Unless u don’t have much data

#

So results look biased

steady basalt Aug 1, 2022, 9:14 PM

#

@wooden sail normalising test set to 0-1 like train boosted ac by 20%.

#

how?

ocean swallow Aug 1, 2022, 9:14 PM

#

iron basalt See what each generates, check grammar, and decide.

okay thanks for the ton of hlep but i couldn't see myself actually coming up with a concrete idea. NLTK has generation tool builtin from grammar rules but I actually couldn't find english grammar rule set. Not even a very simple one. lol

iron basalt Aug 1, 2022, 9:16 PM

#

ocean swallow okay thanks for the ton of hlep but i couldn't see myself actually coming up wit...

Yea, IDK if you will even find any nice pre-made tools for this. It will take some work.

#

And finding some grammar rules that are in a nice format is not as common as one would think.

mild dirge Aug 1, 2022, 9:27 PM

#

steady basalt <@467435887236612106> normalising test set to 0-1 like train boosted ac by 20%.

Was your training set normalized?

#

Because the model expects your data to be in the same format as your training data

#

If you optimize your model for values between 0 and 1, then your model would be very incorrect if those values are scaled times 100 f.e.

#

So you should process your test set the same as you do for your training set

#

Where normalization should be based on the min and max of the training set

#

@steady basalt

proper ingot Aug 1, 2022, 9:53 PM

#

Hey does anyone provide tutoring here? I need help specifically with pandas

steady basalt Aug 1, 2022, 10:02 PM

#

mild dirge Where normalization should be based on the min and max of the **training set**

Oh...

#

fucks sakes, i normalised test set on itself

#

but isnt that still fair?

#

because you get new data from a source, normalise it on its own values and predict it?

mild dirge Aug 1, 2022, 10:04 PM

#

No

steady basalt Aug 1, 2022, 10:04 PM

#

im normalising test set on itself and then training the model on the trainig set and predicting test y values

mild dirge Aug 1, 2022, 10:04 PM

#

Because let's say you want to predict 1 sample

#

how do you normalize 1 sample

steady basalt Aug 1, 2022, 10:04 PM

#

good point

mild dirge Aug 1, 2022, 10:05 PM

#

😛

steady basalt Aug 1, 2022, 10:05 PM

#

gona have to redo it

mild dirge Aug 1, 2022, 10:05 PM

#

And the model is based on the scaling and offset of the training data

#

So you need to use the same scaling

steady agate Aug 1, 2022, 10:05 PM

#

i am making a wordle bot can somebody help me

steady basalt Aug 1, 2022, 10:05 PM

#

omg i hope it doenst kill my score again

mild dirge Aug 1, 2022, 10:05 PM

#

But if your test set and training set have approximately the same min and max, it probably won't matter a lot in this case

steady basalt Aug 1, 2022, 10:05 PM

#

X_train['age'] = scaler.fit_transform(X_train[['age']])```

#

X_test['age']

mild dirge Aug 1, 2022, 10:06 PM

#

Not sure how that scaler works, but there probably is a .fit() method by itself

steady basalt Aug 1, 2022, 10:06 PM

#

we do scaler.transform only?

#

for test

mild dirge Aug 1, 2022, 10:06 PM

#

so fit on the training data, and then transform both the train and test

mild dirge Aug 1, 2022, 10:06 PM

#

steady basalt we do scaler.transform only?

yes

steady basalt Aug 1, 2022, 10:07 PM

#

so is it this

#

test[col] = scaler.fit_transform[train[col]]

#

so the exact same as before, just declaring test

mild dirge Aug 1, 2022, 10:07 PM

#

you need to transform the test data

steady basalt Aug 1, 2022, 10:08 PM

#

X_train['age'] = scaler.fit_transform(X_train[['age']]) and X_test['age'] = scaler.fit_transform(X_train[['age']])

#

is this what u meant

#

X_test['age'] = scaler.fit_transform(X_test[['age']]) i was errenously doing this

mild dirge Aug 1, 2022, 10:09 PM

#

norm_train_data = scaler.fit_transform(train_data)
norm_test_data = scaler.transform(test_data)

steady basalt Aug 1, 2022, 10:09 PM

#

o hi see

#

this inst going to work

#

with how ive coded it

#

as i coded all the trains first with fit transform

#

so id have to do them one at a time train and test

mild dirge Aug 1, 2022, 10:10 PM

#

So you have

norm_train = (train_data - min_train) / (max_train - min_train)
norm_test = (test_data - min_train) / (max_train - min_train)

#

What do you mean "all" the trains?

steady basalt Aug 1, 2022, 10:11 PM

#

X_test['age'] = scaler.transform(X_test[['age']])
X_train['bmi'] = scaler.fit_transform(X_train[['bmi']])
X_test['bmi'] = scaler.transform(X_test[['bmi']])```

#

there i made them intermittant

mild dirge Aug 1, 2022, 10:11 PM

#

Yeah that's it, maybe you can do it with multiple columns at once

#

Seems like it from the docs

steady basalt Aug 1, 2022, 10:12 PM

#

interestingly my model tuned to max out AUCROC is garbage compared to random params

#

when it does so it lowers recall for important class

#

so is it best to try to optimise accuracy for this issue

#

?

mild dirge Aug 1, 2022, 10:13 PM

#

What you try to optimzie depends on what you want out of this model

#

If accuracy is important to you, use a loss function that maximizes it

steady basalt Aug 1, 2022, 10:14 PM

#

I mean its tabular data so im using random forest and xgb

#

so its a grid search parameter

#

default is accuracy but all i care about is making sure recall is decent for both classes

#

and optimising roc ruins that

mild dirge Aug 1, 2022, 10:15 PM

#

So maybe instead of accuracy use the sum of the recall of both classes

steady basalt Aug 1, 2022, 10:15 PM

#

it makes great precision and good recall for 1 class but not the other

#

idk how to do that, does halvinggridsearch allow custom ones?

mild dirge Aug 1, 2022, 10:16 PM

#

Not sure, I don't use sklearn that often

steady basalt Aug 1, 2022, 10:16 PM

#

lets say I prefer accuracy of 65, recall from both classes at 65 and roc of 70 than 70,20,65 and 80 respectively

#

@mild dirge about to find out how much acc ill lose by reducing this bias

#

any experts in classification metrics here?

#

kappa score etc

steady basalt Aug 1, 2022, 11:49 PM

#

mild dirge Not sure, I don't use sklearn that often

after doing as you said i suddenly get 0.98 auc wtf?

#

but thats only with a certain set of parameters i chose, otherwise its 0.82

#

what on earth

steady basalt Aug 2, 2022, 12:10 AM

#

This can be due to oversamlljng before splitting?

river escarp Aug 2, 2022, 12:20 AM

#

I need to make a model which predicts whether a product will go out of stock based on historical sales data. What would be a unique/interesting way to implement this?

serene scaffold Aug 2, 2022, 12:51 AM

#

river escarp I need to make a model which predicts whether a product will go out of stock bas...

do you know about time series forecasting?

timid narwhal Aug 2, 2022, 1:56 AM

#

has anyone done extensive work with geocoding such as using the Photon geocode in order to get information about addresses?

harsh crow Aug 2, 2022, 2:05 AM

#

Hello, do you know any library that I can use to make a similar graphic as in the picture?

serene scaffold Aug 2, 2022, 2:08 AM

#

harsh crow Hello, do you know any library that I can use to make a similar graphic as in th...

what do the dots mean? how is this different from finding a plane outline and putting dots on it in MS paint?

harsh crow Aug 2, 2022, 2:17 AM

#

serene scaffold what do the dots mean? how is this different from finding a plane outline and pu...

The dots means the hits that the plane had. This figure is from the survival bias from a case in the WII.

quaint leaf Aug 2, 2022, 3:58 AM

#

harsh crow The dots means the hits that the plane had. This figure is from the survival bia...

you could combine matplotlib and opencv for example
https://www.researchgate.net/post/Can_I_plot_some_points_over_an_image

final aurora Aug 2, 2022, 4:02 AM

#

Guys Im a mechanical engineering student taking AI as an additional course and its in Python. If there are lists or information i have to plug in, Ill look for them and provide them. I just have no basic in python so im asking for help for the following questions:

#

def survivorSelection(population, eliteSize):

# Replace the dummy survival selection function below with  
# either Fitness Based Selection or Merge, Sort & Truncate.
  

elites = []

# Replacement starts here  #this is dummy/ Use the merge, sort, truncate and take the elite size list from the top
def merge (population,eliteSize):


for i in range(eliteSize):
    elites.append(population[i])
# Replacement ends here

return elites

#

I need to change the code between #replacement starts here and #replacement ends here into a merge,sort and truncate python code

#And for the 2nd question: Performance Evaluation. You will present performance evaluation for the different options created in this lab, either: a) Fitness function; or b) Parent Selection function.

filename = 'cities8.txt'
popSize = 20
eliteSize = 5
mutationProbability = 0.01
iteration_limit = 100

cityList = genCityList(filename)

population = initialPopulation(popSize, cityList)
distances = [Fitness(p).routeDistance() for p in population]
min_dist = min(distances)
print("Best distance for initial population: " + str(min_dist))

for i in range(iteration_limit):
population = oneGeneration(population, eliteSize, mutationProbability)
distances = [Fitness(p).routeDistance() for p in population]
index = np.argmin(distances)
best_route = population[index]
min_dist = min(distances)
print("Best distance for population in iteration " + str(i) +
": " + str(min_dist))

print("Optimal path is " + str(best_route))

# TO DO (10 marks) - Performance Evaluation. You will present the performance achieved 
# by different options created in this lab. You can choose to investigate either
# a) Fitness function; or b) Parent Selection function. For fitness function, you compare 
# the performance achieved by Fitness function 1 (Simple division) and 
# Fitness function 2 (Maximum difference). For parent selection function, you compare the 
# performance achieved by Random Selection, Tournament Selection, and Proportional Selection.

wooden sail Aug 2, 2022, 4:06 AM

#

steady basalt <@467435887236612106> normalising test set to 0-1 like train boosted ac by 20%.

yozu can scroll up to yesterday's explanation. for now it's most important you keep in mind normalization is done before test/train splitting

final aurora Aug 2, 2022, 4:22 AM

#

final aurora Guys Im a mechanical engineering student taking AI as an additional course and i...

These 2 different questions are part of 1 big Genetic Algorithm code

#

its under the travelling salesman problem

#

my group just split up the work in messy orders but we were tasked to just complete our individual parts

woven coral Aug 2, 2022, 8:15 AM

#

#

how to fix this???

#

anyone knows???

stuck schooner Aug 2, 2022, 8:45 AM

#

Hello, what would be a proper way to merge two series together while creating a clue for plotting purpose (ie. a boolean column with 0 for the first series, 1 for the second) ?

unique flame Aug 2, 2022, 9:12 AM

#

I would create a dataframe for the first series with columns [name, bool] and also a second dataframe, for the second series, with [name, bool] and than merge on "name"

steady basalt Aug 2, 2022, 9:50 AM

#

wooden sail yozu can scroll up to yesterday's explanation. for now it's most important you k...

according to pccamel ur not meant to do this

#

ive changed my project such that x_test columns are transformed based on the fit of x_train

#

additionally, im going to have to go back and oversample only only training data after splitting,

untold bloom Aug 2, 2022, 9:51 AM

#

stuck schooner Hello, what would be a proper way to merge two series together while creating a ...

hi, can pass indicator=True to pd.merge which will put a new column named "_merge" in the result indicating from which series the data come from: "left_only", "right_only" or "both" are the values. If you pass a string instead of True to indicator, the default name "_merge" will be replaced with that

#

In [35]: s_1
Out[35]:
0    1
1    2
2    3
3    2
Name: month, dtype: int64

In [36]: s_2
Out[36]:
3    A
4    D
5    Z
Name: item, dtype: object

In [37]: pd.merge(s_1, s_2, how="outer", left_index=True, right_index=True, indicator="flag")
Out[37]:
   month item        flag
0    1.0  NaN   left_only
1    2.0  NaN   left_only
2    3.0  NaN   left_only
3    2.0    A        both
4    NaN    D  right_only
5    NaN    Z  right_only

#

you can map the resultant column values to 0, 1...

stuck schooner Aug 2, 2022, 9:52 AM

#

But pd.merge doesn't take only Series as argument

untold bloom Aug 2, 2022, 9:53 AM

#

then how did the above code work? :p

#

it's a lie :p

stuck schooner Aug 2, 2022, 9:53 AM

#

Oh

#

You are right ahah

#

I thought I looked at it

#

Thank you then !

wooden sail Aug 2, 2022, 10:35 AM

#

steady basalt according to pccamel ur not meant to do this

the explanation camel gave you was the same i did, except he went on to explain in more in detail. the main idea is to do the same scaling to all the data

wraith goblet Aug 2, 2022, 10:41 AM

#

versed gulch ```py arr = sparse.DOK((grid_size, grid_size, 242, 512, 512), dtype=np.uint8) #...

PLEASE

#

tell me what font and what theme you are using in what text editor or IDE

stuck schooner Aug 2, 2022, 10:56 AM

#

wraith goblet tell me what font and what theme you are using in what text editor or IDE

comic sans ms, clown theme is nice

wraith goblet Aug 2, 2022, 10:57 AM

#

this is for sure not comic sans

#

thanks though

stuck schooner Aug 2, 2022, 10:57 AM

#

no regular just light theme and default vs code font

wraith goblet Aug 2, 2022, 10:58 AM

#

thx

steady basalt Aug 2, 2022, 11:06 AM

#

wooden sail the explanation camel gave you was the same i did, except he went on to explain ...

I scaled testing data on the FIT of the training data, rather than all together

#

that was his point, so that if you have a single testing sample to predict you can scale it on trainig data

wooden sail Aug 2, 2022, 11:08 AM

#

that makes no difference, as long as you apply the same scaling

steady basalt Aug 2, 2022, 11:09 AM

#

wooden sail that makes no difference, as long as you apply the same scaling

also, do u know whats the best metric to optimise for making sure neither class has low recall or precision

wooden sail Aug 2, 2022, 11:09 AM

#

the difference is that in practice you don't have as many samples to test on, and they may not be know ahead of time. other than this, the takeaway is just apply the same scaling to everything

steady basalt Aug 2, 2022, 11:09 AM

#

rather than getting massively high auc but just from one class

#

such as 0.06 for one and 0.98 for anoither

#

even after rebalancing

wooden sail Aug 2, 2022, 11:10 AM

#

probably a wasserstein-like metric that considers support/cardinality mismatch

steady basalt Aug 2, 2022, 11:10 AM

#

does sklearn have

wooden sail Aug 2, 2022, 11:10 AM

#

probably not

steady basalt Aug 2, 2022, 11:10 AM

#


           0       0.97      0.95      0.96    145304
           1       0.07      0.10      0.08      5420

    accuracy                           0.92    150724
   macro avg       0.52      0.53      0.52    150724
weighted avg       0.93      0.92      0.93    150724

#

this is unnacceptable

#

id take a 20% loss in accuracy to get class1 over 0.6 recall

#

can anyone help with that?

#

my test data is not balanced but trainig has been balanced

#

and yet its still guessing 0 for all

versed gulch Aug 2, 2022, 12:19 PM

#

Hi is there a way I can project my 2D flat surface back to a 3D surface (i.e. a sphere) in Python as the image originally in real life was like 3D spherical structure but was then flattened when taking the image?

wooden sail Aug 2, 2022, 12:20 PM

#

if you have only a single 2D surface and no extra info, no

#

if you have several 2D images from different angles, you can use epipolar geometry for this. and if you stored all of the coordinates on the sphere that the 2D image corresponds to, it's easy

versed gulch Aug 2, 2022, 12:21 PM

#

wooden sail if you have only a single 2D surface and no extra info, no

extra info as in the approximate radius?

wooden sail Aug 2, 2022, 12:22 PM

#

radius, distance from the sphere to the 2D imaging plane, and which pixels correspond to which part of the sphere

versed gulch Aug 2, 2022, 12:25 PM

#

wooden sail if you have several 2D images from different angles, you can use epipolar geomet...

I have several but all at the same angle as the specimen in question was flattened before imaged (taken tile by tile)

#

so I stitched it all together to make one big 2d Image (these images were originally 3D but i took their Maximum intensity projections)

wooden sail Aug 2, 2022, 12:27 PM

#

there's no easy way to undo that kind of projection. you'd have to look in the original data to see which coordinates contained the max and use that info

#

this image probably won't be very useful though

#

it'll be sparse with some points scattered here and here, and 0 everywhere else

versed gulch Aug 2, 2022, 12:28 PM

#

wooden sail there's no easy way to undo that kind of projection. you'd have to look in the o...

I know how they're arranged as I've got their coordinates but this is obvious when it was flattened

wooden sail Aug 2, 2022, 12:29 PM

#

then what's the question?

versed gulch Aug 2, 2022, 12:29 PM

#

to project it onto a 3D spherical surface

wooden sail Aug 2, 2022, 12:30 PM

#

you have the coordinates

versed gulch Aug 2, 2022, 12:30 PM

#

yes but obviously when joining them it makes a rectangular shape

wooden sail Aug 2, 2022, 12:31 PM

#

what do you mean by "joining them"

versed gulch Aug 2, 2022, 12:33 PM

#

so my images are 3D tiles and what I did to each tile was take the maximum intensity projection of them making them 2D. after this using the coordinates of how they're arranged i glued the tiles together to make 1 large 2D image

#

using hstack and vstack

wooden sail Aug 2, 2022, 12:34 PM

#

mhm

versed gulch Aug 2, 2022, 12:34 PM

#

so now i got this 2d large image and want to project it onto a 3D sphere-like (hemishphere etc) shape, and I'm wondering how to do that?

wooden sail Aug 2, 2022, 12:38 PM

#

ok, if all you want is to map the plane to a sphere, regardless of where the samples came from, you want to follow the same procedure of a stereographic projection. i think the common python libraries only do it in the opposite direction (the usual direction), mapping a sphere to a plane (a hemisphere to a disc)

#

alternatively you can use a vertical projection if your images can be considered to be captures from "infinitely far away" from the sphere

#

i really don't know any libs that do this automatically, but it's not so difficult to do by hand

#

here's a nice stackoverflow on it https://math.stackexchange.com/questions/1729012/mapping-the-unit-disc-to-the-hemisphere

#

and here on wiki https://en.wikipedia.org/wiki/Stereographic_projection

#

the bottom one here

#

if you have numpy-like arrays, it should be doable in a couple of broadcasted operations

north condor Aug 2, 2022, 12:43 PM

#

I know this question is dumb, but anyways

#

I have this data

#

#

from did a moving average to get

#

#

when applying a SARIMA model, should I apply it to the first or second?

wooden sail Aug 2, 2022, 12:51 PM

#

that depends on whether you know if the high frequency components you filtered out have any useful info

#

if you know anything about the spectral content of the target signal, that'd be your answer

north condor Aug 2, 2022, 12:52 PM

#

It's sales data so I doubt it. Assuming that the average across the window is a good representation of each of the individual days.

wooden sail Aug 2, 2022, 12:53 PM

#

then the second should be fine as long as the averaging window was reasonable

north condor Aug 2, 2022, 12:53 PM

#

30 days

wooden sail Aug 2, 2022, 12:53 PM

#

do you know what a fourier transform is?

north condor Aug 2, 2022, 12:54 PM

#

Yeah, why?

wooden sail Aug 2, 2022, 12:55 PM

#

try looking at the amplitude or power spectral density of the first plot. if the noise is white~ish in distribution, we should see the spectrum of the signal of interest + a rather constant offset added to all of the spectrum. you can use this info to decide what a reasonable window length for filtering is

#

you can also plot the ASD or PSD of the filtered signal and see if you're happy with how it came out compared to the original

north condor Aug 2, 2022, 12:58 PM

#

Oh cool, thanks

frigid creek Aug 2, 2022, 1:00 PM

#

hi, umm can anybody tell me what is average precision and what precision and recall has to do with it like a 5 y/o? thx

north condor Aug 2, 2022, 1:07 PM

#

wooden sail try looking at the amplitude or power spectral density of the first plot. if the...

And the moving average part of the SARIMA model will not interfere with the rolling average I've already done (assuming no useful info has been filtered out)?

wooden sail Aug 2, 2022, 1:12 PM

#

it shouldn't if the noise is uncorrelated to the quantity you are trying to fit with the (S)ARIMA model

north condor Aug 2, 2022, 1:13 PM

#

Ok, thanks

wooden sail Aug 2, 2022, 1:13 PM

#

the assumption is that the error between the fit of the previous samples and the model for those samples is a random quantity that has some direct predictive power over a given window of time

#

that's only true if the error is related to the model, it need not necessarily be the case

#

but do try both and see. doing a moving average filter can be incorporated explicitly into the (S)ARIMA model, if you wanna look at that

steady basalt Aug 2, 2022, 1:40 PM

#

is it macro f1 that will balance both clases scores?

#

might use thatr as my scorer

#

to stop one sided high scores

serene scaffold Aug 2, 2022, 1:41 PM

#

steady basalt is it macro f1 that will balance both clases scores?

macro f1 takes the average for each class. whereas micro f1 is the average of all the instances.

mild dirge Aug 2, 2022, 1:41 PM

#

macro treats every class as equal basically

steady basalt Aug 2, 2022, 2:10 PM

#

serene scaffold macro f1 takes the average for each class. whereas micro f1 is the average of al...

so whats best in fixing my problem?

#

#

i nee recall for 1 to be high too

#

im willing to sacrifice other stuff

serene scaffold Aug 2, 2022, 2:18 PM

#

steady basalt so whats best in fixing my problem?

I'm not caught up on what your model is intended to do, or what the training data is, etc.

steady basalt Aug 2, 2022, 2:33 PM

#

its from a databank of peoples info and medical stuffs

#

i want high recall for 1, which i oversampled

#

but its rly bad

north condor Aug 2, 2022, 3:04 PM

#

Does anyone know have to fix this error?

#

serene scaffold Aug 2, 2022, 3:06 PM

#

north condor Does anyone know have to fix this error?

shouldn't you be passing something to fit?

north condor Aug 2, 2022, 3:09 PM

#

What do you mean?

#

Here is the example I'm follwing

#

#

The data is passed into the model ARIMA(train_data, order=(7,1,2))

serene scaffold Aug 2, 2022, 3:10 PM

#

what is type(model)? by the way, I won't look at any more screenshots--please use text.

#

and if it's just SARIMAX, where can I find the docs for that?

north condor Aug 2, 2022, 3:11 PM

#

Yeah

#

type(model) = statsmodels.tsa.arima.model.ARIMA

#

https://www.statsmodels.org/devel/generated/statsmodels.tsa.arima.model.ARIMA.html

#

I tried passing in freq="D" and missing="drop" but that didn't work and the implied frequency should be correct already.

steady basalt Aug 2, 2022, 3:17 PM

#

anyone know a good balanced metrics scorer?

#

https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter

scikit-learn

3.3. Metrics and scoring: quantifying the quality of predictions

There are 3 different APIs for evaluating the quality of a model’s predictions: Estimator score method: Estimators have a score method providing a default evaluation criterion for the problem they ...

untold bloom Aug 2, 2022, 3:48 PM

#

@north condor That's not an error but a warning saying that the underlying procedure did not converge. SARIMAX of statsmodels uses maximum likelihood to figure out the (S)AR/MA parameters. It's an iterative process (e.g., using L-BFGS), so there's a max_iter parameter controlling it. It turned out that although max_iter was reached, maximum likelihood value (or around) was not, hence the ConvergenceWarning.

#

you can pass maxiter to .fit.

amber swan Aug 2, 2022, 3:51 PM

#

hi, got a time series question regarding feature engineering due to a course im following on it (and the lecturer is on vacation 😅)
when creating features such as lag-features, moving averages, ... what is the best practice to determine the ideal value for my dataset (here it is btw: https://www.kaggle.com/datasets/robervalt/sunspots) note: this dataset has seasonality that reoccurs every 11 years (every 132 datapoints)
from my understanding starting with a lag feature based on the seasonality is a decent one to start with but what about others... are there any optimizers (hyperparameter tuning?) or stuff like moving average (which my lecturer said doesnt work that well for seasonal data which im not sure about) that i could take a look at? im quite new to all of this so if im incorrect about any assumptions please correct me 🙂

north condor Aug 2, 2022, 3:52 PM

#

untold bloom you can pass `maxiter` to `.fit`.

Thanks

#

What would be a sensible value?

untold bloom Aug 2, 2022, 3:53 PM

#

i wouldn't know; but i know 100_000_000 isn't, and 10 isn't :p

#

1000 maybe?

#

if your data size isn't very large, you can try & see perhaps; otherwise i don't have a strong heuristic in general

north condor Aug 2, 2022, 3:54 PM

#

Ok

wooden sail Aug 2, 2022, 3:55 PM

#

that depends very strongly on the method it's using. if it's really L-BFGS, it depends on how well conditioned the estimate of the hessian is at every iteration, so a good thing to play with is the initial hessian

#

shouldn't be all that difficult to compute the fisher information matrix, for instance

steady basalt Aug 2, 2022, 5:05 PM

#

wow, changing test train split to 90:10 instead of 70:30 literally made my results actually viable

#

0 435802
1 435802

#

training

#

0 48435
1 1807 testing

odd marsh Aug 2, 2022, 5:10 PM

#

Running model.summary() to see the layers of my model drain all my 16GB of ram, freeze VS Code and destroy Jupyter, any help?

#

Also this happens

compact valley Aug 2, 2022, 5:16 PM

#

guys, has anyone finished projects on dataquest.io?
I am about to start some and I was wondering If I can have a link to share with HR so I can apply to some Data positions with this projects

north condor Aug 2, 2022, 5:37 PM

#

If I have a pandas time series containing daily sales data, how can turn it into a pandas time series containing monthly sales data summing up all of the sales of the whole month?

desert oar Aug 2, 2022, 5:37 PM

#

north condor If I have a pandas time series containing daily sales data, how can turn it into...

use .resample

#

!d pandas.DataFrame.resample

arctic wedgeBOT Aug 2, 2022, 5:37 PM

#

pandas.DataFrame.resample


DataFrame.resample(rule, axis=0, closed=None, label=None, convention='start', kind=None, loffset=None, base=None, on=None, level=None, origin='start_day', offset=None)```
Resample time-series data.

Convenience method for frequency conversion and resampling of time series. The object must have a datetime-like index (DatetimeIndex, PeriodIndex, or TimedeltaIndex), or the caller must pass the label of a datetime-like series/index to the `on`/`level` keyword parameter.

desert oar Aug 2, 2022, 5:37 PM

#

!d pandas.Series.resample

arctic wedgeBOT Aug 2, 2022, 5:37 PM

#

pandas.Series.resample


Series.resample(rule, axis=0, closed=None, label=None, convention='start', kind=None, loffset=None, base=None, on=None, level=None, origin='start_day', offset=None)```
Resample time-series data.

Convenience method for frequency conversion and resampling of time series. The object must have a datetime-like index (DatetimeIndex, PeriodIndex, or TimedeltaIndex), or the caller must pass the label of a datetime-like series/index to the `on`/`level` keyword parameter.

desert oar Aug 2, 2022, 5:37 PM

#

it's like groupby but for time ranges

north condor Aug 2, 2022, 5:38 PM

#

Perfect, tysm

steady basalt Aug 2, 2022, 5:38 PM

#

anyone know why my accuracy goes from 56% to 70% when changing from 80/20 to 90/10 traintest split

#

is this due to the variance and bad data

#

if so, is it cheating to just go with the 90/10

#

ftr, i have 500k samples, of which 18k are the target class, I oversampled thje training data too, so the final test data was left with 40k/1.8k class balance

desert oar Aug 2, 2022, 5:40 PM

#

any number of reasons. natural variance in the data (remember, splitting should be random) is a likely suspect. i suggest doing the same exact training process for several train/test splits (maybe 5-10) to see how much the accuracy varies solely due to randomness in splitting

#

there are also results showing that cross-validation can perform better than repeated train/test splits in terms of needing fewer iterations to get the same amount of variation coverage

steady basalt Aug 2, 2022, 5:40 PM

#

yeah time to thoroughly investiage this... do you mean with a different random seed?

#

or different split values

desert oar Aug 2, 2022, 5:41 PM

#

same seed, do 10 different splits, run the same exact process on each split

steady basalt Aug 2, 2022, 5:41 PM

#

can you explain what you mean by cross validation in this case?

desert oar Aug 2, 2022, 5:41 PM

#

steady basalt can you explain what you mean by cross validation in this case?

cross-validation is a specific technique

steady basalt Aug 2, 2022, 5:41 PM

#

Yeah ik what it is

desert oar Aug 2, 2022, 5:41 PM

#

https://en.wikipedia.org/wiki/Cross-validation_(statistics)

Cross-validation (statistics)

Cross-validation, sometimes called rotation estimation or out-of-sample testing, is any of various similar model validation techniques for assessing how the results of a statistical analysis will generalize to an independent data set.
Cross-validation is a resampling method that uses different portions of the data to test and train a model on di...

steady basalt Aug 2, 2022, 5:41 PM

#

are you saying to stick with the well performing test train split?

#

and do cv?

#

i didnt do any cv on my test set

desert oar Aug 2, 2022, 5:41 PM

#

i am saying that in the future you might want to use cross val instead of a train/test split as long as you have enough data

#

oh, i see

steady basalt Aug 2, 2022, 5:42 PM

#

I dont really understand what u mean by that

#

how can you replace test train split by cv?

desert oar Aug 2, 2022, 5:42 PM

#

you have two completely separate datasets?

steady basalt Aug 2, 2022, 5:42 PM

#

arent they meant to be used in tandem? youd always want a test holdout

desert oar Aug 2, 2022, 5:42 PM

#

it's rare in real world scenarios that you have a "train set" and "test set" -- usually you have a big pile of data and you need to figure out what to do with it

steady basalt Aug 2, 2022, 5:42 PM

#

I have 1 set

desert oar Aug 2, 2022, 5:42 PM

#

so often you construct your own train/test split, and if you have enough data in the train set you can do further splitting

steady basalt Aug 2, 2022, 5:42 PM

#

so is it enough to skip splitting it and just to 10fold cv instead to get a more reliable understanding

#

and take the average scores

desert oar Aug 2, 2022, 5:43 PM

#

if this is a true holdout set then no i don't recommend re-splitting 10 times. you'll burn all of your holdout data

steady basalt Aug 2, 2022, 5:43 PM

#

doing as you advice, id skip holdout entirely

desert oar Aug 2, 2022, 5:43 PM

#

no, you definitely do want one

#

but i do suggest doing cross val on the training set, or repeated train/test splits

#

it depends on how much data you have and how nervous you are about missing sections of it

steady basalt Aug 2, 2022, 5:44 PM

#

quite nervous

#

its bad data but big

#

and imbalanced

desert oar Aug 2, 2022, 5:44 PM

#

i see the numbers you posted above. yeah that's pretty typical. it's also hard to know if 56 -> 70 is within reason or if it's a sign of a problem

steady basalt Aug 2, 2022, 5:44 PM

#

so as you advice: test train split 80/20 or 90/10 and on the training set do cv to get additional insight, while also doing analysis just on the test set too?

desert oar Aug 2, 2022, 5:45 PM

#

i would avoid doing analysis on the test set for now, and do the cv or repeated-splitting analysis on the train set only. that way you can get a general sense of how much variation you can expect just from splitting

steady basalt Aug 2, 2022, 5:45 PM

#

the shocking class1 scores?

#

i need it fixed seriously, even if it means sacrifices

#

in the final results, do i use the training cv or test results?

desert oar Aug 2, 2022, 5:46 PM

#

in the final results you'd re-fit on the whole training set and then test on the test set

steady basalt Aug 2, 2022, 5:46 PM

#

I did that already is what im saying, i fit my model to all traiing set

desert oar Aug 2, 2022, 5:46 PM

#

right. but now you have reason to question your process (unexpectedly big accuracy increase), and you need to gather more info

#

so instead of burning up your test data to do so, i'm suggesting going back to just the training set for that analysis

steady basalt Aug 2, 2022, 5:47 PM

#

okay... good idea i guess

desert oar Aug 2, 2022, 5:47 PM

#

i mean, you could do it on the full dataset too. that's fine if you're really disciplined and you aren't tempted to tune hyperparameters based on the results

steady basalt Aug 2, 2022, 5:47 PM

#

no im not cheating

#

this is a serious project

desert oar Aug 2, 2022, 5:48 PM

#

so if you do the repeated train/test split analysis on the full dataset, that will definitely give a better result. but then you're risking information leakage via your brain.

steady basalt Aug 2, 2022, 5:48 PM

#

funnily, im getting 96% scores for my gridsearch best results, weird

#

thats 3cv

#

thisis data science channel

desert oar Aug 2, 2022, 5:48 PM

#

oh ok so you're already doing 3-fold cv on the train set to tune parameters

steady basalt Aug 2, 2022, 5:48 PM

#

right, but thats not really anything to do with analysing why its so sensitive to differnet splits

lapis sequoia Aug 2, 2022, 5:49 PM

#

steady basalt thisis data science channel

ai

steady basalt Aug 2, 2022, 5:49 PM

#

yeah thats ot an ai task

lapis sequoia Aug 2, 2022, 5:49 PM

#

eh

steady basalt Aug 2, 2022, 5:49 PM

#

thaqt isnt really ai

lapis sequoia Aug 2, 2022, 5:49 PM

#

sure i'll use another channel

desert oar Aug 2, 2022, 5:49 PM

#

lapis sequoia ai

i'm not sure this kind of task is considered ai. check out #user-interfaces perhaps, or #algos-and-data-structs or even even #game-development. object collision detection itself isn't an AI task

lapis sequoia Aug 2, 2022, 5:49 PM

#

thanks :D

steady basalt Aug 2, 2022, 5:50 PM

#

@desert oar I mean no1 wud notice if i just didnt mention anything and went with the 90/10 split and called it a day with good scores...

desert oar Aug 2, 2022, 5:50 PM

#

heh that's always the danger 🙂

#

i think you're probably ok doing repeated (~10?) train/test splits on the entire dataset but just to see how much variation you get when splitting.

steady basalt Aug 2, 2022, 5:50 PM

#

I think its likely the issue is the more useful target class samples were left out of the training set in 8/2 split

desert oar Aug 2, 2022, 5:50 PM

#

that's possible. make sure you are doing a stratified split

#

oversample first, then do a stratified split

steady basalt Aug 2, 2022, 5:50 PM

#

oh no

desert oar Aug 2, 2022, 5:50 PM

#

the other risk with splitting in general is splitting up rare values of features

steady basalt Aug 2, 2022, 5:51 PM

#

i oversampled on my training data only, as per the manual

desert oar Aug 2, 2022, 5:51 PM

#

ok, that's good too

#

but keep in mind that accuracy isn't a great metric to use on unbalanced data

steady basalt Aug 2, 2022, 5:51 PM

#

ive basically all but eliminated my bias mistakes

desert oar Aug 2, 2022, 5:51 PM

#

accuracy can be very sensitive to the class imbalance

steady basalt Aug 2, 2022, 5:51 PM

#

i used smote to boost from 18k to 330k class1 samples

#

i normalised based on training set

#

then i analyised based on test set

#

im not rly using acc

#

more so i care about having a good balance of BOTH classes recall/precision

desert oar Aug 2, 2022, 5:52 PM

#

okay, so you're using f1 score to evaluate then?

steady basalt Aug 2, 2022, 5:52 PM

#

basically the main issue was as seen earlier, i was getting 0.2 recall for positive class and still high roc

#

im just waiting until i get >0.6 recall AND precision for BOTHclasses

#

rather than getting 0.90 for 1 and 0.1 for another

#

make sense/

desert oar Aug 2, 2022, 5:53 PM

#

oof yeah. i think roc is sensitive to class imbalance too

steady basalt Aug 2, 2022, 5:53 PM

#

well it rebalanced w smote

#

training set

desert oar Aug 2, 2022, 5:53 PM

#

indeed that's no good. when doing grid search what scoring criterion are you using?

steady basalt Aug 2, 2022, 5:53 PM

#

test set is extremely unbalanced but that doenst mater

desert oar Aug 2, 2022, 5:53 PM

#

it sounds like you should be using f1

steady basalt Aug 2, 2022, 5:53 PM

#

ive tried

#

a lot ... including f1 and auroc

#

currently im doing auroc

#

score=(train=0.996, test=0.947)

#

my question is. why tf is this happenign in my search

#

when it was much lower for 80/20

desert oar Aug 2, 2022, 5:54 PM

#

i'd stick with f1 tbh. auroc is a bit funky https://stats.stackexchange.com/a/468497/36229

Cross Validated

AUC and class imbalance in training/test dataset

I just start to learn the Area under the ROC curve (AUC). I am told that AUC is not reflected by data imbalance. I think it means that AUC is insensitive to imbalance in test data, rather than imba...

steady basalt Aug 2, 2022, 5:55 PM

#

I think if anything this is a good sign because its training data, and getting the best out of that is all that matters, surely its not cheating to do whatever boosts it

#

if that means 99/1 split so be it

desert oar Aug 2, 2022, 5:55 PM

#

well that's my point: you don't know what's causing it. likely suspects include incorrect stratification of classes during splitting (you probably ruled this out), or rare values of features that are also being split and maybe need to be stratified

steady basalt Aug 2, 2022, 5:56 PM

#

wdym by stratified in this context

desert oar Aug 2, 2022, 5:56 PM

#

enforcing that the distribution is similar on both sides of the split

steady basalt Aug 2, 2022, 5:56 PM

#

[CV 1/3; 1/2] END bootstrap=True, criterion=entropy, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=1000, n_jobs=-1;, score=(train=0.999, test=0.968) total time=15.9min

#

this cant be natural

steady basalt Aug 2, 2022, 5:57 PM

#

desert oar enforcing that the distribution is similar on both sides of the split

isnt this cheating? i want test to be torally random

desert oar Aug 2, 2022, 6:00 PM

#

steady basalt isnt this cheating? i want test to be torally random

it's not cheating at all. it's ensuring that the test and train sets are both reflective of the underlying data

steady basalt Aug 2, 2022, 6:00 PM

#


           0       0.97      0.95      0.96     48435
           1       0.07      0.10      0.08      1807

    accuracy                           0.92     50242
   macro avg       0.52      0.53      0.52     50242
weighted avg       0.93      0.92      0.92     50242

#

it hapepend again dude

#

blanket guessing 1 class

desert oar Aug 2, 2022, 6:01 PM

#

i see. ok so that's an obvious problem there

#

there might be very few of the rare class in the test set