#data-science-and-ml

1 messages · Page 64 of 1

wooden sail
#

so what you see there as the cost of np where is the difference between np where and str contains

#

np where was almost instant, str contains was the bottleneck

#

do you mean to use regex here? if not, try passing regex=False to str contains

#

otherwise the comparison isn't very good

tidal bough
#

by using some dtype stuff and then numba, I managed to make it 2 and 8 times slower respectively 🥴

errant lake
wooden sail
#

can you show?

#

but anyway yeah, str contains is a super general function with regex, nan replacement, etc by default, so the comparison will be bad unless you turn all of that off

#

you're seeing the numpy and pandas tax: a fixed overhead to prepare for large data with weird properties

errant lake
#

^ that's np.where() with in , str.contains() doesn't have regex=False yet

#

clear winner lol

#

str.contains() takes the same time with regex=False

tidal bough
#

str.contain s does have regex=False

wooden sail
#

can you show the code?

tidal bough
#

I don't believe this np.where result is right.

errant lake
#

Ah, let me inspect the df. I used:

np.where('char' in df['column'], 'thing', 'some')
tidal bough
#

'char' in df['column'] is just a single True, since of course it's there

wooden sail
#

yeah i was afraid you'd do something like that

#

i can't think of a sensible way of using np where for this case

#

short of ==, you kinda have to use str contains or one of the other methods you listed to generate the bools

errant lake
#

Ah yep, thanks for that 🤦‍♂️

wooden sail
#

if we do this, for example

 # np.where
    start_time = time.time()
    np.where(df['column'] == 'char', 'thing', 'some')
    results.loc['np.where'][f'Test {i+1}'] = time.time() - start_time
``` we get this plot
#

which would kinda be the ideal case

errant lake
#

Nice thanks. I'll try to refine that to better match some use-cases

wooden sail
#

this also makes sense

    # np.where
    start_time = time.time()
    np.where(df['column'].apply(lambda x: 'char' in x) , 'thing', 'some')
    results.loc['np.where'][f'Test {i+1}'] = time.time() - start_time
``` and yields the plot:
tidal bough
#

I think I might have finally got an overengineered solution that's a bit faster

#
%%cython
import pandas as pd, numpy as np
cdef int cont(str a, str b):
    return a in b
def str_contains_cython(arr: np.ndarray, sub: str):
    n = len(arr)
    result = np.zeros(n,dtype=np.bool_)
    for i in range(n):
        result[i] = cont(sub, arr[i])
    return result
#

200ms on my set, ~10% faster than the listcomp

merry wadi
#

Is anyone familiar with building input output hidden Markovs?

tidal bough
#

a-ha, I got <100ms by cythoning it entirely:

cpdef void str_contains_cython(np.ndarray arr, str sub, np.ndarray res):
    n = len(arr)
    for i in range(n):
        res[i] = cont(sub, arr[i])
%timeit str_contains_cython(df["column"].values,"char", np.zeros(len(df["column"]),dtype=bool)) # 82ms
wooden sail
#

ah yes, python't

tidal bough
#

my least favorite part of cython is the C type syntax

#

why, oh god why

#

last time I tried using cython I realised it would be faster and less painful to rewrite it all in rust

wooden sail
#

that's probably not true either

tidal bough
#

that's what I ended up doing though :p

wooden sail
#

sunken cost fallacy

errant lake
#

I'm using a regular jupyter notebook, is there anything I need to install to run your function definitions in cython? I'm not familiar with it sorry

#

nvm ill read the docs x)

tidal bough
#

You'd need to install cython, then do %load_ext cython, and then a cell with %%cython should compile the contents as cython.

wooden sail
#

then you'll have to rewrite the python as cython 😛

errant lake
#

ah yeah... not convinient

tidal bough
#

like, cpdef means it's callable from python.

wooden sail
#

yeah but does that look like python? that was my point

tidal bough
#

ah, that's true. it's only useful in situations where you are okay with writing a weird function to speed up a process 2x

#

occasionally it's a lot more useful when you're doing something really weird and the weird function speeds it up dozens of times (e.g. calculating some complicated cumulative function over the df, so you need to iterate)

errant lake
#

Thanks, I'll try to use that. I've seen worse boilerplate for the time gain.

brave sand
#

has anyone used a jetson nano for object detection?

somber pollen
#

also if you don't want to do Cython but are dealing with relatively common types, you can use Numba

#

YMMV tho

hasty mountain
#

How common it is to try to implement an algorithm from a paper and it doesn't work?
I feel like every image generation algorithm I try to implement solely based on the original paper ends up failing...then I have to make adaptations until it works grumpchib

#

Except for DCGAN. DCGAN is cool joe_salute

#

But those papers that use MSE Loss for Variational AutoEncoders...I really don't get their trick. My VAEs only work with Gaussian Log Likelihood

serene scaffold
#

But it's probably a more common problem than we'd like to think

#

I've only implemented the algorithm from a paper once, and it performed well on my data, which was different.

hasty mountain
#

Aw, that's sad...
I wouldn't want to reproduce the algorithm using the exact same data, though(most VAEs, GANs and Diffusion Models papers use CelebA dataset, which I find meh).

#

But I suppose that changing the data from CelebA to a CIFAR100 might be catastrophic...which is also sad, since a good model should be able to overcome such things...I guess... pithink

iron basalt
#

(And huge waste of my time)

hasty mountain
#

Yeah, I've wasted quite some time in GANs because of that...
Strange... I thought math was supposed to be exact sciences yert

iron basalt
serene scaffold
#

and python is not perl.

hasty mountain
iron basalt
#
Math - You know what you are doing because you made up the rules.
Engineering - You know what you are doing and are trying to figure out the best way to do it.
Science - You don't know what is going on and are trying to figure it out.
#

ML tends to bounce around all three.

hasty mountain
#

Now it makes sense now

serene scaffold
#

Which one is "You don't know what is going on, but you're happy to be here"?

agile cobalt
serene scaffold
# iron basalt Art?

I thought some of the most profound and impactful art is made by the depressed and mentally ill

iron basalt
serene scaffold
#

!otn a squiggle the philosopher

arctic wedgeBOT
#

:ok_hand: Added squiggle-the-philosopher to the names list.

potent sky
potent sky
sterile heath
past meteor
#

If for my research I'd implement everything that looks remotely good I'd be busy forever. Half of the implementations are in Matlab 💀💀💀, way more than half doesn't share code/data and all of them seem to crush the existing benchmarks

tulip wyvern
#

Can somebody explain why my test accuracy is so bad (8.7%) and my loss isn't decreasing at all

#

I get that I should be using a CNN (idk how to implement CNNs though) but this test accuracy is just crazily bad

tulip wyvern
#

one sec let me make it public

#

okay its public now my bad

agile cobalt
#

0.5 in everything?

tulip wyvern
#

yea

#

was that a bad play

agile cobalt
#

0.5 dropout as well?

#

also not 100% sure if it is fine or not to reuse the same layer, it would probably give a warning if it was a problem though

tulip wyvern
#

yea i was just doing the medium

#

i dont think the dropout helps me anyways because im far from overfitting 💀

agile cobalt
#

you are dropping way too much

tulip wyvern
#

okay i think ill lower it to 0.3 then

agile cobalt
#

set it to like 0.1

#

and do look into Conv2d layers

tulip wyvern
#

yea im doing some research on them

agile cobalt
#

if you feel like it is overfitting, you can increase it back later

tulip wyvern
#

do you think my low test accuracy is just because im using an rnn (i think thats what its called??)

agile cobalt
#

not sure if it has a name besides just being a neural network, but a CNN would likely work much better

#

maybe also try using Adam instead of just SGD

tulip wyvern
#

i heard about adam before

#

all i know is that itsthe combination of momentum and rmsprop

#

but ill read more into it on the math and its function

#

thank you so much !! 🙏 🙏

past meteor
#

Dropout, batchnorm, ... are things you should add after you've got a network that works but is overfitting

quaint loom
#

Anyone who have a quick way to make this model into text/latex form? Or something similar?

worldly dawn
quaint loom
past meteor
#

Depends on how beautiful it needs to be, draw.io could work

#

It's web based but you can also install it

quaint loom
#

It doesnt have to be beautiful, As long as GPT understand it 😛

past meteor
quaint loom
#

Ehm. It looks like this page can only create the outlook of the model, not in textform

past meteor
#

Oooh, you want to give this diagram as text. I don't think GPT will understand that

quaint loom
#

I remember I got a suggestion earlier here one time but I don`t remember the name of it. Some kind of latex of two dimentional matrix

bold timber
#

anyone can explain to me what is difference between sequence_length and vocab_size from this code?

quaint loom
#

In your code, the , sequence_length and vocab_size are two parameters used in the Embeddings class

bold timber
quaint loom
#

Sorry for late reply @bold timber

past meteor
#

I have an array alphas = alpha ** np.arange(1, X.shape[1]+1)] that I multiply with X as such y_pred = np.dot(alphas, X.T). Now my question is, how would you guys optimize alpha wrt. the MSE? This is a non-convex problem. Right now I used simulated annealing because that's what I'm most conversant with from school. Can I just throw BFGS on such a problem instead? It's been a while but afaik 2nd order methods will just go to a saddle point.

tidal bough
#

So it's a 1d optimization task, you are only tweaking α? Maybe differential evolution would work well, too.

past meteor
#

I hadn't heard of differential evolution yet! I'll try it on my data

wooden sail
#

ok, was just making sure it was a matrix and not something else

tidal bough
#

actually, wait... isn't this exactly quadratic by α? ah, nevermind, **

wooden sail
#

i would point out that no method has guarantees of finding the global optimum btw, so a heuristic like sim annealing also has no guarantee of finding it

past meteor
#

I'm OK with not finding the global optimum. I just defaulted to sim annealing because I was unsure how my go-to (BFGS) would behave here

wooden sail
#

is there any constraint on alpha?

#

probably alpha > 0, or?

past meteor
#

0 < alpha < 1

wooden sail
#

ok

past meteor
#

I'm reading about differential evolution and at a glance it looks like a special case of genetic algorithms

#

Both being population-based metaheuristics that use crossover, mutation and selection. Am I missing anything @tidal bough ?

#

Like, you can express genetic algos in terms of real-valued vectors and pick a specific mutation operator and then it's differential evolution?

tidal bough
#

Yup, it's a fancy genetic algorithm and local searches for refining the results. I've had good results with it in high-dimensional optimization, but plausbily it's less interesting in 1d.

past meteor
#

Genetic algorithms were some of my favourite coursework 🙂 Good to know scipy has a very easy-to-use implementation

night kernel
#

anyone know of a good text summarizer model on github?

#

kind of an ed-techy tool that can summarize large texts
not cgpt
in the interest of saving money not using their api

wooden sail
#

the problem is not convex in general, but one could plot the second derivative vs alpha and see if anything special happens

past meteor
#

It's just an exponentially weighted moving average so X are lags

wooden sail
#

so all the entries of the matrix are positive?

past meteor
#

Yes

wooden sail
#

if the matrix and y are fixed, there's a good chance the problem is convex after all

past meteor
#

But they might not be in the future - I might swap the task from predicting y_t to predicting delta_y_t

wooden sail
#

aha

brave sand
#

once I label my dataset, are there any good tutorials for object detection?

#

i want to draw a bounding box

past meteor
brave sand
past meteor
#

You'll have to look for that yourself 🙂

#

you can consult the docs of YOLO, they have a tutorial there on how to train their model

patent ocean
#

Guys, I got a trained model from github but I wanna run it using my test dataset.

#

it is an image steganography project

#

but the code just keeps giving me results from the training data

#

If there are any kind souls present who would be willing to spend some time to read this project, I would deeply appreciate your effort. It is getting kind of desperate. I need help. please.

frail flicker
#

hey everyone, im trying to remove the outliers from my data using pandas
Im using the following code rn:

    lower_limit = df.column.mean() - 3* df.column.std()
    upper_limit = df.column.mean() + 3* df.column.std()
    df = df[df[column] > lower_limit]
    df = df[df[column] < upper_limit]```
The error being raised is that the database doesn't have a column object
#

How can I resolve this? Im kinda a beginner with pandas and this is the easiest method to remove outliers I found

spare briar
#

try df[column].mean()

plain jungle
serene scaffold
plain jungle
#

Sorry about that, will remove the post

past meteor
potent sky
#

I think JTexpo meant to make a general comment on GAs, in that they can be used to train neural networks

versed gulch
#

what would be the best way to measure a plants bio mass with ai to accuratly measure a plant from the soil up in a single pot, guestimating the root biomass is also important but if i could get a accurate reading of the upper plant i could guestimate the roots possibly. the goal is to measure how much the plant is growing, has grown and esitmate how much water it has uptaken and also possibly transpired. if the system could be embeded into a cheap micro controller like esp32 or arduino it could help with a watering system im working on

agile cobalt
#

what?...

versed gulch
# agile cobalt what?...

ai biomass estimation. i was thinking white dots around plant like in green screen movie animation

agile cobalt
#

sounds a bit like a regression problem, but you would need of a lot of data to train a model for that

#

after the edit: do you understand how models actually work?

versed gulch
brave sand
#

hey guys

agile cobalt
#
  • you must have well defined inputs and outputs
  • the model itself is used to approximate a function, but you must model your problem in a way such that you can "teach" it how to get better at approximating that function
versed gulch
#

ive been recomended open cv and plant cv from gpt

brave sand
#

does this matter?

#

can I use the mobile net v2 320x320 for this?

agile cobalt
#

forget about using GPT if you do not understand the subject of the discussion well enough to verify whenever or not the output makes sense, seriously

brave sand
versed gulch
agile cobalt
versed gulch
agile cobalt
#

the biggest issue you would have to tackle first is finding a dataset containing the biomass of the plant, with the features you want to use to estimate the biomass

versed gulch
agile cobalt
#

I don't know much about plants, but I highly doubt that data about one species would work for other species unless they are extremely similar/close

versed gulch
#

or any valued plant

#

i water plants based on weight so the ai would be a compliment to my setup

#

watering can be a challenge for a lot of people, under watering over watering ect. if the ai can do it all then can help humans with labor costs

potent sky
#

yeah interesting problem to solve but you need to define it as an ML problem first
along the lines of what etrotta said, identify your inputs, why those inputs will give you valuable modeelling information, identify your outputs, study whether such a system even is possible, identify what kind of statistical modelling task it fits into (regression maybe), etc.
Good luck!

analog schooner
#

anyone knows a reliable Midjourney API provider?

tulip wyvern
#

Why is my test accuracy so bad and not increasing at all? (5.6%) for my multiclass task?

#

😦

agile cobalt
# analog schooner anyone knows a reliable Midjourney API provider?

There are a lot of Stable Diffusion API providers and OpenAI offers Dalle-2 via API, but Midjourney does not offers an official API and you should not trust anything that claims to offer an API for it - It would be either a scam using a different model or violate ToS

agile cobalt
tulip wyvern
#

yeah i think ill chagne that then

#

what do you mean the backpropagation step is in the wrong order

agile cobalt
#

the order of these lines

#

I would also recommend taking a look at the images yourself after your code resizes them to make sure that they remain identifiable at all

tulip wyvern
#

okay let me try putting the zero_grad first and hen check the images after resize

agile cobalt
#

(like, if the original images are wallpaper sized, you might want to keep them a bit larger or crop the faces first)

versed gulch
#

@agile cobalt thanks for that awesome site and your help

potent sky
potent sky
potent sky
#

ah okay

agile cobalt
#

the model is also fairly large so you might need to let it train for a bit longer than 2 epochs, specially with the learning rate you set

potent sky
#

Both would be equivalent

tulip wyvern
#

i trianed for 5 epochs and the accuracy didnt change at all

agile cobalt
#

looking at the dataset... a lot of these images are not like each other at all
I would recommend testing your model on a more conventional dataset first

#

including inverted images in the dataset is also weird af imo

#

specially considering that it looks like they were not even consistent on it? some of the characters have inverted images while others don't

tulip wyvern
#

yeah you're right il use a different dataset this one is way too hard

#

ill find one iwht more consistent images

agile cobalt
#

it might work well enough if you fine-tune an existing model, but I don't think that training one from scratch on that data set is a good idea

tulip wyvern
#

yea i probably took too hard of a dataset for a starting learning project 😭

#

thank you very much!! 🙏

potent sky
# tulip wyvern i trianed for 5 epochs and the accuracy didnt change at all

A few things. Check if 64x64 image size contains enough information for you to do the classification you're trying to do, look at the dataset for this
Though these probably aren't significantly responsible for keeping it at 5%:
Try a lower kernel size. Your image is 64x64, a kernel size of 5 is probably too much
Try more filters, instead of 3->6 try 16,32,64. More filters give the model the ability to capture/model more features
Report train accuracy as well and compare train and test accuracy to get a sense of bias/variance

#

Check the data distribution and balance across classes
Lots of things are possible, you just keep eliminating what isn't the problem to keep making your method better and eventually find out a solution

#

Albumentations could be useful too once your model is learning better

#

So much successive pooling with such a small image-size to start with is rapidly reducing the size of your feature map, and combined with you having few filters could mean a lot of relevant information is available for very few parameters to learn

frozen girder
#

Hi! I am trying to use TargetEncoder from sklearn so as to apply mean encoding to some of my features. But i cant find any info about it, especially examples. Any one knows how to use it?

analog schooner
# agile cobalt There are a lot of Stable Diffusion API providers and OpenAI offers Dalle-2 via ...

well, they really have a reason for delaying the release of their own API, so people like me that rely on their bot are pretty much stuck if they want to create bigger projects. I, for example, need it for a video generation tool, but without an API it's useless. others like DALEE2 don't work as well. I've found mjapi.io, thenextleg.io and others, might just try them, but I was curious if there's anyone that had already tried them and has anything to share. ofc, this is just a proof of concept, and I'm well aware that it's safer to use a burner account

agile cobalt
#

if something requires using proxies, burner accounts or such, we will not assist it

brave sand
#

once I label my images, what do I do now?

#

i'm trying to make an object detector

timid grove
#

this is my device map for "HuggingFaceH4/starchat-alpha" model , can anyone tell what values should i pass inside no_split_module_classes to make the same layers in one device.

tidal fog
#

Im hoping I could get some help on a problem that I have been having. I have been trying to implement the paper HiPPO https://arxiv.org/abs/2008.07669 and accomplished this here: https://github.com/Dana-Farber-AIOS/HiPPO-Jax. However since then I have been trying to re-implement it with different design choices. The big difference is that HiPPO is basically a layer that behaves recurrently. Its common practice, atleast with RNNs, that you implement the RNN Cell and then call the RNN Cell within a skelaton RNN that accepts arbitrary RNN Cells, i.e. LSTM, GRU, etc. I wanted to do the same with HiPPO where the two different cells are HiPPOLSICell and HiPPOLTICell. I have found that when I try to initialize my parameters such that the flax module knows the learnable weights, the way I am doing it doesnt work and will fail when I try to set the weight matrices as self.param. I will provide the code for all of this in a moment.

hasty mountain
#

My experience with PPO(not HiPPO) says that initializing the network weights the way they suggest(which is, if I remember correctly, through an invertible matrix) makes the model rubbish

and that's a standard for most algorithms I've tried so far

#

At least my model tends to produce the same outputs and get stuck on local optima more easily with such initialization

tidal fog
#

oh ok I cant upload files

#

damn

digital quartz
#

hi everyone, im have a problem, i have to generate a dataset for fine tune gpt from documents, but i have no idea how to do 🥲

serene scaffold
digital quartz
#

my instructor and he want to fine tune chat gpt to question answering about the document

hasty mountain
#

I suppose it's to decrease the chance of misinformation pithink

serene scaffold
# digital quartz my instructor and he want to fine tune chat gpt to question answering about the ...

you can't fine-tune ChatGPT, because the actual model is not available. you can only interact with it over their website.

You can download GPT-2 (in a certain sense, ChatGPT is GPT-3.5), but I think that's as close as you can get. And GPT-2 isn't conversational the way ChatGPT is.

If your instructor doesn't know this, but they gave you this assignment anyway, you should see if you can get a refund for the course.

#

if I'm reading this other article right, it looks like you can fine-tune GPT-3 via OpenAI's API, but that the actual model is never turned over to you.

agile cobalt
#

if they did not specify it has to be exactly chatgpt, there are some open source models you can use - overall they are significantly worse than openai's gpt, but should work well enough with fine tuning

agile cobalt
#

I'm thinking more like the ones released earlier this year than however long ago gpt 2 was
edit; 2019

serene scaffold
#

everyone and their grandma wants their own LLM

tulip wyvern
#

How come my model test accuracy is just the same accuracy as pure guessing and is not improving at all? (10% with 10 classes)

#

I've used three different datasets and my test accuracy has been the same as guessing each time

#

😦

agile cobalt
#

you are still using that?

transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])

tulip wyvern
#

yea i havent changed it yet

#

im just trying to get my model to learn 😭

bold timber
#

I have a question about this code:

def compute_mask(inputs, mask=None):
    mask = tf.math.not_equal(inputs, 0)
    mask1 = mask[:, :, tf.newaxis] # column vector
    mask2 = mask[:, tf.newaxis, :] # row vector
    attention_mask = mask1 & mask2
    return attention_mask

That code is the code for masking in Transformers. But I'm confused about why do we need to set mask= None ?

agile cobalt
bold timber
arctic wedgeBOT
#

:incoming_envelope: :ok_hand: applied timeout to @tulip wyvern until <t:1685075660:f> (10 minutes) (reason: newlines spam - sent 157 newlines).

The <@&831776746206265384> have been alerted for review.

merry oak
#

!unmute 494283724373098529

arctic wedgeBOT
#

:incoming_envelope: :ok_hand: pardoned infraction timeout for @tulip wyvern.

agile cobalt
merry oak
#

use the pastebin

tulip wyvern
#

yeah sorry 😭

tulip wyvern
agile cobalt
#

try shuffling your input data

tulip wyvern
#

for the train or test, or both?

agile cobalt
#

both

tulip wyvern
#

okay i will try that 🙏

agile cobalt
#

just remember to keep the images and their corresponding labels together

#

either bundle them in a 2-D array and only sort in the first axis, or just generate a random order to grab indexes from

tulip wyvern
#

i have the label and the image as attributes in my dataset class

#

as lists

#

and for getitem i return the object at the selected index for the image and label list

#

okay i shuffled my training and test set and i got 37% test accuracy

Do you think this is just because I have a bad model architecture or is it a product of another dumb mistake?

agile cobalt
#

are you still using 0.5 for all in the normalisation?

tulip wyvern
#

yep

agile cobalt
tulip wyvern
#

yeah I can't believe i missed that 😭

#

okay it looks my model can finally learn

#

ill just work on finding better normalizaiton parameters and architecture

#

thank you very much i really appreciate it

agile cobalt
#

any more than that might require some more delicate tweaking, but even 95% should still be reachable if you put time into it

tulip wyvern
#

yeah im definitely gonna put in some time to play around with it

mild lichen
#

Guys, who can help me with Python simple thing, im new in it.

tulip wyvern
#

Can somebody explain why the RNN performs so much better than the CNN (which is not improving at all)

#

I don't know why I keep getting this issue 😭 😭

agile cobalt
#

recall = True Positive / Total Positive

you have 516 False Positives and 79 true negatives, so you identified correctly 79 / (519 + 79) of the examples labeled 1 = 0.13277310924369748

#

your classes are extremely unbalanced, so it predicts false positives more often than false negatives

#

*I might have messed|mixed up the names ; do double check it

#

I assume so

#

(so that it supports cases in which you have three or more categories instead of just binary classification)

agile cobalt
# tulip wyvern I don't know why I keep getting this issue 😭 😭

tbh I would recommend trying to use a slightly more high level library like fastai

I would guess underfitting, since your CNN network has wayyy more params than the other one, but I feel like even then it should improve at least a little more than it is right now
also you left the other one with 10 outputs?

tulip wyvern
#

Fastai is higher level?

#

I was considering learning it but I didn't know people actually use it

tulip wyvern
agile cobalt
#

not enough data to adjust all the params

#

but never mind, that does not makes much sense

#

I'm kinda sleepy derp

tulip wyvern
#

i think i will try to grind and figure this out

#

even though it is extremely painful

#

hopefully im learning something 💀

#

thank you for all the help youve given me 🙏

agile cobalt
#

maybe play a bit with the learning rate (both increasing and decreasing)

#

the loss going up in the last epoch for the 'rnn' is also a bit suspicious

tulip wyvern
#

im going to do everything imaginable

fickle crescent
#

hi guys, i recently got interested in data science..... can you share some really good resources to study....also how much time would it take before i could make a decent project and land an intership?

wraith walrus
#

error: mp_drawing = mp.solutions.drawing_utils
AttributeError: module 'mediapipe' has no attribute 'solutions'

lapis sequoia
#

Hello, so i have a question about an onnx models, im new to this and i'm not finding enough documentation on the internet.
So i have two models i exported as onnx and they work fine. One is called splitpost and the other nosplitpost. They take as input a mask of shape [1,88,44,60]
i wanted to create an onnx model where i have an if else statement and a split boolean value.
If split is true, we use the first model on the image else we use the second one.
This is my code: https://codeshare.io/Lw4en6
Obviously this isn't working but i don't know why. i'm getting this error on the check model:
onnx.onnx_cpp2py_export.checker.ValidationError: Unrecognized attribute: axes for operator ReduceSum
==> Context: Bad node spec for node. Name: OpType: If
Can anyone please help me solve this?

dawn hazel
#

.

errant bison
#

can anyone tell which libraries can i use for automatic license plate recognition

potent sky
potent sky
#

Basically, if you're using a file named mediapipe.py, change the name

potent sky
potent sky
pulsar needle
#

Hello, I am trying to make a CNN to predict whether an image of an eye is open or closed, and it seemed to train very well over 50 epochs:
Epoch 50/50
9/9 [==============================] - 7s 816ms/step - loss: 0.0044 - accuracy: 1.0000 - val_loss: 0.0119 - val_accuracy: 1.0000

When I predict images using the model, it always looks something like this:
array([[0.00796265, 0.9920373 ]], dtype=float32)
where the second class is much larger than the first.

Also, I tried predicting Images that are in the training set and it still outputted the same, incorrect prediction.

I'm not sure how I could fix this. If anyone could help that would be great.

cold osprey
#

u sure something isnt swapped around?

lapis sequoia
potent sky
#

For example if you had 990 images - class A, 10 images - class B.
In this case even if your model learns to just predict A everytime, it gets 99.9% accuracy, even tho it's not learning anything valuable.

plain jungle
pulsar needle
plain jungle
#

Just to double check, are you making sure to shuffle your data and not train eye open first then eye closed?

lapis sequoia
pulsar needle
plain jungle
#

Something like this ^

keen gust
#

anyone here familiar w/ pygsheets or google api auth? I had a project I was working on a while ago, I was using a service account to connect and could access my spreadsheet fine. I went to run it today but it just keeps giving me a timeout error. Nothing has changed in the code since or on my console account

spiral inlet
#

Anyone here competing in Kaggle?

potent sky
potent sky
spiral inlet
potent sky
#

The last one was on RL

spiral inlet
#

I'm in competing in the "playgrounds", to learn more

potent sky
#

oh great! That should be useful
Past competitions are also useful to learn, you can also observe the discussions by the top submissions and even their code if they've made it public

#

Good luck!

spiral inlet
#

I'm trying this competitions because there are a lot of people in there!

potent sky
#

Oh yeah community is very important too, lots to learn from our peers

spiral inlet
#

But I feel 'alone' in the competitions. I think that talking with other about the competitions can acelerate the apprenticeship

#

For the reason, I decided be more participative here

keen gust
keen gust
boreal gale
#

do you know how to create a column for start of week's date for all the value in the Date column?
that's your first order of business.

obtuse locust
#

In Pandas, is it possible to take a list of the data from rows like [A, B, C]

and have it output like

0 | Header A | Header B | Header C|
1 |    A1    |    B1    |    C1
1 |    A2    |    B2    |    C2
1 |    A3    |    B3    |    C3
obtuse locust
#

Got it! Thanks 🙂

serene scaffold
limber grove
#

Hello! I´ve been learning python basics for almost 2 months and I want to improve my skills related to python for data analytics, does anyone know about any course that could be helpful?

coral field
#

In pytorch, what is the difference between nn.Conv2d() and nn.MaxPool2d()? Both take in the "kernel_size" and "stride" arguments, so wont they perform the same function of decreasing the analysis area of the data?

agile cobalt
#

where did you even read about them?

coral field
agile cobalt
#

their inputs and outputs are similar, but the operation they perform on the data is completely different

coral field
#

could you explain?

agile cobalt
#

MaxPool, as the name suggests, grabs the maximum value present in the kernel region, and ignores the rest
a Convolution has weights for each of the kernel positions, and will take all values into consideration at least until an activation layer like relu negates some of them

potent sky
# limber grove Hello! I´ve been learning python basics for almost 2 months and I want to impro...

There's this course by freecodecamp that's..well, free
https://www.freecodecamp.org/learn/data-analysis-with-python/

This is very code-orientated
There's also a few courses by mit ocw if you want to get into the math behind it (recommended)
https://ocw.mit.edu/courses/15-075j-statistical-thinking-and-data-analysis-fall-2011/

Datacamp is considered a very good resource so here's a datacamp course:
https://www.datacamp.com/tracks/data-analyst-with-python
But datacamp is paid. I think you get a free 3 month trial so use it wisely

There are some other great courses as well on udemy, Coursera and ofc YouTube

Good luck!

coral field
potent sky
#

They perform different operations

serene scaffold
#

maxpool and convolution both involve operations on a sliding region of the array, yes? is there a term that encompasses both?

potent sky
#

maxpool as the name suggests, activates the maximum from the kernel region
A convolution operation activates each element, with some weight, from the kernel region and adds them up

agile cobalt
limber grove
agile cobalt
#

but specifically for CNNs, do take a look at the https://poloclub.github.io/cnn-explainer/ page that they link in their website if you haven't yet
it also explains convolutions and pooling layers

An interactive visualization system designed to help non-experts learn about Convolutional Neural Networks (CNNs).

potent sky
potent sky
potent sky
primal jay
#

Hello, am trying to install mediapipe, does anyone know why is this error?

kind loom
agile cobalt
primal jay
#

Aight, i was using conda, maybe is that

tulip wyvern
#

Anybody have any ideas on how I can reduce overfitting (Train: 100%, test: ~65%)

#

On a multiclass image task

#

I've done:
Normalization
Batchnorm
Random horizontal flip (0.5 chance)
Random grayscale (0.2 chance)
Dropout regularization (0.4 chance on each layer)

#

What other techniques are there?

tulip wyvern
#

not enough train data?

hidden sigil
tulip wyvern
#

840 * 3 (for each class)

agile cobalt
#

did you figure out why was it stuck on 33% the other day?

tulip wyvern
agile cobalt
#

wow

#

wasn't it just like learning rate too high?

tulip wyvern
#

oh yeah that too

#

i lowered it and it improved a lot

agile cobalt
#

that explains it

#

it was probably overshooting in the gradient

tulip wyvern
#

yeah

agile cobalt
#

0.4 dropout sounds a bit high to me, not sure what is the standard though, and you are applying it pretty late in the process?
using Norm layers like BatchNorm comes with a few things you have to pay attention to, did you look into that?

hidden sigil
#

idk how you set up your code but ill explain the basics of undersampling if you want to try it, basically you make it so that the number of classes you have for each target is approximately the same, and you can do that by removing samples randomly of a row that has the target class where there is more of. so lets say you have 1000rows of data where the target value is of class a and 200 rows of data for target class b. you randomly remove 800 of those rows where target value is a so that you end up with 200 rows for each target; if you implement that properly you can probably reduce the number number of convolutions, normalizations and dropouts you do significantly

agile cobalt
#

oh wow

tulip wyvern
# agile cobalt 0.4 dropout sounds a bit high to me, not sure what is the standard though, and y...

The default in pytorch is 0.5 so I just used 0.4
What do you mean I'm applying it late in the process? I thought you were supposed to apply it right after each linear layer
I haven't looked into the things that batchnorm brings along with it, but upon a quick google search it says that too small of a batch size could negatively affect it, and that batchnorm can also affect the laerning rate?

tulip wyvern
agile cobalt
#

never mind about dropout, from looking it up it sounds like that is about right (batch norm between conv layers and dropout between linears)

hidden sigil
tulip wyvern
#

I've never heard of classification_report, I will look into it

hidden sigil
#
    
    xgboost_classifier = xgb.XGBClassifier()
    xgboost_classifier.fit(X_train, y_train)
    y_pred = xgboost_classifier.predict(X_test)
    report = classification_report(y_test, y_pred, output_dict=True)
    print(classification_report(y_test, y_pred))``` here is a simple example
tulip wyvern
#

ohh its something scikitlearn already has

#

okay let me try that out rq then

median leaf
#

guys i keep trying to do knn classifier

#

but i keep getting error about ''ValueError: Unknown label type: 'continuous'

#

anyone know the problem?

kind loom
# median leaf anyone know the problem?

Knn is used for classification task where the target variable is discrete. Your dataset might require a regression algorithm. Try using linear regression or SVR

kind loom
median leaf
#

It is and everything seems discrete so im quite confused

sleek harbor
#

could anyone throw me a link to a good guide on feature engineering? Feels like it's a big deal yet I see little coverage for it in guides and tutorials.. I get that domain knowledge is often key here, but what are some ways you could get an idea of what to do without any domain knowledge? The kaggle tutorial on PCA shows that it can be used for feature extraction, but it's not very well explained. Anyone know where I could read about PCA, not about how it's calculated (there's plenty info on that), but on how to interpret and use the results? And/or other popular feature engineering techniques?

past meteor
#

There's also well-documented things you can try for canonical problems like creating lags in time series. Specifically for linear models: interactions terms, B-splines, ...

sleek harbor
past meteor
sleek harbor
past meteor
#

I think someone having code I can gloss over is great. It's nice if they've actively worked on things so I'd see experience on Kaggle as a plus. I also value experience in internships, student jobs as well but that's not mutually exclusive with sideprojects like Kaggle

#

For me the big thing is that none of these are mutually exclusive and most discussions treat them as if they were lol. You can get a degree, do Kaggle/sideprojects and get internships etc. Personally that's what I did

sleek harbor
# past meteor For me the big thing is that none of these are mutually exclusive and most discu...

but when ur time is limited and u r in a big hurry to get a job quick - u gotta choose 😅
I don't have a degree (not CS at least, I have a bachelors in Economics), and an unpaid internship is.. not really something I'd want to spend time on (if its a paid internship then sign me up lol). I have a few ideas for sideprojects, but one is massive and I'm unsure if I'm qualified to even approach it, and I haven't even checked if I can get the appropriate data for the second one I had in mind.. 💀 If everything works out as planned, I'll put one kaggle and 2 side projects on my resume, and hope for the best. I just hope my antisocial introvertness doesn't kick in during interviews.. 🗿 First step is to get the interview tho.. Storytelling is a big deal in DS, it seems.. and my storytelling skills are.. lacking

past meteor
#

For what it's worth, my bachelors was from the economics faculty as well, but not pure econ

quick bay
#

Hello, does anyone know how to do classification on univariate, muticlass, and imbalanced time series dataset? I would like to know that what i'm doing is right or not.

potent sky
# sleek harbor oh I've seen it everywhere, on youtube, reddit, random chat groups. Some say rec...

I think Kaggle NBs are useful in the sense that they can demonstrate your ability to work with code like a lot of the industry does, much like GitHub but toned down
However I think there could be a case for recruiters being tired of seeing the same things again and again. Like house price prediction is an overdone project, everyone does it, it's available easily, you could've just copy pasted it onto your resume.
Don't do something like that.
Do something new. Use Data Science to actually solve an interesting problem. It might be useful, or it might be silly, doesn't matter; you can do analysis of Pokemon for all they care
As long as it's something new and interesting

#

I have heard this from recruiters at some top companies. The same overdone kaggle projects don't help you.
Demonstrate that you can identify a problem and then use Data Science to bring it to a solution

burnt island
#

Well said

#

Whenever you get to see this, do know that I am data scientist willing to participate in machine learning / DS projects . My main goal is to build my experience in solving real world problems using machine learning algorithms.

Whether it is a virtual internship, unpaid role, paid role, hackathon I am willing to contribute as much as my current resources would allow me to

tepid parcel
#

Good morning!

I plan to be a machine learning engineering, but don't know if I need to learn machine learning from scratch without software development tool or It's better to use libraries and frameworks in this process? What should I analyze and consider on this choice?

If I will learn from scratch, what should I learn, only the logic or more structured and complex things like algorithms or similars?

grand minnow
spice whale
#

Hey @grand minnow

grand minnow
#

Heyyyy

narrow crane
#

What do you guys think of this channel, and what guidance could you give someone who's trying to make it in machine learning engineering or data science?

serene scaffold
#

@narrow crane I watched the first one. I think he's mostly right, but might be over-stating a few things for dramatic effect. like with portfolio projects--they aren't useless necessarily, but no number of portfolio projects can make up for a lack of "real" experience.

he's also right that "learning Python or R" isn't a strategy for getting an AI role. No one gets hired into AI because they "know python". They get hired because of demonstrated knowledge and application of AI.

narrow crane
#

My plan was to just go through kaggle courses, then try to do the machine learning bootcamp, and hopefully do some projects while learning NLP and SQL.

serene scaffold
#

unless you have industry experience in another STEM field.

past meteor
serene scaffold
#

I'm watching the second video now. He seems weirdly preoccupied with SQL.

past meteor
#

Because for old school data professionals data = tabular data

serene scaffold
#

I agree that SQL is very important. it's just weird that he keeps saying "3 years of SQL"

past meteor
#

If you start your career making Power BI dashboards good luck pivoting to anything ML heavy in the future I think...

serene scaffold
#

he just said "excel is not a big boy data tool"--is this the guy who keeps posting that starbucks barista pic on quora??

#

done with the second video. does he ever talk about getting a masters in CS? because that's probably the most straightforward way to get a job in ML.

past meteor
#

Getting a bit off-topic on my part but I'ma be honest and say that what matters the most is just a solid university degree.

serene scaffold
#

I think he's mostly right when he says "there are no entry level jobs in ML" in the sense that the only ML jobs that will take you based only on your degree, require that you did something tangible with ML during that degree

narrow crane
#

Should I consider changing my study plans?

past meteor
#

Are you based in the US?

narrow crane
#

Yeah I am.

serene scaffold
#

(I'm assuming you're a young person who doesn't have professional experience. tell me if I'm wrong.)

narrow crane
#

You’re not wrong. But my current degree is in cyber security.

#

Which is..different yeah. I was trying to learn two skill sets in one.

serene scaffold
#

one that you're currently pursuing, or that you've finished?

narrow crane
past meteor
#

Then I don't know how your job market looks like. In our case, you're not getting hired to do ML with a bachelors. Amongst masters level candidates there's tons of degrees (CS, bio-informatics, quant business (my first one), statistics, ...) that all can/want to work in ML/AI so you need to be able to "compete"

#

If you're interested early you have a ton of time to do relevant internships, projects, ... I think

serene scaffold
narrow crane
#

Yeah I think I may have made a few mistakes with my goals.

queen cradle
serene scaffold
past meteor
#

I did a lot of internships in data, learnt Python, ... while I was in my 2nd year I think

queen cradle
#

Really I would say that the important thing is to do well in your courses. Any student with a 3.9 GPA is more appealing than any student with a 2.5 GPA.

narrow crane
#

Yeah I was doing courses and studying data science on my own whilst being in clsss for cyber security.

past meteor
#

If you can pick good electives you should

#

Idk how math / statistics heavy your program is but for us those 2 were the cornerstone of the degree which made grokking "data science" concepts easier

serene scaffold
#

I have to go, so Kyle and zestar can fight over who gets to be in charge.

narrow crane
#

So is it realistic for me to continue as I am and be able to land a good career or do I need to change a few things?

#

What courses, practices, or anything do I need to learn practical data science skills?

narrow crane
narrow crane
past meteor
sleek harbor
#

the way I see it - you don't get a degree to learn anything. U just get it so you would attract recruiters, so you could get hired. Once you are hired, then you actually start to learn (obv u should learn on ur own before that, else you won't get hired even with a degree, cus u won't have any knowledge, but what I mean is that a degree won't give you much.. didn't give me much at least)

past meteor
#

Everything is learnable without a degree sure but if your uni / profs are good it helps. Massively.

narrow crane
#

Like maybe I could do a boot camp for a summer to get the data skills

#

There’s a lot of programs for data science and cyber security

past meteor
narrow crane
#

Or is it just better for people to hire me with

past meteor
#

Jobs taught me a lot I didn't learn in school nor didn't pay enough attention to

#

Reading and doing something are different

queen cradle
past meteor
#

They're complementary

queen cradle
#

Boot camps are fine if you want to just use ML techniques as a black box.

sleek harbor
past meteor
#

I took DB courses in school and also spent the summer doing data engineering a few years back

#

The DE jobs taught me so much about working in data environments, what the pitfalls were, and just practical SQL and more

narrow crane
queen cradle
#

Let me put it this way. Given your situation, planning on a boot camp seems unwise. You would probably be better served by taking relevant courses. After all, you're already paying to go to school; you may as well learn some things while you're there!

past meteor
#

The uni perspective was way broader and explained a lot of concepts that people take for granted in industry

queen cradle
#

What you don't usually get in university courses are practical skills, like how to use SQL or various ML packages.

narrow crane
#

I’ll have to consider a lot after this conversation

queen cradle
#

You can get those in a boot camp, but you get them faster and better with a job.

past meteor
#

I think bootcamps are a no-go tbh

#

They cover stuff you could just do by yourself

queen cradle
#

I think they're reasonable in some situations. They give you something specific you can put on a resume. Like, maybe you start to do data stuff at some job, but it's not an official part of your job duties, and you decide you want to do it full-time. A boot camp might smooth that transition for some people.

#

That said, I think they are often oversold.

#

I think they have some value, but not as much as the boot camps' marketing would have you believe.

narrow crane
#

Don’t mean to interrupt but I have one final question. Are the Kaggle and collab courses good enough or are they irrelevant or bad or whatever

#

I really do appreciate you guys being patient enough to talk to me about these things btw

queen cradle
#

I don't know; I've never looked at them myself.

narrow crane
narrow crane
past meteor
narrow crane
#

I think right now I may have to focus on just being able to use these things like how Kyle said

past meteor
#

Yes, there's a series of competitions called "tabular playground", make your own notebook. Submit it and then look at how other people solved theirs

narrow crane
#

Part of me is scared but at the same time the capability of learning and growing over time is interesting.

#

Imagine I’m 30 or something and just a cyber wizard.

queen cradle
#

When you're 30, you'll probably not feel like a wizard no matter what you do. Either you won't be that good; or you really will be that good but you'll know how much you don't know!

past meteor
#

If you like an adventure you could just do it in Europe for a fraction of the price

sleek harbor
somber panther
#

determinate came up in my study, ad - bc is kind of abstract for me, this something i should spend much time studying if i want to market myself in DS fields?

narrow crane
potent sky
potent sky
past meteor
serene scaffold
#

If you have enough industry experience, that can eventually eliminate the need for a degree. But then, that experience is hard to get if you didn't have the degree.

narrow crane
past meteor
#

Tbh here you either do bachelors + masters at a research univ or a bachelors in applied science. The latter has 0 math in their applied comp sci programs.

#

Correlation != causation but employers do think so, so if you don't have a masters you're looked at unfavourably (at entry level). This doesn't apply to the US so I'd say my most important piece of advice is actually not to give too much weight to stranger's advice on the internet (not me, Reddit, Youtube or otherwise) and to understand your local job market tbh

potent sky
potent sky
night prawn
serene scaffold
#

Are you trying to use pytorch, or what?

night prawn
#

tensorflow

serene scaffold
#

So you need to look into how to install tensorflow with CUDA. That you're using VSC isn't relevant.

night prawn
#

So I must install an anaconda framework ?

serene scaffold
#

Also there is no "anaconda framework". It's just anaconda.

potent sky
#

WSL2 is a type-1 hypervisor. Just install tensorflow with cuda and cudnn for Linux in the subsystem and it should ideally work

#

I have never tried it myself tho

past meteor
#

Yes, TF and Pytorch support WSL2 better than actual Windows 🤣

serene scaffold
#

Is that surprising?

night prawn
#

I have installed cuda like in the tuto but i don't know how install cudnn and i tried to install tensorflow with pip but it doesn't work

serene scaffold
potent sky
#

How so? Or dym only for cuda?

night prawn
#

i have did this

past meteor
# potent sky How so? Or dym only for cuda?

"GPU support on native-Windows is only available for 2.10 or earlier versions, starting in TF 2.11, CUDA build is not supported for Windows. For using TensorFlow GPU on Windows, you will need to build/install TensorFlow in WSL2 or use tensorflow-cpu with TensorFlow-DirectML-Plugin"

potent sky
past meteor
#

Afaik torch.compile doesn't work on Windows either

potent sky
#

I meant cpu yeah

potent sky
past meteor
#

But for there rest, you're good to go

potent sky
# night prawn i have did this

Can you describe what you've done and what error you're facing exactly.
Unfortunately the only thing I can make out here is a pip install keras that returns requirement already satisfied

past meteor
#

Getting anything to run on WSL2 is easier anyway I think

potent sky
night prawn
potent sky
#

Also why not just dual-boot but let's not get off topic

past meteor
potent sky
night prawn
potent sky
timid grove
#

hey folks,
anyone please help me out with this issue 🙏

I am making a data science text-code generating bot by finetuning https://huggingface.co/HuggingFaceH4/starchat-alpha this model on my own dataset which is having about 3000 text-clean code conversations.
i have loaded the model checkpoint shards successfully on my colab with the help of hugging face acclerate , BitsAndBytesConfig nested quantization for memory efficiency.
This is how my data science text to code data is structured.
DatasetDict({
train: Dataset({
features: ['input_ids', 'attention_mask', 'labels'],
num_rows: 2384
})
validation: Dataset({
features: ['input_ids', 'attention_mask', 'labels'],
num_rows: 666
})
})

I am facing a error while training this model with this code:
trainer = Trainer(
model = model,
args = training_args,
train_dataset = lm_datasets["train"],
eval_dataset = lm_datasets["validation"],)
trainer.train()
and getting this error :NotImplementedError: Cannot copy out of meta tensor; no data!

Please someone help me out in resolving this error, it will be a great help to me.
this is the full error message

hoary jay
#

hey anyone here experience with bert word tokenizer? basically it generates word vectors for a word in a sentence so if bert doesnt have an embedding for a word suppose overweight, then it sub divides it as "over" and "##weight" and then produces word embeddings for these two sub words instead, now would it be okay to define the embedding of the complete word overweight as the average of these two?

agile cobalt
serene scaffold
#

"over" and "weight" both have discrete meanings that contribute to the overall meaning of "overweight".

timid grove
median leaf
#

what does this mean

#

i keep trying to do knn but i get nothing

potent sky
# timid grove hey folks, anyone please help me out with this issue 🙏 I am making a data sci...

For large models hf accelerate loads your model and weights in shards on each device.
For this it first instantiates a meta tensor, which contains only shape and dtype information and doesn't consume any memory.
Then it moves different layers of this model to different devices
And then loads the corresponding weights from the state dict in each part of the model

Your error seems like you're trying to move the meta tensor to some device directly without any data, which can't be done since there's no data to move

I think there's an empty_like() method you can try using to do that and later load the model weights in

Or, more simply. You can just load the corresponding shard of weights into the meta tensor and then move it to the device of your choice

#

See if that helps. Meta tensors are still a relatively new feature last I checked and support is still maturing

magic dune
#

anyone have any good decision Tree tutorials?

agile cobalt
#

the sklearn documentation should explain it fairly well
they also have a free course/mooc with publicly available materials you can take a look at

potent sky
hoary jay
hoary jay
agile cobalt
#

iirc the way bert encodes it is not even a vector at all, it is just one number per word segment

#

why do you want to "represent them as a single word vector"?

#

and how would you even define what is and isn't one word?

hoary jay
hoary jay
# agile cobalt and how would you even define what is and isn't _one_ word?

i meant in cases where it subtokenizes one word into many is bad for me like consider a statement

"All men are overweight these days" then if i pass it to Bert then it will generate 768 dimensional vectors for every word from "all" to "days".... except for overweight, in this case I'll get two word vectors one for "over" and one for "##weight" and that's the issue i need to resolve

#

see by finding the word vector or "overweight" i can account for biases in my data whether overweight is more associated to words like He, Him or male or She, Her or females

agile cobalt
#

(I know that Bert and GPT are not the exact same thing, but as far as I know, the way they tokenize is more or less the same)

past meteor
#

Computerphile has a few great videos on byte-pair encoding and it's worth looking at if you want to understand tokenization

glacial kiln
#

is this the right place to talk about scipy?

grand minnow
glacial kiln
#

can I do integration in scipy?

grand minnow
glacial kiln
#

i have to see

#

one thing can I read about how to use scipy from the help section of python IDE?

grand minnow
#

You could or you go its official doc

#

both works

#

Also depends on your IDE

#

whatever that is

glacial kiln
grand minnow
#

And I assume help section of it is just help(scipy)?

tepid parcel
glacial kiln
tepid parcel
past meteor
#

The masters program I did (AI) had multiple courses in computational neuroscience. I didn't do them but I assume that's the way to go.

tepid parcel
glacial kiln
dense crane
#

can someone confirm if i want to use azure for training the models i have to create the virtual machine there?

past meteor
dense crane
grand minnow
pseudo spire
tepid parcel
pseudo spire
#

are those university subjects or what?

#

actually gradient descent is Calculus 3, I believe

scarlet ingot
#

how can i do some basic machine learning to make a snake ai

#

like where would i start

pseudo spire
#

so I would be so ultimatively stating it is only Calculus 2

#

Anyways this is just subject names. When Newton invented Calculus, he didn't have 1 2 3 and 4

tepid parcel
# pseudo spire are those university subjects or what?

Of course bro, They are essential for everything, complex systems studies not only on neuroscience, psychology or social areas, but anything that has a system representation, like biology, telecommunication, including ANN itself (artificial neural networks). This area will discover and solve AI extreme low explainabilities for example and beyond things

scarlet ingot
#

i think its to wean you onto the concepts of it

#

im gonna go ask chatgpt my question

tepid parcel
pseudo spire
#

ChatGPT is also prediction. Prediction of which of the many answers is the most appropriate.

tepid parcel
#

It's not only matter of difficulty, having more deep calculus is good and is the trend for the development, but I talk about complexity too, Complex systems

#

It's slight different

pseudo spire
#

The more subjects you know well the better. You know, they are not isolated from each other. Knowing what is positive feedback and negative feedback might help you in your AI work.

tepid parcel
#

How Calculus 1 or simple things in general communicates with the more complex?

tepid parcel
pseudo spire
#

There is no Math 1 2 3 4, so there is no Calculus 1 2 3 4. This is just conventional dividing into smaller parts. In math the more complex things are always built upon more simple things. As in many other subjects.

#

So the AI of Elon Musk's neurointerface is also either prediction or classification. E.g. classification of what particular signal could mean.

tepid parcel
tepid parcel
night prawn
serene scaffold
hoary jay
serene scaffold
grand minnow
hoary jay
tepid parcel
# pseudo spire So the AI of Elon Musk's neurointerface is also either prediction or classificat...

Maybe I wasn't so clear too and misunderstanding, Complex systems are no related to hardware and software although it has applications based on that, an example of Complex system is climate, it doesn't nothing to do with hardware, software or IT, but as the mainly purpose of IT itself, I can use it as a technology or tool for understand climate systems (And I believe that it's the next researches studies steps after discovering human brain functions, to go understand it systems).

I said it so meaning not as a merely of software or hardware only, I talk about more fundamental things, next to the core, more structural and "symbolic", how Elon's Neural chips works as a system (a complex system type) on the sense of interactions and communication, like how a neural network works. Unfortunately the actual neuroscience focus and fame isn't on what I said and should be, it's on health solutions only (That's isn't a bad thing but maybe exclude other important scientific areas, like sociology that's so but so underrated and poorly explored till nowadays I think)

pseudo spire
#

Your initial question was about AI, right?

tepid parcel
pseudo spire
#

So you've got answers about AI, don't you

tepid parcel
potent sky
#

specific to wsl2

plain drift
#

have been wondering how valuable it might be to move my python install to wsl

potent sky
tepid parcel
potent sky
plain drift
#

would throw out a lot of OS familiarity

potent sky
potent sky
wooden sail
#

tf gpu and jax are only available on windows through wsl, and also docker requires it

#

it's worthwhile to look into

#

short of using a proper vm, it's the easiest way of getting started with linux too

potent sky
#

jax as well? damn I've really been blissfully unaware using linux

tepid parcel
tepid parcel
#

I can find courses, but again, so superficial I think

#

Not a Degree for it

potent sky
#

wait so are you looking to choose what courses to take at your uni or
what courses to take online or what to refer to?

tepid parcel
cold osprey
tepid parcel
tepid parcel
pseudo spire
past meteor
#

How to create a neural network via pure math pithink

pseudo spire
past meteor
#

Super super questionable advice you're getting here but OK

wooden sail
#

after exactly 10k hours, your skin immediately turns gold and music chimes from within you, heralding the unlocking of an achievement

lapis sequoia
tepid parcel
# pseudo spire You have to study: - Math statistics (probability and statistics). - Linear al...

I know that it isn't your fault or responsibility to instruct me on this case, on how to discover the covered subjects for this scientific area, but actually I don't have any methods for it...

How you'd do to study one professional career that don't have any course about yet guys, is on "experimental phasis" and isn't well structured educationally saying, but it is full of concepts (They surged in 1950's, next to AI concept), you know that it's viable study and work on it? Like Its a low explored scientific area and there's no learning materials, but you know there's some core areas involved although not knowing which exactly and can do a self-taught on it?

wooden sail
tepid parcel
cold osprey
#

do a phd

#

get into research

wooden sail
#

what's complex systems science? AI/ML degrees are widely available at all education levels

hoary jay
#

hey guys had a question, so if a word reappears again and again in a sentence than how to modify it's word embedding? like say it appears in two completely different sentences but the word has pretty much the same meaning (I'm only talking about pronouns in my case) so for a big data set how can i calculate the embeddings of these words in Bert

tepid parcel
tepid parcel
wooden sail
#

sounds very wishy-washy

#

but in the direction you're pointing lies physics-informed AI, so maybe read up on papers related to the topic

tepid parcel
pseudo spire
potent sky
#

there's interseciton but I think it deserves to be mentioned separately

pseudo spire
#

"Complex systems". Is it something from the 80s? You said it's something new. I don't think it's new. Maybe it develops more rapidly last years (as almost any subject), but it is definitely something from 1980s. Or even 70s.

queen cradle
#

I have encountered people who bill themselves as experts on "complex systems" and I have not been impressed.

#

My advice is that the fundamentals are the same regardless of the subject area: Math, statistics, computer science, physics, and so on are useful everywhere.

#

If you understand those well, then you can apply them to whatever problem interests you.

tepid parcel
pseudo spire
#

Like we have complex systems even in generic electronics (not microelectronics).

#

Social science systems... What the heck is this?

tepid parcel
pseudo spire
#

You are speaking some bird language I don't understand.

queen cradle
#

I don't think he's a very fluent English speaker. But I also believe that's not the biggest problem here.

pseudo spire
#

Me neither a native speaker. And subject names could differ from country to country. Anyway I have no clue what Social Science Systems is.

past meteor
#

If only I had 5x more time than I do

pseudo spire
#

You have a pen (AI). You have an apple (domain area). Boom! PPAP
Although this journey is not always that linear.

#

And that's why science exist, and scientific researches exist.

#

You could investigate history of text-to-speech technologies to see what was the evolution in this particular domain area

tepid parcel
#

It's just systems present on around the world with an social science approach: friendship, family are ones

#

Remember your School biology class about biological organization levels, from atom to biosphere, between them we have systems

hasty mountain
#

Guys, is there a method to calculate how complex an operation is and how to use a method which may provide similar performance?

I'm currently making some experiments on GANs and I've noticed that, from previous experiments I've made, the best one (contrary to what is recommended) was the one where the Generator used no Transposed Convolutions for upsampling, but simply applied bi-linear upsampling operations followed by a residual block with 3 convolution layers + a single convolution layer to get an output with 3 channels.
I've done this to generate, in a single generator, images 4x4, 8x8, 16x16 and 32x32. However, for the 32x32 images, I've also applied, after the convolution, another residual block with the idea that the model could, then, "apply small enhancements" to the image.

The results: 4x4, 8x8 and 16x16 images are fine, but 32x32 have a tendency of showing model collapse or even collapsing the discriminator(there's one discriminator for each dimension, so the collapse of 32x32 discriminator doesn't affect the others). I was thinking that, of course, since 32x32x3 images are more complex than 16x16x3, this could be expected. But I'm not thinking: is this instability really caused just by the complexity of data, or is it because I added too many layers? Or maybe too few for such complex data? Is it possible to estimate whether I added too many parameters or too few for a given data sample?

#

(PS: Transposed Convolutions in Pytorch are simply the gradient of a normal Convolution with respect to the convolution input. So I suppose Transposed Convolutions and Convolutions in Pytorch should be the same operations, I guess the only difference is that TransConv might add way more padding)

#

Wait...then I suppose that...if I apply a Bilinear Upsampling...the model will have a better idea on which transformations it should apply to the data than in the case of a Transposed Convolution, since this one is equivalent to padding the input...so the model will just receive 0s and have no idea on what it should do...relying solely on the bias... pithink

tepid parcel
# pseudo spire Like we have complex systems even in generic electronics (not microelectronics).

Yes, that's true, we have complex systems on electronics too...

https://en.wikipedia.org/wiki/Complex_system

A complex system is a system composed of many components which may interact with each other. Examples of complex systems are Earth's global climate, organisms, the human brain, infrastructure such as power grid, transportation or communication systems, complex software and electronic systems, social and economic organizations (like cities), an e...

tepid parcel
tepid parcel
cold osprey
#

tbh i dont even follow the initial question

pseudo spire
serene scaffold
#

@tepid parcel please do not ping people to ask for help. If they're online and have time to answer questions, they'll look at this channel. Otherwise, no one is on-call to answer questions.

tepid parcel
tepid parcel
raw compass
tepid parcel
tepid parcel
raw compass
tepid parcel
raw compass
#

I mean it looks like that you already got an answer.

errant bison
potent sky
# errant bison first i am trying to do with stationary. Then with moving cars, like extracting ...

For videos you could add a detection module with model chaining so that one model will detect whether there is a vehicle in the frame and only then the rest of the models will run to extract the car and license num. This will save a lot of processing.
You could also add an object tracking module to ensure each vehicle is processed only once
For static images YOLO + OCR is quite a solid and standard approach
An E2E model would not work as well for this even if you could somehow get a suitable dataset.
Same for a ViLT model fine-tuned for VQA, it wouldn't work as well imo
You can use some traditional image processing techniques to preprocess the images so that the license plate numbers become more prominent and then run OCR on that
You could also try segmentation but that'd be overkill since detection is the right task here

errant bison
potent sky
#

One simpler way could be to detect contours and extract the probable license plate regions from there, tho I don't think this would perform very well

#

Since it has to be in cars and cars will probably be upright you could also design specific kernels or structuring elements to detect these license plates

potent sky
# errant bison thanks for the explanation, can u pls provide any documentation

I mean, yolo, tensorflow, pytorch docs are available online but here you go:
https://docs.ultralytics.com/
If you're asking for a tutorial then maybe this'll help
https://medium.com/saarthi-ai/how-to-build-your-own-ocr-a5bb91b622ba

Explore Ultralytics YOLOv8, a cutting-edge real-time object detection and image segmentation model for various applications and hardware platforms.

Medium

In this article, you will learn how to make your own custom OCR program with the help of deep learning, to read text from an image. I will…

potent sky
errant bison
#

YOLO means opencv?

#

or am i mixing things

plain drift
#

i explored pre-trained OCR tools for the first time in a while recently. before that, i'd been so disappointed with tesseract that i figured paying for google's vision module was my only hope for any ocr-reliant project. but this time i found paddleocr pretty impressive -- though even then I had to preprocess the input images a little to get useful outputs

#

there's a HF space that lets you try it out somewhere. will look to find it unless you're looking to train your own tool?

hasty mountain
pseudo spire
# tepid parcel What is the problem? My English is so weird at that point?

The problem is... Judging from your questions, it is very likely that you try to find a ready-to-use recipe in a particular domain area (or areas). This judgment can be wrong, but as you are unable to came up with more practice-oriented questions, it very seems so.

Also, it might happen that in those particular domain area (or areas) a ready to use recipe either doesn't exist, or it's not made public. So... yeah.

Also, it there was a publicly available ready-to-use recipe, then you (as an AI professional, lets imagine that) would not be needed at all. Any person who knows how to operate computer to some extent, would be able to follow a ready-to-use recipe and achieve a needed result.

So.. .Step-by-step guides are not an option when you want to create something new / breakthrough. However, step-by-step guides can be used on some stages of learning... It's important though, that the learning path must include something else aside following step-by-step guides, otherwise the learning would not be complete.

pseudo spire
# tepid parcel What is the problem? My English is so weird at that point?

So I previously mentioned AI is the best in solving the following practical tasks:

  1. Classification
  2. Math regression

Actually I was wrong (I am learning as well), and there are more tasks/computational methods which can be well executed with AI:

  1. Classification
  2. Math regression
  3. Clustering
  4. Dimensionality reduction
  5. seq2seq

So as an AI professional, you start to think can you apply any of the mentioned methods to a particular domain area. If yes - how exactly. E.g., the "clustering" task/method is very relevant to social connections, I believe.

tepid parcel
# raw compass I mean it looks like that you already got an answer.

Yes, AI projects with ML are experimental, testing and failing, and all else. Maybe I was misunderstood, I know that nobody know exactly what of Data science and AI subjects cover Complex System science, It's because I just didn't dive into this area before to know something about so I want to know it better before start to studying it, I thought that would hear something like "Dude, if you want to be not only a professional but a real researcher, study x, w, y, z subjects" maybe "on x subject you can start with this or that, on w together with y you will learn something".
To complement my doubt, I need know about the career too, the experiences, opportunities, specializations, future prospection and etc.

pseudo spire
#

One must not only "know", but also be able to apply this knowledge. This is usually achieved via extensive practice in a particular domain area and gaining practical experience.

raw compass
pseudo spire
#

This is what we do in Universities, isn't it? It's not only reading about x, w, y, z. But also some practical work. Or more precisely scientific work (especially for candidates to PhD). There is a disctinct difference between practical work and scientific work. And this difference is novelty, pushing forward borders of science.

raw compass
pseudo spire
potent sky
potent sky
#

Yolo is more like a set of techniques by now rather than one single model, but fair enough
We're at YOLOv8 now

plain drift
#

my domain was reading timestamps from screenshots of the game league of legends. surprisingly, even though these timestamps were way more well-structured than the canonical challenge for an OCR system (e.g. handwriting), this was too hard for tesseract and the failure rate was at least 10% (didn't actually measure)

potent sky
plain drift
#

yeah i can't help but think i was doing something wrong. but either way, paddleocr proved easier to get working

potent sky
#

Did you do preprocessing on the images? Game screenshots can be kindof flashy

plain drift
#

yeah! i think i did some transformations to increase contrast and whatnot. what ended up being important for getting performance on paddleocr to work was to basically make sure the parts of the image i snipped from these screenshots to read were superimposed on larger images -- e.g. a 150x150 image of black pixels. My guess is that if the ratio of "text" to background is unusual wrt to its training set, the model can be thrown off.

tepid parcel
# raw compass you are getting answers arent you?

Not really, how with this advice I will get an PhD on Complex System?
This are advices aren't directed enough, I think, I know that this server don't have any label of areas to apply AI, but I am asking for. Great part form what was said here, just being honest, clear and direct, I already know. But I am happy to know that I am one the right path!

pseudo spire
#

You will not get PhD with an advice. Full stop.

tepid parcel
potent sky
raw compass
plain drift
#

it's really interesting how janky these AI models are. Felt like negotiating with an alien

pseudo spire
potent sky
# tepid parcel Not really PhD, I exaggerated, but apply AI knowledges

You mentioned I think social sciences and complex systems. These are pretty broad fields and like you said most of the advice we can give in this setting are things you already know: explore relevant courses, understand the ML math deeply, consult books on the topic etc.
What other advice are you looking for? /gen

#

I hope this doesn't come off as harsh, I'm genuinely trying to understand what other advice you're expecting

#

Maybe I missed something in between, quite possible

tepid parcel
pseudo spire
#

Which year student?

tepid parcel
tepid parcel
potent sky
#

Fair enough. I don't think I specialise in those fields so there's not much more I can say.
Except that combining ai-ml with any other field will demand a deep understanding, so it is really recommended that you dive deep into the math behind it and how things work under the hood (esp for complex systems ig)
GL!

tepid parcel
potent sky
#

Search up courses on this online by other universities, see the professors who're taking the class, maybe contact them and ask for advice? Or the students enrolled in those courses?

#

Or people in the industry working in related positions, can get that through a LinkedIn search

brave zenith
#

Is there any way to use opencv to render footage live

#

from an api or a server hosted

#

using google colab

hasty mountain
#

Ugh... GANs are already annoying enough, and then when I search for a way to make a Variational AutoEncoder that doesn't generate blurry images...the papers around this tells me to use a VAE with a GAN configuration py_guido

#

I guess I'll just turn to huggingface and use a pretrained VAE, then...

iron basalt
hasty mountain
#

But I'm having some trouble in getting a trained VAE

#

And I'm having a hard time trying to find the architecture of the VAE they used for Stable Diffusion.

hasty mountain
#

I also see folks commenting about MSELoss... Really, I don't know how they manage to make a VAE work with colored images using a MSE Loss. At least my VAEs only work with colored images when I use Gaussian Log Likelihood

#

Interesting... I knew this for GANs, the Progressive Grow, but I didn't expect that this would be necessary for VAEs, too.

#

Thanks, Squiggle! I'll run some tests.

#

But first, just a single VAE in 8x8 or 16x16 images...just to make sure

hasty mountain
#

Hm... The targets for the Encoding Loss isn't just a Normal Distribution for each encoder? It could be a distribution of a more complex features?

Now this got even more interesting...I'm already thinking about information entropy. Too bad it tends to make things too computationally expensive grumpchib

potent sky
hasty mountain
#

Even when the loss appear to have stabilized, after 3 or 4 epochs it decreases some 3 or 4 points, then stabilizes again, and so on...

#

Too bad I still don't know enough about how to implement Genetic Algorithms...it may be interesting to use stochastic gradient descent for a VAE and, when the model begins to take too much time to get better, maybe perform some evolutionary optimization in parallel...

mint palm
#

does gpu parallelism in pytorch just means:
send model to each gpu
devide input into n parts

nothing else?

modest mulch
mint palm
dusk tide
#

I am working on movies dataset on kaggle link here https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset and I am practicing data cleaning. So there is a column named production_countries which tells all the countries where movies produced. There are a lot of NaN values in this column so I decided to replace all of these with the most occured city names (like US, FR,GB). And I cannot do this by fillna() because it want a scaler or dictionary value but the values that I have are in lists form. So i found another way which is using random() function. prod_countries_nans = movies4['production_countries'].isna() prod_countries_nans_length = sum(prod_countries_nans) replacement = random.choices([['US','DE'],['GB','FR']], weights=[.5 , .5], k=prod_countries_nans_length) movies4.loc[prod_countries_nans,'production_countries'] = replacement . So this function try to evenly distribute the countries and stores them in replacement. But I am receiving an error Must have equal len keys and value when setting with an ndarray . The length of both prod_countries_nans_length and replacement is same 6206 and the same thing I applied for genre and production companies worked fine. Can anyone tell??

errant lake
mellow quarry
cold osprey
#

!code

arctic wedgeBOT
#
Formatting code on discord

Here's how to format Python code on Discord:

```py
print('Hello world!')
```

These are backticks, not quotes. Check this out if you can't find the backtick key.

For long code samples, you can use our pastebin.

errant lake
#

This will replace all NaNs in your column with a random choice from list choices

raw compass
#

I was thinking about something interesting after seeing AUTO-GPT. How is it possible to use an AI agent that can control your entire operating system, open application files, and perform various tasks?

oblique quarry
#

Hello, could I ask for help? It's about a question concerning a MLP. I have just started my journey into the wide land of ml, so I'm fairly new which is why I'd like to ask some more advanced pogrammers if they could help me spot the error in my code

plain drift
# raw compass I was thinking about something interesting after seeing AUTO-GPT. How is it poss...

This is maybe not exactly how it's being implemented. But when you have an AI agent like ChatGPT write a script for you, you are already having it specify a sequence of system operations that achieve a desired result. So from there, the only additional step you haven't automated is the execution of that script on your system. To use an AI agent that can control your entire operating system, you just automate that step too.

raw compass
past meteor
plain drift
#

Sure. So one scripting language for controlling operating systems is bash. ChatGPT can write bash scripts to execute system operations. You can make an app that the user grants system access, prompts ChatGPT to create scripts that suit a specified need (e.g., through an API call), and then executes that script.

raw compass
#

so like extend python with c.

plain drift
#

I guess it depends on your project-specific needs. But you can mostly stick to Python for this. Python will talk to ChatGPT through an API, and then initiate system commands based on ChatGPT's responses. Python's os library supports issuing system commands.

#

chatgpt is an idiot though. could do crazy irreversible things. careful!

oblique quarry
oblique quarry
past meteor
#

I also don't think writing a MLP from scratch and manually computing gradients is a good exercise.

#

If you're going to implement neural nets from scratch you should implement automatic differentation (aka autograd) https://en.wikipedia.org/wiki/Automatic_differentiation by yourself and then use that yo make your neural network library

In mathematics and computer algebra, automatic differentiation (auto-differentiation, autodiff, or AD), also called algorithmic differentiation, computational differentiation, is a set of techniques to evaluate the partial derivative of a function specified by a computer program.
Automatic differentiation exploits the fact that every computer pr...

raw compass
oblique quarry
past meteor
oblique quarry
#

Appreciate it, will read through it

past meteor
#

Just as you wrote

night prawn
placid cedar
#

hi

#

currently doing my project, and these are the project suggested steps

#

shouldn't missing value imputation be before train and test split?

#

and the transformation of numerical variables using proper transformation methods be after the train test split?

#

would appreciate anyone's help here 🙂

past meteor
placid cedar
#

its an assignment apparently

#

the lecturers suggested these steps

#

and i felt that the arrangement was kinda flawd

#

flawed

#

unless im wrong

past meteor
#

What's flawed in yoru opinion?

placid cedar
#

shouldnt missing values imputation be before the train and test split?

past meteor
#

No

#

It's flawed because the first thing you should no is the test train split

placid cedar
#

o?

past meteor
#

You want to estimate the performance of your method on truly unseen data. If you explore it with .describe() and remove outliers from it, it is not truly unseen

potent sky
past meteor
#

You're just introducing bias into your results

potent sky
past meteor
#

Measure the performance of your methods like you would if you deploy them into production

placid cedar
#

it also stated that the transformation methods are done before the train test split

past meteor
#

That's bogus

placid cedar
#

shld it be after split?

potent sky
#

Hmm that should be done after

past meteor
#

Ideally everything is done after. In reality this is not the case but you should postpone as much as possible unless you can't

placid cedar
#

mmm alright

past meteor
#

Imagine your split makes it such that all the outliers are in your test set? If you split after removing outliers you've manipulated your test set

#

Which means your test set is not truly unseen. You saw it and you touched it.

wooden sail
#

such a good example, oof

placid cedar
#

makes sense

#

then what about the missing value imputation?

#

what happens if i perform it before the train test split

past meteor
#

The same

potent sky
#

Imputation will also have similar problems. The values filled in will be representative of the whole dataset
So the imputed values in your train set will have been influenced by the test set, rendering it not unseen

#

Only dropping records or features is fine before split I think

past meteor
#

You're most likely doing some mean value imputation right? You're information from unseen data to determine the mean and using that to impute the values

past meteor
placid cedar
#

so it means i should do the transformation methods such as logarithm transformation before the splitting?

#

so sorry for pestering too much yeah, im pretty new to this mb

past meteor
#

No no, this is a good discussion tbh because this is something some of my colleagues struggle with 💀

placid cedar
#

oh damn

#

this really isn't easy at the start ngl

past meteor
#

Another thing to notice is that the transformations your lecturer wants you to do are linear regression specific

potent sky
past meteor
#

You can either 1) Eyeball the data and determine you need a log / box-cox / ... transform beforehand OR 2) make your linear regression model and study the residuals. If the residuals have any structure then you apply a transform afterwards (easier)

potent sky
past meteor
#

Removing outliers are something you should do immediately though yeah

#

Because linear regression is very sensitive to outliers. I'll look for an image, sec

placid cedar
#

so after i performed my train test split, lets say i want to use a mean value imputation. since im going to split it 70% and 30%, so the missing values in the 70% are going to be replaced by the mean of the values of that 70%?

potent sky
#

Yes

#

Which is good

placid cedar
#

ohhh ok i see

#

how would i explain why its good?

potent sky
#

If it was based on all the 100% then some information is "leaking" from the test set to the train set
Which means the test set isn't exactly "unseen" anymore.
Obviously we want the test set to be unseen

past meteor
past meteor
potent sky
wooden sail
#

if you wanted to do a fair imputation, you'd do it based only on the 70%, which would introduce some amount of bias unless you have a fair amount of knowledge on the statistics of the problem you're dealing with

past meteor
potent sky
#

This is a good clear example of leakage btw. Ngl a lot of times it's very difficult to spot leakage in complex data pipelines

past meteor
#

Like, the whole point of this is having a high fidelity estimate of how your model will perform in the wild

potent sky
#

I was referring to performance in the wild. That it will probably have lower performance since I've removed rows that I know would occur in reality

past meteor
potent sky
#

My point is only that it's different to bias

placid cedar
#

thanks so much for ur support and answers anyway guys, hope to see you all next time when i have more queries. really nice learning more from y'all!

placid cedar
#

yesh

potent sky
past meteor
#

Do the box-cox or log transforms after step 8

#

So load in the data, split immediately, explore it, remove outliers, impute missing values, one hot encoding, make features and then make a model

placid cedar
#

yep gotcha! joe_salute

past meteor
#

Step 9 is investigate the residuals (the error you're making) with respect to all of your variables. Afterwards if you see "structure" in these residuals then you go for box-cox, log transform or binning

placid cedar
#

yes sirrrr

past meteor
# potent sky My point is only that it's different to bias

So bias in statistics is just that the estimate you have is different to that of the actual quantity. So even if your sample is infinite you do not converge to y_true. The quantity you're trying to estimate in model evaluation is "how well is my model performing in the wild". If you remove a certain % of the data it may see in the wild you will converge to an estimate that is "biased", it will not accurately reflect the performance it should have had

#

I think you're mixing it up a bit with "if I drop valid rows my model will get less data and be worse"

potent sky
past meteor
#

I think in the limit the problem I described is worse. Why? Companies like Zillow lost billions of dollars because they deployed bad ML models. This can likely be ascribed to not having high fidelity estimates of how well the model is performing.

#

I don't think I can explain it any differently or better than I have now 😅 . It's a bit unintuitive and poorly taught in school so it might take a while for this to make sense idk

potent sky
#

Ensuring fidelity in model estimates has multiple aspects. One is to have test data representative of real data. Another is to ensure the model is not biased towards the test data.
Different things

past meteor
#

What does "biased towards the test data" mean?

potent sky
#

I was referring to the second aspect and how dropping records or features will not lead to this.
It might lead to the first one

potent sky
past meteor
#

Dropping records will not lead to the latter but imo that's a moot point

potent sky
#

So learning in the training data biases model towards the train data distribution. This is why we need a test data to have a good fidelity estimate.
Hence we must ensure the training data (and the model by extension) isn't biased towards the test data

past meteor
#

Scenario #2 is a lesser of both evils yes

#

It'll lead to your estimates not necessarily being higher than they're supposed to be (which happens if you leak from test -> train) but they'll still be off

#

Unless maybe you dropped the rows at random

potent sky
past meteor
#

Okay so a concrete example from my work 😄

#

We have multiple time series. For now our independent variables (X) are lags. We use the last n lags to predict n+1. If there is an interruption in our time series longer than for example 45 minutes between any of the lags we can drop that observation

#

This is a rule we made before looking at the data. The 45 min etc. is arbitrary, it could've been 2 hours or something else

#

In the real world if we deploy our system I don't think we'd make any prediction in these cases, we don't have enough information. In that sense dropping them does not lead to any bias

#

I'm a bit "annoying" in this respect, before we start modelling I always want to have a meeting where everyone sits together (tech/non-tech) where we discuss how we evaluate and how we'll drop rows, ...

potent sky
#

Precisely. I agree.
My point was that if you had set it to 10 secs, it would introduce the first kind of bias, in that you would have to make those preds on deployment but you're dropping those records from your data
However it wouldn't introduce the second kind of bias, i.e. biasing the model towards the test data
We would still have low fidelity estimates tho. This becomes the problem of our test data distribution not being representative of the deployed data distribution

past meteor
#

Maybe I exaggerate as well, idk

#

I'm jealous that other "engineering" disciplines like mechanical engineering etc. take the evaluation of their products so much more seriously.

potent sky
#

Ahaha it's difficult to convey the entire sense over text, that too without math

potent sky
past meteor
#

It's normal there to have a meeting on product / material evaluation before you even start building the pieces. You don't want to test the brakes of a car in different situations than it will be used on the road etc. in data science it's less common to be very very rigorous about this

potent sky
#

Exactly

#

Some more preparation than "throw a black-box at the problem and pray it works" lol

#

Anyw, fantastic discussion! gtg now tho

median leaf
#

anyone understand how to interpret this

narrow crane
#

is there a discord server related to data science?

pseudo spire
placid cedar
#

anyone online now?

#

quick question, are you able to check for r square and mse if there are columns that contain categorical data in your dataset

past meteor
#

Huh

past meteor
placid cedar
#

oh, because the lecturer asked us to like, for instance, the transformation methods

#

he asked us to like for each numerical variable

#

try out the different methods

#

and see which one gives the best score or smth

#

but i still have some categorical data, so shld i js encode it first, and then do some transformation, and then use the r aquare and mse test?

past meteor
#

You try out the different methods, make predictions and compare the rmse they were making on the test set

#

So method 1 -> predictions_1 -> rmse_1 vs method 2 -> predictions_2 -> rmse_2

placid cedar
#

aite

half bramble
#

Hello fellas

#

I’m new in Data science

#

What ur suggestions for better understanding..?🙏🏽

past meteor
#

Very hard question to answer because it depends on what you already know

placid cedar
#

what numerical imputation should i use for this case?

#

i feel that mean wld be good here as the data is quite evenly distributed without any form of extreme values

#

or maybe median

half bramble
past meteor
#

If it's rusty I'd start there tbh

#

And then pick up a canonical data science book and proceed from there

half bramble
half bramble
past meteor
#

Being not good

half bramble
past meteor
half bramble
#

Thanks❤️

pseudo spire
#

math statistics is a vital skill, so if you are in university pay attention to it

sleek harbor
#

is target encoding often used in the field, and is it something one should necessarily know? Every way I look at it looks like the whole idea is a target leakage hazard.. I don't like it...

half bramble
half bramble
past meteor
#

Like if a few of your columns have very high cardinality if you'd one hot them

agile cobalt
past meteor
#

It's encoding a level your feature (a categorical variable) by say the mean of the target when that categorical variable appears

agile cobalt
#

oh

#

...isn't that still just (the) one-hot encoding (case listed there)?

past meteor
#

No

agile cobalt
#

oh wait I see

past meteor
#

[{"colA" : "foo", "target": 1}, {"colA" : "bar", "target": 2}, {"colA": foo" : "target":7}] colA gets replaced with 4 in case it's foo and 2 in case it's bar

#

Bit handwavy but one hot "retains" information better but it sucks, so much, for high cardinality stuff sometimes

merry ore
#

hello folks! hope you are having a great day! We are working on writing a ml pipeline for a bunch of models, in Rust and Python . So, we have looking at various inference engines. So far we have tried OnnxRuntime and tflite. While working with Onnx we noticed that it also has a bunch of ExecutionProviders. It has support for CoreMl on Apple devices. We also noticed another package coremltools by Apple. Both can give us inference results. Can any body give me some differences between the two? Do they call same backing code or are there any significant differences ?