serene scaffold Nov 1, 2023, 6:13 PM

#

ML is applied math. so you'd be learning stats, calculus, and linear algebra. there are resources in the pins.

timid dune Nov 1, 2023, 6:16 PM

#

serene scaffold ML is applied math. so you'd be learning stats, calculus, and linear algebra. th...

nice, thank you

abstract wasp Nov 1, 2023, 6:23 PM

#

Help, I'm making an autoencoder, my error happens when I call autoencoder.fit:
ValueError: y argument is not supported when using dataset as input.
This is my code:
#IMPORTS #LOADING DATASET (x_train, x_test), ds_info = tfds.load('celeb_a', split=('train', 'test'), shuffle_files=True, as_supervised=False, with_info=True) x_train = x_train.map(lambda x: {'image':(x['image']/255)}) x_test = x_test.map(lambda x: {'image':(x['image']/255)}) print(x_train) print(x_test) class Autoencoder(Model): def __init__(self, latent_dim, shape): super(Autoencoder, self).__init__() self.latent_dim = latent_dim self.shape = shape self.encoder = tf.keras.Sequential([ layers.Input(shape=shape), layers.Conv2D(16, (3, 3), activation='relu', padding='same', strides=2), layers.MaxPooling2D((2, 2)), layers.Conv2D(32, (3, 3), activation='relu', padding='same', strides=2), layers.MaxPooling2D((2, 2)), layers.Flatten(), layers.Dense(latent_dim, activation='relu')]) self.decoder = tf.keras.Sequential([ layers.Input(shape=(latent_dim,)), layers.Dense(tf.math.reduce_prod(shape), activation='relu'), layers.Reshape(shape), layers.Conv2D(32, (3, 3), activation='relu', padding='same', strides=2), layers.UpSampling2D((2, 2)), layers.Conv2D(16, (3, 3), activation='relu', padding='same', strides=2), layers.UpSampling2D((2, 2)), layers.Conv2D(8, (3, 3), activation='relu', padding='same', strides=2)]) def call(self, x): encoded = self.encoder(x) decoded = self.decoder(encoded) return decoded

#

shape = (218, 178, 3) latent_dim = 64 autoencoder = Autoencoder(latent_dim, shape) autoencoder.compile(optimizer='adam', loss=losses.MeanSquaredError()) autoencoder.fit( x_train, x_train, epochs=10, shuffle=True, validation_data=(x_test, x_test))

desert oar Nov 1, 2023, 7:07 PM

#

maybe not quite what you're looking for, but you can call R using rpy2 https://rpy2.github.io/ and use this https://search.r-project.org/CRAN/refmans/misty/html/na.test.html

blazing oxide Nov 1, 2023, 7:19 PM

#

abstract wasp Help, I'm making an autoencoder, my error happens when I call `autoencoder.fit`:...

Hey there! 👋 It seems like you're having trouble with your autoencoder model. The error "ValueError: y argument is not supported when using dataset as input" usually happens when you're trying to fit a model with a dataset as input, but you're also providing a target y argument.

In your case, you're using tf.data.Dataset objects for training and testing data. When you call fit on your autoencoder model, you're providing x_train as both the input data and the target data. However, when using a tf.data.Dataset, you should only provide it as the first argument to fit, and not as the target data.

Here's how you can modify your fit call:

autoencoder.fit(
    x_train,
    epochs=10,
    shuffle=True,
    validation_data=x_test)

In this case, your x_train and x_test datasets should yield pairs (input_batch, target_batch). But since you're working with an autoencoder, your input data is the same as your target data. So, you might need to modify how your datasets are created to yield pairs of the same data.

I hope this helps! If you have any more questions or need further clarification, feel free to ask! 😊

cerulean kayak Nov 1, 2023, 8:06 PM

#

so I tried to use Optuna and it's still taking quite a while. I realized the main problem is I don't know what range of values I should use for my hyperparameters, which are:

max_features (sqrt or log2 so not applicable)
n_estimators
max_depth
I've tried looking up on "how to know what range of values to test for in random forest hyperparameter tuning", but I just get a bunch of results explaining how to tune them, while each one just chooses a random value for the start and another random value for the end.

Is there a way to know what range of values I should use based on the amount of data I have?

desert oar Nov 1, 2023, 8:42 PM

#

blazing oxide Hey there! 👋 It seems like you're having trouble with your autoencoder model. T...

do we have chatgpt now?

blazing oxide Nov 1, 2023, 8:43 PM

#

desert oar do we have chatgpt now?

Why you think that XC

desert oar Nov 1, 2023, 8:43 PM

#

oh damn your username is just "discord"

#

i'm so old 🤣

blazing oxide Nov 1, 2023, 8:44 PM

#

desert oar oh damn your username is just "discord"

Cause I am

desert oar Nov 1, 2023, 8:44 PM

#

cerulean kayak so I tried to use Optuna and it's still taking quite a while. I realized the mai...

what values are you using?

#

you can save a lot of time here by using "warm starts" and gradually increasing n_estimators. or halving random search with the same

#

max features i don't think is worth cross validating over

#

so that's just max depth, at which point you're searching over a single parameter

blazing oxide Nov 1, 2023, 8:45 PM

#

desert oar do we have chatgpt now?

I also did a post in #1051603408597024828 so don't think that if I write well it means I use AI

cerulean kayak Nov 1, 2023, 8:45 PM

#

desert oar so that's just max depth, at which point you're searching over a single paramete...

...also n_estimators...

cerulean kayak Nov 1, 2023, 8:48 PM

#

desert oar what values are you using?

my input is a series of images processed from tensorflow into numpy arrays (5,712 training and 1311 testing)
atm I'm doing

    n_estimators=trial.suggest_int("n_estimators",100,400,step=20)
    max_depth=trial.suggest_int(log=True,name="max_depth",step=3,low=2,high=32)

desert oar Nov 1, 2023, 8:51 PM

#

cerulean kayak ...also n\_estimators...

kind of, but that's what i mean about the warm starts thing. you can treat n_estimators as a special case. conceptually, you would do something like this (assuming scikit-learn):

model_base = RandomForestClassifier()
scores = {}
for max_depth in [0, 2, 4, 6, 8]:
    model = sklearn.clone(model_base)
    model.set_params(warm_start=True)
    for n_estimators in [0, 50, 100, 150, 200]:
        model.set_params(n_estimators)
        model.fit(x_train, y_train)
        score_train = score(y_train, model.predict(x_train))
        score_eval = score(y_eval, model.predict(x_eval))
        scores[(max_depth, n_estimators)] = (score_train, score_eval)

#

that should save you a fair amount of individual decision tree fits

#

also i believe sklearn supports transparently parallelizing RandomForestClassifier

#

(are you training on image embeddings? why not just keep them going through a fully connected hidden layer at the end of whatever NN you're already using?)

#

so if you want to use optuna for max_depth you can, but you might not need it

#

also i don't know that incrementing n_estimators from 100 to 400 in steps of 20 is a good use of cpu cycles. seems excessive

#

the other option is halving search, which works well on tree ensembles and NNs because it's easy to identify what the "resource" should be (number of trees or number of epochs, respectively) https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.HalvingGridSearchCV.html#sklearn.model_selection.HalvingGridSearchCV

scikit-learn

sklearn.model_selection.HalvingGridSearchCV

Examples using sklearn.model_selection.HalvingGridSearchCV: Release Highlights for scikit-learn 0.24 Comparison between grid search and successive halving Successive Halving Iterations

cerulean kayak Nov 1, 2023, 9:00 PM

#

desert oar (are you training on image embeddings? why not just keep them going through a fu...

I'm not making a Neural network. I'm trying to see how accurate a Random Forest can get compared to a CNN.

cerulean kayak Nov 1, 2023, 9:02 PM

#

desert oar kind of, but that's what i mean about the warm starts thing. you can treat `n_es...

and you think those 5 values for max_depth and n_estimators will be enough?

desert oar Nov 1, 2023, 9:12 PM

#

cerulean kayak I'm not making a Neural network. I'm trying to see how accurate a Random Forest ...

ahh, i see

desert oar Nov 1, 2023, 9:12 PM

#

cerulean kayak and you think those 5 values for max\_depth and n\_estimators will be enough?

well it was more to demonstrate the point, but yeah i think more than 8 max depth is probably overkill

past meteor Nov 1, 2023, 9:12 PM

#

cerulean kayak and you think those 5 values for max\_depth and n\_estimators will be enough?

I'd say for random forest the most important parameter to optimise for is cost complexity. N_estimators isn't worth it, it doesn't do a lot for random forest (I'll leave this as an exercise for you to figure out)

desert oar Nov 1, 2023, 9:13 PM

#

likewise with 200+ estimators, although breiman advocates for 1000+ trees, my experience is that you hit diminishing returns pretty hard. but you should see that in the score curve

#

fortunately your parameter space is small enough to visualize

past meteor Nov 1, 2023, 9:13 PM

#

On the other hand, max_depth is implicitly taken into account when you compute the cost complexity (cc_alpha), look at the equation to understand why 😁.

For xgboost it's the opposite, there you should properly reduce the number of estimators

desert oar Nov 1, 2023, 9:14 PM

#

past meteor On the other hand, max_depth is implicitly taken into account when you compute t...

i'm actually not familiar with that parameter, that's ccp_alpha in sklearn?

past meteor Nov 1, 2023, 9:14 PM

#

desert oar i'm actually not familiar with that parameter, that's `ccp_alpha` in sklearn?

Yes

desert oar Nov 1, 2023, 9:15 PM

#

ohh i see, this is for pruning individual trees?

#

so you'd set a generous max depth and then optimize over the pruning parameter instead?

past meteor Nov 1, 2023, 9:15 PM

#

I'd roll with the default max_depth, which is very high afaik and then prune

#

That's what the docs also wants you to do afaik, it says something about this on the user guide or all trees

desert oar Nov 1, 2023, 9:16 PM

#

cool, i think that guide was probably added after i'd already ossified in my mind that tuning max depth was the way to go

#

this is what i get for not actually ever reading breiman's book. thanks, i learned something new!

#

hm... i think the docs only talk about pruning of individual decision trees, which i'm definitely familiar with. but i never heard of doing it in an ensemble like that

past meteor Nov 1, 2023, 9:19 PM

#

Np. Generally I don't like tuning more than 2 hyperparameters unless it's a neural network so for each model I use only touch the ones that give the most bang for your buck

desert oar Nov 1, 2023, 9:19 PM

#

agreed on that

past meteor Nov 1, 2023, 9:20 PM

#

There's some other tricks for random forest like using the OOB score instead of cross validation etc.

#

But that I'm sure you know

desert oar Nov 1, 2023, 9:21 PM

#

yeah, i totally forgot to mention OOB

#

although i've always been a little nervous about it

#

i've used it, but i was always mildly skeptical even though it's supposed to be fine

past meteor Nov 1, 2023, 9:23 PM

#

I barely use it because I'm usually trying 10 models with the same generic code

desert oar Nov 1, 2023, 9:26 PM

#

yeah, that's part of it. even the thing i showed with n_estimators involves a whole extra training setup

hollow sentinel Nov 1, 2023, 9:30 PM

#

import json
import pandas as pd
import plotly.express as px

print(px.data.carshare())

def plot_choropleth_from_df_and_geojson(df, geojson_path, show_empty_zips=True, zip_range=[]):
    with open(geojson_path, 'r') as file:
        zipcodes = json.load(file)

    df['zip_code'] = df['zip_code'].astype(str)

    for val in zipcodes:
      print(f"{val['zip_code']} is in {val['state']}")

    color_scale = px.colors.sequential.Plasma

    # Create the choropleth map
    fig = px.choropleth(
        df,
        geojson=zipcodes,
        locations="zip_code",  # Change this to match the column name in df
        featureidkey="properties.zip_code",  # Update to match the geojson properties
        color="zip_count",
        hover_name="zip_code",  # Change this to match the column name in df
        hover_data=["zip_count"],
        scope="usa",
        color_continuous_scale=color_scale,
        title="ZIP Code Choropleth based on Count"
    )
    fig.update_geos(fitbounds="locations", visible=False)
    fig.show()

df = pd.read_csv('zip_codes.csv')
print(df.columns)
print(df.head(5))
geojson_path = "/content/USCities.json"  # Update this path to the location of your GeoJSON file
plot_choropleth_from_df_and_geojson(df, geojson_path, show_empty_zips=True)

#

still stuck with figuring out this chloropeth map

#

i need help with understanding how to parse the json data into a pandas dataframe

cerulean kayak Nov 1, 2023, 9:31 PM

#

past meteor I'd say for random forest the most important parameter to optimise for is cost c...

here's the thing. I've never seen a tutorial that's used this hyperparameter, but to be fair all the tutorials, except two, I've seen have been for very finite datasets (meaning they wern't as computationally expensive as my situation is). Heck this is the first time I'm seeing this hyperparameter.
Also I'd have no idea of where to start looking for values within that hyperparameter, especially since its a float.

desert oar Nov 1, 2023, 9:32 PM

#

cerulean kayak here's the thing. I've never seen a tutorial that's used this hyperparameter, bu...

90% of data science / ML tutorials are all just rehashing the same content over and over

mild dirge Nov 1, 2023, 9:33 PM

#

and over

desert oar Nov 1, 2023, 9:33 PM

#

read a couple tutorials? write your own, post it on TDS, claim to be an educator on your linkedin

#

so i wouldn't take absence of any technique from generic tutorials as strong evidence against the usefulness of that tehnique

cerulean kayak Nov 1, 2023, 9:38 PM

#

well so there's also an implied principle of "I should only try solving problems that there are resources for".
If I am having such a hard time tuning the hyperparameter for which there are lots of tutorials on how to tune, it would be even harder to tune hyperparameters that don't have tutorials on how to do so online.
and yes. I know just because teachers exist, doesn't mean they're good. I go to college.

past meteor Nov 1, 2023, 9:43 PM

#

cerulean kayak well so there's also an implied principle of "I should only try solving problems...

If you don't believe me I can recommend a chapter in a book you can read at your own time 🙂

#

There's also these links on sklearn's docs:

https://scikit-learn.org/stable/modules/tree.html#minimal-cost-complexity-pruning

https://scikit-learn.org/stable/auto_examples/tree/plot_cost_complexity_pruning.html

scikit-learn

Post pruning decision trees with cost complexity pruning

The DecisionTreeClassifier provides parameters such as min_samples_leaf and max_depth to prevent a tree from overfiting. Cost complexity pruning provides another option to control the size of a tre...

scikit-learn

1.10. Decision Trees

Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning s...

#

From my experience the "tutorials" that cover sklearn are from people that have never read the documentation. It's quite well done so I suggest you read it once or twice front to back, it takes 2 days tops

cerulean kayak Nov 1, 2023, 10:46 PM

#

past meteor If you don't believe me I can recommend a chapter in a book you can read at your...

I'd appreciate that if you wouldn't mind.

timid dune Nov 1, 2023, 11:14 PM

#

is there a way to export discord chat data? 😅

echo mesa Nov 1, 2023, 11:45 PM

#

@past meteor Would you mind answering this? Thanks 🙂

past meteor Nov 2, 2023, 12:53 AM

#

Introduction to statistical learning, there's a chapter on it

past meteor Nov 2, 2023, 12:54 AM

#

echo mesa <@260493929047130113> Would you mind answering this? Thanks 🙂

I think it's fine, at least the way it was covered in my classes there were no big dependencies between each other, at least for the basic stuff

left tartan Nov 2, 2023, 1:45 AM

#

Arguably you might retain the information better by spreading the learning out over a longer time and reinforcing it.

hollow cloud Nov 2, 2023, 1:55 AM

#

#

Sharing some code I use to trade the markets

serene scaffold Nov 2, 2023, 2:06 AM

#

hollow cloud

!code

arctic wedgeBOT Nov 2, 2023, 2:06 AM

#

Formatting code on discord

Here's how to format Python code on Discord:

```py
print('Hello world!')
```

These are backticks, not quotes. Check this out if you can't find the backtick key.

For long code samples, you can use our pastebin.

serene scaffold Nov 2, 2023, 2:06 AM

#

Be sure to never ask people to read screenshots of text.

pallid urchin Nov 2, 2023, 2:07 AM

#

!code

arctic wedgeBOT Nov 2, 2023, 2:07 AM

#

Formatting code on discord

Here's how to format Python code on Discord:

```py
print('Hello world!')
```

These are backticks, not quotes. Check this out if you can't find the backtick key.

For long code samples, you can use our pastebin.

serene scaffold Nov 2, 2023, 2:07 AM

#

pallid urchin !code

the !code command brings up a message from the bot. Be sure to follow the instructions in the message.

vestal spruce Nov 2, 2023, 3:07 AM

#

Is speaker diarization and speaker turn detection the same thing? I just need to split an hour long audio based on a change of speaker without the need to identify them.

shut girder Nov 2, 2023, 4:02 AM

#

Hello, I am a beginner to statistics and interpreting graphs like histograms. I can't really tell if this graph is skewed or not. The y axis is the amount and the x axis is the age.

wooden sail Nov 2, 2023, 4:08 AM

#

you could numerical compute the skewness either with scipy https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.skew.html or by coding the formula for skewness yourself

#

the main idea behind skewness is checking whether more of the area of a pdf (or here, histogram) is to the left, the right, or if it's centered

#

at least at a glance, this data looks skewed to the right*, since skewness describes which side the tail is on

shut girder Nov 2, 2023, 4:23 AM

#

I see, thanks. So if I wanted the typical value of this data, should I calculate the median instead of the mean since it seems to be skewed?

wooden sail Nov 2, 2023, 4:24 AM

#

depends what you mean by "typical value" 😛

shut girder Nov 2, 2023, 4:34 AM

#

wooden sail depends what you mean by "typical value" 😛

Ah sorry, I mean the average value when I say typical value. I'm trying to find the average age in this case but I'm kind of stumped

vestal spruce Nov 2, 2023, 4:45 AM

#

shut girder Ah sorry, I mean the average value when I say typical value. I'm trying to find ...

Mean is average, median is the center value from sorted data ~~regardless of ascending or descending order~~, meaning that the former is much more prone to biases like outlier, meanwhile the latter doesn't.

#

does my explanation is clear for you? I hope so. ¯_(ツ)_/¯

abstract wasp Nov 2, 2023, 5:21 AM

#

blazing oxide Hey there! 👋 It seems like you're having trouble with your autoencoder model. T...

Hi, can you explain to me what you mean with that last part "So, you might need to modify how your datasets are created to yield pairs of the same data."
Also, I tried the suggested code and I got a value error:
Call arguments received by layer 'autoencoder_3' (type Autoencoder): • x={'image': 'tf.Tensor(shape=(218, 178, 3), dtype=float32)'}

limber mesa Nov 2, 2023, 5:33 AM

#

shut girder I see, thanks. So if I wanted the typical value of this data, should I calculate...

median will most likely be a better representation if data is skewed.

blazing oxide Nov 2, 2023, 10:09 AM

#

abstract wasp Hi, can you explain to me what you mean with that last part "So, you might need ...

Sure, I'd be happy to clarify!

When you're using a tf.data.Dataset object with the fit method in Keras, the dataset is expected to yield a tuple of two elements. The first element is the input data and the second element is the target data. In other words, it should yield (input_batch, target_batch) pairs.

In an autoencoder, the target data is the same as the input data because the model is trying to reconstruct its input. So, you need your dataset to yield pairs where the input batch and target batch are the same.

Here's how you can modify your dataset creation:

x_train = x_train.map(lambda x: (x['image']/255, x['image']/255))
x_test = x_test.map(lambda x: (x['image']/255, x['image']/255))

Regarding the error you're seeing, it seems like your model is receiving a dictionary as input ({'image': 'tf.Tensor(shape=(218, 178, 3), dtype=float32)'}), but it's expecting a tensor. The modification above should also fix this issue because it will directly pass the tensor to your model.

I hope this helps! Let me know if you have any more questions. 😊

quick mason Nov 2, 2023, 11:04 AM

#

Currently getting ```Traceback (most recent call last):
File "C:\Users\Carson.pyenv\pyenv-win\versions\3.9.8\lib\site-packages\torchaudio_extension\utils.py", line 85, in _init_ffmpeg
_load_lib("libtorchaudio_ffmpeg")
File "C:\Users\Carson.pyenv\pyenv-win\versions\3.9.8\lib\site-packages\torchaudio_extension\utils.py", line 61, in load_lib
torch.ops.load_library(path)
File "C:\Users\Carson.pyenv\pyenv-win\versions\3.9.8\lib\site-packages\torch_ops.py", line 643, in load_library
ctypes.CDLL(path)
File "C:\Users\Carson.pyenv\pyenv-win\versions\3.9.8\lib\ctypes_init.py", line 374, in init
self._handle = _dlopen(self._name, mode)
FileNotFoundError: Could not find module 'C:\Users\Carson.pyenv\pyenv-win\versions\3.9.8\Lib\site-packages\torchaudio\lib\libtorchaudio_ffmpeg.pyd' (or one of its dependencies). Try using the full path with constructor syntax.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "C:\Users\Carson.pyenv\pyenv-win\versions\3.9.8\lib\site-packages\torchaudio_extension_init_.py", line 67, in <module>
_init_ffmpeg()
File "C:\Users\Carson.pyenv\pyenv-win\versions\3.9.8\lib\site-packages\torchaudio_extension\utils.py", line 87, in _init_ffmpeg
raise ImportError("FFmpeg libraries are not found. Please install FFmpeg.") from err
ImportError: FFmpeg libraries are not found. Please install FFmpeg.```
Returned when I attempt training with this.
https://github.com/CBeast25/Applio-RVC-Fork/blob/main/lib/infer/modules/train/train.py

which doesn't happen with this https://github.com/CBeast25/Applio-RVC-Fork/blob/main/lib/infer/modules/train/train_new.py

Any ideas?

GitHub

Applio-RVC-Fork/lib/infer/modules/train/train.py at main · CBeast25...

Contribute to CBeast25/Applio-RVC-Fork development by creating an account on GitHub.

GitHub

Applio-RVC-Fork/lib/infer/modules/train/train_new.py at main · CBea...

Contribute to CBeast25/Applio-RVC-Fork development by creating an account on GitHub.

vestal spruce Nov 2, 2023, 11:11 AM

#

quick mason Currently getting ```Traceback (most recent call last): File "C:\Users\Carson\...

Just to make sure, but do you have the FFmpeg installed?

quick mason Nov 2, 2023, 11:14 AM

#

vestal spruce Just to make sure, but do you have the FFmpeg installed?

Yup

vestal spruce Nov 2, 2023, 11:19 AM

#

Hol up

vestal spruce Nov 2, 2023, 11:20 AM

#

quick mason Yup

idk if that's the actual ffmpeg or the package for ffmpeg but I have to assume that is the former, so how about the python package do you also have it installed?

quick mason Nov 2, 2023, 11:22 AM

#

vestal spruce idk if that's the actual ffmpeg or the package for ffmpeg but I have to assume t...

got that one too

vestal spruce Nov 2, 2023, 11:24 AM

#

ok lemme try it

vestal spruce Nov 2, 2023, 11:27 AM

#

quick mason got that one too

um idk but could you try installing the one I have?

quick mason Nov 2, 2023, 11:28 AM

#

lmao that's stupid of me lets hope that works

#

damn no dice

vestal spruce Nov 2, 2023, 11:30 AM

#

well that sucks

vestal spruce Nov 2, 2023, 11:31 AM

#

quick mason damn no dice

I don't suppose the problem would be due to pathing?
I qouted this part of the error raised:

FileNotFoundError: Could not find module 'C:\Users\Carson\.pyenv\pyenv-win\versions\3.9.8\Lib\site-packages\torchaudio\lib\libtorchaudio_ffmpeg.pyd' (or one of its dependencies). Try using the full path with constructor syntax.

quick mason Nov 2, 2023, 11:32 AM

#

I don't think so. Might be the version of ffmpeg I have.

outer tapir Nov 2, 2023, 11:35 AM

#

i have to train an object detection dataset on yolo model
i have downloaded a dataset called indian vehicle dataset, it has 5 classes of trucks, cars, tempos, tractor, tractor, each has over 130 images, i also have annotation folder for the same classes provided in xml format which looks like following

This XML file does not appear to have any style information associated with it. The document tree is shown below.
<annotation>
<folder>20210427</folder>
<filename>20210427_12_52_47_000_pgmtGnyv84hd7BiNoFr7ALIRzeZ2_F_4160_3120.jpg</filename>
<source>
<database>Unknown</database>
<annotation>Unknown</annotation>
<image>Unknown</image>
</source>
<size>
<width>3120</width>
<height>4160</height>
<depth/>
</size>
<segmented>0</segmented>
<object>
<name>tempo</name>
<truncated>0</truncated>
<occluded>0</occluded>
<difficult>0</difficult>
<bndbox>
<xmin>7.5</xmin>
<ymin>1089.34</ymin>
<xmax>522.95</xmax>
<ymax>2326.43</ymax>
</bndbox>
<attributes>
<attribute>
<name>rotation</name>
<value>0.0</value>
</attribute>
</attributes>
</object>
<object>
<name>tempo</name>
<truncated>0</truncated>
<occluded>0</occluded>
<difficult>0</difficult>
<bndbox>
<xmin>383.47</xmin>
<ymin>622.4</ymin>
<xmax>2645.4000000000005</xmax>
<ymax>3151.1</ymax>
</bndbox>
<attributes>
<attribute>
<name>rotation</name>
<value>0.0</value>
</attribute>
</attributes>
</object>
</annotation>

so how do i convert it into yolo accepted format, i am doind this type of thing for first time

vestal spruce Nov 2, 2023, 11:37 AM

#

quick mason I don't think so. Might be the version of ffmpeg I have.

lastly how about the environment variables?

quick mason Nov 2, 2023, 11:38 AM

#

ffmpeg is on my path

#

Really dont want to install conda but it looks like that's what I'll have to try now

vestal spruce Nov 2, 2023, 11:43 AM

#

I'm afraid so... sorry I couldn't be of any help :/

blazing oxide Nov 2, 2023, 12:59 PM

#

outer tapir i have to train an object detection dataset on yolo model i have downloaded a da...

To train a YOLO model, you need to convert your annotations into a format that the YOLO model can understand. The YOLO model expects a .txt file for each image in the dataset with the same name as the image file. Each line in this file represents one bounding box, in the format <object-class> <x_center> <y_center> <width> <height>, where:

<object-class> is an integer representing the index of the class of the object from 0 to (classes-1).
<x_center>, <y_center>, <width>, and <height> are all relative to the size of the image. They are in float format and range from 0.0 to 1.0.

Here's a Python code snippet that can help you convert your XML annotations to YOLO format:
https://paste.pythondiscord.com/G37Q

In this code, replace 'path_to_annotations' with the path to your XML annotations and 'path_to_yolo_annotations' with the path where you want to save your YOLO annotations. Also, replace 'path_to_images' with the path to your images.

Please note that this code assumes that your XML files have the same name as your images and are in the same directory. If this is not the case, you may need to adjust the code accordingly.

I hope this helps! Let me know if you have any more questions. 😊

pulsar lily Nov 2, 2023, 3:03 PM

#

Hey everyone
I am new to ML and NN.
I just tried to learn pytorch with some datasets from kaggle but I quickly ran into a problem.

I have a Dataframe with different data types in it. Boolean, Strings and Floats
Most of them are relevant so now I'm wondering how to handle this. Cause for converting it into a tensor it needs to be a Float or int or binary I think.
How would one handle a data input with mixed data?

And also, if I convert the boolean values to a BoolTensor. How do I merge them with the FloatTensors... Cause it should be still one input right?

meager ridge Nov 2, 2023, 3:13 PM

#

hey anyone have any idea why it takes forever to build the wheel for Pandas?

serene scaffold Nov 2, 2023, 3:13 PM

#

pulsar lily Hey everyone I am new to ML and NN. I just tried to learn pytorch with some data...

you can represent booleans as 0 or 1. but for strings, it depends on what they represent. is there a finite number of unique strings, where they represent categories? or what?

pulsar lily Nov 2, 2023, 3:26 PM

#

serene scaffold you can represent booleans as 0 or 1. but for strings, it depends on what they r...

Thank you! I have both. I have columns with finite possibilities but also columns where I have a lot of different string. (Probably also finite but not just 5 different possibilities)
But isn't it better to use the proper boolean type than converting it to 0 and 1? After all there is a BoolTensor type...

rare osprey Nov 2, 2023, 3:27 PM

#

I am an ai researcher. I know 10+ coding languages and around 7 frameworks. I know everything from cnns to rnns to transformers and large language models. I have made my own finetuned multi modal large language model and am researching about augmented lstms. Im fullstack too and im here to discuss about anything CS or math and looking for friends and connections. I own two startups currently and working on a lot of projects.

#

Dm me if you want to be a friend, collab on smth

serene scaffold Nov 2, 2023, 3:28 PM

#

pulsar lily Thank you! I have both. I have columns with finite possibilities but also column...

for the features that have a finite set of possibilities (ie, they represent a fixed number of categories), you can one-hot encode them

#

for strings that could be "anything" (like a person's name, or a description of an art piece), the question becomes even more complicated.

pulsar lily Nov 2, 2023, 3:29 PM

#

serene scaffold for strings that could be "anything" (like a person's name, or a description of ...

hmm ok what do you mean by that?

serene scaffold Nov 2, 2023, 3:31 PM

#

pulsar lily hmm ok what do you mean by that?

there are different approaches to representing natural language numerically, but that's more advanced, and you should probably focus on one-hot encoding for the moment.

pulsar lily Nov 2, 2023, 3:32 PM

#

ok

#

thank you so much

#

I have one last question

serene scaffold Nov 2, 2023, 3:33 PM

#

no you don't.

pulsar lily Nov 2, 2023, 3:33 PM

#

I have an ID and I want to map the ID to my prediction.
This is no relevant data for the training part.
How would you handle this

pulsar lily Nov 2, 2023, 3:33 PM

#

serene scaffold no you don't.

aight guess I don't

serene scaffold Nov 2, 2023, 3:34 PM

#

I'm just being silly, if that wasn't clear.

pulsar lily Nov 2, 2023, 3:34 PM

#

sure I caught that 😄

serene scaffold Nov 2, 2023, 3:34 PM

#

anyway, you don't encode the ID. but if you feed a tensor with 10 rows into the network, then you'll get ten rows out. and the nth row of the output represents the nth row of the input.

#

(or, if you're working with more than two dimensions, whatever the leftmost dimension is)

pulsar lily Nov 2, 2023, 3:36 PM

#

emm ok...
I think I just keep coding and ask again when I get there... I think I won't understand it right now.
Is it ok if I ping you later?

serene scaffold Nov 2, 2023, 3:37 PM

#

I check this channel regularly, but don't ping me since I have a variable schedule.

pulsar lily Nov 2, 2023, 3:37 PM

#

ok

eager jasper Nov 2, 2023, 3:53 PM

#

How can I make a simple basic

#

Whitch no module

#

Only built in modules

#

Iis there a way

cunning agate Nov 2, 2023, 6:25 PM

#

hello guys i want to create ai video

#

where i want to intergrate my body in the video and voice

#

is there any free website or something like that

dire iron Nov 2, 2023, 6:33 PM

#

Hello. I hope you are well. Could you tell me about your industry experience? How did you get into the professional world?

left tartan Nov 2, 2023, 7:27 PM

#

dire iron Hello. I hope you are well. Could you tell me about your industry experience? Ho...

#databases message

cunning agate Nov 2, 2023, 7:30 PM

#

does there any apps or wesites to modify and add things to a video with ai?

eager jasper Nov 2, 2023, 8:17 PM

#

Hey

odd meteor Nov 2, 2023, 9:04 PM

#

cunning agate does there any apps or wesites to modify and add things to a video with ai?

I think capcut. Idk how it works but I've seen some Tik-Tok videos using it do some cool stuff

mighty bridge Nov 2, 2023, 11:30 PM

#

What is your guys overall opnion about tensorflow and pytorch? esnoobNotes which is most used for anything ? is there one easier to the other?

tidal bough Nov 2, 2023, 11:32 PM

#

Recently TF dropped support for windows, which may be an issue for some people. Besides that, I think they are nearly equivalent these days.

mighty bridge Nov 2, 2023, 11:33 PM

#

is there any reason why?

#

which OS they still supports?

tidal bough Nov 2, 2023, 11:35 PM

#

there's probably a github issue explaining why or something.

#

they only support windows via WSL now. and unix systems are still supported.

mighty bridge Nov 2, 2023, 11:39 PM

#

I didnt even know about what was WSL until now lol

winter drift Nov 3, 2023, 12:57 AM

#

Hello I built a level in unreal engine and want to do behavioral cloning can anyone point me in the correct direction

#

I have a little under water level with a pilot able submarine, and I want to teach a simple model to drive the submarine and perhaps go to targets.

#

fierce kiln Nov 3, 2023, 3:46 AM

#

Hello everyone. I was required to write a python script to deduce height of a medical object. Is there any sort of computer vision model to do that directly? Unfortunately, I couldn't find any in context of my use case.
Therefore I came up with this idea: segment the area of interest and draw a boundary box around the segmented polygon. Calculate the height and width of the boundary box with simple geometry. We get heights and widths in pixels. Convert pixels into our required scale if focal length is constant or use stuff like arUco marker for relative measurement.

However, I don't know how reliable this thing will turn out to be. Is there a better way? Or a vision model that I don't know of?

vestal spruce Nov 3, 2023, 4:06 AM

#

fierce kiln Hello everyone. I was required to write a python script to deduce height of a me...

it depends on the defined constraint for you project, because that could also impact the accuracy of your system like what camera it will use, from what angle will you take the picture, the relative distance to said object, etc. if you don't mind sharing a bit more detail, please do.

fierce kiln Nov 3, 2023, 4:25 AM

#

Thank you for your interest. It is a segment of an OCT image. Let's assume the angle is always perpendicular and the relative distance always remains constant.

vestal spruce Nov 3, 2023, 4:35 AM

#

fierce kiln Thank you for your interest. It is a segment of an OCT image. Let's assume the a...

oh well then, I'm confident the result would be reliable enough, especially given how well you understood the assignment. but when in-doubt remember that it is a must to run system evaluation to determine the error rate from your system. from there you can determine if you should either make drastic changes to your system or just simple tweaks.

#

And one more thing, I think it can be of help to you if there's another object as a reference in all of the image, perhaps like a piece of A4 paper as the background object behind your main object, with this you can then sample its (A4 paper) pixel to real-world size ratio for calculating the main object pixel to real-world size, but this is just one way to do it, you might find other methods to be much easier to implement/understand, and I might suggest go and take a look about it on Google Scholars.

lapis sequoia Nov 3, 2023, 7:35 AM

#

hello

#

I have trained a model on a dataset with imbalance using xgboost

#

#

it gives f1_score_weighted of 85%

#

per-label it gives f1 score [0.8855230715040509, 0.8542857142857143, 0.7451428571428571, 0.743142144638404, 0.853898561695685, 0.8672331386086033]

#

what else could be done in order to make better model

#

to increase the accuracy/f1_score

lapis sequoia Nov 3, 2023, 7:54 AM

#

Should I use CatBoost instead or try to solve imbalance in dataset using smote or other methods

#

Maybe both

vestal spruce Nov 3, 2023, 8:39 AM

#

lapis sequoia I have trained a model on a dataset with imbalance using xgboost

could you perhaps provide us with more context of your model? is it a CV model, perhaps you can augment the data by adding dups to the minor category and change the dups rotation, do inverse so it have more variety, if it's a timeseries model like ASR perhaps a reversed audio dups could work too.

#

other things you can do is to add weight to your category, so for minor category you give them bigger wieghting to balance out, or emphasize the model to be more critical when dealing with rarely occuring data.

#

lastly proportional distribution A.K.A stratified splitting the category might also affect it's performance.

#

I don't assume you already have applied the ideas I'm telling you @lapis sequoia ?

lapis sequoia Nov 3, 2023, 8:56 AM

#

vestal spruce could you perhaps provide us with more context of your model? is it a CV model, ...

It is a cv model
Emotion model

#

dair-ai/emotion

#

It's basically a model for nlp classification

vestal spruce Nov 3, 2023, 8:59 AM

#

lapis sequoia It is a cv model Emotion model

though the ideas I've stated could still work regardless of model since it is more to do with how the data is being presented, btw you haven't answered my question :).

lapis sequoia Nov 3, 2023, 9:00 AM

#

No I've not applied the techniques you've mentioned

vestal spruce Nov 3, 2023, 9:03 AM

#

Well you might want to try them out first because, from my past experience, these are pretty common way to deal with imbalance dataset.

olive pecan Nov 3, 2023, 9:04 AM

#

Are there any good courses on Data analytics and Big data using jupyter for data understanding, cleaning, preprocessing, modeling etc anywhere?

vestal spruce Nov 3, 2023, 9:04 AM

#

I do apologize as I'm unfamiliar with CatBoost, perhaps it's way more streamline than some basic ideas I said

lapis sequoia Nov 3, 2023, 9:05 AM

#

vestal spruce I do apologize as I'm unfamiliar with CatBoost, perhaps it's way more streamline...

Nah
I heard it for the first time 😅

#

I'll do some fixing to tackle the imbalance

vestal spruce Nov 3, 2023, 9:06 AM

#

lapis sequoia Nah I heard it for the first time 😅

well I guess then it's ok to sticking with the basics. haha

vestal spruce Nov 3, 2023, 9:09 AM

#

lapis sequoia Should I use CatBoost instead or try to solve imbalance in dataset using smote o...

you can combine, oversampling/smote with undersampling too, try different combination, who knows what might work and not work together

#

one thing is certain is that you should always try and experiment with different scenarios ¯_(ツ)_/¯

blazing oxide Nov 3, 2023, 9:10 AM

#

lapis sequoia what else could be done in order to make better model

Why not use also a Neural Network that looking to a file csv of the last winners of all the last Gran Prixs so that can see a pattern of the trend of the winners?

lapis sequoia Nov 3, 2023, 9:11 AM

#

vestal spruce one thing is certain is that you should always try and experiment with different...

Tbh this is the first time I'm experimenting with the nlp model (maybe ML in general)

vestal spruce Nov 3, 2023, 9:13 AM

#

lapis sequoia Tbh this is the first time I'm experimenting with the nlp model (maybe ML in gen...

I'm curious how could your visual emotion model also use nlp? afaik nlp is mostly use for text-based scenario. lemon_thinking

lapis sequoia Nov 3, 2023, 9:13 AM

#

blazing oxide Why not use also a Neural Network that looking to a file csv of the last winners...

I've never worked with NN
Ik transformers could give a whooping accuracy of maybe >94
But rn I'd like to stick to ml algo

blazing oxide Nov 3, 2023, 9:14 AM

#

lapis sequoia I've never worked with NN Ik transformers could give a whooping accuracy of may...

Ok👌

lapis sequoia Nov 3, 2023, 9:14 AM

#

vestal spruce I'm curious how could your visual emotion model also use nlp? afaik nlp is mostl...

No
It's a text based dataset

vestal spruce Nov 3, 2023, 9:15 AM

#

lapis sequoia No It's a text based dataset

oh so it wasn't computer vision then ?🤦

lapis sequoia Nov 3, 2023, 9:15 AM

#

No

#

When did I mention that ?

vestal spruce Nov 3, 2023, 9:15 AM

#

I'm getting thrown off by such revelation

vestal spruce Nov 3, 2023, 9:15 AM

#

lapis sequoia It is a cv model Emotion model

^

lapis sequoia Nov 3, 2023, 9:15 AM

#

vestal spruce ^

CSV*

#

Sorry man

vestal spruce Nov 3, 2023, 9:16 AM

#

oh my goodness

#

it's alright, no harm done

lapis sequoia Nov 3, 2023, 9:17 AM

#

Thanks man

vestal spruce Nov 3, 2023, 9:23 AM

#

lapis sequoia Thanks man

ok I'm not really verse in NLP, but I believe the ideas I've said still applicable aside from making variety duplicate for obvious reason.

lapis sequoia Nov 3, 2023, 9:23 AM

#

Adding duplicate would not be a great idea

#

In the preprocessing
I literally removed the duplicates

vestal spruce Nov 3, 2023, 9:24 AM

#

lapis sequoia Adding duplicate would not be a great idea

then we have establish a similar conclusion.

odd meteor Nov 3, 2023, 9:25 AM

#

lapis sequoia

Use TextAttack library (the same library used in performing adversarial text attack on NLP tasks) to perform data augmentation on the minority class.

That would most definitely improve the current model's performance.

vestal spruce Nov 3, 2023, 9:26 AM

#

but don't think of them (duplicates) as an obstacle to your model, because oversampling and undersampling is basically just that, add and remove duplicate as a mean to make the dataset balance.

lapis sequoia Nov 3, 2023, 9:28 AM

#

vestal spruce but don't think of them (duplicates) as an obstacle to your model, because overs...

Actually I'm doing an assignment
They've mentioned to remove the dups

#

It's a standard procedure in the nlp ig

lapis sequoia Nov 3, 2023, 9:28 AM

#

odd meteor Use TextAttack library (the same library used in performing adversarial text at...

Will check
Thanks 👍

clever lake Nov 3, 2023, 10:11 AM

#

guys I've been looking for tutorials to create a chatbot (AI) but most of them are outdated (videos too old and don't work) could you recommend me something? I haven't found anything else.

My idea was to have a json (something simple) so: tags,patterns,responses

{
    "intents": [
        {
            "tag": "greetings",
            "patterns": [
                "hello",
                "hey",
                "hi",
            ],
            "responses": [
                "Hello",
                "hey!",
                "what can i do for you?"
            ]
        }
    ]
}

olive pecan Nov 3, 2023, 10:13 AM

#

Are there any good courses on Data analytics and Big data using jupyter for data understanding, cleaning, preprocessing, modeling etc anywhere? I'm struggling at uni to understand the terms and techniques

blazing oxide Nov 3, 2023, 10:36 AM

#

clever lake guys I've been looking for tutorials to create a chatbot (AI) but most of them a...

Sure, I found some recent resources that might help you:

"How to Build Chatbots | Complete AI Chatbot Tutorial for Beginners (https://www.youtube.com/watch?v=jCoH82LPgdk)" by Liam Ottley. This is a comprehensive video tutorial that covers everything you need to know to build custom no-code AI chatbots.
"Learn how to create your own chatbot easily (https://www.youtube.com/watch?v=Pj00e6lq9Cg)" by TECH WITH SACH. This tutorial teaches you how to create your own chatbot quickly and easily using Dialogflow and different tools and features.
"How to Build a ChatBot using the GPT-4 API – Full Project-Based Tutorial (https://www.freecodecamp.org/news/build-gpt-4-api-chatbot-turorial/)". This is a full project-based tutorial that teaches you how to build your own chatbot using the GPT-4 API.
"A Complete Guide on How to Build a Chatbot (Easy to Hard) (https://www.g2.com/articles/how-to-build-a-chatbot)". This guide provides different ways to build a chatbot, each requiring a varying level of technical skills.
"ChatterBot: Build a Chatbot With Python – Real Python (https://realpython.com/build-a-chatbot-python-chatterbot/.)". This tutorial guides you through creating a chatbot using Python ChatterBot.

As for your idea of using a JSON structure for intents, patterns, and responses, it's a common approach in rule-based chatbots. However, for more advanced AI chatbots, you might need to use more sophisticated techniques such as machine learning and natural language processing. I hope this helps! 😊

#

Finally I got the voice verified role! 😁

timid dune Nov 3, 2023, 10:42 AM

#

is it complicated to integrate your PyTorch model into an frontend application?

clever lake Nov 3, 2023, 10:44 AM

#

blazing oxide Sure, I found some recent resources that might help you: 1. **"How to Build Cha...

I will take a look at all of them, but using chatterbot doesn't appeal to me, I would like to be able to something of my own without using something already made. For working with intent do you have any tips?

blazing oxide Nov 3, 2023, 10:44 AM

#

timid dune is it complicated to integrate your PyTorch model into an frontend application?

Hey there! 😊 Sure, integrating a PyTorch model into a frontend application can be a bit tricky, but there are several ways to do it:

Flask: You can use Flask to deploy your PyTorch model and create a REST API for your model. This is a common way to integrate ML models into web apps.
PyTorch C++ Frontend: If you're working with C++, you can use the PyTorch C++ frontend. It provides a pure C++ interface to PyTorch.
Flutter: For mobile apps, you can use PyTorch Mobile with Flutter. This lets you run your PyTorch model directly on a mobile device.
Windows ML API: If you're developing a Windows app, you can use the Windows ML API to integrate your PyTorch model.

Remember, the complexity can vary depending on your model and application. It's always a good idea to start with a simple prototype and add complexity as needed. Hope this helps! 😊

timid dune Nov 3, 2023, 10:45 AM

#

blazing oxide Hey there! 😊 Sure, integrating a PyTorch model into a frontend application can ...

can I use graphql or whatever? my app is built with react

blazing oxide Nov 3, 2023, 10:48 AM

#

clever lake I will take a look at all of them, but using chatterbot doesn't appeal to me, I ...

Sure thing! 😊 Building your own chatbot from scratch can be a great learning experience and also an amazing challenge. Here are some tips for working with intents:

Define the Purpose: Start by figuring out what you want your chatbot to do. This will help you determine the types of intents your chatbot needs to handle.
Identify Common Questions: Consider the types of questions your chatbot is likely to get. These can guide you in creating your intents.
Create Sample Conversations: Try to come up with possible scenarios or conversations your chatbot might have. This can help you understand how to structure your intents and responses.
Annotation: Mark up your training data to identify intents and any entities. This is crucial for training your chatbot to understand user inputs.
Continuous Training: Train your chatbot with the marked-up data, and keep fine-tuning and adjusting the chatbot’s responses based on feedback.
Add and Refine Intents: As your chatbot interacts with users, you'll likely discover new intents that you hadn't thought of. Add these to your chatbot and refine existing ones based on user interactions.

Remember, building a chatbot from scratch can be complex, especially if you want it to understand natural language. You might need to learn about Natural Language Processing (NLP) techniques, which can help your chatbot understand and interpret user intents in real-time.

Hope this helps! 😊

blazing oxide Nov 3, 2023, 10:49 AM

#

timid dune can I use graphql or whatever? my app is built with react

Absolutely, you can use GraphQL with your React application and PyTorch model! Here's a general approach:

Create a Backend Service: This service will host your PyTorch model and expose an API endpoint for making predictions.
GraphQL Server: Set up a GraphQL server that communicates with your backend service. This server will receive requests from your frontend, forward them to the backend service, and return the results.
Apollo Client: In your React application, use the Apollo Client to communicate with your GraphQL server. Apollo Client is a comprehensive state management library for JavaScript that enables you to manage both local and remote data with GraphQL.
Fetch Data: Use the useQuery hook provided by Apollo Client to fetch data from your GraphQL server. This data can then be displayed in your React components.
Update Data: If your application allows for updating data (like re-training your model), you can use the useMutation hook.

Remember, integrating these technologies can be complex and requires a good understanding of each component. But with some patience and practice, it's definitely achievable. Good luck! 😊

clever lake Nov 3, 2023, 10:56 AM

#

blazing oxide Sure thing! 😊 Building your own chatbot from scratch can be a great learning ex...

thank you for your help, I noticed that the various videos or tutorials offered are mostly very basic, chatterbot gpt 4 API etc, do you have a tutorial focused on learning based on intent? this would be very helpful for me to study how it works and do something more

#

of course I searched but they were all very old (2 to 4 years ago) many libraries have changed and do not work as they used to which makes them useless

#

https://paste.pythondiscord.com/XGDQ

#

gives an error when creating the array (line 60) and I have no idea how to fix it

Traceback (most recent call last):
  File "/home/saohy/documents/bots/mei-chan-bot/g.ignore/chatbot/training.py", line 60, in <module>
    training = np.array(training)
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 2 dimensions. The detected shape was (5, 2) + inhomogeneous part

blazing oxide Nov 3, 2023, 11:04 AM

#

clever lake thank you for your help, I noticed that the various videos or tutorials offered ...

Absolutely👌, here are some resources that focus on intent-based chatbot development:

"Chatbot Development Tutorial: Introduction of Intent, Stories, Actions in Rasa (https://www.pragnakalp.com/chatbot-development-tutorial-rasa/)": This tutorial provides a comprehensive guide on how to use Rasa, an open-source machine learning framework for building chatbots. It covers the concepts of intents, stories, and actions in detail.
"Creating Chatbots on Google Cloud (https://developers.google.com/learn/topics/chatbots)": This resource offers step-by-step guides to learn about Dialogflow and the corresponding Google Cloud services, which facilitate building chatbots. Dialogflow is a powerful tool for intent recognition and handling.
"How to Build Your AI Chatbot with NLP in Python? (https://www.analyticsvidhya.com/blog/2021/10/complete-guide-to-build-your-ai-chatbot-with-nlp-in-python)": This guide provides a complete walkthrough on how to build an AI chatbot with Natural Language Processing (NLP) in Python. It covers various aspects of NLP, including intent recognition and handling.

Remember, building an intent-based chatbot can be complex and requires a good understanding of NLP and machine learning concepts. But with some patience and practice, it's definitely achievable. Good luck with your studies! 😊

Also for the error you've encountered in your code I think I know where is the problem, but I'll tell you about it in another message because I don't have Nitro🥲.

blazing oxide Nov 3, 2023, 11:06 AM

#

clever lake gives an error when creating the array (line 60) and I have no idea how to fix i...

The error you're encountering, ValueError: setting an array element with a sequence, typically occurs when you're trying to create a NumPy array from a list of sequences that aren't all the same length¹. In your case, it seems like the issue is with the training variable.

Here's a possible solution:

Instead of using np.array(training), you can convert training into an array of objects:

training = np.array(training, dtype=object)

This tells NumPy to create an array of Python objects, which can be of varying lengths or types.

This is what I can do for now, I hope I've been helpful! 😊

#

As for recent resources on chatbot development, here are some that might be helpful:

"Future directions for chatbot research: an interdisciplinary research agenda (https://link.springer.com/article/10.1007/s00607-021-01016-7)": This article provides a research agenda for chatbot development, discussing future directions and challenges in the field.
"7 Chatbot Trends for 2022 & Beyond (https://manychat.com/blog/chatbot-trends/.)": This article discusses the latest trends in chatbot development, which could be useful for your project.
"How To Develop a Chatbot From Scratch (https://chatbotsmagazine.com/how-to-develop-a-chatbot-from-scratch-62bed1adab8c)": This guide provides a comprehensive walkthrough on how to build a chatbot from scratch.
"9 Major Chatbot Trends for 2021 (https://botcore.ai/blog/chatbot-trends-2021/)": Although this article is from 2021, it discusses some major trends in chatbot development that are still relevant today.

Good luck with your project! (Also if you need any more help feel free to ask also in DMs) 😊

#

Now Igtg, cya👋 (I'll be back in 1 hour)

clever lake Nov 3, 2023, 11:16 AM

#

blazing oxide As for recent resources on chatbot development, here are some that might be help...

ty

midnight harbor Nov 3, 2023, 12:34 PM

#

I am new at this and i wanna know how to use langchain to create simple application,
i was using openai library which was kinda easy but to make scalable project i think lang chain would be great, BUT i found langchain to be very confusing, like i am failed to find
role: system context in langchian
that how to pass system prompt that what are you and all

then user asistant chat start right,
Need to find this concept using langchain

I have found using ConversationBufferMemory and summarizer buffer i can manage alot of stuff but giving system context to the LLM i want right now kindly help

blazing oxide Nov 3, 2023, 12:47 PM

#

clever lake ty

You are welcome. I'm always happy to help

blazing oxide Nov 3, 2023, 12:50 PM

#

midnight harbor I am new at this and i wanna know how to use langchain to create simple applicat...

Hey there!👋 It's great to hear that you're diving into data science and AI. Langchain is indeed a powerful tool for building scalable AI applications.

To provide system context in Langchain, you can use the ConversationBufferMemory as you mentioned. This is where you can store the system context or any other information that you want to persist across different turns of the conversation.

Here's a simple example:

from langchain import ConversationBufferMemory

# Initialize the memory buffer
memory = ConversationBufferMemory()

# Add system context
memory.add_system_context("I am an AI developed by Langchain.")

# Now you can pass this memory to the LLM
llm = LangchainLLM(memory=memory)

In this example, the system context "I am an AI developed by Langchain." is stored in the memory and can be accessed by the LLM during the conversation.

Remember, the system context is just like any other piece of information in the conversation. It's up to you how you want to use it in your application.

I hope this helps! If you have any more questions, feel free to ask. Happy coding! 🚀

midnight harbor Nov 3, 2023, 12:52 PM

#

blazing oxide Hey there!👋 It's great to hear that you're diving into data science and AI. La...

AttributeError: 'ConversationBufferMemory' object has no attribute 'add_system_context'

blazing oxide Nov 3, 2023, 12:54 PM

#

midnight harbor AttributeError: 'ConversationBufferMemory' object has no attribute 'add_system_c...

I apologize for the error I made😓 . It seems there was a misunderstanding in my previous message. The ConversationBufferMemory class does not have a method called add_system_context.

In Langchain, the system context is typically set at the beginning of the conversation and is passed to the model as part of the conversation history. Here's a simplified example:

from langchain import ConversationBufferMemory, LangchainLLM

# Initialize the memory buffer
memory = ConversationBufferMemory()

# Add system context
memory.add_message("system", "I am an AI developed by Langchain.")

# Now you can pass this memory to the LLM
llm = LangchainLLM(memory=memory)

In this example, the system context "I am an AI developed by Langchain." is added as a system message in the conversation history. The LLM can then use this context during the conversation.

I hope this clears up the confusion. Let me know if you have any other questions! Happy coding! 🚀

midnight harbor Nov 3, 2023, 12:56 PM

#

blazing oxide I apologize for the error I made😓 . It seems there was a misunderstanding in my...

AttributeError: 'ConversationBufferMemory' object has no attribute 'add_message'

blazing oxide Nov 3, 2023, 12:59 PM

#

midnight harbor AttributeError: 'ConversationBufferMemory' object has no attribute 'add_message'

Oh, I'm really disappointed in myself... I don't know why they works in my machine, try this last version I'm really sorry for the consìfusion:

from langchain import ConversationBufferMemory, LangchainLLM

# Initialize the memory buffer
memory = ConversationBufferMemory()

# Add system context
memory.add_turn({"role": "system", "content": "I am an AI developed by Langchain."})

# Now you can pass this memory to the LLM
llm = LangchainLLM(memory=memory)

odd meteor Nov 3, 2023, 2:31 PM

#

lapis sequoia Will check Thanks 👍

If you're having any challenge in using the library let me know.

long canopy Nov 3, 2023, 2:36 PM

#

any tools or libraries of interest for voice recognition?

serene scaffold Nov 3, 2023, 2:57 PM

#

blazing oxide I apologize for the error I made😓 . It seems there was a misunderstanding in my...

!mute 857925971004882975 "1 day" Using ChatGPT to answer questions is not allowed. If you are not using a generative AI to produce content, as you allege, stop deliberately writing in a way that resembles generative AI writing.

arctic wedgeBOT Nov 3, 2023, 2:57 PM

#

:incoming_envelope: :ok_hand: applied timeout to @blazing oxide until <t:1699109861:f> (1 day).

ancient cairn Nov 3, 2023, 4:53 PM

#

how to transcribe audio to text arduino for chat gpt. then chat gpt replies in audio?

left tartan Nov 3, 2023, 5:34 PM

#

ancient cairn how to transcribe audio to text arduino for chat gpt. then chat gpt replies in a...

I think you just are asking: How to do Text to Speech, and Speech to Text? Doesn't matter if it's chatgpt or not, right?

desert oar Nov 3, 2023, 6:23 PM

#

serene scaffold !mute 857925971004882975 "1 day" Using ChatGPT to answer questions is not allowe...

i was just about to flag this, thank you

ancient cairn Nov 3, 2023, 6:41 PM

#

left tartan I think you just are asking: How to do Text to Speech, and Speech to Text? Doesn...

no like i talk to chat gpt in speech like siri

#

and it replies to me as audio

#

i know what i need but i need code

#

i just need an idea

echo mesa Nov 3, 2023, 9:22 PM

#

Hi Guys, I've been learning pre-calculus from james stewart: "Precalculus: Mathematics for Calculus".
here is a review from this guy:
https://www.youtube.com/watch?v=N-JXs_n2JhI

I'm wondering whether I should go thru "all" of them, the guy said that there is much more math in this book than in a college course. My main goal would be to learn calculus for machine learning so I do wonder that whether it worth spend years just to go thru pre-calculus as a 16 year old? Or how should I do it?

YouTube

The Math Sorcerer

Precalculus: Mathematics for Calculus

https://www.freemathvids.com/ || We take a look at a wonderful book called Precalculus: Mathematics for Calculus. This is a great book for learning precalculus as it covers all the main topics!
Here it is: https://amzn.to/48ra0tv
Useful Math Supplies https://amzn.to/3Y5TGcv
My Recording Gear https://amzn.to/3BFvcxp
(these are my affiliate links)...

▶ Play video

left tartan Nov 3, 2023, 9:25 PM

#

echo mesa Hi Guys, I've been learning pre-calculus from james stewart: "Precalculus: Mathe...

All of what? All of the chapters?

echo mesa Nov 3, 2023, 9:26 PM

#

left tartan All of what? All of the chapters?

yeah all of the content

left tartan Nov 3, 2023, 9:26 PM

#

echo mesa yeah all of the content

You're still in high school right? Maybe ask the pre-calculus teacher for a copy of their syllabus, so you can see what information is taught?

echo mesa Nov 3, 2023, 9:29 PM

#

left tartan You're still in high school right? Maybe ask the pre-calculus teacher for a copy...

In Ireland its different afaik, but I think we have a similar structure, we are going thru these kind of books

#

This is 5th year, which is the book that we are going thru atm

#

This is the next version of that book

#

This is the last one's table of contents

#

This is the first book(the green) one's content

#

This is literally the math that is required in leaving cert

#

I don't know if the content is enough for pre-calc

#

I know the table of contents might not be enough for you to assume, but what do you think?

echo mesa Nov 3, 2023, 10:09 PM

#

or perhaps I can take one of these courses? https://www.khanacademy.org/math/precalculus

Khan Academy

Precalculus | Math | Khan Academy

The Precalculus course covers complex numbers; composite functions; trigonometric functions; vectors; matrices; conic sections; and probability and combinatorics. It also has two optional units on series and limits and continuity. Khan Academy's Precalculus course is built to deliver a comprehensive, illuminating, engaging, and Common Core align...

#

Because I've seen a lot of people recommending khan academy for maths for machine learning

primal iris Nov 3, 2023, 10:14 PM

#

https://discord.com/channels/267624335836053506/1170122069556613140

#

can someone help pls

past meteor Nov 3, 2023, 10:37 PM

#

primal iris https://discord.com/channels/267624335836053506/1170122069556613140

I can't commit to helping you in the thread so I'll answer here. You can use small multiples or colour coding. Seaborn and plotly let you do both quite easily.

primal iris Nov 3, 2023, 10:44 PM

#

past meteor I can't commit to helping you in the thread so I'll answer here. You can use sma...

i am already doing that

#

but i don't think they would be the best for multilabel as itmight have values that have more than one label

echo mesa Nov 3, 2023, 11:00 PM

#

past meteor I can't commit to helping you in the thread so I'll answer here. You can use sma...

Since BillyBobby is not active anymore can you please take a look at my question, if you wouldn't mind 🙂

past meteor Nov 3, 2023, 11:01 PM

#

echo mesa Since BillyBobby is not active anymore can you please take a look at my question...

I'm not from the US, we follow a different system of math. I'm abroad rn actually, can you ping me in a few days? I'll write out the topics of my undergrad math course. That was the bare minimum of what you need.

echo mesa Nov 3, 2023, 11:02 PM

#

past meteor I'm not from the US, we follow a different system of math. I'm abroad rn actuall...

sure that's fine, thanks. I've sent the topics that is included in Ireland in my school for someone who's pursuing higher level maths.

past meteor Nov 3, 2023, 11:03 PM

#

It looks similar to what I had

echo mesa Nov 3, 2023, 11:22 PM

#

past meteor It looks similar to what I had

Does it? I asked in the Mathematics server and they said that I should go with them because they by the table of contents very good books and since I'm learning this in school its even better

acoustic prawn Nov 3, 2023, 11:36 PM

#

hi, I'm not sure this is the right channel. I'm working on a small project utilizing some image tag/recognition models, gpt and flask and quite inexperienced

loading the models takes a few minutes. everytime I make a change in some file, flask is reloaded (which is good) and with it the models.

currently, I'm calling load_Models() directly before run app line.
what is common practice to keep such models "loaded" and only reload the rest?

serene scaffold Nov 3, 2023, 11:40 PM

#

acoustic prawn hi, I'm not sure this is the right channel. I'm working on a small project utili...

If the models are part of the same python process as the flask app that you're editing, there is no way to prevent that from happening.

An alternative design is to have the models in a separate flask app that only has the absolute minimal functionality that needs direct access to the models, and then have the main app interact with that other app via requests.

#

It's not just that "flask" is reloading. It's the whole python process.

acoustic prawn Nov 3, 2023, 11:44 PM

#

oh no, that does not sound good...
so I'd have two flask projects/servers running, communicating with each other?

desert oar Nov 4, 2023, 12:02 AM

#

echo mesa Hi Guys, I've been learning pre-calculus from james stewart: "Precalculus: Mathe...

i suggest maybe just picking a starting point and taking your time

#

iirc i was about 16 when i did pre-calc in school. do you not have access to a pre-calc course there?

#

i believe the sequence was something like trigonometry and basic proofs in 10th grade, then logarithms, derivatives, and a cursory look at antiderivatives + integrals in 11th

#

we probably did other things in 10th too but i can't remember what they were

#

what i do remember is that we took it pretty slow

#

there was no rushing through the material. i had very good teachers overall, and we spent a while building things up carefully and developing intuition

#

for example, i remember one day the teacher had us step through angles and measure the sides of the resulting triangles inscribed in a unit circle

#

and then she had us plot the ratios, and magically the sine and cosine functions appeared on our graph paper

#

i loved that, it was the first time in math i felt like there was actually a purpose to what we were doing and things started coming together

#

likewise the next year, the teacher gave everyone some construction paper and tape and assigned us all various dimensions of boxes to build. then we computed the volumes of all our boxes and showed that there was a unique optimum volume, which i believe we then derived. that was also a fun and memorable demo

#

the point being: it's worth taking your time with these things. they will form the foundation for everything to follow

echo mesa Nov 4, 2023, 9:08 AM

#

desert oar iirc i was about 16 when i did pre-calc in school. do you not have access to a p...

I've sent you the two books that we are going to go thru and I've sent the table of contents if you scroll up a bit.

#

would you think that that would be enough?

echo mesa Nov 4, 2023, 9:09 AM

#

echo mesa This is the first book(the green) one's content

.

echo mesa Nov 4, 2023, 9:09 AM

#

echo mesa This is the last one's table of contents

.

wooden sail Nov 4, 2023, 9:34 AM

#

echo mesa I've sent you the two books that we are going to go thru and I've sent the table...

that's a pretty good loadout

#

precalc is basically algebra and trig, both of which you cover there. differential and integral calculus are proper calculus, so what you learn after precalc

echo mesa Nov 4, 2023, 9:34 AM

#

wooden sail that's a pretty good loadout

Do you think that it would be enough to prepare myself for calculus?

wooden sail Nov 4, 2023, 9:34 AM

#

that already includes calculus lol

echo mesa Nov 4, 2023, 9:34 AM

#

wooden sail that already includes calculus lol

I mean yeah 😄

#

so I guess it should be enough then 😄

wooden sail Nov 4, 2023, 9:35 AM

#

yep

tame path Nov 4, 2023, 12:14 PM

#

gyz can some one guide me , about learning python roadmap

fading scaffold Nov 4, 2023, 2:06 PM

#

do you guys know the best course to emphasize numpy and pandas?

serene scaffold Nov 4, 2023, 2:10 PM

#

fading scaffold do you guys know the best course to emphasize numpy and pandas?

Not sure about numpy, but I recommend the kaggle pandas tutorial

fading scaffold Nov 4, 2023, 2:13 PM

#

serene scaffold Not sure about numpy, but I recommend the kaggle pandas tutorial

is this?

serene scaffold Nov 4, 2023, 2:14 PM

#

I don't think so.

#

!resources data science

arctic wedgeBOT Nov 4, 2023, 2:14 PM

#

Resources

The Resources page on our website contains a list of hand-selected learning resources that we regularly recommend to both beginners and experts.

serene scaffold Nov 4, 2023, 2:14 PM

#

It's one of the items here ^

fading scaffold Nov 4, 2023, 2:17 PM

#

serene scaffold It's one of the items here ^

ah i figured it out, thanks a bunch!

left tartan Nov 4, 2023, 2:22 PM

#

serene scaffold It's one of the items here ^

I've been thinking about this a bit, how to teach data manipulation in a way while showing plain Python vs Pandas vs SQL side by side.

#

I think a lot of people don't connect sql or pandas operations with the equivalent; how you'd do it in a simple Python loop. When I teach people SQL (or Pandas), I express things in terms of the primitives (how you'd do the same thing in code).

hushed crater Nov 4, 2023, 3:00 PM

#

Hey guys, I'm trying to color my points on this regplot by their y value. for example i want all where dropout == 1 to be orange and where 0 to be blue. but I can't seem to figure out how

#

g = sns.JointGrid(height=6, space=0)
x, y = df["Application_order"], df["Dropout"]
sns.regplot(x=x, y=y, scatter_kws=dict(s=1), x_jitter=.3, y_jitter=.125, logistic=True, truncate=False, ax=g.ax_joint)
sns.histplot(x=x, discrete=True, stat='frequency', ax=g.ax_marg_x)
g.ax_marg_y.remove()
g.ax_marg_x.set_facecolor('white')

g.ax_joint.set_xlabel('Application Order')
plt.title('Dropout Likelihood Across Application Orders')```

#

yjwl9Di6ysLPz979HbW0tli1bhqeffjrGfFBUVIQNGzZAEASsWrUKS5cuxZo1a5CTkwOeZ1Mq48yAE1Px3mEwGIaxa9curFy5Env27NFNWrRp0yasWbMGn3766RC3jsFgnEkwMZfBYDAYDIYCJhwwGAwGg8FQwMwKDAaDwWAwFDDNAYPBYDAYDAVMOGAwGAwGg6GACQcMBoPBYDAUMOGAwWAwGAyGAiYcMBgMBoPBUMCEAwaDwWAwGAqYcMBgMBgMBkMBEw4YDAaDwWAoPYPIGrIF8GWgAAAABJRU5ErkJggg.png

#

Any help? I can manage to make it work with an lmplot but then I don't get my marginal histogram which i really want to keep...

left tartan Nov 4, 2023, 4:25 PM

#

hushed crater

Draw the scatter directly, and set "scatter=False" in the regplot. I think you'll have to add the jitter manually to the df, like: df_x_jittered = df["Application_order"] + np.random.uniform(-0.3, 0.3, size=len(df))

#

Something like: ```py
jittered_x = df["Application_order"] + np.random.uniform(-0.3, 0.3, size=len(df))
jittered_y = df["Dropout"] + np.random.uniform(-0.125, 0.125, size=len(df))
sns.scatterplot(x=jittered_x, y=jittered_y, hue=df["Dropout"] > 0, palette=['blue', 'orange'], legend=False, ax=g.ax_joint, s=1)

hushed crater Nov 4, 2023, 4:43 PM

#

left tartan Draw the scatter directly, and set "scatter=False" in the regplot. I think you'l...

Thanks for the answer! The colors are now working, however setting scatter=False seems to mess up regplot...

#

#

I'm guessing I should just draw the line manually instead of using regplot?

left tartan Nov 4, 2023, 4:44 PM

#

That’s weird, I did a test and my regplot worked

hushed crater Nov 4, 2023, 4:44 PM

#

g = sns.JointGrid(height=6, space=0)
x, y = df["Application_order"], df["Dropout"]
sns.regplot(x=x, y=y, scatter_kws=dict(s=1), scatter=False, x_jitter=.3, y_jitter=.125, logistic=True, truncate=False, ax=g.ax_joint)
jittered_x = df["Application_order"] + np.random.uniform(-0.3, 0.3, size=len(df))
jittered_y = df["Dropout"] + np.random.uniform(-0.125, 0.125, size=len(df))
sns.scatterplot(x=jittered_x, y=jittered_y, hue=df["Dropout"] > 0, palette=['blue', 'orange'], legend=False, ax=g.ax_joint, s=1)
sns.histplot(x=x, discrete=True, stat='frequency', ax=g.ax_marg_x)
g.ax_marg_y.remove()
g.ax_marg_x.set_facecolor('white')

g.ax_joint.set_xlabel('Application Order')
plt.title('Dropout Likelihood Across Application Orders')```

left tartan Nov 4, 2023, 4:44 PM

#

Try regplot -after- scatter?

hushed crater Nov 4, 2023, 4:45 PM

#

Working!

#

thanks a bunch

past meteor Nov 4, 2023, 5:23 PM

#

left tartan I've been thinking about this a bit, how to teach data manipulation in a way whi...

When I get students to do work for us I give them a pandas crash course and I always put it next to SQL

#

I always tell them the SQL idioms mostly carry over and I have a bunch of SQL queries and a toy dataset I have where I show the equivalent Pandas code

abstract wasp Nov 4, 2023, 5:54 PM

#

Hi, I don't know why I get module_wrapper when I grab the summary of this code:
`model = Sequential()
vgg16 = tf.keras.applications.vgg16.VGG16(
include_top=False,
input_shape=(224, 224, 3),
pooling='avg',
classes=91,
weights='imagenet'
)

for layer in vgg16.layers:
layer.trainable = False

model.add(vgg16)
model.add(normalizing)
model.add(Flatten())
model.add(Dense(512, activation='relu'))
model.add(Dense(91, activation='softmax'))`

#

Screenshot_2023-11-04_at_10.54.42_AM.png

#

This is why shows up when I train:

Screenshot_2023-11-04_at_10.59.39_AM.png

lapis sequoia Nov 4, 2023, 6:12 PM

#

So getting some errors and cant find out why isnt not printing that last statement

import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import mean_absolute_error
import tensorflow as tf
from sklearn import preprocessing

df = pd.read_csv('Google_Stock_Price.csv')
df['Date'] = pd.to_datetime(df['Date'],format='mixed')
df['Time'] = df.apply(lambda row: len(df) - row.name, axis=1)
df['CloseFuture'] = df['Close'].shift(30)

df_test = df[:185]
df_train = df[185:]

X = np.array(df_train['Time'])
X = X.reshape(-1 ,1)
scaler = preprocessing.MinMaxScaler()
X_scaled = scaler.fit_transform(X)
y = np.array(df_train['CloseFuture'])

model = tf.keras.Sequential([
        tf.keras.layers.Dense(10, activation='sigmoid', input_shape=(1,)),
        tf.keras.layers.Dense(10, activation='sigmoid'),
        tf.keras.layers.Dense(10, activation='relu'),
        tf.keras.layers.Dense(1)])

model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),loss='mse',metrics=['mae'])
model.fit(X_scaled, y, epochs = 30, batch_size = 10)
ennuste_train = model.predict(X_scaled)
df_train['Ennuste'] = ennuste_train

X_test = np.array(df_test['Time'])
X_test = X_test.reshape(-1,1)
X_testscaled = scaler.transform(X_test)
ennuste_test = model.predict(X_testscaled)
df_test['Ennuste'] = ennuste_test


plt.scatter(df['Date'].values, df['Close'].values, color='black')
plt.plot((df_train['Date'] + pd.DateOffset(days=30)).values, df_train['Ennuste'].values, color='blue')
plt.plot((df_test['Date'] + pd.DateOffset(days=30)).values, df_test['Ennuste'].values, color='red')
plt.show()

df_validation = df.test.dropna()

print("Predicted data average error %.f" % mean_absolute_error(df_validation['CloseFuture'], df_validation['Ennuste']))```

#


D:\Archives\Coder\Python\Roboticas\Tehtävä10.py:31: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_train['Ennuste'] = ennuste_train
6/6 [==============================] - 0s 601us/step
D:\Archives\Coder\Python\Roboticas\Tehtävä10.py:37: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_test['Ennuste'] = ennuste_test
Traceback (most recent call last):
  File "D:\Archives\Coder\Python\Roboticas\Tehtävä10.py", line 45, in <module>
    df_validation = df.test.dropna()
                    ^^^^^^^
  File "C:\Users\Jani\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\core\generic.py", line 6204, in __getattr__
    return object.__getattribute__(self, name)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'DataFrame' object has no attribute 'test'

#

help appreciated..

quaint gorge Nov 4, 2023, 6:20 PM

#

is there an NLP package in python or something that categorizes sentences based on the content of the sentence?

brave sand Nov 4, 2023, 6:25 PM

#

how would I implement q learning on a racing game I made?

lapis sequoia Nov 4, 2023, 6:26 PM

#

quaint gorge is there an NLP package in python or something that categorizes sentences based ...

Nope, havent use in this project, there should be none string data for AI to emulate, its just column names for dataframe..tho i am new machine learning and still discovering stuff, so there could be some inaccuracy..

stray void Nov 4, 2023, 8:13 PM

#

hey so is this the place to ask about apis

#

i need an api for llm prompts where u can automate the responses u get from it in a way

left tartan Nov 4, 2023, 8:59 PM

#

stray void i need an api for llm prompts where u can automate the responses u get from it i...

Unclear what you’re asking, but you can start from OpenAI’s API and do whatever you want with the response

storm kelp Nov 4, 2023, 11:13 PM

#

I keep getting a serialisation error with my Pyspark function, anyone who is able to provide some insight would be greatly appreicated
https://discord.com/channels/267624335836053506/1170499876660973679

polar zodiac Nov 5, 2023, 8:15 AM

#

hello,
Am working on a project and there seem to be many null values in the dataset
would you advice me to go with fillna or dropna?
also If I use fillna and fill in avg random values wouldn't it affect the dataset?

royal crest Nov 5, 2023, 8:32 AM

#

you can choose to drop, or impute so long as you report your decision

#

typically any imputation method is effective when <5% of the data is missing, mean imputation drops out at around 10% missing, regression drops out at 15% but multiple imputation and ML imputation methods are more accurate even when larger proportions of the data are missing

mild dirge Nov 5, 2023, 10:32 AM

#

I have just been reading some papers to get some inspiration for a model I want to design. My task is to design a model that handles data of which a large part (~90%) is unlabeled, and my model should classify the labels. I read a paper (https://arxiv.org/abs/1911.04252) that starts by training a relatively small model on the 10% labeled data, and then classifying maybe 10% of the data that is unlabeled such that we have 10% true labeles, and 10% predicted labels. Then a new model is instantiated that is slightly bigger than the first, and trained on the 20% labeled data. Which is used to label another 10%, so the next model has 30% labeled data. etc.

My problem is, why would this improve performance at all. In the end all labeled data is based on the information from the initial 10%, it doesn't learn anything "new" from the predicted labels does it?

arXiv.org

Self-training with Noisy Student improves ImageNet classification

We present Noisy Student Training, a semi-supervised learning approach that works well even when labeled data is abundant. Noisy Student Training achieves 88.4% top-1 accuracy on ImageNet, which is 2.0% better than the state-of-the-art model that requires 3.5B weakly labeled Instagram images. On robustness test sets, it improves ImageNet-A top-1...

#

The paper discusses a modification of the algorithm explained above, but I was wondering why the original idea works at all.

mild dirge Nov 5, 2023, 11:08 AM

#

I guess the idea is called "self-training", but the confusion still remains. If the initial classifier predicts correctly, the data point probably doesn't bring much new information, whereas if it is predicted incorrectly, the newer iteration model will be learning incorrectly.

past meteor Nov 5, 2023, 11:20 AM

#

mild dirge I have just been reading some papers to get some inspiration for a model I want ...

Have you looked at general methods from semi supervised learning?

#

Are you able to label post-hoc or not at all?

mild dirge Nov 5, 2023, 11:21 AM

#

past meteor Are you able to label post-hoc or not at all?

What do you mean with this?

past meteor Nov 5, 2023, 11:21 AM

#

What is canonically done is having some model output predictions and uncertainties and you label the ones with the highest uncertainties

#

This only makes sense if you can still label ofc

mild dirge Nov 5, 2023, 11:22 AM

#

Highest certainties right?

past meteor Nov 5, 2023, 11:22 AM

#

The ones you're most uncertain about

mild dirge Nov 5, 2023, 11:22 AM

#

Or do you mean, manually labeling?

#

I thought you meant using the labels that the model was msot uncertain about for the next iteration

#

I get it now

#

And I think that would be quite difficult

past meteor Nov 5, 2023, 11:23 AM

#

Imagine you can still label the cats and dogs in your dataset but you don't want to do it upfront. Your model is also able to output uncalibrated probabilities. You let it do predictions and afterwards you label the ones with a ton of uncertainty

mild dirge Nov 5, 2023, 11:24 AM

#

The data is thousands or tens of thousands tree point clouds from many different forests

#

The labeling is mostly done by knowing what trees are in which forest, and labeling entire sections

#

So post-hoc would be difficult and requries expert knowledge, which I don't have

past meteor Nov 5, 2023, 11:24 AM

#

I think this is an ideal use case for this though, that is if you can get a hold of the experts to do it

mild dirge Nov 5, 2023, 11:25 AM

#

Yeah, I understand why it would be great if we can label post-hoc, but even then the labels are mostly from known forests from looking at it in the field.

#

So not from looking at the data itself, which are point clouds (sometimes it is easy to see, sometimes not)

past meteor Nov 5, 2023, 11:25 AM

#

Iirc this is what one of my lectures was about cause my prof did this for energy load anomalies. I can send you the slides when I get back maybe there's something else of value to you there 🤷

mild dirge Nov 5, 2023, 11:26 AM

#

Would be great, It seems that it even works without labeling post-hoc from some papers, but it seems a bit counterintuitive as the actual information you have is just the labaled data and some pseudo-labels.

#

In the OG paper they only take pseudo-labels from samples that the model is very certain about

past meteor Nov 5, 2023, 11:27 AM

#

A large part of his research is semi supervised learning in general so afterwards it might be a good idea to check out the papers from his lab 🙂

mild dirge Nov 5, 2023, 11:30 AM

#

Appreciate the input, the post-hoc can be something to look into if that's possible at my company (basically need to convince one of my coworkers to do it 😛 )

harsh minnow Nov 5, 2023, 1:08 PM

#

Hey guys I am pretty new to working with LLM's. I want to build a AI agent, how can build it? is there any useful videos that you can share that goes pretty indepth building AI agents. Your suggestions will help me guys, thanks.

tacit salmon Nov 5, 2023, 1:27 PM

#

hey you guys, I am interested on AI and machine learning, that's principally why I learned Python

#

but as beginner-Junior developer, I don't know how could it be applied

#

Because I only saw text based simple programs

#

can you tell me how does it work?

#

I mean, what does one need to start making AI stuffs

left tartan Nov 5, 2023, 1:32 PM

#

tacit salmon I mean, what does one need to start making AI stuffs

First, you need to learn Python to a fairly good level... it's hard to do anything meaningful until you're able to build intermediate level projects, of any type. Focus on learning the fundamentals and practicing by writing code often. #python-discussion is a good place to ask for help.

left tartan Nov 5, 2023, 1:33 PM

#

tacit salmon I mean, what does one need to start making AI stuffs

Second, you should learn a bit of the theory and ideas behind AI. A nice place to start is: https://www.youtube.com/watch?v=aircAruvnKk&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi. This is just a taste of the subject, there's a lot more to learn after this.

#

Third, math. Strong math skills, if you want to do anything more complicated than simple projects with existing models. Once you understand the theory, you need strong math skills to understand the details. See pinned messages for some reading material.

#

One place to start with using Ai/Ml is: https://keras.io/getting_started/

long canopy Nov 5, 2023, 2:23 PM

#

What part of the data science process is the one going from data collection to database insertion?

tacit salmon Nov 5, 2023, 2:51 PM

#

left tartan Second, you should learn a bit of the theory and ideas behind AI. A nice place t...

but Maths on its own theory or just working with problem solving and algorithms

#

I mean, I used some Trigonometry to make my game characters rotate, but It's not like a kind of hard Math

#

Wow 3Blue1Brown, idk if it's gonna be kinda hard

#

I'll watch the series

left tartan Nov 5, 2023, 2:57 PM

#

tacit salmon but Maths on its own theory or just working with problem solving and algorithms

Both. It's a long journey of constant learning.

tacit salmon Nov 5, 2023, 2:57 PM

#

So do I need Calculus? I never tried to do that

left tartan Nov 5, 2023, 2:58 PM

#

Calculus is considered a fundamental college course for any STEM major, just like algebra is in high school.

#

You can certainly do stuff with AI/ML libraries without calculus, but it's just one part of becoming a complete engineer or scientist: understanding how everything works.

tacit salmon Nov 5, 2023, 3:01 PM

#

left tartan You can certainly do stuff with AI/ML libraries without calculus, but it's just ...

I get it, on the deep levels.

long canopy Nov 5, 2023, 7:38 PM

#

Any book recommendation that clearly establishes differences between data architect, data engineer, and data scientist?

#

I'm having a hard time organizing data intake vs. validation vs. database insertion in my code, I feel like having a higher-level conceptual understanding would help me out

modest dome Nov 5, 2023, 7:52 PM

#

Question. I am working with a pandas dataframe and the csv file date shows up with month and year only (Jan 2000). How can I separate the month and year and create their own columns?

young granite Nov 5, 2023, 7:52 PM

#

modest dome Question. I am working with a pandas dataframe and the csv file date shows up w...

if its in datetime format use that otherwise use the split function and pandas apply

modest dome Nov 5, 2023, 7:55 PM

#

@young granite I'll try that thank you

echo mesa Nov 5, 2023, 7:58 PM

#

Guys, I have a confusion with vectors and their magnitude. Now I'm reading a book(data science from scratch first principles with python) and there the vectors magnitude is defined as square root of the sum of squares from the vector's value. However I don't think its necessarily correct, when we are calculating the vector's magnitude we are considering its components which from my understanding is Δx and Δy,(I'm talking about R^2 specifically) and in order to calculate its magnitude we square them add them together and square root the whole expression. When we have a vector [3,4] and assuming that it starts from the origin then yeah we can use the method of summing their squares and square rooting them which will result with 5, but what we really do is consider Δx and Δy which since we started from the origin will be the coordinates(3,4) of the vector, however when we don't start from the origin we have to consider the terminal point and the endpoint and determine the Δx(x2-x1) and Δy(y2-y1) and then calculate the magnitude. The method described in the book only works if the vector starts from the origin, if not then Δx and Δy will not be equal to the vector's coordinates. I think that a distinction between whether the vector starts from the origin and whether it starts from somewhere else is really important because it effects the magnitude of the vector. Perhaps I'm overcomplicating it but I think the book should have been more specific here.

barren fable Nov 5, 2023, 8:00 PM

#

Does anyone have a good recommendation for whatever books, documents, or even pretty nice videos explaining the polynomial regression, SVR, and kernel?

modest dome Nov 5, 2023, 8:38 PM

#

young granite if its in datetime format use that otherwise use the split function and pandas a...

I tried to split but it gives me this 0 [Jan 2000]. I want to separate the month and year into a datetime format but I am having a hard time doing that

modest dome Nov 5, 2023, 8:38 PM

#

modest dome I tried to split but it gives me this `0 [Jan 2000]`. I want to separate the...

The beginning 0 is an index position on the dataframe

agile cobalt Nov 5, 2023, 8:46 PM

#

modest dome Question. I am working with a pandas dataframe and the csv file date shows up w...

!e you can use to_datetime to turn into proper datetime objects, then extract the properties like ```py
import pandas as pd
df = pd.DataFrame({'string': ['Jan 2020', 'Feb 2021']})
df['date'] = pd.to_datetime(df['string'])
df['month'] = df['date'].dt.month
df['year'] = df['date'].dt.year

optionally do not even store `df['date']` as a column, just

(same thing as before but without storing as a column)

dates = pd.to_datetime(df['string'])

df['month'] = dates.dt.month [...]

print(df)

arctic wedgeBOT Nov 5, 2023, 8:46 PM

#

@agile cobalt :white_check_mark: Your 3.12 eval job has completed with return code 0.

001 | /home/main.py:3: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
002 |   df['date'] = pd.to_datetime(df['string'])
003 |      string       date  month  year
004 | 0  Jan 2020 2020-01-01      1  2020
005 | 1  Feb 2021 2021-02-01      2  2021

agile cobalt Nov 5, 2023, 8:47 PM

#

the format for this case is "%b %Y", so ideally you should explicitly specify like ```py
df['date'] = pd.to_datetime(df['string'], format="%b %Y")

Python strftime reference cheatsheet

A quick reference for Python's strftime formatting directives.

wooden sail Nov 5, 2023, 9:49 PM

#

echo mesa Guys, I have a confusion with vectors and their magnitude. Now I'm reading a boo...

there's no difference because vectors canonically do not have a "location in space"

#

if you specify a vector simply as coordinates, it does not matter where a vector is located in space

#

the object you're describing requires extra parameters. a second vector, to define both the "tail" and "head" of the vector (more properly, an affine transformation of a vector). but if you then compute the corresponding vector as head - tail, you again get a proper vector with no location in space attached to it, which we can always represent as having its tail at the origin

haughty heart Nov 6, 2023, 12:18 AM

#

https://venturebeat.com/ai/meet-nightshade-the-new-tool-allowing-artists-to-poison-ai-models-with-corrupted-training-data/

VentureBeat

Carl Franzen

Meet Nightshade, the new tool allowing artists to ‘poison’ AI model...

Nightshade was developed by University of Chicago researchers under computer science professor Ben Zhao and will be added as an option to...

#

That’s so cool

#

W for the artists

serene scaffold Nov 6, 2023, 12:25 AM

#

haughty heart https://venturebeat.com/ai/meet-nightshade-the-new-tool-allowing-artists-to-pois...

Does it basically hide another image in the least significant bits of the pixels?

haughty heart Nov 6, 2023, 12:39 AM

#

serene scaffold Does it basically hide another image in the least significant bits of the pixels...

So it basically hides tiny pixels in the actual art and when the ai try’s to use thus art to train itself it thinks the art is something else

#

So if there’s a picture of a dog and it has that poison pixel in it. It makes the picture look like a cat instead of a dog

#

So it’s basically a deterrent against these big ai tech companies like chatgpt from using artists art without there knowledge

left tartan Nov 6, 2023, 12:45 AM

#

This will be the arms race for next 20(+) years... counter-AI measures and counter-counter measures.

#

10 years from now, the first "AI Defense" majors...

serene scaffold Nov 6, 2023, 12:52 AM

#

left tartan 10 years from now, the first "AI Defense" majors...

15 years, interactive ai defense entertainment

left tartan Nov 6, 2023, 12:53 AM

#

serene scaffold 15 years, interactive ai defense entertainment

All those tower defense games have prepared us well.

haughty heart Nov 6, 2023, 12:54 AM

#

left tartan This will be the arms race for next 20(+) years... counter-AI measures and count...

Yes

#

Scary

lapis sequoia Nov 6, 2023, 1:46 AM

#

!e what you are

iron peak Nov 6, 2023, 3:21 AM

#

Had anyone heard of polars?

desert oar Nov 6, 2023, 3:38 AM

#

echo mesa Guys, I have a confusion with vectors and their magnitude. Now I'm reading a boo...

however when we don't start from the origin

in linear algebra, all vectors start at the origin. the notion of a vector as an arrow floating around in space is a convenience used in physics, but it has little to do with how it's actually modeled in linear algebra.

if you want to live in a world where the origin is "free floating", the magnitude is defined as the distance between the vector and its own origin, not strictly the "global" (0, 0) point.

but yes your line thinking is reasonable and it's a good question.

serene scaffold Nov 6, 2023, 4:13 AM

#

iron peak Had anyone heard of polars?

Yes. Are you just interested to know if people have heard of it?

iron peak Nov 6, 2023, 4:14 AM

#

Yeah. I saw two YouTube shorts about it and was like “this is weird. Is it something that’s up and coming or did YouTube hook onto something I didn’t notice.

serene scaffold Nov 6, 2023, 4:15 AM

#

My impression is that Polars is on the rise, but I'm not inclined to switch to it.

iron peak Nov 6, 2023, 4:17 AM

#

I don’t know pandas yet. Maybe I should just get ahead with polars.

serene scaffold Nov 6, 2023, 4:17 AM

#

Maybe. Pandas is very established and integrated with the data science stack

desert oar Nov 6, 2023, 4:25 AM

#

iron peak Yeah. I saw two YouTube shorts about it and was like “this is weird. Is it somet...

it is in fact up and coming

desert oar Nov 6, 2023, 4:25 AM

#

iron peak I don’t know pandas yet. Maybe I should just get ahead with polars.

i'd suggest starting with pandas. it's much more popular and it's better suited for interactive use. polars is lower-level in some respects, less flexible, and more verbose

iron peak Nov 6, 2023, 4:26 AM

#

serene scaffold Maybe. Pandas is very established and integrated with the data science stack

The video I saw was showing polars to be 4x faster than pandas.

iron peak Nov 6, 2023, 4:26 AM

#

desert oar i'd suggest starting with pandas. it's much more popular and it's better suited ...

Alright. Thanks.

desert oar Nov 6, 2023, 4:26 AM

#

iron peak The video I saw was showing polars to be 4x faster than pandas.

polars can be significantly faster than pandas on bigger datasets, yes. however pandas is basically instant on any operations of < 1 million rows, and in the 1 - 10 million range it will be fast enough. in the 10 - 100 million range it starts to feel pretty slow.

#

pandas is not particularly highly optimized, being more aimed at general-purpose use than high performance

#

(it actually is fairly well optimized for what it is, but the developers have a range of priorities, optimization being only one of them)

#

polars meanwhile is specifically designed to be high performance

#

the main advantage of polars over pandas is that polars has a "query engine" much like what you find in a database, that can compile and optimize what you ask it to do. whereas pandas just does whatever you tell it, no query optimization or compilation phase.

left tartan Nov 6, 2023, 4:30 AM

#

there's some interesting benchmark results here that compares Polars and Pandas, if you're interested. This is a duckdb publication (I'm not shilling... ), but you can ignore everything and just compare the Polars vs Pandas line. : https://duckdb.org/2023/11/03/db-benchmark-update.html or for full blog post: https://duckdb.org/2023/11/03/db-benchmark-update.html

past meteor Nov 6, 2023, 8:07 AM

#

iron peak Had anyone heard of polars?

I'd always recommend people to get started with Pandas, it integrates best with with the data science stack indeed

#

To me, aside from the speed aspect, Polars has much better syntax and many constructs like over and groupby_dynamic which make it significantly less verbose than Pandas.

#

I'll always make the argument I've used Spark, SQL, Dplyr, data.table, Polars, Pandas, ... and honestly Pandas has always been the one that makes me want to pull my hair out. Even if it were slower I'd prefer Polars' syntax.

spark nimbus Nov 6, 2023, 11:13 AM

#

Does anyone know how to easily debug where in a chain of computations dask is failing? It says it failed to infer types on mul but with the amount of multiplications we have that means nothing. The tracebacks contain no call site information and the code ran fine in pandas.

grand minnow Nov 6, 2023, 11:59 AM

#

Just spotted this in the wild. A Taipy. https://www.taipy.io/ looks interesting for data scientists. But I don't know how well this compares to streamlit

Taipy

The Taipy Team

Welcome to Taipy | Taipy

Taipy is an open-source Python library for building your web application front-end & back-end in no time. No knowledge of web development is required!

desert oar Nov 6, 2023, 1:59 PM

#

spark nimbus Does anyone know how to easily debug where in a chain of computations dask is fa...

i don't have an answer, but maybe can you reproduce on a smaller dataset or a subset of the overall computations?

spark nimbus Nov 6, 2023, 2:00 PM

#

Unfortunately I can't make a subset as that requires loading the original file

desert oar Nov 6, 2023, 4:26 PM

#

spark nimbus Unfortunately I can't make a subset as that requires loading the original file

you can't create a small sample from the original file for debugging purposes?

cold osprey Nov 6, 2023, 4:39 PM

#

pandas can read in top n rows iirc

#

or polars

boreal gale Nov 6, 2023, 4:45 PM

#

try specifying the dtypes explicitly when reading your data (if you are working with a schemaless format like CSV)?

spark nimbus Nov 6, 2023, 4:51 PM

#

desert oar you can't create a small sample from the original file for debugging purposes?

I haven't found a way to do partial reads from parquet files unfortunately, but even if I did I'd need to select by certain criteria to perform all tests which would likely require loading the majority of the data anyway

past meteor Nov 6, 2023, 4:53 PM

#

spark nimbus I haven't found a way to do partial reads from parquet files unfortunately, but ...

https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.read_parquet.html you can specify n_rows and instantly turn it into a pandas dataframe with to_pandas() assuming you'e on Pandas 2.X

spark nimbus Nov 6, 2023, 4:54 PM

#

Well yeah but then I'd need to loop over all rows and discard them if they don't fit a given criteria and if the resulting dataset is too small I'd need to filter on different criteria to get a good dataset

#

And with over 1.5m rows I'm not sure if that's feasible

past meteor Nov 6, 2023, 4:55 PM

#

In that case you can scan_parquet and apply all the filters you want using polars, collect(streaming=True) and then to_pandas()

#

scan_parquet is lazily executed

#

https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.scan_parquet.html

past meteor Nov 6, 2023, 4:57 PM

#

spark nimbus Well yeah but then I'd need to loop over all rows and discard them if they don't...

https://pola-rs.github.io/polars/py-polars/html/reference/lazyframe/api/polars.LazyFrame.collect.html (streaming results is still in alpha)

spark nimbus Nov 6, 2023, 4:58 PM

#

Does it allow for groupby queries or only simple filtering?

lapis sequoia Nov 6, 2023, 4:58 PM

#

can someone help me adding my flask app to github and have to add .gitignore file and config.json as well

spark nimbus Nov 6, 2023, 4:58 PM

#

@lapis sequoia #web-development

past meteor Nov 6, 2023, 4:59 PM

#

spark nimbus Does it allow for groupby queries or only simple filtering?

Good question! You can definitely groupby on a LazyFrame but I don't know exactly what queries you can do in streaming mode

#

You might not even need streaming mode tbf

#

1.5m rows isn't that much if your machine has like 8-16GB ram, considering Polars' memory footprint is substantially smaller to begin with

spark nimbus Nov 6, 2023, 5:00 PM

#

I've tried both pandas and dask so far but both can't allocate enough memory (dask says I'd need about 80GB)

past meteor Nov 6, 2023, 5:02 PM

#

spark nimbus I've tried both pandas and dask so far but both can't allocate enough memory (da...

I'm this room's #1 Polars shill so obviously take what I'm saying with a grain of salt: Pandas' memory footprint is so large, especially pre 2.X that if you move it to Polars and drop Dask you might be fine already

#

Polars does a lot of multithreading already to begin with. Dask adds so much complexity and opaqueness.

spark nimbus Nov 6, 2023, 5:03 PM

#

Unfortunately it's management who told me to use dask over pandas/polars so I'm afraid my hands are mostly tied

past meteor Nov 6, 2023, 5:03 PM

#

I remember writing Numba + Numpy + Dask stuff that released the GIL and the code would silently fail without errors if 1 thread or so ran into trouble.

#

I wouldn't wish this onto my worst enemies

spark nimbus Nov 6, 2023, 5:04 PM

#

Trust me I'd switch to polars if I could, I've had to abuse some #esoteric-python tech to patch dask to even be compatible with our old code

past meteor Nov 6, 2023, 5:04 PM

#

Have you at least upgraded your Pandas version?

spark nimbus Nov 6, 2023, 5:04 PM

#

On the latest version compatible with py3.11.5 afaik

past meteor Nov 6, 2023, 5:05 PM

#

You definitely need to check if it starts with a 2

spark nimbus Nov 6, 2023, 5:05 PM

#

2.1.1 it looks like

past meteor Nov 6, 2023, 5:06 PM

#

If mgmt is set on using Pandas, which is fine if you have tons of existing code, I'd still consider doing the bulk of the preprocessing / subsetting in Polars before moving to Pandas.

#

That's what I do, I have some legacy Pandas code that works fine. There's literally no sense in refactoring it. I keep it as is, do some Polars stuff beforehand and call df.to_pandas()

#

Could that work for you? (you can do the same in DuckDB, it doesn't need to be Polars)

rugged comet Nov 6, 2023, 6:01 PM

#

I found the RMSE and the average of the absolute value of the residuals for three regression models. I also found the mean and the standard deviation of the response. I'm being asked to 1, determine if the models are accurate, 3, how I determined their accuracy, and 2, whether I compared predictions to other numbers.

For me, accuracy is hard to quantify for regression models. I typically think of classification models when I hear 'accuracy'. However, I think that we can use RMSE as a measure to evaluate the models. Though I'm not sure how to put into words my findings.
I compared the predictions to the actual values. That part was easy.
Are the predictions accurate? Again, this notion of accuracy for a regression model is hard for me to conceptualize. I think I have the pieces of information needed to get to the answer; I'm just not sure how to get there.

past meteor Nov 6, 2023, 6:03 PM

#

rugged comet I found the RMSE and the average of the absolute value of the residuals for thre...

Always compare against a baseline, even with classification. For instance sklearn offers DummyClassifier and DummyRegressor. I always prefer speaking of performance relative to other things.
GJ!
It's the same point as 1.

#

Accuracy also depends on the business problem. Being 3 % off of sales numbers is huge if you're walmart compared to a corner store. Being 3 % off of detecting a life threatening disease is worse than either of those.

#

Finally, under or overshooting isn't as bad. If you're doing sales forecasting there's typically a cost for overstocking and understocking and it's asymmetric meaning that the same RMSE (say your predictions were always respectively 10 above or 10 under the target) can have different "business" costs. Same goes for detecting a life threatening disease, sending a few more people to a second screening isn't as bad as telling them they flat out have nothing.

desert oar Nov 6, 2023, 6:16 PM

#

spark nimbus I haven't found a way to do partial reads from parquet files unfortunately, but ...

i'm not saying to stop using dask, i'm just saying to write some script that takes an extract from your data, just so you have a smaller set to debug with

#

that or generate one randomly

abstract wasp Nov 6, 2023, 6:19 PM

#

Hi, does anyone here know about a really good research paper where they talk about cGANs and using that for age progression and regression? Feeding an image of a person to the cGAN and then generating an image of the person but younger/older.
I want to use the paper as a guide so I can build one myself.

desert oar Nov 6, 2023, 6:19 PM

#

rugged comet I found the RMSE and the average of the absolute value of the residuals for thre...

"accuracy" in this sense just means "correctness of predictions". in classification we have a very natural way to compute that: the "0-1" loss (0 if wrong, 1 if correct), also often just called "accuracy". for continuous labels, you're right that we have to use something else. RMSE is a great choice overall.

Are the predictions accurate?
think about what RMSE is: it's the mean squared error of the model predictions. so are the predictions accurate? you tell me: look at the thing you computed. if i asked you the same about a classification model, it would be the same? do the model predictions match the labels in the data? if not, what's the error? if we chose a sensible error/loss function, one would hope that lower loss means better predictions.

rugged comet Nov 6, 2023, 6:26 PM

#

desert oar "accuracy" in this sense just means "correctness of predictions". in classificat...

Say the RMSE was 56000. The mean of the response was 180000. The standard deviation of the response was 80000. Is the model accurate because the RMSE is lower than the standard deviation?

desert oar Nov 6, 2023, 6:26 PM

#

rugged comet Say the RMSE was 56000. The mean of the response was 180000. The standard deviat...

accuracy is not a yes-no thing

#

that's true for classification as well

past meteor Nov 6, 2023, 6:27 PM

#

You can't make any claims about accuracy without benchmarks and baselines imo

desert oar Nov 6, 2023, 6:27 PM

#

that ☝️

#

accuracy is only meaningful when compared to other accuracies

#

or when compared to some benchmark as required for your business

past meteor Nov 6, 2023, 6:27 PM

#

past meteor 1. Always compare against a baseline, even with classification. For instance skl...

point 1 is the key here and this is how to "catch" people 😄

desert oar Nov 6, 2023, 6:27 PM

#

the real question is, is it accurate enough for whatever you're trying to accomplish?

rugged comet Nov 6, 2023, 6:28 PM

#

I think I'm just hung up on the phrasing of the question.

desert oar Nov 6, 2023, 6:28 PM

#

rugged comet I think I'm just hung up on the phrasing of the question.

it's either badly phrased or a trick

past meteor Nov 6, 2023, 6:28 PM

#

I'd compare them against each other and add a baseline

desert oar Nov 6, 2023, 6:28 PM

#

i think you're supposed to explain whether it's accurate enough for whatever the context of the question is

past meteor Nov 6, 2023, 6:28 PM

#

past meteor Finally, under or overshooting isn't as bad. If you're doing sales forecasting t...

And then try to apply this

#

Think about asymmetric costs (by looking at predicted vs. actual plots) and qualitatively argue for the accuracy of your model compared to your business problem

#

Don't underestimate how qualitative data science is! 😄

desert oar Nov 6, 2023, 6:33 PM

#

past meteor Think about asymmetric costs (by looking at predicted vs. actual plots) and qual...

fwiw i wouldn't worry about this on a homework assignment at this level. it's important in real world practice, but if they're not asking about it at all, you probably don't yet have good tools for reasoning about it

past meteor Nov 6, 2023, 6:34 PM

#

I think it's my coursework where I picked up this line of thinking in the first place

#

But not in an intro to DS course yes I'll give you that

serene scaffold Nov 6, 2023, 7:57 PM

#

Something that's increasingly causing miscommunication these days is that when people hear "LLM", a lot of people immediately think of instruction-tuned LLMs. but I've been using LLMs for a few years, and I still just think of them as a probability distribution for word co-occurrence.

sterile wyvern Nov 6, 2023, 11:13 PM

#

How come GridSearchCV doesnt need to call the original function that made the time serries data?

#

Ordinarily I should use the same function to pass the parameters to. How could this be possible tbh?

sour coral Nov 7, 2023, 12:19 AM

#

I'm making an image recognition for Science Fair and im using Teachable Machine for the processing. Im trying to run it in python and the export guide has a code snippet to load a kerras model. I the model it is trying to load is in a zip file in my downloads. Here is the code that is trying to fetch the download:

#

model = load_model("keras_Model.h5", compile=False)

#

Im getting an error saying no file or directory found at keras_Model.h5. Any ideas on how to get the code to find it?

#

Thanks!

acoustic prawn Nov 7, 2023, 12:38 AM

#

sour coral I'm making an image recognition for Science Fair and im using Teachable Machine ...

it should probably be in the same directory as the file that's executing load_model.
But just go to the definition of the function load_model to check exactly what it does and which path it's using to load the file

sour coral Nov 7, 2023, 12:53 AM

#

Directory as in the same folder?

#

Sorry I'm kinda new to this

acoustic prawn Nov 7, 2023, 1:04 AM

#

sour coral Directory as in the same folder?

yup.. but that's just an assumption, try to look at the code, where load_model is defined to see what's actually happening and in which path the file is supposed to be

sour coral Nov 7, 2023, 1:27 AM

#

Okay, it found the file, but now its saying 'Unable to open file (file signature not found)'
Is it because it's trying to access a zip file?

#

Never mind it works!

#

Thank you so much for your help!

shut girder Nov 7, 2023, 1:49 AM

#

Hello, how should I approach learning how to do exploratory analysis? Any resource suggestions?

empty pond Nov 7, 2023, 1:49 AM

#

Hey

left tartan Nov 7, 2023, 1:51 AM

#

shut girder Hello, how should I approach learning how to do exploratory analysis? Any resour...

My favorite EDA resource is https://www.itl.nist.gov/div898/handbook/

shut girder Nov 7, 2023, 1:52 AM

#

Thanks, I'll check it out

spark nimbus Nov 7, 2023, 8:46 AM

#

Running into an issue where df[col] * some_float throws an error because it can't multiply a sequence by a non-int, but at the time of running the dtype of the column should be float, not str, and df[col].compute() even gives a float64 series as output

sterile wyvern Nov 7, 2023, 8:48 AM

#

Im just wondering. How it works.

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

It doesnt take the original function that made the time serries.

So sorry, my question is so basic.

scikit-learn

sklearn.model_selection.GridSearchCV

Examples using sklearn.model_selection.GridSearchCV: Release Highlights for scikit-learn 0.24 Feature agglomeration vs. univariate selection Shrinkage covariance estimation: LedoitWolf vs OAS and m...

buoyant vine Nov 7, 2023, 1:10 PM

#

Has anyone had some issues using Poetry any PyToch 2.1 ?
For some reason poetry seems to not be able to find the right packages even though they exist in PyPi and also exist in Pytorch's own index which is the fallback.

#

PepeHands And then as a second question, does anyone know if it is possible to update th cuda version a CML docker container is using (Using 11 atm, need 12)

serene scaffold Nov 7, 2023, 1:29 PM

#

spark nimbus Running into an issue where `df[col] * some_float` throws an error because it ca...

are you using dask? if you're asking about dataframes, the default assumption is that you're using pandas.

wooden sail Nov 7, 2023, 1:30 PM

#

buoyant vine <:PepeHands:734461705240707128> And then as a second question, does anyone know ...

you could make your own docker image and install cml on it

#

there's probably a pytorch image with your desired cuda version, then you install cml on top

wooden sail Nov 7, 2023, 1:32 PM

#

buoyant vine Has anyone had some issues using Poetry any PyToch 2.1 ? For some reason poetry ...

regarding this, if you're using this inside the docker container, you could also consider a solution that is more low tech than poetry, since containerizing more or less achieves the same effect as making venvs

buoyant vine Nov 7, 2023, 1:32 PM

#

wooden sail regarding this, if you're using this inside the docker container, you could also...

Not everything is using docker though, if you are running locally for example, it's more convenient and easier to manage deps with poetry and the lockfile

wooden sail Nov 7, 2023, 1:33 PM

#

ah you're doing both, ok

buoyant vine Nov 7, 2023, 1:33 PM

#

yeah

wooden sail Nov 7, 2023, 1:33 PM

#

fair enough

#

i can't help with that :x

buoyant vine Nov 7, 2023, 1:34 PM

#

i will work on the docker stuff first, since that is currently breaking ci

wooden sail Nov 7, 2023, 1:34 PM

#

https://hub.docker.com/layers/pytorch/pytorch/2.1.0-cuda12.1-cudnn8-runtime/images/sha256-e4aaefef0c96318759160ff971b527ae61ee306a1204c5f6e907c4b45f05b8a3?context=explore this might be a good place to look

#

https://cml.dev/doc/install#docker and here

#

should be able to use the pytorch image as a base and slap those cml install commands on top... in theory

boreal gale Nov 7, 2023, 1:43 PM

#

spark nimbus Running into an issue where `df[col] * some_float` throws an error because it ca...

from your chat history, this is when using dask right?
can you

confirm how are you reading your dask dataframe?
confirm how many parquet files are you reading?
if you are read more than 1 parquets, confirm all parquet files' schemas are identical?
confirm that you don't have an interactive python interpreter session that is capable of reading all the parquets with dask?

desert oar Nov 7, 2023, 1:48 PM

#

that error about multiplying a sequence almost sounds like trying to multiply a native python list

#

!e ```py
[1,2,3] * 5.1

arctic wedgeBOT Nov 7, 2023, 1:48 PM

#

@desert oar :x: Your 3.12 eval job has completed with return code 1.

001 | Traceback (most recent call last):
002 |   File "/home/main.py", line 1, in <module>
003 |     [1,2,3] * 5.1
004 |     ~~~~~~~~^~~~~
005 | TypeError: can't multiply sequence by non-int of type 'float'

desert oar Nov 7, 2023, 1:49 PM

#

@spark nimbus ☝️

#

somehow you ended up with a native python list somewhere. maybe in an "object" dtype column.

spark nimbus Nov 7, 2023, 2:02 PM

#

Using .astype(float) ended up solving it

#

Turns out it got confused by the type remapping from a .replace call

spice mountain Nov 7, 2023, 2:19 PM

#

I am trying to pass an image into a pretrained VQGAN model. The model takes floats as inputs/uses floats for its biases.

The original image is the one in the middle. The one on the left (the one where it is blue and is missing most of the pixels,) is what happens when I convert it to floats. The one where it looks a bit demonic is the one, when I just load it into a PyTorch tensor.


# Import necessary libraries
import torch
from PIL import Image
import torchvision.transforms as transforms

# Read a PIL image
image = Image.open(segmentation_path)

# Define a transform to convert PIL
# image to a Torch tensor
transform = transforms.Compose([
    transforms.PILToTensor()
])

# transform = transforms.PILToTensor()
# Convert the PIL image to Torch tensor
img_tensor = transform(image).permute(1,2,0)

# print the converted Torch tensor
print(img_tensor)```

My code for loading it.

In order to turn it into a float, I just do

```python

img_tensor.float()

buoyant vine Nov 7, 2023, 2:23 PM

#

We love CI

#

So trying to setup the action CI, we now use the pytorch image

#

but now setup-cml doesn't seem to want to work 😢

river maple Nov 7, 2023, 4:10 PM

#

im trying to make driver drowsiness detection program using a pretrained model.
Here's the code:

import cv2
import os
import tensorflow as tf
from tensorflow.keras.models import load_model
import numpy as np
import pygame.mixer as mixer

mixer.init()
sound = mixer.Sound('alarm.wav')

face_cascade = cv2.CascadeClassifier(cv2.data.haarcascades + "haarcascade_frontalface_default.xml")
eye_cascade = cv2.CascadeClassifier(cv2.data.haarcascades + "haarcascade_eye.xml")
model = load_model(os.path.join("models", "cnnCat2.h5"))
lbl = ['Close', 'Open']

cap = cv2.VideoCapture(0)

while True:
    ret, frame = cap.read()
    height, width = frame.shape[0:2]
    gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)

    faces = face_cascade.detectMultiScale(gray, minNeighbors=3, scaleFactor=1.2, minSize=(25, 25))
    eyes = eye_cascade.detectMultiScale(gray, minNeighbors=3, scaleFactor=1.2)

    for (x, y, w, h) in faces:
        cv2.rectangle(frame, pt1=(x, y), pt2=(x + w, y + h), color=(255, 0, 0), thickness=3)

    for (ex, ey, ew, eh) in eyes:
        cv2.rectangle(frame, pt1=(ex, ey), pt2=(ex + ew, ey + eh), color=(255, 0, 0), thickness=3)
        eye = frame[ey:ey + eh, ex:ex + ew]
        eye = cv2.resize(eye, (80, 80))
        eye = eye / 255
        eye = eye.reshape(80,80,3) 
        eye = np.expand_dims(eye, axis=0)
        prediction = model.predict(eye)
        print(prediction)

I get this error when i run the code.

mild dirge Nov 7, 2023, 4:52 PM

#

@river maple Should the image not be grayscale?

#

It seems to expect 1 channel, but you have 3

river maple Nov 7, 2023, 5:31 PM

#

@mild dirge i changed the image to grayscale and reshaped it to (80,80,1)

#

Now its giving me this

vale swallow Nov 7, 2023, 7:30 PM

#

Hi, is there such thing as a deep learning engineer? Or is is just an ML engineer?
I’m most interested in deep learning so I was wondering if there are careers that only focus on that.

agile cobalt Nov 7, 2023, 7:32 PM

#

~~99.99% of machine learning jobs are focused on deep learning~~

past meteor Nov 7, 2023, 7:33 PM

#

agile cobalt ~~99.99% of machine learning jobs are focused on deep learning~~

Is that so? 🤔

#

I think DL gets the most "airtime" but I still see a lot of shops doing ML on tabular data.

agile cobalt Nov 7, 2023, 7:35 PM

#

you think that those shops will have a professional specialized on machine learning?

past meteor Nov 7, 2023, 7:36 PM

#

Yes

#

I was at a data science / ML consultancy in the past and they did virtually no DL. Most of it was ML on tabular data. Typical use cases like fraud detection, forecasting, predictive maintenance, ...

agile cobalt Nov 7, 2023, 7:38 PM

#

how long ago 'in the past'?

past meteor Nov 7, 2023, 7:38 PM

#

I think this was 2 years ago

agile cobalt Nov 7, 2023, 7:39 PM

#

not sure then

past meteor Nov 7, 2023, 7:39 PM

#

I'm not sure why you'd use DL if you don't need it 🤷 . There are, imo, many cons especially surrounding deployment of it.

#

So when it's tabular data I'd very much prefer not to.

#

Also the time it takes to pick a bespoke architecture, tuning which is harder but 100 % necessary etc.

past meteor Nov 7, 2023, 7:42 PM

#

agile cobalt not sure then

How is it in your experience / location? 😮

agile cobalt Nov 7, 2023, 7:43 PM

#

stakeholders don't understand the tradeoffs and just want the best thing money can buy, but for cheap

past meteor Nov 7, 2023, 7:44 PM

#

What stakeholders though? If they're business idt they'd understand the difference between xgboost and DL

#

Sorry if I'm sounding contentious, really just curious about your experience since for me it's different 😄

agile cobalt Nov 7, 2023, 7:46 PM

#

just managers not very exposed to statistics

left tartan Nov 7, 2023, 7:47 PM

#

My clients are either: "I must gpt everything", or "I don't trust ai to recommend my lunch, much less a trading decision/forecast"

past meteor Nov 7, 2023, 7:48 PM

#

The thing with my time at $ML/AIConsultingCompany is that if they knock on your door they want something already

#

And the business model was really to have different titles and teams do BI / data infra first and plan ML in the long run after the first set of projects have delivered value

#

idk how stuff looks like in that scene after GPT dropped 🤷

left tartan Nov 7, 2023, 8:08 PM

#

so much snakeoil.

vale swallow Nov 7, 2023, 8:20 PM

#

Ok, so, I’m guessing it depends on the company you work for?

#

Is anyone here an AI/deep learning researcher? What was your experience and is the income decent?

rugged comet Nov 7, 2023, 8:28 PM

#

I'm trying to understand low and high p-values and how they relate to normality and homoscedasticity of data.

Use the Shapiro-Wilk test to determine if the data is normal. Shapiro-Wilk gives us a p-value. We would like the number to be above .05.
This sounds like if the p-value is greater than 0.05, the data is normal.

A p-value measures the probability of obtaining the observed results, assuming that the null hypothesis is true. The lower the p-value, the greater the statistical significance of the observed difference. A p-value of 0.05 or lower is generally considered statistically significant.
This part I'm not getting. I think it contradicts the first part. This makes it sound like a low p-value is good.
But perhaps "good" has opposite meanings in different contexts.
How would you interpret a low p-value?

agile cobalt Nov 7, 2023, 8:30 PM

#

what do you understand that the first part means by "determine if the data is normal"?

rugged comet Nov 7, 2023, 8:33 PM

#

I understand the first part to mean compare the p-value we got from the Shapiro-Wilk test on the residuals to 0.05.

agile cobalt Nov 7, 2023, 8:34 PM

#

and do you understand what it means (regarding the data) if it passes or fails the test?

rugged comet Nov 7, 2023, 8:35 PM

#

I think that if the data fails the test, p < 0.05, then the data is not normal. If it passes the test, p > 0.05, the data is normal.

#

Is that what you're asking?

past meteor Nov 7, 2023, 8:36 PM

#

That's not really how to interpret p-values

#

But interpreting them correctly is a science in and of itself so I get the struggle

tidal bough Nov 7, 2023, 8:37 PM

#

The "null hypothesis" of the S-W test is that the underlying distribution is normal. Hence, the p-value the Shapiro-Wilk test gives you is ||the probability of obtaining this data (well, or rather data that's at-least-this-weird by the metrics the S-W test tracks) if the underlying distribution is normal||.

#

So if you get a low enough p-value, then (as always) you can consider your null hypothesis disproven - this data probably didn't come from a normal distribution.
So if you want to check that the data is from a normal distribution, you're looking for a high p-value on the S-W test.

rugged comet Nov 7, 2023, 8:41 PM

#

tidal bough The "null hypothesis" of the S-W test is that the underlying distribution is nor...

(I haven't clicked on the spoiler yet)

A p-value measures the probability of obtaining the observed results, assuming that the null hypothesis is true
So if we assume the null hypothesis is true, that is the data is normally distributed, a low p-value means a low chance that the observed data came from a normal distribution?

tidal bough Nov 7, 2023, 8:41 PM

#

rugged comet (I haven't clicked on the spoiler yet) > A p-value measures the probability of o...

Yup. Hence, if you got data with a low p-value on the S-W test, the distribution it came from probably isn't normal (otherwise obtaining this data would be unlikely).

buoyant vine Nov 7, 2023, 8:42 PM

#

Does anyone know how PyTorch lightning distributed backend works with datamodules?
Currently it appears to be creating two instances which I haven't configured, and i'm not sure how to turn that off, because it seems to breaking things...

I see the python file being ran as __main__ twice, one for each process and in the logs:

Log: https://paste.pythondiscord.com/7UYA

The problem is PT doesn't seem to then call prepare_dataset again, so the validation dataset is not generated.
But I don't really want it trying to do two distributed systems right now 😅

This is the training code

        early_stop = pl.callbacks.early_stopping.EarlyStopping(
            monitor="val_loss",
            min_delta=0.00001,
            patience=5,
            verbose=False,
            mode="max",
        )

        trainer = pl.Trainer(
            callbacks=[early_stop],
            max_epochs=self.model_config.n_epochs,
            num_nodes=1,
            log_every_n_steps=32,
            accelerator="auto",
            devices="auto",
        )

        trainer.fit(self.model, self.data_module)

num_nodes suggests to me that this shouldn't spawn multiple indapendent processes?

tidal bough Nov 7, 2023, 8:43 PM

#

tidal bough Yup. Hence, if you got data with a low p-value on the S-W test, the distribution...

That said, I don't like this definition you quoted:

A p-value measures the probability of obtaining the observed results, assuming that the null hypothesis is true
The probabability of obtaining any particular results is usually 0 or close to it. The definition of the p-value usually involves a clause like "results as extreme as the observed".

rugged comet Nov 7, 2023, 8:46 PM

#

tidal bough That said, I don't like this definition you quoted: > A p-value measures the pro...

What is meant by extreme in "results as extreme as the observed"? I think perhaps it means that it would be unlikely to select from a normal distribution and get this result.

tidal bough Nov 7, 2023, 8:50 PM

#

rugged comet What is meant by extreme in "results as extreme as the observed"? I think perhap...

Basically, to do statistics this way, you start by selecting the metric you will track. Suppose you're flipping a coin 20 times, say, and you're checking the null-hypothesis of the coin being fair. You choose the number of heads as the metric.
You get 15. What's the p-value of this result? It's not the probability of obtaining exactly 15 heads (1.48%) - that'll be quite low even for 10 heads (worse, if we were doing an experiment with continuous measurements, the probability of obtaining any particular result would just be exactly 0). It's the probability of obtaining >=15 heads, about 2%.

(Or, alternatively, we could choose to track the 2-tailed p-value here - that is, we consider deviations from 10 to both sides to be the same, and so our p-value will be the sum of the probabilities of obtaining 15,...,20 and also 1,...5.)

#

(If it seems kinda arbitrary that you need to precommit to doing only specific measurements (total number of heads, not some other metric) and how you'll be calculating the results (one-sided or two-sided) before the experiment for it to be valid - yeah, it is, that's a common critique of this entire approach to statistics as opposed to e.g. the Bayesian one.)

tidal bough Nov 7, 2023, 8:55 PM

#

tidal bough Basically, to do statistics this way, you start by selecting the metric you will...

As a result of that, what does it mean if you get a p-value of 0.03 from the S-W test? Well, it means that they're measuring some property W of the data (you'd have to look up the definition of the test to know what), and the value of W on the test data is such that data drawn from a normal distribution would only produce a W-value larger than that 3% of the time.

#

If you get a high p-value here, what does it mean? Well, it means your data is such that this statistic, calculated on this data, looks like what you'd get from a normal distribution.
(It would be a mistake, though, to assume that guarantees it actually came from a normal distibution - maybe there's some distribution that's not normal but produces a similar distribution of this statistic, and the data was drawn from it.)

past meteor Nov 7, 2023, 9:00 PM

#

I avoid using p-values in practice because they require a lot of baggage to interpret correctly

#

From my old slides:

tidal bough Nov 7, 2023, 9:02 PM

#

past meteor I avoid using p-values in practice because they require a lot of baggage to inte...

Ah, that's interesting. What do you do in practice, then? Are you calculating likelihood factors for the hypotheses instead?

rugged comet Nov 7, 2023, 9:05 PM

#

I also have this qq plot and this displot of the residuals. The dots should hug the line on the left I believe. And the residuals should look like a bell curve on the right. Neither of these are true which which is more evidence in support of the residuals not being normal or homoscedastic.

past meteor Nov 7, 2023, 9:06 PM

#

tidal bough Ah, that's interesting. What do you do in practice, then? Are you calculating li...

I'm not often in situations where I need them. Summary statistics and counts is what I use without trying to make strong claims unless I have access to an actual statistician.

When I was and was "forced" to use them they didn't express anything interesting for my domain. I had a cross section with demographic data. When n is that large any difference becomes significant. Sure then you need to fix an effect size but it was hard to come up with an effect size before looking at the data. Doing it after means the entire thing is biased anyway.

serene scaffold Nov 7, 2023, 9:07 PM

#

is the right one supposed to look like the burj kalifa?

rugged comet Nov 7, 2023, 9:08 PM

#

Does the burj kalifa look like a bell? (rhetorical)

tidal bough Nov 7, 2023, 9:09 PM

#

tbh if you showed me the right plot i'd go 'yeah sure looks normal to me'

#

the left plot is more worrying, with the right tail especially

rugged comet Nov 7, 2023, 9:11 PM

#

Perhaps I should compare these with a normal displot and and normal qq plot to get a better idea.

rugged comet Nov 7, 2023, 9:26 PM

#

Would you agree with this? I'm uncertain about the null hypothesis for the White test.

cunning agate Nov 7, 2023, 11:28 PM

#

does any one work with knowledge graph ?

#

i've some question

pine void Nov 8, 2023, 12:42 AM

#

!paste

arctic wedgeBOT Nov 8, 2023, 12:42 AM

#

Pasting large amounts of code

If your code is too long to fit in a codeblock in Discord, you can paste your code here:
https://paste.pythondiscord.com/

After pasting your code, save it by clicking the Paste! button in the bottom left, or by pressing CTRL + S. After doing that, you will be navigated to the new paste's page. Copy the URL and post it here so others can see it.

pine void Nov 8, 2023, 12:43 AM

#

https://paste.pythondiscord.com/GFRQ

This is my code for a minimax on tic tac toe. I need to be able to get the score when running it, but in the end i need it to return the best move. How can I do that?

desert oar Nov 8, 2023, 12:59 AM

#

past meteor From my old slides:

it's unfortunate because the underlying idea is actually pretty elegant

#

find a statistic with a known distribution under the null. if the value of that statistic in your sample is highly improbable, you reject the null.

lapis sequoia Nov 8, 2023, 7:04 AM

#

interesting

potent sky Nov 8, 2023, 9:56 AM

#

cunning agate does any one work with knowledge graph ?

It's generally better to just ask your question away if possible and whoever is familiar with the concept and free can pick it up and help you with it.
People usually don't want to engage in back and forth just to get to the question and then discover whether they're familiar / free enough to help with it or not

lapis sequoia Nov 8, 2023, 2:11 PM

#

Salut

spark nimbus Nov 8, 2023, 2:13 PM

#

With dask, how would I do df['x'] = df['x'].mask(df['x'] < 0, 0.0) or something along those lines? no matter what I do, it complains about the index not being aligned.

agile cobalt Nov 8, 2023, 2:42 PM

#

spark nimbus With dask, how would I do `df['x'] = df['x'].mask(df['x'] < 0, 0.0)` or somethin...

maybe https://docs.dask.org/en/latest/generated/dask.dataframe.Series.clip.html ?

spark nimbus Nov 8, 2023, 2:51 PM

#

oh I should've given a better example

#

tl;dr I need to use mask since I'm doing complex filtering

jaunty helm Nov 8, 2023, 3:47 PM

#

spark nimbus With dask, how would I do `df['x'] = df['x'].mask(df['x'] < 0, 0.0)` or somethin...

Works fine for me?

import pandas as pd
import seaborn as sns

iris = sns.load_dataset("iris")
iris: pd.DataFrame

iris["sepal_length"] = iris["sepal_length"].mask(iris["sepal_length"] < 5, 0.0)
print(iris.head())
#   sepal_length  sepal_width  petal_length  petal_width species
#0           5.1          3.5           1.4          0.2  setosa
#1           0.0          3.0           1.4          0.2  setosa
#2           0.0          3.2           1.3          0.2  setosa
#3           0.0          3.1           1.5          0.2  setosa
#4           5.0          3.6           1.4          0.2  setosa

agile cobalt Nov 8, 2023, 3:59 PM

#

jaunty helm Works fine for me? ```py import pandas as pd import seaborn as sns iris = sns.l...

dask

spark nimbus Nov 8, 2023, 4:05 PM

#

jaunty helm Works fine for me? ```py import pandas as pd import seaborn as sns iris = sns.l...

Unfortunately I'm dealing with Dask, not Pandas

jaunty helm Nov 8, 2023, 4:06 PM

#

Ah whoops

lapis sequoia Nov 8, 2023, 4:22 PM

#

spark nimbus With dask, how would I do `df['x'] = df['x'].mask(df['x'] < 0, 0.0)` or somethin...

what exactly are you doing here? what does mask do essentially?

#

also its not really that hard in dask

#

this is how you can do conditional assignment

spark nimbus Nov 8, 2023, 4:37 PM

#

I see

short heart Nov 8, 2023, 6:00 PM

#

Why does TfIdfVectorizer returns an all zero array (basically all zero but there is 1 non zero value for other 20.000 zeroes)?

agile cobalt Nov 8, 2023, 6:05 PM

#

short heart Why does TfIdfVectorizer returns an all zero array (basically all zero but there...

did you read the documentation to see what it is doing in first place?

#

see https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html and https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer if not

if you did see, specify which part exactly you do not understand

short heart Nov 8, 2023, 6:11 PM

#

agile cobalt did you read the documentation to see what it is doing in first place?

yes, i do realize what it does, and it shouldn't return an all zero array in my case, so Im assuming there could be implementation problem or something of this sort here

agile cobalt Nov 8, 2023, 6:14 PM

#

show a reproducible example then

short heart Nov 8, 2023, 6:16 PM

#

that would be hard without sharing initial data

plush jungle Nov 8, 2023, 8:04 PM

#

has anyone here done reinforcement learning? I've been training a DQN on a simple pygame game i made but after a ton of experiments I'm really disappointed at how badly/slowly it learns. I've seen successful projects with more advanced algorithms like PPO that make use of massive parallelization so I figured I'd give that a try

#

but I can't find a simple code example of PPO so I don't really understand how to implement it

#

so I figured I'd first try to parallelize the DQN and mmove on to PPO if that doesn't work

#

the problem is, I'm not actually sure how to model it

#

right now I have observations [o_1, ... o_n] and I pass a vector of length n to the neural net

#

but if I have N observations and M games running in parallel, do I change my neural net to have the first layer be NxM observations? cause that doesn't seem like what I want. I want the agent to be able to run on a single game, and it certainly shouldn't correlate across games

#

so I'm thinking I should have M copies of the neural net, store the results of the games in my replay memory, and then backpropagate on only one of the neural nets

#

but this is all way more complicated than I think I can manage without help

#

and I'm worried I'm overthinking it

astral hazel Nov 8, 2023, 8:11 PM

#

What's the best layer layout for lightweight NLP model, that puts single word in the right form, tense, case etc?

plush jungle Nov 8, 2023, 8:13 PM

#

astral hazel What's the best layer layout for lightweight NLP model, that puts single word in...

what do you mean by layer layout

astral hazel Nov 8, 2023, 8:14 PM

#

What layers to use and where to put them

plush jungle Nov 8, 2023, 8:19 PM

#

astral hazel What layers to use and where to put them

are you using an rnn?

rotund lark Nov 8, 2023, 9:13 PM

#

Hello. If i want to have Selanium enter login information on a website. How would I do that?

https://member.restaurantdepot.com/customer/account/login

im getting this error after trying to run Selanium

There has been an error processing your request
Front controller reached 100 router match iterations
Error log record number: a41644797e68843c6e965cd114df2fce18138d5b6efff34f07c54a26d3378e2f

left tartan Nov 8, 2023, 9:39 PM

#

rotund lark Hello. If i want to have Selanium enter login information on a website. How woul...

This is the wrong channel, ask in #python-discussion . But, the first question is going to be: are you violating the sites ToS

rotund lark Nov 8, 2023, 9:40 PM

#

No, i have permission from their corporate office

brave sand Nov 8, 2023, 10:48 PM

#

in q learning, how would I convert a state into an int?

lapis sequoia Nov 8, 2023, 10:50 PM

#

hi all,

i am trying to learn kmeans clustering with sklearn. so far i have imported my data, dropped rows containing NaN, and visualized the fit data with matplotlib. i wanted to break the set of columns into two separate clusters (k=2). however, it seems to be trying to cluster all of the rows into clusters instead.

#

might i need to transpose the dataframe? or is there another way

lapis sequoia Nov 9, 2023, 1:44 AM

#

wut

brave sand Nov 9, 2023, 2:11 AM

#

why is my agent so dumb

#

am I doing something wrong?

#

the state never changes

#

it's always (1, 1, 1, 1)

#

https://github.com/emerso2000/AI_RacingSim/blob/main/q_learning.py

GitHub

AI_RacingSim/q_learning.py at main · emerso2000/AI_RacingSim

Contribute to emerso2000/AI_RacingSim development by creating an account on GitHub.

#

i think it is the discretize bins function

#

since the next state is always (1, 1, 1, 1)

#

btw this is my first time doing anything AI related so bear with me

mild dirge Nov 9, 2023, 6:51 AM

#

brave sand

Is the reward based on driving without crashing, because it's doing a good job with that 😛

grave token Nov 9, 2023, 7:24 AM

#

Can you guys recommend some books for statistics maths and ai maths?
It could be a tutorial too

slender kestrel Nov 9, 2023, 8:49 AM

#

grave token Can you guys recommend some books for statistics maths and ai maths? It could be...

intro to statistical learning

#

anyone from india here who is doing good in data science ? need a bit of advice for indian data science job market

astral hazel Nov 9, 2023, 9:17 AM

#

plush jungle are you using an rnn?

Lstm, but accuracy is terrible

desert oar Nov 9, 2023, 10:54 AM

#

grave token Can you guys recommend some books for statistics maths and ai maths? It could be...

What's your background?

grave token Nov 9, 2023, 11:39 AM

#

desert oar What's your background?

I am pursuing ms in data science

#

Completed my bachelor a year ago, but i did not focus much on statastics

bronze flint Nov 9, 2023, 11:53 AM

#

Anyone using Apache Spark for big data streaming?

past meteor Nov 9, 2023, 2:30 PM

#

grave token Can you guys recommend some books for statistics maths and ai maths? It could be...

#data-science-and-ml message I wrote this up a couple of weeks ago, you're pretty much the target audience for the resources 😄

#

Also consider reading practical statistics for data scientists

magic dune Nov 9, 2023, 3:37 PM

#

add bishop

#

https://www.microsoft.com/en-us/research/uploads/prod/2006/01/Bishop-Pattern-Recognition-and-Machine-Learning-2006.pdf

past meteor Nov 9, 2023, 3:38 PM

#

magic dune add bishop

I'd definitely not recommend reading Bishop as a first read personally

magic dune Nov 9, 2023, 3:41 PM

#

past meteor I'd definitely *not* recommend reading Bishop as a first read personally

It really helps you understand the more complex topics imo

#

but maybe not first read

past meteor Nov 9, 2023, 3:42 PM

#

It covers topics that are also not really necessary for industry and there's a lot of diminishing return on those

magic dune Nov 9, 2023, 3:42 PM

#

past meteor It covers topics that are also not really necessary for industry and there's a l...

true

past meteor Nov 9, 2023, 3:43 PM

#

I didn't read PRML in full but many of my coursework referenced it so I read some chapters. Stuff like the decomposition of bias and variance is interesting to see but it doesn't really help you.

magic dune Nov 9, 2023, 3:43 PM

#

but there are topics that are interesting if you want to understand the behind the scenes for the algo

winter canyon Nov 9, 2023, 4:14 PM

#

How do I best handle a 2 player (turn based) game? My idea would be to have the following inputs: the environment, movement options of the active bot, location of the other bot, which bot is active (?)
Just imagine a grid based pathfinding game for simplicity.

#

concern beeing that the ai might get confused with constant switching beween players

serene scaffold Nov 9, 2023, 4:23 PM

#

winter canyon concern beeing that the ai might get confused with constant switching beween pla...

a game-playing AI would not get confused by switching between players, because the "AI" can only "think" during its turn.

#

AIs aren't like human brains that are constantly thinking and perceiving. they only have "brain activity" in the limited context that they're being used.

#

ChatGPT doesn't "think" between user inputs.

winter canyon Nov 9, 2023, 4:28 PM

#

Makes sense. so which player is active is a useless info then 👍

serene scaffold Nov 9, 2023, 4:52 PM

#

winter canyon Makes sense. so which player is active is a useless info then 👍

in games, we usually refer to each player as an agent. and the AI is an automated agent.

when it's the automated agent's turn, it needs to be able to get all the same information about the game state that is available to human players.

#

for example, in chess, it needs to know where all the pieces are on the board. whereas for card games, it needs to be able to see its own cards.

bronze flint Nov 9, 2023, 5:02 PM

#

bronze flint Anyone using Apache Spark for big data streaming?

To add to this, does someone maybe know where I can get help with big data streaming?

short heart Nov 9, 2023, 6:53 PM

#

If I fit_transform TfIdfVectorizer multiple times, will it rewrite past words or update dictionary?

serene scaffold Nov 9, 2023, 7:49 PM

#

short heart If I fit_transform TfIdfVectorizer multiple times, will it rewrite past words or...

yes, every time you fit something, that basically resets it. including with fit_transform

#

this is a landmine for a lot of people.

#

you also shouldn't (for example) have separate vectorizers for train and test. because then your test data represents the same things differently than the training data, making everything meaningless.

young egret Nov 9, 2023, 8:21 PM

#

Hi I'm just wondering if we can compare 2 lists of different length by ID? How do you do it? I'm looking for the difference between multiple dates that are only > 0

serene scaffold Nov 9, 2023, 8:33 PM

#

young egret Hi I'm just wondering if we can compare 2 lists of different length by ID? How d...

Sorry, but I don't really understand what you mean

mild dirge Nov 9, 2023, 9:25 PM

#

Maybe a bit of a shot in the dark, but I'm working on making a encoder+decoder model which tries to recreate a point cloud. I am using the chamfer distance for the loss, but it seems that it doesn't get much further than the general shape. I was wondering if anyone had any idea to maybe change the loss function so that it tries to get the details more similar. This image shows the original point cloud in the bottom row, and my decoded point cloud at the top.

#

As can be seen, the general shape is pretty good, but when there are tiny details, it just doesn't do well.

past meteor Nov 9, 2023, 10:26 PM

#

mild dirge Maybe a bit of a shot in the dark, but I'm working on making a encoder+decoder m...

If you haven't done it already I suggest you sanity check your architecture / approach by fitting one point cloud and watching the loss go to 0

#

I always sanity check my approach by ensuring I can at least completely overfit 1 sample or a batch of samples, depending on the case

#

If that is already hard then there's likely something up with your architecture 🙂

mild dirge Nov 9, 2023, 10:28 PM

#

It should be able to overfit on a single point cloud since the decoder is just a bunch of dense layers, and the loss is a pretty standard loss that is 0 when the point clouds are equal

#

I can maybe look into different decoders, but I'm thinking the model is just stuck in a local *minimum in which optimizing small details just doesn't decrease the overall loss enough for it to move in that direction.

past meteor Nov 9, 2023, 10:30 PM

#

You should still try it imo 😄 it's basically just debugging. It has saved me countless of times.

mild dirge Nov 9, 2023, 10:31 PM

#

I'll give it a go, but theoretically it should definitely be able to exactly copy the point cloud

past meteor Nov 9, 2023, 11:46 PM

#

Depends. Are you trying to win the competition or do data science properly? 😄

acoustic skiff Nov 9, 2023, 11:47 PM

#

WIN

#

well maybe well

past meteor Nov 9, 2023, 11:47 PM

#

It's important to know that they're completely different things and that Kaggle competitions can, or at least used to, teach bad habits

#

If you're OK with that then the smartest thing to do if you want to win is to clean the entire dataset together.

#

If you care about being methodologically correct, the first thing you should do is split your data and only work with the training set. You should then make a preprocessing pipeline that can work with your training set but also works for totally unseen data.

acoustic skiff Nov 9, 2023, 11:53 PM

#

Right so I was thinking as a not cheating thing if I define my preprocessing pipeline using only train data that's probably the correct thing... But now I think I see what you mean I could cheat and include the test data too

#

Like MEDIAN_AGE just from train

past meteor Nov 10, 2023, 12:11 AM

#

acoustic skiff Right so I was thinking as a not cheating thing if I define my preprocessing pip...

Yes, I think you get it

glad skiff Nov 10, 2023, 12:20 AM

#

hey all, does anybody have experiences designing RAG systems? I have a bunch of PDFs, and I'm working on a RAG pipeline, and have been researching on different approaches (already went with the very simple LlamaIndex and Langchain thing).

I have been seeing a lot of folks recommending the "hybrid" approach of using both vector and keyword based search. Now the vector is "ok": you read the docs, embedds them, saves into the vector and then perform a search. The regular search, by itself also: read the docs, store in a db, and perform regular search.

what I'm not following is how to mix those two approaches - does that mean I need to duplicate the data? Eg having it as regular text in elasticsearch for example, and also have it as vector? Any ideas?

rugged comet Nov 10, 2023, 12:21 AM

#

Would anyone here be willing to review my Jupyter notebook for a school project? It's about LinearRegression, lassoCV, and RidgeCV.

abstract wasp Nov 10, 2023, 12:46 AM

#

blazing oxide Sure, I'd be happy to clarify! When you're using a `tf.data.Dataset` object wit...

Hi, sorry, it’s been a while since I’ve worked in this. I tried your suggestion but I still get the error ;-;
What else can I do? 😔 I need to batch the dataset?

serene scaffold Nov 10, 2023, 12:49 AM

#

abstract wasp Hi, sorry, it’s been a while since I’ve worked in this. I tried your suggestion ...

that person was just copying answers from ChatGPT, btw. so they probably don't even understand your question.

abstract wasp Nov 10, 2023, 12:49 AM

#

serene scaffold that person was just copying answers from ChatGPT, btw. so they probably don't e...

😔

verbal venture Nov 10, 2023, 4:01 AM

#

can someone explain in object detection the relevance of grids on the image, bounding boxes, and anchor boxes

winter drift Nov 10, 2023, 8:51 AM

#

I made a really cool gpt https://chat.openai.com/g/g-Ks09304lQ-philbo

ChatGPT

ChatGPT - Philbo

You think, therefore i am.

sterile wyvern Nov 10, 2023, 11:57 AM

#

@boreal gale is it possible to run an initial function with 1 set of parameters but then use gridsearch with the performance from the initial function and a group of parameters?

bronze flint Nov 10, 2023, 2:30 PM

#

sterile wyvern <@231160898872410123> is it possible to run an initial function with 1 set of pa...

Grid search exists to find the best combination of hyperparameters, you can give it a set of hyperparameters you initially used along with others you would like to test out.

#

If that's what you meant

#

Anyways
Day 2 of me trying to find someone who has experience with Apache spark and Structured data streaming
If you are that person please let me know, i would love to hear your opinion on how it handled big data 🙂
I will be dealing with possibly infinite amount of data so i have to choose proper tool

sterile wyvern Nov 10, 2023, 4:03 PM

#

bronze flint Grid search exists to find the best combination of hyperparameters, you can give...

im realizing hyperparameters are different from parameters

young egret Nov 10, 2023, 4:10 PM

#

Is there a way to right outer join tables based on a condition? Any links would be greatly appreciated 🙂

serene scaffold Nov 10, 2023, 4:14 PM

#

young egret Is there a way to right outer join tables based on a condition? Any links would ...

what do you mean "based on a condition"?

young egret Nov 10, 2023, 4:15 PM

#

I'm looking to join 2 tables using the minimum difference between 2 columns. I don't really know what to do my 2 tables have different number of rows.

serene scaffold Nov 10, 2023, 4:16 PM

#

young egret I'm looking to join 2 tables using the minimum difference between 2 columns. I d...

join operations are on exact matching values between the two tables. you can't do a join on the closest values.

#

so you'll have to come up with an approach that does not involve joining

young egret Nov 10, 2023, 4:18 PM

#

My 2 tables have the same IDs. But I don't know if that'll work because they still have different numbers of rows. I've tried concat the 2 dataframes, but I really want them to be on the same row or some other way I can interact with one another.

#

I'm now out of ideas and looking for help, a way to solve this semi matching table thing.

agile cobalt Nov 10, 2023, 4:19 PM

#

sounds like a normal many-to-many outer join?

young egret Nov 10, 2023, 4:20 PM

#

Yes I'm looking for a many to many join

agile cobalt Nov 10, 2023, 4:21 PM

#

you can specify how= to tell pandas whenever to do a inner/left/right/outer join
as for many to many... pretty sure that's the default and pretty much only supported option (excluding the validate)

#

!d pandas.merge

arctic wedgeBOT Nov 10, 2023, 4:21 PM

#

pandas.merge


pandas.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=None, indicator=False, validate=None)```
Merge DataFrame or named Series objects with a database-style join.

A named Series object is treated as a DataFrame with a single named column.

The join is done on columns or indexes. If joining columns on columns, the DataFrame indexes *will be ignored*. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on. When performing a cross merge, no column specifications to merge on are allowed.

Warning

If both key columns contain rows where the key is a null value, those rows will be matched against each other. This is different from usual SQL join behaviour and can lead to unexpected results.

agile cobalt Nov 10, 2023, 4:22 PM

#

maybe take a look at the User Guides

young egret Nov 10, 2023, 4:22 PM

#

I'll try that. Can you explain further what it would do?

#

What will happen to the empty rows? Do I need to fill anything?

#

How many rows will the new table have?

agile cobalt Nov 10, 2023, 4:25 PM

#

if you use inner, it'll exclude rows that do not have a matching record on the other table
if you use left/right, it'll just throw in nulls/Nones/NAs for the right/left table where there were no matching records for the left/right, respectively
if you use outer, it'll throw in NAs in both sides where there is no match

agile cobalt Nov 10, 2023, 4:28 PM

#

young egret How many rows will the new table have?

with inner: at minimum 0 if there are no matches, at most min(x, y) if all have rows have matches

with outer: at minimum max(x, y) if all rows from both tables have matches, or x + y if there are no matches

mystic birch Nov 10, 2023, 4:29 PM

#

does anone know a platform to use random forest on python on windows? i tried tensorflow but it only works on macos and linux

agile cobalt Nov 10, 2023, 4:29 PM

#

mystic birch does anone know a platform to use random forest on python on windows? i tried te...

if you are not trying to use neural networks, iirc sklearn should suffice

mystic birch Nov 10, 2023, 4:30 PM

#

agile cobalt if you are not trying to use neural networks, iirc `sklearn` should suffice

thx

young egret Nov 10, 2023, 4:37 PM

#

agile cobalt with inner: at minimum `0` if there are no matches, at most `min(x, y)` if all h...

OMG IT WORKS THANK YOU SO MUCH

silk axle Nov 10, 2023, 5:05 PM

#

I'm trying to train an AI that will correctly classify an amazon review with the corresponding rating that was left (1/2/3/4/5 stars), but the AI's accuracy isn't improving at all in the epochs. What does this typically indicate? Fwiw the loss is decreasing (whatever that means Shrug )

In fact, the accuracy is 3.79%, which is worse than the "pick a number at random" (would give 20%)

#

    model = keras.Sequential([
        keras.layers.Embedding(vocab_size, embedding_dim),
        keras.layers.GlobalAveragePooling1D(),
        keras.layers.Dense(24, activation="relu"),
        keras.layers.Dense(1, activation="sigmoid")
    ])
    model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])```this is the model that I'm using (copy-pasted from an example)

bronze flint Nov 10, 2023, 5:06 PM

#

silk axle I'm trying to train an AI that will correctly classify an amazon review with the...

I mean you are trying to train a model without having clue what hyperparameters you are setting
That is not the way to do it

#

It can indicate anything

silk axle Nov 10, 2023, 5:08 PM

#

bronze flint I mean you are trying to train a model without having clue what hyperparameters ...

What are "hyperparameters"?

#

The config?

bronze flint Nov 10, 2023, 5:09 PM

#

I suggest you get into ML properly from the start if you are interested

#

You could check Stanford course for the starters

#

I think it is free on youtube

silk axle Nov 10, 2023, 5:10 PM

#

I know some basics of ML, just not how to choose the right hyperparameters

bronze flint Nov 10, 2023, 5:11 PM

#

That's entire point of engineering the model today, fine tuning hyperparameters

silk axle Nov 10, 2023, 5:18 PM

#

What does "loss" measure?

true geode Nov 10, 2023, 5:23 PM

#

The error essentially, i.e, how far the predicted value is from the actual value.

silk axle Nov 10, 2023, 5:24 PM

#

true geode The error essentially, i.e, how far the predicted value is from the actual value...

Right, so saying a 5 is a 1 would have a larger loss than saying a 4 is a 3?

past meteor Nov 10, 2023, 5:24 PM

#

silk axle I'm trying to train an AI that will correctly classify an amazon review with the...

Can I just send you this? https://karpathy.github.io/2019/04/25/recipe/ This is mostly the methodology I use. I can give you a watered down version or I can refer you to the source 😄

A Recipe for Training Neural Networks

Musings of a Computer Scientist.

past meteor Nov 10, 2023, 5:26 PM

#

silk axle I know some basics of ML, just not how to choose the right hyperparameters

If it's too technical I can give you a higher level overview you can follow.

silk axle Nov 10, 2023, 5:26 PM

#

past meteor Can I just send you this? https://karpathy.github.io/2019/04/25/recipe/ This is ...

Will take a look at it, thanks :)

past meteor Nov 10, 2023, 5:27 PM

#

Np, ping me if you have any specific questions

true geode Nov 10, 2023, 5:28 PM

#

past meteor Np, ping me if you have any specific questions

Is this an open invitation. 😅 😅

past meteor Nov 10, 2023, 5:29 PM

#

Well, it's mostly for this specific case

#

I'm on frequently enough to notice if there's a question where I can help though 🙂

echo mesa Nov 10, 2023, 6:50 PM

#

@past meteor I've been going thru a book called "data science first principles with python" which I found in the python dc server resources it is a very good book and I've been enjoying it. My question is that for example when the book introduces Linear algebra and then it talks about it for a while but then at the end of the chapter there are other books that you can read if you wanna know more, I wonder how should I pursue those books or whether I should? Should I wait until I finish with the whole book or should I stop read the book and then move on? Or should I just read the book note everything down including the books then perhaps do a very basic project and if I decide to move onto a field where the mathematics is needed then I can read those books?

echo mesa Nov 10, 2023, 7:25 PM

#

and other helpers perhaps?

quaint gorge Nov 10, 2023, 7:56 PM

#

what models are used to solve such a problem? Can someone just point me in the right direction and i will do some research.

Looking at customer feedback from surveys and I need to catgeroize the feedback based on some grouping logic for example

"I didn't buy the item because i have no money " ---> Budgeting

left tartan Nov 10, 2023, 9:30 PM

#

echo mesa <@260493929047130113> I've been going thru a book called "data science first pri...

I think you're probably thinking too "linearly" (no pun intended): there's no right or wrong answer. From your explanation, it sounds like the book gave you the minimum information you need for linear algebra... but that is an entirely separate (college-level) course, so don't stress it. You can learn it when you want to learn it.

past meteor Nov 10, 2023, 10:03 PM

#

echo mesa <@260493929047130113> I've been going thru a book called "data science first pri...

I'll let you in one a secret: I read 3+ books on the same topic 😄

#

I like having different opinions on the same topics. I also frequently read book 1, then 2 then 1 then 3 and then 2

#

As I mentioned: you need to get really comfortable with not understanding things in detail

crystal axle Nov 10, 2023, 10:06 PM

#

is there anyway to getting a head start in math and it's implementation in programming before my study at the university? For reference, I will be studying AI course, Bachelors degree.

left tartan Nov 10, 2023, 10:18 PM

#

crystal axle is there anyway to getting a head start in math and it's implementation in progr...

Of course. Ask your university for a curriculum map: a list of courses you'll need to take, usually by year. It's usually online, or you can send an email to the department.

crystal axle Nov 10, 2023, 10:24 PM

#

left tartan Of course. Ask your university for a curriculum map: a list of courses you'll ne...

I tried to look at their website, and I believe it's a new course they provide, so I really can't click on any of the modules that they list, I only have the names, and short explanations about them, even some explains aren't complete and finish with "..." I will email them and ask them what can I expect, I asked in this channel to see if anyone would recommend books, websites... Etc to help with the study.

left tartan Nov 10, 2023, 10:25 PM

#

crystal axle I tried to look at their website, and I believe it's a new course they provide, ...

What country are you in?

crystal axle Nov 10, 2023, 10:25 PM

#

left tartan What country are you in?

UK

left tartan Nov 10, 2023, 10:25 PM

#

crystal axle I tried to look at their website, and I believe it's a new course they provide, ...

You can look at other universities and see what comparable curriculums look like. In the US, freshman year computer science is two semesters of Calculus.

#

Then, either more calc, or linear algebra... and definitely some calculus-based stats.

#

Number one tho is: strong algebra / pre-calc skills. Calculus isn't inherently hard, it's the lack of algebra mastery that makes it hard.

crystal axle Nov 10, 2023, 10:30 PM

#

I'm not sure whether my university would be the same as the others, but I would try and look that up

#

But after that, what do I do? How do I study them?

left tartan Nov 10, 2023, 10:32 PM

#

crystal axle But after that, what do I do? How do I study them?

Same thing: find a syllabus for the class you want to take at a local university. Most calc courses, for instance, are largely the same.

#

For example, here's stanfords summary of Calc 1, 2 and 3: https://drive.google.com/file/d/1ycBlN7LBm1v10gnEU0gg1JfD_v2ncenc/view

Google Docs

MATH 20 Series Topics List.pdf

crystal axle Nov 10, 2023, 10:34 PM

#

Hopefully I can get more knowledgeable in programming and math, I don't want the next year to be harder.

left tartan Nov 10, 2023, 10:39 PM

#

crystal axle Hopefully I can get more knowledgeable in programming and math, I don't want the...

Absolutely. The two things I’d advise any student going to CS is: study calculus and learn programming (whatever language your university teaches first… usually Java or Python). Do practice problems, watch online courses, etc. if you do a little every week, you’ll be ready. If you wait to the end, you won’t be able to cram it in.

#

(This is us-centric advice but I assume Uk curriculum is similar)

#

My advice is primarily about making freshman year as easy as possible, not that calculus is more important than other maths.

crystal axle Nov 10, 2023, 10:43 PM

#

I believe so, but the study curriculum is different it's mostly focused on search and independent learning, that's why I'm very focused on math since it would be my first time in this kind of studying, add to that I have no idea how math can be studied independently. Yes they would explain it in classes, but I think you'd still have to study harder on yourself.

#

As for programming, I'm trying to practice more and I can have an idea of how I can study independently.

left tartan Nov 10, 2023, 10:44 PM

#

crystal axle I believe so, but the study curriculum is different it's mostly focused on searc...

For calc, I have standard advice, one sec, let me find an old thread

#

#data-science-and-ml message and the message right after it

crystal axle Nov 10, 2023, 10:53 PM

#

This might be obvious, but I should focus on pure programming and Python right? After this 8 can take time in learning math and it's implementation?

left tartan Nov 10, 2023, 11:08 PM

#

crystal axle This might be obvious, but I should focus on pure programming and Python right? ...

First: find out what the intro language is at your college. It might not be Python.

#

And, I would suggest studying both. You’ll never ‘finish’ learning Python, nor will you finish learning math. You’re just trying to get better, not be perfect

white coral Nov 10, 2023, 11:30 PM

#

guys i have a few doubts can i ask here?

#

in this code of mine:
`import numpy as np
import pandas as pd
import statsmodels.api as sm

Generate dummy data

np.random.seed(123)
X = np.random.normal(size=(100, 5))
y1 = np.random.normal(size=100)
y2 = np.random.normal(size=100)
data = pd.DataFrame(X, columns=['x1', 'x2', 'x3', 'x4', 'x5'])
data['y1'] = y1
data['y2'] = y2

Fit model for y1

X = sm.add_constant(data[['x1', 'x2', 'x3', 'x4', 'x5']])
model1 = sm.OLS(data['y1'], X).fit()

Fit model for y2

model2 = sm.OLS(data['y2'], X).fit()

Print model summaries

print(model1.summary())
print(model2.summary())

Fit joint model

X = sm.add_constant(data[['x1', 'x2', 'x3', 'x4', 'x5']])
y = data[['y1', 'y2']]

model_joint = sm.OLS(y, X).fit()
results_df = pd.DataFrame()
results_df['Coefficients Y1'] = model_joint.params.iloc[:, 0]
results_df['Coefficients Y2'] = model_joint.params.iloc[:, 1]
print(results_df)
results_df['Std Errors Y1'] = model_joint.bse.iloc[:, 0].values
results_df['Std Errors Y2'] = model_joint.bse.iloc[:, 1].values
print(results_df)`

i am getting the following error :
ValueError Traceback (most recent call last) <ipython-input-1-e65d97313ad8> in <cell line: 34>() 32 results_df['Coefficients Y2'] = model_joint.params.iloc[:, 1] 33 print(results_df) ---> 34 results_df['Std Errors Y1'] = model_joint.bse.iloc[:, 0].values 35 results_df['Std Errors Y2'] = model_joint.bse.iloc[:, 1].values 36 print(results_df) /usr/local/lib/python3.10/dist-packages/numpy/core/overrides.py in dot(*args, **kwargs) ValueError: shapes (100,2) and (100,2) not aligned: 2 (dim 1) != 100 (dim 0)

#

please can someone here help me??

desert oar Nov 11, 2023, 5:21 AM

#

@white coral if you look at the full "traceback" you'll see that this error comes from inside statsmodels and isn't related to your dataframe. i can reproduce your error but i'm not sure what causes it. you'll see the error if you just do print(model_joint.bse).

#

i'd argue that this is a bug in statsmodels. if you did something invalid, you should get a helpful error message, not this

crystal axle Nov 11, 2023, 5:26 AM

#

left tartan First: find out what the intro language is at your college. It might not be Pyth...

I believe it's Python, but I will look that up, but does it matter, programming is the same isn't it? And to keep in mind, I'm already making some projects and learning more programming using Python.

Yes, I know that programming isn't something you can master, it's always updating and you have to know things that you didn't knew before, but what I meant by "finish" is to get at a high level.

rotund turtle Nov 11, 2023, 8:46 AM

#

Hey

left tartan Nov 11, 2023, 4:28 PM

#

crystal axle I believe it's Python, but I will look that up, but does it matter, programming ...

I am just saying: be prepared and ask a lot of questions. Know what the program entails: especially details and example syllabi for first year courses. If it’s Java, then be prepared for Java. Being prepared will make the first year -much- easier

crystal phoenix Nov 11, 2023, 5:31 PM

#

I have a numpy array of 20000 values. How can I plot the standard deviation for sample size 1-20000? (Ox is sample size, Oy is standard deviation)

hallow cargo Nov 11, 2023, 10:34 PM

#

crystal phoenix I have a numpy array of 20000 values. How can I plot the standard deviation for ...

You can iterate over a for loop and calculate standard deviation for each array:

ox = np.array([i for i in range(20_000)])
oy = np.ndarray((20_000,))

for i in range(20_000):
 array_for_std = array[:i]
 oy[i] = calculate_std(array_for_std)
 
# then plot it with pyplotlib for example

Haven't tested it, but give it a go and tag me if something is wrong. But hope it helps.

#

Does anyone here have any experience with tf.data and tensorflow? I have windowed my data for timeseries with window size of 512 and batch size of 2048, and I have 16 features and a total of 161873 datapoints before windowing or batching. I am fairly new to tensorflow, being more experienced in the mathematics of ML. I keep getting this error, and I am unsure how to set up feature layers to make my data compatible. I am more or less randomly putting something into input_shape= trying to figure out how it works.

Error:

ValueError: Exception encountered when calling layer "dense_features" (type DenseFeatures).

We expected a dictionary here. Instead we got: 

Call arguments received by layer "dense_features" (type DenseFeatures):
  • features=tf.Tensor(shape=(None, 2048, 512, 16), dtype=float32)
  • cols_to_output_tensors=None
  • training=None

#

This is my model and feature layers:

data = Data('TData/train.csv')

count = 64
dataset = data.data_ds.take(int(count*0.8))
dataset_cv = data.data_ds.skip(int(count*0.8))

numeric_features = [tf.feature_column.numeric_column(feat) for feat in data.column_names[:-1]]
feature_layer = tf.keras.layers.DenseFeatures(numeric_features, input_shape=(2048, 512, 16))
# output_bias = tf.keras.initializers.Constant(init_bias)
model1 = Sequential([
    feature_layer,
    LSTM(units=64, stateful=True),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dropout(0.3),
    Dense(units=8, activation='tanh', kernel_regularizer=tf.keras.regularizers.L2(0.16)),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dropout(0.3),
    Dense(units=1, activation='sigmoid') # , bias_initializer=output_bias)
])

# model1.build(input_shape=(64, 300, 16))

cp = ModelCheckpoint('ModelDatasetTesting/', save_best_only=True)

# model1 = tf.keras.models.load_model('ModelV1/')
model1.compile(loss=BinaryCrossentropy(), optimizer=Adam(learning_rate=0.0001), metrics=['accuracy']) # , tf.keras.metrics.Precision(), tf.keras.metrics.Recall()

model1.fit(dataset, validation_data=(dataset_cv), epochs=10, batch_size=2048, callbacks=[cp]) # , class_weight=class_weight

sterile wyvern Nov 11, 2023, 11:04 PM

#

@boreal gale does https://github.com/bayesian-optimization/BayesianOptimization

have lag or get slow with large searches?

GitHub

GitHub - bayesian-optimization/BayesianOptimization: A Python imple...

A Python implementation of global optimization with gaussian processes. - GitHub - bayesian-optimization/BayesianOptimization: A Python implementation of global optimization with gaussian processes.

latent musk Nov 11, 2023, 11:06 PM

#

hi! does anyone have experience with vectorization (like changing words to vectors)?

#

i'm trying to use word2vec to vectorize a column of my dataframe, but tbh i've never done it before, so i'm rather confused on how to do it

lapis sequoia Nov 12, 2023, 12:58 AM

#

yo, can I share a model and get some opinions??

serene scaffold Nov 12, 2023, 1:00 AM

#

lapis sequoia yo, can I share a model and get some opinions??

It's easier for everyone if you show the thing you want feedback on when you ask. Don't wait for a commitment.

lapis sequoia Nov 12, 2023, 1:01 AM

#

https://github.com/nickkatsy/Trials_and_ect/blob/main/not_accurate_but_looks_cool.py

GitHub

Trials_and_ect/not_accurate_but_looks_cool.py at main · nickkatsy/T...

Contribute to nickkatsy/Trials_and_ect development by creating an account on GitHub.

untold glade Nov 12, 2023, 2:55 AM

#

Hello!, can I install tensorflow/tfjs-node using python 3.x? I read that the best option is using Python 2.7

scenic parcel Nov 12, 2023, 6:50 AM

#

Have any of you guys ever scraped / crawled reddit and analyzed it

hallow light Nov 12, 2023, 6:50 AM

#

Hey guys what algorithm do you guys recommend for anomaly detection? I'm trying to build a model that checks gas meter rates and if it gets out of range flag the value.

odd meteor Nov 12, 2023, 7:59 AM

#

hallow light Hey guys what algorithm do you guys recommend for anomaly detection? I'm trying ...

You could use any of your preferred algorithm for this. XGBoost, CatBoost, VaE, etc.

odd meteor Nov 12, 2023, 8:02 AM

#

scenic parcel Have any of you guys ever scraped / crawled reddit and analyzed it

Haven't crawled Reddit. Is that what you're currently doing? Are you having any challenge in doing that?

polar zodiac Nov 12, 2023, 8:08 AM

#

Hello,
How to work with Over-fitting model?

odd meteor Nov 12, 2023, 8:10 AM

#

latent musk i'm trying to use word2vec to vectorize a column of my dataframe, but tbh i've n...

You could use Gensim to achieve that after you've performed tokenization (you can NLTK to achieve this.)

import gensim
from gensim.models import Word2Vec

latent musk Nov 12, 2023, 8:14 AM

#

odd meteor You could use Gensim to achieve that after you've performed tokenization (you ca...

sorry I'm rather new to programming lol
could you specify what you mean by tokenization/using gensim to achieve this?

#

i asked chatgpt about it too and it gave me some code, but requires that i download the full pre-trained word2vec model

is there another way to do this without downloading multiple gigabytes of word2vec?

odd meteor Nov 12, 2023, 8:30 AM

#

polar zodiac Hello, How to work with Over-fitting model?

You'd have to reduce your model complexity to enable the high variability in your model to drop.

Meanwhile there are several ways you could approach this. We can categories them into 2. Data-Centric & Model-Centric approach.

Data-Centric

Data Augmentation
Resampling (up sampling / Down sampling)
SMOTE (I personally haven't seen the significant impact of this approach when compared to other approaches)

Model Centric

Use Regularisation
Early Stopping / Other callbacks
Hyperparameter Tunning
Adjusting the value of scale_pos_weight parameter in your model
Adjusting the Class Weight parameter

odd meteor Nov 12, 2023, 8:56 AM

#

latent musk sorry I'm rather new to programming lol could you specify what you mean by token...

You didn't offend me so no need to apologise for doing nothing wrong 😊. Meanwhile, welcome, I hope you're enjoying your programming experience thus far?

So tokenization is just a fancy way of saying "breaking down the walls of Jericho into tinny stone pebbles".

Only that this time, in NLP, this 'wall' isn't an actual 'wall' but a document or textual data.

You can decide to break this 'Wall"' into different size of pebbles. So we have different types of tokenization but I'll just highlight sentence tokenization and Word tokenization.

Word tokenization: When you break down your text data into individual words. For example "I ate jollof rice this morning" becomes [ 'I', 'ate', 'jollof', 'rice', 'this' , 'morning' ]
Sentence Tokenization: this is where our document / text data is broken down into individual sentences.

NLTK & Gensim are both libraries used for doing some interesting stuff in NLP.

So, once you've installed Gensim in your machine you could easily leverage it's models module to access Word2Vec embedding.

I'll point you to a website for more clarity shortly

#

https://builtin.com/machine-learning/nlp-word2vec-python

Built In

How to Practice Word2Vec for NLP Using Python

In this tutorial, I’ll show you how to vectorize text using Gensim and Plotly.

polar zodiac Nov 12, 2023, 9:07 AM

#

odd meteor You'd have to reduce your model complexity to enable the high variability in you...

Thank you very much!

latent musk Nov 12, 2023, 9:14 AM

#

odd meteor You didn't offend me so no need to apologise for doing nothing wrong 😊. Meanwhi...

Thank you! this really helped 🙏🙏

#

Checking out the website right now

polar zodiac Nov 12, 2023, 9:16 AM

#

odd meteor You'd have to reduce your model complexity to enable the high variability in you...

Do you have any tutorials I can follow up?

odd meteor Nov 12, 2023, 9:35 AM

#

polar zodiac Do you have any tutorials I can follow up?

Unfortunately I have none. You might wanna check online on each individual approach. However, If you tell me the kind of data you're working with and the task you're performing, I could provide more useful resource / feedback

echo mesa Nov 12, 2023, 10:04 AM

#

Guys, what courses should I take on uni that would be really useful for ai and machine learning. I was thinking about doing just mathematics since I already know how to code and taking just computer science wouldn't make sense i think, or perhaps taking a joint course. I was also wondering if there are any courses just about machine learning because I didn't find anything but mathematics and computer science. Any thoughts?

odd meteor Nov 12, 2023, 10:52 AM

#

echo mesa Guys, what courses should I take on uni that would be really useful for ai and m...

Any of these 3 majors are cool in building solid foundational knowledge. Statistics, Computer Science, and Mathematics.

Nonetheless this field isn't discrimatory, so you might as well decide to study Fishery and still end up doing Data Science (so long as you are ready to put in the needed work)

Many universities now have a data science undergraduate program. So it depends on where you reside and the schools you're interested in.

drowsy viper Nov 12, 2023, 10:59 AM

#

Hi there general question here how popular is pytorch lightining

past meteor Nov 12, 2023, 11:06 AM

#

echo mesa Guys, what courses should I take on uni that would be really useful for ai and m...

Also depends on your country! 🙂 When I was at a data science consulting half of the staff has a MSc in Busines engineering (I'm one of them) and the other half MSc in Comp Sci.

#

What both have in common is they go very deep into statistics and math. We tended to have a bit more domain knowledge and knowledge of other methods that may be relevant (for instance operations research) for data adjacent problems and they knew (a lot) more programming and a bit more math.

#

There's also tons of people doing data science that compe from any quantitative field such as Bio Engineering / Bio informatics etc.

orchid agate Nov 12, 2023, 11:46 AM

#

Hi all,

I have been working on a proof-of-concept AI tool for my current job.
The purpose of the tool is to map an arbitrary string into a string from the list.
I have a dataset of arbitrary strings and corresponding strings from the target list. The tool is written in Python.
Here is the source code:
Using **LinearSVC ** model: https://pastebin.com/R6LNrJ4w
Using **SGDClassifier ** model: https://pastebin.com/bxcug3Vb

This code was written using a ChatGPT-like tool and scraps of information on the scikit-learn library that I could understand (I am very far from Python and ML itself).
The dataset consists of 3 million+ mappings (each of the two files is approximately 100 MB).
I would be very grateful if there are people here who can help me with at least some of the related questions:

The fit() method executes well and fast in both models if I cut off the dataset to be around 10k mappings. However, when I stay with the original 3 million, it takes a very long time (started approximately 48 hours ago and still running).
My PC specs are 32GB of RAM and an i5-12400f CPU. Is this an adequate time for such a dataset? Could it be at least roughly estimated how much more time is left?
Is this a good way of using scikit-learn in this manner for this kind of tool? I would be grateful if you could recommend other more effective approaches or ready-made solutions.
Can it be determined which of the models (LinearSVC or SGDClassifier) has more pros than cons for this tool?
In the current implementation, when I turn off the program or PC, all the trained data from RAM is vanishing. How can I save the trained progress to the hard drive and resume the training from it?

Thank you!

Pastebin

from sklearn.calibration import LabelEncoderfrom sklearn.svm import...

Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.

Pastebin

from sklearn.calibration import LabelEncoderfrom sklearn.linear_mod...

Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.

echo mesa Nov 12, 2023, 12:17 PM

#

odd meteor Any of these 3 majors are cool in building solid foundational knowledge. Statist...

Hmm I'm planning to go to the UK since I'm in Ireland, I was just wondering that what should I go with. Going just math would be interesting as I might need to learn stuff that I won't ever use in machine learning and data science but if I would go with Mathematics then I could literally self learn almost everything in terms of data science and machine learning as I have a really strong foundation in mathematics. I just wondered whether that's a good decision

buoyant vine Nov 12, 2023, 1:10 PM

#

How do you guys go about testing multi-label classifiers.

Currently I have a classifier model which is trying to predict one label based on the content of the text, but realistically this should be multi-label, but I have a couple questions around testing and evaluating the model:

In the situation where the model were to produce multiple labels, how would you score that against a dataset of single labels?
How do you evaluate weight in this case? I.e. Does the score/confidence the model gives mean anything to the metrics? Is the order something I should care about?
If I shouldn't care about the order, do I just say all(expected_label in returned_labels for expected_label in expected_labels) to count as a 'correct' score?

sharp nimbus Nov 12, 2023, 1:24 PM

#

buoyant vine How do you guys go about testing multi-label classifiers. Currently I have a cl...

wdym by single labels? like only a binary thing or something else?

buoyant vine Nov 12, 2023, 1:27 PM

#

I.e. a piece of text being labelled as News rather than [News, Business, Politics] for example

sharp nimbus Nov 12, 2023, 1:29 PM

#

buoyant vine I.e. a piece of text being labelled as `News` rather than `[News, Business, Poli...

and what does the model output?

buoyant vine Nov 12, 2023, 1:29 PM

#

hmm?

#

Atm it only produces a single label, it should (and will) get moved over to a multi-label version though

#

since single-label doesn't really work

sharp nimbus Nov 12, 2023, 1:31 PM

#

ok im guessing you want your model to be like a hierarchal classification thing? orr am i confused

past meteor Nov 12, 2023, 1:31 PM

#

buoyant vine How do you guys go about testing multi-label classifiers. Currently I have a cl...

Precision and recall work in the multi-label case.

#

Which means you can compute F1-scores as well etc

buoyant vine Nov 12, 2023, 1:32 PM

#

sharp nimbus ok im guessing you want your model to be like a hierarchal classification thing?...

kinddddddd of, they are IAB categories if you are familiar with them.

So there are effectively 4 tiers of categories like Automotive > Sports cars > Luxury cars

#

The primary issue we have atm is as shown by the confusion matrix the model generally, runs into issues with single label where the topics/text can easily fall into other categories

68747470733a2f2f61737365742e636d6c2e6465762f383562346637396334646636303063333737363231373761373735623635306564356134393662373f636d6c3d706e672663616368652d6279706173733d31303533333336652d333161362d343265652d626532622d643562646633393965303438.png

past meteor Nov 12, 2023, 1:34 PM

#

past meteor Precision and recall work in the multi-label case.

You can also make a roc-curve per category (1 vs all) or even per combination (1vs1) if you choose

buoyant vine Nov 12, 2023, 1:35 PM

#

past meteor You can also make a roc-curve per category (1 vs all) or even per combination (1...

😅 What is a roc-curve

past meteor Nov 12, 2023, 1:36 PM

#

The absolute GOAT evaluation method, but it's really made for binary classification sadly, you can finesse it into working for more cases.

buoyant vine Nov 12, 2023, 1:36 PM

#

partyparrot We are also going to ignore the fact that the current dataset is single label only atm

past meteor Nov 12, 2023, 1:37 PM

#

Imagine you make one of these per label (one versus all) and have a drill-down that business can use to have one per label-label combination

#

Make sense?

buoyant vine Nov 12, 2023, 1:38 PM

#

yeah

#

that is a lot of graphs 😅

past meteor Nov 12, 2023, 1:38 PM

#

Indeed, there's other ways. It largely depends on the amount of categories you have tbf

buoyant vine Nov 12, 2023, 1:38 PM

#

1400+

#

I think in total xD

past meteor Nov 12, 2023, 1:39 PM

#

If you want a single number I'd look towards precision and recall. Are you familiar with them?

buoyant vine Nov 12, 2023, 1:39 PM

#

not really, I do not typically do AI

past meteor Nov 12, 2023, 1:41 PM

#

buoyant vine not really, I do not typically do AI

Precision = (True Positives) / (True Positives + False Positives)

Recall = (True Positives) / (True Positives + False Negatives)

buoyant vine Nov 12, 2023, 1:41 PM

#

right

past meteor Nov 12, 2023, 1:42 PM

#

Note that TP + FP => the amount of times your model said it was true, so precision means "the proportion of times it was really right when your model said it was right"

#

We can do the same for recall: TP + FN = The amount of times it was really label A, so recall means "the proportion of times your model said it was A when it should've been A"

buoyant vine Nov 12, 2023, 1:44 PM

#

Question, how does that get calculated when we say "Model says it is [A, B, D, E] but it should be [A, B, C]"

#

each label is just another hit/miss count?

past meteor Nov 12, 2023, 1:44 PM

#

Good question! You calculate the precision and recall for each category and then you have different strategies to combine it. The most obvious one is averaging.

sharp nimbus Nov 12, 2023, 1:45 PM

#

buoyant vine each label is just another hit/miss count?

yes, you'll have to calculate TP, TN for all classes, iirc, a One v All thing

past meteor Nov 12, 2023, 1:45 PM

#

You should think of the computation of the precision and recall more per batch (set of rows) than per instance (set of columns)

past meteor Nov 12, 2023, 1:46 PM

#

buoyant vine Question, how does that get calculated when we say "Model says it is` [A, B, D, ...

[A, B, D, E] vs [A, B, C]
[A, B] vs [A, B]
[C, B, D] vs [A]

you compute the precision and recall for A, B, C, D and E over these rows and you take the average.

#

I'm a bit lazy but I can write it out in full if you want 🤣

buoyant vine Nov 12, 2023, 1:47 PM

#

😅 I think it's fine

#

Second question, has anyone tried llama2 for text classification?

past meteor Nov 12, 2023, 1:48 PM

#

Notice how we can drill-down again, you can combine Precision and Recall into something called the F1 score (basically their harmonic mean), which you can drill-down into P and R for an aggregation over all classes and then drill-down again to P and R per class, apart

buoyant vine Nov 12, 2023, 1:53 PM

#

past meteor Notice how we can drill-down again, you can combine Precision and Recall into so...

Does the order of the labels returned matter btw?

past meteor Nov 12, 2023, 1:53 PM

#

No

buoyant vine Nov 12, 2023, 1:53 PM

#

Thumbs_Up ty

lapis sequoia Nov 12, 2023, 3:06 PM

#

polar zodiac Hello, How to work with Over-fitting model?

How to avoid overfitting? Or trying to make it so the model overfits?

stoic garnet Nov 12, 2023, 3:36 PM

#

lapis sequoia How to avoid overfitting? Or trying to make it so the model overfits?

Early stopping

polar zodiac Nov 12, 2023, 5:00 PM

#

odd meteor Unfortunately I have none. You might wanna check online on each individual appro...

Its a dengue prediction project
https://www.kaggle.com/discussions/general/91461 I am currently using this dataset as I didnt find any other dataset online

Typhoid and Dengue Fever Symptoms Dataset | Kaggle

Typhoid and Dengue Fever Symptoms Dataset.

polar zodiac Nov 12, 2023, 5:01 PM

#

lapis sequoia How to avoid overfitting? Or trying to make it so the model overfits?

avoiding overfitting...
I more of want accurate results

lapis sequoia Nov 12, 2023, 5:29 PM

#

What do you use the most? LogisticRegression? random Forrest overfits so often

#

You could also just use statsmodels and run all of the features (if there are not too many) and see what the R-Sqaured score is if it is not discrete.

#

https://github.com/nickkatsy/python_ml_ect_/blob/master/hmda_updated.py I put 200 hours into that dataset due too all of the nas, having to read the HMDA act in Boston and then I just used logistic regression. Maybe just drop columns that don’t have to be there as well.

GitHub

python_ml_ect_/hmda_updated.py at master · nickkatsy/python_ml_ect_

Contribute to nickkatsy/python_ml_ect_ development by creating an account on GitHub.

misty flint Nov 12, 2023, 5:39 PM

#

buoyant vine Second question, has anyone tried llama2 for text classification?

you can try it. be prepared for a world of prompt engineering depending on your task in which you might as well have used classical ML, again depending on your task

blazing cape Nov 12, 2023, 5:40 PM

#

input size torch.Size([1, 4, 320, 320]) not equal to max model size (1, 3, 320, 320)

#

pls send help i am under the water

odd meteor Nov 12, 2023, 6:52 PM

#

polar zodiac Its a dengue prediction project https://www.kaggle.com/discussions/general/9146...

Is there class imbalance issue in the data?

odd meteor Nov 12, 2023, 6:56 PM

#

echo mesa Hmm I'm planning to go to the UK since I'm in Ireland, I was just wondering that...

I would go with Statistics again (maybe because that's my 1st major), then Computer Science, and then Mathematics comes last for me.

So long as you've done your due diligence, weighed your options, and of course, happy with your decision, then go for it.

odd meteor Nov 12, 2023, 6:59 PM

#

blazing cape input size torch.Size([1, 4, 320, 320]) not equal to max model size (1, 3, 320, ...

The issue is in the second dimension where you have 4 and 3. What are you trying to do?

echo mesa Nov 12, 2023, 7:08 PM

#

odd meteor I would go with Statistics again (maybe because that's my 1st major), then Compu...

isn't statistics is part of mathematics?

#

and computer science would be about learning how to code which I already know how to

left tartan Nov 12, 2023, 7:12 PM

#

echo mesa isn't statistics is part of mathematics?

They’re different majors though. Pick a university, and look at the degree offerings and course requirements

odd meteor Nov 12, 2023, 7:13 PM

#

echo mesa isn't statistics is part of mathematics?

They can be likened as two peas in same pod but they are quite distinct.

echo mesa Nov 12, 2023, 7:14 PM

#

odd meteor They can be likened as two peas in same pod but they are quite distinct.

well yeah but you said that you would do mathematics last and you'd go with statistics first so thats why I was confused

left tartan Nov 12, 2023, 7:14 PM

#

For example https://www.harvard.edu/programs/statistics/#undergraduate Vs. https://www.harvard.edu/programs/applied-mathematics/#undergraduate vs https://www.harvard.edu/programs/mathematics/#undergraduate

Harvard University

Mathematics

Harvard University is devoted to excellence in teaching, learning, and research, and to developing leaders in many disciplines who make a difference globally.

Harvard University

Statistics

Harvard University is devoted to excellence in teaching, learning, and research, and to developing leaders in many disciplines who make a difference globally.

Harvard University

Applied Mathematics

Harvard University is devoted to excellence in teaching, learning, and research, and to developing leaders in many disciplines who make a difference globally.

echo mesa Nov 12, 2023, 7:15 PM

#

odd meteor They can be likened as two peas in same pod but they are quite distinct.

all im trying to understand is that how would going with pure mathematics help me with machine learning and ai, because I already know how to code and if I'd be really good at math then it would be a lot easier to understand new topics in machine learning or even do something.

odd meteor Nov 12, 2023, 7:18 PM

#

echo mesa well yeah but you said that you would do mathematics last and you'd go with stat...

Statistics and Math are kinda related but they are two different fields. Mathematics is a broad discipline that deals with numbers, quantities, shapes, and abstract structures. It includes various branches such as algebra, calculus, geometry, and more, providing the theoretical foundation for many other fields, including statistics.

Statistics specifically focuses on data. It involves collecting, analyzing, interpreting, and presenting data to infer proportions in whole from those in a representative sample. While it utilizes mathematical tools and principles, statistics is often more applied, using mathematical theories to solve real-world problems related to data analysis. It's a discipline heavily grounded in real-world applications, including decision-making in business, health, politics, and other domains.

For more context and better explanation: https://www.quora.com/What-is-the-difference-between-mathematics-and-statistics

Quora

What is the difference between mathematics and statistics?

Answer (1 of 62): Frequently, when people hear the word "statistics", they think of using formulas and spreadsheet programs to analyze data by measuring as many things as possible about the samples being studied. The reason why I don't quite fit in either category is that I don't work with statis...

echo mesa Nov 12, 2023, 7:19 PM

#

odd meteor Statistics and Math are kinda related but they are two different fields. Mathema...

hmm, then i dont really know what to go with

lapis sequoia Nov 12, 2023, 7:23 PM

#

I got into DS with a economics degree so I do not know

odd meteor Nov 12, 2023, 7:25 PM

#

echo mesa hmm, then i dont really know what to go with

Read the attached Quora page for more clarity. The goal is not to confuse you though.

Meanwhile, there's a lot more you could learn from a Computer Science major other than coding (even if you already know how to code). Just do your own research and finding then go with the one that appears more interesting / fun to you.

At the end of the day, this field isn't discriminatory. So you could even study Actuarial Science, Zoology, Soil Science, or even Human Kinetics (a.k.a 'jumpology' 😀 ) and still end up in Data Science.

lapis sequoia Nov 12, 2023, 7:28 PM

#

lapis sequoia I got into DS with a economics degree so I do not know

I want to add for clarification that I added a bunch of math on top of it and went far out of my way to get better at a bunch of the aspects of DS and programming on the side and had to teach myself a lot of stuff. But it is my favorite thing ever.

echo mesa Nov 12, 2023, 7:29 PM

#

odd meteor Read the attached Quora page for more clarity. The goal is not to confuse you th...

all i know is that im very interested in ai and machine learning and I love mathematics

echo mesa Nov 12, 2023, 7:29 PM

#

lapis sequoia I want to add for clarification that I added a bunch of math on top of it and we...

Gotcha

left tartan Nov 12, 2023, 7:29 PM

#

Many schools are now offering DS programs, which are like a hybrid of CS and Stats.

#

But I don't know UK.

lapis sequoia Nov 12, 2023, 7:39 PM

#

echo mesa all i know is that im very interested in ai and machine learning and I love math...

All I will say about majoring in pure mathematics is that if you are using it for nothing, you are wasting your time. Just do what you like and you will be fine.

odd meteor Nov 12, 2023, 7:39 PM

#

More so, with a Statistics degree (so long as you ended up in a well-to-do school) you're most definetely gonna have math core courses and electives where you'll learn Calculus, Linear Algebra, Laplace Transform, etc. To me, it's like using 1 stone to kill two birds.

With a core Maths program, you probably won't go as far as covering Inferential Stats, Econometrics, Operation Research, Time Series, Gambler's Ruin, Stochastic & Maximum Likelihood Estimation, ANOVA & MANOVA, and all that interesting stuff in Stats.

But don't take my word for it. Do your own research and go with the program that you think will give you wings like red bull 😀

At the end of the day, whether you do Animal Husbandry, or Urban & Regional Planning as your 1st major, you can still get into Data Science.

echo mesa Nov 12, 2023, 7:40 PM

#

lapis sequoia All I will say about majoring in pure mathematics is that if you are using it fo...

Yeah that's what im trying to avoid, i wanna learn mathematics that is very useful for machine learning

echo mesa Nov 12, 2023, 7:42 PM

#

odd meteor More so, with a Statistics degree (so long as you ended up in a well-to-do schoo...

hmm, that's really good to know

echo mesa Nov 12, 2023, 7:43 PM

#

odd meteor More so, with a Statistics degree (so long as you ended up in a well-to-do schoo...

I mean yeah, this sounds like something that I can actually use for machine learning and data science, problem with pure mathematics is that you learn a bunch of other stuff that you wont ever use, unlike with statistics which afaik the foundation of ml

mighty halo Nov 12, 2023, 7:46 PM

#

ohh really ? @echo mesa

echo mesa Nov 12, 2023, 7:46 PM

#

mighty halo ohh really ? <@547810225777016834>

wdym?

mighty halo Nov 12, 2023, 7:47 PM

#

stats seems more useful than the crazy complex math

echo mesa Nov 12, 2023, 7:47 PM

#

mighty halo stats seems more useful than the crazy complex math

yeah

mighty halo Nov 12, 2023, 7:48 PM

#

im drowning in a ocean of math ... not knowledgable enought to see the horizon

echo mesa Nov 12, 2023, 7:48 PM

#

mighty halo im drowning in a ocean of math ... not knowledgable enought to see the horizon

are you in uni or smth?

mighty halo Nov 12, 2023, 7:49 PM

#

just messing with multicore mcus and python , wondering if mojo-python is workable

gloomy creek Nov 12, 2023, 8:16 PM

#

is pytorch good to learn as a intermidiate

lapis sequoia Nov 12, 2023, 8:47 PM

#

gloomy creek is pytorch good to learn as a intermidiate

It depends on what you are trying to do and what a intermediate is.

tender sparrow Nov 12, 2023, 10:50 PM

#

how do i

df.rolling(100).groupby
>> AttributeError: 'Rolling' object has no attribute 'groupby'

😦

serene scaffold Nov 12, 2023, 11:03 PM

#

tender sparrow how do i ```python df.rolling(100).groupby >> AttributeError: 'Rolling' object h...

let's "zoom out". what are you trying to do?

serene scaffold Nov 12, 2023, 11:05 PM

#

gloomy creek is pytorch good to learn as a intermidiate

"learning libraries" is not a good approach to learning ML. None of them are end-to-end tools (pretty much anything will involve two or more of them), and they're designed from the assumption that you already understand what you're trying to do.

#

Don't use ChatGPT to answer questions.

tender sparrow Nov 12, 2023, 11:16 PM

#

hey i am trying to define my problem as simple as possible to understand

serene scaffold Nov 12, 2023, 11:18 PM

#

tender sparrow hey i am trying to define my problem as simple as possible to understand

that's okay.

if you try to simplify your question to only "how do I do groupby on a rolling operation?", the answer is "you can't", and that's it--there's no way around it. So we have to take a step back and understand what doing groupby on a rolling was intended to accomplish, so we can figure out what pandas functionality is available for that.

tender sparrow Nov 12, 2023, 11:19 PM

#

yea that's why now i am trying to define my problem as simple as possible again 😄 give me a second ha

#

it's the first thing that came to my mind, knowing how wonderfull python and pandas api can be i thought that would be so straight forward, seems logical

tender sparrow Nov 12, 2023, 11:56 PM

#

# original dataframe
df = pd.DataFrame({
    'x': ['a',      'b', 'c', 'd', 'e', 'e', 'b', 'c', 'd', 'a'], 
    'y': [ 0,        1,   2,   3,   4,   4,   3,   2,   1,   9]}
)

# grouper
gr = df.groupby('x')['y']

# operations
assert gr.sum().idxmax() == 'a'
assert gr.sum().max() == 9

# desired output of the rolling 2 operations
# z = x where y is max
# s = sum of y where x is max
pd.DataFrame({
    'x': ['a',      'b',  'c', 'd', 'e', 'e', 'b', 'c', 'd', 'a'], 
    'z': [ pd.NA,   'b',  'c', 'd', 'e', 'e', 'e', 'b', 'c', 'a'],
    's': [ pd.NA,   '1',   2,   3,   4,   8,   4,   3,   2,   9],
})

now when i think of it it would be very interesting how to simplify that question i am bad at that

lapis sequoia Nov 13, 2023, 3:18 AM

#

serene scaffold Don't use ChatGPT to answer questions.

I feel chatgtp is fine if you need to do minor things and need a solution.

serene scaffold Nov 13, 2023, 3:19 AM

#

lapis sequoia I feel chatgtp is fine if you need to do minor things and need a solution.

sure, but if people want to use ChatGPT, they can do so, and they don't need someone on Discord to copy/paste answers for them.

And if the person doing the copy/pasting doesn't actually understand the question--let alone the answer--and can't verify if the answer is correct or help the asker understand it, then they've just made things worse.

lapis sequoia Nov 13, 2023, 3:20 AM

#

I agree

#

People who rely solely on chatgtp are not better off. It cannot do a entire data set for you and make sense of it for oneself. I only use it to add little things that I cannot think of for a project. Even then, if I didn’t know exactly what to ask it, then chatAI would do nothing of any real significance

#

And openAI is wrong a lot as well

serene scaffold Nov 13, 2023, 3:23 AM

#

OpenAI is a company. Are you talking about them, or ChatGPT?

lapis sequoia Nov 13, 2023, 3:25 AM

#

Sorry, I mean chatgtp

serene scaffold Nov 13, 2023, 3:26 AM

#

@tender sparrow did you try df.groupby('x')['y'].rolling(2).max()?

polar zodiac Nov 13, 2023, 6:08 AM

#

odd meteor Is there class imbalance issue in the data?

umm sorry what does that mean?

elfin stirrup Nov 13, 2023, 6:09 AM

#

can someone help me understand this problem i got the answer to be 2 data samples but it was incorrect.
i found id3 and id7 to be classified as c_1 based on the formula

polar zodiac Nov 13, 2023, 6:14 AM

#

lapis sequoia What do you use the most? LogisticRegression? random Forrest overfits so often

Logistic Regression

odd meteor Nov 13, 2023, 6:32 AM

#

polar zodiac umm sorry what does that mean?

I mean the response variable (y) in your dataset, does it have a balanced class-label.

Presuming your dataset"s target variable is y, and it contains 2 classes (1s and 0s), confirm if the number of times 1 and 0 appeared are somewhat equal.

polar zodiac Nov 13, 2023, 7:17 AM

#

No...
For 1's its 72
and for 0's its 32

scenic parcel Nov 13, 2023, 7:23 AM

#

Any of you ever use academictorrents.com ?

royal crest Nov 13, 2023, 7:30 AM

#

what is it for?

odd meteor Nov 13, 2023, 7:31 AM

#

polar zodiac No... For 1's its 72 and for 0's its 32

Since both class labels don't have same proportion, there's class imbalance in your data.

So, I'd suggest using stratified random sampling 1st ( use stratify = y when splitting your data with train_test_split) and see how the model will perform before trying any of the approach I suggested earlier

scenic parcel Nov 13, 2023, 8:04 AM

#

royal crest what is it for?

Getting datasets for research/analysis. If a company doesnt want that dataset to be easily available, then you might have to torrent it

muted hollow Nov 13, 2023, 9:52 AM

#

Hey guys, can you recommend me some NLP project to make so I can better my resume, I have only tried the health care chatbot and study chatbot project only

jagged hedge Nov 13, 2023, 12:56 PM

#

Is there any pre-trained LLM that can process a document pdf without changing it into text because once it gets changed to text or some image the generated text will not be accurate for images or table structures present in the pdf …. If you any such LLM or any NLP or Document processing model so that I can ask questions from that model … Please help

serene scaffold Nov 13, 2023, 1:09 PM

#

jagged hedge Is there any pre-trained LLM that can process a document pdf without changing it...

No, there is no way to pass a PDF file to an LLM without first converting it to text. If you find a tool that purports to do that, it is guaranteed to be converting it to text under the hood.

#

what do you need to do with the images and tabular data? and are these images of text?

left tartan Nov 13, 2023, 1:11 PM

#

Seems like it's torrents of stuff you can download freely anyway (responding to #data-science-and-ml message)

Discord

Discord - A New Way to Chat with Friends & Communities

Discord is the easiest way to communicate over voice, video, and text. Chat, hang out, and stay close with your friends and communities.

jagged hedge Nov 13, 2023, 1:12 PM

#

serene scaffold what do you need to do with the images and tabular data? and are these images of...

there are tables like on what input i want to convey a particular image like that and also what structs i need to implement with their particular bits required

serene scaffold Nov 13, 2023, 1:13 PM

#

jagged hedge there are tables like on what input i want to convey a particular image like tha...

there are probably tools for serializing tabular data as text that can be passed to an LLM. And for images, there are probably models that can convert images to natural language descriptions (like the opposite of what Midjourney et al do).

polar zodiac Nov 13, 2023, 3:57 PM

#

odd meteor Since both class labels don't have same proportion, there's class imbalance in y...

Oh okay cool

scenic parcel Nov 13, 2023, 4:57 PM

#

left tartan Seems like it's torrents of stuff you can download freely anyway (responding to ...

idk, I'm trying to get archived reddit data, and in July reddit started cracking down on services like pushshift that used to give it out for free

gaunt pendant Nov 13, 2023, 5:14 PM

#

hey guys, anybody with reinforcement learning experience expecially in proximal policy optimization, I'm dealing with a convergence issues for the cartpole v1 environment

gaunt pendant Nov 13, 2023, 5:48 PM

#

?

visual fossil Nov 13, 2023, 5:56 PM

#

Hey,

Has anyone here reviewed the following book ?

Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow.

abstract wasp Nov 13, 2023, 6:23 PM

#

visual fossil Hey, Has anyone here reviewed the following book ? Hands-On Machine Learning ...

It's a great book

visual fossil Nov 13, 2023, 6:32 PM

#

abstract wasp It's a great book

That’s nice to know, I’ll be starting it soon.

visual fossil Nov 13, 2023, 6:39 PM

#

abstract wasp It's a great book

In your opinion, will it be better to go over this book to learn about AI, or go for an online course ?!

abstract wasp Nov 13, 2023, 6:43 PM

#

visual fossil That’s nice to know, I’ll be starting it soon.

Yeah, you should have some Python background otherwise it might be really challenging.

abstract wasp Nov 13, 2023, 6:44 PM

#

visual fossil In your opinion, will it be better to go over this book to learn about AI, or go...

For me, I use the book alongside other resources I find online like YT vids. etc.

visual fossil Nov 13, 2023, 6:51 PM

#

abstract wasp For me, I use the book alongside other resources I find online like YT vids. etc...

Alright, thank you!

visual fossil Nov 13, 2023, 6:52 PM

#

abstract wasp For me, I use the book alongside other resources I find online like YT vids. etc...

If it’s possible for you, could you dm some useful resources whenever it’s easy for you!

gaunt pendant Nov 13, 2023, 7:44 PM

#

gaunt pendant hey guys, anybody with reinforcement learning experience expecially in proximal ...

nobyody?

tender sparrow Nov 13, 2023, 7:49 PM

#

serene scaffold <@489930157147291648> did you try `df.groupby('x')['y'].rolling(2).max()`?

im trying

agile cobalt Nov 13, 2023, 8:37 PM

#

tender sparrow im trying

not gonna lie I don't get what your s is meant to be at all

unique crown Nov 13, 2023, 10:47 PM

#

hello folks! i have a question about webscraping with python, can i ask that here?

serene scaffold Nov 14, 2023, 12:40 AM

#

unique crown hello folks! i have a question about webscraping with python, can i ask that her...

this channel would be more about using the data once you have scraped it, so you'd want to use a general help thread. Be sure that the website you're scraping is okay with that in their terms of service

plush jungle Nov 14, 2023, 2:39 AM

#

I'm trying to run the ppo pytorch code from here https://pytorch.org/rl/tutorials/coding_ppo.html and I finally got it working, but now I want to retrofit it to some of the other gym mujoco environments. When I change this line

base_env = GymEnv("InvertedDoublePendulum-v4", device=device, frame_skip=frame_skip, render_mode="human")

to this

base_env = GymEnv("Humanoid-v4", device=device, frame_skip=frame_skip, render_mode="human")

I get

  File "C:\python\reinforcement_learning\ppo\pytorch_ppo.py", line 130, in <module>
    collector = SyncDataCollector(
                ^^^^^^^^^^^^^^^^^^
  File "C:\Users\mj\miniconda3\Lib\site-packages\torchrl\collectors\collectors.py", line 677, in __init__
    self._tensordict_out.unsqueeze(-1)
  File "C:\Users\mj\miniconda3\Lib\site-packages\tensordict\tensordict.py", line 2184, in expand
    d[key] = value.expand((*shape, *value.shape[-last_n_dims:]))
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: expand(): argument 'size' (position 1) must be tuple of ints, but found element of type float at pos 0```

#

my full code is here: https://paste.pythondiscord.com/KBVA

#

obviously I need to change the shape of something like the observations or the action space or something, but I'm not sure what to change it to. any ideas?

shut girder Nov 14, 2023, 5:06 AM

#

Hello, I am a beginner to data analysis. Does data cleaning come before or during exploratory data analysis?

frigid dust Nov 14, 2023, 5:21 AM

#

shut girder Hello, I am a beginner to data analysis. Does data cleaning come before or durin...

You need to perform EDA in order to see how the data should be cleaned. Once you have created your cleaning steps, continue with your analysis

sleek palm Nov 14, 2023, 6:46 AM

#

Hello, i am trying to build a cnn based ml model. Anyone who has any experience in same, please dm me.

polar zodiac Nov 14, 2023, 7:28 AM

#

odd meteor Since both class labels don't have same proportion, there's class imbalance in y...

Just tried that out but idt there was any change

#

except for the graph it got changed to this

odd meteor Nov 14, 2023, 7:41 AM

#

polar zodiac except for the graph it got changed to this

Oh I see. I'm not seeing an overfitting problem here. After about +90 epochs, you can see the model converged quite well.

Do you mind sharing the old learning curve?

polar zodiac Nov 14, 2023, 7:51 AM

#

odd meteor Oh I see. I'm not seeing an overfitting problem here. After about +90 epochs, yo...

odd meteor Nov 14, 2023, 8:25 AM

#

polar zodiac

Evidently, the stratified random sampling you performed certainly improved the model's performance. The model is overfitting here.

Based on your last experiment, can try to further improve the performance. Try hyperparameter tunning

#

It's usually not so common for the test to outperform the train; although not impossible. Could be due to having small sample size in the test compared to the train.

With more experimentation, I'm certain you can squeeze out more performance from your model

drifting summit Nov 14, 2023, 3:59 PM

#

hey can someone advice on how to start with ai, im an intermediate python programmer and im pretty good at mathematics. just want to learn about ai before college and not get left behind

storm kelp Nov 14, 2023, 4:07 PM

#

Working in Pyspark and have made a mess of code which runs but is slow/inefficient:

for row in collected_loci:
    studyLocusId = row['studyLocusId']
    study = row['studyId']
    
    sumstats = spark.read.csv(f'gs://finngen-public-data-r9/summary_stats/finngen_R9_{study}.gz', header=True, sep='\t')
    
    df_locus = (
        df
        .filter(f.col('studyLocusId') == studyLocusId)
        .select('studyId', 'fi.chromosome', 'fi.position', 'fi.referenceAllele', 'fi.alternateAllele', 'idx'))

    gnomad_count = df_locus.count()
    
    df_locus = (
        df_locus
            .join(
                sumstats,
                (df['fi.chromosome'] == sumstats['#chrom']) &
                (df['fi.position'] == sumstats['pos']) &
                (df['fi.referenceAllele'] == sumstats['ref']) &
                (df['fi.alternateAllele'] == sumstats['alt']),
                'inner')
            .sort('idx'))
    
    
    matched_count = df_locus.count()
    df_locus.write.csv(f'output/{studyLocusId}/locus.csv', mode='overwrite')
    
    count_df = spark.createDataFrame([Row(gnomad_variant_count = gnomad_count, matched_variant_count = matched_count)])
    count_df.coalesce(1).write.mode('overwrite').option('delimiter', '\t').csv(f'output/{studyLocusId}/metadata.tsv')
    
    idxs = [x['idx'] for x in df_locus.select('idx').collect()]
    bm_filtered = bm.filter(idxs, idxs)
    bm_numpy = bm_filtered.to_numpy()
    bm_mirror = bm_numpy + bm_numpy.T - np.diag(np.diag(bm_numpy))
    np.save(f'output/{studyLocusId}/ld', bm_mirror)
    ```

Looking to see if there are any ways to speed up the execution other than the obvious things like removing the count() calls. 
My initial attempt was to refactor it into functions so that it can be parallelized instead of running iteratively. But this gave me serialization errors due to calling sparkContext within the function.

agile cobalt Nov 14, 2023, 4:23 PM

#

how large is the original df, how large are each of the sumstats and do you have any duplicate studyId / studyLocusId? (for the last question, not duplicate pairs, but just non-unique values within a column)

#

also shouldn't you be using df_locus instead of df inside of the join? in the df['col'] == sumstats['col'] lines

polar zodiac Nov 14, 2023, 4:28 PM

#

odd meteor It's usually not so common for the test to outperform the train; although not im...

Okay cool
Thank You Veryyy Much!!
Any other ways I can try on for improvement on the model?

storm kelp Nov 14, 2023, 4:30 PM

#

agile cobalt how large is the original df, how large are each of the sumstats and do you have...

The original df is about 20,000 rows.
each sumstat file is ~6m rows but I only need a small subset of that.
Every studyLocusID is a unique hash, but there are approximately 1,000 unique studyIDs

storm kelp Nov 14, 2023, 4:31 PM

#

agile cobalt also shouldn't you be using df_locus instead of df inside of the join? in the `d...

Yeah that's a mistake. Although df_locus inherits column names from df, so I guess it doesn't make a difference? I'll change it anyway

#

Surprised it didn't throw an error from that tbh, but there you go

agile cobalt Nov 14, 2023, 4:33 PM

#

storm kelp Yeah that's a mistake. Although df_locus inherits column names from df, so I gue...

it could be making the comparisons based on the original dataframe instead of the filtered one? which would slow it down quite a lot ; not sure though, I don't really know how spark works, haven't used it (I'm only used to pandas)

#data-science-and-ml

optionally do not even store df['date'] as a column, just

(same thing as before but without storing as a column)

dates = pd.to_datetime(df['string'])

df['month'] = dates.dt.month [...]

Generate dummy data

Fit model for y1

Fit model for y2

Print model summaries

Fit joint model

optionally do not even store `df['date']` as a column, just