#data-science-and-ml

1 messages · Page 87 of 1

serene scaffold
#

ML is applied math. so you'd be learning stats, calculus, and linear algebra. there are resources in the pins.

abstract wasp
#

Help, I'm making an autoencoder, my error happens when I call autoencoder.fit:
ValueError: y argument is not supported when using dataset as input.
This is my code:
#IMPORTS #LOADING DATASET (x_train, x_test), ds_info = tfds.load('celeb_a', split=('train', 'test'), shuffle_files=True, as_supervised=False, with_info=True) x_train = x_train.map(lambda x: {'image':(x['image']/255)}) x_test = x_test.map(lambda x: {'image':(x['image']/255)}) print(x_train) print(x_test) class Autoencoder(Model): def __init__(self, latent_dim, shape): super(Autoencoder, self).__init__() self.latent_dim = latent_dim self.shape = shape self.encoder = tf.keras.Sequential([ layers.Input(shape=shape), layers.Conv2D(16, (3, 3), activation='relu', padding='same', strides=2), layers.MaxPooling2D((2, 2)), layers.Conv2D(32, (3, 3), activation='relu', padding='same', strides=2), layers.MaxPooling2D((2, 2)), layers.Flatten(), layers.Dense(latent_dim, activation='relu')]) self.decoder = tf.keras.Sequential([ layers.Input(shape=(latent_dim,)), layers.Dense(tf.math.reduce_prod(shape), activation='relu'), layers.Reshape(shape), layers.Conv2D(32, (3, 3), activation='relu', padding='same', strides=2), layers.UpSampling2D((2, 2)), layers.Conv2D(16, (3, 3), activation='relu', padding='same', strides=2), layers.UpSampling2D((2, 2)), layers.Conv2D(8, (3, 3), activation='relu', padding='same', strides=2)]) def call(self, x): encoded = self.encoder(x) decoded = self.decoder(encoded) return decoded

#

shape = (218, 178, 3) latent_dim = 64 autoencoder = Autoencoder(latent_dim, shape) autoencoder.compile(optimizer='adam', loss=losses.MeanSquaredError()) autoencoder.fit( x_train, x_train, epochs=10, shuffle=True, validation_data=(x_test, x_test))

desert oar
blazing oxide
# abstract wasp Help, I'm making an autoencoder, my error happens when I call `autoencoder.fit`:...

Hey there! 👋 It seems like you're having trouble with your autoencoder model. The error "ValueError: y argument is not supported when using dataset as input" usually happens when you're trying to fit a model with a dataset as input, but you're also providing a target y argument.

In your case, you're using tf.data.Dataset objects for training and testing data. When you call fit on your autoencoder model, you're providing x_train as both the input data and the target data. However, when using a tf.data.Dataset, you should only provide it as the first argument to fit, and not as the target data.

Here's how you can modify your fit call:

autoencoder.fit(
    x_train,
    epochs=10,
    shuffle=True,
    validation_data=x_test)

In this case, your x_train and x_test datasets should yield pairs (input_batch, target_batch). But since you're working with an autoencoder, your input data is the same as your target data. So, you might need to modify how your datasets are created to yield pairs of the same data.

I hope this helps! If you have any more questions or need further clarification, feel free to ask! 😊

cerulean kayak
#

so I tried to use Optuna and it's still taking quite a while. I realized the main problem is I don't know what range of values I should use for my hyperparameters, which are:

  • max_features (sqrt or log2 so not applicable)
  • n_estimators
  • max_depth
    I've tried looking up on "how to know what range of values to test for in random forest hyperparameter tuning", but I just get a bunch of results explaining how to tune them, while each one just chooses a random value for the start and another random value for the end.

Is there a way to know what range of values I should use based on the amount of data I have?

blazing oxide
desert oar
#

oh damn your username is just "discord"

#

i'm so old 🤣

blazing oxide
desert oar
#

you can save a lot of time here by using "warm starts" and gradually increasing n_estimators. or halving random search with the same

#

max features i don't think is worth cross validating over

#

so that's just max depth, at which point you're searching over a single parameter

blazing oxide
cerulean kayak
# desert oar what values are you using?

my input is a series of images processed from tensorflow into numpy arrays (5,712 training and 1311 testing)
atm I'm doing

    n_estimators=trial.suggest_int("n_estimators",100,400,step=20)
    max_depth=trial.suggest_int(log=True,name="max_depth",step=3,low=2,high=32)
desert oar
# cerulean kayak ...also n\_estimators...

kind of, but that's what i mean about the warm starts thing. you can treat n_estimators as a special case. conceptually, you would do something like this (assuming scikit-learn):

model_base = RandomForestClassifier()
scores = {}
for max_depth in [0, 2, 4, 6, 8]:
    model = sklearn.clone(model_base)
    model.set_params(warm_start=True)
    for n_estimators in [0, 50, 100, 150, 200]:
        model.set_params(n_estimators)
        model.fit(x_train, y_train)
        score_train = score(y_train, model.predict(x_train))
        score_eval = score(y_eval, model.predict(x_eval))
        scores[(max_depth, n_estimators)] = (score_train, score_eval)
#

that should save you a fair amount of individual decision tree fits

#

also i believe sklearn supports transparently parallelizing RandomForestClassifier

#

(are you training on image embeddings? why not just keep them going through a fully connected hidden layer at the end of whatever NN you're already using?)

#

so if you want to use optuna for max_depth you can, but you might not need it

#

also i don't know that incrementing n_estimators from 100 to 400 in steps of 20 is a good use of cpu cycles. seems excessive

#

the other option is halving search, which works well on tree ensembles and NNs because it's easy to identify what the "resource" should be (number of trees or number of epochs, respectively) https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.HalvingGridSearchCV.html#sklearn.model_selection.HalvingGridSearchCV

cerulean kayak
cerulean kayak
desert oar
past meteor
desert oar
#

likewise with 200+ estimators, although breiman advocates for 1000+ trees, my experience is that you hit diminishing returns pretty hard. but you should see that in the score curve

#

fortunately your parameter space is small enough to visualize

past meteor
#

On the other hand, max_depth is implicitly taken into account when you compute the cost complexity (cc_alpha), look at the equation to understand why 😁.

For xgboost it's the opposite, there you should properly reduce the number of estimators

desert oar
desert oar
#

ohh i see, this is for pruning individual trees?

#

so you'd set a generous max depth and then optimize over the pruning parameter instead?

past meteor
#

I'd roll with the default max_depth, which is very high afaik and then prune

#

That's what the docs also wants you to do afaik, it says something about this on the user guide or all trees

desert oar
#

cool, i think that guide was probably added after i'd already ossified in my mind that tuning max depth was the way to go

#

this is what i get for not actually ever reading breiman's book. thanks, i learned something new!

#

hm... i think the docs only talk about pruning of individual decision trees, which i'm definitely familiar with. but i never heard of doing it in an ensemble like that

past meteor
#

Np. Generally I don't like tuning more than 2 hyperparameters unless it's a neural network so for each model I use only touch the ones that give the most bang for your buck

desert oar
#

agreed on that

past meteor
#

There's some other tricks for random forest like using the OOB score instead of cross validation etc.

#

But that I'm sure you know

desert oar
#

yeah, i totally forgot to mention OOB

#

although i've always been a little nervous about it

#

i've used it, but i was always mildly skeptical even though it's supposed to be fine

past meteor
#

I barely use it because I'm usually trying 10 models with the same generic code

desert oar
#

yeah, that's part of it. even the thing i showed with n_estimators involves a whole extra training setup

hollow sentinel
#
import json
import pandas as pd
import plotly.express as px

print(px.data.carshare())

def plot_choropleth_from_df_and_geojson(df, geojson_path, show_empty_zips=True, zip_range=[]):
    with open(geojson_path, 'r') as file:
        zipcodes = json.load(file)

    df['zip_code'] = df['zip_code'].astype(str)

    for val in zipcodes:
      print(f"{val['zip_code']} is in {val['state']}")

    color_scale = px.colors.sequential.Plasma

    # Create the choropleth map
    fig = px.choropleth(
        df,
        geojson=zipcodes,
        locations="zip_code",  # Change this to match the column name in df
        featureidkey="properties.zip_code",  # Update to match the geojson properties
        color="zip_count",
        hover_name="zip_code",  # Change this to match the column name in df
        hover_data=["zip_count"],
        scope="usa",
        color_continuous_scale=color_scale,
        title="ZIP Code Choropleth based on Count"
    )
    fig.update_geos(fitbounds="locations", visible=False)
    fig.show()

df = pd.read_csv('zip_codes.csv')
print(df.columns)
print(df.head(5))
geojson_path = "/content/USCities.json"  # Update this path to the location of your GeoJSON file
plot_choropleth_from_df_and_geojson(df, geojson_path, show_empty_zips=True)
#

still stuck with figuring out this chloropeth map

#

i need help with understanding how to parse the json data into a pandas dataframe

cerulean kayak
# past meteor I'd say for random forest the most important parameter to optimise for is cost c...

here's the thing. I've never seen a tutorial that's used this hyperparameter, but to be fair all the tutorials, except two, I've seen have been for very finite datasets (meaning they wern't as computationally expensive as my situation is). Heck this is the first time I'm seeing this hyperparameter.
Also I'd have no idea of where to start looking for values within that hyperparameter, especially since its a float.

desert oar
mild dirge
#

and over

desert oar
#

read a couple tutorials? write your own, post it on TDS, claim to be an educator on your linkedin

#

so i wouldn't take absence of any technique from generic tutorials as strong evidence against the usefulness of that tehnique

cerulean kayak
#

well so there's also an implied principle of "I should only try solving problems that there are resources for".
If I am having such a hard time tuning the hyperparameter for which there are lots of tutorials on how to tune, it would be even harder to tune hyperparameters that don't have tutorials on how to do so online.
and yes. I know just because teachers exist, doesn't mean they're good. I go to college.

past meteor
#
#

From my experience the "tutorials" that cover sklearn are from people that have never read the documentation. It's quite well done so I suggest you read it once or twice front to back, it takes 2 days tops

cerulean kayak
timid dune
#

is there a way to export discord chat data? 😅

echo mesa
#

@past meteor Would you mind answering this? Thanks 🙂

past meteor
#

Introduction to statistical learning, there's a chapter on it

past meteor
left tartan
#

Arguably you might retain the information better by spreading the learning out over a longer time and reinforcing it.

hollow cloud
#

Sharing some code I use to trade the markets

serene scaffold
arctic wedgeBOT
#
Formatting code on discord

Here's how to format Python code on Discord:

```py
print('Hello world!')
```

These are backticks, not quotes. Check this out if you can't find the backtick key.

For long code samples, you can use our pastebin.

serene scaffold
#

Be sure to never ask people to read screenshots of text.

pallid urchin
#

!code

arctic wedgeBOT
#
Formatting code on discord

Here's how to format Python code on Discord:

```py
print('Hello world!')
```

These are backticks, not quotes. Check this out if you can't find the backtick key.

For long code samples, you can use our pastebin.

serene scaffold
# pallid urchin !code

the !code command brings up a message from the bot. Be sure to follow the instructions in the message.

vestal spruce
#

Is speaker diarization and speaker turn detection the same thing? I just need to split an hour long audio based on a change of speaker without the need to identify them.

shut girder
#

Hello, I am a beginner to statistics and interpreting graphs like histograms. I can't really tell if this graph is skewed or not. The y axis is the amount and the x axis is the age.

wooden sail
#

the main idea behind skewness is checking whether more of the area of a pdf (or here, histogram) is to the left, the right, or if it's centered

#

at least at a glance, this data looks skewed to the right*, since skewness describes which side the tail is on

shut girder
#

I see, thanks. So if I wanted the typical value of this data, should I calculate the median instead of the mean since it seems to be skewed?

wooden sail
#

depends what you mean by "typical value" 😛

shut girder
vestal spruce
#

does my explanation is clear for you? I hope so. ¯_(ツ)_/¯

abstract wasp
limber mesa
blazing oxide
# abstract wasp Hi, can you explain to me what you mean with that last part "So, you might need ...

Sure, I'd be happy to clarify!

When you're using a tf.data.Dataset object with the fit method in Keras, the dataset is expected to yield a tuple of two elements. The first element is the input data and the second element is the target data. In other words, it should yield (input_batch, target_batch) pairs.

In an autoencoder, the target data is the same as the input data because the model is trying to reconstruct its input. So, you need your dataset to yield pairs where the input batch and target batch are the same.

Here's how you can modify your dataset creation:

x_train = x_train.map(lambda x: (x['image']/255, x['image']/255))
x_test = x_test.map(lambda x: (x['image']/255, x['image']/255))

Regarding the error you're seeing, it seems like your model is receiving a dictionary as input ({'image': 'tf.Tensor(shape=(218, 178, 3), dtype=float32)'}), but it's expecting a tensor. The modification above should also fix this issue because it will directly pass the tensor to your model.

I hope this helps! Let me know if you have any more questions. 😊

quick mason
#

Currently getting ```Traceback (most recent call last):
File "C:\Users\Carson.pyenv\pyenv-win\versions\3.9.8\lib\site-packages\torchaudio_extension\utils.py", line 85, in _init_ffmpeg
_load_lib("libtorchaudio_ffmpeg")
File "C:\Users\Carson.pyenv\pyenv-win\versions\3.9.8\lib\site-packages\torchaudio_extension\utils.py", line 61, in load_lib
torch.ops.load_library(path)
File "C:\Users\Carson.pyenv\pyenv-win\versions\3.9.8\lib\site-packages\torch_ops.py", line 643, in load_library
ctypes.CDLL(path)
File "C:\Users\Carson.pyenv\pyenv-win\versions\3.9.8\lib\ctypes_init
.py", line 374, in init
self._handle = _dlopen(self._name, mode)
FileNotFoundError: Could not find module 'C:\Users\Carson.pyenv\pyenv-win\versions\3.9.8\Lib\site-packages\torchaudio\lib\libtorchaudio_ffmpeg.pyd' (or one of its dependencies). Try using the full path with constructor syntax.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "C:\Users\Carson.pyenv\pyenv-win\versions\3.9.8\lib\site-packages\torchaudio_extension_init_.py", line 67, in <module>
_init_ffmpeg()
File "C:\Users\Carson.pyenv\pyenv-win\versions\3.9.8\lib\site-packages\torchaudio_extension\utils.py", line 87, in _init_ffmpeg
raise ImportError("FFmpeg libraries are not found. Please install FFmpeg.") from err
ImportError: FFmpeg libraries are not found. Please install FFmpeg.```
Returned when I attempt training with this.
https://github.com/CBeast25/Applio-RVC-Fork/blob/main/lib/infer/modules/train/train.py

which doesn't happen with this https://github.com/CBeast25/Applio-RVC-Fork/blob/main/lib/infer/modules/train/train_new.py

Any ideas?

GitHub

Contribute to CBeast25/Applio-RVC-Fork development by creating an account on GitHub.

GitHub

Contribute to CBeast25/Applio-RVC-Fork development by creating an account on GitHub.

vestal spruce
vestal spruce
#

Hol up

vestal spruce
# quick mason Yup

idk if that's the actual ffmpeg or the package for ffmpeg but I have to assume that is the former, so how about the python package do you also have it installed?

vestal spruce
#

ok lemme try it

vestal spruce
quick mason
#

lmao that's stupid of me lets hope that works

#

damn no dice

vestal spruce
#

well that sucks

vestal spruce
# quick mason damn no dice

I don't suppose the problem would be due to pathing?
I qouted this part of the error raised:

FileNotFoundError: Could not find module 'C:\Users\Carson\.pyenv\pyenv-win\versions\3.9.8\Lib\site-packages\torchaudio\lib\libtorchaudio_ffmpeg.pyd' (or one of its dependencies). Try using the full path with constructor syntax.
quick mason
#

I don't think so. Might be the version of ffmpeg I have.

outer tapir
#

i have to train an object detection dataset on yolo model
i have downloaded a dataset called indian vehicle dataset, it has 5 classes of trucks, cars, tempos, tractor, tractor, each has over 130 images, i also have annotation folder for the same classes provided in xml format which looks like following

This XML file does not appear to have any style information associated with it. The document tree is shown below.
<annotation>
<folder>20210427</folder>
<filename>20210427_12_52_47_000_pgmtGnyv84hd7BiNoFr7ALIRzeZ2_F_4160_3120.jpg</filename>
<source>
<database>Unknown</database>
<annotation>Unknown</annotation>
<image>Unknown</image>
</source>
<size>
<width>3120</width>
<height>4160</height>
<depth/>
</size>
<segmented>0</segmented>
<object>
<name>tempo</name>
<truncated>0</truncated>
<occluded>0</occluded>
<difficult>0</difficult>
<bndbox>
<xmin>7.5</xmin>
<ymin>1089.34</ymin>
<xmax>522.95</xmax>
<ymax>2326.43</ymax>
</bndbox>
<attributes>
<attribute>
<name>rotation</name>
<value>0.0</value>
</attribute>
</attributes>
</object>
<object>
<name>tempo</name>
<truncated>0</truncated>
<occluded>0</occluded>
<difficult>0</difficult>
<bndbox>
<xmin>383.47</xmin>
<ymin>622.4</ymin>
<xmax>2645.4000000000005</xmax>
<ymax>3151.1</ymax>
</bndbox>
<attributes>
<attribute>
<name>rotation</name>
<value>0.0</value>
</attribute>
</attributes>
</object>
</annotation>

so how do i convert it into yolo accepted format, i am doind this type of thing for first time

vestal spruce
quick mason
#

ffmpeg is on my path

#

Really dont want to install conda but it looks like that's what I'll have to try now

vestal spruce
#

I'm afraid so... sorry I couldn't be of any help :/

blazing oxide
# outer tapir i have to train an object detection dataset on yolo model i have downloaded a da...

To train a YOLO model, you need to convert your annotations into a format that the YOLO model can understand. The YOLO model expects a .txt file for each image in the dataset with the same name as the image file. Each line in this file represents one bounding box, in the format <object-class> <x_center> <y_center> <width> <height>, where:

  • <object-class> is an integer representing the index of the class of the object from 0 to (classes-1).
  • <x_center>, <y_center>, <width>, and <height> are all relative to the size of the image. They are in float format and range from 0.0 to 1.0.

Here's a Python code snippet that can help you convert your XML annotations to YOLO format:
https://paste.pythondiscord.com/G37Q

In this code, replace 'path_to_annotations' with the path to your XML annotations and 'path_to_yolo_annotations' with the path where you want to save your YOLO annotations. Also, replace 'path_to_images' with the path to your images.

Please note that this code assumes that your XML files have the same name as your images and are in the same directory. If this is not the case, you may need to adjust the code accordingly.

I hope this helps! Let me know if you have any more questions. 😊

pulsar lily
#

Hey everyone
I am new to ML and NN.
I just tried to learn pytorch with some datasets from kaggle but I quickly ran into a problem.

I have a Dataframe with different data types in it. Boolean, Strings and Floats
Most of them are relevant so now I'm wondering how to handle this. Cause for converting it into a tensor it needs to be a Float or int or binary I think.
How would one handle a data input with mixed data?

And also, if I convert the boolean values to a BoolTensor. How do I merge them with the FloatTensors... Cause it should be still one input right?

meager ridge
#

hey anyone have any idea why it takes forever to build the wheel for Pandas?

serene scaffold
pulsar lily
rare osprey
#

I am an ai researcher. I know 10+ coding languages and around 7 frameworks. I know everything from cnns to rnns to transformers and large language models. I have made my own finetuned multi modal large language model and am researching about augmented lstms. Im fullstack too and im here to discuss about anything CS or math and looking for friends and connections. I own two startups currently and working on a lot of projects.

#

Dm me if you want to be a friend, collab on smth

serene scaffold
#

for strings that could be "anything" (like a person's name, or a description of an art piece), the question becomes even more complicated.

serene scaffold
pulsar lily
#

ok

#

thank you so much

#

I have one last question

serene scaffold
#

no you don't.

pulsar lily
#

I have an ID and I want to map the ID to my prediction.
This is no relevant data for the training part.
How would you handle this

pulsar lily
serene scaffold
#

I'm just being silly, if that wasn't clear.

pulsar lily
#

sure I caught that 😄

serene scaffold
#

anyway, you don't encode the ID. but if you feed a tensor with 10 rows into the network, then you'll get ten rows out. and the nth row of the output represents the nth row of the input.

#

(or, if you're working with more than two dimensions, whatever the leftmost dimension is)

pulsar lily
#

emm ok...
I think I just keep coding and ask again when I get there... I think I won't understand it right now.
Is it ok if I ping you later?

serene scaffold
#

I check this channel regularly, but don't ping me since I have a variable schedule.

pulsar lily
#

ok

eager jasper
#

How can I make a simple basic

#

Whitch no module

#

Only built in modules

#

Iis there a way

cunning agate
#

hello guys i want to create ai video

#

where i want to intergrate my body in the video and voice

#

is there any free website or something like that

dire iron
#

Hello. I hope you are well. Could you tell me about your industry experience? How did you get into the professional world?

cunning agate
#

does there any apps or wesites to modify and add things to a video with ai?

eager jasper
#

Hey

odd meteor
mighty bridge
#

What is your guys overall opnion about tensorflow and pytorch? esnoobNotes which is most used for anything ? is there one easier to the other?

tidal bough
#

Recently TF dropped support for windows, which may be an issue for some people. Besides that, I think they are nearly equivalent these days.

mighty bridge
#

is there any reason why?

#

which OS they still supports?

tidal bough
#

there's probably a github issue explaining why or something.

#

they only support windows via WSL now. and unix systems are still supported.

mighty bridge
#

I didnt even know about what was WSL until now lol

winter drift
#

Hello I built a level in unreal engine and want to do behavioral cloning can anyone point me in the correct direction

#

I have a little under water level with a pilot able submarine, and I want to teach a simple model to drive the submarine and perhaps go to targets.

fierce kiln
#

Hello everyone. I was required to write a python script to deduce height of a medical object. Is there any sort of computer vision model to do that directly? Unfortunately, I couldn't find any in context of my use case.
Therefore I came up with this idea: segment the area of interest and draw a boundary box around the segmented polygon. Calculate the height and width of the boundary box with simple geometry. We get heights and widths in pixels. Convert pixels into our required scale if focal length is constant or use stuff like arUco marker for relative measurement.

However, I don't know how reliable this thing will turn out to be. Is there a better way? Or a vision model that I don't know of?

vestal spruce
fierce kiln
#

Thank you for your interest. It is a segment of an OCT image. Let's assume the angle is always perpendicular and the relative distance always remains constant.

vestal spruce
#

And one more thing, I think it can be of help to you if there's another object as a reference in all of the image, perhaps like a piece of A4 paper as the background object behind your main object, with this you can then sample its (A4 paper) pixel to real-world size ratio for calculating the main object pixel to real-world size, but this is just one way to do it, you might find other methods to be much easier to implement/understand, and I might suggest go and take a look about it on Google Scholars.

lapis sequoia
#

hello

#

I have trained a model on a dataset with imbalance using xgboost

#

it gives f1_score_weighted of 85%

#

per-label it gives f1 score [0.8855230715040509, 0.8542857142857143, 0.7451428571428571, 0.743142144638404, 0.853898561695685, 0.8672331386086033]

#

what else could be done in order to make better model

#

to increase the accuracy/f1_score

lapis sequoia
#

Should I use CatBoost instead or try to solve imbalance in dataset using smote or other methods

#

Maybe both

vestal spruce
#

other things you can do is to add weight to your category, so for minor category you give them bigger wieghting to balance out, or emphasize the model to be more critical when dealing with rarely occuring data.

#

lastly proportional distribution A.K.A stratified splitting the category might also affect it's performance.

#

I don't assume you already have applied the ideas I'm telling you @lapis sequoia ?

lapis sequoia
#

dair-ai/emotion

#

It's basically a model for nlp classification

vestal spruce
lapis sequoia
#

No I've not applied the techniques you've mentioned

vestal spruce
#

Well you might want to try them out first because, from my past experience, these are pretty common way to deal with imbalance dataset.

olive pecan
#

Are there any good courses on Data analytics and Big data using jupyter for data understanding, cleaning, preprocessing, modeling etc anywhere?

vestal spruce
#

I do apologize as I'm unfamiliar with CatBoost, perhaps it's way more streamline than some basic ideas I said

lapis sequoia
#

I'll do some fixing to tackle the imbalance

vestal spruce
vestal spruce
#

one thing is certain is that you should always try and experiment with different scenarios ¯_(ツ)_/¯

blazing oxide
lapis sequoia
vestal spruce
lapis sequoia
vestal spruce
lapis sequoia
#

No

#

When did I mention that ?

vestal spruce
#

I'm getting thrown off by such revelation

vestal spruce
lapis sequoia
#

Sorry man

vestal spruce
#

oh my goodness

#

it's alright, no harm done

lapis sequoia
#

Thanks man

vestal spruce
# lapis sequoia Thanks man

ok I'm not really verse in NLP, but I believe the ideas I've said still applicable aside from making variety duplicate for obvious reason.

lapis sequoia
#

Adding duplicate would not be a great idea

#

In the preprocessing
I literally removed the duplicates

vestal spruce
odd meteor
# lapis sequoia

Use TextAttack library (the same library used in performing adversarial text attack on NLP tasks) to perform data augmentation on the minority class.

That would most definitely improve the current model's performance.

vestal spruce
#

but don't think of them (duplicates) as an obstacle to your model, because oversampling and undersampling is basically just that, add and remove duplicate as a mean to make the dataset balance.

lapis sequoia
#

It's a standard procedure in the nlp ig

clever lake
#

guys I've been looking for tutorials to create a chatbot (AI) but most of them are outdated (videos too old and don't work) could you recommend me something? I haven't found anything else.

My idea was to have a json (something simple) so: tags,patterns,responses

{
    "intents": [
        {
            "tag": "greetings",
            "patterns": [
                "hello",
                "hey",
                "hi",
            ],
            "responses": [
                "Hello",
                "hey!",
                "what can i do for you?"
            ]
        }
    ]
}
olive pecan
#

Are there any good courses on Data analytics and Big data using jupyter for data understanding, cleaning, preprocessing, modeling etc anywhere? I'm struggling at uni to understand the terms and techniques

blazing oxide
# clever lake guys I've been looking for tutorials to create a chatbot (AI) but most of them a...

Sure, I found some recent resources that might help you:

  1. "How to Build Chatbots | Complete AI Chatbot Tutorial for Beginners (https://www.youtube.com/watch?v=jCoH82LPgdk)" by Liam Ottley. This is a comprehensive video tutorial that covers everything you need to know to build custom no-code AI chatbots.

  2. "Learn how to create your own chatbot easily (https://www.youtube.com/watch?v=Pj00e6lq9Cg)" by TECH WITH SACH. This tutorial teaches you how to create your own chatbot quickly and easily using Dialogflow and different tools and features.

  3. "How to Build a ChatBot using the GPT-4 API – Full Project-Based Tutorial (https://www.freecodecamp.org/news/build-gpt-4-api-chatbot-turorial/)". This is a full project-based tutorial that teaches you how to build your own chatbot using the GPT-4 API.

  4. "A Complete Guide on How to Build a Chatbot (Easy to Hard) (https://www.g2.com/articles/how-to-build-a-chatbot)". This guide provides different ways to build a chatbot, each requiring a varying level of technical skills.

  5. "ChatterBot: Build a Chatbot With Python – Real Python (https://realpython.com/build-a-chatbot-python-chatterbot/.)". This tutorial guides you through creating a chatbot using Python ChatterBot.

As for your idea of using a JSON structure for intents, patterns, and responses, it's a common approach in rule-based chatbots. However, for more advanced AI chatbots, you might need to use more sophisticated techniques such as machine learning and natural language processing. I hope this helps! 😊

#

Finally I got the voice verified role! 😁

timid dune
#

is it complicated to integrate your PyTorch model into an frontend application?

clever lake
blazing oxide
# timid dune is it complicated to integrate your PyTorch model into an frontend application?

Hey there! 😊 Sure, integrating a PyTorch model into a frontend application can be a bit tricky, but there are several ways to do it:

  1. Flask: You can use Flask to deploy your PyTorch model and create a REST API for your model. This is a common way to integrate ML models into web apps.

  2. PyTorch C++ Frontend: If you're working with C++, you can use the PyTorch C++ frontend. It provides a pure C++ interface to PyTorch.

  3. Flutter: For mobile apps, you can use PyTorch Mobile with Flutter. This lets you run your PyTorch model directly on a mobile device.

  4. Windows ML API: If you're developing a Windows app, you can use the Windows ML API to integrate your PyTorch model.

Remember, the complexity can vary depending on your model and application. It's always a good idea to start with a simple prototype and add complexity as needed. Hope this helps! 😊

timid dune
blazing oxide
# clever lake I will take a look at all of them, but using chatterbot doesn't appeal to me, I ...

Sure thing! 😊 Building your own chatbot from scratch can be a great learning experience and also an amazing challenge. Here are some tips for working with intents:

  1. Define the Purpose: Start by figuring out what you want your chatbot to do. This will help you determine the types of intents your chatbot needs to handle.

  2. Identify Common Questions: Consider the types of questions your chatbot is likely to get. These can guide you in creating your intents.

  3. Create Sample Conversations: Try to come up with possible scenarios or conversations your chatbot might have. This can help you understand how to structure your intents and responses.

  4. Annotation: Mark up your training data to identify intents and any entities. This is crucial for training your chatbot to understand user inputs.

  5. Continuous Training: Train your chatbot with the marked-up data, and keep fine-tuning and adjusting the chatbot’s responses based on feedback.

  6. Add and Refine Intents: As your chatbot interacts with users, you'll likely discover new intents that you hadn't thought of. Add these to your chatbot and refine existing ones based on user interactions.

Remember, building a chatbot from scratch can be complex, especially if you want it to understand natural language. You might need to learn about Natural Language Processing (NLP) techniques, which can help your chatbot understand and interpret user intents in real-time.

Hope this helps! 😊

blazing oxide
# timid dune can I use graphql or whatever? my app is built with react

Absolutely, you can use GraphQL with your React application and PyTorch model! Here's a general approach:

  1. Create a Backend Service: This service will host your PyTorch model and expose an API endpoint for making predictions.

  2. GraphQL Server: Set up a GraphQL server that communicates with your backend service. This server will receive requests from your frontend, forward them to the backend service, and return the results.

  3. Apollo Client: In your React application, use the Apollo Client to communicate with your GraphQL server. Apollo Client is a comprehensive state management library for JavaScript that enables you to manage both local and remote data with GraphQL.

  4. Fetch Data: Use the useQuery hook provided by Apollo Client to fetch data from your GraphQL server. This data can then be displayed in your React components.

  5. Update Data: If your application allows for updating data (like re-training your model), you can use the useMutation hook.

Remember, integrating these technologies can be complex and requires a good understanding of each component. But with some patience and practice, it's definitely achievable. Good luck! 😊

clever lake
#

of course I searched but they were all very old (2 to 4 years ago) many libraries have changed and do not work as they used to which makes them useless

#

gives an error when creating the array (line 60) and I have no idea how to fix it

Traceback (most recent call last):
  File "/home/saohy/documents/bots/mei-chan-bot/g.ignore/chatbot/training.py", line 60, in <module>
    training = np.array(training)
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 2 dimensions. The detected shape was (5, 2) + inhomogeneous part
blazing oxide
# clever lake thank you for your help, I noticed that the various videos or tutorials offered ...

Absolutely👌, here are some resources that focus on intent-based chatbot development:

  1. "Chatbot Development Tutorial: Introduction of Intent, Stories, Actions in Rasa (https://www.pragnakalp.com/chatbot-development-tutorial-rasa/)": This tutorial provides a comprehensive guide on how to use Rasa, an open-source machine learning framework for building chatbots. It covers the concepts of intents, stories, and actions in detail.

  2. "Creating Chatbots on Google Cloud (https://developers.google.com/learn/topics/chatbots)": This resource offers step-by-step guides to learn about Dialogflow and the corresponding Google Cloud services, which facilitate building chatbots. Dialogflow is a powerful tool for intent recognition and handling.

  3. "How to Build Your AI Chatbot with NLP in Python? (https://www.analyticsvidhya.com/blog/2021/10/complete-guide-to-build-your-ai-chatbot-with-nlp-in-python)": This guide provides a complete walkthrough on how to build an AI chatbot with Natural Language Processing (NLP) in Python. It covers various aspects of NLP, including intent recognition and handling.

Remember, building an intent-based chatbot can be complex and requires a good understanding of NLP and machine learning concepts. But with some patience and practice, it's definitely achievable. Good luck with your studies! 😊

Also for the error you've encountered in your code I think I know where is the problem, but I'll tell you about it in another message because I don't have Nitro🥲.

blazing oxide
# clever lake gives an error when creating the array (line 60) and I have no idea how to fix i...

The error you're encountering, ValueError: setting an array element with a sequence, typically occurs when you're trying to create a NumPy array from a list of sequences that aren't all the same length¹. In your case, it seems like the issue is with the training variable.

Here's a possible solution:

Instead of using np.array(training), you can convert training into an array of objects:

training = np.array(training, dtype=object)

This tells NumPy to create an array of Python objects, which can be of varying lengths or types.

This is what I can do for now, I hope I've been helpful! 😊

#

As for recent resources on chatbot development, here are some that might be helpful:

  1. "Future directions for chatbot research: an interdisciplinary research agenda (https://link.springer.com/article/10.1007/s00607-021-01016-7)": This article provides a research agenda for chatbot development, discussing future directions and challenges in the field.

  2. "7 Chatbot Trends for 2022 & Beyond (https://manychat.com/blog/chatbot-trends/.)": This article discusses the latest trends in chatbot development, which could be useful for your project.

  3. "How To Develop a Chatbot From Scratch (https://chatbotsmagazine.com/how-to-develop-a-chatbot-from-scratch-62bed1adab8c)": This guide provides a comprehensive walkthrough on how to build a chatbot from scratch.

  4. "9 Major Chatbot Trends for 2021 (https://botcore.ai/blog/chatbot-trends-2021/)": Although this article is from 2021, it discusses some major trends in chatbot development that are still relevant today.

Good luck with your project! (Also if you need any more help feel free to ask also in DMs) 😊

#

Now Igtg, cya👋 (I'll be back in 1 hour)

midnight harbor
#

I am new at this and i wanna know how to use langchain to create simple application,
i was using openai library which was kinda easy but to make scalable project i think lang chain would be great, BUT i found langchain to be very confusing, like i am failed to find
role: system context in langchian
that how to pass system prompt that what are you and all

then user asistant chat start right,
Need to find this concept using langchain

I have found using ConversationBufferMemory and summarizer buffer i can manage alot of stuff but giving system context to the LLM i want right now kindly help

blazing oxide
blazing oxide
# midnight harbor I am new at this and i wanna know how to use langchain to create simple applicat...

Hey there!👋 It's great to hear that you're diving into data science and AI. Langchain is indeed a powerful tool for building scalable AI applications.

To provide system context in Langchain, you can use the ConversationBufferMemory as you mentioned. This is where you can store the system context or any other information that you want to persist across different turns of the conversation.

Here's a simple example:

from langchain import ConversationBufferMemory

# Initialize the memory buffer
memory = ConversationBufferMemory()

# Add system context
memory.add_system_context("I am an AI developed by Langchain.")

# Now you can pass this memory to the LLM
llm = LangchainLLM(memory=memory)

In this example, the system context "I am an AI developed by Langchain." is stored in the memory and can be accessed by the LLM during the conversation.

Remember, the system context is just like any other piece of information in the conversation. It's up to you how you want to use it in your application.

I hope this helps! If you have any more questions, feel free to ask. Happy coding! 🚀

midnight harbor
blazing oxide
# midnight harbor AttributeError: 'ConversationBufferMemory' object has no attribute 'add_system_c...

I apologize for the error I made😓 . It seems there was a misunderstanding in my previous message. The ConversationBufferMemory class does not have a method called add_system_context.

In Langchain, the system context is typically set at the beginning of the conversation and is passed to the model as part of the conversation history. Here's a simplified example:

from langchain import ConversationBufferMemory, LangchainLLM

# Initialize the memory buffer
memory = ConversationBufferMemory()

# Add system context
memory.add_message("system", "I am an AI developed by Langchain.")

# Now you can pass this memory to the LLM
llm = LangchainLLM(memory=memory)

In this example, the system context "I am an AI developed by Langchain." is added as a system message in the conversation history. The LLM can then use this context during the conversation.

I hope this clears up the confusion. Let me know if you have any other questions! Happy coding! 🚀

midnight harbor
blazing oxide
# midnight harbor AttributeError: 'ConversationBufferMemory' object has no attribute 'add_message'

Oh, I'm really disappointed in myself... I don't know why they works in my machine, try this last version I'm really sorry for the consìfusion:

from langchain import ConversationBufferMemory, LangchainLLM

# Initialize the memory buffer
memory = ConversationBufferMemory()

# Add system context
memory.add_turn({"role": "system", "content": "I am an AI developed by Langchain."})

# Now you can pass this memory to the LLM
llm = LangchainLLM(memory=memory)
odd meteor
long canopy
#

any tools or libraries of interest for voice recognition?

serene scaffold
arctic wedgeBOT
#

:incoming_envelope: :ok_hand: applied timeout to @blazing oxide until <t:1699109861:f> (1 day).

ancient cairn
#

how to transcribe audio to text arduino for chat gpt. then chat gpt replies in audio?

left tartan
desert oar
ancient cairn
#

and it replies to me as audio

#

i know what i need but i need code

#

i just need an idea

echo mesa
#

Hi Guys, I've been learning pre-calculus from james stewart: "Precalculus: Mathematics for Calculus".
here is a review from this guy:
https://www.youtube.com/watch?v=N-JXs_n2JhI

I'm wondering whether I should go thru "all" of them, the guy said that there is much more math in this book than in a college course. My main goal would be to learn calculus for machine learning so I do wonder that whether it worth spend years just to go thru pre-calculus as a 16 year old? Or how should I do it?

https://www.freemathvids.com/ || We take a look at a wonderful book called Precalculus: Mathematics for Calculus. This is a great book for learning precalculus as it covers all the main topics!
Here it is: https://amzn.to/48ra0tv
Useful Math Supplies https://amzn.to/3Y5TGcv
My Recording Gear https://amzn.to/3BFvcxp
(these are my affiliate links)...

▶ Play video
left tartan
echo mesa
left tartan
# echo mesa yeah all of the content

You're still in high school right? Maybe ask the pre-calculus teacher for a copy of their syllabus, so you can see what information is taught?

echo mesa
#

This is 5th year, which is the book that we are going thru atm

#

This is the next version of that book

#

This is the last one's table of contents

#

This is the first book(the green) one's content

#

This is literally the math that is required in leaving cert

#

I don't know if the content is enough for pre-calc

#

I know the table of contents might not be enough for you to assume, but what do you think?

echo mesa
#

or perhaps I can take one of these courses? https://www.khanacademy.org/math/precalculus

Khan Academy

The Precalculus course covers complex numbers; composite functions; trigonometric functions; vectors; matrices; conic sections; and probability and combinatorics. It also has two optional units on series and limits and continuity. Khan Academy's Precalculus course is built to deliver a comprehensive, illuminating, engaging, and Common Core align...

#

Because I've seen a lot of people recommending khan academy for maths for machine learning

primal iris
#

can someone help pls

past meteor
primal iris
#

but i don't think they would be the best for multilabel as itmight have values that have more than one label

echo mesa
past meteor
echo mesa
past meteor
#

It looks similar to what I had

echo mesa
# past meteor It looks similar to what I had

Does it? I asked in the Mathematics server and they said that I should go with them because they by the table of contents very good books and since I'm learning this in school its even better

acoustic prawn
#

hi, I'm not sure this is the right channel. I'm working on a small project utilizing some image tag/recognition models, gpt and flask and quite inexperienced

loading the models takes a few minutes. everytime I make a change in some file, flask is reloaded (which is good) and with it the models.

currently, I'm calling load_Models() directly before run app line.
what is common practice to keep such models "loaded" and only reload the rest?

serene scaffold
#

It's not just that "flask" is reloading. It's the whole python process.

acoustic prawn
#

oh no, that does not sound good...
so I'd have two flask projects/servers running, communicating with each other?

desert oar
#

iirc i was about 16 when i did pre-calc in school. do you not have access to a pre-calc course there?

#

i believe the sequence was something like trigonometry and basic proofs in 10th grade, then logarithms, derivatives, and a cursory look at antiderivatives + integrals in 11th

#

we probably did other things in 10th too but i can't remember what they were

#

what i do remember is that we took it pretty slow

#

there was no rushing through the material. i had very good teachers overall, and we spent a while building things up carefully and developing intuition

#

for example, i remember one day the teacher had us step through angles and measure the sides of the resulting triangles inscribed in a unit circle

#

and then she had us plot the ratios, and magically the sine and cosine functions appeared on our graph paper

#

i loved that, it was the first time in math i felt like there was actually a purpose to what we were doing and things started coming together

#

likewise the next year, the teacher gave everyone some construction paper and tape and assigned us all various dimensions of boxes to build. then we computed the volumes of all our boxes and showed that there was a unique optimum volume, which i believe we then derived. that was also a fun and memorable demo

#

the point being: it's worth taking your time with these things. they will form the foundation for everything to follow

echo mesa
#

would you think that that would be enough?

wooden sail
#

precalc is basically algebra and trig, both of which you cover there. differential and integral calculus are proper calculus, so what you learn after precalc

echo mesa
wooden sail
#

that already includes calculus lol

echo mesa
#

so I guess it should be enough then 😄

wooden sail
#

yep

tame path
#

gyz can some one guide me , about learning python roadmap

fading scaffold
#

do you guys know the best course to emphasize numpy and pandas?

serene scaffold
serene scaffold
#

I don't think so.

#

!resources data science

arctic wedgeBOT
#
Resources

The Resources page on our website contains a list of hand-selected learning resources that we regularly recommend to both beginners and experts.

serene scaffold
#

It's one of the items here ^

fading scaffold
left tartan
#

I think a lot of people don't connect sql or pandas operations with the equivalent; how you'd do it in a simple Python loop. When I teach people SQL (or Pandas), I express things in terms of the primitives (how you'd do the same thing in code).

hushed crater
#

Hey guys, I'm trying to color my points on this regplot by their y value. for example i want all where dropout == 1 to be orange and where 0 to be blue. but I can't seem to figure out how

#
g = sns.JointGrid(height=6, space=0)
x, y = df["Application_order"], df["Dropout"]
sns.regplot(x=x, y=y, scatter_kws=dict(s=1), x_jitter=.3, y_jitter=.125, logistic=True, truncate=False, ax=g.ax_joint)
sns.histplot(x=x, discrete=True, stat='frequency', ax=g.ax_marg_x)
g.ax_marg_y.remove()
g.ax_marg_x.set_facecolor('white')

g.ax_joint.set_xlabel('Application Order')
plt.title('Dropout Likelihood Across Application Orders')```
#

Any help? I can manage to make it work with an lmplot but then I don't get my marginal histogram which i really want to keep...

left tartan
# hushed crater

Draw the scatter directly, and set "scatter=False" in the regplot. I think you'll have to add the jitter manually to the df, like: df_x_jittered = df["Application_order"] + np.random.uniform(-0.3, 0.3, size=len(df))

#

Something like: ```py
jittered_x = df["Application_order"] + np.random.uniform(-0.3, 0.3, size=len(df))
jittered_y = df["Dropout"] + np.random.uniform(-0.125, 0.125, size=len(df))
sns.scatterplot(x=jittered_x, y=jittered_y, hue=df["Dropout"] > 0, palette=['blue', 'orange'], legend=False, ax=g.ax_joint, s=1)

hushed crater
#

I'm guessing I should just draw the line manually instead of using regplot?

left tartan
#

That’s weird, I did a test and my regplot worked

hushed crater
#
g = sns.JointGrid(height=6, space=0)
x, y = df["Application_order"], df["Dropout"]
sns.regplot(x=x, y=y, scatter_kws=dict(s=1), scatter=False, x_jitter=.3, y_jitter=.125, logistic=True, truncate=False, ax=g.ax_joint)
jittered_x = df["Application_order"] + np.random.uniform(-0.3, 0.3, size=len(df))
jittered_y = df["Dropout"] + np.random.uniform(-0.125, 0.125, size=len(df))
sns.scatterplot(x=jittered_x, y=jittered_y, hue=df["Dropout"] > 0, palette=['blue', 'orange'], legend=False, ax=g.ax_joint, s=1)
sns.histplot(x=x, discrete=True, stat='frequency', ax=g.ax_marg_x)
g.ax_marg_y.remove()
g.ax_marg_x.set_facecolor('white')

g.ax_joint.set_xlabel('Application Order')
plt.title('Dropout Likelihood Across Application Orders')```
left tartan
#

Try regplot -after- scatter?

hushed crater
#

Working!

#

thanks a bunch

past meteor
#

I always tell them the SQL idioms mostly carry over and I have a bunch of SQL queries and a toy dataset I have where I show the equivalent Pandas code

abstract wasp
#

Hi, I don't know why I get module_wrapper when I grab the summary of this code:
`model = Sequential()
vgg16 = tf.keras.applications.vgg16.VGG16(
include_top=False,
input_shape=(224, 224, 3),
pooling='avg',
classes=91,
weights='imagenet'
)

for layer in vgg16.layers:
layer.trainable = False

model.add(vgg16)
model.add(normalizing)
model.add(Flatten())
model.add(Dense(512, activation='relu'))
model.add(Dense(91, activation='softmax'))`

#

This is why shows up when I train:

lapis sequoia
#

So getting some errors and cant find out why isnt not printing that last statement

import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import mean_absolute_error
import tensorflow as tf
from sklearn import preprocessing

df = pd.read_csv('Google_Stock_Price.csv')
df['Date'] = pd.to_datetime(df['Date'],format='mixed')
df['Time'] = df.apply(lambda row: len(df) - row.name, axis=1)
df['CloseFuture'] = df['Close'].shift(30)

df_test = df[:185]
df_train = df[185:]

X = np.array(df_train['Time'])
X = X.reshape(-1 ,1)
scaler = preprocessing.MinMaxScaler()
X_scaled = scaler.fit_transform(X)
y = np.array(df_train['CloseFuture'])

model = tf.keras.Sequential([
        tf.keras.layers.Dense(10, activation='sigmoid', input_shape=(1,)),
        tf.keras.layers.Dense(10, activation='sigmoid'),
        tf.keras.layers.Dense(10, activation='relu'),
        tf.keras.layers.Dense(1)])

model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),loss='mse',metrics=['mae'])
model.fit(X_scaled, y, epochs = 30, batch_size = 10)
ennuste_train = model.predict(X_scaled)
df_train['Ennuste'] = ennuste_train

X_test = np.array(df_test['Time'])
X_test = X_test.reshape(-1,1)
X_testscaled = scaler.transform(X_test)
ennuste_test = model.predict(X_testscaled)
df_test['Ennuste'] = ennuste_test


plt.scatter(df['Date'].values, df['Close'].values, color='black')
plt.plot((df_train['Date'] + pd.DateOffset(days=30)).values, df_train['Ennuste'].values, color='blue')
plt.plot((df_test['Date'] + pd.DateOffset(days=30)).values, df_test['Ennuste'].values, color='red')
plt.show()

df_validation = df.test.dropna()

print("Predicted data average error %.f" % mean_absolute_error(df_validation['CloseFuture'], df_validation['Ennuste']))```
#

D:\Archives\Coder\Python\Roboticas\Tehtävä10.py:31: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_train['Ennuste'] = ennuste_train
6/6 [==============================] - 0s 601us/step
D:\Archives\Coder\Python\Roboticas\Tehtävä10.py:37: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_test['Ennuste'] = ennuste_test
Traceback (most recent call last):
  File "D:\Archives\Coder\Python\Roboticas\Tehtävä10.py", line 45, in <module>
    df_validation = df.test.dropna()
                    ^^^^^^^
  File "C:\Users\Jani\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\core\generic.py", line 6204, in __getattr__
    return object.__getattribute__(self, name)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'DataFrame' object has no attribute 'test'

#

help appreciated..

quaint gorge
#

is there an NLP package in python or something that categorizes sentences based on the content of the sentence?

brave sand
#

how would I implement q learning on a racing game I made?

lapis sequoia
stray void
#

hey so is this the place to ask about apis

#

i need an api for llm prompts where u can automate the responses u get from it in a way

left tartan
storm kelp
polar zodiac
#

hello,
Am working on a project and there seem to be many null values in the dataset
would you advice me to go with fillna or dropna?
also If I use fillna and fill in avg random values wouldn't it affect the dataset?

royal crest
#

you can choose to drop, or impute so long as you report your decision

#

typically any imputation method is effective when <5% of the data is missing, mean imputation drops out at around 10% missing, regression drops out at 15% but multiple imputation and ML imputation methods are more accurate even when larger proportions of the data are missing

mild dirge
#

I have just been reading some papers to get some inspiration for a model I want to design. My task is to design a model that handles data of which a large part (~90%) is unlabeled, and my model should classify the labels. I read a paper (https://arxiv.org/abs/1911.04252) that starts by training a relatively small model on the 10% labeled data, and then classifying maybe 10% of the data that is unlabeled such that we have 10% true labeles, and 10% predicted labels. Then a new model is instantiated that is slightly bigger than the first, and trained on the 20% labeled data. Which is used to label another 10%, so the next model has 30% labeled data. etc.

My problem is, why would this improve performance at all. In the end all labeled data is based on the information from the initial 10%, it doesn't learn anything "new" from the predicted labels does it?

#

The paper discusses a modification of the algorithm explained above, but I was wondering why the original idea works at all.

mild dirge
#

I guess the idea is called "self-training", but the confusion still remains. If the initial classifier predicts correctly, the data point probably doesn't bring much new information, whereas if it is predicted incorrectly, the newer iteration model will be learning incorrectly.

past meteor
#

Are you able to label post-hoc or not at all?

mild dirge
past meteor
#

What is canonically done is having some model output predictions and uncertainties and you label the ones with the highest uncertainties

#

This only makes sense if you can still label ofc

mild dirge
#

Highest certainties right?

past meteor
#

The ones you're most uncertain about

mild dirge
#

Or do you mean, manually labeling?

#

I thought you meant using the labels that the model was msot uncertain about for the next iteration

#

I get it now

#

And I think that would be quite difficult

past meteor
#

Imagine you can still label the cats and dogs in your dataset but you don't want to do it upfront. Your model is also able to output uncalibrated probabilities. You let it do predictions and afterwards you label the ones with a ton of uncertainty

mild dirge
#

The data is thousands or tens of thousands tree point clouds from many different forests

#

The labeling is mostly done by knowing what trees are in which forest, and labeling entire sections

#

So post-hoc would be difficult and requries expert knowledge, which I don't have

past meteor
#

I think this is an ideal use case for this though, that is if you can get a hold of the experts to do it

mild dirge
#

Yeah, I understand why it would be great if we can label post-hoc, but even then the labels are mostly from known forests from looking at it in the field.

#

So not from looking at the data itself, which are point clouds (sometimes it is easy to see, sometimes not)

past meteor
#

Iirc this is what one of my lectures was about cause my prof did this for energy load anomalies. I can send you the slides when I get back maybe there's something else of value to you there 🤷

mild dirge
#

Would be great, It seems that it even works without labeling post-hoc from some papers, but it seems a bit counterintuitive as the actual information you have is just the labaled data and some pseudo-labels.

#

In the OG paper they only take pseudo-labels from samples that the model is very certain about

past meteor
#

A large part of his research is semi supervised learning in general so afterwards it might be a good idea to check out the papers from his lab 🙂

mild dirge
#

Appreciate the input, the post-hoc can be something to look into if that's possible at my company (basically need to convince one of my coworkers to do it 😛 )

harsh minnow
#

Hey guys I am pretty new to working with LLM's. I want to build a AI agent, how can build it? is there any useful videos that you can share that goes pretty indepth building AI agents. Your suggestions will help me guys, thanks.

tacit salmon
#

hey you guys, I am interested on AI and machine learning, that's principally why I learned Python

#

but as beginner-Junior developer, I don't know how could it be applied

#

Because I only saw text based simple programs

#

can you tell me how does it work?

#

I mean, what does one need to start making AI stuffs

left tartan
left tartan
#

Third, math. Strong math skills, if you want to do anything more complicated than simple projects with existing models. Once you understand the theory, you need strong math skills to understand the details. See pinned messages for some reading material.

long canopy
#

What part of the data science process is the one going from data collection to database insertion?

tacit salmon
#

I mean, I used some Trigonometry to make my game characters rotate, but It's not like a kind of hard Math

#

Wow 3Blue1Brown, idk if it's gonna be kinda hard

#

I'll watch the series

left tartan
tacit salmon
#

So do I need Calculus? I never tried to do that

left tartan
#

Calculus is considered a fundamental college course for any STEM major, just like algebra is in high school.

#

You can certainly do stuff with AI/ML libraries without calculus, but it's just one part of becoming a complete engineer or scientist: understanding how everything works.

long canopy
#

Any book recommendation that clearly establishes differences between data architect, data engineer, and data scientist?

#

I'm having a hard time organizing data intake vs. validation vs. database insertion in my code, I feel like having a higher-level conceptual understanding would help me out

modest dome
#

Question. I am working with a pandas dataframe and the csv file date shows up with month and year only (Jan 2000). How can I separate the month and year and create their own columns?

young granite
modest dome
#

@young granite I'll try that thank you

echo mesa
#

Guys, I have a confusion with vectors and their magnitude. Now I'm reading a book(data science from scratch first principles with python) and there the vectors magnitude is defined as square root of the sum of squares from the vector's value. However I don't think its necessarily correct, when we are calculating the vector's magnitude we are considering its components which from my understanding is Δx and Δy,(I'm talking about R^2 specifically) and in order to calculate its magnitude we square them add them together and square root the whole expression. When we have a vector [3,4] and assuming that it starts from the origin then yeah we can use the method of summing their squares and square rooting them which will result with 5, but what we really do is consider Δx and Δy which since we started from the origin will be the coordinates(3,4) of the vector, however when we don't start from the origin we have to consider the terminal point and the endpoint and determine the Δx(x2-x1) and Δy(y2-y1) and then calculate the magnitude. The method described in the book only works if the vector starts from the origin, if not then Δx and Δy will not be equal to the vector's coordinates. I think that a distinction between whether the vector starts from the origin and whether it starts from somewhere else is really important because it effects the magnitude of the vector. Perhaps I'm overcomplicating it but I think the book should have been more specific here.

barren fable
#

Does anyone have a good recommendation for whatever books, documents, or even pretty nice videos explaining the polynomial regression, SVR, and kernel?

modest dome
modest dome
agile cobalt
# modest dome Question. I am working with a pandas dataframe and the csv file date shows up w...

!e you can use to_datetime to turn into proper datetime objects, then extract the properties like ```py
import pandas as pd
df = pd.DataFrame({'string': ['Jan 2020', 'Feb 2021']})
df['date'] = pd.to_datetime(df['string'])
df['month'] = df['date'].dt.month
df['year'] = df['date'].dt.year

optionally do not even store df['date'] as a column, just

(same thing as before but without storing as a column)

dates = pd.to_datetime(df['string'])

df['month'] = dates.dt.month [...]

print(df)

arctic wedgeBOT
#

@agile cobalt :white_check_mark: Your 3.12 eval job has completed with return code 0.

001 | /home/main.py:3: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
002 |   df['date'] = pd.to_datetime(df['string'])
003 |      string       date  month  year
004 | 0  Jan 2020 2020-01-01      1  2020
005 | 1  Feb 2021 2021-02-01      2  2021
agile cobalt
#

the format for this case is "%b %Y", so ideally you should explicitly specify like ```py
df['date'] = pd.to_datetime(df['string'], format="%b %Y")

wooden sail
#

if you specify a vector simply as coordinates, it does not matter where a vector is located in space

#

the object you're describing requires extra parameters. a second vector, to define both the "tail" and "head" of the vector (more properly, an affine transformation of a vector). but if you then compute the corresponding vector as head - tail, you again get a proper vector with no location in space attached to it, which we can always represent as having its tail at the origin

haughty heart
#

That’s so cool

#

W for the artists

serene scaffold
haughty heart
#

So if there’s a picture of a dog and it has that poison pixel in it. It makes the picture look like a cat instead of a dog

#

So it’s basically a deterrent against these big ai tech companies like chatgpt from using artists art without there knowledge

left tartan
#

This will be the arms race for next 20(+) years... counter-AI measures and counter-counter measures.

#

10 years from now, the first "AI Defense" majors...

serene scaffold
left tartan
lapis sequoia
#

!e what you are

iron peak
#

Had anyone heard of polars?

desert oar
# echo mesa Guys, I have a confusion with vectors and their magnitude. Now I'm reading a boo...

however when we don't start from the origin

in linear algebra, all vectors start at the origin. the notion of a vector as an arrow floating around in space is a convenience used in physics, but it has little to do with how it's actually modeled in linear algebra.

if you want to live in a world where the origin is "free floating", the magnitude is defined as the distance between the vector and its own origin, not strictly the "global" (0, 0) point.

but yes your line thinking is reasonable and it's a good question.

serene scaffold
iron peak
#

Yeah. I saw two YouTube shorts about it and was like “this is weird. Is it something that’s up and coming or did YouTube hook onto something I didn’t notice.

serene scaffold
#

My impression is that Polars is on the rise, but I'm not inclined to switch to it.

iron peak
#

I don’t know pandas yet. Maybe I should just get ahead with polars.

serene scaffold
#

Maybe. Pandas is very established and integrated with the data science stack

desert oar
iron peak
desert oar
#

pandas is not particularly highly optimized, being more aimed at general-purpose use than high performance

#

(it actually is fairly well optimized for what it is, but the developers have a range of priorities, optimization being only one of them)

#

polars meanwhile is specifically designed to be high performance

#

the main advantage of polars over pandas is that polars has a "query engine" much like what you find in a database, that can compile and optimize what you ask it to do. whereas pandas just does whatever you tell it, no query optimization or compilation phase.

left tartan
past meteor
#

To me, aside from the speed aspect, Polars has much better syntax and many constructs like over and groupby_dynamic which make it significantly less verbose than Pandas.

#

I'll always make the argument I've used Spark, SQL, Dplyr, data.table, Polars, Pandas, ... and honestly Pandas has always been the one that makes me want to pull my hair out. Even if it were slower I'd prefer Polars' syntax.

spark nimbus
#

Does anyone know how to easily debug where in a chain of computations dask is failing? It says it failed to infer types on mul but with the amount of multiplications we have that means nothing. The tracebacks contain no call site information and the code ran fine in pandas.

grand minnow
#

Just spotted this in the wild. A Taipy. https://www.taipy.io/ looks interesting for data scientists. But I don't know how well this compares to streamlit

Taipy is an open-source Python library for building your web application front-end & back-end in no time. No knowledge of web development is required!

desert oar
spark nimbus
#

Unfortunately I can't make a subset as that requires loading the original file

desert oar
cold osprey
#

pandas can read in top n rows iirc

#

or polars

boreal gale
#

try specifying the dtypes explicitly when reading your data (if you are working with a schemaless format like CSV)?

spark nimbus
past meteor
spark nimbus
#

Well yeah but then I'd need to loop over all rows and discard them if they don't fit a given criteria and if the resulting dataset is too small I'd need to filter on different criteria to get a good dataset

#

And with over 1.5m rows I'm not sure if that's feasible

past meteor
#

In that case you can scan_parquet and apply all the filters you want using polars, collect(streaming=True) and then to_pandas()

#

scan_parquet is lazily executed

spark nimbus
#

Does it allow for groupby queries or only simple filtering?

lapis sequoia
#

can someone help me adding my flask app to github and have to add .gitignore file and config.json as well

spark nimbus
past meteor
#

You might not even need streaming mode tbf

#

1.5m rows isn't that much if your machine has like 8-16GB ram, considering Polars' memory footprint is substantially smaller to begin with

spark nimbus
#

I've tried both pandas and dask so far but both can't allocate enough memory (dask says I'd need about 80GB)

past meteor
#

Polars does a lot of multithreading already to begin with. Dask adds so much complexity and opaqueness.

spark nimbus
#

Unfortunately it's management who told me to use dask over pandas/polars so I'm afraid my hands are mostly tied

past meteor
#

I remember writing Numba + Numpy + Dask stuff that released the GIL and the code would silently fail without errors if 1 thread or so ran into trouble.

#

I wouldn't wish this onto my worst enemies

spark nimbus
#

Trust me I'd switch to polars if I could, I've had to abuse some #esoteric-python tech to patch dask to even be compatible with our old code

past meteor
#

Have you at least upgraded your Pandas version?

spark nimbus
#

On the latest version compatible with py3.11.5 afaik

past meteor
#

You definitely need to check if it starts with a 2

spark nimbus
#

2.1.1 it looks like

past meteor
#

If mgmt is set on using Pandas, which is fine if you have tons of existing code, I'd still consider doing the bulk of the preprocessing / subsetting in Polars before moving to Pandas.

#

That's what I do, I have some legacy Pandas code that works fine. There's literally no sense in refactoring it. I keep it as is, do some Polars stuff beforehand and call df.to_pandas()

#

Could that work for you? (you can do the same in DuckDB, it doesn't need to be Polars)

rugged comet
#

I found the RMSE and the average of the absolute value of the residuals for three regression models. I also found the mean and the standard deviation of the response. I'm being asked to 1, determine if the models are accurate, 3, how I determined their accuracy, and 2, whether I compared predictions to other numbers.

  1. For me, accuracy is hard to quantify for regression models. I typically think of classification models when I hear 'accuracy'. However, I think that we can use RMSE as a measure to evaluate the models. Though I'm not sure how to put into words my findings.
  2. I compared the predictions to the actual values. That part was easy.
  3. Are the predictions accurate? Again, this notion of accuracy for a regression model is hard for me to conceptualize. I think I have the pieces of information needed to get to the answer; I'm just not sure how to get there.
past meteor
#

Accuracy also depends on the business problem. Being 3 % off of sales numbers is huge if you're walmart compared to a corner store. Being 3 % off of detecting a life threatening disease is worse than either of those.

#

Finally, under or overshooting isn't as bad. If you're doing sales forecasting there's typically a cost for overstocking and understocking and it's asymmetric meaning that the same RMSE (say your predictions were always respectively 10 above or 10 under the target) can have different "business" costs. Same goes for detecting a life threatening disease, sending a few more people to a second screening isn't as bad as telling them they flat out have nothing.

desert oar
#

that or generate one randomly

abstract wasp
#

Hi, does anyone here know about a really good research paper where they talk about cGANs and using that for age progression and regression? Feeding an image of a person to the cGAN and then generating an image of the person but younger/older.
I want to use the paper as a guide so I can build one myself.

desert oar
# rugged comet I found the RMSE and the average of the absolute value of the residuals for thre...

"accuracy" in this sense just means "correctness of predictions". in classification we have a very natural way to compute that: the "0-1" loss (0 if wrong, 1 if correct), also often just called "accuracy". for continuous labels, you're right that we have to use something else. RMSE is a great choice overall.

Are the predictions accurate?
think about what RMSE is: it's the mean squared error of the model predictions. so are the predictions accurate? you tell me: look at the thing you computed. if i asked you the same about a classification model, it would be the same? do the model predictions match the labels in the data? if not, what's the error? if we chose a sensible error/loss function, one would hope that lower loss means better predictions.

rugged comet
desert oar
#

that's true for classification as well

past meteor
#

You can't make any claims about accuracy without benchmarks and baselines imo

desert oar
#

that ☝️

#

accuracy is only meaningful when compared to other accuracies

#

or when compared to some benchmark as required for your business

past meteor
desert oar
#

the real question is, is it accurate enough for whatever you're trying to accomplish?

rugged comet
#

I think I'm just hung up on the phrasing of the question.

desert oar
past meteor
#

I'd compare them against each other and add a baseline

desert oar
#

i think you're supposed to explain whether it's accurate enough for whatever the context of the question is

past meteor
#

Think about asymmetric costs (by looking at predicted vs. actual plots) and qualitatively argue for the accuracy of your model compared to your business problem

#

Don't underestimate how qualitative data science is! 😄

desert oar
past meteor
#

I think it's my coursework where I picked up this line of thinking in the first place

#

But not in an intro to DS course yes I'll give you that

serene scaffold
#

Something that's increasingly causing miscommunication these days is that when people hear "LLM", a lot of people immediately think of instruction-tuned LLMs. but I've been using LLMs for a few years, and I still just think of them as a probability distribution for word co-occurrence.

sterile wyvern
#

How come GridSearchCV doesnt need to call the original function that made the time serries data?

#

Ordinarily I should use the same function to pass the parameters to. How could this be possible tbh?

sour coral
#

I'm making an image recognition for Science Fair and im using Teachable Machine for the processing. Im trying to run it in python and the export guide has a code snippet to load a kerras model. I the model it is trying to load is in a zip file in my downloads. Here is the code that is trying to fetch the download:

#

model = load_model("keras_Model.h5", compile=False)

#

Im getting an error saying no file or directory found at keras_Model.h5. Any ideas on how to get the code to find it?

#

Thanks!

acoustic prawn
sour coral
#

Directory as in the same folder?

#

Sorry I'm kinda new to this

acoustic prawn
sour coral
#

Okay, it found the file, but now its saying 'Unable to open file (file signature not found)'
Is it because it's trying to access a zip file?

#

Never mind it works!

#

Thank you so much for your help!

shut girder
#

Hello, how should I approach learning how to do exploratory analysis? Any resource suggestions?

empty pond
#

Hey

shut girder
#

Thanks, I'll check it out

spark nimbus
#

Running into an issue where df[col] * some_float throws an error because it can't multiply a sequence by a non-int, but at the time of running the dtype of the column should be float, not str, and df[col].compute() even gives a float64 series as output

sterile wyvern
buoyant vine
#

Has anyone had some issues using Poetry any PyToch 2.1 ?
For some reason poetry seems to not be able to find the right packages even though they exist in PyPi and also exist in Pytorch's own index which is the fallback.

#

PepeHands And then as a second question, does anyone know if it is possible to update th cuda version a CML docker container is using (Using 11 atm, need 12)

serene scaffold
wooden sail
#

there's probably a pytorch image with your desired cuda version, then you install cml on top

wooden sail
buoyant vine
wooden sail
#

ah you're doing both, ok

buoyant vine
#

yeah

wooden sail
#

fair enough

#

i can't help with that :x

buoyant vine
#

i will work on the docker stuff first, since that is currently breaking ci

wooden sail
#

should be able to use the pytorch image as a base and slap those cml install commands on top... in theory

boreal gale
desert oar
#

that error about multiplying a sequence almost sounds like trying to multiply a native python list

#

!e ```py
[1,2,3] * 5.1

arctic wedgeBOT
#

@desert oar :x: Your 3.12 eval job has completed with return code 1.

001 | Traceback (most recent call last):
002 |   File "/home/main.py", line 1, in <module>
003 |     [1,2,3] * 5.1
004 |     ~~~~~~~~^~~~~
005 | TypeError: can't multiply sequence by non-int of type 'float'
desert oar
#

@spark nimbus ☝️

#

somehow you ended up with a native python list somewhere. maybe in an "object" dtype column.

spark nimbus
#

Using .astype(float) ended up solving it

#

Turns out it got confused by the type remapping from a .replace call

spice mountain
#

I am trying to pass an image into a pretrained VQGAN model. The model takes floats as inputs/uses floats for its biases.

The original image is the one in the middle. The one on the left (the one where it is blue and is missing most of the pixels,) is what happens when I convert it to floats. The one where it looks a bit demonic is the one, when I just load it into a PyTorch tensor.


# Import necessary libraries
import torch
from PIL import Image
import torchvision.transforms as transforms

# Read a PIL image
image = Image.open(segmentation_path)

# Define a transform to convert PIL
# image to a Torch tensor
transform = transforms.Compose([
    transforms.PILToTensor()
])

# transform = transforms.PILToTensor()
# Convert the PIL image to Torch tensor
img_tensor = transform(image).permute(1,2,0)

# print the converted Torch tensor
print(img_tensor)```

My code for loading it.

In order to turn it into a float, I just do

```python

img_tensor.float()

buoyant vine
#

We love CI

#

So trying to setup the action CI, we now use the pytorch image

#

but now setup-cml doesn't seem to want to work 😢

river maple
#

im trying to make driver drowsiness detection program using a pretrained model.
Here's the code:

import cv2
import os
import tensorflow as tf
from tensorflow.keras.models import load_model
import numpy as np
import pygame.mixer as mixer

mixer.init()
sound = mixer.Sound('alarm.wav')

face_cascade = cv2.CascadeClassifier(cv2.data.haarcascades + "haarcascade_frontalface_default.xml")
eye_cascade = cv2.CascadeClassifier(cv2.data.haarcascades + "haarcascade_eye.xml")
model = load_model(os.path.join("models", "cnnCat2.h5"))
lbl = ['Close', 'Open']

cap = cv2.VideoCapture(0)

while True:
    ret, frame = cap.read()
    height, width = frame.shape[0:2]
    gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)

    faces = face_cascade.detectMultiScale(gray, minNeighbors=3, scaleFactor=1.2, minSize=(25, 25))
    eyes = eye_cascade.detectMultiScale(gray, minNeighbors=3, scaleFactor=1.2)

    for (x, y, w, h) in faces:
        cv2.rectangle(frame, pt1=(x, y), pt2=(x + w, y + h), color=(255, 0, 0), thickness=3)

    for (ex, ey, ew, eh) in eyes:
        cv2.rectangle(frame, pt1=(ex, ey), pt2=(ex + ew, ey + eh), color=(255, 0, 0), thickness=3)
        eye = frame[ey:ey + eh, ex:ex + ew]
        eye = cv2.resize(eye, (80, 80))
        eye = eye / 255
        eye = eye.reshape(80,80,3) 
        eye = np.expand_dims(eye, axis=0)
        prediction = model.predict(eye)
        print(prediction)

I get this error when i run the code.

mild dirge
#

@river maple Should the image not be grayscale?

#

It seems to expect 1 channel, but you have 3

river maple
#

@mild dirge i changed the image to grayscale and reshaped it to (80,80,1)

#

Now its giving me this

vale swallow
#

Hi, is there such thing as a deep learning engineer? Or is is just an ML engineer?
I’m most interested in deep learning so I was wondering if there are careers that only focus on that.

agile cobalt
#

99.99% of machine learning jobs are focused on deep learning

past meteor
#

I think DL gets the most "airtime" but I still see a lot of shops doing ML on tabular data.

agile cobalt
#

you think that those shops will have a professional specialized on machine learning?

past meteor
#

Yes

#

I was at a data science / ML consultancy in the past and they did virtually no DL. Most of it was ML on tabular data. Typical use cases like fraud detection, forecasting, predictive maintenance, ...

agile cobalt
#

how long ago 'in the past'?

past meteor
#

I think this was 2 years ago

agile cobalt
#

not sure then

past meteor
#

I'm not sure why you'd use DL if you don't need it 🤷 . There are, imo, many cons especially surrounding deployment of it.

#

So when it's tabular data I'd very much prefer not to.

#

Also the time it takes to pick a bespoke architecture, tuning which is harder but 100 % necessary etc.

past meteor
agile cobalt
#

stakeholders don't understand the tradeoffs and just want the best thing money can buy, but for cheap

past meteor
#

What stakeholders though? If they're business idt they'd understand the difference between xgboost and DL

#

Sorry if I'm sounding contentious, really just curious about your experience since for me it's different 😄

agile cobalt
#

just managers not very exposed to statistics

left tartan
#

My clients are either: "I must gpt everything", or "I don't trust ai to recommend my lunch, much less a trading decision/forecast"

past meteor
#

The thing with my time at $ML/AIConsultingCompany is that if they knock on your door they want something already

#

And the business model was really to have different titles and teams do BI / data infra first and plan ML in the long run after the first set of projects have delivered value

#

idk how stuff looks like in that scene after GPT dropped 🤷

left tartan
#

so much snakeoil.

vale swallow
#

Ok, so, I’m guessing it depends on the company you work for?

#

Is anyone here an AI/deep learning researcher? What was your experience and is the income decent?

rugged comet
#

I'm trying to understand low and high p-values and how they relate to normality and homoscedasticity of data.

Use the Shapiro-Wilk test to determine if the data is normal. Shapiro-Wilk gives us a p-value. We would like the number to be above .05.
This sounds like if the p-value is greater than 0.05, the data is normal.

A p-value measures the probability of obtaining the observed results, assuming that the null hypothesis is true. The lower the p-value, the greater the statistical significance of the observed difference. A p-value of 0.05 or lower is generally considered statistically significant.
This part I'm not getting. I think it contradicts the first part. This makes it sound like a low p-value is good.
But perhaps "good" has opposite meanings in different contexts.
How would you interpret a low p-value?

agile cobalt
#

what do you understand that the first part means by "determine if the data is normal"?

rugged comet
#

I understand the first part to mean compare the p-value we got from the Shapiro-Wilk test on the residuals to 0.05.

agile cobalt
#

and do you understand what it means (regarding the data) if it passes or fails the test?

rugged comet
#

I think that if the data fails the test, p < 0.05, then the data is not normal. If it passes the test, p > 0.05, the data is normal.

#

Is that what you're asking?

past meteor
#

That's not really how to interpret p-values

#

But interpreting them correctly is a science in and of itself so I get the struggle

tidal bough
#

The "null hypothesis" of the S-W test is that the underlying distribution is normal. Hence, the p-value the Shapiro-Wilk test gives you is ||the probability of obtaining this data (well, or rather data that's at-least-this-weird by the metrics the S-W test tracks) if the underlying distribution is normal||.

#

So if you get a low enough p-value, then (as always) you can consider your null hypothesis disproven - this data probably didn't come from a normal distribution.
So if you want to check that the data is from a normal distribution, you're looking for a high p-value on the S-W test.

rugged comet
tidal bough
buoyant vine
#

Does anyone know how PyTorch lightning distributed backend works with datamodules?
Currently it appears to be creating two instances which I haven't configured, and i'm not sure how to turn that off, because it seems to breaking things...

I see the python file being ran as __main__ twice, one for each process and in the logs:

Log: https://paste.pythondiscord.com/7UYA

The problem is PT doesn't seem to then call prepare_dataset again, so the validation dataset is not generated.
But I don't really want it trying to do two distributed systems right now 😅

This is the training code

        early_stop = pl.callbacks.early_stopping.EarlyStopping(
            monitor="val_loss",
            min_delta=0.00001,
            patience=5,
            verbose=False,
            mode="max",
        )

        trainer = pl.Trainer(
            callbacks=[early_stop],
            max_epochs=self.model_config.n_epochs,
            num_nodes=1,
            log_every_n_steps=32,
            accelerator="auto",
            devices="auto",
        )

        trainer.fit(self.model, self.data_module)

num_nodes suggests to me that this shouldn't spawn multiple indapendent processes?

tidal bough
rugged comet
tidal bough
# rugged comet What is meant by extreme in "results as extreme as the observed"? I think perhap...

Basically, to do statistics this way, you start by selecting the metric you will track. Suppose you're flipping a coin 20 times, say, and you're checking the null-hypothesis of the coin being fair. You choose the number of heads as the metric.
You get 15. What's the p-value of this result? It's not the probability of obtaining exactly 15 heads (1.48%) - that'll be quite low even for 10 heads (worse, if we were doing an experiment with continuous measurements, the probability of obtaining any particular result would just be exactly 0). It's the probability of obtaining >=15 heads, about 2%.

(Or, alternatively, we could choose to track the 2-tailed p-value here - that is, we consider deviations from 10 to both sides to be the same, and so our p-value will be the sum of the probabilities of obtaining 15,...,20 and also 1,...5.)

#

(If it seems kinda arbitrary that you need to precommit to doing only specific measurements (total number of heads, not some other metric) and how you'll be calculating the results (one-sided or two-sided) before the experiment for it to be valid - yeah, it is, that's a common critique of this entire approach to statistics as opposed to e.g. the Bayesian one.)

tidal bough
#

If you get a high p-value here, what does it mean? Well, it means your data is such that this statistic, calculated on this data, looks like what you'd get from a normal distribution.
(It would be a mistake, though, to assume that guarantees it actually came from a normal distibution - maybe there's some distribution that's not normal but produces a similar distribution of this statistic, and the data was drawn from it.)

past meteor
#

I avoid using p-values in practice because they require a lot of baggage to interpret correctly

#

From my old slides:

tidal bough
rugged comet
#

I also have this qq plot and this displot of the residuals. The dots should hug the line on the left I believe. And the residuals should look like a bell curve on the right. Neither of these are true which which is more evidence in support of the residuals not being normal or homoscedastic.

past meteor
# tidal bough Ah, that's interesting. What do you do in practice, then? Are you calculating li...

I'm not often in situations where I need them. Summary statistics and counts is what I use without trying to make strong claims unless I have access to an actual statistician.

When I was and was "forced" to use them they didn't express anything interesting for my domain. I had a cross section with demographic data. When n is that large any difference becomes significant. Sure then you need to fix an effect size but it was hard to come up with an effect size before looking at the data. Doing it after means the entire thing is biased anyway.

serene scaffold
#

is the right one supposed to look like the burj kalifa?

rugged comet
#

Does the burj kalifa look like a bell? (rhetorical)

tidal bough
#

tbh if you showed me the right plot i'd go 'yeah sure looks normal to me'

#

the left plot is more worrying, with the right tail especially

rugged comet
#

Perhaps I should compare these with a normal displot and and normal qq plot to get a better idea.

rugged comet
#

Would you agree with this? I'm uncertain about the null hypothesis for the White test.

cunning agate
#

does any one work with knowledge graph ?

#

i've some question

pine void
#

!paste

arctic wedgeBOT
#
Pasting large amounts of code

If your code is too long to fit in a codeblock in Discord, you can paste your code here:
https://paste.pythondiscord.com/

After pasting your code, save it by clicking the Paste! button in the bottom left, or by pressing CTRL + S. After doing that, you will be navigated to the new paste's page. Copy the URL and post it here so others can see it.

pine void
#

https://paste.pythondiscord.com/GFRQ

This is my code for a minimax on tic tac toe. I need to be able to get the score when running it, but in the end i need it to return the best move. How can I do that?

desert oar
#

find a statistic with a known distribution under the null. if the value of that statistic in your sample is highly improbable, you reject the null.

lapis sequoia
#

interesting

potent sky
# cunning agate does any one work with knowledge graph ?

It's generally better to just ask your question away if possible and whoever is familiar with the concept and free can pick it up and help you with it.
People usually don't want to engage in back and forth just to get to the question and then discover whether they're familiar / free enough to help with it or not

lapis sequoia
#

Salut

spark nimbus
#

With dask, how would I do df['x'] = df['x'].mask(df['x'] < 0, 0.0) or something along those lines? no matter what I do, it complains about the index not being aligned.

spark nimbus
#

oh I should've given a better example

#

tl;dr I need to use mask since I'm doing complex filtering

jaunty helm
# spark nimbus With dask, how would I do `df['x'] = df['x'].mask(df['x'] < 0, 0.0)` or somethin...

Works fine for me?

import pandas as pd
import seaborn as sns

iris = sns.load_dataset("iris")
iris: pd.DataFrame

iris["sepal_length"] = iris["sepal_length"].mask(iris["sepal_length"] < 5, 0.0)
print(iris.head())
#   sepal_length  sepal_width  petal_length  petal_width species
#0           5.1          3.5           1.4          0.2  setosa
#1           0.0          3.0           1.4          0.2  setosa
#2           0.0          3.2           1.3          0.2  setosa
#3           0.0          3.1           1.5          0.2  setosa
#4           5.0          3.6           1.4          0.2  setosa
spark nimbus
jaunty helm
#

Ah whoops

lapis sequoia
#

also its not really that hard in dask

#

this is how you can do conditional assignment

spark nimbus
#

I see

short heart
#

Why does TfIdfVectorizer returns an all zero array (basically all zero but there is 1 non zero value for other 20.000 zeroes)?

short heart
agile cobalt
#

show a reproducible example then

short heart
#

that would be hard without sharing initial data

plush jungle
#

has anyone here done reinforcement learning? I've been training a DQN on a simple pygame game i made but after a ton of experiments I'm really disappointed at how badly/slowly it learns. I've seen successful projects with more advanced algorithms like PPO that make use of massive parallelization so I figured I'd give that a try

#

but I can't find a simple code example of PPO so I don't really understand how to implement it

#

so I figured I'd first try to parallelize the DQN and mmove on to PPO if that doesn't work

#

the problem is, I'm not actually sure how to model it

#

right now I have observations [o_1, ... o_n] and I pass a vector of length n to the neural net

#

but if I have N observations and M games running in parallel, do I change my neural net to have the first layer be NxM observations? cause that doesn't seem like what I want. I want the agent to be able to run on a single game, and it certainly shouldn't correlate across games

#

so I'm thinking I should have M copies of the neural net, store the results of the games in my replay memory, and then backpropagate on only one of the neural nets

#

but this is all way more complicated than I think I can manage without help

#

and I'm worried I'm overthinking it

astral hazel
#

What's the best layer layout for lightweight NLP model, that puts single word in the right form, tense, case etc?

plush jungle
astral hazel
#

What layers to use and where to put them

plush jungle
rotund lark
#

Hello. If i want to have Selanium enter login information on a website. How would I do that?

https://member.restaurantdepot.com/customer/account/login

im getting this error after trying to run Selanium

There has been an error processing your request
Front controller reached 100 router match iterations
Error log record number: a41644797e68843c6e965cd114df2fce18138d5b6efff34f07c54a26d3378e2f

left tartan
rotund lark
#

No, i have permission from their corporate office

brave sand
#

in q learning, how would I convert a state into an int?

lapis sequoia
#

hi all,

i am trying to learn kmeans clustering with sklearn. so far i have imported my data, dropped rows containing NaN, and visualized the fit data with matplotlib. i wanted to break the set of columns into two separate clusters (k=2). however, it seems to be trying to cluster all of the rows into clusters instead.

#

might i need to transpose the dataframe? or is there another way

lapis sequoia
#

wut

brave sand
#

why is my agent so dumb

#

am I doing something wrong?

#

the state never changes

#

it's always (1, 1, 1, 1)

#

i think it is the discretize bins function

#

since the next state is always (1, 1, 1, 1)

#

btw this is my first time doing anything AI related so bear with me

mild dirge
# brave sand

Is the reward based on driving without crashing, because it's doing a good job with that 😛

grave token
#

Can you guys recommend some books for statistics maths and ai maths?
It could be a tutorial too

slender kestrel
#

anyone from india here who is doing good in data science ? need a bit of advice for indian data science job market

astral hazel
grave token
#

Completed my bachelor a year ago, but i did not focus much on statastics

bronze flint
#

Anyone using Apache Spark for big data streaming?

past meteor
#

Also consider reading practical statistics for data scientists

past meteor
magic dune
#

but maybe not first read

past meteor
#

It covers topics that are also not really necessary for industry and there's a lot of diminishing return on those

past meteor
#

I didn't read PRML in full but many of my coursework referenced it so I read some chapters. Stuff like the decomposition of bias and variance is interesting to see but it doesn't really help you.

magic dune
#

but there are topics that are interesting if you want to understand the behind the scenes for the algo

winter canyon
#

How do I best handle a 2 player (turn based) game? My idea would be to have the following inputs: the environment, movement options of the active bot, location of the other bot, which bot is active (?)
Just imagine a grid based pathfinding game for simplicity.

#

concern beeing that the ai might get confused with constant switching beween players

serene scaffold
#

AIs aren't like human brains that are constantly thinking and perceiving. they only have "brain activity" in the limited context that they're being used.

#

ChatGPT doesn't "think" between user inputs.

winter canyon
#

Makes sense. so which player is active is a useless info then 👍

serene scaffold
#

for example, in chess, it needs to know where all the pieces are on the board. whereas for card games, it needs to be able to see its own cards.

bronze flint
short heart
#

If I fit_transform TfIdfVectorizer multiple times, will it rewrite past words or update dictionary?

serene scaffold
#

this is a landmine for a lot of people.

#

you also shouldn't (for example) have separate vectorizers for train and test. because then your test data represents the same things differently than the training data, making everything meaningless.

young egret
#

Hi I'm just wondering if we can compare 2 lists of different length by ID? How do you do it? I'm looking for the difference between multiple dates that are only > 0

serene scaffold
mild dirge
#

Maybe a bit of a shot in the dark, but I'm working on making a encoder+decoder model which tries to recreate a point cloud. I am using the chamfer distance for the loss, but it seems that it doesn't get much further than the general shape. I was wondering if anyone had any idea to maybe change the loss function so that it tries to get the details more similar. This image shows the original point cloud in the bottom row, and my decoded point cloud at the top.

#

As can be seen, the general shape is pretty good, but when there are tiny details, it just doesn't do well.

past meteor
#

I always sanity check my approach by ensuring I can at least completely overfit 1 sample or a batch of samples, depending on the case

#

If that is already hard then there's likely something up with your architecture 🙂

mild dirge
#

It should be able to overfit on a single point cloud since the decoder is just a bunch of dense layers, and the loss is a pretty standard loss that is 0 when the point clouds are equal

#

I can maybe look into different decoders, but I'm thinking the model is just stuck in a local *minimum in which optimizing small details just doesn't decrease the overall loss enough for it to move in that direction.

past meteor
#

You should still try it imo 😄 it's basically just debugging. It has saved me countless of times.

mild dirge
#

I'll give it a go, but theoretically it should definitely be able to exactly copy the point cloud

past meteor
#

Depends. Are you trying to win the competition or do data science properly? 😄

acoustic skiff
#

WIN

#

well maybe well

past meteor
#

It's important to know that they're completely different things and that Kaggle competitions can, or at least used to, teach bad habits

#

If you're OK with that then the smartest thing to do if you want to win is to clean the entire dataset together.

#

If you care about being methodologically correct, the first thing you should do is split your data and only work with the training set. You should then make a preprocessing pipeline that can work with your training set but also works for totally unseen data.

acoustic skiff
#

Right so I was thinking as a not cheating thing if I define my preprocessing pipeline using only train data that's probably the correct thing... But now I think I see what you mean I could cheat and include the test data too

#

Like MEDIAN_AGE just from train

glad skiff
#

hey all, does anybody have experiences designing RAG systems? I have a bunch of PDFs, and I'm working on a RAG pipeline, and have been researching on different approaches (already went with the very simple LlamaIndex and Langchain thing).

I have been seeing a lot of folks recommending the "hybrid" approach of using both vector and keyword based search. Now the vector is "ok": you read the docs, embedds them, saves into the vector and then perform a search. The regular search, by itself also: read the docs, store in a db, and perform regular search.

what I'm not following is how to mix those two approaches - does that mean I need to duplicate the data? Eg having it as regular text in elasticsearch for example, and also have it as vector? Any ideas?

rugged comet
#

Would anyone here be willing to review my Jupyter notebook for a school project? It's about LinearRegression, lassoCV, and RidgeCV.

abstract wasp
serene scaffold
verbal venture
#

can someone explain in object detection the relevance of grids on the image, bounding boxes, and anchor boxes

winter drift
sterile wyvern
#

@boreal gale is it possible to run an initial function with 1 set of parameters but then use gridsearch with the performance from the initial function and a group of parameters?

bronze flint
#

If that's what you meant

#

Anyways
Day 2 of me trying to find someone who has experience with Apache spark and Structured data streaming
If you are that person please let me know, i would love to hear your opinion on how it handled big data 🙂
I will be dealing with possibly infinite amount of data so i have to choose proper tool

sterile wyvern
young egret
#

Is there a way to right outer join tables based on a condition? Any links would be greatly appreciated 🙂

serene scaffold
young egret
#

I'm looking to join 2 tables using the minimum difference between 2 columns. I don't really know what to do my 2 tables have different number of rows.

serene scaffold
#

so you'll have to come up with an approach that does not involve joining

young egret
#

My 2 tables have the same IDs. But I don't know if that'll work because they still have different numbers of rows. I've tried concat the 2 dataframes, but I really want them to be on the same row or some other way I can interact with one another.

#

I'm now out of ideas and looking for help, a way to solve this semi matching table thing.

agile cobalt
#

sounds like a normal many-to-many outer join?

young egret
#

Yes I'm looking for a many to many join

agile cobalt
#

you can specify how= to tell pandas whenever to do a inner/left/right/outer join
as for many to many... pretty sure that's the default and pretty much only supported option (excluding the validate)

#

!d pandas.merge

arctic wedgeBOT
#

pandas.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=None, indicator=False, validate=None)```
Merge DataFrame or named Series objects with a database-style join.

A named Series object is treated as a DataFrame with a single named column.

The join is done on columns or indexes. If joining columns on columns, the DataFrame indexes *will be ignored*. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on. When performing a cross merge, no column specifications to merge on are allowed.

Warning

If both key columns contain rows where the key is a null value, those rows will be matched against each other. This is different from usual SQL join behaviour and can lead to unexpected results.
agile cobalt
#

maybe take a look at the User Guides

young egret
#

I'll try that. Can you explain further what it would do?

#

What will happen to the empty rows? Do I need to fill anything?

#

How many rows will the new table have?

agile cobalt
#

if you use inner, it'll exclude rows that do not have a matching record on the other table
if you use left/right, it'll just throw in nulls/Nones/NAs for the right/left table where there were no matching records for the left/right, respectively
if you use outer, it'll throw in NAs in both sides where there is no match

agile cobalt
mystic birch
#

does anone know a platform to use random forest on python on windows? i tried tensorflow but it only works on macos and linux

agile cobalt
young egret
silk axle
#

I'm trying to train an AI that will correctly classify an amazon review with the corresponding rating that was left (1/2/3/4/5 stars), but the AI's accuracy isn't improving at all in the epochs. What does this typically indicate? Fwiw the loss is decreasing (whatever that means Shrug)

In fact, the accuracy is 3.79%, which is worse than the "pick a number at random" (would give 20%)

#
    model = keras.Sequential([
        keras.layers.Embedding(vocab_size, embedding_dim),
        keras.layers.GlobalAveragePooling1D(),
        keras.layers.Dense(24, activation="relu"),
        keras.layers.Dense(1, activation="sigmoid")
    ])
    model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])```this is the model that I'm using (copy-pasted from an example)
bronze flint
#

It can indicate anything

silk axle
#

The config?

bronze flint
#

I suggest you get into ML properly from the start if you are interested

#

You could check Stanford course for the starters

#

I think it is free on youtube

silk axle
#

I know some basics of ML, just not how to choose the right hyperparameters

bronze flint
#

That's entire point of engineering the model today, fine tuning hyperparameters

silk axle
#

What does "loss" measure?

true geode
#

The error essentially, i.e, how far the predicted value is from the actual value.

silk axle
past meteor
past meteor
silk axle
past meteor
#

Np, ping me if you have any specific questions

true geode
past meteor
#

Well, it's mostly for this specific case

#

I'm on frequently enough to notice if there's a question where I can help though 🙂

echo mesa
#

@past meteor I've been going thru a book called "data science first principles with python" which I found in the python dc server resources it is a very good book and I've been enjoying it. My question is that for example when the book introduces Linear algebra and then it talks about it for a while but then at the end of the chapter there are other books that you can read if you wanna know more, I wonder how should I pursue those books or whether I should? Should I wait until I finish with the whole book or should I stop read the book and then move on? Or should I just read the book note everything down including the books then perhaps do a very basic project and if I decide to move onto a field where the mathematics is needed then I can read those books?

echo mesa
#

and other helpers perhaps?

quaint gorge
#

what models are used to solve such a problem? Can someone just point me in the right direction and i will do some research.

Looking at customer feedback from surveys and I need to catgeroize the feedback based on some grouping logic for example

"I didn't buy the item because i have no money " ---> Budgeting

left tartan
past meteor
#

I like having different opinions on the same topics. I also frequently read book 1, then 2 then 1 then 3 and then 2

#

As I mentioned: you need to get really comfortable with not understanding things in detail

crystal axle
#

is there anyway to getting a head start in math and it's implementation in programming before my study at the university? For reference, I will be studying AI course, Bachelors degree.

left tartan
crystal axle
# left tartan Of course. Ask your university for a curriculum map: a list of courses you'll ne...

I tried to look at their website, and I believe it's a new course they provide, so I really can't click on any of the modules that they list, I only have the names, and short explanations about them, even some explains aren't complete and finish with "..." I will email them and ask them what can I expect, I asked in this channel to see if anyone would recommend books, websites... Etc to help with the study.

crystal axle
left tartan
#

Then, either more calc, or linear algebra... and definitely some calculus-based stats.

#

Number one tho is: strong algebra / pre-calc skills. Calculus isn't inherently hard, it's the lack of algebra mastery that makes it hard.

crystal axle
#

I'm not sure whether my university would be the same as the others, but I would try and look that up

#

But after that, what do I do? How do I study them?

left tartan
crystal axle
#

Hopefully I can get more knowledgeable in programming and math, I don't want the next year to be harder.

left tartan
#

(This is us-centric advice but I assume Uk curriculum is similar)

#

My advice is primarily about making freshman year as easy as possible, not that calculus is more important than other maths.

crystal axle
#

I believe so, but the study curriculum is different it's mostly focused on search and independent learning, that's why I'm very focused on math since it would be my first time in this kind of studying, add to that I have no idea how math can be studied independently. Yes they would explain it in classes, but I think you'd still have to study harder on yourself.

#

As for programming, I'm trying to practice more and I can have an idea of how I can study independently.

left tartan
crystal axle
#

This might be obvious, but I should focus on pure programming and Python right? After this 8 can take time in learning math and it's implementation?

left tartan
#

And, I would suggest studying both. You’ll never ‘finish’ learning Python, nor will you finish learning math. You’re just trying to get better, not be perfect

white coral
#

guys i have a few doubts can i ask here?

#

in this code of mine:
`import numpy as np
import pandas as pd
import statsmodels.api as sm

Generate dummy data

np.random.seed(123)
X = np.random.normal(size=(100, 5))
y1 = np.random.normal(size=100)
y2 = np.random.normal(size=100)
data = pd.DataFrame(X, columns=['x1', 'x2', 'x3', 'x4', 'x5'])
data['y1'] = y1
data['y2'] = y2

Fit model for y1

X = sm.add_constant(data[['x1', 'x2', 'x3', 'x4', 'x5']])
model1 = sm.OLS(data['y1'], X).fit()

Fit model for y2

model2 = sm.OLS(data['y2'], X).fit()

Print model summaries

print(model1.summary())
print(model2.summary())

Fit joint model

X = sm.add_constant(data[['x1', 'x2', 'x3', 'x4', 'x5']])
y = data[['y1', 'y2']]

model_joint = sm.OLS(y, X).fit()
results_df = pd.DataFrame()
results_df['Coefficients Y1'] = model_joint.params.iloc[:, 0]
results_df['Coefficients Y2'] = model_joint.params.iloc[:, 1]
print(results_df)
results_df['Std Errors Y1'] = model_joint.bse.iloc[:, 0].values
results_df['Std Errors Y2'] = model_joint.bse.iloc[:, 1].values
print(results_df)`

i am getting the following error :
ValueError Traceback (most recent call last) <ipython-input-1-e65d97313ad8> in <cell line: 34>() 32 results_df['Coefficients Y2'] = model_joint.params.iloc[:, 1] 33 print(results_df) ---> 34 results_df['Std Errors Y1'] = model_joint.bse.iloc[:, 0].values 35 results_df['Std Errors Y2'] = model_joint.bse.iloc[:, 1].values 36 print(results_df) /usr/local/lib/python3.10/dist-packages/numpy/core/overrides.py in dot(*args, **kwargs) ValueError: shapes (100,2) and (100,2) not aligned: 2 (dim 1) != 100 (dim 0)

#

please can someone here help me??

desert oar
#

@white coral if you look at the full "traceback" you'll see that this error comes from inside statsmodels and isn't related to your dataframe. i can reproduce your error but i'm not sure what causes it. you'll see the error if you just do print(model_joint.bse).

#

i'd argue that this is a bug in statsmodels. if you did something invalid, you should get a helpful error message, not this

crystal axle
# left tartan First: find out what the intro language is at your college. It might not be Pyth...

I believe it's Python, but I will look that up, but does it matter, programming is the same isn't it? And to keep in mind, I'm already making some projects and learning more programming using Python.

Yes, I know that programming isn't something you can master, it's always updating and you have to know things that you didn't knew before, but what I meant by "finish" is to get at a high level.

rotund turtle
#

Hey

left tartan
crystal phoenix
#

I have a numpy array of 20000 values. How can I plot the standard deviation for sample size 1-20000? (Ox is sample size, Oy is standard deviation)

hallow cargo
#

Does anyone here have any experience with tf.data and tensorflow? I have windowed my data for timeseries with window size of 512 and batch size of 2048, and I have 16 features and a total of 161873 datapoints before windowing or batching. I am fairly new to tensorflow, being more experienced in the mathematics of ML. I keep getting this error, and I am unsure how to set up feature layers to make my data compatible. I am more or less randomly putting something into input_shape= trying to figure out how it works.

Error:

ValueError: Exception encountered when calling layer "dense_features" (type DenseFeatures).

We expected a dictionary here. Instead we got: 

Call arguments received by layer "dense_features" (type DenseFeatures):
  • features=tf.Tensor(shape=(None, 2048, 512, 16), dtype=float32)
  • cols_to_output_tensors=None
  • training=None
#

This is my model and feature layers:

data = Data('TData/train.csv')

count = 64
dataset = data.data_ds.take(int(count*0.8))
dataset_cv = data.data_ds.skip(int(count*0.8))

numeric_features = [tf.feature_column.numeric_column(feat) for feat in data.column_names[:-1]]
feature_layer = tf.keras.layers.DenseFeatures(numeric_features, input_shape=(2048, 512, 16))
# output_bias = tf.keras.initializers.Constant(init_bias)
model1 = Sequential([
    feature_layer,
    LSTM(units=64, stateful=True),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dropout(0.3),
    Dense(units=8, activation='tanh', kernel_regularizer=tf.keras.regularizers.L2(0.16)),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dropout(0.3),
    Dense(units=1, activation='sigmoid') # , bias_initializer=output_bias)
])

# model1.build(input_shape=(64, 300, 16))

cp = ModelCheckpoint('ModelDatasetTesting/', save_best_only=True)

# model1 = tf.keras.models.load_model('ModelV1/')
model1.compile(loss=BinaryCrossentropy(), optimizer=Adam(learning_rate=0.0001), metrics=['accuracy']) # , tf.keras.metrics.Precision(), tf.keras.metrics.Recall()

model1.fit(dataset, validation_data=(dataset_cv), epochs=10, batch_size=2048, callbacks=[cp]) # , class_weight=class_weight
sterile wyvern
latent musk
#

hi! does anyone have experience with vectorization (like changing words to vectors)?

#

i'm trying to use word2vec to vectorize a column of my dataframe, but tbh i've never done it before, so i'm rather confused on how to do it

lapis sequoia
#

yo, can I share a model and get some opinions??

serene scaffold
lapis sequoia
untold glade
#

Hello!, can I install tensorflow/tfjs-node using python 3.x? I read that the best option is using Python 2.7

scenic parcel
#

Have any of you guys ever scraped / crawled reddit and analyzed it

hallow light
#

Hey guys what algorithm do you guys recommend for anomaly detection? I'm trying to build a model that checks gas meter rates and if it gets out of range flag the value.

odd meteor
odd meteor
polar zodiac
#

Hello,
How to work with Over-fitting model?

odd meteor
latent musk
#

i asked chatgpt about it too and it gave me some code, but requires that i download the full pre-trained word2vec model

is there another way to do this without downloading multiple gigabytes of word2vec?

odd meteor
# polar zodiac Hello, How to work with Over-fitting model?

You'd have to reduce your model complexity to enable the high variability in your model to drop.

Meanwhile there are several ways you could approach this. We can categories them into 2. Data-Centric & Model-Centric approach.

  1. Data-Centric
  • Data Augmentation
  • Resampling (up sampling / Down sampling)
  • SMOTE (I personally haven't seen the significant impact of this approach when compared to other approaches)
  1. Model Centric
  • Use Regularisation
  • Early Stopping / Other callbacks
  • Hyperparameter Tunning
  • Adjusting the value of scale_pos_weight parameter in your model
  • Adjusting the Class Weight parameter
odd meteor
# latent musk sorry I'm rather new to programming lol could you specify what you mean by token...

You didn't offend me so no need to apologise for doing nothing wrong 😊. Meanwhile, welcome, I hope you're enjoying your programming experience thus far?

So tokenization is just a fancy way of saying "breaking down the walls of Jericho into tinny stone pebbles".

Only that this time, in NLP, this 'wall' isn't an actual 'wall' but a document or textual data.

You can decide to break this 'Wall"' into different size of pebbles. So we have different types of tokenization but I'll just highlight sentence tokenization and Word tokenization.

  • Word tokenization: When you break down your text data into individual words. For example "I ate jollof rice this morning" becomes [ 'I', 'ate', 'jollof', 'rice', 'this' , 'morning' ]

  • Sentence Tokenization: this is where our document / text data is broken down into individual sentences.

NLTK & Gensim are both libraries used for doing some interesting stuff in NLP.

So, once you've installed Gensim in your machine you could easily leverage it's models module to access Word2Vec embedding.

I'll point you to a website for more clarity shortly

latent musk
#

Checking out the website right now

polar zodiac
odd meteor
echo mesa
#

Guys, what courses should I take on uni that would be really useful for ai and machine learning. I was thinking about doing just mathematics since I already know how to code and taking just computer science wouldn't make sense i think, or perhaps taking a joint course. I was also wondering if there are any courses just about machine learning because I didn't find anything but mathematics and computer science. Any thoughts?

odd meteor
# echo mesa Guys, what courses should I take on uni that would be really useful for ai and m...

Any of these 3 majors are cool in building solid foundational knowledge. Statistics, Computer Science, and Mathematics.

Nonetheless this field isn't discrimatory, so you might as well decide to study Fishery and still end up doing Data Science (so long as you are ready to put in the needed work)

Many universities now have a data science undergraduate program. So it depends on where you reside and the schools you're interested in.

drowsy viper
#

Hi there general question here how popular is pytorch lightining

past meteor
#

What both have in common is they go very deep into statistics and math. We tended to have a bit more domain knowledge and knowledge of other methods that may be relevant (for instance operations research) for data adjacent problems and they knew (a lot) more programming and a bit more math.

#

There's also tons of people doing data science that compe from any quantitative field such as Bio Engineering / Bio informatics etc.

orchid agate
#

Hi all,

I have been working on a proof-of-concept AI tool for my current job.
The purpose of the tool is to map an arbitrary string into a string from the list.
I have a dataset of arbitrary strings and corresponding strings from the target list. The tool is written in Python.
Here is the source code:
Using **LinearSVC ** model: https://pastebin.com/R6LNrJ4w
Using **SGDClassifier ** model: https://pastebin.com/bxcug3Vb

This code was written using a ChatGPT-like tool and scraps of information on the scikit-learn library that I could understand (I am very far from Python and ML itself).
The dataset consists of 3 million+ mappings (each of the two files is approximately 100 MB).
I would be very grateful if there are people here who can help me with at least some of the related questions:

  1. The fit() method executes well and fast in both models if I cut off the dataset to be around 10k mappings. However, when I stay with the original 3 million, it takes a very long time (started approximately 48 hours ago and still running).
    My PC specs are 32GB of RAM and an i5-12400f CPU. Is this an adequate time for such a dataset? Could it be at least roughly estimated how much more time is left?
  2. Is this a good way of using scikit-learn in this manner for this kind of tool? I would be grateful if you could recommend other more effective approaches or ready-made solutions.
  3. Can it be determined which of the models (LinearSVC or SGDClassifier) has more pros than cons for this tool?
  4. In the current implementation, when I turn off the program or PC, all the trained data from RAM is vanishing. How can I save the trained progress to the hard drive and resume the training from it?

Thank you!

echo mesa
# odd meteor Any of these 3 majors are cool in building solid foundational knowledge. Statist...

Hmm I'm planning to go to the UK since I'm in Ireland, I was just wondering that what should I go with. Going just math would be interesting as I might need to learn stuff that I won't ever use in machine learning and data science but if I would go with Mathematics then I could literally self learn almost everything in terms of data science and machine learning as I have a really strong foundation in mathematics. I just wondered whether that's a good decision

buoyant vine
#

How do you guys go about testing multi-label classifiers.

Currently I have a classifier model which is trying to predict one label based on the content of the text, but realistically this should be multi-label, but I have a couple questions around testing and evaluating the model:

  • In the situation where the model were to produce multiple labels, how would you score that against a dataset of single labels?
  • How do you evaluate weight in this case? I.e. Does the score/confidence the model gives mean anything to the metrics? Is the order something I should care about?
  • If I shouldn't care about the order, do I just say all(expected_label in returned_labels for expected_label in expected_labels) to count as a 'correct' score?
sharp nimbus
buoyant vine
#

I.e. a piece of text being labelled as News rather than [News, Business, Politics] for example

sharp nimbus
buoyant vine
#

hmm?

#

Atm it only produces a single label, it should (and will) get moved over to a multi-label version though

#

since single-label doesn't really work

sharp nimbus
#

ok im guessing you want your model to be like a hierarchal classification thing? orr am i confused

past meteor
#

Which means you can compute F1-scores as well etc

buoyant vine
#

The primary issue we have atm is as shown by the confusion matrix the model generally, runs into issues with single label where the topics/text can easily fall into other categories

past meteor
past meteor
#

The absolute GOAT evaluation method, but it's really made for binary classification sadly, you can finesse it into working for more cases.

buoyant vine
#

partyparrot We are also going to ignore the fact that the current dataset is single label only atm

past meteor
#

Imagine you make one of these per label (one versus all) and have a drill-down that business can use to have one per label-label combination

#

Make sense?

buoyant vine
#

yeah

#

that is a lot of graphs 😅

past meteor
#

Indeed, there's other ways. It largely depends on the amount of categories you have tbf

buoyant vine
#

1400+

#

I think in total xD

past meteor
#

If you want a single number I'd look towards precision and recall. Are you familiar with them?

buoyant vine
#

not really, I do not typically do AI

past meteor
buoyant vine
#

right

past meteor
#

Note that TP + FP => the amount of times your model said it was true, so precision means "the proportion of times it was really right when your model said it was right"

#

We can do the same for recall: TP + FN = The amount of times it was really label A, so recall means "the proportion of times your model said it was A when it should've been A"

buoyant vine
#

Question, how does that get calculated when we say "Model says it is [A, B, D, E] but it should be [A, B, C]"

#

each label is just another hit/miss count?

past meteor
#

Good question! You calculate the precision and recall for each category and then you have different strategies to combine it. The most obvious one is averaging.

sharp nimbus
past meteor
#

You should think of the computation of the precision and recall more per batch (set of rows) than per instance (set of columns)

past meteor
#

I'm a bit lazy but I can write it out in full if you want 🤣

buoyant vine
#

😅 I think it's fine

#

Second question, has anyone tried llama2 for text classification?

past meteor
#

Notice how we can drill-down again, you can combine Precision and Recall into something called the F1 score (basically their harmonic mean), which you can drill-down into P and R for an aggregation over all classes and then drill-down again to P and R per class, apart

buoyant vine
past meteor
#

No

buoyant vine
#

Thumbs_Up ty

lapis sequoia
polar zodiac
polar zodiac
lapis sequoia
#

What do you use the most? LogisticRegression? random Forrest overfits so often

#

You could also just use statsmodels and run all of the features (if there are not too many) and see what the R-Sqaured score is if it is not discrete.

misty flint
blazing cape
#

input size torch.Size([1, 4, 320, 320]) not equal to max model size (1, 3, 320, 320)

#

pls send help i am under the water

odd meteor
odd meteor
odd meteor
echo mesa
#

and computer science would be about learning how to code which I already know how to

left tartan
odd meteor
echo mesa
left tartan
#
Harvard University

Harvard University is devoted to excellence in teaching, learning, and research, and to developing leaders in many disciplines who make a difference globally.

Harvard University

Harvard University is devoted to excellence in teaching, learning, and research, and to developing leaders in many disciplines who make a difference globally.

Harvard University

Harvard University is devoted to excellence in teaching, learning, and research, and to developing leaders in many disciplines who make a difference globally.

echo mesa
odd meteor
# echo mesa well yeah but you said that you would do mathematics last and you'd go with stat...

Statistics and Math are kinda related but they are two different fields. Mathematics is a broad discipline that deals with numbers, quantities, shapes, and abstract structures. It includes various branches such as algebra, calculus, geometry, and more, providing the theoretical foundation for many other fields, including statistics.

Statistics specifically focuses on data. It involves collecting, analyzing, interpreting, and presenting data to infer proportions in whole from those in a representative sample. While it utilizes mathematical tools and principles, statistics is often more applied, using mathematical theories to solve real-world problems related to data analysis. It's a discipline heavily grounded in real-world applications, including decision-making in business, health, politics, and other domains.

For more context and better explanation: https://www.quora.com/What-is-the-difference-between-mathematics-and-statistics

Quora

Answer (1 of 62): Frequently, when people hear the word "statistics", they think of using formulas and spreadsheet programs to analyze data by measuring as many things as possible about the samples being studied. The reason why I don't quite fit in either category is that I don't work with statis...

echo mesa
lapis sequoia
#

I got into DS with a economics degree so I do not know

odd meteor
# echo mesa hmm, then i dont really know what to go with

Read the attached Quora page for more clarity. The goal is not to confuse you though.

Meanwhile, there's a lot more you could learn from a Computer Science major other than coding (even if you already know how to code). Just do your own research and finding then go with the one that appears more interesting / fun to you.

At the end of the day, this field isn't discriminatory. So you could even study Actuarial Science, Zoology, Soil Science, or even Human Kinetics (a.k.a 'jumpology' 😀 ) and still end up in Data Science.

lapis sequoia
echo mesa
left tartan
#

Many schools are now offering DS programs, which are like a hybrid of CS and Stats.

#

But I don't know UK.

lapis sequoia
odd meteor
#

More so, with a Statistics degree (so long as you ended up in a well-to-do school) you're most definetely gonna have math core courses and electives where you'll learn Calculus, Linear Algebra, Laplace Transform, etc. To me, it's like using 1 stone to kill two birds.

With a core Maths program, you probably won't go as far as covering Inferential Stats, Econometrics, Operation Research, Time Series, Gambler's Ruin, Stochastic & Maximum Likelihood Estimation, ANOVA & MANOVA, and all that interesting stuff in Stats.

But don't take my word for it. Do your own research and go with the program that you think will give you wings like red bull 😀

At the end of the day, whether you do Animal Husbandry, or Urban & Regional Planning as your 1st major, you can still get into Data Science.

echo mesa
echo mesa
echo mesa
mighty halo
#

ohh really ? @echo mesa

echo mesa
mighty halo
#

stats seems more useful than the crazy complex math

mighty halo
#

im drowning in a ocean of math ... not knowledgable enought to see the horizon

mighty halo
#

just messing with multicore mcus and python , wondering if mojo-python is workable

gloomy creek
#

is pytorch good to learn as a intermidiate

lapis sequoia
tender sparrow
#

how do i

df.rolling(100).groupby
>> AttributeError: 'Rolling' object has no attribute 'groupby'

😦

serene scaffold
serene scaffold
# gloomy creek is pytorch good to learn as a intermidiate

"learning libraries" is not a good approach to learning ML. None of them are end-to-end tools (pretty much anything will involve two or more of them), and they're designed from the assumption that you already understand what you're trying to do.

#

Don't use ChatGPT to answer questions.

tender sparrow
#

hey i am trying to define my problem as simple as possible to understand

serene scaffold
tender sparrow
#

yea that's why now i am trying to define my problem as simple as possible again 😄 give me a second ha

#

it's the first thing that came to my mind, knowing how wonderfull python and pandas api can be i thought that would be so straight forward, seems logical

tender sparrow
#
# original dataframe
df = pd.DataFrame({
    'x': ['a',      'b', 'c', 'd', 'e', 'e', 'b', 'c', 'd', 'a'], 
    'y': [ 0,        1,   2,   3,   4,   4,   3,   2,   1,   9]}
)

# grouper
gr = df.groupby('x')['y']

# operations
assert gr.sum().idxmax() == 'a'
assert gr.sum().max() == 9

# desired output of the rolling 2 operations
# z = x where y is max
# s = sum of y where x is max
pd.DataFrame({
    'x': ['a',      'b',  'c', 'd', 'e', 'e', 'b', 'c', 'd', 'a'], 
    'z': [ pd.NA,   'b',  'c', 'd', 'e', 'e', 'e', 'b', 'c', 'a'],
    's': [ pd.NA,   '1',   2,   3,   4,   8,   4,   3,   2,   9],
})

now when i think of it it would be very interesting how to simplify that question i am bad at that

lapis sequoia
serene scaffold
lapis sequoia
#

I agree

#

People who rely solely on chatgtp are not better off. It cannot do a entire data set for you and make sense of it for oneself. I only use it to add little things that I cannot think of for a project. Even then, if I didn’t know exactly what to ask it, then chatAI would do nothing of any real significance

#

And openAI is wrong a lot as well

serene scaffold
#

OpenAI is a company. Are you talking about them, or ChatGPT?

lapis sequoia
#

Sorry, I mean chatgtp

serene scaffold
#

@tender sparrow did you try df.groupby('x')['y'].rolling(2).max()?

polar zodiac
elfin stirrup
#

can someone help me understand this problem i got the answer to be 2 data samples but it was incorrect.
i found id3 and id7 to be classified as c_1 based on the formula

odd meteor
# polar zodiac umm sorry what does that mean?

I mean the response variable (y) in your dataset, does it have a balanced class-label.

Presuming your dataset"s target variable is y, and it contains 2 classes (1s and 0s), confirm if the number of times 1 and 0 appeared are somewhat equal.

polar zodiac
#

No...
For 1's its 72
and for 0's its 32

scenic parcel
royal crest
#

what is it for?

odd meteor
# polar zodiac No... For 1's its 72 and for 0's its 32

Since both class labels don't have same proportion, there's class imbalance in your data.

So, I'd suggest using stratified random sampling 1st ( use stratify = y when splitting your data with train_test_split) and see how the model will perform before trying any of the approach I suggested earlier

scenic parcel
# royal crest what is it for?

Getting datasets for research/analysis. If a company doesnt want that dataset to be easily available, then you might have to torrent it

muted hollow
#

Hey guys, can you recommend me some NLP project to make so I can better my resume, I have only tried the health care chatbot and study chatbot project only

jagged hedge
#

Is there any pre-trained LLM that can process a document pdf without changing it into text because once it gets changed to text or some image the generated text will not be accurate for images or table structures present in the pdf …. If you any such LLM or any NLP or Document processing model so that I can ask questions from that model … Please help

serene scaffold
#

what do you need to do with the images and tabular data? and are these images of text?

left tartan
jagged hedge
serene scaffold
scenic parcel
gaunt pendant
#

hey guys, anybody with reinforcement learning experience expecially in proximal policy optimization, I'm dealing with a convergence issues for the cartpole v1 environment

gaunt pendant
#

?

visual fossil
#

Hey,

Has anyone here reviewed the following book ?

Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow.

visual fossil
visual fossil
abstract wasp
abstract wasp
visual fossil
agile cobalt
unique crown
#

hello folks! i have a question about webscraping with python, can i ask that here?

serene scaffold
plush jungle
#

I'm trying to run the ppo pytorch code from here https://pytorch.org/rl/tutorials/coding_ppo.html and I finally got it working, but now I want to retrofit it to some of the other gym mujoco environments. When I change this line

base_env = GymEnv("InvertedDoublePendulum-v4", device=device, frame_skip=frame_skip, render_mode="human")

to this

base_env = GymEnv("Humanoid-v4", device=device, frame_skip=frame_skip, render_mode="human")

I get

  File "C:\python\reinforcement_learning\ppo\pytorch_ppo.py", line 130, in <module>
    collector = SyncDataCollector(
                ^^^^^^^^^^^^^^^^^^
  File "C:\Users\mj\miniconda3\Lib\site-packages\torchrl\collectors\collectors.py", line 677, in __init__
    self._tensordict_out.unsqueeze(-1)
  File "C:\Users\mj\miniconda3\Lib\site-packages\tensordict\tensordict.py", line 2184, in expand
    d[key] = value.expand((*shape, *value.shape[-last_n_dims:]))
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: expand(): argument 'size' (position 1) must be tuple of ints, but found element of type float at pos 0```
#

obviously I need to change the shape of something like the observations or the action space or something, but I'm not sure what to change it to. any ideas?

shut girder
#

Hello, I am a beginner to data analysis. Does data cleaning come before or during exploratory data analysis?

frigid dust
sleek palm
#

Hello, i am trying to build a cnn based ml model. Anyone who has any experience in same, please dm me.

polar zodiac
#

except for the graph it got changed to this

odd meteor
odd meteor
# polar zodiac

Evidently, the stratified random sampling you performed certainly improved the model's performance. The model is overfitting here.

Based on your last experiment, can try to further improve the performance. Try hyperparameter tunning

#

It's usually not so common for the test to outperform the train; although not impossible. Could be due to having small sample size in the test compared to the train.

With more experimentation, I'm certain you can squeeze out more performance from your model

drifting summit
#

hey can someone advice on how to start with ai, im an intermediate python programmer and im pretty good at mathematics. just want to learn about ai before college and not get left behind

storm kelp
#

Working in Pyspark and have made a mess of code which runs but is slow/inefficient:

for row in collected_loci:
    studyLocusId = row['studyLocusId']
    study = row['studyId']
    
    sumstats = spark.read.csv(f'gs://finngen-public-data-r9/summary_stats/finngen_R9_{study}.gz', header=True, sep='\t')
    
    df_locus = (
        df
        .filter(f.col('studyLocusId') == studyLocusId)
        .select('studyId', 'fi.chromosome', 'fi.position', 'fi.referenceAllele', 'fi.alternateAllele', 'idx'))

    gnomad_count = df_locus.count()
    
    df_locus = (
        df_locus
            .join(
                sumstats,
                (df['fi.chromosome'] == sumstats['#chrom']) &
                (df['fi.position'] == sumstats['pos']) &
                (df['fi.referenceAllele'] == sumstats['ref']) &
                (df['fi.alternateAllele'] == sumstats['alt']),
                'inner')
            .sort('idx'))
    
    
    matched_count = df_locus.count()
    df_locus.write.csv(f'output/{studyLocusId}/locus.csv', mode='overwrite')
    
    count_df = spark.createDataFrame([Row(gnomad_variant_count = gnomad_count, matched_variant_count = matched_count)])
    count_df.coalesce(1).write.mode('overwrite').option('delimiter', '\t').csv(f'output/{studyLocusId}/metadata.tsv')
    
    idxs = [x['idx'] for x in df_locus.select('idx').collect()]
    bm_filtered = bm.filter(idxs, idxs)
    bm_numpy = bm_filtered.to_numpy()
    bm_mirror = bm_numpy + bm_numpy.T - np.diag(np.diag(bm_numpy))
    np.save(f'output/{studyLocusId}/ld', bm_mirror)
    ```

Looking to see if there are any ways to speed up the execution other than the obvious things like removing the count() calls. 
My initial attempt was to refactor it into functions so that it can be parallelized instead of running iteratively. But this gave me serialization errors due to calling sparkContext within the function.
agile cobalt
#

how large is the original df, how large are each of the sumstats and do you have any duplicate studyId / studyLocusId? (for the last question, not duplicate pairs, but just non-unique values within a column)

#

also shouldn't you be using df_locus instead of df inside of the join? in the df['col'] == sumstats['col'] lines

polar zodiac
storm kelp
storm kelp
#

Surprised it didn't throw an error from that tbh, but there you go

agile cobalt