#data-science-and-ml

1 messages · Page 68 of 1

past meteor
#

They all show you something about the trade-off between FPs and FNs and I guess when you know what you want to optimize you'll find the right plot pretty easily

#

This to me is better than opaque stuff like downsampling, smote etc

warm copper
#

so all I need to do is insert this threshold metric into Random Forest right?

#

Best Threshold=0.230000, F-Score=0.457

#

Im going to use this

#
             precision    recall  f1-score   support

           0       0.96      0.97      0.96     14684
           1       0.50      0.42      0.46      1053

    accuracy                           0.93     15737
   macro avg       0.73      0.69      0.71     15737
weighted avg       0.93      0.93      0.93     15737
#

ta daaaa @past meteor

#

my f1 score got improved a lot

#
              precision    recall  f1-score   support

           0       0.94      1.00      0.97     14684
           1       0.97      0.07      0.13      1053

    accuracy                           0.94     15737
   macro avg       0.96      0.54      0.55     15737
weighted avg       0.94      0.94      0.91     15737```
#

this was before ^^

warm copper
#

this is for the ROC curve

#
              precision    recall  f1-score   support

           0       0.98      0.78      0.87     14684
           1       0.21      0.79      0.33      1053

    accuracy                           0.78     15737
   macro avg       0.59      0.79      0.60     15737
weighted avg       0.93      0.78      0.83     15737
hearty stratus
#

Hi anyone familiar with Pandas here?
I am stuck a bit with groupby and was wondering how to resolve the following problem.

https://stackoverflow.com/questions/76459817/pandas-groupby-year-week

Anything helps 😉

In the meantime I am toying around 😉

night kernel
#

i figured out how to locally install and use an llm model. how can i upload my own texts so that the model responds to me? which models allow this?

#

i used gpt4all vicuna 13b - can i do it with this one?

dusty valve
#

How can i render a pre computed simulation between 1000 to 10000 objects (single points about 5×5 pixels)

brave ivy
#

good evening everyone

#

I have a question

#

do data scientists still use python 2.7 for data science?

serene scaffold
warm copper
#

omg

#

have you seen anything like this? @serene scaffold

serene scaffold
warm copper
brave ivy
#

oh, I see

serene scaffold
# warm copper

I don't have anything insightful to say about this figure.

brave ivy
#

thank you very much

warm copper
#

I am trying to find the best f1 value

#

for the imbalanced predictor variable

#

Best Threshold=0.923983, F-Score=nan

#

how come the F score is nan

#
precision, recall, thresholds = precision_recall_curve(test_labels, pred_positive)

f1_score = (2 * precision * recall) / (precision + recall)
i_max = argmax(f1_score)
print('Best Threshold=%f, F-Score=%.3f' % (thresholds[i_max], f1_score[i_max]))

plt.figure(figsize=(16, 10))
no_skill = len(test_labels[test_labels == 1]) / len(test_labels)
plt.plot([0, 1], [no_skill, no_skill], linestyle='--', label='No Skill')
plt.plot(recall, precision, marker='.', label='Logistic Regression')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.legend()
plt.show()
brave ivy
#

are you using matplotlib?

brave sand
#

does anyone know why this is happening?

#

this is on google collab

#

I am training on my own images, using the ssd_resnet50_v1_fpn model

brave sand
#

does anyone know where shuffle buffer size is located?

potent sky
potent sky
lapis sequoia
#

what's an easy way to convert jupyter notebooks to output with just markdowns and outputs but no code cells

wooden sail
#

downloading as markdown 😛

#

if you open the notebook in your browser, you can download its current state as markdown. you can probably also do this from the terminal and from vscode, but i wouldn't know how

lapis sequoia
#

@wooden sail it gave me this weeird looking file

#

which has code

wooden sail
#

ah you want no code at all?

#

idk if that can be done automatically, but it's very easy to write a script to do it for you

#

all of the code blocks are written in that format, wrapped with ```python ```

#

you could write a 10 liner or so that removes those blocks

lapis sequoia
#

@wooden sail what should I write it for

#

html output?

wooden sail
#

what?

lapis sequoia
#

the loop to remove cide

#

code

#

i want the pdf to look just like the juoyter notebook looks. minus the code cells

#

I did something with nbconvert but it looks ugly

#

Okay I have it figured out

#

what does %%html do btw

waxen sonnet
#

Can someone explain why test accuracy remains constant at 68.782% even though the loss keeps varying?

lapis sequoia
mild dirge
#

If you do classification, and your loss is cross entropy f.e., this can still decrease without argmax(logits) being any different, thus accuracy not being different.

waxen sonnet
# lapis sequoia give us more details, what is the loss, what are you predicting etc

It's a binary classifier that predicts whether the price of an asset will go up or down based on past 60 periods of data, the following is the entire training loop:

model = LSTM(input_dim=train_dataloader.dataset.sequences.shape[-1], hidden_dim=HIDDEN_DIM, output_dim=OUTPUT_DIM, num_layers=N_LAYERS, fc_dim=FC_DIM)
criterion = nn.CrossEntropyLoss()
optimiser_lr_e3 = torch.optim.Adam(model.parameters(), lr=0.001)
optimiser_lr_e2 = torch.optim.Adam(model.parameters(), lr=0.01)

train_hist = np.zeros(EPOCHS)
test_hist = np.zeros(EPOCHS)
start_time = time.time()
lstm = []

for t in range(EPOCHS):
    model.train()
    for i, (inputs, labels) in enumerate(train_dataloader):
        
        y_train_pred = model(inputs)
        loss = criterion(y_train_pred, labels)
        
        train_hist[t] = loss.item()
        if t < 15:
            optimiser_lr_e2.zero_grad()
            loss.backward()
            optimiser_lr_e2.step()
        else:
            optimiser_lr_e3.zero_grad()
            loss.backward()
            optimiser_lr_e3.step()
        
    correct = 0
    total = 0
    model.eval()
    for i, (inputs, labels) in enumerate(test_dataloader):
        
        y_test_pred = model(inputs)
        loss = criterion(y_test_pred, labels)
        test_hist[t] = loss.item()

        _, predicted_labels = torch.max(y_test_pred.data, 1)
        total += labels.size(0)
        correct += (predicted_labels == labels).sum().item()
    
    print(f'Epoch {t+1}\n\tTrain Loss: {train_hist[t]:.4f}\n\tTest Loss: {test_hist[t]:.4f}\n\tAccuracy: {(correct/total)*100:.3f}%')
        
training_time = time.time()-start_time
print("Training time: {}".format(training_time))
waxen sonnet
mild dirge
#

Why would you want to do that?

waxen sonnet
#

I'm confused why the accuracy stagnates right after the first epoch. Is there a fault with how I'm calculating it?

mild dirge
#

No probably not

#

It can just be that the loss does decrease whereas the accuracy does not

#

And the loss/accuracy stays constant could be of many reasons

#

Like too simple model, or plateau/local minima

waxen sonnet
#

I'm using the MinMaxScaler() that scikit-learn provides, weirdly enough, this does not happen if I use a different pre-processing technique

mild dirge
#

What is the min and max of your data?

#

Are there outliers?

waxen sonnet
mild dirge
#

minmax doesn't work well if there are very large outliers

#

You'd likely want to standardize it with mean 0 std 1 in that case

waxen sonnet
#

I'll try that out

maiden geyser
#

Hi i am doing some neural network and i'd like to know if you have some good method to tune the hyperparameters ? (dont hesitate to ping me in your answer please)

#

i'm really a kid in this domain and i need to create a NN (which i've done) but to chose the right number of hidden layers and nb of neurons per layer is something obscure to me

hollow storm
#

Anyone have some knowledge in doing specification curve analysis?

hollow storm
#

In most situations, you want the number of neurons in your hidden layer to be greater than your output layer.

#

A basic nn is 3, 2, 1. With the input layer being three, middle being 2 and 1 being the output.

maiden geyser
brave sand
hollow storm
#

simplest thing I can offer without knowing much is a network that is maybe 16 > 8 > 1.

#

chuck the 16 columns in per neuron

potent sky
potent sky
brave sand
brave sand
brave sand
rare fog
#

This machine learning expert says "learn python it is the number one programming language for machine learning."

Does anyone here know what's different about python, that makes it better for machine learning than other languages?

mild dirge
#

It is just used for ML very often, so there is also a lot of support for it now

wooden sail
#

it's more or less a positive feedback loop. as pccamel says, tons of people use it. this makes people want to write more, better modules for it. which in turn brings in more people, and the cycle repeats

#

as a result, there are several powerful, rich modules for ML in python

rare fog
#

Ah so the big strength of Python for ML is its ML modules

wooden sail
#

as for the "better", probably that python has nice and simple syntax and interfaces very easily with other langs

#

maybe you wouldn't want to implement a matrix multiplication directly in python, but it doesn't matter because people have written amazing code for that in other langs, and you can very easily just call those functions from python

#

that's how you get numpy, tensorflow, pytorch, etc

potent sky
potent sky
pale hemlock
#

hey anyone interested in a 3d tensor model?

dusty valve
#

And what size is it

pale hemlock
#

creats a 3d tensor model based of 3axis hold on ill show it to you it does well on first start

#

!code

arctic wedgeBOT
#
Formatting code on discord

Here's how to format Python code on Discord:

```py
print('Hello world!')
```

These are backticks, not quotes. Check this out if you can't find the backtick key.

For long code samples, you can use our pastebin.

pale hemlock
#
import numpy as np
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# Define the tensor dimensions
dims = ("dim1", "dim2", "dim3", "dim4")

# Create a random tensor
tensor = np.random.rand(*([10]*len(dims)))

# Map each parameter to a point in tensor space and self-label variables
x = []
y = []
z = []
for i, j, k, l in np.ndindex(tensor.shape):
    coordinates = [i/10, j/10, k/10]
    x.append(coordinates[0])
    y.append(coordinates[1])
    z.append(coordinates[2])

# Apply PCA to reduce the dimensionality to 3D
pca = PCA(n_components=3)
coords = np.column_stack((x, y, z))
coords_pca = pca.fit_transform(coords)

# Visualize the tensor coordinates as a scatter plot
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(coords_pca[:, 0], coords_pca[:, 1], coords_pca[:, 2])
ax.set_xlabel('PCA 1')
ax.set_ylabel('PCA 2')
ax.set_zlabel('PCA 3')
plt.show()

print('Hello world!')
#

thas pretty much it, a 3d block of a tensor, each with its own name space and unique identifier for each tensor based of 3d space.. figured it would be helpful to have an idea how a block of tensor can be created

brave sand
past meteor
pale hemlock
#

well i modified lol it works but wow does it take a moment to process lol

#

its small but huge wow

#

if anyone wishes to view the result

#

!code

arctic wedgeBOT
#
Formatting code on discord

Here's how to format Python code on Discord:

```py
print('Hello world!')
```

These are backticks, not quotes. Check this out if you can't find the backtick key.

For long code samples, you can use our pastebin.

pale hemlock
#
import numpy as np
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

# Define the tensor dimensions
dims = ("dim1", "dim2", "dim3", "dim4")
tensor_shape = [10] * len(dims)

# Create a random tensor
tensor = np.random.rand(*tensor_shape)

# Map each parameter to a point in tensor space and self-label variables
x = []
y = []
z = []
for i, j, k, l in np.ndindex(tensor.shape):
    i_label = f"{dims[0]}-{i}"
    j_label = f"{dims[1]}-{j}"
    k_label = f"{dims[2]}-{k}"
    l_label = f"{dims[3]}-{l}"
    coordinates = [i/10, j/10, k/10]
    x.append((i_label, coordinates[0]))
    y.append((j_label, coordinates[1]))
    z.append((k_label, coordinates[2]))

# Apply PCA to reduce the dimensionality to 3D
pca = PCA(n_components=3)
coords = np.column_stack(([coord[1] for coord in x], [coord[1] for coord in y], [coord[1] for coord in z]))
coords_pca = pca.fit_transform(coords)

# Visualize the tensor coordinates as a scatter plot
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(coords_pca[:, 0], coords_pca[:, 1], coords_pca[:, 2])
ax.set_xlabel('PCA 1')
ax.set_ylabel('PCA 2')
ax.set_zlabel('PCA 3')

# Label the coordinates
for i, coord in enumerate(coords_pca):
    coord_x = coord[0]
    coord_y = coord[1]
    coord_z = coord[2]
    ax.text(coord_x, coord_y, coord_z, f"{x[i][0]} {y[i][0]} {z[i][0]}")

plt.show()

print('Hello world!')
potent sky
brave sand
#

there's nothing there about buffer size unfortunatley

potent sky
#

Yeah that

brave sand
#

man this sucks

potent sky
#

Alternatively you could try looking at the detailed stack trace

#

That could directly give you from within which function this shuffle op was executed and then you can find that in the repo and modify it

brave sand
#
2023-06-13 12:18:26.373528: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:392] Filling up shuffle buffer (this may take a while): 1034 of 2048```
#

i think that is the file?

boreal gale
#

@brave sand have you posted your code? i can't find any in discord history.

boreal gale
#

okay, and how are you running it?

#

i can't access that nor will i request access, could you post it in text form please

brave sand
#

this is the command I use

#

this is what I get:

WARNING:tensorflow:From /usr/local/lib/python3.10/dist-packages/object_detection/builders/optimizer_builder.py:124: The name tf.keras.optimizers.SGD is deprecated. Please use tf.keras.optimizers.legacy.SGD instead.

W0613 12:17:45.182555 140099376232256 module_wrapper.py:149] From /usr/local/lib/python3.10/dist-packages/object_detection/builders/optimizer_builder.py:124: The name tf.keras.optimizers.SGD is deprecated. Please use tf.keras.optimizers.legacy.SGD instead.

2023-06-13 12:17:46.277504: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_29' with dtype int64
     [[{{node Placeholder/_29}}]]
2023-06-13 12:17:46.278118: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_25' with dtype int64
     [[{{node Placeholder/_25}}]]
2023-06-13 12:17:56.785031: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:392] Filling up shuffle buffer (this may take a while): 70 of 2048
2023-06-13 12:18:06.777774: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:392] Filling up shuffle buffer (this may take a while): 782 of 2048
2023-06-13 12:18:26.373528: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:392] Filling up shuffle buffer (this may take a while): 1034 of 2048
^C```
#

my ram usage is 12.3/12.7

#

so it's running out of ram

boreal gale
#

post content of models/my_ssd_resnet50_v1_fpn/pipeline.config

brave sand
#

is it because of shuffle = false?

boreal gale
#

disclaimer: i don't use tensorflow. take my advice with a grain of salt.

#

do you know how to take it from here to test this hypothesis? i.e. what changes you need to make to your pipeline config file?

brave sand
#

isn't that low already by default?

boreal gale
#

no that's not set to 11

#

it's saying it's the 11th field in protobuf, which is different.

brave sand
#

so i change the default field right

boreal gale
#

[default = 2048]; is the key point which matches to what you have in the logs

brave sand
#

like default = 1024

boreal gale
#

no don't change the protobuf file

#

change models/my_ssd_resnet50_v1_fpn/pipeline.config

brave sand
#

i don't understand

#

why?

boreal gale
#

e.g.

train_input_reader {
  label_map_path: "annotations/label_map.pbtxt"
  tf_record_input_reader {
    input_path: "annotations/train.record"
  }
  shuffle_buffer_size: 10
}
brave sand
#

I can do that? I mean editing this file and adding that variable

boreal gale
#

pipeline.proto and input_reader.proto i showed you are protobuf specification

they are responsible for defining what models/my_ssd_resnet50_v1_fpn/pipeline.config should look like

i showed them to you merely to show you clues on how you can configure a tensorflow pipeline, this is probably documented in the docs but i didn't look there.

boreal gale
brave sand
#

so the 10 is just random number chosen?

boreal gale
#

yes that's a random number i have chosen, pick whatever you want if your RAM is okay with it

brave sand
#

gotcha, the buffer isn't filling up anymore

boreal gale
brave sand
#

hopefully it's working

#

thank you very much

#

i really appretiate it

past meteor
#

@boreal gale do you (still) use tensorflow?

#

Was always my go-to but I see a lot of people going to Torch. The SoTA time series stuff is also frequently in MXnet so I'm thorn.

#

Last time I just torch was for vision and it was not nice because I had a big time constraint and spent most of my time thinking about "How do I do feature X of TF in torch"...

boreal gale
# brave sand i really appretiate it

for prosperity sake, this was how i discovered this while having close to 0 knowledge of actually using TF

  1. search for "shuffle" in TF repo
  2. https://github.com/tensorflow/models/blob/1a3b1cfefeb3171e73db6cefeb5059391223223b/official/core/input_reader.py#L466-L468 jumped out because i noticed that's very similiar to log you posted earlier (tensorflow/core/kernels/data/shuffle_dataset_op.cc:392 - i looked this up last time you posted this)
  3. track how _shuffle_buffer_size is defined.
  4. noticed this is InputReader
  5. looked at https://github.com/tensorflow/models/blob/master/research/object_detection/model_main_tf2.py as i want to see how InputReader relates to this
  6. couldn't figure out how InputReader is defined, but i guessed it must be pipeline_config_path
  7. noticed the train_input_reader at the end
  8. search train_input_reader in TF repo
  9. found the protobuf specs, and then found the shuffle_buffer_size which matches point 3 (ish)
boreal gale
brave sand
#

i think it's working, I have a loss rate and learning rate

#

wooohooo

boreal gale
past meteor
#

Well for someone that doesn't do a lot of ML at work you sure do know a lot about it.

boreal gale
boreal gale
storm kelp
#

Anyone having issues with VScode not picking up stubs correctly for some modules?

brave sand
#

i don't think the loss rate vs learning rate should be like this?

lapis sequoia
#

Where should I start to learn Data science ?

serene scaffold
arctic wedgeBOT
#
Resources

The Resources page on our website contains a list of hand-selected learning resources that we regularly recommend to both beginners and experts.

lapis sequoia
hollow storm
#

Regarding pandas and plotting using df.boxplot()
I have one extreme value up at 9000, while everything else falls between 500-1000. Without just deleting the row, is there an argument I could put in the () that would ignore values over 1000?

brave sand
potent sky
half hill
rare jetty
#

hi. i’ve got a couple of pdf reports over the span of a few years, and i just want to extract one table these pdfs and convert it into a csv file. does anyone have any tips on how i can go about this (idk if this would be relevant but i’m using VSCode)

hasty mountain
crimson summit
#

is this formula to calculate the cost of the weights incorrect. I think that the formula i circled is missing the sigmoid(1-sigmoid) part of the equation ?

night kernel
#

i want to build my own ai chatbot for free based on my own data

how can i do this? from what i know the workflow is as follows:

get a model from huggingface
use langchain to host (is it free?)
fine-tune with lora

have also heard about runpod

what am i missing?

past meteor
crimson summit
#

cause when you do the derivvatives I think you have to do 1 minus sigmoid aswell

past meteor
#

The last line is what you have on top

#

If it's not clear from this I think Edd is the one to walk you through it step by step

past meteor
crimson summit
wooden sail
#

you rang

crimson summit
wooden sail
#

you have a question on backprop for logistic regression?

past meteor
#

Or rather, the code of automatic differentiation

wooden sail
#

we can just use the chain rule on a toy example and see what comes out

crimson summit
crimson summit
#

it looks like it is missing a part

wooden sail
#

let's do one layer. lemme grab my tablet because i'm too lazy to latex this up right now

past meteor
#

I'm too lazy to write it out as well hence why I just copy pasted my slides 🤣

wooden sail
#

you're using a sigmoid, but the activation function doesn't matter

#

let's just call it f

#

how comfortable are you with matrix calculus? do i do this for the scalar case for simplicity?

#

i think that's easiest

#

there are some things missing though. primarily, that it's probably a least squares loss, yeah?

crimson summit
wooden sail
#

lemme make a quick arrangement

#

so a derivative of the sigmoid seems to be missing, but maybe that's cuz that derivative has a special form? let's take a look

#

(that'd be the f'(wx + b) term)

#

so, turns out the derivative of the sigmoid is the same sigmoid times 1 - itself

crimson summit
# wooden sail

yea this equation looks right but the equation in my picutre is missing the f ' (wx+b) part

wooden sail
#

i think they forgot the gradient symbol

#

the equations you posted look like gradient updates

#

and the x_j term looks like the derivative of the argument of f

#

so it would follow that the f is differentiated

#

might be a typo

crimson summit
#

mmmmm okay so my understanding is correct theres just a typo in the course

crimson summit
# wooden sail

cause this picture makes total sense I understand it totally

#

guess the typo is messing me up since im still a beginner

hasty mountain
#

Damn Edd... Someday you gotta teach me how to deal with math like that.
Yesterday I was trying to calculate the integral of a normal Gaussian Distribution and after 30 minutes I got a headache and gave up yert

past meteor
#

It's also shocking how much I forgot and I haven't graduated that long ago 😢

potent sky
#

Hmm maybe I can try making a freehand to LaTeX converter

wooden sail
#

my guess is they wanted to be fancy by leaving out the last level of composition, the affine transformation. but in doing that they forgot the gradient of the rest of the cost function

serene scaffold
#

Nerd.

crimson summit
kindred forge
#

Got a question y'all. A few years ago I saw a tool on HN that was a flowchart style calculator that worked with probability density functions and inexact values. It looked like blender's node graphs. It was real slick and would report Q1-Q4 and stuff like that on the output node. I absolutely cannot find it again, no combination of google search terms has worked. Anyone know what I'm referring to?

kindred forge
rare jetty
#

hey everyone. i have a table that’s 24 pages long as a pdf (this table starts at page 110 of the pdf) i want to extract this table from the pdf and convert it into csv using python. but the problem is each page has a few header lines that i’m not interested in. any tips on how i can extract this huge table ? i’m trying to use tabula but it’s not really giving me what i’m looking for

potent sky
lapis sequoia
rare jetty
potent sky
#

Excel to csv should be trivial no?

hoary jay
#

so if you have two different population, and you have some data , now I take a 1 sample ttest with both the populations and get p1 and p2 as p-values, Soooo does it make sense to compare them? Does this tell you which population the sample relates to the most?

solid basalt
#

Sorry guys to bother, i have an issue transforming a dataframe but i didnt find any chat help for data, only this one, someone can lend me a hand? Or it is the wrong chat? Thanks!

boreal gale
rare jetty
solid basalt
# boreal gale it's the correct chat, go ahead and post your issue and people will chime in!

Thanks man! So i have a tricky one, i used some chatgpt because i'm a junior, have the logic but doesnt know how to apply it to the code haha, i have a df like this example:

ID_STRO HECHO PRICE

4431 RC 2000
4431 RC 1000
4431 IT 3000
445 RC 2000
446 RP 1000

And i need this output:

ID_STRO HECHO PRICE FREQUENCY_RC FREQUENCY_IT FREQUENCY_RP PRICE_RC PRICE_IT PRICE_RP

4431 RC 2000 2 0 0 3000 0 0
4431 IT 3000 0 1 0 0 3000 0
4435 RC 2000 1 0 0 2000 0 0
4436 RP 1000 0 0 1 0 0 1000

Trying to be basic here, we are doing a price per each id_stro and each hecho inside the id_stro, then storing it in a new column with the frequency of the stros, so wanted to know how to do that, and i spent and hour trying to get the best result but didn't work.

I need to keep the first row for each id_stro in each hecho so i don't get a duplicated frequency

I going to type the code that chatgpt provided me but it was going to be wrong and needed to explain a lot,

cold osprey
#

do each column one by one

#

then see if u can merge them together

solid basalt
#

Each column one by one? Sorry i didnt understood

cold osprey
#

the result u want to achieve

#

try to do each of them on its own, rather than all at once

#

may be easier to figure out how to do it

#

worst case, if u cant merge them into 1 query, u can just join them on ID_STRO

solid basalt
#

yeah i just figured it out another thing, but thanks!

#

:d

potent sky
real scarab
#

anyone know how to make a saved matplotlib animation not have complete garbage font rendering? left is the video output, right is what it looks like in a notebook. I want the rendered video to look like the one on the right. my code is:

import matplotlib.pyplot as plt
import matplotlib.animation as animation
from IPython.display import HTML
import numpy as np


def animate_test():
    fig = plt.figure()
    ax = fig.add_subplot(autoscale_on=False, xlim=[0, 1], ylim=[0, 1])

    ax.set_xlabel("The X Axis")
    ax.set_ylabel("The Y Axis")
    ax.set_title("A title that explains what the graph is about")
    
    points, = ax.plot([], [], ".")

    def animation_func(frame):
        points.set_data(np.random.random(100), np.random.random(100))
        return points,
    
    ani = animation.FuncAnimation(fig, animation_func, range(10), interval=100, blit=True)
    return ani

ani = animate_test()
video_writer = animation.FFMpegFileWriter(fps=10, bitrate=1000, codec="libx264")
ani.save(
    "animation_test.mp4",
    writer=video_writer
)
HTML(ani.to_jshtml())

(you probably have to run this in a jupyter notebook cell)

#

I have already looked on stackoverflow for this. they recommend changing the bitrate to some really high value. but I've tried that and it doesn't make a difference

cold osprey
#

change codec maybe?

real scarab
#

any suggestions for the codec

real scarab
#

okay I figured out how to at least export all the frames to individual png files, they turn out okay. I can stitch them together with another tool

def animate_test():
    fig = plt.figure()
    ax = fig.add_subplot(autoscale_on=False, xlim=[0, 1], ylim=[0, 1])

    ax.set_xlabel("The X Axis")
    ax.set_ylabel("The Y Axis")
    ax.set_title("A title that explains what the graph is about")
    
    points, = ax.plot([], [], ".")

    def animation_func(frame):
        points.set_data(np.random.random(100), np.random.random(100))
        fig.savefig(f"frames/animate_test_{frame}.png", format="png", transparent=False, facecolor="white")
        return points,
    
    ani = animation.FuncAnimation(fig, animation_func, range(10), interval=100, blit=True, repeat=False)
    return ani

ani = animate_test()
HTML(ani.to_jshtml())
cold osprey
wooden forge
#

Little question, has anyone ever made a neural network (regression or convolution or both) which gave the same output no matter the input, as it was not learning anything, simply minimising the error?

cold osprey
#

huh

#

if the output is always the same for a given input, the error will always be the same no?

#

i dont quite follow

hollow storm
#

can anyone point me towards a package or library that can generate synthetic data based on another dataset (probably through a model)?

cold osprey
#

e.g. if output is always 1,

for 2 samples,

  1. output should be 0
  2. outout should be 1

the error will never change?

wooden forge
#

but the standard deviation is very high (angles are normalized by 2*pi) and I always have 0.1

#

A more acceptable standard deviation would be at least 0.001

#

for example with a CNN, but it's similar with a feed forward

cold osprey
#

oh

#

i think i get it

#

like theres some sort of restriction that the output can only be one value

wooden forge
#

I tried reducing the learning by a lot but it doesn't impact the training

cold osprey
#

would that be equivalent to calculating the mean or median or some statistical average ?

wooden forge
#

The network learns the average value basically

#

so he doesn't learn technically, and it's so kind of rote learning but not exactly

cold osprey
#

why this instead of letting each image have its own output?

wooden forge
#

wdym

cold osprey
#

if image 1's angle is 90

#

and image 2's angle is 30

#

the output u would want is 60 right?

#

for both

wooden forge
#

no I want 90 and 30

cold osprey
#

oh

wooden forge
#

I don't want the avg

#

I want my network to recognize angle lol

cold osprey
#

then isnt it just a normal network?

wooden forge
#

?

cold osprey
#

nvm

wooden forge
#

Normal network?

cold osprey
#

ur model is giving u the average now

wooden forge
#

Yes

cold osprey
#

which is what u dont want

#

okok, i thought u wanted to restrict it to one output value only

wooden forge
#

because it is not able to extract from the image the right characteristics of the line, so instead it minimizes the error and returns the same value for all input

cold osprey
#

have u tried some image augmentation?

#

rotation/ crop etc

wooden forge
#

Images are randomly generated, so it's pointless doing rotation, plus, on the experimental data, rotation modifies the pixels which is not what I want

#

images are 18x18 pixels

cold osprey
#

u got a sample image?

wooden forge
#

I can't show synthetic diagrams they are under a NDA

#

These are synthetic, close to the real data

cold osprey
#

ah but can i just image 2 lines making an angle?

wooden forge
#

it's one line per image

cold osprey
#

and the angle is measured to the horizontal?

wooden forge
#

vertical axis

cold osprey
#

i see

#

hows ur current model looking like

#

architecture

wooden forge
#

I have two, one is a feedforward:

model = nn.Sequential(
        nn.Linear(N, 24),
        nn.LeakyReLU(),
        nn.Linear(24, 12),
        nn.LeakyReLU(),
        nn.Linear(12, 6),
        nn.ReLU(),
        nn.Linear(6, 1)
    )```

One is a convolution:
```python
layers = nn.Sequential(
            nn.Conv2d(1, 6, kernel_size=kernel_size_conv, stride=1, padding=1),
            nn.ReLU(),
            nn.Conv2d(6, 12, kernel_size=kernel_size_conv, stride=1, padding=1),
            nn.ReLU(),
            nn.Flatten(),
            nn.Linear(12 * (kernel_size_conv ** 2) * (kernel_size_conv ** 2), 200),
            nn.LeakyReLU(),
            nn.Linear(200, 100),
            nn.LeakyReLU(),
            nn.Linear(100, 1)
            )```
cold osprey
#

hmm random idea

#

possible to do something to the contrast of the image?

wooden forge
#

The FF works really well on synthetic binary images (dont mind the orange I used cmap copper by accident)

cold osprey
#

ya seems like its the noise thats causing it to be less accurate

wooden forge
#

yes

#

It can't extract the features properly

cold osprey
#

if like val of pixel < 127, then multiply by 0.2 ?

#

like to make light lighter and dark darker

wooden forge
#

I have tried smaller network, bigger network, modifying the lr, adding blur, using antialias to make thicker lines, ...

cold osprey
#

127, *1.5 or max

#

hmm

wooden forge
#

changing the network size won't really help, it's not overfitting

#

I just checked, my experimental data

>>>X_exp.max()
tensor(3.8361e-10)
>>>X_exp.min()
tensor(-4.1663e-10)```
cold osprey
#

i mean the pixel value

boreal gale
#

could you provide some code to generate the synthetic images? sounds interesting!

wooden forge
#

This is how I create the synthetic images:

import numpy as np
from numpy import ndarray
from skimage.draw import line, line_aa
from scipy.ndimage import gaussian_filter  # only import if necessary
from typing import Tuple

from utils.angle_operations import calculate_angle, normalize_angle


def generate_image(size: tuple, sigma: float = 0, aa: bool = False) -> Tuple[ndarray, float]:
    """
    Generate a binary image with a random line
    :param size: Shape of the image
    :param sigma: Add a gaussian blur to the image if True
    :param aa: Anti-alias, creates AA line or not
    :return:
    """
    img = np.random.normal(10, 2, size) * 255
    min_length = 0.5 * size[0]

    # Select two random positions in the array
    index1 = np.random.choice(img.shape[0], 2, replace=False)
    x1, y1 = tuple(index1)
    # Set a minimum length for the line (at least half the size of the picture)
    length = 0
    while length <= min_length:  # while the length is not at least half the size of the picture it selects new endpoints
        index2 = np.random.choice(img.shape[0], 2, replace=False)
        x2, y2 = tuple(index2)
        length = np.sqrt((x1 - x2) ** 2 + (y1 - y2) ** 2)

    # Compute angle of the line with respect to the x-axis (horizontal)
    angle = calculate_angle(x1, y1, x2, y2)

    # Create line starting from (x1,y1) and ending at (x2,y2)
    if aa:
        rr, cc, val = line_aa(x1, y1, x2, y2)
        img[rr, cc] = 255 * val
    else:
        rr, cc = line(x1, y1, x2, y2)
        img[rr, cc] = 255

    img = gaussian_filter(img, sigma=sigma)

    return img/255, normalize_angle(angle)


def create_image_set(n: int, N: int, gaussian_blur: bool = False, aa: bool = False) -> Tuple[ndarray, ndarray]:
    """
    Generate a batch of arrays with various lines orientation

    :param n: number of image to generate
    :param N: side of each image
    :param gaussian_blur: Add a gaussian blur to the image if True
    :param aa: Anti-alias, creates AA line or not
    :return: 3d numpy array, n x N x N
    """
    image_set = np.zeros((n, N, N))  # important for NN to have size n x N x N
    angle_list = []

    for k in range(n):
        image, angle = generate_image((N, N), gaussian_blur, aa)
        image_set[k, :, :] = image
        angle_list.append(angle)

    return image_set, np.array(angle_list)```
#

And then some utils functions used:

calculate_angle

def calculate_angle(x1: float, y1: float, x2: float, y2: float) -> float:
    """
    Calculate the angle of a lign with respect to the
    :param x1: x position of first point
    :param y1: y position of first point
    :param x2: x position of second point
    :param y2: y position of second point
    :return: angle of a line between (x1,y1) and (x2,y2) with respect to the x-axis
    """

    a, b, c, d = get_point_above_horizontal(x1, y1, x2, y2)

    dx = a - c
    dy = b - d

    if dx == 0:
        return np.pi/2
    else:
        slope = dy/dx
        angle = np.arctan(slope)
        if angle < 0:
            return angle + np.pi
        else:
            return angle

normalize_angle

def normalize_angle(angle):
    """
    Normalize angle in radian to a value between 0 and 1
    angle can be a float or a ndarray, it doesn't matter
    :param angle: angle of a line
    :return: normalized angle value
    """
    return angle / (2*np.pi)
timber sky
#

Hi guys, anyone has experience with transforming logs from several components (nginx, django,..) to vectors? So I can do anonmaly detection based on it?

wooden forge
#

Then to plot everything I use the following function;

def create_multiplots(image_set_input: ndarray, angles: ndarray, prediction_angles: ndarray = None, number_sample: float = None) -> Tuple[Figure, Axes]:
    """
    Generate figures with several plots to see different lines orientation

    :param image_set_input:
    :param angles: array containing the angles for each image of the set
    :param prediction_angles: optional, value of predicted angles by a neural network (ndarray)
    :param number_sample: number of images to plot, None by default
    :return: a figure with subplots
    """
    if isinstance(image_set_input, torch.Tensor):  # if images are from load_diagrams.py
        image_set = image_set_input.squeeze(1)
        n, p, _ = image_set.shape
    else:  # for synthetic diagrams
        image_set = image_set_input
        n = len(image_set)
        p, _ = image_set[0].shape
    # n, p = image_set.shape  # change when using tensor
    # print(len(image_set))
    # n = len(image_set)  # change when using synthetic data

    if (number_sample is not None) and (number_sample < n):
        n = number_sample

    # Compute the number of rows and columns required to display n subplots
    number_rows = int(np.ceil(np.sqrt(n)))
    number_columns = int(np.ceil(n / number_rows))

    # Select a random sample of indices
    indices = sample(range(len(image_set)), k=number_sample)

    # Create a figure and axis objects
    fig, axes = plt.subplots(nrows=number_rows, ncols=number_columns, figsize=(6 * number_columns, 6 * number_rows))

    for i, ax in enumerate(axes.flatten()):
        if i < n:
            index = indices[i]
            # image = np.reshape(image_set[index, :, :], (Settings.patch_size_x, Settings.patch_size_y))
            image = image_set[index, :, :]

            normalized_angle = float(angles[index])
            # print(normalized_angle)
            angle_radian = normalized_angle * (2 * np.pi)
            # print(angle_radian)
            angle_degree = angle_radian * 180 / np.pi
            ax.imshow(image * 255, cmap='copper')
            title = 'Angle: {:.3f} | {:.2f}° \n Normalized value: {:.4f}'.format(angle_radian, angle_degree, normalized_angle)
            if prediction_angles is not None:
                prediction_angle = prediction_angles[index][0]  # the angle is a ndarray type with one element only for index i
                title += '\n Predicted: {:.4f} ({:.2f}°)'.format(prediction_angle, prediction_angle*2*np.pi*180/np.pi)
            ax.set_title(title, fontsize=25)
            ax.axis('off')
            plt.tight_layout()
        else:
            fig.delaxes(ax)  # if not there, problem with range in the array and out of bound error

    return fig, axes```

Since you don't have the predicted angles you don't have to pass it as argument, it won't take it into account
boreal gale
#

wonderful. i am at work atm so i can't really look into this atm, but i will hopefully have a look tonight!

wooden forge
#

to run the last bit you'll need the following library:

from typing import Tuple

import matplotlib.pyplot as plt
from numpy import ndarray
import numpy as np
from matplotlib.figure import Figure
from matplotlib.axes import Axes
from random import sample
import torch```
wooden forge
#

This is part of my internship project for context

#

Here is another synthetic dataset, the angles predicted are very close to the actual one (feed forward)

pastel verge
#

Hey guys, how is it going? I don't know what happened, but all of sudden all of my streamlit apps started refreshing the page every 25/30 seconds. Yesterday everything was working fine. I started thinking it was some problem on my codes. But since this started happening to all the apps, I believe it is not something related to my codes
Does anyone know what is going on?

boreal gale
wooden forge
wooden forge
boreal gale
wooden forge
wooden forge
rose dagger
#

Choosing the same seed in these two ImageDataGenerators makes it so that the same transformations are applied to input and output picture, right? The reason i'm asking is: When i was manually pre-loading my data and training my neural net the performance was amazing, but now that i am trying to use DataGenerators the performance sucks (with the same model architecture!). What might be the reason for that? Is there something wrong with these DataGenerators?

wooden forge
boreal gale
wooden forge
#

oops

#
def get_point_above_horizontal(x1: float, y1: float, x2: float, y2: float) -> Tuple[float, float, float, float]:
    """
    Get the point above the horizontal line passing through the center of the line between (x1,y1) and (x2,y2).
    :param x1: x position of first point
    :param y1: y position of first point
    :param x2: x position of second point
    :param y2: y position of second point
    :return: in order the point above the center and the point under the center
    """
    # Calculate the y-center of the line
    center_y = (y1 + y2) / 2

    if y1 >= center_y:
        return x1, y1, x2, y2
    else:
        return x2, y2, x1, y1```
boreal gale
#

🙏

wooden forge
boreal gale
#

in you real dataset, are all the lines also spanning across the image from edge to edge?

wooden forge
#

yes yes

#

imagine the line to not be too straight, a little dented so to speak

boreal gale
#

also are you trying to do this from scratch? are you open to using some pre-trained arch/weights found in the wild?

wooden forge
#

I am doing this from scratch

#

depends the model haha

boreal gale
#

are you trying to keep the model minimal?

wooden forge
#

yes

#

it's supposed to be small to then be implemented in a physical circuit in a cryostat

#

So not too big

boreal gale
#

gotcha, well first we gotta make something that actually works first, then i think there are ways to trim it down

wooden forge
#

yup

boreal gale
# wooden forge yes yes

size 256 - synthetic data generator seems to generate lines that aren't going from edge to edge. expected? or did i use too big a size?

there is a min_length = 0.5 * size[0] so i guess it is expected..

wooden forge
#

yeah

#

you can increase it to have 0.9

#

Also I use the copper cmap, looks better

boreal gale
#

i will just keep it 0.5, the generator just burns my CPU trying to find valid candidates 💀

wooden forge
#

xd

#

yeah my laptop as a GTX 4050 lol

#

I don't even hear it during network training xd

#

For the experimental data, I'll try to progressively increase the size of the network

wooden forge
#

it's very small

boreal gale
wooden forge
#

:3

boreal gale
#

and you have tried upsampling it right?

wooden forge
#

the experimental data?

boreal gale
#

yep

wooden forge
#

The problem we have is the lack of data, also the distribution of data

#

you can see most angles are between 0.3 and 0.49 (multiply by 2pi to get radian)

boreal gale
#

my apologies, i meant upsampling in the image, not upsampling in the traditional data science sense.
i.e. 18x18 -> 256x256 for example

wooden forge
#

I can't do that

boreal gale
#

how come?

wooden forge
#

we are limited by the patch sample of the machine

#

it's supposed to be used live on quantum dots

#

you want to make small patch otherwise the measurements are too small

boreal gale
#

i meant something like this

Original array:
[[0 1 2]
 [3 4 5]
 [6 7 8]]
Resampled by a factor of 2 with nearest interpolation:
[[0 0 1 1 2 2]
 [0 0 1 1 2 2]
 [3 3 4 4 5 5]
 [3 3 4 4 5 5]
 [6 6 7 7 8 8]
 [6 6 7 7 8 8]]
wooden forge
#

Also modifying the data is not wanted because its supposed to be a general method

#

to be applied on any dataset

boreal gale
#

sure, i am just thinking how a NN can recognise edges if your pixel is that massive (relatively speaking).

wooden forge
#

mmh

#

well it works fine on synthetic diagrams lol

queen cradle
#

@wooden forge I think you should try using a Radon transform instead of a neural network. While this is not something I'm especially familiar with, I believe the technique is: (1) Pad the image on all sides with zeros; (2) Apply a two-dimensional Fourier transform; (3) For each angle theta, look at the maximum absolute value (of the Fourier transformed data) achieved along the line of angle theta through the origin. That is, for each theta, look at all the points whose polar coordinates are (r, theta) where r is allowed to be any real number (including negative), take absolute values, and say that the maximum absolute value along the line is some kind of score for how likely it is that this theta is the angle you're looking for; the theta with the highest score is your guess for the true angle of the line.

wooden forge
#

pretty much sounds like a Bayesian neural network

queen cradle
wooden forge
#

the theta with the highest score is your guess for the true angle of the line.
If you have lines with many different orientations, you will have many different theta

queen cradle
#

Yes.

wooden forge
#

Moreover, I have somewhat of a continuum. Yes it's a numerical continuum, but still, that's a lot of value, and I can't just discretize the range by setting a list of possible angles like [0, 2, 3, ...179°] (180 being excluded because I take into account the symmetry of the line 0 <=> 180)

queen cradle
#

Why can't you discretize the range? Your images are only 18x18, so you're unlikely to be able to observe angles to high precision.

#

If this is really a problem, you could also try an initial discretization, then a local search for a slightly better angle.

#

But I suspect that's unnecessary. I think the resolution you can observe is likely to be no more than 100 angles.

#

It depends on the amount of noise and the number of bits of precision in each pixel I guess.

visual garden
#

Hello people!
My school group and I have been working on classifiers
And our research questions is : Does image quality improve a classifier ?
And we ended up with 3 classifiers (Lr, Knn, DTC)
Trained on 3 different quality images (Good, bad and mixed) with 3 types of features selection (PCA, variance threshold, and none)
And tested on 3 types of data sample (again , good images bad and mixed)
So yeah that's 81 different results, anyone has a good idea on how to plot them for our presentation without being a giant number mess ?

tidal bough
tidal bough
#

I guess you could also do 9x9 instead of 27x3 if the cells are small enough.

visual garden
#

Yeah I was thinking about doing 3 subplots one for Knn one for LR and one for DTC
Colum being data trained on
Rows being data test on
And in each of the graph the 3 different type of feature selection

toxic bone
#

@lone steppe Hi, you already are aware of the issue, kindly help

queen vector
#

guys, do you have any ideas of using AI or ML in API testing?

boreal gale
boreal gale
cold osprey
toxic bone
boreal gale
toxic bone
#

sure

queen vector
#

making an application, that takes api endpoints as input and uses AI ML to test those api end points (ofc we will have to send requests) and genereate a report of some analysis

#

@cold osprey

cold osprey
#

interesting

glass anvil
#

So I am doing this project for text summarization. And I want to compare different normalization and other preprocessing performances. Right now I am trying to compare the performance of different stop words lists, but I don't know what metric should I use? Can you help me out? I am googling and everything but can't really find an answer

#

@warm bane

warm bane
glass anvil
warm bane
#

I am not much experienced with NLP. I've only ever taken 1 lecture on it but spacy is useful. Maybe try to check if there are metric systems there or not. I cannot quite help further :p

azure granite
#

Do you guys know of any good library that one can use to visualize 3D mathmatics?

Like I want to do math on vectors & see those vectors in 3D space - soley for my own experimentation

dusty valve
#

You can use matplotlib for 3d plots, but based on your messages you want something more of a game engine

#

Pyglet could work for a 3d game if thats what you need

#

Just make sure to not crosspost

#

You can be auto muted

tidal bough
#

afaik pyglet's support for 3d is literally just exposing opengl. which... works, but isn't very fun

azure granite
#

and visualizing it

tidal bough
wooden sail
#

quiver is what i was going to suggest as well

#

numpy plus quiver

azure granite
#

Lines are cool too

wooden sail
#

you can draw dots in space with a scatter plot

dense crane
#

can someone leave the paper or tell me what are the best ways to train resnet18 or resnet34 architecture on Imagenet?

plain jungle
verbal venture
#

can anyone explain how the deeper layers of a CNN are able to detect features from an image non-arbitratily. What I mean is, say there's a house, deeper max-pooling + CNN layers will extract very small pixels from the house, that make the feature indiscernible from knowing it belongs to the house. How is the CNN able to know those are the higher representational features of a house

warped leaf
#

deeper layers in a CNN learn to detect more complex and meaningful features by building upon simpler features learned in earlier layers

#

i think the network gradually combines and abstracts these features to recognize objects or patterns, even if the specific pixels aren't easily distinguishable in the deeper layers

verbal venture
#

okay. in the final fully connected layer, when things get flattened, it's to essentially combine the extract features of a particular object?

warped leaf
#

yepp that's correct In the final fully connected layer of a CNN, the feature maps from the previous convolutional and pooling layers are flattened into a 1-dimensional vector. This flattening operation is performed to combine and represent the extracted features of a particular object or image.

verbal venture
#

okay, so the final layers (before the classification) just extract even finer representations of the object

warped leaf
#

By flattening the feature maps, the spatial information is lost, and the network can then treat the features as a sequential input

#

This allows the fully connected layer to receive and process the learned features as a fixed-length vector.

#

The fully connected layer then performs classification or other tasks based on these combined features, making predictions or generating output based on the learned representations.

verbal venture
#

okay. if you have time I have a few questions

#
  1. Why would you not want to add as many deep layers as possible for image classification, since your network can extract more and more relevant features? I understand that increases compute time, but in terms of accuracy, wouldn't deeper layers mean more accuracy?
wooden sail
#

only if you have enough data and time to train it

#

the more parameters you have, generally the more data and epochs you need to train it correctly

#

there's a sweetspot for it. past a certain point, if you don't increase the amount of training data, the performance will start to get worse

warped leaf
# verbal venture okay. if you have time I have a few questions

the thing here is that the optimal depth of a CNN depends on the complexity of the task, the available dataset size, and computational constraints. while deeper networks can improve accuracy to a certain extent, there is often a diminishing return on performance as the network becomes more complex.

verbal venture
wooden sail
#

that's the usual behavior whenever estimating parameters

#

more unknowns leads to higher variance in the estimate, which you offset by showing more data

#

a bigger network means more parmeters need to be trained, i.e. more unknowns

#

there's a very trivial lower bound for this: you need at least as many data samples as parameters you want to estimate. if this is not met, the parmeters cannot be found at all

#

if you keep the data set fixed and make the network arbitrarily big, you will hit this scenario eventually, and any network larger than this size won't work. in practice, this trivial bound almost never holds and the networks would break down even earlier due to properties of the data

#

you can see the easiest case by considering that you can solve a linear system equations with 2 unknowns if the system has 2 linearly independent equations. if you have 2 linearly dependent equations, all of a sudden there are either infinitely many solutions with different properties (which one is the right one for your case?) or no solutions at all

#

if you only had 1 equation you'd run into similar problems

#

if we add more variables to the equations, the problem gets worse. if we add more equations though, we can do something about it

halcyon hedge
#

How to use pd.to_datatime method on dates following "month-year" format, eg. "Jun-91", "Jul- 91" -------- "Dec-00", "Jan-21".
I was trying to analyze a csv in which dates were given in the aforementioned format and I am unable to convert the strings into datetime objects.

warped leaf
#

your dataframe section specify the dates column

halcyon hedge
warped leaf
fast nova
#

hi!
there was a library that allows you to "fuse" several arithmetic operations into one object, and then you can apply it to some inputs (raw numbers or even numpy arrays)
like this (pseudocode): ```py
x = O('a * b + c')
x(a=1, b=2, c=3) # 5
x(a=nparr1, b=nparr2, c=nparr3) # another array

i cannot find this library, can you help me?
#

looks like numexpr, but im not sure

warped leaf
#

is it SymPy?

#

this library also allows you to define arithmetic expressions symbolically and then evaluate them with different inputs.

fast nova
#

its theano! i found it!

#

thank you guys

warped leaf
#

oh cool

#

your welcome!

flint cosmos
#

I figure I should put this on here cause answers to help requests regarding AI generally dont seem to come through. I'm very new to using PyTorch and am trying to use ChatGPT to make a CNN that can play Pong (via Openai's Gym). Both ChatGPT and I have tried to fix this error but I (in my lack of experience) cant seem to figure it out. Any help would be appreciated.

mild dirge
#

You have 4 dimensions, so you need to specify all 0, 1, 2, and 3 for the permute

flint cosmos
#

When I tried that this error came up

#

This is the CNN model:

#

I figured it should have three input channels cause thats the number for pong (RGB, later greyscaled)

serene scaffold
#

@flint cosmos please always show code, error messages, and other text as actual text (not screenshots)

#

!code

arctic wedgeBOT
#
Formatting code on discord

Here's how to format Python code on Discord:

```py
print('Hello world!')
```

These are backticks, not quotes. Check this out if you can't find the backtick key.

For long code samples, you can use our pastebin.

sweet crypt
#

Hi does anyone have experience working with MCTS here?

#

I am evaluating some states in game with just model, and with MCTS guided by the model itself. These model change in every 10 iterations, but it seems that model guided MCTS always has the same accuracy which is weird.

night prawn
#

Can anyone help me please? I want to use tensorflow gpu but i tried more than 5 times with wsl but it never works.

plush garnet
#

Hello i need am new to machine learning and i wanted to know if there was anything i need to optimize or change maybe a new model or something else my project is a multi label text classifier (i know its maybe to advanced but i have wanted to try making this for a while) I dont want to use any pretrained models or generally anything i need to download i just want to use pytorch, numpy, sklearn and other things like that here is my code (i know its very messy and bad written) and let me know if you need the other files they dont have anything that i think should be changed they are just some simple files with some functions and variables
https://pastebin.com/STHwu2Dx

latent stump
#

The plot shows some horizontal lines, how do I remove these data quirks?

proud saddle
#

i have a json with all of the universities in the world and I'm trying to figure out a way to rank them all there about like 100k data points here is the link to download as well https://zenodo.org/record/7387951

#

i have no clue where to start?

plain jungle
#

Cheers! Thank you

pale hemlock
#

in color space

pale hemlock
granite onyx
#

Hello, so I'm very new to machine learning, and would love if somebody could point me into the right direction. I'll do my research but little pointers could help a lot.

I have a dataset containing thousands of tuples with a set and a value, each set contains a few items of a total of around 100 different ones, represented in this example by numbers.
The items in the sets correlate to each other in an unknown way to get to the value.
I want to build a model using pytorch (but I could switch to anything) to predict the value of any set.
What kind of architecture / layers go well with that kind of data?

example_data = [
    ({0,1,2}, 10),
    ({1,5,9}, 5000),
    ({20,28}, 400),
    ({89,95,98,99}, 1),
    ...
]
rare fog
granite onyx
mild dirge
#

If there's only 100 possible set items, you may as well just have 100 input nodes

#

That are either on (1) or off (0) depending on whether the value is in the set

#

And the output is the other value, so it's a regression task

sweet crypt
#

Hi does anyone have some experience with MCTS? I am doing search on game states, and it seems that my search doesnt improve rather than what the model gives me

snow folio
#

How do tools like keepa get their data from Amazon if Amazon does not allow their data to be scraped?

zenith epoch
#

when you make a neural network, what distribution should the biases and weights we initialized as?

#

i'm new to this and the example i'm working off of uses a gaussian distribution and i was wondering if there's a particular reason why

agile cobalt
#

things usually just work fairly well with a normal distribution

#

there are some things you have to pay attention to though

#

these are the defaults Pytorch uses for Linear layers for example:

zenith epoch
#

ok ty

#

i have no idea what the k and weird looking u stand for in that image lol

agile cobalt
#

I forgot the exact explanation, but iirc it was something like scale down the values & variance based on the number of features to avoid having the gradient go out of control as you stack more and more layers

#

don't trust me though, do look up it properly derp

zenith epoch
#

makes sense lol, i'll try to look it up

#

i didn't really get any results on google when i tried before asking here tho, so are there any specific terms or anything I should search?

agile cobalt
#

from the linked timestamp up to 36-ish should be relevant to this

zenith epoch
#

ty

hasty mountain
#

Advise based on experience: when using normal distributions for initialization, keep track of the gradients behavior... using normal distribution without scaling tends to generate vanishing gradients...specially in Fully Connected Layers.

#

I had to lose the habit of using normal initialization that I had acquired while working with GANs because of that... it tends to make classifiers rubbish. Too much time to train, higher risk of local optima grumpchib

latent stump
latent stump
sly lake
#

Hey guys! Any of you know about any flood prediction model or something similar?

#

I got a hackathon problem statement on "Sea level rise and coastal flooding" and I am thinking of making an app for flood prediction using weather data and stuff

#

Any pointers on guides? Thanks 🙂

muted crypt
#

I have different models to test. I need to find which is the best one for my kind of data. However I'm confused on whether I should use cross validation with all the data cv_scores = cross_val_score(model, X, y) or should I only use the training data, cv_scores = cross_val_score(rf, X_train, y_train). My guess is that for the sake of finding which model is the best, i should do cv_scores = cross_val_score(rf, X, y) instead, is that correct? and when I know which one is better, train it with only the training data?

tidal bough
#

A sufficiently overfit model will perfectly learn the training set, so performance on the training set isn't a very useful metric. CV is for estimating how good it is in generalizing to data points it never saw during training.

muted crypt
#

based on this figure

potent sky
little vector
#

@muted crypt with cross validation the algorithm loops over the data, and selects a portion of it to use as train and test

So yeah it's best to use cv_score = cross_val_score(rf, X, y)

sleek harbor
#

is there a way to make jupyter notebook sessions persistent? Like.. so I wouldn't have to rerun all the cells every time I reboot my PC? I know about dill, but.. is there no better way?

I know that with tmux you can make your terminal sessions persistent with Tmux, and I know that you can edit and run notebooks from terminal editors, like Neovim and Emacs (after spending a few ages configuring them), so in theory, it should be perfectly possible to have a terminal code editor attached to a Tmux session, which will make your notebooks persistent. But I haven't tested this theory, and anyhow, I use VS Code.. :/

Anyone have any experience with this? I'm pretty sure you can attach vs code to tmux (don't quote me tho, and I'm not sure how this works), but honestly, I'm not much of a terminal wiz and don't even use tmux, tho would gladly install and use it if there was a way to make jup notebooks persistent, and would be extremely greatful if someone could show me da wae (know to mortals as "the way")

tidal bough
sleek harbor
tidal bough
#

But to make it survive a reboot of the kernel, you'd have to save a python interpreter's state somehow, which... seems really hard.

wooden sail
#

why not store the important results and load them when needed? kinda sounds like jupyter is holding you back here tbh

#

but yeah a workaround is to check whether certain files exist, and if not, compute their content and create them. if they do exist, just load them

night prawn
umbral charm
#

Is there any open source ai that i can integrate into my project that has the capabilities of recognizing the constituents of stock images?

#

I kinda dont wanna train my own

#

I also dont know how even if i had the hardware

mild dirge
#

constituents?

boreal gale
# sleek harbor is there a way to make jupyter notebook sessions persistent? Like.. so I wouldn'...

is there a way to make jupyter notebook sessions persistent?
what's the reason for making you want to do this?
Like.. so I wouldn't have to rerun all the cells every time I reboot my PC?
just for this?
I know about dill, but.. is there no better way?
what's the pain point about dill?

I know that with tmux you can make your terminal sessions persistent with Tmux....
this would only make sense for running jupyter (and kernels) on a remote host, and this is out of the box?

there probably are ways to do this, just spitballing here -- perhaps wrapping your jupyter stack in vagrant and vagrant suspend whenever you need to switch off your PC, N.B. this is a VERY heavyhanded solution, i do not recommend this like at all.

obtuse olive
#

Is anyone here profiecient in Tensorflow and Computer Vision willing to collaborate on a project, DM if interested

white flint
#

here it seems the NASNetLarge pretrained imagenet model is the 'best' when plotted on these 2 variables

#

but is there any drawback I should be aware of when using it

sleek harbor
# boreal gale > is there a way to make jupyter notebook sessions persistent? what's the reason...

just for this?
yeah, just for that, pretty much. Convenience
what's the pain point about dill?
nothing really, it's just inconvenient. You have to write the code for dumping and loading, then you have to run the import and load cell on open. Like.. it's not the end of the world, but it's not very convenient either.
Besides that, I'm a bit of a.. idk how to say this, so lets just say weirdo.. I like it when the numbers next to my cells, the ones that show the order in which the cells were run.. are actually in the order I ran the cells. dill ruins that.. :/ kinda nitpicking, but.. but yeah.
this would only make sense for running jupyter (and kernels) on a remote host, and this is out of the box?
I don't really get what is said here. Why would it only make sense for running on a remote host? And what is meant by "out of the box"?
perhaps wrapping your jupyter stack in vagrant and vagrant suspend
whoosh, that went over my head :3

somber panther
#

I've started looking at index heirarchy, I guess i'm wondering what everyones thoughts of it are, I prefer to use business data for my keys and this kind of popped out at me

timid grove
#

can anyone please guide me through the steps for making a english to marathi translation model and vice versa by finetuning multilingual language models.
Thank You in advance.

somber panther
#

The data i'm working on collects data from states every month so i'm thinking i can use Year/Date/State as my index, is there a reason not to do this?

umbral charm
#

Like the things that make up the image, ie this random image from my camera roll, it should recognize there is a person driving a vehicle in a field

#

Or in this one that it depicts a sunset

umbral charm
crimson summit
#

In a machine learning program i am doing now it said that you should use the linear activation function for something like a tock price predictor since y can be positive or negative but for a model that predicts the price of a house you should use the Relu activation function since y (the price of a house) can never be negative. My question is whats the point of even using ReLU ? ReLU turns anything below zero into zero but if when trying to see what the price of a house is a house will never be negative so why not just use linear activation function ?

wooden sail
#

you forget that you will later want to do inference on values outside of the data set

#

if you model your prices with a linear function, past a certain parameter range, the values will all be negative no matter what you do

#

that means the model is only valid in a limited domain. you immediately lost generalizability by using the wrong activation

#

(you'd also never use a linear activation with deep learning for other reasons: no matter how many linear layers you use, they can be simplified into a single linear function. the power of neural networks comes from using nonlinear functions)

crimson summit
mild dirge
#

It makes sure that even for edge cases the output would be 0 instead of negative

potent sky
# crimson summit Okay so using the ReLU for the housing price preditor helps with more complex re...

Incorporating a non-linearity helps in modelling more complex relationships between the attributes as you're now no more limited to a linear seperator (a straight line in 2d, and corresponding hyperplanes in higher dimensions)
Think of it like this: as long as you only have a straight line to separate your data points, you can only separate very cleanly divided data points, non linearities can give you "curvy" separators, so you can separate between finer patterns

timid ledge
#

im interested in a career in machine learning. i know python is the prefered language for machine learning, but what else do i pair it with?

#

yes, the right language is the language needed for the job, but what language can best prepare me for what the jobs in machine learning will most likely be?

#

should i try for knowledge in python and r?

#

or maybe python and c++?

magic dune
#

U can right a lot of your heavy lifting in c

#

And use python for other parts

potent sky
potent sky
#

To start off

#

That said, a lot of high performance code for ML is written in C++ and then called through python wrappers. Including for popular deep learning frameworks like tensorflow and pytorch

#

R is still popular among a significant number of data analysts

past meteor
#

R is also popular under legitimate data scientists

#

R as a language has many features I don't like (discussed this already) but there are many libraries in R that aren't as readily available in Python (and vice versa). Part of it is likely availability bias since one is used by more stats oriented people and the other by more comp sci oriented people.

#

The biggest one here is time series analysis. (Python) Statsmodel's API is garbage, pdarima is too slow and the new kid on the block Nixtla's source code is just so horrible I don't want to use it but it's the best bet I have

past meteor
night prawn
#

I've follow the tensorflow tutoriel to install tensorflow gpu on wsl but when i run this command : python3 -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))" it return this

white flint
#

here it seems the NASNetLarge pretrained imagenet model is the 'best' when plotted on these 2 variables, but is there any drawback I should be aware of when using it

night prawn
#

Why i can't execute this code ?

past meteor
night prawn
#

this isnt install with conda ?

past meteor
#

No, conda has everything to do with Python itself (the programming language).

The Python extension in visual studio code lets it "understand" Python and gives you features that make python programming easier.

night prawn
#

ok thanks

#

and now i'm sure i've installed tensorflow with pip on conda but why it's not here

sleek harbor
sleek harbor
# night prawn

see that yellow box at the bottom right. I don't speak french, or whatever language that is, but you need to select the environment there for vs code to see what environment you're using, not just activate it in the terminal. Then vs code will automatically activate it for you in the terminal, btw

night prawn
#

What is the path i have to put for it to work ?

#

line 23

sleek harbor
# night prawn What is the path i have to put for it to work ?

If it's on your linux (Ubuntu) storage, then try just /home and everything afterwards. It seem like you're trying to access the linux storage through windows, but you don't need to do that, since you're already working in WSL. Just imagine that you're on linux, not windows

night prawn
#

ok thank you

night prawn
surreal verge
#

Hey, I need some help i want to get my images to train it in a model. I am using this code for the following task

data = tf.keras.utils.image_dataset_from_directory(
    "train",
    labels="int",
    label_mode="categorical",
    shuffle=True,
    image_size=(180, 180),
    color_mode='grayscale',
    class_names=None
)

but my directory structure is something like this

train
  - cat.01.jpg
  - dog.09.jpg
  - and so on.

How can i use this for achiving the following task

mild dirge
#

So a directory per class

timid ledge
#

I wanna do more programming than statistical work

#

Is this the wrong field for that? I don’t really enjoy statistics

#

Machine learning specifically, not data science

past meteor
# timid ledge I don’t really want to be on the businessey statistical side of things
  1. I think in practice (outside of research) ML is a last resort measure because it's "expensive". Either you stay in research or join a team that only deals with "the hard" problems after an analyst has found it's not solveable with bar charts and heuristics. In that case the scope of places where you can work is smaller but that's OK. These are the roles that interest me the most.

  2. ML is synonym (or subset?) of statistics. I don't think you can enjoy ML without enjoying (large parts of) statistics. Historically ML is what comes out a CS dept and stats from ... stats but also economics, psych, etc. The scope and interest of ML (read: CS folk) does not cover everything of statistics and that's fine because not all of it is relevant for ML and vice versa. There are a bunch of things from say traditional statistics that are worth learning though that are not covered in traditional ML curricula/books/...

#

Like, knowing all the different types of t-tests and anovas are admittedly boring, I don't know / care about that as well. I'd say the hardcore statistical modelling is pretty much the same as ML and worth looking at.

rose dagger
#

Does anybody know a good guide for how to reduce GPU memory usage for neural networks in Keras? So far i've found this: https://stackoverflow.com/questions/53170149/memory-usage-of-neural-network-keras
I'm dealing with a model with roughly 2 million parameters, output size is 324x324 and i'm running out of memory (~16GB GPU RAM available). Any advice?

rose dagger
#

For training 16, though it still worked with 32, but barely. Apparently it just runs out of memory when predicting on the test set.

rose dagger
#

Oh that looks promising. Do you have any experience with that?

past meteor
#

Just on toy problems to test out the API

past meteor
past meteor
#

Iirc mixed precision uses bfloat16 or float16 for the gradients and fp32 for the weights. Concretely it means your resulting model takes as much memory.

I think you need to solve it by reducing how many examples get sent to your GPU at a time or quantize your model (don't recommend)

model.evaluate(X, y, batch_size = 16) (default is 32)

timid ledge
#

8m really interested in cybersecurity, especially malware analysis

#

I’ve heard python and c++ are great for those fields especially

cold osprey
#

i heard cybersec is more on cerrtifications

#

not sure

rose dagger
past meteor
#

Each time I used Kaggle it was the latter, just a format with the results, not the model itself

rose dagger
#

I have to submit a notebook containing my model and some function that grabs the test data from some path. But the full test data is not publicly available, so i can't "precompute" the predictions and submit only the results.

past meteor
rose dagger
#

Currently i just iterate over all files in the test path and do this: This should just predict each element one at a time, i think.

past meteor
rose dagger
#

run garbage collection after every prediction?

past meteor
#

I don't know how the Python garbage collector plays with things that live on GPU. Do you know what is causing the issue, is it your CPU or GPU going OOM?

rose dagger
#

Mmh, well they seem to recommend clear_session() as well, maybe i'll try that.

past meteor
#

I'd really consider making a tf.dataset and not looping

rose dagger
past meteor
#

At least here you can unambiguously specify your batch_size and you have more or less good guaranntees it'll play well because this is what thhe vast majority of TF people use (as well as Torch folk, they use their own variety)

rose dagger
#

Ok, thank you so much for all of the recommendations! I'll try them, some of them must work i'd hope

past meteor
#

Yeah, start with tf.dataset 🙂 the mixed precision training is something you imo should remember exists but it will not help you here

agile anvil
#

Greetings fellow PythonistAIs! I have been experimenting with automatic prompt engineering, using Claude API to automatically attempt to improve prompts (in this case to produce a cover letter from a resume and job description) then test those candidate responses against the original prompt's and repeat in an artificial selection process. I had a huge problem when I found that Claude would always say it prefers the second of two resulting cover letters, even when their order was swapped, but I was able to overcome that by asking for a list of salient differences and then using a second API call to state a preference based on the list of differences. Is anyone else working on anything similar? I want to collaborate on a paper about this with someone experienced enough to legitimately critique my work so far.

pale hemlock
#

Anyone think of data being stored in 3 dimentions?

#

dimensions rather

potent sky
#

Mixed precision training has been such a delight
And quantization

potent sky
pale hemlock
# potent sky What do you mean? Elaborate?

im messing around with a tensor model that utilizes dimensional space to create geometric parameters to store it, so its reference is based on telemetry within the model. using shapes as means to identify data structures.

potent sky
#

"store it": store what?

pale hemlock
#

data sets,

#

any data sets

#

just refrenced in telemetry

#

self labeled modeling in that, you provide the basic math for lets say a triangle. you can create a data set on any parameter that defines a triangle in space,

pale hemlock
potent sky
#

I'm not sure I completely follow. Could you lay out the input, process, and output?

pale hemlock
potent sky
#

What would be the form of these multipoint references

#

Also do you have a link to this paper or smtg

pale hemlock
#

its something ive thought about for 7 years

#

and now having the means to implement..

potent sky
#

oh so you're developing this independently

pale hemlock
#

no, i haven't touched a computer in 34 years for the purpose of programming.

potent sky
#

Mm any link to help me understand the idea better? I'm still not exactly sure what this is meant to do or why it's supposed to work

pale hemlock
#

sure, hold on

potent sky
#

Thanks! I'll have a look

glossy aspen
pale hemlock
#

you can create so many points of reference in theroetical space

iron basalt
glossy aspen
#

I don’t understand the purpose of it. Maybe Fourier transform would help to model the input signal with some constants?

potent sky
tiny iris
#

Roadmap for AI?

potent sky
#

Also it's kinda outdated with the new stuff, since yk...2020

hasty mountain
#

Hey guys, I'm getting confused over calculations around True Positive Rate, False Positive Rate, True Negative Rate and False Negative Rate.

Given that my model made 88 valid predictions, where 87 were negative ones(0) and just 1 were positive(1), I'm being able to get the True Negatives and the False Negatives correctly by using the following code:

negative_predictions, positive_predictions = (task.argmax(-1) == 0), (task.argmax(-1) == 1)

negative_labels, positive_labels = (label.cpu().numpy() == 0), (label.cpu().numpy() == 1)

true_negative_rate, true_positive_rate = (negative_predictions == negative_labels).sum(), (positive_predictions == positive_labels).sum()

false_negative_rate, false_positive_rate = negative_predictions.sum() - true_negative_rate, positive_predictions.sum() - true_positive_rate

This provides me with 80 True Negatives and 7 False Negatives. However, I'm having some trouble with the Positive Predictions. Even though I have only 1 Positive Prediction, and my labels include 7 Positive Values, the code above provides me with the same values for True Positives and False Negatives.

Can someone give me a hint on how to fix it? I'm a bit out of ideas right now.

pale hemlock
potent sky
pale hemlock
#

right

#

you can set coordinates and create a plane those cordinates create. using that plane as a base line for data to be stored in a dynamic fashion so that information of the most basic types can create other facets of interpreted information conforming to the paramaters within the this tensor model

mild dirge
#

But these are just TP, FP, TN, FN, not rates

hasty mountain
#

Oh yes, indeed... Rate would be if I divide them by the total number of samples, right?

hasty mountain
pale hemlock
# potent sky So broadly, for visualization, interpretation of data?

with a understanding of creating a basic structure that defines its self with machine learning and AI, i figured i combine the gap by designing a model that has natural language constructed off of real world interpreted representation combined with a expanding data structure that based off of coordinate values in space and stored along a given plane that referenced within the original structure that modeled itself off a basic geometric equations and stored as such for data reference.

wooden sail
hasty mountain
#

Oh yes...Now I got it... Every value in the positive prediction that has a boolean True in the same place as the negative labels will actually be a false positive...
Yes, my problem is exactly with the Boolean Masks yert

#

Thanks, guys!

past meteor
#

Just add an and there and it makes a lot more sense

pale hemlock
#

for instance, a box, a cube, a square, 3, points of data stored in such a manner using the constructed of the data apoint of the original tensor

#

because box is defined, a AI can understand a box, but the data given can also be implemeneted along that line

mild dirge
#

Sensitivity and specificity mathematically describe the accuracy of a test which reports the presence or absence of a condition. If individuals who have the condition are considered "positive" and those who don't are considered "negative", then sensitivity is a measure of how well a test can identify true positives and specificity is a measure o...

#

Here you can see all the rates

potent sky
mild dirge
#

You divide them by either all ground truths, or ground false I think

potent sky
#

Thanks for taking the time to try to explain tho. I'll have a look at it again sometime

past meteor
#

FN = ((y_pred != y_true) & (y_pred == 0)).sum()

hasty mountain
pale hemlock
#

the ai can then sure the reference box was sued in in various ways, if a box is presented in one side of a plot and the box use case was used like a box car

hasty mountain
past meteor
#

with true I mean y_true

pale hemlock
#

both instances are stored as a references of data points and can be used for reference values such as a temprature

hasty mountain
#

Oh wait...yes... All predictions that are 0 and different from the labels(y_true)

#

Ugh... I suppose my head need some rest yert

past meteor
#

I changed it a bit but conceptually that's the easiest way to understand what a false negative is

mild dirge
#

True means it gets it right, False means it gets it wrong

potent sky
mild dirge
#

True means pred and true are the same, false vice versa, that's all you need to remember

past meteor
#

Not what I meant 😮

#

(y_pred != y_true) checks if there was a mistake made (the false part). (y_pred == 0) means it predicted a negative (the negative part)

potent sky
#

Let your prediction be x and ground truth be y for a binary classification problem
True positive --> x=1 y=1
True negative --> x=0 y=0
False positive --> x=1 y=0
False negative --> x=0 y=1

past meteor
mild dirge
#

Why not?

agile cobalt
#

with == you're just getting (1-)accuracy

mild dirge
#

!e

import numpy as np

negative_predictions = np.array([True, False, True])
positive_labels = np.array([False, False, True])

print((negative_predictions == positive_labels).sum())
arctic wedgeBOT
#

@mild dirge :white_check_mark: Your 3.11 eval job has completed with return code 0.

2
mild dirge
#

Oh right

past meteor
#

say you have this array [0, 0, 0, 1, 1] for negative predictions and [0, 0, 0, 1, 1] for positive predictions what are you comparing?

mild dirge
#

Blindly copied the example, guess the TP and TN need to be changed that way too then

past meteor
#

You'd take [0, 0, 0] vs [1,1]

#

Either way, I get what you mean.

past meteor
mild dirge
#

You'd need bitwise and then pretty sure

#

!e

import numpy as np

negative_predictions = np.array([True, False, True])
positive_labels = np.array([False, False, True])

print((negative_predictions & positive_labels).sum())
arctic wedgeBOT
#

@mild dirge :white_check_mark: Your 3.11 eval job has completed with return code 0.

1
mild dirge
#

Oh this works I guess

#

I remember using bitwise and before, because & didn't work, wonder why that was

past meteor
#

But it's expressed in positive_labels and negative_predictions already which is not what you get out of a model

#

You just get y_pred hence why it's easier to express it in function of that (subjective I guess)

pale hemlock
# potent sky What role will ai be playing here? Some predictive task?

well the whole point of making and training seems so odd, where as, machine learning provides a context of information, and AI spends its time contextualizing information, i thought, why not make a machine learning model that can learn from itself with information received from its own instance. creating dictionaries to work with by simple mathematical parameters. such as, applying x y coordinates relating to coordinates that is a learned instance of any given input. decerning one x coordinate form another based of how the reference was made. was the operation happening becuase someone was talking about a car, or was it someone talking about a box, in both cases the math is is defining which caracteristics that was presented to the tensor model input, keeping note of each instance and defining new variables and functions where they are needed, the actual data isn't important as long as data is being inputed or represented in some value the model should maintain data structure.

past meteor
#

!e

import numpy as np

y_true = np.array([1, 0, 1, 1, 0, 1, 0])
y_pred = np.array([0, 0, 1, 0, 0, 1, 1])

FN = ((y_pred != y_true) & (y_pred == 0)).sum() # or np.sum((y_true == 1) & (y_pred == 0))

print('False Negatives:', FN)
arctic wedgeBOT
#

@past meteor :white_check_mark: Your 3.11 eval job has completed with return code 0.

False Negatives: 2
potent sky
#

I don't see how it all ties into each other

pale hemlock
#

it depends on how you store those data classes right? and how would you use those data classes, well you can store, the whole classe as a data point, its a point that is represented somewhere in space based of the causation of that instance the data was being entered, the model then can keep track of that data in space, and part of that data is referenced as any data type you wish, as the learned instances grow, from inputs of that, any part can quickly be referenced by the model to provide the output wanted, storing information at a point would have a plane, a new X, Y coordinate system is created, x can now now hold a value at "(-.4),(2),(5),(-2) on the original construct, and if a box is ever referred too in conversation, it then can learn where the word box is framed in context, depending on how the operation has progressed in learned understanding of context, the word, box or any other reference to a geometric figure, triangle, rectangle, the context of reference can be learned from this natural implementation but used as away to use a. not necessarly needed to instanciate the implementation but the implementation can be instanciated when called..

pale hemlock
# potent sky So you have certain representations of classes "saved" and you use these to gene...

say the value of (2) this case contains coordinates value in the context of X as reference of an EX as in an ex fiance, the values on that plane can be evaluated among its self were X is served as X boss, or X wife, theses contextual cases can be planed across its self and if the contexts plane runs into the original tensor the use case could be construded as an actual function but or the rest of the values don't cross those planes the use case could be extracted natrually due to requirements of the assumed use case.

#

hard to get the idea that X means an X coordinate if other values don't make sense to call it that, in the general slang case.

merry sundial
#

What computers are used to train large scale ML models like Tesla’s CV and Self Driving models? H100 Servers?

somber panther
#

am toying around with fbi background data and am a bit flustered by some results I'm getting, python my_df.groupby('state').describe().transpose()['Alabama'] returns: python permit count 295.000000 mean 9571.213559 std 12549.296862 min 0.000000 25% 0.000000 ... return_to_seller_other min 0.000000 25% 0.000000 50% 0.000000 75% 0.000000 max 3.000000 Name: Alabama, Length: 192, dtype: float64
I'm a bit lost because i get a keyerror if i don't transpose it first

#

was just about to remove this, think i figured it out

tidal bough
#

I'd do, like, my_df[my_df["state"]=="Alabama"].describe().

somber panther
#

ya, thats what thought i was doing

#

ty

glossy aspen
# pale hemlock it depends on how you store those data classes right? and how would you use thos...

As I understand you still have a long data and you want to make it 3d or just a smaller vector. I would check autoencoders for this purpose. The model learns from data trying to make outputs similar as much as possible to the inputs. So you can get the latent space (just a vector between encoder and decoder parts) so you will end up with a vector which represents the input (bigger data). I tried to use it for a similar situation. I generally use statistics to interpret data and the difference between clusters etc.

glossy aspen
# pale hemlock say the value of (2) this case contains coordinates value in the context of X as...
final raft
#

I want to get a book on Machine Learning, ideally using Python for under £20. Does anybody know of some good books and a link to buy it please?

arctic wedgeBOT
#
Resources

The Resources page on our website contains a list of hand-selected learning resources that we regularly recommend to both beginners and experts.

serene scaffold
#

You can filter it by books

potent sky
#

I'll take some time to digest it

potent sky
lapis sequoia
#
        for index, row in ManagerGroup.get_group(manager).iterrows():
            if row['Late In Hrs'] != 0:
                i += 1
            elif row['Early Out Hrs'] != 0:
                j += 1
            elif row['Late In Hrs'] == 0 and row['Early Out Hrs'] == 0:
                k += 1
``` I want to recreate this into tableau but everytime I try to, the counts are equal of each variable
#

This is the end result of the loop

glossy aspen
sharp wyvern
#

With data science can I go into crime and business fields?

lapis sequoia
glossy aspen
rose dagger
#

Can something like this actually happen, or do i probably just have a bug somewhere: I trained a model on a reduced training set, got pretty good results on test data, then trained it on the entire dataset using data augmented by Keras ImageDataGenerators and now i get horrible non-sensical results on test data. Did anybody ever experience something similar?

pine escarp
lapis sequoia
glossy aspen
near field
#

Hello, i am a beginer in AI and i would like to know what are the most used libraries and which are the best.

#

I'm currently using Tensorflow but a few PHDs told me that it is bad because it doesn't have a good compatibility with previous versions.

wooden sail
#

it's true that tensorflow has introduced many breaking changes from one version to the next

#

the latest ones drop gpu support for windows outside of wsl

#

many people like pytorch and it seems to have gained a lot of traction in academia too. i like jax for the stuff i work on

#

tensorflow and keras are good, but as with most software, you have to keep track of which versions your code works on

maiden geyser
#

Hi, i am looking for some help designin a Neural Network. I have to model a time and space dependant problem and predict temperature. I have try using LSTM as it is a timedepedant problem but It feels like no matter what i do in training (i have managed to have some train and val loss around 5 ), test loss is still at 500

#

(please ping me in your answer)

night prawn
sleek harbor
#

quick question. When you're working in a jupyter notebook, do you generally list your observations before the code that leads to these observations, or after it? Like, do you plot a plot, and then add a markdown before the code like "the following plot shows...", or do you plot the plot, and then add a markdown after the plot saying smth like "from the plot we can conclude..."? What is the 'standard' (more popular, accepted) way of doing it, formatting your observations in a notebook?

cursive crown
#

Hi everyone. I need help in optimising a block of code I wrote.

cf_df = pd.DataFrame(columns=['player_id','player_name','country'...)

for player, df in t20_bat_df.groupby('player_id'):
    date_range = pd.date_range(start=df['start_date'].min(), end=df['start_date'].max(), freq='M')
    df = df.set_index('start_date')
    for month in date_range:
        games_played = df['match_id'].count()
        month_df = pd.DataFrame({'player_id': [player], 'month': [month]})
        month_df['runs'] = df['runs'].sum()
        month_df['date'] = month
        month_df['games_played'] = games_played
        .
        .
        .
        
        pd.concat((cf_df, month_df))
cf_df

This is taking forever to compute. Thanks!

dusk tide
#

Hello everyone , I am practicing data cleaning and EDA on movies dataset https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset
There are 2 columns** Revenue and Budget **which has 90% missing values . So I was exploring more techniques besides dropping the column or doing mean/median imputation.
As far as I searched these 2 columns lie in Missing completely at random category. Can anyone take a look on these 2 columns and tell whether am I correct or not .

My notebook here https://www.kaggle.com/code/nishchay331/datacleaning-practice1
Can we use MICE technique in all the 3 scenarios (MCAR,MAT,MNAR) categories as it is an advanced technique? Also if anyone can tell what are the things to keep in mind while doing data imputation like one thing I came to know recently that standard deviation should not change of that specific feature for which we are doing imputation.

rose dagger
#

learning curve currently looks like this: When i run the model on a smaller dataset it works perfectly fine. This seems to suggest that it's stuck in a local minimum. What can i do to circumvent this? Currently using Adam optimizer with learning_rate=10^-3

potent sky
#

the goal is great readability and minimal confusion, so whatever suits that is good I suppose

left tartan
glossy aspen
rose dagger
potent sky
#

But yeah to improve it you can try varying the hparams

rose dagger
potent sky
#

Changing batch size affects how much data is considered for one gradient update. Lower batch size can increase noisy updates and you might be stuck in a minimum

rose dagger
#

I wasn't so focused on how high the accuracy was in the learning curve, i was just worried about the flat loss decrease

rose dagger
potent sky
glossy aspen
potent sky
#

If it's giving a higher performance on every sampling of the train data then we might have an interesting case

rose dagger
#

Yes, i have randomly sampled a small subset multiple times, it works (much!) better every time. Though another difference is that i was loading in the smaller dataset in memory and processing it "by hand", but for the larger dataset i used a Keras DataGenerator. Maybe i just messed the DataGenerator up?

potent sky
rose dagger
#

Haha, i meant i was coding custom methods to do this, not using any prebuilt modules like DataGenerators from Keras

potent sky
#

It's also possible, though not very likely, that the extra data might actually be causing you to overfit on the train set
Difficult to digest ik xd

rose dagger
#

Yeah, the issue with the first approach was the insane memory usage, which is why i was trying to use DataGenerators. Guess i'll load small batches into memory, process them like i did before, train, then discard the batch and load a new one.

potent sky
#

You can just use DataGenerators for the smaller ones too?
As long as rest of the pipeline is same for both, we can compare

rose dagger
potent sky
#

How's the train accuracy when training with larger dataset

#

As compared to small subset

rose dagger
#

Weirdly enough (as you can see in the screenshot) the accuracy when training on the large dataset is very high (~0.95) and iirc the accuracy was smaller on the small dataset (~0.9)

#

I'll have to look it up how they calculate the accuracy, because as i said the model trained on the larger data essentially only returns 0.5 for every pixel or something dumb like that

potent sky
#

So test accuracy was higher than train accuracy on the small dataset?

rose dagger
#

Yep, just not the loss

potent sky
#

Hmm

#

Try to reproduce with identical pipeline ig

#

Only then can we make any solid conclusions

rose dagger
#

Yes, will do that. It's the only way to compare them properly i guess

potent sky
#

Yep

rose dagger
#

Ok, will update once i'm done. Thank you so much for your help so far!

#

@potent sky While i have your attention: Is there a way to also return the magnitude or some other information about the gradients produced during training time? E.g. plot loss and norm of gradient in each epoch or something similar. Something like that might me help understand whether i'm stuck in some local minimum.

sleek harbor
#

Does correlation between a continuous variable and catigorical variable give any meaningful informations?

potent sky
potent sky
sleek harbor
potent sky
#

Or a better example here would be type of contract ig
Small, medium, large whatev

#

Mm maybe ANOVA?

potent sky
#

GL!

glossy aspen
boreal gale
potent sky
sleek harbor
potent sky
boreal gale
# sleek harbor I'd like to know the answer for both ordinal and nominal, as I have both. Curren...

in your original post, by correlation, you probably meant pearson correlation?

i think correlation (of any kind, i will comment on this later) between continuous variable and categorical-and-not-ordinal variable does not give you meaningful info [the hand-wavy way of thinking about this is: since you can easily swap the "ordering" (since it's not ordinal) of the categorical variable, you can arrive at different correlation result, which means they probably don't make any sense!]

however, for ordinal data, it might.
but pearson correlation might not be the thing you want, i think by using pearson, you are effectively assuming the gap between ordinal 1 and ordinal 2 has the same "distance" as ordinal 2 and ordinal 3, when in reality this might not be true. (e.g. low income group, middle income group, high income group, distance between low and mid is not the same as mid and high)

spearman's rho and kendall's tau might make more sense, they are also correlation measures but mainly for ordinal data.

potent sky
#

Hmm yeah spearman's or kendall's might be a better approach
btw, if you find correlation with nominal data, doesn't it in a sense become implicitly ordered, w.r.t. the correlation found

sleek harbor
glossy aspen
sleek harbor
boreal gale
potent sky
cerulean kayak
#

okay so I'm trying to select a column with iloc based on 2 examples I saw:

df.iloc[10:20]
```and this was the second example I used to base my code off of:
```py
df.iloc[:, [1,2,5]]
```So I combined the two to make this:
```py
df.iloc[:, [0:5]]
```My df has 7 columns but this gives me the error of "invalid syntax". Please help.
small wedge
#

maybe you're looking for a slice object?

#

afaik [0:5] is only valid after something that accepts slice objects as indices like a list

small wedge
#

you could try slice(0,5) instead

left tartan
#

Get rid of []’s around the 0:5

small wedge
#

oh that too lol

agile cobalt
cerulean kayak
left tartan
#

0:5 is a range. [1,2,5] is a list.

cerulean kayak
#

is a range a seperate datatype from a list?

agile cobalt
#

when you use : within object[:] / object[:] = ... (getitem / setitem notation), python automatically creates a slice from that

however, you cannot use it when you are creating a normal list, despite it also using []

cerulean kayak
#

because the 2 things that are getting me hung up on this function are:
1).why in the world does a method use braces?
2). the difference between lists and list-like objects seems arbitrary.

so please bear with me.

agile cobalt
#

: is not "list-like"

#

taking a step away from pandas and looking at normal python, you have: ```py

list literals:

list = [0, 1, 2, 3, 4 ,5]

list indexing:

list[0] == 0
list[3] == 3

list slicing:

list[0:3] == [0, 1, 2]
list[slice(0, 3, 1)] == [0, 1, 2]

updating lists:

list[0] = 100
list[1:3] = [10, 20]
list[slice(1, 3, 1)] == [10, 20]

#

the : is pretty much syntax sugar for annotating a slice, usable when retrieving elements from a list or overwriting a slice of the list

#

which means: these pairs are equivalent```py
df.iloc[10:20]
df.iloc[slice(10, 20)]

df.iloc[:, [1,2,5]]
df.iloc[slice(None), [1,2,5]]

df.iloc[:, 0:5]
df.iloc[slice(None), slice(0, 5)]

#

however, that 'syntax sugar' is only valid within object[...].```py

valid:

var = slice(1, 10, 3)

syntax error:

var = [1:10:3]

still rivet
#

hi i have a question about my rl:

import numpy as np

grid = np.array([
    ['P', ' ', ' ', ' '],
    [' ', 'X', ' ', 'X'],
    [' ', ' ', ' ', ' '],
    ['X', ' ', ' ', 'G'],
])

print(grid.shape[0])
print(grid.shape[1])

num_states = grid.shape[0] * grid.shape[1]
num_actions = 4  # Up, Down, Left, Right
q_table = np.zeros((num_states, num_actions))

learning_rate = 0.1
discount_factor = 0.9
num_episodes = 10_000
max_steps_per_episode = 100

for episode in range(num_episodes):
    player_pos = np.where(grid == 'P')
    state = player_pos[0][0] * grid.shape[1] + player_pos[1][0]
    if episode % 1_000 == 0:
        print(episode / num_episodes * 100)
    for step in range(max_steps_per_episode):

        if np.random.uniform(0, 1) < 0.1:
            action = np.random.randint(num_actions)
        else:
            action = np.argmax(q_table[state])

        row, col = divmod(state, grid.shape[0])
        #row = state // grid.shape[1]
        #col = state % grid.shape[1]

        if action == 0:  # Up
            row -= 1
        elif action == 1:  # Down
            row += 1
        elif action == 2:  # Left
            col -= 1
        elif action == 3:  # Right
            col += 1

        if row < 0 or row >= grid.shape[0] or col < 0 or col >= grid.shape[1]:
            new_state = state  
        else:
            new_state = row * grid.shape[1] + col

        if grid.flat[new_state] == 'X':  
            reward = -10
        elif grid.flat[new_state] == 'G':  # Goal reached
            reward = 1
        else:  
            reward = 0

        q_table[state, action] += learning_rate * (reward + discount_factor * np.max(q_table[new_state]) - q_table[state, action])

        state = new_state

        if episode == num_episodes - 1:
            print("x: ", row, " | y: ", col)

        if grid.flat[state] == 'G':  # Goal reached
            break

print("Learned Q-table:")
print(q_table)
#

this reinforcement learning does work fine, but as soon as i add another row or column it would break and the results are completely off

#

the 4x4 grid (which works)
q_table:

[[ 5.31440892e-01  4.78286716e-01  5.31440417e-01  5.90490000e-01]
 [ 5.90489707e-01 -9.46856122e+00  5.31440737e-01  6.56100000e-01]
 [ 6.56099855e-01  7.29000000e-01  5.90488097e-01  5.90487408e-01]
 [ 1.68409465e-01 -9.19000000e+00  6.56099831e-01  1.97696497e-01]
 [ 5.31440488e-01  6.33422575e-02  1.29616717e-01 -9.88018350e+00]
 [ 5.90490000e-01  3.40817433e-01  1.64306823e-01  4.15189404e-01]
 [ 6.56099257e-01  8.10000000e-01 -9.46855932e+00 -9.18999923e+00]
 [ 2.39609557e-01  9.00000000e-01  4.15189404e-01 -8.08372366e+00]
 [ 2.49389245e-01 -4.09510000e+00  0.00000000e+00  7.28959493e-02]
 [-8.18826176e+00  1.24354675e-03  2.24665146e-02  8.09998283e-01]
 [ 7.28997072e-01  7.28984768e-01  7.28973041e-01  9.00000000e-01]
 [-9.18999846e+00  1.00000000e+00  8.09999965e-01  8.99998865e-01]
 [ 0.00000000e+00 -1.00000000e+00 -1.00000000e+00  0.00000000e+00]
 [ 2.07141416e-01  0.00000000e+00 -8.49905365e+00  0.00000000e+00]
 [ 8.09998301e-01  6.57927489e-02  2.03081236e-02  5.21703100e-01]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00]]
#

with any change the the grid:

[[  0.           0.           0.           0.        ]
 [  0.         -10.           0.           0.        ]
 [  0.           0.           0.           0.        ]
 [  0.         -10.           0.           0.        ]
 [  0.           0.           0.           0.        ]
 [  0.           0.          -9.99997646  -9.99959516]
 [  0.           0.           0.           0.        ]
 [  0.           0.          -9.99373421  -9.99963565]
 [  0.           0.           0.           0.        ]
 [  0.           0.          -9.9293035    0.        ]
 [  0.           0.           0.           0.        ]
 [ -1.           0.           0.           0.        ]
 [  0.           0.           0.           0.        ]
 [ -1.           0.           0.           0.        ]
 [  0.           0.           0.           0.        ]
 [  0.           0.           0.           0.        ]
 [  0.           0.           0.           0.        ]
 [  0.           0.           0.           0.        ]
 [  0.           0.           0.           0.        ]
 [  0.           0.           0.           0.        ]]
wheat snow
#

hi i wanted to ask something about the behavior of stacked barplots. if i have different values, do the stacked barplots just sort themselve in an order that he color we see e.g. on top has the actual highest value till the point were it ends?

lapis sequoia
#

I'm working on porting some MATLAB code to Python, and I'm running into an issue which I think may be due to the semantics of matmul being different than what I expected. I have a discretized vector field called m, which has shape (3, N, N), and can be thought of as an N by N grid of vectors in R3. I then construct a corresponding discretized matrix field called L, which has shape (3, 3, N, N), and can be thought of as an N by N grid of 3 by 3 matrices. I was hoping that computing L @ m would yield the same result as multiplying the matrix at each grid point by the corresponding vector at that grid point, but I'm not sure if those are the actual semantics. It's possible that this is not the actual issue in my code, but it seems like the most likely problem at the moment.

young granite
wheat snow
#

i show in a image what i mean in a second

#

i first have to finish preparing da data

hearty anchor
#

Hi, I was wondering for matplotlib is there a place that explains how to structure the fmt keyword? i took a look at the documentation but couldnt find anything

#

I was looking specficially for the bar_label function but ive seen it in other functions too

wooden sail
lapis sequoia
wooden sail
#

in your case, we would do

result = np.einsum("ijkl, jkl -> ikl", L, m)
lapis sequoia
# wooden sail yeah

Ah, Interesting. I was using (3, N, N) because if I want to normalize the vectors I can just norm them and divide, since NumPy will broadcast (N, N) to (3, N, N) no problem.

wooden sail
#

i always prefer and recommend being as explicit with numpy as possible, especially if you come from matlab

#

you'll realize that numpy prefers doing weird stuff instead of erroring out. stuff that would give an error in matlab will happily compute in numpy, and give you trash

#

e.g. 1D np arrays don't behave like actual vectors. you could multiply a vector from the left or right of a matrix without making any change

#

so do things like adding dummy axes (with np.newaxis or reshaping to add a dimension of 1) and use einsum

#

einsum is perhaps the 1 thing numpy really beats matlab at 😛 matlab doesn't have a built-in equivalent for natural tensor contractions

lapis sequoia
iron basalt
wooden sail
#

i never remember which axis it'll favor tbh

wooden sail
#

a little bit cursed tbh. i come from the "matlab ordering of indices" and "all vectors are column vectors" world so this is exactly the opposite of what i'd like 😛

#

maybe that's why i never remember it

lapis sequoia
#

Indeed. At least now that I've got this sorted I am properly computing the Laplacian of my vector field.

iron basalt
#

Numpy has a row-major memory layout by default (C-style) and since they were in that mindset, this order would make sense as an arbitrary choice.

#

The last axis being the "fastest."

wooden sail
#

fair enough

#

i should probably change it to default to F ordering

iron basalt
#

That is, if you loop over it as a flat array, and had the multidimensional index printed, the last index is changing the fastest, and the first the slowest.

iron basalt
#

Fortran, Pascal, etc.

#

Although the internal memory layout only affects speed, not semantics, but since they were implementing it in C, it comes as the natural choice.

wooden sail
#

right. i did do a fair amount of C++ back in my day, so i'm familiar with all of this... but never any numerics in it. i have fast contiguous memory access and numerics in separate drawers in my head 😭

iron basalt
#

To match the internal layout more.

#

"Row-major" and "column-major" are not great names for it either, it does not generalize past two dimensions and "major" is not a common term in programming in general.

lapis sequoia
#

Blah, I'm not built for numerics. I miss proving that solutions exist. Finding them is no fun.

#

Weird, my Laplacian operator works, but my curl operator does not...

zenith epoch
#

I'm following michael nielson's book on neural nets, and I'm trying to understand his code. I don't understand how he is calculating gradients. My understanding of calculus is shaky, so that might be part of it, but from what I see, he is taking the vector of differences between the actual and desired outputs and multiplying it by the derivative of z (which is a scalar?). I'm not sure what is happening in line 119. I'm kind of lost on what is happening tbh. Could anyone help explain? ty
https://github.com/unexploredtest/neural-networks-and-deep-learning/blob/master/src/network.py#L106 (and line 119)

arctic wedgeBOT
#

src/network.py line 106

delta = self.cost_derivative(activations[-1], y) * \```
brittle storm
#

yo

#

can some one help me?

dusk tide
#

Hello everyone , I am practicing data cleaning and I want to ask that Let's take a numerical feature and before imputation and after imputation if I draw a single histogram and both of them overlaps so that means that we did correct imputation?
What are the things we should keep in mind while doing imputation?

sleek harbor
#

\0. can KNeighborsClassifier be used for multilabel classification? How is it done? I can kinda guess for ordinal categories, but.. nominal? No idea.

\1. can you use KNNImputer for imputing nominal categories? If yes, then how would you do it? You can't use OHE, cus your missing value will just be encoded as a dummy variable, just like the rest of the values, and nothing will be imputed. You could encode them as ordinal, but.......

\2. is there any point in encoding nominal categories as ordinal ordered by frequency? Would that make sense? Maybe not the best solution, but using KNNImputer with nominals encoded as ordered by frequency ordinals would probably be better than just imputing with the univariate mode, right? Or does that not make sense?

strong bear
#

i have a list of stocks as dictionaries for example
{'symbol': 'AAPL', 'currency': 'USD', 'exchange': 'NYSE', 'isin': 'US0378331005', 'security_type': 'equity'}, {'symbol': 'TSLA', 'currency': 'CAD', 'exchange': 'V*SE', 'isin': 'yyyy', 'security_type': 'equity'}

and i have the same data in a database table but with duplicates and inconsistencies. for example,

symbol    |currency        |isin                    |security_type
-----------------------------------------------------
AAPL    |AAPL            |                         |Crypto
AAPL    |USD            |US0378331005            |Equity
AAPL    |USD            |AAPL221125C00070000    |Derivative

as you can see, one database row has crypto as the security type for AAPL and the other two have varying ISINs.

i am looking for an algorithm that would retun the best match for AAPL. in this case the 2nd row.

any suggestions or recommendations are welcome.

cold osprey
#

fix it from the source?

#

AAPL clearly isnt crypto so it shouldnt ever be a row

left tartan
#

Yah, I’m not sure what “algorithm” is going to help. Maybe get a list of valid equities from Edgar and cross check?

foggy harness
#

Hi

I am working on a project that requires an ai model to detect faded road markings and the percentage of marking faded (0% means not faded ,100% means completely faded). How should I accomplish this using object detection or image segmentation etc?

foggy harness
#

The markings are also irregular shaped instead of square shaped

strong bear
left tartan
#

I mean, if you're just looking to join your dictionary to a database table?

#

You could either load the dict to a table and join, or construct a param, or download the table and filter, a few options.

strong bear
#

if a match is found in the database then no further action is required otherwise i need a add a new row to the database and its critical that this matching is as accurate as possible because we cannot allow duplicates

left tartan
#

Well, if you can't allow dupes, then make sure you have a unique index

strong bear
left tartan
#

oh sorry, I see you have isin's.

#

left outer join on isin, and insert if not exists.

#

but with a unique key, so you don't have some sort of race condition with another update.

strong bear
#

ISIN's are not reliable as there are entries in the api that have one symbol with a valid ISIN and another None

left tartan
#

What kind of garbage api are you dealing with?

strong bear
left tartan
#

Oh, I've been meaning to try their data out.

strong bear
#

i was hoping if there was some clasification algorithm that i could use to pick out a match it would solve a lot of my problems

left tartan
#

I'm just not sure what you're after... sounds like a typical data cleansing problem:

#

Find duplicates, try to resolve known equities using ISIN or well known lists, etc

left tartan
#

Like, if you have a good master list of ISINs and tickers, you could at least whittle the list down

strong bear
#

i have something i am working on., symbol, currency, security type match and if the db returns more then 1 match then check isin and then check exchange

#

its a lot of if else

hoary jay
#

@wooden sail @queen cradle Hey guys do u rem me? I have the abstract ready...do u guys mind giving it a read

left tartan
strong bear
#

postgres

left tartan
#

Yah, you could load it and do it all in sql too

#

I use duckdb for a lot of this stuff, so I'll just join a table to an in-memory dataframe.

simple tapir
#

has anyone ever heard of global ai hub?

sleek harbor
#

current code, which produces first picture:

for col in ["Pclass", "Sex", "SibSp", "Parch", "Embarked"]:
    sns.displot(
        train_data,
        x=col,
        hue="Survived",
        multiple="dodge",
        stat="percent",
        discrete=True,
        shrink=0.8,
    )
    plt.show()

how to make the xticks look like they do on the second picture?

tidal bough
sleek harbor
tidal bough
#

If you want a tick per each class and none others, you can instead explicitly set the ticks, via something like plt.xticks(np.unique(train_data[col]))

sleek harbor
simple tapir
#

How can it be possible that train accuracy is high and test accuracy is low? Isn't test dataset dependent on the train dataset?

left tartan
mild dirge
simple tapir
#

Imagine that you are making a model that predicts the house prices based on the amount of rooms and distance to the city center. Price is test data and the rest are train. So the point is to set such values that it predicts the price well. So the better the train accuracy is, the better the test one, but it's not when overfitting comes to the case. But why? I searched for that on Internet and I learned that the machine memorises the data so that it gets unable to predict well. Yeah, it makes sense but how is it possible therotically ? How would you visualize that ?

left tartan
#

You pick such values more or less randomly... such that you can validate that the model you built (using the train data) is not overfit. So, the test data serves the opposite purpose: to remind you that a model that trains well on "train" data might be useless in the real world.

mild dirge
#

You've probably seen this image before @simple tapir

#

We see that overfitting get's an incredibly good accuracy on the training** data (the points) with an error of 0 as it passes through all points. But when giving it new data from the same distribution, it will make poor predictions.

#

Whereas the good fit has a higher training error than the overfitted case, but it will be able to make better predictions on new data.

left tartan
#

You didn't use the airplane?

#

(joke)

mild dirge
verbal venture
#

hey, how would I use a dataset after gathering images for one

#

do I train the model on the images and then if I want a random image to be classified, just upload that image and then call the model?

small wedge
#

yes

#

the training process is forward propagation -> back propagation

#

to use the model you only need the forward propagation part

verbal venture
#

ok, I know how to make a CNN. how should I call the model through a website

small wedge
# verbal venture hey, how would I use a dataset after gathering images for one

label all of the images so the model has the correct answers to use in the cost function, then you may consider data augmentation (rotation, coloring, translation, noise, etc)if you don't have a good quantity of images. Finally you'll wanna normalize the data which is easy for images as they usually just divide every pixel value by the max pixel value 255

small wedge
#

or do you want like implementation details?