#data-science-and-ml
1 messages · Page 68 of 1
so all I need to do is insert this threshold metric into Random Forest right?
Best Threshold=0.230000, F-Score=0.457
Im going to use this
precision recall f1-score support
0 0.96 0.97 0.96 14684
1 0.50 0.42 0.46 1053
accuracy 0.93 15737
macro avg 0.73 0.69 0.71 15737
weighted avg 0.93 0.93 0.93 15737
ta daaaa @past meteor
my f1 score got improved a lot
precision recall f1-score support
0 0.94 1.00 0.97 14684
1 0.97 0.07 0.13 1053
accuracy 0.94 15737
macro avg 0.96 0.54 0.55 15737
weighted avg 0.94 0.94 0.91 15737```
this was before ^^
this is for the ROC curve
precision recall f1-score support
0 0.98 0.78 0.87 14684
1 0.21 0.79 0.33 1053
accuracy 0.78 15737
macro avg 0.59 0.79 0.60 15737
weighted avg 0.93 0.78 0.83 15737
Hi anyone familiar with Pandas here?
I am stuck a bit with groupby and was wondering how to resolve the following problem.
https://stackoverflow.com/questions/76459817/pandas-groupby-year-week
Anything helps 😉
In the meantime I am toying around 😉
i figured out how to locally install and use an llm model. how can i upload my own texts so that the model responds to me? which models allow this?
i used gpt4all vicuna 13b - can i do it with this one?
How can i render a pre computed simulation between 1000 to 10000 objects (single points about 5×5 pixels)
good evening everyone
I have a question
do data scientists still use python 2.7 for data science?
Not at all.
please don't ping me with an incomplete question. what is "this"?
oh, I see
I don't have anything insightful to say about this figure.
thank you very much
I am trying to find the best f1 value
for the imbalanced predictor variable
Best Threshold=0.923983, F-Score=nan
how come the F score is nan
precision, recall, thresholds = precision_recall_curve(test_labels, pred_positive)
f1_score = (2 * precision * recall) / (precision + recall)
i_max = argmax(f1_score)
print('Best Threshold=%f, F-Score=%.3f' % (thresholds[i_max], f1_score[i_max]))
plt.figure(figsize=(16, 10))
no_skill = len(test_labels[test_labels == 1]) / len(test_labels)
plt.plot([0, 1], [no_skill, no_skill], linestyle='--', label='No Skill')
plt.plot(recall, precision, marker='.', label='Logistic Regression')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.legend()
plt.show()
are you using matplotlib?
does anyone know why this is happening?
this is on google collab
I am training on my own images, using the ssd_resnet50_v1_fpn model
does anyone know where shuffle buffer size is located?
What's the issue? I don't see anything wrong here
There's generally a buffer_size or similar arg to whatever shuffle method you're using, for example dataset.shuffle
what's an easy way to convert jupyter notebooks to output with just markdowns and outputs but no code cells
downloading as markdown 😛
if you open the notebook in your browser, you can download its current state as markdown. you can probably also do this from the terminal and from vscode, but i wouldn't know how
ah you want no code at all?
idk if that can be done automatically, but it's very easy to write a script to do it for you
all of the code blocks are written in that format, wrapped with ```python ```
you could write a 10 liner or so that removes those blocks
what?
the loop to remove cide
code
i want the pdf to look just like the juoyter notebook looks. minus the code cells
I did something with nbconvert but it looks ugly
Okay I have it figured out
what does %%html do btw
Can someone explain why test accuracy remains constant at 68.782% even though the loss keeps varying?
give us more details, what is the loss, what are you predicting etc
If you do classification, and your loss is cross entropy f.e., this can still decrease without argmax(logits) being any different, thus accuracy not being different.
It's a binary classifier that predicts whether the price of an asset will go up or down based on past 60 periods of data, the following is the entire training loop:
model = LSTM(input_dim=train_dataloader.dataset.sequences.shape[-1], hidden_dim=HIDDEN_DIM, output_dim=OUTPUT_DIM, num_layers=N_LAYERS, fc_dim=FC_DIM)
criterion = nn.CrossEntropyLoss()
optimiser_lr_e3 = torch.optim.Adam(model.parameters(), lr=0.001)
optimiser_lr_e2 = torch.optim.Adam(model.parameters(), lr=0.01)
train_hist = np.zeros(EPOCHS)
test_hist = np.zeros(EPOCHS)
start_time = time.time()
lstm = []
for t in range(EPOCHS):
model.train()
for i, (inputs, labels) in enumerate(train_dataloader):
y_train_pred = model(inputs)
loss = criterion(y_train_pred, labels)
train_hist[t] = loss.item()
if t < 15:
optimiser_lr_e2.zero_grad()
loss.backward()
optimiser_lr_e2.step()
else:
optimiser_lr_e3.zero_grad()
loss.backward()
optimiser_lr_e3.step()
correct = 0
total = 0
model.eval()
for i, (inputs, labels) in enumerate(test_dataloader):
y_test_pred = model(inputs)
loss = criterion(y_test_pred, labels)
test_hist[t] = loss.item()
_, predicted_labels = torch.max(y_test_pred.data, 1)
total += labels.size(0)
correct += (predicted_labels == labels).sum().item()
print(f'Epoch {t+1}\n\tTrain Loss: {train_hist[t]:.4f}\n\tTest Loss: {test_hist[t]:.4f}\n\tAccuracy: {(correct/total)*100:.3f}%')
training_time = time.time()-start_time
print("Training time: {}".format(training_time))
Should I consider using a different loss function, in that case?
Why would you want to do that?
I'm confused why the accuracy stagnates right after the first epoch. Is there a fault with how I'm calculating it?
No probably not
It can just be that the loss does decrease whereas the accuracy does not
And the loss/accuracy stays constant could be of many reasons
Like too simple model, or plateau/local minima
I'm using the MinMaxScaler() that scikit-learn provides, weirdly enough, this does not happen if I use a different pre-processing technique
probably
minmax doesn't work well if there are very large outliers
You'd likely want to standardize it with mean 0 std 1 in that case
I'll try that out
Hi i am doing some neural network and i'd like to know if you have some good method to tune the hyperparameters ? (dont hesitate to ping me in your answer please)
i'm really a kid in this domain and i need to create a NN (which i've done) but to chose the right number of hidden layers and nb of neurons per layer is something obscure to me
Anyone have some knowledge in doing specification curve analysis?
Depends on what you're doing. For almost every simple situation, one layer is enough. As for the number of neurons per layer, that can depend on your topology and number of inputs and outputs. If your input layer is greater than 1, then you want a number neurons equal to, or between the size of your input layer in your hidden layer
In most situations, you want the number of neurons in your hidden layer to be greater than your output layer.
A basic nn is 3, 2, 1. With the input layer being three, middle being 2 and 1 being the output.
yeah ok but it is definitely not enough. because i have a dataset composed of 16 columns and 500k rows and one column for y. And with a simple 321 it does not predict some accurate values
It kills the program, it’s not running
I couldn’t find it though
That doesn't tell me much without knowing the data, what are you trying to do? What kind of outputs are the targets etc etc. You have 16 columns in x, but do you need 16? Have you run PCA to determine the most important columns? What are your dimensions on x? Are the 16 columns going in sequence or does each column get it's own input neuron.
simplest thing I can offer without knowing much is a network that is maybe 16 > 8 > 1.
chuck the 16 columns in per neuron
Wdym you couldn't find it, it's not there in the docs?
Does the buffer fill up? What happens after it fills up?
It gets to 1024/2048 and then it says killed
I found it in the docs, I couldn’t find it in my code. I didn’t write this code, this is from the tensorflow object detection api
yeah the ^C kills the program, I’m not sure why it does that. Should I be using less images? I’m at 100 right now.
This machine learning expert says "learn python it is the number one programming language for machine learning."
Does anyone here know what's different about python, that makes it better for machine learning than other languages?
It is just used for ML very often, so there is also a lot of support for it now
it's more or less a positive feedback loop. as pccamel says, tons of people use it. this makes people want to write more, better modules for it. which in turn brings in more people, and the cycle repeats
as a result, there are several powerful, rich modules for ML in python
Ah so the big strength of Python for ML is its ML modules
as for the "better", probably that python has nice and simple syntax and interfaces very easily with other langs
maybe you wouldn't want to implement a matrix multiplication directly in python, but it doesn't matter because people have written amazing code for that in other langs, and you can very easily just call those functions from python
that's how you get numpy, tensorflow, pytorch, etc
So they went with the default. Just add that arg and set it to the value you want
Check your memory usage. It's probably terminated due to maxing out the RAM. Might be solved by reducing the buffer size
Let me check
hey anyone interested in a 3d tensor model?
What does it do
And what size is it
creats a 3d tensor model based of 3axis hold on ill show it to you it does well on first start
!code
import numpy as np
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
# Define the tensor dimensions
dims = ("dim1", "dim2", "dim3", "dim4")
# Create a random tensor
tensor = np.random.rand(*([10]*len(dims)))
# Map each parameter to a point in tensor space and self-label variables
x = []
y = []
z = []
for i, j, k, l in np.ndindex(tensor.shape):
coordinates = [i/10, j/10, k/10]
x.append(coordinates[0])
y.append(coordinates[1])
z.append(coordinates[2])
# Apply PCA to reduce the dimensionality to 3D
pca = PCA(n_components=3)
coords = np.column_stack((x, y, z))
coords_pca = pca.fit_transform(coords)
# Visualize the tensor coordinates as a scatter plot
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(coords_pca[:, 0], coords_pca[:, 1], coords_pca[:, 2])
ax.set_xlabel('PCA 1')
ax.set_ylabel('PCA 2')
ax.set_zlabel('PCA 3')
plt.show()
print('Hello world!')
thas pretty much it, a 3d block of a tensor, each with its own name space and unique identifier for each tensor based of 3d space.. figured it would be helpful to have an idea how a block of tensor can be created
yep, you were correct
this file doesn't have a buffer size argument
https://github.com/tensorflow/models/blob/master/research/object_detection/model_main_tf2.py
There is a flag for nbconvert that exports it without code. Could even just be --no-code
well i modified lol it works but wow does it take a moment to process lol
its small but huge wow
if anyone wishes to view the result
!code
import numpy as np
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
# Define the tensor dimensions
dims = ("dim1", "dim2", "dim3", "dim4")
tensor_shape = [10] * len(dims)
# Create a random tensor
tensor = np.random.rand(*tensor_shape)
# Map each parameter to a point in tensor space and self-label variables
x = []
y = []
z = []
for i, j, k, l in np.ndindex(tensor.shape):
i_label = f"{dims[0]}-{i}"
j_label = f"{dims[1]}-{j}"
k_label = f"{dims[2]}-{k}"
l_label = f"{dims[3]}-{l}"
coordinates = [i/10, j/10, k/10]
x.append((i_label, coordinates[0]))
y.append((j_label, coordinates[1]))
z.append((k_label, coordinates[2]))
# Apply PCA to reduce the dimensionality to 3D
pca = PCA(n_components=3)
coords = np.column_stack(([coord[1] for coord in x], [coord[1] for coord in y], [coord[1] for coord in z]))
coords_pca = pca.fit_transform(coords)
# Visualize the tensor coordinates as a scatter plot
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(coords_pca[:, 0], coords_pca[:, 1], coords_pca[:, 2])
ax.set_xlabel('PCA 1')
ax.set_ylabel('PCA 2')
ax.set_zlabel('PCA 3')
# Label the coordinates
for i, coord in enumerate(coords_pca):
coord_x = coord[0]
coord_y = coord[1]
coord_z = coord[2]
ax.text(coord_x, coord_y, coord_z, f"{x[i][0]} {y[i][0]} {z[i][0]}")
plt.show()
print('Hello world!')
Oof looks like you'll have to do some digging
Look at what it's importing (trainer_loop_v2) check that file, identify where the dataset is being processed or shuffled and pass the arg there
You mean model_lib_v2 right?
there's nothing there about buffer size unfortunatley
Yeah that
man this sucks
They've gone with the default so it won't be mentioned explicitly
You'll just have to find the function where they're doing the shuffle
Alternatively you could try looking at the detailed stack trace
That could directly give you from within which function this shuffle op was executed and then you can find that in the repo and modify it
That is the error message right?
2023-06-13 12:18:26.373528: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:392] Filling up shuffle buffer (this may take a while): 1034 of 2048```
i think that is the file?
@brave sand have you posted your code? i can't find any in discord history.
okay, and how are you running it?
i can't access that nor will i request access, could you post it in text form please
!python model_main_tf2.py --model_dir=models/my_ssd_resnet50_v1_fpn --pipeline_config_path=models/my_ssd_resnet50_v1_fpn/pipeline.config --alsologtostderr```
this is the command I use
this is what I get:
WARNING:tensorflow:From /usr/local/lib/python3.10/dist-packages/object_detection/builders/optimizer_builder.py:124: The name tf.keras.optimizers.SGD is deprecated. Please use tf.keras.optimizers.legacy.SGD instead.
W0613 12:17:45.182555 140099376232256 module_wrapper.py:149] From /usr/local/lib/python3.10/dist-packages/object_detection/builders/optimizer_builder.py:124: The name tf.keras.optimizers.SGD is deprecated. Please use tf.keras.optimizers.legacy.SGD instead.
2023-06-13 12:17:46.277504: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_29' with dtype int64
[[{{node Placeholder/_29}}]]
2023-06-13 12:17:46.278118: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_25' with dtype int64
[[{{node Placeholder/_25}}]]
2023-06-13 12:17:56.785031: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:392] Filling up shuffle buffer (this may take a while): 70 of 2048
2023-06-13 12:18:06.777774: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:392] Filling up shuffle buffer (this may take a while): 782 of 2048
2023-06-13 12:18:26.373528: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:392] Filling up shuffle buffer (this may take a while): 1034 of 2048
^C```
my ram usage is 12.3/12.7
so it's running out of ram
post content of models/my_ssd_resnet50_v1_fpn/pipeline.config
Hastebin is a free web-based pastebin service for storing and sharing text and code snippets with anyone. Get started now.
is it because of shuffle = false?
disclaimer: i don't use tensorflow. take my advice with a grain of salt.
https://github.com/tensorflow/models/blob/1a3b1cfefeb3171e73db6cefeb5059391223223b/research/object_detection/protos/input_reader.proto#LL48C40-L48C61
according to this protobuf spec, you can specify shuffle_buffer_size it's defaulted to 2048 which is more than what you can take
https://github.com/tensorflow/models/blob/1a3b1cfefeb3171e73db6cefeb5059391223223b/research/object_detection/protos/pipeline.proto#L17
this is just showing you where is it used/reference in the main pipeline protobuf spec
do you know how to take it from here to test this hypothesis? i.e. what changes you need to make to your pipeline config file?
thank you for the links. so it is set to 11, what does that mean?
isn't that low already by default?
no that's not set to 11
it's saying it's the 11th field in protobuf, which is different.
so i change the default field right
[default = 2048]; is the key point which matches to what you have in the logs
like default = 1024
no don't change the protobuf file
change models/my_ssd_resnet50_v1_fpn/pipeline.config
e.g.
train_input_reader {
label_map_path: "annotations/label_map.pbtxt"
tf_record_input_reader {
input_path: "annotations/train.record"
}
shuffle_buffer_size: 10
}
I can do that? I mean editing this file and adding that variable
pipeline.proto and input_reader.proto i showed you are protobuf specification
they are responsible for defining what models/my_ssd_resnet50_v1_fpn/pipeline.config should look like
i showed them to you merely to show you clues on how you can configure a tensorflow pipeline, this is probably documented in the docs but i didn't look there.
you are not "adding variable" in a sense, you are specifying a value allowed the in protobuf specification to something other than the default.
ohhh, that makes sense
so the 10 is just random number chosen?
yes that's a random number i have chosen, pick whatever you want if your RAM is okay with it
gotcha, the buffer isn't filling up anymore
https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/configuring_jobs.md
this is not very clear.. but it's something at least
@boreal gale do you (still) use tensorflow?
Was always my go-to but I see a lot of people going to Torch. The SoTA time series stuff is also frequently in MXnet so I'm thorn.
Last time I just torch was for vision and it was not nice because I had a big time constraint and spent most of my time thinking about "How do I do feature X of TF in torch"...
for prosperity sake, this was how i discovered this while having close to 0 knowledge of actually using TF
- search for "shuffle" in TF repo
- https://github.com/tensorflow/models/blob/1a3b1cfefeb3171e73db6cefeb5059391223223b/official/core/input_reader.py#L466-L468 jumped out because i noticed that's very similiar to log you posted earlier (
tensorflow/core/kernels/data/shuffle_dataset_op.cc:392- i looked this up last time you posted this) - track how
_shuffle_buffer_sizeis defined. - noticed this is
InputReader - looked at https://github.com/tensorflow/models/blob/master/research/object_detection/model_main_tf2.py as i want to see how
InputReaderrelates to this - couldn't figure out how
InputReaderis defined, but i guessed it must bepipeline_config_path - noticed the
train_input_readerat the end - search
train_input_readerin TF repo - found the protobuf specs, and then found the
shuffle_buffer_sizewhich matches point 3 (ish)
negative. 💥 i don't do much ML at work tbh.
have some RL work planned for my side hussle to which i would need to do a comprehensive research of what library is good then
Oh, I got to step 4 but I couldn't make the connection after. Hopefully with more practice this intuition comes easier to me
i think it's working, I have a loss rate and learning rate
wooohooo
i also have this feeling that torch is superior from all the comment i read online thus far
Well for someone that doesn't do a lot of ML at work you sure do know a lot about it.
🎉 🎉 🎉 🎉 🎉 🎉 🎉
😅 - i am originally a data scientist but i have since defected to the software engineering camp, even with that said - my ML knowledge is mostly from my university days as i was probably a glorified data analyst at best.
Anyone having issues with VScode not picking up stubs correctly for some modules?
Where should I start to learn Data science ?
!resources data science
The Resources page on our website contains a list of hand-selected learning resources that we regularly recommend to both beginners and experts.
Thanks
Regarding pandas and plotting using df.boxplot()
I have one extreme value up at 9000, while everything else falls between 500-1000. Without just deleting the row, is there an argument I could put in the () that would ignore values over 1000?
congrats!
should it be taking 3 hours?
depends on how much data you're training it on, how many epochs for, etc
hi. i’ve got a couple of pdf reports over the span of a few years, and i just want to extract one table these pdfs and convert it into a csv file. does anyone have any tips on how i can go about this (idk if this would be relevant but i’m using VSCode)
There's a Python module called PyPDF2 which may be useful
is this formula to calculate the cost of the weights incorrect. I think that the formula i circled is missing the sigmoid(1-sigmoid) part of the equation ?
i want to build my own ai chatbot for free based on my own data
how can i do this? from what i know the workflow is as follows:
get a model from huggingface
use langchain to host (is it free?)
fine-tune with lora
have also heard about runpod
what am i missing?
It's right in the bottom of the slide
thats just sigmoid(outputs) I think when you are doing the chain rule with respect to the weights there is some multiplication of sigmoid(1-sigmoid) it only has the sigmoid part not 1-sigmoid aswell
cause when you do the derivvatives I think you have to do 1 minus sigmoid aswell
The last line is what you have on top
If it's not clear from this I think Edd is the one to walk you through it step by step
It's not stuff I actively need in my day-to-day but if you write it all out (like it's on the slide) it just makes a lot more sense. At least that's what I did when I was in uni 🤷♂️
I learned backprop with this notation so Im just trying to match it with the notation of the picture you sent. specifically the sigmoid part of the equation
you rang
?
you have a question on backprop for logistic regression?
oh yes
Speaking of backprop, the notation is terrible. It's a lot easier to understand by looking at the code: https://github.com/karpathy/micrograd/tree/master
Or rather, the code of automatic differentiation
we can just use the chain rule on a toy example and see what comes out
I understand this way of back propagation completly
I am just confused on this equation
it looks like it is missing a part
let's do one layer. lemme grab my tablet because i'm too lazy to latex this up right now
I'm too lazy to write it out as well hence why I just copy pasted my slides 🤣
you're using a sigmoid, but the activation function doesn't matter
let's just call it f
how comfortable are you with matrix calculus? do i do this for the scalar case for simplicity?
i think that's easiest
there are some things missing though. primarily, that it's probably a least squares loss, yeah?
where is the sigmoid part of the equation in the equation that the arrow is pointing to ?
yea
lemme make a quick arrangement
so a derivative of the sigmoid seems to be missing, but maybe that's cuz that derivative has a special form? let's take a look
(that'd be the f'(wx + b) term)
so, turns out the derivative of the sigmoid is the same sigmoid times 1 - itself
yea this equation looks right but the equation in my picutre is missing the f ' (wx+b) part
i think they forgot the gradient symbol
the equations you posted look like gradient updates
and the x_j term looks like the derivative of the argument of f
so it would follow that the f is differentiated
might be a typo
mmmmm okay so my understanding is correct theres just a typo in the course
cause this picture makes total sense I understand it totally
guess the typo is messing me up since im still a beginner
Damn Edd... Someday you gotta teach me how to deal with math like that.
Yesterday I was trying to calculate the integral of a normal Gaussian Distribution and after 30 minutes I got a headache and gave up 
It's also shocking how much I forgot and I haven't graduated that long ago 😢
relatable lmao
Hmm maybe I can try making a freehand to LaTeX converter
these are my ramblings while doing HS-level calculus
my guess is they wanted to be fancy by leaving out the last level of composition, the affine transformation. but in doing that they forgot the gradient of the rest of the cost function
You still have your hs math notes?
Nerd.
okay that makes a bit more sense. You would think since its a begginer course they would show all the work lol
Got a question y'all. A few years ago I saw a tool on HN that was a flowchart style calculator that worked with probability density functions and inexact values. It looked like blender's node graphs. It was real slick and would report Q1-Q4 and stuff like that on the output node. I absolutely cannot find it again, no combination of google search terms has worked. Anyone know what I'm referring to?
maybe plotly?
Naw, it was a no code required style thing
hey everyone. i have a table that’s 24 pages long as a pdf (this table starts at page 110 of the pdf) i want to extract this table from the pdf and convert it into csv using python. but the problem is each page has a few header lines that i’m not interested in. any tips on how i can extract this huge table ? i’m trying to use tabula but it’s not really giving me what i’m looking for
I think there's a get data->from pdf option in excel itself. maybe try that
i used regex to do something like this, where there were headers but always following the same style but it may be overkill for your usecase
so convert from pdf to excel to csv ?
Excel to csv should be trivial no?
so if you have two different population, and you have some data , now I take a 1 sample ttest with both the populations and get p1 and p2 as p-values, Soooo does it make sense to compare them? Does this tell you which population the sample relates to the most?
Sorry guys to bother, i have an issue transforming a dataframe but i didnt find any chat help for data, only this one, someone can lend me a hand? Or it is the wrong chat? Thanks!
it's the correct chat, go ahead and post your issue and people will chime in!
yeah it is
Thanks man! So i have a tricky one, i used some chatgpt because i'm a junior, have the logic but doesnt know how to apply it to the code haha, i have a df like this example:
ID_STRO HECHO PRICE
4431 RC 2000
4431 RC 1000
4431 IT 3000
445 RC 2000
446 RP 1000
And i need this output:
ID_STRO HECHO PRICE FREQUENCY_RC FREQUENCY_IT FREQUENCY_RP PRICE_RC PRICE_IT PRICE_RP
4431 RC 2000 2 0 0 3000 0 0
4431 IT 3000 0 1 0 0 3000 0
4435 RC 2000 1 0 0 2000 0 0
4436 RP 1000 0 0 1 0 0 1000
Trying to be basic here, we are doing a price per each id_stro and each hecho inside the id_stro, then storing it in a new column with the frequency of the stros, so wanted to know how to do that, and i spent and hour trying to get the best result but didn't work.
I need to keep the first row for each id_stro in each hecho so i don't get a duplicated frequency
I going to type the code that chatgpt provided me but it was going to be wrong and needed to explain a lot,
Each column one by one? Sorry i didnt understood
the result u want to achieve
try to do each of them on its own, rather than all at once
may be easier to figure out how to do it
worst case, if u cant merge them into 1 query, u can just join them on ID_STRO
great then, hope it works
anyone know how to make a saved matplotlib animation not have complete garbage font rendering? left is the video output, right is what it looks like in a notebook. I want the rendered video to look like the one on the right. my code is:
import matplotlib.pyplot as plt
import matplotlib.animation as animation
from IPython.display import HTML
import numpy as np
def animate_test():
fig = plt.figure()
ax = fig.add_subplot(autoscale_on=False, xlim=[0, 1], ylim=[0, 1])
ax.set_xlabel("The X Axis")
ax.set_ylabel("The Y Axis")
ax.set_title("A title that explains what the graph is about")
points, = ax.plot([], [], ".")
def animation_func(frame):
points.set_data(np.random.random(100), np.random.random(100))
return points,
ani = animation.FuncAnimation(fig, animation_func, range(10), interval=100, blit=True)
return ani
ani = animate_test()
video_writer = animation.FFMpegFileWriter(fps=10, bitrate=1000, codec="libx264")
ani.save(
"animation_test.mp4",
writer=video_writer
)
HTML(ani.to_jshtml())
(you probably have to run this in a jupyter notebook cell)
I have already looked on stackoverflow for this. they recommend changing the bitrate to some really high value. but I've tried that and it doesn't make a difference
change codec maybe?
any suggestions for the codec
okay I figured out how to at least export all the frames to individual png files, they turn out okay. I can stitch them together with another tool
def animate_test():
fig = plt.figure()
ax = fig.add_subplot(autoscale_on=False, xlim=[0, 1], ylim=[0, 1])
ax.set_xlabel("The X Axis")
ax.set_ylabel("The Y Axis")
ax.set_title("A title that explains what the graph is about")
points, = ax.plot([], [], ".")
def animation_func(frame):
points.set_data(np.random.random(100), np.random.random(100))
fig.savefig(f"frames/animate_test_{frame}.png", format="png", transparent=False, facecolor="white")
return points,
ani = animation.FuncAnimation(fig, animation_func, range(10), interval=100, blit=True, repeat=False)
return ani
ani = animate_test()
HTML(ani.to_jshtml())
no idea hahah just guessing
Little question, has anyone ever made a neural network (regression or convolution or both) which gave the same output no matter the input, as it was not learning anything, simply minimising the error?
huh
if the output is always the same for a given input, the error will always be the same no?
i dont quite follow
can anyone point me towards a package or library that can generate synthetic data based on another dataset (probably through a model)?
e.g. if output is always 1,
for 2 samples,
- output should be 0
- outout should be 1
the error will never change?
The network doesn't learn. It's not really rote learning, it basically can't find the characteristics of the images (I'm trying to do angle recognition on lines), so it just tries to find the closest value to everyone and minimise the error like that
but the standard deviation is very high (angles are normalized by 2*pi) and I always have 0.1
A more acceptable standard deviation would be at least 0.001
for example with a CNN, but it's similar with a feed forward
oh
i think i get it
like theres some sort of restriction that the output can only be one value
I tried reducing the learning by a lot but it doesn't impact the training
would that be equivalent to calculating the mean or median or some statistical average ?
The output is the angle of a line on an image, this is working well for binary images in black and white, but adding noise to be closer to experimental data makes it harder for the network to learn
The network learns the average value basically
so he doesn't learn technically, and it's so kind of rote learning but not exactly
why this instead of letting each image have its own output?
wdym
if image 1's angle is 90
and image 2's angle is 30
the output u would want is 60 right?
for both
no I want 90 and 30
oh
then isnt it just a normal network?
?
nvm
Normal network?
ur model is giving u the average now
Yes
which is what u dont want
okok, i thought u wanted to restrict it to one output value only
because it is not able to extract from the image the right characteristics of the line, so instead it minimizes the error and returns the same value for all input
Images are randomly generated, so it's pointless doing rotation, plus, on the experimental data, rotation modifies the pixels which is not what I want
images are 18x18 pixels
u got a sample image?
I can't show synthetic diagrams they are under a NDA
These are synthetic, close to the real data
ah but can i just image 2 lines making an angle?
it's one line per image
and the angle is measured to the horizontal?
vertical axis
I have two, one is a feedforward:
model = nn.Sequential(
nn.Linear(N, 24),
nn.LeakyReLU(),
nn.Linear(24, 12),
nn.LeakyReLU(),
nn.Linear(12, 6),
nn.ReLU(),
nn.Linear(6, 1)
)```
One is a convolution:
```python
layers = nn.Sequential(
nn.Conv2d(1, 6, kernel_size=kernel_size_conv, stride=1, padding=1),
nn.ReLU(),
nn.Conv2d(6, 12, kernel_size=kernel_size_conv, stride=1, padding=1),
nn.ReLU(),
nn.Flatten(),
nn.Linear(12 * (kernel_size_conv ** 2) * (kernel_size_conv ** 2), 200),
nn.LeakyReLU(),
nn.Linear(200, 100),
nn.LeakyReLU(),
nn.Linear(100, 1)
)```
The FF works really well on synthetic binary images (dont mind the orange I used cmap copper by accident)
ya seems like its the noise thats causing it to be less accurate
if like val of pixel < 127, then multiply by 0.2 ?
like to make light lighter and dark darker
I have tried smaller network, bigger network, modifying the lr, adding blur, using antialias to make thicker lines, ...
changing the network size won't really help, it's not overfitting
I just checked, my experimental data
>>>X_exp.max()
tensor(3.8361e-10)
>>>X_exp.min()
tensor(-4.1663e-10)```
i mean the pixel value
could you provide some code to generate the synthetic images? sounds interesting!
give me a second !
This is how I create the synthetic images:
import numpy as np
from numpy import ndarray
from skimage.draw import line, line_aa
from scipy.ndimage import gaussian_filter # only import if necessary
from typing import Tuple
from utils.angle_operations import calculate_angle, normalize_angle
def generate_image(size: tuple, sigma: float = 0, aa: bool = False) -> Tuple[ndarray, float]:
"""
Generate a binary image with a random line
:param size: Shape of the image
:param sigma: Add a gaussian blur to the image if True
:param aa: Anti-alias, creates AA line or not
:return:
"""
img = np.random.normal(10, 2, size) * 255
min_length = 0.5 * size[0]
# Select two random positions in the array
index1 = np.random.choice(img.shape[0], 2, replace=False)
x1, y1 = tuple(index1)
# Set a minimum length for the line (at least half the size of the picture)
length = 0
while length <= min_length: # while the length is not at least half the size of the picture it selects new endpoints
index2 = np.random.choice(img.shape[0], 2, replace=False)
x2, y2 = tuple(index2)
length = np.sqrt((x1 - x2) ** 2 + (y1 - y2) ** 2)
# Compute angle of the line with respect to the x-axis (horizontal)
angle = calculate_angle(x1, y1, x2, y2)
# Create line starting from (x1,y1) and ending at (x2,y2)
if aa:
rr, cc, val = line_aa(x1, y1, x2, y2)
img[rr, cc] = 255 * val
else:
rr, cc = line(x1, y1, x2, y2)
img[rr, cc] = 255
img = gaussian_filter(img, sigma=sigma)
return img/255, normalize_angle(angle)
def create_image_set(n: int, N: int, gaussian_blur: bool = False, aa: bool = False) -> Tuple[ndarray, ndarray]:
"""
Generate a batch of arrays with various lines orientation
:param n: number of image to generate
:param N: side of each image
:param gaussian_blur: Add a gaussian blur to the image if True
:param aa: Anti-alias, creates AA line or not
:return: 3d numpy array, n x N x N
"""
image_set = np.zeros((n, N, N)) # important for NN to have size n x N x N
angle_list = []
for k in range(n):
image, angle = generate_image((N, N), gaussian_blur, aa)
image_set[k, :, :] = image
angle_list.append(angle)
return image_set, np.array(angle_list)```
And then some utils functions used:
calculate_angle
def calculate_angle(x1: float, y1: float, x2: float, y2: float) -> float:
"""
Calculate the angle of a lign with respect to the
:param x1: x position of first point
:param y1: y position of first point
:param x2: x position of second point
:param y2: y position of second point
:return: angle of a line between (x1,y1) and (x2,y2) with respect to the x-axis
"""
a, b, c, d = get_point_above_horizontal(x1, y1, x2, y2)
dx = a - c
dy = b - d
if dx == 0:
return np.pi/2
else:
slope = dy/dx
angle = np.arctan(slope)
if angle < 0:
return angle + np.pi
else:
return angle
normalize_angle
def normalize_angle(angle):
"""
Normalize angle in radian to a value between 0 and 1
angle can be a float or a ndarray, it doesn't matter
:param angle: angle of a line
:return: normalized angle value
"""
return angle / (2*np.pi)
Hi guys, anyone has experience with transforming logs from several components (nginx, django,..) to vectors? So I can do anonmaly detection based on it?
Then to plot everything I use the following function;
def create_multiplots(image_set_input: ndarray, angles: ndarray, prediction_angles: ndarray = None, number_sample: float = None) -> Tuple[Figure, Axes]:
"""
Generate figures with several plots to see different lines orientation
:param image_set_input:
:param angles: array containing the angles for each image of the set
:param prediction_angles: optional, value of predicted angles by a neural network (ndarray)
:param number_sample: number of images to plot, None by default
:return: a figure with subplots
"""
if isinstance(image_set_input, torch.Tensor): # if images are from load_diagrams.py
image_set = image_set_input.squeeze(1)
n, p, _ = image_set.shape
else: # for synthetic diagrams
image_set = image_set_input
n = len(image_set)
p, _ = image_set[0].shape
# n, p = image_set.shape # change when using tensor
# print(len(image_set))
# n = len(image_set) # change when using synthetic data
if (number_sample is not None) and (number_sample < n):
n = number_sample
# Compute the number of rows and columns required to display n subplots
number_rows = int(np.ceil(np.sqrt(n)))
number_columns = int(np.ceil(n / number_rows))
# Select a random sample of indices
indices = sample(range(len(image_set)), k=number_sample)
# Create a figure and axis objects
fig, axes = plt.subplots(nrows=number_rows, ncols=number_columns, figsize=(6 * number_columns, 6 * number_rows))
for i, ax in enumerate(axes.flatten()):
if i < n:
index = indices[i]
# image = np.reshape(image_set[index, :, :], (Settings.patch_size_x, Settings.patch_size_y))
image = image_set[index, :, :]
normalized_angle = float(angles[index])
# print(normalized_angle)
angle_radian = normalized_angle * (2 * np.pi)
# print(angle_radian)
angle_degree = angle_radian * 180 / np.pi
ax.imshow(image * 255, cmap='copper')
title = 'Angle: {:.3f} | {:.2f}° \n Normalized value: {:.4f}'.format(angle_radian, angle_degree, normalized_angle)
if prediction_angles is not None:
prediction_angle = prediction_angles[index][0] # the angle is a ndarray type with one element only for index i
title += '\n Predicted: {:.4f} ({:.2f}°)'.format(prediction_angle, prediction_angle*2*np.pi*180/np.pi)
ax.set_title(title, fontsize=25)
ax.axis('off')
plt.tight_layout()
else:
fig.delaxes(ax) # if not there, problem with range in the array and out of bound error
return fig, axes```
Since you don't have the predicted angles you don't have to pass it as argument, it won't take it into account
wonderful. i am at work atm so i can't really look into this atm, but i will hopefully have a look tonight!
to run the last bit you'll need the following library:
from typing import Tuple
import matplotlib.pyplot as plt
from numpy import ndarray
import numpy as np
from matplotlib.figure import Figure
from matplotlib.axes import Axes
from random import sample
import torch```
Heyy no problem, i'm also at work haha
This is part of my internship project for context
Here is another synthetic dataset, the angles predicted are very close to the actual one (feed forward)
Hey guys, how is it going? I don't know what happened, but all of sudden all of my streamlit apps started refreshing the page every 25/30 seconds. Yesterday everything was working fine. I started thinking it was some problem on my codes. But since this started happening to all the apps, I believe it is not something related to my codes
Does anyone know what is going on?
the angles predicted are very close to the actual one (feed forward)
oh! so for synthetic dataset it's fine but not so much for the actual dataset?
Yes exactly, additionally if you make synthetic data with some noise (Gaussian blur), it produces the same effect
It's a bit annoying because the repo of the lab team is private, it'd be easier to send the github link so you can just load it
that's no problem at all 😉 to be expected in these kind of things tbh.

@boreal gale @cold osprey I used another loss function (SmoothL1: https://pytorch.org/docs/stable/generated/torch.nn.SmoothL1Loss.html#torch.nn.SmoothL1Loss) and now I get correct prediction, not precise but at least I don't get always the same value.
Choosing the same seed in these two ImageDataGenerators makes it so that the same transformations are applied to input and output picture, right? The reason i'm asking is: When i was manually pre-loading my data and training my neural net the performance was amazing, but now that i am trying to use DataGenerators the performance sucks (with the same model architecture!). What might be the reason for that? Is there something wrong with these DataGenerators?
work well on synthetic data, not experimental one though
just starting to look at this - missing get_point_above_horizontal i think
oops
def get_point_above_horizontal(x1: float, y1: float, x2: float, y2: float) -> Tuple[float, float, float, float]:
"""
Get the point above the horizontal line passing through the center of the line between (x1,y1) and (x2,y2).
:param x1: x position of first point
:param y1: y position of first point
:param x2: x position of second point
:param y2: y position of second point
:return: in order the point above the center and the point under the center
"""
# Calculate the y-center of the line
center_y = (y1 + y2) / 2
if y1 >= center_y:
return x1, y1, x2, y2
else:
return x2, y2, x1, y1```
🙏

in you real dataset, are all the lines also spanning across the image from edge to edge?
also are you trying to do this from scratch? are you open to using some pre-trained arch/weights found in the wild?
are you trying to keep the model minimal?
yes
it's supposed to be small to then be implemented in a physical circuit in a cryostat
So not too big
gotcha, well first we gotta make something that actually works first, then i think there are ways to trim it down
yup
size 256 - synthetic data generator seems to generate lines that aren't going from edge to edge. expected? or did i use too big a size?
there is a min_length = 0.5 * size[0] so i guess it is expected..
i will just keep it 0.5, the generator just burns my CPU trying to find valid candidates 💀
xd
yeah my laptop as a GTX 4050 lol
I don't even hear it during network training xd
For the experimental data, I'll try to progressively increase the size of the network
I use patch of 18x18 btw
it's very small
oh right
:3
and you have tried upsampling it right?
the experimental data?
yep
The problem we have is the lack of data, also the distribution of data
you can see most angles are between 0.3 and 0.49 (multiply by 2pi to get radian)
my apologies, i meant upsampling in the image, not upsampling in the traditional data science sense.
i.e. 18x18 -> 256x256 for example
I can't do that
how come?
we are limited by the patch sample of the machine
it's supposed to be used live on quantum dots
you want to make small patch otherwise the measurements are too small
i meant something like this
Original array:
[[0 1 2]
[3 4 5]
[6 7 8]]
Resampled by a factor of 2 with nearest interpolation:
[[0 0 1 1 2 2]
[0 0 1 1 2 2]
[3 3 4 4 5 5]
[3 3 4 4 5 5]
[6 6 7 7 8 8]
[6 6 7 7 8 8]]
Also modifying the data is not wanted because its supposed to be a general method
to be applied on any dataset
sure, i am just thinking how a NN can recognise edges if your pixel is that massive (relatively speaking).
@wooden forge I think you should try using a Radon transform instead of a neural network. While this is not something I'm especially familiar with, I believe the technique is: (1) Pad the image on all sides with zeros; (2) Apply a two-dimensional Fourier transform; (3) For each angle theta, look at the maximum absolute value (of the Fourier transformed data) achieved along the line of angle theta through the origin. That is, for each theta, look at all the points whose polar coordinates are (r, theta) where r is allowed to be any real number (including negative), take absolute values, and say that the maximum absolute value along the line is some kind of score for how likely it is that this theta is the angle you're looking for; the theta with the highest score is your guess for the true angle of the line.
so I will have a very large amount of theta, it doesn't seem right
pretty much sounds like a Bayesian neural network
I don't know what you mean by a large amount of theta.
the theta with the highest score is your guess for the true angle of the line.
If you have lines with many different orientations, you will have many different theta
Yes.
Moreover, I have somewhat of a continuum. Yes it's a numerical continuum, but still, that's a lot of value, and I can't just discretize the range by setting a list of possible angles like [0, 2, 3, ...179°] (180 being excluded because I take into account the symmetry of the line 0 <=> 180)
Why can't you discretize the range? Your images are only 18x18, so you're unlikely to be able to observe angles to high precision.
If this is really a problem, you could also try an initial discretization, then a local search for a slightly better angle.
But I suspect that's unnecessary. I think the resolution you can observe is likely to be no more than 100 angles.
It depends on the amount of noise and the number of bits of precision in each pixel I guess.
Hello people!
My school group and I have been working on classifiers
And our research questions is : Does image quality improve a classifier ?
And we ended up with 3 classifiers (Lr, Knn, DTC)
Trained on 3 different quality images (Good, bad and mixed) with 3 types of features selection (PCA, variance threshold, and none)
And tested on 3 types of data sample (again , good images bad and mixed)
So yeah that's 81 different results, anyone has a good idea on how to plot them for our presentation without being a giant number mess ?
The basic way, I think, would be to choose one of the parameters (let's say classifier kind) to be columns, and the other 27 combinations will be rows (but sorted by quality, then feature, then sample). And in the cells you'll have results, ideally colored by some measure of "goodness".
I will try that! Tyvm
I guess you could also do 9x9 instead of 27x3 if the cells are small enough.
Yeah I was thinking about doing 3 subplots one for Knn one for LR and one for DTC
Colum being data trained on
Rows being data test on
And in each of the graph the 3 different type of feature selection
@lone steppe Hi, you already are aware of the issue, kindly help
guys, do you have any ideas of using AI or ML in API testing?
you tagged the wrong person
#databases message
have you installed seaborn and matplotlib?
lel
sorry
yes, i have done it already
you have opened a help thread, let me follow up there.
sure
making an application, that takes api endpoints as input and uses AI ML to test those api end points (ofc we will have to send requests) and genereate a report of some analysis
@cold osprey
interesting
So I am doing this project for text summarization. And I want to compare different normalization and other preprocessing performances. Right now I am trying to compare the performance of different stop words lists, but I don't know what metric should I use? Can you help me out? I am googling and everything but can't really find an answer
@warm bane
Have you heard of spacy? It may have everything for that.
I heard of it yes
I am not much experienced with NLP. I've only ever taken 1 lecture on it but spacy is useful. Maybe try to check if there are metric systems there or not. I cannot quite help further :p
ok thank you
Do you guys know of any good library that one can use to visualize 3D mathmatics?
Like I want to do math on vectors & see those vectors in 3D space - soley for my own experimentation
You can use matplotlib for 3d plots, but based on your messages you want something more of a game engine
Pyglet could work for a 3d game if thats what you need
And if it is a game folks over at #game-development could be more of help
Just make sure to not crosspost
You can be auto muted
afaik pyglet's support for 3d is literally just exposing opengl. which... works, but isn't very fun
Yes well what I want is just simplest possible way of doing vector math
and visualizing it
I don't actually see what you mean by visualizing vectors. A 3d quiver plot? https://matplotlib.org/stable/gallery/mplot3d/quiver3d.html
Well I just mean drawing dots in space
Lines are cool too
you can draw dots in space with a scatter plot
can someone leave the paper or tell me what are the best ways to train resnet18 or resnet34 architecture on Imagenet?
I built a GNN from scratch, and thought it to play Jetpack joyride. Here’s the results
can anyone explain how the deeper layers of a CNN are able to detect features from an image non-arbitratily. What I mean is, say there's a house, deeper max-pooling + CNN layers will extract very small pixels from the house, that make the feature indiscernible from knowing it belongs to the house. How is the CNN able to know those are the higher representational features of a house
deeper layers in a CNN learn to detect more complex and meaningful features by building upon simpler features learned in earlier layers
i think the network gradually combines and abstracts these features to recognize objects or patterns, even if the specific pixels aren't easily distinguishable in the deeper layers
okay. in the final fully connected layer, when things get flattened, it's to essentially combine the extract features of a particular object?
yepp that's correct In the final fully connected layer of a CNN, the feature maps from the previous convolutional and pooling layers are flattened into a 1-dimensional vector. This flattening operation is performed to combine and represent the extracted features of a particular object or image.
okay, so the final layers (before the classification) just extract even finer representations of the object
By flattening the feature maps, the spatial information is lost, and the network can then treat the features as a sequential input
This allows the fully connected layer to receive and process the learned features as a fixed-length vector.
The fully connected layer then performs classification or other tasks based on these combined features, making predictions or generating output based on the learned representations.
okay. if you have time I have a few questions
- Why would you not want to add as many deep layers as possible for image classification, since your network can extract more and more relevant features? I understand that increases compute time, but in terms of accuracy, wouldn't deeper layers mean more accuracy?
only if you have enough data and time to train it
the more parameters you have, generally the more data and epochs you need to train it correctly
there's a sweetspot for it. past a certain point, if you don't increase the amount of training data, the performance will start to get worse
the thing here is that the optimal depth of a CNN depends on the complexity of the task, the available dataset size, and computational constraints. while deeper networks can improve accuracy to a certain extent, there is often a diminishing return on performance as the network becomes more complex.
interesting how it would require more data to train at deeper networks
that's the usual behavior whenever estimating parameters
more unknowns leads to higher variance in the estimate, which you offset by showing more data
a bigger network means more parmeters need to be trained, i.e. more unknowns
there's a very trivial lower bound for this: you need at least as many data samples as parameters you want to estimate. if this is not met, the parmeters cannot be found at all
if you keep the data set fixed and make the network arbitrarily big, you will hit this scenario eventually, and any network larger than this size won't work. in practice, this trivial bound almost never holds and the networks would break down even earlier due to properties of the data
you can see the easiest case by considering that you can solve a linear system equations with 2 unknowns if the system has 2 linearly independent equations. if you have 2 linearly dependent equations, all of a sudden there are either infinitely many solutions with different properties (which one is the right one for your case?) or no solutions at all
if you only had 1 equation you'd run into similar problems
if we add more variables to the equations, the problem gets worse. if we add more equations though, we can do something about it
How to use pd.to_datatime method on dates following "month-year" format, eg. "Jun-91", "Jul- 91" -------- "Dec-00", "Jan-21".
I was trying to analyze a csv in which dates were given in the aforementioned format and I am unable to convert the strings into datetime objects.
try using
pd.to_datetime('Dates Column', format="%b-%y")
your dataframe section specify the dates column
Thanks, it's working now
ur wlcm
hi!
there was a library that allows you to "fuse" several arithmetic operations into one object, and then you can apply it to some inputs (raw numbers or even numpy arrays)
like this (pseudocode): ```py
x = O('a * b + c')
x(a=1, b=2, c=3) # 5
x(a=nparr1, b=nparr2, c=nparr3) # another array
i cannot find this library, can you help me?
looks like numexpr, but im not sure
is it SymPy?
this library also allows you to define arithmetic expressions symbolically and then evaluate them with different inputs.
I figure I should put this on here cause answers to help requests regarding AI generally dont seem to come through. I'm very new to using PyTorch and am trying to use ChatGPT to make a CNN that can play Pong (via Openai's Gym). Both ChatGPT and I have tried to fix this error but I (in my lack of experience) cant seem to figure it out. Any help would be appreciated.
You have 4 dimensions, so you need to specify all 0, 1, 2, and 3 for the permute
When I tried that this error came up
This is the CNN model:
I figured it should have three input channels cause thats the number for pong (RGB, later greyscaled)
@flint cosmos please always show code, error messages, and other text as actual text (not screenshots)
!code
gotcha cheif
The whole of the code: https://paste.pythondiscord.com/oxixacimil
Hi does anyone have experience working with MCTS here?
I am evaluating some states in game with just model, and with MCTS guided by the model itself. These model change in every 10 iterations, but it seems that model guided MCTS always has the same accuracy which is weird.
Can anyone help me please? I want to use tensorflow gpu but i tried more than 5 times with wsl but it never works.
Hello i need am new to machine learning and i wanted to know if there was anything i need to optimize or change maybe a new model or something else my project is a multi label text classifier (i know its maybe to advanced but i have wanted to try making this for a while) I dont want to use any pretrained models or generally anything i need to download i just want to use pytorch, numpy, sklearn and other things like that here is my code (i know its very messy and bad written) and let me know if you need the other files they dont have anything that i think should be changed they are just some simple files with some functions and variables
https://pastebin.com/STHwu2Dx
Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.
The plot shows some horizontal lines, how do I remove these data quirks?
i have a json with all of the universities in the world and I'm trying to figure out a way to rank them all there about like 100k data points here is the link to download as well https://zenodo.org/record/7387951
Data dump from the Research Organization Registry (ROR), a community-led registry of open identifiers for research organizations. Release v1.15 contains ROR IDs and metadata for 103,047 research organizations in JSON format. This release reflects the introduction of new organization statuses in ROR and new functionality for changing an organizat...
i have no clue where to start?
that's impressive!
Cheers! Thank you
interesting output
extremely interesting output, you have a compounding function for inputs that alike, and every input matched you increase the value of the total output.
in color space
you could find the mean in this graph, average both sides of that mean to find values way outside of it and mute the resulting output because of its irrelevance
Hello, so I'm very new to machine learning, and would love if somebody could point me into the right direction. I'll do my research but little pointers could help a lot.
I have a dataset containing thousands of tuples with a set and a value, each set contains a few items of a total of around 100 different ones, represented in this example by numbers.
The items in the sets correlate to each other in an unknown way to get to the value.
I want to build a model using pytorch (but I could switch to anything) to predict the value of any set.
What kind of architecture / layers go well with that kind of data?
example_data = [
({0,1,2}, 10),
({1,5,9}, 5000),
({20,28}, 400),
({89,95,98,99}, 1),
...
]
That’s an interesting problem. Where did the data in the tuples come from?
It's something similar to a set of qualifications in a curriculum relating to a mean salary, I just simplified the best I could to explain the problem
If there's only 100 possible set items, you may as well just have 100 input nodes
That are either on (1) or off (0) depending on whether the value is in the set
And the output is the other value, so it's a regression task
Hi does anyone have some experience with MCTS? I am doing search on game states, and it seems that my search doesnt improve rather than what the model gives me
How do tools like keepa get their data from Amazon if Amazon does not allow their data to be scraped?
when you make a neural network, what distribution should the biases and weights we initialized as?
i'm new to this and the example i'm working off of uses a gaussian distribution and i was wondering if there's a particular reason why
things usually just work fairly well with a normal distribution
there are some things you have to pay attention to though
these are the defaults Pytorch uses for Linear layers for example:
I forgot the exact explanation, but iirc it was something like scale down the values & variance based on the number of features to avoid having the gradient go out of control as you stack more and more layers
don't trust me though, do look up it properly 
makes sense lol, i'll try to look it up
i didn't really get any results on google when i tried before asking here tho, so are there any specific terms or anything I should search?
(All lesson resources are available at http://course.fast.ai.) In this lesson, we discuss the importance of weight initialization in neural networks and explore various techniques to improve training. We start by introducing changes to the miniai library and demonstrate the use of HooksCallback and ActivationStats for better visualization. We th...
from the linked timestamp up to 36-ish should be relevant to this
ty
Advise based on experience: when using normal distributions for initialization, keep track of the gradients behavior... using normal distribution without scaling tends to generate vanishing gradients...specially in Fully Connected Layers.
I had to lose the habit of using normal initialization that I had acquired while working with GANs because of that... it tends to make classifiers rubbish. Too much time to train, higher risk of local optima 
Can you write an equation or something? I have to find mean of x and y then do what?
The graph is from a dataset in the book hands on ml with scikit learn
no one can help me ?
Hey guys! Any of you know about any flood prediction model or something similar?
I got a hackathon problem statement on "Sea level rise and coastal flooding" and I am thinking of making an app for flood prediction using weather data and stuff
Any pointers on guides? Thanks 🙂
I have different models to test. I need to find which is the best one for my kind of data. However I'm confused on whether I should use cross validation with all the data cv_scores = cross_val_score(model, X, y) or should I only use the training data, cv_scores = cross_val_score(rf, X_train, y_train). My guess is that for the sake of finding which model is the best, i should do cv_scores = cross_val_score(rf, X, y) instead, is that correct? and when I know which one is better, train it with only the training data?
You don't use the training data at all during cross-validation.
A sufficiently overfit model will perfectly learn the training set, so performance on the training set isn't a very useful metric. CV is for estimating how good it is in generalizing to data points it never saw during training.
how not? doesn't it take all the data also including the training data with the respective fold of test data
based on this figure
What's the issue? There is a specific guide for tensorflow with GPU for WSL in the docs. Have you followed that?
@muted crypt with cross validation the algorithm loops over the data, and selects a portion of it to use as train and test
So yeah it's best to use cv_score = cross_val_score(rf, X, y)
Examples using sklearn.model_selection.cross_val_score: Model selection with Probabilistic PCA and Factor Analysis (FA) Model selection with Probabilistic PCA and Factor Analysis (FA) Imputing miss...
is there a way to make jupyter notebook sessions persistent? Like.. so I wouldn't have to rerun all the cells every time I reboot my PC? I know about dill, but.. is there no better way?
I know that with tmux you can make your terminal sessions persistent with Tmux, and I know that you can edit and run notebooks from terminal editors, like Neovim and Emacs (after spending a few ages configuring them), so in theory, it should be perfectly possible to have a terminal code editor attached to a Tmux session, which will make your notebooks persistent. But I haven't tested this theory, and anyhow, I use VS Code.. :/
Anyone have any experience with this? I'm pretty sure you can attach vs code to tmux (don't quote me tho, and I'm not sure how this works), but honestly, I'm not much of a terminal wiz and don't even use tmux, tho would gladly install and use it if there was a way to make jup notebooks persistent, and would be extremely greatful if someone could show me da wae (know to mortals as "the way")
Running the jupyter kernel on another computer would work - that'd persist until that computer has to reboot.
that's not an option for me, I'm afraid 😅
But to make it survive a reboot of the kernel, you'd have to save a python interpreter's state somehow, which... seems really hard.
why not store the important results and load them when needed? kinda sounds like jupyter is holding you back here tbh
but yeah a workaround is to check whether certain files exist, and if not, compute their content and create them. if they do exist, just load them
yes i've follow the official guide and others guide but it's never work
Is there any open source ai that i can integrate into my project that has the capabilities of recognizing the constituents of stock images?
I kinda dont wanna train my own
I also dont know how even if i had the hardware
constituents?
is there a way to make jupyter notebook sessions persistent?
what's the reason for making you want to do this?
Like.. so I wouldn't have to rerun all the cells every time I reboot my PC?
just for this?
I know about dill, but.. is there no better way?
what's the pain point about dill?
I know that with tmux you can make your terminal sessions persistent with Tmux....
this would only make sense for running jupyter (and kernels) on a remote host, and this is out of the box?
there probably are ways to do this, just spitballing here -- perhaps wrapping your jupyter stack in vagrant and vagrant suspend whenever you need to switch off your PC, N.B. this is a VERY heavyhanded solution, i do not recommend this like at all.
Is anyone here profiecient in Tensorflow and Computer Vision willing to collaborate on a project, DM if interested
here it seems the NASNetLarge pretrained imagenet model is the 'best' when plotted on these 2 variables
but is there any drawback I should be aware of when using it
just for this?
yeah, just for that, pretty much. Convenience
what's the pain point about dill?
nothing really, it's just inconvenient. You have to write the code for dumping and loading, then you have to run the import and load cell on open. Like.. it's not the end of the world, but it's not very convenient either.
Besides that, I'm a bit of a.. idk how to say this, so lets just say weirdo.. I like it when the numbers next to my cells, the ones that show the order in which the cells were run.. are actually in the order I ran the cells.dillruins that.. :/ kinda nitpicking, but.. but yeah.
this would only make sense for running jupyter (and kernels) on a remote host, and this is out of the box?
I don't really get what is said here. Why would it only make sense for running on a remote host? And what is meant by "out of the box"?
perhaps wrapping your jupyter stack in vagrant and vagrant suspend
whoosh, that went over my head :3
I've started looking at index heirarchy, I guess i'm wondering what everyones thoughts of it are, I prefer to use business data for my keys and this kind of popped out at me
can anyone please guide me through the steps for making a english to marathi translation model and vice versa by finetuning multilingual language models.
Thank You in advance.
The data i'm working on collects data from states every month so i'm thinking i can use Year/Date/State as my index, is there a reason not to do this?
Like the things that make up the image, ie this random image from my camera roll, it should recognize there is a person driving a vehicle in a field
Or in this one that it depicts a sunset
Thats what i mean by constituents
anyone plz ?
In a machine learning program i am doing now it said that you should use the linear activation function for something like a tock price predictor since y can be positive or negative but for a model that predicts the price of a house you should use the Relu activation function since y (the price of a house) can never be negative. My question is whats the point of even using ReLU ? ReLU turns anything below zero into zero but if when trying to see what the price of a house is a house will never be negative so why not just use linear activation function ?
you forget that you will later want to do inference on values outside of the data set
if you model your prices with a linear function, past a certain parameter range, the values will all be negative no matter what you do
that means the model is only valid in a limited domain. you immediately lost generalizability by using the wrong activation
(you'd also never use a linear activation with deep learning for other reasons: no matter how many linear layers you use, they can be simplified into a single linear function. the power of neural networks comes from using nonlinear functions)
Okay so using the ReLU for the housing price preditor helps with more complex relationships in the network ?
It makes sure that even for edge cases the output would be 0 instead of negative
Incorporating a non-linearity helps in modelling more complex relationships between the attributes as you're now no more limited to a linear seperator (a straight line in 2d, and corresponding hyperplanes in higher dimensions)
Think of it like this: as long as you only have a straight line to separate your data points, you can only separate very cleanly divided data points, non linearities can give you "curvy" separators, so you can separate between finer patterns
im interested in a career in machine learning. i know python is the prefered language for machine learning, but what else do i pair it with?
yes, the right language is the language needed for the job, but what language can best prepare me for what the jobs in machine learning will most likely be?
should i try for knowledge in python and r?
or maybe python and c++?
Python goes really well with c
U can right a lot of your heavy lifting in c
And use python for other parts
oof this isn't as readable as I'd like it to be ;-;
Math
Solid understanding of the math behind machine learning
In terms of programming languages, python is sufficient for now, and you can pick another up quick if you need to
What's insubstitutable is math
To start off
That said, a lot of high performance code for ML is written in C++ and then called through python wrappers. Including for popular deep learning frameworks like tensorflow and pytorch
R is still popular among a significant number of data analysts
R is also popular under legitimate data scientists
R as a language has many features I don't like (discussed this already) but there are many libraries in R that aren't as readily available in Python (and vice versa). Part of it is likely availability bias since one is used by more stats oriented people and the other by more comp sci oriented people.
The biggest one here is time series analysis. (Python) Statsmodel's API is garbage, pdarima is too slow and the new kid on the block Nixtla's source code is just so horrible I don't want to use it but it's the best bet I have
So yes, I think you should focus on Python but also learn the basics of R and treat it like a DSL (domain specific language) for statistics moreso than an actual programming language.
I've follow the tensorflow tutoriel to install tensorflow gpu on wsl but when i run this command : python3 -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))" it return this
here it seems the NASNetLarge pretrained imagenet model is the 'best' when plotted on these 2 variables, but is there any drawback I should be aware of when using it
Why i can't execute this code ?
Do you have the Python extension installed?
this isnt install with conda ?
No, conda has everything to do with Python itself (the programming language).
The Python extension in visual studio code lets it "understand" Python and gives you features that make python programming easier.
ok thanks
and now i'm sure i've installed tensorflow with pip on conda but why it's not here
try running the code. Does it work?
see that yellow box at the bottom right. I don't speak french, or whatever language that is, but you need to select the environment there for vs code to see what environment you're using, not just activate it in the terminal. Then vs code will automatically activate it for you in the terminal, btw
yes it's french and thank you it's work
What is the path i have to put for it to work ?
line 23
If it's on your linux (Ubuntu) storage, then try just /home and everything afterwards. It seem like you're trying to access the linux storage through windows, but you don't need to do that, since you're already working in WSL. Just imagine that you're on linux, not windows
ok thank you
i don't know why but when i run this code https://paste.pythondiscord.com/pimolobawe it return this message error https://paste.pythondiscord.com/inanabicat
Hey, I need some help i want to get my images to train it in a model. I am using this code for the following task
data = tf.keras.utils.image_dataset_from_directory(
"train",
labels="int",
label_mode="categorical",
shuffle=True,
image_size=(180, 180),
color_mode='grayscale',
class_names=None
)
but my directory structure is something like this
train
- cat.01.jpg
- dog.09.jpg
- and so on.
How can i use this for achiving the following task
Probably just format it in the way that keras expects it to be formatted.
So a directory per class
I don’t really want to be on the businessey statistical side of things
I wanna do more programming than statistical work
Is this the wrong field for that? I don’t really enjoy statistics
Machine learning specifically, not data science
-
I think in practice (outside of research) ML is a last resort measure because it's "expensive". Either you stay in research or join a team that only deals with "the hard" problems after an analyst has found it's not solveable with bar charts and heuristics. In that case the scope of places where you can work is smaller but that's OK. These are the roles that interest me the most.
-
ML is synonym (or subset?) of statistics. I don't think you can enjoy ML without enjoying (large parts of) statistics. Historically ML is what comes out a CS dept and stats from ... stats but also economics, psych, etc. The scope and interest of ML (read: CS folk) does not cover everything of statistics and that's fine because not all of it is relevant for ML and vice versa. There are a bunch of things from say traditional statistics that are worth learning though that are not covered in traditional ML curricula/books/...
Like, knowing all the different types of t-tests and anovas are admittedly boring, I don't know / care about that as well. I'd say the hardcore statistical modelling is pretty much the same as ML and worth looking at.
Does anybody know a good guide for how to reduce GPU memory usage for neural networks in Keras? So far i've found this: https://stackoverflow.com/questions/53170149/memory-usage-of-neural-network-keras
I'm dealing with a model with roughly 2 million parameters, output size is 324x324 and i'm running out of memory (~16GB GPU RAM available). Any advice?
What is your batch size?
For training 16, though it still worked with 32, but barely. Apparently it just runs out of memory when predicting on the test set.
This is also something you could try: https://keras.io/api/mixed_precision/ or TF link: https://www.tensorflow.org/guide/mixed_precision
Oh that looks promising. Do you have any experience with that?
Just on toy problems to test out the API
If your GPU is recent enough to support it mixed_bfloat16 is what you want
Come to think of it - it might not solve your problem at all
Iirc mixed precision uses bfloat16 or float16 for the gradients and fp32 for the weights. Concretely it means your resulting model takes as much memory.
I think you need to solve it by reducing how many examples get sent to your GPU at a time or quantize your model (don't recommend)
model.evaluate(X, y, batch_size = 16) (default is 32)
So maybe python and c++ so that I can give myself some wiggle room for the future? I’m still in college so I’m not fully sure what I want to do in the future
8m really interested in cybersecurity, especially malware analysis
I’ve heard python and c++ are great for those fields especially
Yeah, if i am not mistaken training is fine, but as soon as test data gets loaded (by Kaggle upon submission) it runs out of memory. (At least that's my suspicion of where the error lies, unfortunately Kaggle is not very precise about telling you the error of your submission)
Do you have to submit a model or just your results?
Each time I used Kaggle it was the latter, just a format with the results, not the model itself
I have to submit a notebook containing my model and some function that grabs the test data from some path. But the full test data is not publicly available, so i can't "precompute" the predictions and submit only the results.
Do you have control over model.predict() or so in that function? You need to find a way to lower the batch size at test time.
Currently i just iterate over all files in the test path and do this: This should just predict each element one at a time, i think.
Yes but I'm unsure if that tensor is removed from your GPU's memory after you made a prediction. Can you look that up?
This may be relevant: https://stackoverflow.com/questions/64199384/tf-keras-model-predict-results-in-memory-leak
run garbage collection after every prediction?
I don't know how the Python garbage collector plays with things that live on GPU. Do you know what is causing the issue, is it your CPU or GPU going OOM?
Mmh, well they seem to recommend clear_session() as well, maybe i'll try that.
I'd really consider making a tf.dataset and not looping
Unfortunately i do not know that.
At least here you can unambiguously specify your batch_size and you have more or less good guaranntees it'll play well because this is what thhe vast majority of TF people use (as well as Torch folk, they use their own variety)
Ok, thank you so much for all of the recommendations! I'll try them, some of them must work i'd hope
Yeah, start with tf.dataset 🙂 the mixed precision training is something you imo should remember exists but it will not help you here
Greetings fellow PythonistAIs! I have been experimenting with automatic prompt engineering, using Claude API to automatically attempt to improve prompts (in this case to produce a cover letter from a resume and job description) then test those candidate responses against the original prompt's and repeat in an artificial selection process. I had a huge problem when I found that Claude would always say it prefers the second of two resulting cover letters, even when their order was swapped, but I was able to overcome that by asking for a list of salient differences and then using a second API call to state a preference based on the list of differences. Is anyone else working on anything similar? I want to collaborate on a paper about this with someone experienced enough to legitimately critique my work so far.
Mixed precision training has been such a delight
And quantization
What do you mean? Elaborate?
im messing around with a tensor model that utilizes dimensional space to create geometric parameters to store it, so its reference is based on telemetry within the model. using shapes as means to identify data structures.
"store it": store what?
data sets,
any data sets
just refrenced in telemetry
self labeled modeling in that, you provide the basic math for lets say a triangle. you can create a data set on any parameter that defines a triangle in space,
i have a 3d representation of this model if your interested
I'm not sure I completely follow. Could you lay out the input, process, and output?
the input would be all telemetry, any parameters defined would be in the scope of maths of the initial instance of the tensor model, with multipoint references within it. each point being a its own tensor, of the same type, referencing telemetry within that tensor model. Initial instance hasn't been coded to that point yet but i have a start.
What would be the form of these multipoint references
Also do you have a link to this paper or smtg
not paper, code
its something ive thought about for 7 years
and now having the means to implement..
oh so you're developing this independently
no, i haven't touched a computer in 34 years for the purpose of programming.
Mm any link to help me understand the idea better? I'm still not exactly sure what this is meant to do or why it's supposed to work
sure, hold on
Thanks! I'll have a look
can some one help me ?
You have 10 data points and want to represent them with 3 points? Matrix factorization methods would be a solution or any other dimensionality reduction Method as you used pca or feature extraction would work
well i plan to implement math functions to identify and label as a variable reference of input,
you can create so many points of reference in theroetical space
You need CUDA installed, and you need to find where libcuda.so is and see where your code is looking for it.
I don’t understand the purpose of it. Maybe Fourier transform would help to model the input signal with some constants?
What's the point of the extrapolate function? What problem is this trying to solve
Roadmap for AI?
There's no single comprehensive roadmap
But this should give a start:
https://whimsical.com/machine-learning-roadmap-2020-CA7f3ykvXpnJ9Az32vYXva
Also it's kinda outdated with the new stuff, since yk...2020
OMG!!!!
Hey guys, I'm getting confused over calculations around True Positive Rate, False Positive Rate, True Negative Rate and False Negative Rate.
Given that my model made 88 valid predictions, where 87 were negative ones(0) and just 1 were positive(1), I'm being able to get the True Negatives and the False Negatives correctly by using the following code:
negative_predictions, positive_predictions = (task.argmax(-1) == 0), (task.argmax(-1) == 1)
negative_labels, positive_labels = (label.cpu().numpy() == 0), (label.cpu().numpy() == 1)
true_negative_rate, true_positive_rate = (negative_predictions == negative_labels).sum(), (positive_predictions == positive_labels).sum()
false_negative_rate, false_positive_rate = negative_predictions.sum() - true_negative_rate, positive_predictions.sum() - true_positive_rate
This provides me with 80 True Negatives and 7 False Negatives. However, I'm having some trouble with the Positive Predictions. Even though I have only 1 Positive Prediction, and my labels include 7 Positive Values, the code above provides me with the same values for True Positives and False Negatives.
Can someone give me a hint on how to fix it? I'm a bit out of ideas right now.
well i guess its not yet implemented enough to give a better understanding the actual use case has nothing to 3d display, but a reference to 3d cordinates based off the data structure a user implements in it, i find use cases for understanding it this way to be useful in interpereting all types of data
So broadly, for visualization, interpretation of data?
right
you can set coordinates and create a plane those cordinates create. using that plane as a base line for data to be stored in a dynamic fashion so that information of the most basic types can create other facets of interpreted information conforming to the paramaters within the this tensor model
False negative is (negative_predictions == positive_labels).sum() and false positive is (positive_predictions == negative_labels).sum()
But these are just TP, FP, TN, FN, not rates
Oh yes, indeed... Rate would be if I divide them by the total number of samples, right?
False negative = positive labels? 
with a understanding of creating a basic structure that defines its self with machine learning and AI, i figured i combine the gap by designing a model that has natural language constructed off of real world interpreted representation combined with a expanding data structure that based off of coordinate values in space and stored along a given plane that referenced within the original structure that modeled itself off a basic geometric equations and stored as such for data reference.
you infer a negative label when the true one was positive
Oh yes...Now I got it... Every value in the positive prediction that has a boolean True in the same place as the negative labels will actually be a false positive...
Yes, my problem is exactly with the Boolean Masks 
Thanks, guys!
Just add an and there and it makes a lot more sense
for instance, a box, a cube, a square, 3, points of data stored in such a manner using the constructed of the data apoint of the original tensor
because box is defined, a AI can understand a box, but the data given can also be implemeneted along that line
Sensitivity and specificity mathematically describe the accuracy of a test which reports the presence or absence of a condition. If individuals who have the condition are considered "positive" and those who don't are considered "negative", then sensitivity is a measure of how well a test can identify true positives and specificity is a measure o...
Here you can see all the rates
Maybe it's because you've been thinking about it for seven years so the idea is mature or because I'm tired but I really am not sure what you're going for ;-;
You divide them by either all ground truths, or ground false I think
Thanks for taking the time to try to explain tho. I'll have a look at it again sometime
FN = ((y_pred != y_true) & (y_pred == 0)).sum()
That's exactly the article that I was using for the formulas
the ai can then sure the reference box was sued in in various ways, if a box is presented in one side of a plot and the box use case was used like a box car
False Negative = Predictions different from positive labels and predictions equal to 0?
with true I mean y_true
both instances are stored as a references of data points and can be used for reference values such as a temprature
Oh wait...yes... All predictions that are 0 and different from the labels(y_true)
Ugh... I suppose my head need some rest 
I changed it a bit but conceptually that's the easiest way to understand what a false negative is
True means it gets it right, False means it gets it wrong
What role will ai be playing here? Some predictive task?
True means pred and true are the same, false vice versa, that's all you need to remember
Not what I meant 😮
(y_pred != y_true) checks if there was a mistake made (the false part). (y_pred == 0) means it predicted a negative (the negative part)
Let your prediction be x and ground truth be y for a binary classification problem
True positive --> x=1 y=1
True negative --> x=0 y=0
False positive --> x=1 y=0
False negative --> x=0 y=1
Yeah this wouldn't run in numpy I think
Why not?
should be & instead of ==
with == you're just getting (1-)accuracy
!e
import numpy as np
negative_predictions = np.array([True, False, True])
positive_labels = np.array([False, False, True])
print((negative_predictions == positive_labels).sum())
@mild dirge :white_check_mark: Your 3.11 eval job has completed with return code 0.
2
Oh right
say you have this array [0, 0, 0, 1, 1] for negative predictions and [0, 0, 0, 1, 1] for positive predictions what are you comparing?
Blindly copied the example, guess the TP and TN need to be changed that way too then
This one sticks closest to the term, false term on the left an & followed by negative on the right 🤷♂️
You'd need bitwise and then pretty sure
!e
import numpy as np
negative_predictions = np.array([True, False, True])
positive_labels = np.array([False, False, True])
print((negative_predictions & positive_labels).sum())
@mild dirge :white_check_mark: Your 3.11 eval job has completed with return code 0.
1
Oh this works I guess
I remember using bitwise and before, because & didn't work, wonder why that was
But it's expressed in positive_labels and negative_predictions already which is not what you get out of a model
You just get y_pred hence why it's easier to express it in function of that (subjective I guess)
well the whole point of making and training seems so odd, where as, machine learning provides a context of information, and AI spends its time contextualizing information, i thought, why not make a machine learning model that can learn from itself with information received from its own instance. creating dictionaries to work with by simple mathematical parameters. such as, applying x y coordinates relating to coordinates that is a learned instance of any given input. decerning one x coordinate form another based of how the reference was made. was the operation happening becuase someone was talking about a car, or was it someone talking about a box, in both cases the math is is defining which caracteristics that was presented to the tensor model input, keeping note of each instance and defining new variables and functions where they are needed, the actual data isn't important as long as data is being inputed or represented in some value the model should maintain data structure.
!e
import numpy as np
y_true = np.array([1, 0, 1, 1, 0, 1, 0])
y_pred = np.array([0, 0, 1, 0, 0, 1, 1])
FN = ((y_pred != y_true) & (y_pred == 0)).sum() # or np.sum((y_true == 1) & (y_pred == 0))
print('False Negatives:', FN)
@past meteor :white_check_mark: Your 3.11 eval job has completed with return code 0.
False Negatives: 2
So you have certain representations of classes "saved" and you use these to generate new data points? By augmentation?
I don't see how it all ties into each other
it depends on how you store those data classes right? and how would you use those data classes, well you can store, the whole classe as a data point, its a point that is represented somewhere in space based of the causation of that instance the data was being entered, the model then can keep track of that data in space, and part of that data is referenced as any data type you wish, as the learned instances grow, from inputs of that, any part can quickly be referenced by the model to provide the output wanted, storing information at a point would have a plane, a new X, Y coordinate system is created, x can now now hold a value at "(-.4),(2),(5),(-2) on the original construct, and if a box is ever referred too in conversation, it then can learn where the word box is framed in context, depending on how the operation has progressed in learned understanding of context, the word, box or any other reference to a geometric figure, triangle, rectangle, the context of reference can be learned from this natural implementation but used as away to use a. not necessarly needed to instanciate the implementation but the implementation can be instanciated when called..
say the value of (2) this case contains coordinates value in the context of X as reference of an EX as in an ex fiance, the values on that plane can be evaluated among its self were X is served as X boss, or X wife, theses contextual cases can be planed across its self and if the contexts plane runs into the original tensor the use case could be construded as an actual function but or the rest of the values don't cross those planes the use case could be extracted natrually due to requirements of the assumed use case.
hard to get the idea that X means an X coordinate if other values don't make sense to call it that, in the general slang case.
What computers are used to train large scale ML models like Tesla’s CV and Self Driving models? H100 Servers?
am toying around with fbi background data and am a bit flustered by some results I'm getting, python my_df.groupby('state').describe().transpose()['Alabama'] returns: python permit count 295.000000 mean 9571.213559 std 12549.296862 min 0.000000 25% 0.000000 ... return_to_seller_other min 0.000000 25% 0.000000 50% 0.000000 75% 0.000000 max 3.000000 Name: Alabama, Length: 192, dtype: float64
I'm a bit lost because i get a keyerror if i don't transpose it first
was just about to remove this, think i figured it out
I'd do, like, my_df[my_df["state"]=="Alabama"].describe().
As I understand you still have a long data and you want to make it 3d or just a smaller vector. I would check autoencoders for this purpose. The model learns from data trying to make outputs similar as much as possible to the inputs. So you can get the latent space (just a vector between encoder and decoder parts) so you will end up with a vector which represents the input (bigger data). I tried to use it for a similar situation. I generally use statistics to interpret data and the difference between clusters etc.
Like this paper
We employ unsupervised machine learning techniques to learn latent parameters
which best describe states of the two-dimensional Ising model and the
three-dimensional XY model. These methods range from principal component
analysis to artificial neural network based variational autoencoders. The
states are sampled using a Monte-Carlo simulation ab...
I want to get a book on Machine Learning, ideally using Python for under £20. Does anybody know of some good books and a link to buy it please?
!resources data science
The Resources page on our website contains a list of hand-selected learning resources that we regularly recommend to both beginners and experts.
You can filter it by books
Wdym fbi background data
Hmm
I can't decide if this is something we already have lol and this would just be a different way to represent it
I'll take some time to digest it
This is partly what I meant by "we already have this"
for index, row in ManagerGroup.get_group(manager).iterrows():
if row['Late In Hrs'] != 0:
i += 1
elif row['Early Out Hrs'] != 0:
j += 1
elif row['Late In Hrs'] == 0 and row['Early Out Hrs'] == 0:
k += 1
``` I want to recreate this into tableau but everytime I try to, the counts are equal of each variable
This is the end result of the loop
Don't understand the problem i, j, k have different values according to the pie plot
With data science can I go into crime and business fields?
Fixed the problem, thanks for looking into it
Good, just a reminder I think you don't need a loop in this case and I try to omit loops in data manipulation projects in general. You can use someting like np.where(data == 0).size
Can something like this actually happen, or do i probably just have a bug somewhere: I trained a model on a reduced training set, got pretty good results on test data, then trained it on the entire dataset using data augmented by Keras ImageDataGenerators and now i get horrible non-sensical results on test data. Did anybody ever experience something similar?
Data is everywhere.
I will try this, thank you so much! :D
You're welcome
Hello, i am a beginer in AI and i would like to know what are the most used libraries and which are the best.
I'm currently using Tensorflow but a few PHDs told me that it is bad because it doesn't have a good compatibility with previous versions.
it's true that tensorflow has introduced many breaking changes from one version to the next
the latest ones drop gpu support for windows outside of wsl
many people like pytorch and it seems to have gained a lot of traction in academia too. i like jax for the stuff i work on
tensorflow and keras are good, but as with most software, you have to keep track of which versions your code works on
Hi, i am looking for some help designin a Neural Network. I have to model a time and space dependant problem and predict temperature. I have try using LSTM as it is a timedepedant problem but It feels like no matter what i do in training (i have managed to have some train and val loss around 5 ), test loss is still at 500
(please ping me in your answer)
but i followed well the tutorial of tensorflow
quick question. When you're working in a jupyter notebook, do you generally list your observations before the code that leads to these observations, or after it? Like, do you plot a plot, and then add a markdown before the code like "the following plot shows...", or do you plot the plot, and then add a markdown after the plot saying smth like "from the plot we can conclude..."? What is the 'standard' (more popular, accepted) way of doing it, formatting your observations in a notebook?
Hi everyone. I need help in optimising a block of code I wrote.
cf_df = pd.DataFrame(columns=['player_id','player_name','country'...)
for player, df in t20_bat_df.groupby('player_id'):
date_range = pd.date_range(start=df['start_date'].min(), end=df['start_date'].max(), freq='M')
df = df.set_index('start_date')
for month in date_range:
games_played = df['match_id'].count()
month_df = pd.DataFrame({'player_id': [player], 'month': [month]})
month_df['runs'] = df['runs'].sum()
month_df['date'] = month
month_df['games_played'] = games_played
.
.
.
pd.concat((cf_df, month_df))
cf_df
This is taking forever to compute. Thanks!
Hello everyone , I am practicing data cleaning and EDA on movies dataset https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset
There are 2 columns** Revenue and Budget **which has 90% missing values . So I was exploring more techniques besides dropping the column or doing mean/median imputation.
As far as I searched these 2 columns lie in Missing completely at random category. Can anyone take a look on these 2 columns and tell whether am I correct or not .
My notebook here https://www.kaggle.com/code/nishchay331/datacleaning-practice1
Can we use MICE technique in all the 3 scenarios (MCAR,MAT,MNAR) categories as it is an advanced technique? Also if anyone can tell what are the things to keep in mind while doing data imputation like one thing I came to know recently that standard deviation should not change of that specific feature for which we are doing imputation.
learning curve currently looks like this: When i run the model on a smaller dataset it works perfectly fine. This seems to suggest that it's stuck in a local minimum. What can i do to circumvent this? Currently using Adam optimizer with learning_rate=10^-3
from what I've seen it's mostly before
but that's mostly for code
since plots can have a big code cell attached to them, the md cell explaining the plot can go way too up-and-out
so for plots it can be after the plot ig as that would be just next to the plot
the goal is great readability and minimal confusion, so whatever suits that is good I suppose
I prefer to title and annotate above any charts or diagrams, as many notebooks are more than a page long. In, say, latex, you have a little more page/layout control, and you can describe underneath (in a caption). Just my opinion
Have you also check if your full data more noisy? Have you used normalization? You can also make it for different learning rates and different window sizes (batch size?) to see if it helps
I have not checked that, but i doubt that. My small subset of data i previously trained on was randomly selected and that worked multiple times. I've experimented with learning rates, but the learning curve you're seeing there is the best i could get by just varying the learning rates. I have not tried different batch sizes, but i can't necessarily see how that would help. Can you explain?
You expect higher accuracy?
Are you sure it's not just a more generalizable score?
Smaller datasets are easier to fit to, and would understandably give higher scores
But yeah to improve it you can try varying the hparams
Yes, i was kind of expecting worse learning curves on larger data, but: When i tested out the trained model on test data, the model trained on the larger training set performed way worse than the one trained on the smaller dataset.
Changing batch size affects how much data is considered for one gradient update. Lower batch size can increase noisy updates and you might be stuck in a minimum
I wasn't so focused on how high the accuracy was in the learning curve, i was just worried about the flat loss decrease
Oh, interesting! Then i'll try that. By the way, what's your experience on using Adam vs SGD for optimizers? Is there one you strictly perfer?
It's possible that the smaller data and test data just happen to have lower cross Entropy
Can you try sampling different "smaller datasets" from your train data and compare performance on this
Thanks for the clear explanation I was trying to find how to explain it 😂
If it's giving a higher performance on every sampling of the train data then we might have an interesting case
haha I get it, no problem!
Yes, i have randomly sampled a small subset multiple times, it works (much!) better every time. Though another difference is that i was loading in the smaller dataset in memory and processing it "by hand", but for the larger dataset i used a Keras DataGenerator. Maybe i just messed the DataGenerator up?
Depends on the case but Adam is generally a good bet for usual tasks
If you're looking to optimise to the teeth you might also want to look into grad scaling, lr schedulers, and experimenting with different momentum values
Processing it by hand?
Haha, i meant i was coding custom methods to do this, not using any prebuilt modules like DataGenerators from Keras
Ah okay
Try reproducing the results with the rest of the pipeline the same
Like ablation testing
It's also possible, though not very likely, that the extra data might actually be causing you to overfit on the train set
Difficult to digest ik xd
Yeah, the issue with the first approach was the insane memory usage, which is why i was trying to use DataGenerators. Guess i'll load small batches into memory, process them like i did before, train, then discard the batch and load a new one.
You can just use DataGenerators for the smaller ones too?
As long as rest of the pipeline is same for both, we can compare
That is an actual possibility, but the "quality" of the outputs is very different in both cases: The first model gave very sensible masks for the input images and was highly confident when generating the masks, and the second one just gives outputs where basically the entire image is masked with value .5 or something like that.
How's the train accuracy when training with larger dataset
As compared to small subset
Weirdly enough (as you can see in the screenshot) the accuracy when training on the large dataset is very high (~0.95) and iirc the accuracy was smaller on the small dataset (~0.9)
I'll have to look it up how they calculate the accuracy, because as i said the model trained on the larger data essentially only returns 0.5 for every pixel or something dumb like that
So test accuracy was higher than train accuracy on the small dataset?
Yep, just not the loss
Hmm
Try to reproduce with identical pipeline ig
Only then can we make any solid conclusions
Yes, will do that. It's the only way to compare them properly i guess
Yep
Ok, will update once i'm done. Thank you so much for your help so far!
@potent sky While i have your attention: Is there a way to also return the magnitude or some other information about the gradients produced during training time? E.g. plot loss and norm of gradient in each epoch or something similar. Something like that might me help understand whether i'm stuck in some local minimum.
Does correlation between a continuous variable and catigorical variable give any meaningful informations?
Sure you can get the gradients but plotting them helpfully would be difficult or getting much meaningful information
Instead ig you can use loss for that
Why wouldn't it? The number of freelance contracts I receive in a year and the amount of money I spend on teddy bears being correlated can give meaningful conclusions don't you think xd
mk, but.. how do I calculate it?
Or a better example here would be type of contract ig
Small, medium, large whatev
Mm maybe ANOVA?
Sure but I've to go now and idk if I'll be able to check again soon. But I'm sure any of the others will be able to help, stel, zestar, pccamel, amer etc.
GL!
- post hoc to check if there is a significant difference between the groups
is your categorical variable ordinal?
And ry ofc, mb
Too many great helpers lol
I'd like to know the answer for both ordinal and nominal, as I have both. Currently I'm looking at ordinal, but would like to know both
Oh right, my example only stands for ordinal
in your original post, by correlation, you probably meant pearson correlation?
i think correlation (of any kind, i will comment on this later) between continuous variable and categorical-and-not-ordinal variable does not give you meaningful info [the hand-wavy way of thinking about this is: since you can easily swap the "ordering" (since it's not ordinal) of the categorical variable, you can arrive at different correlation result, which means they probably don't make any sense!]
however, for ordinal data, it might.
but pearson correlation might not be the thing you want, i think by using pearson, you are effectively assuming the gap between ordinal 1 and ordinal 2 has the same "distance" as ordinal 2 and ordinal 3, when in reality this might not be true. (e.g. low income group, middle income group, high income group, distance between low and mid is not the same as mid and high)
spearman's rho and kendall's tau might make more sense, they are also correlation measures but mainly for ordinal data.
Hmm yeah spearman's or kendall's might be a better approach
btw, if you find correlation with nominal data, doesn't it in a sense become implicitly ordered, w.r.t. the correlation found
can I compare the results of pearson and spearman or kendall? And/or can I use spearman or kendall for correlation between continuous variables?
and yeah, I meant pearson up there
From the book Discovering statistics using R
won't the correlation found depend on the encoding of the nominal values, thus.. no?
hmm that i am not sure, my gut feel says no.
Yes so the correlation won't be meaningful across different ordering of nominal values
But why wouldn't it be meaningful for the ordering associated with that correlation
okay so I'm trying to select a column with iloc based on 2 examples I saw:
df.iloc[10:20]
```and this was the second example I used to base my code off of:
```py
df.iloc[:, [1,2,5]]
```So I combined the two to make this:
```py
df.iloc[:, [0:5]]
```My df has 7 columns but this gives me the error of "invalid syntax". Please help.
maybe you're looking for a slice object?
afaik [0:5] is only valid after something that accepts slice objects as indices like a list
Yes( this is an invalid syntax
you could try slice(0,5) instead
Get rid of []’s around the 0:5
oh that too lol
use df.iloc[:, 0:5] or df.iloc[:, slice(0, 5)]
Sorry but why if the one where they want the column 1, 2 and 5 was .iloc[:, [1,2,5]]
0:5 is a range. [1,2,5] is a list.
is a range a seperate datatype from a list?
yes.
0:5 is a slice to be more specific (range is even another data type)
when you use : within object[:] / object[:] = ... (getitem / setitem notation), python automatically creates a slice from that
however, you cannot use it when you are creating a normal list, despite it also using []
so
1). is the problem as simple as: df.iloc[:,[1,2,5]] works because both : and [1,2,5] are list-like objects that it can accept, but 0:5 in df.iloc[:,[0:5]] doesn't work because the : denotes a list-like-object and therefore the second parameter is really a list-like inside a list-like
2). or is it somthing else entirely?
because the 2 things that are getting me hung up on this function are:
1).why in the world does a method use braces?
2). the difference between lists and list-like objects seems arbitrary.
so please bear with me.
: is not "list-like"
taking a step away from pandas and looking at normal python, you have: ```py
list literals:
list = [0, 1, 2, 3, 4 ,5]
list indexing:
list[0] == 0
list[3] == 3
list slicing:
list[0:3] == [0, 1, 2]
list[slice(0, 3, 1)] == [0, 1, 2]
updating lists:
list[0] = 100
list[1:3] = [10, 20]
list[slice(1, 3, 1)] == [10, 20]
the : is pretty much syntax sugar for annotating a slice, usable when retrieving elements from a list or overwriting a slice of the list
which means: these pairs are equivalent```py
df.iloc[10:20]
df.iloc[slice(10, 20)]
df.iloc[:, [1,2,5]]
df.iloc[slice(None), [1,2,5]]
df.iloc[:, 0:5]
df.iloc[slice(None), slice(0, 5)]
however, that 'syntax sugar' is only valid within object[...].```py
valid:
var = slice(1, 10, 3)
syntax error:
var = [1:10:3]
hi i have a question about my rl:
import numpy as np
grid = np.array([
['P', ' ', ' ', ' '],
[' ', 'X', ' ', 'X'],
[' ', ' ', ' ', ' '],
['X', ' ', ' ', 'G'],
])
print(grid.shape[0])
print(grid.shape[1])
num_states = grid.shape[0] * grid.shape[1]
num_actions = 4 # Up, Down, Left, Right
q_table = np.zeros((num_states, num_actions))
learning_rate = 0.1
discount_factor = 0.9
num_episodes = 10_000
max_steps_per_episode = 100
for episode in range(num_episodes):
player_pos = np.where(grid == 'P')
state = player_pos[0][0] * grid.shape[1] + player_pos[1][0]
if episode % 1_000 == 0:
print(episode / num_episodes * 100)
for step in range(max_steps_per_episode):
if np.random.uniform(0, 1) < 0.1:
action = np.random.randint(num_actions)
else:
action = np.argmax(q_table[state])
row, col = divmod(state, grid.shape[0])
#row = state // grid.shape[1]
#col = state % grid.shape[1]
if action == 0: # Up
row -= 1
elif action == 1: # Down
row += 1
elif action == 2: # Left
col -= 1
elif action == 3: # Right
col += 1
if row < 0 or row >= grid.shape[0] or col < 0 or col >= grid.shape[1]:
new_state = state
else:
new_state = row * grid.shape[1] + col
if grid.flat[new_state] == 'X':
reward = -10
elif grid.flat[new_state] == 'G': # Goal reached
reward = 1
else:
reward = 0
q_table[state, action] += learning_rate * (reward + discount_factor * np.max(q_table[new_state]) - q_table[state, action])
state = new_state
if episode == num_episodes - 1:
print("x: ", row, " | y: ", col)
if grid.flat[state] == 'G': # Goal reached
break
print("Learned Q-table:")
print(q_table)
this reinforcement learning does work fine, but as soon as i add another row or column it would break and the results are completely off
the 4x4 grid (which works)
q_table:
[[ 5.31440892e-01 4.78286716e-01 5.31440417e-01 5.90490000e-01]
[ 5.90489707e-01 -9.46856122e+00 5.31440737e-01 6.56100000e-01]
[ 6.56099855e-01 7.29000000e-01 5.90488097e-01 5.90487408e-01]
[ 1.68409465e-01 -9.19000000e+00 6.56099831e-01 1.97696497e-01]
[ 5.31440488e-01 6.33422575e-02 1.29616717e-01 -9.88018350e+00]
[ 5.90490000e-01 3.40817433e-01 1.64306823e-01 4.15189404e-01]
[ 6.56099257e-01 8.10000000e-01 -9.46855932e+00 -9.18999923e+00]
[ 2.39609557e-01 9.00000000e-01 4.15189404e-01 -8.08372366e+00]
[ 2.49389245e-01 -4.09510000e+00 0.00000000e+00 7.28959493e-02]
[-8.18826176e+00 1.24354675e-03 2.24665146e-02 8.09998283e-01]
[ 7.28997072e-01 7.28984768e-01 7.28973041e-01 9.00000000e-01]
[-9.18999846e+00 1.00000000e+00 8.09999965e-01 8.99998865e-01]
[ 0.00000000e+00 -1.00000000e+00 -1.00000000e+00 0.00000000e+00]
[ 2.07141416e-01 0.00000000e+00 -8.49905365e+00 0.00000000e+00]
[ 8.09998301e-01 6.57927489e-02 2.03081236e-02 5.21703100e-01]
[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]]
with any change the the grid:
[[ 0. 0. 0. 0. ]
[ 0. -10. 0. 0. ]
[ 0. 0. 0. 0. ]
[ 0. -10. 0. 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0. -9.99997646 -9.99959516]
[ 0. 0. 0. 0. ]
[ 0. 0. -9.99373421 -9.99963565]
[ 0. 0. 0. 0. ]
[ 0. 0. -9.9293035 0. ]
[ 0. 0. 0. 0. ]
[ -1. 0. 0. 0. ]
[ 0. 0. 0. 0. ]
[ -1. 0. 0. 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]]
hi i wanted to ask something about the behavior of stacked barplots. if i have different values, do the stacked barplots just sort themselve in an order that he color we see e.g. on top has the actual highest value till the point were it ends?
I'm working on porting some MATLAB code to Python, and I'm running into an issue which I think may be due to the semantics of matmul being different than what I expected. I have a discretized vector field called m, which has shape (3, N, N), and can be thought of as an N by N grid of vectors in R3. I then construct a corresponding discretized matrix field called L, which has shape (3, 3, N, N), and can be thought of as an N by N grid of 3 by 3 matrices. I was hoping that computing L @ m would yield the same result as multiplying the matrix at each grid point by the corresponding vector at that grid point, but I'm not sure if those are the actual semantics. It's possible that this is not the actual issue in my code, but it seems like the most likely problem at the moment.
in plotly they getting sorted but i think it could vary depending on used package
oh sorry, i should have mentioned i am working with matplotlib
i show in a image what i mean in a second
i first have to finish preparing da data
Hi, I was wondering for matplotlib is there a place that explains how to structure the fmt keyword? i took a look at the documentation but couldnt find anything
I was looking specficially for the bar_label function but ive seen it in other functions too
yeah, matmul expects the matrices to be in the last 2 indices!
if you want to be completely explicit and avoid trying to figure out what numpy does by default, i recommend checking out einsum https://numpy.org/doc/stable/reference/generated/numpy.einsum.html where you can explicitly state which dimensions the multiplication happens along using einstein notation
Hrm, so matmul would want (N, N, 3, 3) and (N, N, 3)?
in your case, we would do
result = np.einsum("ijkl, jkl -> ikl", L, m)
Ah, Interesting. I was using (3, N, N) because if I want to normalize the vectors I can just norm them and divide, since NumPy will broadcast (N, N) to (3, N, N) no problem.
i always prefer and recommend being as explicit with numpy as possible, especially if you come from matlab
you'll realize that numpy prefers doing weird stuff instead of erroring out. stuff that would give an error in matlab will happily compute in numpy, and give you trash
e.g. 1D np arrays don't behave like actual vectors. you could multiply a vector from the left or right of a matrix without making any change
so do things like adding dummy axes (with np.newaxis or reshaping to add a dimension of 1) and use einsum
einsum is perhaps the 1 thing numpy really beats matlab at 😛 matlab doesn't have a built-in equivalent for natural tensor contractions
Ah, okay! Good to know. Thank you very much for the help. I don't do numerics stuff very often, so this is great for me to know.
>>> x = np.arange(9).reshape((3, 3))
>>> x
array([[0, 1, 2],
[3, 4, 5],
[6, 7, 8]])
>>> y = np.arange(3) + 1
>>> y
array([1, 2, 3])
>>> x / y
array([[0. , 0.5 , 0.66666667],
[3. , 2. , 1.66666667],
[6. , 3.5 , 2.66666667]])
>>> x.shape
(3, 3)
>>> y.shape
(3,)
i never remember which axis it'll favor tbh
a little bit cursed tbh. i come from the "matlab ordering of indices" and "all vectors are column vectors" world so this is exactly the opposite of what i'd like 😛
maybe that's why i never remember it
Indeed. At least now that I've got this sorted I am properly computing the Laplacian of my vector field.
Numpy has a row-major memory layout by default (C-style) and since they were in that mindset, this order would make sense as an arbitrary choice.
The last axis being the "fastest."
That is, if you loop over it as a flat array, and had the multidimensional index printed, the last index is changing the fastest, and the first the slowest.
Yeah, column major is what all the scientific languages do, this is a C thing.
Fortran, Pascal, etc.
Although the internal memory layout only affects speed, not semantics, but since they were implementing it in C, it comes as the natural choice.
right. i did do a fair amount of C++ back in my day, so i'm familiar with all of this... but never any numerics in it. i have fast contiguous memory access and numerics in separate drawers in my head 😭
To match the internal layout more.
"Row-major" and "column-major" are not great names for it either, it does not generalize past two dimensions and "major" is not a common term in programming in general.
Blah, I'm not built for numerics. I miss proving that solutions exist. Finding them is no fun.
Weird, my Laplacian operator works, but my curl operator does not...
I'm following michael nielson's book on neural nets, and I'm trying to understand his code. I don't understand how he is calculating gradients. My understanding of calculus is shaky, so that might be part of it, but from what I see, he is taking the vector of differences between the actual and desired outputs and multiplying it by the derivative of z (which is a scalar?). I'm not sure what is happening in line 119. I'm kind of lost on what is happening tbh. Could anyone help explain? ty
https://github.com/unexploredtest/neural-networks-and-deep-learning/blob/master/src/network.py#L106 (and line 119)
src/network.py line 106
delta = self.cost_derivative(activations[-1], y) * \```
Hello everyone , I am practicing data cleaning and I want to ask that Let's take a numerical feature and before imputation and after imputation if I draw a single histogram and both of them overlaps so that means that we did correct imputation?
What are the things we should keep in mind while doing imputation?
\0. can KNeighborsClassifier be used for multilabel classification? How is it done? I can kinda guess for ordinal categories, but.. nominal? No idea.
\1. can you use KNNImputer for imputing nominal categories? If yes, then how would you do it? You can't use OHE, cus your missing value will just be encoded as a dummy variable, just like the rest of the values, and nothing will be imputed. You could encode them as ordinal, but.......
\2. is there any point in encoding nominal categories as ordinal ordered by frequency? Would that make sense? Maybe not the best solution, but using KNNImputer with nominals encoded as ordered by frequency ordinals would probably be better than just imputing with the univariate mode, right? Or does that not make sense?
i have a list of stocks as dictionaries for example
{'symbol': 'AAPL', 'currency': 'USD', 'exchange': 'NYSE', 'isin': 'US0378331005', 'security_type': 'equity'}, {'symbol': 'TSLA', 'currency': 'CAD', 'exchange': 'V*SE', 'isin': 'yyyy', 'security_type': 'equity'}
and i have the same data in a database table but with duplicates and inconsistencies. for example,
symbol |currency |isin |security_type
-----------------------------------------------------
AAPL |AAPL | |Crypto
AAPL |USD |US0378331005 |Equity
AAPL |USD |AAPL221125C00070000 |Derivative
as you can see, one database row has crypto as the security type for AAPL and the other two have varying ISINs.
i am looking for an algorithm that would retun the best match for AAPL. in this case the 2nd row.
any suggestions or recommendations are welcome.
Yah, I’m not sure what “algorithm” is going to help. Maybe get a list of valid equities from Edgar and cross check?
Hi
I am working on a project that requires an ai model to detect faded road markings and the percentage of marking faded (0% means not faded ,100% means completely faded). How should I accomplish this using object detection or image segmentation etc?
The markings are also irregular shaped instead of square shaped
this is not possible as there are layers of securities built on top of these.
I mean, if you're just looking to join your dictionary to a database table?
You could either load the dict to a table and join, or construct a param, or download the table and filter, a few options.
if a match is found in the database then no further action is required otherwise i need a add a new row to the database and its critical that this matching is as accurate as possible because we cannot allow duplicates
Well, if you can't allow dupes, then make sure you have a unique index
yes i am working on a filtering logic but i was also exploring is there are some options what could make this easier. actually the two sources of the data enevn though they have the same dictionary keys do not have the same type of data in the values, for example one source has exchange as NYSE another source has New York Stock Exchange that sort
I mean, the real answer is to resolve everything to sedol or cusip numbers, or something similar.
oh sorry, I see you have isin's.
left outer join on isin, and insert if not exists.
but with a unique key, so you don't have some sort of race condition with another update.
ISIN's are not reliable as there are entries in the api that have one symbol with a valid ISIN and another None
What kind of garbage api are you dealing with?
Oh, I've been meaning to try their data out.
i was hoping if there was some clasification algorithm that i could use to pick out a match it would solve a lot of my problems
I'm just not sure what you're after... sounds like a typical data cleansing problem:
Find duplicates, try to resolve known equities using ISIN or well known lists, etc
it is
Like, if you have a good master list of ISINs and tickers, you could at least whittle the list down
i have something i am working on., symbol, currency, security type match and if the db returns more then 1 match then check isin and then check exchange
its a lot of if else
@wooden sail @queen cradle Hey guys do u rem me? I have the abstract ready...do u guys mind giving it a read
What database are you working in?
postgres
Yah, you could load it and do it all in sql too
I use duckdb for a lot of this stuff, so I'll just join a table to an in-memory dataframe.
has anyone ever heard of global ai hub?
current code, which produces first picture:
for col in ["Pclass", "Sex", "SibSp", "Parch", "Embarked"]:
sns.displot(
train_data,
x=col,
hue="Survived",
multiple="dodge",
stat="percent",
discrete=True,
shrink=0.8,
)
plt.show()
how to make the xticks look like they do on the second picture?
You can do, say,
import matplotlib.ticker as mplticker
ax.xaxis.set_major_locator(mplticker.MultipleLocator(base=1.0))
hmm.. Thanks! That works for the current case. But is there no way to make it automatically do that, the way sns.countplot() does? Because.. say the x numbers are 1, 3, 4, skipping 2. Then there'll be an empty space for 2, whereas there won't for countplot
If you want a tick per each class and none others, you can instead explicitly set the ticks, via something like plt.xticks(np.unique(train_data[col]))
while that does work, I was hoping that matplotlib (or seaborn) could handle it.. cus, say I want to use bins specified in the displot function. Setting the ticks separately will become a bit of a hassle :/
How can it be possible that train accuracy is high and test accuracy is low? Isn't test dataset dependent on the train dataset?
This is the entire point of a train/test split. Why do you think we have train and split data? Why not just train on the entire dataset?
The idea is that both come from the same distribution, but because you don't have infinite data, they will be slightly different from each other. And because you train the model to be optimized for just the training data, it may be a bit worse on the test data.
Imagine that you are making a model that predicts the house prices based on the amount of rooms and distance to the city center. Price is test data and the rest are train. So the point is to set such values that it predicts the price well. So the better the train accuracy is, the better the test one, but it's not when overfitting comes to the case. But why? I searched for that on Internet and I learned that the machine memorises the data so that it gets unable to predict well. Yeah, it makes sense but how is it possible therotically ? How would you visualize that ?
" So the point is to set such values that it predicts the price well.": This is, I think, where you go astray:
You pick such values more or less randomly... such that you can validate that the model you built (using the train data) is not overfit. So, the test data serves the opposite purpose: to remind you that a model that trains well on "train" data might be useless in the real world.
You've probably seen this image before @simple tapir
We see that overfitting get's an incredibly good accuracy on the training** data (the points) with an error of 0 as it passes through all points. But when giving it new data from the same distribution, it will make poor predictions.
Whereas the good fit has a higher training error than the overfitted case, but it will be able to make better predictions on new data.
Changed test to train^
hey, how would I use a dataset after gathering images for one
do I train the model on the images and then if I want a random image to be classified, just upload that image and then call the model?
yes
the training process is forward propagation -> back propagation
to use the model you only need the forward propagation part
ok, I know how to make a CNN. how should I call the model through a website
label all of the images so the model has the correct answers to use in the cost function, then you may consider data augmentation (rotation, coloring, translation, noise, etc)if you don't have a good quantity of images. Finally you'll wanna normalize the data which is easy for images as they usually just divide every pixel value by the max pixel value 255
same way you'd serve any app, the web server interfaces with the program and gets the data from the model when its requested, then serves the output
or do you want like implementation details?