#data-science-and-ml

1 messages · Page 38 of 1

wooden sail
#

can you print the shape of h, k1_x, k1_y, etc all the way until the line where you assign x and y to x_sol and y_sol?

prime hearth
#

hello, i would like to please ask are there any beginner guide resources to naming variables in data science? For example this is what I have:

# load data
uploaded = files.upload()
yelp_business_category_correlation_dataframe = pd.read_csv(io.BytesIO(uploaded['yelp_bussiness_category_correlation.csv']))
yelp_business_category_correlation_dataframe.head(10)
def getBusinessCompetitorsByCategory(business_category):
#

the csv file im loading is a pearson correlation of businesses by categories

#

and my function will return the top correlated bussines categories related to the parameter business_category, i put keyword Competitor since these business can be potential competitors

#

i dont want to name variables like
df = pd.read.. where df stands for dataframe. df is very ambiguous and not readable in the long term

#

or what i have is okay?

timber spoke
#

say I have a 3 layer MLP neural network (2 hidden 1 output). I know the range for my input (0 to 1 for example), but i do not know the possible range for the layer 1 output for example. given i have a set of weights and biases and a defined range for layer 1 inputs, does anyone know if it is possible to determine the maximum possible output for layer 1 for example?

hasty mountain
#

From 0 to 1 you could use a sigmoid or softmax function(though I guess sigmoid function between hidden layers might not be recommended due to vanishing gradients)

timber spoke
#

for the hidden neurons that is

hasty mountain
#

Sigmoid seems to make things quite...unstable. At least I was testing a GAN here and, well...with sigmoid things got quite messy.

timber spoke
#

hmm, that's interesting

iron basalt
#

Notation in probability is a mess (especially expected value, it can be ambiguous without more information). I prefer p and Pr. Hats on top for approximation.

#

In ML they may or may not state what the notation means, or even be consistent. Things are left ambiguous and made unambiguous with the surrounding text (which they sometimes don't have and expect you to know based on some other papers that they are copying in notation (but mixed together, so it may require a lot of inference to decode its meaning (like solving a Sudoku puzzle at times)))

serene scaffold
#

@iron basalt thank you for your input 💚

iron basalt
#

*Or my favorite, you could only know what they mean by being on the same wavelength / predicting what they are trying to do in the paper because everyone working on similar stuff has convergent ideas.

#

("culture")

serene scaffold
#

there's a whole reference implementation for the paper I wrote two-ish years ago. but even with that, I'm already regretting some imprecision in how I explained a few points.

iron basalt
#

Yeah in ML they sometimes kind of give up, and write it for others that are on that same wavelength. In that case it's a very fast way to write it, but terrible for anyone trying to get in on it.

serene scaffold
#

(and it's not a shitty reference implementation. you could reproduce everything in the paper with one bash command, if you have the dataset.)

iron basalt
#

(in that case one hopes for a reference implementation, hopefully in Python, because it's unlikely that any C++ or other stuff makes any sense and is bug free)

iron basalt
#

@hasty mountain Try different loss functions.

#

The default 2014 one is not so great.

hasty mountain
#

A modified one, the non-saturating version for the generator loss

iron basalt
# hasty mountain Nah, this one https://github.com/tensorflow/tensorflow/blob/2007e1ba474030fcce84...

You also might want to look into https://en.wikipedia.org/wiki/Wasserstein_GAN if you did not already.

The Wasserstein Generative Adversarial Network (WGAN) is a variant of generative adversarial network (GAN) proposed in 2017 that aims to "improve the stability of learning, get rid of problems like mode collapse, and provide meaningful learning curves useful for debugging and hyperparameter searches".Compared with the original GAN discriminator,...

hasty mountain
#

Ugh, I've read about WGAN, but I'm kind of lazy to try to implement it

iron basalt
hasty mountain
#

Look at that...it looks like a C++ loop grumpchib

#

lr = 5e-5 and RMSProp? Woah...

iron basalt
#
Compared to the original GAN algorithm, the WGAN undertakes the following changes:

    - After every gradient update on the critic function, we are required to clamp the weights to a small fixed range is required, usually [−c,c].
    - Use a new loss function derived from the Wasserstein distance.  The discriminator model does not play as a direct critic but rather a helper for estimating the Wasserstein metric between real and generated data distributions.

Empirically the authors recommended usage of RMSProp optimizer on the critic, rather than a momentum-based optimizer such as Adam which could cause instability in the model training.
hasty mountain
#

Oooh... I see... I didn't know that about Adam

iron basalt
hasty mountain
#

Now that you mentioned it... I think Goodfellow used a simple SGD with momentum, didn't he?

iron basalt
#

You are past the beginning here, so default choices need to be reevaluated.

hasty mountain
#

But now it makes sense...specially since some papers on GANs differ so much on values for beta1 and beta2 for Adam

#

And I don't have more than 9.000 GPUs to spend so much time in trial and error

#

Thanks!

iron basalt
#

Yeah, when you don't have ALL the compute, your choices need to be more careful / examined.

#

/ don't assume that because everyone does it that it's the best.

prime hearth
#

hello, i would like to please ask is it better to do recommendation system based on what others like or make a simple recommendation based on popular things

#

what i want is simply to recommend other popular bussiness , but not sure if this is better than recommending what other users like. Recommending popular business is just using some math to find most rating count and highest rating compared to the second option where i need to use KNN or something similar

hasty kiln
#

Are you using tensorflow?

wooden sail
#

nope

#

i've used keras a little, but i usually go for jax

hasty kiln
#

Wow cool

iron basalt
#

Taichi is there, but I have not used it, seems fine, assuming it's not buggy. It has limitations as usual though. We have our own stuff so I don't use these open source frameworks except when using other's stuff or when it happens to be a good fit for something. https://www.taichi-lang.org/

hasty kiln
#

I love your job, this is my dream

wooden sail
iron basalt
austere swift
wooden sail
austere swift
#

that looks like something i'll definitely start using

iron basalt
#

The way that it hijacks the Python type hinting syntax is interesting.

#

Building a language out of Python existing syntax, since Python kind of has everything you need now.

fresh tiger
#

Hi, I have a question regarding evaluating and retraining a model.

Assuming I have the current flow: 1) User can evaluate existing deloyed model, 2) Based on evaluation, user can retrain the model, 3) if retrain and accuracy better than evaluation accuracy from step 1, deploy model.

My question here is, for step 3, should I be comparing the newely trained model accuracy to the accuracy from step 1? Or should I be comparing it to the accuracy from the old model when it was initially trained?

iron basalt
#

*At leas the docs seem to give warnings for limitations which is nice: ```
WARNING

Taichi only supports fields of dimensions ≤ 8.

#

(I happen to have used 12 dimensional arrays so 😦 )

wooden sail
#

sounds nasty 😌

wooden sail
iron basalt
#

(I do like "fields" more than "tensors" actually, might yoink that naming)

#

(I just call them ndarrays in my stuff, because I like to call data structures exactly what they are)

wooden sail
#

hmm i don't think it's a clear nomenclature though, considering field is already used for the fields over which one defines vector spaces, or when working with vector fields, it kinda hinds at there being other (possibly spatial) dimensions to which vectors are assigned

#

i prefer tensor if it's a multilinear transformation, n-way array otherwise

iron basalt
#

I think Taichi comes from graphics programming, so the name kind of makes sense given that background.

wooden sail
#

i see... but having to relearn nomenclature AND syntax/API makes it less appealing 😛

iron basalt
#

But I will just stick with ndarray, like numpy. The actual name is rarely typed since I have factory procedures (functions))

wooden sail
#

i like jax cuz it looks the same as numpy and the nomenclature is mathematically sound

fresh tiger
# wooden sail what's the difference between the two? how do you evaluate the model in step 1?

forgot to add that poart my bad - THe user would add new data to the database - the user can then initiate an evaluation. I dont really have anything specificly defined for that evaluation, but I assume something like: ```python
score = model.evaluate(X_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

The model is already a .pb and I would load it in, and then probably run something like .evaluate() on it. Shouldnt this accuracy then be used for stage 3, as this accuray is the one impacted by the new data?
wooden sail
#

i don't get your last point

#

this evaluation is not that different from the validation that was already done when the model was trained

fresh tiger
#

but if the evaluation is done with a lot of new data added to the dataset couldnt that impact the accuracy?

#

ie the model with the original dataset may have had an 80% accuracy

#

but then a large amount of new data is added, model is evaluated, It may have an accuracy of 60%?

wooden sail
#

ok, that's fair. but by retrain you don't mean from scratch, do you?

fresh tiger
#

I do

wooden sail
#

that doesn't seem very well thought out

#

but sure, you can do that

iron basalt
#

*Due to the graphics background Taichi seems to have spatial partitioning trees so one can do voxels, fluid simulations and such. Pretty neat.

wooden sail
#

you could also treat this as a batching strategy for the data, though in fairness the populations would have different statistical properties

fresh tiger
wooden sail
#

i see. anyway, yeah, that makes sense. evaluate using the new data

#

just make sure it's split properly so that you don't evaluate on data you will also train on

fresh tiger
#

Could that cause overfitting?

wooden sail
#

not overfitting, unfair evaluation

#

you wouldn't even be able to tell if there was overfitting in that case

wooden sail
#

maybe i'll force a student to look at it in some ultrasound simulations :x

fresh tiger
wooden sail
#

all i mean is that the data you use for evaluation cannot be part of the training data

#

so when you add new data to do this evaluation, that data can't be used for anything else

fresh tiger
#

Ahh ok yes that makes sense

#

thank you very much for all of your help! 😄

patent lynx
#

hey is there a numpy function for the gram-schmidt for finding a set of basis vectors?

wooden sail
#

not directly, but you can use a QR decomp or SVD to achieve a similar effect

#

QR is probably faster

patent lynx
#

guess i look into it, otherwise I'll be relying someone else gsbasis function i found on github

wooden sail
#

scipy linalg orth produces an orthonormal basis, and the docs say it does it via SVD as i suggested

#

like so

patent lynx
#

yeah the github function shows similar method, but I am not sure if this is robust:

#
import numpy as np
import numpy.linalg as la
def gsBasis(A) :
    B = np.array(A, dtype=np.float_) # Make B as a copy of A, since we're going to alter it's values.
    # Loop over all vectors, starting with zero, label them with i
    for i in range(B.shape[1]) :
        # Inside that loop, loop over all previous vectors, j, to subtract.
        for j in range(i) :
            # Complete the code to subtract the overlap with previous vectors.
            # you'll need the current vector B[:, i] and a previous vector B[:, j]
            B[:, i] = B[:, i] - B[:, i] @ B[:, j] * B[:, j]
        # Next insert code to do the normalisation test for B[:, i]
        if la.norm(B[:, i]) > verySmallNumber :
            B[:, i] = B[:, i] / la.norm(B[:, i])
        else :
            B[:, i] = np.zeros_like(B[:, i])   
    # Finally, we return the result:
    return B```
wooden sail
#

yeah this looks like vanilla gram schmidt, might have some numerical stability issues

#

i'd just use the scipy one if you don't wanna make a robust one yourself

patent lynx
#

thank you

wide ibex
#

Hey everyone, I am trying to find answers to a strange problem in Tensorflow Keras where the exported weights dont seem to be producing the same results and am trying to being some attention to the problem to see if anyone can help understand why this is happening, there is a GitHub issue concerning the problem here: https://github.com/keras-team/keras/issues/17332

If anyone can help shed some light on this issue it would be greatly appreciated, thank you.

GitHub

For some reason when I export my weights from a Tensorflow Keras model, be it a simple Sequential FNN, and then load them into C to perform forward passes I get different output results, sometimes ...

odd dagger
#

hi, can someone give me an example of web crawling with parallel processing?

plain cobalt
#

would someone tell me where should i learn data science for free

serene scaffold
arctic wedgeBOT
#
Resources

The Resources page on our website contains a list of hand-selected learning resources that we regularly recommend to both beginners and experts.

plain cobalt
serene scaffold
ripe sapphire
#

what should I learn in AIML specifically?

#

what is required to be a professional

#

someone guide me please

serene scaffold
ripe sapphire
#

@serene scaffolddid you complete your masters

serene scaffold
#

not that I'm some prodigy. I was very fortunate to be hired. but I also cultivated a very niche skillset during my undergrad that happened to be what they wanted.

ripe sapphire
#

silly question: Is good maths neccesary for ML

serene scaffold
#

depends on what you mean by "good at maths". ML is math. but being "bad at math" is mostly a state of mind.

patent lynx
#

You need 3 stuff generally

#
  1. linear algebra
#
  1. multivariate calc
#
  1. statistics (PCA?) still working on this
#

I'd recommend starting with this

#

What might it feel like to invent calculus?
Help fund future projects: https://www.patreon.com/3blue1brown
An equally valuable form of support is to simply share some of the videos.
Special thanks to these supporters: http://3b1b.co/lessons/essence-of-calculus#thanks

In this first video of the series, we see how unraveling the nuances of a simp...

▶ Play video
#

and this

serene scaffold
#

though bare in mind that even if you learn those things, potential AI/ML employers won't really take you seriously without a degree. whether that's fair or not can be debated.

wooden sail
#

very strong disagree there

#

due to how 3b1b presents content, it only makes sense if you have already covered the content in another form first

#

e.g. by reading in a book or in lectures in uni

#

on its own the channel does not provide nearly enough background nor detail of the right kind to be a standalone good math resource

serene scaffold
wooden sail
#

i agree, but the way historify and curry chicken recommended it is misleading if the other person has no prior knowledge

serene scaffold
#

and I think the guiding principle for what videos he decides to make is "what aspects of a given topic could be better explained with animations than with static visuals?" and goes from there

patent lynx
#

Sorry if i give bad advice, but yeah should have put asterisks that this should have been supplemented with examples like in books and exercises

wooden sail
#

not bad advice, just needs a little more oomph 😛

odd dagger
#

@wooden sail I somewhat agree with you. I felt it was the best channel for me for calculus stuff atleast, I still watch his videos for fun...
Also like, everyone's personal opinion differ right? blobgrimacing

wooden sail
#

most certainly. my wording might have been harsh, but at the end of the day it's my opinion rather than fact 😛

serene scaffold
arctic wedgeBOT
#

:ok_hand: Added edd’s-opinion-is-fact to the names list.

wooden sail
#

lol

odd meteor
#

Is anyone here attending ICLR 2023 ?

#

We can do a small PythonDiscord Hangout / Dinner if we're up to 3 people that'll attend ICLR 😎

steady basalt
hasty mountain
#

@iron basalt ChatGPT might not have all the answers...but he's quite a Sokrates.
I was talking to him about my GAN and that my discriminator had its loss stabilized around 0.34, while my Generator loss, in the best case, oscillated between 5 and 8, and he just told me "you could train the generator for more epochs with the discriminator's weights frozen"

#

Like...every code I've seen about GANs uses the approach 1 batch -> 1 step for discriminator -> 1 step for generator.
But then this reminded me that Goodfellow suggested that one could use more iterations per batch in both discriminator or generator. He just used 1 in the paper for convenience. Heh

#

Now I'll see if 5 more iterations in the generator works(it has 5 times more parameters than the discrimiantor)... using a content loss function to try to avoid model collapse, of course

#

I was trying to compensate this by using a higher learning rate for the generator optimizer, but it also didn't work.

toxic vault
#

question

#

i was wondering if this code would work for a stock trading bot that uses machine learning and stock strategies

serene scaffold
#

I have a Bert-for-NER model, named m0, that does 9 classes, so the final layer is Linear(in_features=768, out_features=9, bias=True). And I'm trying to make a copy of that model, m1, that does all of those same classes and three additional ones. So I did this.

m1 = BertForTokenClassification.from_pretrained('./m0.pkl')
linear = nn.Linear(768, len(e1))  # len(e1) == 12
with torch.no_grad():
    linear.weight[:len(e0), :] = m0.classifier.weight.clone().detach()  # len(e0) == 9
    m1.classifier = linear
m1.train().to(cuda)

And if I print m1, I can see that indeed, the last layer is (classifier): Linear(in_features=768, out_features=12, bias=True).

But I still end up wit this error

Traceback (most recent call last):
  File "/home/farnsworthsw/projects/cont_learning/replicate_addner.py", line 194, in <module>
    optimizer = AdamW(m1.parameters(), lr=1e-5, eps=1e-8)
  File "/home/farnsworthsw/projects/cont_learning/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/farnsworthsw/projects/cont_learning/venv/lib/python3.9/site-packages/transformers/models/bert/modeling_bert.py", line 1785, in forward
    loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
RuntimeError: shape '[-1, 9]' is invalid for input of size 7296
#

I checked all the layers of both m0 and m1, and I didn't see anything in either that depended on the number of classes except the classifier layer (which is the last one), so I'm not sure why this would suddenly become a problem.

#

I suppose I could try making m1 with nn.Sequential of all the layers of m0 except the last one, and then the new linear one.

#

I hope my question is sufficiently detailed without being too much.

keen notch
#

shall i np.shape(h)

#
import numpy as np
import matplotlib.pyplot as plt
# Define functions to compute the right-hand sides of the differential equations
def f_x(x, y, vx, vy):
    return -2 * y**2 * x * (1 - x**2) * np.exp(- (x**2 + y**2))
def f_y(x, y, vx, vy):
    return -2 * x**2 * y * (1 - y**2) * np.exp(- (x**2 + y**2))
def trajectory(impactpar, speed):
    maxtime = 10 / speed
    t = np.linspace(0, maxtime, 300)
    x = impactpar

    y = -2
    vx = 0
    vy = speed
    # Initialize arrays to store the solutions
    x_sol = np.empty(t.shape)
    y_sol = np.empty(t.shape)
    for i, _t in enumerate(t[:-1]):
        h = t[i+1] - _t
        k1_x, k1_y = h * vx, h * vy
        k2_x, k2_y = h * (vx + 0.5 * k1_x), h * (vy + 0.5 * k1_y)
        k3_x, k3_y = h * (vx + 0.5 * k2_x), h * (vy + 0.5 * k2_y)
        k4_x, k4_y = h * (vx + k3_x), h * (vy + k3_y)
        x += (k1_x + 2 * k2_x + 2 * k3_x + k4_x) / 6
        y += (k1_y + 2 * k2_y + 2 * k3_y + k4_y) / 6
        vx = f_x(x, y, vx, vy)
        vy = f_y(x, y, vx, vy)
        x_sol[i+1], y_sol[i+1] = x, y
    return x_sol, y_sol
x_sol,y_sol = trajectory(0.1, 0.1)
# Plot the resulting trajectory
plt.plot(x_sol, y_sol)
plt.xlabel("x")
plt.ylabel("y")
plt.show()

# Solution to part (b)
def scatterangles(allb, speed):
    # Initialize an array to store the scatter angles
    angles = np.empty(allb.shape)

    # Loop over the impact parameter values
    for i, impactpar in enumerate(allb):
    # Solve the differential equations and store the final values of x and y
        _, vy = trajectory(impactpar, speed)
        # Compute the scatter angle
        angles[i] = np.arctan2(vy, 0)
        # Return the array of scatter angles
    return angles
allb = np.arange(-2, 2, 0.001)
angles = scatterangles(allb, 0.1)

# Plot the scatter angles as a function of impact parameter
plt.plot(allb, angles)
plt.xlabel("Impact parameter")
plt.ylabel("Scatter angle")
plt.show()```
hasty mountain
#

I hope I'm just being too hasty

keen notch
#

so when printing the shape I get this

hasty mountain
# hasty mountain Eeeeh...maybe he was wrong here, too. Again, discriminator loss stabilized at 0....

In the end, the correct answer was "try using residual blocks". I was trying to use an architecture similar to SRGAN but without residual blocks, but, apparently, residual blocks are magical. Though I don't quite understand the justificative on why they work so well...
I can understand when you concatenate a residual block to your output, but I don't quite get it why it works when you directly sum the residual blocks, element-wise.

idle sequoia
#

huys, I have this column in a Pyspark data frame:

date
2022/1/1
2022/10/2
2022/2/4

and I really need to convert the datas to:

2022/01/01
2022/10/02
2022/02/04

how can i do this with pyspark?

craggy patio
#

How do I make an AI that tries making different chords and tests if the user likes them or not and keep making chords that the user likes?

strong sedge
#

can we create a auto encoder for dealing with names ?

#

if so can you link me to some paper/article

odd dagger
#

anyone good in xpath here?
//*[contains(concat( " ", @class, " " ), concat( " ", "organizationName", " " ))]

Can someone explain to me how this xpath works?

I am trying to crawl data from a website using scrapy

ocean swallow
#

is there any service for labeling product images?

#

I used Google Vision and probably if I had an hour I would deploy a service better than that...

#

but mine wouldn't be enough as well

odd mason
#

Not sure if this is the right place to ask,
I've been a ML Engineer for 2+ years now in the same company (straight out of college)
I'm considering switching companies soon and was looking for potential project ideas to put on my resume.
Is there any place I can get ideas from? (Which aren't too generic)

#

My resume just has one project at present

patent lynx
#

join a kaggle competition, you might want to form teams with someone you know. Gravitate to a topic that is relevant to the company you are interested in or that satisfy the new job's skills

odd mason
patent lynx
amber lark
#

Does someone know how to change the colors that they won't look so similar?

strange igloo
#

You may try something similar to this:

axes.scatter(high_action.suspense, high_action.action, high_action.comedy, c="red", marker="x", s=200)
axes.scatter(low_action.suspense, low_action.action, low_action.comedy, c="blue", marker="o", s=200)

axes.set_title("Sample Movies")
axes.set_xlabel('Suspense')
axes.set_ylabel('Action')
axes.set_zlabel('Comedy')```
#

And the full article will be helpful if you are using matplotlib

hasty mountain
lethal spade
#

Am I the only one that thinks that Advanced indexing in numpy doesn't follow the principle of minimum astonishment?

for example

a = np.random.rand(100, 100)

a[(2,4)] #this yields the element at [2,4]
a[[2,4]] #this yields the rows at position 2 and 4
a[1, (2,4)] #this yields the 2nd and 4th elements of row 1. (So actually does advanced indexing)
a[1, [2,4]] # Works the same way as the previous one.

Worst of all, it's very easy for someone do a mistake and not notice it: it seems to me that the first method, a[(2,4)], should not be allowed, and instead only a[*(2,4)] should work. I checked how it works in Julia (which has a similar syntax), and a[(2,4)] would yield an error, which makes sense to me. Could it be an idea to deprecate a[(2,4)]-like usages?

agile cobalt
arctic wedgeBOT
#

@agile cobalt :white_check_mark: Your 3.11 eval job has completed with return code 0.

001 | (1, 2, 3)
002 | (1, 2, 3)
agile cobalt
#

!e without the * (had it there for testing)```py
class Foo:
def getitem(self, item):
print(repr(item))
arr = Foo()
arr[1, 2, 3]
arr[(1, 2, 3)]
arr[*(1, 2, 3)]

arctic wedgeBOT
#

@agile cobalt :white_check_mark: Your 3.11 eval job has completed with return code 0.

001 | (1, 2, 3)
002 | (1, 2, 3)
003 | (1, 2, 3)
strange igloo
#

I have a data frame that looks like this. I used a recursive function to create a lineage of dependencies. The problem is, some of the lineage routes are incomplete, though the data set will include the complete route at some point.

How can I remove the incomplete routes and preserve the complete routes?

Here the red line is incomplete, while the green line is complete.

lapis sequoia
#

JAI HIND. I AM FROM INDIA AND LOOKING FORWARD TO BE A DATA SCEINTIST

young granite
strange igloo
#

So it takes every row of the original data set, and then creates that rows chain.

#

but any particular row in the original data set can be in the middle of a chain

young granite
#

how do u define whether a row is complete or incomplete then

strange igloo
#

It's more like a series of rows

young granite
#

to cluster ur data u need to define rules for clustering

strange igloo
#

It's like this

row 1
row 2
row 3
row 4

is complete series

row 3
row 4

is incomplete series

young granite
#

so different sources ?

#

use something like this in a pandas df:

df.loc[df['column_name'] == some_value]
df.loc[(df['column_name'] >= A) & (df['column_name'] <= B)]

or make a boolean df beforehand and filter afterwards.
unfortunately I do not understand your question logic

patent lynx
#

I think what he meant that the chain creation is created by the dependency and the dependent

#

Like delta cool 1 depedent is cool 1

#

Then the delta cool 2 dependency is cool 1

strange igloo
#

It’s hard to describe

#

It’s hard to think about lol

#

But that is right

patent lynx
#

Index chain 9, 10, 11 is incomplete because if you look at one of their dependency and the dependent for the next index are not equal.

#

Like delta cool 3 in index 9 has dependent cool report but delta cool 2 in index 10 has the dependency cool 1. This forms an incomplete chain.

strange igloo
#

Yeah, that's right.

Delta cool 3 in index 9 is the end of a chain, so there isn't anything that comes after it.

So there's a correction there on my part, the chain actually begins at index 10, with delta cool 2, but is incomplete

Another way to say it, is I want to only preserve a chain that begins at the root

patent lynx
#

Idk how would you define the root tho

#

Is it the proc name or the dependency?

strange igloo
#

the dependency

#

I realized that maybe this isn't ideal, though.

Basically, I'm pulling definitions of stored procedures from a SQL database and then parsing the from and into clauses to find dependencies.

And I envisioned this as a spreadsheet lineage.

Now I realize that preserving only the complete series might be a bad idea, because you might want to search a lineage when the "root" is actually in the middle of a series.

#

Thank you for the help @patent lynx - I'm going to go for a walk. Enough hacking for now.

serene scaffold
keen notch
#

I've a question why does all my plots look the same is my energy functions wrong

#
# YOUR CODE HERE
import numpy as np
import math
import matplotlib.pyplot as plt

G=6.6738e-11
M=1.9891e30
m=5.9722e24

def verlet(x0,y0,vx0,vy0,N,paramters=()):
    t = 1/N #timestep
    G = paramters[0]
    M = paramters[1]
    m = paramters[2]
    x=np.zeros((N,2))
    v=np.zeros((N,2))
    x[0]=(x0,y0)
    v[0]=(vx0,vy0)
    for i in range(N-1):
        x[i+1] = x[i] + (v[i] * t)
        f = -G * M * x[i+1] / (np.linalg.norm(x[i+1])**3)
        v[i+1] = v[i] + (f * t)
    return x,v

def solve(par):
    xval,vval = verlet(1.521e11,0,0,2.9291e4,35040,paramters=par)
    return xval,vval

def potentialEnergy(r,par):
    energy = np.zeros(len(r))
    vals = r[:, 1]
    for i in range(len(r)):
        energy[i] = par[2] * par[0] * np.linalg.norm(r[i])
    return energy
    
def kineticEnergy(v,par):
    energy = np.zeros(len(v))
    vals = v[:, 0]
    for i in range(len(v)):
        energy[i] = 0.5 * par[2] * ((np.linalg.norm(v[i]))**2)
    return energy

xval,vval = verlet(1.521e11,0,0,2.9291e4,35040,paramters=(G,M,m))

pe = potentialEnergy(xval,(G,M,m))
ke = kineticEnergy(vval,(G,M,m))

total = pe+ke

plt.subplot(3, 1, 1)
plt.plot(pe)


plt.subplot(3, 1, 2)
plt.plot(ke)

plt.subplot(3, 1, 3)
plt.plot(total)

plt.show()```
#

happy to show the question if needed

young granite
#

provide the question pls @keen notch

keen notch
#

yes of course,here!

#

can you read it?

young granite
#

nah please provide text

#

or paste the markdown in a pastebin

#

!paste

arctic wedgeBOT
#

Hey @keen notch!

You either uploaded a .txt file or entered a message that was too long. Please use our paste bin instead.

keen notch
#

has a few equations so ss again

young granite
#

does this seem suited?

#

@keen notch

keen notch
#

omgg

#

looks better i thinkk

#

the start scale is squashed

#

tbh not sure how it's supposed to look just knew my graphs looked very wrong😂

young granite
#

😄

keen notch
#

how did you do it in terms of code

young granite
#

wait

#

somethings wrong

keen notch
#

ohh ok let me know what's up

young granite
keen notch
#

ooo thank you so what was the issue

#

so i can understand

#

DAMN YOUR GENIUS

#

is there no orange plot?

young granite
#

u see blue line is by 0

#

so there wont be many changes to the resulting curve

keen notch
#

ohhh makes sense

#

what was wrong with what i had previously

young granite
#

i would assume its ur np.zeros all time so u get a mismatch somewhere

#

but i didnt check ur "logic"

#

so the iterations u do

#

i would suggest to check em

keen notch
#

ahh fair enough so what does this do

keen notch
keen notch
young granite
keen notch
#

the appends

young granite
#

store out the values

#

u need to calculate for each step

keen notch
young granite
#

thats why u need to append later and why i think ur np.zeros are resulting in mismatch

keen notch
#

smart

#

and interesting

#

t[-1]?

young granite
#

u want to compare last t value with the tmax

keen notch
young granite
#

to iterate only in the range

#

!e

import numpy as np


test = np.array((1,2,3))
print(test[-1])
keen notch
#

makes sensee

#

thank you so so much

young granite
#

no worries

#

but [-1] is pretty basic stuff

#

u should check the basic first

#

before u attend such difficult questions?!

keen notch
#

i should ur right

young granite
#

u doing it for university?

keen notch
#

working on a course outside uni but yes

young granite
#

i would highly recommend to learn python tho

#

its the "soft-skill" to have

keen notch
keen notch
young granite
#

thats totally fine

#

but maybe ur current class is then too advanced for u

#

its no shame to start small

keen notch
#

i'm more a C girl

#

i'm just trying to do python questions I've done some basic ones tbf and didn't have problems

young granite
#

however can be frustrating and difficulty

#

im out for today have a great night

#

🦉

keen notch
#

one question

#

shouldn't the total remain constant?

robust jungle
#

im trying to train a model with labels in the format of integers ranging from 0-2 and am getting this error:

Received a label value of 2 which is outside the valid range of [0, 1).  Label values: 1 2 2 2 2 0 0 0 1 1 2 2 1 1 0 0
     [[{{node sparse_categorical_crossentropy/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits}}]] [Op:__inference_train_function_73146]

I know that sparse crossentropy is supposed to accept integer encoded labels as opposed to one - hot, so what am I doing wrong?

plush jungle
#

lol my q learning bot that I'm trying to teach to play a top down shooter game has learned that pygame has lag

#

I don't have a cap on how many bullets it can shoot, and I gave it a penalty for not putting its laser on the target and a reward for being on target, so naturally as it spun around trying to find the target, it realized it could minimize its penalty if it slowed down time

#

so in between every turn action it takes, it fires another bullet

#

now that it's found the target I think it's about to learn that not slowing down time is actually beneficial

dusk tide
#

So I was following a notebook on kaggle to train model on TPU which was by kaggle grandmaster Phil Cullinton https://www.kaggle.com/code/philculliton/a-simple-tf-2-1-notebook/notebook . In this he has used VGG16 model which takes the input shape as (224,224) during training. But he is training with (192,192) input shape images ?? On GPU or CPU it will throw an error. How is it possible ??

keen notch
#

hey does anyone whether this means my bottom plot is wrong (should be constant)

fast schooner
#

Hi, I hope everyone is having a great christmas. I'm searching for some advice and think that this might be the right channel to ask. I'm about to finish my theoretical physics degree in barcelona and I've taken some computational courses+ I've done Machine Learning course and an internship ML related in stockholm (erasmus). I really enjiyed this topics and was thinking about pursuing a carreer in data science. I was hoping that there is someone with a similar background that could give some advice. Thanks in advance 🙂

fast schooner
# keen notch

Idk without seeing the rest of the plot but ur total energy should be higher than the kinetic energy too

#

sry without seeing the rest of the code

#

Maybe its not properly labelled

keen notch
#

I can pastebin the rest of the code

#

with the question?

keen notch
keen notch
#

the thing I'm unsure about is whether the total has to stay constant

fast schooner
#

yes it should

#

Energy is preserved

fast schooner
#

energy should be more in absolute value in this case

keen notch
#

not sure what I'm doing wrong

fast schooner
#

Oh

#

I think it is right

#

it oscillates because of the method u're using

#

just check b

#

It does oscillate around a constant mean value

#

That can happen when using numerical methods

keen notch
keen notch
fast schooner
#

Smthing like this

#

I think so yea

keen notch
#

ohh i see the red line is the mean!

#

how'd you do that

fast schooner
#

Paint

keen notch
#

haha ohh

fast schooner
#

hahahah

keen notch
#

i get youu

#

smart thank you so the mean is what they want constant

#

but it oscilates

fast schooner
#

but you can get the mean value with np.mean and then plot a constant line if u want

#

yess

keen notch
#

ofc i know that function hehe:)

#

yesss so my graphs are all good😎

fast schooner
#

Can u plot the potential and kinetic energy

keen notch
#

top graph

fast schooner
#

yes but i cant see the potential

keen notch
#

because it sum of both

#

so will cancel

fast schooner
#

mmm

#

I think in this case the potential is just on top of ur total energy

keen notch
#

is it because the blue line is by 0
so there wont be many changes to the resulting curve (orange)

mystic grotto
#

Hey guys, question regarding regression model.
I want to predict a salary based on categorical data such as experience level or job title.
What model would recommend?
Thx in advance 🙂

keen notch
fast schooner
#

Im not sure

#

hahahahahah

#

If u could plot only the potential

#

im assuming it will look just like the total energy

keen notch
#

hmm i'll think

#

but thank you for your help!:)

fast schooner
#

np

#

It still doesnt look quite okay

#

But think about it I can't help rn maybe at night

keen notch
#

ahhh no a is wrong total should be constant

keen notch
fast schooner
#

yes there is something wrong

#

If I have time tonight I'll take a look

keen notch
#

yeah there is I'll think about it when I'm back home as well

#

might be my equations

fading zealot
#

does anyone know feature engineering ?

charred light
fading zealot
#

I have 4 dataset

#

I want to apply feature engineering using these four dataset with an hypothesis question

charred light
fading zealot
#

apply a model and predict something

#

Do you mind if we can get on call and explain you everything @charred light

#

thinking about applying linear regression model

charred light
fading zealot
#

I do really need help @charred light though

charred light
#

And text is just fine to do that.

rancid sorrel
#

you should convert the data over to a common format using sklearn, unable it and then throw it in your models

#

first thing is to compair the types of data using pandas.

#

df1.dtypes

#

for example

fading zealot
#

can you explain using a zoom call

rancid sorrel
#

lol not zoom no

fading zealot
#

oops we do have discord

fading zealot
open storm
#

https://pastebin.com/LJaFfFsE
Can someone please help ? This is a deep learning problem. I trained a gesture learning model on 225x225 pixel images using keras and neural networks. I saved the model to an h5 file. Above is the code I want to use for detecting it. However when I show my hand in front of the camera it shuts down right away with error being that

ValueError: Input 0 of layer "sequential_4" is incompatible with the layer: expected shape=(None, 224, 224, 3), found shape=(None, 21, 2)

I am fine with detecting my hand within a region of interest. But what other things I can do to fix this problem and get it up and running?

rancid sorrel
#

also anyone know why this isnt working

l1.dytpes
bare_nuclei                    object
----> 1 l1["bare_nuclei"] = pd.to_numeric(l1["bare_nuclei"])

TypeError: 'method' object is not subscriptable
l1["bare_nuclei"] = pd.to_numeric(l1["bare_nuclei"])```
charred light
rancid sorrel
#

well i kinda need to shove it there for all the stuff i am doing later

#

? does pd.to_numeric actualy convert the data?

#

give that a read

charred light
#

Yes, it applies similarly to the inplace flag in other pandas functions

rancid sorrel
#

also @charred light just using pd.to_numeric(l1["bare_nuclei"]) has same error

charred light
fading zealot
#

@rancid sorrel Do you want help me later?

rancid sorrel
#
ValueError                                Traceback (most recent call last)
File ~/.local/lib/python3.9/site-packages/pandas/_libs/lib.pyx:2363, in pandas._libs.lib.maybe_convert_numeric()

ValueError: Unable to parse string "?"

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
Cell In[4], line 1
----> 1 type(pd.to_numeric(l1["bare_nuclei"]))
      2 #pd.to_numeric(l1["bare_nuclei"])

File ~/.local/lib/python3.9/site-packages/pandas/core/tools/numeric.py:185, in to_numeric(arg, errors, downcast)
    183 coerce_numeric = errors not in ("ignore", "raise")
    184 try:
--> 185     values, _ = lib.maybe_convert_numeric(
    186         values, set(), coerce_numeric=coerce_numeric
    187     )
    188 except (ValueError, TypeError):
    189     if errors == "raise":

File ~/.local/lib/python3.9/site-packages/pandas/_libs/lib.pyx:2405, in pandas._libs.lib.maybe_convert_numeric()

ValueError: Unable to parse string "?" at position 23```
#

the pandas.core.series.Series is the type(l1["bare_nuclei"])

charred light
#

Ah, you'll need the errors flag

#

You have a "?" in one of your rows, which is causing the error.

rancid sorrel
#

oh bugger yeah that would do it sorry this data is supposed to be sanitized already

#

time to go thow some chlorox at it

charred light
#

Welcome to DS, the data is never clean (No matter what the data team tells you).

rancid sorrel
#

yeah 16 ? in the data

#

thank god the dataset is small enough i can just open it in vs code

charred light
#

That's probably why it's being read in as an object too.

rancid sorrel
#

do we have a crap in crap out emoji?

charred light
#

It should have defaulted as a int/float.

rancid sorrel
#

yeah i would have expected that, honestly i am final year CS this is my first time dealing with Data Science

#

💩 in --> 💩 out

fading zealot
#

@charred light in order to predict something from a dataset what are the steps we need to take into consideration?

#

apply models and predict the accuracy ?

rancid sorrel
#

you need to do and EDA

#

first

#

the sweet viz libary will do most of that for you

fading zealot
#

Explore data analysis and then apply the model to predict?

rancid sorrel
#
#without scaling funtion for the models
def models(X_train,Y_train):
    
    #Logistic Regression
    from sklearn.linear_model import LogisticRegression
    log = LogisticRegression(random_state = 0)
    log.fit(X_train, Y_train)
    
    #Decision Tree
    from sklearn import tree
    from sklearn.tree import DecisionTreeClassifier
    dtc = DecisionTreeClassifier(criterion = 'entropy',random_state=0)
    
    #Random Forest class1ifier
    from sklearn.ensemble import RandomForestclass1ifier
    forest = RandomForestclass1ifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
    forest.fit(X_train, Y_train)

    #print model accuracy on the training data.

    print('[0]Logistic Regression Training Accuracy:', log.score(X_train, Y_train))
    print('[1]Decision Tree class1ifier Training Accuracy:', dtc.score(X_train, Y_train))
    print('[2]Random Forest class1ifier Training Accuracy:', forest.score(X_train, Y_train))

    return log, dtc, forest```
#

but thats a decent load of code to run all the models

charred light
fading zealot
#

@charred light any youtube video to help me ?

rancid sorrel
fading zealot
#

@rancid sorrel can we use Linear regression ?

rancid sorrel
#

Learn Data Analysis with Python in this comprehensive tutorial for beginners, with exercises included!
NOTE: Check description for updated Notebook links.

Data Analysis has been around for a long time, but up until a few years ago, it was practiced using closed, expensive and limited tools like Excel or Tableau. Python, SQL and other open libra...

▶ Play video
#

@fading zealot you test the prediction accuracy after

#

so use sklearn to split the data into train and test, then you apply the metrics used to test each model

#

once you have your models scored you can then chose the correct one

fading zealot
#

Oh k

#

I wish to learn that slowly @rancid sorrel with focus using different data sets from kaggle

fading zealot
#

@charred light looking at the dataset I shared with you

#

what do you think what shall be a good hypothesis question

rancid sorrel
#

@charred light my uni skipped al the data sanitation and cleaning and just threw us at the , AAN,CNN and was like "lol enjoy the deep end here is a rock for you to help you float"

#

but yeah if your learning i recommend you use MD for notes with something like obsidan.md to collate all your notes into a vault (it dosnt work well with Jupiter however)

#

at least not yet

charred light
fading zealot
#

Oh merge is so easy, it is just combining two dataframes and passing on the query using an attribute

charred light
fading zealot
#

The problem at the moment is the term period I have .. I need to finish off with this asap

rancid sorrel
#

yeah i got 10 days for mine to be in too and i am on holiday abroard 🙂

#

i feel you

charred light
#

You'll probably be using 1978 Year + for global temps, and limited to 2010 year + for EV.

fading zealot
#

@charred light you seem an expert in this field

charred light
#

No expert here, just working as a data scientist. There's a lot to this field.

rancid sorrel
#

honestly i hope to reach the levels of skill you have oneday skyglow

fading zealot
#

@charred light if you want me to be honest. All I want you to explain using a Jupyter notebook and coding at the moment

#

and then guide me how to be a good data scientist

#

I am willing to learn and make it happen

rancid sorrel
#

well not that you have the time for all of it but i think you need to go back to basics

#

freecodecamp > cs50
intro to github basics

fading zealot
#

just 9 hours to get done with the project

#

thats the problem

rancid sorrel
#

yeah this is like 2 weeks of time to watch

#

yeah i know what you mean, well bascialy your kinda boned

fading zealot
#

ya

#

If you or skyglow will help me with this project

#

that would be great

#

and then make a schedule or plan what to learn and how to be a good scientist

rancid sorrel
#

once you have followed, skyglows reccomendations, by merging the datasets

fading zealot
#

okay i will do that

#

@rancid sorrel can you stay online here

#

@charred light please do stay. I might need help

#

As a data Scientist, do let me know the road to be a good data scientist

#

@rancid sorrel do we need find the missing values as well

rancid sorrel
#

you can however assuming like me your not dealing with 8Pb datasets

#

you can just get the datasets into the correct shape, then deal with the sanitization

#

so share> sanitize > merge

fading zealot
#

ok

#

@charred light this was my hypothesis question

#

‘Will the increased usage of electric vehicles aid in decreasing CO2 emissions, therefore leading to reduced global warming?’

rancid sorrel
#

no it wont

#

but thats a different issue

fading zealot
#

okay

rancid sorrel
#

your dealing with the demand side fo the equation not the supply side, and thats the main issue with EV

charred light
rancid sorrel
#

honestly we should all be using h2 from biofule with carbon capture at conversion . but thats just my opinion

charred light
#

Or have better public transportation (for the US).

rancid sorrel
#

my general point is you got some major survivor bias with that analysis with your dataset

fading zealot
#

what would a unique hypothesis question ?

rancid sorrel
#

honestly id compair the amount of EV adoption vs the adoption of renewable energy generation

#

and see how much energy your wasting, and analyse the supply side shortfall compared to gas

#

if you can chose your datasets your better off using non US data,

#

cause use emmesions are crap shoot (imo)

charred light
fading zealot
#

yes

rancid sorrel
#

fair enough just plow though it then 🙂

fading zealot
#

but was said we can use external data as well

rancid sorrel
#

sorry for going off on datascience no1 rule of cynicism

#

EU/UK has good data on this

#

as they use the europe emission standards, also EU has the satellite that tracks it

#

uk is good for EV adoption as its data is centraly avaible and accurate due to the regulations

fading zealot
#

@rancid sorrel how long does it take get done with the code ?

rancid sorrel
charred light
#

Yea, US has something like that from DOT (Department of Transportation). Although, most of it is null lol.

rancid sorrel
#

you have to have all your info up to date in the uk

#

or your going to jail, there are automatic number plate readers everwhere

#

so the DVLA (DMV) has a complete dataset

fading zealot
#

@charred light how long will it take for you get done with the feature engineering and the EDA with the data sets provided to you ?

rancid sorrel
#

@fading zealot EDA you use sweetviz

#

and it will do it for you in about 1 line

fading zealot
#

ok

rancid sorrel
#

analysis = sv.analyze(l1, target_feat='class1') analysis.show_html('EDA-Sweetviz2.html', open_browser=False)

fading zealot
#

i am installing

arctic wedgeBOT
#

Hey @rancid sorrel!

It looks like you tried to attach file type(s) that we do not allow (.html). We currently allow the following file types: .gif, .jpg, .jpeg, .mov, .mp4, .mpg, .png, .mp3, .wav, .ogg, .webm, .webp, .flac, .m4a, .csv, .json.

Feel free to ask in #community-meta if you think this is a mistake.

rancid sorrel
#

well thats sensible

#

but yeah it creates a html report

fading zealot
#

@rancid sorrel @charred light please stay

#

@charred light what to do after merging

charred light
fading zealot
#

which datasets you want me use for the hypothesis question

#

co2 emission and ev datasets ?

charred light
#

If you merged correctly, you should have 1 dataset with all the columns.

fading zealot
#

I know. that

#

but which datasets would you recommend me to merge

#

thats what I am asking

rancid sorrel
#

All of them

tawny turtle
#

why packages not showing?

fading zealot
#

I might be annoying to both of you but all I am left is with 8 hours to finish it off @rancid sorrel @charred light Apologies.

#

@charred light can you do the feature engineering for me ?

#

in a jypter notebook

charred light
agile cobalt
fading zealot
serene scaffold
fading zealot
#

I know

agile cobalt
serene scaffold
fading zealot
#

Because I am stuck at the moment

serene scaffold
#

I don't know that we'll be able to help you become unstuck before the assignment is due.

fading zealot
#

I am not a frequent user asking for help

#

it is just bcoz I am stuck and with the christmas around I was not in a state to get done with everything

agile cobalt
#

we and your teacher(s) would rather have you ask for help often to learn things when you're supposed to than ask for help only when given homework, and ask for everything then

fading zealot
#

please don't preach me ..

#

we had four assignments in the span of 2 weeks

rancid sorrel
#

i am in same situation and litterly dealing with youtube rn, btw do we have any rep?

serene scaffold
fading zealot
#

Moreover i am project manager and all I am stuck is with this code thats all

rancid sorrel
#

i would like to thank @charred light for helping me with my particual problem

serene scaffold
fading zealot
#

@serene scaffold I know what you are trying to explain

#

I am not dumb to come up here to get done with my assignment

serene scaffold
fading zealot
#

@serene scaffold Seriously brother. It's not easy to understand other problems what they go through

serene scaffold
fading zealot
#

when family is around with 20 people in your house and one wish to study .. it is next to impossible for me get done with everything

#

Ok

serene scaffold
#

No more discussion of your personal situation in this channel. You can ask a question if you want. "Please do my homework for me", as we've discussed, does not count.

fading zealot
#

I never mentioned do my homework for me as a statement

#

Check if you want

#

Stop this crap and coding for a non-technical person is hard

#

Fine I wont ask anything

#

You need to look at me that I am eager to learn and understand the concepts

#

No more discussion please @serene scaffold

serene scaffold
#

that's what I've been asking for bing_shrug

fading zealot
#

Query - How to view multiple csv files in jypter notebook?

#

@charred light

serene scaffold
#

I was about to answer your question, too

fading zealot
#

ok

#

noted

serene scaffold
#

you can ping people if they've already engaged with your current question. not if they engaged with a similar question in the past.

you would have to load each csv with pd.read_csv and then have the name of each df be the last statement of a cell

#

so if there are 3 dataframes to display, you need 3 cells.

fading zealot
#

I mean all to say all the three datasets

serene scaffold
fading zealot
#

it is

#

but I have four datasets

#

I want to import all of them

serene scaffold
#

so, you would do pd.read_csv four times. one for each dataset. and then you'd need four cells to display them

#

because each cell displays the result of the last statement

fading zealot
#

it gave utf-8 error

serene scaffold
#

remember to always show the whole error message.

fading zealot
#

ok noted

#

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb2 in position 150: invalid start byte

serene scaffold
#

so it would look something like df1 = pd.read_csv('file/path.csv', encoding='ascii')

fading zealot
serene scaffold
#

you need a comma between the file name and the encoding= part

fading zealot
#

ok noted sir

serene scaffold
#

if you do df.head() four times, only the last one will be displayed, afaik

rancid sorrel
fading zealot
#

UnicodeDecodeError: 'ascii' codec can't decode byte 0xce in position 30: ordinal not in range(128)

serene scaffold
fading zealot
#

Can I share the datasets ?

serene scaffold
#

you can put them in the paste bin

#

!paste

arctic wedgeBOT
#

Pasting large amounts of code

If your code is too long to fit in a codeblock in Discord, you can paste your code here:
https://paste.pythondiscord.com/

After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.

serene scaffold
#

but there isn't really any way for us to know what the encoding will be except to guess.

fading zealot
#

oky i will try my way to view the dataset

serene scaffold
#

try removing the encoding part and doing endoing_errors='ignore' and see if that works

fading zealot
#

ok

serene scaffold
#

it looks like the problem is whatever char is at the end of this
Afghanistan,AF,93,1752,0,41128771,652230,0.40%,63/km�

#

so, we can just ignore it.

fading zealot
#

yes

serene scaffold
fading zealot
#

oky

#

can you just tell me what feature engineering is

#

what are the steps need to be taken into consideration

serene scaffold
#

it's where you use the data to create additional features

fading zealot
#

with the existing data ..yes i know that

#

how to predict something we need to have an hypothesis question

#

i have that

serene scaffold
#

like these features: Year,BEV average price (USD),Global Sales Volume,Mileage (Km),Lithium Ion Battery Price (USD),,,Average price of new car

#

you might make another feature, battery price per milage

#

or something like that

serene scaffold
#

Feature engineering (Merge datasets, clean data, apply transformations)
imo, only the "apply transformations" is feature engineering. data cleaning is its own thing.

rancid sorrel
#

can anyone explain why
`missing_values = ["NA","N/a",np.nan,"?"]

u1 = pd.read_csv("../DataSets/Breast cancer dataset/breast-cancer-wisconsin.data",header=None,na_values=missing_values)
ul.dropna()`

#

isnt working for ul.dropna()

#

is dropna() a predefinded funtion or does it take my missing_values when called this way

pliant fox
#

.

charred light
rancid sorrel
#

it appears i needed l1 = l1.dropna()

fading hill
#

What does numpy use for the visualization of data in their documentation?

rancid sorrel
#

missing_values = ["NA","N/a",np.nan,"?"] << appears to flag the ? as a null value

#

i also tried l1['bare_nuclei'] = pd.to_numeric(l1['bare_nuclei'],errors='coerce') @charred light

rancid sorrel
#

no that swaps the errors to nan

#

at which point icould do a drop null easily

#

errors='coerce' << swaps errors to nan

charred light
#

Ok, good to hear.

rancid sorrel
#

unless i am making a mistake. but so far thats a fairly novel way to handle the problem. now to turn autopilot back on

charred light
#

It can be better to go in and manually fix errors (depending on scale of your data). Like mentioned earlier, if the data point is 16 ?. Then it could be better to clean this with apply and some function. But then again, if you have large rows of data, it doesn't really matter losing one data point or two.

charred light
# pliant fox .

You might want to clarify what you mean by flattening a pip list. I assume pip here is apart of the pip python package, which you can send to a txt file.

rancid sorrel
#

and i am responsible for creating the templates for the EDA

#

so the templates have to do the cleaning for them rather than me manipulating the data, even if the data is small

#

also honestly the data forensic training i had really makes me not want to change the original data>

#

in band data manipulation is just a habit that's been forced into me so you can see the manipulation clearly. also the academics prefer it

charred light
#

Yea, academia tends to have perfect data. Real world data is mostly nulls lol

rancid sorrel
#

honestly my background is sysadmin and cybersecurity (15 is years) going back to accedmia was a bit of a mindfuck

#

at work id fix it with sed and damn the data or pull this from a SQL server and fix it there

charred light
#

I would sure hope the data is cleaned before heading into a database.

rancid sorrel
#

honestly thats usally where you get the crap show

charred light
#

Yea, it's always a fight with our digital team that handles the databases lol

rancid sorrel
#

honestly its usally the frountends fault 😉

#

they are not doing the sanitisation at the java script 😉

charred light
#

that's why I add a ; DROP TABLE usernames every time a wifi connection asks me for info

rancid sorrel
#

haha

#

i need to get into more microservice so i can do this full stack 😉

toxic vault
#
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# Load and preprocess stock data
data = load_stock_data()
X = data[['past_performance']]
y = data['future_performance']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Train a linear regression model on the training data
model = LinearRegression()
model.fit(X_train, y_train)

# Use the model to make predictions on the test data
predictions = model.predict(X_test)

# Evaluate the model's performance
score = model.score(X_test, y_test)
print("Model accuracy:", score)

# Use the model to make decisions about which stocks to buy or sell
while True:
  current_stocks = get_current_stocks()
  current_prices = get_current_prices(current_stocks)
  for stock, price in current_prices.items():
    prediction = model.predict([[price]])
    if prediction > price:
      # Buy the stock
      buy_stock(stock, price)
    elif prediction < price:
      # Sell the stock
      sell_stock(stock, price)
  time.sleep(3600) # Wait an hour before making new predictions```
#

Would this work as a stock trading bot?

#

What add ons or tweaks would need to be made for it to work efficiently

charred light
keen notch
hasty mountain
#

If I make a GAN and, instead of passing the Generator's output directly to the Discriminator, I pass it to a SuperResolution model and only then I pass it to the Discriminator...would this make things too messy?

wooden sail
#

it should be fine, it'd make the two more capable of working independently. ofc there's now way more parameters so the training would be slower (unless you train and freeze the super res part ahead of time)

hasty mountain
#

Yeah, I was thinking about using a pretrained SRGAN in order to make my Generator(in my current GAN) to produce images with a better resolution

#

I mean...even images in 64x64 are so blurry without superresolution models...
(At least if I'm not making anything wrong...which is quite likely)

wooden sail
#

do keep in mind though that doing SR does not result in any extra info... at least normally. using a network that might be different, but the added info is bias that the network picked up from the training data

#

the original architecture without SR should perform about as well as with SR if everything is working ideally

hasty mountain
#

Wouldn't the SR stimulate the generator to make better images? I mean, it would remark the generator's mistakes for the discriminator, wouldn't it?

toxic vault
#

@charred light like a minimum Range and a maximum Range?

wooden sail
#

but try it out and see!

hasty mountain
#

I'll need an AWS entire building just to learn about GANs...pithink

charred light
sudden ermine
rapid pasture
#

Hello guys, I have this error : sklearn.exceptions.NotFittedError: The TF-IDF vectorizer is not fitted
at the code below :

def predict(message : str) : 
    tfidf_vectorizer=TfidfVectorizer(stop_words='english')
    pac = pickle.load(open("model.pkl", "rb"))
    vectMessage = tfidf_vectorizer.transform(pd.Series([message]))
    prediction = pac.predict(vectMessage)
    return prediction[0]

I don't know why I have this error because I am transforming the series so it can fit. Thanks in advance for the help.

#

Before saving the model into a pkl it worked fine, but now it keeps raising this error

#

if somebody could help it would be good, thanks in advance

sudden ermine
# rapid pasture Hello guys, I have this error : `sklearn.exceptions.NotFittedError: The TF-IDF ...

The error sklearn.exceptions.NotFittedError: The TF-IDF vectorizer is not fitted is raised when you try to use a scikit-learn estimator that has not been fitted yet.

In your case, it seems that you are trying to transform a message into a vector using the tfidf_vectorizer object, but this object has not been fitted to any data. To fit the vectorizer, you need to pass a list of documents to the fit method, like this:

tfidf_vectorizer=TfidfVectorizer(stop_words='english')
tfidf_vectorizer.fit(list_of_documents)

Where list_of_documents is a list of strings representing the documents you want to fit the vectorizer to. Once the vectorizer is fitted, you can use the transform method to transform new documents into vectors.

In your case, you might want to consider fitting the vectorizer to the training data that you used to train the classifier, so that the vectorizer is able to transform new messages in the same way as it did for the training data.

I hope this helps!

young granite
#

how can i find multi-input models from scikit learn and compare them on a given dataset?

sudden ermine
# young granite how can i find multi-input models from scikit learn and compare them on a given ...

Here is an example of how you could use the model_selection module to compare different models on a given dataset:

from sklearn.model_selection import cross_val_score

Load the dataset

X = ... # Input features
y = ... # Target variable

Define a list of models to compare

models = [LinearRegression(), LogisticRegression(), DecisionTreeRegressor(), RandomForestRegressor(), SVR()]

Iterate over the models and print their mean cross-validation score

for model in models:
scores = cross_val_score(model, X, y, cv=5) # 5-fold cross-validation
print(f"{model.class.name}: {scores.mean():.2f}")

This code will train each model on the training folds of a 5-fold cross-validation, and then evaluate its performance on the corresponding test folds. The mean cross-validation score for each model will be printed at the end.

I hope this helps. Let me know about it.

young granite
rapid pasture
#

so I will always need to insert the new document I want to predict inside a serie of documents right?

sudden ermine
# rapid pasture so I will always need to insert the new document I want to predict inside a seri...

Yes, that's correct. When you use a scikit-learn vectorizer to transform a document into a numerical representation (also known as a feature vector), you need to pass the document as a list of strings to the transform method.

For example, if you want to transform a single document message, you can do it like this:

vectMessage = tfidf_vectorizer.transform([message])

If you want to transform multiple documents at once, you can pass them as a list of strings:

vectMessages = tfidf_vectorizer.transform(list_of_messages)

Where list_of_messages is a list of strings representing the documents you want to transform.

rapid pasture
sudden ermine
#

Disclaimer: those are the responses of chatgpt.

odd meteor
young granite
#

does all work for multiple features?

mighty meteor
#

can someone help me?

odd meteor
charred light
cerulean lodge
#

Is anyone aware of Lark grammar libraries? Not even necessarily actual libs but even just published grammar sets. I searched the discord, doesn't seem to be much here. I found: https://pypi.org/project/lark-grammars/0.3.0/ but I'm wondering if I can get an even larger set of grammar use cases/solutions.

sudden ermine
cerulean lodge
#

I saw you posted that same link on the data science discord @sudden ermine. Your own app?

sudden ermine
#

Yes Its my first app

young granite
#

@odd meteor it seems like this works only for 1 feature inputs?

odd meteor
serene scaffold
odd meteor
young granite
odd meteor
young granite
#

currently i use df as input and it returns an empty result for clf.fit

#

do i need to convert my features into arrays?

#

normally scikit works with dfs aswell?

#

@odd meteor any idea?

#

i converted the df now to array with .to_numpy() but still empty result

odd meteor
# young granite currently i use df as input and it returns an empty result for clf.fit

Hi Greenleek, If you were able to split your data with Train-Test-Split, all you need to do next is to just follow the snapshot I sent. If it's still not super clear, I'll advise you try replicating this on your PC to see how it works. Once you are able to replicate the results, then using lazypredict in your own project will become easier to grab.

Just follow the screenshot I sent, or better still check the documentation for more clarity. https://lazypredict.readthedocs.io/en/latest/usage.html#classification

young granite
#

if i use lazy regressor it gives different results then my previously determined ones so i guess something isnt working with the input

#

i dunno why tho cause the input is exactly the same as in the example, with the only difference that both X and y got multiple features

odd meteor
odd meteor
# young granite i can

Awesome.

The reason you're getting different result could be because of a couple of things...

  1. Random State used
  2. The hyperparameter tuned/involved etc
#

So long as you can replicate the result on the documentation page, you can just pick, say, the top 3 algorithms and try to do more hyperparameter tuning to improve the model performance.

young granite
#

but i do get an empty "models" after running the lazyclassifier 🗿

#

so it seems no model works for the offered data

#

even tho the data is in the correct format

odd meteor
young granite
#

regression but there it results in errors as well

#

"y should be a 1d array, got an array of shape (20,20) instead"

#

so they arent capable to perform on multiple features iguess

#

and the LinearRegressor offers waaaaay different results then my previously run scikit

odd meteor
# young granite "y should be a 1d array, got an array of shape (20,20) instead"

I've not had any issue with the library (the couple of times I used it). It worked perfectly. If you believe this is a serious issue, perhaps you can raise this issue on the library's GitHub page. https://github.com/shankarpandala/lazypredict

I'll try to use the library once I get home today to confirm if it's still working properly.

Also, you share the error message / your code?

GitHub

Lazy Predict help build a lot of basic models without much code and helps understand which models works better without any parameter tuning - GitHub - shankarpandala/lazypredict: Lazy Predict help ...

young granite
hasty mountain
#

Guys, I want to use a model in Pytorch which outputs 2 classes using softmax(softmax, not sigmoid). However, I don't know really which Loss Function I should use, as Pytorch's Cross Entropy includes a LogSoftmax implemented, and NLLLoss expects an output generated by a LogSoftmax function.
Any suggestion?

dense crane
#

are there an assosiated rules for none binary variable?

#

like for iris dataset for example

hasty mountain
#

Since there's some folks here that are quite mathmaniacs, can someone tell me if this madness I made makes sense?
The idea here is to adapt the Dot-Product Attention from Transformer(NLP) into a Element-Wise Attention layer to extract features from images. I want to avoid Matrix Multiplications because they're too computationally expensive.

class AttentionBlock(nn.Module):

    def __init__(self, in_channels, n_attention_weights):

        super(AttentionBlock, self).__init__()

        self.create_x_weights = nn.Conv2d(in_channels, n_attention_weights, kernel_size=1, stride=1, bias=False)
        self.create_y_weights = nn.Conv2d(in_channels, n_attention_weights, 1, 1, bias=False)
        self.conclude_attention = nn.Conv2d(n_attention_weights, in_channels, 1, 1, bias=False)

        self.Xsoftmax = nn.Softmax(-2) # Computes softmax over the X axis in a feature map
        self.Ysoftmax = nn.Softmax(-1) # Computes softmax over the Y axis in a feature map

    def forward(self, input):

        x_weights = self.create_x_weights(input)
        y_weights = self.create_y_weights(input)

        x_weights = self.Xsoftmax(x_weights)
        y_weights = self.Ysoftmax(y_weights)

        attention_weights = x_weights * y_weights

        attention_weights = self.conclude_attention(attention_weights)

        attention_output = attention_weights * input

        return attention_output

I've noticed that Transformer uses a "similarity matrix", that is the dot-product between queries and keys, then applies softmax to this product. But I don't see exactly how I could use something like this here, so I just applied softmax over the rows of some feature maps(which would be the row weights) and softmax over the columns of other feature maps(column weights) and then apply element-wise product to the input. The higher the X and Y weights, higher the final product, higher the relevancy...or so this is what I want.

#

I should simply test this...but I'm also crazy enough to have this idea while trying to make a GAN, so even if this works, it might not appear so, since...well...GANs things

#

I also don't know if maybe wouldn't it be better to just stick with a single Conv2D instead of doing all this

worthy hollow
#

hey guys so i have this code that makes an error at the division of 2 different dataframe that i want to use for a new data frame... It makes NaN everywhere.. Here's the code: ```py

Chargement des données financières des entreprises du secteur pharmaceutique

df = pd.read_csv(r'data//income_statement.csv')

Sélection des colonnes à inclure dans l'analyse

cols = ['entreprise', 'date', 'chiffre_affaires', 'resultat_operationnel', 'resultat_net']
df = df[cols]
df['date'] = pd.to_datetime(df['date'], format='%Y', errors='ignore')

Filtrage des données pour ne conserver que les années précédant la covid-19 (2019 et avant) et celles incluant la covid-19 (2020 et après)

df_avant_covid = df[df['date'].dt.year < 2020]
df_apres_covid = df[df['date'].dt.year >= 2020]

Calcul de la moyenne annuelle des chiffres d'affaires et des résultats opérationnels pour chaque entreprise, avant et après la covid-19

df_avant_covid = df_avant_covid.groupby(['entreprise', df_avant_covid['date'].dt.year]).mean()
df_apres_covid = df_apres_covid.groupby(['entreprise', df_apres_covid['date'].dt.year]).mean()

Calcul de la variation des chiffres d'affaires et des résultats opérationnels entre les périodes avant et après la covid-19

df_variation = pd.DataFrame()
df_variation['variation_ca'] = df_apres_covid['chiffre_affaires'] / df_avant_covid['chiffre_affaires'] - 1
df_variation['variation_op'] = df_apres_covid['resultat_operationnel'] / df_avant_covid['resultat_operationnel'] - 1

Affichage des variations des chiffres d'affaires et des résultats opérationnels pour chaque entreprise

print(df_variation)

Création d'un graphique comparant les variations des chiffres d'affaires et des résultats opérationnels pour chaque entreprise

plt.bar(df_variation.index, df_variation['variation_ca'], label="variation du chiffre d'affaires")
plt.bar(df_variation.index, df_variation['variation_op'], label="variation du résultat opérationnel")
plt.legend()
plt.show()```

#

here's the output of the df_variation dataframe containing the NaNs from the operation py variation_ca variation_op entreprise date Roche Holding AG 2018 NaN NaN 2019 NaN NaN 2020 NaN NaN 2021 NaN NaN

#

and here's the error when it tries to generate the plot: ```py

TypeError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_8504/54936595.py in <module>
24
25 # Création d'un graphique comparant les variations des chiffres d'affaires et des résultats opérationnels pour chaque entreprise
---> 26 plt.bar(df_variation.index, df_variation['variation_ca'], label="variation du chiffre d'affaires")
27 plt.bar(df_variation.index, df_variation['variation_op'], label='variation du résultat opérationnel')
28 plt.legend()

c:\Users\PEGON\anaconda3\lib\site-packages\matplotlib\pyplot.py in bar(x, height, width, bottom, align, data, **kwargs)
2649 x, height, width=0.8, bottom=None, *, align='center',
2650 data=None, **kwargs):
-> 2651 return gca().bar(
2652 x, height, width=width, bottom=bottom, align=align,
2653 **({"data": data} if data is not None else {}), **kwargs)

c:\Users\PEGON\anaconda3\lib\site-packages\matplotlib_init_.py in inner(ax, data, *args, **kwargs)
1359 def inner(ax, *args, data=None, **kwargs):
1360 if data is None:
-> 1361 return func(ax, *map(sanitize_sequence, args), **kwargs)
1362
1363 bound = new_sig.bind(ax, *args, **kwargs)

c:\Users\PEGON\anaconda3\lib\site-packages\matplotlib\axes_axes.py in bar(self, x, height, width, bottom, align, **kwargs)
2277
2278 if orientation == 'vertical':
-> 2279 self._process_unit_info(
2280 [("x", x), ("y", height)], kwargs, convert=False)
2281 if log:

c:\Users\PEGON\anaconda3\lib\site-packages\matplotlib\axes_base.py in _process_unit_info(self, datasets, kwargs, convert)
2339 # Update from data if axis is already set but no unit is set yet.
2340 if axis is not None and data is not None and not axis.have_units():
-> 2341 axis.update_units(data)
2342 for axis_name, axis in axis_map.items():
2343 # Return if no axis is set.

c:\Users\PEGON\anaconda3\lib\site-packages\matplotlib\axis.py in update_units(self, data)
1446 neednew = self.converter != converter
1447 self.converter = converter
-> 1448 default = self.converter.default_units(data, self)
1449 if default is not None and self.units is None:
1450 self.set_units(default)

c:\Users\PEGON\anaconda3\lib\site-packages\matplotlib\category.py in default_units(data, axis)
...
---> 92 raise TypeError(
93 "{!r} must be an instance of {}, not a {}".format(
94 k,

TypeError: 'value' must be an instance of str or bytes, not a tuple```

#

any one has a clue?

queen cradle
worthy hollow
queen cradle
#

What about df_apres_covid?

#

Do you already have NaNs there?

worthy hollow
#

i have no nan there too

#

the nan are made from here

#

df variation

#

bcuz i try to do an operation from there using the two above df (df_avant_covid & df_apres_covid)

#

but they are not in the same dimension i think thats why

queen cradle
#

Oh, it could also be happening because when you groupby you have no data. Then when you call .mean() you get NaN.

worthy hollow
#

nah its not

serene scaffold
#

@queen cradle welcome to our wonderful data science chat waveboye

serene scaffold
#

how did you find this channel immediately after joining, anyway? thinkPeepo

queen cradle
#

Um, I scrolled down?

serene scaffold
#

good to know. (some people complain about the findability of our channels.)

worthy hollow
worthy hollow
#

" I want to divide: (mean of 2018-2019) / (mean of 2020-2021) from the "chiffre_affaires" column... But it's complicated bcuz it's inside the same dataframe and i want to store the result at the "variation_ca" column "

queen cradle
serene scaffold
lapis sequoia
#

can someone help me why this is not working

#

this is how br looks

#

br.set_index(["area_name","dat"]).stack("area_name")
@serene scaffold
I think it's related to the area name column. Because a normal set index("area).stack() is also not working for that.

queen cradle
#

I recommend restarting your analysis and inserting print statements.

worthy hollow
#

ok i found something but got a new error now

#

here's the dataframe i have

#
      variation_ca  variation_op  chiffre_affaires  resultat_operationnel  \
date                                                                        
2018           0.0           0.0          60829.73               15099.83   
2019           0.0           0.0          64165.38               17662.06   
2020           0.0           0.0          64361.84               19777.96   
2021           0.0           0.0          72046.48               19863.39   

      resultat_net  
date                
2018      10735.20  
2019      13584.73  
2020      15247.05  
2021      15240.81 ```
#

and the code with the error is:

#
# Chargement des données financières des entreprises du secteur pharmaceutique
df = pd.read_csv(r'data//income_statement.csv')

# Sélection des colonnes à inclure dans l'analyse
cols = ['entreprise', 'date', 'chiffre_affaires', 'resultat_operationnel', 'resultat_net']
df = df[cols]
df['date'] = pd.to_datetime(df['date'], format='%Y', errors='ignore')

# Filtrage des données pour ne conserver que les années précédant la covid-19 (2019 et avant) et celles incluant la covid-19 (2020 et après)
df_avant_covid = df[df['date'].dt.year < 2020]
df_apres_covid = df[df['date'].dt.year >= 2020]

# Calcul de la moyenne annuelle des chiffres d'affaires et des résultats opérationnels pour chaque entreprise, avant et après la covid-19
df_avant_covid = df_avant_covid.groupby([df_avant_covid['date'].dt.year]).mean()
df_apres_covid = df_apres_covid.groupby([df_apres_covid['date'].dt.year]).mean()

# Calcul de la variation des chiffres d'affaires et des résultats opérationnels entre les périodes avant et après la covid-19
df_variation = pd.DataFrame()
df_variation['variation_ca'] = df_apres_covid['chiffre_affaires'] / df_avant_covid['chiffre_affaires'] - 1
df_variation['variation_op'] = df_apres_covid['resultat_operationnel'] / df_avant_covid['resultat_operationnel'] - 1

data = [df_variation, df_avant_covid, df_apres_covid]
df4 = pd.concat(data)
df4 = df4.iloc[4:, :]
df4 = df4.fillna(0)

# Sélection des lignes à utiliser pour la division
df4 = df4.set_index("date")
row_2018_2019 = df4.loc[['2018', '2019'], 'chiffre_affaires']
row_2021_2020 = df4.loc[['2021', '2020'], 'chiffre_affaires']

# Division des lignes sélectionnées
df4["variation_ca"] = row_2021_2020 / row_2018_2019

df4```
#

brings me this error idk why: ```py

KeyError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_12880/950567047.py in <module>
26
27 # Sélection des lignes à utiliser pour la division
---> 28 df4 = df4.set_index("date")
29 row_2018_2019 = df4.loc[['2018', '2019'], 'chiffre_affaires']
30 row_2021_2020 = df4.loc[['2021', '2020'], 'chiffre_affaires']

c:\Users\PEGON\anaconda3\lib\site-packages\pandas\util_decorators.py in wrapper(*args, **kwargs)
309 stacklevel=stacklevel,
310 )
--> 311 return func(*args, **kwargs)
312
313 return wrapper

c:\Users\PEGON\anaconda3\lib\site-packages\pandas\core\frame.py in set_index(self, keys, drop, append, inplace, verify_integrity)
5449
5450 if missing:
-> 5451 raise KeyError(f"None of {missing} are in the columns")
5452
5453 if inplace:

KeyError: "None of ['date'] are in the columns"```

verbal venture
#

how much data is usually needed for very effective CNNs? and does the accuracy of an algo fall down to just how much data you have, or the type of algorithms you're creating?

hasty mountain
#

I've read that Relativistic Discriminators tend to perform better in GANs, and now that I've implemented it, my discriminator simply won't learn anything...nice.

hasty mountain
verbal venture
#

would the same algorithm perform way better on 1M images?

hasty mountain
#

1M images = more features to learn, more ways to generalize

#

A person who only learned around 100 words will have way more difficulty in communicating and developing social skills than someone who has more than 25,000 words in his vocabulary

hasty mountain
silk knot
#

Does anyone know how to only get 1 Line as the output for this?

s1 = """
Apples Oranges Grapes
White Black Red Green"""
s2 = "Apples"

print(s1[s1.index(s2) + len(s2):])

Output:

Oranges Grapes
White Black Red Green

I want the output to be Oranges Grapes (Which is only the one line after the word)

#

nvm i got it

dull carbon
#

How i can solve this

fading zealot
#

any data scientist here

#

@charred light can you suggest the dataset

#

the dataset i have used the accuracy is -0.44

#

are you there ?

steady basalt
#

Ur predictions in a black hole or smtn

fading zealot
#

ok

#

@steady basalt can you help with the datasets

#

these are the three datasets

#

I am finding to difficult to find the features

#

EV dataset has only 10 rows

#

atleast need a big dataset to calculate r2

#

rsquare

keen notch
#

does anyone know why this error

#
import numpy as np
import matplotlib.pyplot as plt
from scipy.integrate import solve_ivp

def acc_func(t, vals):
    x, y, vx,vy = vals
    acc_x = -2.0 * y**2 * x * (1 - (x**2)) * np.exp(- (x**2 + y**2))
    acc_y = -2.0 * x**2 * y * (1 - (y**2)) * np.exp(- (x**2 + y**2))
    return np.array([vx, vy,acc_x,acc_y],dtype=object)


def trajectory(impactpar, speed):
    maxtime = 10.0 / speed
    t = np.linspace(0, maxtime, 300)
    x0 = impactpar
    y0 = -2.0
    vx0 = 0.0
    vy0 = speed
    vals =  np.array([x0, y0,vx0,vy0],dtype=object)
    acc = solve_ivp(acc_func, (0.,300.), vals, t_eval=t)
    x = acc.y[0]
    y = acc.y[1]
    return x, y

x, y = trajectory(0.15, 0.1)
# Plot the trajectory
plt.plot(x, y)
plt.xlabel("x")
plt.ylabel("y")
plt.show()

# # Solution to part (b)
def scatterangles(allb, speed):
    angles = []
    for impactpar in allb:
        x, y = trajectory(impactpar, speed)
        vx = x[-1]
        vy = y[-1]
        angle = np.arctan2(vy, vx)
        angles.append(angle)
    return angles
allb = np.arange(-2, 2, 0.001)
angles = scatterangles(allb, 0.1)
plt.plot(allb, angles)
plt.xlabel("Impact parameter")
plt.ylabel("Scatter angle")
plt.show()```
dusky finch
keen notch
#

I'm happy to provide the question to make more sense

keen notch
#

my graph looks like this👀 😂

#

sorry for the spam but looks better now just the range error still:(

#

I think it might be i need to use solve_ivp

arctic wedgeBOT
#

Hey @keen notch!

You either uploaded a .txt file or entered a message that was too long. Please use our paste bin instead.

unique ridge
#

Hey, I think this is the perfect place to ask my question, so let's give it a shot. I am building a predictive model based on temperature data from my greenhouse. Now I have data which spans over 10 months from January till October. Every hour, the sensors register attributes such as: AVG temperature in and out the greenhouse, AVG relative humidity, ABS humidity and AVG moisture deficit. I have plotted my data and I see some points that are quite high. Because of this, I want to detect outliers and remove them, but I dont know how I can approach it the best way in this use cause, mainly due the fact the data represent real stuff and all the attributes depend a bit on each other. Any advice on how I can approach this the best way?

runic patrol
#

hey guys i have started learning for data science from sep 2022 till now im done with python ,pandas,numpy ,gui,databases,graphs and charts although data science is a very vast topic to learn i'm thinking to learn till Machine learning wihich includes statistics,EDA & feature engineering,ML, PCA, NLP, time series analysis,stats. after that end to end projects on ML. Will this be enough to get a entry level job in ML field ? i will complete my data science with ongoing job. my background is civil engineering

opaque bay
#

Hi

#

I need source code for an Instagram scraper which scrapes an account's followers and gets their emails

opaque bay
#

Web scraping

runic patrol
#

ohk

unique ridge
steady basalt
fading zealot
#

yes

#

can you just recommend me

#

which features i can take into consideration and do feature engineering

steady basalt
#

What is useful

#

And also, you have 10 rows this isn’t good enough unless you’re joining datasets and have a forgein key

fading zealot
#

ya

#

what if we take co2 dataset and air pollution dataset

#

what features would you suggest

steady basalt
#

What features are there how many

#

Is this how much co2 a car produces?

fading zealot
steady basalt
#

Given car engine, model, price etc?

fading zealot
#

no

#

how the increase in co2 and no2 causing global warming

steady basalt
#

your target variable is global temperature?

#

What is each sampler

#

A sample of what, factory output?

#

Where do you get this data, what is the actual data im on mobile

fading zealot
#

oh

#

we have 4 data sets

steady basalt
#

U can just say what is your sample on the data set u working on

fading zealot
#
  1. global temperature
steady basalt
#

Ok predicting global temp based on what

#

It’s time series?

fading zealot
#

i want to calculate r square and for that I need huge data

#

in all the datasets i have

steady basalt
#

you need to properly describe your data

fading zealot
#

global temperature data set doesnt have countries

#

wait let me send you a pic

steady basalt
#

Just one row

fading zealot
steady basalt
#

Globe is what ur predicting?

fading zealot
#

no

#

i have to use two datasets it is mandatory

steady basalt
#

Ok I see

fading zealot
#

to predict my hypothesis

steady basalt
#

Global temp is time series monthly

fading zealot
#

i tried but i got -0.44 accuracy

steady basalt
#

Time series regression isn’t accuracy based

#

It’s error

#

Did you take the square of the error what is ur metric

fading zealot
#

i just want you to recommend what features i can use from two datasets and then will apply linear regression

steady basalt
#

U need to understand better what predictions are and how you measure them

fading zealot
#

i dont wanna use global temperature

steady basalt
#

There’s no point until you understand how we measure regression

#

It’s more important than features

fading zealot
#

ok

unique ridge
steady basalt
#

If you must merge data sets maybe you can show over time the largest co2 producers increasing co2

fading zealot
#

if we will take co2 dataset and air pollution

#

which features would be perfect

steady basalt
#

Maybe just a box plot

fading zealot
#

merge co2 and air pollution?

unique ridge
steady basalt
#

What do you set out to predict

#

Decide that first

steady basalt
keen notch
#

does anyone how to use solve_ivp

steady basalt
#

If there’s a distribution of points and a single point miles out

keen notch
fading zealot
#

yes I am unable to get a good dataset

unique ridge
steady basalt
#

It matters then whether you think it’s relevant or not to your model

unique ridge
fading zealot
#

i did but didnt get relevant dataset

unique ridge
#

You can maybe combine datasets to get the wanted result.

steady basalt
#

Shits random and has been for millions of years

#

All u can do is say line go up because humans

fading zealot
#

lets say we can use co2 data set and air pollution dataset

#

can we say with the increase in temperature of co2 an no2

#

it is causing global warming

unique ridge
#

Supermoon, want to see some charts?

steady basalt
#

Controversial question

steady basalt
fading zealot
steady basalt
#

Not something I’d personally work on

fading zealot
#

so features like country year can be taken into consideration

arctic wedgeBOT
#

:incoming_envelope: :ok_hand: applied mute to @unique ridge until <t:1672409033:f> (10 minutes) (reason: attachments rule: sent 7 attachments in 10s).

The <@&831776746206265384> have been alerted for review.

steady basalt
#

Ouch

brisk vapor
#

!unmute 360683248151429131

arctic wedgeBOT
#

:incoming_envelope: :ok_hand: pardoned infraction mute for @unique ridge.

steady basalt
unique ridge
#

🤣

brisk vapor
#

Could you please upload them to some image host instead perhaps? Otherwise our bot will not like it

unique ridge
#

I can send them 1 for 1?

brisk vapor
#

sure but it might take a little bit of time, since the bot looks at 10s window

steady basalt
#

Outliers where?

unique ridge
#

Hold on, this is the general chart

steady basalt
#

Might not want to delete anything based on that

unique ridge
#

I can send boxplots too? 🤣

steady basalt
#

More useful than time series

unique ridge
#

What do you mean?

steady basalt
#

How can u decide to delete points based on the series

#

If it can be legit and caused by something at that time

#

I’d imagine none of those points will appear as outliers on ur box plot

unique ridge
#

Since it is real data. It is all legit indeed, but the sensors can do a fuckywucky ofcourse.

steady basalt
#

If you beleive this is the case try linear interpolation ?

#

Doesn’t look too bad but I don’t know much about greenhouses

unique ridge
#

Since i am not knowing the sensor did an oopsie, is the reason i want to find possible outliers. Well, just keep in mind that if it is warm outside, the temperature and humidity in the greenhouse increase as well. Only if the variables change too fast, preventive actions get taken.

steady basalt
#

What are you modelling for

unique ridge
#

I basically want to predict the temperature in the greenhouse based on the other 5 attributes.

steady basalt
#

Incase ur thermostat breaks?

unique ridge
#

Iam following crisp-dm, so i am now at prepping.

#

No, so you can predict the upcoming temperature 😛

steady basalt
#

Interesting

unique ridge
#

Its a small school project so nothing too serious, but if i do something, i want it done right (or atleast the best i can do)

#

Would you suggest i dont need outliers removal?

steady basalt
#

I wouldn’t…

#

But you know glasshouses better than me

#

And it could be a good idea if ur usecase aligns with it

#

Maybe ur recorders weren’t fault

unique ridge
#

What i maybe can do is select from each attribute 10 highest values and have a look if the values from other attributes match with the others?

steady basalt
#

Is it possible to test ur model now without doing so see how close it is

#

Predicting the next day 15 readings

#

Say

#

Maybe give it to a bi lstm and then predict windows

unique ridge
#

Yeah i can iterate multiple times through it. One with no 'outlier removal' and 1 without.

steady basalt
#

Try and see what u get after all pre processing

#

At predicting next 5 readings

#

Look at absolute error maybe to get an idea

#

Are you using all readings at equal intervals to predict temp

#

If only you have sunlight too

raw tulip
#

hey!

slate hollow
#

the existence of a forward feeding neural network implies the existence of a mysterious, unseen backward feeding neural network

hasty mountain
#

Does initializing my weights with a very low standard deviation(2e-5) might lead to vanishing gradients, or was this just a delirium from ChatGPT?
I've tested it and it seemed to actually make a difference, but I was sleepy back then and I might've changed my residual scaling factor...idk...I don't remember... pithink

raw tulip
#

I was hoping to see more discussions about GPT and AI here, maybe anyone could provide a recommendation to other channel?

serene scaffold
#

The message before yours even mentions chatgpt

digital hazel
#

When bringing my tensorflow model code into an API, do I have to save the model and load it in the API file? I understand thsr saving the model saves its weight and accuracy, but assuming I keep the same parameters and code in the API code, shouldnt the model be around the same accuracy after fitting it?

#

In other words, why can't I just copy and paste the same code that trains/fits the model into the API code file, as it only runs one time when the server is setting up?

#

Nvm realize now that the model takes long to train so training it one time and saving it saves you a lot of time.

serene scaffold
digital hazel
#

Gotcha thank you

#

Idk why I didnt realize that

serene scaffold
digital hazel
#

No I understand no worries. Just took me one google search of why do I need to save models🤦‍♂️

rancid sorrel
#

anyone got a good guide for how to hook up an AAN to input/output thats dynamic?

wintry geode
#

Hello, I am using a module called chatterbot and I am trying to see what the confidence of the chatbot’s response is, does anyone know how to?

serene scaffold
hasty mountain
#

Curious...I thought pretraining my Generator with a L1 Loss would mess the adversarial training with the Discriminator, but it seems to actually do no harm at all... so far, at least
EDIT: during adversarial training, the Discriminator simply messes up the generator pretrained weights py_guido

#

I'm almost becoming a GAN researcher...too bad I still couldn't get any decent result

steady basalt
rancid sorrel
#

we get a lot of guides about how to parse data into a AAN, we dont get many guides on how to hook the up as say a control system

#

like for example a self driving car

steady basalt
#

Oh right

#

Well that’s just engineering

rancid sorrel
#

well teh coding part

steady basalt
#

Once you have the model you can deploy it

#

For instance, you can use the model you’ve trained in an app

rancid sorrel
#

like for example
input A ->>> AAN ->>> output B

hasty mountain
rancid sorrel
#

i want more examples of parts A and B

steady basalt
#

Data will come from somewhere, for me it’s cloud based. For your self driving car, I suppose an app will stream images

#

That’s probably very complex software

#

So I can only explain on a more basic level

#

You can do a batch or a real time app

#

Once you built a model it’s all app building, and retraining to account for drift

rancid sorrel
#

my disseration is ML with Cybersecurity. honestly all the stuff sofar is about importing static data

steady basalt
#

There is some data engineering and possibly ml ops you will need to learn then

rancid sorrel
steady basalt
#

Data engineering will be making the data get to model to retrain and ml ops you will need to work out how to deploy it to produce results

#

I mean I started on the job with Azure pipelines

#

What is your objective?

#

Don’t do something random as it’s harder

rancid sorrel
#

hoenstly ive got like 90% of a comp sci degree in me so i can code whatever needed,

steady basalt
#

In which case

rancid sorrel
#

baiscaly making a ML powered bot

steady basalt
#

You can try simulate a data stream

#

So generate shit with a script

#

Can you build apps