#data-science-and-ml
1 messages · Page 176 of 1
ye
ye
?
Wow
not everything
full stacked? wat that mean for ai?
Ye
no like. wdym full stacked. like just a working neural network?
Kinda.
then ye i can make any neural net i want and train it
U can actually build stuff and become rich perhaps
neural nets r not that complicated
Like ai nowadays is going in a big boom
its gonna plateau soon
Rly?
ye
the models havent changed mathematically for decades. the experts r going in to big companies which are run by old ppl so we are having no technical advancements for it.
It actually makes sense
Sad
Who cares. Python is a great language though ain't it?
ai is not intelligent. the biggest models dont even think or reason, they just predict the next word in a sentence based on probability
ye u dont need to predict anything. its just how limits work. ai will plateau soon because the training pool will be less effective over time and the predicting ability will stagnate
o-0
ye so the next step is either big ai companies financing research into new models or throwing endless money for very little gain
Many people r learning for the sole reason of making better ai models. Like my friends r learning python for the same reason
Yo just wait a min I will come back
ye they will all unanimously fall into the trap of doing exactly what they were taught to do, then if they make it to a company the company will tell them to copy the big companies and they will get stuck making another chatgpt. which is just a flawed concept
Its just like that
When people in 2017 thought ai was gonna plateau. 💥 Chatgpt claude and gemini
Its just like that. Limits can come in siem fields. Who knows perhaps we get an AI that thinks for itself instead of predictions
U can't rly predict
hundreds of millions of dollars and the ais cant even make reliable video slop yet...
I understand limits have occured in some fields, but some other fields still r in the beginning phase
so idk what you believe, but its very obvious that these ais are going nowhere
k bye
Gotta have breakfast
I am back
wb
so i made this ai
import random
from random import *
populationamount = 100
mutationrate = 0.15
goalnum = randint(1, 1000)/10
print(f"Goal: {goalnum}")
chars = "0123456789+-/"
operators = "+-/"
randpopamount = 10
addcharrate = 0.05
delcharrate = 0.05
def randexpr(length = 12):
return "".join(choice(chars) for x in range(length))
def safeeval(expr):
try:
val = eval(expr)
if val > 1e9 or val < 1e-9:
return float("inf")
else:
return val
except (Exception, SyntaxWarning):
return float("inf")
def mutateexpr(expr):
expr = list(expr)
newexpr = []
for char in expr:
if random() < mutationrate:
newexpr.append(choices(chars, weights = [8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4], k = 1)[0])
else:
newexpr.append(char)
if random() < addcharrate and len(expr) < 20:
pos = randrange(len(newexpr))
newexpr.insert(pos, choice(chars))
if random() < delcharrate and len(expr) > 4:
pos = randrange(len(newexpr))
del newexpr[pos]
return "".join(newexpr)
def crossover(a, b):
cut = randrange(len(a))
return a[:cut] + b[cut:]
def tourney(pop, k = 5):
group = [choice(pop) for x in range(k)]
return min(group, key = getfitness)
fitnesscache = {}
badpats = [
([f"{i}/{i}" for i in range(1, 10)], 50),
([str(goalnum + i) for i in range(-9, 9)], 100),
(["0+", "+0", "-0", "0/", "0*", "0", "0*"], 100),
(["1*", "*1", "/1"], 75),
(["++", "+-", "-+", "/+", "/-"], 50),
(["//"], 25)
]
def getfitness(expr):
if expr in fitnesscache:
return fitnesscache[expr]
diff = abs(safeeval(expr) - goalnum)
#pusnishments
for outerpat in badpats:
for innerpat in outerpat[0]:
if innerpat in expr:
diff += outerpat[1]
break
fitnesscache[expr] = diff
return diff
def properformat(expr):
newexpr = ""
for char in expr:
if char in operators:
newexpr += f" {char} "
else:
newexpr += char
return newexpr
population = [randexpr() for _ in range(populationamount)]
generation = 0
while True:
generation += 1
best = min(population, key = getfitness)
if safeeval(best) == goalnum:
print(f"Achived target number on generation {generation}")
print(f"{properformat(best)} = {safeeval(best)}")
break
else:
print(f"Best of generation {generation}: {best} = {safeeval(best)}")
newpopulation = []
for _ in range(populationamount - randpopamount):
parent1 = tourney(population)
parent2 = tourney(population)
child = mutateexpr(crossover(parent1, parent2))
newpopulation.append(child)
for _ in range(randpopamount):
newpopulation.append(randexpr())
population = newpopulation
ugh dc is formatting my stuff ;-;
thats cool bacon
but ya its not technically an ai model, but a random guessing program
its still cool
btw u have commited a syntax error at line 86 i think. in is invalid syntax
Hi, I'm trying to make a predicting model which uses the opening price of a stock and returns the volume (how many people buy it.)
This image above is the relation between Opening price (X-axis) and volume (Y-Axis). I was wondering what regression model I should use in order to get the most accurate result.
seems non-correlated, at least below ~500
I'd say try the log transform on your volume data. It's the easiest thing to do and might just work. If that makes the relationship linear, run a linear regression and see what you get
Quick question, I'm trying to interpret my results for cosine similarity. I know that it is a measure how similar e.g. a document is to other documents. Would most similar pairs be values that are greater than 0.5 and values that are 1?
A good score is completely relative to what you're analyzing. In some datasets, especially in text analysis where word frequencies can't be negative, the entire scale is just 0 to 1. In that case, 0.5 is halfway to being identical, which might be significant or might be total garbage, depending on the context
A value of 1 is the most similar
hoi
Anything less than that depends entirely on your specific use case and data
noice
Does anyone have a neural mat that can generate useful structures?
Is neural mat not a typo?
It was a typo and had to keep my eyes on the road for a second or two
please ask again after driving
Setting aside how dangerous that is, there's no way we can have a productive conversation where you will learn something while you're driving
I turned onto a side street to ask the question
doesn't matter, please drive safely and will talk about that later
There's no way we can have a productive conversation ... While you're driving
i believe openai open sources weights
Has anyone worked with langgraph building an agent.
why do you ask?
I have question.
for AI learning, is python essential?
you can also use other languages to do Machine Learning
but python also okay?
yes
no you could use C/C++, java but python is easy syntax
yes of course
Anyone can send Whole roadmap of AI i mean full scrkit learning to tensorflow to Latest llm concepts
Try finding some course on Coursera or Udemy. They can be really helpful. I suggest Andrew ng's full ai with python course in Coursera
Is there a book for py torch for beginners?
Im trying to learn more about BERT. I wanna ask which model should i use for generating BERT embeddings for sentences all-MiniLM-L6-v2 or all-mpnet-base-v2 or something else?
According to me you should use all-mpnet-base-v2 as its a top performer balancing in speed and high accuracy for smaller models with less memory. all-MiniLM-L6-v2 is a highly recommended choice, while bge-en-icl is noted as having top performance on various benchmarks, but with a larger size.
Building LLMs with pytorch: step by step by Anand Trivedi is a good book for pytorch. You could find more books on Amazon
Oh ok thanks
Ollama with RAG is still non-deterministic even with temperature set to 0 right?
having some fun with the airline passengers prediction data set, and I got this with a naive LSTM implementation in pytorch
1 to 1 prediction, 0.67 train test split, 2000 epochs.
However, a single MinMax normalization line results in this
same hyperparameters
Now, the thing that I am finding interesting is this: the green line, which represents the test set, is nearly identical to the original dataset. The MinMax range was chosen to be (0, 0.1)
It is not as good if I normalize it as (0, 1), which is commonly done. So the absolute magnitude of the training errors matter in the end. The error is directly dependent on the absolute scale of the y-axis
and yes, did this with Jupyter, the hated notebook.
What if you could turn ML artifacts into proof-carrying objects?
Like lets say, for any given dataset, model or run a small, deterministic and verifiable record gets produced that says, this is what i am, this is how i was computed and here how you can independently verify that I'm not lying.
Running a model on my Mac even with metal takes an extremely long time (13 sec or more), but with the webgpu demo on chrome, exact model it has sub-second latency… anyone know what might be causing this?
It shouldn’t take this long as it’s meant to be very fast even on iOS
this is just a silly question but whats happening here? why are those two not interleaving
what do you mean by not interleaving?
make_moons generates a dataset for classification, so what you see are the two classes. since noise=0, the two classes are clearly separated
Can someone please help me with my post in #1035199133436354600 ???
67
Hmm
So I'm interpolating values for NaN's and I'm not sure if this instructor is just saying, "Hey, you can deal with NaNs this way!" or "Hey, you can synthesize data for training this way!" Is it unheard of to train on available data to generate missing values to then train on?
not unheard of, but there are a bunch of downsides like
- can reinforce biases
- increase the risk of data drift
- many real datasets are made by concatenating different datasets ; the distribution in each of them would be different, so you may need to train a model per data source, or give up if a given source does not contains that field for any records at all
oftentimes it's better to just let the model figure it out instead of layering a model on top of another model
man, the precision that make up stats still perplexes me sometimes... I don't have a formal stats education and it would be so easy to look at two distributions and just say "eh, close enough"
By interpolation do you actually mean imputation?
I feel that using the word "interpolation" leads to a mental dead end. If you start thinking of filling gaps in the data as "imputation" you'll find a rich and extremely challenging literature on this theme
Because, for example, preserving the statistical properties of the data set is important, and this isn't a solved problem, and is an area of active research. For a time series, for instance, there are techniques that consist of identifying distributions characteristic to the region around the gap, and then taking a random sample.
Yeah I'm somewhat familiar with the different methods of filling in the gaps, I was just a bit surprised to have it presented as it was. I wouldn't intuitively consider training a model on synthetic data.
I don't see an issue with synthetic data
The question is really about the properties that synthetic data has
As long as the model that generated that synthetic data has some kind of correspondence with what you're trying to ultimately model
Like the stock market, a good question would be the model that generated synthetic data set: how well does it represent the actual market? Probably not very well, but I hope your get the idea I'm trying to present
Ideally, synthetic data inserted into real data should be invisible insofar the final result is concerned
It's just strange to treat a prediction like a feature for the first time I guess ¯_(ツ)_/¯
what do you mean - a prediction like a feature?
Quick question, i'm supposed to be making heatmap of my document similarity results and when I was calculating the similarity results, I used 2000 of the documents. When I was trying to make heatmap e.g. for 50 documents, it was not interpretable whatsoever. I just wanna ask, does the number of documents when making visualisation matters or not? I'm not sure what people do when they need to make visualisations and they are working with a lot of data. I've been using top N documents so far but I don't know why I still feel uncomfortable doing this way.
Anyone have any thoughts on this?
ask GPT first
gng are we serious and also I did
gng ??
gang
lol
they are not "interleaving" because the data set is designed that way. Those sklearn.datasets are meant as training tools, with which to test clustering algorithms. In this example you can see how it might challenge proximity based approaches, like KNN, but not density based such as DBSCAN
My main complaint about those scikit training datasets is that they don't have a good assortment of high dimensional datasets, which would be useful for understanding the dimensionality-dependence of hyperparameters
the usual question with clustering, which is: how much data do you need to resolve a cluster in N dimensions, as a function of N?
There is, but there's different types of "missingness" depending on the type it is, you can potentially deal with it algorithmically to impute like KNN imputation or MICE, but those have assumptions on the type of missingness present
Usually with extremely high cardinality like that, thats the easiest way to do it. Otherwise you can try clustering based on semantic similarities, but its computationally expensive with that much data. Are the documents all related?
Running llms and vlms on my Mac are very slow even with metal, even in webgpu it’s getting sub second response times but trying to run it manually it takes over 10 seconds - does anyone know why this might be?
<@&831776746206265384>
This guy is trolling in general as well
!mute 1435917124303589427
:incoming_envelope: :ok_hand: applied timeout to @glad vessel until <t:1762854500:f> (1 hour).
I like how even the last two words got flagged as "AI"
markov decision processes: the sum of a row within the transition probability matrix must be either 1 or 0 correct? when iterating over my transition matrix for a shape(22,22,4) matrix (its about the grid given in the assignment) i get some odd results:
T[s', s, action_space]
sum of T[0, 0, :]: 2.0
sum of T[0, 1, :]: 1.0
sum of T[0, 5, :]: 0.9999999999999999
sum of T[1, 0, :]: 0.9999999999999999
sum of T[1, 1, :]: 1.9999999999999998
sum of T[1, 2, :]: 1.0
sum of T[2, 1, :]: 0.9999999999999999#
the 0.99 is due to machine precision error i assume but i am more worried that i get 2.0 as the prob sum to transition from State 0 to state 0 given to try all a from the action space
the ruleset was this
@buoyant slate
i assume T[0, 0, :] represents the row of probabilities over actions of staying in place for the top-left corner
then it makes sense for it to be 2, while being in 0, 0 you only stay in place if you try to move north or west
then you can choose north or west, succeed and stay in place with probability p each or fail and go to west or north respectively with probability (1-p)/3
or you can go south or east, each has probability p of succeeding and probability (2 - 2p)/3 to instead go north or west and stay in place
giving you (6-6p)/3 + 2p = 2 - 2p + 2p = 2
like i said in math server the values in action rows don't have to make a distribution, the procedure is that you first choose an action, take the row corresponding to your chosen action and current state and this will be the distribution over next states (and hence has to sum to 1)
so a0= p + (1-p)/3 because 1. if agent chooses north he has p chance to actually "go" there (by staying in place) + the (1-p)/3 chance from choosing west but actually going north
which results in also staying
what does this mean now. that there is a 2 chance to stay in s0 when starting from s0 when we sum all actions. thats not really workijng in my brain, cause i knew probabilities beeing between 0.0 and 1.0
OHHH but ofc, teh agent can only pick one action at a time
so i should rather see it like the agent has a x chance to stay in s0 when starting from s0 when picking that and that action
yeah the point is that these are not probabilities, as you said actor can only pick one action at a time and these are probabilities that mean "T[s', s, a] is the probability that actor will move to s' GIVEN that he was in s and chose action a"
that means that the probabilities along our first dimension should add up to 1 tho ye?
as in, the individual cells are probabilities but they aren't probabilities over next states not over actions
print(grid.T.sum(axis=0))
[[1. 1. 1. 1.]
[1. 1. 1. 1.]
[1. 1. 1. 1.]
[1. 1. 1. 1.]
[1. 1. 1. 1.]
[1. 1. 1. 1.]
[1. 1. 1. 1.]
[1. 1. 1. 1.]
[1. 1. 1. 1.]
[1. 1. 1. 1.]
[1. 1. 1. 1.]
[1. 1. 1. 1.]
[1. 1. 1. 1.]
[1. 1. 1. 1.]
[1. 1. 1. 1.]
[1. 1. 1. 1.]
[1. 1. 1. 1.]
[1. 1. 1. 1.]
[1. 1. 1. 1.]
[1. 1. 1. 1.]
[1. 1. 1. 1.]
[1. 1. 1. 1.]] works out
yeah i think so
wait holup what does axis=0 in a rank 22 tensor mean?
side note- rank means number of dimensions in a tensor so it's rank 3 tensor
3 dimenison ye mb
you can visualize the tensor as four 22x22 matrices stacked on top of each other forming a "cube" (cuboid to be exact)
no i think thats wrong. we have 22 22x4 matricies stacked on top
that will result in the same cuboid tho
yep, now imagine taking each of these matrices and stacking them in a separate dimension
now fixing one dimension will give you a slice, fixing two of them will give you a "line" in this cuboid, fixing three of them will give you one element
i mean this shape makes sense for me. the first chunk (layer) is an s' . this s' has 23 rows each representing s0- s22. the 4 columns present the possible actions to take and teh vals the % to move from that state s (which is a row) with an action to the s' given by teh entire first layer
and teh 2nd layer is s1 then again with 22 rows representing the starter states
that is what i have been presenting with the T[s', s, a] already. i struggle to understand your representation in a different shape that should still somehow result in the same representation of the transition amtrix
yup that's correct
then summing over axis=0 will eliminate the 0th axis (corresponding to "next state") so it will give you probability to moving to "any of the next states" for a given state-action pair
which is 1, because you always have to move to some next state
also i coded my solution with adding this comment:
Assumption: rewards get given to agent upon departing from an state, so arriving from S16 -> S21: +0 and going from S21 ->S16: +10
this what we got tought in the lesson about mrp's. rewards gioven upon departing
but i think it doesnt make too much sense for this?
also States moive from left to right
this 1 between 10 and 11 is supposed to eb empty
rewards are usually given for "taking an action in a given state" - so they correspond to action-state pairs, in your case they are determined by the next state, which is in one-to-one correspondence with state-action pairs so it's okay
doesn't matter if you do it when leaving the state or if when arriving at the state, as long as it depends on the state you moved to
but wouldnt that change mes up or policy? if we give rewards upon departing the agent will not take rest in one of the absorbing states, they woudl likely like to move into teh last absorbing state and move out?
also does absorbing state mean that the agent cant take any further action once reacing an so called "absorbing state"
what is your algorithm for computing the policy tho
not yet written haha
thats for next week
"An absorbing state is a state that, once entered, cannot be left. A (finite) drunkard's walk is an example of an absorbing Markov chain. Like general Markov chains, there can be continuous-time absorbing Markov chains with an infinite state space." wikipedia
also i'm wrong here, it's not one-to-one correspondence with action-state pairs it's correspondence with outcomes of an action at a given state, it'll depend on algorithm which one matters
so the thought cannot be for a succsessfull policy/agent to give rewards upon departing from a state because absorbing states trap us.
that depends on context but the point of the policy learning is to maximize the rewards
so staying at the state with R=10 is actually the optimal policy here
also algorithms usually do offline learning - they don't change policy during simulating the agent, only after it has finished a "simulation episode"
yes but we can never collect that reward if we use the given assumption: rewards are given upon departing from a state because we are trapped in there
ah, i assumed staying at the state is also considered "departing from it", just to the same state
if that's not the case then yea it seems wrong at the first glance
ye i have no idea anymore, i sent an email to the course assistends asking for clarifying that
anyway. can u have a look ove rthe code i wrote to achieve my matrix @buoyant slate ?
not right now cuz i have to leave in 10 minutes but i can later
k, i leave it here fo ya to check if u got time
we have a big ah class GridWorld which contains all info a lot of helper functions:
def __init__(self,
shape = (5,5),
prob_success = 0.7,
obstacle_locs = [(1,1),(2,1),(2,3)],
absorbing_locs = [(4,0),(4,1),(4,2),(4,3),(4,4)],
absorbing_rewards = [-10, -10, -10, -10, 10]
):
"""
GridWorld initialisation
input:
- shape {tuple} -- GridWorld shape (height, width)
- prob_success {float} -- probability of success when taking an action, used to fill the transition matrix
- obstacle_locs {list of tuples} -- location of all obstacles of the grid: [(obstacle 1), (obstacle 2), ...]
- absorbing_locs {list of tuples} -- location of all absorbing states of the grid: [(state 1), (state 2), ...]
- absorbing_rewards {list of float} -- reward corresponding to each absorbing state of the grid: [reward 1, reward 2, ...]
output: /
"""
```helper functions:
a neighbour matrix 22x4 (bad name) containing what happens when taking a direction a from a state and where you end up:
and teh fucnxtion i wrote
def fill_in_transition(self):
"""
Compute the transition matrix of the grid
input: /
output: T {np.array} -- the transition matrix of the grid
"""
T = np.zeros((self.state_size, self.state_size, self.action_size)) # Empty matrix of dimension S*S*A 23 sacks of 23 x 4 matricies.
#each stack is exactly one S_prime. each s_prime contains 23 States where it could originate from. T[s', s, a]
####
a_size= self.action_size
state_size= self.state_size
neighbours= self.neighbours
prob_success= self.prob_success
# T[s_prime, s, action] --> similar to P(s_prime| s, a) ===> T[21, 17,2]: the probability when departing from state 17 to 22 by choosing west
for a in range(a_size): # represents the choice the agents picks
for s in range(state_size):
for a_result in range(a_size): #represents the actual s_prime the agents attempts to move to
s_prime= neighbours[s][a_result]
if a_result==a: #sucsess case
p=prob_success
else:
p=(1-prob_success) /3
T[int(s_prime), s, a] +=p #+= because of multible walls that agent could hit
return T
i think helper functions arent used here yet
they are for the reward matrix
this start for anyone reading and trying to help.
what are some free ml bootcamps or coursess?
arent like all harvard courses for free on youtube (at least the lectures idk about the execises)
Help wud be appreciated thanksss
why would you assume harvard courses are good?
there's no reason to assume that those courses are better than a comparable rando Udemy / Coursera course
you kind of have to tailor the course you choose to your goals & skill sets.
I never recommend learning a data science topic in a college degree style, which is bottom up. In other words, before you learn the topic itself, first we must take a few semesters of adjacent courses, etc.
you should do it top-down, which is you focus on the topic itself, whatever it might be, and you pick up the needed knowledge along the way. Tons of options for this, so choosing the right one for you depends on who you are
my experience with online MIT / Harvard / etc courses is that they tend to be bottom-up in how they approach the topic. They are rigorous, demanding, but ultimately a waste of time.
Uh idk, never listened to one
I noticed that aswell. I started my first pandas project knowing nothing worked from there and developed 2 amazing data dashboards about my spotify and netflix data. I also took a course a year back and some foundations are laid but there was so much missing
@buoyant slate got some.time now?
college courses teach you the implmenetation. they can be considered useless if you will never work on designing ml algorithms
Right. So do that if your goal is to design an ML algorithm.
and it takes a fair amount of mathematical literacy to do that. Reading journal articles should be something you can do on a regular basis
You guys been using polars more or still pandas
Pandas. I don't have a compelling reason to switch.
Faster it has declarative syntax. Only thing I can't figure how to do in polars is the transpose function
The pandas query function is kinda tough for me to follow vs what polars does with filter/select
Pandas has the better transpose function doing it with just T
isn't pandas pretty declarative as well?
speed's not that relevant for most use cases probably
that said... 
Meh I get confused with the brackets a lot. And then sometimes you are forced to use loc or iloc. I found it better to stack multiple conditions with polars instead of using a complicated query
Only thing in polars I can't do properly is the transpose function
Polars transpose is more complicated
unpivot/pivot works better for many transpose-ish use cases
polars also has a transpose method though
what I mean is the difference in speed between pandas and polars
I know they have their own but pandas T is better from my experience
Yeah polars is faster too and I found their lazy evaluation useful
I know it's faster, I just don't think the speed difference is meaningful in the vast majority of use cases
but of course, that's a lame excuse
Like I had some code for Treasury data I had for time series and I couldn't transpose it to make yield curve using polars but the pandas T worked fine
Given if you use pandas to datetime
The documentation for polars on this was confusing
for some use cases it matters a lot, if doing things more efficiently can let you allocate fewer resources to achieve the same result (cost savings), or some niche cases in which ultra low latency matters
though for most cases I agree it's at best a convenience of having to wait a few less seconds
do you have a complete example?
import matplotlib.pyplot as plt
import numpy as np
import yfinance as yf
import scipy as sp
Treasury = pd.read_csv('daily-treasury-rates.csv') Treasury['Date']= pd.to_datetime(Treasury['Date'])
Treasury = Treasury.set_index('Date')
Treasury1 = Treasury.T plt.plot(Treasury1['2025-10-10 00:00:00'],color='black',label='10/10/2025 Yield Curve')
plt.plot(Treasury1['2025-08-28 00:00:00'],color='red',label='8/28/2025 Yield Curve')```
iirc they use this: https://docs.rs/chrono/latest/chrono/format/strftime/index.html
polars really doesn't documents it well, in great part since it's just wrapping chrono, but I still wish they'd at least include a link or two
might open an issue for that later 
set_index('Date')
This was in pandas but how to convert to polars is difficult
The data I used came from the US treasury website
The result I would get this
from pandas but polars i cant
i know i can convert the pd dataframe to polars but i cant transpose cleanly like pandas
I think that df.unpivot(index='Date') works for that? ```pycon
import plotly.express as px
import polars as pl
df = pl.read_csv('daily-treasury-rates.csv')
t = df.unpivot(index='Date')
dates = ['10/10/2025', '01/02/2025']
test = t.filter(pl.col('Date').is_in(dates))
fig = px.line(test, x='variable', y='value', color='Date')
fig.write_html('test.html')
granted, you cannot index it later - polars explicitly avoids having an index like pandas's
I think this is a acceptable solution
Unorthodox but works
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.datasets import load_breast_cancer
import matplotlib.pyplot as plt
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target
df.head(20)
X_train, X_test, y_train, y_test = train_test_split(df, data.target, test_size=0.1, random_state=42)
model = LogisticRegression(max_iter=100000)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_pred
model_accuracy = accuracy_score(y_test, y_pred)
model_classification_report = classification_report(y_test, y_pred)
model_confusion_matrix = confusion_matrix(y_test, y_pred)```
is there anything to improve or anything as a recommendation, I am a beginnner to data science
it's a breast cancer detection model
I mean the code will run. Do you understand the outputs? Do you want to improve accuracy of the model, improve interpretability of it, etc.
Technically a lot to improve, but what do you feel what help you most to learn how to do?
I know how the model works, but I will apprecite if you explain me how the confusion matrice works
Hello,
I am having issues with setting GPU/CPU in order to train my ResNet model. My Jupternotebook is currently set with a GPU. When I try to load my dataset from a directory. I need to explicitly tell tensorflow to perform that operation on the CPU by doing
with tf.device('CPU:0')
# code to load datasets here
When I want to create my resnet model i currently do;
with tf.device('/CPU:0'):
pretrained_model = tf.keras.applications.ResNet50V2(
include_top = False,
input_shape = (img_height, img_height, 3),
weights = 'imagenet',
)
# other code here
output = Dense(1, activation="sigmoid")(x)
model = Model(inputs=pretrained_model.input, outputs=output)
Finally, I want to compile and fit the model;
# compile code here
epochs = 50
with tf.device('/GPU:0'):
history = model.fit(
train_ds,
validation_data=val_ds,
epochs=epochs,
callbacks=[checkpoint_cb]
)
However when I try this, I get this error;
InvalidArgumentError: Graph execution error:
Detected at node StatefulPartitionedCall defined at (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
Trying to access resource conv1_conv/kernel/911 (defined @ /opt/conda/lib/python3.12/site-packages/keras/src/backend/tensorflow/core.py:38) located in device /job:localhost/replica:0/task:0/device:CPU:0 from device /job:localhost/replica:0/task:0/device:GPU:0
Cf. https://www.tensorflow.org/xla/known_issues#tfvariable_on_a_different_device
[[{{node StatefulPartitionedCall}}]] [Op:__inference_multi_step_on_iterator_38084]
How can this be solved? When I set the model creation to a GPU that code will then fail. So it becomes a big problem.
Like I know that it tells me how and where the model did wrong but how can I fix it or improve it's accuracy
Hey everyone! I’m preparing for Data Engineer roles (Python, SQL, ADF, ETL) for the next 6months — anyone up for consistent learning and project collaboration?
sorry i was quite busy yesterday and i forgot
the code looks alright i think, you iterate over action-state pairs and the innermost loop looks at all neighbours of the state and fills out probabilities in sliced matrix corresponding to s-a pair
although you might be misunderstanding what the tensor means - it is keeping transition probabilities, so probabilities that you will transition to a given state after you've chosen action in some state
so the innermost loop shouldn't in general loop over all possible actions, but rather over all possible states (or only the ones that are reachable from current state, varies from env to env) - in this case the actions correspond to possible next states but it doesn't have to be the case
hmm i am not quite getting the 2nd part. why wouldnt we want to have the state looping in the 2nd loop? what would be a "sentence" that describes what my version is doing ( i think that helps me understand the difference), which seems to be wrong?
okay so the tensor you're filling out tells you this - "If we are in state s, we take action a, then for any state s', T[s', s, a] is the probability we end up in this state in the next step"
And what you are doing is "If we are in state s, we take action a, then for any action a' we fill out T with probabilities so that T[s', s, a] is probability of ending up at s' where s' is the state associated with a' "
the problem is - the possible states s' you can end up in after action a from state s do not have to correspond to actions, they are kinda independent
You assumed that each action corresponds to some direction in which we are trying to move - this is correct since we have a grid, but what if we had only 2 actions - one has 1/3 probability to either go north, west or east and second 1/3 probability of going south, east or west - then your innermost loop will go over two actions but you have 3 probabilities to add - since you can move in 3 different directions after any action
thanks a lot. ima go over this with my code later, but i assume the correction would be to loop over action space 2x. then identifying and crafting the probabilities (that represent actually attempting to move in that direction a we are currently looping over):
if a_result==a: #sucsess case
p=prob_success
else: #fail case
p=(1-prob_success) /3
followed by looping over our states and within that loop determine s' and the val for T to fill in by:
s_prime= neighbours[s][a_result] #a_result is the inner a loop
T[int(s_prime), s, a] +=p
i mean your code is correct, I was just adding context since looping over actions second time seemed a bit odd to me
the usual way I did this is having some function that returns possible next states for state, action pair, e.g.
def possible_next_states(action, state):
return neighbours[s]
Or just one that returns pairs (next_state, transition probability)
However you don't really need it in this case, just as a reference for future
Hello all, I'm a beginner looking for ideas on how to approach creating a tool that pulls an identical subimage from a larger image using a template, then write what was found to a csv or json file. Specifically it is an inventory screen in a game UI, so the subimage would be an item icon in a fixed position and resolution for every screengrab and it would be an exact match. I've looked into opencv but was just wondering if there was any better suggested methods or tools to use?
Not exactly. Some are and some are not. I did try clustering based on semantic similarities but only for a small sample of documents e.g. 100 out of tens of thousands.
so heres some fun everyone can enjoy, I want to convert MNIST back into images in altair. I wrote a 2x2 heat plot that was like 20 lines of code so I'm a little bit intimidated by the prospect. I'm not writing this because it's the best way, I have been doing everything with altair whether it makes sense or not to level up as a DA.
I'm about to whip out an autoencoder and then I'll need to check my results, some alt.Chart() code will follow
so here's where I'm at ```py
def train_it(model):
optimizer = torch.optim.Adam(params=model.parameters(), lr=0.01)
loss_func = nn.MSELoss()
n_epochs = 10000
for i in range(n_epochs):
sample = df.sample(n=32, seed=i).lazy()
X = sample.select(pl.exclude(['id', 'column_1'])).collect().cast(pl.Float32).to_torch()
y_hat = model(X)
loss = loss_func(y_hat, X)
optimizer.zero_grad()
loss.backward()
optimizer.step()
return model(
df.sample(n=10, seed = 42)
.select(pl.exclude(['id', 'column_1']))
.cast(pl.Float32)
.to_torch())
nice, am in business torch.Size([10, 784])
wait, why are you using X as your target ?
autoencoder
is that wrong?
no, didn't realise it was an autoencoder
I think you should use dataloaders and batches
reciting this from memory after a lecture a few hours ago, so ya feel free to correct mistakes
it will work better
yeah I have no doubt really I'm more after the charting experience, this was a pretty easy subject
huh... so [column for column in encoded_df] returns the columns but encoded_df[0] returns the first row (polars)
how'm I supposed to iterated over it, I was thinking .to_numpy().reshape((28,28)) was going to get it reshaped
in general you are not supposed to iterate over data frames
but to_numpy should work?
yeah... hmm I'm less comfortable with numpy but that makes sense
it's mostly the same as torch
in general you're not supposed to plot unlabeled data with altair heh
also, you can use df[row, col] like df[:, 0] or df[0, :]
I think that will try to get a column whose index is a tuple of (row, col)
without the loc
polars
we'll win you over eventually
reshaped_arr = encoded_arr.reshape((10,28,28))
reshaped_df = pl.concat([pl.DataFrame(reshaped_arr[i]) for i in range(10)])
this feels like a weak method for getting at the intended result
I'm still reaching for iteration
oh well, break time here's where I'ma leave off ```py
reshaped_df = pl.concat([pl.DataFrame(reshaped_arr[i]) for i in range(10)])
reshaped_df = reshaped_df.with_row_index(name = 'Y')
reshaped_df.with_columns(pl.col('Y')%28+1)
guys i need to check ai % detection for my word document do you know any tools?\
Gptzero
none of those are actually good. unlike images or sound, there's no surefire way to detect that it's AI generated.
ipynb notebooks are notoriously incompatible with git. I'm not aware of a good solution except to switch to a different type of notebook like marimo.
Big teams must collaborate somehow?
I am looking at nbdime and jupytext rn
I think it's unusual for multiple people to collaborate on a notebook in a way that involves git versioning. If they do, they probably clear all the cell outputs before comitting.
Hey folks, I know there are a lot of professional programmers here, so I’d like to ask you something.
I keep seeing a lot of negativity around coding with AI / AI-generated code.
Why do you think that is?
I’m genuinely curious — not trying to start drama or be ironic.
Just to clarify:
I’m not talking about “one-prompt copy-paste code”,
but about serious, iterative development with AI tools involved.
I think a lot of negativity stems from the reduced human input. Code written by AI will almost always be worse than code written by someone with proper experience, and that will only compound
I think there are multiple reasons. One is that the easy accessibility of AI tools and AI code generation has lead to an influx of people with little (or even no) programming experience producing and contributing poorly written code in large quantities, which can add a lot of extra workload to people who have to review or deal with that code.
or people backing up their own "ideas" by saying "but chatgpt told me it was correct"
Another is that it's just kinda human nature to be skeptical of new and untested concepts or technologies.
that's a fair point, but we have pyright, pylint, those tools arent created to make codemore.... stronger?
Lots of people will just kinda lean in the negative direction until it's been proven beyond reasonable doubt that something really works well.
it can be useful to someone who can pick out the 1/3 of the code that's decent from the 2/3 that tend to wander into hallucination.
That being said, I am a professional developer and I do use AI tools all the time in my daily work.
yeah, I'm not saying it can't be beneficial, but it still requires an experienced dev in the driver seat
a calculator is only useful if you know what numbers you're meant to be punching in
agree in 100% haha
oh yea, in my opinion worstest thing - chatgpt can generate code and newbie can think - wow its amazing, for sure this code its fully optimised
We also get a lot of people coming into this server where they've used ChatGPT to vibe code something, and eventually they run into an issue they can't resolve just by re-prompting the AI, and ask us to solve it for them instead.
even code that works and is optimized will often be needlessly complex... is complexified a word? anyway, GPT likes to add libraries and variables inappropriately. Routinely in my studies I'll ask AI to make something happen and the learning process for me is to take what GPT gives me and do it the right way
Ask it to build something you already have deep knowledge of and you'll see how much unnecessary fluff it adds
Ohhh, didn't knowed that, i am having same issues sometimes like that, but in geneal then im switching tool - claude/ gemini / chatgpt / grok/ perplexity.... 🙂
Anyway coding with ai= needs strong testings - like many tests plans, regression, crosscheck with other ai's, its my opinion based on my little experience with python haha
but if you already have deep knowledge of something, why are you using AI for it....now apply this for things you're less confident about
additionally, it's training is horribly outdated in many areas. If you ask it about something it should know, like how to write machine learning, it will give you advice on tensorflow. Tensorflow hasn't been relevant for years.
I think something that's counter-intuitive and a lot of people fail to realize is that using an LLM effectively (in such a way that the result is useful and that you actually save time and effort) to accomplish complex tasks like coding is itself a skill that needs to be learned. It's not a magic box that just any person off the street can use with no training.
And it's deceptive, because it LOOKS as if you can.
is it necessary to learn how to use LLMs?
You will get code, it just won't be good or useful code.
hah... is that a serious question?
it is
I mean, I've read that in china they have LLM class with Math, History, Literacy and whatever
"necessary" is a strong word. I genuinely think it can be a useful tool in various contexts. But you can probably get away with not using it in most situations as well.
totally agree, if someone doesnt know how properly use ai it can generate so many hallucinations, so many false code... and at end they start to get in places like there based on this discussion what i see
im just going to assume that i am in the "most situations"
Most people are in most situations most of the time. 😛
from what i hear from senior devs, it seems like it's not actually that useful anyway
or at least it doesn't really improve your productivity much
Learning how to use LLMs isn't really easy to answer, the models are constantly changing and you wont always be informed. What works today may not work tomorrow, but there're probably some concepts that can be applied across the subject
In my opinion, this technology is still in its infancy and most senior devs don't know much about it.
For me it feels like:
– 10 years ago: you had to learn Git
– later: you had to learn Docker / CI
– now: you have to learn how to use LLMs properly
let's see if LLMs go into the crypto bin
It'll be interesting to see what will be considered industry standard practice in 10 or 20 years.
for now, it seems i should be able to get away with not using LLMs
i dont want even immagine, seriously i dont want even immagine haha
im just going to run with that until im forced to do otherwise
I strongly doubt that will happen, because even today people are flailing to even conceptualize use cases for crypto, whereas I think there are already many obvious uses for LLMs, that just maybe still haven't been fully refined.
It's just hard to imagine that we'll just shelve all of this.
I use it to write me snippets that I can then evaluate, but other than that, it's not a regular part of my workflow
It will boil down to "How much hallucination is acceptable?"
I've started learning c# and I tried using chatgpt just to see what it would throw at a beginner and I hated it
I use the Copilot autocomplete all the time. I never turn it off and I accept its suggestions all the time.
hey folks, slightly off-topic for a moment 🙂
I’m building a lightweight IDE in PyQt + QScintilla because I got a bit tired of VS Code / Sublime / PyCharm / Spyder.
i’m planning to release it soon and I’m really curious about your experience.
What are the biggest pain points or cons of the IDEs you currently use?
I like using Copilot or ChatGPT to generate test code. I don't use it wholesale and I review it, but it's nice not to have to manually type out all the boilerplate that tends to come with test code.
what's the problem you are trying to solve
I think it's so damaging to beginners though. I immediately turned it off
This is still #data-science-and-ml
That might be the case, I'm speaking from my own perspective as a senior dev.
yeah, it's fine if you can vet what it gives you
yea you probably should ask it in #python-discussion
hah, oky, but anyway guys, thank you for curious discussion about using ai in work !
I find in general that it's mentally easier for me to change something that already exists rather than create something new from scratch. Even long before LLMs came around I would often develop new features or tests by copying existing similar code and then changing it to my needs.
And I find that LLMs are useful for this, generating an initial draft that I can iterate on, for various tasks.
hey nothing personal but I don't really DM, you can usually find me here though
just wanted to thanks and show you something
Task 3.4 – Discord UX Features (Idle Tabs, Variables, Beginner Mode)
Priority: P1 – IMPORTANT
Estimated time: 10–12 h
Scope:
Idle Tabs Manager
track the last_accessed timestamp for each tab,
configurable threshold in settings (default: 24 h),
“notify only” mode (status bar / lightweight popup) + logging,
absolute requirement: zero data loss
(before any future auto-closing behavior, there must be autosave + a “parking lot” recovery system — for 1.0, notification-only is enough).
Variables Panel (post-run snapshot)
after script execution, generate a simple snapshot of locals() (no debugger),
filter out private names (_), provide readable repr() with length limit,
separate panel / output tab, read-only.
Beginner Mode
toggle in Settings (e.g. ui/beginner_mode),
when enabled:
simplified menu (hide advanced options),
larger fonts, more tooltips,
Variables Panel enabled by default, improved error messages.
Acceptance criteria:
Idle Tabs Manager gently notifies the user about long-unused tabs (no auto-closing),
Variables Panel displays a clear, sensible list of variables after script execution,
Beginner Mode noticeably simplifies the UI (fewer options, more guidance).
have a nice day ! 🙂
is there someone, here who's looking forward to learn together and is a begginer, if yes dm me personally
define beginner
ive been messing with ai for 5 years now and i still feel like a beginner
nahh I mean are you still learning libraries such as sciket learn, mathematics stats etc
nope
im learning other stuff
and also dont try to learn a library
its useless to just learn all the functions defined in a library
thats why google exist
bro damn🤣🤣
just go along with what you are doing and search for stuff when its needed
what are you learning rn
decompositions, ode solvers, efficient ways to handle big data such as sparse matrices, comfyui, autoencoders,
ok now how're you gonna apply it
I get it you're doing the maths
ode solvers is used for image generation
decompositions are useful for handling the math of latent space
big matrice storage u can guess probably
comfyui is for hobby
autoencoders is what transforms inputs into latent
vue.js??????
bro, how you're gonna build the model, like actually aplly it in an app etc
if i make the model, then applying it into an app is very easy
making the model is the hard part here
my guy how're you gonna make a model, when you don't know libraries such as tensorflow which are essential for deployment
if i know the math, programming it is easy
because i dont have to know the function names, as i know what to do and can find the name from there
google it
can I switch to dms, cause I wanna send you a filw
sure
I just saw this. Did you still need help?
what is the porpuse of data science
its adapting so fast, it does feel like that.
to make data work
I just learned about the Kurimoto model recently.
Ive been dabbling in multi agent cogntive swarms.
the way i see it is its purpose lies in organizing and arranging usually messy data to adapt to ML uses.
theres more to it, but thats the geist these days.
no not anymore
the purpose is to extract information from data and to interpret it. all of these buzzwords like ML and data science are kinda muddy, unfortunately. for example, data science can involve using AI/ML to process and interpret data. it can also be done with classical methods instead
what plunder mentioned is more like data preparation, which is a cleanup process you can do before either of ML and data science
Anyone tried python 3.14 with pytorch, Tensorflow OpenCV?
Hi
Anyone tried python 3.14 with pytorch, Tensorflow OpenCV?
Hello guys can someone help me by giving me an idea for my final project. I want the for hackathon so I want a great idea and easy cuz im like beginner im down to learn more for the project 😄
I recommend coming up with your own idea and being proud that it was your own concept, and that you carried out the project yourself from start to finish. Start by creating a mindmap of what interests you and what the possibilities are. It could give you insight into what truly excites you.
Also write a proper project plan once the idea and project goal starts to take shape. In my opinion, it's a good way to get started.
3.14 came out fairly recently. on pytorch's site, no support for 3.14 is listed yet
it's not uncommon for these libraries and othery like numpy to not have support for a handful of months. your best bet is to install 3.13 in a separate environment and use that
Thank you @wooden sail
as a beginner which one is easier to learn, which one is easier to master? pandas/polars
@wooden sail What do you recommend vscode or py charm
it depends on how experienced you are. in my personal opinion, learning python, learning a ML module, and learning an IDE are 3 completely different tasks and doing all 3 at the same time will make you learn everything more slowly
so if you're new to all, i would probably just use a syntax-highlighting text editor
a disproportionate amount of beginner questions in this server have to do with people fighting against vscode and pycharm to get things just to run
on a separate note, i do use vscode myself
I've 3 years experience but I'm used to both
i have the basics of python, i can understand variables, functions, classes, dictionaries, tuple, list, for loop, while loop, if/else statements, data types (like str int bool float) and i recently learn some pandas
i also know a little about numpy and matplotlib but completely new to polars & sklearn
it kinda depends what you want to do. pandas has better integration with numpy at the moment, because it sits on top of numpy
polars is better for large queries because it's faster
at least in my head, polars is for handling, moving, and accessing data, but not for any complex processing of the data
doing the latter will have you leave polars, transforming the data into something like numpy, torch, etc
basically i have to learn these libraries at my school
numpy
seaborn
sklearn
pandas
matplotlib
and i am trying to master them so i can take the exam
Focus on pandas
now i know why we weren't taught polars, we only needed to work with smaller datasets
since you list all of these things, pandas plays way better with everything here than polars does. polars will require extra conversions into types that can be used in mpl, seaborn, numpy, sklearn, etc
Polars is great for processing data, but it has no ecosystem whatsoever.
how far does sklearn cover in machine learning? the python library
whats the maximum it can do
It covers traditional machine learning
Supervised, unsupervised, preprocessing, and pipelines and all.
a really big amount of the classical statistical optimization/estimation methods, and some basic deep learning methods
oh right now we're on only supervised learning, the lecturer asked us to finish the datacamp course "supervised learning with scikit-learn" and we have some kind of project but i haven't started to even learn the library yet
The majority of supervised learning algorithms in sklearn are hardly used in practice, bit it builds the foundation.
I looked at the course. Dont recommend it.
How much do you know about just basic old linear regression
i disagree with this, look at the module
there is a large amount of papers being published on these topics and their applications still
This is scikit?
and they build the foundation for state of the art model-based neural networks
yes
i know nothing yet i only know theory (like reading) classification, regression, where to use each, confusion matrix, decision trees, test data train data like that
Theres not meant to be sequential
Supervised learning is easier to learn compared to unsupervised
But you need unsupervised to eventually optimize your supervised model
For feature engineering
Decision trees are the foundation of the more "cutting edge" tabular models
And they're used extensively in more modern tabular ML algorithms
Is recommend keep going into linear regression and its regularizations
So L1, L2, and L1+L2 - these become hyperparameters for many algorithms so it helps to know how they work
For classification, I'd do logistic before anything else and multinomial logistic.
After than try KNN, then flip a coin between SVMs and Naive Bayes - difficult cor different reasons.
If youre more of a stats guy, try for naive bayes, if youre a math and CS guy, go for SVMs
KNN and SVMs can both be used for regression, but a lot less commonly. You can technically use naive bayes for regression too, but I wouldn't recommend it.
Then id move on to decision trees and ensemble methods which sort of rule tabular ML for complex data and relationships
So random forest after, then general gradient boosting, XGBoost, LightGBM, and CatBoost
And then stacking if you're feeling fancy
There's a lot of details that are off with this, we can go into the details if you're willing to 😄
The order in which to learn the algorithms seems off and that's partially because you don't (imo) appropriately highlight why any of them make sense to use in a given context
This is similar to how I learned it in my machine class last semester.
I don't think it's particulaly helpful learning ML as a big box of different algorithms with different names
When it's more important to look at the properties of each method, and group them by property, which then maps to the kind of problems they're good at solving
Its more about know where they fall in the landscape of ML and how these algorithms set the foundation for many others
You should along the way know every models assumptions/strengths/weaknesses
As an example, SVM is commonly used in many domains. Anything related to EEG/fMRI will likely use (kernel-based) SVMs because they really shine when you have high dimensional data with a limited set of data points
In part because the optimization problem they solve's number of unknowns is the amount of observations and not the amount of features
Then it's also clear you can't use them on large datasets since you need to make the Gram matrix (size N x N) which may not fit in memory
Works great for high dimensional data with medium sized data sets. It was used commonly for NLP tasks. It blew up in the 90s.
I know, but courses will typically just write this stuff without explaining why, but the why is so important :p
They're rather obsolete in the sense of using a tabular model instead of a deep learning model that excels with text based/high cardinality/high dimensionality datasets
LLMs alone make an SVM model more novelty than anything else.
You have much much much more hyperparameters! (In deep learning everything is a hyperparameter, in RBF SVMs you only have 2 hyperparameters)
Industries aren't using SVMs besides niche datasets where its too small for something more complicated but too complex for GLMs
And finally, the optimization problem for SVMs, both in the primal and dual formulation leads to a global optimum. With neural nets, well good luck fiddling with parameters and training it over and over
deep learning is often the wrong approach for problems. many problems have simple solutions, sometimes even in closed form, where you cannot get any more performance no matter how you try
No, they definitely are using them. Where they are appropriate. When they're appropriate, they're simply (one of) the best method you can apply
you're underestimating performance and optimality guarantees, which deep learning has very little of
It's up to you to know when simple(r) methods are appropriate and use them
Instead of trying to dice a tomato with a chainsaw
But realistically, organizations that need to deal with high cardinality like that
Theyre using LightGBM
Theyre using deep learning models
Not necessarily
Yes, you need more data, and companies have plenty of it. Too much of it.
this is 100% wrong
big AI companies have plenty of data. most companies do not have enough to train small models
And when you're working with anything related to bio / human stuff
You have so so so little data
And these are domains where the most money can be made
Im talking corporate level companies
for reference zestar works in bio applications with ML, and i work in industrial nondestructive testing, also with ML
with masters/phd
i have yet to collaborate with a company or university that says they have too much data
I work as a data scientist
training with less data is an active research field
ML rollout in industry is impeded by lack of data
Lack of publicly accessible data
No, even within companies
Even if they have data, a lot of it isn't labelled to be used in the context of supervised ML
Thats where my job comes in
Can I be blunt? 😅
There's typically dozens of ways models/approaches can be improved by knowing some more of the theory. I see this at work as well, and models have been demonstrably improved by this.
We can have a cool discussion here, but I feel like I'm talking to a wall haha.
I guess I just have a different perspective
It's not different, we can't say 1+1=3 and call that a different perspective imo
ML Algorithms that arent great out of the box are very costly and need rapid prototyping
You know you've just described deep learning?
And a lot of the modeling i do is bayesisn modeling
They are not great out of the box if you need a novel architecture because the design space is infinite
The whole point of this, is essentially that other methods, simpler ones are great out of the box
Simpler ones are grest if you have simple data and simple needs.
Let's go back to fMRI data, that is definitely not simple data and not used for simple. SVMs will still outperform an exotic whatever architecture in most cases
On forecasting data exponential smoothing has been shown in large scale studies to be hyper competitive with whatever LSTM people were cooking up
You can try benchmarking it against an XGBoost or LightGBM
What survey paper
On forecasting methods
Yes
And I hope you know XGB and LGBM have serious issues for forecasting (another nice theory one)
The values in the leaves and nodes are from the training data, correct?
Theyre not as commonly used for time series modeling compared to cross sectional data
They absolutely are
Mostly because they can model seasonality without preprocessing
But due to this they cannot model trend
Most time series models fail horrendously regardless
But a lot of people do not know this, so they employ these models in different scenarios where the real world distribution shifts in a very predictable, constant way (e.g., trend) and they cannot capture this, linear regression can
They make us. so. much. money.
The time series has to be static. No sudden cuts to interest rates, no random politician shenanigans.
If you're using exponential smoothing you're invariant to this
If the series suddenly shifts your predictions will also shift
For serious time series modeling, theyre doing everything from scraping news articles and performing sentimental analyses on the
because the prediction is a moving average
Its not. One single "cancel culture" tweet would destroy your model and its predictions for next month's concert sales.
A single tweet, sudden changing trend, a natural disaster, a political whatever, a pandemic - that model is gone
A time series would have be consistent and stable. These models learn from what theyre given. They cant do anything about something no one of us saw coming.
Okay, just gonna drop this here and move on 😄 https://otexts.com/fpp3/
Great book on forecasting. Definitely check out chapter 8 which may help on this topic (exponential smoothing)
Great for valuing your most recent data points more than your previous ones.
Still cant do anything about that Mcdonalds E coli outbreak
Time Series models overfit beyond most others. The second anything in thst environment changes that is significant - its all out the window.
They learn from what they have, but they cant predict or forecast a sudden feature it was never trained on, without a ton of uncertainty.
if it's an LLM, you just start doing it and wait for the heat death of the universe
the code doesn't look any different than the same code running on a GPU. It just won't finish.
you could LoRA it 💀
i tried dosent work
want the code then ping me
It's important to never say that something "didn't work". That doesn't give anyone any useful information. You have to say what you did, what you expected it to do, and what actually happened.
Can anyone suggest me CV or NLP project ideas??
Hi, Is there anyone here, aged 14–25, who is interested in AI/ML?
You deleted the pastebin entry? I'm not available right now, but it would have been useful for other people
There are lots of people here who are interested in that. Why that age range?
just trying to find people my age to connect
Does that mean we can't be friends? 
of course we can 😄
Yes
@clever hollow I'm interested
Nice to meet you too
How was your experience studying CS
oh nice im into those too got any projects u are working on
Let's chat in DM
ok
Hey when working on spatio temporal problem let's say I take input t1 to t6 and predict t7 to t9 after that I take t2 to t7 and predict t8 to t10 and so on will this result in data leakage? As samples are overlapping
Or I build samples like this t1 to t6 and targets t7 to t9 next sample will be t10 to t15 and tragets for this will be t16 to t18
hey can anybody help me to develop skill in data science and ai i am in 2nd year B.Tech student done with python ,NumPy,pandas ,matplotlib and seaborn going to start ml. Please provide me setp by step process to learn it and also give suggestions
what model are you using
Would be happy to help.
Hi everyone! I'm currently engaged in deep analytical research focused on the Oslo Bysykkel Open Data . My main goal is to extract maximum educational and practical value from this data asset. I've developed (mete) several distinct concepts for structuring this work, and I'd love to share the vision and gather feedback. I'd love to connect with anyone interested in discussing these concepts, collaborating on content, or just exchanging insights on the Oslo data.Feel free to send me a DM or comment below! Thanks
I like making money. like. a lot. That book you recommended, would it be a valuable read? Are there others you can recommend?
Spent about 3-400 hours studying ML math and inner workings in the last year, for context. As a professional DA I recognize that deep learning isn't usually the right tool for the job, but I figure the math will be valuable as I try to reenter the job market.
Hi everyone! I’m currently exploring Data Science through a course and have just started a GitHub repo to map out learning paths. It’s a collaborative space, and I’d love to include others who are passionate about DS — whether you're experienced or just starting out. If you'd like to contribute helpful resources or insights, feel free to DM me for an invite. Let’s learn and grow together
Convlstm
heard data science is just statistics is that ryt?
Not exactly, DS includes statistics, but it’s much more than that. It also involves programming, machine learning, data wrangling, and storytelling through visualizations. Statistics helps us understand patterns, but data science uses that understanding to build models, solve problems, and make smart decisions with data.
Hi @grizzled tartan . It actually takes an exceptional amount of computing power to create LLMs. Like it literally costs millions of dollars each time. You should start by learning how to train simpler models.
thanks for understanding I am new to this filed
Hi everyone, I have created a fuzzer to fuzz test the MCP, helpful mostly if you’re using compiler language to create an MCP server as it would help detect crashes and other probable resource issues, also if you’re implementing your own custom MCP protocol implemented, it’s not tested thoroughly as you can see from the issue https://github.com/Agent-Hellboy/mcp-server-fuzzer/issues/108
please use this on your server and help me test it, could be a helpful project to the community.
Check out the pinned comments
Hello guys
Would love to know your opinion on this project
My question is - Can we make subgraphs inside the main graph sharing the same state of the main graph in langgraph ?
Guys.... Where to start?
With what? 
Im working on something a bit like that in nature, but im working toward a crypto economic provenance layer. Its got a unique complexity metric Ive been working on for a few years. But it also uses merkle anchored commitments on Polygon zkevm, basically ZK data availability layer.
Looks a bit AI generated. I'd be the last to knock you for that, but presentation. But I also dont see any code. Just your LICENSE and README. Its not even python related, it looks like AI slop honestly lol.
Hello my name is Taha, nice to meet you! -> likedin/in/tahayacine
If you are a strong CS/AI/Data Undergrad or Masters, and strongly interested in AI research and just starting or just started, DM me, I am starting an initiative together.
Hello everyone, I'm new to python, and now for the specific what i do is learn about data and be a data scientist, but now i'm really confused what should I learn next after playing with the python.. should I continue to learn about sql? or do I need to learn another thing that relate to a data? welp, no idea.. so i just wanna ask something, what the next thing should I learn to be a data scientist?
uhm
learn pandas and numpy. you could learn SQL if you are going to get and check data in a SQL database.
i already learn about sql.. but about pandas and numpy, i just know about the basic.. but, does pandas only making a grapich of a data? or it can do smth else?
I could be wrong but I don't think pandas does graphic charts for data. That's normally done with another library like seaborn or matplotlib
If it did charts, you can do more than that
like data cleanup or reshaping, resizing, additional columns, drop NAs, etc
or maybe i'm the one who wrong here 😅
oh yes, the seaborn one who made the chart of a data
AI slop?
yea it is
Data science is a lot of statistics and data interpretation, much of which isn't performed programmatically. As agent mentioned, a dataframe library is the primary tool in their stack, personally I prefer polars over pandas. Corey Schafer goes over pandas and is highly regarded by the community if you're looking for some educational material.
As for charting, multiple sources have recommended plotly express as the first library to reach for. Past that there are many options, you may also benefit from Marimo, a notebook interface for *.py files with integrated visualization.
ohh I see.. thanks for the information, I'll start to find out about it soon 😄
hello guys, im currently looking for data science github repositories made by seniors, i want to know how seniors make projects
how can i find one?
charting: plotly or matplotlib. Roughly equivalent, but I do not like how plotly has a sales component to it. Makes me feel sullied.
also note that you can use gnuplot from the command line if you want to do something really quick and dirty. I feel that gnuplot is often overlooked
Nearly 40 year old plotting tool.
http://www.gnuplot.info/
it has a simple and easy to learn scripting language
what kind of zesty finetuning where they doing for grok 4.1
Which is the best and most scalable tool for complex workflows?
what is it
take a sentencetransformer that has this built in complexity sense, thats driven by my UCF tool. But the idea is to put a small head that projects embeddings int o a 5D "UCF" space (N,A,ϵ,cosθ,sinθ)then.. maybe.. train it so that this subspace matches the analytic UCF while still doing normal semantic embeddings. something like that
one embedding, but two views
so what next
hi guys, im kind of obsessed with reproducibility (nix user 😔 ) and i was wondering if there is a library i can use for downloading datasets that:
- caches the data on disk, so it's downloaded only once
- asserts that the checksum of the file matches a given hash, so it's clear if the source data ever changes
- returns a path to the downloaded/cached file
basically im looking for something similar to fetcher derivations in nix, but as a python library
fetchurl {
url = "https://www.kaggle.com/api/v1/datasets/download/hojjatk/mnist-dataset";
hash = "sha256-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx";
}
hell if i know lol still thinking about it
matplotlib has an interesting API that I can't stand honestly
I'm in the minority here but I like hvplot a bit (which is another abstraction layer on top of the plotting libs like matplotlib, plotly, bokeh)
what are the odds. you're saying, have an origin which you use theta to cluster, and then use phi/theta/distance from 0,0,0 to find semantic clusters. Noice.
That's what I'm doing.
hvplot looks really cool, do you find yourself using it's interactive functionality? I recently switched to vega-altair and marimo but haven't ever put any real dashboards together. I guess I'm just wondering how many of these bells and whistles I'd benefit from with a more holistic familiarity.
if you use a 30k vocab, that mapping reduces search space down to 7k or lower. You can do ALOT better than that though. It's not that helpful, but just enough to leverage it.
Does anyone have the experience of using langgraph
what would you ask that person? it's faster for everyone if you ask your actual question.
Yes will do it
Well, do it...
yeah pretty often
I mean I just find it usually nice to be able to drag around, zoom in/out, hover and get tooltips, etc
the interactivity just depends on the backend you're using
(I'd say I'm more of a hobbyist than a good data scientist, so take the above with big grains of salt)
with marimo, you can also kind of use plots as inputs
e.g. select a region of a scatter plot then retrieve it as a dataframe
some examples: official docs, notebook I made a while ago
it is nice for data exploration and could be useful for some dashboards
i found what i was looking for: https://pypi.org/project/pooch/
import pooch
file_path = pooch.retrieve(
# URL to my data
url="https://github.com/org/project/raw/v1.0.0/data/test_image.jpg",
known_hash="sha256:50ef9a52c621b7c0c506ad1fe1b8ee8a158a4d7c8e50ddfce1e273a422dca3f9",
)
apparently it's pretty widely used (in packages like scipy, scikit, histolab, etc)
Hi . I have recently started a master's course in machine learning and data science. I was hoping to find out if there was anyone on this channel that does or might be interested in tutoring/having a ML concepts discussion...basically to help with discussing doubts on basic ML concepts. Feel free to DM if anyone is interested.
I don't know if you'll find someone, but in case you don't, feel free to ask your questions and doubts regarding ML and DS in this channel
Isn't pretty much just:
- prediction by pre-training
- prediction by reward
using things like quality-learning, where you have a state, and you just alter the state if it does an action by how far/close it got to guessing if the route it should take is correct, so the weights on best action to choose are altered and allow it to make a different decision on the next test run as a way to permutate every possible action given the state constraints that adjust each run until it converges on what's good by always getting the right answer each time?
And if you take all possible states that could be permutated, and scope them to a smaller set of values to split up the states assessed for the learning process, you have a network/framework of these q-learning instances working within their own state-space, so you dont permutate something like 800 points, and it cuts it down to like 60 (60 is alot, idk of an example right now, but I try to get it to 10 or less) per learning instance.
There ya go, ML.
ty
What do I do to get experience that has an effect on the outside world so that I can turn python into something that makes me money. I have thought about process automation but I do not know where to reach people.
honestly, there are 2 main paths: freelancing or actually getting a job
doesn't even have to be an IT job. anything Excel-related you can use Python to automate and impress your boss works too
Build a trading bot.
What do you guys think about the longevity of data science?
What bot should i make
Whatever you need
Hello, anyone familiar with Roboflow here pls...I have a folder with my images and labels from txt file, anyone knows how I can upload that in roboflow? I can only upload a single folder at a time, do I need a json file or something linking each image to a label or something like that?
damn my trading bot is destroying it. its up $1800 bucks in 40 hours. pretty proud of myself though, between the model and the trading bot itself it's taken me at least a year and half to make it this far.
We will probably see it evolve with new tools and that kind of stuff, but i dont see it going anywhere. Not as long as AI is around.
But you wont need nearly as many as we currently probably do.
I know nothing about trading
why not start now ?
Maybe in 3 or 4 years
Too busy with school to start now
Nice work! Though 40 hours is still a pretty short sample, so let’s see where this stands in 10 years. As Taleb would remind us, this could easily be a Black Swan event 😉
I restarted it last night and started it with 5k this time. Seemed like a perfect time to test its grit. Financial markets are on fire.
Can yall help me with a data leakage problem
Just ask the whole question, so people can answer if they can.
Ok, I have a data leakage problem in this code where im getting a 1 for the score(It may be overfitting but I doubt it), here's my code: import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.preprocessing import LabelEncoder
ts = pd.read_csv("/Users/arhaann/Documents/code/Python/Titanic Survival.csv")
s = pd.read_csv("/Users/arhaann/Documents/code/Python/Survive.csv")
ts['Age'] = ts['Age'].fillna(ts['Age'].median())
ts['Fare'] = ts['Fare'].fillna(ts['Fare'].median())
le_sex = LabelEncoder()
le_embarked = LabelEncoder()
#Female is 0, Male is 1
ts['Sex'] = le_sex.fit_transform(ts['Sex'])
#C is 0, Q is 1, S is 2
ts['Embarked'] = le_embarked.fit_transform(ts['Embarked'])
ts['Family_Size'] = ts['SibSp'] + ts['Parch'] + 1
ts['Family_Size'] = ts['Family_Size'].fillna(ts['Family_Size'].median())
ts['Survived'] = s['Survived']
ts = ts.sample(frac=1, random_state=41).reset_index(drop=True)
x = ts[['Pclass', 'Sex', 'Age', 'Embarked', 'Family_Size', 'Fare']]
y = ts['Survived']
print(ts.head(10))
print(s.head(10))
gbr = GradientBoostingClassifier()
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1000)
gbr.fit(x_train, y_train)
print(cross_val_score(gbr, x_train, y_train, cv = 3, n_jobs=-1).mean())
param_grid = {
'n_estimators': [100, 200, 300, 500],
'learning_rate': [0.01, 0.05, 0.1],
'max_depth': [3, 5, 7]
}
gbr2 = GridSearchCV(gbr, param_grid, cv = 3, n_jobs=-1)
gbr2.fit(x_train, y_train)
y_pred = gbr.predict(x_test)
print("Base Model Accuracy:", accuracy_score(y_test, y_pred))
best_model = gbr2.best_estimator_
y_pred_best = best_model.predict(x_test)
print("Best Model Accuracy:", accuracy_score(y_test, y_pred_best))
!code
Does anyone have any kids stories they can type to my ai? (Chatgpt coded it and I have been playing with it)
would something like https://github.com/galaxykate/tracery help?
Hey guys it's nice to have you all here , today i joined this discord community and it's really exciting for me to be here .
Any body amongst you guys familiar with the libraries that one needs to learn to be a data analyst I'm actually pretty confused though.
As a Data Scientist I hope I can answer this for a Data Analyst.
The learning never stops.
If you are starter then:
- Pandas
- NumPy
- Matplotlib
- SciPy
- statsmodels
- Scikit-learn
hey guys can you gimme advice i know nothing bout AI (i know python. c. java tho) will this book help me like build at least small langauge models. and what videos and media you recommend other than this. rlly appreciated.
building an useful language model from scratch requires an absurd amount of data and compute, even a relatively "small" one
fine tuning existing models is not as bad though, take a look at Hugging Face or Unsloth's documentation
label leak from the second csv, wrong gridsearch attribute, evaluating the wrong model. All humming together like a quiet bureaucratic nightmare.
Alright thanks dude
Hello guys can you give ideas for data science project for a hackathon
Some guidance for data science please
I'd start here forsure https://pythonprogramming.net/
maybe you can start with LORA model if you have a decent gaming rig.
I saw this recently. https://github.com/unslothai/unsloth
Has anyone seen unsloth before? Looks really awesome
Cool project, but FYI this isn’t an LLM. It’s basically a rule-based NLP engine, not a neural language model. Also, real question: is there even a single line of code in here that you actually wrote yourself?
Eager but completely uneducated "developers" are killing all open source work. My time was wasted again for 10 minutes. Open source is doomed.
Not the place to shitpost
you will be better off in one of the off topic channels like #ot0-psvm’s-eternal-disapproval
<@&831776746206265384> shitposting, aggressive
!mute 1426730370665152683
:incoming_envelope: :ok_hand: applied timeout to @devout pivot until <t:1763803386:f> (1 hour).
A cli tool to search for models and datasets on the hf hub.
Features
- Search Models: Find models by keywords, author, tags, or task
- Search Datasets: Find datasets by keywords, author, or tags
- Export Results: Export search results to CSV or TXT files
- Beautiful Output: Formatted terminal output with Rich
- Python API: Use as a library in your Python projects
pip install hfsearch
drop a star🌟 if you like it https://github.com/HenokB/hfsearch
Hi everyone,
I’m working on a sign-language classification project in TensorFlow and I need some advice because my accuracy is very low. I have used WLASL100 and WLASL1000, and I also tried using only the top 10 most frequently recorded words, but that didn’t improve accuracy much. I excluded the face keypoints and only used body, hands, and arms (with MediaPipe), which helped a little but didn’t solve the problem. My model is a small BiLSTM network with two layers (64 and 32 units), followed by a dense layer and a softmax output. For training, I used class weighting, early stopping, and learning rate reduction. Sequences are padded to the maximum length in the dataset, and I do a stratified train/validation split.
I’m wondering what I could do to improve accuracy, as my deadline is tomorrow. Should I switch to a different architecture or dataset? Any advice would be very helpful!
Thanks a lot!
maybe try some bigger network ?
Hello, anyone knows is there is some kind of object detection models that detect letters trained on a high volume of data?
I want to make a program that will identify letters and based on that identify a whole word and perform some other logic
are you sure you're not asking about optical character recognition?
they usually use BiLSTMs for error correction, since those can take the left and right context into account.
oh interesting, I will read a bit about that and come back
by the way, do you have any suggestion about any open source library that I can use to perform OCR in python pls
the most popular was Tesseract for years, nowadays visual language models are also being used in some cases though
e.g. https://github.com/deepseek-ai/DeepSeek-OCR
yeahh, I have made use of tesseract, was very helpful, but if there are newer libraries, would really love to use them
The deepSeek OCR is a model that can be downloaded?
yes, the readme contains all commands and code necessary to install the dependencies, download the model and run it locally
yep noted, ty !
note that deepseek is much larger and compute intensive than traditional methods like tesseract though
yeah I guess😭 , for my use case it's really for minimalistic thing, I think I will switch to a lighter thing
I noticed there is a library called EasyOCR, have you used it?
iirc it's just a wrapper around other libraries never mind, must be confusing with another one
no
I tried it, though I would not recommend it if your background is colored. I had to do a lot of rule based amendments to the EasyOCR results
I am trying to learn how to make an Ai for a project I am working on. The ai will take in content from the user and reply in a kind comforting way almost like a therapist. I have no idea where I should start. I would appreciate any advice or suggestions!
you can look into Eliza, which is something that is possible for someone to implement on their own. It is not possible to train an LLM on ones own computer, or on cloud infrastructure without significant costs.
if you've only started learning about AI in the last few years, pretty much everything that you think of as "AI" is unobtainable.
Thank you! I will look into Eliza.
https://d2l.ai/chapter_introduction/index.html
is this a good resource for learning deep learning?
Since the D2L library used there isn't compatible with Google Colab, I need to rearrange the programs.
I want to learn deep reinforcement learning, but I can't find good educational resources for it.
I found the problem. It was a lot simpler actually. For some reason all the males died and all the females lived. No data leakage, just a bad dataset. Thank you though.
h
oh ok, the problem with tesseract is there a lot to configure, no? Like adding to the environmental variable path etc
yes but thats a one time thing.
once you set it up correctly everything works
Can someone please help me with this....😅 ?
https://discord.com/channels/267624335836053506/1442034333526261770
Replied
yep noted, ty !
the thing is, can it be used on a cloud platform like google colab?
I wanted to try something but I'm unsure if I can use it there
This might help you OCRusingTesseract on Google Colab
If it was the Titanic dataset, then it’s not a bad dataset at all, but a completely classic one, and you won’t get a figure like that if the analysis is done correctly. Although in the Titanic dataset there is no need for a second CSV file, so this may have been a different case.
explain me roadmap of ai and data science
AI and Data Scientist Roadmap might help you.
you're a hero
does anyone have a good resource on implementing a custom matrix in python? i cant really find much online
what kind of matrix? "matrix" just means "2d array".
basically i need to implement a matrix that has 4 blocks of with dimensions n*n
what is a block?
idk how to really explain, like a mini matrix inside of a matrix
its like 4 matrices inside one matrix
how big is a block?
so if the blocks are n * n, then the size of the whole matrix will be 2n * 2n
!docs numpy.zeros
numpy.zeros(shape, dtype=None, order='C', *, device=None, like=None)```
Return a new array of given shape and type, filled with zeros.
you can create of size 2n by 2n with the diagonal filled with the value
then take the second half of the matrix in both dimensions and fill it with a new value
something like:
bigmat = numpy.eye(2*n)*value
bigmat[n+1:, n+1:] = value
# or if the last line isn't working
bigmat[n+1:, n+1:] = np.ones([2n, 2n])*value
thank you
alright ive figured out a nicer way
im using scipy's linear operator class
so i can avoid constructing 2 large arrays filled with 0s
so if i want to perform a matvec i can just do it on the two non-zero sections of my matrix
Hi everyone! I’m Stella from Zagreb. I recently finished a Python Developer course and have been diving deep into AI, experimenting and practicing to really understand how it works. I’m super excited to start my first projects and learn by doing, especially in Python + AI.
I’d love any advice, tips, or pointers on where to find opportunities or projects — or just general guidance on how to get started. Any help would mean a lot!
AI is a very broad term. Is there something in particular?
There's machine learning, LLMs, deep learning, natural language processing, computer vision, chatbots, and even robotics can be considered AI.
In all cases, I'd highly recommend diving into stats for anything in the machine learning route. I'd also recommend learning how to wrangle and visualize data.
What's your skillset atm?
You’re right, AI is extremely broad.
My main focus is working hands-on with large language models — experimenting with them, building structured interactions, testing their behavior, and understanding how they reason.
I’ve spent a lot of time doing deep practical work with LLMs: creating prompt systems, running simulations, analyzing responses, and pushing models to understand complex patterns. So even though I’m new to the Python job market, I already have strong practical intuition in how AI models think, learn through feedback loops, and how to guide them effectively.
If anyone here works with Python + LLMs or is building small AI tools, I’d love to learn, contribute, and help wherever I can. Happy to be here!
My skillset right now is a mix of early-stage Python development and deep hands-on experience with LLMs.
Python (beginner):
basics: variables, loops, functions, OOP
working with files, APIs
simple scripts and automation
currently learning best practices and looking for small real projects to improve
AI / LLM practical experience:
prompt engineering
designing structured conversations
building and iterating “AI agent” personalities
studying model behavior, consistency and memory
running small simulations with an LLM to test reasoning and interaction patterns
I’m still early in Python professionally, but I learn fast and I’m very active in experimenting with AI behavior.
If you have suggestions for small projects or beginner-friendly tasks, I’d appreciate it.
Can someone explain the self attention formula steps for Q dot product with K, the final computation with k and the argmax step after all of the attention values are accumulated? I can't seem to find a resource that explains it step by step so I'm mixing up which step happens where.
getting into NLP would help I believe
NLP is just really intense classification
What makes you say that?
a lot of NLP tasks revolve around classification
Would you consider machine translation in that?
Hey guys, I got recommended by a kind lad to figure out a solution to my problem in python help, no one responded and my post got locked
its about rocket thrusters
you can just post it, we do a bit more with numpy so you might get a bite
these channels are slower though
One of my fav books 🫠 I've tried to send pdf here but server blocked me
hmmm okok
ill just post the full thing here
I am given a rocket and I need to figure out the values of Fmax, F0 and mass of the rocket by varying my thrust values and according to the a(F) function given in the screenshot, the acceleration varying in such a way.
I have attached the text file for the code for the right thruster (code is pretty much the same for the left thruster) and the F0 for both thrusters differ while the Fmax is the same.
The problem I get when plotting my results is crazy oscillatory (at least that is what I think it is) behaviour as I go through the thrust values. I have tried to instead use np.gradient but I am not very sure if that is a good approach for this.
This in turn will not allow me to obtain the values at a high enough accuracy and precision.
I have attached some images giving a clearer picture as to what my issue is.
Please react with ✅ to upload your file(s) to our paste bin, which is more accessible for some users.
looks like numerical noise coming from np.diff
What are the recommended methods and resources for studying mathematics relevant to data science?
ah okok i thought that would be the problem
now here’s my follow up, how do i fix this, should i just use np.gradient?
Savitzky-Golay Filter and the Kalman Filter might be relevant
ooo new concepts for me
what about true positives and false negatives?
could be due to imbalance, could be due to small sample size
506 true positive, 2 false negative
It's based on 80/20 split, there's 2922 data total
I feel likes it's something to do with classification
looks a bit imbalanced / positive case is way easier to classify
did you do stratified split? also class_weight / sample_weight?
Hmm, I think you are true, because the population proportions should be 77/23 for positive/negative, but the sample is around 14%,
for this, you should use stratified split
which would ensure that your train set and test set have similar class distributions
It is time based as the closest the data is to the present, the better it is (i guess) to predict future outbreaks
Ah thanks a lot, I think I'll put more weight to 0 so the code think twice before labelling everything as outbreaks
I will try using it
Thanks a lot Purplys, now I will be sleeping
well then that complicates it a bit
usually you don't do stratified for time series, you just split based on time
speaking of which, you should e.g. order your data based on time then turn shuffle=False when splitting, otherwise you leak future information into the training set
Oh god yes thanks Purplys and Nahita
Hi.
l've always enjoyed coding and I'm already comfortable with python and building small things. But l've recently realized that in order to get hired as a developer, you need to specialize in a field. And after a bit of research, I find myself to be drawn towards data science in python. However, I think it's worth mentioning that my math skills are not really good. The potential is there. But, l've never really studied math as l'm a high school dropout. So, I'm seeking advice as to whether if I should dive into data science or not. I have a few questions:
• do you need to be good at math?
• what kind of background do you need ?
• what is the best way to learn ?
• what are the best resources to learn ?
I would deeply appreciate any advice and thank you all.
Feel free to ping me any time
do you need to be good at math?
Yes, But it's something you can learn to be good at.
what kind of background do you need ?
Data scientists come from all sorts of background, but most share some computer science, math and stats background.
what is the best way to learn ?
That depends on what works best for you.
what are the best resources to learn ?
Depends on what areas you want to concentrate in. Paid options include DataCamp which are very high quality courses. The there's more formal approaches like college/uni. Free approaches iinclude financial aid in coursera, or watching youtube videos or reading the documentation online.
Why exactly do I need to be good at math ? I’m sorry I’m asking too much but I just wanna be certain before I commit to it. So, why do you need to be good at math ? What do you do daily that requires math
Most of data science is writing code to compute statistics and plots things. This requires a lot of deep understanding of the theory.
For instance, if you're an entry level data analyst you might need to work with dataframes, pivot tables, and compute various statistics, which means knowing what formulas to apply and when. If you're a data scientist then it's even more involved in math usually requires lots of rigorous experimentation, bias/variance tests, hypothesis testing and so on.
More advanced data science positions, like senior level data scientists or machine learning engineers also work with lots of differential geometry, calculus and information theory.
To be honest, that truly doesn’t sound like something I would be good at. I might have to take some time to consider it. However, if I don’t get into data science, what other fields do you recommend I specialize in ?
Is automation/scripting a field you can get a job with ? Because I really like that. I do build small stuff for myself sometimes
It's not a matter of if you'd be good at it, it's a matter do you want to be good at it. The tech industry is currently saturated and jobs like those are harder to come by these days.
I see. Well, thanks for your advice.
If you like automation and aren't particularly good at maths, you could consider a Data Engineer role.
Thanks for replying. What is data engineering specifically and how is it different from data science ?
data engineering is when you manage the data infrastructure for a team, like how it gets stored in databases and made available for other team members
How is the job market for it ? Now that I’m looking into it, I really like it
the job market right now in the US is generally not great. but if you don't have a degree, you'll very likely need to get one, and the market might have improved by then
I’m not in the US.
idk what the market is like in your country, whatever it might be. you can ask in #career-advice
Thanks. I’m gonna also look into data engineering. Any resource to start with ?
I'm not sure. you should probably be comfortable with SQL and MongoDB
Thanks
You can check out this video to see how different roles on a data team work together. Some of those roles might overlap if you're working at a small company.
https://www.youtube.com/watch?v=tyJ476aNCYU
Watch this visual, animated breakdown of how modern data teams really work — including data engineers, analysts, scientists, architects, and ML experts collaborating on real projects.
👉 Subscribe, Like, and Comment If you want more FREE Courses ❤️https://www.youtube.com/@UC8_RSKwbU1OmZWNEoLV1tQg
━━━━━
MY COURSES
To get ce...
I see that. Thank you.
thanks
Hi! Im curious on what some of your guys' favorite deep learning packages/tools are.
My current stack is pretty standard. PyTorch, Polars/Pandas, numpy, Sklearn, Scipy, etc..
I've had ~2 years of hands-on experience with both designing and training models (mostly time series and NLP related), and I'm wondering if there's if there are any underrated or super useful tools that you recommend checking out.
for some things I prefer jax over pytorch, other than that just whatever solves specific (frequently niche) problems
e.g. docker image annoyingly large? use onnx over torch
and not really specific to machine learning, but I like markitdown to ingest any files and marimo for prototyping
What's onnx?
thank you! these seem pretty cool
ive always used jupyter notebook, marimo looks crazy
I have 360 wedding photos, all ready and edited. I wondered why I couldn't create an AI model to edit them. I came up with a few ideas and approaches:
#######. ######################## ######################.
First, I thought about it and asked Giminai to do it. He suggested training the model with the photos one by one. However, the result was messed up because he was editing pixel by pixel. For example, one half of the face would be over-lit while the other half was under-lit. The training would take about four hours. (I didn't like that idea; it wasn't what I wanted.)
########. ######################### ####################.
The second idea was to discover algorithms that adjust lighting and colors. There are also algorithms that calculate the percentage of lighting and colors. So, what did I do? I wrote code that retrieved all the data from the 360-degree photos into a table of the edited images and trained it ("unsupervised learning") so that if I fed it data from an unedited image, it would predict ideal lighting and colors and apply them to the image. (The idea wasn't the best; the editing was weak.)
#####. ######################## ################.
The third idea involved importing the images into Lightroom and changing the settings so they reverted to their original state. I then extracted the data, resulting in two files: x = unedited data, y = edited data. I tried training them using Random Forest Regressor, but the result was worse, especially in terms of lighting. The colors were somewhat good. (Here, I felt the problem was with the data itself, as there was a small amount of incorrect data, but I didn't think it would significantly affect the results.) So, the questions I want to understand are:
What's the best training method?
Does even a small amount of incorrect data affect the results?
Is this small amount of data the cause?
Is my approach to these steps sound? In your opinion, how would you rate my thinking of these alternative plans out of 10 (regardless of the project not yet being successful)?
And these are questions for those with experience 👇
Is it possible to train the model, but if the training is insufficient, I create an interface and let the model not predict and modify, and then display a modified image? I would then have three options to click: the first, "No," means the image is corrupted; the second, "Maybe"; and the third, "Yes," means it's modified perfectly. If I click "Yes," it saves the data to the table, adds it, and trains from it?
I would appreciate any helpful information, and if you have any ideas, please leave a comment.
Giminai
I just discovered notebooklm . What an incredible learning tool.
Taichi is a domain-specific language embedded in Python that helps you easily write portable, high-performance parallel programs.
(Numba killer)
Thank you!
this thing sounds insane
How do you get to the open source part? The link requires you to submit your business email.
I'm developing an Algorithm Trading bot. So I'm wondering what do you guys use in VS Code to visualize/analyse large base of data
A lot of people use Jupyter Notebooks inside VS Code. Personally I prefer doing all the data work in JupyterLab, and then I build "production-ready" Python packages in VS Code when needed.
that's true, because Jupyter Notebook is very popular right now
I'd say they're less popular now than they have been in the last five years, now that there's competitors like marimo, and people seem generally more aware of their limitations.
!warn @waxen crag your message was removed for advertising. And it's not really an open-source project if you have to sign up for something to be able to access the code.
:incoming_envelope: :ok_hand: applied warning to @waxen crag.
Hi, first time hearing about 'marimo'. Apart from Jupyter notebook, I've used Kaggle notebook and Google colab, but they all feel the same in usage with colab having AI integration in it.
I am curious, in what area(s) does marimo excel over the platforms mentioned above? Thanks.
the biggest upside is having no hidden state ; the execution order depends only on which cells reference variables defined in which cells, you cannot run things out of order so it's much harder to end up with results different from what you're get running it fresh after restarting the kernel
about half of that comes from their reactive code, half from preventing you from doing things that are generally considered a bad idea though
(you cannot re-assign variables in different cells, and are generally discouraged from mutating things)
it also comes with some built-in UI elements and you can toggle between the code and a dashboard/webapp-ish view when using them, kinda like having streamlit built into the notebook
Thanks a bunch. Lots of features, I'd definitely give it a try.
I've only been doing ml and deep learning for a few months, but one library I found to be useful is skorch because it lets pytorch models interface well with scikit-learn functions. 🙂
Ooh cool thank you
hello, anyone knows about the ipywidgets library? I recently came across that, can anyone explain what is it and how it is used pls, is it just a UI that allows us to set up some settings?
pretty much - though not necessarily "settings", just user interface inputs.
You could use it for experiments/simulations/training parameters, plots/tables filters, or even just something like a calculator
the documentation explains it fairly well imo, is there some specific thing you feel that it is missing? https://ipywidgets.readthedocs.io/en/stable/
will have a look at the docs, haven't dive into it yet, :c, ty !
guys do you know why the data looks so ugly ? it wasnt supporse to be like in clean separated colums ?
I just wanna to look like this
it says that I got a total of 1 column, why is that ?
Use sep or separator = “\t” for tab separated values in the read_csv arguments, I forget what the arg name is specifically. By default it’s a comma separator because csv is “comma separated values”
thanks man
Did anyone do anything using conformal prediction efore?
just found this
better and faster than autogluon
Cloud data-platform notebooks are also often used, such as those in Databricks or Fabric, and within the notebook the languages used can be Python, PySpark, SQL, or Scala.
But are those actually different notebook types, or are they just reskins of the ipython engine?
They look visually similar to Jupyter notebooks, but the execution engine is Spark (and other runtimes depending on the cell type). Since Apache Spark is built in Scala, Scala is fully supported in Spark environments like Databricks, though not all platforms. Fabric has PySpark, SQL and Python. No ipython under the hood at all.
I've been working on for a while, too. Its actually trading on a testnet right now. What kind of setup did you go with? As for VS Code extension, maybe look at Sandance in the extensions.
Can someone help me with a ai/ vision problem that know alot about these things?
Just ask
I have a ai model that can detect eggs and tell if they are clean or dirty. I want to be able to see on the dirty eggs how much dirt is on them. So i would like to make a mask of only the dirt. I have tried a few things but the results are not that great.
Does anyone know how i could make a presice mask from the dirt on the egg? Or have any ideas.
Its also important that the stamps on the eggs are ignored in the mask. I also have no idea how to do this or if its even posible.
I feel like some simple theresholding should work?
yeah even just asking AI to write some opencv2 code it seems like it should do the job
(code)
This is not rlly what im looking for. Im looking for a mask for only the dirt on the egg. If you get what i mean
And it needs to be specific. And detect evrything thats dirty and should not be on the egg
that was meant to be at most a starting point, not me doing your entire job for you
you can test a few different strategies and see what works - it'll probably involve a ton of trial and error one way or the other
I have tried some things myself aswel and ive come here(see picture). It looks realy doable to filter it out of here. And i tried to filter out green with the inrange command. But i dont get enything out if it. The img is in hsv
Yeah i noticed😅
Ty tho! But was not completely what i was looking for
guys what is going on ??
You're trying to look at a file that isn't text in the text editor
why are you working with wheel files in first place?
I dont know why, but it popped out suddenly, maybe I touched something that I should not have
if you are just trying to download/install packages, you should use tools like uv or pip that manage downloading automatically for you instead of downloading these files yourself
python environments -> uses a default python env (which could be global or a venv)
existing jupyter server -> you must run the jupyter server separately, for example in another terminal. good for centralizing workflows in a single environment
For most cases, i recommend Python environments. It's simpler, less hassle.
Im really new into python. I have created many Bot in Pine Script. Now Im wondering which one should i use: Backtrader, Backtesting.py or any other tool. Which one should I use ?
!ban 719846291332661259 spam
:incoming_envelope: :ok_hand: applied ban to @waxen crag permanently.
Hi all, I have been trying to use dask and a notebook to process 560 gzip files of data, 100mb compressed and 524mb uncompressed each, on my computer which usually has 10-13 GB available RAM, is there a way to split the aggregates at the middle layer to write to their own files? I'm sick and tired of constantly having to cancel runs because of OOM
I know of repartition, maybe I'm just misreading this graph??
“Hi, can anyone tell me which project I should put on my resume?”
in Data analysis
the project you're most proud of or most relevant to what job you want
Data analysis
Could someone recommend some books covering data analysis for beginners
guys I dont know why but Its taking too long to connect with python
it might not work well with 3.14 yet, try using 3.13 or even 3.12
should i learn neural networks from scratch or should i directly learn tensorflow
I’d say learn the math and the theory first, but I’d recommend just implementing with PyTorch/Tensorflow after you understand how the structures work
You can technically get away with just using a python library and not know how everything works for some basic projects, but diagnosing your problems and improving your models will be very hard
For coding by scratch, there are things like gradient calculation and optimizers that might make it a little harder and confusing for beginners to implement, which is why I suggest just jumping to the library
Those libraries take care of it for you
i know maths , i'm reading a book called neural networks from scratch in python , it's using numpy to create neural networks
yea ok i'll do that
I’d say from scratch is good if you want to dive super deep into the functionality and implementation, but if you just want to learn about architecture components and specific applications, libraries will probably help you a little more there
guys is this were data anlayst chat?
!warn 1401906940866465848 Your messages where you ask for work have been removed, as this is against the rules.
:incoming_envelope: :ok_hand: applied warning to @craggy creek.
Someone knows a good formation for Pandas ?
I recommend the kaggle pandas tutorial, which is interactive
Thanks i'll check it !
also read the official user guide if you haven't yet
guys anybody knows what should i master in python to become a data analyst
I’d say the standard libraries you must learn for any data related job are numpy, pandas, scipy, scikit learn, and matplotlib
These libraries were more enough for me to get my first work experience at least
thanks bro
what is easier data anysis or data science
The one you prefer
i mean i want the easiesst one only
can anyone help me with this error ?
note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for pyarrow
Failed to build pyarrow
error: failed-wheel-build-for-install
× Failed to build installable wheels for some pyproject.toml based projects
╰─> pyarrow
The easiest one is the one you'll be the most at ease with, which is the one you prefer
learn pandas
Lately I’ve been noticing something funny:
the same intuition I use when I read people in real life seems to work surprisingly well when I work with AI systems.
It’s like… every model has a ‘personality rhythm’.
Every prompt has an emotional temperature.
Every conversation has a hidden structure.
I don’t approach AI academically — I feel it first.
Behaviour, resonance, stability, the way the system “breathes”…
and only later I figure out the technical name behind it.
It’s a strange skill to describe, but it lets me spot patterns fast, tune prompts intuitively, and stabilise behaviour before it breaks.
Does anyone else work with AI more through intuition than textbooks?
Curious to hear how you translate your human instincts into AI work.
hi
What is the best model in Ollama for coding?
But at the same time not too big, because I only have 1Tb and 16gb DDR5 ram
Hello everyone is any body, ready to collaborate for any projects, which I can get hand on learning skills from
Please let me know am open.
I like to learn from people guys.
i dont think storage matters a lot, its more of ur gpu, vram, ram and cpu
lets take deepseek coder for example, looking at ur storage and ram, i think youll have maybe a 8 or 12 gb vram gpu or even more im not familiar with the latest gpus, but you can for sure run like 8b or even 12b params model very easily, but if you go for better ones like the 33b youll have to offload these work to your ram and cpu too which could be slow but yea still run, again im no expert
Hi,
I am interested in learning AI. checked out youtube courses & udemy etc to find an course.
looked into so many suggestions from reddit, however, could not find something explaining in simple way with pictures & sample programs.
if you know any, pls let me know.
@toxic palm a specific part of the AI field or everything that came in the last decade ? I'm a newcomer too but all I can say is that I watched sebastian raschka llm videos
just basics for now..
seems like raschka video tutorials are split in easy chunks, and his book has a bit more details. but maybe there's better. how solid on math are you
learning "ai" in the sense how to build them?
kind of what is AI, then writing some small AI programs etc
mhm thats cool, if you just want the basics then i'd say watch 3blue1brown's video (youtube) on large language models, its pretty simple to understand for beginners, there is also a person named Andrew Ng, who is like really good at this, he has courses on coursera, all tho they are paid you can still watch all videos of that course, and if you want to understand it more visually then there is this website called mlu-explain github io, it explains all the ways we train these "AI"
Hey gum, just checked raschka video tutorials. they are good. he is explaining step by step starting from environment setup. Thank you.
Regading math : any specific things you mean?
thank you. Also, one basic qn,
when i ask about AI, why everyone pointing to LLM. Is it one concept in AI / what is it?
Got it:
AI (artificial intelligence) is a broad field that aims to simulate human intelligence and behavior. Under its umbrella are machine learning, deep learning, and generative AI. All three concepts share a common foundation: learning from data.
@toxic palm it's the mainstream trend these days. then there's machine learning (various kinds of neural networks, deep learning) ... long ago there was GOFAI (expert systems, prolog)
I find the semantic vector embedding idea nice
ok. which branch is better for starters?
Is anyone in here familiar with deep reinforcement learning ?
I'm trying to solve highway_env using DQN and am struggling a lot.
I know not yet also would have anyone made a simulation with life in it?
"with life in it"? what exactly do you mean by that
A network that can pass traits on like a genetic algorithm but goes through the same processes similar to animals or humans in any regard
How is what you're saying different from a genetic algorithm?
Good point
elaborate more
I wana build a tool compatible with next js... Can someone guide me on how to build an Ai that can check 10 pdf files with each file having one page .. either calling them from some data base or user uploads
What should I go for An agent or some finetuned Ai model
Use a Python backend (FastAPI) with Next.js frontend to upload or fetch 10 PDF files, then use OCR + structured parsing, and only call an LLM when needed. Start with a pipeline approach, not a finetuned model, unless your PDFs follow a consistent format.
hallo.
does this channel count as scientific computing?
this channel is where we talk about that
epic.
gang NumPy slicing is SOO hard to get in my head.
wdym the last index is 'exclusive'😭🙏
it's the same as list slicing, in that regard
