#data-science-and-ml
1 messages Β· Page 79 of 1
i should run this?
hi
Seems like all you want is: ```py
from io import StringIO
import pandas as pd
s = StringIO("""Team,Player,Tournament,Matches,Batting Innings,Not Out,Runds Scored,Highest Score,Batting Average,Balls Faced,Batting Strike Rate,100,50,0,4s,6s,Bowling Innings,Overs Bowled,Maidens Bowled,Runs Conceded,Wickets Taken,Best Bowling Figures,Bowling Average,Bowling Economy Rate,Bowling Strike Rate,4+ Innings Wickets,5+ Innings Wickets,Catches Taken,Stumpings Made
Delhi Daredevils,CH Morris,IPL 2016,12,7,4,195,82*,65,109,178.89,0,1,1,15,12,12,44,0,308,13,Feb-30,23.69,7,20.3,0,0,8,0""")
data = pd.read_csv(s)
team_2019 = data[data["Tournament"] == "IPL 2019"]
team_2019_newdf = team_2019[["Player", "Team", "Matches", "Batting Average", "Batting Strike Rate", "Bowling Innings", "Bowling Average", "Bowling Economy Rate"]]
team_2019_newdf.to_csv("newfile.csv")
yes yes, these columns are the ones which i need. but how do i add the data?
Change the "s" in the read_csv to your file name
The problem with your original code was you had brackets around the series... ```py
d = {
"Player" : player_2019,
"Team" : player_2019,
"Batting Innings" : player_batting_innnings,
"Batting Average" : player_batting_avg,
"Batting Strike Rate" : player_strikerate,
"Bowling Innings" : player_bowling_innings,
"Bowling Averagw" : player_bowling_average,
"Bowling Economy Rate" : player_bowling_eco
}
aighttttt
the course which taught me, said to use dictionary to create the new csv file
ohh i see the difference
also, in jupyter, use "display(df)" instead of "print(df)", it'll look nicer.
okk
can i ask what stringIO does?
it lets me treat a string like a file/io object
greatt, thank you soo much brother!, i was stuck on this for the past two days.
also, can you recommend some books or courses for this area of python
For data stuff? https://www.kaggle.com/learn is pretty good
aight thankss
Although I dislike Pandas' user guide I think it's also good to create a habit of reading the documentation of packages you use π
9 times out of 10 I learn stuff by just reading their official documentation. If it's sparse, shady or something odds are that I won't touch the package
Thank you so much! π
is 8gb macbook pro m1 enough for data science?
I heard that i need more ram for data..?
If you want to run large models, and within a certain time limit, you probably want a desktop, or just a simple laptop and use servers to run the models @compact valley
I wouldn't recommend a laptop to run any big models, but you can use services like google collab to run it on their servers
I have a desktop with 32gb ram
my company would also pay for cloud i guess thats an option here
32GB is enough, and most important is gpu for most models
And cpu, but that is often not the bottleneck
More than enough
My work laptop has 8GB ram
can i get help for an ARIMA model here
99 % of my development is through SSH. I cycle to work, I don't want to carry a GPU monstosity uphill on my bike π
Also, my dev VM only had 4GB ram in the beginning until I annoyed IT enough to make it 8 and subsequently 64. I work with several tens of millions of rows of data. It's stupid but being constrained memory wise teaches you how to do things "properly"
Which matters a lot if/when you scale to terabyte size datasets
shoot! π
start=len(train)
end=len(train)+len(test)-1
pred=model.predict(start=start,end=end,typ='levels').rename('ARIMA Predictions')
pred.plot(legend=True)
test['AvgTemp'].plot(legend=True)
When I run this code I get this error
TypeError: Model.predict() missing 1 required positional argument: 'params'
idk how to fix it i need it for like an essay for my highschool
Are you using statsmodels?
yeah
Have you fit your model already?
Yeah, you've fit it π now you can do model_fit.forecast()
I assume you want do out-of-sample forecasts? (note: what people mean with out-of-sample is that you fit your model with weather data until 2023 and then you use the model to see what 2024 is like)
hello every one
I'm just adapting an example I had in a notebook already: ```py
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.arima.model import ARIMA
n_points = 100
x = np.linspace(0, 20 * np.pi, n_points)
noise = np.random.normal(0, 0.5, n_points)
y = 5 * np.sin(x / 2) + noise
p, d, q = 1, 1, 1
model = ARIMA(y, order=(p, d, q))
fit_model = model.fit()
forecast_steps = 20
forecast = fit_model.get_forecast(steps=forecast_steps)
conf_int = forecast.conf_int()
plt.figure(figsize=(12, 6))
plt.plot(y, label="Data")
plt.plot(np.arange(n_points, n_points + forecast_steps), forecast.predicted_mean, color="red", label="Forecast")
plt.fill_between(np.arange(n_points, n_points + forecast_steps), conf_int[:, 0], conf_int[:, 1], color="pink", alpha=0.3)
plt.legend()
plt.show()
print(fit_model.summary())
So you have your screenshot right? Make a new cell in your notebook and then simply write model_fit.forecast()
You need to put the amount of steps you want to forecast for in the method. Otherwise it'll default to 1.
Are you familiar with the "statistics" behind ARIMA by the way, how did you pick your order parameters?
(#data-science-and-ml message for a link to the original thread for this)
If it's weather data I actually think you need a SARIMA model
is it like this one
not that much
im just trying to get this math essay done for school
i hate ib
Wild they have you doing ARIMA in high school
nah its a self thing
That's crazy
i get to pick my topic
and i couldnt find anything for math
so i wanted to create a short term weather forecast
and it led me here
What's your "window"? How many days are you considering?
7 days
Okay then you don't need SARIMA you still might
What's the frequency of your measurements
wdym
Do you have a measurement every hour? Every day, every minute?
oh every day
Just one per day?
I think you're just plotting your test data as-is now, no?
i think so
When is your assignment due?
is there anyway i can show u my full code
6 days :/
If you have the time I'd review basic Python and also watch some videos on AR, MA and ARIMA
i mean after i finish the model i need to write like 2500 words abt it and stuff
i did
Maybe I'm not the best at this but there's too much for me to unpack
ive been watching this tutorial
and i had to watch like so many other tutorials in the middle to get where he was
Basically they tell you how correlated each measurement at time n is with n-1, n-2, n-3, ...
They directly inform you what the p and q of the (so the AR and the MA) of arima should be
In your case the I should be 0, you shouldn't detrend
i just need to like predict and show the next forecast
Your data being stationary or not is only related to the I parameter of ARIMA
If you just want to predict then you should save the result into a new variable so predictions = model_fit.forecast(steps=7)
this just shows me like
the data i already have tho
hey guys, could I ask a question about this field?
please just ask instead of asking if you can ask π
Oh sorry for that. I'm just curious about data science. Actually, I'm learning IT at my university and now I'm interested in data science. What things I have to learn first and what websites could help me to cover a whole range of topics in a simple way to kick off this journey :))
https://Kaggle.com/learn combine that with YouTube + courses from Udemy or DataCamp or DataQuest etc (if you're interested in making a financial commitment), and you'll be well on your way
Practical data skills you can apply immediately: that's what you'll learn in these no-cost courses. They're the fastest (and most fun) way to become a data scientist or improve your current skills.
Check our pinned message for additional resource
this book has been pretty good for getting a grasp of data science and libraries associated with it like numpy and pandas
https://jakevdp.github.io/PythonDataScienceHandbook/
This channel really needs a sticky (pinned?)
What's a sticky? Sticker?
!resources
The Resources page on our website contains a list of hand-selected learning resources that we regularly recommend to both beginners and experts.
I think there was a data science specific one? idk
!resources data science
The Resources page on our website contains a list of hand-selected learning resources that we regularly recommend to both beginners and experts.
oh hey that worked
I mean something in the channel description. Look at discord bots for a good example.
i remember stel did a similar command
The data science resources there are pretty meek though
they are. we were talking about adding also some data eng resources on there
Yah, I have a bunch Iβd like to contribute for the data science stuff
but i kinda dropped the ball on that
oops
@serene scaffold billybobby has volunteered. let me compile something for DE/MLE
Lol
Excel data engineering
(I'm joking)
No
python in excel
It's in a limited rollout
rip
so even though I have beta channel, they haven't pushed it to all beta users
im seeing it more and more on my newsfeed
i think even fireship did a video
Microsoft Excel just added support for Python, allowing developers to code and run custom functions directly inside a spreadsheet. Let's take a first look at how Python support will work in Excel.
#python #ai #thecodereport
π¬ Chat with Me on Discord
π Resources
Python Excel Announcement https://techcommunity.mi...
it's interesting... the execution is a sandboxed anaconda install that runs in amazon cloud with about 400 pre-selected packages. Can't run whatever you want, though.
400 pre-selected packages. i dont remember if thats more or less than what comes with regular anaconda
but, this is a good thing, when compared to the security nightmare of vba.
necessary evil, perhaps.
probs
Compiling a data engineering list is hard(er) because it can mean literally anything
it does but we will start somewhere
good ol' SQL fundamentals
I'm throwing my stuff togehter, one sec
I think the issue with DE moreso than with DS is that it's a really tools-driven field
You can make an argument that on some level you're learning tools and not concepts
Or the old heads that reduce all of the domain to dimensional modelling and dashboards 
I think I've got more somewhere, but here's stuff I had in Notion:
Prerequisite: Get good at Python (see resources/etc)
Overviews / Intros
https://shivaga9esh.medium.com/data-architecture-engineering-22164c8dbd43
Fundamentals Tools:
Pandas: https://pandas.pydata.org/docs/user_guide/10min.html
SQL: https://selectstarsql.com/
Analytical Databases: Snowflake, duckdb, clickhouse, etc
Introductory:
3B1B: https://www.youtube.com/c/3blue1brown
https://www.kaggle.com/learn
CS50 for AI: https://cs50.harvard.edu/ai/2023/
https://jakevdp.github.io/PythonDataScienceHandbook/
Math:
- Highlights of Calculus (Strang): https://ocw.mit.edu/courses/res-18-005-highlights-of-calculus-spring-2010/video_galleries/highlights_of_calculus/
Advanced:
- Mathematics for Machine Learning: https://mml-book.github.io/book/mml-book.pdf
- https://www.manning.com/books/deep-learning-with-pytorch
- https://scikit-learn.org/stable/user_guide.html
o and https://pythonprogramming.net/machine-learning-python-sklearn-intro/ - Fundamentals of Data Engineering: https://www.amazon.com/s?k=fundamentals+of+data+engineering
3Blue1Brown, by Grant Sanderson, is some combination of math and entertainment, depending on your disposition. The goal is for explanations to be driven by animations and for difficult problems to be made simple with changes in perspective.
For more information, other projects, FAQs, and inquiries see the website: https://www.3blue1brown.com
Practical data skills you can apply immediately: that's what you'll learn in these no-cost courses. They're the fastest (and most fun) way to become a data scientist or improve your current skills.
thats why joe reis and matt housley offer this framework. so peeps can start thinking in concepts instead of tools since they say tools come and go:
met them in person. real cool folks π
and ofc the diagram is from the first chapter of FoDE. i really sound like a broken record at this point

hey, I put fode in my list
and thats all that matters
I guess I should throw duckdb in my list then
I have never heard of that book interesting
You should add a Kimball / Inmon book to the list
That's how it started for me at least π€·
brand new. joe reis likes to describe it as the prequel to designing data intensive applications
data warehouse toolkit most def
Add polars to your list
hmm
At least for my work I prefer it much more than DuckDB or SQL in general. My transformations are awkward to express as SQL queries. It's unnecessary pain.
the killer is that there's never going to be a nice way to auth against internal databases
I'm still a big believer in the fact that people should YEET Pandas out of their data engineering stack
It's inferior in all respects compared to Polars except it integrating better with data viz and ML tools
The memory footprint being a lot smaller, better multi threading and the lazy API are just big selling points
even this is slowly going away - seaborn and Plotly both play nicely with polars
oh yeah that is killer
the main thing pandas has going for it is the first movers advantage a ridiculous amount of tooling layered on top of it
i see it more as a dif type of workflow
if that makes sense
Sure, it makes sense
I still use Pandas when I'm getting very close to my "final" application
E.g., when I'm getting close to sklearn and whatnot I use Pandas
I'm completely off of pandas at work at this point, which might screw me when I come to interview
I think Spark / Databricks are (sadly) data engineering essentials
Why read a small CSV in Python when you can wait several seconds for the JVM to start!! π
the saddest thing here is that the dataset api is great, but it's scala only. pyspark is just a bit rubbish
I'd prefer doing Spark in Scala if not only for the fact that UDFs are a lot less painful and they don't tank your performance as much
You'll be fine, just learn duckdb π
DuckDB is just SQL, there's nothing to learn?
it's more that the kind of roles I'll end up applying for will often have a pandas coding interview - and "trust me bro, let's just use duckdb/polars" isn't a great line
I know, I agree π
the response will be "OK nerd, but we have these pandas codebases"
Like all things, they have a lot of tricks up their sleeves - #python-discussion message
I don't fully understand the DuckDB hype
I get it for people that don't know Python
The people that are going for the whole DuckDB + DBT set-up
Oh, dbt is heaven... but:
For the rest DuckDB and Polars are quite similar in terms of capabilities. The question becomes "do you want to write SQL or have a dataframe API"
I want a dataframe API any day of the week because you get all of software's best practices for free
i get the duckdb hype. SQL is already the lingua franca of the data world, and it let's you avoid pandas which is basically a dsl largely separate from actual python
the duckdb team is just cranking out features and it's just so much better than what's out there. For example: columns with regular expressions or lambdas or Python UDFs or pivot/unpivot
TLDR; DuckDB continues to push the boundaries of SQL syntax to both simplify queries and make more advanced analyses possible. Highlights include dynamic column selection, queries that start with the FROM clause, function chaining, and list comprehensions. We boldly go where no SQL engine has gone before! Who says that SQL should stay frozen in ...
its okay. your next role can be in DE 
(I'm not affiliated, but just what they've shipped in the past year is amazing)
SQL overuse can be terrible as well
as_of joins, positional joins, etc.
It's declarative but when you need something that doesn't exist natively you end up tying together all of those declarative things into a procedural monster
i saw legit sql queries stored as database values this quarter
when you run the sql query in order to get more sql queries 
doesn't surprise me hahaha
What I end up doing is just collapsing dozens or hundreds of lines of code into a duckdb transformation... but I could use them inline:
Note the inline use of the global dataframe: ```py
import duckdb
import math
import pandas as pd
somedf = pd.DataFrame({"i": [i for i in range(22)]})
with duckdb.connect() as con:
con.create_function('factorial_method_native', lambda x: math.factorial(x), [duckdb.typing.HUGEINT], duckdb.typing.HUGEINT, type='native')
df = con.sql('select factorial_method_native(i) from somedf tbl(i)').df()
print(df)
yeah others like moyen have seen this as well
Do me a favour and try Polars π¦
Writing strings in Python also just sucks
it's a hack
Oh, I've used polars extensively.
Not trying to sell you, but there is a pythonic API if that's your cup of tea.
I've looked at DuckDB extensively tbh. I've even "sold" it to colleagues using worse tools (laughs in R)
but for me, I can join a polars df, with hive partitioned parquet files, with a csv file, in a single operation.
that's just black magic for me.
this is dope
Went to talks ran by elite data engineering teams on DuckDB etc as well
But I just still don't see why I'd squeeze it into my current workflow
It looks inferior to my current setup, just my 2 cents
the main thing putting me off of duckdb is that it's kinda redundant for analysis once you're data is already in some SQL database.
why pull data from snowflake just to change the SQL flavor?
i think it has a dif use case imo
I mean, if you're in snowflake, you can stay in snowflake, I prob wouldn't change that.
There's edge cases where I would use duckDB though, I think its streaming and larger-than-memory support is better than Polars
If I were really really memory constrained I'd move to DuckDB
But realistically, at that point they should hire someone else. I'm not a data engineer π’
Everywhere I've been everyone just sucked at it so I do it, with pleasure
Otherwise the data pipeline would be csv files on teams
I do, but I think that's the place where polars eats DuckDB's lunch - it can do most of the ducky things, but stays relevant later in the data lifecycle . you can pull data from a database, then you can write nice composable polars transforms over the top of it
Yah, I'd mix it in, personally... duckdb works over polars /arrow data too, so you can use either or both
chroma, vector db, is built on top of duckdb so theres that
i need that one meme where you lift the mask and its just duckdb underneath

I don't understand vector db's either
so you got some embeddings right?
I get that part
these are lists of floating point numbers
and you need to read/write them
super duper fast
What I don't get is why the tech is so overhyped
current db systems suck at this
oh yeah it is a bit overhyped atm
Just serializing it normally and doing a dot product works unless you have a really really large set of embeddings
the issue comes down to the model
People have been doing algebraic topic modelling for ages without vector dbs
some models generate embeddings that are only 300ish dimensions per "row" (looks at SBERT)
some are wildly large embeddings
with huge dimensions
Just benchmark matrix vector multiplication with different sizes in numpy and then ask yourself if the hype is warranted. I have and for me it was a "no"
i mean my work mentor built his own vector db doing just what you described
however
just look at how much money pinecone is raking in lol
there are some use cases where some stuff just doesnt scale
someone in this server was telling about it
and how they were waiting for a good vector db to come out
thats equivalent to cassandra
This exists but it's not proportional to the vector db propaganda
agreed
I did a course on information retrieval in uni and I guess this is why they reminded us that the triangle inequality is a thing
There's also extensive research on approximate nearest neighbour search etc
@past meteor i sent u a dm
Hi π
new to this topic, I have a huge trading data set of about 420k rows each day
what will be the best way to find similar patterns for all the data sets I have ?
And after that of course convert it into a model
That's a giant topic. If you want to dig in, I enjoyed this book: https://www.amazon.com/Advances-Financial-Machine-Learning-Marcos/dp/1119482089
I want to learn only the specific things for my project, im not planning on working in that field, im pretty bad with math lol
Maybe narrow down your question then?
im not sure how π
I thought maybe an AI could go through the data and find something
Can I DM u about it ?
What are good laptops out there rn for performing deep learning tasks and processing other ml models
I donβt dmβ¦ just ask here, someone can surely answer it
because hitting the data warehouse every time is expensive and slow
save your 10 GB subset locally and go to town with duckdb, polars, pandas, whatever
@serene scaffold i ended up compiling a basic list aimed towards beginners. ill send you a link to the notion doc if youre still interested in updating pydis resources
I heard snowflake is expensive as well though isn't it?
it adds up because the warehouse always stays on for 1 minute minimum, so even select 1 costs 1 minute of execution time
It really is beautiful technology though. Fully separate compute and storage. Predicate pushdown. S3-like files with a SQL interface that make it act like a RDBMS, ...
I'm wary of doing R&D in the cloud though
Does it have a cold start if the cluster is off?
Hey can someone assist me with a code pls. it's rlly URGENT. For volunteers ping me or dm
I'm trying to understand the point of max-pooling, at least on an intuitive level.
The book I'm reading is trying to explain this but I can't understand the explanation at all, thus I can't really formulate a concrete question besides "Why do we actually use max-pooling?"
For the sake of clarity: I know what it does, just not why it's used.
Thanks in advance
Edit: Trying to research this a little further - and it seems like there's no definitive answer. Am I digging too much into things? Because indeed, most of my learning experience with a lot of subjects related to ML could sort of be summed up with "X works, because someone thought X could work, so they implemented X into some models and noticed an improvement"
So perhaps there's a meta question here: Should I even bother trying to understand some of these concepts?
I've attached the explanation I'm having trouble with
Would this be the correct subchannel to discuss hdf5 files?
I'm on vacation and that's why I haven't been responding. Let's chat next week
stel enjoy your time off. dont mind me bud and sounds good 
def OHLCV(list_tickers, start, end):
ohlcv = {}
for t in list_tickers:
try:
data = yf.download(t, start=start, end=end, interval="1d", repair=True).dropna()
if not data.empty:
ohlcv[t] = data
except:
pass
return ohlcv```
Hey guys I'm working with a big ticker sample that includes delisted stocks but as you can see there's tons of price data series behaving incorrectly.
How can I identify these bad time series data so that I can remove them from my analysis?
I'm thinking something along the lines of "based on how the other prices reacted during time interval, remove these tickers from ohlcv"
Can someone help me with this code: ```py
def initialize_model(N,V, random_seed=1):
'''
Inputs:
N: dimension of hidden vector
V: dimension of vocabulary
random_seed: random seed for consistent results in the unit tests
Outputs:
W1, W2, b1, b2: initialized weights and biases
'''
### START CODE HERE (Replace instances of 'None' with your code) ###
np.random.seed(random_seed)
# W1 has shape (N,V)
W1 = np.random.randn(N, V)
# W2 has shape (V,N)
W2 = np.random.randn(V, N)
# b1 has shape (N,1)
b1 = np.zeros((N, 1))
# b2 has shape (V,1)
b2 = np.zeros((V, 1))
### END CODE HERE ###
return W1, W2, b1, b2```
I guess you'd have to look at the data to see what's going on. Are there a bunch of zero points in between? Are these after trading hours blips? In short: look at the data to see what the issue is
the thing is my sample size is over 300 stocks so checking individually would take me a whole day
could I simply .apply() something?
maybe for example: if stock volume is under x don't add it to ohlcv_dict?
Help in what way?
it's not passing tests
@slim flicker see #voice-verification
What tests...
Wrong initialization for b2 vector. Check the use of the random seed.
Expected: [[0.77951459]
[0.02293309]
[0.57766286]
[0.00164217]
[0.51547261]]
Got: [[0.]
[0.]
[0.]
[0.]
[0.]].
24 Tests passed
12 Tests failed
I don't know what I could change about the func. I did torch.randn and regular np.zeros() but got the same test output
When you ask for help, this is the kind of information you should give in the first message for your question.
Look at how you define b2
ah
You first need to inspect the data to understand what the problem is. Youβre asking for the medicine without a diagnosis.
you're saying don't ue np.zeros yeah?
I got the same output
Show
change b1 & b2 to randn
Perhaps you have spurious zeros, or perhaps gaps in data thatβs been filled incorrectly, or perhaps your charting code is rendering gaps as zeros, or perhaps youβve made a mistake and group disparate series under the same ticker, etc
Wrong initialization for b2 vector. Check the use of the random seed.
Expected: [[0.77951459]
[0.02293309]
[0.57766286]
[0.00164217]
[0.51547261]]
Got: [[-0.10106761]
[-0.05230815]
[ 0.24921766]
[ 0.19766009]
[ 1.33484857]].
@verbal venture the n in randn stands for normal. It does something different than the general purpose random array generator
ah ok, how long have you been doing AI for
A few years
all NLP?
Pretty much
figured
can you help with this as well? ```py
def back_prop(x, yhat, y, h, W1, W2, b1, b2, batch_size):
'''
Inputs:
x: average one hot vector for the context
yhat: prediction (estimate of y)
y: target vector
h: hidden vector (see eq. 1)
W1, W2, b1, b2: matrices and biases
batch_size: batch size
Outputs:
grad_W1, grad_W2, grad_b1, grad_b2: gradients of matrices and biases
'''
# Compute z1 as "W1β
x + b1"
z1 = np.dot(W1, x) + b1
### START CODE HERE (Replace instanes of 'None' with your code) ###
# Compute l1 as W2^T (Yhat - Y)
l1 = (yhat - y)
# if z1 < 0, then l1 = 0
# otherwise l1 = l1
# (this is already implemented for you)
l1[z1 < 0] = 0 # use "l1" to compute gradients below
# compute the gradient for W1
grad_W1 = np.dot(l1, x.T) / batch_size
# Compute gradient of W2
grad_W2 = np.dot(l1, h.T) / batch_size
# compute gradient for b1
grad_b1 = np.sum(l1, axis=1, keepdims=True) / batch_size
# compute gradient for b2
grad_b2 = np.sum(yhat - y, axis=1, keepdims=True) / batch_size
### END CODE HERE ###
return grad_W1, grad_W2, grad_b1, grad_b2
error is: boolean index did not match indexed array along dimension 0; dimension is 5778 but corresponding boolean dimension is 50
!resources ai
The Resources page on our website contains a list of hand-selected learning resources that we regularly recommend to both beginners and experts.
Why pooling? Because it's helpful to scale CNNs.
Why maxpooling? It introduces nonlinearity and the intuition is pretty natural: you're letting a group of neurons fire if at least one of them fires.
It also makes the whole computation more robust (for example if you detect an object only on one corner of your pooling box, the signal still carries over when it would have been 'obfuscated' by an avg pooling)
There are alternatives (most notably using a convolutional layer with stride 2 https://arxiv.org/pdf/1412.6806.pdf) but it didn't really improve the performances of visual models, so it's not popular outside of academics
It all feels so... "handwavy" I think is the terminology in English
When I thought about it further I managed to conjure up some random explanation where "The most dominant pixels on the feature map are carried over, and thus it's perhaps easier to concentrate on the relevant features of the image" but honestly it seems like any effort at trying to explain this phenomena on an intuitive level without the appropriate background is futile
i think this might be a nice read for you https://www.di.ens.fr/willow/pdfs/icml2010b.pdf
at the end of the day, max pooling is a heuristic. it doesn't always work nor make sense. the paper i linked explores the regime in which max pooling is beneficial. keep in mind downsampling is a common operation in signal processing that can be justified in many ways, e.g. the spectrum of the data is relatively low frequency, allowing downsampling without loss of information, or the data admitting a sparse representation in some domain. max pooling is one choice of irregular downsampling that can be beneficial depending on the properties of the data. there are also cases in which you do lose information by downsampling, but the high frequency information benefits from using the low frequency info as a "prior" or "initial guess", where it makes sense to consider both the downsampled data and the full data at the same time (think e.g. u-nets with "skip connections". but yeah, it doesn't always make sense and there is also no guarantee of optimality for it in general
but as you correctly point out, someone tried stuff and this worked in many cases π people looked for explanations and interpretations after
helo everyone
how can you check for top 5 states with top 10 most demanding services each in the us?
how exactly can i get this data?
for the most part, the cost is small enough to not matter for analysis. create a temp table with some level of aggregation, then it's just ~1-5s round trip times
this is purely vs duckdb that I'm talking about
Thank you both for the detailed explanation
Admittedly I don't entirely understand the terminology behind "High/low frequency data" (Not sure if signal processing is necessarily even taught in my degree) but I think I got the idea of what you're trying to say, something like "Transformations of the type F^n -> F^m where m < n can still be "rich" with information and often help the computer process it" or something along those lines? ^^;
I really just wanted to hear that there isn't a single concrete explanation for this and that I'm not missing some fundamental theory. So indeed, I think you put my mind at ease
sure, that's one way of looking at it. if we have a vector in F^n and it satisfies special properties, then a projection F^n -> F^m, with m < n, is invertible. you know how functions can be invertible from the left without being overall invertible (having no right inverse)
if you wanna think about it that way, it'd be that the original signal in F^n is in a vector space of dimension d <= m
then T: F^n -> F^m can be invertible from the left if you apply it to an input that is not in the null space of T
the case of "signal frequency" is a particular choice of T where we have a unitary matrix U in F^nxn, and we construct T by keeping m columns of U and transposing. the particular U of orthogonal complex exponentials is the "fourier basis" used to look at the frequency domain of signals, but other choices are possible
I think I sort of missed the point - Why do we care if it's invertible from the left?
My Linear Algebra is a little rusty, maybe that's why I'm not getting it
because that means you can recover the full info even if you downsampled
that's not necessary depending on what you do after, but it can be nice to have
Right but how does that matter in the context of Computer VIsion?
Ah
for example with u-nets that i mentioned earlier, you can use a network to denoise an image
you can do this by slowly downsampling an image down to very few samples, and then using those few samples to reconstruct the full image without noise
similar things are done all the time in image processing more generally under the guise of "auto-encoders"
That sounds insane
Completely out of context but is "Image processing" its own field?
Like ML or whatever
yesn't π
in all of signal processing, the general maths are transferrable
but in each application within it you can go arbitrarily deep & cursed
at some point you start seeing stuff that is only done in image processing and virtually no other signal processing/ML application
ayo edd hows you mate
hi
hows you ! and hows life been ?
been ok, i'm not sure i remember you though
glad to know you are ok and lol its fine if you dont remember me
you helped me once on a problem i was stuck on
i see, hopefully that worked out well π
well that did lol
someone help with the shape of the images pls
does this method have a name
the ML people would call this something like "latent space representation" or something like that
Principal component analysis'
PCA is one way of doing it, but not the only
Definitely .
technically any time you use a parametric representation with fewer parameters than there are data points, this happens
oh interesting i know about pca
i.e. fitting the 3 parameters of a sinusoid to 100 time domain samples
It's more general than PCA, I guess any compression and reconstruction pair can work
In uni I always wondered why PCA specifically was always mentioned but in the end, the properties of the method just make it ultra suitable
PCA is mentioned because it ties to covariance matrices and the fundamental theorem of linear algebra in one shot
super clean whenever your data is contained in a comparatively low dimensional linear subspace of the overall space the data is in
The orthogonality of the eigenvectors is a large selling point
i would say that's also not even necessary since you can always gauss jordan after finding a basis
Would that apply to say the bottleneck of an autoencoder?
how do you mean?
Conceptually they're similar things as your principal components, they "compress" the data
ah yeah
But that property does not hold for autoencoders
which property?
Orthogonality of the neurons in the bottleneck
it's kinda moot since it's nonlinear anyway
Tbh people have been playing with this, see beta-autoencoders
beta encoder means something related to ADCs to me, lemme see if i can find what it means in ML
Or beta VAE's, I'm not at my computer right now so I can't look up the specifics. The idea is just that people want representations that are independent
you'd have to define what you mean by independent. if it's the neurons you wanna make indep, you need to define some sort of inner product or distance metric
if it's an architecture that ends up generating a matrix of atoms, then the final result is some sort of frame and ideas similar to those of PCA apply
Hello everyone can anybody tell me this kind of loss graph indicates that my learning rate is too high or is there another issue?
I'm a bit rusty in the specifics as you can see, I haven't read the paper in a while
i'm also not savvy on VAEs, so i can only discuss this very superficially
but a cursory glance at a beta VAE paper makes me think this is indeed the case. some regularized learning of "latent factors", but i can't tell if these are used to decompose the data linearly or nonlinearly
in either case, it's more or less the same idea: if you can construct a decent model for your data, you can then focus on learning the parameters of that model instead of learning everything from scratch. those parameters are usually few
Personally I'm a big sucker for kernel methods (when I can use them) so I'd just use KPCA over these
i would also lump those in the same category
Of course, they're all conceptually the same
Hi π hobbyist playing with some dataframes, backtesting and data vizualisation here. Looking for suggestion for some cool library or software that i can use to further visualize the data. Heatmaps and matplotlib are fine but my dataframe with results will be 3 or maybe more axis, thinking i could find some more tools to assist me here?
I use all sorts of libraries. I work in notebooks a lot and have moved away from matplotlib, preferring Plotly. Seaborn has some interesting plots (violin plots) as well. Currently, my main is Plotly. Altair is pretty good too... altair uses vega as the js charting engine, and plotly has plotly.js, so these can produce some nice charts with interactivity.
Good, the backtesting.py lib i use plots price graph and other stuff with bokeh, but it is very heavy on the computer to load. Okay i used a bit seaborn for heatmaps will look into its other features and plotly. Would you visualize it in 3d if you had 3 axis?
so I wanted to build a feature that allows the user to say input into their microphone, then this input gets translated to english (if it isn't already in english) and then gets printed out in the form of text. Basically a speech to text model with translation
Now, the requirement of this project is that it needs to be done using an API (school mandated it π )
I rarely have liked 3d visualizations, it tends to be noisy and hard to visually interpret.
I tried doing this as my code, it records the input but then for some reason can't transcribe it
used whisper api from openai
What granularity is your data? Hourly or 5 min candles or something?
I'd appreciate any help I could get
I have 1minute data but rn im resampling it to 5min to save time and also bokeh plots evrry single candle so it lags alot
Maybe open a #1035199133436354600 thread and share some details? It's fairly specific, but maybe someone will jump on?
The lib should be able to resample before sending data to bokeh but it didnt work for me
I haven't used bokeh in a while, so not sure if it's slow, but I'd try a few libraries... and make sure it's the library. For complex (lots of points) charts, rendering statically might be better.
Yeah maybe i will just plot linecharts of equity and close prices with matplotlib instead. Its not that important, more all the metrics of the results i look at
fwiw, plotly.express is kinda nice:
import plotly.express as px
fig = px.line(df, x="date", y="price", color="equity")
fig.show()
Okay looks alot like matplotlib syntax iirc π ill have a shot at them all today. Like to just play around then eventually keep what i see works well
yah, it's intentionally matplotlibby
I see many use jupyter, but it never really appealed to me
I just append all the results into a html page
I primarily work in notebooks, but that's my business. But, we continually refactor anything complex into separate modules.
It's a nice environment for doing this type of work, sort of a all in one (repl, visualization, stateful kernel, etc).
I can see how its probably faster and more flexible than when im modifying the htm with .replace(), tag by tag π«£π
You could also use something like dash
Plotly Dash User Guide & Documentation
it looks actually quite nice
i think i would be better off using that to make some interactive functionality in the visualized data
and make stuff look pretty
I have a small issue with graphs and I dont exactly know which library would be best.
The data is:
x length = 4000
Y length = 10,000
Z length = 255
Drawing up the actual plot with matplotlib takes so long
i was actually thinking of using a third party software just for plotting but I was wondering what would be the best way to show this data
Remember that in computer science, a graph is nodes and edges. Data visualizations are plots.
If you have a lot of things that you're trying to plot, you should probably do some kind of down selection
you could aggregate it before plotting, but a third party software won't be much more faster than matplotlib if it were just trying to do the same thing as you're trying to do
(it could do it in a way smarter than what you're trying to do though)
the thing is currently I need to be able to visualise all data points for a better observation
so I cant get rid of any of the data at the moment, all individual nodes are important
If you actually plotted everything, would it actually produce a visualization that was easy to interpret?
spaced out, yes
ive tried it with a smaller Y length and its perfect, but I want to do the full 10k plots
Spaced out? Does that mean you'd need a giant monitor to view it?
you can probably aggregate it in buckets of 40x100x1 down to a 100 x 100 x 255 grid without any real loss
Ive been increasing with this ax.yaxis.set_major_locator(ticker.MultipleLocator(base=5))
If the plot that you produce can't fit on someone's screen, it's not useful.
Just needs to fit on my screen
Is your screen the size of a television?
I need a highres image that I can zoom in on
i wish
What would you do once you zoom in? Because you can make more than one plot
I have to go all of the sudden.
I just need to plot all of it so I can see which parts need to be edited and changed
It takes an hour with only 5000 plots and its killing me
if I chunk it, is their any way to kind of string them together?
im still trying to find ways to space out the whole scatter and increase the amount i can plot
Actually, Im gonna look into VisPy
ah I ended up answering my own issue, it was an issue of fig size haha
I'm imagining this: https://en.wikipedia.org/wiki/The_Great_Picture#/media/File:GP_Hanging_In_Camera_RJ.jpg
As of 2011, The Great Picture (111 feet (34 m) wide and 32 feet (9.8 m) high) holds the Guinness World Record for the largest print photograph, and the camera with which it was made holds a record for being the world's largest. The photograph was taken in 2006 as part of the Legacy Project, a photographic compilation and record of the history of...
fr if i had a place with a big enough screen, i would do that
like give me a 12k projector
rent a movie theater
anyway this actually works atm so im cool with it
I'll be sat with popcorn staring at a graph for hours
How'd you generate that?
Its 10 seconds of sensor data
I mean, what library?
Oh, interesting, came out pretty nice... assumed it was something else
its a 10k image res tho
so it takes up like 150 mbs per image
but i get good stuff like this
What kind of sensor data?
I believe vibration data?
yeah its vibrations over an array of 4000 locations
this is part of my thesis atm
except i cant get enough of making this lil plot spin
it spin
can someone please explain to me what %matplotlib inline does in Jypyter?

and then increasing it to
more bigger spin
That just says render the matplotlibs inside the notebook, rather than as a separate image.
I think it's enabled by default though, at least in my environment
I was going to ask that.
Yah, doesn't do anything for me. I think it used to create a separate image or something
I'm trying to use an implementation of FIt-SNE, but it runs out of memory on an A10 at this line:
dY = torch.sum(
(PQ * num.to(device)).unsqueeze(1).repeat(1, no_dims, 1).transpose(2, 1) * (Y.unsqueeze(1) - Y),
dim=1
)
Unfortunately, these variables are poorly named and undocumented. However, the function preceeding the line where the function runs out of memory is
sum_Y = torch.sum(Y * Y, dim=1)
num = -2. * torch.mm(Y, Y.t()) # (N, N)
num = 1. / (1. + (num.to('cpu') + sum_Y.to('cpu')).t().to('cpu') + sum_Y.to('cpu'))
num.fill_diagonal_(0)
Q = num / torch.sum(num)
Q = torch.max(Q, torch.tensor(1e-12, device=Q.device))
Q = Q.to(device)
# Compute gradient
PQ = P - Q
Q.to('cpu')
P.to('cpu')
Y is initialized to a tensor of zeros equal to the length of the dataset and the code given as well as the line that breaks is within a loop that runs for a given number of times. P is some tensor based on the dataset (I think it's the dataset after undergoing some dimension reduction).
Essentially: How can I change this expression to calcualte dY with less memory?
Ended up just changing matplot settings in the end with cython to execute it and be a tiny bit faster, but im finally getting the stuff i wanted
the actual file is about 150 mbs
hey
data_id_mean = data.ID.mean()
data.ID.map(lambda p: p - data_id_mean)
What does this actually do?
We take the mean of ID column but what's p here? What do we minus from mean of IDs?
p is any element the map function passes through to the lambda, i.e. each element of the ID attribute. I believe this would be a type of data normalization
hmm, I see thanks
While i always liked numbers, i think i should acquire some knowledge about statistics. I could ask a dozen questions about variance, standardization and quantiles, but maybe i should read a book about it. Not that i like books alot though. Any suggestions? And also, good morning
(Non professional data scientist, algotrader thats asking)
Statistics by David Freedman is pretty accessible and can be found online for free
If you prefer videos, statquest has a lot of short and accessible videos
Even if i only end up reading half of it like all other books i try and read.. still a half book wiser
While it makes sense to just loom at the bulk of the distribution (quantiles), the outliers do also have relevance π€ i think the big challenge is to narrow down the metrics to the most useful ones to assist an informed decision
I've narrowed it down to the following fields: Statistics, Linear Algebra, Probability theory and Calculus.
linear algebra, calculus first in that order
though calculus will probably be taught before linear algebra in college
Hello everyone,
After building Pytorch from source, CUDA is not available. Here is what I did:
git clone --recursive https://github.com/pytorch/pytorch
cd pytorch
git submodule sync
git submodule update --init --recursive
conda install cmake ninja
pip install -r requirements.txt
conda install mkl mkl-include
conda install -c pytorch magma-cuda115
make triton
export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
python setup.py develop
Can anyone help?
Many thanks
probability theory is a nice extension of linear algebra and calculus
so you can do it after lin alg/calc
Oh nice good to know!
It will depend on your text book but we had mini proofs in probability theory that required a working knowledge of say integrals
Basically my situation right now is this: I'm starting a 1 year conversion course soon as some of you already know going from a Bsc in Nat Sci to an Msc in AI and ML.
I know a bit of statistical testing from my Nat Sci course but not too much.
I've been studying python for like 10 hours a day for the last month and now I'm trying to brush up on my maths. What would you all reccomend I focus on? I have roughly 20 days until the start of my course.
Just assume I'll be studying 10 hours a day up until the course starts
Because I will be
the calculus you need for ML i don't think is too deep, at least not to start; not that i'm that familiar with ML -- but linalg is everywhere
So like calculus 1?
maybe a lil bit of calculus 2?
yeah, like calc 1 to understand back propagation
Rodget that
Many thanks mate
I'm actually watching an udemy course on the fundamentals of math right now. Just to refresh my math knowledge and have a solid foundation going forward.
I'm thinking: Math fundementals (refresher) --> Algebra (refresher) --> Linear Algebra --> Calculus 1
hey guys,I had a question.
WHen I type print( in my jupyter lab, it doesnt complete the other bracket at the end ,and nor does it automatically add the ending quotation mark if I type the opening one,
Any fix?
are you typing in a markdown cell?
no in a code cell
I'm not sure then mate sorry. I'm new too.
me and my friends have been debating.
For just general AI and data science, higher ram speeds or more ram volume?
like we where talking about how you could chunk it to make more out of the speeds, but then more volume means you can just load it at once
which is better?
hi, did you try the "Settings" in the menu, then "Auto Close Brackets for Text Editor"
yeah,ty
below a certain threshold, lower RAM can make some things pretty much impossible.
I guess slower RAM will only ever make things slower.
yes it's applied math
Like an engineering discipline the real world matters (the problem you're solving) and also your constraints (stuff related to comp sci)
So its just applying maths to a problem within the constraints of computer science
Do you lot think that trying to code math problems into python is good practice for a career in ML?
math problems and formulas and equations and such
that's unavoidably a large part of what you'll do
along with the practical implications of handling huge amounts of data and doing math with stuff that doesn't fit in memory
fwiw, these videos are wonderful intros for linear and calc. Just a high level intro, but I'd suggest watching it before and after you take either course... it's just really well done: https://www.youtube.com/@3blue1brown/courses. And for calculus, this is a gentle intro to Calculus taught by one of the great professors of the subject: https://ocw.mit.edu/courses/res-18-005-highlights-of-calculus-spring-2010/. There are probably other options for linear & calculus deep dives, but his explanations are probably my favorite. I absolutely love his proof of the derivative of e^x.
Hi guys, is this the right place to ask questions about Jupyter?
yes, ask away
I think I solved it. The problem was with the fact that jupyter closed the cell for editing right at the moment when I typed something and stopped even for a short moment. Presumably it was related to autosave setting of vs code, I'm running notebooks inside it.
a friend of mine suggested the following approach:
from uniqed.runners.tof_run import detect_outlier
df = detect_outlier(ohlcv_dict["UPRO"]["Adj Close"], cutoff_n=80)
Cell In[4], line 1
df = detect_outlier(ohlcv_dict["UPRO"]["Adj Close"], cutoff_n=80)
File ~\AppData\Roaming\Python\Python310\site-packages\uniqed\runners\tof_run.py:34 in detect_outlier
np_time_series = time_series.values[:, 0]
IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed
but my test run was unsuccessfull, I've tried reset_index() on the "Adj Close" series to create a "Date" column, and .reshape(). Any ideas?
I'm just saying, you have to understand the mechanisms behind the actions before applying filters and outliers. It's very likely there's an easier way to filter the data.
For example, if there's an erroneous 0 after trading hours /etc, perhaps you want to remove them. Or, perhaps you just want to restrict your data to the trading window.
yfinance registers prices on close so it's probably not after hour data
You seem to not want to look at the data? I don't really get it. Pick a spot where a particular equity has a lot of zeros and look at it.
The dataframe, is what I'm talking about.
sure but saying "it's after hour data" without even looking if yahoo finance registers it
kinda 
should i install packages like tensorflow globally or in virtual environment? I am beginner.
in a virtual environment
hey so im writing my college application and one of the supplementals is why i want to do the honors program, im saying that i want to do it for the mentorship so i can learn a ton ab ai and using it (which is true) but i dont know what there is in the real world with ai that isnt taught in the basic courses i screenshotted below. any thoughts on some things that college doesnt teach you in ai that a mentor from the industry could?
I'd assume AI fundamentals will not go over things like deploying trained models and industry concerns when distributing ML services to users such as alignment and adversarial attacks. Depends on exactly what your mentor would do though, there are applications of AI/ML that aren't distributed to users ofc.
industry probably isnt the way to go with this i dont think, because i have internships for that. stuff in the actual theory is more aligned i think
i have a bug in my backpropagation function i have a feeling it's with calculating the derivative. can you tell me if i have a obvious flaw in my code?
the code is in c++ but i think the its understandable
double derivative_outputlayer(double current_weight, double output, double prev_output, double target)
{
double f1 = output - target;
double f2 = output * (1 - output);
double f3 = prev_output * current_weight;
double derivative = f1 * f2 * f3;
return derivative;
}
double derivative_hiddenlayer(double current_weight, double output, double prev_output, double sum_gradients)
{
double f1 = output * (1 - output);
double f2 = prev_output;
double f3 = sum_gradients;
double derivative = f1 * f2 * f3;
return derivative;
}
Hi. Anaconda keeps telling me to update to latest version.
If i update, will it update all installed packages across my envs? Will my old projects break?
Can I comment on the code itself a bit?
-
I would give more descriptive names to f1, f2 and f3 especially considering f1 in function 1 == f1 in function2. To me this detracts from the readability.
-
I'd rename the function to make clear that you're dealing with a sigmoid. If I see "derivative_outputlayer" I then check the body of the function to see what loss you're using. You can help the reader a bit out here π
thanks, ill try to impplement it!
double derivative_outputlayer_sigmoid(double current_weight, double output, double prev_output, double target)
{
double error = output - target;
double activation_derivative = output * (1 -output);
double derivative = error * activation_derivative * prev_output
return derivative;
}```
it's also strange that you're doing the derivative w.r.t. a single neuron in the previous layer
im kinda confused anyways with the howl the backprop thing. i watched a few videos and everytime when i watched them i had a million question which were left unanswerd
Okay can I give you a protip then?
Start smaller. Just write out gradient descent in any language you want. Then turn it into stochastic gradient descent. Then add regularization.
Write your own data generating function. Play around with the amount of regularization, play with the batch size. See what it's doing and try and find out why.
Once you've done SGD for a simple linear model implementing backprop will be a bit easier. Also, it's important to know that nowadays people don't do backprop like this. They use automatic differentiation
ok, thanks! i think its the bes advice ive ever gotten in my howl Journey.
ik that backprop isnt as relevant nowadays but i find it quite interesting and im only gonna use it to apply for a apprenticeship
bumping my post from yesterday
Automatic differentiation is still backprop, it's just done differently. You can kind of conceptualise it as a class/type that wraps a scalar, vector, tensor, ... basically each time you do any mathematical operation with that type it "acts" like a regular math object but it stores the gradient internally
https://arxiv.org/abs/2106.11342 <= this book also uses the method of starting with linear regression and moving to neural nets if you want more "guidance"
This open-source book represents our attempt to make deep learning
approachable, teaching readers the concepts, the context, and the code. The
entire book is drafted in Jupyter notebooks, seamlessly integrating exposition
figures, math, and interactive examples with self-contained code. Our goal is
to offer a resource that could (i) be freely av...
ill look into it tommorow, thanks!
How big is your dataset?
42000 elements
If I recall correctly, but you'll have to fact check me on this, t-sne does form some kind of pairwise similarity matrix, basically an NxN thing
Sorry, I forgot to copy the post with the dimensions. PQ and num are (42000 x 42000) while, Y is (42000 x 2)
42000x42000 x floating point precision
Yeah pretty much. I'm not well versed in how t-sne works, I'm just trying to visualize datasets so that I can display coreset coverage, and there aren't any good torch implementations of t-sne, so I'm trying to get the only not-completely-garbage one I could find to work
If it's double precision you're looking at idk 14.1GB ram?
How much memory do you have?
Does it need to be from Torch?
That step is trying to create 13.2GiB, so yeah that seems accurate.
It does in fact need to work with torch. Keras will not work for my implementation. I have 2 A10s, although it seems that this can only utilize one at any given time.
I should really look into the algorithm in detail but sklearn's docs strongly imply they have a method that defers forming that NxN matrix but is slower https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html
Examples using sklearn.manifold.TSNE: Comparison of Manifold Learning methods Manifold Learning methods on a severed sphere Manifold learning on handwritten digits: Locally Linear Embedding, Isomap...
Look for the bare necessities, The simple bare necessities, Forget about your worries and your strife, I mean the bare necessities, Old Mother Natureβs recipes, That bring the bare necessities of l...
I don't believe that the sklearn method works with tensors
nope
It also seems to be a CPU-compute method, so it won't function well with such a large dataset, let alone what I need to eventually do with IMAGENET
Plus it'd be a moot point if it goes ahead and forms that pairwise distance matrix anyway.
Are you open to other ways to visualise your dataset?
Such as?
It seems that t-SNE is the standard now in domains that require a representation of "distance"
Heads up though, I've had a month long holiday and I'm a bit rusty ML-wise so maybe I'm not seeing something.
If you have images you can just ram them through any pretrained model and compute whatever distance you want. Just remove the top of the network, flatten or pool, whichever you want and then compute the distance
Also, at that point you have a vector. You can PCA this vector if you want and plot that way. This is what I did in the past sometimes.
Not really what I need. I'm working on coreset selection, so I need to display the positions of selected coreset vs redundant elements within datasets within the domain space.
Let me read your original post and think for a bit. EDIT: not much to go on for your problem π
What exactly do you mean with "the positions of selected coreset"
I can't say much on the topic, but I'm attempting to use coreset selection to guess the bounding manifolds of the individual classes within a space. I want to display the dataset in a way where I can highlight certain points based on said coreset selection.
Hello! I have a question. I have an algorithm that gathers data on financial performance, keeps it and yields it to an AI. However, im facing problems feeding the ai bc the amount of data is too large. To solve this, I am trying with langchain models, but i dont know which one and how to use it. Does anyone know about it ? If not, do you know any alternatives?
langchain agents*
I see! I hadn't heard the term coreset selection specifically yet. Another niche I just discovered. I guess this is very specific so I'm probably not of much help.
My profs loved kernel methods so we covered Determinantal point processes which are probably coreset selection now that I know what the term is
You'll run into the same issue there cause as you probably know Kernel methods need that NxN matrix as well πΏ
It's a niche that's gaining ground due to the growth of massive unlabeled datasets with things like LLM training, so I'd suggest looking into them. CRAIG is a good algorithm to start by looking at.
The issue isn't the NxN matrix, so much as that the line I posted is creating a 2nxn matrix for some ungodly reason.
Yes but the NxN alone means you'll never scale. Having it twice sucks yes but just the one is already a death sentence, no?
Or am I missing something, especially since you mentioned imagenet
I guess you're correct there. Hmm well since I only need to calculate it once for each dataset, I guess a CPU-bound option works. How fast is the sklearn option? Even if I need to run over the course of a few days, there shouldn't be too much of an issue.
The docs say it can take hours π€£
Hi, I need to process 200000 images. I want to store them into an hdf5 file, 3d array. First dim is the image number. I track the image number. I create a hdf5 dataset. Can I populate the array only by keeping track of the image number in the parallel processing?
dset[image_number]=converted_image
Then I relay on my image_number to sort this out in the dataset the position?
H5py puts the converted data via the image_number into the correct location of the dataset?
what do you mean by "correct location"? when you iterate over the dataset, the images should be returned in the order of image_number, yes.
sure, although analytics queries can be pretty slow even against temporary tables, depending on what you're doing. i think the actual use case for duckdb is embedded real-time analytics inside applications. personally i'm happy with pandas and polars for analysis work.
Hi, sorry not see. I have to fill a new dataset in a hdf5 file.
I am trying parallel processing, the images are distributed and come back now converted. They have to be placed back into the dataset (3D np.ndarry).
I use zmq
I am trying to convert a GZ File to a CSV File however I am running into errors, this is my current code:
import gzip
import csv
import io
with gzip.open(r'C:\code\un-general-debates-blueprint.csv.gz', 'rt', encoding="utf-8") as vFile:
file_content = vFile.read()
#file_content = csv.reader(vFile)
nlpFile = open(r"C:\code\un-general-debates-blueprint.csv", "w")
nlpFile.write(file_content)
nlpFile.close()
print(file_content)
Does anyone know how to possible transform this GZ File to a saved CSV File?
What errors?
This is the error:
Can anyone help me get into neuroevolution??
Any NN journey? im new to ML started with sound analyzation with logistic regression, planning to move on naive bayes and neural networks or even more of them in time.
i dont know much about NNs i want to gain experience with data management first.
post as text, not image please
there's no one journey that works best for everyone. imo you'll do well by following your interests. if you like audio analysis, keep doing that. you'll eventually find excuses to learn all the things you need to know. for example data management: try storing metadata in various formats, figure out how to store binary data, figure out how to store your fitted model, etc.
the only caveat here is that audio data can be a little complicated, and it tends to be easier to work with "tabular" data (like what you might find in a spreadsheet)
but you can work with audio metadata in tabular form, like a database of songs, year released, etc.
like you could try to predict genre using artist and title or something
I actually planned audio as something thats fun to develop and make me learn AI better i will look to these.
at the same time, plenty of people do machine learning with audio and image data just using a big folder of files & a text file of labels
yeah keep doing that. lots of opportunities to explore the various tools used in machine learning
Yeah im planning to learn them after i master data management etc in basic models probably will start learning NN,random forest,decision trees in few months. theres too much to search.
Thats a cool option too but i already started trying to classify signals.
the field is huge, take your time
Okay thanks for all advices.
i don't think you need to master data management. data management is a means to an end. start simple, and if your needs expand you will learn more things.
Im just trying to have practical skills when i develop a ML model i dont go into things that i dont need in that modal.
Not able to install lazypredict library
def log_retornos(n):
df = pd.DataFrame()
for t in clean_price.keys():
df[t] = clean_price[t]["Adj Close"]
df = df.dropna()
sec_returns = np.log(1 + df.pct_change()).dropna() * 100
sec_outliers = pd.DataFrame()
for t in sec_returns.columns:
sec_outliers[t] = detect_outlier(sec_returns[[t]], cutoff_n=80)["TOF"]
for t in clean_price.keys():
try:
plt.figure(figsize=(22, 10))
plt.rcParams.update({"font.size": 19})
plt.plot(sec_returns[t])
outlier_positions = sec_outliers.index[sec_outliers[t] == 1]
if not outlier_positions.empty:
valid_positions = outlier_positions.intersection(sec_returns.index)
plt.scatter(valid_positions, sec_returns[t].loc[valid_positions], marker='x', color='red', label='Outlier')
plt.title(f"Log Retornos em % - {t}")
plt.legend()
plt.show()
except:
pass
log_retornos(int(input("Defina o nΓΊmeros de preΓ§os vizinhos: ")))```
Traceback (most recent call last):
Cell In[3], line 30
log_retornos(int(input("Defina o nΓΊmeros de preΓ§os vizinhos: ")))
Cell In[3], line 10 in log_retornos
sec_outliers[t] = detect_outlier(sec_returns[[t]], cutoff_n=80)["TOF"]
File ~\AppData\Roaming\Python\Python310\site-packages\uniqed\runners\tof_run.py:39 in detect_outlier
).fit_transform(np_time_series)
File C:\ProgramData\anaconda3\lib\site-packages\sklearn\utils\_set_output.py:142 in wrapped
data_to_wrap = f(self, X, *args, **kwargs)
File ~\AppData\Roaming\Python\Python310\site-packages\uniqed\transformers\transformers.py:25 in fit_transform
return self.fit(x).transform(x)
File C:\ProgramData\anaconda3\lib\site-packages\sklearn\utils\_set_output.py:142 in wrapped
data_to_wrap = f(self, X, *args, **kwargs)
File ~\AppData\Roaming\Python\Python310\site-packages\uniqed\transformers\transformers.py:21 in transform
X = self._embedding(x, self.d, self.tau)
File ~\AppData\Roaming\Python\Python310\site-packages\uniqed\transformers\transformers.py:37 in _embedding
X = np.zeros((embedded_length, d))
ValueError: negative dimensions are not allowed
While trying to plot log_returns with markers on outlier data, I've been getting the following error
Correct location: The images have been measure in a sequence. To keep this sequence intact, it is important that images from parallel processing come back into this order when saved.
Another question in this context is how ZMQ and multiprocessing ensure that the workers return the data back into the corect order
Hi everyone. In a classification problem, is it correct to one hot encode date like this? 'day_of_week_0', 'day_of_week_1', 'day_of_week_2',
'day_of_week_3', 'day_of_week_4', 'day_of_week_5', 'day_of_week_6',
'month_1', 'month_2', 'month_3', 'month_4', 'month_5', 'month_6',
'month_7', 'month_8', 'month_9', 'month_10', 'month_11', 'month_12'
Any algorithm engineers lurking about?
If you have a question please just ask it π
guys ik this is outa context. I need pc experts opinions on this. I when i play a game like gta v i get around 300 fps low that was about 2 years ago. I do the same thing now i cant even reach 150. I have updated Graphics Drvrs and the exact same settings on everything. whats the issue
In a convolution neural network does a convolution layer reduce the dimentions of the input picture
it depends on the exact configurations of the convolutional layer, if it only has one filter and the dim of the kernel/length of the stride are greater than one then the output of the convolution will be smaller than the original image. However in practice almost all convolutional layers are going to have multiple filters which ends up increasing the number of parameters output by the layer. Generally we chain convolutional layers with pooling layers to reduce dimensionality and introduce other kinda of spatial invariance.
Im taking two courses in AI and neural networks at my uni. Is it normal to not get what biases and weights do to my dataset at the beginning? I'm having problems figuring out what it does to my results as im not understanding the lecture jargon.
that depends on what were the prerequisites/how early on in your study program you take the course
well were two weeks in and i kinda feel like were getting thrown in the deep end really quickly, but thats me. Our first task was to make a perceptron classifier, from which understand only works if i can make a linear line from the dataset? We then had to adopt a SVM classifier from lecture notes that would give me a different result if its not linear from what im understanding. And im still blank at what my biases and weights do to these classifiers
were using an iris dataset, much like this https://scikit-learn.org/stable/auto_examples/svm/plot_iris_svc.html
and to answer the prerequisites. We had an introductory course last Fall semester, but it was more about soring algos and a use of Kaman filter to help a pygame get 100% hits
Bias and weights are kinda fundamental. Maybe watch 3b1b, first vid covers it: https://m.youtube.com/watch?v=aircAruvnKk
What are the neurons, why are there layers, and what is the math underlying it?
Help fund future projects: https://www.patreon.com/3blue1brown
Written/interactive form of this series: https://www.3blue1brown.com/topics/neural-networks
Additional funding for this project provided by Amplify Partners
Typo correction: At 14 minutes 45 seconds, th...
Only matrixes
all right. the weights form a matrix
it gets multiplied to the input data vector. then you add another vector, the "biases", to shift the result
this is called an "affine transformation". it transforms one line into another, and then moves it around in space
Hi. a little issue with a DataTable in dash / plotly. as it was convenient for me to transpose the dataframe used, the DataTable is not displaying the row header
im assuming the weights move it in another direction?
that, and stretch as well
so what should i be look at when it comes to tuning my data assuming my classifiers are built corretly?
All i got now is a plot with some dots either side of the median line
That question sounded amateurish, but that is my current level π
i'd start by checking that 3b1b video, and then reading in your coursebook about support vector machines
Implementet and SVM into my code, just ended up with the same accuracy as they are both linear
But yes, ill save the video for the morning.
Sadly, no course books, just power points... :/
i find that hard to believe, uni always expects you to spend at least as much time as you do in class, doing self studies. the slides and syllabus have no references?
I'm not sure, sec_returns contains negative values, and I'm not sure you want it to. Other than that, I can use the uniqed package the way you try to. I don't know what's in clean_price.
It would probably help if you explain what you're trying to do. E.g. what types/ sizes do you expect every input and output to be?
The reference is usually documentation, like svm turtorials and https://medium.com/@zxr.nju/what-is-the-kernel-trick-why-is-it-important-98a98db0961d and those lines
When talking about kernels in machine learning, most likely the first thing that comes into your mind is the support vector machines (SVM)β¦
oof
I'm abit frustrated as the lecturers are above my classes head, they are really good at this. Just not teaching. Never worked with weights and biases before even in the introductory course, just sorting algos and implementing kalman filter to a pygame :/
You can ask me anything about SVMs @nocturne hornet, do you have any specific questions?
To me this question wasn't worded super clearly though. You mean you're unsure of the impact of the weights and biases before training?
Hi, has anyone ever made an AI that geotags photos? If so, which dataset did you use?
I was going to make a joke about Trevor rainbolt
Curious if any ai can beat that guy
The best way to guarantee ordering in any concurrent or parallel processing is to include the sequence number in the input and output, that way the data can be handled in any order, and can be re-ordered as needed later
oh I was trying to do .dropna() but since sec_returns has columns of delisted stock price data, it was removing all rows
this was one of the goals, removing the time interval where stock return == 0%, because it translates to a stock delist
Log Returns in % - BGP
ps: plots in portuguese, I know 
problem is, what to do with these illiquid stocks with huge price movements @glacial rampart
Oh, you have a wide formatted dataset (one column per security)? dropna, by default, drops any row with an NA value (default axis = 0)... you can also drop columns with NA with axis = 1: the risk is that you'd drop any security with a single NA (such as an equity that listed recently)
You could also melt (unpivot) the data back to narrow form, which might be easier to work with.
So you don't have any data sets that can help me? π
Nah, sorry
Did you check Kaggle?
Yes, but I haven't been able to find anything good.
I circumvented the situation with .dropna() by taking the NaN values out of each series without affecting sec_returns and plotting them instead
for t in sec_outliers.columns:
plt.figure(figsize=(22, 10))
plt.rcParams.update({"font.size": 19})
sec_returns_t = sec_returns[t].replace(0, np.nan).dropna()
if not sec_returns_t.empty:
plt.plot(sec_returns_t)```
By replacing the 0s (when you calculate variation of NaN values result = 0%) with NaN and removing them, I'm removing the period where a stock has been delisted
You are right. I distribute the images via the following range() over all workers.
Some code snippets
#in worker
for i in range(worker_id, nimages, self._nworkers):
img = images[i]
push_sock.send_pyobj((i, result))
def _unordered_recv(self, sock):
while True:
img_number, result = sock.recv_pyobj()
yield (img_number, result)
#in collector later
generator = self._unordered_recv(pull_sock)
The previous programmed wrote an ordered_rec function
def _ordered_recv(self, sock):
cache = {}
next_img_number = 0
while True:
img_number, result = sock.recv_pyobj()
if img_number == next_img_number:
yield (img_number, result)
next_img_number += 1
while next_img_number in cache:
yield (img_number, cache.pop(next_img_number))
next_img_number += 1
else:
cache[img_number] = result
But strangely, it seems not to work. When I use ordered_recv I don't get an error message, but some entries in the converted image stack remain empyt. That I don't understand. Unordered seems to work.
dope project
I can also set the length of bad returns allowed to be less or more lenient
0.1 meaning I except at least 90% of time series to behave normally
neat, I'll have to try it out someday. I still don't like the application of outlier removal without an understanding of the mechanism / reason for the outliers, for the record.
How do you handle outlier removal @left tartan ?
Hi! Thank you for your answer. I ran the script with the ordered_recv and unordered_recv. Indeed, only in the unordered_recv version output file seem to have all images inside. In the ordered version some arrays show only 0 that should not be the case.
It would be nice to understand why.
I got this snippet of ordered_recv by another person, I thought he would know what he is doing? For me, it is very important to be in control because I aim to give my code to other people.
Is there any better way to control the input, the processing by ZMQ, and the output, please? Currently for me this is a blackbox.
The only way I can come up with is as follows: I run without the parallel processing and do it in series. Then I redo it, I compare the two final files. Thank you for the discussion π
good morning π i am converting a dataframe to dict to use as data in a dash_table.DataTable. Unfortunately dash seem to require that i use the method 'records' for converting to dict, which results in all row titles being deleted. if i use 'index' the dict contains the row titles, but dash will not generate the table.
I think you have a mistake in the code. You yield img_number with cache.pop(next_img_number). Shouldnβt this be next.img_number?
But regardless, I donβt think this is a good approach. If img number 0 comes last, all the other requests are cached. Me? I would either receive them out if order and sort them. Or even better, if worried about memory, I would have each worker write the output to a local file and process them in the order I want
hello, does anyone know how to implement a delay for each request langchain makes? I have the starter freeplan in Openai of 5$ and it limits the amount of requests per minute so i get blocked... Here 's the code: data = yaml.load(f, Loader=yaml.FullLoader)
json_spec = JsonSpec(dict_=data, max_value_length=4000)
json_toolkit = JsonToolkit(spec=json_spec)
json_agent_executor = create_json_agent(
llm=OpenAI(temperature=0, openai_api_key=OPEN_API_KEY),
toolkit=json_toolkit,
verbose=True,
)
I'm unsure what it does before training, I'm also unsure how to read the result after training. I have this dataset which im told is a classic illustration of a classification problem.
Can you tell me what you specifically mean wwith BEFORE training?
If i initialize bias and weights to something else than zero, how does it impact the training. I'm also abit unsure on how SVM works compared to perceptron, so i cant really tell what the differerence is in the training and predictions.
Maybe this can help, I wrote this in the past: Maybe this helps? Idk. I wrote this in the past
Your perceptron's decision boundary is pretty bad compared to that of the support vector machine. You see the line is to close too the purple class. There's probably points just beyond that line that are still supposed to be in the purple class but your perceptron will say they're in the yellow
That's a direct consequence of the margin term you add to a SVM, which is related to L2 regularization. If you add that to your perceptron the results will be better. L2 also wants to constrain a magnitude of a vector if you recall π
The reason why you have points inside of the dotted lines are because of the slack variables SVMs are because of the slack variables (eta in my screenshot)
As far as I know SVMs don't do any kind of fancy initialisation unlike neural nets. If you start with a random weight vector before training and use that you converge to the same result. It's a convex optimization problem with a global optimum.
/info dump over
def _init_weights_bias(self, X):
n_features = X.shape[1]
self.w = np.zeros(n_features)
self.b = 0``` I had this function in my class so i just assumed it did
You can try it with different values that aren't 0. At least for the SVM it should converge.
class SVM:
def __init__(self, learning_rate=LEARNING_RATE_SVM, lambda_param=LAMBDA_PARAM, n_iters=N_ITERS):
self.lr = learning_rate
self.lambda_param = lambda_param
self.n_iters = n_iters
self.w = None
self.b = None``` By L2, do you mean lambda parameter in this case?
Did you code the SVM from scratch 
No, we got handed this code with the task of implenting it to our dataset. With the end goal of comparing perceptron vs SVM in both time and accuracy.
Tried coding perceptron from scratch, probably why its bad π¦
Learning rate in an SVM? Can you show me the implementation
Is it stochastic gradient descent?
class SVM:
def __init__(self, learning_rate=LEARNING_RATE_SVM, lambda_param=LAMBDA_PARAM, n_iters=N_ITERS):
self.lr = learning_rate
self.lambda_param = lambda_param
self.n_iters = n_iters
self.w = None
self.b = None
def _init_weights_bias(self, X):
n_features = X.shape[1] # The number of features is the number of columns in the dataset
self.w = np.zeros(n_features) # The weights are the coefficients of the input variables. They are multiplied by the inputs and summed to arrive at an output. They are updated in the learning process.
self.b = 0 # The bias is a constant value that is added to the weighted sum of inputs to determine the output of a neuron. It is a constant value that is learned during training.
def _get_cls_map(self, y):
return np.where(y <= 0, -1, 1)
def _satisfy_constraint(self, x, idx):
linear_model = np.dot(x, self.w) + self.b
return self.cls_map[idx] * linear_model >= 1
def _get_gradients(self, constrain, x, idx):
if constrain:
dw = self.lambda_param * self.w
db = 0
return dw, db
dw = self.lambda_param * self.w - np.dot(self.cls_map[idx], x)
db = - self.cls_map[idx]
return dw, db
def _update_weights_bias(self, dw, db):
self.w -= self.lr * dw
self.b -= self.lr * db
def fit(self, X, y):
self._init_weights_bias(X)
self.cls_map = self._get_cls_map(y)
for _ in range(self.n_iters):
for idx, x in enumerate(X):
constrain = self._satisfy_constraint(x, idx)
dw, db = self._get_gradients(constrain, x, idx)
self._update_weights_bias(dw, db)
def predict(self, X):
estimate = np.dot(X, self.w) + self.b
prediction = np.sign(estimate)
return np.where(prediction == -1, 0, 1)```
Your class is very weird
The class is from the course github page which were told to implement, but I dont know much about this classifier to judge.
class Perceptron: # Perceptron classifier class
def __init__(self, learning_rate=LEARNING_RATE, n_iters=N_ITERS):
self.lr = learning_rate
self.n_iters = n_iters
self.weights = None # The weights are the coefficients of the input variables. They are multiplied by the inputs and summed to arrive at an output. They are updated in the learning process.
self.bias = None # The bias is a constant value that is added to the weighted sum of inputs to determine the output of a neuron. It is a constant value that is learned during training.
def fit(self, X, y): # The fit method is used to train the model
n_samples, n_features = X.shape
self.weights = np.zeros(n_features)
self.bias = 0
for _ in range(self.n_iters):
for idx, x_i in enumerate(X): #
linear_output = np.dot(x_i, self.weights) + self.bias
y_predicted = np.where(linear_output > 0, 1, 0)
update = self.lr * (y[idx] - y_predicted)
self.weights += update * x_i
self.bias += update
def predict(self, X): # The predict method is used to predict the class of a sample
linear_output = np.dot(X, self.weights) + self.bias # The linear output is the weighted sum of the inputs
return np.where(linear_output > 0, 1, 0)``` this is the perceptron class i ended up with
Did you write this or your professor?
This implementation is wrong
class SVM:
def __init__(self, learning_rate=1e-3, lambda_param=1e-2, n_iters=1000):
self.lr = learning_rate
self.lambda_param = lambda_param
self.n_iters = n_iters
self.w = None
self.b = None
def _init_weights_bias(self, X):
n_features = X.shape[1]
self.w = np.zeros(n_features)
self.b = 0
def _get_cls_map(self, y):
return np.where(y <= 0, -1, 1)
def _satisfy_constraint(self, x, idx):
linear_model = np.dot(x, self.w) + self.b
return self.cls_map[idx] * linear_model >= 1
def _get_gradients(self, constrain, x, idx):
if constrain:
dw = self.lambda_param * self.w
db = 0
return dw, db
dw = self.lambda_param * self.w - np.dot(self.cls_map[idx], x)
db = - self.cls_map[idx]
return dw, db
def _update_weights_bias(self, dw, db):
self.w -= self.lr * dw
self.b -= self.lr * db
def fit(self, X, y):
self._init_weights_bias(X)
self.cls_map = self._get_cls_map(y)
for _ in range(self.n_iters):
for idx, x in enumerate(X):
constrain = self._satisfy_constraint(x, idx)
dw, db = self._get_gradients(constrain, x, idx)
self._update_weights_bias(dw, db)
def predict(self, X):
estimate = np.dot(X, self.w) + self.b
prediction = np.sign(estimate)
return np.where(prediction == -1, 0, 1)``` this is the original one that is in our course repo, which I assume is written by our professor
What is the difference between language module and an actual ai
a language model is a model trained on corpora of text, it is typically a neural network, which 'falls' under AI
AI is an umbrella term for automated machinery to complete a task autonomously
Looking at this again, I assume lambda_param is the canonical C parameter in SVMs
π
That makes the implementation not strictly wrong but just "strange". Idk why your professor would code it up from scratch, why they'd use gradient descent in the primal etc
is it possible to increase the amount of iteration steps with tensorflow when predicting an answer?
This is essentially a https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html with loss="hinge"
Examples using sklearn.linear_model.SGDClassifier: Model Complexity Influence Out-of-core classification of text documents Comparing various online solvers Early stopping of Stochastic Gradient Des...
Will AI replace 90% of programmers in 5 years it very good it will replace us right and if not then WHY
Still having issues adjusting the perceptron tho, added a L2 parameter, but its not really doing anything :/
?
Can anyone provide the resources to learn how to make chatbot from scratch?
chatterbot or nltk is your guy
Thank you
I wouldn't use chatterbot, because i've had troubles getting to work properly on 3.11.4
OK π
no
Right now, Generative Pre-trained Transformer (π€) GPT 4 is the most advanced model, and it has hallucination rates of 20%, so every 5 messages will make false information
It is incapable of doing arithmatics, and its coding knowledge is completely based from StackOverflow
Up until AI can make new information based on what it knows proefficiently, we'll still be their overlords
I found my result similar to this one tho... https://www.bogotobogo.com/python/scikit-learn/Perceptron_Model_with_Iris_DataSet.php
Perceptron model on the Iris dataset
What loss is your perceptron algo using, binary cross entropy?
simple difference between the predicted output and the actual target to update the weights.
y_predicted = np.where(linear_output > 0, 1, 0)
update = self.lr * (y[idx] - y_predicted)```
Hmm, I guess the decision boundary being where it is is down to that term but honestly I haven't looked at perceptron in detail
In 5 years it will right
It's very close to SVMs, especially if you added L2. I'd encourage you to just write down the equations
Right
If what my understanding from the article is correct, covergience is a problem with perceptron. So im just gonna assume i cant do anything else with that kind of classifier
But in your image it did OK didn't it?
I've not used Perceptron in the past mainly because I can't think of a situation where I would favour it above a generic SGDclassifier/regressor.
?
Please replay
I am scared if ai will replace us in 5 years
chill, it won't
Still need to understand my results better i guess. Never sure if im getting what i need
Ok even in 5 years
People should stop asking this to engineers go ask it to philosophers
Virtually no engineer cares about discussing this topic
It's tiresome and pointless
No, find a philosophy discord server and ask them. Thanks.
You can talk about how perceptron "hugs" the decision boundary and SVM doesn't and how that's a good thing
No I mean why it won't
I know it won't replace us but why
Thanks βοΈ
Write out the equations and it'll make more sense π
ai has already replaced us. in reality, zestar is a bot we have linked to chatgpt, the true overlord
What
Zestar will you replace us
zestar already replaced me by giving better answers in this channel 
Better answers dosnt mean replace you
Zestar how do you feel as a chat gpt and will you replace us
It's definitely better than its 74 predecessors
Maybe zestar76 won't take so much time to remember why exactly SVMs are a maximum margin classifier.
What
Another ai
NOo I don't want it to replace us
if we let it
this is a Machine learning tensor model defines values in a coordinate system and catagorgized via mathematics in dimensions based off the orginal model
I eventually want to create a machine learning model that becomes AI
The graphic 3d aspect is not important the the system as a whole, merely an in intended consequence to model i desgned.
the fact that can do provides deeper insite in to my goal
i don't think this code even works as-written, you'll need the nonlocal declaration for tensor
the enthusiasm is appreciated, but i also don't think this does anything with machine learning
its defining a tensor off of real world values like circle triangle, data along theses parameteres can help with massive data
i see. this is actually a well-established technique, you're basically treating each coordinate of each shape here as a separate feature. there are some advantages (e.g. simplicity of the code) but there are also problems with it. for example you can take a (small) image and flatten the pixels, so that each pixel is a separate input feature. the problem is that you lose the proper sense of the 2d locality of the data, so the model has a lot more work in order to construct a good internal representation of the data, compared to using a CNN.
however one thing you can do is use something like an autoencoder to obtain a fixed-size vector embedding for each "object" and feed those into a model. that's basically what all of deep learning for NLP is based on. the nice thing there is that you can tailor the embedding model to accommodate all kinds of weird objects (text, shapes, images, whatever) but the output is very uniform and can be put into a very generic model.
You could also define the picture as a rectangle and store the info about along that particular dimension
right, that's exactly the technique. i'm saying that simply flattening the coordinates might not be the most effective way to obtain a vector representation of a polygon.
although it probably wouldn't be bad either in the case specifically of k-gons, since k is fixed
would be interesting to see if a model can distinguish self-intersecting, convex, and regular polygons just by flattening the coordinates
the whole point those flattned coordinates are unique to that particular information. the whole pupose of my endevour is to formlize context.. say you want it to identify its self. it can refer to its current shape dude to its parameters that changed over time..
i'm not sure what you mean by that
right im trying
yes, flattening the coordinates is an invertible transformation between the original polygon and the flat representation
say you run this thing and learns the system dymentions, it would soon learn that in contenxt its a rectangle shaped box (gathered from the internet based on hardware information, dimensions, type, cpu then contextualize the information as hardware, it could read x and y values a see it as information from this source as a application, perhaps these values are window dimensions, they can be stored on the dimension avaible and given the cordinates of this newly created understand the model conforms to this information holding values that can be stored as a variable, Ie the computer its on.
in other words with each complete growth factor that conforms to lets say a triangle the orginal modle stores the information coordinately, those corrdinates are input references of stored dictionaries
these references can be combined to show in theory a 3d object once it learns enought about "its self"
what i find intersting is that you could intheory rotate the tensor object to store information anew.
i don't follow, what does CPU have to do with this?
nothing but if you ask the model what type of cpu it has, because when the tensor recieves input and creates a new storage dimension, any value learned in this dimension can later be refernece when tyring to talk a machine learning about itself.
what type of CPU you have because i haven't yet programed that far, i had a hard drive malfunction a week ago and starting new i have to retrace my steps, however this is a new approach of what i was trying to do
because a CPU is generally square, it can store the dimensions, name and stuff on the fly in said dimensions, say it managed to learn square thing and stored in the square dimension, it can refence instantaneously along those parameters and anything else that references square or square objects. perhaps what it revived threw videos, just stored in a space as a refence defined by the orignal tensor.
basically right now, if you run a model againsts it it would know its a square, with circle blach blach but an object that receives and learns as outputs
Hi, I am trying the ai cells magics from https://blog.jupyter.org/generative-ai-in-jupyter-3f7174824862 in jupyter lab. They work great, but I would like to streamline my process. Is there a way to avoid having to %load_ext jupyter_ai_magics in each notebook before using these magics? I looked around and advice on this was old/nonworking. Ideally I would like to change a config or add an argument to my jupyter lab startup alias.
assuming you are just using a stock default jupyter lab installation:-
- run
ipython profile locate defaultin your terminal - add
c = get_config()
c.InteractiveShellApp.extensions.append('jupyter_ai_magics')
to
<the-path-revealed-in-step-1>/ipython_config.py
3. restart jupyterlab
unfortunately i think you might be very confused about how machine learning works
hold on
I suggest you don't run it, it bogs hard, i working with an ssd again the VISUAL context is only something to help it won't remain in the final product,
yes, in 1 year
@boreal gale Amazing! That worked.
I was previously trying to modify the config indicated by jupyter-lab --generate-config, but it looks like I truly needed the ipython config even though i am using jupyter lab.
Which documentation did you look at to find the answer regarding which config needs to be modified and which value needed to be set?
https://ipython.org/ipython-doc/3/config/intro.html
i was gonna go with ipython startup scripts as i knew that's definitely possible to run python code when a kernel starts up,
but i forgot where to put start up scripts, so i google ipython profile since i knew it's a profile-based thiing and that led me to that page and the top bit immediately showed how to enable cython magic by default so i adapted that instead.
i am actually unsure why your previous approach didn't work.
edit: that seems to be the config for LabApp not ServerApp which is what we needed to change
edit2: yeah no. jupyter-lab --generate-config is for both ServerApp and LabApp actually, and that's probably not to do with the kernel that's being spawned
edit3: i guess the keypoint is just realising jupyter is just using ipykernel - hence you need to alter ipython profile , not jupyter's profile.
Agree on edit3. I am not a huge magics user, but it is a good reminder that magics are kernel feature language specific feature, not a general juyter feature.
Hi Iβm new here. Just signed up for Nucampβs DevOps course with Python starting in September. In the meantime Iβm trying to figure out how to work with list in real world situations but I got stuck somewhere. I was able to import my list (.csv) into pandas but donβt know what to do after that. How do I sort the list or find some item in the list. I learned to do this with just simple small list but not in real situations. Any suggestions? Please dm me. Thanks
Hey guys, any good known datasets of pictures of trash and debris?
yeah, the selfie folder on my phone
im planning on making a simple machine learning / ai algorithm to play chess. are there any python libraries i should investigate, aside from pandas / numpy / tensorflow. looking for a starting point in the project, i guess
do you want it to use machine learning, or a heuristic?
@left tartan Thank you so much. You were absolutely right. There is the mistake, I did not see it.
Corrected version is now:
def _ordered_recv(self, sock):
cache = {}
next_img_number = 0
while True:
current_img_number, current_result = sock.recv_pyobj()
if current_img_number == next_img_number:
yield (current_img_number, current_result)
next_img_number += 1
while next_img_number in cache:
next_result = cache.pop(next_img_number)
yield (next_img_number, next_result)
next_img_number += 1
else:
cache[current_img_number] = current_result
This one makes sense in case you have a live queue, where images from a detector come in and should be displayed in the correct order. It could also happen that a scan stops and we don't want the final file to have empty places.
If order does not matter, the unordered version is ok, the target file gets filled because (img_number, result) travel together and I can identify the image and the location in the final file with img_number.
Maybe this is not possible. If we scan, we get for example 300.000 images ore more. Not easy to keep in memory and maybe one would not like to have too many files flying around. I have to admit I don't know well the process detector --> outputfile. I always get the output file which I have to convert to something else. My current converted file is 300GB large
machine learning, not hard-coded
@left tartan I check now ordered and ordered_fix and let you know, but it looked good on the first glance
then I wouldn't recommend this as a first project.
not my first project, im just looking for resources
what other projects have you done?
do python have a library that can match similar lookin charts?
This isn't really my area of expertise, but you can probably find something that somebody has put together using reinforcement learning if you search around, maybe based on AlphaZero. however you should keep in mind that there is a long history of chess playing algorithms that don't use "machine learning" in the modern sense, which you also might want to look into
yeah it seems like AlphaZero has been used for chess since 2017, you should be able to find at least something about using it in python
Minmax (sometimes Minimax, MM or saddle point) is a decision rule used in artificial intelligence, decision theory, game theory, statistics, and philosophy for minimizing the possible loss for a worst case (maximum loss) scenario. When dealing with gains, it is referred to as "maximin" β to maximize the minimum gain. Originally formulated for se...
this is what stockfish uses if i remember correctly
Hmm a really nice project would be a voice activated chess
No board just voice haha
Woah that would be cool time to learn RNNs
thanks π
Could Chatgpt3 interact with this model by its self?
ChatGPT
Yes, GPT-3 can interact with your tensor-based model by itself. GPT-3 is capable of processing and generating text, which means it can send text inputs to your model and receive responses from it. This interaction would involve GPT-3 generating prompts that are formatted in a way that your tensor-based model can understand and interpret.
@desert oar ive gotten somewhere i hope this changes your preception on what i may or may not understand
the neuro network model
If anyone is good with python selenium here and you have a moment please check my post on python help page, just posted it there β€οΈ
anyone using a mac? my jupyter notebook kernel is using the /usr/bin/python3 where it should be the anaconda python. I don't know how to fix this
@brittle radishplease don't post advertisements like that in this server
need advice on NLP books
so I'm eyeing some books as a handbook to guide me in learning NLP with libraries like spacy, nltk etc (let's assume i know nothing of NLP). right now i have these 3 books listed. but i wonder if there are better books for complete beginners?
I just finished learning Python. I want to start learning how to train ai agents to play a game. Can anyone link me somewhere and explain how should I start doing that?
π£ I've just published an in-depth article on Stable Diffusion, an AI technology that's transforming the way we deal with noisy data. π
π What's Stable Diffusion?
π§ π¬ The Science Behind It
π Get Hands-On
The tutorial includes a step-by-step guide on setting up Stable Diffusion, so you can get started on your own projects.
Let me know what you think!
Beyond the Hype: Practical Tutorial to Stable Diffusion and Its Impact on Tech
does DuckDB have query parameters?
Hi guys. I have a few issues with pandas.style.background_color function. Which topic will you suggest me to ask in?
or is it better i open a thread in python help?
here or thread in python help are both fine
my problem is that i am coloring my table with heatmap, but i only want to apply it to specific rows. you can add subset="column name" to target specific columns, but it will always target columns, not rows, despite setting axis to 0 or 1.
code:
metrics_html = combined_metrics.style.background_gradient(cmap='Greens',axis=0).to_html()
as im very familiar with css, i tried overwriting the styles of the cells in the html, but due to the nature of how the colors are applied it is not possible to do
have a read here: https://pandas.pydata.org/docs/reference/api/pandas.io.formats.style.Styler.background_gradient.html
in particular:
subset: label, array-like, IndexSlice, optional
A valid 2d input to DataFrame.loc[<subset>], or, in the case of a 1d input or single key, to DataFrame.loc[:, <subset>] where the columns are prioritised, to limit data to before applying the function.
here is an example to get what you wanted:
import pandas as pd
import numpy as np
arr = np.random.randint(1, 100, size=(10, 10))
df = pd.DataFrame(arr)
df.style.background_gradient(cmap='Greens', subset=([3,4], slice(0, None)), axis=0)
okay thanks let me try that right away. i tried with the slice thing and apply and applymap last night, but the slice stuff was confusing for me.
you don't even need slice actually..
df.style.background_gradient(cmap='Greens', subset=([3,4], ), axis=0) will do just fine
thats basically what i was already doing. it still only searches the columns for 3 and 4 here, it cannot search the rows
n.b.
subset: label, array-like, IndexSlice, optional
A valid 2d input to DataFrame.loc[<subset>], or, in the case of a 1d input or single key, to DataFrame.loc[:, <subset>] where the columns are prioritised, to limit data to before applying the function.
how you make it work on rows though, it says it looks through columns π€·
your example works. just figuring out how to make it also work on my df
hmm? i though the documentation is quite clear already π€
unless you weren't aware there is a difference between [3,4] and ([3,4], )?
yes i was totally unaware of that π
i see that you thereby are targeting rows , i just didnt go further with that because it always said it was searching cols
ah! okay, you can experiment with df.loc[<whatever-you-put-as-subset>] to see what it will target (in the 2D case according to the docs)
but if it's 1D then it's df.loc[:, <subset>]
what a weird API.. π€·ββοΈ
okay case solved. thank you very much!
combined_metrics.style.background_gradient(cmap='Greens', subset=(["opti","bt1","bt2","bt3","bt4"], ), axis=0).to_html()
yesterday i was so close, litereally just had to put ", " after the ]
anyone familiar with Power Bi?
IF anyone is truly interested, patient, willing to listen to a reasonable implementation of working on a novel concept of Machine learning and AI.. please by all means message me
be sure to always ask a complete question that someone who knows the answer could start answering
Can someone please answer my question?
import numpy as np
import math
import pandas as pd
from pandas.plotting import scatter_matrix
import scipy
import matplotlib.pyplot as plt
from scipy.stats import *
pd.set_option('display.max_rows', None, 'display.max_columns', None, 'display.width', None)
housing = pd.read_csv(filepath_or_buffer = 'filepath')
print(housing)
why does this not print all of it, it prints from like 3000 - 13000
this is on pychamr btw maybe thats the issue
and i do have 13000 data points
Terminal has a maximum length @umbral charm
Yea
was thinking this, anyway to solve?
or do i just have to use like IDLE
I don't think you ever need to print over 10k lines probably.
Could print to a text file
That is true, its just useful to see if it works
But probably better to only print the useful info
yea would be but i like to visualise everything just incase
Maybe there is a setting to change the maximum line count
But you'd need to look around in the settings
I have 'override conscle cycle buffer size'
would that be it?
printing out 10k lines of text is one of the worst ways of visualizing anything
make a plot of the stuff you care about
Probably. But is it possible to make a plot instead?
You're not going to read through 10k lines in les than a few hours π
i feel personally attacked
that explains why you're always confused
That is ture
Yea its fine i should be good with 10k lines
i was just curiuos, dealing with large datasts is a pain
oh
it worked
thanks!
It was edds suggestion ππ½
What is the most important mathematical formula in ML?
is it the Quadratic formula?
Really subjective. There is also the chain rule for backwards propagation
Lol probably not
Lots of activation functions like tanh, ReLU, sigmoid etc.
Also pretty big
But yeah, not really a single answer π
I don't think there is a single most important formula, multiple statistical formulas are used
Cheers everyone
housing['date'] = pd.to_datetime(housing['date'], format = '%d/%m/%Y')
housing2 = housing[(housing['date'] > '2016-12-01') and (housing['date'] < '2018-01-01')]
print(housing2)
why doesnt this work
' raise ValueError(
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().'
can u provide full traceback as code
like u want my full code and full error?
only the full error i assume it results from line (2)?
the traceback mate
Put parens around each side of the equation
(β¦) and (β¦)
Traceback (most recent call last):
File "D:\Users\FatBoy\PycharmProjects\Coursera\Gali.py", line 11, in <module>
housing2 = housing[(housing['date'] > '2016-12-01') and (housing['date'] < '2018-01-01')]
File "C:\Users\FatBoy\bottle\lib\site-packages\pandas\core\generic.py", line 1466, in __nonzero__
raise ValueError(
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Oh sorry, youβre comparing to a string.
You want to compare the series to a date obj
When i put the '&' instead of and
it works
but i dont want to use '&' coz idk what it actual means i only know its a bitwise operator
Oh and that too:
Each side of that AND is a boolean series.
You want the bitwise AND of the series⦠not the boolean AND which makes no sense: what is [10101010] and [00011010]?
A boolean AND operation would give you a True or a False, which makes no sense
Does that make any sense?
i see
so if [10101010] and [00011010]? makes no sense
what would [10101010] & [00011010] do
What do you think itβll do? Do you know anything about bitwise operations or boolean arithmetic?
nope nothing about bitwise operator, thats why i didnt want to use it
i also know that there is a pipe for OR bitwise
doesn't & return everything that is in both lists?
Let me demonstrate... one sec
import pandas as pd
s1 = pd.Series([0, 1, 1, 0, 1])
s2 = pd.Series([1, 0, 1, 1, 0])
print(s1 & s2)
Yes
So, lookinga tyour original code:
You have series A which is : (housing['date'] > '2016-12-01') and series B which is: (housing['date'] < '2018-01-01')
this would print 11111 right?
import pandas as pd
s1 = pd.Series([0, 1, 1, 0, 1])
s2 = pd.Series([1, 0, 1, 1, 0])
print(s1 | s2)
So, you want all rows where A is True and B is True
