#data-science-and-ml
1 messages · Page 82 of 1
this may or may not be useful: https://en.wikipedia.org/wiki/Chebyshev's_inequality#Proof
always ask an actual question that someone who knows the answer can start answering--don't ask to ask.
@boreal gale @agile cobalt thanks!
model : resnet 152 initially loaded with imagenet weights
dataset : freihand and multiview hand pose dataset (approx total of 180k images)
optimizer : Adam with .01 lr
task : keypoints detection
hnet : trained for 10 epochs
hnet2 : +5 epochs (15 ep)
hnet3 : +5 epochs (20 ep)
hnet4 : : +5 epochs (25 ep)
ive been training my resnet model for days and ive found out there's a gradual decrease in MSE loss, my goal is to train the model to work well during testing ( lets say with my hand , outside of this dataset ), can someone suggest me a way to improve this model , it would be really helpful , should i just train my model bit more like upto 50 epoch? so far i've cropped the image using bouding box to focus the whole image on hand to remove noises.
this insight was the key
@boreal gale @agile cobalt any ideas for what to make of this? I'm not sure what to do with u (without a subscript; u has up to this point been the random variable) now that we have a sequence of u values.
yeah weird, it just looks like u is undefined here..
what I thought that could be useful from the proof was mainly the y = (X - u) ** 2
we were already asked to prove this, and one can just substitute that for t.
please help🙏
please help🙏
please help 🙏
sorry I just wanted to be included.
shoot, im ok with any suggestions
yeah you can take (u - mu)^2 as a new random variable, substitute into chevyshev's, and compute its expectation
@somber prism >?
OH
Jaabir
would you mind phone call, i may interest you in a new way to work with this?
what is u here?
nothing really, i just wanted to know how others would take approach in this
that's the question 😛
I just asked the prof, so we'll see when he replies.
I guess it's an arbitrary element from the sequence of u values, but I don't think that goes without saying.
is the question to figure out what is u 😛
the question is "what is u intended to represent in the context of the homework question"?
and the homework question is to prove that inequality.
u is probably the mean of all the u_i, since that reduces the variance by a factor of n
you can already go ahead and try that out
my guess it's probably supposed to be \bar{u}
but yeah, it's not written clearly
you can use the linearly of the expectation operation to show this one, as well as once again plugging into chebyshev's ineq
as a general recommendation, it's a good idea to keep an eye out for the (central) moments of random variables
they tend to have nice properties and give you intuition about their distribution's properties
what is u-bar, and if it's the mean of all u_i, how is that different from mu?
the bar is a common notation for the mean
and i guess i should say "sample mean" for clarity
.latex the sample mean is [ \frac{1}{N} \sum_n u_n ]
this is only equal to the true mean if N goes to infinity
people usually call the "true mean" the "expected value"
.latex that'd be
[
\mathbb{E}(U) = \int_{-\infty}^{infty} u f(u) du
]
oops
where f(u) is the pdf of U
these two being equivalent as N goes to infinity is the "law of large numbers", which btw doesn't hold in general
but the takeaway for you is that the average and the expected value are not the same thing
what averaging DOES do is reduce the variance
this is what the problem is asking you to show
in signal processing terms, averaging is the same as applying a rolling average window, which is the same as lowpass filtering
another way to think of it is that the sample mean or average is a function of random variables, and so its output is also another random variable with a new distribution. one with the same mean as the original variables, but a lower variance. the expected is a constant
@wooden sail
Oh wow, sorry, I just left that out of the question, and no one asked yet! Yes, u in part c is the sample mean of all the us.
it's due in 10 hours

😌
mystery solved. i left you 10 pages of lore explaining the background in the meantime
I like lore
I think they meant u_{i} no?
the prof since clarified that it's the sample mean for all u_i
ooof
In this context bar{u} would've been appropriate like Ry said then
Typos can be made ofc 🙂
yes
I always feel "guilty" when I realize that I used to know these things but I forgot 🤣 . Guess that's what mostly doing applied stuff does to you
i have 0 guilt 🤣 because i know i wouldn't even know how to do 20% of the things i easily whip out today back then
I'm making a python script that extracts the sudoku grid from the image. all numbers should be extracted into a 2d array matching the sudoku image
this my colab notebook link:
https://colab.research.google.com/drive/1ykMxMtiPX0SVph6bQpzQTLCZGkPJFJuT?usp=sharing
there some error in text extraction or might be in perspective correction, not sure what to do exactly?
kindly have a look at the code🙏
error:
AttributeError Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/PIL/ImageFile.py in _save(im, fp, tile, bufsize)
517 try:
--> 518 fh = fp.fileno()
519 fp.flush()
AttributeError: '_idat' object has no attribute 'fileno'
During handling of the above exception, another exception occurred:
SystemError Traceback (most recent call last)
10 frames
/usr/local/lib/python3.10/dist-packages/PIL/ImageFile.py in _encode_tile(im, fp, tile, bufsize, fh, exc)
531 encoder = Image._getencoder(im.mode, e, a, im.encoderconfig)
532 try:
--> 533 encoder.setimage(im.im, b)
534 if encoder.pushes_fd:
535 encoder.setfd(fp)
SystemError: tile cannot extend outside image
i put an exception handling code there and now whole extracted grid is 0000...
please help!!!!
Can someone help me with pandas in python? I have some issues in implementing some of the functions of pandas and numpy. I also need some help in creating a script to get some insights from a dataset.
be sure to always give enough information that someone can start helping--don't wait for a commitment.
whatever dataframes you need help with, show them as text (not a screenshot) with print(df.head().to_dict('list'))
Based on that message about idat not having file no, I’m guessing your fp parameter is not a file handle.
Check your params carefully. I didn’t look at your code tho.
if possible can u look ?😓
I can’t right now, I’m half debugging something, just doing light discord 🙂
hi @serene scaffold thanks for your reply, actually i want help in generating a scripts from specific condition to implement in dataset,
though the datasets are
-
power data of equipments -
{'Time stamp': ['2023-06-01 00:00', '2023-06-01 01:00', '2023-06-01 02:00', '2023-06-01 03:00', '2023-06-01 04:00'], 'HVAC 1 (kW)': [0.0, 0.0, 0.0, 0.0, 0.0], 'HVAC 2 (kW)': [0.0, 0.0, 0.0, 0.0, 0.0], 'HVAC 3 (kW)': [0.0, 0.0, 0.0, 0.0, 0.0], 'HVAC 4 (kW)': [0.6772, 0.5796, 0.4976, 0.6235, 0.5637], 'Kitchen Bar lights (kW)': [0.0, 0.0, 0.0, 0.0, 0.0], 'LCC Oxford Circus - Total (kW)': [5.39, 5.25, 5.17, 5.0, 4.42], 'Main 1 (kW)': [5.39, 5.25, 5.17, 5.0, 4.42], 'Main 1 L1 (kW)': [3.81, 3.85, 3.77, 3.72, 3.22], 'Main 1 L2 (kW)': [0.9682, 0.9715, 0.9935, 0.919, 0.9206], 'Main 1 L3 (kW)': [0.611, 0.4309, 0.4057, 0.3594, 0.2783]} -
Working hours of the site -
{'WeekDay': ['Monday', 'Monday', 'Monday', 'Monday', 'Monday'], 'Type': ['Non Trading', 'Non Trading', 'Non Trading', 'Non Trading', 'Non Trading'], 'Hour': [0, 1, 2, 3, 4]}
df_temp = df.query('1970<Year<1981')
plt.pyplot.subplot(1, 5, 1)
df_temp.value_counts("Method").plot(kind = 'bar', title="1970-1980", figsize=(24,4))
df_temp = df.query('1980<Year<1991')
plt.pyplot.subplot(1, 5, 2)
df_temp.value_counts("Method").plot(kind = 'bar', title="1980-1990")
df_temp = df.query('1990<Year<2001')
plt.pyplot.subplot(1, 5, 3)
df_temp.value_counts("Method").plot(kind = 'bar', title="1990-2000")
df_temp = df.query('2000<Year<2010')
plt.pyplot.subplot(1, 5, 4)
df_temp.value_counts("Method").plot(kind = 'bar', title="2000-2010")
df_temp = df.query('2010<Year<2020')
plt.pyplot.subplot(1, 5, 5)
df_temp.value_counts("Method").plot(kind = 'bar', title="2010-2020");
plt.pyplot.suptitle("Main Title", fontsize=15)
plt.pyplot.subplots_adjust(hspace=4, top=4)
plt.pyplot.subplots_adjust(left=0.1,
bottom=0.1,
right=0.9,
top=0.9,
wspace=0.4,
hspace=0.4)
plt.pyplot.show;
Spacing and padding not working for the heading("plt.pyplot.subtitle").
The word "Main Title" ("plt.pyplot.subtitle") is just overlapping with the title of the individual graphs. Despite adding "plt.pyplot.subplots_adjust(left=0.1,
bottom=0.1,
right=0.9,
top=0.9,
wspace=0.4,
hspace=0.4)"
How to fix this
to be clear, I wasn't making a commitment to help. I was just telling you that you need to ask your question if you want to get help.
generating a scripts from specific condition to implement in dataset,
I don't know what this means. it sounds like you don't have the vocabulary to convey what you're trying to do.
Try showing what result you want given the two dataframes that you have shown.
[PyGraft is looking for open-source contributors]
Hi there,
I recently open-sourced PyGraft, a configurable Python tool to generate synthetic knowledge graphs easily!
It can be used in any AI tasks (Machine Learning, Deep Learning, Reasoning, etc.) provided that you work with graphs.
The repo is gaining a lot of visibility, and I am looking for motivated contributors to support me in implementing new features and unit tests. Ideally, you should (or would like to) have a general understanding of knowledge graphs, semantic web, RDF/RDFS, and OWL vocabularies. In addition, strong Python programming skills are required. Experience in Software Engineering is a plus 🙂
DM me if you would like to contribute!
Otherwise, you can still take a look and star and fork the repo if you find the project interesting!
Can someone help me to find some good article pr blog which discusses about static evaluation for non-terminal gamestate
If we’re talking deep learning and games, are you familiar with two minute papers? I’d probably go through the papers he cites on his yt channel
Yeah, I am subscribed to that channel
(That’s all out all I know about games tho)
Im just having a simple approach in my game which uses minimax upto certain depth and the chanell talks about the deep learning
(cross post cause wrong channel) Does anyone know how to use matplotlibs or another library to make a graph some like the one below. The issue I have is I want it to show the count of events over a period of time( edits on a wiki via the MediaWiki api to be exact), which don't have a 'y' value, just a timestamp value. To use the count of X in range as y I guess.https://cdn.discordapp.com/attachments/421605166198816769/1152370948582932541/channel_chart.png
That kind of graph is called a histogram.
There is a plt.hist() function that will give you a bar plot, but there is also a histogram function in numpy which you can use in combination with matplotlib to make a line plot like shown there
Oh? Thank you for that information, I'll look into documentation and such for that!
while a histogram also have x and y axis and can display the same data, i'd say its a linechart with smoothing applied to the line
wouldnt cost too much and see here if you like that type of plot. thats just one out of many other libs. https://plotly.com/python/line-charts/
to make it smooth you want to look for what they call "spline" .. somewhere . the documentation is rather ok while not amazing
Thats what I thought at first as well to use. But I only have 1D points on the X axis and am not entirely sure how I would go about calculating Y data from it, so I was lookong for a solution that already does that.
okay. you have time on X and what you plan for Y?
or i dont know if you have time on X, thats how i interpreted your msg
I have a collection of objects with a timestamp, I’m looking to make a graph like this(https://cdn.discordapp.com/attachments/421605166198816769/1152370948582932541/channel_chart.png) where the concentration of points forms the Y. This image shows discord message activity, my data will be logs of edit to a wiki, so you could see trends in increases and decreases in editor activity.
Show what your data looks like plz
Aight. When I get back home I’ll generate some sample data.
(srry didnt think to make any yet)
As I’ve read; you have an event stream where each event has a timestamp. You want to show events per day. Right?
Yeah, although I may tweak it to show over a year instead, depending on the timescale I want to observe
Its an object with a UNIX EPOCH Timestamp iirc
The function that gets you the y values from a list of single events is numpy.histogram()
Yah, so you just want to group the data by date, take a sum per group, and plot that.
You -can- do some of this stuff in the charting library, but the normal process is just to calculate what you want, then plot it.
(I don’t think of this as a histogram case, but if you added a date column to each object, I guess it’d work)
All that grouping stuff is really Pandas or SQL language, by the way
is this technique you are describing also referred to as "bins" ?
Bins is what you’d call it in a histogram… but grouping is what you’d call the data transformation in data libraries and dbs.
ah cool. yea i see the histogram 2d and other 2d maps have nbins functions where scatter only has a few different named something with group
In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?
helloi am not even able to understand what is ols function working?
Is data science just about grabbing data and making sense of it? Like if I wanted to create some analytics for rugby, I just grab the important data make sense of it via libraries and whatnot
Yes, although the added value arises by finding correlations that are not immediately apparent.
So if I was to make the same statement you just did I would not include the word just
Do you have an example of that if you dont mind me asking
I am an X-ray scientist so probably not the best person to ask for examples, but you can look into past kaggle competitions
I remember there was one for predicting supermarket sales (to predict how much produce should be stocked by the supermarket), and the sales of ice cream would increase on public holidays but only if the weather was good.
whats the best stats course i can take for data science? your recommendations?
What stats have you taken? It’s all relative to you, right?
uhm, i have an over the top understanding of some concepts but ive taken high school math and engineering math aswell so yeah
I think this also functions as a sort of meta-example. Amusing
i have an over the topic very shallow understanding of some stats concepts
but what would be a good course to take teaches both stats and implementing that in python
I made the mistake in grad school of taking a stats course that was primarily for math majors (but they allowed CS). I thought I knew stats. Boy was I wrong.
I don’t know if one but would love if someone had one!
I think you can make bank with a maths and stats degree. Quants require like some msc from what I see and you can make millions if you're good enough
Perhaps, but I don’t think you can ‘will yourself into’ loving math and a quant role if you don’t have a passion (and ability) for the subject.
oh yeah i do look at some stats things and im so lost but im confident i could pick it up
since im not really super idk
whats the word im looking for?
isolated from the math concepts?
I dunno, but three things you could/should work on: 1. Data engineering stuff- do projects that work with data. Kaggle has lots of ideas/projects. 2. stats: there’s so many things to learn but just pick one thing at a time that you know little about and dig deep. 3. CS50 for AI: not stats, but it’s a survey of ML and Python, the hands on stuff you were talking about
And three books zestar75 recommended: #data-science-and-ml message
thanks for the pointers
how do you guys follow a book when its not prescribed as part of a course
thats dedication
pls
I don’t. I pick a chapter or topic I’m interested in.
ahh
thats far more convenient and freeing tbh
my brain is so "all or nothing" tho
i cant just pick a chapter and not like perfectly do the whole book seems so wrong
Yah, I have a few books I’ll never finish, but I can’t put them away
there's always finishing them with audio
that doesn't work if it's technical content
Lol, I’d love to hear a stochastic calc text in audio
You have to be more specific in what you do and don't know from stats imo
i know
hey
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
def logistic_regression(data, max_iter=100, random_state=42):
x = data.drop(["happiness_rank"], axis=1)
y = data["happiness_rank"]
x_train, x_test, y_train, y_test = train_test_split(x,y,random_state=random_state)
model = LogisticRegression(max_iter=max_iter, random_state=random_state)
model.fit(x_train,y_train)
preds = model.predict(x_test)
score = accuracy_score(y_test, preds)
return score
it shows that accuracy is 0.0
What is happiness rank exactly? How does the data look like?
How many levels? (E.g., is it 0/1 or from 1 to 10 or so?)
Hello there, I want to train YOLO with my own dataset but I’d have to do the annotations manually and it would take forever to do so 😭 I was thinking of using edge detection to automatically create the bounding boxes. Do you guys agree or do you guys have a better idea?
Anyone interested in optimization of LMs?
Greetings, I am working on the kaggle medical transcript dataset and I am trying to re-balance the dataset because the mean word count is at 18 words, max words is 76. I am wondering what is the best move here? should I cut the top 25% highest word count?
Hi, does someone knows any kind of resource related to designing some evaluation function which takes in a state (mostly non-terminal) and the player and returns score (winning).
An example of the function in any game will also be fine. I just want to get some jist about the core concept of extracting the feature and evaluating those features
I am making a connect4 game which is using some algorithmic approach for AI player. I'm using minimax under the hood.
I was doing chapters of Elements of Statistical Learning completed 3 chapters, but then started Deep L. For coders 2022 course. Watched 4 lec.
Now i am just not able to figure out that should I first learn ML in depth or do DL. I know ml basics .
I have 2 years left of my clg (undergrad)
ESL is aimed at masters (graduate) students or advanced bachelors students that have had several math/stats courses imo.
Unless you specifically want to go into computer vision or NLP. You should know ML before DL because deep learning offers solutions where "regular" ML fails.
i was thinking that this is the best time when i can read this book with no pressure... but i also think that is it really required? because this time i can invest in any other thing like more on implementing than on theory
These 3 books are what I always recommend
You need a mix. In interviews for ML/AI positions they ask about what you've done but "theoretical" questions are also present.
ML is a very very leaky abstraction. Every model nowadays has a .fit() and .predict() method but they're all subtly different and for me at least knowing what most of them do gives me confidence that my work is correct.
Hi all, does anyone know of any good online resources where I can find data science sample projects (and preferably solutions)? . I've gone through McKinney's book for Pandas but I feel like without some actual hands-on projects none of it will stick
I'm assuming you're aware of kaggle and looking for something else?
Someone 📍 pin this message... really helpful
Agree, though I would also add https://www.deeplearningbook.org to that.
I wasn't actually. Do you recommend just going through the excercises here https://www.kaggle.com/learn/pandas?
Solve short hands-on challenges to perfect your data manipulation skills.
Is it ok to tune CatBoost parameters on GPU to then use these parameters for CPU training?
yes
Thanks! This looks good
Yep that might be useful.
But I was also referring to projects part, since you asked about projects.
Kaggle has tons and tons of projects/notebooks by other people.
You can find the right projects as per your interest.
Though do be careful to not pick up bad practices as these projects are not curated. Maybe keep an eye on the upvotes, user-level and discussion section to get some sort of idea.
worth pursuing ai/ml if I am trash at these specific mathematics? (nearly failed half these courses in uni, and have to retake them for improvement)
So I have this project I'm working on, and I need to capture the text on a screen and identify what number the text is. However, the area surrounding the text is transparent for the most part, therefore the background is subject to change, which may directly affect how the text is captured.
To capture the text, I'm using adaptive thresholding. I've attached a short video on the text being captured in different background environments.
My question is, what's the best approach to identify if the text equals "0%" or "1%" or "2%" and etc.
I need the solution to be relatively efficient. So far, the capturing the text and adaptive thresholding is done in 0.03s roughly.
As of right now, I'm thinking of template matching.
You should get more comfortable with those branches of math, then. If that isn't something you're able or willing to do, then you'd want to look elsewhere, yeah
Is data mining possible with LLMs ? Like passing whole bunch of pdfs and say it to generate a csv from them ..
I searched for some papers but couldn't find somthing intresting.
I want to generate a medical dataset for medicines and symptoms
So thinking of this instead of doing text scrapping
from medical books
it might be, but you shouldn't trust it unless you've demonstrated that it produces that expected output on a given document set
and you can't just pass PDFs--you'd need to convert them to text first.
Yeah..that i will surely do if im doing it .... i tried for few paragraph.. it works fine but can't trust it for whole text
if this is something that matters, you'd need to be pretty rigorous in how you measure its accuracy
but it might be that it doesn't matter.
But i think it is just impossible for the other way to extract data...scrapping the whole text
Analyzing PDFs is a difficult problem, here’s a talk / project about it that combines text and ocr: https://youtu.be/vB-C7dBoxc8?list=PL8uoeex94UhEGxPOetT3bpg8ibcxflh44&t=9122
Even after i got a text...like "use paracetamol tablets if suffering from cold" then i want to generate a symptoms - medicine dataset from it...
So for this only option i can see is using LLM
This isn’t my space, but I recall a lot of work around medical ai and expert systems. There are certainly many papers in this, ie: https://dl.acm.org/subject/ai?AllField=Medical&sortBy=relevancy
If you want to learn a specific package using their official docs is always a good idea imo. https://docs.opencv.org/4.x/index.html if they don't have good docs I'm personally always hesistant to use it.
YouTube + Documentation = 🔥
Good afternoon everybody! Does anyone here use seaborn?
Why do you want to know if someone uses seaborn? Because you should always ask your actual question. Not if someone knows about the topic of a question you haven't asked.
Hi guys, im working with a dataset of cars in csv and a some data has the word 'Turbo' in something like line break
so what i want to do is delete the CC and Turbo, i tried this code:
data['Power in cubic capacity (CC)'] = data['Power in cubic capacity (CC)'].str.replace('CC\nTurbo', '')
But Turbo isn't deleted, can somebody help me on how to delete it?
me bro
i've fixed it hehe
Hi, guys need help finding a regex pattern . I have a list of strings eg. ['TR ITA14TRK010 FF BROOKLYN STRAIGHT Beige 44','MB ITA14BLT016 35MM NA Olive Green 32','MB ITA15BLT004 40MM NA Reddish Brown 38 / 112CM'] and I want to split the string based on charaters which starts with IT(like ITA14TRK010 ,ITA14BLT016 ,ITA15BLT004 ) so that I can grab the brand name(at the start of string like TR, MB,...etc.). How can I write a good regex expression for the same . I have written one r'IT[A-Z0-9]+'. Can someone validate it ?
Is BertGeneration or gpt-3 better suited for text generation tasks
Sorry for the late response, I was travelling to another city
Can you do a .unique() on happiness rank?
let me try
Okay, now this makes. You need linear regression for this. There's too many levels in your target.
There's models that are (slightly) more adequate than linear regression for your problem (because you have integers) but I don't want to send you down a rabbit hole 🙂
What I can say is that in your code you're not doing any feature scaling, you actually should. Logistic regression by default (and the new model you should use which is RidgeRegressionCV) automatically have regularisation and it does not work properly if your data is on different scales. I can explain why if you want but for now my advice to you is always rescale your data. Doing it where it's not necessary isn't as bad as not doing it
how to do hyperparameter tuning?
my supervisor was mentioning something similar to doit search (is sounded like **oit search, * means i am not sure about start lol ), I didn't catch the pronunciation correctly.
What ML package are you using?
pytorch
That's relatively important to mention because it influences what kind of hyperparameter tuning you can do 🙂
https://optuna.org/ is a good place to start if you're using Pytorch
Typically for neural nets you'd run a hyperparameter tuning algorithm that does the search "sequentially" because you may not have enough VRAM to do it in parallel
Hello.
I'm a new at this field. #langchain
Does anyone can give an advice how to ask chatGPT to get structured response with validation model?
In official docs represent extract documents for that. But I'm trying to build something like:
Question: Hello chat, what is the State of the Union?
Structured answer:
['Question',
'Alternative rephrased question',
'Second alternative rephrased question']
Kind a hard to build. Please give an advice.
I know somehow its possible to get from chatGPT structured answers from just one-line questions.
If you have multiple nodes optuna supports syncing hyperparameter search on multiple nodes w a database
personally I use mlflow for this but same idea 🙂
I mostly mentioned this because in the past I never had the ability to train multiple NNs concurrently so instead of doing my default strategy (random search) I'd go with a variant of bayes opt
is there any function to get the index value of a given element from any column?
can you give an example of the input/output of that would be like?
edit; brb in like half an hour
I have 2 dataframes, A and B. A columns are vegetable and type. B columns are vegetable and amount. I want to iterate through B vegetable column, check if that vegetable is in A and asign the amount in B as a new column in A.
A might have vegetables that B dont't so I'll put a 0 in those cases.
I tried using concat inner.
I think it'd be something like A.join(B, on="vegetable", how="left"), yeah.
you could do it a bit more "manually" like ```py
either of
veg_amount = B.set_index("vegetable")["amount"]
veg_amount = pd.Series(index=B["vegetable"], data=B["amount"].array)
then
A["amount"] = A["vegetable"].map(veg_amount).fillna(0)
``` but you probably should prefer df.join or pd.merge
i need to tune
3 weights of 3 losses
one temperature
learning rate
what hyperparameter tuning method should i use and why? when to prefer one over other? i am using torch
is theres any diff between 4060 lp and reguler one ??
You'll want to compare attributes like its memory and FLOPS
do you have any thoughts on this @serene scaffold I want to make a web application that uses different LLMs frmo differnet organizations. I'm worried prehistoric inputs won't be used in the attention mechanism as they are different LLMs. is it possible to retain the different attention histories across each LLM. in other words, use the previous attention from one LLM in another
I was thinking concatenating attention might work but not sure if the math behind that does what I want it to
I don't think it's guaranteed that every LLM uses attention (though it might be unusual for one not to), let alone in a way that can be used as-is in a different LLM
Hi everyone, this question isn't particularly related to AI, more so to the philosphy of AI, so if there is a more suitable channel feel free to point it out and I will move it there.
I want to know what exact the robot argument and system arguement to the Chinese Room Argument mean.
ChatGPT won't replace developers within ten years, if it's that
So far, my understanding of chinese room arguement is as follows.
haha dw i want nothing to do with AI or datascience
this is for some content i'm learning for an algorithmics class
with the system argument, the general gist of it is that even tho the human doesn't understand Chinese, the room as a whole does. But what does understand Chinese, in that context mean?
"What does it mean to understand" that's the crux of the question.
That summary also seems terribly written, no offense to the author.
It's the sentence about "Searle claims that if a human-like computer..."
It sounds like you might not fully understand what the Chinese room analogy is intended to convey.
to me it seems like the analogy is there to argue that there is no way for strong AI to exist, to act as a human mind does.
is that wrong?
or is it to argue the Turing Test?
I don't think the analogy is intended to argue for one position or another, but to provide a basis for discussion
fwiw, my take on it is that it's a very narrow statement: that merely imitating intelligence != intelligence. Perhaps as a rebuttal of the turing test.
"if a robot participates in a conversation only by looking up responses from a table of inputs and outputs, can it be considered to understand what is being said?"
isn't his basis that it can't, though? because he does rebut a lot of arguments that try to say that it can be intelligent?
actually wait basically, i just want to confirm this statement:
I haven't read the Searle paper that you're referring to
I'm looking at https://plato.stanford.edu/entries/chinese-room/ now
Alan Turing believes that if a computer can simulate a human being well enough, it is intelligent. Searle argues that no matter how well a computer is programmed, it is still only simulating understanding, and is not intelligent
this is where i found some info: https://iep.utm.edu/chinese-room-argument/#SH2b
i haven't read any papers, i just want to understand enough to get what the argument argues
What's inte4resting... as I read through this, is not the argument, but the reply's to the argument
i like the brain simulator reply the most
i meant brain simulator
just cause the analogy is really well thought out,
just got into it. Interesting read, enjoyed it.
Searle does make a conclusion, and it's a non sequitur. It's another form of Vitalism, but for modern day. In addition, whether it can understand the tasks given depends on the tasks, specifically how much they rely on knowing about the real world. If the task is for example, math (symbol manipulation / algebra parts of it), then sure it can do those and "understand" them, just like a human would in that situation. Fundamentally it's arguing that computers lack symbol grounding, but they can have symbol grounding, this is an arbitrary assumption / premise made by Searle that makes their argument work at all.
This argument was also probably made in the context of understanding AI back how it was when everyone did symbolic ("good old fashioned") AI. Which is why it has the heavy focus on symbols and symbol grounding.
From what I have read, it seems Searle does actually admit that it could be possible, but that it has not been done yet by machines, only biology. So basically "not yet."
Seems like a bit of a walk back.
Something like "brains have the magic sauce" (again, Vitalism).
(Vitalism used to be huge when cells / humans being made of cells first started becoming accepted)
(we like to feel special)
i feel like (based on his reply on the other minds reply) that he wants to say the machine can't ever do it, because a machine can never become a biological.
well not quite.
The Many Mansions Reply suggests that even if Searle is right in his suggestion that programming cannot suffice to cause computers to have intentionality and cognitive states, other means besides programming might be devised such that computers may be imbued with whatever does suffice for intentionality by these other means.
This too, Searle says, misses the point: it “trivializes the project of Strong AI by redefining it as whatever artificially produces and explains cognition” abandoning “the original claim made on behalf of artificial intelligence” that “mental processes are computational processes over formally defined elements.” If AI is not identified with that “precise, well defined thesis,” Searle says, “my objections no longer apply because there is no longer a testable hypothesis for them to apply to” (1980a, p. 422).
he wants to say that a machine is defined as somehting programmed, with instructions. If it doesn't work like that, the argument fails because the argument is not meant to target biological machines
That is what I read, but then I read that Searle basically said that brains can do it, machines can't, and that it would need to be demonstrated. Which is not the same as "not possible."
I think the opinion and message has changed over time.
Basically this. Someone probably said something else that did not make sense and this was in response.
The messaging is not really clear enough, so i'm just going to leave it at what I wrote.
so someone posted this over in the excel discord, what is your impression?
We explore the recently announced Excel feature: Python embedded within Excel
excel and python are kind of my thing, wonder if i should milk this
🐄
it looks nice, but the subscription at the end does seem like something Microsoft will milk.
Not sure if this is the right chat but I'm at a stage in which I need to extend some of the built in stats file for an open source library that I'm currently using. Upon checking in with chatgpt to help me do so, it suggests that it's best to subclass it rather than modify the original file.
In theory, if I'm subclassing would I just need to add the pertinent code to a whole new file and include this in the same directory as the rest of the library files? Or is there something else I'm missing? Thanks.
What’s the open source library?
And it sounds like you’re not very familiar with subclassing, right?
That's right. This is my first time doing this.
Generally, subclassing let’s you override or add functionality from the base class. It’s requires a few examples to explain, but perhaps you should start with the inheritance section here: https://python.swaroopch.com/oop.html
In terms of file placement: you’d usually put your code in your own directory. It doesn’t matter where the library files are. You’d import the library, then any modules you wrote
Thank you
@latent ibex @left tartan this would go in #software-architecture btw
dipping my toes in ai / ml, hoping someone else can suggest some reading for what im trying to accomplish
my goal is to first determine whether a piece of text is code, and if it is classify which language it was written in. i understand this is very accomplishable without any ML but i just want a lil project 🤠
I just dont even know where to start cause ive never rly touched this
is there a way to check if a variable is truthy and equals a value at the same time? I have to use this and it's annoying. I have a js background:
if meta_tag_property != None:
if meta_tag_property.startswith('og:'):
wrong channel im assuming haha but I'd just do if meta_tag_property and meta_tag_property.startswith("og:"):
Looks like a fun project. On the top of my head I know a few ways how I would solve this. I'd say what matters most for you is what part of AI you want to delve into:
- Do you want to do traditional ML and generate "features" (input variables) and then train a model?
- Do you want to pass it off to something like a neural network?
- Do you want to use an API to generate "features" to generate your input variables and then train a model.
thanks. Which channel is better?
i was leaning more towards processing the text myself to generate features and then train a model
I think general python stuff just goes in #1035199133436354600
That's a fun one to do! I would suggest you pick a few programming languages but make it sufficiently hard for yourself (have C# and Java be in there together) and then for each language you read a bunch of code and ask yourself "what makes Python code Python"
... and then you'll need a lot of regex to make rules
That's just what I would do on the top of my head. I'm also not sure how far it will get you 🤔 . You can always move on to 2) and 3) if it's not working well
so a neural network solution would look like what?
I know basically nothing about AI so most of these concepts r very foreign haha
A neural network solution would process the raw strings into some sort of vector and then it would use that to make predictions. The difference is that you are no longer generating features yourself
mm okay, I assume for training that I would feed it the vector + label of each document?
Exactly, you feed it the vector and then you compare with the label to determine the error/calculate the loss during training
well that doesn't sound incredibly difficult 
You're very handsome and I appreciate you
thank you!
you're basically saying can't/unlikely to be done yeah
hey hey yall, anyone got any good data sources to practice machine learning models with python? Ive tried searching in kaggle but i dont think i have a good trained eye to select good data sets. im tryna practice my xgboost, random forest parameter setting and optimization skills. also am pretty new to python coding. i was told to browse thru the pins but i cant find anything too specific
It does not really matter what you start with, as long as it is more complex than Iris... Titanic is interesting, as it benefits from cleanups in the same many real-world datasets do
I don't really have any idea. haven't tried it.
Kaggle's tabular playground series
Specifically for xgboost just tune the number of trees, for random forest the cost_complexity
does anyone know of site or video that could teach me how I can build something like this for options , data supplied by yfinance and CBOE
Do you know black scholes?
Like, are you looking to write this yourself? GEX curves are a little annoying to calculate/draw out
but hang on a sec... I know a blog that posted this...
The method is solid, although slow. It can be vectorized, if youre up to the task
nope can't say I'm familiar and yes I do want to write this myself if I have to, I just want my personal app website or whatever that can give me those levels , for now I'm getting them from another server by the name of investors haven , I don't fully understand how they do it but up for the challenge and definitely willing to learn more
thanks for this bruv
Ya that last sentence of your first paragraph is garbage
Whoever authored that thought is an idiot (the chinese room author in this case)
If you simulate understanding, you have understanding, which is intelligence
hey can someone whos done machine learning with both pytorch and tensorflow help me out. I can't decide which i wanna learn first.
i thought intelligence was coming up with things by your own self
Pytorch
why pytorch over tensorflow?
Simpler to use
No I think intelligence can be reduced to problem solving in x domain
Basically if you can manipulate data to solve your problem
a calculator can do that... is it intelligent?
i think im too dumb to talk abt this
Hello everyone!
I have to do a project using machine learning but I would like to know if anyone has an interesting dataset to share with me?
thank you 😄
oh thats super interesting ima check that out thx
Hi, which library do you guys think is better for building decision trees?
Which libraries are you talking abt
?
Decision trees is a pretty simple algorithm tho
Tensorflow or Scikit-Learn?
My idea for this is that once I get this other model working, I will use the output of that model to help me decide an approximate date, like month, of when an image was taken. I think I’ll include a CNN to help me identify the season and then do the decision tree portion. Do you think this is a good idea?
sckit-learn is easier
but tensorflow is more made for nn
Try openbb terminal
scikit-learn and Tensorflow/Pytorch are different use cases imo.
I would only use the Tensorflow's decision trees if for some reason I to do the predictions on edge with TF lite.
xgboost
Hi guys, do any of you know how to display a neural network? It is a simple Neural Network (3 in, 2 hid, 3 out). I am using neat-python for if it helps, but you can use any package if you want. If you can help me out, that would be great!
Does anyone know a good deep speach yt tutorial
results = all_months_data.groupby('Month').sum()
months = range(1,13)
plt.bar(months, results['Sales'])
plt.xticks(months)
plt.ylabel("Sales in USD($)")
plt.xlabel("Month")
plt.show();
Month contains datetime objects. Code runs perfectly fine on Jupyter but I get a "Cannot sum datetime object" error on kaggle, how to fix this.
check which version of pandas you have running locally and which version you have running on Kaggle
you can specify which columns you want to operate on after using groupby like df.groupby(groupby_col)[target_col].function()
Will the new data set contain the month column? I need the month column in the new dataset as well
the month will become the index
if you need of it as a column you could reset index after running the aggregation
I have 1.4.2 running locally and 2.0.3 on kaggle
Okay I will try that
Thanks
definitely update your local version
Alright
Hello, has anyone here used TurboODBC (although it could just be a typical ODBC driver issue as well...) and dealing with a converted datatype from Pandas for BIT, MONEY, and TEXT to SQL Server (in Azure)? Getting various issues about cannot convert: Numeric Error.
anyone here very familiar with stochastic matrices?
they're used in some RL right? like a matrix of probabilities for each state an agent can have?
transition probability matrices yes, with markov chains
unless Edd is referring to a different type of Stochastic matrices ;-;
I have a question about annotating data for training. I would like to train LayoutLM with my own dataset of scanned forms. I plan on annotating the data using the same method used for the Funsd dataset. I have used pyTesseract to extract the data from the images. Unfortunately pyTesseract, isn't perfect! even after pre-processing the images (removing lines, noise, and binarizing).
Does the annotation need be based on extracted data from pyTesseract or the data as it should appear?
For example do the bounding box coordinate need to match those in pytesseract data. If there is a missing word do I add into the annotated data?
oops, i disappeared all of a sudden. yeah, stochastic matrices like in markov chains. say right stochastic matrices, more concretely (rows adding up to 1). do you know of any interesting properties of the product A^T A? maybe some bounds on the off diagonal elements 👀
if I want to try make some 3d model with AI what should I use instead of NeRf and pytorch3d?
and also rasterization based not rt
assuming I have dataset of images
Hi, looking for help : How to proceed Fine-tuning with LlamaIndex for any models (for example with finBERT model) ?
So currently I am working on a project which consist of fine-tuning our model FinBERT with the LlamaIndex method (https://gpt-index.readthedocs.io/en/latest/examples/finetuning/embeddings/finetune_embedding_adapter.html) in order to have better result in the context of Sentiment analysis. I am actually a beginner so I would appreciate any kind of help for a better understanding of this process.
Looking forward hearing from you 🙂
Does anyone have some good reasources on feature selection? I have a dataset with several combinations of features. One way is to make a model for each combination and test these models against each other, an other way (per a blog I read) is to train the model with the full features then test it on the different combinations of features (when it performs worse it means the missing feature was important). Is there any authorative source on this?
Specifically feature selection and not creation? (Just to be sure)
the idea of even doing feature selection is controversial. for example in predictive modeling, you often just apply regularization and don't bother trying to remove features
i don't know of a single authoritative source for feature selection, there might be something about it in elements of statistical learning
there is a lot of old bad advice out there about "stepwise" regression, but there are a lot of problems with that
@past meteor Yes, as in selecting a set of existing features not inferring new ones.
aha, interesting @desert oar
This is the complete answer, I have nothing to add! 😄
I might be overthinking my use case, I am making plots, plotting several dimensions of time series data (line showing the position, colour showing speed etc etc) so the thought is to find what features actually help the model
Even if your model is capable of finding non-linear patterns explicitly making them can help
But using plots to remove features? idk, I would probably not do that.
Regularization is the answer
yeah, it's useful to know which features are important, but using that importance to actually remove features from the model is what's questionable
Knowing what features are important is so dangerous that I wouldn't touch it unless you know what you're doing
the business people will want to know 🙂
@desert oar good point. As in - there is no reason to actually remove features. Thx @past meteor - I guess that might move me into overfitting landscape and such
that said, i have run into cases where i did really want to remove "irrelevant" features, but it's a case-by-case situation. if you can explain what you're actually doing maybe we can provide more detailed advice
I'd nuance it to the very very very maximum that it means nothing
"Under this particular instance of the model and our data the most important features appear to be ..., different instances may drastically find different importances."
Business cannot expect me to give them more unless they give me the € to do an experiment 🤣. I "fight" this every other day.
Their statistical literacy can be low, if you give them what they think they want they'll make decisions that hurt the business
haha, good point 🙂 kind of a mvp product, just minimally budgeted product
Hi guys, need some help, i have created n clusters, a cluster contains m docs which are embeddings of 2048 dims (1 doc = 2048 dim of vector, 1 cluster = m docs), now i have a query string, i want to get the most relevant/similar cluster that it can fall under, so i'm thinking of calculating an average embedding of a cluster, and finding cosine sim b/w the query embedding and the cluster embeddings to find the most relevant cluster it can belong to? Any other efficient approach?
If you have 2 highly correlated features your regularizer will kill 1, that doesn't mean the feature is irrelevant to the problem etc etc
then what do you suggest?
Just use regularisation and call it day imo
And if someone asks you "what is important and what isn't" you can answer a variation of
lol, thanks
Oh you were a different person
Indeed
Thought u were referring to my question …
Is this latent semantic analysis you did? (LSA or LSI)
Maybe explain what you want to achieve in non technical terms
Because it might be just a standard case of LSA
I have a question about annotating data for training that I asked yesterday but has not been answered. I'm hoping someone can provide some insight. #data-science-and-ml message
let's say i've created 7 clusters, a cluster basically have n # of docs, now a user can input his/her query, now i want to recommend him/her the most relevant cluster of docs based on the query inputed by him/her...
Before that, without using clusters etc
It's a retrieval problem? Someone gives an input, what do you want to give them, the most relevant document?
or the most relevant topic?
document
basically, a user uploads a csv file
each row is a doc let's say
now i take this csv and create clusters out of it, now the user also enters a query, now based on this query, i wanna recommend him a cluster
the most "similar/relevant" cluster
Your case is exactly latent semantic indexing
From my old slides, that's what you want to do right?
yea i wanna give him the index of the cluster which contains relevant docs based on his query
You actually don't need to cluster
i know, i can just show him most similar docs
i've done that
but the cluster part is a different feature
basically with clusters, the user can look at other options...
the ability to explore more...
You can just take the cluster centres and do the same as before
That's indeed what you proposed originally
yea i guess its a KNN problem...?
(I think you're making this a lot harder than it should but...) take the mean of all the docs in the cluster, compute the cosine sim, take the most similar one, show the top N in that cluster
yea but lets say in a cluster i have 2 docs, 1 doc has high similarity but the other one not so much, but they still have some similarity that is why they r in the same cluster, now mean operation will average out the embeddings, so info may get lost...?
Yes it will but c'est la vie
wait let me translate that...
that's life*
This is what my course had to say about it:
NLP from 1979 though 👀
Their suggestion is just to take the mean of the documents and return all in the cluster, which has drawbacks ofc as you mentioned.
hmm... yeah
anyway, thanks
ill implement what we discussed and will keep researching to find a better approach to solve this...
hi What tools would be helpful to make a visualization like this: https://www.linkedin.com/search/results/content/?fromMember=["ACoAAAJIlxABE10sM8zEE7MSsBmy06lUQODDj_U"]&heroEntityKey=urn%3Ali%3Afsd_profile%3AACoAAAJIlxABE10sM8zEE7MSsBmy06lUQODDj_U&keywords=james eagle&position=0&searchId=24e66560-e6f4-40ff-817b-64cced149b69&sid=9(F&update=urn%3Ali%3Afs_updateV2%3A(urn%3Ali%3Aactivity%3A7109400669119242241%2CBLENDED_SEARCH_FEED%2CEMPTY%2CDEFAULT%2Cfalse)
500 million+ members | Manage your professional identity. Build and engage with your professional network. Access knowledge, insights and opportunities.
Are these two expressions equivalent?
what's that fancy 1
1 if the underset equality is true, else 0
I think (because the assignment doesn't say)
(the first expression is given in the assignment and the second is my attempt at rewriting it to be easier to reason about)
yeah, you factored out the -1 exponent from the log. looks equivalent
ty math wizard
Why am I getting such an error? I can't use Cv.Imshow directive
how did you install opencv? just pip install?
Yes, but after I got this error, I deleted it and reinstalled it, it said so on the internet, but it didn't work.
from https://github.com/opencv/opencv-python/issues/18 it sounds like opencv-python-headless may be causing the problem, do you have that installed?
yes
maybe try uninstalling it then reinstalling opencv
or just nuke your current venv and create a new one
I deleted and reinstalled opencv many times but it didn't work. I will delete my project and open a new one
@agile cobalt It didn't work, I still get the same error, it's ridiculous.
try reinstalling python, this time from python.org instead of the windows store, then create a virtual environment before using pip
I already installed Python from Python org, but I couldn't understand what you mean by virtual environment.
tl;dr keep things tidy instead of ending up with messes that can causes all sorts of problems like what you just had
@agile cobalt What I am about to say may seem strange to you, but when I run the code using a different IDE, there is no problem, but I get this error in Visual Studio Code. I really couldn't understand
that other IDE is PyCharm?
or something inside of Anaconda
both of these manage virtual environments for you to some extent
No, normal Python Idle
it might be just pointing to a different python interpreter then
in VSCode, do you see the python version in the bottom right corner? Click it and select a different interpreter
Wowwww I chose anaconda and it worked, very strange
So the problem is that simple
I've been trying to solve this problem for 2 hours
managing dependencies in python can be a pain on the ass sometimes
Thank you so much bro
hi I was wondering if anyone wanted to work on a ai project with me 😄 here is a blue print for the ai
I am building a CNN with data regarding most popular cities. If I train my cnn with the cities I have right now and gather more data of other cities later, will it remember the previous cities or will it forget and just remember the new ones?
Should I just wait until all my data and train it all together?
@abstract wasp training a CNN to do what?
To give an estimated location of an image.
so it's classifying images of locations within cities according to their city?
would we expect it to classify an image of the empire state building as "NEW YORK"?
Yes
if you trained your classifier on cities {a, b, c}, and then continued training it on {d, e}, you would run the risk of the classifier forgetting {a, b, c}. and there are strategies for mitigating this, but unless the training you did on {a, b, c} can't wait and would be expensive to replicate, you'll get better results if you train once on {a, b, c, d, e}.
Ok, thank you!!
Wait, I have another question. If I train a, b, c, d, etc., and I later gather additional data for each other those same classes, is the risk still the same or will it be okay?
suppose you have two training sets A and B that both contain instances of the same classes. I suspect that if the distributions of the classes is the same in both sets, then training on A, and then training on B, wouldn't be much different from training once on the union of A and B.
but that's probably a quesiton that dissertations are written about.
Ok thanks!!
@abstract wasp why do you ask? do you already have a trained model that was expensive to train?
No, I haven’t trained it yet. I’m still gathering data but rn, I’m just gathering data for the 100 most popular cities. The overall goal of this is to have data for most of the world—I just wanted to see if I should build diff. CNNs and fuse them together or if I should just train the CNN with what I have, save it, and later retrain it with additional data.
But yeah, training a CNN with this amount of data would be very expensive, that’s also another factor.
can I have a simple decision tree code review?
whenever you ask for something online, give people everything they would need to fulfill your request.
k
sorry abt that
!paste
something to keep in mind as you approach this is: if someone with internet access looked at one of the images in the dataset, would they be able to figure out what city they're from? if the images are so non-descript (no famous buildings, no city-specific architecture, etc.) that there's nothing about them that could be tied to a particular city, a neural network won't be able to magically solve that for you.
https://paste.pythondiscord.com/UHHA
My decision tree code I challenged myself by not using entropy or information gain
I think I did an ok job but might be able to improve
if anyone can review the code and tell me what I should improve on I would be happy to hear
Yeah, that makes sense. Thank you for the support 😄
Is there a specific book or vid series for ml that focus on how to achive accuracy for different data behaviours. And techniques to win competitions.
As most of the books focus on teaching algos...
Like kaggle competitions?
Yeah
Maybe the kaggle book?
How is ^VIX being calculated as having only a 0.0272 correlation with VIXCLSx
Should I be normalizing first? Using pandas corr function, pearson correlation
Hi Folks,
During my free time, I was doing personal
project basically created a chatbot which can
answer your question from document. I used
Langchain(framework), ChromaDB(vector database), Streamlit(ui) and used both local llm(Llama2 based model) or OpenAl api for llm. You can use PDF, TXT, CSV, and DOCX files for question answering. Any
contributions to this project will be highly welcome. Thanks!
Github link: https://github.com/himanshu662000
/InfoGPT
Hi I'm a complete beginner to ml and need to train a model to automatically find coordinates in an image, can someone please point me to some resources and libraries that can help me accomplish this, thanks.
chat
gpt
Dude this is actually fucking sick I've been meaning to build the exact same thing but haven't had the time lmao. Literally everything the same I wanted to use chroma and use my own local llm and use langchain to make it talk to itself
Does it get slow with large amounts of pdf, essentially if you gave it an entire bookshelf to search through? I'm definitely downloading this and trying it out though. How long did it take?
I already have a correction, it is supposed to be requirements.txt not requirement.txt just a tiny thing lol
Thanks! No actually it will not be slow irrespective of size of pdf but while ingesting data to chromadb(which u will do definitely before querying) it will take little more time for large pdf. But while querying and getting your answer u will not notice any difference irrespective of pdf size.
Thanks I will change it
Cool!
So do you have to use GGUF llms? Or can you use any llm like GPTQ?
any AI tools available for CLI prompt validation? i.e. to check whether a string answer to a command line prompt has an appropriate format
Do any of you guys know some high quality libraries for making maps in python
I have worked with plotly and folium
Plotly is pretty good but runs into some limitations from time to time
does anybody know the maximum length a SQL query can be with Athena DB ?
I basically need to do: SELECT ... WHERE col IN <very-long-list-of-values> the list/tuple can have upwards of 100k strings with at least 50 chars each
not sure if I pass it as query parameter if it will still matter or not...
worked with geopandas before seems pretty clean
I'm at a loss for how to proceed with this question. It appears that we have function K as R^d x R^d -> R, and Phi as R^d -> R^d, but I don't understand the relationship between K and Phi.
This is my zone, sec! 😄
This one does have me thinking tho
. The relationship between K and phi is called the kernel trick. Phi maps x to an infinite dimensional space for an RBF kernel (you can't compute this). Basically K allows you to do a dot product of two vectors that were mapped in a (possibly) infinite dimensional space without explicitly going there.
You definitely have to look at it in terms of K and not phi, that's the property they want you to exploit
if x_i = x_j then the term is exp^0 = 1 and if they are different xi - xj² results in a positive number which you multiply by -1/2 resulting in a negative which is also bounded by 0 and 1.
I don't understand the <= 2
Can you expand on what "automatically find coordinates in an image" means? Like latitude longitude coordinates? Coordinates of some object you're detecting? Either way this doesn't really sound like a beginner project.
Please never recommend chatGPT as a source of information, especially to beginners.
I have an image and I'd have to find out what part of the image should stay and what should be removed...
I need to remove the unnecessary parts (in my current task, I have to remove all the parts that are white & black, ie. texts, etc in a picture which is otherwise full of color?)...
anyway, not even gonna try to sugarcoat this, I got the code from chatGPT
from PIL import Image
import numpy as np
from sklearn.cluster import KMeans
def get_dominant_colors(image, num_colors):
# Convert the image to a numpy array
img_array = np.array(image)
# Reshape the image array to a list of pixels
pixels = img_array.reshape(-1, 3)
# Initialize K-Means with the desired number of clusters (colors)
kmeans = KMeans(n_clusters=num_colors, random_state=0).fit(pixels)
# Get the RGB values of the cluster centers (dominant colors)
dominant_colors = kmeans.cluster_centers_.astype(int)
return dominant_colors
# Open an example image (replace with your box)
box = Image.open('./doggo.jpeg')
# Specify the number of dominant colors you want to extract (5 in this case)
num_colors = 5
# Get the 5 dominant colors within the box
dominant_colors = get_dominant_colors(box, num_colors)
# Print the dominant colors (RGB values)
print("Dominant Colors:")
for color in dominant_colors:
print(f"RGB: {color[0]}, {color[1]}, {color[2]}")
can someone tell me what's going wrong, I'm trying to get the 5 dominant colors in a image
I do use geopandas regularly. But rather limited in making high quality maps I think.
We need the full traceback in order to know what went wrong. Also please send it as text instead of a screenshot 🙂
hey everyone , i need a favor , can someone run code for me , doesnt seem to work on my laptop i want to see if the problem is me, maybe i didnt install all the right packages or maybe the code , has to do with webscraping from cboe and gamma exposure for oprions https://github.com/Matteo-Ferrara/gex-tracker/tree/e4a5cd508268673004e7dcd2f73ce7f74bf251c5
Hi, help, I get an data_iterator = data.as_numpy_iterator() AttributeError: 'DirectoryIterator' object has no attribute 'as_numpy_iterator'
This is my code:
``import pandas as pd
import numpy as np
import os
import tensorflow as tf
from tensorflow.keras import layers, models
from tensorflow.keras.preprocessing.image import ImageDataGenerator
#GPU
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
#DATA LOADING
train_images = '/Users/avatarvaleria/Projects/colabs/Lys/time/data/images'
classes = os.listdir(train_images)
print(classes)
#DATA AUGMENTATION
data_aug = ImageDataGenerator(
rotation_range=25,
fill_mode='nearest',
horizontal_flip=True,
brightness_range=(.5, .5),
zoom_range=.5
)
#APPLYING AUG
batch_size =32
data = data_aug.flow_from_directory(
train_images,
target_size=(256, 256),
batch_size=batch_size,
class_mode='sparse'
)
data_iterator = data.as_numpy_iterator()
batch = data_iterator.next()
fig, ax = plt.subplots(ncols=5, figsize=(20,20))
for idx, img in enumerate(batch[0][:5]):
ax[idx].imshow(img.astype(int))
ax[idx].title.set_text(batch[1][idx])
#DATA PREPROCESSING
data = data.map(lambda x, y: (x/255, y))
data.as_numpy_iterator().next()
scaled = batch[0]/255
scaled.max()
scaled_iterator = data.as_numpy_iterator()
batch = scaled_iterator.next()
fig, ax = plt.subplots(ncols=4, figsize=(20,20))
for idx, img in enumerate(batch[0][:4]):
ax[idx].imshow(img)
ax[idx].title.set_text(batch[1][idx])
data.as_numpy_iterator().next()[0].max()
#SPLITTING
len(data)
train_size = int(len(data).7)
val_size = int(len(data).2)
test_size = int(len(data)*.1)
print(f"Train size: {train_size}")
print(f"Validation size: {val_size}")
print(f"Test size: {test_size}")
train = data.take(train_size)
val = data.skip(train_size).take(val_size)
test = data.skip(train_size+val_size).take(test_size)`
What error do you get?
also im a beginner , i just saw source code and thought that it would work
is it operational on your end ?
All that’s saying is the request is returning no data. Presumably there’s no error handling, so it’s probably just hiding an error
why would it do that lol
This code is a year old. Websites change.
Scraping is very fragile. I’ve worked with cboe to do this exact thing before (gex), but manually downloaded the data.
thats the tedious task i am trying to avoid , id like the data to update itself 😔
was it difficult , this is the only reseaon i am learning python, it really helped with my trading but my license expired with the site where i was getting the levels from
i think you have sent me this once before , i skimmed through but i will read it now , mind if i add you in case i run into some problems i may need help with ?
what kind of maps do you require?
Like choropleth, or polygon plots. With annotations
is folium/plotly lacking in any way for these?
Sort of like this
Plotly does quite well, but sometimes lacks. So was wondering if there's other tools available that are good in other dimensions
Like for example, plotly wasn't handling overlap of text very well. And I had to custom code a clustering logic which avoided the overlap
gpt did mention that
I mainly wish to make static maps btw
i use cartopy which is what geopandas uses internally
no interactivity but relatively detailed control over output
i get map tiles using contextily
it's good enough for the static images in presentations and docs that i need
I definitely am gonna continue recommending chatgpt to beginners
Sos 😭
It's your life. As long as you are aware that it is a less useful recommendation than "Google it" because of the unreliabile nature of modern LLMs.
gpt4 is highly reliable and most beginner programmers just want to do something simple that gpt 3.5 wont be hallucinating anyting up for
Its instant responses chatting with an industry expert when the alternative is maybe getting a response every few hours from some people on discord/reddit or poring over documentation. Its how I got my start a few months ago and I found it invaluable
It's absolutely not reliable as it can and will give you contradictory answers to the same logical question when worded differently. Not to mention hallucination is still a problem for gpt4 even if it's not as much of a problem as it was for 3.5. Again, not to mention you didn't recommend it for a simple task, you recommended it to someone who had a fairly complex task. Talking to gpt4 is absolutely not akin to talking to an industry expert.
What is the relationship between K and the inner product? How does || ... ||^2 relate to that?
It’s highly unreliable, and more importantly: prevents new programmers from developing the problem solving skills they need.
Even worse is when it gives a working answer that’s a bad practice
Is this not a helpful respone? (3.5 btw)
Hi I'm a complete beginner to ml and need to train a model to automatically find coordinates in an image, can someone please point me to some resources and libraries that can help me accomplish this, thanks.
ChatGPT
Certainly! If you're a beginner in machine learning and want to train a model to automatically find coordinates in an image, you'll likely be working on an object detection task. Object detection involves identifying and locating objects in an image, which can be thought of as finding the coordinates of objects within the image. Here are some resources and libraries to get you started:
-
Python: Most machine learning and computer vision tasks in the context of object detection are done in Python.
-
Libraries/Frameworks:
TensorFlow Object Detection API: This is a popular framework for object detection. It provides pre-trained models and tools to train your own models. Here's the official GitHub repository.
PyTorch: PyTorch is another popular deep learning framework that can be used for object detection. You can find tutorials and pre-trained models in the PyTorch Hub.
OpenCV: OpenCV is a computer vision library that can be used for various tasks, including object detection. It has pre-trained models and tutorials for object detection. Check the OpenCV documentation.
YOLO (You Only Look Once): YOLO is a popular real-time object detection framework. You can find implementations and pre-trained models like YOLOv3 and YOLOv4 in various repositories, such as YOLO GitHub.
-
Datasets: You'll need a dataset of images with labeled coordinates to train your model. Some popular object detection datasets include COCO (Common Objects in Context), Pascal VOC, and custom datasets you can create.
-
Tutorials and Courses:
Coursera and Udacity offer machine learning and computer vision courses that cover object detection.
YouTube has numerous tutorials on object detection using different frameworks.
Blogs and tutorials on Medium and Towards Data Science often provide step-by-step guides for object detection tasks.
-
Books: Books like "Deep Learning" by Goodfellow, Bengio, and Courville or "Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow" by Aurélien Géron can provide a solid foundation in machine learning and deep learning concepts.
-
Forums and Communities: Websites like Stack Overflow and Reddit (e.g., r/MachineLearning) are great places to ask questions and seek guidance from the machine learning community.
-
Online Coding Platforms: Platforms like Kaggle provide datasets, kernels (code notebooks), and competitions related to object detection. It's a great way to learn and practice.
Remember that object detection can be a complex task, especially for a beginner, but with dedication and practice, you can make progress. Start with the basics of machine learning and gradually delve into object detection techniques as you become more comfortable with the concepts and tools.
Without even reading it; that content is no better than what the same google search would yield.
Note I didn't say it can't provide helpful responses. I said it is unreliable
That’s incredibly generic, overly verbose, and not particularly helpful advice to someone just starting.
In fact, I’d argue that is worse advice than googling and reading a few articles that explain -why- and put things in context
Did people have the same opinion about Google and stackoverflow in the past?
is this GIS?
good day everyone,
please does anyone know how i can edit a particular cell in powerBI?
I think it’s the difference between spoon feeding and researching. It’s one thing to develop the skills to research and answer questions
Hey Billy, how are you doing?
I'm making use of power bi and I have a column filled with null values and I want to edit one particular cell in the column to something else.
The replace function keeps replacing the whole column filled with null instead of the particular cell I want to replace.
How do I do this please?
I don’t know powerbi, sorry
Does anybody else get this "UserWarning: The NumPy module was reloaded (imported a second time). This can in some cases result in small but subtle issues and is discouraged." its like unfixable
!paste Could you show the code that creates the warning?
If your code is too long to fit in a codeblock in Discord, you can paste your code here:
https://paste.pythondiscord.com/
After pasting your code, save it by clicking the Paste! button in the bottom left, or by pressing CTRL + S. After doing that, you will be navigated to the new paste's page. Copy the URL and post it here so others can see it.
Seems to be any time I import pandas, or if a library that I import uses pandas
import gdown
import zipfile
import os
def download_data_from_drive(zip_url, output_path):
# Download the zip file from Google Drive
gdown.download(zip_url, output_path, quiet=False)
# Extract the zip file
with zipfile.ZipFile(output_path, 'r') as zip_ref:
zip_ref.extractall(os.path.dirname(output_path))
os.remove(output_path) # Remove the zip file after extraction
# Using the direct download link format provided by gdown's warning
url = r'https://drive.google.com/example_link'
output = str(DATA_DIR / 'Stock_data/dl_folder/example_data-2.zip')
download_data_from_drive(url, output)
Most recent time I've run into it, has happened a lot of other times. This is one of the smaller scripts that causes it
Stackoverflow said uninstall and reinstall pandas/numpy, have tried that. It happened with miniconda, tried using full anaconda instead, still happens. Uninstalled everything on conda and reinstalled everything with pip only, still happens
At this point I think its a problem with python 3.10.12 and have left it because it hasn't caused any noticable effects but the warning is annoying
Does anyone have any advice on how I can begin learning AI. I already know intermediate python and data structures.
Hey i am having a trouble installing OpenCV CUDA, I am done with all the steps in CMAKE-gui, but when i try to build the files, it just throws an error:
MSBUILD : error MSB1009: Project file does not exist.
Switch: INSTALL.vcxproj
Hi, can someone pls help me. Can someone give me an example code of how to split my data. For example, I have a directory named “main_dir” and in this directory I have three directories, each for the three classes I have named “1”, “2”, “3” (with just images of each class). How can I split my data into train, val, test?
I’m seeing different ways using Tensorflow, Sklearn, and other ways so I’m confused on how I should do it.
SOLVED <3
hey I'd suggest sklearn if you're going for complete basics! sklearn has a direct train test split function where you can mention the ratio in which it needs to be split into! check out some sklearn tutorials!
Ok, thanks!
Also, do you know how to implement data augmentation? I saw you can apply the augmentation to the actual model but then there’s another option with apply that augmentation to the actual dataset. Which one do you think is best?
yep it worked thankss
hi yeah, you need to apply scaling or any kind of augmentation to your data set to the split model instead of the original dataset because:
if you augment the orignal dataset, all data entries will be changed, and upon splitting into test and train sets, your test set will also be affected.
on the other hand, splitting data and then augmenting/scaling/changing the train set, will help you preserve the original test set, giving more accurate outcomes to the test output
am new to this too, so correct me if am wrong, anyone
Hm I can't reproduce the issue so it must be something with your package manager/environment
Ok thank you!
AI is a pretty broad term, what kinda AI are you interested in making?
Can you give me a requirements.txt file of the packages that you're using
what are the best algos to try for stock trading futures/indices?
I'm not using a venv and I do the thing you're not supposed to do with global installs :)) I just installed the modules in your code on top of my existing environment with pip
so if I made a requirements.txt it'd be like 100 lines long
if you just want my versions of these packages I could send that
Yes I'd appreciate that thank you
Also python version
gdown==4.7.1
├── beautifulsoup4 [required: Any, installed: 4.10.0]
├── filelock [required: Any, installed: 3.12.4]
├── requests [required: Any, installed: 2.25.1]
├── six [required: Any, installed: 1.16.0]
└── tqdm [required: Any, installed: 4.66.1]
config==0.5.1
did any of these even use numpy 
tona@albedo:~$ pip --version
pip 22.0.2 from /usr/lib/python3/dist-packages/pip (python 3.10)
zipfile ?
isn't that a built-in module
mhm
Gis as in? It is plotly in python.
I'm currently developing a module to detect small bubble events (Ebullition), calculating the CH4 ebullition flux (eFCH4) by assuming a constant diffusion rate. To mitigate diffusion flux inhibition due to high CH4 concentration in a floating chamber, I select previous data within the observation period to calculate the diffusion flux, represented by the U-M line (Prototype: [link: https://paste.pythondiscord.com/K6RA). I employ the least squares method to fit the slope of the U-M line, obtaining the CH4 diffusion rate within the period. Additionally, I calculate the CH4 diffusion concentration at the observation's end (point E) based on the U-M line's slope. The change in CH4 ebullition concentration (Δc) results from subtracting the concentration at point E from point T during the observation period.
I want a module that can extracts relevant time periods from raw data (an xls file) for analysis (e.g., 10:02:59 - 10:12:59, 11:23:59 - 11:26:59). This targeted approach eliminates the need to analyze the entire raw data range. Ebullition events occur when CH4 bubbles disrupt the linear increase in CH4 concentration.
While I've created a prototype for significant bubble events ([link: https://paste.pythondiscord.com/K6RA), I'm seeking guidance on developing one for small bubbles. Additionally, I'm working on determining an appropriate threshold value ([link: https://paste.pythondiscord.com/H7RQ). Any assistance or advice to enhance the module would be greatly appreciated.
For the threshold value: I have heard that, no matter what, I should never convert it to a str, but I am not familiar with how I should do it
Hi, I had a question with pytorch. Below is my model
from torch import nn
# create a two layer FCNN, avoid ValueError: optimizer got an empty parameter list
class img2latent(nn.Module):
def __int__(self):
super(img2latent,self).__init__()
self.neuralDim=len(X_train[0])
self.latentDim=len(Y_train[0])
self.hiddenDim=self.neuralDim
self.fc1=nn.Linear(self.neuralDim,self.hiddenDim)
self.fc2=nn.Linear(self.hiddenDim,self.latentDim)
# INTITIALISE THE WEIGHTS, FC1 WITH ONES, FC2 WITH PARAMETERS OF RIDGE
self.fc1.weight.data.fill_(1)
self.fc1.bias.data.fill_(0)
self.fc2.weight.data=ridge.coef_
self.fc2.bias.data=ridge.intercept_
def forward(self,x):
x=self.fc1(x)
# add reLU
x=torch.relu(x)
x=self.fc2(x)
return x
def train_loop(model,loss_fn, optimizer):
model.train()
# do full batch gradient descent
pred=model(X_train)
loss=loss_fn(pred,Y_train)
optimizer.zero_grad()
loss.backward()
optimizer.step()
return loss.item()
fullmodel=img2latent()
# SEND TO GPU
fullmodel=fullmodel.to(device)
# take mse loss + l2 regularaisation on the weights of the second layer
loss=nn.MSELoss()
optimizer=torch.optim.Adam(fullmodel.parameters(),lr=0.01,weight_decay=0.01)
for t in range(1000):
loss=train_loop(fullmodel,loss,optimizer)
if t%100==0:
print(t,f"{loss:0.2f}",end='\t')
However I get
ValueError: optimizer got an empty parameter list
How do I solve this issue
You wrote __int__ instead of __init__
👆🏽
when moving into predicting events that are hard to distinguish from normal variation, you usually end up having to trade off between probability of true detection and probability of false detection. but ultimately just about every statistical technique revolves around doing something very similar to what you are already doing: proposing a baseline model, and then looking for deviations from that baseline model. where it gets more complicated is when you want to analyze the situation probabilistically, but that's often necessary in cases where it's not straightforward to distinguish the baseline and deviated scenarios. probability model and gives you a principled framework for distinguishing the two, and allows you to make trade-offs in terms of the probabilities of false positives, true positives, etc.
One thing that did not really occur to me when I was helping you previously is that, because you assume a constant rate of increase in the baseline scenario, when you remove the trend by taking first differences, you should get a flat line
That is, you can transform your data in terms of deviations from the expected trend line
Then instead of modeling the slope directly, you can just look for unexpectedly large positive deviations from trend
This is convenient because you can think in terms of level values rather than rates, which i think makes it all a little bit easier
It also simplifies the problem I think, because it reduces now to figuring out what is the normal baseline distribution of CH4 increases at any time step
In your case, it seems like maybe the variation in the data is constant over time? If that's true, then you can get pretty far using standard statistical hypothesis testing
Thanks for explaining that, it makes a lot more sense now! I agree that focusing on deviations from the expected trend seems like a crucial first step in developing the module. It's true that many published reports on methane ebullition events are based on predictions rather than actual observations, which can introduce uncertainty.
Understanding the composition of bubbles and how other substances inside them change over time can significantly reduce uncertainties and improve predictions. there is ongoing research into the content of these bubbles. It seems like we're working towards making our predictions about methane bubbles more accurate and reliable, moving away from just educated guesses. I also want to mention that I think this is the first step for me to develop the module, so the module itself will have to be improved of course.
my understanding was that the expectation of a linear trend was derived from some theoretical knowledge about the underlying chemical process involved in whatever you're working on, is that not true?
it gets fuzzier and harder to distinguish events from non-events when you need to estimate the baseline distribution directly from the data without a theoretical model
I am familiar with this, and yes, it may. But it doesnt lead to the development of the module itself that over time can be improved. As both we`re talking about, the module I am trying to create now and the standard statistical hypothesis testing will have huge uncertainties.
To be honest, not always, but in some cases, the expectation of a linear trend is based on theoretical knowledge of the underlying chemical processes. This theoretical foundation provides us with a starting point for making predictions. However, remember that while theory guides us, real-world data can sometimes behave differently due to various factors. So, while we start with a theoretical basis, we also need to be prepared to adapt our models when necessary to account for deviations from the expected linear trend.
right, you're thinking along the right lines then
remind me again: are you able to analyze the whole time series at once? or do you need to be able to detect events when they occur, using only past data?
That is also why I think this module can be helpful for me and the people who is working with GHG with more accurate data.
I am using a GHG instrument that measures real-time data every second at the water-air interface.
right, but is this something that is going to be running continuously and sending out an alert when an event happens? or are you running it for a set period of time and then analyzing the entire sequence later?
So no, always fresh data that is being analyzed here
okay, so if you are looking for methods that only use past data without seeing the full sequence, your keyword is "online" (although it's not very useful on its own given its other meanings)
i think last time you asked about this, changepoint detection was brought up, and ultimately i think that does describe what you are trying to do
Well, this is where the time-consuming part comes in. The instrument logs data every second. So, in the field where I use the instrument for 10 minutes at each site, I have to physically analyze the data afterward in Excel, empty the distributed dataset before placing it in the water, etc. And this is just raw data; there are a few other modules that have to be used afterward to obtain the actual Flux data."
right, but it sounds like this data is intended for a post-hoc analysis rather than continuous monitoring, correct?
Well, I'm not particularly interested in determining whether the bubbles are occurring on the spot, as you can often spot them with your eyes, but you can't make that determination when you're analyzing the data afterward.
makes sense
you're right. the concept of changepoint detection aligns well with what I'm trying to achieve.
Typically changepoint detection, especially Bayesian online changepoint detection (BOCD) assumes you have 2 distinct distributions and there's a specific point you go from P1(y|X) to P2(y|X). Is that the case for you or do you just have anomalous points?
If you can mathematically define what you want to do finding a method that does it comes out the other end sometimes 😄
In my data, small and significant bubble events don't always fit the traditional definition of clear anomalous data points. They occur at somewhat random rates, making it a bit more challenging to pinpoint them as distinct anomalies.
Do they change the trajectory of your overall curve?
Well, I am not that good with math or programming so that is also why I seek guidance and help here,
I'd say the size of the bubble doesn't determine if it's an anomaly or not, you can choose what an anomaly is as the practitioner
If the bubble moves your line permanently it's a changepoint, if it doesn't I'd say it's an anomaly
Yes, these bubble events do indeed affect the overall trajectory of the data curve. When they occur, they introduce deviations from the expected pattern, causing temporary fluctuations in the data.
When you say temporary, does it mean it shifts back?
As the instrument I am using flushes the system at a constant time, the line will after X time smoothly decrease
Is the entire "lifespan" of your data impacted by an "event"?
Or just a "zone" around the "event"?
I meant that the data exhibits deviations from the expected pattern for a certain duration or period of time while the bubble event is happening. These fluctuations are not permanent shifts; they are only observed during the occurrence of the bubble event.
Okay I'd say they're anomalies then. How fast do you need to spot them? You can be vague about this like "very fast", "medium" etc
And do you know the expected pattern before starting your process?
Additionally, do you have series that don't have any "bubble events"?
It all depend on the length of the time I measure, Often I sample for 10-15 minute so when I see a significant increase when I analyses the data, it will be for the rest of the "liftspan" of the observed datatime I look for. But When I look at the real-time data out in the field, it will ofcourse drop down as the system flushes
I don't know your domain so it's in both our best interest if you abstract away some of the details 🤣
I don`t understand your question. I am not trying to catch them. When they occurs, I would like to "catch them" when I analys the data.
So not in real-time? After the fact?
Sorry, I am just afraind I am too limited with explaining to make confusions
I'd look at the variance of N points that don't have any "bubble events" and then take "N" points that contain a bubble event in the beginning and the end
There should be some sort of difference in variance
Well, when looking at the data when I am on the field, the graph itself drops if I continously keep meauring. But when I have decided that this timespend is the data I will use, you may not always see its decrease unless it is small bubble event.
Well, there will always be a constance change as the measurment is being done every second. That is also why I need to have a trashold value.
Data
Data Variability
as i understand, they are specifically looking for step detection, but i was thinking you could reduce that to just looking for large positive deviations from trend
that is, they are assuming there is some steady state constant rate of increase, and these bubbles lead to large "steps", effectively positive y-intercept shifts
I agree with all you said hence why I think we're going in circles 😄
yeah that's why I was trying to get out whether this was online or not
The best I can gather is that this is not really an online problem, which allows you to produce a decent estimate of baseline variation around trend, so you can retroactively look and find large deviations that might be bubbles
basically what I'm proposing is a shortcut to find change points, using the specific assumptions of the problem, rather than a fully general algorithm
No, what you're suggesting is enough
however I suspect that most of the off-line change point detection algorithms that work by recursively partitioning the time series would also work very well to detect large mean shifts
where things get tricky is detecting smaller mean shifts, and that's where I got hung up before I had to go do something else
There's a few cases where it will fail that I can foresee but they should start here and solving those will be easy
if you just look at average deviation from trend, E.g. estimating sample standard deviation of first differences, that standard deviation estimate will include all of the shifts
and this is where I really regret dropping that nonparametric statistics class in grad school
because my hunch is that some kind of robust estimation would be appropriate here
essentially you have an extra distribution of mean shifts: baseline shifts, and shifts caused by bubbles
Something simpler can work, I'd only reach for those if the diff-in-variance method fails
so either you do something nonparametric to try to eliminate the bubbles from the baseline estimate, or you do something like a bayesian mixture model where you are being really meticulous about accounting for all sources of variation, but that might be harder to design
True, maybe it just works
@quaint loom if you have the opportunity to sit there and observe bubbles as they come up, I think that would help your analysis substantially
Most summary statistics have robust counterparts incl. variance if that's an issue
Literally just mark the time that a bubble occurs, then all of a sudden you have labeled data points and you can be much more confident about model/technique selection
I know they exist and that's about it
You probably also use Huber loss?
Yes, I know. I am sitting there and observing but sometimes you can also not detect them as some of the bubbles is very small but the sensor is detecting it. One of the reason I also want to develop this module is because going through all this data would take ages.
Yea, that is the simple way out.
It's not the simple way out, it's the correct solution. This is the "shoe leather" part of "statistics and shoe leather" that goes back to the earliest days of statistical analysis. Actually collecting good useful data almost always involves toil and manual effort
As the chamber is covering th water surface, you can not always detect with your eye that a bubble is appearing inside the chamber.
I see, you said before that they were observable and I didn't realize that was limited
you`re right, its not always as a constant rate.
I would still like to develop this module. Just have to figure out how the best way should be! I also missed so much in school about this...
anyone can help me in openpyxl?
I would start by focusing on the algorithm you intend to implement rather than the code
If you have some suggestions, please list them out to me when you have some time.
Please be more specific about your question. Maybe even i can help you if you can just ask directly
I was going to run my model but I got this error:
(DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_4' with dtype int32 and shape [5923] [[{{node Placeholder/_4}}]] 2023-09-23 07:34:00.087774: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_4' with dtype int32 and shape [5923] [[{{node Placeholder/_4}}]] Epoch 1/60 Assertion failed: (f == nullptr || dynamic_cast<To>(f) != nullptr), function down_cast, file ./tensorflow/tsl/platform/default/casts.h, line 58. Assertion failed: (f == nullptr || dynamic_cast<To>(f) != nullptr), function down_cast, file ./tensorflow/tsl/platform/default/casts.h, line 58.
Well we were talking about looking for outliers in the distribution of differences right? So I would start by computing the sample standard deviation, or maybe the median absolute deviation, of the first differences, and then messing around setting thresholds on that quantity. For example, if a deviation is more than two median absolute deviation away from the median, it might be an outlier, i.e. a bubble event
The median absolute deviation should be more robust to the effects of including outliers in the estimation process, compared to the standard deviation
I would probably start by doing this on each time series individually, preferably one where you were able to actually observe and mark known bubbles
and by the way you might have a somewhat more enjoyable data analysis experience if you work with pandas or numpy. the trade-off is that they are fairly large libraries and might take a while to get comfortable with, but maybe you can set that as an intermediate goal before moving onto more sophisticated algorithms or scaling up to some kind of automated process
Thank you for your time. I will return to you when I have delved deeper into this.
ANYONE WHO HAS DONE SIGN LANGUAGE DETECTION USING MACHINE LEARNING??!!! ITS URGEENTTT
can someone explain to me why np.axis=0 means "applied to each column individually, but along the rows"
Hi guys, this might be a dumb question but, Do I have to know numpy in order to understand and use pytorch, because I'm learning math right now for machine learning and ai and I want to improve in python as well and eventually want to use either pytorch of tensorflow to implement my math knowledge and build models, so I'm wondering how should I do them, because I understand that I need math and I love learning math, however I also eventually want to use pytorch or tensorflow to implement my math knowledge and actually build models with them, so I'm wondering that in what order should I do them. Or should I start with understanding and being familiar with numpy and then get on to pytorch and tensorflow, or first being really good at math and then get into numpy pytorch and tensorflow?
I hope this make sense, I know that this is dumb question, but I'm a beginner in this field.
pip makes we want to kill a dolphin
is there anyone who is available ı d would like to ask something easy things 🙂
I can but i'm also a beginner
in first photo ı can open camera easyly in second photo ı would like to open webcam again ,ı m writing 0 to default what can ı do
when ı write 0 to promt ıt works but when ı was trying to write it in code doesnt work
where ı made a mistake
hah okey 🙂
Hello. I am doing some preprocessing and have just 2 null values in a single column I want to look at. How can I look at just the two rows of data that have them? I have the typical suite of libraries installed: pandas, numpy, matplotlib and seaborn.
something like df.loc[df.isna().any(axis=...)] might work? maybe also use subset= on isna
Yeah, that was just about the first thing that popped up after I rephrased my own question. This always happens 🙃
Thank you so much though 😅
You can use the tools without fully understanding them. To be a professional, you’ll need to round out your knowledge, and will be forced to learn more of numpy, but there are plenty of starter projects that require basic understanding
Make sure you first have solid Python skills (finish a tutorial and maybe some small projects), or it might be overwhelming. I’d suggest cs50 for ai or kaggle.com/learn to learn some Ml stuff and basic numpy
does anyone do tracking of certain products of the web
What do you mean “why”? It just does. Axis=0 says: operate row by row, axis=1 is columnar
Right, 0 = row but if you do np.sum(axis=0) then it gets the sum along each column
That link explains it better than I would, could you check that first?
I understand your confusion, it’s just one of those things that’s how it works
Thanks I appreciate your answer
the way I think of it: for reduction operations, axis=0 means "get rid of axis 0 by reducing along it".
so you have an (n,m) array, you do something like .sum(axis=0), you get an (m,) array.
in general, the axis parameter tells you which axes are "consumed" by the operation
oh hah thats what reptile just said
so it's not that axis=0 means "operate columnwise", it means "operate everything-other-than-row-wise"
consider an array of RGB images, shape (m,n,3). let's say you want to find the average value of each color across all pixels across all images. that would be np.mean(images, axis=(0,1))
What do you expect x_train[1] to do?
this makes sense
anyone knos a good machine learning algorithm for a tyre degradation prediction model? the feature is you input the compound and the tyre lap and it should give the expected time
F1 tires? linear regression for sure, at least as a starting point. if every lap is driven identically, you would expect the same amount of tire wear each lap. if some laps randomly result in more wear and some laps randomly result in less wear, with a roughly bell curve shaped distribution of where centered around average per lap tire wear, that's the classic linear regression model
you could imagine that maybe tires do not wear consistently. maybe they exhibit a lot of wear in the first couple of laps, then wear rate flattens, and then the tire deteriorates rapidly at the end of its life. or maybe something completely different. but i would always advocate for the simpler model first
assuming you are actually interested in making predictions about F1 race outcomes, you have the problem where you are not physically observing and measuring the condition of the tire, so any proposed model is more like a guess or theory and there's no real principled way to fit that to any data, because you have no data
so in that case you would almost definitely want to go with the linear model, always go with the simpler model in the absence of other information
Hello, I am currently trying to run a pretrained model that classifies the mnist number data set both from huggingface. I am having issues with the dimensions and format of the images. I have attached my code below along with the error raised and would appreciate any help regarding this. Thanks in advance.
Hello, I have dataset which have 1400 rows and 1800 rows. I am trying to recognize letters. My model can currently recognize every letter but A, B, D and H. I use randomforest algorithm. Do I need more data or is there some other way to solve this problem. At training it has accuracy of 97% but when trying in practice it doesn't recognize those above mentioned letters
thats what i used lol, but would linear regression work for non linean model as the tyres??
Guys why do my spacy code doesn't return correct similarity, these sentences even are similar
import spacy
nlp = spacy.load('en_core_web_lg')
sentence1 = nlp("subjective test 3 _ Test paper (Biology) __ PDF ONLY __ (Neev 2024)")
sentence2 = nlp("biology test")
print(sentence1.smilarity(sentence2))
0.1846538...
Ive been implementing lda, following a guide, making some tweeks occasionally. But I cant stop asking myself why do I have to use the within class scatter matrix. When I look at the formula I'm really tempted to just go for the cov Matrix... it would be so much easier on the computer in terms of computations.
Hello Everyone👋🏼
I am thrilled to share that I have participated in a Datacamp competition to show my analytical and machine-learning skills. Just like as you've supported me in past competitions, I am reaching out to you again.🙌🏼
Your support means the world to me. To increase my chance of winning, I kindly ask for a moment of your time to visit my DataCamp workspace and upvote it from the link 👇🏼
https://app.datacamp.com/workspace/w/83209d5b-2341-46d3-88c3-113ebb8d587b
Your upvote could make all the difference. Your encouragement and support have always been a driving force and I am immensely grateful for it. ☺️
Thanks for taking the time to upvote my work ♥️
By
Umar and Faizan
Can anyone help me in making a Sign Language Recognition
hey guys, which algorithm do you think it suits the most this model. it is a model which predicts the time in a specific circuit due to tyre degradation with inputs: compound, laps with tyre, laps in race.
I have used linear regresion but now what to try something else
What is your output? It's a specific time?
so for instance '50 hours'?
its laptime
Looks like it might be an issue of black and white vs. colour images. You can import the image dataset using an ImageFolder helper from torch instead as this automatically helps with this.
Otherwise you will have to convert the images in place somehow.
This is likely a case where linear regression is not great 🙂
Probably look at a gamma regression
Examples using sklearn.linear_model.GammaRegressor: Release Highlights for scikit-learn 0.23 Tweedie regression on insurance claims
i mean it gave me an error of like 0,9s
Can you plot the distribution of your target variable
i dont rlly know how to use matplotlib, could you tell me how to do it
I can give you a few pointers:
Seaborn is a great high level plotting library you could use to make a histogram or in this case density plot: https://seaborn.pydata.org/generated/seaborn.displot.html#seaborn.displot
but do you know why my plot is bugged??
The link has clear examples on how to, you'd do something like: sns.displot(data=penguins, x="flipper_length_mm", kind="kde")
this??
Yes
No, personally I would have a hard time knowing why exactly your plot without reading through your whole script and if I may be honest I don't have the time for that right now 🙂
Seaborn is built on top of Matplotlib and you should most likely read this page: https://matplotlib.org/stable/users/explain/quick_start.html#a-simple-example to understand how it works (the relationship between Figures and Axes)
so what is pinguins here?? sns.displot(data=penguins, x="flipper_length_mm", hue="species", kind="kde")
it's just an example, you'd put your own data there
Your data frame and x is your column
like the csv??
I think you need to read through the documentation (both links I sent you)
and what would be the hue
I won't tell you and that's in your best interest 🙂 Learning to read documentation is probably top 3 most important things in programming.
the thing is i dnt rlly understand what histogram are you askin for
if you tell me what the histogram should be
Is your "y" column (laptime) is shaped like this you're better off using a gamma regression.
To find out you make a histogram or kde plot
but what is the graph comparing
If you're having trouble with that then https://seaborn.pydata.org/tutorial/distributions.html <- is a good read to understand the reasoning behind histograms etc
nono, im saying whats the graph u want me to plot
laptime
I'm not going to answer that 😄 Time for you to do the work.
if ur not going to tell me what x is how am i suposed to do the graph u want me to do
this is not about an implementation is about something u want me do to
i dont know why gamma reggresor would be better
Read the stuff I linked and you'll know what X is. I won't always be here
flipper length mm??
i can see 0 2 4 6 8 10 12
I'll leave you on your own for now with your homework, good luck! (I'm not trying to be annoying, it's for your best interest)
bro this is not good for me
hello can anyone help me with a bug with my project
its a pre-trained image classifier to identify dog breeds
i just hv one small problem i can't fix
if interested dm asap please i really need to complete this project
Hello everyone,
I share data science tutorials regularly every week on YouTube and I wanted to share the playlists I've created. If you are learning about data analysis, data science and machine learning, I have plenty of videos that can help you on this journey.
Data science projects playlist ->https://youtube.com/playlist?list=PLTsu3dft3CWg69zbIVUQtFSRx_UV80OOg&si=cLljLTBYA9c48Bys
This playlist contains my end-to-end data science projects which I provide with the datasets i use.
I share courses on my channel too, the PySpark course that I share on my channel -> https://www.youtube.com/watch?v=jWZ9K1agm5Y&t=2628s
My channel for more -> https://www.youtube.com/@onurbltc
Thanks for reading, have a great day!
Welcome to my Data Science Projects playlist! In this series of videos, I explore various topics in data science and machine learning by working on hands-on ...
PySpark, the Python API for Apache Spark, empowers data engineers, data scientists, and analysts to process and analyze massive datasets efficiently. In this course, you'll dive deep into the fundamentals of PySpark, learning how to harness the combined power of Python and Apache Spark to handle big data challenges with ease. From data manipulat...
Hello! I'm Onur. I'm creating courses on Udemy for a year and now I started to upload videos on YouTube. My videos are going to be about data science and programming. I will upload crash courses that are going to be helpful on learning concepts in a single video. Thanks for visiting my channel!
Maybe post a help thread, and paste a link here? It’s hard to know who can help without knowing the problem. #❓|how-to-get-help . Threads can be hit or miss if it’s a very advanced topic.
oh alright
i dont actually have a link for it
i was looking forward to getting on like a voice call to share my screen or smt
its from a course online thats why
can someone recommend course/book for deep learning for computer vision?
I'm interested mainly in CNN
Thanks for the advice. I have experienced another issue upon resizing my data to fit the model.
What would you guys do to recognize the position In a online chess board image and return a FEN . Would you use train, models or coordinates in the board?
if you do a nonlinear transformation of the data, you can fit the transformed data with linear regression, eg. y = az + b where z = 1/x
can you clarify the question? what do you mean by "train" and "models" in this case?
is it an image of a physical chessboard, or a computer generated 2d image or something else?
please post code and errors as text and not screenshots. discord has support for formatting text as code with syntax highlighting, and we also have a separate paste site for posting longer sections: https://paste.pythondiscord.com
but the error is that collections.Iterable was removed in recent versions of python, you need collections.abc.Iterable. it's essential to practice understanding and working through error messages on your own, it is a critical skill
I don't know if I translated In my head correctly. But I mean using a machine learning model to be trained and then recognize chess pieces and the position In a chess board
Would you have an idea?
I've already done image recognition
individually
how could I do in the chess board
with many pieces
and
I have to identify the squares that each piece are
already done
Is this a picture of a chess board? Like a photo?
Or a screenshot/whatever of a chess board where the positions are fixed?
screenshot
online chess board
So can you decompose the problem to square level recognition? Instead of recognizing the board, just recognizing each square?
Wdym? I think it's the only way to find where each piece are and the position by dividing the board in squares
But I want opinions In wich are the best way to do
for example, find a online chess board In any image and try to use it
not just this perfect cropped screenshot
# Fitness algorithm
def get_fitness(area):
return (1000 * ((GRAY_AREA_TOTAL - area) ** 4)) / (GRAY_AREA_TOTAL ** 4)
# Generate segments
def generate_segments(neural_network):
segments = []
total_length = 0
# Use the neural network to produce a list of segment endpoints
output = neural_network.activate([1])
# Interpret the segment endpoints as pairs of (x, y) coordinates
for i in range(0, len(output), 4):
x_start, y_start, theta, length = output[i:i+4]
x_start = (x_start + 1) * X_SCALING_FACTOR + BORDER[0]
y_start = (y_start + 1) * Y_SCALING_FACTOR + BORDER[2]
length += 1
total_length += length
if total_length >= MAX_LENGTH:
total_length -= length
length = MAX_LENGTH - total_length
x_end = x_start + length * math.cos(theta * math.pi)
y_end = y_start + length * math.sin(theta * math.pi)
segments.append(((x_start, y_start), (x_end, y_end)))
return segments
x_end = x_start + length * math.cos(theta * math.pi)
y_end = y_start + length * math.sin(theta * math.pi)
segments.append(((x_start, y_start), (x_end, y_end)))
return segments
def evaluate_genome(genomes, config):
nets = []
sets = []
# Create a neural network from the genome
for id, g in genomes:
net = neat.nn.FeedForwardNetwork.create(g, config)
nets.append(net)
g.fitness = 0
# Get the segments NEAT generates
for net in nets:
set = generate_segments(net)
sets.append(set)
# Implement fitness
for i, segments in enumerate(sets):
# Calculate the remaining areas
area_original = calculate_valid_area(segments)
area_flipped = calculate_valid_area(flip(segments))
area_total = area_original + area_flipped
# Get the fitness of the segments
fitness = get_fitness(area_total)
genomes[i][1].fitness = fitness
def run_neat(config, gen_count):
# Create the NEAT population
population = neat.Population(config)
# Add a reporter to monitor progress (optional)
reporter = neat.StdOutReporter(True)
population.add_reporter(reporter)
stats = neat.StatisticsReporter()
population.add_reporter(stats)
# Run the NEAT algorithm
winner = population.run(evaluate_genome, gen_count) # Specify the number of generations
# Retrieve the best genome (neural network)
best_genome = winner
return best_genome
###
# RUN
###
# Set configuration file
config_path = "./config-feedforward.txt"
config = neat.Config(neat.DefaultGenome, neat.DefaultReproduction,
neat.DefaultSpeciesSet, neat.DefaultStagnation, config_path)
run_neat(config, 10)
I want to get the output generated from generate_segments() that performed the best once the simulation is finished. How can i do this?
(I am using NEAT-python)
Apologies for the formatting. I have tried troubleshooting and ended up trying to tackle this another way. Here is my paste: https://paste.pythondiscord.com/ZZWA
You'd have to see if population.run allows you to return more than the best_genome otherwise you can have evaluate_genome mess with some mutable global (e.g., put the best fitness per generation in a dict with its value) but that's messy imo.
Looks like an interesting project what r u doing
Trying to build a program around the opaque set problem. The set (black lines in the right plot) for which all lines that can be drawn through the area (gray square in the right plot) go through at least once
the first 2 plots show the fitness. its that area in grey that isnt covered with either red or blue. The NN is attempting to draw lines whose combined length has a strict limit that meet the conditions
m and b represent the values in y = mx + b for which, in red or blue, the line goes through a segment in the set, or in gray, the line goes through the target area
In discrete geometry, an opaque set is a system of curves or other set in the plane that blocks all lines of sight across a polygon, circle, or other shape. Opaque sets have also been called barriers, beam detectors, opaque covers, or (in cases where they have the form of a forest of line segments or other curves) opaque forests. Opaque sets wer...
im super new to working with neural networks tho so im iffy on whether its not doing anything other than random selection at this point lmaooo
does anyone knows about how to work with spatial autocorrelations? I have data spaced evenly every 10 meters of the change of a conditions and want to know the relationship to surrounding points. Most of the things I search on partial autocorrelations are to do with time-series which seems a little different, so not sure how to start
e.g. a partial autocorrelation for a time-series
Have u ever tried arithmetic operations with neural netd
Anyone familiar with gephi
Hi I have a 3D array which is a 3D image composed of 2D slices, is there a way I can rotate my 3D array in the x-z plane and the y-z plane?
I don't know exactly which function would be right for you, but there's this one for rotating arrays https://numpy.org/doc/stable/reference/generated/numpy.rot90.html
there's also flip, if that one isn't right.
Companion webpage to the book “Mathematics for Machine Learning”. Copyright 2020 by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong. Published by Cambridge University Press.
thanks
Hey anyone interested in developing AI
Can join me I'm going to develop AI
We can build and learn together 😀
lol @past meteor I linked that because you link it
Hahaha so I'm just giving myself a pat on the back 🤣
you deserve it 👍
I have your message bookmarked to give to new folks, I think we should pin it or something similar: #data-science-and-ml message
So pat yourself on teh back with both hands 🙂
by the power vested in me by lemon, so shall it be.
No I haven’t, what would that do?
do you have any recs for libs / tools for data anonymization?
also anyone could recommend books / resources for learning AI with C++? (asking for a friend 🙂 )
So that it can Develop reasoning skills
does anyone have any ideas about how i can begin learning ai in python (neural networks and like maybe image classification). I already have a good basis in python. I also don't want to spend money on resources and am looking for good free resources.
CS50 for AI is a pretty good intro, if you already know Python.
I have a subreddit..r/mlscholar.. go to it's wiki page for all the resources
someone have an idea?
guys i need advice for how to segment english word by rules
hey guys, how do you prepare for the programmimg part of the interview? Leetcode?
I am applying for a graduate position
Probably a better question for #career-advice, please share a little more information about your background and the position you're interviewing for.
Hi... i want to ask a question from AI and DS engineers... what is ur view on how the AI art is created and if it violates copyright issues of artists... some of the artists go as far as to claim that bots trained on such models use stolen images and art available from artists without their consent hence not just the bot but the dev is also responsiblebehind them..and then worse is monetizing them... i wanna ask where can i find an appropriate library/dataset to build such a model that doesnt violate such rights...
Previously i have been using kaggle for such datasets... is there one such set available which doesnt do that
you might have to look at the details of how each dataset was constructed/gathered
So are there claims legitimate? And if yes then why have devs chosen to take such measures to build bots/generators like that even if they can be illegal
im gonna play around with falcon today, now i cant really find any good info comparing 180b with 40b. which would you recommend me to go with at first? is 180b significantly more power hungry or harder to use?
can i even run it on a normal computer? it says somewhere 40b requires 90 gigabyte of gpu memory.....
These questions are above the paygrade of 9/10 engineers and make more sense to ask a legal / arts / philosophy person 😄
Hahah... surely.. the point is if they re right then is there such a data to get where such copyright claims arent available... and if their claim is not correct then how dare they insult us... it is either they havent predicted the future like it is and now crying over spilled milk or we misunderstood what such a thing could bring in our lives... usually the questions i asked from other fellows...were that they dont care...since art can be subjective
I don't really know what you mean. At work we have a legal dept. if I want to know something I ask them
The AI art debate has so much nuance that most engineers (maybe I'm just projecting) don't have, it's a legal/philosophy matter I think. I can voice an opinion but it's likely going to be bad 😄
It is fine for me... ur bad can be mine good and vice versa as well... but after having such a debate with such fellows i am now in a shock to even start my own ML model on such a thing... or not...
So for me...it is kinda a guiding light now
...uh, sorry but do you have the slightest idea of how much it costs to train a model like Stable Diffusion?
Not at all
over half a million dollars
you can train a mini scale diffusion based image generation model that generates something like 32x32 grayscale images for 10 types of objects, but training something that generates high quality images for almost anything you can imagine requires an absurd amount of compute
which is why you pretty much only ever see giant corporations training their own models
there are a lot of different ways to customize these models though, for example a bunch of people fine tune Stable Diffusion to work better on generating specific kinds of images
Yeah i figured... i dont wanna create a model like stable diffusion or Dall E... i just wanna create something smoller and a proof of concept... which wont require data from such artists... i really like the concept of qr code art so i wanna create a small model like that
the "qr code" part aside, getting a model good enough to generate something people might recognise as "art" is already insanely difficult as-is
iirc the current methods to generate it are mostly using control net to guide stable diffusion, you might want to look into these two in detail first if you haven't yet?
(control net and stable diffusion)
Oka... thanks for the help
I am just starting this field again so i thought to check this out as well... the imaging part...
lol okay. you would need like 360 gigabyte of vram for it to run optimally. think i'll just go ahead with the 7F which has alot lower system requirements. eager to hear if anyone tried to run 40b on their system and how the performance was
Hi guys 👋🏻
For generating a synthetic dataset from financial PDFs :
I' want to do Query Generation:
For that, i think about using a pre-trained language model (such as GPT-3, GPT-4, or other LLMs) to generate queries based on the content of each text chunk.
But the problem is using OpenAI's GPT models, I would need to have access to the OpenAI API and set up API key.
Do u think I can use LlamaIndex instead to generate these queries ?
what kind of synthetic dataset? i don't have an answer, but i'm trying to stay somewhat up-to-date on all the LLM hype and i'm curious what task you have in mind
It’s just for the questions and answers generated by GPT
From a pdf file
Like u can ask questions about the content of the PDF
but I want to generate those questions through another method in order to avoid using OPEN AI
‘Cause I don’t have API KEY / no budget for that
So I was wondering if I can do through Llamaindex model
what do you mean by generate queries though?
or are you talking about fine-tuning a model using some text data that you have?
This seems more of an NLP question; you want to query meaning from (presumably) 10-k’s and q’s?
i mean, querying from a corpus of documents seems to be one of the really strong use cases for fine tuning one of these open models
ive seen it mentioned a handful of times now, that llama performs pretty well when fine tuned
i have zero personal experience with it though
I downloaded llama, just need to finally get around to trying it. Halfway there, I guess 🙂
Are you? For me it feels like a wild goose chase 🤣
I took many NN based courses in uni and ultimately a lot of my projects have been time series so there's overlap in methods but LLMs specifically are a very specific niche. I have 0 FOMO at the "next big thing" coming out every other week, once the hype settles down a little bit I'll catch up.
for Pandas: does anyone know how to get a datetime64[ns] to work with pd.cut()? it apparently supports datetime64 but not [ns]. I'm just trying to bin the datetime into months
How should I deal with related features?
Example: I want to predict housing prices, and two features I have are distance to nearest school and nearest school type(as in, say elementary, middle, high)
I could keep them separate, but intuition tells me that I could "combine" them somehow, or there was some way to inform the model that these two are related, which could yield better results
do you mean that you want to round timestamps down to the start of their month? so 2023-9-27 T 09:11:25 becomes 2023-09-01 T 00:00:00?
or are you trying to group the dataframe by year/month for some subsequent operation?
(ie, group by year/month so that days in March 2020 aren't in the same group as March 2023)
it's just data from the past few months, trying to bin the entries by month. By now we've found a workaround by making the .index the datetime and then do .index.month. but I was kinda expecting pd.cut() to be able to do this type of binning, considering I saw the following:
df['day_bin'] = pd.cut(df['date'], bins='1D')
so my assumption was it would be able to also handle bins='1M' 🤷♂️
it looks like pd.cut returns a tuple of two values
are you sure that what you're trying to do is neither of the two options I gave? I've never used pd.bin before, and I'm trying to understand what you are doing.
I don't know precisely what it means to "bin the entires by month", and that's what I'm trying to understand.
every entry in my df has a column with a datetime64[ns] dtype. I'm just trying to group the entries of the same month together, so all entries with the month 'May' get in a bin, 'June' their own bin, etc.
just binning
https://pandas.pydata.org/pandas-docs/stable/whatsnew/v0.20.0.html#other-enhancements
pd.cut and pd.qcut now support datetime64 and timedelta64 dtypes (GH 14714, GH 14798)
i technically got what i want now, such that all entries are grouped by the respctive month in the datetime column. I was just thinking the pd.cut could make it easy
Sounds like you solved it anyway. For your awareness, dataframes have a groupby method for doing operations on groups, so that's what people will think you mean if you talk about creating groups. It sounds like you wanted to create a new column where the value represents a group that rows belong to.
Hmm I'm not sure how much the model will gain from combining these features but there are definitely ways you could try. Say those are your only two features and you are onehot encoding the nearest school type. Instead of using one's in your encoding you could use a distance from the nearest school; i.e. [2.5,0,0] is an elementary school that is 2.5 miles away where [.5, 0, 0] would be one that's .5 miles away.
Ah, so something like
elementary | middle | high
------------------------------
5.11 0 0
0 3.2 0
0 0 0
```Instead of only `1` and `0` the value is now the distance
Yeah, although you might have to fiddle with the values there, normalizing them and/or making the distance inversely proportional to the size of the number if being close would raise housing prices, etc.
completely a wild goose chase, i'm just trying to stay loosely aware of new use cases, meanwhile i educate myself on the fundamentals of LLMs and the transformer/attention architectures
Right
I think I'll just make the 0s (which should indicate no schools nearby) some huge distance
TY for your help!
Small soapbox but I think only time will tell what use cases will tell what new use cases were viable or not. There's a lot of strange stuff going on. For instance, we had a rejected research proposal and my boss was like "Okay we'll go again, let's just make sure it contains AI, VR or XR" 💀 .
lol
that's the way of the world
sad but very true
Which of the popular open sauce gpt LLMs would you suggest for a system with 2x3060Ti (24gb vram + 128gb system ram)?
I came to realize falcon 180b that i was eyeing is way out of league, maybe even 40b is. 7b just sounds so low in comparison
are you training or just running inference? for inference, maybe there are distilled versions that run on consumer hardware
i don't see why you'd want to combine these. it's OK if they're "related", as long as they're not identical or nearly-identical
