#data-science-and-ml
1 messages · Page 59 of 1
tweet_df['Neutral_count'] = tweet_df['sentiment'].apply(lambda x: 1 if x == 'Neutral' else 0)
tweet_df['Positive_count'] = tweet_df['sentiment'].apply(lambda x: 1 if x == 'Positive' else 0)
tweet_df['Negative_count'] = tweet_df['sentiment'].apply(lambda x: 1 if x == 'Negative' else 0)
tweet_df.head()
Its simply supposed to keep track of the number of positive neutral and negative in the sentiment column
can you do print(tweet_df['sentiment'].head())?
Will you endlessly tune on validation?
Please do not show screenshots of text anymore.
surte
Anyway, all you need to do is tweet_df['sentiment'].value_counts(). but you don't want to put that into tweet_df
because there aren't different sentiment counts for each row. it's counts for the whole dataframe.
what about if i want to store cumsum
would ```py
tweet_df['positive_sent'] = (tweet_df['sentiment'] == 'Positive').cumsum()
tweet_df['negative_sent'] = (tweet_df['sentiment'] == 'Negative').cumsum()
tweet_df['neutral_sent'] = (tweet_df['sentiment'] == 'Neutral').cumsum()
to store the running sum
try it and see
tweet_df['sentiment'].eq('Positive').cumsum() -- this notation would also work. I think it looks cleaner, but that's just me.
it didnt work ill try yours
it won't have a different result
Yeah but was this on your first go?
would I also have to store it in a new df?
I don't think so.
how do i get whatsapp seen data? With beatiful soup or selenium
you can show dataframes as text by doing print(df.head().to_dict('list'))
I tried with bs4 to get with class
But ı cant
Worst case scenario you've tuned to solve your validation set didn't you
Yes but so long as you don't have completely other data you don't know if that'll be consistently lowering
{'id': [1651643206474317824, 1651643242595729408, 1651643261784563719, 1651643301555064835, 1651643310782533632], 'user_name': ['MAKS_Diogenes', 'crypto__wire', 'eurexcoinLTD', 'BitcoinCourant', 'spaziocrypto'], 'text': ['Im thinking about making 50 seed phrases - place 10000 sats on each - then I will engrave the phrases on metal with… https://t.co/SR5Tdq6sY1', 'Crypto Winter Is Over | Mass Bitcoin Adoption | Meme Coins Continue To ... https://t.co/rd1K6tb6sZ #BABYPEPE #memecoins #Bitcoin', '1: Bitcoin price is $29293.55 (0.66% 1h)\n2: Ethereum price is $1906.26 (0.44% 1h)\n3: Tether price is $1.00 (-0.01%… https://t.co/qCEQZgdptU', 'An easy way to run your own Bitcoin node is by using Bitcoin Core\nhttps://t.co/NyAtlHbWTB', 'Pufff, banking system gone... #bitcoin https://t.co/zQSjQls5qx'], 'sentiment': [' Positive', ' Positive', ' Neutral', ' Neutral', ' Positive'], 'positive_sent': [0, 0, 0, 0, 0], 'negative_sent': [0, 0, 0, 0, 0], 'neutral_sent': [0, 0, 0, 0, 0]}
Like it might, but it might not
tweet_df['positive_sent'] = tweet_df['sentiment'].eq('Positive').cumsum()
tweet_df['negative_sent'] = tweet_df['sentiment'].eq('Negative').cumsum()
tweet_df['neutral_sent'] = tweet_df['sentiment'].eq('Neutral').cumsum()
print(tweet_df.head().to_dict('list'))
``` my code for this
Is it because the datatype is an object and not a string?
😦
I found the issue is that there was a space before the sentiment
'sentiment': [' Positive', ' Positive', ' Neutral', ' Neutral', ' Positive']
is there a website to learn machine learning, which is free for Students?
introduction to statistical learning is the book I'd recommend. It's free, but the code examples and labs are in R. A Python version is coming out soon.
if you have proof of your student status, you can apply to "financial aid" on coursera. this allows you to get the certificates for free. just keep in mind the application takes like 2 weeks to get processed
Thanks
yes and for each "course" you have to apply lol
has 10 courses, so you apply 10 times
there are also fast.ai's and sklearn's courses, but like anything 100% free they have no certificates
when it comes to my model rewriting it’s own pathways and creating neurons how do I go about that
are there good libraries to convert somewhat complex XLSX files into json?
was it like pandas or something?
hi guys
# Set the path of the image folder
image_folder_path = "/content/gdrive/MyDrive/video_frames3"
# Define the list of emotions to detect
emotion_labels = ["neutral", "happy", "sad", "surprise", "angry", "fear", "disgust"]
# Create an empty DataFrame to store the emotion data
emotion_df = pd.DataFrame(columns=["Image"] + emotion_labels + ["Dominant Emotion"])
# Loop through the images in the folder
for image_filename in os.listdir(image_folder_path):
if image_filename.endswith(".jpg") or image_filename.endswith(".png"):
# Load the image using DeepFace and check if a face is detected
image_path = os.path.join(image_folder_path, image_filename)
detected_faces = DeepFace.extract_faces(image_path)
if len(detected_faces) == 0:
# If no face is detected, skip to the next image
continue
# Perform emotion detection using DeepFace
emotions = DeepFace.analyze(image_path, actions=['emotion'])
dominant_emotion = DeepFace.analyze(image_path, actions=["dominant_emotion"])
# Append the emotion data to the DataFrame
emotion_data = {"Image": image_filename}
for label in emotion_labels:
emotion_data[label] = emotions["emotion"].get(label)
try:
dominant_emotion_label = dominant_emotion[0].get("dominant_emotion")
except:
dominant_emotion_label = "None"
emotion_data["Dominant Emotion"] = dominant_emotion_label
emotion_df = emotion_df.append(emotion_data, ignore_index=True)
# Save the emotion data to a CSV file
emotion_df.to_csv("emotion_data.csv", index=False)
Any changes recommended for this because atm it's giving me this error
TypeError: list indices must be integers or slices, not str
Does anyone have a pre trained model to a certain extent I can use as a baseline for mine? Or willing to help me make one
for what?
working ai assistant, just need some baseline
you can download large language models like GPT2 from huggingface, but even if you did, it would be tons of work to create an AI assistant.
call me stony hark!!!
I will not do that. what is your motivation for wanting to create an AI assistant, and do you have any (fairly specific) examples of what you want it to do?
yes, and one day ill get the world to acknowledge me! uses are an integration into an ar glasses hardware ( requires the ai portion to be done ofc)
that's a fine goal. but you'll need to start smaller. an AI assistant that is actually useful would require a lot of components, and you probably don't know ML fundamentals.
shiit they know https://cdn.discordapp.com/emojis/766226970974617600.gif?size=64
im workin on it parental figure believe in me!
Any advice for someone making a career pivot from civil engineering to the data sector? Finished my M.S. and have experience with computational modeling and HPC but want to score a role as a junior level data engineer, analyst, or some sort of developer in this industry. I know I have a lot to learn and I’m excited to do so, but just need some guidance on how to get started. Willing to DM LinkedIN or resume for context.
Maybe find a volunteer role to get experience? https://www.datakind.org/do-good-with-data
Harnessing the power of data science + AI in the service of humanity
@slim lance never thought of that, that’s a really clever idea!
Can someone who's fluent with pyspark & data handling please ping me ? I need to ask how do i handle 120GB worth of data in json file using pyspark. I need to clean that data and then put into into my mongoDB but i don't understand how I'd read so much data with pyspark.
So if someone's familiar with pyspark please ping / dm would be better. It would be a major help
The other thing to do is learn to drive the api of every single service you use with Python. e.g. - Gmail, Gsheets, Trello, Jira, Slack, Discord, etc. Not specific to DE/DS but great coding practice. (At least it has been for me.)
I like this idea too 👍
Code
df = spark.read.option("header", True).csv(*[f"/FileStore/tables/deck_data_{num}.csv" for num in range(500000, 2500001, 500000)])
Error
ParseException:
[PARSE_SYNTAX_ERROR] Syntax error at or near '/': extra input '/'(line 1, pos 0)
== SQL ==
/FileStore/tables/deck_data_1000000.csv
^^^
What is going on here? I'm not using a SQL cell, I'm using a Python cell. This is in databricks by the way.
I'm trying to read multiple csv files at once by unpacking a list of file paths.
Anyone have experience running Athena queries from python?
(I’m wondering if it’s worth the extra step)
I have a dataset with data about 50 stores in US. I need to predict revenue for all of them at once. Dataset looks like this: 10 lines for 1 store, 15 for second, 35 for third etc. Which model should I use?
I have one Y and three X
Rows in csv
im guessing one per year or
One per month
Preferably upcoming 6 months
But how I can do it for all stores at once?
I built a model in ARIMAX that is working only when I use dataset containing one store
yeah ig it doesnt rly make sense to do all stores at once
unless u mean the totality?
hes enforcing that u only use one model?
it doesnt make sense
one store have have 5x less revenue than another lol
ig u can merge ur data together with some column denoting which store it is
seems like a bad idea if stores can have quite different revenues
Or maybe I can do it with a loop? So I will be running predictions for one store only
Has anyone worked with T-SNE visualization? I am having a hard time trying to understand what it actually shows. I have two classes.
do u have a train bit and also a test bit?
Yes
But is there any model that would fit my needs? Let’s assume that stores are connected somehow and using all for training will do a better predictions
no idea
def similarity_matrix_blocking_code(class_dataframe) -> ndarray:
#using sklearn algorithms
tfidf_vectorize = TfidfVectorizer(stop_words='english')
anime_matrix = tfidf_vectorize.fit_transform(class_dataframe[DataframeColumns.COMBINED_FEATURES])
return cosine_similarity(anime_matrix)```
I have this code for basic similarity recommendations
the only problem is high cpu usage because 10-15k items are being processed here
which crashes my container on vps ;--;
any suggestions?
!rule 6 @floral comet remove that message pls
Hey folks, I'm looking for resources to build something in ML/AI with IoT sensor data. Any suggestions? Anomaly detection seems to be a common application. Any other applications? Currently planning to focus on Power consumption data or Temperature/Humidity data of a factory
A power sensor connected to a machine and Temp, Humidity, Air Quality sensor in the same room. It's just a scenario as I want to understand how AI/ML can be used in this situation
Why can't I just set a range of values that are OK and if the values start going out of that range I could consider it an anomaly and send an alert?
If you have to set a range of values manually, that's not ML, that's just an if-statement 🙂
Anomaly detection is basically feeding a model a lot of normal data and having the model learn from it what the normal ranges for the values are.
Yeah, let's say a machine normally uses 500W. If the machine not working properly it may use 200W. I can use an if-else to do this. I want to understand in what way can ML help me here
Here's a short overview of how simple (non-neural-network) anomaly detection works: https://scikit-learn.org/stable/modules/outlier_detection.html
I think it's to alert before something goes wrong
Though since your data has timestamps attached, it'd be losing a lot of context to consider your data points independently: like, it might be that none of the measurements are weird on their own, but a specific sequence of several measurements in a row is anomalous. If you want to take that into account, you need anomaly detection on time series as shimmer mentioned, which I know little about (though apparently Azure has an implementation, https://learn.microsoft.com/en-us/azure/data-explorer/kusto/query/anomaly-detection)
Say too high of a humidity and the machine wont run properly
I am trying to do feature extraction and visualisation for an image dataset with classification in mind, does anyone have any ideas on any methods to go about this? So far I have used Kmeans clustering to highlight false positives and such but I am looking for other ways to visualize it.
Hi everyone! Could someone tell me if there are any repos, tools that I can use to generate images from text on windows with AMD gpu?
This output means you don't have much, if any, separation between your classes. You can try UMAP (sometimes it succeeds where t-SNE fails, and vice versa) but I wouldn't expect it to perform much better. Your classes are probably not separated well enough.
Oh okay, I just have two classes and I guess it is not able to classify well based on the graph? Also, the images are put into separate sub-folders for each class.
This is a more reliable solution than AI/ML. People sometimes think that AI/ML techniques are better because they're sophisticated. But they can also be more fragile.
If simply setting a range doesn't work, there is a lot of statistical literature on "statistical process control" or "statistical quality control". The material is quite classical at this point; it was developed starting about 100 years ago. It might give you a little more sensitivity while still remaining robust to ordinary variation.
Yes, it looks like whatever method you used for classification has failed. You'll need to re-evaluate your methods and look for mistakes or ways in which they could be improved.
Thing is, I didn't even classify yet. I just wanted to extract the features and view it.
This is labeled data?
Yeah it's images put into its respective class sub-folders; container and no_container.
I mean it's kind of my first time trying these visualisations out so that's why I am trying to figure it out.
It sounds like the features you extracted aren't powerful enough to distinguish the two classes.
Oh alright, maybe I should try a different model perhaps?
Maybe? I don't know what you've tried or what's appropriate for your data. And feature engineering is a kind of art.
Yeah true. I just tried YOLOv5s maybe it is more suited towards obj detection instead. I should maybe try using a better suited model like VGG or RESNET.
True, I was looking at a TinyML video on DigiKey's Youtube channel. They had to re-train the model whenever they moved the sensor because the data changes based on the location of the sensor. Seems a bit too much work for something that can be achieved in a simpler way.
I'm trying to understand if there is something I don't know or see about ML/AI
But yeah feature engineering is quite hard ngl (at least for me).
Hey guys, long-time lurker here. I had a question about some pandas functionality.
Doesn't seem possible to groupby then aggregate a custom function over multiple columns?
Basically, I need to be able to define, in one groupby/agg statement, the sum & the division of two separate columns and save them as a new column, and then calculate the sum of other columns.
This way the sum & division, and the sums is/are done over the grouped variables.
could you give us an concrete example?
some people forget google exists sometimes
being snide doesn't really add to the conversion constructively 🙂
I appreciate the sarcasm, but I've already looked at this.
Yes -- just a moment ry.
Unfortunately not. They are different
then u can loop over them or smth
Just a moment, I'm typing up an example so I can't respond, thanks
df = pd.DataFrame({'location': ['backyard', 'store', 'bank', 'backyard', 'backyard', 'bank', 'store'],
'is_orange': [1, 1, 0, 0, 1, 0, 1],
'is_non_orange': [0, 0, 1, 1, 0, 1, 0],
'melons': [73, 81, 94, 174, 23, 71, 65})
@serene scaffold @mighty patio Sorry for the late response. So i am designing with the ezdxf library a figure, i have seen that i can export that figure and save it because ezdxf has this functionality implemented. I can do that through matplotlib. Now i have also to make a report that has that figure inside. I wanted to surpass the step of saving the figure from the ezdxf script i have and load it into the report docx file. So in summary i want to keep the figure in memory without saving it localy and then load it directly into the docx file.
Alright, so given this DataFrame, what I'd like to do is something like this: sorry this is pseudocode
df.sort_values(['location']).groupby(['location']).agg(
'total orange/non-orange' : df['is_orange'] + df['is_non_orange'],
'percent_orange' : df['is_orange'] / (df['is_orange'] + df['is_non_orange']),
'sum_melons' : sum(df['melons'])
seems like u shud be able to define custom agg functions to use
and/or combining columns
Right, that's the idea. Basically, the requirements are forcing me to make this into one table. There are posts about custom agg functions for one column, but I unfortunately need this for multiple columns.
We are able to do this in legacy statistical software pretty easily, strangely enough.
u can make the sums first
and then do like py df['percent_orange'] = df['is_orange'] / (df['is_orange'] + df['is_non_orange']) ?
Btw, not to be that guy that says "just google it", but in my experience chatgpt is pretty good for this exact purpose of finding pandas functions to transform dataframes by giving an example and some explanation.
That's a great idea @mild dirge , I'll definitely look into it after this.
Thank you
Yeah, this is exactly the kind of thing I was thinking as well: just define the complex stuff up front and just sum, right? But if the weights (in our case, location) are different, then that would lead to incorrect roll-ups.
is_orange is_non_orange melons percent_orange
location
backyard 2 1 270 0.666667
bank 0 2 165 0.000000
store 2 0 146 1.000000```
But that is exactly the kind of output I am looking for @cold osprey , yeah
this is slightly odd.
df['is_orange'] + df['is_non_orange']
seems to be operation between two series, which returns a series
sum(df['melons'])
seems to be a scalar
how do you expect the result to look? basically i want to know what is the output after your desired aggregation
oh i see, you just didn't add all the sums in, right?
That's a really good question @boreal gale . The best I can understand it looking at this legacy code, we want a sum of our melons grouped by location. So the bank would have 165 melons, and so on. Then, the df['is_orange'] + df['is_non_orange'], you are correct, this is my bad and it is an abuse of notation. I'd like to sum up each value within the groups.
So if I was trying to explain it, it would be: for each individual, sum up all of the is_orange and is_non_orange values pertaining to that location, and present each sum in the groupby table.
I hope that makes sense !
yeah i think u can do it with sums and defining new columns based on those
understood, sorry if i am being pedantic
is_orange is_non_orange melons percent_orange total orange/non-orange
location
backyard 2 1 270 0.666667 3
bank 0 2 165 0.000000 2
store 2 0 146 1.000000 2``` is my final output
not at all. and noted @cold osprey , I'll take a look
Not at all, it was a great question. I was using badly-written pseudo code haha
In the meantime @cold osprey I will try to implement this. @boreal gale definitely let me know what you think.
Thanks everyone for your help
!e i would use something like this, but it's worth remembering shimmer's point - precomputing anything where sensible on the global dataframe level (though i don't think there is any here)
import pandas as pd
df = pd.DataFrame({'location': ['backyard', 'store', 'bank', 'backyard', 'backyard', 'bank', 'store'],
'is_orange': [1, 1, 0, 0, 1, 0, 1],
'is_non_orange': [0, 0, 1, 1, 0, 1, 0],
'melons': [73, 81, 94, 174, 23, 71, 65]})
def stats(df_subgroup):
return pd.Series({
'total_oranges': (df_subgroup['is_non_orange'] + df_subgroup['is_orange']).sum(),
'melons': (df_subgroup['melons']).sum(),
})
print(df.groupby('location').apply(stats))
@boreal gale :white_check_mark: Your 3.11 eval job has completed with return code 0.
001 | total_oranges melons
002 | location
003 | backyard 3 270
004 | bank 2 165
005 | store 2 146
Ah, I didn't think to use apply or the custom function like this at all. So the Series object can in effect contain 2 series.
the returned series in the custom function describe one row at a time like this
not sure if there can be a speed up if u precompute sums then create new cols
may well be slower
i just deleted my code lel
yeah I'm unsure of how the groupby is optimized.
@boreal gale do you think your answer is worth posting on SO?
there might very well be better solutions out there - this is just the way i prefer to do it
at the end of the day, if it solves a particular issue it's probably worth being posted 🤷
Got it, I can post my question then
interesting, let me try it
but need second dataframe or dropping old columns if new 'calculated' columns are placed in same dataframe
ah wait
lemme put the sum groupby in the same cell
unfair comparison
i would run benchmark on the actual problem instead of in this microbenchmark-esque way, but 10x does sound significant
how did you get df2?
Ry, did you have a way to calculate the percentage oranges?
trynna figure out how to increase the loops and runs manually
numbers inconsistent across runs
could you post cell 17 as in here please?
thats ur stats function
sec running benchmark with more loops n runs
will send notebook
nvm cant send notebook LOL
best i can do
looks like 2x
lemme try with df2 declaration in the same cell
ah
df = pd.DataFrame({
'location' : ['backyard', 'store', 'bank', 'backyard', 'backyard', 'bank', 'store'],
'is_orange': [1, 1, 0, 0, 1, 0, 1],
'is_non_orange': [0, 0, 1, 1, 0, 1, 0],
'melons': [73, 81, 94, 174, 23, 71, 65]
})
def stats(df_subgroup):
return pd.Series({
'total_oranges' : (df_subgroup['is_non_orange'] + df_subgroup['is_orange']).sum(),
'percentage_oranges' : (df_subgroup['is_orange'] / (df_subgroup['is_non_orange'] + df_subgroup['is_orange'])).mean(),
'melons': (df_subgroup['melons']).sum()
})
location total_oranges percentage_oranges melons
backyard 3.0 0.66 270
bank 2.0 0.00 165
store 2.0 1.00 146
no diff this time around
I believe these are the correct results. Shimmer do you get the same thing?
uh
is_orange is_non_orange melons percent_orange total orange/non-orange
location
backyard 2 1 270 0.666667 3
bank 0 2 165 0.000000 2
store 2 0 146 1.000000 2```
seems like i do
yeah seems like it
Tell you what, I'll make an SO question here, and I'll send the link to you guys in a few
Both answers seem to work for me as well
ye its roughly doing the same thing
I think your version is good but what do you think about memory overhead
not exactly sure how apply with custom function works under the hood
yeah same
i think memory shud be about the same
using apply will 'delete' the old df/uneeded columns only when it returns the new one so
same thing as df/df2 and then manually deleting df
or using same df and then dropping the unneeded cols
haha i wanna up the runs and loops just to see how far i can push it
oh, the aggregation doesn't get more complicated than that?
In my case, it doesn't, but you've asked a good question. What if it did?
if agg('sum') doesn't give you sufficient information for your further aggregation then you are potentially stuck with apply
Right. Also, it seems that if you need to do some sort of multiplicative thing like *, / , then you'd have to use the mean or median function to retrieve the correct value
Which kinda makes sense, it's sort of what shimmer is doing.
yeah exactly. Basically it's kinda creating your new columns and broadcasting the same value to each subgroup, so it takes the 'mean' of the subgroup which is just all the same numbers
@cold osprey , @boreal gale , does this sound right? [pandas]: I need to groupby on a column, then define multiple (including some custom) aggregation functions.
ye seems like a good way to do it
that way anyone can just modify the agg function and it will apply to everywhere it is used
Yeah, agree
I will see if I can add that point too once I ask that question you guys answer
yeah that sounds sensible as a title, also i gotta go, have fun 🙂
@boreal gale , @cold osprey : https://stackoverflow.com/questions/76130797/pandas-groupby-on-columns-then-define-multiple-including-some-custom-agg
Posted, please answer and decide who will get the accepted answer. Thanks again so much for your guys's help
haha idet i have a stackoverflow account
Got it, would be great if you could upload your answer. If not I can do that as soon as I finish up w work
And give you credit
haha its fine yeah u can upload my ans too
done, I gave you credit as shimmer from the Python Discord server, I hope that is enough
i dont care much for credit but thanks
Of course. Thanks so much for your help, not often you come across something like this
@boreal gale , I'll let you post your answer to that link if you want, otherwise I will post your approach & also credit you
Any of you ever used transformers for multivariate time series analysis, if so how was your experience? I'm not sure how I feel about it since attention is permutation invariant. Not sure we have enough data for stuff like temporal fusion transformers either.
I'm not sure there's any merit to doing this at all - do people just apply them to time series because they are sequences?
Code
df = spark.read.option("header", True).csv(*[f"/FileStore/tables/deck_data_{num}.csv" for num in range(500000, 2500001, 500000)])
Error
ParseException:
[PARSE_SYNTAX_ERROR] Syntax error at or near '/': extra input '/'(line 1, pos 0)
== SQL ==
/FileStore/tables/deck_data_1000000.csv
^^^
What is going on here? I'm not using a SQL cell, I'm using a Python cell. This is in databricks by the way. I'm trying to read multiple csv files at once by unpacking a list of file paths.
Someone please help me, there is one cell with an error i don't understand please check this link and help me out, PS: I am a beginner practicing python data-science
https://colab.research.google.com/drive/1LE1RLYrl1pCWfoMPbmWiVFebg3a99pB-?usp=sharing
The code and the error you will find in this link below
df = spark.read.option("header", True).csv(*[f"file:/FileStore/tables/deck_data_{num}.csv" for num in range(500000, 2500001, 500000)])
Suppose I have an Nx2 array: e.g
n [92]: a
Out[92]:
array([[3, 7],
[2, 4],
[0, 9]])
I want to treat each row as representing the start and end points of a consecutive sequence of elements in another array. How can efficiently extract those sequences. Obviously they'll be different length, so can't be an array, but they could be a list of arrays. So e.g. if I have
In [98]: x
Out[98]: array([7, 1, 5, 2, 0, 4, 8, 9, 6, 3])
I want to get:
In [100]: r
Out[100]: [array([2, 0, 4, 8]), array([5, 2]), array([7, 1, 5, 2, 0, 4, 8, 9, 6])]
I can do it by looping over a and forming slices from each row, but is there something that doesn't involve looping in python?
I don't think so, since the output is a list.
tried a numba function - it's exactly as fast as the python implementation, probably because a list is involved.
I don't think this works.
I'm having a new error in spark though. I'm trying to reformat some csvs.
Code
for num in range(500000, 2500001, 500000):
path = f"/FileStore/tables/deck_data_{num}.csv"
with open(path, "r") as f:
reader = csv.reader(f)
with open(f"deck_data_{num}_formatted.csv", "w", newline="") as f:
writer = csv.writer(f, delimiter="|")
for row in reader:
writer.writerow(row)
Error
FileNotFoundError: [Errno 2] No such file or directory: '/FileStore/tables/deck_data_500000.csv'
I'm sure the files are in the dbfs. It seems like spark won't let me open them.
You must enter "file:///c:/..." style in my opinion.
Not really sure what you mean by that. This is in databricks remember.
FileNotFoundError: [Errno 2] No such file or directory: 'file:///c:/FileStore/tables/deck_data_500000.csv'
When you input path, couldn't you see the file? like this.
Databricks doesn't have intillisense as far as I'm aware. So no, I can't see the filename autocomplete.
I'm using databricks community edition.
If the path is no problem, in other word the file is existing, the code will go well.
The file system on databricks is a bit different than a regular hard drive.
Never used databricks but can you traverse whatever the file system they use?
Did you check if the file is existing?
Anyone here super familiar with matplotlib?
That's basically what I'm trying to figure out.
Yes.
Although the output is a list, it should be possible to do the iteration over the inputs quicker. I'll have a play with cython and numba...
you could probably rewrite the code that generates that list of lists to generate slices instead
my approach would probably be to rethink whether really need that list of arrays. like, maybe you can use slices directly?
Not sure what you mean exactly, a slice, of itself, isn't the data - I need to get my hands on the data. The resulting arrays will be views, so no need to actually copy the underlying memory
I mean that maybe you can rewrite whatever function consumes this list to take an array of pairs instead, and take slices using that array, which can be numbified.
I have not experienced databricks as well. So I can't give you advice any more. Sorry.
It's okay. Thanks for trying to help.
It;s the "taking slices" bit that I'm trying to solve...
Well no, not quite, you're trying to then put these slices into a list. I'm saying that maybe you can construct these slices right before usage.
Like, instead of having a function that takes a list of variable-length arrays, have a function that takes an (N,2) shaped array, and, inside the function, take slices using these pairs. That way, the whole thing can be numbified much better than creating a list of numpy arrays can be.
What is the most used function for AI in pythob
Too vague and also doesn't mean much, but if u insist imo matrix multiplication
Yes that go with that
Then again i guess a more suitable answer would be tensor multiplication lol
If can ask on how do you get an idea on what to program with AI if want to create a new tool
even if you're starting with a pretrained model, you would still need a GPU to fine tune it for your use case. but you can get some free GPU compute at google colab.
@high iron what are you trying to do?
chatbots are not a good first project.
Honestly, just learn multi var calculus and a intro to linear algebra and everything else is somewhat simple to understand
Training a 1 stack/layer transformer which I am assuming you will be doing anyway because that's what everyone only talks about these days with hardware such as a 3080 take about 30 minutes on a dataset of 51785 training 1803 test and 1193 validation
In batches (mini batch) of 64
And epoch of 20
Each example has 86 feature and we'll inference up to 81 labels max
The 3080 had it's power draw capped to 85%
This should be faster with decode only because that's what everyone does, but it makes sense since the MHA confusion matrix for the encoder shows that it's fucking useless as you removed about half the trainable parameters
so technically you don't really need hardware, the problem is data, which everyone seems to forget, for some reason
LinearRegression.predict()
This returns a list with a bunch of decimal values when I used it on my x_test data, what are these numbers for and how can they be used? (Linear regression model from sklearn library)
it returns a list of predictions, where the nth element of the list is the prediction for the nth element of x_test.
you have to know what x_test represents and what the model is intended to do to make sense of the output.
Well it is a spam filter project, one column shows the email, the other has a value of 1 or 0 (0 is not spam, 1 is spam)
if the model is supposed to tell you if the model is spam or not, than the "is spam" value should not be part of the x data.
Yes, the "is spam" is not present, only the number
@vital cedar how did you represent the emails for the purposes of linear regression?
Count Vectorizer
and the outputs from predict are what? numbers between 0 and 1?
[ 0.63555711 0.28661988 0.63555711 ... -0.36245224 -1.77000725
-0.47555284]
They don't seem to be between 0 and 1
this is an array, not a list.
My bad
youll need to do transformations to force between 0 and 1, I did a spam model the other day and just cut it at .5 (>.5 = 1) (<=.5 =0)
What do these values represent exactly?
I thought it would return 0 or 1
not if you use linear regression, you basically fitted a line on a graph
if you want to use a model like that you can do a random forrest i guess but its basically the same thing with the cutoff i wrote just built in
if line 49 says both are same dimensions, how do line 50 gives no error on view but 52 says RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.
it sounds like an issue related to the stride? (or something about the underlying array being non-continuous, see https://github.com/cezannec/capsule_net_pytorch/issues/4)
did you try just using the other function the error message tells you to?
i havent tried those, it would look inconsistent in code to do so, i dont see why they are different, and one is compatible with view and other not
Well, what are those values and how can I use them?
The higher the number the more your robot thinks it’s spam
Thank you 👍
anyone familiar with pyspark here?
please i need guidance
i have a project which consists of a 100GB json file, i need to process it; clean it and then insert it in to mongoDB
can someone please guide me, im not actually sure what to do here
I'm messing with the numbers to get one that is accurate enough to differentiate between spam and not spam
if regModel.predict(msg) > 1:
For now it's 1 but is there a good or accurate way to find the number?
I want to start learning machine learning but I am not sure which framework to use, any suggestions? I am thinking either Pytorch or Tensorflow...
My advice is to learn PyTorch. It will save you headaches down the road, despite TensorFlow being easier to use OOB.
Also shimmer I was thinking. Your approach yesterday doesn't actually add significant space overhead because pandas doesn't create deep copies by default.
Has your thing been solved?
I normally have a decorator laying somewhere that lets you compute an arbitrary function per group of your df
TensorFlow cause of its built in training loops and nothing else it’s mostly just my preference
I just want more people to use tensorflow tbh
Read my earlier messages
Transformations? How?
.5 cutoff. Here: https://youtu.be/xG-E--Ak5jg
🔥Artificial Intelligence Engineer Program (Discount Coupon: YTBE15): https://www.simplilearn.com/masters-in-artificial-intelligence?utm_campaign=ClassificationInMachineLearning-xG-E--Ak5jg&utm_medium=Descriptionff&utm_source=youtube
🔥Professional Certificate Program In AI And Machine Learning: https://www.simplilearn.com/pgp-ai-machine-learning-...
Honestly, starting with sci-kit learn might be a good idea
Especially if you're working with tabular data that should be the place to start.
Which one is normally easier to learn?
Pytorch
The degree of difficulty is something like:
Keras < Pytorch < Tensorflow
Keras is highest level API, you assemble your model like you assemble lego pieces.
Tensorflow is the lowest level, you have to do many things manually(not all, though)
Pytorch is the mid-term
I'd say tensorflow and pytorch are at the same level
They have a lot of overlap
Easiest to learn will def be keras, as it is just: import model -> fit model -> use model. But you don't learn a lot from it.
when im feature engineering an imbalanced dataset, should i apply pca before or after resampling?
I'd make sure you want to resample first
It's fair to just do stuff as usual and then select your operating point manually by looking at ROC, PR, DET, ... based on your application
im going for binary classification
one class is 95% of the dataset lol
so specificity and all that would be horrendous without resampling
You need to compute all of those metrics over several operating points (decision thresholds)
You want to PCA first, because you want to know the the projection of the original data. And then you can resample
Otherwise the PCA is affected by the resampling
This is the data science channel, so try #web-development . But please always ask a complete question that someone can start answering. Not if someone knows about a topic.
@dapper flame this is the data science channel. kindly remove your messages from this channel and try in #web-development.
I see you also asked in #async-and-concurrency. that's fine. but please ask your question in only one channel, so that no one answers a question that was answered somewhere else.
Are any of you guys good with using pytorch, for some reason I'm getting a dimension mismatch and I don't really know why
show code and error
Ok so just a bit of context first, I'm simply trying to implement UNet for audio source separation and I'm using the musdb dataset and accompanying package in order to do it
There is the entirety of the code that I'm using
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-81-b99926cdd359> in <cell line: 1>()
7 print("Target shape:", target.shape)
8 train_unet_func(umet_model, reshaped_train_loader, optimizer, device, epoch, tb_writer)
----> 9 test_unet_func(umet_model, test_unet, device, epoch, tb_writer)
10 umet_model.cpu()
11 state_dict = umet_model.state_dict()
<ipython-input-79-badf5cb3f78b> in test_unet_func(model, test_data, device, epoch, tb_writer)
38
39 x_padded, (left, right) = padding(x)
---> 40 right = x_padded.size(1) - right
41 mask = model(x_padded.unsqueeze(0)).squeeze(0)[:, :, left:right]
42 y = mask * x.unsqueeze(0)
IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1)
Also, I printed out the input shape after every step of forward while debugging and some other areas, I don't know how much it will be useful to you, but I have included the paste of it below:
https://paste.pythondiscord.com/zilegevuqe
:incoming_envelope: :ok_hand: applied timeout to @stiff matrix until <t:1682843160:f> (10 minutes) (reason: duplicates spam - sent 4 duplicate messages).
The <@&831776746206265384> have been alerted for review.
I'm working with a random forest classification model and when I implement it on an external validation dataset (like the test set), i'm getting a recall (sensitivity) of 100%. Is it fine to have a high recall like this?
that's for you to decide
You can get 100 % recall on a given class by just predicting everything belongs to that class.
I'm working on binary classes... and the rest of the parameters seems to be what I expected
Accuracy = 83.33333333333334
Sensitivity = 100.0
Specificity = 66.66666666666666
Precision = 75.0
ROC = 83.33333333333334
MCC = 70.71067811865476
f1 = 85.71428571428571
I feel like this can still serve the purpose.. even with high recall
care to share your opinion....?
Accuracy, precision, recall, specificity, sensitivity, ... all have intuitive, real world meanings. If I were you I'd ask myself the question "what would make my classifier a good one" and then look up the definitions of these metrics. I can't decide this for you, it's problem dependent 🙂
yeahhh... okayy.. Thank you
Hello everyone
plot_decision_boundary(model=model_4,
X=X,
y=y)
With this code im trying to check the decision boundary from the latest model Im building
This is the error I received
ValueError: Exception encountered when calling layer 'sequential_10' (type Sequential).
Input 0 of layer "dense_15" is incompatible with the layer: expected min_ndim=2, found ndim=1. Full shape received: (None,)
Call arguments received by layer 'sequential_10' (type Sequential):
• inputs=('tf.Tensor(shape=(None,), dtype=float32)', 'tf.Tensor(shape=(None,), dtype=float32)')
• training=False
• mask=None
Hi all need some help in data science. Question.
What is it
I forgot to ask😅
What data am I missing currently? Why is it not working?
@lapis sequoia hi davs I have dm you check
Hello, I'm fairly new to ml. I'd consider myself as a intermediate to advanced python developer(although I see myself somewhere in the middle). But I have almost no experience in ml. I learn best by finding somebody who can guide me. Anyone willing to spend some time to help me to hop on the train?
Maybe I should mention that I'm really interested in the math behind it but I dunno if it's really worth it to learn it from scratch when there are already so many libraries etc
@lone plaza https://mml-book.github.io/ and then https://www.statlearning.com/ and https://arxiv.org/abs/2106.11342 ideally you should actually do projects etc. while reading these
Mathematics for Machine Learning
Companion webpage to the book “Mathematics for Machine Learning”. Copyright 2020 by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong. Published by Cambridge University Press.
An Introduction to Statistical Learning
arXiv.org
This open-source book represents our attempt to make deep learning
approachable, teaching readers the concepts, the context, and the code. The
entire book is drafted in Jupyter notebooks, seamlessly integrating exposition
figures, math, and interactive examples with self-contained code. Our goal is
to offer a resource that could (i) be freely av...
Each of them has a "who is this book for" section, let that convince you whether or not you want to go into the maths or not.
hello falk
if you managed to find a outline or a guide my case is exactly similar to you
but i have little bit experience in it
i did some projects 😂
can you tell me about it as well
Hi all,
I'm currently trying to work out a bit of an issue i'm having mapping some numpy operations to Rust and have encountered an interesting behaviour which has dumbfounded me.
Say I have an array:
data = np.array([
np.full(5, 0.20, dtype=np.float64),
np.full(5, 25.7, dtype=np.float64),
np.full(5, 3.0, dtype=np.float64),
np.full(5, 0.9, dtype=np.float64),
], dtype=np.float64)
And it's an f64, when I then apply the following ops to it:
hyperplane_vector = np.empty(dim, dtype=np.float64)
for d in range(dim):
hyperplane_vector[d] = (data[left, d] / left_norm) - (
data[right, d] / right_norm
)
Where left, right and their respective norms are:
left = 0
right = 1
left_norm = norm(data[left]) # L2 norm
right_norm = norm(data[right]) # L2 norm
Numpy will produce an array of 0.0
interesting [0. 0. 0. 0. 0.]
But if this array becomes a float32 we get:
interesting [-5.0820086e-09 -5.0820086e-09 -5.0820086e-09 -5.0820086e-09 -5.0820086e-09]
And this is confusing the fuck out of me where the accuracy is dropping / if the f64 is correct and it should be zero or if something else is going on
The reason why i'm a bit confused is because when porting this over the f32 array over in rust world Is creating the same values as numpy's float64 behavour
If I force it to become a f64 array and do all of that with double precision I get a number close to the f32 value in numpy but off by a tad which can honestly just get put down to rounding error
Wow, dude! Appreciate you!
hi all,
so basically, i am working on a project right now and need some help more like guidance
i want to predict temperature values based on inputs such as date, humidity (percentage values)
First, are these enough inputs?
second i am using one hot coding (not sure if that the right name but basically taking date as Day 1,2,3,4,5,....
third which algorithm will be best
i have worked with SVM (rgf) and only dates and i found the results quite promising.
but when i introduced humidity values the results were worse
like before mean error was 1.xxxx and then it went up like 144.xxxx
pearson corealtion values were
-.67 for temp and humidity
0.5xx for temp and day
hi Allive been asked to run random_state 10 time and take the mean of them, do I just code it like this? new to all this and really struggling to get my head round it
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.20, random_state=10)
no.
look up cross validation
They probably want you to do 10 random runs, so you should set the seed to a different value each time. @sterile belfry
data
processed data
thanks from me as well
hi
watching it
The two videos I sent I actually highly recommend watching. The one by sebastian lague is pretty chill, and he explains in pretty simple terms.
The 3b1b goes a bit more in-depth on the maths
I agree those two are good ones
I'm working on a time series dataset, when I plot this using line plotly library the time will show like this I want to format it, so it look little good, I've tried everythin I know but still didn't work for me can anyone help me in this to format this datetime.
can someone help me with this question? i got 7 datapoints for cluster 1 but it was incorrect
are you using Euclidean distance
Manhatten distance
show formula
After performing the initial clustering and restimating the centroids using the Manhattan distance as a distance metric, we obtain:
Cluster 1: [2, 1, 4, 5, 3]
Cluster 2: [8, 6, 28, 12, 9, 7, 10]
The new centroids are:
K1_new = (2+1+4+5+3)/5 = 3
K2_new = (8+6+28+12+9+7+10)/7 = 11.14285714
[3, 6, 2, 1, 4, 5, 7] = C1
7 datapoints
Distance to K1_new:
[0.67, 5.67, 3.67, 1.67, 2.67, 4.67, 24.33, 8.33, 5.33, 2.33, 3.33, 6.33]
Distance to K2_new:
[4.5, 5.5, 15.17, 9.5, 11.5, 8.17, 15.83, 3.83, 3.83, 7.17, 4.17, 2.83]
then i used the distances to get the c1 points
ok
i got 7 buts incorrect
do you know correct answer or no
yes i did that
?
Cluster 1: [3, 2, 1, 4, 5, 7, 9]
Cluster 2: [8, 6, 28, 12, 10]
this is what i got for both clusters
yeah i dont think so
you solution seems ok, the result depends on whether you use <= or <, since it looks like one distance is repeated
!e
import numpy as np
x = np.array([3,8,6,2,1,4,28,12,9,5,7,10])
c1 = 2
c2 = 8
for i in range(2):
d1 = np.abs(x-c1)
d2 = np.abs(x-c2)
print(d1)
print(d2)
clusters = d1 < d2
if i == 1:
break
c1 = np.mean(x[clusters])
c2 = np.mean(x[np.logical_not(clusters)])
print(f"{np.sum(clusters)} points belong to cluster c1 with centroid {c1}")
print(f"{np.sum(np.logical_not(clusters))} points belong to cluster c2 " +
f"with centroid {c2}")
@wooden sail :white_check_mark: Your 3.11 eval job has completed with return code 0.
001 | [ 1 6 4 0 1 2 26 10 7 3 5 8]
002 | [ 5 0 2 6 7 4 20 4 1 3 1 2]
003 | [ 0.5 5.5 3.5 0.5 1.5 1.5 25.5 9.5 6.5 2.5 4.5 7.5]
004 | [ 7.625 2.625 4.625 8.625 9.625 6.625 17.375 1.375 1.625 5.625
005 | 3.625 0.625]
006 | 6 points belong to cluster c1 with centroid 2.5
007 | 6 points belong to cluster c2 with centroid 10.625
notice in the first iteration, there's a point with the same distance to both cluster centroids. the result depends on which one you pick for that iteration
if we use <= instead of <, we get 7 points as you said
oh ok so i put the wrong question lol.
it was similar but i got 6 points for this
Cluster 1: [3, 2, 1, 4, 5]
Cluster 2: [8, 6, 28, 12, 9, 7, 10]
New centroid K1 = mean([3, 2, 1, 4, 5]) = 3
New centroid K2 = mean([8, 6, 28, 12, 9, 7, 10]) = 11.4
Cluster 1: [3, 2, 1, 4, 5, 7]
Cluster 2: [8, 6, 28, 12, 9, 10]
New centroid K1 = mean([3, 2, 1, 4, 5, 7]) = 3.67
New centroid K2 = mean([8, 6, 28, 12, 9, 10]) = 12.17
Cluster 1: [3, 2, 1, 4, 5, 7]
Cluster 2: [8, 6, 28, 12, 9, 10]
I have used a PCA technique for my image dataset and it shows the two classes points are overlapping. Does it mean there is a high degree of similarity?
The first 2 principal components capture the most variance but it's not guaranteed that your classes can be separated along those 2 dimensions
They could be similar in the first 2 and dissimilar in the others, hard to tell. Maybe you can try LDA, it's more or less a supervised version of PCA.
LDA?
latent discriminant analysis. I assume you use sci-kit learn? There's an LDA classifier there. What you want to call is the transform method.
Oh okay alright, I am just extracting the features through mobileNet and then running it through the visualizations. I tried T-Sne as well, made no sense.
So you're doing features => PCA => plot?
Yeah so normalize the images, resize, put in dataloader etc. => Run the MobileNet for image feature extraction => PCA => plot
Do you use a Standardscaler before you do your PCA
Idk if I am missing any steps.
No. Supposed to do that as well?
Yes you are 🙂 make_pipeline(StandardScaler(), PCA(2))
Personally I would go with PCA first again
I am assuming the scaling function should be used with LDA, T-SNE etc?
Okay cool.
Many methods require feature scaling but not all of them. The rest can feel free to correct me if they disagree but when in doubt you can rescale because the effect of not doing it is worse than the inverse.
You should also use the exact image normalization that your pretrained model used. For example, some are trained on [-1, 1] (typically) others on [0, 1] so you should consult the docs to see what they did and mirror that.
Oh alright. So this StandardScaler method is usually done right before using PCA or other methods?
You mean the size?
Yes indeed
Typically pixels are in [0, 255] but models are trained on [0,1] or [-1,1]
Oh okay alright, let me just try out this scaling function and I will let you know.
I think you're already doing this since you mentioned you normalize and then resize. Just check the docs to see what they normalized with in training
So for MobileNet I am using this docs, https://pytorch.org/vision/main/models/generated/torchvision.models.mobilenet_v3_small.html#torchvision.models.mobilenet_v3_small It seems like the model was trained on 256x256.
Also, the scaling option did not work as well. Looks like the same thing.
"The inference transforms are available at MobileNet_V3_Small_Weights.IMAGENET1K_V1.transforms " Ideally you need to run this transformation instead of manually resizing / normalising
The image is exactly the same as the previous one? Can I maybe see the code?
Oh this is new to me. Let me just check it.
Yeah sure. Hold on.
So this code takes the features and labels:
features = []
labels = []
for images, label in train_dataloader:
with torch.no_grad():
outputs = model(images)
features.append(outputs.numpy())
labels.append(label.numpy())
features = np.concatenate(features, axis=0)
labels = np.concatenate(labels, axis=0)
I get 3488,1000 as the output. 3488 is the number of samples and 1000 is the features.
Then the scaling:
features_2d = scaler.fit_transform(features)```
pca = PCA(n_components=2)
pca_result = pca.fit_transform(features_2d)
print(pca_result.shape) # (3488,2)
ax = plt.figure()
ax = ax.add_subplot(111)
ax.scatter(pca_result[:, 0], pca_result[:, 1], c=labels, cmap=ListedColormap(colors))
# View the plot
%matplotlib inline
print(plt.show())
This is the visualization.
Also, I did not freeze any layers. I think the model.eval() command does that.
Hmmm then I'm not sure because this seems to be OK. Could be the transforms that are going wrong. I doubt it, but it's good practice to use the ones they suggest in the docs for pretrained models. If it's not that it could be that the features from mobilenet are inadequate (try Xception for example, but the model is a lot larger) OR that it is really your data OR that the first 2 PC's are not discriminative
Hmm okay.
So the docs just say to put the weights like this:
model = models.mobilenet_v3_small(pretrained=True, weights="MobileNet_V3_Small_Weights.IMAGENET1K_V1")
Cause it accepts the params.
Let me try it now.
I remove the normalization and resizing.
But idk how that would make a significant difference.
Yes, that's one thing but the other one (bottom paragraph of the docs) is doing all of the same transformations they did. They make it simple by offering you MobileNet_V3_Small_Weights.IMAGENET1K_V1.transforms on this link: https://pytorch.org/vision/main/models/generated/torchvision.models.mobilenet_v3_small.html#torchvision.models.mobilenet_v3_small
Yeah I saw this but they have not explicitly shown where to add it to.
Wherever you were resiziing and normalizing before you can replace it with that
You mean my transform function?
transform = transforms.Compose(
[
transforms.Resize((IMG_HEIGHT, IMG_WIDTH)), # Resize the images to (224, 224)
transforms.ToTensor(), # Convert the images to PyTorch tensors
transforms.Normalize(
mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]
), # Normalize the images
]
)
yes, this
Ah okay, so I don't think that works because the transform function does not accept that module.
transform = transforms.Compose(
[
transforms.Resize((IMG_HEIGHT, IMG_WIDTH)), # Resize the images to (224, 224)
transforms.ToTensor(), # Convert the images to PyTorch tensors
# transforms.Normalize(
# mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]
# ), # Normalize the images
transforms.MobileNet_V3_Small_Weights.IMAGENET1K_V1,
]
)
So I think I can apply the transformations which they have done through the transform function and then keep the weights param.
Maybe I'm explaining it poorly. Either way, you can just remove that part then because you hard coded the numbers which is fine as well I guess?
Using MobileNet_V3_Small_Weights.IMAGENET1K_V1.transforms (transforms is missing) just applies all the steps you've done.
I actually got to go now, if I were you I'd just try a different model at this point (Xception, Resnet)
Thanks a lot for your help. Will update you when I can.🤗
can someone explain what is graph_execution_error and how can i overcome it ?
i have used tflearn to make a model for disease recognition in pomegranate. when i load the model and try to predict, it gives correct output at first but for second try it gives graph execution error
You should probably check the shapes of your input and output data, and the input and output shape of your model. @maiden widget
thanks
i actually the model needed to close first or restart before using it again
using keras, what's the difference between enabling return_sequences and having several nodes in the output layer?
Anyone here know linear & logistic regression?
what's up? you got any specific questions?
hi guys, i use EMNIST balanced dataset(47 classes) in order to train my CNN in keras and i get an accuracy equal to 83%. however, when i make predictions with my model on new data(images drawn by me), i get inaccurate detections all the time(except for letter X for some reason lol).
inputs = Input(shape=(width, height, 1))
x = Conv2D(filters=32, kernel_size=(3,3), activation="relu", kernel_regularizer=tf.keras.regularizers.l2(0.001))(inputs)
x = MaxPooling2D(pool_size=(2,2 ))(x)
x = BatchNormalization()(x)
x = Conv2D(filters=64, kernel_size=(3,3), activation="relu", kernel_regularizer=tf.keras.regularizers.l2(0.001))(x)
x = BatchNormalization()(x)
x = MaxPooling2D(pool_size=(2, 2))(x)
x = Conv2D(filters=128, kernel_size=(3,3), activation="relu", kernel_regularizer=tf.keras.regularizers.l1_l2(0.001) )(x)
x = BatchNormalization()(x)
x = Flatten()(x)
x = Dropout(.5)(x)
outputs = Dense(NB_CLASSES, activation="softmax")(x)
model = keras.Model(inputs=inputs, outputs=outputs)
return model
this is how my model looks like. what could be the reason behind incorrect predictions? i preprocess data correctly so i have no idea the cause of the problem.
width = 28; height = 28 and NB_CLASSES = 47
little note: i trained the model on 15 epochs and i set the batch size to 16
apparently it works but i have to rotate the image counter clockwise by 90 like bruh 😭
Hey can someone help with my code, I have an issue, here's my code:
import numpy as np
from keras.models import load_model
from PIL import Image
import time
#modle
model = load_model('model.h5')
noise = np.random.randn(1, 512, 512, 3)
delay_time = 30 #sec
for i in range(10):
generated_images = model.predict(noise)
#convert imge array
generated_image = Image.fromarray(np.uint8(generated_images[0]*255))
#save img
generated_image.save('generated_image.png')
#delay
time.sleep(delay_time)
For some reason when I try to run it, it just generates a 1 by 1 white pixel image.
Thanks for any help!
hi guys. im new to programming i want to get into programming
i saw this video 1 month ago https://www.youtube.com/watch?v=WtEYMELvRHI&t=53s&ab_channel=AtleFjellangSæther
i searched many things about this and i want to make robots like this
in internet i understand i need to learn matlab ,python ,machine learning,arduino and raspberry pi
are they good to make one like this
This video illustrates the work performed in the
context of our bachelor's thesis.
The project was conducted in collaboration
with Oslo and Akershus University College of
Applied Sciences.
The purpose of the thesis has been to elucidate
the main methods of self-learning systems, and
develop a self-learning algorithm for an
appropriate de...
shouldn't need of matlab
auduino or or raspberry pi, probably no need to learn both
it is a bit harder to use python on arduino than on rapseberry pi
do not underestimate machine learning
but yes, it should be possible to do something like that with python + rapberry pi + machine learning, but it is by no means a simple project
(to clarify, by "do not underestimate machine learning" I mean, it is significantly deep - might be like three times harder compared to the other items you'd have to learn if you were to understand what each part of the system is doing with a reasonable depth.... though if you just stick with using it as a black box, copy pasting from some tutorial and editing without trying to understand what's happening under the hood, which tbh is perfectly fine, it might not be that bad)
you guys used langchain?
It is very elemantary to find the related text part given a question but what if the related text is related to another text and esentially the text, hence all the relevant information, becomes too big to feed to AI models.
guys any idea on how i can perform exploratory data analysis on huge datasets without my analysis being biased? because if i reduce the dataset then my analysis starts to loose integrity
so i have this:
df = pd.concat([pd.read_csv('./Sales_Data/'+file) for file in listdir('./Sales_Data')])
df.dropna(how='any')
df['Price Each'] = pd.to_numeric(df['Price Each'])
which causes this error:
ValueError: Unable to parse string "Price Each"
could someone please explain why this is happening
hey guys how we evaluate a recommendation system
Hello, I am trying to run DiffMorph (https://github.com/volotat/DiffMorph/) on my mac. However I ran into some error
To run this program I tried running it by doing the following commands:
pip install -r requirements.txt
python morph.py
however while installing the packages I ran in some error related to the numpy version required a higher python versxion.
So what I did is that I switched to python 3.6 to pythjon 3.10.
I redid the same commands above however I got a new error while running pip install -r requirements.txt:
ERROR: Could not find a version that satisfies the requirement tensorflow==2.9.1 (from versions: none)
ERROR: No matching distribution found for tensorflow==2.9.1
```.
How do I get to install tensorflow without having this error
Just wanted to let you know that I have tried out the LDA visualization and this is what I came up with. I think I know why the points keep coming closer, the features within each image is similar so that's why there is an overlap of the points. That makes sense right?
this is my code
plt.plot(neigh, acc_score, marker="o", markeredgecolor = 'black', markerfacecolor = 'red')
plt.xlabel("Number of neighbors")
plt.ylabel("Accuracy score")``` and this below is my graph. How can i add the annotation on the markers only or make the x axis scale a bit more detailed?
You can do plt.xticks(range(51), range(51)) @lapis sequoia
that's great, thank you.
Can also do plt.grid() so you can see where it alligns
pandas dataframe, use it as an object or like a dictionary?
wdym by "as an object"? dicts are also objects
just because series can be accessed with dataframe["header"] and dataframe.header
was wondering if one is prefered
do you mean dataframe.head? those do different things
df["..."] always works, df.... sometimes works; so in "real" code, one might want to prefer first
but latter is easier to type, so there's that
latter sometimes works only because of name clashes, e.g., if you have a column named "info" or "sum", it would fail and defer to the corresponding attributes
if you have nonvalid Python identifiers as a column name, e.g., one with spaces in it, it will fail too
so in short, IMHO, prefer df["..."] except in one-off quick trials on frames you perform
df["..."] has also the big advantage of conveying you are selecting a column immediately
if i got a linear regression model and do a residual plot after prediction and in this plot i see x:y pairs of x:-x, how can i adapt so i reduce this phenomena
So it predicts the y value too low, and this error is proportional to x? @young granite
i would want something like this:
but i get a linear trend
i did check the IDs of the "outlier" sets for all targets an see that if they follow this linear trend for one feature they do it often for more
i wouldnt say its proportional
i think its just not good represented by the model
The residuals just show you the error for different x values, which is what the top plot shows. I don't see any linear trend in that, nor do I see why you would want to fit a linear regression model on the residuals.
for some targets i got fewer datasets
u misunderstood me i guess, i do a LR on my dataset 2k data, from that i get 50 features and want to generate 10 targets
these pics are just examples
the green dots go from -x:x through 0
I'm sorry, I don't think I can help
thanks for trying did u understood the problem now or do i need to rephrase more?
this is helpful, thanks
can you explain a little more? we do linear regression to find 50 parameters that explain your 2000 data examples; how many dimensions do the data observations have each?
i got a dataset of 2k datasets, each containing 800x2 datapoints which are reduced to 50 features to predict 10 targets
and you're doing linear regression on those 50 features? or?
yes
ok, so you reduce it to 2000 samples, each one being a vector of 50 features. and you wanna do regression on that. what are these targets you want to predict, and how many parameters are you using in the linear regression?
m/z values, i currently run simple LR from scikitlearn without any parameters
i#m still not understanding what you're trying to do, sorry
where am i loosing u?
i'm trying to figure out the size of the matrix we're working with, but i'm not sure what you're calling a feature here
dataset: 2k
1 set contains a df o shape: 800row x 2col
each set is reduced to: 50 features
target are: 10 m/z values
Wait what
and you wanna do a different regression for each of the 2k datasets?
What's that 800 x 2 in a set
so i got 2000x50 features for one LR model
spectroscopy data
and what are you passing to sklearn's LinearRegression.fit()?
ok. so the 2000x50 array of wavelet coefficients, and whatever this m/z is
then the linear regression learns 51 parameters, ok
in the plot you showed above, what is this "fitted value" you put on the x axis?
the predicted m/z?
yyes
residual is: real-pred
so when x:-x occurs the model has problems
and i want to figure out a way to improve that other than to switch model
what does your notation x:-x mean
x:y coordinates
ok
that does indeed indicate model mismatch and not noise
you can try to whiten the data before doing linear regression
but if the relationship between the data is not linear, no amount of preprocessing will help
could it be due to lack of certain m/z values
so that it will decrease over time
cause thats my assumption
probably not tbh
if a m/z is in 90% its good represented but not for 10%
mhhh
i came to this conclusion cause i tried a simple CNN aswell which resulted in a similar trend
so Series.count() isn't what i was expecting, whats the trick?
need to collect the number of "Black" in a series
groupby count
I'm being asked to find a 'nice looking' representation of how close one number is to one another, preferably condensed to a 0-1 or 0-100 scale (it's not a group/population, just a series of A and B data points that are unrelated to other A and B data points, so cannot apply normalization).
So when A=100 and B=100, the score is like 1 or 100. But when A or B is different, they want to know the 'distance' so to speak, without the sign. So e.g. A=50, B=100 = 0.5 or 50 score. But if A=1000000 and B=100, then the score would be 0.000x or similar. Anyone has any ideas? My stats background is in life sciences so I'm a bit lost here 😅
Hello everyone, I need help with understanding the use case of shapely value..
First of all, Is it possible to calculate record level shapely value? (record in sense for individual observation used for training or testing the model)
Does the imbalanced datasets problem concern only the target variable? Like if i have an imblanced feature, do i have to deal with it?
Sorry, i meant categorical features.
Ok so that would be a problem only if the imbalance isn't really representative of the population i guess?
Well, you could take |A-B| and apply to it any function that maps [0, ∞] to [0,1). For example, arctan (times a constant).
atan(|A-B|)*2/π would be 0 for A=B, and approaches 1 as |A-B| approaches infinity.
the logistic function (aka the sigmoid, 1/(1+exp(x))) is another choice for the function. Though that one is 1/2 at 0, so you'd have to rescale it.
thank you! going to read in to the mapping ability rn
can somebody help me with this
Heyy
Hi @lapis sequoia
following this tutorial:https://www.geeksforgeeks.org/disease-prediction-using-machine-learning/ and in code ```py
Training the models on whole data
final_svm_model = SVC()
final_nb_model = GaussianNB()
final_rf_model = RandomForestClassifier(random_state=18)
final_svm_model.fit(X, y)
final_nb_model.fit(X, y)
final_rf_model.fit(X, y)
Reading the test data
test_data = pd.read_csv("./dataset/Testing.csv").dropna(axis=1)
test_X = test_data.iloc[:, :-1]
test_Y = encoder.transform(test_data.iloc[:, -1])
Making prediction by take mode of predictions
made by all the classifiers
svm_preds = final_svm_model.predict(test_X)
nb_preds = final_nb_model.predict(test_X)
rf_preds = final_rf_model.predict(test_X)
final_preds = [mode([i,j,k])[0][0] for i,j,
k in zip(svm_preds, nb_preds, rf_preds)]
print(f"Accuracy on Test dataset by the combined model
: {accuracy_score(test_Y, final_preds)*100}")
cf_matrix = confusion_matrix(test_Y, final_preds)
plt.figure(figsize=(12,8))
sns.heatmap(cf_matrix, annot = True)
plt.title("Confusion Matrix for Combined Model on Test Dataset")
plt.show()
``` i get error y contains previously unseen labels: 'Fungal infection' why is that? my dataset contains the label
if anyone is interested in DE https://www.linkedin.com/posts/benjaminrogojan_data-engineering-study-guide-outline-make-activity-7059166721567297536-rBDL
Help please
i removed loss.backward, and removes shuffle, passed same sample at test and train time, BUT LOSS comes out to be DIFFERENT. HOWWWWWW?
This model is f**king with me, i am fed up.
I'm sorry that this is frustrating for you. if you want help, you might get it if you show the code (not as a screenshot).
It happens, i figured it out, it was the random crop that was making the difference.
One more thing, i noticed my model was not improving. I made some changes to a model(added a new backbone and a transformer encoder) and before that it was training as expected.
Considering Implementation is not an issue, can you please tell me what to try?
few more observation:
loss is decreasing during training
loss isnt decreasing during testing
sometimes test accuracy reduces
Should i try changing hyperparameter? is so what?
I mean...if you removed the loss.backward() during training, the loss won't decrease because you're not backpropagating the gradients.
Isn't the idea to train the model on train samples, then evaluate it on test sample to check how things are going?
Its loss won't decrease during test. The loss in the test section will only decrease after a train section.
Also, the "sometimes test accuracy reduces", I suppose the cause for that might be similar to why sometimes, after a batch iteration(or even an epoch) the lost might increase instead of decreasing. It's just the stochastic gradient nature. The model might be optimized into a worse point accidentally(or, for the accuracy, towards being overfit), but then fix that afterwards.
Hi everybody
I was wondering if you can recommed good exercises for scipy, numpy, matplotlib including algorithm development
not sure about scipy, but for numpy+matplotlib you could try implementing some simple algorithms like k-means clustering
Does anyone know of any good math courses focused around data science/ML principles?
I've heard about https://www.deeplearning.ai/courses/mathematics-for-machine-learning-and-data-science-specialization but haven't tried it myself so cannot really vouch for it
Ty
there are also 3blue1brown's videos on youtube
For what purpose?
learning math?
they have some very nice visualisations
ah, yeah that was in response to Whip, not to you
hello guys where can i find some ai and ml project to work on it
You can check on Kaggle
@willow quest @tidal bough Here's what chatgpt said which is nice and simple i guess:
One option could be to calculate the ratio between the two numbers and then scale it to a 0-1 or 0-100 range. For example, if A=50 and B=100, the ratio is 0.5, which can be scaled to a 50 out of 100 score or a 0.5 out of 1 score. Similarly, if A=1000000 and B=100, the ratio is 10000, which can be scaled to a 0.0001 out of 1 score or a 0.01 out of 100 score.
Another option could be to take the logarithm of the ratio between the two numbers, which would compress the range of values and make it easier to compare across different magnitudes. For example, if A=50 and B=100, the logarithm of the ratio would be -0.301, which could be scaled to a 30 out of 100 score or a 0.3 out of 1 score. If A=1000000 and B=100, the logarithm of the ratio would be 4.605, which could be scaled to a 0.046 out of 1 score or a 4.6 out of 100 score.
Ultimately, the choice of method would depend on the specific requirements of the task and the preferences of the stakeholders involved.
could use a good video covering pandas if anyone has suggestions
I'm using pyenv with virtualenv. I have a ROCm GPU and i'm running the command on pytorch.org/get-started/locally.
i've set --no-cache-dir and ensured it's pulling from indexes in the proper order. it's downloading the linux manylinux wheels and then downloads the nvidia_cuda_cu11 wheels anyways.
does it just download that anyways? or am i not specifying things correctly?
man... 100days course has me using pandas before numpy, all the resources for pandas seem to suggest that i'm doing this out of order...
I bookmarked this, thankyou
He does have an equally good playlist for numpy as well btw.
Yeah I'm going to run through these exercises and just type what im told for now, i'll probably go through those before I start my DS course
learning pandas first is fine. the main hurdle with learning either is to not write for loops.
i am making a model to classify an image in 5 class
network = input_data(shape=input_shape)
network = conv_2d(network, 32, 3, activation='relu')
network = max_pool_2d(network, 2)
network = conv_2d(network, 64, 3, activation='relu')
network = max_pool_2d(network, 2)
network = fully_connected(network, 128, activation='relu')
network = dropout(network, 0.5)
network = fully_connected(network, 5, activation='softmax')
network = regression(network, optimizer='adam',loss='categorical_crossentropy', learning_rate=0.001)
can i use anything else rather than regression for output layer ?
Is it run program or is just example?
this is the model i am using to train the model, but my teacher is saying regression is used for prediction not classification
How can I get both my labels to show?
# Plot the data using different colors for each class
fig, ax = plt.subplots()
scatter = ax.scatter(features_2d[:, 0], features_2d[:, 1], c=labels)
plt.legend(loc='upper right', labels=['Container', 'No_Container'])
# Set the title and show the plot
ax.set_title("LDA Visualization")
plt.show()
This only shows one label and not the other.
What does your labels variable contain?(the one used for c)
easy way to convert first tensor to second?
currently i do: sims = sims.reshape(2, 3*2).t().view(2, 3, 2)
Two classes 0 and 1. 0 is container and 1 is no_container.
arr.reshape((2,2,3)).transpose((1,2,0)) would do, for one.
is it faster also? or just cleaner?
who knows, probably about the same
ok, i am gonna use your, looks better
got no overload error for transpose
probably it's done slightly differently in torch than in numpy, maybe .transpose(1,2,0) or whatever
the crux of the issue is that you have only called ax.scatter once (plt.legend seems to be matching against all the plots you have plotted so far in a best-effort manner, seeing as you only plotted once, there can only be one legend entry)
can you try using two ax.scatter, one for the individual classes on their own, so two in total
Like this?
# Plot the data using different colors for each class
fig, ax = plt.subplots()
scatter1 = ax.scatter(features_2d[:, 0], c=labels)
scatter2 = ax.scatter(features_2d[:, 1], c=labels)
# 0 is container and 1 is no container
plt.legend(*scatter.legend_elements(), loc="upper right", title="Classes")
# Set the title and show the plot
ax.set_title("LDA Visualization")
plt.show()
no, you need to select rows where the corresponding label is 0 and plot and then repeat (subbing 0 with 1), do you know how to do that?
Yeah so currently it is getting all the rows of column 0.
Which is one of my classes.
each column is a class?
0 is indeed one of your classes, but plotting just features_2d[:, 0] is not going to work. you haven't specified the y argument, also think about what is features_2d[:, 0], did it really select all rows of class 0?
Yeah it does not, it prints the first element in the array.
-1.2610931 , 0.05956531], dtype=float32)```
"first element in the array" is not quite correct.
it's printing the red box, the first "column" of the array
I get all with just features_2d[:]
0.17381935, 0.69571465],
[-0.68409383, 1.4792389 , 0.50605595, ..., -1.1951295 ,
-0.1751789 , 0.39030302],
[-1.2169716 , 0.4897305 , 0.03768349, ..., -0.6940949 ,
-1.2008945 , 0.4508954 ],
...,
[-1.0407516 , -0.3269264 , 0.41814092, ..., -1.038974 ,
-1.4509413 , 1.5210139 ],
[-1.2610931 , 0.33272272, 0.88745356, ..., -1.881851 ,
-1.0341158 , 2.5147882 ],
[ 0.05956531, -0.46555987, 1.7835644 , ..., -0.74892455,
-2.190432 , 1.3660964 ]], dtype=float32)
what you want is something like this
assuming all the red box is where the corresponding class is 0
"where the corresponding class is 0" meaning in the label array, the corresponding entry is 0
e.g. the 3rd element in the array is 0 for the first red box
Yeah so 0 and 1 for the respective images, so it should show as [0] in the red boxes?
i didn't really understand what you meant there, mind elaborating?
You said in the red box is where the corresponding class is, so it will show as [0,0], like that?
what shape is your data? i think ry means the data is of size N x 3 and the 3rd column is the class
i was operating under the assumption you have two arrays, a 2D array for the features, and a 1D array for the labels/classes
Yeah I have it as a list.
this is what i meant re. corresponding entry (i didn't highlight all 1s obviously)
Yeah yeah that
Like that is how I have my labels:
[0 0 0 0 0 0 0 0 0 0]```
For each image.
Now I understood what you mean.
you lost me at the word image hmm
but do you know how to select those rows with label 1?
ok that also works. you can make an array of indices based on the labels, and use those to index the rows
For each image, I am having a label of 1 and 0 respectively. That's what I mean.
maybe something like
indices = labels == 0
features_2d[indices, :]
indices is a boolean array with True for rows with a label of 0, and we can use that to weed out the class 0
something similar can be done for the class 1
here's a MWE
!e
import numpy as np
data2d = np.random.normal(size=(4,3))
labels = np.array([0,1,0,0])
print(data2d)
print(labels)
indices = labels == 0
print(data2d[indices, :])
@wooden sail :white_check_mark: Your 3.11 eval job has completed with return code 0.
001 | [[-0.59429971 -0.92207244 0.48890026]
002 | [ 0.22585973 -0.70344427 -0.46252381]
003 | [ 0.46275675 1.01961049 -1.97198535]
004 | [ 0.11588553 0.40670295 0.83123672]]
005 | [0 1 0 0]
006 | [[-0.59429971 -0.92207244 0.48890026]
007 | [ 0.46275675 1.01961049 -1.97198535]
008 | [ 0.11588553 0.40670295 0.83123672]]
Okay yeah that helps bring out the class 0 but what is indices for then? To make it include only 0?
to make it include only class 0
Yeah class 0, okay.
what ry has been saying all this time is that your plot is not split by classes, from what i understand. and they're trying to help you do that
Yeah I realized that when I was checking array that it was not showing correctly.
So I gotta redo my scatterplot.
Anyone familiar with neural networks here?
@boreal gale Going back to your statement earlier, what did you mean by two ax.scatter?
it smells slightly to not have legend defined where your scatter plot is though, hence i would prefer this
Yeah I was also getting confused with which label was what. But wait, I am just trying to fully understand this. So the features_2d_0 will contain all of the indices of class 0 right? Then in the first ax.scatter, what is that doing?
features_2d_0 will contain all of the indices of class 0 righ
it contains all rows which is of class 0
Then in the first ax.scatter, what is that doing?
it plots a scatter plot, using the 0th column as x and 1st column as y, for all points of class 0.
Oh okay, I get it now. Understood well.
Thanks a lot. @boreal gale @wooden sail
hi , i am currently learning arduino and matlab i want to make a robot like this https://www.youtube.com/watch?v=WtEYMELvRHI&ab_channel=AtleFjellangSæther. is arduino and matlab enough for this? and should i change matlab with python
This video illustrates the work performed in the
context of our bachelor's thesis.
The project was conducted in collaboration
with Oslo and Akershus University College of
Applied Sciences.
The purpose of the thesis has been to elucidate
the main methods of self-learning systems, and
develop a self-learning algorithm for an
appropriate de...
guys does this data looks or seems to be white noise ?
you can check. the main properties are zero mean, constant variance, and uncorrelatedness
subtract the mean value and check whether those properties hold
i mainly think it is not due to the std, it seems to be increasing by time
Hello people,
Question for Dropout
Is spatialDropout used in bi-lstm? or just cnn?
Cuz so far it helped my model a lot
I am working on an object detection project to detect road defects and I already have the data.
There are main class and subclasses, for example one of the main class being cracks and the subclasses being multiple line crack, Hairline Crack , Block Crack and etc.
Right now we are trying out grouding dino on this but it is giving a lot of noise and detecting things that are not cracks.
Currently, I am stuck in terms of accuracy/mAP, and any tips/advice to approach this object detection task would be greatly appreciated. Since cracks tend to be similar, I feel simply just using an object detection model would not be enough for good performance
Here is an example of the classes:
- Cracks
---Transverse crack
---Longitudinal crack
---Multi crack
---Alligator crack
---Block crack
---Rigid pavement crack - Potholes
---Wet pothole
---Pothole with cracks
---Dry pothole
I recommend starting with the Arduino Programming Language (the default it comes with), which is similar to C++. Make some projects with that, no machine learning. Then learn some Python to run on your PC, make some projects with that, no machine learning. Then, if you still want to get into machine learning, you will need to learn some mathematics and play around with some machine learning libraries in Python, there are many resources for that.
Make sure to read the documentation: https://docs.arduino.cc/learn/starting-guide/getting-started-arduino
Hello everyone, I am trying to build a multistep LSTM-DNN model to forecast gold prices
Mainly it is a a regression problem with timeseries data
That's really cool what your doing. Is it open source
What I am struggling to setup is the supervised dataset
have a look at TSfresh
Interesting, but I can see its more about feature extraction which is indeed part of the setup but not really what I am asking for
lets assume that for now, the only feature we have is the historical price
using the last 30 days of data to forecast the next 7 days
Yeah Tsfresh has transforms to make your dataset "rolling" as well
I see. Let me check it out
If python has a hard time with computationally demanding tasks, why is it popular for AI development?
all of the computationally demanding parts are handled by libraries like numpy or pytorch, which can do the operations in a way as efficient as they would be if done in C/C++
ofc, that requires using these libraries and writing code in a way that makes good use of the features they offer though
furthermore, when you add in GPUs / TPUs to the mix, doing it in python and letting pytorch/tensorflow optimise how to make good use of the GPU/TPU can make it orders of magnitudes faster
Okay thanks
so I'm just starting with pandas and have encountered some discrepencies between what the tutor is using and how the docs propose elements are extracted. Specifically, the tutor will grab a "cell" (my wording) and wrap it in int() to achieve element extraction, but the docs would have you python x_value = row_data.loc[row_data.index[0], 'x']
vspython x_value = int(row_data.x)
and I'm getting the sense that using the tutors approach is overlooking an important aspect of data manipulation
My question, then, is there a right and wrong way? why?
One reason is the abundance of data science/ai libraries
Have you heard about the programming language Mojo? It's a superset of python, meant for high performance AI operations. I think it came out yesterday
I said came out, but it was just announced as a new project :D
which "tutor" specifically are you referring to?
some website/tutorial or a real person that is teaching you?
its the 100days course, angela yu
haven't seen it myself but yeah, without more context, the way they're doing it doesn't makes much sense
the two pieces of code you showed are for completely different purposes though
in py x_value = row_data.loc[row_data.index[0], 'x'] row_data shouldn't be an actual row (pandas.Series), but rather a data frame (pandas.DataFrame)
its for used in a dataframe state, x coor, y coor
in x_value = int(row_data.x), row data should be a series, and you're taking the x value out of it, then converting it from a pandas/numpy float or integer to a python integer
my code for row data is row_data = state_coors[state_coors["state"] == guess]
if state coors is a dataframe, then the second code should throw an error
!e ```py
import pandas as pd
df = pd.DataFrame({'state': [1, 2, 3]})
rows = df[df['state'] == 2]
print(rows.shape, rows, type(rows), sep='\n')
she may have converted it to a series, i haven't watched the video since yesterday
@agile cobalt :white_check_mark: Your 3.11 eval job has completed with return code 0.
001 | (1, 1)
002 | state
003 | 1 2
004 | <class 'pandas.core.frame.DataFrame'>
maybe i need to be more specific
her method of extracting an int was just completely different than what i got from docs
there are a bunch of different ways to do more or less same thing tbf
like, these two would give you the same result:```py
df.iloc[0, 0]
df.loc[df.columns[0], df.index[0]]
in case df = pd.DataFrame({'A': [10]})
df.loc['A', 0]
yeah no doubt I realize im kind of splitting hairs
the biggest difference would be the type of the number you're getting out of that though
I guess i'm concerned that im not seeing something core to data analysis with python
they're both ints, here's my complete code
import turtle
import pandas as pd
screen = turtle.Screen()
screen.title("State Test")
image = "blank_states_img.gif"
screen.addshape(image)
turtle.shape(image)
state_coors = pd.read_csv("50_states.csv")
state_names = state_coors["state"].tolist()
print(state_names)
def get_mouse_click_coor(x, y):
print(x, y)
def user_answer():
return screen.textinput(title="Guess the states", prompt="Guess a state")
while True:
guess = user_answer()
if guess in state_names:
row_data = state_coors[state_coors["state"] == guess]
x_value = row_data.loc[row_data.index[0], 'x']
y_value = row_data.loc[row_data.index[0], 'y']
print("x_value:", x_value)
print("y_value:", y_value)
state_pointer = turtle.Turtle()
state_pointer.shape("circle")
state_pointer.color("black")
state_pointer.penup()
state_pointer.goto(x_value,y_value)
turtle.mainloop()```
is the difference that 1 is numpy int and the other is core python?
pretty much
would that ever matter?
turtle might complain if you pass a numpy number to it, not sure
eh, never mind
I was trying to demonstrate something but that something shouldn't really matter to you
I don't want to waste your time with something that's unimportant 99.9% of the time
will be helpful to keep in the back of my head, ty
back to the topic... as long as it works, you can just assume that it is a stylistic choice from the author
oftentimes you'll see like 5+ methods to do the same thing
and as for which one is the "right" way, usually you should stick with the official documentation
yeah i get it, just growing pains, it took me 3 hours to get to the same outcome as row_data.x
Hi, I'm a novice programmer. I want to display a dataframe coming from a csv file using an object oriented programming method. Here I show the code.
"""
`import pandas as pd
import matplotlib.pyplot as plt
class csv2df():
def __init__(self):
df = pd.read_csv("RMS level.csv")`
Explanation please. Thank You
Is this a day in the life of a data scientist? They collect data and create a chart for it?
that would be the easiest day in the life of a data scientist.
Are you a Data Scientist?
I'm a computational linguist.
I think you would probably want to change "data" to "science" to better reflect the complexity of the occupation
or title the chart with "Science" perhaps
skills required to become a data scientist
Is that a question? I think advanced statistics one requirement, calculus maybe
Realistically though, I have heard that there's not much standardization, there's a lot of cross-over between data analyst, product analyst and data scientist
no like i meant programming langs
python or r, and sql
alright ty
I wanted to run a question by a few... I have an opportunity to buy a friends computer for ML/data analytics work...
- AMD Threadripper 3970x
- MSI RTX 3080 10G
- 128GB (4x 32GB) DDR4 3200MHz
- Gigabyte TRX40 Motherboard
- 970 Evo Plus NVMe 500GB
- 970 Evo NVMe 500GB
- Corsair HX1200 PSU
- Noctua NH-U14S
For $2350 USD -- how much value is in the 3970X now? Only concern for the above is locked into TRX40 which won't work with zen3 etc. PC part picker of the above tells me its $5700 but not sure how accurate that is.
Not sure if this is a good deal given it is a used condition
Fair call
One factor is that I'm located in the south pacific on an island called Australia.
how can i find out why pip refuses to install a package? i'm seeing the result when i run pip_search tensorflow-rocm but i can't run pip install tensorflow-rocm it shows 'no matching distribution'
i think it may be a python version thing
GitHub
Using Python 3.7, the command: $ pip3 install --user tensorflow-rocm attempts to download and install the following file: https://files.pythonhosted.org/packages/d6/b5/f14d48711d276f2391c944ecda0bc...
yes, I had a similar issue
Does anyone have an idea of why the predicted results look like this? They seem to be flipped along the horizontal axis as well as scaled down? (I've used LSTM)
i m having this error can anyone help me to resolve this i googled about it and asked gpt4 and got to know it cuz of version mismatch of PyQt and opencv but idk which is the suitable version for opencv and PyQt
does anyone has any solution of this problem ?????
how many of a, b, c (s) are there?
is it 15? or are there 15 unique combinations of this
some are both a and b?
so there could be multiple procedures applied to the same patient(row)?
something like this seems good
if u have 1 million rows, should be fine to have 15 more cols
another way is to one hot encode the combination of procedures, which will be more than 15
Yeah don't think there ie
@barren otter@agile hearth help
Anyone got any resources on how to make dashboards look better? 😅
Design, colours etc
After taking the difference in timeseries data I got negative values, can change it to positive?
while we can of course tell you how to make it positive from a data manipulation perspective, i think it would be beneficial to know why do you want to change it to positive as that might not be the correct thing to do in the grand scheme of things (grand scheme of things as in the overall goal of completing your coursework/delivering value at work).
I take the diff to make the data stationary, I resample the the data to hourly from 5 min timestamp, Now I'm just asking should I change it to positive or not?
ah, i obviously lack caffeine in my body to read properly, heh.
no, you shouldn't make it positive.
and 1 more question, when I do kpss and adf test to check whether the data is stationary or not, the kpss test gives me result the data is stationary and adf test results show the data is not stationary, after taking the diff kpss test show the data is not stationary and adf test result show the stationary.
Like both test give alternative answers @boreal gale
u malaysian?
Me??
yeye
Nope
where u from
Pakistan
ah
Yep
so to recap you have only taken the first order difference, and performed ADF and KPSS test on the differenced time series?
have you looked into whether you have trend in your data?
Has anyone had any experience with image adaptive GAN reconstruction?
Does anyone have any experience with image enhancement using cv? I have a project including ocr and the video quality is poor and needs enhancement and I have tried everything I could find and I am really lost
Yes data have daily seasonality, and also increasing trend. Yes I've take only first order difference but it makes data stationary after first order difference
What are some examples of Data Science work?
Do they predict the sales of products for a company or something?
you might want to de-trend it first? then ADF and KPSS should hopefully agree with each other
I want to make the data stationary, so I will check the ARIMA model performace on our data
ARIMA or SARIMA
I couldn't say but i'm sure you could google some examples of data science
- data collection: collect data from various sources, including databases, APIs, and the web.
- data cleaning: raw data is often noisy and incomplete - part of the job is to to preprocess and clean the data to remove outliers, fill in missing values, and standardise data formats.
- exploratory data analysis (EDA): using visualisation tools and stats to explore and summarise the data, identify patterns (in a primative way(?)), and detect outliers.
- feature engineering: creating new features from the existing data in order to enhance the predictive power of models.
- model development: use machine learning algorithms/traditional stats models to build predictive models that can make somewhat accurate predictions on unseen new data.
- model evaluation: test and evaluate the performance of models to ensure their accuracy (among other evaluation metrics) and effectiveness.
- deployment: data scientists deploy the model into production environments to make data-driven decisions.
- monitoring: data scientists monitor the model's performance over time, detect any anomalies or changes in data patterns (e.g. population drift), and update the model accordingly.
depends on the organisation you work for, your focus might varies among these 8 tasks.
i assume most DS won't be collecting data/deploying their own model.
also, people always say cleaning the data takes 80% of your time, and making models takes 20% of your time.
i personally think this is just a meme, it depends on your org and your task really.
Hey @boreal gale do you mind if I ask a question about that problem that you helped me solve to find the error between two trajectories where you used DTW?
sure, what's up?
if you remember, we had the trajectories and one of them was shifted in time so dtw helped to find the closest distance. But is there a way to find the most optimal time shift that minimizes the error?
Referring to this, as the real trajectory (red) seems to be shifted to the left and it would be nice to find what's the most ideal shift so both of them can be considered to be happening at the same time
In this example is quite clear but there are other cases where it's not so easy to tell where to shift one of the sequences
But is there a way to find the most optimal time shift that minimizes the error?
just so we are on the same page, what is time shift?
is it a constant shift in time? or a dynamic shift in time?
because DTW assumes there could be a dynamic shift in time, as in time 1 in series A could be matched to time 1+4 in series B and time 2 in A could be matched to time 2+10 in B (the +4 and +10 there is not a constant)
just a constant time, or at least that's my approach to what I think is a good way of representing them on the same time axis and considering the best alineation between both of them
oh! if it's constant time, then i don't think DTW is a right approach. i need to re-read the paper to fully confirm this though...
because what I am doing, is considering these trajectories as time series and then make a model, that passing it the intended trajectory as input, can sort of predict what could be the real trajectory that will follow from the things that it has learned from the data. Do you think that's possible?
i see, you are making a "real trajectory synthesiser"
give the thing the intended trajectory, and you can expect a simulated "real trajectory" from it
i don't see anything that screams "no" to me here - but there is a very real possibility that this is one of those "unknown unknown" thing to me, i don't know what could fail until i attempt it
fair! I've been trying to use neural networks. Like LSTM for now but the results aren't as I was expecting :(
I mean it's possible as there are a lot of papers doing so but I can't get my head around how do they do it
The best I could do is something like this (green is the prediction, orange is the real truth, and blue is the intended)
(of course the prediction matches closely the intended trajectory but the problem is time once again)
from the description of your task, LSTM sounds sensible as a component in your NN.
also there is this concept of GAN, i wonder if you can incorporate it into your NN somehow
(it's important to note my knowledge to NN is almost purely academic, i haven't done any "real" NN development)
if i just look at this by itself, i would argue your NN hasn't learn enough to give you a decent predicted real trajectory 🤔
are you feeding it intended trajectory as features and the unshifted real trajectory you have observed as targets?
I'll take a look at it! I mean i have this as a thesis to deliver soon and I'm not a computer scientist or something so my skills are quite low (that's why you see me asking all time here! thanks for saving me!) and having to learn about NN is just tough
so what i've tried here is so wrong but I was desperate to see something happen. This has just been trained with some concatenated latitude-positions from different flights over time. That's it, just a column and the time stamp. But the thing is that I can't seem to train it with the real data as then it has to become an input which I don't have for future intended flights
it's either that or I have understood that wrong. As fas as I know they usually take previous information to predict the future, but if I feed it all the data, the it asks me the input to be the same size and I don't have the things such as the error of a trajectory that hasn't been flown
since you have a real world understanding of your problem, you can potentially think about why the real trajectory is different to the intended trajectory, then come up with features that correspond to the underlying cause of deviation to intended trajectory. and feed those into the NN
e.g. is it about to turn a super tight corner? is it at max acceleration? i don't actually know the physics of drones to this is just me guessing.
it's mainly that the drone doesn't follow exactly the intended trajectory in terms of time. Say that it has to reach a turn at time=10 but the dron arrives at time=12, then somehow the dron speeds up and get to the next point at time=20 instead of the intended time=18. This leads to the most error. Of course that in the turns the error is higher as we could see from the dtw thing that you did! i marked the highest error points in the trajectory and these were the turns
these drones not perfectly adapting to the time marks of the intended path makes the trajectories very different and I though that shifting them would make sense
oh, you might want to normalise your input first, as i think having the raw lat lng will impede the NN from learning.
instead of latlng pairs like (62,15), (63,16)
you might instead want (0,0), (1,1)
because that path is virtually the same as (100,100),(101,101) (which noramlises to the same 00,11 pair) , but your NN won't see it that way unless you normalise
i know it's not actually the same, because the way that lat lng works, but it's close enough also >90 for both of them is obviously wrong lol
(in 3D they look veeery similar though as we ignore the time variable)
Nice explanation. I guess a Data Scientist were useful to predict the rise COVID-19 cases.
I've done something like that with a Scaler, which puts it between 0 and 1 but I suppose it is not the same, right?
not quite the same
what do you mean that the NN won't see it?
"won't see it that way"
as in behaviour located at (62,15), (63,16) and (100,100),(101,101) could be treated differently
no it doesn't, if you are using scaler
consider if we only have those two "paths"
it (the scaler) would normalise to something like
(0,0),(0.1,0.1) and (0.99,0.99),(1,1)
which is not the same as treating both path as (0,0), (1,1)
so (62,15), (63,16) becomes (0,0), (1,1). What does (100,100),(101,101) become?
both should be (0,0), (1,1) imo
basically your input should be what's the trajectory relative to where you started, not the trajectory of actual absolute position on earth
then it is like subtracting the initial point to the rest of the points. Otherwise all the trajectories would have so many points in common?
then it is like subtracting the initial point to the rest of the points.
bingo
otherwise your NN could be learning different things for flights that happen in a different starting location
but my flights all happen in the same area
then that effect is somewhat reduced
plotting this makes all flights trajectories' overlap
hmm? that's not a problem if we are just using that as training data for NN?
You mean doing that?
ermm.. i guess?
basically you want to take each trajectory - i assume it's 2D np array (call it path), and subtract the starting position
path_deltas = path - path[0]
My assumption was that this could be treated as a time series
oh
why though
each individual flights should be treated as one time series "observation" imo
u want to forecast so u dont have to do new measurments?
yes makes sense now that you say it. Then the question is that we should also have the real trajectory in the same "observation" too, right?
not sure what you mean here
It's for a development of drones, having the intended path, we need to know how it will "really" look like when the drone flies it problems can be avoided
I've truly spent several months just collecting and cleaning data (clinical trial) 
for the NN to learn, it kind of need to associate the intended trajectory to the real one so it learns what differs from one to another? like providing the input (intended) and output (real)
heh, there are always exceptions 😉
cleaning data is an iterative process 😄 (always coming back to it Q_Q)
I guess it depends on the domain but the 80/20 thing is probably true if you're in an applied science kind of domain where you're responsible for designing the study, collecting data, ...
if u get data from someone else the struggle begins 😄
Yup, agree but for some reason we did the lump sum in the beginning
omg. that reminds me... dealing with census data always give me a headache 😡
hahahaha
this leads something like that
Last time I dealt with census data the results were so bad I contacted our government and they said "oopsie we made a mistake, sending you the new dataset in a minute."
government be like : shits on fire yo
The entire thing was a joke. We had a large cross-section of a significant % of the population. Yes they hashed some fields but I'm pretty sure if you tried you could de-anonymize most people. (We obviously did not because this is illegal.)
FBI open the door
don't be shy, share the dataset :)
Nope. I deleted it after the work was done as you should ❤️
that's what I wanted to hear, good job man
OHH, i think i see the issue, i think you might be using LSTM slightly wrong 🤔
the more NN-literate folks can probably comment.
i think you are stitching paths like that because you aren't connecting hidden state of the LSTM units to output units?
i think it's possible to connect hidden state of the LSTM units to output units and have these output units predict your "real" path
What is your task @muted crypt ?
this sounds good but I won't lie if I say that i know how to do this
in your input to LSTM there should never be real trajectory, the simulated real trajectory should be generated by these output units and compared against the real real trajectory.
I've been weeks with no progress :(
Can you explain what you want to do?
oh this changes so many things
I mean in the training dataset there has to be the real trajectory, marked as the label (y)? at least that's what I have
You want to predict the trajectory of an object through time given a starting position?
so input should be ur expected drone trajectory and that output should be compared with ur real trajectory
I have 2 datasets: one of drone inteded trajectories and another of the real trajectories (already flown following the intended ones). From this, I have to develop a model that provided an intended trajectory, predicts how the real one will be
someone else can probably guide you better, since i only have academic experience with NN.
post your current attempt and potentially with some example data would really maximise traction.
anyhow, i gotta get back to the fun task of scraping data for now 👋
bs4 ftw
given the whole intended trajectory
from the green one (intended path) as input, predict the red (real flown trajectory) one
this is misleading
so predicted is real?
u only want to compare pred with real wont u?
not compare
So you have 3 input sequences (intended) and your output needs to be 3 output sequences (flown)?
literally provide the green plot and get the red one (or similar)
ah lel
u didnt build a model just yet?
so its green(input) and red(target)
but u didnt do calc. just yet?
Input essentially is a sequence of points (3D points and time stamp) and then the output should also be (3D points and time)
Do you expect the sequences to be independent of each other or not? E.g., the intended altitude has an effect on the flown longitude?
I've done a LSTM model, but it fails
ideally that would be good yeah!
did u check that?
this is probably more illustrative
indeed 😄
(being lat,long,alt)
here the fit is way better then in ur other example, is it the same ?
i'd like to start without taking into consideration this and probably as it seems an extra hard step but I'd love to do that
From my understanding you have a quite normal RNN set-up where you map intended -> flown
u can even choose to give a 3d array into the NN
You should start by sanity checking your model - use a training set of size 1 and get 0 loss
yes, see that in the 2D we have time, while in 3D we don't. So in 3D we don't see the time shift (which is what makes the 2D plots look so different)
so drone is just slow 🗿 😄, jokes aside try what zestar suggested and maybe differ the inputs dimension
haha facts
red just looks like its lagging behind green
can you elaborate more on that?
and theres some kind of 'startup' time
For whatever DNN I'm making I always try to overfit a single example as proof that my architecture is bug-free
yes but not alkwys
so a single flight that would be?
Yes
for a POC yes
so the model kinda represents the quirks of this particular drone flying
certain delays, route simplification etc it may have
I've done that
What is blue and what is yellow?
that's exactly the idea!
yellow is the real (true) and the blue what it has predicted from the train
reverse engineering the algo in the drone? 😄
haha thats what im thinking also
Did you train it on just one flight?
yes
If you're not hitting 0 loss on 1 flight there's something wrong