#data-science-and-ml

1 messages · Page 59 of 1

arctic wedgeBOT
#
Formatting code on discord

Here's how to format Python code on Discord:

```py
print('Hello world!')
```

These are backticks, not quotes. Check this out if you can't find the backtick key.

For long code samples, you can use our pastebin.

lapis sequoia
#
tweet_df['Neutral_count'] = tweet_df['sentiment'].apply(lambda x: 1 if x == 'Neutral' else 0)
tweet_df['Positive_count'] = tweet_df['sentiment'].apply(lambda x: 1 if x == 'Positive' else 0)
tweet_df['Negative_count'] = tweet_df['sentiment'].apply(lambda x: 1 if x == 'Negative' else 0)
tweet_df.head()
#

Its simply supposed to keep track of the number of positive neutral and negative in the sentiment column

serene scaffold
#

can you do print(tweet_df['sentiment'].head())?

lapis sequoia
past meteor
#

Will you endlessly tune on validation?

serene scaffold
lapis sequoia
#

surte

serene scaffold
#

Anyway, all you need to do is tweet_df['sentiment'].value_counts(). but you don't want to put that into tweet_df

#

because there aren't different sentiment counts for each row. it's counts for the whole dataframe.

lapis sequoia
#

what about if i want to store cumsum

#

would ```py
tweet_df['positive_sent'] = (tweet_df['sentiment'] == 'Positive').cumsum()
tweet_df['negative_sent'] = (tweet_df['sentiment'] == 'Negative').cumsum()
tweet_df['neutral_sent'] = (tweet_df['sentiment'] == 'Neutral').cumsum()

#

to store the running sum

serene scaffold
#

try it and see

#

tweet_df['sentiment'].eq('Positive').cumsum() -- this notation would also work. I think it looks cleaner, but that's just me.

lapis sequoia
#

it didnt work ill try yours

serene scaffold
#

it won't have a different result

past meteor
#

Yeah but was this on your first go?

lapis sequoia
serene scaffold
vague briar
#

how do i get whatsapp seen data? With beatiful soup or selenium

lapis sequoia
#

odd.

#

well i cant post a screen shot to show my df head but its just all zeroes

serene scaffold
vague briar
#

But ı cant

past meteor
#

Worst case scenario you've tuned to solve your validation set didn't you

#

Yes but so long as you don't have completely other data you don't know if that'll be consistently lowering

lapis sequoia
#
{'id': [1651643206474317824, 1651643242595729408, 1651643261784563719, 1651643301555064835, 1651643310782533632], 'user_name': ['MAKS_Diogenes', 'crypto__wire', 'eurexcoinLTD', 'BitcoinCourant', 'spaziocrypto'], 'text': ['Im thinking about making 50 seed phrases - place 10000 sats on each - then I will engrave the phrases on metal with… https://t.co/SR5Tdq6sY1', 'Crypto Winter Is Over |  Mass Bitcoin Adoption | Meme Coins Continue To ... https://t.co/rd1K6tb6sZ #BABYPEPE #memecoins #Bitcoin', '1: Bitcoin price is $29293.55 (0.66% 1h)\n2: Ethereum price is $1906.26 (0.44% 1h)\n3: Tether price is $1.00 (-0.01%… https://t.co/qCEQZgdptU', 'An easy way to run your own Bitcoin node is by using Bitcoin Core\nhttps://t.co/NyAtlHbWTB', 'Pufff, banking system gone... #bitcoin https://t.co/zQSjQls5qx'], 'sentiment': [' Positive', ' Positive', ' Neutral', ' Neutral', ' Positive'], 'positive_sent': [0, 0, 0, 0, 0], 'negative_sent': [0, 0, 0, 0, 0], 'neutral_sent': [0, 0, 0, 0, 0]}
past meteor
#

Like it might, but it might not

lapis sequoia
#
tweet_df['positive_sent'] = tweet_df['sentiment'].eq('Positive').cumsum()
tweet_df['negative_sent'] = tweet_df['sentiment'].eq('Negative').cumsum()
tweet_df['neutral_sent'] = tweet_df['sentiment'].eq('Neutral').cumsum()
print(tweet_df.head().to_dict('list'))
```  my code for this
lapis sequoia
lapis sequoia
#

😦

#

I found the issue is that there was a space before the sentiment

#

'sentiment': [' Positive', ' Positive', ' Neutral', ' Neutral', ' Positive']

tawdry flint
#

is there a website to learn machine learning, which is free for Students?

past meteor
#

introduction to statistical learning is the book I'd recommend. It's free, but the code examples and labs are in R. A Python version is coming out soon.

wooden sail
tawdry flint
#

Thanks

lapis sequoia
#

has 10 courses, so you apply 10 times

wooden sail
#

indeed, but it's free

#

beggars can't be choosers

agile cobalt
#

there are also fast.ai's and sklearn's courses, but like anything 100% free they have no certificates

edgy jacinth
#

when it comes to my model rewriting it’s own pathways and creating neurons how do I go about that

naive peak
#

are there good libraries to convert somewhat complex XLSX files into json?

#

was it like pandas or something?

hushed wave
#

hi guys

#
# Set the path of the image folder
image_folder_path = "/content/gdrive/MyDrive/video_frames3"

# Define the list of emotions to detect
emotion_labels = ["neutral", "happy", "sad", "surprise", "angry", "fear", "disgust"]

# Create an empty DataFrame to store the emotion data
emotion_df = pd.DataFrame(columns=["Image"] + emotion_labels + ["Dominant Emotion"])

# Loop through the images in the folder
for image_filename in os.listdir(image_folder_path):
    if image_filename.endswith(".jpg") or image_filename.endswith(".png"):
        # Load the image using DeepFace and check if a face is detected
        image_path = os.path.join(image_folder_path, image_filename)
        detected_faces = DeepFace.extract_faces(image_path)
        
        if len(detected_faces) == 0:
            # If no face is detected, skip to the next image
            continue
        
        # Perform emotion detection using DeepFace
        emotions = DeepFace.analyze(image_path, actions=['emotion'])
        dominant_emotion = DeepFace.analyze(image_path, actions=["dominant_emotion"])
        
        # Append the emotion data to the DataFrame
        emotion_data = {"Image": image_filename}
        for label in emotion_labels:
            emotion_data[label] = emotions["emotion"].get(label)
        try:
            dominant_emotion_label = dominant_emotion[0].get("dominant_emotion")
        except:
            dominant_emotion_label = "None"
        emotion_data["Dominant Emotion"] = dominant_emotion_label
        emotion_df = emotion_df.append(emotion_data, ignore_index=True)

# Save the emotion data to a CSV file
emotion_df.to_csv("emotion_data.csv", index=False)

Any changes recommended for this because atm it's giving me this error

#

TypeError: list indices must be integers or slices, not str

edgy jacinth
#

Does anyone have a pre trained model to a certain extent I can use as a baseline for mine? Or willing to help me make one

edgy jacinth
#

working ai assistant, just need some baseline

serene scaffold
serene scaffold
# edgy jacinth call me stony hark!!!

I will not do that. what is your motivation for wanting to create an AI assistant, and do you have any (fairly specific) examples of what you want it to do?

edgy jacinth
serene scaffold
edgy jacinth
#

im workin on it parental figure believe in me!

round kettle
#

Any advice for someone making a career pivot from civil engineering to the data sector? Finished my M.S. and have experience with computational modeling and HPC but want to score a role as a junior level data engineer, analyst, or some sort of developer in this industry. I know I have a lot to learn and I’m excited to do so, but just need some guidance on how to get started. Willing to DM LinkedIN or resume for context.

slim lance
round kettle
#

@slim lance never thought of that, that’s a really clever idea!

lapis sequoia
#

Can someone who's fluent with pyspark & data handling please ping me ? I need to ask how do i handle 120GB worth of data in json file using pyspark. I need to clean that data and then put into into my mongoDB but i don't understand how I'd read so much data with pyspark.
So if someone's familiar with pyspark please ping / dm would be better. It would be a major help

slim lance
rugged comet
#

Code

df = spark.read.option("header", True).csv(*[f"/FileStore/tables/deck_data_{num}.csv" for num in range(500000, 2500001, 500000)])

Error

ParseException: 
[PARSE_SYNTAX_ERROR] Syntax error at or near '/': extra input '/'(line 1, pos 0)

== SQL ==
/FileStore/tables/deck_data_1000000.csv
^^^

What is going on here? I'm not using a SQL cell, I'm using a Python cell. This is in databricks by the way.

#

I'm trying to read multiple csv files at once by unpacking a list of file paths.

slim lance
#

Anyone have experience running Athena queries from python?

#

(I’m wondering if it’s worth the extra step)

lapis sequoia
#

I have a dataset with data about 50 stores in US. I need to predict revenue for all of them at once. Dataset looks like this: 10 lines for 1 store, 15 for second, 35 for third etc. Which model should I use?

#

I have one Y and three X

cold osprey
#

what is lines?

#

rows of data or?

lapis sequoia
#

Rows in csv

cold osprey
#

im guessing one per year or

lapis sequoia
#

One per month

cold osprey
#

u need to be clearer on what data u have

#

and u want to predict next month?

lapis sequoia
#

Preferably upcoming 6 months

cold osprey
#

would probaly start with linear regression

#

quite little data

lapis sequoia
#

But how I can do it for all stores at once?

#

I built a model in ARIMAX that is working only when I use dataset containing one store

cold osprey
#

yeah ig it doesnt rly make sense to do all stores at once

#

unless u mean the totality?

lapis sequoia
#

No, I need to see results separately

#

I know, but it’s a requirement from my boss

cold osprey
#

hes enforcing that u only use one model?

#

it doesnt make sense

#

one store have have 5x less revenue than another lol

#

ig u can merge ur data together with some column denoting which store it is

#

seems like a bad idea if stores can have quite different revenues

lapis sequoia
#

Or maybe I can do it with a loop? So I will be running predictions for one store only

cold osprey
#

sure

#

not familiar with arimax

earnest widget
#

Has anyone worked with T-SNE visualization? I am having a hard time trying to understand what it actually shows. I have two classes.

cold osprey
#

do u have a train bit and also a test bit?

lapis sequoia
#

Yes

cold osprey
#

i would use a diff model for each store

#

so u would run fit for each store

lapis sequoia
#

But is there any model that would fit my needs? Let’s assume that stores are connected somehow and using all for training will do a better predictions

cold osprey
#

no idea

lapis sequoia
#

I will try to google that again

#

Thank you! ❤️

modern belfry
#
def similarity_matrix_blocking_code(class_dataframe) -> ndarray:
      #using sklearn algorithms
      tfidf_vectorize = TfidfVectorizer(stop_words='english')
      anime_matrix = tfidf_vectorize.fit_transform(class_dataframe[DataframeColumns.COMBINED_FEATURES])
      return cosine_similarity(anime_matrix)```
#

I have this code for basic similarity recommendations

#

the only problem is high cpu usage because 10-15k items are being processed here

#

which crashes my container on vps ;--;

#

any suggestions?

mild dirge
#

!rule 6 @floral comet remove that message pls

arctic wedgeBOT
#

6. Do not post unapproved advertising.

deft robin
#

Hey folks, I'm looking for resources to build something in ML/AI with IoT sensor data. Any suggestions? Anomaly detection seems to be a common application. Any other applications? Currently planning to focus on Power consumption data or Temperature/Humidity data of a factory

cold osprey
#

time series prediction?

#

demand prediction maybe

#

where are the IoT sensors located

deft robin
#

Why can't I just set a range of values that are OK and if the values start going out of that range I could consider it an anomaly and send an alert?

tidal bough
#

If you have to set a range of values manually, that's not ML, that's just an if-statement 🙂

#

Anomaly detection is basically feeding a model a lot of normal data and having the model learn from it what the normal ranges for the values are.

deft robin
tidal bough
cold osprey
#

I think it's to alert before something goes wrong

tidal bough
#

Though since your data has timestamps attached, it'd be losing a lot of context to consider your data points independently: like, it might be that none of the measurements are weird on their own, but a specific sequence of several measurements in a row is anomalous. If you want to take that into account, you need anomaly detection on time series as shimmer mentioned, which I know little about (though apparently Azure has an implementation, https://learn.microsoft.com/en-us/azure/data-explorer/kusto/query/anomaly-detection)

cold osprey
#

Say too high of a humidity and the machine wont run properly

earnest widget
#

I am trying to do feature extraction and visualisation for an image dataset with classification in mind, does anyone have any ideas on any methods to go about this? So far I have used Kmeans clustering to highlight false positives and such but I am looking for other ways to visualize it.

cold osprey
glass estuary
#

Hi everyone! Could someone tell me if there are any repos, tools that I can use to generate images from text on windows with AMD gpu?

queen cradle
glacial iris
earnest widget
queen cradle
# deft robin Why can't I just set a range of values that are OK and if the values start going...

This is a more reliable solution than AI/ML. People sometimes think that AI/ML techniques are better because they're sophisticated. But they can also be more fragile.

If simply setting a range doesn't work, there is a lot of statistical literature on "statistical process control" or "statistical quality control". The material is quite classical at this point; it was developed starting about 100 years ago. It might give you a little more sensitivity while still remaining robust to ordinary variation.

queen cradle
earnest widget
queen cradle
#

This is labeled data?

earnest widget
#

Yeah it's images put into its respective class sub-folders; container and no_container.

#

I mean it's kind of my first time trying these visualisations out so that's why I am trying to figure it out.

queen cradle
#

It sounds like the features you extracted aren't powerful enough to distinguish the two classes.

earnest widget
queen cradle
earnest widget
deft robin
earnest widget
surreal solstice
#

Hey guys, long-time lurker here. I had a question about some pandas functionality.

#

Doesn't seem possible to groupby then aggregate a custom function over multiple columns?

#

Basically, I need to be able to define, in one groupby/agg statement, the sum & the division of two separate columns and save them as a new column, and then calculate the sum of other columns.

#

This way the sum & division, and the sums is/are done over the grouped variables.

boreal gale
#

could you give us an concrete example?

cold osprey
#

some people forget google exists sometimes

boreal gale
#

being snide doesn't really add to the conversion constructively 🙂

surreal solstice
surreal solstice
cold osprey
#

does each aggregation use the same group by columns?

#

or different

surreal solstice
#

Unfortunately not. They are different

cold osprey
#

then u can loop over them or smth

surreal solstice
#

Just a moment, I'm typing up an example so I can't respond, thanks

#
df = pd.DataFrame({'location': ['backyard', 'store', 'bank', 'backyard', 'backyard', 'bank', 'store'],
                   'is_orange': [1, 1, 0, 0, 1, 0, 1],
                   'is_non_orange': [0, 0, 1, 1, 0, 1, 0],
                   'melons':     [73, 81, 94, 174, 23, 71, 65})
lapis sequoia
#

@serene scaffold @mighty patio Sorry for the late response. So i am designing with the ezdxf library a figure, i have seen that i can export that figure and save it because ezdxf has this functionality implemented. I can do that through matplotlib. Now i have also to make a report that has that figure inside. I wanted to surpass the step of saving the figure from the ezdxf script i have and load it into the report docx file. So in summary i want to keep the figure in memory without saving it localy and then load it directly into the docx file.

surreal solstice
#

Alright, so given this DataFrame, what I'd like to do is something like this: sorry this is pseudocode

df.sort_values(['location']).groupby(['location']).agg(
    'total orange/non-orange' : df['is_orange'] + df['is_non_orange'],
    'percent_orange'          : df['is_orange'] / (df['is_orange'] + df['is_non_orange']),
    'sum_melons'              : sum(df['melons'])
cold osprey
#

seems like u shud be able to define custom agg functions to use

#

and/or combining columns

surreal solstice
#

Right, that's the idea. Basically, the requirements are forcing me to make this into one table. There are posts about custom agg functions for one column, but I unfortunately need this for multiple columns.

#

We are able to do this in legacy statistical software pretty easily, strangely enough.

cold osprey
#

u can make the sums first

#

and then do like py df['percent_orange'] = df['is_orange'] / (df['is_orange'] + df['is_non_orange']) ?

mild dirge
surreal solstice
#

That's a great idea @mild dirge , I'll definitely look into it after this.

#

Thank you

surreal solstice
cold osprey
#
is_orange    is_non_orange    melons    percent_orange
location                
backyard    2    1    270    0.666667
bank    0    2    165    0.000000
store    2    0    146    1.000000```
surreal solstice
#

But that is exactly the kind of output I am looking for @cold osprey , yeah

boreal gale
cold osprey
#

i mean i got something

#

but not sure what u mean by the weights and rolling up

boreal gale
#

oh i see, you just didn't add all the sums in, right?

surreal solstice
#

That's a really good question @boreal gale . The best I can understand it looking at this legacy code, we want a sum of our melons grouped by location. So the bank would have 165 melons, and so on. Then, the df['is_orange'] + df['is_non_orange'], you are correct, this is my bad and it is an abuse of notation. I'd like to sum up each value within the groups.

#

So if I was trying to explain it, it would be: for each individual, sum up all of the is_orange and is_non_orange values pertaining to that location, and present each sum in the groupby table.

#

I hope that makes sense !

cold osprey
#

yeah i think u can do it with sums and defining new columns based on those

boreal gale
#

understood, sorry if i am being pedantic

cold osprey
#
is_orange    is_non_orange    melons    percent_orange    total orange/non-orange
location                    
backyard    2    1    270    0.666667    3
bank    0    2    165    0.000000    2
store    2    0    146    1.000000    2``` is my final output
surreal solstice
#

not at all. and noted @cold osprey , I'll take a look

surreal solstice
#

In the meantime @cold osprey I will try to implement this. @boreal gale definitely let me know what you think.

#

Thanks everyone for your help

boreal gale
# surreal solstice Not at all, it was a great question. I was using badly-written pseudo code haha

!e i would use something like this, but it's worth remembering shimmer's point - precomputing anything where sensible on the global dataframe level (though i don't think there is any here)

import pandas as pd

df = pd.DataFrame({'location': ['backyard', 'store', 'bank', 'backyard', 'backyard', 'bank', 'store'],
                   'is_orange': [1, 1, 0, 0, 1, 0, 1],
                   'is_non_orange': [0, 0, 1, 1, 0, 1, 0],
                   'melons':     [73, 81, 94, 174, 23, 71, 65]})



def stats(df_subgroup):
    return pd.Series({
        'total_oranges': (df_subgroup['is_non_orange'] + df_subgroup['is_orange']).sum(),
        'melons': (df_subgroup['melons']).sum(),
    })


print(df.groupby('location').apply(stats))
arctic wedgeBOT
#

@boreal gale :white_check_mark: Your 3.11 eval job has completed with return code 0.

001 |           total_oranges  melons
002 | location                       
003 | backyard              3     270
004 | bank                  2     165
005 | store                 2     146
surreal solstice
#

Ah, I didn't think to use apply or the custom function like this at all. So the Series object can in effect contain 2 series.

boreal gale
#

the returned series in the custom function describe one row at a time like this

cold osprey
#

not sure if there can be a speed up if u precompute sums then create new cols

#

may well be slower

#

i just deleted my code lel

surreal solstice
#

yeah I'm unsure of how the groupby is optimized.

#

@boreal gale do you think your answer is worth posting on SO?

boreal gale
#

there might very well be better solutions out there - this is just the way i prefer to do it
at the end of the day, if it solves a particular issue it's probably worth being posted 🤷

surreal solstice
#

Got it, I can post my question then

cold osprey
#

fair comparison or?

#

10x speedup

surreal solstice
#

interesting, let me try it

cold osprey
#

but need second dataframe or dropping old columns if new 'calculated' columns are placed in same dataframe

#

ah wait

#

lemme put the sum groupby in the same cell

#

unfair comparison

boreal gale
#

i would run benchmark on the actual problem instead of in this microbenchmark-esque way, but 10x does sound significant

surreal solstice
#

agree

#

I think they're both fair shakes at the problem

boreal gale
#

how did you get df2?

cold osprey
#

empty dataframe

boreal gale
#

could you give me code for doing that?

#

literally empty? no index no column?

surreal solstice
#

Ry, did you have a way to calculate the percentage oranges?

cold osprey
#

trynna figure out how to increase the loops and runs manually

#

numbers inconsistent across runs

boreal gale
#

could you post cell 17 as in here please?

cold osprey
#

sec running benchmark with more loops n runs

#

will send notebook

#

nvm cant send notebook LOL

#

best i can do

#

looks like 2x

#

lemme try with df2 declaration in the same cell

#

ah

surreal solstice
#
df = pd.DataFrame({
    'location' : ['backyard', 'store', 'bank', 'backyard', 'backyard', 'bank', 'store'],
    'is_orange': [1, 1, 0, 0, 1, 0, 1],
    'is_non_orange': [0, 0, 1, 1, 0, 1, 0],
    'melons': [73, 81, 94, 174, 23, 71, 65]
})

def stats(df_subgroup):
    return pd.Series({
    'total_oranges' : (df_subgroup['is_non_orange'] + df_subgroup['is_orange']).sum(),
    'percentage_oranges' : (df_subgroup['is_orange'] / (df_subgroup['is_non_orange'] + df_subgroup['is_orange'])).mean(),
    'melons': (df_subgroup['melons']).sum()
})

location    total_oranges    percentage_oranges    melons
backyard              3.0                  0.66       270
bank                  2.0                  0.00       165
store                 2.0                  1.00       146
cold osprey
#

no diff this time around

surreal solstice
cold osprey
#

uh

#
is_orange    is_non_orange    melons    percent_orange    total orange/non-orange
location                    
backyard    2    1    270    0.666667    3
bank    0    2    165    0.000000    2
store    2    0    146    1.000000    2```
#

seems like i do

surreal solstice
#

yeah seems like it

#

Tell you what, I'll make an SO question here, and I'll send the link to you guys in a few

#

Both answers seem to work for me as well

cold osprey
#

ye its roughly doing the same thing

surreal solstice
#

I think your version is good but what do you think about memory overhead

cold osprey
#

not exactly sure how apply with custom function works under the hood

surreal solstice
#

yeah same

cold osprey
#

i think memory shud be about the same

#

using apply will 'delete' the old df/uneeded columns only when it returns the new one so

#

same thing as df/df2 and then manually deleting df

#

or using same df and then dropping the unneeded cols

#

haha i wanna up the runs and loops just to see how far i can push it

boreal gale
surreal solstice
#

In my case, it doesn't, but you've asked a good question. What if it did?

boreal gale
#

if agg('sum') doesn't give you sufficient information for your further aggregation then you are potentially stuck with apply

surreal solstice
#

Right. Also, it seems that if you need to do some sort of multiplicative thing like *, / , then you'd have to use the mean or median function to retrieve the correct value

#

Which kinda makes sense, it's sort of what shimmer is doing.

cold osprey
#

ah thats that the mean was for

#

i was wondering what it was there for but it worked

surreal solstice
#

yeah exactly. Basically it's kinda creating your new columns and broadcasting the same value to each subgroup, so it takes the 'mean' of the subgroup which is just all the same numbers

#

@cold osprey , @boreal gale , does this sound right? [pandas]: I need to groupby on a column, then define multiple (including some custom) aggregation functions.

cold osprey
#

ye seems like a good way to do it

#

that way anyone can just modify the agg function and it will apply to everywhere it is used

surreal solstice
#

Yeah, agree

#

I will see if I can add that point too once I ask that question you guys answer

boreal gale
#

yeah that sounds sensible as a title, also i gotta go, have fun 🙂

surreal solstice
#

Posted, please answer and decide who will get the accepted answer. Thanks again so much for your guys's help

cold osprey
#

haha idet i have a stackoverflow account

surreal solstice
#

Got it, would be great if you could upload your answer. If not I can do that as soon as I finish up w work

#

And give you credit

cold osprey
#

haha its fine yeah u can upload my ans too

surreal solstice
#

done, I gave you credit as shimmer from the Python Discord server, I hope that is enough

cold osprey
#

i dont care much for credit but thanks

surreal solstice
#

Of course. Thanks so much for your help, not often you come across something like this

surreal solstice
past meteor
#

Any of you ever used transformers for multivariate time series analysis, if so how was your experience? I'm not sure how I feel about it since attention is permutation invariant. Not sure we have enough data for stuff like temporal fusion transformers either.

#

I'm not sure there's any merit to doing this at all - do people just apply them to time series because they are sequences?

rugged comet
#

Code

df = spark.read.option("header", True).csv(*[f"/FileStore/tables/deck_data_{num}.csv" for num in range(500000, 2500001, 500000)])

Error

ParseException: 
[PARSE_SYNTAX_ERROR] Syntax error at or near '/': extra input '/'(line 1, pos 0)

== SQL ==
/FileStore/tables/deck_data_1000000.csv
^^^

What is going on here? I'm not using a SQL cell, I'm using a Python cell. This is in databricks by the way. I'm trying to read multiple csv files at once by unpacking a list of file paths.

hollow crane
tranquil gust
warm jungle
#

Suppose I have an Nx2 array: e.g

n [92]: a
Out[92]: 
array([[3, 7],
       [2, 4],
       [0, 9]])

I want to treat each row as representing the start and end points of a consecutive sequence of elements in another array. How can efficiently extract those sequences. Obviously they'll be different length, so can't be an array, but they could be a list of arrays. So e.g. if I have

In [98]: x
Out[98]: array([7, 1, 5, 2, 0, 4, 8, 9, 6, 3])

I want to get:

In [100]: r
Out[100]: [array([2, 0, 4, 8]), array([5, 2]), array([7, 1, 5, 2, 0, 4, 8, 9, 6])]

I can do it by looping over a and forming slices from each row, but is there something that doesn't involve looping in python?

tidal bough
#

tried a numba function - it's exactly as fast as the python implementation, probably because a list is involved.

rugged comet
# tranquil gust df = spark.read.option("header", True).csv(*[f"file:/FileStore/tables/deck_data_...

I don't think this works.
I'm having a new error in spark though. I'm trying to reformat some csvs.
Code

for num in range(500000, 2500001, 500000):
    path = f"/FileStore/tables/deck_data_{num}.csv"
    with open(path, "r") as f:
        reader = csv.reader(f)
        with open(f"deck_data_{num}_formatted.csv", "w", newline="") as f:
            writer = csv.writer(f, delimiter="|")
            for row in reader:
                writer.writerow(row)

Error


FileNotFoundError: [Errno 2] No such file or directory: '/FileStore/tables/deck_data_500000.csv'

I'm sure the files are in the dbfs. It seems like spark won't let me open them.

tranquil gust
rugged comet
tranquil gust
#

When you input path, couldn't you see the file? like this.

rugged comet
#

I'm using databricks community edition.

tranquil gust
rugged comet
#

The file system on databricks is a bit different than a regular hard drive.

next valley
#

Never used databricks but can you traverse whatever the file system they use?

tranquil gust
#

Did you check if the file is existing?

vagrant ginkgo
#

Anyone here super familiar with matplotlib?

rugged comet
rugged comet
warm jungle
wooden sail
#

you could probably rewrite the code that generates that list of lists to generate slices instead

tidal bough
#

my approach would probably be to rethink whether really need that list of arrays. like, maybe you can use slices directly?

warm jungle
#

Not sure what you mean exactly, a slice, of itself, isn't the data - I need to get my hands on the data. The resulting arrays will be views, so no need to actually copy the underlying memory

tidal bough
#

I mean that maybe you can rewrite whatever function consumes this list to take an array of pairs instead, and take slices using that array, which can be numbified.

tranquil gust
# rugged comet Yes.

I have not experienced databricks as well. So I can't give you advice any more. Sorry.

rugged comet
warm jungle
tidal bough
#

Like, instead of having a function that takes a list of variable-length arrays, have a function that takes an (N,2) shaped array, and, inside the function, take slices using these pairs. That way, the whole thing can be numbified much better than creating a list of numpy arrays can be.

narrow crane
#

hey

orchid sky
#

What is the most used function for AI in pythob

next valley
orchid sky
#

Yes that go with that

next valley
#

Then again i guess a more suitable answer would be tensor multiplication lol

orchid sky
#

If can ask on how do you get an idea on what to program with AI if want to create a new tool

serene scaffold
#

even if you're starting with a pretrained model, you would still need a GPU to fine tune it for your use case. but you can get some free GPU compute at google colab.

#

@high iron what are you trying to do?

#

chatbots are not a good first project.

next valley
#

Honestly, just learn multi var calculus and a intro to linear algebra and everything else is somewhat simple to understand

next valley
#

Training a 1 stack/layer transformer which I am assuming you will be doing anyway because that's what everyone only talks about these days with hardware such as a 3080 take about 30 minutes on a dataset of 51785 training 1803 test and 1193 validation
In batches (mini batch) of 64
And epoch of 20
Each example has 86 feature and we'll inference up to 81 labels max

The 3080 had it's power draw capped to 85%

This should be faster with decode only because that's what everyone does, but it makes sense since the MHA confusion matrix for the encoder shows that it's fucking useless as you removed about half the trainable parameters

#

so technically you don't really need hardware, the problem is data, which everyone seems to forget, for some reason

vital cedar
#
LinearRegression.predict()

This returns a list with a bunch of decimal values when I used it on my x_test data, what are these numbers for and how can they be used? (Linear regression model from sklearn library)

serene scaffold
#

you have to know what x_test represents and what the model is intended to do to make sense of the output.

vital cedar
serene scaffold
vital cedar
serene scaffold
#

@vital cedar how did you represent the emails for the purposes of linear regression?

serene scaffold
vital cedar
#

They don't seem to be between 0 and 1

serene scaffold
vital cedar
thorn swift
thorn swift
vital cedar
#

I thought it would return 0 or 1

thorn swift
#

not if you use linear regression, you basically fitted a line on a graph

#

if you want to use a model like that you can do a random forrest i guess but its basically the same thing with the cutoff i wrote just built in

mint palm
#

if line 49 says both are same dimensions, how do line 50 gives no error on view but 52 says RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.

agile cobalt
mint palm
vital cedar
thorn swift
vital cedar
lapis sequoia
#

anyone familiar with pyspark here?

#

please i need guidance

#

i have a project which consists of a 100GB json file, i need to process it; clean it and then insert it in to mongoDB
can someone please guide me, im not actually sure what to do here

vital cedar
copper patio
#

I want to start learning machine learning but I am not sure which framework to use, any suggestions? I am thinking either Pytorch or Tensorflow...

surreal solstice
#

My advice is to learn PyTorch. It will save you headaches down the road, despite TensorFlow being easier to use OOB.

cold osprey
#

+1 pytorch

#

esp if ure on windows

surreal solstice
#

Also shimmer I was thinking. Your approach yesterday doesn't actually add significant space overhead because pandas doesn't create deep copies by default.

past meteor
#

I normally have a decorator laying somewhere that lets you compute an arbitrary function per group of your df

thorn swift
#

I just want more people to use tensorflow tbh

past meteor
#

Especially if you're working with tabular data that should be the place to start.

copper patio
hasty mountain
#

The degree of difficulty is something like:
Keras < Pytorch < Tensorflow

Keras is highest level API, you assemble your model like you assemble lego pieces.
Tensorflow is the lowest level, you have to do many things manually(not all, though)
Pytorch is the mid-term

mild dirge
#

I'd say tensorflow and pytorch are at the same level

#

They have a lot of overlap

#

Easiest to learn will def be keras, as it is just: import model -> fit model -> use model. But you don't learn a lot from it.

hot blade
#

when im feature engineering an imbalanced dataset, should i apply pca before or after resampling?

past meteor
#

It's fair to just do stuff as usual and then select your operating point manually by looking at ROC, PR, DET, ... based on your application

hot blade
#

one class is 95% of the dataset lol

#

so specificity and all that would be horrendous without resampling

past meteor
#

You need to compute all of those metrics over several operating points (decision thresholds)

mild dirge
#

Otherwise the PCA is affected by the resampling

clear siren
#

hiii

#

is there anybody who is good with flask and python , I need a bit of help

serene scaffold
serene scaffold
#

@dapper flame this is the data science channel. kindly remove your messages from this channel and try in #web-development.

#

I see you also asked in #async-and-concurrency. that's fine. but please ask your question in only one channel, so that no one answers a question that was answered somewhere else.

dapper flame
#

Ok no problem sorry

#

Thank you for your feedback

valid wind
#

Are any of you guys good with using pytorch, for some reason I'm getting a dimension mismatch and I don't really know why

valid wind
# serene scaffold show code and error

Ok so just a bit of context first, I'm simply trying to implement UNet for audio source separation and I'm using the musdb dataset and accompanying package in order to do it

#

There is the entirety of the code that I'm using

#
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-81-b99926cdd359> in <cell line: 1>()
      7         print("Target shape:", target.shape)
      8   train_unet_func(umet_model, reshaped_train_loader, optimizer, device, epoch, tb_writer)
----> 9   test_unet_func(umet_model, test_unet, device, epoch, tb_writer)
     10   umet_model.cpu()
     11   state_dict = umet_model.state_dict()

<ipython-input-79-badf5cb3f78b> in test_unet_func(model, test_data, device, epoch, tb_writer)
     38 
     39             x_padded, (left, right) = padding(x)
---> 40             right = x_padded.size(1) - right
     41             mask = model(x_padded.unsqueeze(0)).squeeze(0)[:, :, left:right]
     42             y = mask * x.unsqueeze(0)

IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1)
#

Also, I printed out the input shape after every step of forward while debugging and some other areas, I don't know how much it will be useful to you, but I have included the paste of it below:
https://paste.pythondiscord.com/zilegevuqe

arctic wedgeBOT
#

:incoming_envelope: :ok_hand: applied timeout to @stiff matrix until <t:1682843160:f> (10 minutes) (reason: duplicates spam - sent 4 duplicate messages).

The <@&831776746206265384> have been alerted for review.

bleak sky
#

I'm working with a random forest classification model and when I implement it on an external validation dataset (like the test set), i'm getting a recall (sensitivity) of 100%. Is it fine to have a high recall like this?

past meteor
#

You can get 100 % recall on a given class by just predicting everything belongs to that class.

bleak sky
#

I'm working on binary classes... and the rest of the parameters seems to be what I expected

#

Accuracy = 83.33333333333334
Sensitivity = 100.0
Specificity = 66.66666666666666
Precision = 75.0
ROC = 83.33333333333334
MCC = 70.71067811865476
f1 = 85.71428571428571

#

I feel like this can still serve the purpose.. even with high recall

#

care to share your opinion....?

past meteor
lapis sequoia
#

Hello everyone

plot_decision_boundary(model=model_4,
X=X,
y=y)

With this code im trying to check the decision boundary from the latest model Im building

#

This is the error I received

ValueError: Exception encountered when calling layer 'sequential_10' (type Sequential).

Input 0 of layer "dense_15" is incompatible with the layer: expected min_ndim=2, found ndim=1. Full shape received: (None,)

Call arguments received by layer 'sequential_10' (type Sequential):
  • inputs=('tf.Tensor(shape=(None,), dtype=float32)', 'tf.Tensor(shape=(None,), dtype=float32)')
  • training=False
  • mask=None
granite falcon
#

Hi all need some help in data science. Question.

lapis sequoia
lapis sequoia
granite falcon
lone plaza
#

Hello, I'm fairly new to ml. I'd consider myself as a intermediate to advanced python developer(although I see myself somewhere in the middle). But I have almost no experience in ml. I learn best by finding somebody who can guide me. Anyone willing to spend some time to help me to hop on the train?

#

Maybe I should mention that I'm really interested in the math behind it but I dunno if it's really worth it to learn it from scratch when there are already so many libraries etc

past meteor
#

@lone plaza https://mml-book.github.io/ and then https://www.statlearning.com/ and https://arxiv.org/abs/2106.11342 ideally you should actually do projects etc. while reading these

#

Each of them has a "who is this book for" section, let that convince you whether or not you want to go into the maths or not.

zealous hollow
#

can you tell me about it as well

buoyant vine
#

Hi all,

I'm currently trying to work out a bit of an issue i'm having mapping some numpy operations to Rust and have encountered an interesting behaviour which has dumbfounded me.

Say I have an array:

data = np.array([
    np.full(5, 0.20, dtype=np.float64),
    np.full(5, 25.7, dtype=np.float64),
    np.full(5, 3.0, dtype=np.float64),
    np.full(5, 0.9, dtype=np.float64),
], dtype=np.float64)

And it's an f64, when I then apply the following ops to it:

    hyperplane_vector = np.empty(dim, dtype=np.float64)
    for d in range(dim):
        hyperplane_vector[d] = (data[left, d] / left_norm) - (
            data[right, d] / right_norm
        )

Where left, right and their respective norms are:

left = 0
right = 1
left_norm = norm(data[left])  # L2 norm
right_norm = norm(data[right])  # L2 norm

Numpy will produce an array of 0.0

interesting [0. 0. 0. 0. 0.]

But if this array becomes a float32 we get:

interesting [-5.0820086e-09 -5.0820086e-09 -5.0820086e-09 -5.0820086e-09 -5.0820086e-09]

And this is confusing the fuck out of me where the accuracy is dropping / if the f64 is correct and it should be zero or if something else is going on

#

The reason why i'm a bit confused is because when porting this over the f32 array over in rust world Is creating the same values as numpy's float64 behavour
If I force it to become a f64 array and do all of that with double precision I get a number close to the f32 value in numpy but off by a tad which can honestly just get put down to rounding error

zealous hollow
#

hi all,
so basically, i am working on a project right now and need some help more like guidance

i want to predict temperature values based on inputs such as date, humidity (percentage values)
First, are these enough inputs?
second i am using one hot coding (not sure if that the right name but basically taking date as Day 1,2,3,4,5,....

third which algorithm will be best
i have worked with SVM (rgf) and only dates and i found the results quite promising.
but when i introduced humidity values the results were worse
like before mean error was 1.xxxx and then it went up like 144.xxxx
pearson corealtion values were
-.67 for temp and humidity
0.5xx for temp and day

sterile belfry
#

hi Allive been asked to run random_state 10 time and take the mean of them, do I just code it like this? new to all this and really struggling to get my head round it
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.20, random_state=10)

agile cobalt
#

no.

look up cross validation

mild dirge
#

They probably want you to do 10 random runs, so you should set the seed to a different value each time. @sterile belfry

rigid cape
#

hi

mild dirge
#

Oy

#

Have you already watched a video by 3 blue 1 brown @rigid cape ?

rigid cape
#

watching it

mild dirge
#

The two videos I sent I actually highly recommend watching. The one by sebastian lague is pretty chill, and he explains in pretty simple terms.

#

The 3b1b goes a bit more in-depth on the maths

sullen reef
tall tulip
#

I'm working on a time series dataset, when I plot this using line plotly library the time will show like this I want to format it, so it look little good, I've tried everythin I know but still didn't work for me can anyone help me in this to format this datetime.

elfin stirrup
#

can someone help me with this question? i got 7 datapoints for cluster 1 but it was incorrect

elfin stirrup
magic dune
elfin stirrup
# magic dune show formula

After performing the initial clustering and restimating the centroids using the Manhattan distance as a distance metric, we obtain:

Cluster 1: [2, 1, 4, 5, 3]

Cluster 2: [8, 6, 28, 12, 9, 7, 10]

The new centroids are:

K1_new = (2+1+4+5+3)/5 = 3

K2_new = (8+6+28+12+9+7+10)/7 = 11.14285714

#

[3, 6, 2, 1, 4, 5, 7] = C1

#

7 datapoints

elfin stirrup
# magic dune show formula

Distance to K1_new:

[0.67, 5.67, 3.67, 1.67, 2.67, 4.67, 24.33, 8.33, 5.33, 2.33, 3.33, 6.33]

Distance to K2_new:

[4.5, 5.5, 15.17, 9.5, 11.5, 8.17, 15.83, 3.83, 3.83, 7.17, 4.17, 2.83]

#

then i used the distances to get the c1 points

magic dune
#

ok

elfin stirrup
#

i got 7 buts incorrect

magic dune
elfin stirrup
#

doesnt give me answer

magic dune
#

| x 1 − x 2 | + | y 1 − y 2 |

#

@elfin stirrup

#

this is manhattan

elfin stirrup
magic dune
#

for my centroids

#

wait

#

wrong

elfin stirrup
magic dune
#

but idk

#

feels like to much

elfin stirrup
#

this is what i got for both clusters

elfin stirrup
wooden sail
#

!e

import numpy as np
x = np.array([3,8,6,2,1,4,28,12,9,5,7,10])
c1 = 2
c2 = 8

for i in range(2):
    d1 = np.abs(x-c1)
    d2 = np.abs(x-c2)
    print(d1)
    print(d2)
    clusters = d1 < d2
    if i == 1:
        break
    c1 = np.mean(x[clusters])
    c2 = np.mean(x[np.logical_not(clusters)])

print(f"{np.sum(clusters)} points belong to cluster c1 with centroid {c1}")
print(f"{np.sum(np.logical_not(clusters))} points belong to cluster c2 " + 
      f"with centroid {c2}")
arctic wedgeBOT
#

@wooden sail :white_check_mark: Your 3.11 eval job has completed with return code 0.

001 | [ 1  6  4  0  1  2 26 10  7  3  5  8]
002 | [ 5  0  2  6  7  4 20  4  1  3  1  2]
003 | [ 0.5  5.5  3.5  0.5  1.5  1.5 25.5  9.5  6.5  2.5  4.5  7.5]
004 | [ 7.625  2.625  4.625  8.625  9.625  6.625 17.375  1.375  1.625  5.625
005 |   3.625  0.625]
006 | 6 points belong to cluster c1 with centroid 2.5
007 | 6 points belong to cluster c2 with centroid 10.625
wooden sail
#

notice in the first iteration, there's a point with the same distance to both cluster centroids. the result depends on which one you pick for that iteration

#

if we use <= instead of <, we get 7 points as you said

elfin stirrup
#

it was similar but i got 6 points for this

#

Cluster 1: [3, 2, 1, 4, 5]
Cluster 2: [8, 6, 28, 12, 9, 7, 10]

New centroid K1 = mean([3, 2, 1, 4, 5]) = 3
New centroid K2 = mean([8, 6, 28, 12, 9, 7, 10]) = 11.4

Cluster 1: [3, 2, 1, 4, 5, 7]
Cluster 2: [8, 6, 28, 12, 9, 10]

New centroid K1 = mean([3, 2, 1, 4, 5, 7]) = 3.67
New centroid K2 = mean([8, 6, 28, 12, 9, 10]) = 12.17

Cluster 1: [3, 2, 1, 4, 5, 7]
Cluster 2: [8, 6, 28, 12, 9, 10]

earnest widget
#

I have used a PCA technique for my image dataset and it shows the two classes points are overlapping. Does it mean there is a high degree of similarity?

past meteor
#

They could be similar in the first 2 and dissimilar in the others, hard to tell. Maybe you can try LDA, it's more or less a supervised version of PCA.

earnest widget
#

LDA?

past meteor
#

latent discriminant analysis. I assume you use sci-kit learn? There's an LDA classifier there. What you want to call is the transform method.

earnest widget
#

Oh okay alright, I am just extracting the features through mobileNet and then running it through the visualizations. I tried T-Sne as well, made no sense.

past meteor
earnest widget
past meteor
#

Do you use a Standardscaler before you do your PCA

earnest widget
#

Idk if I am missing any steps.

earnest widget
past meteor
#

Yes you are 🙂 make_pipeline(StandardScaler(), PCA(2))

earnest widget
#

Oh okay.

#

But will that make any difference or should I go straight with LDA?

past meteor
#

Personally I would go with PCA first again

earnest widget
#

I am assuming the scaling function should be used with LDA, T-SNE etc?

earnest widget
past meteor
#

You should also use the exact image normalization that your pretrained model used. For example, some are trained on [-1, 1] (typically) others on [0, 1] so you should consult the docs to see what they did and mirror that.

earnest widget
#

Oh alright. So this StandardScaler method is usually done right before using PCA or other methods?

past meteor
earnest widget
past meteor
#

I think you're already doing this since you mentioned you normalize and then resize. Just check the docs to see what they normalized with in training

earnest widget
earnest widget
past meteor
#

"The inference transforms are available at MobileNet_V3_Small_Weights.IMAGENET1K_V1.transforms " Ideally you need to run this transformation instead of manually resizing / normalising

past meteor
earnest widget
earnest widget
#

So this code takes the features and labels:

features = []
labels = []
for images, label in train_dataloader:
    with torch.no_grad():
        outputs = model(images)
        features.append(outputs.numpy())
        labels.append(label.numpy())
features = np.concatenate(features, axis=0)
labels = np.concatenate(labels, axis=0)

I get 3488,1000 as the output. 3488 is the number of samples and 1000 is the features.

#

Then the scaling:

features_2d = scaler.fit_transform(features)```
#
pca = PCA(n_components=2)
pca_result = pca.fit_transform(features_2d)
print(pca_result.shape) # (3488,2)

ax = plt.figure()
ax = ax.add_subplot(111)
ax.scatter(pca_result[:, 0], pca_result[:, 1], c=labels, cmap=ListedColormap(colors))

# View the plot
%matplotlib inline
print(plt.show())
#

This is the visualization.

#

Also, I did not freeze any layers. I think the model.eval() command does that.

past meteor
#

Hmmm then I'm not sure because this seems to be OK. Could be the transforms that are going wrong. I doubt it, but it's good practice to use the ones they suggest in the docs for pretrained models. If it's not that it could be that the features from mobilenet are inadequate (try Xception for example, but the model is a lot larger) OR that it is really your data OR that the first 2 PC's are not discriminative

earnest widget
#

Hmm okay.

#

So the docs just say to put the weights like this:

model = models.mobilenet_v3_small(pretrained=True, weights="MobileNet_V3_Small_Weights.IMAGENET1K_V1")

#

Cause it accepts the params.

#

Let me try it now.

#

I remove the normalization and resizing.

#

But idk how that would make a significant difference.

past meteor
earnest widget
past meteor
earnest widget
earnest widget
# past meteor yes, this

Ah okay, so I don't think that works because the transform function does not accept that module.

transform = transforms.Compose(
    [
        transforms.Resize((IMG_HEIGHT, IMG_WIDTH)),  # Resize the images to (224, 224)
        transforms.ToTensor(),  # Convert the images to PyTorch tensors
        # transforms.Normalize(
        #     mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]
        # ),  # Normalize the images
        transforms.MobileNet_V3_Small_Weights.IMAGENET1K_V1,
    ]
)
#

So I think I can apply the transformations which they have done through the transform function and then keep the weights param.

lapis sequoia
past meteor
#

Using MobileNet_V3_Small_Weights.IMAGENET1K_V1.transforms (transforms is missing) just applies all the steps you've done.

#

I actually got to go now, if I were you I'd just try a different model at this point (Xception, Resnet)

earnest widget
maiden widget
#

can someone explain what is graph_execution_error and how can i overcome it ?

#

i have used tflearn to make a model for disease recognition in pomegranate. when i load the model and try to predict, it gives correct output at first but for second try it gives graph execution error

mild dirge
#

You should probably check the shapes of your input and output data, and the input and output shape of your model. @maiden widget

maiden widget
#

thanks

#

i actually the model needed to close first or restart before using it again

hot blade
#

using keras, what's the difference between enabling return_sequences and having several nodes in the output layer?

lapis sequoia
#

Anyone here know linear & logistic regression?

wooden sail
graceful inlet
#

hi guys, i use EMNIST balanced dataset(47 classes) in order to train my CNN in keras and i get an accuracy equal to 83%. however, when i make predictions with my model on new data(images drawn by me), i get inaccurate detections all the time(except for letter X for some reason lol).

inputs = Input(shape=(width, height, 1))
    x = Conv2D(filters=32, kernel_size=(3,3), activation="relu", kernel_regularizer=tf.keras.regularizers.l2(0.001))(inputs)
    x = MaxPooling2D(pool_size=(2,2 ))(x)
    x = BatchNormalization()(x)
    x = Conv2D(filters=64, kernel_size=(3,3), activation="relu", kernel_regularizer=tf.keras.regularizers.l2(0.001))(x)
    x = BatchNormalization()(x)
    x = MaxPooling2D(pool_size=(2, 2))(x)
    x = Conv2D(filters=128, kernel_size=(3,3), activation="relu", kernel_regularizer=tf.keras.regularizers.l1_l2(0.001) )(x)
    x = BatchNormalization()(x)
    x = Flatten()(x)
    x = Dropout(.5)(x)
    outputs = Dense(NB_CLASSES, activation="softmax")(x)
    model = keras.Model(inputs=inputs, outputs=outputs)
    return model

this is how my model looks like. what could be the reason behind incorrect predictions? i preprocess data correctly so i have no idea the cause of the problem.

#

width = 28; height = 28 and NB_CLASSES = 47

#

little note: i trained the model on 15 epochs and i set the batch size to 16

graceful inlet
#

apparently it works but i have to rotate the image counter clockwise by 90 like bruh 😭

kind moth
#

Hey can someone help with my code, I have an issue, here's my code:

import numpy as np
from keras.models import load_model
from PIL import Image
import time

#modle
model = load_model('model.h5')

noise = np.random.randn(1, 512, 512, 3)

delay_time = 30  #sec

for i in range(10):
    generated_images = model.predict(noise)

    #convert imge array
    generated_image = Image.fromarray(np.uint8(generated_images[0]*255))

    #save img
    generated_image.save('generated_image.png')

    #delay
    time.sleep(delay_time)

For some reason when I try to run it, it just generates a 1 by 1 white pixel image.
Thanks for any help!

weary dust
#

hi guys. im new to programming i want to get into programming
i saw this video 1 month ago https://www.youtube.com/watch?v=WtEYMELvRHI&t=53s&ab_channel=AtleFjellangSæther
i searched many things about this and i want to make robots like this
in internet i understand i need to learn matlab ,python ,machine learning,arduino and raspberry pi
are they good to make one like this

This video illustrates the work performed in the
context of our bachelor's thesis.

The project was conducted in collaboration
with Oslo and Akershus University College of
Applied Sciences.

The purpose of the thesis has been to elucidate
the main methods of self-learning systems, and
develop a self-learning algorithm for an
appropriate de...

▶ Play video
agile cobalt
#

(to clarify, by "do not underestimate machine learning" I mean, it is significantly deep - might be like three times harder compared to the other items you'd have to learn if you were to understand what each part of the system is doing with a reasonable depth.... though if you just stick with using it as a black box, copy pasting from some tutorial and editing without trying to understand what's happening under the hood, which tbh is perfectly fine, it might not be that bad)

ocean swallow
#

you guys used langchain?

#

It is very elemantary to find the related text part given a question but what if the related text is related to another text and esentially the text, hence all the relevant information, becomes too big to feed to AI models.

lapis sequoia
#

guys any idea on how i can perform exploratory data analysis on huge datasets without my analysis being biased? because if i reduce the dataset then my analysis starts to loose integrity

umbral delta
#

so i have this:

df = pd.concat([pd.read_csv('./Sales_Data/'+file) for file in listdir('./Sales_Data')])
df.dropna(how='any')
df['Price Each'] = pd.to_numeric(df['Price Each'])

which causes this error:

ValueError: Unable to parse string "Price Each"

could someone please explain why this is happening

torn hull
#

hey guys how we evaluate a recommendation system

uncut wasp
#

Hello, I am trying to run DiffMorph (https://github.com/volotat/DiffMorph/) on my mac. However I ran into some error

To run this program I tried running it by doing the following commands:

pip install -r requirements.txt
python morph.py
GitHub

Image morphing without reference points by applying warp maps and optimizing over them. - GitHub - volotat/DiffMorph: Image morphing without reference points by applying warp maps and optimizing ov...

#

however while installing the packages I ran in some error related to the numpy version required a higher python versxion.
So what I did is that I switched to python 3.6 to pythjon 3.10.
I redid the same commands above however I got a new error while running pip install -r requirements.txt:

ERROR: Could not find a version that satisfies the requirement tensorflow==2.9.1 (from versions: none)
ERROR: No matching distribution found for tensorflow==2.9.1
```.
How do I get to install tensorflow without having this error
earnest widget
lapis sequoia
#

this is my code

plt.plot(neigh, acc_score, marker="o", markeredgecolor = 'black', markerfacecolor = 'red')
plt.xlabel("Number of neighbors")
plt.ylabel("Accuracy score")```  and this below is my graph. How can i add the annotation on the markers only or make the x axis scale a bit more detailed?
mild dirge
#

You can do plt.xticks(range(51), range(51)) @lapis sequoia

lapis sequoia
mild dirge
#

Can also do plt.grid() so you can see where it alligns

somber panther
#

pandas dataframe, use it as an object or like a dictionary?

wooden sail
#

wdym by "as an object"? dicts are also objects

somber panther
#

just because series can be accessed with dataframe["header"] and dataframe.header

#

was wondering if one is prefered

wooden sail
#

do you mean dataframe.head? those do different things

untold bloom
#

df["..."] always works, df.... sometimes works; so in "real" code, one might want to prefer first

#

but latter is easier to type, so there's that

#

latter sometimes works only because of name clashes, e.g., if you have a column named "info" or "sum", it would fail and defer to the corresponding attributes

#

if you have nonvalid Python identifiers as a column name, e.g., one with spaces in it, it will fail too

#

so in short, IMHO, prefer df["..."] except in one-off quick trials on frames you perform

#

df["..."] has also the big advantage of conveying you are selecting a column immediately

young granite
#

if i got a linear regression model and do a residual plot after prediction and in this plot i see x:y pairs of x:-x, how can i adapt so i reduce this phenomena

mild dirge
#

So it predicts the y value too low, and this error is proportional to x? @young granite

young granite
#

but i get a linear trend

#

i did check the IDs of the "outlier" sets for all targets an see that if they follow this linear trend for one feature they do it often for more

#

i wouldnt say its proportional

#

i think its just not good represented by the model

mild dirge
#

The residuals just show you the error for different x values, which is what the top plot shows. I don't see any linear trend in that, nor do I see why you would want to fit a linear regression model on the residuals.

young granite
#

for some targets i got fewer datasets

young granite
#

these pics are just examples

#

the green dots go from -x:x through 0

mild dirge
#

I'm sorry, I don't think I can help

young granite
wooden sail
young granite
wooden sail
#

and you're doing linear regression on those 50 features? or?

young granite
#

yes

wooden sail
#

ok, so you reduce it to 2000 samples, each one being a vector of 50 features. and you wanna do regression on that. what are these targets you want to predict, and how many parameters are you using in the linear regression?

young granite
wooden sail
#

i#m still not understanding what you're trying to do, sorry

young granite
cold osprey
#

2000 samples

#

50 features

wooden sail
#

i'm trying to figure out the size of the matrix we're working with, but i'm not sure what you're calling a feature here

young granite
#

dataset: 2k
1 set contains a df o shape: 800row x 2col
each set is reduced to: 50 features
target are: 10 m/z values

cold osprey
#

Wait what

wooden sail
#

and you wanna do a different regression for each of the 2k datasets?

cold osprey
#

What's that 800 x 2 in a set

young granite
young granite
cold osprey
#

How did u get the 50 features

#

oh

wooden sail
#

and what are you passing to sklearn's LinearRegression.fit()?

young granite
#

my features

#

and their m/z

wooden sail
#

ok. so the 2000x50 array of wavelet coefficients, and whatever this m/z is

#

then the linear regression learns 51 parameters, ok

young granite
#

now back to the original question hahahaha

#

sorry for the circumstances

wooden sail
#

in the plot you showed above, what is this "fitted value" you put on the x axis?

#

the predicted m/z?

young granite
#

yyes

#

residual is: real-pred

#

so when x:-x occurs the model has problems

#

and i want to figure out a way to improve that other than to switch model

wooden sail
#

what does your notation x:-x mean

young granite
#

x:y coordinates

wooden sail
#

ok

young granite
#

pred:-pred

#

linear downwards trend

wooden sail
#

that does indeed indicate model mismatch and not noise

#

you can try to whiten the data before doing linear regression

#

but if the relationship between the data is not linear, no amount of preprocessing will help

young granite
#

could it be due to lack of certain m/z values

#

so that it will decrease over time

#

cause thats my assumption

wooden sail
#

probably not tbh

young granite
#

if a m/z is in 90% its good represented but not for 10%

#

mhhh

#

i came to this conclusion cause i tried a simple CNN aswell which resulted in a similar trend

somber panther
#

so Series.count() isn't what i was expecting, whats the trick?

#

need to collect the number of "Black" in a series

cold osprey
#

groupby count

willow quest
#

I'm being asked to find a 'nice looking' representation of how close one number is to one another, preferably condensed to a 0-1 or 0-100 scale (it's not a group/population, just a series of A and B data points that are unrelated to other A and B data points, so cannot apply normalization).

So when A=100 and B=100, the score is like 1 or 100. But when A or B is different, they want to know the 'distance' so to speak, without the sign. So e.g. A=50, B=100 = 0.5 or 50 score. But if A=1000000 and B=100, then the score would be 0.000x or similar. Anyone has any ideas? My stats background is in life sciences so I'm a bit lost here 😅

hoary wigeon
#

Hello everyone, I need help with understanding the use case of shapely value..

First of all, Is it possible to calculate record level shapely value? (record in sense for individual observation used for training or testing the model)

untold cliff
#

Does the imbalanced datasets problem concern only the target variable? Like if i have an imblanced feature, do i have to deal with it?

cold osprey
#

wdym imbalanced feature?

#

like skewed?

untold cliff
cold osprey
#

oh

#

idt u need to do anything

untold cliff
tidal bough
#

atan(|A-B|)*2/π would be 0 for A=B, and approaches 1 as |A-B| approaches infinity.

#

the logistic function (aka the sigmoid, 1/(1+exp(x))) is another choice for the function. Though that one is 1/2 at 0, so you'd have to rescale it.

willow quest
lapis sequoia
#

Heyy

uncut wasp
#

Hi @lapis sequoia

upper bridge
#

following this tutorial:https://www.geeksforgeeks.org/disease-prediction-using-machine-learning/ and in code ```py

Training the models on whole data

final_svm_model = SVC()
final_nb_model = GaussianNB()
final_rf_model = RandomForestClassifier(random_state=18)
final_svm_model.fit(X, y)
final_nb_model.fit(X, y)
final_rf_model.fit(X, y)

Reading the test data

test_data = pd.read_csv("./dataset/Testing.csv").dropna(axis=1)

test_X = test_data.iloc[:, :-1]
test_Y = encoder.transform(test_data.iloc[:, -1])

Making prediction by take mode of predictions

made by all the classifiers

svm_preds = final_svm_model.predict(test_X)
nb_preds = final_nb_model.predict(test_X)
rf_preds = final_rf_model.predict(test_X)

final_preds = [mode([i,j,k])[0][0] for i,j,
k in zip(svm_preds, nb_preds, rf_preds)]

print(f"Accuracy on Test dataset by the combined model
: {accuracy_score(test_Y, final_preds)*100}")

cf_matrix = confusion_matrix(test_Y, final_preds)
plt.figure(figsize=(12,8))

sns.heatmap(cf_matrix, annot = True)
plt.title("Confusion Matrix for Combined Model on Test Dataset")
plt.show()
``` i get error y contains previously unseen labels: 'Fungal infection' why is that? my dataset contains the label

misty flint
mint palm
#

i removed loss.backward, and removes shuffle, passed same sample at test and train time, BUT LOSS comes out to be DIFFERENT. HOWWWWWW?

#

This model is f**king with me, i am fed up.

serene scaffold
mint palm
# serene scaffold I'm sorry that this is frustrating for you. if you want help, you might get it i...

It happens, i figured it out, it was the random crop that was making the difference.
One more thing, i noticed my model was not improving. I made some changes to a model(added a new backbone and a transformer encoder) and before that it was training as expected.
Considering Implementation is not an issue, can you please tell me what to try?
few more observation:

loss is decreasing during training
loss isnt decreasing during testing
sometimes test accuracy reduces

Should i try changing hyperparameter? is so what?

hasty mountain
#

Isn't the idea to train the model on train samples, then evaluate it on test sample to check how things are going?
Its loss won't decrease during test. The loss in the test section will only decrease after a train section.

#

Also, the "sometimes test accuracy reduces", I suppose the cause for that might be similar to why sometimes, after a batch iteration(or even an epoch) the lost might increase instead of decreasing. It's just the stochastic gradient nature. The model might be optimized into a worse point accidentally(or, for the accuracy, towards being overfit), but then fix that afterwards.

pallid badge
#

Hi everybody

#

I was wondering if you can recommed good exercises for scipy, numpy, matplotlib including algorithm development

agile cobalt
#

not sure about scipy, but for numpy+matplotlib you could try implementing some simple algorithms like k-means clustering

waxen tusk
#

Does anyone know of any good math courses focused around data science/ML principles?

agile cobalt
waxen tusk
#

Ty

agile cobalt
#

there are also 3blue1brown's videos on youtube

pallid badge
agile cobalt
#

learning math?
they have some very nice visualisations

#

ah, yeah that was in response to Whip, not to you

cunning agate
#

hello guys where can i find some ai and ml project to work on it

hazy sequoia
untold cliff
#

@willow quest @tidal bough Here's what chatgpt said which is nice and simple i guess:
One option could be to calculate the ratio between the two numbers and then scale it to a 0-1 or 0-100 range. For example, if A=50 and B=100, the ratio is 0.5, which can be scaled to a 50 out of 100 score or a 0.5 out of 1 score. Similarly, if A=1000000 and B=100, the ratio is 10000, which can be scaled to a 0.0001 out of 1 score or a 0.01 out of 100 score.

Another option could be to take the logarithm of the ratio between the two numbers, which would compress the range of values and make it easier to compare across different magnitudes. For example, if A=50 and B=100, the logarithm of the ratio would be -0.301, which could be scaled to a 30 out of 100 score or a 0.3 out of 1 score. If A=1000000 and B=100, the logarithm of the ratio would be 4.605, which could be scaled to a 0.046 out of 1 score or a 4.6 out of 100 score.

Ultimately, the choice of method would depend on the specific requirements of the task and the preferences of the stakeholders involved.

somber panther
#

could use a good video covering pandas if anyone has suggestions

untold cliff
cloud marsh
#

I'm using pyenv with virtualenv. I have a ROCm GPU and i'm running the command on pytorch.org/get-started/locally.

i've set --no-cache-dir and ensured it's pulling from indexes in the proper order. it's downloading the linux manylinux wheels and then downloads the nvidia_cuda_cu11 wheels anyways.

does it just download that anyways? or am i not specifying things correctly?

somber panther
#

man... 100days course has me using pandas before numpy, all the resources for pandas seem to suggest that i'm doing this out of order...

somber panther
untold cliff
somber panther
#

Yeah I'm going to run through these exercises and just type what im told for now, i'll probably go through those before I start my DS course

serene scaffold
maiden widget
#

i am making a model to classify an image in 5 class

network = input_data(shape=input_shape)
network = conv_2d(network, 32, 3, activation='relu')
network = max_pool_2d(network, 2)
network = conv_2d(network, 64, 3, activation='relu')
network = max_pool_2d(network, 2)
network = fully_connected(network, 128, activation='relu')
network = dropout(network, 0.5)
network = fully_connected(network, 5, activation='softmax')
network = regression(network, optimizer='adam',loss='categorical_crossentropy', learning_rate=0.001)

#

can i use anything else rather than regression for output layer ?

lapis sequoia
#

Is it run program or is just example?

maiden widget
#

this is the model i am using to train the model, but my teacher is saying regression is used for prediction not classification

earnest widget
#

How can I get both my labels to show?

# Plot the data using different colors for each class
fig, ax = plt.subplots()
scatter = ax.scatter(features_2d[:, 0], features_2d[:, 1], c=labels)

plt.legend(loc='upper right', labels=['Container', 'No_Container'])

# Set the title and show the plot
ax.set_title("LDA Visualization")
plt.show()

This only shows one label and not the other.

untold cliff
mint palm
#

easy way to convert first tensor to second?

#

currently i do: sims = sims.reshape(2, 3*2).t().view(2, 3, 2)

earnest widget
tidal bough
mint palm
tidal bough
#

who knows, probably about the same

mint palm
#

ok, i am gonna use your, looks better

mint palm
tidal bough
#

probably it's done slightly differently in torch than in numpy, maybe .transpose(1,2,0) or whatever

boreal gale
earnest widget
boreal gale
#

no, you need to select rows where the corresponding label is 0 and plot and then repeat (subbing 0 with 1), do you know how to do that?

earnest widget
#

Yeah so currently it is getting all the rows of column 0.

#

Which is one of my classes.

wooden sail
#

each column is a class?

boreal gale
#

0 is indeed one of your classes, but plotting just features_2d[:, 0] is not going to work. you haven't specified the y argument, also think about what is features_2d[:, 0], did it really select all rows of class 0?

earnest widget
boreal gale
earnest widget
#

I get all with just features_2d[:]

         0.17381935,  0.69571465],
       [-0.68409383,  1.4792389 ,  0.50605595, ..., -1.1951295 ,
        -0.1751789 ,  0.39030302],
       [-1.2169716 ,  0.4897305 ,  0.03768349, ..., -0.6940949 ,
        -1.2008945 ,  0.4508954 ],
       ...,
       [-1.0407516 , -0.3269264 ,  0.41814092, ..., -1.038974  ,
        -1.4509413 ,  1.5210139 ],
       [-1.2610931 ,  0.33272272,  0.88745356, ..., -1.881851  ,
        -1.0341158 ,  2.5147882 ],
       [ 0.05956531, -0.46555987,  1.7835644 , ..., -0.74892455,
        -2.190432  ,  1.3660964 ]], dtype=float32)
boreal gale
#

what you want is something like this
assuming all the red box is where the corresponding class is 0

#

"where the corresponding class is 0" meaning in the label array, the corresponding entry is 0
e.g. the 3rd element in the array is 0 for the first red box

earnest widget
boreal gale
#

i didn't really understand what you meant there, mind elaborating?

earnest widget
#

You said in the red box is where the corresponding class is, so it will show as [0,0], like that?

wooden sail
#

what shape is your data? i think ry means the data is of size N x 3 and the 3rd column is the class

boreal gale
#

i was operating under the assumption you have two arrays, a 2D array for the features, and a 1D array for the labels/classes

earnest widget
#

So my features_2d is 3488,1000
Labels is 3488.

#

That is the shape.

boreal gale
#

this is what i meant re. corresponding entry (i didn't highlight all 1s obviously)

earnest widget
#

Yeah yeah that

#

Like that is how I have my labels:

[0 0 0 0 0 0 0 0 0 0]```
#

For each image.

#

Now I understood what you mean.

boreal gale
#

you lost me at the word image hmm

#

but do you know how to select those rows with label 1?

wooden sail
#

ok that also works. you can make an array of indices based on the labels, and use those to index the rows

earnest widget
wooden sail
#

maybe something like

indices = labels == 0
features_2d[indices, :]
#

indices is a boolean array with True for rows with a label of 0, and we can use that to weed out the class 0

#

something similar can be done for the class 1

#

here's a MWE

#

!e

import numpy as np
data2d = np.random.normal(size=(4,3))
labels = np.array([0,1,0,0])
print(data2d)
print(labels)
indices = labels == 0
print(data2d[indices, :])
arctic wedgeBOT
#

@wooden sail :white_check_mark: Your 3.11 eval job has completed with return code 0.

001 | [[-0.59429971 -0.92207244  0.48890026]
002 |  [ 0.22585973 -0.70344427 -0.46252381]
003 |  [ 0.46275675  1.01961049 -1.97198535]
004 |  [ 0.11588553  0.40670295  0.83123672]]
005 | [0 1 0 0]
006 | [[-0.59429971 -0.92207244  0.48890026]
007 |  [ 0.46275675  1.01961049 -1.97198535]
008 |  [ 0.11588553  0.40670295  0.83123672]]
earnest widget
#

Okay yeah that helps bring out the class 0 but what is indices for then? To make it include only 0?

wooden sail
#

to make it include only class 0

earnest widget
#

Yeah class 0, okay.

wooden sail
#

what ry has been saying all this time is that your plot is not split by classes, from what i understand. and they're trying to help you do that

earnest widget
#

Yeah I realized that when I was checking array that it was not showing correctly.

#

So I gotta redo my scatterplot.

muted crypt
#

Anyone familiar with neural networks here?

earnest widget
#

@boreal gale Going back to your statement earlier, what did you mean by two ax.scatter?

boreal gale
#

it smells slightly to not have legend defined where your scatter plot is though, hence i would prefer this

earnest widget
boreal gale
#

features_2d_0 will contain all of the indices of class 0 righ
it contains all rows which is of class 0
Then in the first ax.scatter, what is that doing?
it plots a scatter plot, using the 0th column as x and 1st column as y, for all points of class 0.

earnest widget
#

Thanks a lot. @boreal gale @wooden sail

weary dust
#

hi , i am currently learning arduino and matlab i want to make a robot like this https://www.youtube.com/watch?v=WtEYMELvRHI&ab_channel=AtleFjellangSæther. is arduino and matlab enough for this? and should i change matlab with python

This video illustrates the work performed in the
context of our bachelor's thesis.

The project was conducted in collaboration
with Oslo and Akershus University College of
Applied Sciences.

The purpose of the thesis has been to elucidate
the main methods of self-learning systems, and
develop a self-learning algorithm for an
appropriate de...

▶ Play video
harsh stump
#

guys does this data looks or seems to be white noise ?

wooden sail
#

you can check. the main properties are zero mean, constant variance, and uncorrelatedness

#

subtract the mean value and check whether those properties hold

harsh stump
rough lava
#

Hello people,
Question for Dropout
Is spatialDropout used in bi-lstm? or just cnn?
Cuz so far it helped my model a lot

foggy harness
#

I am working on an object detection project to detect road defects and I already have the data.

There are main class and subclasses, for example one of the main class being cracks and the subclasses being multiple line crack, Hairline Crack , Block Crack and etc.

Right now we are trying out grouding dino on this but it is giving a lot of noise and detecting things that are not cracks.

Currently, I am stuck in terms of accuracy/mAP, and any tips/advice to approach this object detection task would be greatly appreciated. Since cracks tend to be similar, I feel simply just using an object detection model would not be enough for good performance

Here is an example of the classes:

  1. Cracks
    ---Transverse crack
    ---Longitudinal crack
    ---Multi crack
    ---Alligator crack
    ---Block crack
    ---Rigid pavement crack
  2. Potholes
    ---Wet pothole
    ---Pothole with cracks
    ---Dry pothole
iron basalt
# weary dust hi , i am currently learning arduino and matlab i want to make a robot like this...

I recommend starting with the Arduino Programming Language (the default it comes with), which is similar to C++. Make some projects with that, no machine learning. Then learn some Python to run on your PC, make some projects with that, no machine learning. Then, if you still want to get into machine learning, you will need to learn some mathematics and play around with some machine learning libraries in Python, there are many resources for that.

faint mist
#

Hello everyone, I am trying to build a multistep LSTM-DNN model to forecast gold prices

#

Mainly it is a a regression problem with timeseries data

uncut wasp
faint mist
#

What I am struggling to setup is the supervised dataset

past meteor
faint mist
#

Interesting, but I can see its more about feature extraction which is indeed part of the setup but not really what I am asking for

#

lets assume that for now, the only feature we have is the historical price

#

using the last 30 days of data to forecast the next 7 days

past meteor
#

Yeah Tsfresh has transforms to make your dataset "rolling" as well

faint mist
#

I see. Let me check it out

cosmic lynx
#

If python has a hard time with computationally demanding tasks, why is it popular for AI development?

agile cobalt
#

ofc, that requires using these libraries and writing code in a way that makes good use of the features they offer though

#

furthermore, when you add in GPUs / TPUs to the mix, doing it in python and letting pytorch/tensorflow optimise how to make good use of the GPU/TPU can make it orders of magnitudes faster

cosmic lynx
#

Okay thanks

somber panther
#

so I'm just starting with pandas and have encountered some discrepencies between what the tutor is using and how the docs propose elements are extracted. Specifically, the tutor will grab a "cell" (my wording) and wrap it in int() to achieve element extraction, but the docs would have you python x_value = row_data.loc[row_data.index[0], 'x']
vspython x_value = int(row_data.x)
and I'm getting the sense that using the tutors approach is overlooking an important aspect of data manipulation

#

My question, then, is there a right and wrong way? why?

grand mason
#

Have you heard about the programming language Mojo? It's a superset of python, meant for high performance AI operations. I think it came out yesterday

#

I said came out, but it was just announced as a new project :D

agile cobalt
somber panther
#

its the 100days course, angela yu

agile cobalt
#

haven't seen it myself but yeah, without more context, the way they're doing it doesn't makes much sense

the two pieces of code you showed are for completely different purposes though

#

in py x_value = row_data.loc[row_data.index[0], 'x'] row_data shouldn't be an actual row (pandas.Series), but rather a data frame (pandas.DataFrame)

somber panther
#

its for used in a dataframe state, x coor, y coor

agile cobalt
#

in x_value = int(row_data.x), row data should be a series, and you're taking the x value out of it, then converting it from a pandas/numpy float or integer to a python integer

somber panther
#

my code for row data is row_data = state_coors[state_coors["state"] == guess]

agile cobalt
#

!e ```py
import pandas as pd
df = pd.DataFrame({'state': [1, 2, 3]})
rows = df[df['state'] == 2]
print(rows.shape, rows, type(rows), sep='\n')

somber panther
#

she may have converted it to a series, i haven't watched the video since yesterday

arctic wedgeBOT
#

@agile cobalt :white_check_mark: Your 3.11 eval job has completed with return code 0.

001 | (1, 1)
002 |    state
003 | 1      2
004 | <class 'pandas.core.frame.DataFrame'>
somber panther
#

maybe i need to be more specific

#

her method of extracting an int was just completely different than what i got from docs

agile cobalt
#

there are a bunch of different ways to do more or less same thing tbf

#

like, these two would give you the same result:```py
df.iloc[0, 0]
df.loc[df.columns[0], df.index[0]]

in case df = pd.DataFrame({'A': [10]})

df.loc['A', 0]

somber panther
#

yeah no doubt I realize im kind of splitting hairs

agile cobalt
#

the biggest difference would be the type of the number you're getting out of that though

somber panther
#

I guess i'm concerned that im not seeing something core to data analysis with python

#

they're both ints, here's my complete code

#
import turtle
import pandas as pd

screen = turtle.Screen()
screen.title("State Test")
image = "blank_states_img.gif"
screen.addshape(image)
turtle.shape(image)
state_coors = pd.read_csv("50_states.csv")
state_names = state_coors["state"].tolist()
print(state_names)


def get_mouse_click_coor(x, y):
    print(x, y)


def user_answer():
    return screen.textinput(title="Guess the states", prompt="Guess a state")
while True:
    guess = user_answer()
    if guess in state_names:
        row_data = state_coors[state_coors["state"] == guess]
        x_value = row_data.loc[row_data.index[0], 'x']
        y_value = row_data.loc[row_data.index[0], 'y']
        print("x_value:", x_value)
        print("y_value:", y_value)
        state_pointer = turtle.Turtle()
        state_pointer.shape("circle")
        state_pointer.color("black")
        state_pointer.penup()
        state_pointer.goto(x_value,y_value)

turtle.mainloop()```
#

is the difference that 1 is numpy int and the other is core python?

agile cobalt
#

pretty much

somber panther
#

would that ever matter?

agile cobalt
#

turtle might complain if you pass a numpy number to it, not sure

somber panther
#

nah it worked fine

#

from my code?

agile cobalt
#

eh, never mind

#

I was trying to demonstrate something but that something shouldn't really matter to you

somber panther
#

I don't want to waste your time with something that's unimportant 99.9% of the time

#

will be helpful to keep in the back of my head, ty

agile cobalt
#

back to the topic... as long as it works, you can just assume that it is a stylistic choice from the author

#

oftentimes you'll see like 5+ methods to do the same thing

#

and as for which one is the "right" way, usually you should stick with the official documentation

somber panther
#

yeah i get it, just growing pains, it took me 3 hours to get to the same outcome as row_data.x

dusty bay
#

Hi, I'm a novice programmer. I want to display a dataframe coming from a csv file using an object oriented programming method. Here I show the code.
"""

#

`import pandas as pd
import matplotlib.pyplot as plt

class csv2df():

def __init__(self):
    df = pd.read_csv("RMS level.csv")`
#

Explanation please. Thank You

robust stratus
#

Is this a day in the life of a data scientist? They collect data and create a chart for it?

serene scaffold
robust stratus
serene scaffold
somber panther
#

or title the chart with "Science" perhaps

golden vapor
#

skills required to become a data scientist

somber panther
somber panther
golden vapor
somber panther
#

python or r, and sql

golden vapor
#

alright ty

jade plaza
#

I wanted to run a question by a few... I have an opportunity to buy a friends computer for ML/data analytics work...

  • AMD Threadripper 3970x
  • MSI RTX 3080 10G
  • 128GB (4x 32GB) DDR4 3200MHz
  • Gigabyte TRX40 Motherboard
  • 970 Evo Plus NVMe 500GB
  • 970 Evo NVMe 500GB
  • Corsair HX1200 PSU
  • Noctua NH-U14S

For $2350 USD -- how much value is in the 3970X now? Only concern for the above is locked into TRX40 which won't work with zen3 etc. PC part picker of the above tells me its $5700 but not sure how accurate that is.

faint mist
#

Not sure if this is a good deal given it is a used condition

jade plaza
#

Fair call

#

One factor is that I'm located in the south pacific on an island called Australia.

cloud marsh
#

how can i find out why pip refuses to install a package? i'm seeing the result when i run pip_search tensorflow-rocm but i can't run pip install tensorflow-rocm it shows 'no matching distribution'

cold osprey
#

i think it may be a python version thing

muted crypt
#

yes, I had a similar issue

cold osprey
#

hmm but seems like it does support 3.8++

muted crypt
#

Does anyone have an idea of why the predicted results look like this? They seem to be flipped along the horizontal axis as well as scaled down? (I've used LSTM)

cyan magnet
#

i m having this error can anyone help me to resolve this i googled about it and asked gpt4 and got to know it cuz of version mismatch of PyQt and opencv but idk which is the suitable version for opencv and PyQt

#

does anyone has any solution of this problem ?????

cold osprey
#

how many of a, b, c (s) are there?

#

is it 15? or are there 15 unique combinations of this

#

some are both a and b?

#

so there could be multiple procedures applied to the same patient(row)?

#

something like this seems good

#

if u have 1 million rows, should be fine to have 15 more cols

#

another way is to one hot encode the combination of procedures, which will be more than 15

#

Yeah don't think there ie

cyan magnet
cold osprey
#

Anyone got any resources on how to make dashboards look better? 😅

#

Design, colours etc

tall tulip
#

After taking the difference in timeseries data I got negative values, can change it to positive?

boreal gale
tall tulip
boreal gale
#

ah, i obviously lack caffeine in my body to read properly, heh.

no, you shouldn't make it positive.

tall tulip
#

and 1 more question, when I do kpss and adf test to check whether the data is stationary or not, the kpss test gives me result the data is stationary and adf test results show the data is not stationary, after taking the diff kpss test show the data is not stationary and adf test result show the stationary.

#

Like both test give alternative answers @boreal gale

cold osprey
#

u malaysian?

tall tulip
cold osprey
tall tulip
#

Nope

cold osprey
#

where u from

tall tulip
#

Pakistan

cold osprey
#

ah

tall tulip
#

Yep

boreal gale
#

so to recap you have only taken the first order difference, and performed ADF and KPSS test on the differenced time series?

have you looked into whether you have trend in your data?

earnest widget
#

Has anyone had any experience with image adaptive GAN reconstruction?

lavish lagoon
#

Does anyone have any experience with image enhancement using cv? I have a project including ocr and the video quality is poor and needs enhancement and I have tried everything I could find and I am really lost

tall tulip
robust stratus
#

Do they predict the sales of products for a company or something?

boreal gale
tall tulip
#

ARIMA or SARIMA

somber panther
boreal gale
# robust stratus What are some examples of Data Science work?
  • data collection: collect data from various sources, including databases, APIs, and the web.
  • data cleaning: raw data is often noisy and incomplete - part of the job is to to preprocess and clean the data to remove outliers, fill in missing values, and standardise data formats.
  • exploratory data analysis (EDA): using visualisation tools and stats to explore and summarise the data, identify patterns (in a primative way(?)), and detect outliers.
  • feature engineering: creating new features from the existing data in order to enhance the predictive power of models.
  • model development: use machine learning algorithms/traditional stats models to build predictive models that can make somewhat accurate predictions on unseen new data.
  • model evaluation: test and evaluate the performance of models to ensure their accuracy (among other evaluation metrics) and effectiveness.
  • deployment: data scientists deploy the model into production environments to make data-driven decisions.
  • monitoring: data scientists monitor the model's performance over time, detect any anomalies or changes in data patterns (e.g. population drift), and update the model accordingly.

depends on the organisation you work for, your focus might varies among these 8 tasks.
i assume most DS won't be collecting data/deploying their own model.

also, people always say cleaning the data takes 80% of your time, and making models takes 20% of your time.
i personally think this is just a meme, it depends on your org and your task really.

muted crypt
#

Hey @boreal gale do you mind if I ask a question about that problem that you helped me solve to find the error between two trajectories where you used DTW?

boreal gale
#

sure, what's up?

muted crypt
# boreal gale sure, what's up?

if you remember, we had the trajectories and one of them was shifted in time so dtw helped to find the closest distance. But is there a way to find the most optimal time shift that minimizes the error?

#

Referring to this, as the real trajectory (red) seems to be shifted to the left and it would be nice to find what's the most ideal shift so both of them can be considered to be happening at the same time

#

In this example is quite clear but there are other cases where it's not so easy to tell where to shift one of the sequences

boreal gale
#

But is there a way to find the most optimal time shift that minimizes the error?
just so we are on the same page, what is time shift?
is it a constant shift in time? or a dynamic shift in time?

because DTW assumes there could be a dynamic shift in time, as in time 1 in series A could be matched to time 1+4 in series B and time 2 in A could be matched to time 2+10 in B (the +4 and +10 there is not a constant)

muted crypt
boreal gale
#

oh! if it's constant time, then i don't think DTW is a right approach. i need to re-read the paper to fully confirm this though...

muted crypt
boreal gale
#

i see, you are making a "real trajectory synthesiser"

give the thing the intended trajectory, and you can expect a simulated "real trajectory" from it

i don't see anything that screams "no" to me here - but there is a very real possibility that this is one of those "unknown unknown" thing to me, i don't know what could fail until i attempt it

muted crypt
#

I mean it's possible as there are a lot of papers doing so but I can't get my head around how do they do it

#

The best I could do is something like this (green is the prediction, orange is the real truth, and blue is the intended)

#

(of course the prediction matches closely the intended trajectory but the problem is time once again)

boreal gale
#

from the description of your task, LSTM sounds sensible as a component in your NN.
also there is this concept of GAN, i wonder if you can incorporate it into your NN somehow

(it's important to note my knowledge to NN is almost purely academic, i haven't done any "real" NN development)

boreal gale
#

are you feeding it intended trajectory as features and the unshifted real trajectory you have observed as targets?

muted crypt
#

I'll take a look at it! I mean i have this as a thesis to deliver soon and I'm not a computer scientist or something so my skills are quite low (that's why you see me asking all time here! thanks for saving me!) and having to learn about NN is just tough

muted crypt
#

it's either that or I have understood that wrong. As fas as I know they usually take previous information to predict the future, but if I feed it all the data, the it asks me the input to be the same size and I don't have the things such as the error of a trajectory that hasn't been flown

boreal gale
#

since you have a real world understanding of your problem, you can potentially think about why the real trajectory is different to the intended trajectory, then come up with features that correspond to the underlying cause of deviation to intended trajectory. and feed those into the NN

e.g. is it about to turn a super tight corner? is it at max acceleration? i don't actually know the physics of drones to this is just me guessing.

muted crypt
# boreal gale since you have a real world understanding of your problem, you can potentially t...

it's mainly that the drone doesn't follow exactly the intended trajectory in terms of time. Say that it has to reach a turn at time=10 but the dron arrives at time=12, then somehow the dron speeds up and get to the next point at time=20 instead of the intended time=18. This leads to the most error. Of course that in the turns the error is higher as we could see from the dtw thing that you did! i marked the highest error points in the trajectory and these were the turns

#

these drones not perfectly adapting to the time marks of the intended path makes the trajectories very different and I though that shifting them would make sense

boreal gale
# muted crypt so what i've tried here is so wrong but I was desperate to see something happen....

oh, you might want to normalise your input first, as i think having the raw lat lng will impede the NN from learning.

instead of latlng pairs like (62,15), (63,16)
you might instead want (0,0), (1,1)

because that path is virtually the same as (100,100),(101,101) (which noramlises to the same 00,11 pair) , but your NN won't see it that way unless you normalise

i know it's not actually the same, because the way that lat lng works, but it's close enough also >90 for both of them is obviously wrong lol

muted crypt
#

(in 3D they look veeery similar though as we ignore the time variable)

robust stratus
muted crypt
muted crypt
boreal gale
#

"won't see it that way"

as in behaviour located at (62,15), (63,16) and (100,100),(101,101) could be treated differently

muted crypt
#

doesn't normalizing make (62,15), (63,16) == (100,100),(101,101) ?

#

oh

boreal gale
#

no it doesn't, if you are using scaler

#

consider if we only have those two "paths"
it (the scaler) would normalise to something like
(0,0),(0.1,0.1) and (0.99,0.99),(1,1)
which is not the same as treating both path as (0,0), (1,1)

muted crypt
#

so (62,15), (63,16) becomes (0,0), (1,1). What does (100,100),(101,101) become?

boreal gale
#

both should be (0,0), (1,1) imo

#

basically your input should be what's the trajectory relative to where you started, not the trajectory of actual absolute position on earth

muted crypt
#

then it is like subtracting the initial point to the rest of the points. Otherwise all the trajectories would have so many points in common?

boreal gale
#

then it is like subtracting the initial point to the rest of the points.
bingo

#

otherwise your NN could be learning different things for flights that happen in a different starting location

muted crypt
#

but my flights all happen in the same area

boreal gale
#

then that effect is somewhat reduced

muted crypt
boreal gale
muted crypt
boreal gale
#

ermm.. i guess?

basically you want to take each trajectory - i assume it's 2D np array (call it path), and subtract the starting position

path_deltas = path - path[0]

muted crypt
#

My assumption was that this could be treated as a time series

boreal gale
#

oh

#

why though

#

each individual flights should be treated as one time series "observation" imo

young granite
#

u want to forecast so u dont have to do new measurments?

muted crypt
muted crypt
past meteor
muted crypt
# boreal gale not sure what you mean here

for the NN to learn, it kind of need to associate the intended trajectory to the real one so it learns what differs from one to another? like providing the input (intended) and output (real)

boreal gale
young granite
past meteor
#

I guess it depends on the domain but the 80/20 thing is probably true if you're in an applied science kind of domain where you're responsible for designing the study, collecting data, ...

young granite
#

if u get data from someone else the struggle begins 😄

past meteor
boreal gale
past meteor
#

Last time I dealt with census data the results were so bad I contacted our government and they said "oopsie we made a mistake, sending you the new dataset in a minute."

young granite
#

government be like : shits on fire yo

past meteor
#

The entire thing was a joke. We had a large cross-section of a significant % of the population. Yes they hashed some fields but I'm pretty sure if you tried you could de-anonymize most people. (We obviously did not because this is illegal.)

muted crypt
past meteor
muted crypt
#

that's what I wanted to hear, good job man

boreal gale
past meteor
#

What is your task @muted crypt ?

muted crypt
#

this sounds good but I won't lie if I say that i know how to do this

boreal gale
#

in your input to LSTM there should never be real trajectory, the simulated real trajectory should be generated by these output units and compared against the real real trajectory.

muted crypt
past meteor
#

Can you explain what you want to do?

muted crypt
#

I mean in the training dataset there has to be the real trajectory, marked as the label (y)? at least that's what I have

past meteor
#

You want to predict the trajectory of an object through time given a starting position?

young granite
#

so input should be ur expected drone trajectory and that output should be compared with ur real trajectory

muted crypt
# past meteor Can you explain what you want to do?

I have 2 datasets: one of drone inteded trajectories and another of the real trajectories (already flown following the intended ones). From this, I have to develop a model that provided an intended trajectory, predicts how the real one will be

boreal gale
#

someone else can probably guide you better, since i only have academic experience with NN.
post your current attempt and potentially with some example data would really maximise traction.

anyhow, i gotta get back to the fun task of scraping data for now 👋

muted crypt
#

from the green one (intended path) as input, predict the red (real flown trajectory) one

young granite
#

this is misleading

#

so predicted is real?

#

u only want to compare pred with real wont u?

muted crypt
#

not compare

past meteor
#

So you have 3 input sequences (intended) and your output needs to be 3 output sequences (flown)?

muted crypt
#

literally provide the green plot and get the red one (or similar)

young granite
#

ah lel

#

u didnt build a model just yet?

#

so its green(input) and red(target)

#

but u didnt do calc. just yet?

muted crypt
past meteor
#

Do you expect the sequences to be independent of each other or not? E.g., the intended altitude has an effect on the flown longitude?

muted crypt
muted crypt
young granite
muted crypt
#

this is probably more illustrative

young granite
muted crypt
#

(being lat,long,alt)

young granite
muted crypt
past meteor
#

From my understanding you have a quite normal RNN set-up where you map intended -> flown

young granite
#

u can even choose to give a 3d array into the NN

past meteor
#

You should start by sanity checking your model - use a training set of size 1 and get 0 loss

muted crypt
young granite
muted crypt
#

haha facts

cold osprey
#

red just looks like its lagging behind green

muted crypt
cold osprey
#

and theres some kind of 'startup' time

past meteor
#

For whatever DNN I'm making I always try to overfit a single example as proof that my architecture is bug-free

muted crypt
muted crypt
past meteor
#

Yes

young granite
#

for a POC yes

cold osprey
#

so the model kinda represents the quirks of this particular drone flying

#

certain delays, route simplification etc it may have

muted crypt
past meteor
muted crypt
muted crypt
young granite
#

reverse engineering the algo in the drone? 😄

cold osprey
past meteor
muted crypt
#

yes

past meteor
#

If you're not hitting 0 loss on 1 flight there's something wrong