astral path Feb 5, 2021, 9:19 PM

#

wrong channel?

rancid ruin Feb 5, 2021, 9:22 PM

#

oops this is data sci

#

srry

odd aspen Feb 6, 2021, 12:29 AM

#

Difference between sentence encoders and tokenizers?

ripe forge Feb 6, 2021, 1:52 AM

#

tokenizers only breaks something into tokens. (be it sentence tokenizers, word tokenizers, etc)

#

encoders convert tokens (strings basically) into vectors- numbers that can be used by the models

astral path Feb 6, 2021, 1:55 AM

#

so ive ran into a problem

abstract zealot Feb 6, 2021, 1:55 AM

#

shoot

astral path Feb 6, 2021, 1:55 AM

#

basically my series looks like this now

📎 unknown.png

#

this is constructed from the code you gave me where i was making a series/dataframe with each combination of distances

#

im trying to enumerate over them and add the frequency to another array so it's not the frequency of each unique combination but more just another column in a dataframe

for i in enumerate(missed_series):
    freq = dist_freq_series.loc[missed_series[newi], next_series[newi]]
    freqs.append(freq)
    newi += 1

#

however, for some reason i'm getting KeyError: (26.0, 25.0), but I don't see why that shouldn't exist

#

wait never mind

#

posting this made me realize that the dataframe i was enumerating over was wrong

abstract zealot Feb 6, 2021, 1:59 AM

#

hahahaha

#

you have solved your own problem?

astral path Feb 6, 2021, 1:59 AM

#

yeah lol

#

¯_(ツ)_/¯

abstract zealot Feb 6, 2021, 1:59 AM

#

hahahaha

#

nice one

astral path Feb 6, 2021, 2:06 AM

#

mmmmmmmmm

📎 AcmD0Hto2rcJAAAAAElFTkSuQmCC.png

#

FINALLY

abstract zealot Feb 6, 2021, 2:08 AM

#

thats a sexy plot

#

love it

#

good job @astral path

astral path Feb 6, 2021, 2:08 AM

#

thank you !

#

idk if you follow basketball but it should make sense anyways

#

the plot is basically where the x axis is the distance from the hoop of a missed shot, and the y axis is the change in how far away the next shot is after that

abstract zealot Feb 6, 2021, 2:10 AM

#

this actually does make sense

#

its a really cooil plot

astral path Feb 6, 2021, 2:10 AM

#

the size/color is correlating to the frequency of each pairing of missed shot and consecuitive shot

#

i like this color more i think

📎 D9mPioPnWRJvAAAAAElFTkSuQmCC.png

hasty comet Feb 6, 2021, 3:53 AM

#

hi does anyone know about plotly treemap??

lapis sequoia Feb 6, 2021, 8:14 AM

#

google it

rugged comet Feb 6, 2021, 8:19 AM

#

When experiencing poor model performance, how do I know if I need more data or if I need to train longer?

limpid oak Feb 6, 2021, 9:55 AM

#

need help
i have this code
`HB_NO_List = list(VillageDataDict.keys())

for eachVillageHB_NO in HB_NO_List :
TempVillage,TempHB_NO,TempScale = VillageDataDict[eachVillageHB_NO]
print(TempVillage[0])
Vilanename = str(TempVillage[0])
FilterVillage = Killa_Boundary[Killa_Boundary['Village_Na']==Vilanename]
print(FilterVillage.head(1))but its giving me empty dataframe though I have data thereSHERON
FID_Final_ Killa_No Hb_No Village_Na Mustil_No Shape_Length
0 38711 15 266 SHERON 116 68.383971

                                        geometry

0 MULTILINESTRING ((495447.227 3470154.065, 4953...
KOT DAS GUNDI MAL
Empty GeoDataFrame
Columns: [FID_Final_, Killa_No, Hb_No, Village_Na, Mustil_No, Shape_Length, geometry]
Index: []`

rugged comet Feb 6, 2021, 10:19 AM

#

!code Here's how to format Python code on Discord.

arctic wedgeBOT Feb 6, 2021, 10:19 AM

#

Here's how to format Python code on Discord:

```py
print('Hello world!')
```

These are backticks, not quotes. Check this out if you can't find the backtick key.

limpid oak Feb 6, 2021, 10:20 AM

#

Found Error in my Script, Thank you for support

rugged comet Feb 6, 2021, 10:21 AM

#

You're welcome.

raw juniper Feb 6, 2021, 11:05 AM

#

hey guys

#

I have a question

#

I have a row in my excel file

📎 unknown.png

#

it stretches all the way to week 52

#

I want to change it to a datetime

raw juniper Feb 6, 2021, 11:33 AM

#

nvm I got it

last rivet Feb 6, 2021, 11:34 AM

#

@raw juniper what? You working in excel , csv, panda dataframes? Actually that does not matter... it's python anyway
You question does not make any sense at all.

Anyway, I guess your question is answered here if you looking to convert e.g 2021-W1 into: 2021-01-04
https://stackoverflow.com/a/1700069/3314139

raw juniper Feb 6, 2021, 11:35 AM

#

I got the answer on my own thx btw

#

my question was essentially, I have a row that spans from week 1 to week 52

📎 unknown.png

#

I wanted to change it to this

📎 unknown.png

last rivet Feb 6, 2021, 11:38 AM

#

@raw juniper yes, that is what I provided above. Anyway, good job

raw juniper Feb 6, 2021, 11:38 AM

#

cool

last rivet Feb 6, 2021, 11:38 AM

#

Also week 1 started on January 4th btw, not 7th @raw juniper
Unless you typed a date into the formula to get a specific day of the date in the week

raw juniper Feb 6, 2021, 11:39 AM

#

oh

#

I see

cobalt jewel Feb 6, 2021, 1:25 PM

#

pandas question: In pyspark I can filter a df with df.filter("col1=x and col2=y"), but in pandas the only way I know is df[(df.col1==x)&(df.col2==y)]. This syntax gets very verbose when there are a lot of conditions and the df name is long. Wondering if there is a simpler syntax in pandas similar to what can be used in pyspark

ripe forge Feb 6, 2021, 1:36 PM

#

there is df.query, though i personally find normal filters easier to understand

twin moth Feb 6, 2021, 1:36 PM

#

Hi guys, I'm trying to write a crawler for a school project which-in I need to basically fetch thousands of images.
Currently I'm using Selenium in order to open the website and find the src and alt of the images, then use requests in order to download them.

Doing so means that I have to basically download each image twice - once when I load the page - and once when I download it afterwards

Do you know if it's possible to fetch the images out of the cache of the browser? I mean those images were rendered - they should be somewhere on the PC..
If not, (and I could not find any way to get it to work using GeckoDriver) I'd love some assistance regarding the usage of a proxy in order to do the same.

cobalt jewel Feb 6, 2021, 1:37 PM

#

@ripe forge thanks!

twin moth Feb 6, 2021, 1:44 PM

#

twin moth Hi guys, I'm trying to write a crawler for a school project which-in I need to b...

Bear in mind that I also need the alt of the images since I use it to parse them.
So I can't just cache them with a random name

stone jungle Feb 6, 2021, 2:07 PM

#

what is this type of data called?

📎 unknown.png

twin moth Feb 6, 2021, 2:35 PM

#

stone jungle what is this type of data called?

In case you are serious it is not possible to distinguish the type of such data from looks alone.

abstract zealot Feb 6, 2021, 3:13 PM

#

Encrypted perhaps ? @stone jungle

stone jungle Feb 6, 2021, 3:33 PM

#

Yeah ig

misty flint Feb 6, 2021, 4:24 PM

#

rugged comet When experiencing poor model performance, how do I know if I need more data or i...

this is also my question in case anyone has an answer.

woeful hamlet Feb 6, 2021, 4:36 PM

#

using SMOTE do i need to specify the classes?

#

OR it does everything?

#

Also, after using ImageDataGenerator, how can i take a look at the pictures?

#

like, how can i get them?

fleet trail Feb 6, 2021, 5:53 PM

#

stone jungle what is this type of data called?

I literally haven't seen any type of data resembling this, but even if there is such data, what pattern can be extracted from this data that could be useful in real life?

stone jungle Feb 6, 2021, 5:58 PM

#

fleet trail I literally haven't seen any type of data resembling this, but even if there is ...

It's a video file rename as TXT file

#

I wanted to make something that could compress video files but this way too complex

fleet trail Feb 6, 2021, 5:59 PM

#

Ohh man, my bad 😆

#

I thought it was random gibberish, my bad

stone jungle Feb 6, 2021, 6:00 PM

#

It's alright, I r good

#

U r good bro

#

Btw see this star I captured from my telescope

#

📎 IMG_20210206_232643.jpg

fleet trail Feb 6, 2021, 6:01 PM

#

stone jungle

Cool

stone jungle Feb 6, 2021, 6:01 PM

#

Tq

woeful hamlet Feb 6, 2021, 6:25 PM

#

using SMOTE do i need to specify the classes?
OR it does everything?
Also, after using ImageDataGenerator, how can i take a look at the pictures?
like, how can i get them?

late shell Feb 6, 2021, 7:14 PM

#

Hello I had recently started ML, and wanted to know something about epochs & alpha(learning rate). I coded up my own Simple Linear Regression model, and I used it on some data. At first I wasn't getting the perfect values as I got from sklearn, then I tinkered a bit with epochs and alpha variables, I increased epochs from 1k to 2k and epochs from 0.0001 to 0.001, and I then had the perfect results that matched with sklearn's Linear Regression. So I wanted to know how to know what the value for epochs and alpha should be, so that I don't have to hit and trial every time with my model.

lilac kindle Feb 6, 2021, 7:42 PM

#

How do I round to "one digit precision"? For example: 0.000784 -> 0.0008, 0.69 -> 0.7, 0.049683948 -> 0.05. Tried googling a bit but everyone points to Round() which doesn't work in this case.

lunar cloak Feb 6, 2021, 7:46 PM

#

Hi, does anyone know where the faster rcnn inception v2 model is? I tried looking in https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/tf2_detection_zoo.md but can't find it

GitHub

tensorflow/models

Models and examples built with TensorFlow. Contribute to tensorflow/models development by creating an account on GitHub.

nova widget Feb 6, 2021, 10:15 PM

#

late shell Hello I had recently started ML, and wanted to know something about epochs & alp...

The epochs are the batches of training data, so if you feed it more data it is generally better off, while a low learning rate with few epochs (batches) will not teach the model enough. As these are variables and not fixed settings, they are dependant on the training data and the model itself. So depending on the settings you can teach your model fast (but maybe over-generalize), or slow (but not learn enough to make an accurate decision, leaving you with just a blurry network)

#

this all depends on the kinds of variations in the training data you want to accept

#

like training a model with emojis could be quite fast as you can teach the model to identify the exact emoji from the training data, while if you want to identify a person in a picture it needs to see more different examples of the same thing to understand what the pattern of "person" actually is

stray ivy Feb 6, 2021, 11:00 PM

#

does it make sense to use stochastic gradient descent, when i have numerical, ordinal, and nominal data? if so, then does it make sense to scale the ordinal data?

lapis sequoia Feb 6, 2021, 11:36 PM

#

Greetings, working on some NLP with spaCy.
IN python what does this mean?
xpos = a[1]['pos']

#

is that an array, first element? that has value of pos?

#

what if I wanted to check whether a[1]['pos'] exists first?

#

ok, that looks like it is a list

cobalt jewel Feb 6, 2021, 11:57 PM

#

df = pd.DataFrame([[1,2,'a','b'],[3,4,'c','d'],[5,6,'c','e']],columns=['inta','intb','Ticker','desc'])
df.groupby('Ticker').agg({'inta':'mean','intb':'mean','desc':'max'})

#

assuming that 'max' is an appropriate operation for aggregating your qualitative data

velvet thorn Feb 7, 2021, 12:22 AM

#

lilac kindle How do I round to "one digit precision"? For example: 0.000784 -> 0.0008, 0.69 -...

interesting question...

#

I would say

#

use log10 to find the first nonzero number

#

then floor divide by it

#

!e

from math import ceil, log10

def f(i):
    exp = ceil(log10(i)) - 1
    sig = round(i / 10 ** exp)
    return sig * 10 ** exp  

print(f(0.69))
print(f(0.000784))

arctic wedgeBOT Feb 7, 2021, 12:28 AM

#

@velvet thorn :white_check_mark: Your eval job has completed with return code 0.

001 | 0.7000000000000001
002 | 0.0008

drowsy shore Feb 7, 2021, 2:58 AM

#

hey everyone... need help with pandas... getting this value error: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

#

transaction_cust.groupby('store_id').sum((abs(transaction_cust.tran_prod_discount_amt))/sum(transaction_cust.tran_prod_sale_amt))

misty flint Feb 7, 2021, 3:42 AM

#

ahhhhhhhhhhhhh

#

i always forget to capitalize Frame in DataFrame

#

ID_BoomKek

tidal bough Feb 7, 2021, 3:43 AM

#

~~from pandas import DataFrame as DF~~

misty flint Feb 7, 2021, 3:44 AM

#

ik ik

#

i just keep forgetting Clown2

#

inb4 alzheimers

nocturne plover Feb 7, 2021, 3:53 AM

#

uhm, is there any way for the gTTs lib to play without saving it

#

like ```py
from gtts import gTTS
text = input('Enter my speech.. ')
speech = gTTs(text)

#

is there something like sppech.play or something or ive to do it in the delete and create approach

misty flint Feb 7, 2021, 5:58 AM

#

how come i HAD to do .size() first before i could do .sort_values()

📎 unknown.png

#

it returns this instead otherwise

📎 unknown.png

#

does it really not have that attribute, i think thats weird

gaunt thorn Feb 7, 2021, 7:59 AM

#

Everytime I install speech python speech recognition it says error.

lilac geyser Feb 7, 2021, 8:08 AM

#

@tidal bough
Is option a correct answer?

📎 Screenshot_2021-02-07-13-31-36-68_27cd8fadfb0bbe694fcb1be8871f11c2.jpg

#

Should I use hypergeometric distribution formula for calculating probability?
Or Which formula is easier to calculate for finding probability?

#

Please @ me once you see this message

tidal bough Feb 7, 2021, 8:15 AM

#

that's the binomial distribution, which can be approximated by a normal one with the right mean and std. So that's what one should use here, I believe.

#

the mean is 90/4 = 22.5, and the std is... what was it, sqrt(npq) I think? That's sqrt(90*3/4*1/4) = sqrt(270/16) ~= 4.108

#

10 means a z-score of around >-3.043

tidal bough Feb 7, 2021, 8:19 AM

#

lilac geyser <@266216750876459008> Is option a correct answer?

so almost always, I think it's the first answer. Very little of the normal distribution's mass is below -3 standard deviations.

#

(this is pretty clearly how you're supposed to do this, considering you even get a nice z-score of almost exactly -3).

minor egret Feb 7, 2021, 8:31 AM

#

Hey,
can someone explain to me what differentiates valmin, valmax, slidermin, slidermax on Sliders in matplotlib and how to use them?

lilac geyser Feb 7, 2021, 8:32 AM

#

@tidal bough
So if the std is not given then we can find using formula and using normal distribution(z score) we can find the probability rite?
We can follow this method for most of the problems am I right?

lilac geyser Feb 7, 2021, 8:33 AM

#

tidal bough >10 means a z-score of around `>-3.043`

I didn't get this part
How is >10 z score is >-3.043?

misty flint Feb 7, 2021, 8:35 AM

#

makes sense to me

#

just look it up on a z table

lilac geyser Feb 7, 2021, 8:45 AM

#

Oops sorry I misunderstood
I thought that >10 z score of around>-3.043 was directly found
We got that value(-3.043) using this
Z=(x-mean)/sigma
So
Z=(10-22.5)/4.108 Z=-3.043
So >10 z score is around >-3.043

Since for me z table will be not available during my future exam
We can approximate it as 99.97%

Am I right?

tidal bough Feb 7, 2021, 9:22 AM

#

lilac geyser Oops sorry I misunderstood I thought that >10 z score of around>-3.043 was direc...

yup, like I said before, not having a z-table means you are supposed to approximate - and here, it's enough to have a very vague intuition that 3 standard deviations is a lot, so it's the three-nines answer.

ripe forge Feb 7, 2021, 9:24 AM

#

You can memorize the values for 1,2 and 3 std deviations to help

misty flint Feb 7, 2021, 9:32 AM

#

thats a good idea

#

68-95-99.9?

#

or something

#

pithink

balmy junco Feb 7, 2021, 9:40 AM

#

is anyone here knowledgeable about xpath for webscraping, or is there another topic that is more relevant for that?

viral wasp Feb 7, 2021, 12:06 PM

#

fuck anaconda is so fucking confusing

#

can someone dm me so we could voice talk and he would teach me how to fucking use it

severe sentinel Feb 7, 2021, 12:11 PM

#

Anyone know how to rectify this? Trying to analyse a .csv file of netflix data to filter episodes of Friends, and when I try to add columns for days and hours, it returns an error.

📎 wFvSFAq7snJDQAAAABJRU5ErkJggg.png

#

Absolute beginner, apologies

stray ivy Feb 7, 2021, 12:27 PM

#

if you read the output, you'll notice that it's actually a warning, but not all warnings should be ignored.

stray ivy Feb 7, 2021, 12:28 PM

#

viral wasp fuck anaconda is so fucking confusing

this is honestly why i just use a text editor. a lot of IDEs introduce too much magic, and it just gets annoying when you have an idea of what you want to do

severe sentinel Feb 7, 2021, 12:43 PM

#

Oh so I can just carry on and it should generally work fine?

hasty grail Feb 7, 2021, 12:48 PM

#

In this case the warning can be ignored but at the cost of your code being less efficient

#

If you follow the provided link it should explain why

severe sentinel Feb 7, 2021, 12:58 PM

#

Brilliant, thank you!

noble sand Feb 7, 2021, 2:45 PM

#

I've got a problem which I can't seem to fix (it's NLP Natural Language Toolkit related!)

#

Sooo I run my application app.py on a Flask server and when it tries to execute the function preprocess_data (depicted below), I get a "TypeError" exception on this line: self.work_tokenize(self.splitsentence)

#

function preprocess_textdata

📎 func-qe-preprocessTextData.png

nova widget Feb 7, 2021, 2:49 PM

#

type(self.splitsentence)

#

you use the wrong datatype as input, or maybe you target some immutable

noble sand Feb 7, 2021, 2:59 PM

#

@nova widget I'm using BeautifulSoup module to format the extracted webpage data in a clean format and this gets rid of the punctuations (including full stops)..

#

Could that be the reason why NLTK's sent_tokenize isn't giving me any result ...?

#

As it couldn't find any full stops (periods) to split the text string into separated sentences , when the BeautifulSoup object parsed the data?

lapis sequoia Feb 7, 2021, 3:36 PM

#

noble sand Could that be the reason why NLTK's `sent_tokenize` isn't giving me any result ....

did you check what type is self.splitsentence

noble sand Feb 7, 2021, 4:08 PM

#

lapis sequoia did you check what type is `self.splitsentence`

type(self.splitsentence) doesn't return anything for some reason idk.... that's odd

river tangle Feb 7, 2021, 4:21 PM

#

Hello guys do you know any good tutorial? For ds?

noble sand Feb 7, 2021, 4:22 PM

#

river tangle Hello guys do you know any good tutorial? For ds?

Data Science is pretty broad... it depends what specific area you wanna delve into..

river tangle Feb 7, 2021, 4:29 PM

#

Sorry economics

misty flint Feb 7, 2021, 5:45 PM

#

a lot of economics people go into DS too

#

BongoCat

inland swallow Feb 7, 2021, 6:42 PM

#

Trying to transform X into a regression model. It this correct:

#

Transform X, keep y then plot original vs y?

lapis sequoia Feb 7, 2021, 6:52 PM

#

noble sand `type(self.splitsentence)` doesn't return anything for some reason idk.... that'...

you have to print that

#

obviously

noble sand Feb 7, 2021, 10:27 PM

#

lapis sequoia you have to print that

@nova widget It's <class 'list>... it executes the line I showed you which wasn't working earlier perfectly now

#

I'm glad that line got executed without getting any exceptions now 😌

#

Before applying any NLP operations, I realise I get improperly decoded Unicode chars/symbols just after I make a request to several URL links and extract webdata...

#

Anyone know the reason why I get this? And how I might properly convert these sort of characters?

#

Decoded Unicode symbols/chars on one of my webpage that I'm constructing...

📎 pic-improperlyDecodedUnicode.png

empty fog Feb 7, 2021, 10:51 PM

#

I am using celery to load csv data as Panda dataframe and using Celery's group function to distribute it across multiple workers. The intent is to distribute the workload in an idempotent way so each worker can execute it.
I am using Data frame chunks to iterate and distribute small batches of datasets to process it by workers

@celery_app.task(bind=True, autoretry_for=(WorkSchedulingException,),
                 retry_kwargs={'max_retries': 5})
def make_inventory_partitions(self, inventory_bucket: str, gz_name: str, tsk_id):
    gz_path = f"s3://{inventory_bucket}/{gz_name}"
    chunk_size = int(getenv('PARTITION_SIZE', 1000))
    logger.info(f'Splitting inventory [{gz_name}] into chunks of {chunk_size} records per task')
    itr = 0
    try:
        group(
            process_inventory_chunk.s(txn_id=tsk_id, gz_name=gz_path, chunk=chunk.to_json()) for chunk in
            pd.read_csv(gz_path, chunksize=chunk_size, header=None, usecols=[1, 3], names=['path', 'dt']))()
    except Exception as err:
        logger.critical(f'Failed to schedule inventory partitions tasks [{tsk_id}]', exc_info=err)
        raise WorkSchedulingException(f'Failed to schedule inventory partitions tasks workloads for task [{tsk_id}]')

#

Is there any better approach, because my task and worker keep going OOM

fathom sail Feb 7, 2021, 11:04 PM

#

Hi, is anyone here familiar with beautifulsoup?

#

I would like to extract the attributes of the child elements in a form element

serene scaffold Feb 8, 2021, 1:07 AM

#

@fathom sail this might not be a data science question, but in either case you might consider opening a help session and sharing what code you've written so far. See #❓｜how-to-get-help

twin moth Feb 8, 2021, 1:27 AM

#

Any chance for some quick help?

#

I have a 700*350 np.ndarray which contains 3D tuples

#

I want to iterate over all of the tuples

#

I am using np.nditer() in order to achieve that but it just flattens my tuples as well and returns the values of the tuple.
I saw that there's op_axes= and itershape= as params but I could not figure out how to use them

serene scaffold Feb 8, 2021, 1:35 AM

#

@twin moth what's the larger context of the problem? If you can turn this into a (700,350,3) matrix, you can probably leverage nunpys efficient iteration

#

If you use a for loop, you won't get that.

twin moth Feb 8, 2021, 1:40 AM

#

It's already a matrix of that shape

#

But when I iterate over it using the aforementioned function it just iterates over the tuple values as well

lilac geyser Feb 8, 2021, 2:07 AM

#

tidal bough yup, like I said before, not having a z-table means you are supposed to approxim...

Thanks a lot for the help man!

lilac geyser Feb 8, 2021, 2:08 AM

#

misty flint 68-95-99.9?

Ya from now on I will remember all those values

misty flint Feb 8, 2021, 2:22 AM

#

its an easy interview question imo

#

DoggoKek

velvet thorn Feb 8, 2021, 2:24 AM

#

twin moth It's already a matrix of that shape

huh

#

so is it an array containing tuples or not

#

I'm a bit confused

#

🥴

hasty grail Feb 8, 2021, 4:35 AM

#

Got a feeling that you could vectorize whatever you're trying to do

velvet thorn Feb 8, 2021, 4:48 AM

#

^

nocturne plover Feb 8, 2021, 5:29 AM

#

Uh, i am trying my hands on the ||titanic dataset|| in kaggle and I am getting an unexpected output

#

For pedictions, I turned the value of "male" and "female" to "1" and "0". Now, for submission, when I convert them back, I get the gender to be passenger ID of the person

#

Any help is appreciated.....😩

#

Thanks in advance

misty flint Feb 8, 2021, 6:28 AM

#

why the censor

#

CryLaugh

#

also you should probably post how youre doing it

#

Sip

lapis sequoia Feb 8, 2021, 6:44 AM

#

noble sand Anyone know the reason why I get this? And how I might properly convert these so...

its the encoding

lapis sequoia Feb 8, 2021, 8:38 AM

#

Hey guys, any place I can practice some DS problems? Like some jupyter notebook cases, probability, machine learning concept questions, optimizing scenarios etc..

noble sand Feb 8, 2021, 9:55 AM

#

lapis sequoia its the encoding

What about the encoding @lapis sequoia? I've set the encoding to be UTF-8 as well..

ripe forge Feb 8, 2021, 10:54 AM

#

noble sand What about the encoding <@456226577798135808>? I've set the encoding to be UTF-8...

One part of understanding how encoding works is understanding that just setting encodings to utf 8 while reading won't actually solve anything and may also increase the problems in some cases.

#

I'd highly recommend taking some time to read this. https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

Joel on Software

Joel Spolsky

The Absolute Minimum Every Software Developer Absolutely, Positivel...

Ever wonder about that mysterious Content-Type tag? You know, the one you’re supposed to put in HTML and you never quite know what it should be? Did you ever get an email from your friends in…

noble sand Feb 8, 2021, 11:04 AM

#

ripe forge I'd highly recommend taking some time to read this. https://www.joelonsoftware....

Thanks @ripe forge I'll check that out

#

I also do get other human-readable English text content extracted along with these mixed Unicode symbols

tall trail Feb 8, 2021, 11:06 AM

#

i cant seem to figure out how to find a row in a dataframe while in a for loop, it just keeps returning an empty dataframe. anyone know why this keeps happening?

more info:
i am trying to check if the value of test is above a treshold over 20 rows of data, if it is: write it to a seperate dataframe.

here is my code so far:
test_data = np.random.randint(119,150, size= 300)

peak_df = pd.DataFrame(columns=['test'])
test_df = pd.DataFrame(test_data, columns=['test'])
tempdf = pd.DataFrame(columns=['test'])

print( '\nIndex:', type(test_df.test))

counter = 0

for i in range(len(test_df)) :
val = test_df.loc[i, 'test']
if val > 119 :
counter = counter+1
tempdf.append(test_df.loc[i], ignore_index=True)
# print(tempdf)
if counter == 20:
print('counter reached 20, appending')
frames = [tempdf, df]
# print(df)
# print(tempdf)
peak_df.append(tempdf, ignore_index=True)
else:
counter = 0
if not tempdf.empty:
tempdf.drop(['test'])

ripe forge Feb 8, 2021, 11:08 AM

#

General advice, if you're ever iterating on a dataframe, There's probably a better way to do whatever you're trying to do.

tall trail Feb 8, 2021, 11:09 AM

#

hehe, ya im still kinda new to dataframes and i dont quite get the lambda function yet. i see that gets used a lot

twin moth Feb 8, 2021, 11:58 AM

#

velvet thorn I'm a bit confused

Why?
It's an image, each of those tuples represent the color of the pixel

ripe forge Feb 8, 2021, 12:02 PM

#

They don't have to be tuples though, right?

#

More specifically it's just a 3d array, and at this stage there's no more "tuples" yes?

#

To phrase it differently, we understand that there's a logical relation and these are denoting the 3 channels of the images, but we very specifically don't store them as tuples anymore, correct?

#

(if tuples exist still in the array, they need to go. Don't mix tuples inside your np array).

velvet thorn Feb 8, 2021, 12:16 PM

#

twin moth Why? It's an image, each of those tuples represent the color of the pixel

then you should use a 3D array

#

not a 2D array of tuples

twin moth Feb 8, 2021, 1:01 PM

#

ripe forge To phrase it differently, we understand that there's a logical relation and thes...

Huh, okay, I'd try to remove them, but I'd still need to parse them as a single unit

hasty grail Feb 8, 2021, 1:01 PM

#

What specifically are you trying to do to the image?

hazy birch Feb 8, 2021, 1:13 PM

#

-help

#

i have project about SVM and NAIVE BAYES for sentiments analysis and classification

fading juniper Feb 8, 2021, 1:21 PM

#

Hey guys, I just joined the channel and I was wondering if anyone could point me in the right direction. I got the basics of python down, and can go as far as programming simple algorithms but I want to start developing, preferably for mobile banking. What steps should I take next and what frameworks should I start learning about?

hasty grail Feb 8, 2021, 1:27 PM

#

Hmm, that's really broad... do you want to deal with the data or the user interface first?

fading juniper Feb 8, 2021, 1:28 PM

#

Probably data, because once I know how it works, I feel learning UI will be easier

hasty grail Feb 8, 2021, 1:28 PM

#

Are you planning on building a database for this?

hazy birch Feb 8, 2021, 1:44 PM

#

hazy birch i have project about SVM and NAIVE BAYES for sentiments analysis and classificat...

my problem is where i can get the dataset, my target or classification is school suggestion box, i must able to classify the suggestion if its for school improvement, service, facility, school officials or professors

fading juniper Feb 8, 2021, 1:45 PM

#

hasty grail Are you planning on building a database for this?

I would like to, but thats the thing, I really don’t know where to start cause there's just so much to learn

ripe forge Feb 8, 2021, 1:46 PM

#

twin moth Huh, okay, I'd try to remove them, but I'd still need to parse them as a single ...

Yes. Numpy arrays will make all this a breeze if you let it.

twin moth Feb 8, 2021, 1:52 PM

#

ripe forge Yes. Numpy arrays will make all this a breeze if you *let it*.

I just tried, it failed because I used the tuples as a dictionary key and since ndarrays are not hashable I can't do the same with them

#

In order to find where the pixel sits in the scale and not iterate through the entire scale once again (~200 comparisons) I just insert the tuple to a dictionary and I check if the tuple is found in it for each pixel in the heatmap

ripe forge Feb 8, 2021, 1:55 PM

#

What are you trying to do?

twin moth Feb 8, 2021, 1:56 PM

#

I don't have much time to explain but I'm trying to parse those maps according to their scales
https://earthobservatory.nasa.gov/global-maps/MOD_NDVI_M
And insert each pixel and its' position on the scale into a pandas DF

Vegetation

climate change, global climate change, global warming, natural hazards, Earth, environment, remote sensing, atmosphere, land processes, oceans, volcanoes, land cover, Earth science data, NASA, environmental processes, Blue Marble, global maps

twin moth Feb 8, 2021, 1:56 PM

#

twin moth I don't have much time to explain but I'm trying to parse those maps according t...

That's only a single example, there are many different types of maps

ripe forge Feb 8, 2021, 1:57 PM

#

It's like you've taken a car, put it on water and then you're complaining it doesn't swim properly. Numpy arrays are the de facto standard of working with images, and are insanely fast because they vectorize their operations.

#

But you have a approach in mind already which doesn't sit well with how numpy arrays are meant to be used

#

Which isn't a bad thing btw, just might be worth getting your feet wet and seeing how you can use arrays as intended

twin moth Feb 8, 2021, 1:58 PM

#

I have a clean copy of a map, which I compare to the 'colored' map, I iterate each and every pixel, if they are "equal" (or close to it) I ignore that pixel, if it's not equal I search for the smallest distance between the RGB of the pixel and the colors on the scale

twin moth Feb 8, 2021, 1:59 PM

#

ripe forge Which isn't a bad thing btw, just might be worth getting your feet wet and seein...

I'm always down to learn

ripe forge Feb 8, 2021, 1:59 PM

#

So, when you vectorize, you can compare each pixel near instantly. One rule with arrays, no iteration.

twin moth Feb 8, 2021, 1:59 PM

#


def calculate_lowest_distance(colored_rgb: np.ndarray,
                              default_rgb: np.ndarray,
                              scale_list: list,
                              known_pixels_dict: dict) -> int:
    if any(not is_rgb_array(var) for var in [colored_rgb, default_rgb]):
        print("{} or {} is not an ndarray".format(colored_rgb, default_rgb))
        sys.exit(1)  # TODO: Error handling

    if colored_rgb in known_pixels_dict:
        return known_pixels_dict[colored_rgb]

    min_distance = calc_rgb_distance(colored_rgb, default_rgb)
    min_dist_idx = None

    if min_distance == 0:
        known_pixels_dict[colored_rgb] = min_dist_idx
        return known_pixels_dict[colored_rgb]

    for idx, scale_pixel in enumerate(scale_list):
        distance = calc_rgb_distance(colored_rgb, scale_pixel)
        if distance == 0:
            min_dist_idx = idx
            break

        if distance < min_distance:
            min_distance = distance
            min_dist_idx = idx

    known_pixels_dict[colored_rgb] = min_dist_idx
    return min_dist_idx

That's the current, not complete at all

ripe forge Feb 8, 2021, 2:02 PM

#

Okay. So immediately we see a lot of iterations here. It's a more traditional style of programming

#

When it comes to numpy arrays, this style of programming becomes a hindrance.

#

I need to be on a laptop for this..

#

Meanwhile, can you do a simple experiment? Make two numpy arrays, pure arrays no tuples inside

#

And just subtract them. a-b and just see the output

woeful hamlet Feb 8, 2021, 2:11 PM

#

how can i use SMOTE and flow_from_directory from Image Data Generator class together?

#

i found this so far

#

    for x, y in flow_from_directory:
         yield custom_balance(x, y, options)```

#

How can i call SMOTE here?

hasty grail Feb 8, 2021, 2:22 PM

#

fading juniper I would like to, but thats the thing, I really don’t know where to start cause t...

Hmm I haven't really dealt with databases in Python in particular, you might want to see #databases

hasty grail Feb 8, 2021, 2:26 PM

#

twin moth ```py def calculate_lowest_distance(colored_rgb: np.ndarray, ...

Are you familiar with boolean masking in NumPy?

fading juniper Feb 8, 2021, 2:55 PM

#

hasty grail Hmm I haven't really dealt with databases in Python in particular, you might wan...

Thanks

tall basin Feb 8, 2021, 4:23 PM

#

https://twitter.com/Br3Sc/status/1358811349053276166?s=20

Pierre de Fermat (@Br3Sc)

Python Basics in 9 tweets 😱, get in the thread

👇🧵

Oh, If you don't wanna miss out a tweet like this follow me It's free.

#100DaysOfCode #Python #codingtips #CodeNewbies #CodeNewbie #code #programming #pythonprogramming #Python3 #Computer #HowtoPerfect #AI #MachineLearning

twin moth Feb 8, 2021, 4:30 PM

#

hasty grail Are you familiar with boolean masking in NumPy?

Nope 🙊

twin moth Feb 8, 2021, 4:31 PM

#

ripe forge And just subtract them. `a-b` and just see the output

I tried that exact example yesterday but used tuples instead of ndarrays, seems like you can't substract tuples

ripe forge Feb 8, 2021, 4:32 PM

#

mhm. took the car, threw it in water. 🙂

twin moth Feb 8, 2021, 4:33 PM

#

lol

#

!e

import numpy as np
arr1 = np.array([1,7,352,169])
arr2 = np.array([751,7,-352,8690])
arr3 = arr1 - arr2
arr3

arctic wedgeBOT Feb 8, 2021, 4:33 PM

#

You are not allowed to use that command here. Please use the #bot-commands channel instead.

twin moth Feb 8, 2021, 4:33 PM

#

😦

ripe forge Feb 8, 2021, 4:33 PM

#

aww hehe, you can use bot commands for it

twin moth Feb 8, 2021, 4:33 PM

#

Anyways, yes, it works with ndarrays

#

📎 unknown.png

ripe forge Feb 8, 2021, 4:34 PM

#

now heres the main thing, that's a "vectorized" calculation.

twin moth Feb 8, 2021, 4:34 PM

#

Doesn't it work with loops in the background?

ripe forge Feb 8, 2021, 4:34 PM

#

for science, why dont you make the same result via iteration, and then array subtraction. then, do a timeit. and increase the size of the array

#

ish. let's see if theres a difference in timings for large arrays first

#

and then we'll talk\

twin moth Feb 8, 2021, 4:35 PM

#

ripe forge for science, why dont you make the same result via iteration, and then array sub...

Is that timeit a thing or did you just mean that I should create it myself using time.time()?

#

    start_time = time.time()
    print("Start at: {}".format(start_time))

    map_urls, scale_image_url = get_frame_urls(map_type)
    create_df(map_urls, scale_image_url).to_csv(csv_file)

    print("Finished at: {}".format(time.time()))
    print("--- {} seconds ---".format(time.time() - start_time))

ripe forge Feb 8, 2021, 4:35 PM

#

oh, you can use any method you prefer, but yes timeit is an actual thing too

twin moth Feb 8, 2021, 4:36 PM

#

ripe forge oh, you can use any method you prefer, but yes timeit is an actual thing too

I'll check it out, always happy to learn 😄

ripe forge Feb 8, 2021, 4:36 PM

#

timeit is great because it automatically repeats a calculation multiple times and gives you a mean and deviation of timings for some calculation.

twin moth Feb 8, 2021, 4:38 PM

#

Sounds amazing, I'm trying it now

chilly geyser Feb 8, 2021, 4:39 PM

#

Simpler to use jupyter's %%timeit magic

twin moth Feb 8, 2021, 4:56 PM

#

ripe forge timeit is great because it automatically repeats a calculation multiple times an...

Can it receive parameters in its' stmt parameter?

ripe forge Feb 8, 2021, 4:57 PM

#

honestly ive always used it with the magic command 😛 but uh, i believe it can receive strings or functions

twin moth Feb 8, 2021, 4:57 PM

#

def iter_sub(arr1: np.ndarray, arr2: np.ndarray) -> np.ndarray:
    if len(arr1) != len(arr2):
        print("arr1 and arr2 not of the same length.")
        sys.exit(1)
    arr3 = np.zeros(len(arr1))
    for idx in range(len(arr1)):
        arr3[idx] = arr1[idx] - arr2[idx]
    return arr3

def vectorized_sub(arr1: np.ndarray, arr2: np.ndarray) -> np.ndarray:
    if len(arr1) != len(arr2):
        print("arr1 and arr2 not of the same length.")
        sys.exit(1)
    return arr1 - arr2

twin moth Feb 8, 2021, 4:57 PM

#

ripe forge honestly ive always used it with the magic command 😛 but uh, i believe it can r...

Care to give me an example?

ripe forge Feb 8, 2021, 4:57 PM

#

do you have an ipython? where do you run code

#

hm, for the iter_sub, for a fair comparison, i'd say get rid of the print, get rid of the if checks from both too

twin moth Feb 8, 2021, 4:58 PM

#

Specifically that code in a Python terminal

ripe forge Feb 8, 2021, 4:58 PM

#

at least for timeits

twin moth Feb 8, 2021, 4:58 PM

#

But I mostly write Python in PyCharm

ripe forge Feb 8, 2021, 4:58 PM

#

okay, that's probably not ipython is it

#

magic commands only work in ipython repls

twin moth Feb 8, 2021, 4:58 PM

#

ripe forge hm, for the iter_sub, for a fair comparison, i'd say get rid of the print, get r...

True, the prints were only for debugging purposes

ripe forge Feb 8, 2021, 4:58 PM

#

in your cmd, type ipython

chilly geyser Feb 8, 2021, 4:59 PM

#

Would recommend Google Colab instead

twin moth Feb 8, 2021, 4:59 PM

#

ripe forge in your cmd, type ipython

Terminal* 😉

#

Sec, I'm installing it

#

nvm, that was fast

ripe forge Feb 8, 2021, 5:00 PM

#

ah you dont have to. it's just a fancy way of running timeit

#

(well, ipython is a GREAT repl though)

twin moth Feb 8, 2021, 5:01 PM

#

Nah, again, always happy to learn about new standards and tools

#

Okay, so now, how do I use it?

ripe forge Feb 8, 2021, 5:02 PM

#

so, it's just a python repl. you can type and run stuff as usual

#

but it allows magic commands. those are prefixed iwth % and %%

twin moth Feb 8, 2021, 5:03 PM

#

I meant how do I use that magic timeit you were talking about

ripe forge Feb 8, 2021, 5:03 PM

#

so, try %timeit <command_here>

twin moth Feb 8, 2021, 5:03 PM

#

Interesting

#

With parameters?

chilly geyser Feb 8, 2021, 5:03 PM

#

This is what I get

📎 unknown.png

ripe forge Feb 8, 2021, 5:03 PM

#

yep, with params

chilly geyser Feb 8, 2021, 5:03 PM

#

computing courtesy of Google

twin moth Feb 8, 2021, 5:04 PM

#

chilly geyser computing courtesy of Google

Duck Google

#

How do I randomly fill the array?

ripe forge Feb 8, 2021, 5:08 PM

#

you can use one of the np.random. Say np.random.randint(1, 10, size=n)

twin moth Feb 8, 2021, 5:09 PM

#

📎 unknown.png

#

Amazing

chilly geyser Feb 8, 2021, 5:10 PM

#

Indeed

#

nice

twin moth Feb 8, 2021, 5:10 PM

#

ripe forge you can use one of the `np.random`. Say `np.random.randint(1, 10, size=n)`

Kewl

fiery minnow Feb 8, 2021, 5:11 PM

#

I started learning how to use SQL files with the sqlite3 library, and it's all going smoothly so far
but there's one thing that I can't seem to get correctly...
How can I view the SQL file in VSC?

ripe forge Feb 8, 2021, 5:11 PM

#

you cant. sqlite is a database, and so it's file would need a program that can read sqlite databases

fiery minnow Feb 8, 2021, 5:12 PM

#

ripe forge you cant. sqlite is a database, and so it's file would need a program that can r...

Ok, can you give me a program that does that?

ripe forge Feb 8, 2021, 5:12 PM

#

https://sqlitebrowser.org/

DB Browser for SQLite

twin moth Feb 8, 2021, 5:13 PM

#

📎 unknown.png

#

It's so much faster

fiery minnow Feb 8, 2021, 5:14 PM

#

ripe forge https://sqlitebrowser.org/

I'm a little confused on what do in this screen

📎 unknown.png

#

I'm using sqlite

#

but what's the difference between Desktop and Program Manu?

#

Is it just the shortcuts?>

#

or something else?

ripe forge Feb 8, 2021, 5:16 PM

#

i think just shortcut

fiery minnow Feb 8, 2021, 5:16 PM

#

ok

#

@ripe forge Here, what do I change here?

📎 unknown.png

#

~~what are those features~~

ripe forge Feb 8, 2021, 5:18 PM

#

dont worry about it too much, choose what makes sense for you. if not sure, press next.

fiery minnow Feb 8, 2021, 5:21 PM

#

@ripe forge I opened it, is that what it should show?

📎 unknown.png

ripe forge Feb 8, 2021, 5:21 PM

#

yes

#

now just open a database (your sqlite db file), and click on browse data.

fiery minnow Feb 8, 2021, 5:22 PM

#

Yeah I did it

#

I'm planning on using it with my discord bot

#

Would it be good make a table for every server the bot's in?

chilly geyser Feb 8, 2021, 5:23 PM

#

twin moth It's so much faster

If you have a need for speed, you might want to consider C++

ripe forge Feb 8, 2021, 5:28 PM

#

twin moth It's so much faster

yep vectorization is like a whole different world. really good for lots of number crunching. this is essentially the true power of numpy arrays

#

internally, its like "batch" iteration. the key being that it can load and execute operations in batches, as opposed to one at a time

#

for an analogy, it's like iteration is equivalent to making 100 cookies with a baking mold of 1 cookie only. you have to run the oven 100 times, and make one cookie at a time because thats all the mold you have. vs vectorization which is like someone gave you a mold with 20 cookies. so you run the oven 5 times and youre done, making 20 cookies at a time

twin moth Feb 8, 2021, 5:51 PM

#

Sounds really good

#

But how would I use that ?

#

The shape of my array is 350, 700, 3

#

How do I make use of it here?

#

And lets say that I run that piece of code for the matrices as a whole (AKA vectorization) - I'd still need to compare each pixel to each of the pixels in the scale list

#

    return pow(lhs[0] - rhs[0], 2) + pow(lhs[1] - rhs[1], 2) + pow(lhs[2] - rhs[2], 2)

chilly geyser Feb 8, 2021, 5:54 PM

#

Are you using python's pow

twin moth Feb 8, 2021, 5:54 PM

#

Ummm guess so

chilly geyser Feb 8, 2021, 5:54 PM

#

Have you tried numpy's pow

#

I think it has one

#

Also, I just realised

#

This is linalg

#

You should use the norm function instead

#

np.square(lhs-rhs) should be better also, assuming lhs and rhs are 3-dim objects

bronze ermine Feb 8, 2021, 5:58 PM

#

Need some c++ help

#

Am i in the right place?

chilly geyser Feb 8, 2021, 5:59 PM

#

#ot0-psvm’s-eternal-disapproval

limpid oak Feb 8, 2021, 6:01 PM

#

ripe forge in your cmd, type ipython

Its so fast Thank you bro

ripe forge Feb 8, 2021, 6:05 PM

#

twin moth ```py return pow(lhs[0] - rhs[0], 2) + pow(lhs[1] - rhs[1], 2) + pow(lhs[2] ...

whats the purpose of the power here? but you can do the calculation in one go, result = (lhs - rhs) ** 2

#

then if you do need to split them apart later for some reason, do it at the end.

twin moth Feb 8, 2021, 6:17 PM

#

chilly geyser `np.square(lhs-rhs)` should be better also, assuming `lhs` and `rhs` are 3-dim o...

Currently they are, but I'd rather run that function on the two matrices instead of on two 3-dim arrays

twin moth Feb 8, 2021, 6:20 PM

#

ripe forge whats the purpose of the power here? but you can do the calculation in one go, `...

That's useful

#

But again, I'd need to iterate over the entire matrix

#

Because I still have 350*700 such trios

ripe forge Feb 8, 2021, 6:21 PM

#

no iteration. what steps are missing now?

#

if theres any iteration step in the workload, youve lost a lot of the gains vectorization would have given you

twin moth Feb 8, 2021, 6:21 PM

#

I know

#

That's my current issue

#

As I previously said - what I currently do is:
I iterate over all of the pixels in an image.
Compare it to the same pixel in a similar map (using (rsh - lhs) ** 2)
If they're not the same (distance != 0) - I need to compare that pixel to each and every pixel in (1,200,3) ndarray

#

That's basically the algorithm

ripe forge Feb 8, 2021, 6:25 PM

#

why is the squaring being done?

twin moth Feb 8, 2021, 6:26 PM

#

That's the formula

#

📎 48d85816c09ff94368f2e74aaa0fcdf8bea5da50.png

ripe forge Feb 8, 2021, 6:28 PM

#

okay. that shouldnt be too bad. so only the last step remains. what is this (1,200,3) array, and why are we comparing our pixel distances (we're comparing distances, right?) against this?

twin moth Feb 8, 2021, 6:29 PM

#

RGB distances basically

#

Trying to figure out the value of each pixel in a heatmap

ripe forge Feb 8, 2021, 6:31 PM

#

i suppose the thought process here is this: can we achieve the values in our heatmap without having to do this comparison? it sounds a bit backwards to me

#

like, to me, if we know that we have a finite range of possible distance values, we should be able to coerce it into a range of values always. but maybe im missing something

twin moth Feb 8, 2021, 6:31 PM

#

It kinda is

#

But that's the only solution we found

ripe forge Feb 8, 2021, 6:32 PM

#

do you know the min and max values of your heatmap?

twin moth Feb 8, 2021, 6:33 PM

#

Some of the heatmaps have scales ranging between (for example) [brown...green]
And some have [red...white...blue]

#

And as you know white is (255,255,255) so it's not really in the middle

#

And I can't search for the brightest/darkest colors

twin moth Feb 8, 2021, 6:33 PM

#

ripe forge do you know the min and max values of your heatmap?

It differs on which heatmap

#

We have a couple we want to parse

#

But we don't care about the actual values, as we'd be normalizing them later on anyways

ripe forge Feb 8, 2021, 6:35 PM

#

okay. i suppose lets say we just stick to this approach

twin moth Feb 8, 2021, 6:35 PM

#

Okay

chilly geyser Feb 8, 2021, 6:39 PM

#

twin moth But again, I'd need to iterate over the entire matrix

You should be able to vectorize over this size too I think

twin moth Feb 8, 2021, 6:39 PM

#

Really? interesting

#

Got any idea how?

chilly geyser Feb 8, 2021, 6:39 PM

#

Essentially yo uhave 2 ndarrays right

#

Why not try A-B

twin moth Feb 8, 2021, 6:40 PM

#

Let me try

rotund dock Feb 8, 2021, 6:40 PM

#

Hi!! I'm trying to find the path where my concentration flows in a river network, maybe someone here can help me.
So basically thats the data frame I have, I want to find the path using the columns N_out and N_next, I know where my path starts because the concentration is different to 0, then the idea would be to find the n_next of that row, append it to an array, search that n_next in the n_out and repeat until I reach the end of the rows (for this example i'm going to have 2 paths)

📎 Screenshot_2021-02-08_at_17.18.13.png

chilly geyser Feb 8, 2021, 6:41 PM

#

What do you mean by 'path where my concentration flows'?

#

This is looking like a network-theory problem more than #data-science-and-ml too

#

Do you mean

#

You want to make a forest of directed acyclic graphs

#

from your current data?

rotund dock Feb 8, 2021, 6:44 PM

#

No I then have to make some calculations with discharge and concentration

twin moth Feb 8, 2021, 6:44 PM

#

chilly geyser Why not try A-B

Works, but it's still in the shape of (350,700,3)

chilly geyser Feb 8, 2021, 6:44 PM

#

np.square(A-B) is that squared

#

You can np.sum(np.square(A-B))

twin moth Feb 8, 2021, 6:44 PM

#

chilly geyser np.square(A-B) is that squared

I think that it's squared, not sure

chilly geyser Feb 8, 2021, 6:44 PM

#

Well this should be the best NP can give you I think

#

Anything more, you can try JIT

#

Or other langauges

#

languages*

chilly geyser Feb 8, 2021, 6:45 PM

#

rotund dock No I then have to make some calculations with discharge and concentration

Is N_out some form of ID to be checked against?

rotund dock Feb 8, 2021, 6:45 PM

#

chilly geyser What do you mean by 'path where my concentration flows'?

So my concentration starts in n_out 73012836 and then goest to 73012412 and then 73011907 and then 73011906

twin moth Feb 8, 2021, 6:45 PM

#

I don't really need it that fast, you guys told me I should lmao

#

I'd love it as good as possible of course

rotund dock Feb 8, 2021, 6:45 PM

#

chilly geyser Is `N_out` some form of ID to be checked against?

yes I can check it agains N_out

twin moth Feb 8, 2021, 6:45 PM

#

But it's for a school project, it doesn't have to be the most optimal solution

chilly geyser Feb 8, 2021, 6:45 PM

#

rotund dock So my concentration starts in n_out 73012836 and then goest to 73012412 and then...

How do I know if it ends?

twin moth Feb 8, 2021, 6:46 PM

#

chilly geyser Anything more, you can try JIT

How do you do JIT in Python?

rotund dock Feb 8, 2021, 6:46 PM

#

chilly geyser How do I know if it ends?

because the last index is the last node in my network

chilly geyser Feb 8, 2021, 6:46 PM

#

twin moth But it's for a school project, it doesn't have to be the most optimal solution

Optimality is always good

twin moth Feb 8, 2021, 6:46 PM

#

chilly geyser Optimality is always good

Agreed

chilly geyser Feb 8, 2021, 6:46 PM

#

twin moth How do you do JIT in Python?

This one, I'm not too well-versed in it, need to google for it, but if you don't need the speed, just check if it works first then be done with it

twin moth Feb 8, 2021, 6:46 PM

#

chilly geyser You can np.sum(np.square(A-B))

It sums everything up to a single int

chilly geyser Feb 8, 2021, 6:46 PM

#

rotund dock because the last index is the last node in my network

How do I identify 'last node'

twin moth Feb 8, 2021, 6:46 PM

#

I want a sum for each pixel

twin moth Feb 8, 2021, 6:46 PM

#

chilly geyser This one, I'm not too well-versed in it, need to google for it, but if you don't...

True

chilly geyser Feb 8, 2021, 6:47 PM

#

twin moth I want a sum for each pixel

You need to specify the axis of summation

#

Try axis=0, axis=1, axis=2, etc.

twin moth Feb 8, 2021, 6:47 PM

#

chilly geyser You need to specify the axis of summation

On it

chilly geyser Feb 8, 2021, 6:47 PM

#

A little brute forcey but you only have 3 indices, it's fine

#

If you're looking at 20 indices I'd look at solving this 'smartley'

rotund dock Feb 8, 2021, 6:47 PM

#

chilly geyser How do I identify 'last node'

it will always be the last row

chilly geyser Feb 8, 2021, 6:48 PM

#

rotund dock it will always be the last row

Wait then how does the 73011909 N_out work

rotund dock Feb 8, 2021, 6:48 PM

#

let me show you what I'm looking for as a result

chilly geyser Feb 8, 2021, 6:48 PM

#

You are telling me Path_1 is
73012836 -> 73012412 -> 73011907 -> 73011906

#

So path_2 is
73011909 -> 73011908 -> 73011907 -> 73011906?

#

By the way I'm asking you because code-wise, you'd want to know if your thing terminates

#

Also you'd want to know code-wise, what should be 'in-memory' at each point of the computation

twin moth Feb 8, 2021, 6:50 PM

#

chilly geyser Try axis=0, axis=1, axis=2, etc.

Looks like axis=2 solved the issue

rotund dock Feb 8, 2021, 6:50 PM

#

📎 Screenshot_2021-02-08_at_19.50.13.png

chilly geyser Feb 8, 2021, 6:50 PM

#

That means axis=2 represents your pixels

rotund dock Feb 8, 2021, 6:50 PM

#

chilly geyser So path_2 is `73011909 -> 73011908 -> 73011907 -> 73011906`?

yes exactly

twin moth Feb 8, 2021, 6:50 PM

#

Which makes sense since it's (350,700,3) lol

chilly geyser Feb 8, 2021, 6:50 PM

#

rotund dock

Ok that's what I wrote. The thing is, how do I know know that 73011906 is the last?

#

You can tell me because this csv is 8 nodes long

#

What if it were 52000 nodes long

rotund dock Feb 8, 2021, 6:51 PM

#

Because my nodes are already sorted

twin moth Feb 8, 2021, 6:51 PM

#

Okay, so that actually an amazing step.
How do I compare it to all of the values in the list now?

rotund dock Feb 8, 2021, 6:51 PM

#

my last row will always be the last node of the network

chilly geyser Feb 8, 2021, 6:51 PM

#

twin moth Okay, so that actually an amazing step. How do I compare it to all of the values...

'it' = ?

twin moth Feb 8, 2021, 6:51 PM

#

And take the one closest to it if no match is found

twin moth Feb 8, 2021, 6:51 PM

#

chilly geyser 'it' = ?

Each pixel basically

chilly geyser Feb 8, 2021, 6:51 PM

#

You mean the distance?

twin moth Feb 8, 2021, 6:52 PM

#

That list is the list of pixels in the scale

chilly geyser Feb 8, 2021, 6:52 PM

#

I thought that list is now the list of distances?

twin moth Feb 8, 2021, 6:52 PM

#

chilly geyser You mean the distance?

Yes, but to be specific it's the list of pixels in the scale

chilly geyser Feb 8, 2021, 6:52 PM

#

With a distance you create a sphere

twin moth Feb 8, 2021, 6:52 PM

#

Would you mind joining me on VC? It'd be much easier to explain

chilly geyser Feb 8, 2021, 6:53 PM

#

I rather not speak sorry

twin moth Feb 8, 2021, 6:53 PM

#

All good

#

So basically we have a heatmap, the heatmap has pixels which can not be found in the scale because of compression issues (I guess)

chilly geyser Feb 8, 2021, 6:53 PM

#

rotund dock my last row will always be the last node of the network

As for you, ok, one thing is I'd store this last node first, then after that I just do a filter on concentration, then keep following the N_next, N_out thing til I reach the last node

twin moth Feb 8, 2021, 6:54 PM

#

So we need to find the minimum distance between each pixel in the scale and each pixel in the map if the distance != 0 between the two original images

chilly geyser Feb 8, 2021, 6:54 PM

#

What's a heatmap

#

I mean, I know what's a heatmap

twin moth Feb 8, 2021, 6:54 PM

#

lol

chilly geyser Feb 8, 2021, 6:54 PM

#

But the heatmap is the original 3D array?

#

Also, what's a scale here

twin moth Feb 8, 2021, 6:54 PM

#

https://earthobservatory.nasa.gov/global-maps/MOD_NDVI_M

Vegetation

climate change, global climate change, global warming, natural hazards, Earth, environment, remote sensing, atmosphere, land processes, oceans, volcanoes, land cover, Earth science data, NASA, environmental processes, Blue Marble, global maps

chilly geyser Feb 8, 2021, 6:55 PM

#

Also, anotehr 3D array?

twin moth Feb 8, 2021, 6:55 PM

#

That's an example

twin moth Feb 8, 2021, 6:55 PM

#

chilly geyser Also, anotehr 3D array?

Yup

chilly geyser Feb 8, 2021, 6:55 PM

#

So you have 2 RGB arrays I assume

rotund dock Feb 8, 2021, 6:55 PM

#

chilly geyser As for you, ok, one thing is I'd store this last node first, then after that I j...

Okey let me give that a try

twin moth Feb 8, 2021, 6:55 PM

#

But this time it's already filtered, I took a single strip from the scale (1 pixel wide) and that should represent the values

chilly geyser Feb 8, 2021, 6:56 PM

#

twin moth So we need to find the minimum distance between each pixel in the scale and each...

Hmm you want to solve for a minimum distance? I'm not too sure what the if here means actually

twin moth Feb 8, 2021, 6:56 PM

#

chilly geyser So you have 2 RGB arrays I assume

Two images which are basically (350,700,3) and another scale which is (1,200,3)

twin moth Feb 8, 2021, 6:56 PM

#

chilly geyser Hmm you want to solve for a minimum distance? I'm not too sure what the if here ...

distance != 0, sorry

chilly geyser Feb 8, 2021, 6:56 PM

#

Distance != 0, mathematically, if and only if not the same thing

twin moth Feb 8, 2021, 6:56 PM

#

(BTW, thanks for all the help, you're awesome)

twin moth Feb 8, 2021, 6:57 PM

#

chilly geyser Distance != 0, mathematically, if and only if not the same thing

Huh?

chilly geyser Feb 8, 2021, 6:57 PM

#

ok ignore that

twin moth Feb 8, 2021, 6:57 PM

#

I didn't understand what you meant but I'd add that the distance^2 == 0 only if the colors are the same

chilly geyser Feb 8, 2021, 6:57 PM

#

Yes

#

Currently

#

Your image is a vector in R^350*700*3 space yes, no?

twin moth Feb 8, 2021, 6:59 PM

#

chilly geyser Your image is a vector in R^`350*700*3` space yes, no?

True, so basically (to make sure we're all on the same page) - height = 350, width = 700, and each of those contains a trio which represents the RGB

chilly geyser Feb 8, 2021, 6:59 PM

#

Indeed

#

So now you have the distance between two images

twin moth Feb 8, 2021, 6:59 PM

#

And other than that we have another array of all the RGBs in the scale

twin moth Feb 8, 2021, 6:59 PM

#

chilly geyser So now you have the distance between two images

Correct

#

(Which is *ucking awesome!)

chilly geyser Feb 8, 2021, 7:00 PM

#

Ok I think I get what you mean

#

Is your scale also R^350*700*3?

twin moth Feb 8, 2021, 7:01 PM

#

BTW, currently if I find an exact match (0 between the matrices) or between the pixel and a pixel in the scale I just return its' index

chilly geyser Feb 8, 2021, 7:01 PM

#

Oh ok I get it

#

Your scale is a subset of values

twin moth Feb 8, 2021, 7:01 PM

#

chilly geyser Is your scale also R^`350*700*3`?

No, the scale is R^1*200*3

#

Even less than 200 because I removed the outliers

chilly geyser Feb 8, 2021, 7:02 PM

#

The scale is 600 values of acceptable values

#

Then

#

For anything not in the scale

#

You want a minimum colour distance

#

ok

twin moth Feb 8, 2021, 7:02 PM

#

chilly geyser The scale is 600 values of acceptable values

ummm not really

chilly geyser Feb 8, 2021, 7:02 PM

#

?? 1*200*3 is 600?

twin moth Feb 8, 2021, 7:02 PM

#

Because it's 600 only if you separate the R G and B

chilly geyser Feb 8, 2021, 7:03 PM

#

Ah ok

#

200 acceptable values for R data itself

twin moth Feb 8, 2021, 7:03 PM

#

True

chilly geyser Feb 8, 2021, 7:03 PM

#

200 acceptables values for G data itself

#

ditto B

#

ok ok

twin moth Feb 8, 2021, 7:03 PM

#

Huh?

chilly geyser Feb 8, 2021, 7:03 PM

#

Is that not the case?

ripe forge Feb 8, 2021, 7:03 PM

#

its probably more like the same distance formula applied here

#

to find the closest "pixel" that makes sense

twin moth Feb 8, 2021, 7:04 PM

#

ripe forge its probably more like the same distance formula applied here

Exact one

ripe forge Feb 8, 2021, 7:04 PM

#

so, if i understand correctly, you just need a distance matrix

twin moth Feb 8, 2021, 7:04 PM

#

chilly geyser Is that not the case?

Sorry, I didn't understand what you meant

chilly geyser Feb 8, 2021, 7:04 PM

#

I think the closest pixel isn't unique though

ripe forge Feb 8, 2021, 7:04 PM

#

hm, good shout, it's not. i assume that doesnt matter though

twin moth Feb 8, 2021, 7:05 PM

#

ripe forge so, if i understand correctly, you just need a distance matrix

Yeah, but this time between each pixel on the heatmap and each pixel on the scale

#

So basically 350*700*1*200

chilly geyser Feb 8, 2021, 7:05 PM

#

Ok I think I get it

ripe forge Feb 8, 2021, 7:05 PM

#

theres something in scipy i think, for making a distance matrix

chilly geyser Feb 8, 2021, 7:05 PM

#

The data is getting huge though

twin moth Feb 8, 2021, 7:05 PM

#

chilly geyser I think the closest pixel isn't unique though

It is, because unfortunately the scale doesn't contain all of the values

#

Some are really close and we'd rather just take the closest

chilly geyser Feb 8, 2021, 7:05 PM

#

it's like ~120MB

twin moth Feb 8, 2021, 7:06 PM

#

Unless you have a better idea

chilly geyser Feb 8, 2021, 7:06 PM

#

Ok I have handled bigger Dist mats before

#

I don't recommend anything >1Gb basically

ripe forge Feb 8, 2021, 7:06 PM

#

yeah this is still tiny in the grand scheme of things

chilly geyser Feb 8, 2021, 7:06 PM

#

Yeah I think the 'fastest' way is to take the distance matrix

#

And find the minimizer

#

For each pixel

#

If you vectorize it, I can't say anything it probably doesn't fit in fast cache memory so likely it will be an iterative process

twin moth Feb 8, 2021, 7:07 PM

#

So iterations?

chilly geyser Feb 8, 2021, 7:08 PM

#

Ya iterate over each pixel, I think?

ripe forge Feb 8, 2021, 7:08 PM

#

nah, i'd say distance matrix

chilly geyser Feb 8, 2021, 7:08 PM

#

For dist mat IDK how it'd be done

ripe forge Feb 8, 2021, 7:08 PM

#

ok, sec. lemme go find it

chilly geyser Feb 8, 2021, 7:08 PM

#

I mean, I don't know the functions of the top of my head at least

ripe forge Feb 8, 2021, 7:08 PM

#

https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance_matrix.html

#

p=2 for euclidean, which is default anyways. just dump your matrix there, and it should spit out the distance matrix

chilly geyser Feb 8, 2021, 7:09 PM

#

Oh yeah just try that function first

#

Might need to fiddle a bit with shape so it likes it, but I think it's fine right now I think your last .shape is 3 which should be what it wants

ripe forge Feb 8, 2021, 7:10 PM

#

its actually not happy with the shape as is

chilly geyser Feb 8, 2021, 7:10 PM

#

Welp, that means reshaping I think

ripe forge Feb 8, 2021, 7:11 PM

#

yep. slightly annoying but yeah

twin moth Feb 8, 2021, 7:11 PM

#

Wait, how does it help me?

chilly geyser Feb 8, 2021, 7:11 PM

#

You're looking for minimizers on this distance matrix

ripe forge Feb 8, 2021, 7:11 PM

#

so, the thought process is, you get a distance of each pixel against the 200 pixels

chilly geyser Feb 8, 2021, 7:11 PM

#

^yup

ripe forge Feb 8, 2021, 7:12 PM

#

and then you choose the smallest distance from there for each pixel

#

(thats the one weird part about vectorization which i always find hard to digest. It's completely happy with doing really wasteful calculations just to reach an answer, and it still scales insanely well)

twin moth Feb 8, 2021, 7:14 PM

#

Would I also get index of the smallest distance in the scale?

ripe forge Feb 8, 2021, 7:14 PM

#

in a separate step, yep

#

once you have the distance matrix, then that would be your last task.

#

so, say your rbg array is a 3d array yes? then you do a reshape to turn it into a 2d (n_pixel, 3) array. same deal with the heatmap array. then the distance matrix should give you a (n_pixel, 200) array.

chilly geyser Feb 8, 2021, 7:17 PM

#

^seems like it

ripe forge Feb 8, 2021, 7:17 PM

#

at this point, if you use .argmin method, with the correct axis, you'll get the index for the min distance for each pixel.

chilly geyser Feb 8, 2021, 7:17 PM

#

The issue is

#

Need to remap each n_pixel back to the original pixel matrix hmm?

#

That might be annoying again

ripe forge Feb 8, 2021, 7:17 PM

#

a simple reshape back will fix that, no issue

chilly geyser Feb 8, 2021, 7:17 PM

#

Oh hmm

#

Sounds right

#

from numpy.random import seed, uniform
from numpy import array
from scipy.spatial import distance_matrix as dist_mat
# Pixels
x = array(range(6))
x.shape=(2,3)

# Scale
seed(0)
y = uniform(0,6,size=(20,3))

z = dist_mat(x, y)
minimizers = z.argmin(axis=1)
y[minimizers]

^ some toy script

I get

array([[1.25326054, 0.96785711, 3.91864995],
       [3.67257434, 3.70160398, 5.66248847]])

which indeed seem close to 0,1,2 and 3,4,5

#

I don't have the reshaping back into the 350*700 array issue you have, but yeah essentially this thing works

twin moth Feb 8, 2021, 7:23 PM

#

I don't really get that, sorry

#

I'd try to run the following script and see the shape of z

chilly geyser Feb 8, 2021, 7:26 PM

#

Try it

twin moth Feb 8, 2021, 7:35 PM

#

I really don't see how it can help me

#

In [13]: z
Out[13]: 
array([[4.92828313, 4.07220501, 6.33441159, 4.55356006, 5.90153969,
        3.16539456, 7.38905673, 5.77234009, 3.1409925 , 6.07503396,
        4.04371234, 3.91525835, 5.84810124, 4.29668737, 4.68306374,
        4.21475324, 2.64562386, 5.75833833, 2.29192338, 2.41381881],
       [1.44374159, 1.86099389, 1.60497451, 2.09492149, 4.84765814,
        4.60227022, 2.24361732, 2.1995214 , 4.73392854, 3.76611036,
        2.7447737 , 4.11756178, 0.99009464, 3.200087  , 3.95533112,
        5.13868297, 2.65013839, 4.80767455, 3.66255474, 4.01516057]])

#

(2,20,1)

#

AHHH

#

I see it now

#

I returns an array of all of the pixels from the scale for each pixel from the heatmap

chilly geyser Feb 8, 2021, 7:41 PM

#

Yup

#

It should also reshape back nicely

#

But I can't confirm, would probably need weirdly shaped things

twin moth Feb 8, 2021, 7:42 PM

#

Okay, so now I have 2 result matrices

#

A distance matrix between the heatmap and its clean counterpart

#

And between each pixel in the heatmap and each pixel in the scale

#

How do I use it now

chilly geyser Feb 8, 2021, 7:43 PM

#

argmins?

#

Those tell you the closest

#

i thought you were picking the closest?

twin moth Feb 8, 2021, 7:43 PM

#

Won't I need to iterate through it in order to fit it into the datadrame?

chilly geyser Feb 8, 2021, 7:44 PM

#

Well try it first I guess

twin moth Feb 8, 2021, 7:44 PM

#

The closest match for each pixel

#

So basically 245000 (350*700) pixels

#

kk

chilly geyser Feb 8, 2021, 7:44 PM

#

Sounds a lot to you but not too many for a compute rhonestly

twin moth Feb 8, 2021, 7:44 PM

#

True

chilly geyser Feb 8, 2021, 7:44 PM

#

What's 'a lot' for a computer is more of the like of >10^9

twin moth Feb 8, 2021, 7:45 PM

#

Depends on which computer 😛

chilly geyser Feb 8, 2021, 7:46 PM

#

Haha yes I wouldn't get a raspberry to do this

slate hollow Feb 8, 2021, 8:08 PM

#

so i have this df

#

           a         b         c         d
0   308.3720  290.2366  301.1735  351.9520
1   258.5590  235.2776  261.2084  345.8950
2   161.0730  193.4827  189.8389  309.6990
3   123.1920  112.0545  120.7862  253.0690
4    95.6603  110.8244   93.7067  199.7980
5    93.5633   90.5683   95.8124  128.2830
6   119.5540  170.6676  123.4065   78.5675
7   200.7440  177.9813  182.7717   98.8558
8   238.3210  258.7455  256.2180  180.2480
9   320.6590  311.0475  280.1748  177.7440
10  340.3230  361.0352  347.1527  256.2830
11  350.5850  339.0691  345.8954  311.0590```

#

how do i get the average

#

of each row

chilly geyser Feb 8, 2021, 8:12 PM

#

Have you tried pandas.DataFrame.mean

twin moth Feb 8, 2021, 8:44 PM

#

chilly geyser Those tell you the closest

It tells me the closest but since it doesn't use the distance formula we used earlier - I won't be able to compare between the two

jagged iris Feb 8, 2021, 8:49 PM

#

@client.command()
async def graph(ctx, *, query):
    ticker = query

    yf.download(ticker)

    newtime = yf.download(ticker, start = "2015-01-01", end = "2021-12-31")

    newtime['Adj Close'].plot()
    fig = plt.figure()
    plt.xlabel("Date")
    plt.ylabel("Adjusted")
    plt.title("Price data")
    plt.style.use('dark_background')


    fig.savefig("data.png")```

So I am using this code to graph data on some stocks and would like to save the graph as adirect  file on my computer but it is opening the graph in the python application. Not sure what I am doing wrong atm

twin moth Feb 8, 2021, 8:51 PM

#

twin moth It tells me the closest but since it doesn't use the distance formula we used ea...

NVM, seems like we just need to multiply it by itself and then they should be equal

#

Now we just have an issue with the shape

#

Because we're trying to calculate the distance matrix between a (350,700,3) and (1,200,3)

#

But even if we do it

twin moth Feb 8, 2021, 8:54 PM

#

chilly geyser Well try it first I guess

How do we continue, what's the next step?

twin moth Feb 8, 2021, 9:05 PM

#

ripe forge a simple reshape back will fix that, no issue

Got any idea how to do that first reshape() though?

ripe forge Feb 8, 2021, 9:06 PM

#

your_array.reshape(-1, 3)

#

(-1 means the value at the axis is inferred. you could also write 350*700 there)

twin moth Feb 8, 2021, 9:07 PM

#

Huh? How would that work?

ripe forge Feb 8, 2021, 9:08 PM

#

just math. if you provide me an array, and give me all axes except 1, i can always infer the dimension of the remaining axis

twin moth Feb 8, 2021, 9:12 PM

#

So basically if I have (350,700,3) and (1,200,3) and I want to run distance_matrix() on them, I'd have to run it in the following way:

dist_mat = distance_matrix(a.reshape(-1, 3), b)

#

Right?

twin moth Feb 8, 2021, 9:13 PM

#

ripe forge just math. if you provide me an array, and give me all axes except 1, i can alwa...

I mean why should I remove the axis of the trios

twin moth Feb 8, 2021, 9:21 PM

#

ripe forge `your_array.reshape(-1, 3)`

Oh, no I understand what you meant, unfortunately it doesn't work since the heatmap is 350*700 (245000 pixels) while the other is 1*~200 (~200)

#

So a reshape would basically do nothing

#

Since they are of different sizes

twin moth Feb 8, 2021, 9:49 PM

#

numpy.core._exceptions.MemoryError: Unable to allocate 447. GiB for an array with shape (245000, 245000) and data type float64
😰

#

I literally have no idea what to do

#

@ripe forge @chilly geyser If you guys have any idea please ping me, I'll go cry in a corner or something

velvet thorn Feb 8, 2021, 10:15 PM

#

@twin moth what’s the problem

twin moth Feb 8, 2021, 10:20 PM

#

Heya

#

Basically I'm trying to find create a distance matrix between matrices of the following shapes [(1,200,3) and (350,700,3)

#

Other than that, I don't know what's the next step

abstract zealot Feb 8, 2021, 10:42 PM

#

@twin moth its an element by element scalar distance matrix?

twin moth Feb 8, 2021, 10:43 PM

#

Kinda

#

I'm comparing trios

twin moth Feb 8, 2021, 10:45 PM

#

abstract zealot <@!191683640118214656> its an element by element scalar distance matrix?

Is that what you mean by element by element?

abstract zealot Feb 8, 2021, 10:46 PM

#

ahhhhhh yea it is

#

im honestly not sure man

#

thought i mightve had an idea but tried it and it doesnt work xd

twin moth Feb 8, 2021, 10:48 PM

#

abstract zealot thought i mightve had an idea but tried it and it doesnt work xd

lol

#

Thanks for trying ❤️

twin moth Feb 8, 2021, 10:53 PM

#

velvet thorn <@191683640118214656> what’s the problem

Okay, so I managed to fix the issues by just choosing not to use create a distance_matrix between the heatmap and the scale - and now it takes about 22 mins to iterate through a single image, even though it took about 4 mins beforehand

#

We've added a couple of checks but it's too long anyways

steady ginkgo Feb 8, 2021, 10:56 PM

#

Hello there
I'm studying statistics and looking to take a step ahead my classes by being able to retrieve data to analyse on my own from APIs. So i'm looking for help to learn requesting and all that stuff as I only know the basics about the topic.

#

I know how the basics of python too if that's any relevant

austere swift Feb 9, 2021, 12:21 AM

#

twin moth `numpy.core._exceptions.MemoryError: Unable to allocate 447. GiB for an array wi...

i'm curious as to why you would want to use that big of an array

#

and float64 too lmao

misty flint Feb 9, 2021, 12:33 AM

#

steady ginkgo Hello there I'm studying statistics and looking to take a step ahead my classes ...

watch stat quest. that will help you stay ahead

#

https://www.youtube.com/playlist?list=PLblh5JKOoLUK0FLuzwntyYI10UQFUhsY9

YouTube

Statistics Fundamentals

These videos give you a general overview of statistics as well as a be a reference for statistical concepts.

twin moth Feb 9, 2021, 12:35 AM

#

austere swift i'm curious as to why you would want to use that big of an array

Well, I didn't really

#

I wanted to create distance matrix between two matrixes of the following shape (350,700,3)

hasty grail Feb 9, 2021, 2:16 AM

#

twin moth I wanted to create distance matrix between two matrixes of the following shape (...

I thought you only had to create a distance matrix between (350, 700, 3) and (1, 200, 3)-sized arrays?

#

!e

import numpy as np
from scipy.spatial import distance

rng = np.random.default_rng()
a = rng.random((350, 700, 3)).reshape(-1, 3)  # The RGB image. Reshape so that it is 2-D.
b = rng.random((200, 3))  # The 200 RGB pixels to compare against

dist_mat = distance.cdist(a, b).reshape((350, 700, 200))
print(dist_mat)

arctic wedgeBOT Feb 9, 2021, 2:24 AM

#

@hasty grail :warning: Your eval job timed out or ran out of memory.

[No output]

hasty grail Feb 9, 2021, 2:24 AM

#

It barely fit in memory on my PC (which has 8 GB)

#

dist_mat[i, j, k] is the distance between a[i, j] and b[k]

#

If you're having memory issues then you might want to compare batch by batch (with respect to b)
For example, find the closest match out of the first 100 pixels, then the same out of the next 100 pixels, and compare those 2 closest matches.

import numpy as np
from scipy.spatial import distance

img_size = (350, 700)
num_px = 200
batch_size = 100

rng = np.random.default_rng()
img = rng.random((*img_size, 3)).reshape(-1, 3)  # The RGB image. Reshape so that it is 2-D.
px = rng.random((num_px, 3))  # The 200 RGB pixels to compare against

dist_min = np.full(img_size, np.inf)
dist_argmin = np.full(img_size, -1)
for i_min in range(0, num_px, batch_size):
    px_batch = px[i_min:i_min+batch_size]
    dist_batch = distance.cdist(img, px_batch).reshape(*img_size, -1)

    dist_batch_min = dist_batch.min(axis=-1)
    dist_batch_argmin = dist_batch.argmin(axis=-1)
    updated = (dist_min > dist_batch_min)

    dist_min[updated] = dist_batch_min[updated]
    dist_argmin[updated] = i_min + dist_batch_argmin[updated]

assert (dist_argmin > -1).all()

steel roost Feb 9, 2021, 2:33 AM

#

May i have someone explain why ```pythonmiles_log = cur_folder+"mile_log.txt"
miledge_sheet = cur_folder+'blank mileage log.xlsx'

with open(miles_log,'r') as f:
data = f.readlines()
f.close()

wb = load_workbook(miledge_sheet)
for cells in wb.iter_columns:
print(cell)

#

doesnt work

misty flint Feb 9, 2021, 2:36 AM

#

whats your error

#

Sip

steel roost Feb 9, 2021, 3:30 AM

#

essentially i have a text document that i want to write its contents to excel, but only to the rows of 17:38

#

however, all of my solutions rewrite the desired section. i can provide the full code if needed.

plucky zephyr Feb 9, 2021, 4:50 AM

#

is there daily challange for data science,
i mean like using pandas, matplotpib etc ?

lilac geyser Feb 9, 2021, 5:16 AM

#

Is Hypergeometric probability distribution identical to Binomial probability distribution?

fair solar Feb 9, 2021, 5:33 AM

#

is there any library which can generate a normal map from a diffuse texture?

#

was recommended this channel from #python-discussion

#

ping me ^

#

any cli app would work too i guess, which i can call from python

astral path Feb 9, 2021, 6:16 AM

#

i need some help rather quickly

#

how can I create frequencies of data points but with some margin for error?

#

i.e.

📎 unknown.png

#

I have a bunch of points on a graph, but none of them are the same point

#

how could I loop over each point and get a value for the frequency of points really close to it?

#

Thanks!!!

hasty grail Feb 9, 2021, 6:20 AM

#

So basically you want a histogram?

astral path Feb 9, 2021, 6:20 AM

#

well what I want is to vary the sizes of each point by how frequently there are points nearby

hasty grail Feb 9, 2021, 6:21 AM

#

Hmm

#

Looks like you want some sort of nearest neighbour search

astral path Feb 9, 2021, 6:21 AM

#

so like around (0, 0), the points would be very large

hasty grail Feb 9, 2021, 6:21 AM

#

This might help: https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.cKDTree.html

astral path Feb 9, 2021, 6:21 AM

#

I thought about clustering but wasn't sure how to use it in this context

#

i'm following a tutorial but with a different dataset

#

the guy in the tutorial has data already in the form of tiny regions across the plot in hexbins instead of just the raw datapoints, and each region already has the frequency in another dataset

#

his looks like this

📎 1fnWjQl9GPN6AJ1-X90Eznw.png

hasty grail Feb 9, 2021, 6:23 AM

#

In particular the query_ball_tree method

astral path Feb 9, 2021, 6:24 AM

#

i'll take a look at this, thanks

hasty grail Feb 9, 2021, 6:25 AM

#

np, feel free to ask if you run into issues

astral path Feb 9, 2021, 6:25 AM

#

willdo!

astral path Feb 9, 2021, 6:36 AM

#

hasty grail np, feel free to ask if you run into issues

so what I have so far is this:

indices = []
count = 0
for x in missed_x:
    kd_tree1 = cKDTree((missed_x, missed_y))
    kd_tree2 = cKDTree((missed_x[count], missed_y[count]))
    indexes = kd_tree1.query_ball_tree(kd_tree2, r=10)
    indices.append(indexes)
    count += 1

display(pd.DataFrame(indices))

but on the line kd_tree2 = cKDTree((missed_x[count], missed_y[count])), I'm getting an error: ValueError: data must be 2 dimensions
So it looks like it's expecting a cKDTree to have more than one point

#

how would I make a cKDTree with just one?

#

I think i see

hasty grail Feb 9, 2021, 6:44 AM

#

Why don't you just create a cKDTree for each set of points?

astral path Feb 9, 2021, 6:45 AM

#

well partially the problem was that cKDTree didn't work on a single point

hasty grail Feb 9, 2021, 6:45 AM

#

I thought you only wanted to do clustering?

#

In that case the two cKDTrees would be the same

astral path Feb 9, 2021, 6:45 AM

#

but
scipy.spatial.KDTree.query_ball_point should be done if I want to find the nearest neighbors for a single point

hasty grail Feb 9, 2021, 6:46 AM

#

Don't you want to find the nearest neighbors for more than one point?

astral path Feb 9, 2021, 6:47 AM

#

yeah but I'm going to loop over each point and add it to a list with the number of nearest neighbors for each point

#

or is that not the best way to do it

hasty grail Feb 9, 2021, 6:47 AM

#

It's not the best way

#

avoid loops when dealing with NumPy (and its related libraries) whenever possible

#

The reason is that Python for loops are way slower than those in C

astral path Feb 9, 2021, 6:48 AM

#

ah, did not know that

hasty grail Feb 9, 2021, 6:48 AM

#

and NumPy uses C loops internally whenever you execute vectorized operations

astral path Feb 9, 2021, 6:49 AM

#

so then how should I do it?

hasty grail Feb 9, 2021, 6:54 AM

#

# Assuming that you have a bunch of points in a 2-D array named `points`
centroids = points[::2]  # Subsample the points. In this example we sample every other point.

centroids_t = cKDTree(centroids)
points_t = cKDTree(points)

# r = Radius of search
result = centroids_t.query_ball_tree(points_t, r)  # result[i] is a list of the neighbours of centroids[i] in points
num_neighbours = [len(neighbours) for neighbours in result]  # Count the number of neighbours for each centroid

astral path Feb 9, 2021, 6:59 AM

#

I have an array of points made using missed_points = np.array([missed_x, missed_y]), and it returns:

          [ 207,  264,  120, ...,   86,   15,   61]])```, but it says it's a 2x1331 vector.  this would still work, right?  or would I need to transpose it

#

also how come we would do every other point?

hasty grail Feb 9, 2021, 6:59 AM

#

you'll have to transpose it

#

according to the doc, the first dimension represents the number of points and the second dimension represents the number of spatial dimensions the points reside in

astral path Feb 9, 2021, 7:00 AM

#

ah ok, thanks

#

could I just do points[::1] for my case?

hasty grail Feb 9, 2021, 7:00 AM

#

also how come we would do every other point?
It's the easiest method of subsampling, just for demonstration purposes.

#

If you don't want to subsample at all you can just do centroids = points

#

The reason you might want to subsample is to reduce the computational cost

#

(Especially if you have a large number of points like >10k)

astral path Feb 9, 2021, 7:03 AM

#

that does make sense

#

although would it make sense when I want the nearest neighbors for each individual point?

hasty grail Feb 9, 2021, 7:03 AM

#

Yeah, since it is directly related to your purpose

#

Which is to count the number of nearby points

astral path Feb 9, 2021, 7:04 AM

#

how would I use it in my plot?

hasty grail Feb 9, 2021, 7:04 AM

#

query_ball_point isn't actually a nearest neighbour search, it's just a ball query

astral path Feb 9, 2021, 7:04 AM

#

it's a different size from my x and y coordinate arrays

hasty grail Feb 9, 2021, 7:04 AM

#

i.e. find all points within distance r of the given point

astral path Feb 9, 2021, 7:05 AM

#

running it with no subsampling is basically the same running time

hasty grail Feb 9, 2021, 7:05 AM

#

I would think that the reason why you want to count the number of nearby points is so that you don't have to plot every single point

astral path Feb 9, 2021, 7:05 AM

#

ohhh ok

#

I see

#

this is what the graph was btw

📎 unknown.png

#

the code you provided did work

hasty grail Feb 9, 2021, 7:06 AM

#

running it with no subsampling is basically the same running time
Compared to?

astral path Feb 9, 2021, 7:07 AM

#

with the subsampling

#

the dataset is only 1331 points

hasty grail Feb 9, 2021, 7:07 AM

#

Maybe it's because the number of points isn't large enough to matter much

#

1331 is not a lot

astral path Feb 9, 2021, 7:07 AM

#

yeah thats true

#

its an analysis of shots during an nba season

hasty grail Feb 9, 2021, 7:07 AM

#

astral path the code you provided did work

Glad that worked!

misty flint Feb 9, 2021, 7:07 AM

#

wow it looks like a galaxy

#

MHXwoah

astral path Feb 9, 2021, 7:08 AM

#

looks much better with adjusted sizes

📎 unknown.png

#

hmm now that I think about it

misty flint Feb 9, 2021, 7:08 AM

#

bunch of central planets + an outer asteroid belt

#

DoggoKek

astral path Feb 9, 2021, 7:08 AM

#

it's really weird that the dataset is only 1331 points

#

should be much larger

misty flint Feb 9, 2021, 7:09 AM

#

pithink

#

data was lost somewhere?

astral path Feb 9, 2021, 7:09 AM

#

hmmm

#

there should be 118,396 points

misty flint Feb 9, 2021, 7:10 AM

#

Tidus

#

quite a difference

#

just a bit

#

🤏

astral path Feb 9, 2021, 7:10 AM

#

lol

misty flint Feb 9, 2021, 7:10 AM

#

like this much

astral path Feb 9, 2021, 7:11 AM

#

not really enough of a difference to be statistically significant

#

¯_(ツ)_/¯

misty flint Feb 9, 2021, 7:11 AM

#

pithink

#

ig thats true

astral path Feb 9, 2021, 7:13 AM

#

so im a little put off rn

hasty grail Feb 9, 2021, 7:13 AM

#

If you're actually going to deal with 118,396 points then you'll probably need to subsample (and plot only the centroids)

astral path Feb 9, 2021, 7:14 AM

#

yeah thats my new thought

#

i'm looping over 1230 games played to find every shot taken in the game which can be estimated to be about 200 total

misty flint Feb 9, 2021, 7:15 AM

#

pithink

#

interesting data

serene dragon Feb 9, 2021, 7:16 AM

#

for row in t.main_file_df['ID1784']: 
    new_row = [value for value in row if value in t.param_values]
    row = [value for value in row if value not in t.param_values]

#

how can i transform this loop into 2 columns

#

where row I want to stay as 'ID1784'

#

and new_row as new column

hasty grail Feb 9, 2021, 7:16 AM

#

Wdym by 2 columns?

serene dragon Feb 9, 2021, 7:17 AM

#

this is pandas DataFrame

#

t.main_file_df['ID1784'] is column that contain list of strings

misty flint Feb 9, 2021, 7:18 AM

#

6_ThinkingIntensifies

hasty grail Feb 9, 2021, 7:18 AM

#

Can you provide more background information?

#

Such as the structure of your dataframe

misty flint Feb 9, 2021, 7:18 AM

#

i dont understand

serene dragon Feb 9, 2021, 7:19 AM

#

i have DF that contain 2 columns

#

Symbol and ID1784

hasty grail Feb 9, 2021, 7:20 AM

#

and what do you expect your output to be like?

serene dragon Feb 9, 2021, 7:21 AM

#

ID1784 looks like

#

['automatyczna i ręczna regulacja czułości', 'automatyczne wyłączanie', 'identyfikacja sygnału nadajnika dzięki cyfrowemu kodowaniu', 'możliwość dołączenia dodatkowych nadajników', 'optyczna i akustyczna sygnalizacja zlokalizowanego sygnału nadajnika', 'zestaw złożony z nadajnika i odbiornika', 'lokalizowanie tras kabli w gruncie', 'lokalizowanie tras metalowych rur instalacji wodnych i centralnego ogrzewania', 'lokalizowanie właściwego zabezpieczenia odpowiedniego obwodu elektrycznego']

#

list of many strings

#

and I want to split it into 2 columns

#

one that contain texts that i want, based on list intersection

#

and 2nd that have leftovers

#

this for loop can do this for each row

hasty grail Feb 9, 2021, 7:24 AM

#

df['ID1784_new'] = df.apply(lambda row: [value for value in row['ID1784'] if value in t.param_values])
df['ID1784'] = df.apply(lambda row: [value for value in row['ID1784'] if value not in t.param_values])

#

Not that experienced in Pandas so there might be a more efficient method

#

Fixed

serene dragon Feb 9, 2021, 7:27 AM

#

i was close to something like that 😦

#

thanks

hasty grail Feb 9, 2021, 7:28 AM

#

You're welcome

astral path Feb 9, 2021, 7:28 AM

#

perfect timing because i just realized my problem

#

its also pandas

#

ok so

#

I have a dataframe of every taken shot by a team in a given game:

📎 unknown.png

#

I'm trying to loop over each row using for i in homeoutcomes: where homeoutcomes is the name of the dataframe

#

although the only thing that's returned is EVENT_TYPE

#

what am I missing, how should I be iterating over each row instead?

hasty grail Feb 9, 2021, 7:34 AM

#

Why are you iterating over the rows like that?

#

Again, since pandas derives from numpy, avoid direct iteration in Python

astral path Feb 9, 2021, 7:34 AM

#

how should I be doing it

hasty grail Feb 9, 2021, 7:34 AM

#

depends on what you're doing in each iteration

astral path Feb 9, 2021, 7:35 AM

#

in each iteration I do this:

    display(i)
    if homeoutcomes.iloc[indexH]['EVENT_TYPE'] == 'Missed Shot':
      missed_x.append(homestats.iloc[indexH]['LOC_X'])
      missed_y.append(homestats.iloc[indexH]['LOC_Y'])

      next_x.append(homestats.iloc[indexH + 1]['LOC_X'])
      next_y.append(homestats.iloc[indexH + 1]['LOC_Y'])
    indexH += 1
    print("IndexH", indexH)

hasty grail Feb 9, 2021, 7:35 AM

#

what is display?

astral path Feb 9, 2021, 7:36 AM

#

its an IPython function that is used to display data structures

hasty grail Feb 9, 2021, 7:36 AM

#

ok so it's just for debugging

astral path Feb 9, 2021, 7:36 AM

#

yeah

hasty grail Feb 9, 2021, 7:37 AM

#

what are next_x and next_y for?

astral path Feb 9, 2021, 7:38 AM

#

next_x and next_y are keeping track of the x position and y position of the shot following a missed shot

hasty grail Feb 9, 2021, 7:40 AM

#

Hmm ok

#

missed = (homeoutcomes['EVENT_TYPE'] == 'Missed Shot')
missed_x_y = homeoutcomes.iloc[missed].loc[['LOC_X', 'LOC_Y']]

missed_next = np.concatenate([[False], missed[:-1]])  # Shift the `missed` boolean mask
missed_next_x_y = homeoutcomes.iloc[missed_next].loc[['LOC_X', 'LOC_Y']]

#

Try this

astral path Feb 9, 2021, 7:45 AM

#

instead of a for loop?

hasty grail Feb 9, 2021, 7:45 AM

#

Yep

astral path Feb 9, 2021, 7:45 AM

#

will missed_next be for any shot after a missed shot? or just a missed shot after a missed shot

hasty grail Feb 9, 2021, 7:46 AM

#

any

astral path Feb 9, 2021, 7:46 AM

#

ok thanks

misty flint Feb 9, 2021, 7:46 AM

#

ValkNaruhodo

astral path Feb 9, 2021, 7:48 AM

#

NotImplementedError: iLocation based boolean indexing on an integer type is not available

hasty grail Feb 9, 2021, 7:50 AM

#

Can you print out missed?

#

maybe you need to add .values at the end to convert it into a boolean array

#

instead of a Series

astral path Feb 9, 2021, 7:52 AM

#

yea sure

#

i changed it to missed_home though because im doing it for two different teams

#

i changed the missed in the iloc to missed_home btw

hasty grail Feb 9, 2021, 7:52 AM

#

that's fine

astral path Feb 9, 2021, 7:53 AM

#

display(missed_home) gives

📎 unknown.png

hasty grail Feb 9, 2021, 7:54 AM

#

can you display missed_home.values?

#

if it becomes a regular numpy array then that's what you need to use instead

astral path Feb 9, 2021, 7:54 AM

#

📎 unknown.png

hasty grail Feb 9, 2021, 7:54 AM

#

yup

astral path Feb 9, 2021, 7:54 AM

#

ah ok

hasty grail Feb 9, 2021, 7:55 AM

#

as in

missed = (homeoutcomes['EVENT_TYPE'] == 'Missed Shot').values

astral path Feb 9, 2021, 7:55 AM

#

that's what i did

#

missed_x_y_home gives an error
KeyError: "None of [Index(['LOC_X', 'LOC_Y'], dtype='object', name='GAME_ID')] are in the [index]"

ripe forge Feb 9, 2021, 7:56 AM

#

twin moth Oh, no I understand what you meant, unfortunately it doesn't work since the heat...

Oh. Both arrays needed to be 2d, so it should have been (200, 3). Catching up on the convo I'm not quite sure how you ended up with an array of shape (245000, 245000). Also apologies for the trouble sounds like it wasn't a fun ride 😅

hasty grail Feb 9, 2021, 7:56 AM

#

astral path missed_x_y_home gives an error `KeyError: "None of [Index(['LOC_X', 'LOC_Y'], dt...

Can you provide your code?

astral path Feb 9, 2021, 7:56 AM

#

yeah sure

hasty grail Feb 9, 2021, 7:56 AM

#

So that I don't get confused by home and stuff

astral path Feb 9, 2021, 7:56 AM

#

xlocs = df['LOC_X']
ylocs = df['LOC_Y']
made = df['SHOT_MADE_FLAG']

missed_x = []
next_x = []
missed_y = []
next_y = []
df.set_index(keys=['GAME_ID'], drop=False,inplace=True)
game_ids = df['GAME_ID'].unique().tolist()

for id in game_ids:
  gamestats = df.loc[df.GAME_ID==id]
  team_names = list(set(gamestats['TEAM_NAME'].tolist()))
  homestats = gamestats[gamestats['TEAM_NAME'] == team_names[0]]
  awaystats = gamestats[gamestats['TEAM_NAME'] == team_names[1]]

  homeoutcomes = pd.DataFrame(homestats['EVENT_TYPE'])
  awayoutcomes = pd.DataFrame(awaystats['EVENT_TYPE'])

  missed_home = (homeoutcomes['EVENT_TYPE'] == 'Missed Shot').values
  missed_x_y_home = homeoutcomes.iloc[missed_home].loc[['LOC_X', 'LOC_Y']]

  missed_next_home = np.concatenate([[False], missed_home[:-1]])  # Shift the `missed` boolean mask
  missed_next_x_y_home = homeoutcomes.iloc[missed_next].loc[['LOC_X', 'LOC_Y']]
    
  missed_away = (awayoutcomes['EVENT_TYPE'] == 'Missed Shot').values
  missed_x_y_away = awayoutcomes.iloc[missed_away].loc[['LOC_X', 'LOC_Y']]

  missed_next_away = np.concatenate([[False], missed_away[:-1]])  # Shift the `missed` boolean mask
  missed_next_x_y_away = awayoutcomes.iloc[missed_next].loc[['LOC_X', 'LOC_Y']]

missed_x = np.asarray(missed_x)
missed_y = np.asarray(missed_y)
next_x = np.asarray(next_x)
next_y = np.asarray(next_y)

hasty grail Feb 9, 2021, 7:57 AM

#

ok, can you print homeoutcomes.iloc[missed_home]?

astral path Feb 9, 2021, 7:58 AM

#

and so on with all missed values

📎 unknown.png

hasty grail Feb 9, 2021, 7:59 AM

#

Where are the LOC_X and LOC_Y columns?

astral path Feb 9, 2021, 7:59 AM

#

x coordinate of a given shot and y coordinate

hasty grail Feb 9, 2021, 7:59 AM

#

I don't see those columns in your screenshot...

astral path Feb 9, 2021, 8:00 AM

#

oh those are in a different dataframe !!!

hasty grail Feb 9, 2021, 8:00 AM

#

could you combine them together beforehand?

astral path Feb 9, 2021, 8:00 AM

#

should be homestats

#

display(homestats) looks like this

📎 unknown.png

#

and i'm getting homeoutcomes from homestats

hasty grail Feb 9, 2021, 8:02 AM

#

I guess you can just use homestats all the way then

#

(and awaystats)

astral path Feb 9, 2021, 8:03 AM

#

hmm

#

i still get KeyError: "None of [Index(['LOC_X', 'LOC_Y'], dtype='object', name='GAME_ID')] are in the [index]"

hasty grail Feb 9, 2021, 8:03 AM

#

Your edited code?

astral path Feb 9, 2021, 8:03 AM

#

yes

#

on missed_x_y_home

hasty grail Feb 9, 2021, 8:06 AM

#

As in, what is your code now?

#

(Sorry for not being clear)

astral path Feb 9, 2021, 8:06 AM

#

for id in game_ids:
  gamestats = df.loc[df.GAME_ID==id]
  team_names = list(set(gamestats['TEAM_NAME'].tolist()))
  homestats = gamestats[gamestats['TEAM_NAME'] == team_names[0]]
  awaystats = gamestats[gamestats['TEAM_NAME'] == team_names[1]]

  homeoutcomes = pd.DataFrame(homestats['EVENT_TYPE'])
  awayoutcomes = pd.DataFrame(awaystats['EVENT_TYPE'])

  missed_home = (homestats['EVENT_TYPE'] == 'Missed Shot').values
  missed_x_y_home = homestats.iloc[missed_home].loc[['LOC_X', 'LOC_Y']]

  missed_next_home = np.concatenate([[False], missed_home[:-1]])  # Shift the `missed` boolean mask
  missed_next_x_y_home = homestats.iloc[missed_next].loc[['LOC_X', 'LOC_Y']]
    
  missed_away = (awayoutcomes['EVENT_TYPE'] == 'Missed Shot').values
  missed_x_y_away = awayoutcomes.iloc[missed_away].loc[['LOC_X', 'LOC_Y']]

  missed_next_away = np.concatenate([[False], missed_away[:-1]])  # Shift the `missed` boolean mask
  missed_next_x_y_away = awayoutcomes.iloc[missed_next].loc[['LOC_X', 'LOC_Y']]

missed_x = np.asarray(missed_x)
missed_y = np.asarray(missed_y)
next_x = np.asarray(next_x)
next_y = np.asarray(next_y)

#

np

hasty grail Feb 9, 2021, 8:09 AM

#

what is homestats.iloc[missed_home]?

astral path Feb 9, 2021, 8:10 AM

#

thats homeoutcomes.iloc[missed_home] but with homestats instead

#

same error either way

hasty grail Feb 9, 2021, 8:11 AM

#

well yeah, but does it contain the columns 'LOC_X' and 'LOC_Y'?

astral path Feb 9, 2021, 8:12 AM

#

yes, it does

#

its the most recent screenshot ive posted

hasty grail Feb 9, 2021, 8:13 AM

#

after the iloc

astral path Feb 9, 2021, 8:13 AM

#

hmm?

hasty grail Feb 9, 2021, 8:14 AM

#

run display(homestats.iloc[missed_home])

#

supposedly it should have those two columns

astral path Feb 9, 2021, 8:15 AM

#

yep

📎 unknown.png

#

got them

hasty grail Feb 9, 2021, 8:16 AM

#

so homestats.iloc[missed_home].loc[['LOC_X', 'LOC_Y']] results in an error?

astral path Feb 9, 2021, 8:17 AM

#

yeah that one does

hasty grail Feb 9, 2021, 8:17 AM

#

wait

#

I'm dumb

#

ok umm can you do this?

#

homestats.loc[missed_home, ['LOC_X', 'LOC_Y']]

astral path Feb 9, 2021, 8:18 AM

#

yes

#

that worked

hasty grail Feb 9, 2021, 8:18 AM

#

also maybe you could get away with not using .values

astral path Feb 9, 2021, 8:19 AM

#

thanks! i'll try the rest of it

hasty grail Feb 9, 2021, 8:19 AM

#

somehow I thought .iloc uses row indexing while loc uses col indexing

astral path Feb 9, 2021, 8:20 AM

#

ran with no errors!

hasty grail Feb 9, 2021, 8:22 AM

#

for id in game_ids:
  gamestats = df.loc[df.GAME_ID==id]
  team_names = list(set(gamestats['TEAM_NAME'].tolist()))
  homestats = gamestats.loc[gamestats['TEAM_NAME'] == team_names[0]]
  awaystats = gamestats.loc[gamestats['TEAM_NAME'] == team_names[1]]

  missed_home = (homestats['EVENT_TYPE'] == 'Missed Shot')
  missed_x_y_home = homestats.loc[missed_home, ['LOC_X', 'LOC_Y']]

  missed_next_home = np.concatenate([[False], missed_home[:-1]])  # Shift the `missed` boolean mask
  missed_next_x_y_home = homestats.loc[missed_next, ['LOC_X', 'LOC_Y']]
    
  missed_away = (awaystats['EVENT_TYPE'] == 'Missed Shot')
  missed_x_y_away = awaystats.loc[missed_away, ['LOC_X', 'LOC_Y']]

  missed_next_away = np.concatenate([[False], missed_away[:-1]])  # Shift the `missed` boolean mask
  missed_next_x_y_away = awayoutcomes.loc[missed_next, ['LOC_X', 'LOC_Y']]

#

also you can use pd.unique instead of set

#

team_names = pd.unique(gamestats['TEAM_NAME'])

astral path Feb 9, 2021, 8:29 AM

#

that works

#

however,

#

going back to centroids

#

that now gives me a ValueError: setting an array element with a sequence.

hasty grail Feb 9, 2021, 8:34 AM

#

where?

astral path Feb 9, 2021, 8:34 AM

#

it's now

indices = []
count = 0

missed_points = missed_x_y
centroids =  missed_x_y
centroids_t = cKDTree(centroids)
points_t = cKDTree(missed_points)

result = centroids_t.query_ball_tree(points_t, r=10)  
num_neighbours = [len(neighbours) for neighbours in result] 

display(num_neighbours)

#

at cKDTree(centroids)

hasty grail Feb 9, 2021, 8:34 AM

#

what is missed_x_y?

astral path Feb 9, 2021, 8:34 AM

#

wrong thing

#

edited

hasty grail Feb 9, 2021, 8:35 AM

#

edited

astral path Feb 9, 2021, 8:35 AM

#

edited?

hasty grail Feb 9, 2021, 8:35 AM

#

my question

astral path Feb 9, 2021, 8:35 AM

#

oh

#

oh i forgot to say

#

instead of missed_next_x_y_home = homestats.loc[missed_next, ['LOC_X', 'LOC_Y']]

hasty grail Feb 9, 2021, 8:36 AM

#

remember that you need to pass a 2-D ndarray to cKDTree

#

not a pandas dataframe

astral path Feb 9, 2021, 8:36 AM

#

i have an array x_y and I'm using x_y.append(homestats.loc[missed_home, ['LOC_X', 'LOC_Y']])

#

and then missed_x_y = np.asarray(x_y)

hasty grail Feb 9, 2021, 8:37 AM

#

why don't you just set missed_x_y to homestats.loc[missed_home, ['LOC_X', 'LOC_Y']]?

astral path Feb 9, 2021, 8:38 AM

#

because i'm adding homestats.loc[missed_home, ['LOC_X', 'LOC_Y']] to missed_x_y for each game

hasty grail Feb 9, 2021, 8:38 AM

#

ok so the problem is that you need to concatenate the list of dataframes together*

astral path Feb 9, 2021, 8:39 AM

#

you mean missed_x_y?

hasty grail Feb 9, 2021, 8:40 AM

#

yes

#

missed_x_y = pd.concat(missed_x_y)

astral path Feb 9, 2021, 8:41 AM

#

well no errors

hasty grail Feb 9, 2021, 8:41 AM

#

and then

astral path Feb 9, 2021, 8:41 AM

#

although the nearest neighbors is slow as hell

hasty grail Feb 9, 2021, 8:41 AM

#

missed_x_y = missed_x_y.to_numpy()

#

(or just chain them together)

astral path Feb 9, 2021, 8:41 AM

#

so i should do the subsampling?

hasty grail Feb 9, 2021, 8:41 AM

#

yeah

#

you have >100k points now, right?

astral path Feb 9, 2021, 8:42 AM

#

yea

#

should be ~130k

hasty grail Feb 9, 2021, 8:42 AM

#

There are a couple of ways to do subsampling

astral path Feb 9, 2021, 8:42 AM

#

runs much faster with [::2]

hasty grail Feb 9, 2021, 8:42 AM

#

As mentioned before, the easiest way is to just sample every n points

#

However this is only good when the points are randomly distributed, which is not necessarily true since you have sorted them in time sequence, possibly affecting their spatial distribution

astral path Feb 9, 2021, 8:44 AM

#

yeah

hasty grail Feb 9, 2021, 8:44 AM

#

A better method would be to randomly shuffle the points beforehand

#

Still pretty basic, but should be sufficient

astral path Feb 9, 2021, 8:45 AM

#

how would i shuffle it so that the next shots are shuffled the exact same way?

hasty grail Feb 9, 2021, 8:46 AM

#

rng = np.random.default_rng()
shuffled = rng.shuffle(original)

#

this would be nondeterminstic shuffling

#

if you want the shuffling to be identical each time the program is run, pass a seed to default_rng

#

(such as zero)

astral path Feb 9, 2021, 8:48 AM

#

hmm

#

the size/hue looks basically random now

📎 unknown.png

#

which is based on num_neighbours

hasty grail Feb 9, 2021, 8:48 AM

#

oops made a mistake

#

rng.shuffle is in-place so it doesn't return anything

#

so you should just call rng.shuffle(original), and original will be shuffled

#

what does your code look like now?

astral path Feb 9, 2021, 8:50 AM

#

this is with the every other one

#

[::2]

hasty grail Feb 9, 2021, 8:51 AM

#

increase the subsampling ratio

#

it's way too messy to really tell

astral path Feb 9, 2021, 8:52 AM

#

yeah there's waaaay too many points

#

16 does this

📎 unknown.png

#

I'll try out the rng.shuffle tommmorow

#

i'm done for the night so THANK YOU!

hasty grail Feb 9, 2021, 8:54 AM

#

No problem!

astral path Feb 9, 2021, 8:54 AM

#

you have no idea how helpful this was!

hasty grail Feb 9, 2021, 8:54 AM

#

Am glad to help

tall trail Feb 9, 2021, 9:49 AM

#

is it possible in dash to plot a line with an x axis start and end time? i cant find anything on the old google machine

twin moth Feb 9, 2021, 10:25 AM

#

hasty grail If you're having memory issues then you might want to compare batch by batch (wi...

Thanks a ton, I'll try it out later and let you know

twin moth Feb 9, 2021, 10:28 AM

#

ripe forge Oh. Both arrays needed to be 2d, so it should have been (200, 3). Catching up on...

Lol, no, it,didn't really
We ended up iterating through the 1,200,3 matrix instead of creating a distance matrix
Currently it takes about 3mins for each map of 350,700,3.
We have about 240 of those in each dataset (11 hours) and we have multiple datasets, so still a long way to go

lapis sequoia Feb 9, 2021, 10:31 AM

#

I was told that you have to be really good at math and statistics otherwise you can’t be a great data scientist. Is it true?

sage dagger Feb 9, 2021, 10:38 AM

#

lapis sequoia I was told that you have to be really good at math and statistics otherwise you ...

To understand all the concepts behind data science methods, statistical models and so on you need to be good at maths for sure. But to do good analysis you just need to know how to treat your data right and when to use which model. So you can be a good data scientist without fully understanding everything imo. But you'll be better if you understand the methods, their advantages and limits.

lapis sequoia Feb 9, 2021, 11:01 AM

#

sage dagger To understand all the concepts behind data science methods, statistical models a...

So if I'm not mistaken, even if you are not good at math, that does not limit you?

velvet thorn Feb 9, 2021, 11:05 AM

#

lapis sequoia So if I'm not mistaken, even if you are not good at math, that does not limit yo...

honestly

#

it depends on what kinda DS you wanna be

#

there are many types.

sage dagger Feb 9, 2021, 11:05 AM

#

If you want to study Data Science you'll definitely need math BUT there is no good or bad at math, it's just how much time and exercise you want to put into understanding math.

velvet thorn Feb 9, 2021, 11:05 AM

#

also "really good" is subjective

#

for example

#

if you want to be a research scientist

#

you'll defo need to have your linear algebra etc. down

sage dagger Feb 9, 2021, 11:06 AM

#

For analysis tho you just need to know how to run the right analysis on a given data set

lapis sequoia Feb 9, 2021, 11:06 AM

#

I was into DA but now I’m considering about moving to NLP and stuff more like AI than analysis. Then I found my math sucks.

#

I have basic knowledge in linear algebra, but that’s it

velvet thorn Feb 9, 2021, 11:07 AM

#

so like

#

let's say you're a DS running A/B testing

#

you need to know what you're doing

#

proper statistical methodology is important

#

in, for example, placing subjects into cohorts

#

understanding the statistical significance of your results

#

etc.

#

but there are also DS roles

#

which are more

#

...

#

programming-oriented?

lapis sequoia Feb 9, 2021, 11:08 AM

#

Like?

velvet thorn Feb 9, 2021, 11:08 AM

#

as in

#

what I mean is that

#

both are called "Data Scientist"

#

but

#

actual responsibilities may differ

#

just like you can be a "Software Engineer" and work on embedded systems

#

or a "Software Engineer" and do webdev

lapis sequoia Feb 9, 2021, 11:08 AM

#

Ah yes

#

Different expertise within the same field I get that

#

I’m still not sure what I really want to do

velvet thorn Feb 9, 2021, 11:09 AM

#

and

#

you can be also a more business-facing DS

#

so like your special sauce is being very good @ bringing stakeholders together and communicating with them

lapis sequoia Feb 9, 2021, 11:09 AM

#

Data analyst seems to be... inferior? to data scientist

sage dagger Feb 9, 2021, 11:09 AM

#

lapis sequoia I’m still not sure what I really want to do

noone ever does tbh