#data-science-and-ml

1 messages Β· Page 283 of 1

astral path
#

wrong channel?

rancid ruin
#

oops this is data sci

#

srry

odd aspen
#

Difference between sentence encoders and tokenizers?

ripe forge
#

tokenizers only breaks something into tokens. (be it sentence tokenizers, word tokenizers, etc)

#

encoders convert tokens (strings basically) into vectors- numbers that can be used by the models

astral path
#

so ive ran into a problem

abstract zealot
#

shoot

astral path
#

this is constructed from the code you gave me where i was making a series/dataframe with each combination of distances

#

im trying to enumerate over them and add the frequency to another array so it's not the frequency of each unique combination but more just another column in a dataframe

for i in enumerate(missed_series):
    freq = dist_freq_series.loc[missed_series[newi], next_series[newi]]
    freqs.append(freq)
    newi += 1
#

however, for some reason i'm getting KeyError: (26.0, 25.0), but I don't see why that shouldn't exist

#

wait never mind

#

posting this made me realize that the dataframe i was enumerating over was wrong

abstract zealot
#

hahahaha

#

you have solved your own problem?

astral path
#

yeah lol

#

Β―_(ツ)_/Β―

abstract zealot
#

hahahaha

#

nice one

astral path
#

FINALLY

abstract zealot
#

thats a sexy plot

#

love it

#

good job @astral path

astral path
#

thank you !

#

idk if you follow basketball but it should make sense anyways

#

the plot is basically where the x axis is the distance from the hoop of a missed shot, and the y axis is the change in how far away the next shot is after that

abstract zealot
#

this actually does make sense

#

its a really cooil plot

astral path
#

the size/color is correlating to the frequency of each pairing of missed shot and consecuitive shot

hasty comet
#

hi does anyone know about plotly treemap??

lapis sequoia
#

google it

rugged comet
#

When experiencing poor model performance, how do I know if I need more data or if I need to train longer?

limpid oak
#

need help
i have this code
`HB_NO_List = list(VillageDataDict.keys())

for eachVillageHB_NO in HB_NO_List :
TempVillage,TempHB_NO,TempScale = VillageDataDict[eachVillageHB_NO]
print(TempVillage[0])
Vilanename = str(TempVillage[0])
FilterVillage = Killa_Boundary[Killa_Boundary['Village_Na']==Vilanename]
print(FilterVillage.head(1))but its giving me empty dataframe though I have data thereSHERON
FID_Final_ Killa_No Hb_No Village_Na Mustil_No Shape_Length
0 38711 15 266 SHERON 116 68.383971

                                        geometry  

0 MULTILINESTRING ((495447.227 3470154.065, 4953...
KOT DAS GUNDI MAL
Empty GeoDataFrame
Columns: [FID_Final_, Killa_No, Hb_No, Village_Na, Mustil_No, Shape_Length, geometry]
Index: []`

rugged comet
#

!code Here's how to format Python code on Discord.

arctic wedgeBOT
#

Here's how to format Python code on Discord:

```py
print('Hello world!')
```

These are backticks, not quotes. Check this out if you can't find the backtick key.

limpid oak
#

Found Error in my Script, Thank you for support

rugged comet
#

You're welcome.

raw juniper
#

hey guys

#

I have a question

#

it stretches all the way to week 52

#

I want to change it to a datetime

raw juniper
#

nvm I got it

last rivet
#

@raw juniper what? You working in excel , csv, panda dataframes? Actually that does not matter... it's python anyway
You question does not make any sense at all.

Anyway, I guess your question is answered here if you looking to convert e.g 2021-W1 into: 2021-01-04
https://stackoverflow.com/a/1700069/3314139

raw juniper
#

I got the answer on my own thx btw

last rivet
#

@raw juniper yes, that is what I provided above. Anyway, good job

raw juniper
#

cool

last rivet
#

Also week 1 started on January 4th btw, not 7th @raw juniper
Unless you typed a date into the formula to get a specific day of the date in the week

raw juniper
#

oh

#

I see

cobalt jewel
#

pandas question: In pyspark I can filter a df with df.filter("col1=x and col2=y"), but in pandas the only way I know is df[(df.col1==x)&(df.col2==y)]. This syntax gets very verbose when there are a lot of conditions and the df name is long. Wondering if there is a simpler syntax in pandas similar to what can be used in pyspark

ripe forge
#

there is df.query, though i personally find normal filters easier to understand

twin moth
#

Hi guys, I'm trying to write a crawler for a school project which-in I need to basically fetch thousands of images.
Currently I'm using Selenium in order to open the website and find the src and alt of the images, then use requests in order to download them.

Doing so means that I have to basically download each image twice - once when I load the page - and once when I download it afterwards

Do you know if it's possible to fetch the images out of the cache of the browser? I mean those images were rendered - they should be somewhere on the PC..
If not, (and I could not find any way to get it to work using GeckoDriver) I'd love some assistance regarding the usage of a proxy in order to do the same.

cobalt jewel
#

@ripe forge thanks!

twin moth
stone jungle
twin moth
abstract zealot
#

Encrypted perhaps ? @stone jungle

stone jungle
#

Yeah ig

misty flint
woeful hamlet
#

using SMOTE do i need to specify the classes?

#

OR it does everything?

#

Also, after using ImageDataGenerator, how can i take a look at the pictures?

#

like, how can i get them?

fleet trail
stone jungle
#

I wanted to make something that could compress video files but this way too complex

fleet trail
#

Ohh man, my bad πŸ˜†

#

I thought it was random gibberish, my bad

stone jungle
#

It's alright, I r good

#

U r good bro

#

Btw see this star I captured from my telescope

fleet trail
stone jungle
#

Tq

woeful hamlet
#

using SMOTE do i need to specify the classes?
OR it does everything?
Also, after using ImageDataGenerator, how can i take a look at the pictures?
like, how can i get them?

late shell
#

Hello I had recently started ML, and wanted to know something about epochs & alpha(learning rate). I coded up my own Simple Linear Regression model, and I used it on some data. At first I wasn't getting the perfect values as I got from sklearn, then I tinkered a bit with epochs and alpha variables, I increased epochs from 1k to 2k and epochs from 0.0001 to 0.001, and I then had the perfect results that matched with sklearn's Linear Regression. So I wanted to know how to know what the value for epochs and alpha should be, so that I don't have to hit and trial every time with my model.

lilac kindle
#

How do I round to "one digit precision"? For example: 0.000784 -> 0.0008, 0.69 -> 0.7, 0.049683948 -> 0.05. Tried googling a bit but everyone points to Round() which doesn't work in this case.

lunar cloak
nova widget
# late shell Hello I had recently started ML, and wanted to know something about epochs & alp...

The epochs are the batches of training data, so if you feed it more data it is generally better off, while a low learning rate with few epochs (batches) will not teach the model enough. As these are variables and not fixed settings, they are dependant on the training data and the model itself. So depending on the settings you can teach your model fast (but maybe over-generalize), or slow (but not learn enough to make an accurate decision, leaving you with just a blurry network)

#

this all depends on the kinds of variations in the training data you want to accept

#

like training a model with emojis could be quite fast as you can teach the model to identify the exact emoji from the training data, while if you want to identify a person in a picture it needs to see more different examples of the same thing to understand what the pattern of "person" actually is

stray ivy
#

does it make sense to use stochastic gradient descent, when i have numerical, ordinal, and nominal data? if so, then does it make sense to scale the ordinal data?

lapis sequoia
#

Greetings, working on some NLP with spaCy.
IN python what does this mean?
xpos = a[1]['pos']

#

is that an array, first element? that has value of pos?

#

what if I wanted to check whether a[1]['pos'] exists first?

#

ok, that looks like it is a list

cobalt jewel
#

df = pd.DataFrame([[1,2,'a','b'],[3,4,'c','d'],[5,6,'c','e']],columns=['inta','intb','Ticker','desc'])
df.groupby('Ticker').agg({'inta':'mean','intb':'mean','desc':'max'})

#

assuming that 'max' is an appropriate operation for aggregating your qualitative data

velvet thorn
#

I would say

#

use log10 to find the first nonzero number

#

then floor divide by it

#

!e

from math import ceil, log10

def f(i):
    exp = ceil(log10(i)) - 1
    sig = round(i / 10 ** exp)
    return sig * 10 ** exp  

print(f(0.69))
print(f(0.000784))
arctic wedgeBOT
#

@velvet thorn :white_check_mark: Your eval job has completed with return code 0.

001 | 0.7000000000000001
002 | 0.0008
drowsy shore
#

hey everyone... need help with pandas... getting this value error: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

#

transaction_cust.groupby('store_id').sum((abs(transaction_cust.tran_prod_discount_amt))/sum(transaction_cust.tran_prod_sale_amt))

misty flint
#

ahhhhhhhhhhhhh

#

i always forget to capitalize Frame in DataFrame

tidal bough
#

from pandas import DataFrame as DF

misty flint
#

ik ik

#

i just keep forgetting Clown2

#

inb4 alzheimers

nocturne plover
#

uhm, is there any way for the gTTs lib to play without saving it

#

like ```py
from gtts import gTTS
text = input('Enter my speech.. ')
speech = gTTs(text)

#

is there something like sppech.play or something or ive to do it in the delete and create approach

misty flint
#

does it really not have that attribute, i think thats weird

gaunt thorn
#

Everytime I install speech python speech recognition it says error.

lilac geyser
#

Should I use hypergeometric distribution formula for calculating probability?
Or Which formula is easier to calculate for finding probability?

#

Please @ me once you see this message

tidal bough
#

that's the binomial distribution, which can be approximated by a normal one with the right mean and std. So that's what one should use here, I believe.

#

the mean is 90/4 = 22.5, and the std is... what was it, sqrt(npq) I think? That's sqrt(90*3/4*1/4) = sqrt(270/16) ~= 4.108

#

10 means a z-score of around >-3.043

tidal bough
#

(this is pretty clearly how you're supposed to do this, considering you even get a nice z-score of almost exactly -3).

minor egret
#

Hey,
can someone explain to me what differentiates valmin, valmax, slidermin, slidermax on Sliders in matplotlib and how to use them?

lilac geyser
#

@tidal bough
So if the std is not given then we can find using formula and using normal distribution(z score) we can find the probability rite?
We can follow this method for most of the problems am I right?

lilac geyser
misty flint
#

makes sense to me

#

just look it up on a z table

lilac geyser
#

Oops sorry I misunderstood
I thought that >10 z score of around>-3.043 was directly found
We got that value(-3.043) using this
Z=(x-mean)/sigma
So
Z=(10-22.5)/4.108 Z=-3.043
So >10 z score is around >-3.043

Since for me z table will be not available during my future exam
We can approximate it as 99.97%

Am I right?

tidal bough
ripe forge
#

You can memorize the values for 1,2 and 3 std deviations to help

misty flint
#

thats a good idea

#

68-95-99.9?

#

or something

balmy junco
#

is anyone here knowledgeable about xpath for webscraping, or is there another topic that is more relevant for that?

viral wasp
#

fuck anaconda is so fucking confusing

#

can someone dm me so we could voice talk and he would teach me how to fucking use it

severe sentinel
#

Anyone know how to rectify this? Trying to analyse a .csv file of netflix data to filter episodes of Friends, and when I try to add columns for days and hours, it returns an error.

#

Absolute beginner, apologies

stray ivy
#

if you read the output, you'll notice that it's actually a warning, but not all warnings should be ignored.

stray ivy
severe sentinel
#

Oh so I can just carry on and it should generally work fine?

hasty grail
#

In this case the warning can be ignored but at the cost of your code being less efficient

#

If you follow the provided link it should explain why

severe sentinel
#

Brilliant, thank you!

noble sand
#

I've got a problem which I can't seem to fix (it's NLP Natural Language Toolkit related!)

#

Sooo I run my application app.py on a Flask server and when it tries to execute the function preprocess_data (depicted below), I get a "TypeError" exception on this line: self.work_tokenize(self.splitsentence)

nova widget
#

type(self.splitsentence)

#

you use the wrong datatype as input, or maybe you target some immutable

noble sand
#

@nova widget I'm using BeautifulSoup module to format the extracted webpage data in a clean format and this gets rid of the punctuations (including full stops)..

#

Could that be the reason why NLTK's sent_tokenize isn't giving me any result ...?

#

As it couldn't find any full stops (periods) to split the text string into separated sentences , when the BeautifulSoup object parsed the data?

lapis sequoia
noble sand
river tangle
#

Hello guys do you know any good tutorial? For ds?

noble sand
river tangle
#

Sorry economics

misty flint
#

a lot of economics people go into DS too

inland swallow
#

Trying to transform X into a regression model. It this correct:

#
  • Transform X, keep y then plot original vs y?
lapis sequoia
#

obviously

noble sand
#

I'm glad that line got executed without getting any exceptions now 😌

#

Before applying any NLP operations, I realise I get improperly decoded Unicode chars/symbols just after I make a request to several URL links and extract webdata...

#

Anyone know the reason why I get this? And how I might properly convert these sort of characters?

empty fog
#

I am using celery to load csv data as Panda dataframe and using Celery's group function to distribute it across multiple workers. The intent is to distribute the workload in an idempotent way so each worker can execute it.
I am using Data frame chunks to iterate and distribute small batches of datasets to process it by workers

@celery_app.task(bind=True, autoretry_for=(WorkSchedulingException,),
                 retry_kwargs={'max_retries': 5})
def make_inventory_partitions(self, inventory_bucket: str, gz_name: str, tsk_id):
    gz_path = f"s3://{inventory_bucket}/{gz_name}"
    chunk_size = int(getenv('PARTITION_SIZE', 1000))
    logger.info(f'Splitting inventory [{gz_name}] into chunks of {chunk_size} records per task')
    itr = 0
    try:
        group(
            process_inventory_chunk.s(txn_id=tsk_id, gz_name=gz_path, chunk=chunk.to_json()) for chunk in
            pd.read_csv(gz_path, chunksize=chunk_size, header=None, usecols=[1, 3], names=['path', 'dt']))()
    except Exception as err:
        logger.critical(f'Failed to schedule inventory partitions tasks [{tsk_id}]', exc_info=err)
        raise WorkSchedulingException(f'Failed to schedule inventory partitions tasks workloads for task [{tsk_id}]')
#

Is there any better approach, because my task and worker keep going OOM

fathom sail
#

Hi, is anyone here familiar with beautifulsoup?

#

I would like to extract the attributes of the child elements in a form element

serene scaffold
#

@fathom sail this might not be a data science question, but in either case you might consider opening a help session and sharing what code you've written so far. See #β“ο½œhow-to-get-help

twin moth
#

Any chance for some quick help?

#

I have a 700*350 np.ndarray which contains 3D tuples

#

I want to iterate over all of the tuples

#

I am using np.nditer() in order to achieve that but it just flattens my tuples as well and returns the values of the tuple.
I saw that there's op_axes= and itershape= as params but I could not figure out how to use them

serene scaffold
#

@twin moth what's the larger context of the problem? If you can turn this into a (700,350,3) matrix, you can probably leverage nunpys efficient iteration

#

If you use a for loop, you won't get that.

twin moth
#

It's already a matrix of that shape

#

But when I iterate over it using the aforementioned function it just iterates over the tuple values as well

lilac geyser
lilac geyser
misty flint
#

its an easy interview question imo

velvet thorn
#

so is it an array containing tuples or not

#

I'm a bit confused

#

πŸ₯΄

hasty grail
#

Got a feeling that you could vectorize whatever you're trying to do

velvet thorn
#

^

nocturne plover
#

Uh, i am trying my hands on the ||titanic dataset|| in kaggle and I am getting an unexpected output

#

For pedictions, I turned the value of "male" and "female" to "1" and "0". Now, for submission, when I convert them back, I get the gender to be passenger ID of the person

#

Any help is appreciated.....😩

#

Thanks in advance

misty flint
#

why the censor

#

also you should probably post how youre doing it

lapis sequoia
#

Hey guys, any place I can practice some DS problems? Like some jupyter notebook cases, probability, machine learning concept questions, optimizing scenarios etc..

noble sand
ripe forge
#
noble sand
#

I also do get other human-readable English text content extracted along with these mixed Unicode symbols

tall trail
#

i cant seem to figure out how to find a row in a dataframe while in a for loop, it just keeps returning an empty dataframe. anyone know why this keeps happening?

more info:
i am trying to check if the value of test is above a treshold over 20 rows of data, if it is: write it to a seperate dataframe.

here is my code so far:
test_data = np.random.randint(119,150, size= 300)

peak_df = pd.DataFrame(columns=['test'])
test_df = pd.DataFrame(test_data, columns=['test'])
tempdf = pd.DataFrame(columns=['test'])

print( '\nIndex:', type(test_df.test))

counter = 0

for i in range(len(test_df)) :
val = test_df.loc[i, 'test']
if val > 119 :
counter = counter+1
tempdf.append(test_df.loc[i], ignore_index=True)
# print(tempdf)
if counter == 20:
print('counter reached 20, appending')
frames = [tempdf, df]
# print(df)
# print(tempdf)
peak_df.append(tempdf, ignore_index=True)
else:
counter = 0
if not tempdf.empty:
tempdf.drop(['test'])

ripe forge
#

General advice, if you're ever iterating on a dataframe, There's probably a better way to do whatever you're trying to do.

tall trail
#

hehe, ya im still kinda new to dataframes and i dont quite get the lambda function yet. i see that gets used a lot

twin moth
ripe forge
#

They don't have to be tuples though, right?

#

More specifically it's just a 3d array, and at this stage there's no more "tuples" yes?

#

To phrase it differently, we understand that there's a logical relation and these are denoting the 3 channels of the images, but we very specifically don't store them as tuples anymore, correct?

#

(if tuples exist still in the array, they need to go. Don't mix tuples inside your np array).

velvet thorn
#

not a 2D array of tuples

twin moth
hasty grail
#

What specifically are you trying to do to the image?

hazy birch
#

-help

#

i have project about SVM and NAIVE BAYES for sentiments analysis and classification

fading juniper
#

Hey guys, I just joined the channel and I was wondering if anyone could point me in the right direction. I got the basics of python down, and can go as far as programming simple algorithms but I want to start developing, preferably for mobile banking. What steps should I take next and what frameworks should I start learning about?

hasty grail
#

Hmm, that's really broad... do you want to deal with the data or the user interface first?

fading juniper
#

Probably data, because once I know how it works, I feel learning UI will be easier

hasty grail
#

Are you planning on building a database for this?

hazy birch
fading juniper
ripe forge
twin moth
#

In order to find where the pixel sits in the scale and not iterate through the entire scale once again (~200 comparisons) I just insert the tuple to a dictionary and I check if the tuple is found in it for each pixel in the heatmap

ripe forge
#

What are you trying to do?

twin moth
#

I don't have much time to explain but I'm trying to parse those maps according to their scales
https://earthobservatory.nasa.gov/global-maps/MOD_NDVI_M
And insert each pixel and its' position on the scale into a pandas DF

twin moth
ripe forge
#

It's like you've taken a car, put it on water and then you're complaining it doesn't swim properly. Numpy arrays are the de facto standard of working with images, and are insanely fast because they vectorize their operations.

#

But you have a approach in mind already which doesn't sit well with how numpy arrays are meant to be used

#

Which isn't a bad thing btw, just might be worth getting your feet wet and seeing how you can use arrays as intended

twin moth
#

I have a clean copy of a map, which I compare to the 'colored' map, I iterate each and every pixel, if they are "equal" (or close to it) I ignore that pixel, if it's not equal I search for the smallest distance between the RGB of the pixel and the colors on the scale

ripe forge
#

So, when you vectorize, you can compare each pixel near instantly. One rule with arrays, no iteration.

twin moth
#

def calculate_lowest_distance(colored_rgb: np.ndarray,
                              default_rgb: np.ndarray,
                              scale_list: list,
                              known_pixels_dict: dict) -> int:
    if any(not is_rgb_array(var) for var in [colored_rgb, default_rgb]):
        print("{} or {} is not an ndarray".format(colored_rgb, default_rgb))
        sys.exit(1)  # TODO: Error handling

    if colored_rgb in known_pixels_dict:
        return known_pixels_dict[colored_rgb]

    min_distance = calc_rgb_distance(colored_rgb, default_rgb)
    min_dist_idx = None

    if min_distance == 0:
        known_pixels_dict[colored_rgb] = min_dist_idx
        return known_pixels_dict[colored_rgb]

    for idx, scale_pixel in enumerate(scale_list):
        distance = calc_rgb_distance(colored_rgb, scale_pixel)
        if distance == 0:
            min_dist_idx = idx
            break

        if distance < min_distance:
            min_distance = distance
            min_dist_idx = idx

    known_pixels_dict[colored_rgb] = min_dist_idx
    return min_dist_idx

That's the current, not complete at all

ripe forge
#

Okay. So immediately we see a lot of iterations here. It's a more traditional style of programming

#

When it comes to numpy arrays, this style of programming becomes a hindrance.

#

I need to be on a laptop for this..

#

Meanwhile, can you do a simple experiment? Make two numpy arrays, pure arrays no tuples inside

#

And just subtract them. a-b and just see the output

woeful hamlet
#

how can i use SMOTE and flow_from_directory from Image Data Generator class together?

#

i found this so far

#
    for x, y in flow_from_directory:
         yield custom_balance(x, y, options)```
#

How can i call SMOTE here?

hasty grail
hasty grail
tall basin
#

Python Basics in 9 tweets 😱, get in the thread

πŸ‘‡πŸ§΅

Oh, If you don't wanna miss out a tweet like this follow me It's free.

#100DaysOfCode #Python #codingtips #CodeNewbies #CodeNewbie #code #programming #pythonprogramming #Python3 #Computer #HowtoPerfect #AI #MachineLearning

twin moth
twin moth
ripe forge
#

mhm. took the car, threw it in water. πŸ™‚

twin moth
#

lol

#

!e

import numpy as np
arr1 = np.array([1,7,352,169])
arr2 = np.array([751,7,-352,8690])
arr3 = arr1 - arr2
arr3
arctic wedgeBOT
#

You are not allowed to use that command here. Please use the #bot-commands channel instead.

twin moth
#

😦

ripe forge
#

aww hehe, you can use bot commands for it

twin moth
#

Anyways, yes, it works with ndarrays

ripe forge
#

now heres the main thing, that's a "vectorized" calculation.

twin moth
#

Doesn't it work with loops in the background?

ripe forge
#

for science, why dont you make the same result via iteration, and then array subtraction. then, do a timeit. and increase the size of the array

#

ish. let's see if theres a difference in timings for large arrays first

#

and then we'll talk\

twin moth
#
    start_time = time.time()
    print("Start at: {}".format(start_time))

    map_urls, scale_image_url = get_frame_urls(map_type)
    create_df(map_urls, scale_image_url).to_csv(csv_file)

    print("Finished at: {}".format(time.time()))
    print("--- {} seconds ---".format(time.time() - start_time))
ripe forge
#

oh, you can use any method you prefer, but yes timeit is an actual thing too

twin moth
ripe forge
#

timeit is great because it automatically repeats a calculation multiple times and gives you a mean and deviation of timings for some calculation.

twin moth
#

Sounds amazing, I'm trying it now

chilly geyser
#

Simpler to use jupyter's %%timeit magic

twin moth
ripe forge
#

honestly ive always used it with the magic command πŸ˜› but uh, i believe it can receive strings or functions

twin moth
#
def iter_sub(arr1: np.ndarray, arr2: np.ndarray) -> np.ndarray:
    if len(arr1) != len(arr2):
        print("arr1 and arr2 not of the same length.")
        sys.exit(1)
    arr3 = np.zeros(len(arr1))
    for idx in range(len(arr1)):
        arr3[idx] = arr1[idx] - arr2[idx]
    return arr3
def vectorized_sub(arr1: np.ndarray, arr2: np.ndarray) -> np.ndarray:
    if len(arr1) != len(arr2):
        print("arr1 and arr2 not of the same length.")
        sys.exit(1)
    return arr1 - arr2
ripe forge
#

do you have an ipython? where do you run code

#

hm, for the iter_sub, for a fair comparison, i'd say get rid of the print, get rid of the if checks from both too

twin moth
#

Specifically that code in a Python terminal

ripe forge
#

at least for timeits

twin moth
#

But I mostly write Python in PyCharm

ripe forge
#

okay, that's probably not ipython is it

#

magic commands only work in ipython repls

twin moth
ripe forge
#

in your cmd, type ipython

chilly geyser
#

Would recommend Google Colab instead

twin moth
#

Sec, I'm installing it

#

nvm, that was fast

ripe forge
#

ah you dont have to. it's just a fancy way of running timeit

#

(well, ipython is a GREAT repl though)

twin moth
#

Nah, again, always happy to learn about new standards and tools

#

Okay, so now, how do I use it?

ripe forge
#

so, it's just a python repl. you can type and run stuff as usual

#

but it allows magic commands. those are prefixed iwth % and %%

twin moth
#

I meant how do I use that magic timeit you were talking about

ripe forge
#

so, try %timeit <command_here>

twin moth
#

Interesting

#

With parameters?

chilly geyser
ripe forge
#

yep, with params

chilly geyser
#

computing courtesy of Google

twin moth
#

How do I randomly fill the array?

ripe forge
#

you can use one of the np.random. Say np.random.randint(1, 10, size=n)

twin moth
#

Amazing

chilly geyser
#

Indeed

#

nice

fiery minnow
#

I started learning how to use SQL files with the sqlite3 library, and it's all going smoothly so far
but there's one thing that I can't seem to get correctly...
How can I view the SQL file in VSC?

ripe forge
#

you cant. sqlite is a database, and so it's file would need a program that can read sqlite databases

fiery minnow
ripe forge
twin moth
#

It's so much faster

fiery minnow
#

I'm using sqlite

#

but what's the difference between Desktop and Program Manu?

#

Is it just the shortcuts?>

#

or something else?

ripe forge
#

i think just shortcut

fiery minnow
#

ok

#

what are those features

ripe forge
#

dont worry about it too much, choose what makes sense for you. if not sure, press next.

fiery minnow
ripe forge
#

yes

#

now just open a database (your sqlite db file), and click on browse data.

fiery minnow
#

Yeah I did it

#

I'm planning on using it with my discord bot

#

Would it be good make a table for every server the bot's in?

chilly geyser
ripe forge
# twin moth It's so much faster

yep vectorization is like a whole different world. really good for lots of number crunching. this is essentially the true power of numpy arrays

#

internally, its like "batch" iteration. the key being that it can load and execute operations in batches, as opposed to one at a time

#

for an analogy, it's like iteration is equivalent to making 100 cookies with a baking mold of 1 cookie only. you have to run the oven 100 times, and make one cookie at a time because thats all the mold you have. vs vectorization which is like someone gave you a mold with 20 cookies. so you run the oven 5 times and youre done, making 20 cookies at a time

twin moth
#

Sounds really good

#

But how would I use that ?

#

The shape of my array is 350, 700, 3

#

How do I make use of it here?

#

And lets say that I run that piece of code for the matrices as a whole (AKA vectorization) - I'd still need to compare each pixel to each of the pixels in the scale list

#
    return pow(lhs[0] - rhs[0], 2) + pow(lhs[1] - rhs[1], 2) + pow(lhs[2] - rhs[2], 2)
chilly geyser
#

Are you using python's pow

twin moth
#

Ummm guess so

chilly geyser
#

Have you tried numpy's pow

#

I think it has one

#

Also, I just realised

#

This is linalg

#

You should use the norm function instead

#

np.square(lhs-rhs) should be better also, assuming lhs and rhs are 3-dim objects

bronze ermine
#

Need some c++ help

#

Am i in the right place?

chilly geyser
limpid oak
ripe forge
#

then if you do need to split them apart later for some reason, do it at the end.

twin moth
twin moth
#

But again, I'd need to iterate over the entire matrix

#

Because I still have 350*700 such trios

ripe forge
#

no iteration. what steps are missing now?

#

if theres any iteration step in the workload, youve lost a lot of the gains vectorization would have given you

twin moth
#

I know

#

That's my current issue

#

As I previously said - what I currently do is:
I iterate over all of the pixels in an image.
Compare it to the same pixel in a similar map (using (rsh - lhs) ** 2)
If they're not the same (distance != 0) - I need to compare that pixel to each and every pixel in (1,200,3) ndarray

#

That's basically the algorithm

ripe forge
#

why is the squaring being done?

twin moth
#

That's the formula

ripe forge
#

okay. that shouldnt be too bad. so only the last step remains. what is this (1,200,3) array, and why are we comparing our pixel distances (we're comparing distances, right?) against this?

twin moth
#

RGB distances basically

#

Trying to figure out the value of each pixel in a heatmap

ripe forge
#

i suppose the thought process here is this: can we achieve the values in our heatmap without having to do this comparison? it sounds a bit backwards to me

#

like, to me, if we know that we have a finite range of possible distance values, we should be able to coerce it into a range of values always. but maybe im missing something

twin moth
#

It kinda is

#

But that's the only solution we found

ripe forge
#

do you know the min and max values of your heatmap?

twin moth
#

Some of the heatmaps have scales ranging between (for example) [brown...green]
And some have [red...white...blue]

#

And as you know white is (255,255,255) so it's not really in the middle

#

And I can't search for the brightest/darkest colors

twin moth
#

We have a couple we want to parse

#

But we don't care about the actual values, as we'd be normalizing them later on anyways

ripe forge
#

okay. i suppose lets say we just stick to this approach

twin moth
#

Okay

chilly geyser
twin moth
#

Really? interesting

#

Got any idea how?

chilly geyser
#

Essentially yo uhave 2 ndarrays right

#

Why not try A-B

twin moth
#

Let me try

rotund dock
#

Hi!! I'm trying to find the path where my concentration flows in a river network, maybe someone here can help me.
So basically thats the data frame I have, I want to find the path using the columns N_out and N_next, I know where my path starts because the concentration is different to 0, then the idea would be to find the n_next of that row, append it to an array, search that n_next in the n_out and repeat until I reach the end of the rows (for this example i'm going to have 2 paths)

chilly geyser
#

What do you mean by 'path where my concentration flows'?

#

Do you mean

#

You want to make a forest of directed acyclic graphs

#

from your current data?

rotund dock
#

No I then have to make some calculations with discharge and concentration

twin moth
chilly geyser
#

np.square(A-B) is that squared

#

You can np.sum(np.square(A-B))

twin moth
chilly geyser
#

Well this should be the best NP can give you I think

#

Anything more, you can try JIT

#

Or other langauges

#

languages*

chilly geyser
rotund dock
twin moth
#

I don't really need it that fast, you guys told me I should lmao

#

I'd love it as good as possible of course

rotund dock
twin moth
#

But it's for a school project, it doesn't have to be the most optimal solution

twin moth
rotund dock
chilly geyser
twin moth
chilly geyser
twin moth
chilly geyser
twin moth
#

I want a sum for each pixel

chilly geyser
#

Try axis=0, axis=1, axis=2, etc.

chilly geyser
#

A little brute forcey but you only have 3 indices, it's fine

#

If you're looking at 20 indices I'd look at solving this 'smartley'

rotund dock
chilly geyser
rotund dock
#

let me show you what I'm looking for as a result

chilly geyser
#

You are telling me Path_1 is
73012836 -> 73012412 -> 73011907 -> 73011906

#

So path_2 is
73011909 -> 73011908 -> 73011907 -> 73011906?

#

By the way I'm asking you because code-wise, you'd want to know if your thing terminates

#

Also you'd want to know code-wise, what should be 'in-memory' at each point of the computation

twin moth
rotund dock
chilly geyser
#

That means axis=2 represents your pixels

twin moth
#

Which makes sense since it's (350,700,3) lol

chilly geyser
# rotund dock

Ok that's what I wrote. The thing is, how do I know know that 73011906 is the last?

#

You can tell me because this csv is 8 nodes long

#

What if it were 52000 nodes long

rotund dock
#

Because my nodes are already sorted

twin moth
#

Okay, so that actually an amazing step.
How do I compare it to all of the values in the list now?

rotund dock
#

my last row will always be the last node of the network

twin moth
#

And take the one closest to it if no match is found

twin moth
chilly geyser
#

You mean the distance?

twin moth
#

That list is the list of pixels in the scale

chilly geyser
#

I thought that list is now the list of distances?

twin moth
chilly geyser
#

With a distance you create a sphere

twin moth
#

Would you mind joining me on VC? It'd be much easier to explain

chilly geyser
#

I rather not speak sorry

twin moth
#

All good

#

So basically we have a heatmap, the heatmap has pixels which can not be found in the scale because of compression issues (I guess)

chilly geyser
twin moth
#

So we need to find the minimum distance between each pixel in the scale and each pixel in the map if the distance != 0 between the two original images

chilly geyser
#

What's a heatmap

#

I mean, I know what's a heatmap

twin moth
#

lol

chilly geyser
#

But the heatmap is the original 3D array?

#

Also, what's a scale here

twin moth
chilly geyser
#

Also, anotehr 3D array?

twin moth
#

That's an example

twin moth
chilly geyser
#

So you have 2 RGB arrays I assume

twin moth
#

But this time it's already filtered, I took a single strip from the scale (1 pixel wide) and that should represent the values

chilly geyser
twin moth
chilly geyser
#

Distance != 0, mathematically, if and only if not the same thing

twin moth
#

(BTW, thanks for all the help, you're awesome)

chilly geyser
#

ok ignore that

twin moth
#

I didn't understand what you meant but I'd add that the distance^2 == 0 only if the colors are the same

chilly geyser
#

Yes

#

Currently

#

Your image is a vector in R^350*700*3 space yes, no?

twin moth
chilly geyser
#

Indeed

#

So now you have the distance between two images

twin moth
#

And other than that we have another array of all the RGBs in the scale

twin moth
#

(Which is *ucking awesome!)

chilly geyser
#

Ok I think I get what you mean

#

Is your scale also R^350*700*3?

twin moth
#

BTW, currently if I find an exact match (0 between the matrices) or between the pixel and a pixel in the scale I just return its' index

chilly geyser
#

Oh ok I get it

#

Your scale is a subset of values

twin moth
#

Even less than 200 because I removed the outliers

chilly geyser
#

The scale is 600 values of acceptable values

#

Then

#

For anything not in the scale

#

You want a minimum colour distance

#

ok

twin moth
chilly geyser
#

?? 1*200*3 is 600?

twin moth
#

Because it's 600 only if you separate the R G and B

chilly geyser
#

Ah ok

#

200 acceptable values for R data itself

twin moth
#

True

chilly geyser
#

200 acceptables values for G data itself

#

ditto B

#

ok ok

twin moth
#

Huh?

chilly geyser
#

Is that not the case?

ripe forge
#

its probably more like the same distance formula applied here

#

to find the closest "pixel" that makes sense

ripe forge
#

so, if i understand correctly, you just need a distance matrix

twin moth
chilly geyser
#

I think the closest pixel isn't unique though

ripe forge
#

hm, good shout, it's not. i assume that doesnt matter though

twin moth
#

So basically 350*700*1*200

chilly geyser
#

Ok I think I get it

ripe forge
#

theres something in scipy i think, for making a distance matrix

chilly geyser
#

The data is getting huge though

twin moth
#

Some are really close and we'd rather just take the closest

chilly geyser
#

it's like ~120MB

twin moth
#

Unless you have a better idea

chilly geyser
#

Ok I have handled bigger Dist mats before

#

I don't recommend anything >1Gb basically

ripe forge
#

yeah this is still tiny in the grand scheme of things

chilly geyser
#

Yeah I think the 'fastest' way is to take the distance matrix

#

And find the minimizer

#

For each pixel

#

If you vectorize it, I can't say anything it probably doesn't fit in fast cache memory so likely it will be an iterative process

twin moth
#

So iterations?

chilly geyser
#

Ya iterate over each pixel, I think?

ripe forge
#

nah, i'd say distance matrix

chilly geyser
#

For dist mat IDK how it'd be done

ripe forge
#

ok, sec. lemme go find it

chilly geyser
#

I mean, I don't know the functions of the top of my head at least

ripe forge
#

p=2 for euclidean, which is default anyways. just dump your matrix there, and it should spit out the distance matrix

chilly geyser
#

Oh yeah just try that function first

#

Might need to fiddle a bit with shape so it likes it, but I think it's fine right now I think your last .shape is 3 which should be what it wants

ripe forge
#

its actually not happy with the shape as is

chilly geyser
#

Welp, that means reshaping I think

ripe forge
#

yep. slightly annoying but yeah

twin moth
#

Wait, how does it help me?

chilly geyser
#

You're looking for minimizers on this distance matrix

ripe forge
#

so, the thought process is, you get a distance of each pixel against the 200 pixels

chilly geyser
#

^yup

ripe forge
#

and then you choose the smallest distance from there for each pixel

#

(thats the one weird part about vectorization which i always find hard to digest. It's completely happy with doing really wasteful calculations just to reach an answer, and it still scales insanely well)

twin moth
#

Would I also get index of the smallest distance in the scale?

ripe forge
#

in a separate step, yep

#

once you have the distance matrix, then that would be your last task.

#

so, say your rbg array is a 3d array yes? then you do a reshape to turn it into a 2d (n_pixel, 3) array. same deal with the heatmap array. then the distance matrix should give you a (n_pixel, 200) array.

chilly geyser
#

^seems like it

ripe forge
#

at this point, if you use .argmin method, with the correct axis, you'll get the index for the min distance for each pixel.

chilly geyser
#

The issue is

#

Need to remap each n_pixel back to the original pixel matrix hmm?

#

That might be annoying again

ripe forge
#

a simple reshape back will fix that, no issue

chilly geyser
#

Oh hmm

#

Sounds right

#
from numpy.random import seed, uniform
from numpy import array
from scipy.spatial import distance_matrix as dist_mat
# Pixels
x = array(range(6))
x.shape=(2,3)

# Scale
seed(0)
y = uniform(0,6,size=(20,3))

z = dist_mat(x, y)
minimizers = z.argmin(axis=1)
y[minimizers]

^ some toy script

I get

array([[1.25326054, 0.96785711, 3.91864995],
       [3.67257434, 3.70160398, 5.66248847]])

which indeed seem close to 0,1,2 and 3,4,5

#

I don't have the reshaping back into the 350*700 array issue you have, but yeah essentially this thing works

twin moth
#

I don't really get that, sorry

#

I'd try to run the following script and see the shape of z

chilly geyser
#

Try it

twin moth
#

I really don't see how it can help me

#
In [13]: z
Out[13]: 
array([[4.92828313, 4.07220501, 6.33441159, 4.55356006, 5.90153969,
        3.16539456, 7.38905673, 5.77234009, 3.1409925 , 6.07503396,
        4.04371234, 3.91525835, 5.84810124, 4.29668737, 4.68306374,
        4.21475324, 2.64562386, 5.75833833, 2.29192338, 2.41381881],
       [1.44374159, 1.86099389, 1.60497451, 2.09492149, 4.84765814,
        4.60227022, 2.24361732, 2.1995214 , 4.73392854, 3.76611036,
        2.7447737 , 4.11756178, 0.99009464, 3.200087  , 3.95533112,
        5.13868297, 2.65013839, 4.80767455, 3.66255474, 4.01516057]])
#

(2,20,1)

#

AHHH

#

I see it now

#

I returns an array of all of the pixels from the scale for each pixel from the heatmap

chilly geyser
#

Yup

#

It should also reshape back nicely

#

But I can't confirm, would probably need weirdly shaped things

twin moth
#

Okay, so now I have 2 result matrices

#

A distance matrix between the heatmap and its clean counterpart

#

And between each pixel in the heatmap and each pixel in the scale

#

How do I use it now

chilly geyser
#

argmins?

#

Those tell you the closest

#

i thought you were picking the closest?

twin moth
#

Won't I need to iterate through it in order to fit it into the datadrame?

chilly geyser
#

Well try it first I guess

twin moth
#

The closest match for each pixel

#

So basically 245000 (350*700) pixels

#

kk

chilly geyser
#

Sounds a lot to you but not too many for a compute rhonestly

twin moth
#

True

chilly geyser
#

What's 'a lot' for a computer is more of the like of >10^9

twin moth
#

Depends on which computer πŸ˜›

chilly geyser
#

Haha yes I wouldn't get a raspberry to do this

slate hollow
#

so i have this df

#
           a         b         c         d
0   308.3720  290.2366  301.1735  351.9520
1   258.5590  235.2776  261.2084  345.8950
2   161.0730  193.4827  189.8389  309.6990
3   123.1920  112.0545  120.7862  253.0690
4    95.6603  110.8244   93.7067  199.7980
5    93.5633   90.5683   95.8124  128.2830
6   119.5540  170.6676  123.4065   78.5675
7   200.7440  177.9813  182.7717   98.8558
8   238.3210  258.7455  256.2180  180.2480
9   320.6590  311.0475  280.1748  177.7440
10  340.3230  361.0352  347.1527  256.2830
11  350.5850  339.0691  345.8954  311.0590```
#

how do i get the average

#

of each row

chilly geyser
#

Have you tried pandas.DataFrame.mean

twin moth
jagged iris
#
@client.command()
async def graph(ctx, *, query):
    ticker = query

    yf.download(ticker)

    newtime = yf.download(ticker, start = "2015-01-01", end = "2021-12-31")

    newtime['Adj Close'].plot()
    fig = plt.figure()
    plt.xlabel("Date")
    plt.ylabel("Adjusted")
    plt.title("Price data")
    plt.style.use('dark_background')


    fig.savefig("data.png")```

So I am using this code to graph data on some stocks and would like to save the graph as adirect  file on my computer but it is opening the graph in the python application. Not sure what I am doing wrong atm
twin moth
#

Now we just have an issue with the shape

#

Because we're trying to calculate the distance matrix between a (350,700,3) and (1,200,3)

#

But even if we do it

twin moth
twin moth
ripe forge
#

your_array.reshape(-1, 3)

#

(-1 means the value at the axis is inferred. you could also write 350*700 there)

twin moth
#

Huh? How would that work?

ripe forge
#

just math. if you provide me an array, and give me all axes except 1, i can always infer the dimension of the remaining axis

twin moth
#

So basically if I have (350,700,3) and (1,200,3) and I want to run distance_matrix() on them, I'd have to run it in the following way:

dist_mat = distance_matrix(a.reshape(-1, 3), b)
#

Right?

twin moth
twin moth
#

So a reshape would basically do nothing

#

Since they are of different sizes

twin moth
#

numpy.core._exceptions.MemoryError: Unable to allocate 447. GiB for an array with shape (245000, 245000) and data type float64
😰

#

I literally have no idea what to do

#

@ripe forge @chilly geyser If you guys have any idea please ping me, I'll go cry in a corner or something

velvet thorn
#

@twin moth what’s the problem

twin moth
#

Heya

#

Basically I'm trying to find create a distance matrix between matrices of the following shapes [(1,200,3) and (350,700,3)

#

Other than that, I don't know what's the next step

abstract zealot
#

@twin moth its an element by element scalar distance matrix?

twin moth
#

Kinda

#

I'm comparing trios

twin moth
abstract zealot
#

ahhhhhh yea it is

#

im honestly not sure man

#

thought i mightve had an idea but tried it and it doesnt work xd

twin moth
#

Thanks for trying ❀️

twin moth
#

We've added a couple of checks but it's too long anyways

steady ginkgo
#

Hello there
I'm studying statistics and looking to take a step ahead my classes by being able to retrieve data to analyse on my own from APIs. So i'm looking for help to learn requesting and all that stuff as I only know the basics about the topic.

#

I know how the basics of python too if that's any relevant

austere swift
#

and float64 too lmao

misty flint
twin moth
#

I wanted to create distance matrix between two matrixes of the following shape (350,700,3)

hasty grail
#

!e

import numpy as np
from scipy.spatial import distance

rng = np.random.default_rng()
a = rng.random((350, 700, 3)).reshape(-1, 3)  # The RGB image. Reshape so that it is 2-D.
b = rng.random((200, 3))  # The 200 RGB pixels to compare against

dist_mat = distance.cdist(a, b).reshape((350, 700, 200))
print(dist_mat)
arctic wedgeBOT
#

@hasty grail :warning: Your eval job timed out or ran out of memory.

[No output]
hasty grail
#

It barely fit in memory on my PC (which has 8 GB)

#

dist_mat[i, j, k] is the distance between a[i, j] and b[k]

#

If you're having memory issues then you might want to compare batch by batch (with respect to b)
For example, find the closest match out of the first 100 pixels, then the same out of the next 100 pixels, and compare those 2 closest matches.

import numpy as np
from scipy.spatial import distance

img_size = (350, 700)
num_px = 200
batch_size = 100

rng = np.random.default_rng()
img = rng.random((*img_size, 3)).reshape(-1, 3)  # The RGB image. Reshape so that it is 2-D.
px = rng.random((num_px, 3))  # The 200 RGB pixels to compare against

dist_min = np.full(img_size, np.inf)
dist_argmin = np.full(img_size, -1)
for i_min in range(0, num_px, batch_size):
    px_batch = px[i_min:i_min+batch_size]
    dist_batch = distance.cdist(img, px_batch).reshape(*img_size, -1)

    dist_batch_min = dist_batch.min(axis=-1)
    dist_batch_argmin = dist_batch.argmin(axis=-1)
    updated = (dist_min > dist_batch_min)

    dist_min[updated] = dist_batch_min[updated]
    dist_argmin[updated] = i_min + dist_batch_argmin[updated]

assert (dist_argmin > -1).all()
steel roost
#

May i have someone explain why ```pythonmiles_log = cur_folder+"mile_log.txt"
miledge_sheet = cur_folder+'blank mileage log.xlsx'

with open(miles_log,'r') as f:
data = f.readlines()
f.close()

wb = load_workbook(miledge_sheet)
for cells in wb.iter_columns:
print(cell)

#

doesnt work

misty flint
#

whats your error

steel roost
#

essentially i have a text document that i want to write its contents to excel, but only to the rows of 17:38

#

however, all of my solutions rewrite the desired section. i can provide the full code if needed.

plucky zephyr
#

is there daily challange for data science,
i mean like using pandas, matplotpib etc ?

lilac geyser
#

Is Hypergeometric probability distribution identical to Binomial probability distribution?

fair solar
#

is there any library which can generate a normal map from a diffuse texture?

#

ping me ^

#

any cli app would work too i guess, which i can call from python

astral path
#

i need some help rather quickly

#

how can I create frequencies of data points but with some margin for error?

#

I have a bunch of points on a graph, but none of them are the same point

#

how could I loop over each point and get a value for the frequency of points really close to it?

#

Thanks!!!

hasty grail
#

So basically you want a histogram?

astral path
#

well what I want is to vary the sizes of each point by how frequently there are points nearby

hasty grail
#

Hmm

#

Looks like you want some sort of nearest neighbour search

astral path
#

so like around (0, 0), the points would be very large

astral path
#

I thought about clustering but wasn't sure how to use it in this context

#

i'm following a tutorial but with a different dataset

#

the guy in the tutorial has data already in the form of tiny regions across the plot in hexbins instead of just the raw datapoints, and each region already has the frequency in another dataset

hasty grail
#

In particular the query_ball_tree method

astral path
#

i'll take a look at this, thanks

hasty grail
#

np, feel free to ask if you run into issues

astral path
#

willdo!

astral path
# hasty grail np, feel free to ask if you run into issues

so what I have so far is this:

indices = []
count = 0
for x in missed_x:
    kd_tree1 = cKDTree((missed_x, missed_y))
    kd_tree2 = cKDTree((missed_x[count], missed_y[count]))
    indexes = kd_tree1.query_ball_tree(kd_tree2, r=10)
    indices.append(indexes)
    count += 1

display(pd.DataFrame(indices))

but on the line kd_tree2 = cKDTree((missed_x[count], missed_y[count])), I'm getting an error: ValueError: data must be 2 dimensions
So it looks like it's expecting a cKDTree to have more than one point

#

how would I make a cKDTree with just one?

#

I think i see

hasty grail
#

Why don't you just create a cKDTree for each set of points?

astral path
#

well partially the problem was that cKDTree didn't work on a single point

hasty grail
#

I thought you only wanted to do clustering?

#

In that case the two cKDTrees would be the same

astral path
#

but
scipy.spatial.KDTree.query_ball_point should be done if I want to find the nearest neighbors for a single point

hasty grail
#

Don't you want to find the nearest neighbors for more than one point?

astral path
#

yeah but I'm going to loop over each point and add it to a list with the number of nearest neighbors for each point

#

or is that not the best way to do it

hasty grail
#

It's not the best way

#

avoid loops when dealing with NumPy (and its related libraries) whenever possible

#

The reason is that Python for loops are way slower than those in C

astral path
#

ah, did not know that

hasty grail
#

and NumPy uses C loops internally whenever you execute vectorized operations

astral path
#

so then how should I do it?

hasty grail
#
# Assuming that you have a bunch of points in a 2-D array named `points`
centroids = points[::2]  # Subsample the points. In this example we sample every other point.

centroids_t = cKDTree(centroids)
points_t = cKDTree(points)

# r = Radius of search
result = centroids_t.query_ball_tree(points_t, r)  # result[i] is a list of the neighbours of centroids[i] in points
num_neighbours = [len(neighbours) for neighbours in result]  # Count the number of neighbours for each centroid
astral path
#

I have an array of points made using missed_points = np.array([missed_x, missed_y]), and it returns:

          [ 207,  264,  120, ...,   86,   15,   61]])```, but it says it's a 2x1331 vector.  this would still work, right?  or would I need to transpose it
#

also how come we would do every other point?

hasty grail
#

you'll have to transpose it

#

according to the doc, the first dimension represents the number of points and the second dimension represents the number of spatial dimensions the points reside in

astral path
#

ah ok, thanks

#

could I just do points[::1] for my case?

hasty grail
#

also how come we would do every other point?
It's the easiest method of subsampling, just for demonstration purposes.

#

If you don't want to subsample at all you can just do centroids = points

#

The reason you might want to subsample is to reduce the computational cost

#

(Especially if you have a large number of points like >10k)

astral path
#

that does make sense

#

although would it make sense when I want the nearest neighbors for each individual point?

hasty grail
#

Yeah, since it is directly related to your purpose

#

Which is to count the number of nearby points

astral path
#

how would I use it in my plot?

hasty grail
#

query_ball_point isn't actually a nearest neighbour search, it's just a ball query

astral path
#

it's a different size from my x and y coordinate arrays

hasty grail
#

i.e. find all points within distance r of the given point

astral path
#

running it with no subsampling is basically the same running time

hasty grail
#

I would think that the reason why you want to count the number of nearby points is so that you don't have to plot every single point

astral path
#

ohhh ok

#

I see

#

the code you provided did work

hasty grail
#

running it with no subsampling is basically the same running time
Compared to?

astral path
#

with the subsampling

#

the dataset is only 1331 points

hasty grail
#

Maybe it's because the number of points isn't large enough to matter much

#

1331 is not a lot

astral path
#

yeah thats true

#

its an analysis of shots during an nba season

hasty grail
misty flint
#

wow it looks like a galaxy

astral path
#

hmm now that I think about it

misty flint
#

bunch of central planets + an outer asteroid belt

astral path
#

it's really weird that the dataset is only 1331 points

#

should be much larger

misty flint
#

data was lost somewhere?

astral path
#

hmmm

#

there should be 118,396 points

misty flint
#

quite a difference

#

just a bit

#

🀏

astral path
#

lol

misty flint
#

like this much

astral path
#

not really enough of a difference to be statistically significant

#

Β―_(ツ)_/Β―

misty flint
#

ig thats true

astral path
#

so im a little put off rn

hasty grail
#

If you're actually going to deal with 118,396 points then you'll probably need to subsample (and plot only the centroids)

astral path
#

yeah thats my new thought

#

i'm looping over 1230 games played to find every shot taken in the game which can be estimated to be about 200 total

misty flint
#

interesting data

serene dragon
#
for row in t.main_file_df['ID1784']: 
    new_row = [value for value in row if value in t.param_values]
    row = [value for value in row if value not in t.param_values]
#

how can i transform this loop into 2 columns

#

where row I want to stay as 'ID1784'

#

and new_row as new column

hasty grail
#

Wdym by 2 columns?

serene dragon
#

this is pandas DataFrame

#

t.main_file_df['ID1784'] is column that contain list of strings

misty flint
hasty grail
#

Can you provide more background information?

#

Such as the structure of your dataframe

misty flint
#

i dont understand

serene dragon
#

i have DF that contain 2 columns

#

Symbol and ID1784

hasty grail
#

and what do you expect your output to be like?

serene dragon
#

ID1784 looks like

#
['automatyczna i rΔ™czna regulacja czuΕ‚oΕ›ci', 'automatyczne wyΕ‚Δ…czanie', 'identyfikacja sygnaΕ‚u nadajnika dziΔ™ki cyfrowemu kodowaniu', 'moΕΌliwoΕ›Δ‡ doΕ‚Δ…czenia dodatkowych nadajnikΓ³w', 'optyczna i akustyczna sygnalizacja zlokalizowanego sygnaΕ‚u nadajnika', 'zestaw zΕ‚oΕΌony z nadajnika i odbiornika', 'lokalizowanie tras kabli w gruncie', 'lokalizowanie tras metalowych rur instalacji wodnych i centralnego ogrzewania', 'lokalizowanie wΕ‚aΕ›ciwego zabezpieczenia odpowiedniego obwodu elektrycznego']
#

list of many strings

#

and I want to split it into 2 columns

#

one that contain texts that i want, based on list intersection

#

and 2nd that have leftovers

#

this for loop can do this for each row

hasty grail
#
df['ID1784_new'] = df.apply(lambda row: [value for value in row['ID1784'] if value in t.param_values])
df['ID1784'] = df.apply(lambda row: [value for value in row['ID1784'] if value not in t.param_values])
#

Not that experienced in Pandas so there might be a more efficient method

#

Fixed

serene dragon
#

i was close to something like that 😦

#

thanks

hasty grail
#

You're welcome

astral path
#

perfect timing because i just realized my problem

#

its also pandas

#

ok so

#

I'm trying to loop over each row using for i in homeoutcomes: where homeoutcomes is the name of the dataframe

#

although the only thing that's returned is EVENT_TYPE

#

what am I missing, how should I be iterating over each row instead?

hasty grail
#

Why are you iterating over the rows like that?

#

Again, since pandas derives from numpy, avoid direct iteration in Python

astral path
#

how should I be doing it

hasty grail
#

depends on what you're doing in each iteration

astral path
#

in each iteration I do this:

    display(i)
    if homeoutcomes.iloc[indexH]['EVENT_TYPE'] == 'Missed Shot':
      missed_x.append(homestats.iloc[indexH]['LOC_X'])
      missed_y.append(homestats.iloc[indexH]['LOC_Y'])

      next_x.append(homestats.iloc[indexH + 1]['LOC_X'])
      next_y.append(homestats.iloc[indexH + 1]['LOC_Y'])
    indexH += 1
    print("IndexH", indexH)
hasty grail
#

what is display?

astral path
#

its an IPython function that is used to display data structures

hasty grail
#

ok so it's just for debugging

astral path
#

yeah

hasty grail
#

what are next_x and next_y for?

astral path
#

next_x and next_y are keeping track of the x position and y position of the shot following a missed shot

hasty grail
#

Hmm ok

#
missed = (homeoutcomes['EVENT_TYPE'] == 'Missed Shot')
missed_x_y = homeoutcomes.iloc[missed].loc[['LOC_X', 'LOC_Y']]

missed_next = np.concatenate([[False], missed[:-1]])  # Shift the `missed` boolean mask
missed_next_x_y = homeoutcomes.iloc[missed_next].loc[['LOC_X', 'LOC_Y']]
#

Try this

astral path
#

instead of a for loop?

hasty grail
#

Yep

astral path
#

will missed_next be for any shot after a missed shot? or just a missed shot after a missed shot

hasty grail
#

any

astral path
#

ok thanks

misty flint
astral path
#

NotImplementedError: iLocation based boolean indexing on an integer type is not available

hasty grail
#

Can you print out missed?

#

maybe you need to add .values at the end to convert it into a boolean array

#

instead of a Series

astral path
#

yea sure

#

i changed it to missed_home though because im doing it for two different teams

#

i changed the missed in the iloc to missed_home btw

hasty grail
#

that's fine

astral path
hasty grail
#

can you display missed_home.values?

#

if it becomes a regular numpy array then that's what you need to use instead

astral path
hasty grail
#

yup

astral path
#

ah ok

hasty grail
#

as in

missed = (homeoutcomes['EVENT_TYPE'] == 'Missed Shot').values
astral path
#

that's what i did

#

missed_x_y_home gives an error
KeyError: "None of [Index(['LOC_X', 'LOC_Y'], dtype='object', name='GAME_ID')] are in the [index]"

ripe forge
astral path
#

yeah sure

hasty grail
#

So that I don't get confused by home and stuff

astral path
#
xlocs = df['LOC_X']
ylocs = df['LOC_Y']
made = df['SHOT_MADE_FLAG']

missed_x = []
next_x = []
missed_y = []
next_y = []
df.set_index(keys=['GAME_ID'], drop=False,inplace=True)
game_ids = df['GAME_ID'].unique().tolist()
for id in game_ids:
  gamestats = df.loc[df.GAME_ID==id]
  team_names = list(set(gamestats['TEAM_NAME'].tolist()))
  homestats = gamestats[gamestats['TEAM_NAME'] == team_names[0]]
  awaystats = gamestats[gamestats['TEAM_NAME'] == team_names[1]]

  homeoutcomes = pd.DataFrame(homestats['EVENT_TYPE'])
  awayoutcomes = pd.DataFrame(awaystats['EVENT_TYPE'])

  missed_home = (homeoutcomes['EVENT_TYPE'] == 'Missed Shot').values
  missed_x_y_home = homeoutcomes.iloc[missed_home].loc[['LOC_X', 'LOC_Y']]

  missed_next_home = np.concatenate([[False], missed_home[:-1]])  # Shift the `missed` boolean mask
  missed_next_x_y_home = homeoutcomes.iloc[missed_next].loc[['LOC_X', 'LOC_Y']]
    
  missed_away = (awayoutcomes['EVENT_TYPE'] == 'Missed Shot').values
  missed_x_y_away = awayoutcomes.iloc[missed_away].loc[['LOC_X', 'LOC_Y']]

  missed_next_away = np.concatenate([[False], missed_away[:-1]])  # Shift the `missed` boolean mask
  missed_next_x_y_away = awayoutcomes.iloc[missed_next].loc[['LOC_X', 'LOC_Y']]

missed_x = np.asarray(missed_x)
missed_y = np.asarray(missed_y)
next_x = np.asarray(next_x)
next_y = np.asarray(next_y)
hasty grail
#

ok, can you print homeoutcomes.iloc[missed_home]?

astral path
hasty grail
#

Where are the LOC_X and LOC_Y columns?

astral path
#

x coordinate of a given shot and y coordinate

hasty grail
#

I don't see those columns in your screenshot...

astral path
#

oh those are in a different dataframe !!!

hasty grail
#

could you combine them together beforehand?

astral path
#

should be homestats

#

and i'm getting homeoutcomes from homestats

hasty grail
#

I guess you can just use homestats all the way then

#

(and awaystats)

astral path
#

hmm

#

i still get KeyError: "None of [Index(['LOC_X', 'LOC_Y'], dtype='object', name='GAME_ID')] are in the [index]"

hasty grail
#

Your edited code?

astral path
#

yes

#

on missed_x_y_home

hasty grail
#

As in, what is your code now?

#

(Sorry for not being clear)

astral path
#
for id in game_ids:
  gamestats = df.loc[df.GAME_ID==id]
  team_names = list(set(gamestats['TEAM_NAME'].tolist()))
  homestats = gamestats[gamestats['TEAM_NAME'] == team_names[0]]
  awaystats = gamestats[gamestats['TEAM_NAME'] == team_names[1]]

  homeoutcomes = pd.DataFrame(homestats['EVENT_TYPE'])
  awayoutcomes = pd.DataFrame(awaystats['EVENT_TYPE'])

  missed_home = (homestats['EVENT_TYPE'] == 'Missed Shot').values
  missed_x_y_home = homestats.iloc[missed_home].loc[['LOC_X', 'LOC_Y']]

  missed_next_home = np.concatenate([[False], missed_home[:-1]])  # Shift the `missed` boolean mask
  missed_next_x_y_home = homestats.iloc[missed_next].loc[['LOC_X', 'LOC_Y']]
    
  missed_away = (awayoutcomes['EVENT_TYPE'] == 'Missed Shot').values
  missed_x_y_away = awayoutcomes.iloc[missed_away].loc[['LOC_X', 'LOC_Y']]

  missed_next_away = np.concatenate([[False], missed_away[:-1]])  # Shift the `missed` boolean mask
  missed_next_x_y_away = awayoutcomes.iloc[missed_next].loc[['LOC_X', 'LOC_Y']]

missed_x = np.asarray(missed_x)
missed_y = np.asarray(missed_y)
next_x = np.asarray(next_x)
next_y = np.asarray(next_y)
#

np

hasty grail
#

what is homestats.iloc[missed_home]?

astral path
#

thats homeoutcomes.iloc[missed_home] but with homestats instead

#

same error either way

hasty grail
#

well yeah, but does it contain the columns 'LOC_X' and 'LOC_Y'?

astral path
#

yes, it does

#

its the most recent screenshot ive posted

hasty grail
#

after the iloc

astral path
#

hmm?

hasty grail
#

run display(homestats.iloc[missed_home])

#

supposedly it should have those two columns

astral path
#

got them

hasty grail
#

so homestats.iloc[missed_home].loc[['LOC_X', 'LOC_Y']] results in an error?

astral path
#

yeah that one does

hasty grail
#

wait

#

I'm dumb

#

ok umm can you do this?

#

homestats.loc[missed_home, ['LOC_X', 'LOC_Y']]

astral path
#

yes

#

that worked

hasty grail
#

also maybe you could get away with not using .values

astral path
#

thanks! i'll try the rest of it

hasty grail
#

somehow I thought .iloc uses row indexing while loc uses col indexing

astral path
#

ran with no errors!

hasty grail
#
for id in game_ids:
  gamestats = df.loc[df.GAME_ID==id]
  team_names = list(set(gamestats['TEAM_NAME'].tolist()))
  homestats = gamestats.loc[gamestats['TEAM_NAME'] == team_names[0]]
  awaystats = gamestats.loc[gamestats['TEAM_NAME'] == team_names[1]]

  missed_home = (homestats['EVENT_TYPE'] == 'Missed Shot')
  missed_x_y_home = homestats.loc[missed_home, ['LOC_X', 'LOC_Y']]

  missed_next_home = np.concatenate([[False], missed_home[:-1]])  # Shift the `missed` boolean mask
  missed_next_x_y_home = homestats.loc[missed_next, ['LOC_X', 'LOC_Y']]
    
  missed_away = (awaystats['EVENT_TYPE'] == 'Missed Shot')
  missed_x_y_away = awaystats.loc[missed_away, ['LOC_X', 'LOC_Y']]

  missed_next_away = np.concatenate([[False], missed_away[:-1]])  # Shift the `missed` boolean mask
  missed_next_x_y_away = awayoutcomes.loc[missed_next, ['LOC_X', 'LOC_Y']]
#

also you can use pd.unique instead of set

#

team_names = pd.unique(gamestats['TEAM_NAME'])

astral path
#

that works

#

however,

#

going back to centroids

#

that now gives me a ValueError: setting an array element with a sequence.

hasty grail
#

where?

astral path
#

it's now

indices = []
count = 0

missed_points = missed_x_y
centroids =  missed_x_y
centroids_t = cKDTree(centroids)
points_t = cKDTree(missed_points)

result = centroids_t.query_ball_tree(points_t, r=10)  
num_neighbours = [len(neighbours) for neighbours in result] 

display(num_neighbours)
#

at cKDTree(centroids)

hasty grail
#

what is missed_x_y?

astral path
#

wrong thing

#

edited

hasty grail
#

edited

astral path
#

edited?

hasty grail
#

my question

astral path
#

oh

#

oh i forgot to say

#

instead of missed_next_x_y_home = homestats.loc[missed_next, ['LOC_X', 'LOC_Y']]

hasty grail
#

remember that you need to pass a 2-D ndarray to cKDTree

#

not a pandas dataframe

astral path
#

i have an array x_y and I'm using x_y.append(homestats.loc[missed_home, ['LOC_X', 'LOC_Y']])

#

and then missed_x_y = np.asarray(x_y)

hasty grail
#

why don't you just set missed_x_y to homestats.loc[missed_home, ['LOC_X', 'LOC_Y']]?

astral path
#

because i'm adding homestats.loc[missed_home, ['LOC_X', 'LOC_Y']] to missed_x_y for each game

hasty grail
#

ok so the problem is that you need to concatenate the list of dataframes together*

astral path
#

you mean missed_x_y?

hasty grail
#

yes

#

missed_x_y = pd.concat(missed_x_y)

astral path
#

well no errors

hasty grail
#

and then

astral path
#

although the nearest neighbors is slow as hell

hasty grail
#

missed_x_y = missed_x_y.to_numpy()

#

(or just chain them together)

astral path
#

so i should do the subsampling?

hasty grail
#

yeah

#

you have >100k points now, right?

astral path
#

yea

#

should be ~130k

hasty grail
#

There are a couple of ways to do subsampling

astral path
#

runs much faster with [::2]

hasty grail
#

As mentioned before, the easiest way is to just sample every n points

#

However this is only good when the points are randomly distributed, which is not necessarily true since you have sorted them in time sequence, possibly affecting their spatial distribution

astral path
#

yeah

hasty grail
#

A better method would be to randomly shuffle the points beforehand

#

Still pretty basic, but should be sufficient

astral path
#

how would i shuffle it so that the next shots are shuffled the exact same way?

hasty grail
#
rng = np.random.default_rng()
shuffled = rng.shuffle(original)
#

this would be nondeterminstic shuffling

#

if you want the shuffling to be identical each time the program is run, pass a seed to default_rng

#

(such as zero)

astral path
#

hmm

#

which is based on num_neighbours

hasty grail
#

oops made a mistake

#

rng.shuffle is in-place so it doesn't return anything

#

so you should just call rng.shuffle(original), and original will be shuffled

#

what does your code look like now?

astral path
#

this is with the every other one

#

[::2]

hasty grail
#

increase the subsampling ratio

#

it's way too messy to really tell

astral path
#

yeah there's waaaay too many points

#

I'll try out the rng.shuffle tommmorow

#

i'm done for the night so THANK YOU!

hasty grail
#

No problem!

astral path
#

you have no idea how helpful this was!

hasty grail
#

Am glad to help

tall trail
#

is it possible in dash to plot a line with an x axis start and end time? i cant find anything on the old google machine

twin moth
twin moth
lapis sequoia
#

I was told that you have to be really good at math and statistics otherwise you can’t be a great data scientist. Is it true?

sage dagger
lapis sequoia
velvet thorn
#

it depends on what kinda DS you wanna be

#

there are many types.

sage dagger
#

If you want to study Data Science you'll definitely need math BUT there is no good or bad at math, it's just how much time and exercise you want to put into understanding math.

velvet thorn
#

also "really good" is subjective

#

for example

#

if you want to be a research scientist

#

you'll defo need to have your linear algebra etc. down

sage dagger
#

For analysis tho you just need to know how to run the right analysis on a given data set

lapis sequoia
#

I was into DA but now I’m considering about moving to NLP and stuff more like AI than analysis. Then I found my math sucks.

#

I have basic knowledge in linear algebra, but that’s it

velvet thorn
#

so like

#

let's say you're a DS running A/B testing

#

you need to know what you're doing

#

proper statistical methodology is important

#

in, for example, placing subjects into cohorts

#

understanding the statistical significance of your results

#

etc.

#

but there are also DS roles

#

which are more

#

...

#

programming-oriented?

lapis sequoia
#

Like?

velvet thorn
#

as in

#

what I mean is that

#

both are called "Data Scientist"

#

but

#

actual responsibilities may differ

#

just like you can be a "Software Engineer" and work on embedded systems

#

or a "Software Engineer" and do webdev

lapis sequoia
#

Ah yes

#

Different expertise within the same field I get that

#

I’m still not sure what I really want to do

velvet thorn
#

and

#

you can be also a more business-facing DS

#

so like your special sauce is being very good @ bringing stakeholders together and communicating with them

lapis sequoia
#

Data analyst seems to be... inferior? to data scientist

sage dagger