#data-science-and-ml
1 messages Β· Page 283 of 1
Difference between sentence encoders and tokenizers?
tokenizers only breaks something into tokens. (be it sentence tokenizers, word tokenizers, etc)
encoders convert tokens (strings basically) into vectors- numbers that can be used by the models
so ive ran into a problem
shoot
basically my series looks like this now
this is constructed from the code you gave me where i was making a series/dataframe with each combination of distances
im trying to enumerate over them and add the frequency to another array so it's not the frequency of each unique combination but more just another column in a dataframe
for i in enumerate(missed_series):
freq = dist_freq_series.loc[missed_series[newi], next_series[newi]]
freqs.append(freq)
newi += 1
however, for some reason i'm getting KeyError: (26.0, 25.0), but I don't see why that shouldn't exist
wait never mind
posting this made me realize that the dataframe i was enumerating over was wrong
thank you !
idk if you follow basketball but it should make sense anyways
the plot is basically where the x axis is the distance from the hoop of a missed shot, and the y axis is the change in how far away the next shot is after that
the size/color is correlating to the frequency of each pairing of missed shot and consecuitive shot
i like this color more i think
hi does anyone know about plotly treemap??
google it
When experiencing poor model performance, how do I know if I need more data or if I need to train longer?
need help
i have this code
`HB_NO_List = list(VillageDataDict.keys())
for eachVillageHB_NO in HB_NO_List :
TempVillage,TempHB_NO,TempScale = VillageDataDict[eachVillageHB_NO]
print(TempVillage[0])
Vilanename = str(TempVillage[0])
FilterVillage = Killa_Boundary[Killa_Boundary['Village_Na']==Vilanename]
print(FilterVillage.head(1))but its giving me empty dataframe though I have data thereSHERON
FID_Final_ Killa_No Hb_No Village_Na Mustil_No Shape_Length
0 38711 15 266 SHERON 116 68.383971
geometry
0 MULTILINESTRING ((495447.227 3470154.065, 4953...
KOT DAS GUNDI MAL
Empty GeoDataFrame
Columns: [FID_Final_, Killa_No, Hb_No, Village_Na, Mustil_No, Shape_Length, geometry]
Index: []`
!code Here's how to format Python code on Discord.
Here's how to format Python code on Discord:
```py
print('Hello world!')
```
These are backticks, not quotes. Check this out if you can't find the backtick key.
Found Error in my Script, Thank you for support
You're welcome.
hey guys
I have a question
I have a row in my excel file
it stretches all the way to week 52
I want to change it to a datetime
nvm I got it
@raw juniper what? You working in excel , csv, panda dataframes? Actually that does not matter... it's python anyway
You question does not make any sense at all.
Anyway, I guess your question is answered here if you looking to convert e.g 2021-W1 into: 2021-01-04
https://stackoverflow.com/a/1700069/3314139
I got the answer on my own thx btw
my question was essentially, I have a row that spans from week 1 to week 52
I wanted to change it to this
@raw juniper yes, that is what I provided above. Anyway, good job
cool
Also week 1 started on January 4th btw, not 7th @raw juniper
Unless you typed a date into the formula to get a specific day of the date in the week
pandas question: In pyspark I can filter a df with df.filter("col1=x and col2=y"), but in pandas the only way I know is df[(df.col1==x)&(df.col2==y)]. This syntax gets very verbose when there are a lot of conditions and the df name is long. Wondering if there is a simpler syntax in pandas similar to what can be used in pyspark
there is df.query, though i personally find normal filters easier to understand
Hi guys, I'm trying to write a crawler for a school project which-in I need to basically fetch thousands of images.
Currently I'm using Selenium in order to open the website and find the src and alt of the images, then use requests in order to download them.
Doing so means that I have to basically download each image twice - once when I load the page - and once when I download it afterwards
Do you know if it's possible to fetch the images out of the cache of the browser? I mean those images were rendered - they should be somewhere on the PC..
If not, (and I could not find any way to get it to work using GeckoDriver) I'd love some assistance regarding the usage of a proxy in order to do the same.
@ripe forge thanks!
Bear in mind that I also need the alt of the images since I use it to parse them.
So I can't just cache them with a random name
what is this type of data called?
In case you are serious it is not possible to distinguish the type of such data from looks alone.
Encrypted perhaps ? @stone jungle
Yeah ig
this is also my question in case anyone has an answer.
using SMOTE do i need to specify the classes?
OR it does everything?
Also, after using ImageDataGenerator, how can i take a look at the pictures?
like, how can i get them?
I literally haven't seen any type of data resembling this, but even if there is such data, what pattern can be extracted from this data that could be useful in real life?
It's a video file rename as TXT file
I wanted to make something that could compress video files but this way too complex
It's alright, I r good
U r good bro
Btw see this star I captured from my telescope
Cool
Tq
using SMOTE do i need to specify the classes?
OR it does everything?
Also, after using ImageDataGenerator, how can i take a look at the pictures?
like, how can i get them?
Hello I had recently started ML, and wanted to know something about epochs & alpha(learning rate). I coded up my own Simple Linear Regression model, and I used it on some data. At first I wasn't getting the perfect values as I got from sklearn, then I tinkered a bit with epochs and alpha variables, I increased epochs from 1k to 2k and epochs from 0.0001 to 0.001, and I then had the perfect results that matched with sklearn's Linear Regression. So I wanted to know how to know what the value for epochs and alpha should be, so that I don't have to hit and trial every time with my model.
How do I round to "one digit precision"? For example: 0.000784 -> 0.0008, 0.69 -> 0.7, 0.049683948 -> 0.05. Tried googling a bit but everyone points to Round() which doesn't work in this case.
Hi, does anyone know where the faster rcnn inception v2 model is? I tried looking in https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/tf2_detection_zoo.md but can't find it
The epochs are the batches of training data, so if you feed it more data it is generally better off, while a low learning rate with few epochs (batches) will not teach the model enough. As these are variables and not fixed settings, they are dependant on the training data and the model itself. So depending on the settings you can teach your model fast (but maybe over-generalize), or slow (but not learn enough to make an accurate decision, leaving you with just a blurry network)
this all depends on the kinds of variations in the training data you want to accept
like training a model with emojis could be quite fast as you can teach the model to identify the exact emoji from the training data, while if you want to identify a person in a picture it needs to see more different examples of the same thing to understand what the pattern of "person" actually is
does it make sense to use stochastic gradient descent, when i have numerical, ordinal, and nominal data? if so, then does it make sense to scale the ordinal data?
Greetings, working on some NLP with spaCy.
IN python what does this mean?
xpos = a[1]['pos']
is that an array, first element? that has value of pos?
what if I wanted to check whether a[1]['pos'] exists first?
ok, that looks like it is a list
df = pd.DataFrame([[1,2,'a','b'],[3,4,'c','d'],[5,6,'c','e']],columns=['inta','intb','Ticker','desc'])
df.groupby('Ticker').agg({'inta':'mean','intb':'mean','desc':'max'})
assuming that 'max' is an appropriate operation for aggregating your qualitative data
interesting question...
I would say
use log10 to find the first nonzero number
then floor divide by it
!e
from math import ceil, log10
def f(i):
exp = ceil(log10(i)) - 1
sig = round(i / 10 ** exp)
return sig * 10 ** exp
print(f(0.69))
print(f(0.000784))
@velvet thorn :white_check_mark: Your eval job has completed with return code 0.
001 | 0.7000000000000001
002 | 0.0008
hey everyone... need help with pandas... getting this value error: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
transaction_cust.groupby('store_id').sum((abs(transaction_cust.tran_prod_discount_amt))/sum(transaction_cust.tran_prod_sale_amt))
from pandas import DataFrame as DF
uhm, is there any way for the gTTs lib to play without saving it
like ```py
from gtts import gTTS
text = input('Enter my speech.. ')
speech = gTTs(text)
is there something like sppech.play or something or ive to do it in the delete and create approach
how come i HAD to do .size() first before i could do .sort_values()
it returns this instead otherwise
does it really not have that attribute, i think thats weird
Everytime I install speech python speech recognition it says error.
@tidal bough
Is option a correct answer?
Should I use hypergeometric distribution formula for calculating probability?
Or Which formula is easier to calculate for finding probability?
Please @ me once you see this message
that's the binomial distribution, which can be approximated by a normal one with the right mean and std. So that's what one should use here, I believe.
the mean is 90/4 = 22.5, and the std is... what was it, sqrt(npq) I think? That's sqrt(90*3/4*1/4) = sqrt(270/16) ~= 4.108
10 means a z-score of around
>-3.043
so almost always, I think it's the first answer. Very little of the normal distribution's mass is below -3 standard deviations.
(this is pretty clearly how you're supposed to do this, considering you even get a nice z-score of almost exactly -3).
Hey,
can someone explain to me what differentiates valmin, valmax, slidermin, slidermax on Sliders in matplotlib and how to use them?
@tidal bough
So if the std is not given then we can find using formula and using normal distribution(z score) we can find the probability rite?
We can follow this method for most of the problems am I right?
I didn't get this part
How is >10 z score is >-3.043?
Oops sorry I misunderstood
I thought that >10 z score of around>-3.043 was directly found
We got that value(-3.043) using this
Z=(x-mean)/sigma
So
Z=(10-22.5)/4.108 Z=-3.043
So >10 z score is around >-3.043
Since for me z table will be not available during my future exam
We can approximate it as 99.97%
Am I right?
yup, like I said before, not having a z-table means you are supposed to approximate - and here, it's enough to have a very vague intuition that 3 standard deviations is a lot, so it's the three-nines answer.
You can memorize the values for 1,2 and 3 std deviations to help
is anyone here knowledgeable about xpath for webscraping, or is there another topic that is more relevant for that?
fuck anaconda is so fucking confusing
can someone dm me so we could voice talk and he would teach me how to fucking use it
Anyone know how to rectify this? Trying to analyse a .csv file of netflix data to filter episodes of Friends, and when I try to add columns for days and hours, it returns an error.
Absolute beginner, apologies
if you read the output, you'll notice that it's actually a warning, but not all warnings should be ignored.
this is honestly why i just use a text editor. a lot of IDEs introduce too much magic, and it just gets annoying when you have an idea of what you want to do
Oh so I can just carry on and it should generally work fine?
In this case the warning can be ignored but at the cost of your code being less efficient
If you follow the provided link it should explain why
Brilliant, thank you!
I've got a problem which I can't seem to fix (it's NLP Natural Language Toolkit related!)
Sooo I run my application app.py on a Flask server and when it tries to execute the function preprocess_data (depicted below), I get a "TypeError" exception on this line: self.work_tokenize(self.splitsentence)
function preprocess_textdata
type(self.splitsentence)
you use the wrong datatype as input, or maybe you target some immutable
@nova widget I'm using BeautifulSoup module to format the extracted webpage data in a clean format and this gets rid of the punctuations (including full stops)..
Could that be the reason why NLTK's sent_tokenize isn't giving me any result ...?
As it couldn't find any full stops (periods) to split the text string into separated sentences , when the BeautifulSoup object parsed the data?
did you check what type is self.splitsentence
type(self.splitsentence) doesn't return anything for some reason idk.... that's odd
Hello guys do you know any good tutorial? For ds?
Data Science is pretty broad... it depends what specific area you wanna delve into..
Sorry economics
Trying to transform X into a regression model. It this correct:
- Transform X, keep y then plot original vs y?
you have to print that
obviously
@nova widget It's <class 'list>... it executes the line I showed you which wasn't working earlier perfectly now
I'm glad that line got executed without getting any exceptions now π
Before applying any NLP operations, I realise I get improperly decoded Unicode chars/symbols just after I make a request to several URL links and extract webdata...
Anyone know the reason why I get this? And how I might properly convert these sort of characters?
Decoded Unicode symbols/chars on one of my webpage that I'm constructing...
I am using celery to load csv data as Panda dataframe and using Celery's group function to distribute it across multiple workers. The intent is to distribute the workload in an idempotent way so each worker can execute it.
I am using Data frame chunks to iterate and distribute small batches of datasets to process it by workers
@celery_app.task(bind=True, autoretry_for=(WorkSchedulingException,),
retry_kwargs={'max_retries': 5})
def make_inventory_partitions(self, inventory_bucket: str, gz_name: str, tsk_id):
gz_path = f"s3://{inventory_bucket}/{gz_name}"
chunk_size = int(getenv('PARTITION_SIZE', 1000))
logger.info(f'Splitting inventory [{gz_name}] into chunks of {chunk_size} records per task')
itr = 0
try:
group(
process_inventory_chunk.s(txn_id=tsk_id, gz_name=gz_path, chunk=chunk.to_json()) for chunk in
pd.read_csv(gz_path, chunksize=chunk_size, header=None, usecols=[1, 3], names=['path', 'dt']))()
except Exception as err:
logger.critical(f'Failed to schedule inventory partitions tasks [{tsk_id}]', exc_info=err)
raise WorkSchedulingException(f'Failed to schedule inventory partitions tasks workloads for task [{tsk_id}]')
Is there any better approach, because my task and worker keep going OOM
Hi, is anyone here familiar with beautifulsoup?
I would like to extract the attributes of the child elements in a form element
@fathom sail this might not be a data science question, but in either case you might consider opening a help session and sharing what code you've written so far. See #βο½how-to-get-help
Any chance for some quick help?
I have a 700*350 np.ndarray which contains 3D tuples
I want to iterate over all of the tuples
I am using np.nditer() in order to achieve that but it just flattens my tuples as well and returns the values of the tuple.
I saw that there's op_axes= and itershape= as params but I could not figure out how to use them
@twin moth what's the larger context of the problem? If you can turn this into a (700,350,3) matrix, you can probably leverage nunpys efficient iteration
If you use a for loop, you won't get that.
It's already a matrix of that shape
But when I iterate over it using the aforementioned function it just iterates over the tuple values as well
Thanks a lot for the help man!
Ya from now on I will remember all those values
huh
so is it an array containing tuples or not
I'm a bit confused
π₯΄
Got a feeling that you could vectorize whatever you're trying to do
^
Uh, i am trying my hands on the ||titanic dataset|| in kaggle and I am getting an unexpected output
For pedictions, I turned the value of "male" and "female" to "1" and "0". Now, for submission, when I convert them back, I get the gender to be passenger ID of the person
Any help is appreciated.....π©
Thanks in advance
its the encoding
Hey guys, any place I can practice some DS problems? Like some jupyter notebook cases, probability, machine learning concept questions, optimizing scenarios etc..
What about the encoding @lapis sequoia? I've set the encoding to be UTF-8 as well..
One part of understanding how encoding works is understanding that just setting encodings to utf 8 while reading won't actually solve anything and may also increase the problems in some cases.
I'd highly recommend taking some time to read this. https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/
Thanks @ripe forge I'll check that out
I also do get other human-readable English text content extracted along with these mixed Unicode symbols
i cant seem to figure out how to find a row in a dataframe while in a for loop, it just keeps returning an empty dataframe. anyone know why this keeps happening?
more info:
i am trying to check if the value of test is above a treshold over 20 rows of data, if it is: write it to a seperate dataframe.
here is my code so far:
test_data = np.random.randint(119,150, size= 300)
peak_df = pd.DataFrame(columns=['test'])
test_df = pd.DataFrame(test_data, columns=['test'])
tempdf = pd.DataFrame(columns=['test'])
print( '\nIndex:', type(test_df.test))
counter = 0
for i in range(len(test_df)) :
val = test_df.loc[i, 'test']
if val > 119 :
counter = counter+1
tempdf.append(test_df.loc[i], ignore_index=True)
# print(tempdf)
if counter == 20:
print('counter reached 20, appending')
frames = [tempdf, df]
# print(df)
# print(tempdf)
peak_df.append(tempdf, ignore_index=True)
else:
counter = 0
if not tempdf.empty:
tempdf.drop(['test'])
General advice, if you're ever iterating on a dataframe, There's probably a better way to do whatever you're trying to do.
hehe, ya im still kinda new to dataframes and i dont quite get the lambda function yet. i see that gets used a lot
Why?
It's an image, each of those tuples represent the color of the pixel
They don't have to be tuples though, right?
More specifically it's just a 3d array, and at this stage there's no more "tuples" yes?
To phrase it differently, we understand that there's a logical relation and these are denoting the 3 channels of the images, but we very specifically don't store them as tuples anymore, correct?
(if tuples exist still in the array, they need to go. Don't mix tuples inside your np array).
then you should use a 3D array
not a 2D array of tuples
Huh, okay, I'd try to remove them, but I'd still need to parse them as a single unit
What specifically are you trying to do to the image?
-help
i have project about SVM and NAIVE BAYES for sentiments analysis and classification
Hey guys, I just joined the channel and I was wondering if anyone could point me in the right direction. I got the basics of python down, and can go as far as programming simple algorithms but I want to start developing, preferably for mobile banking. What steps should I take next and what frameworks should I start learning about?
Hmm, that's really broad... do you want to deal with the data or the user interface first?
Probably data, because once I know how it works, I feel learning UI will be easier
Are you planning on building a database for this?
my problem is where i can get the dataset, my target or classification is school suggestion box, i must able to classify the suggestion if its for school improvement, service, facility, school officials or professors
I would like to, but thats the thing, I really donβt know where to start cause there's just so much to learn
Yes. Numpy arrays will make all this a breeze if you let it.
I just tried, it failed because I used the tuples as a dictionary key and since ndarrays are not hashable I can't do the same with them
In order to find where the pixel sits in the scale and not iterate through the entire scale once again (~200 comparisons) I just insert the tuple to a dictionary and I check if the tuple is found in it for each pixel in the heatmap
What are you trying to do?
I don't have much time to explain but I'm trying to parse those maps according to their scales
https://earthobservatory.nasa.gov/global-maps/MOD_NDVI_M
And insert each pixel and its' position on the scale into a pandas DF
climate change, global climate change, global warming, natural hazards, Earth, environment, remote sensing, atmosphere, land processes, oceans, volcanoes, land cover, Earth science data, NASA, environmental processes, Blue Marble, global maps
That's only a single example, there are many different types of maps
It's like you've taken a car, put it on water and then you're complaining it doesn't swim properly. Numpy arrays are the de facto standard of working with images, and are insanely fast because they vectorize their operations.
But you have a approach in mind already which doesn't sit well with how numpy arrays are meant to be used
Which isn't a bad thing btw, just might be worth getting your feet wet and seeing how you can use arrays as intended
I have a clean copy of a map, which I compare to the 'colored' map, I iterate each and every pixel, if they are "equal" (or close to it) I ignore that pixel, if it's not equal I search for the smallest distance between the RGB of the pixel and the colors on the scale
I'm always down to learn
So, when you vectorize, you can compare each pixel near instantly. One rule with arrays, no iteration.
def calculate_lowest_distance(colored_rgb: np.ndarray,
default_rgb: np.ndarray,
scale_list: list,
known_pixels_dict: dict) -> int:
if any(not is_rgb_array(var) for var in [colored_rgb, default_rgb]):
print("{} or {} is not an ndarray".format(colored_rgb, default_rgb))
sys.exit(1) # TODO: Error handling
if colored_rgb in known_pixels_dict:
return known_pixels_dict[colored_rgb]
min_distance = calc_rgb_distance(colored_rgb, default_rgb)
min_dist_idx = None
if min_distance == 0:
known_pixels_dict[colored_rgb] = min_dist_idx
return known_pixels_dict[colored_rgb]
for idx, scale_pixel in enumerate(scale_list):
distance = calc_rgb_distance(colored_rgb, scale_pixel)
if distance == 0:
min_dist_idx = idx
break
if distance < min_distance:
min_distance = distance
min_dist_idx = idx
known_pixels_dict[colored_rgb] = min_dist_idx
return min_dist_idx
That's the current, not complete at all
Okay. So immediately we see a lot of iterations here. It's a more traditional style of programming
When it comes to numpy arrays, this style of programming becomes a hindrance.
I need to be on a laptop for this..
Meanwhile, can you do a simple experiment? Make two numpy arrays, pure arrays no tuples inside
And just subtract them. a-b and just see the output
how can i use SMOTE and flow_from_directory from Image Data Generator class together?
i found this so far
for x, y in flow_from_directory:
yield custom_balance(x, y, options)```
How can i call SMOTE here?
Hmm I haven't really dealt with databases in Python in particular, you might want to see #databases
Are you familiar with boolean masking in NumPy?
Thanks
Python Basics in 9 tweets π±, get in the thread
ππ§΅
Oh, If you don't wanna miss out a tweet like this follow me It's free.
#100DaysOfCode #Python #codingtips #CodeNewbies #CodeNewbie #code #programming #pythonprogramming #Python3 #Computer #HowtoPerfect #AI #MachineLearning
Nope π
I tried that exact example yesterday but used tuples instead of ndarrays, seems like you can't substract tuples
mhm. took the car, threw it in water. π
lol
!e
import numpy as np
arr1 = np.array([1,7,352,169])
arr2 = np.array([751,7,-352,8690])
arr3 = arr1 - arr2
arr3
You are not allowed to use that command here. Please use the #bot-commands channel instead.
π¦
aww hehe, you can use bot commands for it
now heres the main thing, that's a "vectorized" calculation.
Doesn't it work with loops in the background?
for science, why dont you make the same result via iteration, and then array subtraction. then, do a timeit. and increase the size of the array
ish. let's see if theres a difference in timings for large arrays first
and then we'll talk\
Is that timeit a thing or did you just mean that I should create it myself using time.time()?
start_time = time.time()
print("Start at: {}".format(start_time))
map_urls, scale_image_url = get_frame_urls(map_type)
create_df(map_urls, scale_image_url).to_csv(csv_file)
print("Finished at: {}".format(time.time()))
print("--- {} seconds ---".format(time.time() - start_time))
oh, you can use any method you prefer, but yes timeit is an actual thing too
I'll check it out, always happy to learn π
timeit is great because it automatically repeats a calculation multiple times and gives you a mean and deviation of timings for some calculation.
Sounds amazing, I'm trying it now
Simpler to use jupyter's %%timeit magic
Can it receive parameters in its' stmt parameter?
honestly ive always used it with the magic command π but uh, i believe it can receive strings or functions
def iter_sub(arr1: np.ndarray, arr2: np.ndarray) -> np.ndarray:
if len(arr1) != len(arr2):
print("arr1 and arr2 not of the same length.")
sys.exit(1)
arr3 = np.zeros(len(arr1))
for idx in range(len(arr1)):
arr3[idx] = arr1[idx] - arr2[idx]
return arr3
def vectorized_sub(arr1: np.ndarray, arr2: np.ndarray) -> np.ndarray:
if len(arr1) != len(arr2):
print("arr1 and arr2 not of the same length.")
sys.exit(1)
return arr1 - arr2
Care to give me an example?
do you have an ipython? where do you run code
hm, for the iter_sub, for a fair comparison, i'd say get rid of the print, get rid of the if checks from both too
Specifically that code in a Python terminal
at least for timeits
But I mostly write Python in PyCharm
True, the prints were only for debugging purposes
in your cmd, type ipython
Would recommend Google Colab instead
Terminal* π
Sec, I'm installing it
nvm, that was fast
ah you dont have to. it's just a fancy way of running timeit
(well, ipython is a GREAT repl though)
Nah, again, always happy to learn about new standards and tools
Okay, so now, how do I use it?
so, it's just a python repl. you can type and run stuff as usual
but it allows magic commands. those are prefixed iwth % and %%
I meant how do I use that magic timeit you were talking about
so, try %timeit <command_here>
This is what I get
yep, with params
computing courtesy of Google
Duck Google
How do I randomly fill the array?
you can use one of the np.random. Say np.random.randint(1, 10, size=n)
Kewl
I started learning how to use SQL files with the sqlite3 library, and it's all going smoothly so far
but there's one thing that I can't seem to get correctly...
How can I view the SQL file in VSC?
you cant. sqlite is a database, and so it's file would need a program that can read sqlite databases
Ok, can you give me a program that does that?
I'm a little confused on what do in this screen
I'm using sqlite
but what's the difference between Desktop and Program Manu?
Is it just the shortcuts?>
or something else?
i think just shortcut
dont worry about it too much, choose what makes sense for you. if not sure, press next.
@ripe forge I opened it, is that what it should show?
Yeah I did it
I'm planning on using it with my discord bot
Would it be good make a table for every server the bot's in?
If you have a need for speed, you might want to consider C++
yep vectorization is like a whole different world. really good for lots of number crunching. this is essentially the true power of numpy arrays
internally, its like "batch" iteration. the key being that it can load and execute operations in batches, as opposed to one at a time
for an analogy, it's like iteration is equivalent to making 100 cookies with a baking mold of 1 cookie only. you have to run the oven 100 times, and make one cookie at a time because thats all the mold you have. vs vectorization which is like someone gave you a mold with 20 cookies. so you run the oven 5 times and youre done, making 20 cookies at a time
Sounds really good
But how would I use that ?
The shape of my array is 350, 700, 3
How do I make use of it here?
And lets say that I run that piece of code for the matrices as a whole (AKA vectorization) - I'd still need to compare each pixel to each of the pixels in the scale list
return pow(lhs[0] - rhs[0], 2) + pow(lhs[1] - rhs[1], 2) + pow(lhs[2] - rhs[2], 2)
Are you using python's pow
Ummm guess so
Have you tried numpy's pow
I think it has one
Also, I just realised
This is linalg
You should use the norm function instead
np.square(lhs-rhs) should be better also, assuming lhs and rhs are 3-dim objects
Its so fast Thank you bro
whats the purpose of the power here? but you can do the calculation in one go, result = (lhs - rhs) ** 2
then if you do need to split them apart later for some reason, do it at the end.
Currently they are, but I'd rather run that function on the two matrices instead of on two 3-dim arrays
That's useful
But again, I'd need to iterate over the entire matrix
Because I still have 350*700 such trios
no iteration. what steps are missing now?
if theres any iteration step in the workload, youve lost a lot of the gains vectorization would have given you
I know
That's my current issue
As I previously said - what I currently do is:
I iterate over all of the pixels in an image.
Compare it to the same pixel in a similar map (using (rsh - lhs) ** 2)
If they're not the same (distance != 0) - I need to compare that pixel to each and every pixel in (1,200,3) ndarray
That's basically the algorithm
why is the squaring being done?
okay. that shouldnt be too bad. so only the last step remains. what is this (1,200,3) array, and why are we comparing our pixel distances (we're comparing distances, right?) against this?
i suppose the thought process here is this: can we achieve the values in our heatmap without having to do this comparison? it sounds a bit backwards to me
like, to me, if we know that we have a finite range of possible distance values, we should be able to coerce it into a range of values always. but maybe im missing something
do you know the min and max values of your heatmap?
Some of the heatmaps have scales ranging between (for example) [brown...green]
And some have [red...white...blue]
And as you know white is (255,255,255) so it's not really in the middle
And I can't search for the brightest/darkest colors
It differs on which heatmap
We have a couple we want to parse
But we don't care about the actual values, as we'd be normalizing them later on anyways
okay. i suppose lets say we just stick to this approach
Okay
You should be able to vectorize over this size too I think
Let me try
Hi!! I'm trying to find the path where my concentration flows in a river network, maybe someone here can help me.
So basically thats the data frame I have, I want to find the path using the columns N_out and N_next, I know where my path starts because the concentration is different to 0, then the idea would be to find the n_next of that row, append it to an array, search that n_next in the n_out and repeat until I reach the end of the rows (for this example i'm going to have 2 paths)
What do you mean by 'path where my concentration flows'?
This is looking like a network-theory problem more than #data-science-and-ml too
Do you mean
You want to make a forest of directed acyclic graphs
from your current data?
No I then have to make some calculations with discharge and concentration
Works, but it's still in the shape of (350,700,3)
I think that it's squared, not sure
Well this should be the best NP can give you I think
Anything more, you can try JIT
Or other langauges
languages*
Is N_out some form of ID to be checked against?
So my concentration starts in n_out 73012836 and then goest to 73012412 and then 73011907 and then 73011906
I don't really need it that fast, you guys told me I should lmao
I'd love it as good as possible of course
yes I can check it agains N_out
But it's for a school project, it doesn't have to be the most optimal solution
How do I know if it ends?
How do you do JIT in Python?
because the last index is the last node in my network
Optimality is always good
Agreed
This one, I'm not too well-versed in it, need to google for it, but if you don't need the speed, just check if it works first then be done with it
It sums everything up to a single int
How do I identify 'last node'
I want a sum for each pixel
True
You need to specify the axis of summation
Try axis=0, axis=1, axis=2, etc.
On it
A little brute forcey but you only have 3 indices, it's fine
If you're looking at 20 indices I'd look at solving this 'smartley'
it will always be the last row
Wait then how does the 73011909 N_out work
let me show you what I'm looking for as a result
You are telling me Path_1 is
73012836 -> 73012412 -> 73011907 -> 73011906
So path_2 is
73011909 -> 73011908 -> 73011907 -> 73011906?
By the way I'm asking you because code-wise, you'd want to know if your thing terminates
Also you'd want to know code-wise, what should be 'in-memory' at each point of the computation
Looks like axis=2 solved the issue
That means axis=2 represents your pixels
yes exactly
Which makes sense since it's (350,700,3) lol
Ok that's what I wrote. The thing is, how do I know know that 73011906 is the last?
You can tell me because this csv is 8 nodes long
What if it were 52000 nodes long
Because my nodes are already sorted
Okay, so that actually an amazing step.
How do I compare it to all of the values in the list now?
my last row will always be the last node of the network
'it' = ?
And take the one closest to it if no match is found
Each pixel basically
You mean the distance?
That list is the list of pixels in the scale
I thought that list is now the list of distances?
Yes, but to be specific it's the list of pixels in the scale
With a distance you create a sphere
Would you mind joining me on VC? It'd be much easier to explain
I rather not speak sorry
All good
So basically we have a heatmap, the heatmap has pixels which can not be found in the scale because of compression issues (I guess)
As for you, ok, one thing is I'd store this last node first, then after that I just do a filter on concentration, then keep following the N_next, N_out thing til I reach the last node
So we need to find the minimum distance between each pixel in the scale and each pixel in the map if the distance != 0 between the two original images
lol
climate change, global climate change, global warming, natural hazards, Earth, environment, remote sensing, atmosphere, land processes, oceans, volcanoes, land cover, Earth science data, NASA, environmental processes, Blue Marble, global maps
Also, anotehr 3D array?
That's an example
Yup
So you have 2 RGB arrays I assume
Okey let me give that a try
But this time it's already filtered, I took a single strip from the scale (1 pixel wide) and that should represent the values
Hmm you want to solve for a minimum distance? I'm not too sure what the if here means actually
Two images which are basically (350,700,3) and another scale which is (1,200,3)
distance != 0, sorry
Distance != 0, mathematically, if and only if not the same thing
(BTW, thanks for all the help, you're awesome)
ok ignore that
I didn't understand what you meant but I'd add that the distance^2 == 0 only if the colors are the same
True, so basically (to make sure we're all on the same page) - height = 350, width = 700, and each of those contains a trio which represents the RGB
And other than that we have another array of all the RGBs in the scale
Correct
(Which is *ucking awesome!)
BTW, currently if I find an exact match (0 between the matrices) or between the pixel and a pixel in the scale I just return its' index
No, the scale is R^1*200*3
Even less than 200 because I removed the outliers
The scale is 600 values of acceptable values
Then
For anything not in the scale
You want a minimum colour distance
ok
ummm not really
?? 1*200*3 is 600?
Because it's 600 only if you separate the R G and B
True
Huh?
Is that not the case?
its probably more like the same distance formula applied here
to find the closest "pixel" that makes sense
Exact one
so, if i understand correctly, you just need a distance matrix
Sorry, I didn't understand what you meant
I think the closest pixel isn't unique though
hm, good shout, it's not. i assume that doesnt matter though
Yeah, but this time between each pixel on the heatmap and each pixel on the scale
So basically 350*700*1*200
Ok I think I get it
theres something in scipy i think, for making a distance matrix
The data is getting huge though
It is, because unfortunately the scale doesn't contain all of the values
Some are really close and we'd rather just take the closest
it's like ~120MB
Unless you have a better idea
Ok I have handled bigger Dist mats before
I don't recommend anything >1Gb basically
yeah this is still tiny in the grand scheme of things
Yeah I think the 'fastest' way is to take the distance matrix
And find the minimizer
For each pixel
If you vectorize it, I can't say anything it probably doesn't fit in fast cache memory so likely it will be an iterative process
So iterations?
Ya iterate over each pixel, I think?
nah, i'd say distance matrix
For dist mat IDK how it'd be done
ok, sec. lemme go find it
I mean, I don't know the functions of the top of my head at least
p=2 for euclidean, which is default anyways. just dump your matrix there, and it should spit out the distance matrix
Oh yeah just try that function first
Might need to fiddle a bit with shape so it likes it, but I think it's fine right now I think your last .shape is 3 which should be what it wants
its actually not happy with the shape as is
Welp, that means reshaping I think
yep. slightly annoying but yeah
Wait, how does it help me?
You're looking for minimizers on this distance matrix
so, the thought process is, you get a distance of each pixel against the 200 pixels
^yup
and then you choose the smallest distance from there for each pixel
(thats the one weird part about vectorization which i always find hard to digest. It's completely happy with doing really wasteful calculations just to reach an answer, and it still scales insanely well)
Would I also get index of the smallest distance in the scale?
in a separate step, yep
once you have the distance matrix, then that would be your last task.
so, say your rbg array is a 3d array yes? then you do a reshape to turn it into a 2d (n_pixel, 3) array. same deal with the heatmap array. then the distance matrix should give you a (n_pixel, 200) array.
^seems like it
at this point, if you use .argmin method, with the correct axis, you'll get the index for the min distance for each pixel.
The issue is
Need to remap each n_pixel back to the original pixel matrix hmm?
That might be annoying again
a simple reshape back will fix that, no issue
Oh hmm
Sounds right
from numpy.random import seed, uniform
from numpy import array
from scipy.spatial import distance_matrix as dist_mat
# Pixels
x = array(range(6))
x.shape=(2,3)
# Scale
seed(0)
y = uniform(0,6,size=(20,3))
z = dist_mat(x, y)
minimizers = z.argmin(axis=1)
y[minimizers]
^ some toy script
I get
array([[1.25326054, 0.96785711, 3.91864995],
[3.67257434, 3.70160398, 5.66248847]])
which indeed seem close to 0,1,2 and 3,4,5
I don't have the reshaping back into the 350*700 array issue you have, but yeah essentially this thing works
I don't really get that, sorry
I'd try to run the following script and see the shape of z
Try it
I really don't see how it can help me
In [13]: z
Out[13]:
array([[4.92828313, 4.07220501, 6.33441159, 4.55356006, 5.90153969,
3.16539456, 7.38905673, 5.77234009, 3.1409925 , 6.07503396,
4.04371234, 3.91525835, 5.84810124, 4.29668737, 4.68306374,
4.21475324, 2.64562386, 5.75833833, 2.29192338, 2.41381881],
[1.44374159, 1.86099389, 1.60497451, 2.09492149, 4.84765814,
4.60227022, 2.24361732, 2.1995214 , 4.73392854, 3.76611036,
2.7447737 , 4.11756178, 0.99009464, 3.200087 , 3.95533112,
5.13868297, 2.65013839, 4.80767455, 3.66255474, 4.01516057]])
(2,20,1)
AHHH
I see it now
I returns an array of all of the pixels from the scale for each pixel from the heatmap
Yup
It should also reshape back nicely
But I can't confirm, would probably need weirdly shaped things
Okay, so now I have 2 result matrices
A distance matrix between the heatmap and its clean counterpart
And between each pixel in the heatmap and each pixel in the scale
How do I use it now
Won't I need to iterate through it in order to fit it into the datadrame?
Well try it first I guess
Sounds a lot to you but not too many for a compute rhonestly
True
What's 'a lot' for a computer is more of the like of >10^9
Depends on which computer π
Haha yes I wouldn't get a raspberry to do this
so i have this df
a b c d
0 308.3720 290.2366 301.1735 351.9520
1 258.5590 235.2776 261.2084 345.8950
2 161.0730 193.4827 189.8389 309.6990
3 123.1920 112.0545 120.7862 253.0690
4 95.6603 110.8244 93.7067 199.7980
5 93.5633 90.5683 95.8124 128.2830
6 119.5540 170.6676 123.4065 78.5675
7 200.7440 177.9813 182.7717 98.8558
8 238.3210 258.7455 256.2180 180.2480
9 320.6590 311.0475 280.1748 177.7440
10 340.3230 361.0352 347.1527 256.2830
11 350.5850 339.0691 345.8954 311.0590```
how do i get the average
of each row
Have you tried pandas.DataFrame.mean
It tells me the closest but since it doesn't use the distance formula we used earlier - I won't be able to compare between the two
@client.command()
async def graph(ctx, *, query):
ticker = query
yf.download(ticker)
newtime = yf.download(ticker, start = "2015-01-01", end = "2021-12-31")
newtime['Adj Close'].plot()
fig = plt.figure()
plt.xlabel("Date")
plt.ylabel("Adjusted")
plt.title("Price data")
plt.style.use('dark_background')
fig.savefig("data.png")```
So I am using this code to graph data on some stocks and would like to save the graph as adirect file on my computer but it is opening the graph in the python application. Not sure what I am doing wrong atm
NVM, seems like we just need to multiply it by itself and then they should be equal
Now we just have an issue with the shape
Because we're trying to calculate the distance matrix between a (350,700,3) and (1,200,3)
But even if we do it
How do we continue, what's the next step?
Got any idea how to do that first reshape() though?
your_array.reshape(-1, 3)
(-1 means the value at the axis is inferred. you could also write 350*700 there)
Huh? How would that work?
just math. if you provide me an array, and give me all axes except 1, i can always infer the dimension of the remaining axis
So basically if I have (350,700,3) and (1,200,3) and I want to run distance_matrix() on them, I'd have to run it in the following way:
dist_mat = distance_matrix(a.reshape(-1, 3), b)
Right?
I mean why should I remove the axis of the trios
Oh, no I understand what you meant, unfortunately it doesn't work since the heatmap is 350*700 (245000 pixels) while the other is 1*~200 (~200)
So a reshape would basically do nothing
Since they are of different sizes
numpy.core._exceptions.MemoryError: Unable to allocate 447. GiB for an array with shape (245000, 245000) and data type float64
π°
I literally have no idea what to do
@ripe forge @chilly geyser If you guys have any idea please ping me, I'll go cry in a corner or something
@twin moth whatβs the problem
Heya
Basically I'm trying to find create a distance matrix between matrices of the following shapes [(1,200,3) and (350,700,3)
Other than that, I don't know what's the next step
@twin moth its an element by element scalar distance matrix?
Is that what you mean by element by element?
ahhhhhh yea it is
im honestly not sure man
thought i mightve had an idea but tried it and it doesnt work xd
lol
Thanks for trying β€οΈ
Okay, so I managed to fix the issues by just choosing not to use create a distance_matrix between the heatmap and the scale - and now it takes about 22 mins to iterate through a single image, even though it took about 4 mins beforehand
We've added a couple of checks but it's too long anyways
Hello there
I'm studying statistics and looking to take a step ahead my classes by being able to retrieve data to analyse on my own from APIs. So i'm looking for help to learn requesting and all that stuff as I only know the basics about the topic.
I know how the basics of python too if that's any relevant
i'm curious as to why you would want to use that big of an array
and float64 too lmao
watch stat quest. that will help you stay ahead
Well, I didn't really
I wanted to create distance matrix between two matrixes of the following shape (350,700,3)
I thought you only had to create a distance matrix between (350, 700, 3) and (1, 200, 3)-sized arrays?
!e
import numpy as np
from scipy.spatial import distance
rng = np.random.default_rng()
a = rng.random((350, 700, 3)).reshape(-1, 3) # The RGB image. Reshape so that it is 2-D.
b = rng.random((200, 3)) # The 200 RGB pixels to compare against
dist_mat = distance.cdist(a, b).reshape((350, 700, 200))
print(dist_mat)
@hasty grail :warning: Your eval job timed out or ran out of memory.
[No output]
It barely fit in memory on my PC (which has 8 GB)
dist_mat[i, j, k] is the distance between a[i, j] and b[k]
If you're having memory issues then you might want to compare batch by batch (with respect to b)
For example, find the closest match out of the first 100 pixels, then the same out of the next 100 pixels, and compare those 2 closest matches.
import numpy as np
from scipy.spatial import distance
img_size = (350, 700)
num_px = 200
batch_size = 100
rng = np.random.default_rng()
img = rng.random((*img_size, 3)).reshape(-1, 3) # The RGB image. Reshape so that it is 2-D.
px = rng.random((num_px, 3)) # The 200 RGB pixels to compare against
dist_min = np.full(img_size, np.inf)
dist_argmin = np.full(img_size, -1)
for i_min in range(0, num_px, batch_size):
px_batch = px[i_min:i_min+batch_size]
dist_batch = distance.cdist(img, px_batch).reshape(*img_size, -1)
dist_batch_min = dist_batch.min(axis=-1)
dist_batch_argmin = dist_batch.argmin(axis=-1)
updated = (dist_min > dist_batch_min)
dist_min[updated] = dist_batch_min[updated]
dist_argmin[updated] = i_min + dist_batch_argmin[updated]
assert (dist_argmin > -1).all()
May i have someone explain why ```pythonmiles_log = cur_folder+"mile_log.txt"
miledge_sheet = cur_folder+'blank mileage log.xlsx'
with open(miles_log,'r') as f:
data = f.readlines()
f.close()
wb = load_workbook(miledge_sheet)
for cells in wb.iter_columns:
print(cell)
doesnt work
essentially i have a text document that i want to write its contents to excel, but only to the rows of 17:38
however, all of my solutions rewrite the desired section. i can provide the full code if needed.
is there daily challange for data science,
i mean like using pandas, matplotpib etc ?
Is Hypergeometric probability distribution identical to Binomial probability distribution?
is there any library which can generate a normal map from a diffuse texture?
was recommended this channel from #python-discussion
ping me ^
any cli app would work too i guess, which i can call from python
i need some help rather quickly
how can I create frequencies of data points but with some margin for error?
i.e.
I have a bunch of points on a graph, but none of them are the same point
how could I loop over each point and get a value for the frequency of points really close to it?
Thanks!!!
So basically you want a histogram?
well what I want is to vary the sizes of each point by how frequently there are points nearby
so like around (0, 0), the points would be very large
I thought about clustering but wasn't sure how to use it in this context
i'm following a tutorial but with a different dataset
the guy in the tutorial has data already in the form of tiny regions across the plot in hexbins instead of just the raw datapoints, and each region already has the frequency in another dataset
his looks like this
In particular the query_ball_tree method
i'll take a look at this, thanks
np, feel free to ask if you run into issues
willdo!
so what I have so far is this:
indices = []
count = 0
for x in missed_x:
kd_tree1 = cKDTree((missed_x, missed_y))
kd_tree2 = cKDTree((missed_x[count], missed_y[count]))
indexes = kd_tree1.query_ball_tree(kd_tree2, r=10)
indices.append(indexes)
count += 1
display(pd.DataFrame(indices))
but on the line kd_tree2 = cKDTree((missed_x[count], missed_y[count])), I'm getting an error: ValueError: data must be 2 dimensions
So it looks like it's expecting a cKDTree to have more than one point
how would I make a cKDTree with just one?
I think i see
Why don't you just create a cKDTree for each set of points?
well partially the problem was that cKDTree didn't work on a single point
I thought you only wanted to do clustering?
In that case the two cKDTrees would be the same
but
scipy.spatial.KDTree.query_ball_point should be done if I want to find the nearest neighbors for a single point
Don't you want to find the nearest neighbors for more than one point?
yeah but I'm going to loop over each point and add it to a list with the number of nearest neighbors for each point
or is that not the best way to do it
It's not the best way
avoid loops when dealing with NumPy (and its related libraries) whenever possible
The reason is that Python for loops are way slower than those in C
ah, did not know that
and NumPy uses C loops internally whenever you execute vectorized operations
so then how should I do it?
# Assuming that you have a bunch of points in a 2-D array named `points`
centroids = points[::2] # Subsample the points. In this example we sample every other point.
centroids_t = cKDTree(centroids)
points_t = cKDTree(points)
# r = Radius of search
result = centroids_t.query_ball_tree(points_t, r) # result[i] is a list of the neighbours of centroids[i] in points
num_neighbours = [len(neighbours) for neighbours in result] # Count the number of neighbours for each centroid
I have an array of points made using missed_points = np.array([missed_x, missed_y]), and it returns:
[ 207, 264, 120, ..., 86, 15, 61]])```, but it says it's a 2x1331 vector. this would still work, right? or would I need to transpose it
also how come we would do every other point?
you'll have to transpose it
according to the doc, the first dimension represents the number of points and the second dimension represents the number of spatial dimensions the points reside in
also how come we would do every other point?
It's the easiest method of subsampling, just for demonstration purposes.
If you don't want to subsample at all you can just do centroids = points
The reason you might want to subsample is to reduce the computational cost
(Especially if you have a large number of points like >10k)
that does make sense
although would it make sense when I want the nearest neighbors for each individual point?
Yeah, since it is directly related to your purpose
Which is to count the number of nearby points
how would I use it in my plot?
query_ball_point isn't actually a nearest neighbour search, it's just a ball query
it's a different size from my x and y coordinate arrays
i.e. find all points within distance r of the given point
running it with no subsampling is basically the same running time
I would think that the reason why you want to count the number of nearby points is so that you don't have to plot every single point
running it with no subsampling is basically the same running time
Compared to?
Maybe it's because the number of points isn't large enough to matter much
1331 is not a lot
Glad that worked!
lol
like this much
so im a little put off rn
If you're actually going to deal with 118,396 points then you'll probably need to subsample (and plot only the centroids)
yeah thats my new thought
i'm looping over 1230 games played to find every shot taken in the game which can be estimated to be about 200 total
for row in t.main_file_df['ID1784']:
new_row = [value for value in row if value in t.param_values]
row = [value for value in row if value not in t.param_values]
how can i transform this loop into 2 columns
where row I want to stay as 'ID1784'
and new_row as new column
Wdym by 2 columns?
this is pandas DataFrame
t.main_file_df['ID1784'] is column that contain list of strings

Can you provide more background information?
Such as the structure of your dataframe
i dont understand
and what do you expect your output to be like?
ID1784 looks like
['automatyczna i rΔczna regulacja czuΕoΕci', 'automatyczne wyΕΔ
czanie', 'identyfikacja sygnaΕu nadajnika dziΔki cyfrowemu kodowaniu', 'moΕΌliwoΕΔ doΕΔ
czenia dodatkowych nadajnikΓ³w', 'optyczna i akustyczna sygnalizacja zlokalizowanego sygnaΕu nadajnika', 'zestaw zΕoΕΌony z nadajnika i odbiornika', 'lokalizowanie tras kabli w gruncie', 'lokalizowanie tras metalowych rur instalacji wodnych i centralnego ogrzewania', 'lokalizowanie wΕaΕciwego zabezpieczenia odpowiedniego obwodu elektrycznego']
list of many strings
and I want to split it into 2 columns
one that contain texts that i want, based on list intersection
and 2nd that have leftovers
this for loop can do this for each row
df['ID1784_new'] = df.apply(lambda row: [value for value in row['ID1784'] if value in t.param_values])
df['ID1784'] = df.apply(lambda row: [value for value in row['ID1784'] if value not in t.param_values])
Not that experienced in Pandas so there might be a more efficient method
Fixed
You're welcome
perfect timing because i just realized my problem
its also pandas
ok so
I have a dataframe of every taken shot by a team in a given game:
I'm trying to loop over each row using for i in homeoutcomes: where homeoutcomes is the name of the dataframe
although the only thing that's returned is EVENT_TYPE
what am I missing, how should I be iterating over each row instead?
Why are you iterating over the rows like that?
Again, since pandas derives from numpy, avoid direct iteration in Python
how should I be doing it
depends on what you're doing in each iteration
in each iteration I do this:
display(i)
if homeoutcomes.iloc[indexH]['EVENT_TYPE'] == 'Missed Shot':
missed_x.append(homestats.iloc[indexH]['LOC_X'])
missed_y.append(homestats.iloc[indexH]['LOC_Y'])
next_x.append(homestats.iloc[indexH + 1]['LOC_X'])
next_y.append(homestats.iloc[indexH + 1]['LOC_Y'])
indexH += 1
print("IndexH", indexH)
what is display?
its an IPython function that is used to display data structures
ok so it's just for debugging
yeah
what are next_x and next_y for?
next_x and next_y are keeping track of the x position and y position of the shot following a missed shot
Hmm ok
missed = (homeoutcomes['EVENT_TYPE'] == 'Missed Shot')
missed_x_y = homeoutcomes.iloc[missed].loc[['LOC_X', 'LOC_Y']]
missed_next = np.concatenate([[False], missed[:-1]]) # Shift the `missed` boolean mask
missed_next_x_y = homeoutcomes.iloc[missed_next].loc[['LOC_X', 'LOC_Y']]
Try this
instead of a for loop?
Yep
will missed_next be for any shot after a missed shot? or just a missed shot after a missed shot
any
ok thanks

NotImplementedError: iLocation based boolean indexing on an integer type is not available
Can you print out missed?
maybe you need to add .values at the end to convert it into a boolean array
instead of a Series
yea sure
i changed it to missed_home though because im doing it for two different teams
i changed the missed in the iloc to missed_home btw
that's fine
display(missed_home) gives
can you display missed_home.values?
if it becomes a regular numpy array then that's what you need to use instead
yup
ah ok
as in
missed = (homeoutcomes['EVENT_TYPE'] == 'Missed Shot').values
that's what i did
missed_x_y_home gives an error
KeyError: "None of [Index(['LOC_X', 'LOC_Y'], dtype='object', name='GAME_ID')] are in the [index]"
Oh. Both arrays needed to be 2d, so it should have been (200, 3). Catching up on the convo I'm not quite sure how you ended up with an array of shape (245000, 245000). Also apologies for the trouble sounds like it wasn't a fun ride π
Can you provide your code?
yeah sure
So that I don't get confused by home and stuff
xlocs = df['LOC_X']
ylocs = df['LOC_Y']
made = df['SHOT_MADE_FLAG']
missed_x = []
next_x = []
missed_y = []
next_y = []
df.set_index(keys=['GAME_ID'], drop=False,inplace=True)
game_ids = df['GAME_ID'].unique().tolist()
for id in game_ids:
gamestats = df.loc[df.GAME_ID==id]
team_names = list(set(gamestats['TEAM_NAME'].tolist()))
homestats = gamestats[gamestats['TEAM_NAME'] == team_names[0]]
awaystats = gamestats[gamestats['TEAM_NAME'] == team_names[1]]
homeoutcomes = pd.DataFrame(homestats['EVENT_TYPE'])
awayoutcomes = pd.DataFrame(awaystats['EVENT_TYPE'])
missed_home = (homeoutcomes['EVENT_TYPE'] == 'Missed Shot').values
missed_x_y_home = homeoutcomes.iloc[missed_home].loc[['LOC_X', 'LOC_Y']]
missed_next_home = np.concatenate([[False], missed_home[:-1]]) # Shift the `missed` boolean mask
missed_next_x_y_home = homeoutcomes.iloc[missed_next].loc[['LOC_X', 'LOC_Y']]
missed_away = (awayoutcomes['EVENT_TYPE'] == 'Missed Shot').values
missed_x_y_away = awayoutcomes.iloc[missed_away].loc[['LOC_X', 'LOC_Y']]
missed_next_away = np.concatenate([[False], missed_away[:-1]]) # Shift the `missed` boolean mask
missed_next_x_y_away = awayoutcomes.iloc[missed_next].loc[['LOC_X', 'LOC_Y']]
missed_x = np.asarray(missed_x)
missed_y = np.asarray(missed_y)
next_x = np.asarray(next_x)
next_y = np.asarray(next_y)
ok, can you print homeoutcomes.iloc[missed_home]?
and so on with all missed values
Where are the LOC_X and LOC_Y columns?
x coordinate of a given shot and y coordinate
I don't see those columns in your screenshot...
oh those are in a different dataframe !!!
could you combine them together beforehand?
should be homestats
display(homestats) looks like this
and i'm getting homeoutcomes from homestats
hmm
i still get KeyError: "None of [Index(['LOC_X', 'LOC_Y'], dtype='object', name='GAME_ID')] are in the [index]"
Your edited code?
for id in game_ids:
gamestats = df.loc[df.GAME_ID==id]
team_names = list(set(gamestats['TEAM_NAME'].tolist()))
homestats = gamestats[gamestats['TEAM_NAME'] == team_names[0]]
awaystats = gamestats[gamestats['TEAM_NAME'] == team_names[1]]
homeoutcomes = pd.DataFrame(homestats['EVENT_TYPE'])
awayoutcomes = pd.DataFrame(awaystats['EVENT_TYPE'])
missed_home = (homestats['EVENT_TYPE'] == 'Missed Shot').values
missed_x_y_home = homestats.iloc[missed_home].loc[['LOC_X', 'LOC_Y']]
missed_next_home = np.concatenate([[False], missed_home[:-1]]) # Shift the `missed` boolean mask
missed_next_x_y_home = homestats.iloc[missed_next].loc[['LOC_X', 'LOC_Y']]
missed_away = (awayoutcomes['EVENT_TYPE'] == 'Missed Shot').values
missed_x_y_away = awayoutcomes.iloc[missed_away].loc[['LOC_X', 'LOC_Y']]
missed_next_away = np.concatenate([[False], missed_away[:-1]]) # Shift the `missed` boolean mask
missed_next_x_y_away = awayoutcomes.iloc[missed_next].loc[['LOC_X', 'LOC_Y']]
missed_x = np.asarray(missed_x)
missed_y = np.asarray(missed_y)
next_x = np.asarray(next_x)
next_y = np.asarray(next_y)
np
what is homestats.iloc[missed_home]?
thats homeoutcomes.iloc[missed_home] but with homestats instead
same error either way
well yeah, but does it contain the columns 'LOC_X' and 'LOC_Y'?
after the iloc
hmm?
run display(homestats.iloc[missed_home])
supposedly it should have those two columns
so homestats.iloc[missed_home].loc[['LOC_X', 'LOC_Y']] results in an error?
yeah that one does
wait
I'm dumb
ok umm can you do this?
homestats.loc[missed_home, ['LOC_X', 'LOC_Y']]
also maybe you could get away with not using .values
thanks! i'll try the rest of it
somehow I thought .iloc uses row indexing while loc uses col indexing
ran with no errors!
for id in game_ids:
gamestats = df.loc[df.GAME_ID==id]
team_names = list(set(gamestats['TEAM_NAME'].tolist()))
homestats = gamestats.loc[gamestats['TEAM_NAME'] == team_names[0]]
awaystats = gamestats.loc[gamestats['TEAM_NAME'] == team_names[1]]
missed_home = (homestats['EVENT_TYPE'] == 'Missed Shot')
missed_x_y_home = homestats.loc[missed_home, ['LOC_X', 'LOC_Y']]
missed_next_home = np.concatenate([[False], missed_home[:-1]]) # Shift the `missed` boolean mask
missed_next_x_y_home = homestats.loc[missed_next, ['LOC_X', 'LOC_Y']]
missed_away = (awaystats['EVENT_TYPE'] == 'Missed Shot')
missed_x_y_away = awaystats.loc[missed_away, ['LOC_X', 'LOC_Y']]
missed_next_away = np.concatenate([[False], missed_away[:-1]]) # Shift the `missed` boolean mask
missed_next_x_y_away = awayoutcomes.loc[missed_next, ['LOC_X', 'LOC_Y']]
also you can use pd.unique instead of set
team_names = pd.unique(gamestats['TEAM_NAME'])
that works
however,
going back to centroids
that now gives me a ValueError: setting an array element with a sequence.
where?
it's now
indices = []
count = 0
missed_points = missed_x_y
centroids = missed_x_y
centroids_t = cKDTree(centroids)
points_t = cKDTree(missed_points)
result = centroids_t.query_ball_tree(points_t, r=10)
num_neighbours = [len(neighbours) for neighbours in result]
display(num_neighbours)
at cKDTree(centroids)
what is missed_x_y?
edited
edited?
my question
oh
oh i forgot to say
instead of missed_next_x_y_home = homestats.loc[missed_next, ['LOC_X', 'LOC_Y']]
i have an array x_y and I'm using x_y.append(homestats.loc[missed_home, ['LOC_X', 'LOC_Y']])
and then missed_x_y = np.asarray(x_y)
why don't you just set missed_x_y to homestats.loc[missed_home, ['LOC_X', 'LOC_Y']]?
because i'm adding homestats.loc[missed_home, ['LOC_X', 'LOC_Y']] to missed_x_y for each game
ok so the problem is that you need to concatenate the list of dataframes together*
you mean missed_x_y?
well no errors
and then
although the nearest neighbors is slow as hell
so i should do the subsampling?
There are a couple of ways to do subsampling
runs much faster with [::2]
As mentioned before, the easiest way is to just sample every n points
However this is only good when the points are randomly distributed, which is not necessarily true since you have sorted them in time sequence, possibly affecting their spatial distribution
yeah
A better method would be to randomly shuffle the points beforehand
Still pretty basic, but should be sufficient
how would i shuffle it so that the next shots are shuffled the exact same way?
rng = np.random.default_rng()
shuffled = rng.shuffle(original)
this would be nondeterminstic shuffling
if you want the shuffling to be identical each time the program is run, pass a seed to default_rng
(such as zero)
oops made a mistake
rng.shuffle is in-place so it doesn't return anything
so you should just call rng.shuffle(original), and original will be shuffled
what does your code look like now?
yeah there's waaaay too many points
16 does this
I'll try out the rng.shuffle tommmorow
i'm done for the night so THANK YOU!
No problem!
you have no idea how helpful this was!
Am glad to help
is it possible in dash to plot a line with an x axis start and end time? i cant find anything on the old google machine
Thanks a ton, I'll try it out later and let you know
Lol, no, it,didn't really
We ended up iterating through the 1,200,3 matrix instead of creating a distance matrix
Currently it takes about 3mins for each map of 350,700,3.
We have about 240 of those in each dataset (11 hours) and we have multiple datasets, so still a long way to go
I was told that you have to be really good at math and statistics otherwise you canβt be a great data scientist. Is it true?
To understand all the concepts behind data science methods, statistical models and so on you need to be good at maths for sure. But to do good analysis you just need to know how to treat your data right and when to use which model. So you can be a good data scientist without fully understanding everything imo. But you'll be better if you understand the methods, their advantages and limits.
So if I'm not mistaken, even if you are not good at math, that does not limit you?
honestly
it depends on what kinda DS you wanna be
there are many types.
If you want to study Data Science you'll definitely need math BUT there is no good or bad at math, it's just how much time and exercise you want to put into understanding math.
also "really good" is subjective
for example
if you want to be a research scientist
you'll defo need to have your linear algebra etc. down
For analysis tho you just need to know how to run the right analysis on a given data set
I was into DA but now Iβm considering about moving to NLP and stuff more like AI than analysis. Then I found my math sucks.
I have basic knowledge in linear algebra, but thatβs it
so like
let's say you're a DS running A/B testing
you need to know what you're doing
proper statistical methodology is important
in, for example, placing subjects into cohorts
understanding the statistical significance of your results
etc.
but there are also DS roles
which are more
...
programming-oriented?
Like?
as in
what I mean is that
both are called "Data Scientist"
but
actual responsibilities may differ
just like you can be a "Software Engineer" and work on embedded systems
or a "Software Engineer" and do webdev
Ah yes
Different expertise within the same field I get that
Iβm still not sure what I really want to do
and
you can be also a more business-facing DS
so like your special sauce is being very good @ bringing stakeholders together and communicating with them
Data analyst seems to be... inferior? to data scientist
noone ever does tbh








