#data-science-and-ml

1 messages · Page 287 of 1

misty flint
iron basalt
#

I don't understand how that is an issue, also it's still sorted after cutting out the non-us rows.

analog pike
#

all i know is that python is disagreeing with me trying to edit values on index, which is a copy of ufos

iron basalt
#

what does ufos look like

#

with row and column labels

analog pike
coral walrus
#

anyone knows how I can format multiple datetime columns in pandas?

df['WPCBDATO'] = df['WPCBDATO'].dt.strftime('%d-%m-%Y')
df['WPCIDATO'] = df['WPCIDATO'].dt.strftime('%d-%m-%Y')
df['WPCPLSLUT'] = df['WPCPLSLUT'].dt.strftime('%d-%m-%Y')
``` already tried df[[]]
analog pike
#

pd.to_datetime i'm pretty sure

coral walrus
#

gonna try again, pretty sure it's not working

#

seems odd if I can't format multiple columns at once

#
df[['WPCBDATO', 'WPCIDATO', 'WPCPLSLUT']] = pd.to_datetime(df[['WPCBDATO', 'WPCIDATO', 'WPCPLSLUT']], format='%d-%m-%Y')
```returns ValueError: to assemble mappings requires at least that [year, month, day] be specified: [day,month,year] is missing
#

.apply(pd.to_datetime, format='') did the trick it seems

analog pike
#

hm

#

never heard of that

coral walrus
#

now you have Wowee

analog pike
#

yup

coral walrus
#

but it ignores the format

#

strange

iron basalt
#
column = ufos.loc[ufos['country'] == 'us', 'datetime']
for i, val in enumerate(column):
  column[i] = val.split()[1]
#

@analog pike

analog pike
#

Well there's no more error

#

but it's giving me the half with the time not the date

#

okay nevermind

#

i switched the 1 to 0

iron basalt
#

date and time

analog pike
#

wdym

iron basalt
#

example:

#

10/10/1949 20:30

analog pike
#

oh yeah

#

only one space

iron basalt
#

split()[0] is the first part which is what you want

analog pike
#

alright thanks so much

iron basalt
#

(side note: split without any arguments passed to it splits by all whitespace, so it works even with multiple spaces in-between, or new lines or tabs)

analog pike
#

nice

grave frost
#

Quick question - I am doing Hyperparameter search for my classification model. Is it reasonable to expect that if a model attains higher accuracy (or lower loss) in 5 epochs against another iteration of parameters then it would also attain higher accuracy after full training (say, like 15 epochs)?

cerulean spindle
#

I'm pretty sure that if you put too many epochs, the model will overfit and the loss will go up.

#

so you'll have to have a balance

grave frost
#

I am fine-tuning

velvet thorn
#

example

#

let’s say the hyperparameter is learning rate

#

and weight initialisation is the same

#

you might get near a local minimum faster, but you might also be less likely to reach it exactly

#

with a higher learning rate

grave frost
#

Aha.. good point

#

Look like Im gonan have to keep it for like 20 hours

#

Thanx a lot @velvet thorn 🚀

velvet thorn
astral path
#

If I have a column in my dataset which contains short string descriptions using keywords, how could I include that in a heatmap/correlogram to show relationships between the keyword and other variables? e.g. I could use this to find that, for example, descriptions that contain the word "red" and "dress" have a smaller value in a column called stock than a description that includes "green" and "bag"
example of data

velvet thorn
#

i.e. one-hot encoding

steady horizon
#

how can i find a degree of similarity between two topics obtained with lda in different documents?

digital crescent
#

I have a dataframe called data with two columns.

print(data.dtypes) yields:
data_in_datetimeformat datetime64[ns]
data_in_float64_format float64
dtype: object

What does "dtype: object" mean? It isn't one of my columns as far as I know.

simple flume
#

any data type in python is object based

#

its considered as an object from a class

#

do u get me ?

digital crescent
#

I think that makes sense, yeah.

#

Why does it list that though?

simple flume
#

object as well but you can tell python that i want list of objects

#

which is a list

digital crescent
#

Is it listing each dataframe's columns' data types and then also saying that "dtype" itself is an object?

simple flume
#

yes data frame columns are objects as well i think you can define them as series and give it to them or a list

#

from panda package

digital crescent
#

I mean, I don't understand why "dtype: object" is part of the output of print(data.dtypes)

simple flume
#

for example dataframe is a class or bigger object which contains smaller objects which are columns

digital crescent
#

That makes sense, yeah

simple flume
#

when python for example tells us something data type is list its origin is a list [] the brackets for example define that you tell the compiler i will have just a number of objects

#

thats why the list could have different data types

#

its not like array in c++

digital crescent
#

I think I understand what you are saying

#

So if a pandas array is composed of columns all themselves composed of the same dataype, pandas.dtypes might return something other than "dtype: object"?

#

Never mind. The output of the function seems like something I don't really need to understand right now

#

The important part to me was being able to identify the datatype being used for each column

plain jungle
#

python has the best list than any other language imo because they allow for multiple datatypes

misty flint
#

scikit learn has the OrdinalEncoder function too. thats the one i used recently

velvet thorn
#

@digital crescent no

#

because that’s a series

#

with string values

#

representing the data types

#

and strings are objects

#

(although that seems like it’s a bit outdated because there is a specific string dtype now)

velvet thorn
#

that has its own drawbacks.

#

it’s not necessarily better

velvet thorn
velvet thorn
#

which would explain it

#

you can check

digital crescent
# velvet thorn <@266774717803921410> no

Gotcha. So a pandas dataframe with 2 data columns essentially has an index column, the 2 data columns, a series with values representing the 2 data columns' data types, and everything else a dataframe object would have, right?

velvet thorn
digital crescent
#

I'm not sure. Either way "series/dynamic generator" 🙂

molten bluff
#

guys, I have this tweets data set i am trying to clean. I want to remove all words that begin with @ from a text column and then drop the rows that have the same texts after the above process. I have the following code but it isn't working

df['clean_text']=df['text'].str.replace('(@\w+.*?)',"")
df = df.drop_duplicates(subset = ['clean_text', 'username'])

can someone help me? Thanks!

misty flint
#

sounds like a regex problem

molten bluff
# misty flint sounds like a regex problem

anyone suggestion on how to remove words that begin with @ or how to modify the above regex? I tried printing dataframe to the console and it appears to have removed the @sub_strings from the texts but the data frame isn't dropping the duplicates

misty flint
#

regex's are above my paygrade, sorry amegablobsweats

harsh timber
#

Hak's regex is working. I think the subset is the problem. Specifically, using the username column which I don't actually see... Probably removing "username" would work?

#

No prob. If it's an issue, just delete that post and let me know if the usernames are the same. I.e. for privacy reasons, you should prob delete that post : )

molten bluff
quiet locust
#

Hi guys I’m trying to figure out a good project to do for my data science portfolio

#

Having a hard time coming up with a research questions

harsh timber
harsh timber
# molten bluff yes

Herm I'm not entirely sure. I would just try to take those two records and compare each string/list using == operator to double check. And then try again. If they are truly equal in Python, then try doing that inplace argument I suppose. Not really my field of expertise since I had my own question to ask, but hopefully testing out a bunch of edge cases should help resolve your issue.

molten bluff
flint mason
#
for i in range(100):
  loss = mse(model(inputs),targets)
  loss.backward()
  with torch.no_grad():
    weights -=weights*1e5
    bias -= bias*1e5
    weights.grad.zero_()
    bias.grad.zero_()

    ```
#

Can someone have a look is the weights and bias used properly

misty flint
#

if you dont like it, you wont finish the project

quiet locust
#

Hmmm okay Rex thanks!

#

Do you mind if I send it in here as I go along?

misty flint
#

sure why not

molten bluff
velvet thorn
molten bluff
#

I figured out that the strings weren't unique because there were exactly same texts with different @s, hastags and urls for some reason. So had to remove all of them and then the drop duplicates worked

#

some user spammed the exact same tweet multiple times with different @s, hashtags and urls multiple times on different days and was messing up my eda

misty flint
#

sounds like a bot

#

are you doing some sentiment analysis or something?

#

i need to do a project involving twitter api sometime

molten bluff
#

I swear to god the guy was a real person, i personally verified it

misty flint
#

to better understand it

#

how many calls are you limited to daily?

#

100?

dusty pasture
#

Hi

#

Pls verify me I accidentally left

#

Before

molten bluff
astral path
#

i'm using a seaborn distplot to visualize the distribution of one of my columns, but it's extremely distorted because there's some variables (which I'm not sure I want to leave out) that are extremely far away in value from the others

#

it looks like this now

#

there's 301 of these outlier values with a mean of ~10220 and an std of 5544, so idk if i should remove them

#

what do you think I should do?

#

same thing happens with another column

#

its seems like i just have some datapoints which are outliers in all variables

#

i've tried using sb.distplot(plot_df[plot_df['stock'] < 100]) to get all items under 100 (as a test case) and it changed nothing at all...

misty flint
#

i would want to work with twitter api only so that i can say ive worked with it

#

but in practicality its annoying

#

lol

misty flint
#

maybe split the dataset?

#

if its that many outliers in that region, seems like a dif. subcategory

astral path
#

i guess i could try that

misty flint
#

honestly idk what youre supposed to do in that case

#

thats just what i would do

#

lol

quiet locust
#

This is a really simple question

#

But what’s the best method for making an api call

#

And why is it necessary?

astral path
#

i think i'll just ask my TA

random thicket
#

/python

slender radish
#

hey everyone

#

i have a question about data types

#

so i have this dataset and height and weight are all integers, are these two attributes continuous data or discrete data?

#

i feel like height and weight should be continuous data, but in this case where all the entries are integers, are they discrete?

dawn turtle
#

they are continuous

#

even if the measurement is granular

idle sail
#

hey i'm new to the server can anyone link me some resources to get started with data science? (like some projects or bootcamp idk).
I have a basic knowledge of python, but i don't know where to find the resources to learn more... thank you in advance 🙂

grave frost
#

I think there is a way to trigger a bot to list all the resources for DS

slender radish
#

@dawn turtle what about an attribute that only takes 0 or 1. 1 being true and 0 being false. Is this categorical data or numerical (discrete) data?

dawn turtle
#

If they represent true and flase its categorical. Its about the thing that the data is representing not how it is measured

#

@slender radish

slender radish
#

@dawn turtle okie thanks so much!

misty flint
lapis sequoia
arctic wedgeBOT
#

Hey @sharp pumice!

Uh-oh! It looks like your message got zapped by our spam filter. We currently don't allow .txt attachments, so here are some tips to help you travel safely:

• If you attempted to send a message longer than 2000 characters, try shortening your message to fit within the character limit or use a pasting service (see below)

• If you tried to show someone your code, you can use codeblocks
(run !code-blocks in #bot-commands for more information) or use a pasting service like:

https://paste.pythondiscord.com

lapis sequoia
#

Hi, hopefully this question isn't too taboo. I work with R a lot and am trying to relearn Python as more and more jobs are using Python (plus a personal interest in it). I've been learning how to use Pandas, matplotlib, numpy etc. Does anyone have any good resources/suggestions on how to get the hang of Python's syntax coming from R?

heady path
#

Anyone interested in chatting approaches to text classification? I am not familiar with data science or python really, but I was a software developer for a few years and have a degree in computer engineering, so I'm sure I can keep up with a convo. I have a specific use case I'm trying to determine whether to go third party, hire a junior developer to work to build something or build it myself

#

Anyway, DM me if anyone is interested

grave frost
#

we are here to help in case of any problem 🙂 but if you want, freelance is always there

little compass
#

The Embedding module of PyTorch is very simple and extremely powerful at the same time. If you are interested in how to deal with representing tokens of a language or encoding categorical variables do not miss this video:)

https://youtu.be/euwN5DHfLEo

In this video, I will talk about the Embedding module of PyTorch. It has a lot of applications in the Natural language processing field and also when working with categorical variables. I will explain some of its functionalities like the padding index and maximum norm. In the second part of this video I will use the Embedding module to represent...

▶ Play video
quiet locust
#

would anyone be able to hop on a zoom call with me?

#

I'm having trouble making a request for an api

astral path
#

would binary yes/no variables be considered categorial?

#

e.g. a variable called inStock holds a value 1 if the item is in stock, 0 if not

earnest falcon
quaint bloom
#

this is probably a dumb question, but how is this 4 dimensions? and what would the shape even be?

velvet thorn
quaint bloom
#

ohhhhhhhh

#

i was so confused lmao

velvet thorn
#

okay so you will notice

#

ML programmers

iron basalt
velvet thorn
#

actually use the term differently

#

so in coding

#

if we say “3D array” (3 dimensional)

#

we mean an array that requires 3 levels of indexing

#

to hit a scalar value

#

in mathematics, this is more commonly referred to as a rank 3 tensor

velvet thorn
#

let’s say representing a point on an x, y plot

#

that would be a rank 1 tensor

#

with 2 dimensions (values, namely x and y)

#

all that said

velvet thorn
quaint bloom
#

ah ok

#

thx for the pro explanation

iron basalt
#

Though it's so far spread the misuse that it's more or less the accepted use now.

velvet thorn
#

cf. functor

#

🥴

iron basalt
#

True. Though my favorite nonsensical term in programming is Dynamic Programming. As it was named as such to disguise the fact that the he was doing mathematical research because the US general had a pathological fear of the word "research" and to make it sound so cool that no congressmen could object to it.

#

I think explaining that difference in dimension usage with number of indices is nice, i'm gonna borrow that.

quiet locust
#

What’s the best way to convert a json into a data frame

iron basalt
quiet locust
#

So basically I requested an api

#

And want to take the json and convert it to a data frame

#

So that I can do some analysis

lapis sequoia
#

nice

merry ridge
#

I don't really think of it as mathematicians using the term differently more so that-non mathematicians are not specific enough when they refer to dimension

#

Take for example, a 2x2 matrix. What is the dimension? Are you referring to the dimension of the column space? Or is this over the vector space of 2x2 matrices over a binary operation like addition?

#

It is especially unclear in machine learning where there is usually no isomorphsim between the row and column space and a mathematician would be more likely to specify exactly what they mean while a non-mathematician is more likely to brush it under the rug because there is a contextually obvious answer without having to include possibly incorrect mathematical rigor

velvet thorn
#

you wouldn't ask that question

#

you would say "what is the length of <this> axis?"

#

furthermore

#

"matrix" is a mathematical abstraction

#

you can have a 2D array representing that

merry ridge
#

playing a game and my queue popped so I won't be able to reply for a while if necessary

velvet thorn
#

if it's nested you probably won't be able to convert it to a DF as easily

#

because dataframes inherently deal with tabular data

quiet locust
#

can I send a screenshot?

#

here's what I did

velvet thorn
#

and then

quiet locust
#

That’s as far as I got

velvet thorn
#

what do you expect the result to look like

#

I'm not seeing much data there

quiet locust
#

It seems like that api wasn’t a good example

#

Yeah it didn’t come out the way I expected

#

I thought I was getting season statistics for the nba

#

Let me try with a different api and see the result

#

Thank you though @velvet thorn

velvet thorn
#

🥴 you're welcome but I didn't do much

prisma willow
#

I just found that the machineLearning course for my university uses matlab, i wanted to do python with tensorflow+keras and stuff should i drop it and learn supervised/unsupervised/clustering my self or do the course anyway? im at the end of my diploma and done all the major stuff i wanted.

#

what u think?

quiet locust
#

@velvet thorn do you have any suggestions in terms of api usage

#

Because I thought I was getting a large dataset and ended up with this

wispy tangle
#

is there a way to use max for a list and ignoring strings found in the list

velvet thorn
#

it just sounds to me

#

like you used the wrong API

#

🥴

#

maybe look through the docs

#

?

#

if there are docs

velvet thorn
#

but why do you have strings in the list

#

like you can do it but I would suggest you filter first

wispy tangle
#

like i want to get the index of the max number in a list and print the item of that index in another list then move on to the second max number etc..

#

what im trying to do is replace that max num after i use it with a string so it doesn't count

#

if i replace it with zero it gets mixed with other nums

#

oh i could just replace it with -1 since the min num possible is zero

misty flint
#

ML Engineer? youll probs want that matlab experience. Just a Data Scientist? tf/keras will probs be sufficient

#

or if you think youre NOT disciplined enough to learn on your own/you want structure, i would do the ML course

quiet locust
#

@velvet thorn so I used a different api and got this result. Now I want to take this and make it a dataframe

velvet thorn
#

do you know what "flat" means in this context?

velvet thorn
#

what I would suggest instead is

#

are you familiar with the concept of "argsort"

quiet locust
#

I don't know what flat means

velvet thorn
quiet locust
#

yeah I'm running into a whole host of errors when I try to read it into a dataframe

velvet thorn
#

!e

import pandas as pd

json = """[
    {
        "col_1": "a",
        "col_2": 3
    },
    {
        "col_1": "w",
        "col_2": -3
    }
]"""

print(pd.read_json(json))
arctic wedgeBOT
#

@velvet thorn :white_check_mark: Your eval job has completed with return code 0.

001 |   col_1  col_2
002 | 0     a      3
003 | 1     w     -3
velvet thorn
#

this source JSON is flat

#

it maps nicely to a table

#

on the other hand

quiet locust
#

ah ok I see what you mean now

velvet thorn
#

!e

import pandas as pd

json = """[
    {
        "col_1": "a",
        "col_2": {
            "sub_col_1": 1,
            "sub_col_2": None
        }
    },
    {
        "col_1": "w",
        "col_2": {
            "sub_col_1": 6,
            "sub_col_2": 4
        }
    }
]"""

print(pd.read_json(json))
arctic wedgeBOT
#

@velvet thorn :x: Your eval job has completed with return code 1.

001 | Traceback (most recent call last):
002 |   File "<string>", line 20, in <module>
003 |   File "/snekbox/user_base/lib/python3.9/site-packages/pandas/util/_decorators.py", line 199, in wrapper
004 |     return func(*args, **kwargs)
005 |   File "/snekbox/user_base/lib/python3.9/site-packages/pandas/util/_decorators.py", line 299, in wrapper
006 |     return func(*args, **kwargs)
007 |   File "/snekbox/user_base/lib/python3.9/site-packages/pandas/io/json/_json.py", line 563, in read_json
008 |     return json_reader.read()
009 |   File "/snekbox/user_base/lib/python3.9/site-packages/pandas/io/json/_json.py", line 694, in read
010 |     obj = self._get_object_parser(self.data)
011 |   File "/snekbox/user_base/lib/python3.9/site-packages/pandas/io/json/_json.py", line 716, in _get_object_parser
... (truncated - too many lines)

Full output: https://paste.pythondiscord.com/imiluyarol.txt

velvet thorn
#

this is not

quiet locust
#

hmmm okay

velvet thorn
#

the opposite of flat is "nested"

#

so

#

the term for what you want to do is "normalise"

#

ideally

quiet locust
#

And with "nested" it's much harder to untangle?

velvet thorn
#

you would have some knowledge of the relational model of data

velvet thorn
quiet locust
#

hmmm ok

velvet thorn
#

your data cannot be mapped to a row-column strucutre

#

pandas has a json_normalize method

#

but that won't work in all cases

#

you can try that

#

if it doesn't

#

then you need to do it manually

quiet locust
#

ok let me look up some syntax on json_normalize

#

thank you!

#

by manually you mean with like functions and iterations?

quiet locust
#

shit I am not good at those

#

I'm like a beginner for sure as you could prolly already tell haha

#

@velvet thorn do you have any book recs for python with data science in particular?

velvet thorn
#

not really a book-for-learning person

quiet locust
#

I feel that

#

I got a module error when trying json_normalize

#

That's prolly bc of version?

quiet locust
#

@velvet thorn I gave up and decided to use a kaggle dataset LOL

merry pebble
#

i'm looking to create a nice looking line chart that can be styled, is mobile friendly/dynamic. i need the package to either render html of the chart or a .svg so the image is interactive. How could I go about this?

lapis sequoia
#

I used to use matlab at university but havent touched it much since I started doing ML

#

So far I've been able to rely on sklearn for any stat crunching models

quasi sparrow
#

Guys, quick question, Why would this throw me an error?

#

'''

#
    df_bert = pd.DataFrame({
        'id': range(len(train_df)),
        'label': train_df[0],
        'alpha': ['a'] * train_df.shape[0],
        'text': train_df[1].replace(r'\n', ' ', regex=True)
    })
#

So, I have a CSV data file and I am using that file to create a data frame with only two columns

#

My interpreter does not like how I parse the columns [1] and [0] into a new dataframe

#

Anything helps!

lavish swift
#

@quasi sparrow Not sure, but here are some thoughts. 1. Without knowing what's in your train_df, I think you're trying to set the label to the values you have in the first column? If so, you may want to try something like:

'label': train_df.iloc[:, 0].values
#

I don't think you can use the index of the column without telling pandas that's what you're trying to do.

quasi sparrow
#

Yes, that's exactly what I'm trying to do! Many thanks!

#

I'm trying to train a bert model

lavish swift
#

as for the replace, you want to add an .str.replace.... So pandas knows you're using a string method

#

or maybe pandas has a replace method? but it looks like you're doing string work

quasi sparrow
#

I have a dataframe with 2 columns: One with string sentences and the other one with numbers (labels)

lavish swift
#

though my first suggestion (using iloc) might solve both issues without having to use .str.

quasi sparrow
#

I am trying to create a new dataframe with 4 columns, 2 coming from the original dataframe that I already have.

lavish swift
#

hopefully those suggestion help. lemme know how it goes! 🙂 good luck!

quasi sparrow
#

Thanks! Sure I will!

astral path
#

Hi all, I'm trying to use some functions created in a kaggle notebook to create a custom correlation heatmap (https://www.kaggle.com/mlwhiz/seaborn-visualizations-using-football-data/), however despite the data I'm using being seemingly very similar to the data that they're using, I'm getting an error UFuncTypeError: ufunc 'add' did not contain a loop with signature matching types (dtype('<U32'), dtype('<U32')) -> dtype('<U32') when I try to call results = associations(plot_df,nominal_columns=catcols,return_results=True) on my own dataframe plot_df. I have absolutely no idea what this means and answers from other places online haven't really explained this well at all

#

this is their dataframe when i called .info() on it

#

and here's mine

#

I really don't see any differences here in types, any ideas what could possibly be causing this?

#

if more info is needed I can provide it. Thank you!!!

viral goblet
astral path
velvet thorn
#

that error is saying that you're treating non-numeric columns as numeric, basically

tender gyro
#

Hi guys, can you guide me on how to go about building this vehicle classifier/tracker system? I have data of vehicles in the form of video recordings

grave frost
#

do you want to track or classify?

balmy bear
#

hi

onyx drum
#

Any suggestions for functions I can use to compute correlations? "pearsonr" from scipy.stats crashes my Jupyter notebook consistently. I have 500000 datapoints for each set of distributions whose correlation I want to compute, but surely there has to be a non-crashy way...

astral path
wise kettle
#

What are some good resources to learn about graphs, probability, and statistics? I am auditing an edX course on python that involves things like JupyterLab and data science stuff but I never passed stats in high school

trail breach
#

Is anyone familiar with the perception algorithm, trying to use mnist, tensorflow, to figure out how to determine a three or a five

#

looking for a reference

#

I have my data normalized, and separated, but not sure where to go from here.

magic panther
#

Anyone know soimething about objective functions? How do i create one out of a 3d or more plot of data

misty flint
#

best stats explanations ever

#
  • tina huang
wise kettle
#

Thanks!

#

Checking it out ASAP

grave frost
#

hi

late shell
#

hello. I was building a Multiple Linear Regression Model, with the dummy dataset of 50_Startups which is usually used by beginner ML learners. The data set looks like this

#

I made 2 models, one in which I dropped one dummy variable in order to avoid the dummy variable trap after onehot encoding the State feature, and in the other model I didn't drop any dummy variable, kept all 3 of them. But the results were exactly the same

#

same R^2
same predictions
why did this happen? Aren't we supposed to drop one dummy variable in order to avoid multicollinearity ?

iron basalt
#

The make_input_fn returns a function which is then later passed to tensorflow during training and evaluation. The reason for the nesting is that train_input_fn and eval_input_fn must be functions that take no arguments. Tensorflow expects / allows them to only have the arguments mode, params and config (and maybe input_context). The functions train_input_fn and eval_input_fn are suppose to return the data, but how are they suppose to do that if they don't take any arguments? How do you get your data to tensorflow? The trick is to either use globals or do the wrapping trick in which the inner function has the data without getting it passed as an argument, but unlike a global, it's not global. In general this is a common trick you will find in python programs. Side note: this is how decorators are implemented.

severe python
#

hi data science people

#

have a pandas question that I haven't had a response for in the help channels:

#

made a script that searches an excel file (currently one column) based on user input and prints matching rows which is great, but i would like it to search multiple columns. any idea how to achieve this? here is my code:

from tabulate import tabulate
from termcolor import colored

class bcolors:
    FAIL = '\033[91m'

while True:
    try:
        variable = input("Please provide an acronym:  ")
        variable = variable.upper()
        df = pd.read_excel("accounts.xlsx")
        df = df.set_index('Acronym')
        result = df.loc[variable]
        print(tabulate(result, headers='keys', tablefmt='psql'))
        
    except KeyError:
        print(f"{bcolors.FAIL}Invalid acronym{bcolors.ENDC}")```
iron basalt
severe python
#

Please provide an acronym: RYAN
+-----------+----------+-----------+-----------+
| Acronym | Parent | Clearer | Account |
|-----------+----------+-----------+-----------|
| RYAN | 26KM291 | GS | 285M322 |
| RYAN | 2378DM | Socgen | 2HKLM242 |
| RYAN | 26KM60 | GS | 285M322 |
| RYAN | 26KM60 | BAML | 268132 |
+-----------+----------+-----------+-----------+
Please provide an acronym:

iron basalt
#

Show accounts.xlsx (assuming it's test data and not something that needs to be kept secret).

#

(screenshot a section of it)

severe python
#

has 4k rows

#

goal is to search by parent or account as well as acronym, searching by clearer isn't necessary

iron basalt
#

so you want to be able to search by a key other than Acronym?

severe python
#

yes exactly, by the parent ID as well as the account ID

#

and print matching rows (including acronym column)

#

i'm thinking i will need to restructure a lot? @iron basalt

misty flint
#

might as well throw it into SQL

iron basalt
#

@severe python I'm back, so you could restructure, much like in a database, but you could also just not use an index and instead do something like df.loc[df["column name"] == value].

#

df.loc will give you the row and df["some name"] gives you the column with that name. You then check that column for all values that match value and get the rows with at that those spots.

#

If you go the index route you can make things a lot faster, but you need to normalize (1NF, 2NF, 3NF, BCNF).

magic pivot
#

hi

#

do anyone know about opencv library

#

i have poblem in its harrcascade classfier

#

the .detectMultiScale() method gets stuck while executing

lapis sequoia
#

hi

magic pivot
#

@lapis sequoia hi

#

can you help me with that problem

lapis sequoia
#

don't now nothing

#

about phyton

misty flint
lapis sequoia
#

just looking around

magic pivot
#

ohk

#

np

#

@misty flint can you help?

misty flint
#

sorry ive only used the draw functions with opencv

magic pivot
#

ohk np

misty flint
#

and the image processing module

iron basalt
#

@magic pivot opencv is a buggy mess (on the c++ side, which the python side inherits), try using scikit-image instead. If that does not work, come back.

magic pivot
#

@iron basalt thanks

#

i was thinking it was a bug

magic pivot
#

@iron basalt thank you very much 🙏

severe python
#

@iron basalt so without index, it would look like df.loc[df["Acronym, Parent, Account"] == variable] ?

astral hound
#

Hey so if I need to read an image and find it as accurately as possible what methods are there? Canny keeps giving me false positives

iron basalt
#
import pandas as pd

df = pd.DataFrame(
    {
        "month": [1, 4, 7, 10],
        "year": [2012, 2014, 2013, 2014],
        "sale": [55, 40, 84, 31]
    }
)

print(df)
print("--------------------------")

print(df.loc[df["year"] == 2014])
print("--------------------------")

print(df.loc[df["year"] == 2014].loc[df["month"] == 4])
print("--------------------------")
#

run that

#

@severe python

#

it gets all rows with year == 2014 and from all those rows it gets all rows with month == 4.

severe python
#

AttributeError: partially initialized module 'pandas' has no attribute 'DataFrame' (most likely due to a circular import)

#

i see what you mean, in my case would i just reference variable? which is basically the user input @iron basalt and would i need to change the ending lines?

iron basalt
#

yeah, if your user inputs an acronym (wants to search by one), then you just do what you are doing now. If they select to search by Parent, then it's the same, but with parent instead.

#

So here is what you can do

#

There is a loop, the user selects which column they want to search by.

#

Then which value they want to match in that column.

#

It spits out all rows that match.

#

Then go back to step 1. But this time it only searches the remaining rows.

#
rows = df
# In a loop with user input
rows = rows.loc[rows[column_to_search_by] == value_to_match]
print(rows)
severe python
#

i see what you're saying

iron basalt
#

It filters down till you only have whatever you want left.

severe python
#

so there isn't a way to search multiple columns normally right? like instead of asking user what criteria they want to search by

iron basalt
#

You can also detect multiple columns were inputted e.g. month, year and then accept multiple values 4, 2014 to speed things up a bit for the user.

#

This would just execute as before, one after another.

severe python
#

a little lost

#

i feel like there's a way to search by both. like if i search a parent account, to print that row. if i search an acronym, print that row

#

what is preventing me from doing that? and can i use df.loc with multiple columns referencing the user input? what you said above

iron basalt
severe python
#

gotcha, can you double check this with the full code i gave above?
df.loc[df["Acronym, Parent, Account"] == variable]

iron basalt
#
import pandas as pd

df = pd.DataFrame(
    {
        "month": [1, 4, 7, 10],
        "year": [2012, 2014, 2013, 2014],
        "sale": [55, 40, 84, 31]
    }
)

print(df)
print("--------------------------")

while True:
    search = input("Enter your query: ")

    if search == "quit":
        break

    sp = search.split(",")

    col = sp[0]
    val = int(sp[1]) 

    print("Showing all results where {} == {}:".format(col, val))
    print(df.loc[df[col] == val])
#

@severe python

#

Example output:

#
   month  year  sale
0      1  2012    55
1      4  2014    40
2      7  2013    84
3     10  2014    31
--------------------------
Enter your query: sale,55
Showing all results where sale == 55:
   month  year  sale
0      1  2012    55
iron basalt
silk axle
#

Is there a way to install TensorFlow with GPU support on Windows 10 without using anaconda? All the tutorials I've seen are either outdated or use anaconda which I don't have

iron basalt
#

The reason why they use anaconda is because tensorflow GPU needs the CUDA toolkit.

#

(And conda has that)

silk axle
#

Right

iron basalt
silk axle
#

But could I not just install the CUDA toolkit myself?

iron basalt
#

You can

silk axle
#

Or is this just something where I should install anaconda?

final granite
#

Not sure if there's an interest in this, but I've been working on a channel for Python for a while. It's not data science-centric, but it is text-analysis specific with bits of DS thrown in the mix. Thought I'd share it here: https://www.youtube.com/pythontutorialsfordigitalhumanities

serene scaffold
final granite
#

Oh dear. I didn't mean to violate the rules. I am sincerely sorry. Everything from custom NER for domain-specific problems to topic modeling.

#

I don't make money of these videos, just helping develop them as part of a postdoc and spreading the word.

serene scaffold
final granite
#

Ah gotcha. No nothing like that.

serene scaffold
#

I also see from your message history that you finished an NLP project with spaCy. My first big Python project was refactoring a spaCy-based NER package that my coworker wrote.

final granite
#

Thanks for the heads up about the rules, though. I'll be more explicit in my intentions if I do something like that in the future.

#

Oh cool. Indeed. I am a huge fan of spaCy. Looking forward to preparing a series of tutorials for version 3.0. There's lots to unpack in the new update and I haven't had the time to fully explore it yet

serene scaffold
#

Oh, there's going to be a new major release?

final granite
#

Was released Feb 1.

#

They are moving towards BERT. Results are expected. Marked increase in accuracy at the cost of performance, but not as much as competitors.

serene scaffold
#

Interesting. One of the reasons we didn't build our NER package entirely through spaCy (we collected features from Doc instances and used other learners) was because we eventually wanted to use BERT. Which we did.

#

But you're telling me that BERT embeddings will be what ship with the large model?

final granite
#

I have only toyed with spaCy 3.0, but it's a fairly easy implementation of BERT. Lots more customization now too with 3.0. You can control your ANN architecture

#

No. There is a .trf model that is the BERT model. They still have all the same sm, md, lg models with embeddings

#

So you can still use the old embeddings, if you desire

serene scaffold
#

Interesting. Right now my advisor has me working on dataset ablation and that might carry me through to when I leave the college. I'm in my last semester of undergrad and I've just had the opportunity to do research because she took a chance on me.

final granite
#

Oh that's really cool. DS/NLP is a fun career path. I'm a historian by training

serene scaffold
#

That is to say, your undergrad was in history but you pursued CS thereafter and that's how you're a postdoc?

final granite
#

Oh no my B.A., M.A., and PhD are all in medieval history. During my PhD I taught myself DS/CS/and programming in secret so that I could do the research I wanted

serene scaffold
final granite
#

Go for it. I will be dashing soon, though so I can only chat for a few miun

#

min*

lone blaze
#
aa = []
varss = []
for i in range(10000):
    x1 = np.random.rand()*2-1
    x2 = np.random.rand()*2-1
    X = np.array([[x1], [x2]])

    y1 = f(x1)
    y2 = f(x2)
    y = np.array([[y1], [y2]])

    a = np.linalg.solve(X.T @ X, X.T @ y)[0][0]
    var = scipy.integrate.quad(lambda x: (a*x-1.4286*x)**2, -1, 1)
    error = scipy.integrate.quad(lambda x: (a*x - f(x))**2, -1, 1)


    aa.append(a)
    varss.append(var)
    errors.append(error)
    
variance = np.mean(varss)
error = np.mean(errors)
print("variance", variance)
print("error", error)
    
ahat = np.mean(aa)
print("ahat", ahat)

plt.plot(x, f(x))
#

No idea where (ax-1.4286x)^2, 1.4286 came from?

#

My thoughts: it seems to me like the guy who wrote the code first calculated ghat
and got it to be approximately 1.4286
then put the numerical value in, since it won't really differ by much
and then avoid having to do all the calculations twice
seems reasonable or did I get something wrong?

astral path
#

I have a dataframe with two columns

#

and anoth dataframe with a 3rd column representing the count of each pairing of values from the first dataframe with multiindexing

#

how would i populate a 3rd column in the original dataframe with the value of the count in the second dataframe for each row's pairing?

#

i originally thought something along the lines of mb_counts.loc([plot_df['merchantID'], plot_df['BrandID']]) but that doesn't work because it's using lists as an index

exotic maple
#

I think you should be to able to create a new column with apply and a lambda

#

something like

#

DF["NEW COLUMN"] = df.loc["applicable_index"].apply(len)

#

I -think- thats what yo want no? the size of the multiindex?

#

or do you want the count of the elements inside the index, per element?

astral path
#

what i'm trying to do is if there's, say 4 rows in the original dataframe where merchantID is 9359 and BrandID is 8360, then the other dataframe contains the number 4 at the multiindex of merchantID being 9359 and BrandID being 8360

#

im adding a column to the original dataframe which contains the number of times each row occurs in the dataframe

exotic maple
#

yeah you can do something like what i said, if i got you right

#

so for example that multiindex has 4 (in the dead) test

#

i can get the number with len(df.loc["Alabama])

#

and you can multiindex it too

#

so what you need is to pass len into an apply

#

to get the size based on a multiindex loc

astral path
#

hmm ok i'll try that

astral path
velvet thorn
#

no

exotic maple
#

that means those are not in the index

#

i'm also testing the solution myself and its not optimal. let me try something else

astral path
#

ok

velvet thorn
#

that’s wrong

#

you want to merge.

#

AKA join

astral path
#

merge the counts dataframe with the original one?

exotic maple
#

ahhhh

velvet thorn
#

ye

exotic maple
#

yes that's actually much better

scenic patio
#

may i ask some resources to learn data science.

astral path
#

ahhh ok

exotic maple
#

you can do it via pivot as well

#

or groupby

velvet thorn
#

merge on those two columns

exotic maple
#

gorupby with .agg {"count"}

velvet thorn
exotic maple
#

yeah honestly grouping / merging would be better there, @velvet thorn is right

velvet thorn
#

doesn’t @astral path mean like just the value in the second DF

#

which is the already computed count

astral path
#

yeah

exotic maple
#

he's trying to compute it, though, right?

#

Or did i misunderstand the whole thing lol

astral path
#

i already have it computed in the second dataframe

exotic maple
#

ok i'm dumb lol

astral path
#

although if it's better to compute it in a one-liner that also adds a column then that's better

exotic maple
#

@velvet thorn 's approach is much cleaner

#

try using pivot or groupby

#

and aggregate

#

via len/count

#

@scenic patio any specifics in mind?

velvet thorn
#

also @astral path a tip

scenic patio
#

nope i just learned basics of python

velvet thorn
#

in general for this kind of question

exotic maple
#

My advice would be to first learn python

velvet thorn
#

if you can provide a runnable sample of input and expected data

#

like something people can copy paste and run

scenic patio
#

i was searching through videos and learned that python is good for ds

velvet thorn
#

it gets a lot easier to understand your problem

astral path
#

how would I get that expected data sample in an easily exportable way from python?

exotic maple
#

@scenic patio There's Pthon, R, and Julia is interesting too, for example

#

Python is the most popular though

astral path
#

ah ok

velvet thorn
#

with a few rows

exotic maple
#

do you have a background with decent math studies?

scenic patio
#

i am in high school

exotic maple
#

Ok you need to learn some math then

#

Try Khan Academy on these topics: Linear Algebra. Calculus I, II, III (vectors)

#

Multivariate calculus

#

As per resources...there's countless to be honest lol

scenic patio
#

thats the problem so much resources dont know what to use

exotic maple
#

I'm going through the University of Michigans Applied DAta Science

#

which is in python

#

and its pretty good

#

but be aware of one thing

#

not one resource EVER is giong to teach you everthing

#

you need to research a lot b yyourself

scenic patio
#

i am in a midst of exploring web development and ds

#

dont know what i want to main

astral path
exotic maple
#

Well, that's an important life topic you shouldnt ask strangers in discord about lol

#

but for data science you can start there

scenic patio
#

yh i just want to know what ds resources are reliable

exotic maple
#

University of Michigans Applied DAta Science -> I like this, but there are others

astral path
exotic maple
#

Codequest? I think its good too

#

EDX has a fantastic data science esp from Harvard, but thats in R, not Python

scenic patio
#

what about this website i stumbled upon: dataquest

exotic maple
#

I think thats good

#

ive heard good reviews about

#

it

#

but i havent tried it mysel

#

dont get stuck in tutorial hell though dude

#

review courses

#

choose one

#

learn as much as you can

#

and then do a project of your interests, whatever that is

grave frost
#

agreed, you would learn more in a project than in a course with spoon-fed material

scenic patio
#

thank you!!!

grave frost
#

IMO that's one of the most enjoyable ways to learn. Because the project would be something you like and would be cool, you won't lose motivation.

exotic maple
#

Exactly

#

choose a topic you like, find data about it, and make sense of it

#

You like anime? Try getting anime data depending on seasons, gender, profits, etc, and try to predict anime's success based on 2 or 3 parameters.

#

You like weather? Plenty of datasets lol

#

economics? shitloads of datasets

#

I think even pornhub has API nowadays...

grave frost
#

That turned dark pretty quickly 🙂

misty flint
#

dont corrupt the highschooler

grave frost
#

😆

exotic maple
#

he'0s in high school

#

not kinder lmao

misty flint
exotic maple
#

besides, I just said API :p

misty flint
#

true

#

kids gotta grow up fast nowadays

exotic maple
#

actually, jokes aside, those guys at PH have some decent datasets and insights

#

i always laugh at their yearly summaries lol

astral path
#

facts

grave frost
#

Well what do you know - there actually is an API for PH

misty flint
#

i heard thats what let them stay ahead of their competition. them using data science and ML

astral path
#

imagine getting a masters in statistics to go work for pornhub

exotic maple
grave frost
#

I thought MS wont allow it

exotic maple
#

-looks away-

exotic maple
scenic patio
#

true

grave frost
exotic maple
#

Data Science in PH -> Interquartile range of "session" lenght based on categories

misty flint
# scenic patio true

dw too much about figuring out what you want to do. you still have time. take a chance to explore everything and find what you like

exotic maple
#

this guy is so ahead of things he shouldnt stress

misty flint
#

yeah def

exotic maple
#

im 28, with career and shit, and im still a bit loss lmao

astral path
#

i got my problem fixed merchant_brand_df = merchant_brand_df.groupby(['merchantID', 'BrandID']).size().to_frame('count').reset_index()

exotic maple
#

thought Im liking data science a lot

misty flint
exotic maple
misty flint
#

mid career change

#

to data science

#

or at least trying

#

you at least seem to know your stuff unlike me

astral path
#

depressing thought: it would suck being the data scientist for the FBI who has to create algorithms for detecting certain illegal kinds of porn

misty flint
misty flint
astral path
#

you would have to go through and analyze the features of those videos and that could be scarring

misty flint
astral path
#

oh thank god

misty flint
#

then if its actual stuff they investigate it more closely

grave frost
misty flint
exotic maple
#

oh hi FBI

grave frost
misty flint
exotic maple
grave frost
#

Juvenile your 3rd home

astral path
#

or did i

misty flint
#

double entendre?

exotic maple
#

Humanity will be doomed the day a NPL algorithm understands double meaning jokes

#

in multiple languages

astral path
#

oh boy

#

i cant wait for the day algorithms can generate jokes so complex and clever that only other AIs can understand it

misty flint
#

AI jokes

astral path
#

centuple-entendre jokes

grave frost
astral path
#

could be of related interest

#

fictional short story but with massive societal implications

exotic maple
#

reddit and twitter are basically 50% bots

#

jerking off and upvoting each other

#

and creating trends out of nowhere

astral path
exotic maple
#

change my mind

astral path
#

at least 1/3 of new reddit posts are porn bots

#

oh god not marge

grave frost
#

Nothing serious tho

#

Right now our Ai is not that advanced. It's just an effective mode of communication

misty flint
#

for now

#

but yeah i getcha

#

semi-related

#

theres this poster thing at my uni, and i wanna do a poster about ai and ethics/society but idk what would be interesting to others

grave frost
misty flint
#

what is this

exotic maple
#

the one thing I havent been able to find in a "simple" understandable way is the theory behind some of the ML algorithms

#

Like, yeah I 99% care about the applied part, but Im curious about the theory too lol 😦

grave frost
#

Ofc you can learn about theory - or get an idea at least if you watch 3B1B

#

But the real challenge is how exactly does a model learn?

grave frost
#

3Blue 1Brown - youtube channel

exotic maple
#

depends on what you mean by "learn". for advanced stuff like image recognition or NPL I have no clue -yet-

velvet thorn
#

which algorithm specifically

misty flint
grave frost
#

No, I mean interpretability of Black Box models. Its a pretty hot research topic

misty flint
#

i thought for AI like this its a lot of reinforcement learning

grave frost
misty flint
#

wait what are we talking about

velvet thorn
#

not sure if "how does a model learn" is exactly the same thing though

grave frost
#

The underlying RL structure has MDP (Markov DEcision processes) at its core. Normal ML is basically supervised .... function mapping? dunno the exact term lolol

velvet thorn
#

the techniques that have been developed to interpret DL models are p cool though

grave frost
exotic maple
#

That personally trascends me lol I'm more interested in the ways to apply ML / DL / NPL in real life

velvet thorn
#

most commonly, yes

astral path
#

woahhh

exotic maple
#

theory is cool, but its not my thing

astral path
#

huge heatmap zoomed way out

exotic maple
#

excellent example of "WTF IS THAT CHART?"

grave frost
#

@velvet thorn Hmm.. weren't all RL based on MDP? Sorry if I am behind times 🤷 I only know basic theory

exotic maple
#

I have no idea of what that heatmap is lol

misty flint
grave frost
#

A3C maybe? I dont get it 🙂

velvet thorn
#

it's just like how there's DL without gradient descent

#

(admittedly a less "out there" case)

exotic maple
iron basalt
#

@grave frost RL has a ton of methods, and an obivous example of non MDP is when they are solving a partially observable MDP, which is a much more interesting (and much more difficult problem).

grave frost
#

I stopped learning about RL cuz I don't think it's very useful for real-world application. mostly for playing games. The ones that are useful require a million lines of code with no libs

misty flint
#

isnt RL used heavily in robotics?

velvet thorn
misty flint
#

thats the only one ik ID_BoomKek

grave frost
#

evolution

iron basalt
#

@grave frost Reality is only partially observable.

velvet thorn
misty flint
#

i also recently learned about that

#

very interesting way

grave frost
misty flint
#

of doing things

grave frost
#

carykh does that kinda things - simple RL algo for simulating animals, development and such

exotic maple
#

Is there a way to add inline code in discord?

misty flint
#

backticks

velvet thorn
#

`code` -> code

grave frost
#

Haha 🤣

iron basalt
#

@grave frost One thing to note is that RL is normal ML. RL, supervised, unsupervised are all ML, just supervised is typically the most immediately applicable one.

iron basalt
#

I'm pretty sure that is just kind of set in stone, but ok

velvet thorn
misty flint
#

what else would Reinforcement Learning be if its not Machine Learning?

#

most places ive read place RL under ML

#

unless you consider it under the umbrella term AI instead

grave frost
iron basalt
grave frost
#

yeah, well I guess its the same 🤷

iron basalt
#

wikipedia machine learning page

misty flint
#

yeah its like the black sheep but it happens

iron basalt
#

RL is arguable the most difficult and would be most essential to creating AI. But that means that few people bother with it because it can't be immediately applied most of the time (which means there is no money in it).

misty flint
#

to creating General AI?

#

you mean?

grave frost
iron basalt
grave frost
iron basalt
#

@grave frost Ideally for AI you need to be able to do RL online in the real world, not in a simulation (like a human).

grave frost
#

atleast not commercially

iron basalt
#

It's getting pretty close, just not popularized.

grave frost
iron basalt
#

I don't see too many games with RL.

#

Or AI at all.

#

Game AI is a different thing, it does not learn.

#

Perhaps you mean Dynamic Programming? That is part of ML / a technique that can be used, but it's not RL.

grave frost
grave frost
iron basalt
#

It's the general idea of breaking down a problem into subproblems, solving those, combining those answers and so on.

velvet thorn
iron basalt
#

Yeah some do, but I don't see it too often.

grave frost
iron basalt
#

Game AI usually refers to the traditional stuff like GOAP. Or a chess bot.

grave frost
#

Games like DOTA are way too complex to be broken into anyting

iron basalt
#

You mean like how DL was used to make a DOTA bot?

#

Yeah that is a more recent development in game design.

grave frost
#

The model does real-time inferencing and partial learning (IDK how exactly does that work) to study the players patterns and counter

#

maybe partial learning is kinda like supervised fine-tuning? who knows

iron basalt
#

However, the DL bots in DOTA don't learn, they only do inference, which is much easier than doing online learning.

#

(DL requires lots of iterations to learn things so it's unsuitable for online learning)

#

(and other reasons)

grave frost
#

but I don't see why an architecture cannot store the players moves into memory to be interpreted by a part of NN (Like previous timesteps) to predict future moves 🤷 A naive workaround

velvet thorn
#

log what's happening

#

but don't use it for training

#

right then?

grave frost
#

Just an idea bud

#

I don't know what OpenAi has done. But by that, I meant that the model can be trained to learn this new piece of info and try to predict the players moves and adjust accordingly

velvet thorn
#

🥴

grave frost
#

not exactly real-time learning, but you can't expect me to come up with an idea right now

#

🙂

#

honestly, even if the model does better job than human without real-time learning, I doubt it would make much difference on the job to be done

iron basalt
#

The DL bots do prediction, but they 1. do it with the exact position of the players and they know where all the players are at all time (much easier than what a human need to do from screen pixels and limited knowledge of the game state). 2. they don't learn online, they do it later. 3. They often learn to abuse their ability to do super human timing (a human needs to slowly process the pixel data, and do a complex learned sequence of muscle commands to do things).

grave frost
#

a human needs to slowly process the pixel data, and do a complex learned sequence of muscle commands to do things
That's such a big insult to reflexology lolo

iron basalt
#

slowly compared to a computer

grave frost
#

Gamers rely on their muscle memory which is the most effecient thing

iron basalt
#

Computers are limited by the speed of light (memory transfer speeds).

iron basalt
#

Modern computers are bottle-necked by the memory speed, not computations on that memory.

#

(see caching)

grave frost
#

When you have to make computations that fast, often multiple devices are required CPU, GPU, RAM, etc. all the time taken to immediately process things adds overheads

#

you can't expect all compuations to be placed on 1 device

#

its a colab b/w CPU and GPU (+ RAM)

#

that adds overhead. M1 tries to reduce that by a unified pool of memory (an Idea I very much like) but still, nothings out there for the M1 yet

misty flint
#

well yeah. thats why you have distributed computing but i thought you guys were talking about supercomputers

grave frost
#

The more devices (gpus) ou have, the more computations you have to do on which device to place tensors on

#

its not exactly a if/else

iron basalt
#

If you did it distributed the latency would probably climb above 200ms which would make the human the winner.

misty flint
#

so supercomputers only?

grave frost
#

yup. and most models are very complex, so they have to be distrivuted

iron basalt
#

It's a very different and more difficult task to have an AI actually run and learn online in real time on just one device (like a robot's single (and power consumption limited) processor).

grave frost
#

but it is possible

misty flint
#

online learning amegablobsweats

#

so. much. data

grave frost
#

With AGI ¯_(ツ)_/¯

iron basalt
#

You can also just not use DL and instead use much more efficient methods. DL is computationally inefficient like crazy.

grave frost
#

Gotta go. adios

misty flint
#

bye

iron basalt
#
int[] arr = new int[64 * 1024 * 1024];

// Loop 1
for (int i = 0; i < arr.Length; i++) arr[i] *= 3;

// Loop 2
for (int i = 0; i < arr.Length; i += 16) arr[i] *= 3;
#

For those wondering what I meant by memory speed being the limiter. These two for loops take the same amount of time. The first does 16x more computation than the second, but they are both limited by memory speed (cpu fetches 16 ints at a time).

#

(Memory speed is also a huge issue on GPUs)

misty flint
#

an increment operator. interesting

#

oh interesting

#

i see

iron basalt
#

Yeah I could not use python code, because python is not running the bare metal, it's really slow and kind of nullifies any speed gains you could have by abusing the ability to do a bunch of computation due to fetching.

misty flint
#

makes sense

iron basalt
#

numpy definitely abuses it (more specifically BLAS, which it just calls).

#

Numpy also uses vector operations so it would fetch let's say 16 ints, and 16 other ints, and then add those 16 ints to the other 16 in 1 operation (as opposed to 16 addition operations / operation level parallelism).

iron basalt
#

Btw, in the past, memory speed was much faster than the actual time is took to do the operations. So in the past, things like linked lists actually made sense. Now everyone uses dense arrays even though you sometimes need to resize them (reallocate memory (slow)).

lapis sequoia
#

cool

#

ok guys

#

If you guys use pycharm

#

then eneter this code ok?

#

if you want to send automatic emails

vital ocean
#

What's the codE?

lapis sequoia
#

ok

#

one min

vital ocean
#

It will be more cool 😄

lapis sequoia
#

oh ok

coral ginkgo
#

Tried searching on google but any communities I can join for SQL?

misty flint
#

dunno but you can always ask sql stuff here or #databases

astral path
#

so

#

i have two categorical variables, merchantID and brandID, which have a positive correlation of 0.11. I'm trying to somehow plot the correlation. I've chosen a heatmap for now to show correlation between different merchants based on brandID for my assignment, but I'm also looking for ways to visualize merchantID and brandID in a way that's useful. This is hard because both variables have thousands of categories, so a plot like a scatterplot looks extremely cluttered

#

what should I do to visualize correlation between two categorical variables when they take on so many values?

#

should I just take a subset?

junior bane
#

how can I pull data from spreadsheet in ETL?

misty flint
#

out of these 9 subjects, what are the top 5 most important topics for data science? my guess would be 1) Programming, 2) Math for CS, 3) DS&A, 4) Databases, and 5) Distributed Systems

#

what do you guys think

astral path
#

i would think programming, math for CS, databases, algo and data structures, and languages and compilers

misty flint
#

you swapped languages and compilers for distributed systems?

#

hadoop, spark, etc. are so important for data science tho

#

esp for big data

astral path
#

yeah, i think it would be wise to understand the languages and how they work, not very familiar with distributed systems

#

although if you think the other is a better option you should do that

misty flint
#

ugh not enough time, too many things to learn

#

ill add it in as #6

astral path
#

oof nice

#

also i just ended up taking a sample

#

unfortunately, all that stuff earlier with pivoting was for nothing

#

cus the heatmap means nothing

misty flint
#

hmm

#

do you have the dataset

#

or a link

#

let me see what happens when i upload it to tableau

astral path
#

uncleaned version

#

what's tableau?

misty flint
#

just another data viz tool

#

lets see

astral path
#

just a more clean version

misty flint
#

rip just had a power surge gimme a sec

astral path
#

ooooof

misty flint
#

wait why are you comparing these two variables

#

i get brand id corresponds to the brand names

#

what does merchant id represent

astral path
#

merchant id represents the merchant it's sold from

#

farfetch is a website that basically acts as a middle man for luxury botiques which wouldn't otherwise have the reach to sell their products

#

i'm trying to find a correlation between merchants and brands

#

well

#

i used Cramer's V and found there's a correlation of 0.11 between them

misty flint
#

i see

#

also tableau hates your dataset

astral path
#

:(

#

what does it look like

misty flint
#

it treats both as independent variables instead of trying to correlate them

astral path
#

yeesh

misty flint
#

its bc theyre both categorical technically

astral path
#

here's another one with both categorical and numerical variables

misty flint
#

interesting

astral path
#

with line of best fit

#

doesnt look that great tbh

#

with smaller sample

misty flint
#

yeah i tried some things too

#

nothing worked

astral path
#

¯_(ツ)_/¯

uncut bloom
#

is the visualization important or knowing the groupings? I think of this as a simple collaborative filter, then tsne, draw bounds, grab 4 samples from each

misty flint
#

interesting quote

#

But data scientists are kind of like the new Renaissance folks, because data science is inherently multidisciplinary.

This is what leads to the big joke of how a data scientist is someone who knows more stats than a computer programmer and can program better than a statistician. What is this joke saying? It’s saying that a data scientist is someone who knows a little bit about two things.

#

this is the rest of the bit:

But I’d say they know about more than just two things. They also have to know to communicate. They also need to know more than just basic statistics; they’ve got to know probability, combinatorics, calculus, etc. Some visualization chops wouldn’t hurt. They also need to know how to push around data, use databases, and maybe even a little OR. There are a lot of things they need to know. And so it becomes really hard to find these people because they have to have touched a lot of disciplines and they have to be able to speak about their experience intelligently. It’s a tall order for any applicant.

#
  • john foreman from mailchimp
devout scroll
#

Restarting my jupyter kernel does not reset variables. I'm using jupyter notebook in vs code. Does someone know what could cause this behaviour?

astral path
#

how do I call sizes (normally a parameter for the seaborn scatterplot function) in scatter_kws in a regplot function?

stoic hollow
#

which minor stream would be most ideal for data science

silk axle
#

I'm currently using a pandas dataframe with scikit-learn LinearRegression as part of my ML program for predicting student grades:```py
data: pd.DataFrame = pd.read_csv('./data/student-mat.csv', sep=";")
data = data[['sex', 'studytime', 'failures', 'schoolsup', 'paid', 'absences', 'G1', 'G2', 'G3']]
data = data.replace({'F': 0, 'M': 1, "no": 0, "yes": 1})

to_predict = "G3"
X = np.array(data.drop([to_predict], 1))
y = np.array(data[to_predict])

X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size=0.1)

linear = linear_model.LinearRegression()
linear.fit(X_train, y_train)```How can I implement certain rules to the `linear.fit()`?

For more context, I have two previous test scores ("G1" and "G2", values of 0-20) of each student where 0 means they didn't take the test. I need to implement logic so that when both scores are 0, it'll predict 0 (since it can't make a prediction), when one score is 0 it'll ignore that score and predict on the other score, and when neither are 0 it'll just do as normal --- I need to ignore the scores that are 0 since it means the student didn't take the test and so I can't predict a test score based off it

earnest ibex
#

Hi Guys!!

heady tide
#

What is the best platform for training NNs on the cloud ?

lilac geyser
#

Hello
Recently I was going through Hypothesis testing.
What I understood after listening to the introduction to hypothesis testing is.

So basically hypothesis testing is nothing but.
When we take the sample data from population and try predicting the population parameters. Whatever we get the population parameters from the sample data will be judged whether to reject or not...
This process is known as hypothesis testing.
Is my understanding correct?
Please help me!

#

Please @ me🙏

solar phoenix
#

Does anyone know how i can change the name of a dataframe to an item i have in a list

#

I have 6 datafames and 6 items in my list

#

and i want to name the dataframes those items

wild dome
#

Trying to convert a Jupyter notebook to PDF throws the following trace error:

#

pls help I just want a PDF hahaha :(

lapis sequoia
#

wassap

#

u guys use R?

velvet thorn
#

you know this is a Python channel, right

lapis sequoia
#

oh right tho i was gonna ask something about the relationship of R and python

velvet thorn
#

sure

#

go ahead

lapis sequoia
#

like can i put some r files in my python program? or hmmm?

lapis sequoia
#

like lets say I have a bunch of data and stats and graphs in my r script, and some functions like which and i wanna use that in my py code like can i do that?

velvet thorn
#

I mean, you could put them in

#

if you expect to be able to run them

#

that's more complicated

lapis sequoia
#

aaah hmmm

#

ok ok

velvet thorn
#

look into foreign function interfaces

proper swift
#

Hi I have a question, can anyone help with running a Linear Regression in Python? I have a dataset that has 5 categorical variables and 1 dependant variable, and im a little bit confused on how best to do this. Before learning Python, i did this sort of stuff in SPSS

wild dome
#

do I have to uninstall Python and reinstall it for all users???

#

or guide me if this is not the channel for this question

serene scaffold
#

I have a bunch of dataframes like this:

tag,precision,recall,f1
ADE,0.106,0.062,0.078
Dosage,0.788,0.804,0.796
Drug,0.534,0.655,0.588
Duration,0.674,0.609,0.640
Form,0.800,0.845,0.822
Frequency,0.668,0.725,0.695
Reason,0.250,0.259,0.254
Route,0.759,0.730,0.745
Strength,0.767,0.828,0.796

I want to name each dataframe and get the "argmax" of all the frames. So if dataframe A has the highest value for (Form, recall), I want that cell to be 'A' in the resulting frame. I heard it suggested that one use a multiindex but it's not clear to me from the docs how it could be used for that.

short heart
#

need help with tensorflow ASAP

serene scaffold
short heart
#

yeah hold on

#

what is train_function error

#

Function call stack:
train_function

abstract zealot
#

Anyone here familiar with chi square goodness of fit?

#

I have a problem with expected values that im generating that are far too small

#

basically im generating these values:```py
val = np.array([abs(int(e)) for e in norm.rvs(loc=1800, scale=2000, size=25, random_state=144)])

#

Ill send you the rest of the code if you can help 😄

grave frost
#

@short heart post the whole traceback

carmine iron
#

is there any finance / quant here. Having trouble understanding what a holding vector is .
Given a ticker of of the stocks, compute the holdings vector h E R^3 for the unique stock porftolio that is both dollar and beta neutral and has unit exposure to the specified stock

verbal jetty
#

im completly new to scripts, so maybe im even in the wrong channel. but after installing the script of: https://github.com/andrewning/sortphotos im not able to run it, and i dont even know where to start

severe python
#

@carmine iron have never heard of a "holdings vector" before

#

@iron basalt you there? want to bounce some ideas off you

carmine iron
#

@severe python yeah me neither...i think it has to do with covariance matrix and optimizing the portfolio for each beta under/over 1

#

but honestly i could be way off. its all linear algebra

severe python
carmine iron
#

well even with that understanding, i am still stuck on how to proceed. All i have is a df with date index, benchmark of SPX, then three unique stocks

quaint kelp
#

Has anybody here encountered this error?:

severe python
#

hmm, so you're not constructing a portfolio, you're evaluating a company's exposure to USD and their beta in general? @carmine iron

carmine iron
#

SPX is the proxy, it will be a portfolio of the three stocks given not including SPX. After the holdings vector is found, i need do calculate daily PnL

#

i believe the exposure should be to SPX

#

unit_exposure is an argument in the function

severe python
#

ah i see, i understand the concept but can't really apply it to python because i'm relatively new to it

carmine iron
#

@severe python lets work together! whats the concept.

#

even if talking in terms of excel, i dont have too much time left to figure this one out

quaint kelp
severe python
#

then you could filter with IF beta over/under 1 then .... whatever. not sure if you are looking for currency exposure as well but could do something similar would have to google. not sure that any of that helps

carmine iron
#

thanks that does!

simple torrent
#

i can't figure it out how to run spark on jupyer notebook. Please help

lapis sequoia
#

hello i am new to data science and i am learning it

#

any tips i can use ?

hollow sentinel
#

Kaggle

lapis sequoia
#

and ?

misty flint
lapis sequoia
#

thanks

misty flint
#

np

ripe forge
serene scaffold
ripe forge
#

Does the assumption about the number of rows and the tag column hold?

serene scaffold
restive obsidian
#

how to get row count in pandas with chunksize?

ripe forge
#

Ah then yeah, my knee jerk reaction is to make a 3d array

#

Shape (num_dataframes, rows, cols) and then just freely use numpy operations after slicing for a single row

rare ice
#

I am using Apache Spark (specifically, pyspark) for some data processing. I noticed that syntax for "case when" and for "when otherwise" is very different and they are used differently Can someone explain the pros/cons of each method? Thanks!

#

It looks like the "case when" approach is potentially harder to use and debug because you are writing a big expression string

real wigeon
#

hey folks, maybe you can chime in when available. I'm using sqlalchemy to pull some data data = db.session.query, with a whole bunch of questions, a few of which contain datime values

#

i then go on to drop that data in a pandas df, so i can prep it for conversion to xlsx

#

df = pd.DataFrame(data, columns=['upload_timestamp',