#data-science-and-ml
1 messages · Page 287 of 1
I don't understand how that is an issue, also it's still sorted after cutting out the non-us rows.
all i know is that python is disagreeing with me trying to edit values on index, which is a copy of ufos
anyone knows how I can format multiple datetime columns in pandas?
df['WPCBDATO'] = df['WPCBDATO'].dt.strftime('%d-%m-%Y')
df['WPCIDATO'] = df['WPCIDATO'].dt.strftime('%d-%m-%Y')
df['WPCPLSLUT'] = df['WPCPLSLUT'].dt.strftime('%d-%m-%Y')
``` already tried df[[]]
pd.to_datetime i'm pretty sure
gonna try again, pretty sure it's not working
seems odd if I can't format multiple columns at once
df[['WPCBDATO', 'WPCIDATO', 'WPCPLSLUT']] = pd.to_datetime(df[['WPCBDATO', 'WPCIDATO', 'WPCPLSLUT']], format='%d-%m-%Y')
```returns ValueError: to assemble mappings requires at least that [year, month, day] be specified: [day,month,year] is missing
.apply(pd.to_datetime, format='') did the trick it seems
now you have 
yup
column = ufos.loc[ufos['country'] == 'us', 'datetime']
for i, val in enumerate(column):
column[i] = val.split()[1]
@analog pike
Well there's no more error
but it's giving me the half with the time not the date
okay nevermind
i switched the 1 to 0
yea the format on https://www.kaggle.com/NUFORC/ufo-sightings?select=scrubbed.csv only has two parts
date and time
wdym
split()[0] is the first part which is what you want
alright thanks so much
(side note: split without any arguments passed to it splits by all whitespace, so it works even with multiple spaces in-between, or new lines or tabs)
nice
Quick question - I am doing Hyperparameter search for my classification model. Is it reasonable to expect that if a model attains higher accuracy (or lower loss) in 5 epochs against another iteration of parameters then it would also attain higher accuracy after full training (say, like 15 epochs)?
I'm pretty sure that if you put too many epochs, the model will overfit and the loss will go up.
so you'll have to have a balance
not necessarily
I am fine-tuning
example
let’s say the hyperparameter is learning rate
and weight initialisation is the same
you might get near a local minimum faster, but you might also be less likely to reach it exactly
with a higher learning rate
Aha.. good point
Look like Im gonan have to keep it for like 20 hours
Thanx a lot @velvet thorn 🚀
yw 👋
If I have a column in my dataset which contains short string descriptions using keywords, how could I include that in a heatmap/correlogram to show relationships between the keyword and other variables? e.g. I could use this to find that, for example, descriptions that contain the word "red" and "dress" have a smaller value in a column called stock than a description that includes "green" and "bag"
example of data
create columns representing the presence of words
i.e. one-hot encoding
how can i find a degree of similarity between two topics obtained with lda in different documents?
I have a dataframe called data with two columns.
print(data.dtypes) yields:
data_in_datetimeformat datetime64[ns]
data_in_float64_format float64
dtype: object
What does "dtype: object" mean? It isn't one of my columns as far as I know.
any data type in python is object based
its considered as an object from a class
do u get me ?
Is it listing each dataframe's columns' data types and then also saying that "dtype" itself is an object?
yes data frame columns are objects as well i think you can define them as series and give it to them or a list
from panda package
I mean, I don't understand why "dtype: object" is part of the output of print(data.dtypes)
for example dataframe is a class or bigger object which contains smaller objects which are columns
That makes sense, yeah
when python for example tells us something data type is list its origin is a list [] the brackets for example define that you tell the compiler i will have just a number of objects
thats why the list could have different data types
its not like array in c++
I think I understand what you are saying
So if a pandas array is composed of columns all themselves composed of the same dataype, pandas.dtypes might return something other than "dtype: object"?
Never mind. The output of the function seems like something I don't really need to understand right now
The important part to me was being able to identify the datatype being used for each column
python has the best list than any other language imo because they allow for multiple datatypes
like gm said one-hot encoding or other recoding methods
scikit learn has the OrdinalEncoder function too. thats the one i used recently
@digital crescent no
because that’s a series
with string values
representing the data types
and strings are objects
(although that seems like it’s a bit outdated because there is a specific string dtype now)
it’s because Python is dynamically typed
that has its own drawbacks.
it’s not necessarily better
this is true but not the point
actually the contents could be type objects and not strings
which would explain it
you can check
Gotcha. So a pandas dataframe with 2 data columns essentially has an index column, the 2 data columns, a series with values representing the 2 data columns' data types, and everything else a dataframe object would have, right?
well, the dtypes series is dynamically generater, I THINK?
I'm not sure. Either way "series/dynamic generator" 🙂
guys, I have this tweets data set i am trying to clean. I want to remove all words that begin with @ from a text column and then drop the rows that have the same texts after the above process. I have the following code but it isn't working
df['clean_text']=df['text'].str.replace('(@\w+.*?)',"")
df = df.drop_duplicates(subset = ['clean_text', 'username'])
can someone help me? Thanks!
anyone suggestion on how to remove words that begin with @ or how to modify the above regex? I tried printing dataframe to the console and it appears to have removed the @sub_strings from the texts but the data frame isn't dropping the duplicates
regex's are above my paygrade, sorry 
regex appears to be working:
Hak's regex is working. I think the subset is the problem. Specifically, using the username column which I don't actually see... Probably removing "username" would work?
No prob. If it's an issue, just delete that post and let me know if the usernames are the same. I.e. for privacy reasons, you should prob delete that post : )
the usernames are the same. I want to delete rows with the same usernames and the same clean_text values
Hi guys I’m trying to figure out a good project to do for my data science portfolio
Having a hard time coming up with a research questions
Maybe try adding Scratch that. You're returning dfinplace=True in drop_duplicates?
yes
Herm I'm not entirely sure. I would just try to take those two records and compare each string/list using == operator to double check. And then try again. If they are truly equal in Python, then try doing that inplace argument I suppose. Not really my field of expertise since I had my own question to ask, but hopefully testing out a bunch of edge cases should help resolve your issue.
ok thanks! I will try it out. I tried inplace, but that didn't work. Probably will need to compare each string and check what is happening
for i in range(100):
loss = mse(model(inputs),targets)
loss.backward()
with torch.no_grad():
weights -=weights*1e5
bias -= bias*1e5
weights.grad.zero_()
bias.grad.zero_()
```
Can someone have a look is the weights and bias used properly
do something in an industry that interests you
if you dont like it, you wont finish the project
sure why not
is that an AND
or an OR
AND
my best guess is that you have spaces
yeah, i tried updating the regex to (@\w+\s*) so spaces following the @substrings are stripped. But that didn't drop the duplicates as well.
I figured out that the strings weren't unique because there were exactly same texts with different @s, hastags and urls for some reason. So had to remove all of them and then the drop duplicates worked
some user spammed the exact same tweet multiple times with different @s, hashtags and urls multiple times on different days and was messing up my eda
sounds like a bot

are you doing some sentiment analysis or something?
i need to do a project involving twitter api sometime
I swear to god the guy was a real person, i personally verified it
I used twint and other scrappers to collect data.
i'm using a seaborn distplot to visualize the distribution of one of my columns, but it's extremely distorted because there's some variables (which I'm not sure I want to leave out) that are extremely far away in value from the others
it looks like this now
there's 301 of these outlier values with a mean of ~10220 and an std of 5544, so idk if i should remove them
what do you think I should do?
same thing happens with another column
its seems like i just have some datapoints which are outliers in all variables
i've tried using sb.distplot(plot_df[plot_df['stock'] < 100]) to get all items under 100 (as a test case) and it changed nothing at all...
oh smart
i would want to work with twitter api only so that i can say ive worked with it
but in practicality its annoying
lol
big yikes

maybe split the dataset?
if its that many outliers in that region, seems like a dif. subcategory
i guess i could try that
honestly idk what youre supposed to do in that case
thats just what i would do
lol
This is a really simple question
But what’s the best method for making an api call
And why is it necessary?
i think i'll just ask my TA
/python
hey everyone
i have a question about data types
so i have this dataset and height and weight are all integers, are these two attributes continuous data or discrete data?
i feel like height and weight should be continuous data, but in this case where all the entries are integers, are they discrete?
hey i'm new to the server can anyone link me some resources to get started with data science? (like some projects or bootcamp idk).
I have a basic knowledge of python, but i don't know where to find the resources to learn more... thank you in advance 🙂
I think there is a way to trigger a bot to list all the resources for DS
@dawn turtle what about an attribute that only takes 0 or 1. 1 being true and 0 being false. Is this categorical data or numerical (discrete) data?
If they represent true and flase its categorical. Its about the thing that the data is representing not how it is measured
@slender radish
@dawn turtle okie thanks so much!
oh? how? i dont think ive seen that before
really? what is it?
The requests library is a great and simple tool for API calls
Hey @sharp pumice!
Uh-oh! It looks like your message got zapped by our spam filter. We currently don't allow .txt attachments, so here are some tips to help you travel safely:
• If you attempted to send a message longer than 2000 characters, try shortening your message to fit within the character limit or use a pasting service (see below)
• If you tried to show someone your code, you can use codeblocks
(run !code-blocks in #bot-commands for more information) or use a pasting service like:
Hi, hopefully this question isn't too taboo. I work with R a lot and am trying to relearn Python as more and more jobs are using Python (plus a personal interest in it). I've been learning how to use Pandas, matplotlib, numpy etc. Does anyone have any good resources/suggestions on how to get the hang of Python's syntax coming from R?
Anyone interested in chatting approaches to text classification? I am not familiar with data science or python really, but I was a software developer for a few years and have a degree in computer engineering, so I'm sure I can keep up with a convo. I have a specific use case I'm trying to determine whether to go third party, hire a junior developer to work to build something or build it myself
Anyway, DM me if anyone is interested
There are still some basics to ML that are mathematically related than present in normal CS, but still I think you would be able to make a decent model on your own
we are here to help in case of any problem 🙂 but if you want, freelance is always there
The Embedding module of PyTorch is very simple and extremely powerful at the same time. If you are interested in how to deal with representing tokens of a language or encoding categorical variables do not miss this video:)
In this video, I will talk about the Embedding module of PyTorch. It has a lot of applications in the Natural language processing field and also when working with categorical variables. I will explain some of its functionalities like the padding index and maximum norm. In the second part of this video I will use the Embedding module to represent...
would anyone be able to hop on a zoom call with me?
I'm having trouble making a request for an api
would binary yes/no variables be considered categorial?
e.g. a variable called inStock holds a value 1 if the item is in stock, 0 if not
Hello so basically i wanna visualise a cumulative data of a specific country using panda but no matter what i cant get it to work but im able to get cumulative data of every country this is my data set https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv
this is probably a dumb question, but how is this 4 dimensions? and what would the shape even be?
in this case, “dimension” means “axis length”
The is the dimensions of the shape, and then there is dimension as it refers to the number of components total.
and mathematicians
actually use the term differently
so in coding
if we say “3D array” (3 dimensional)
we mean an array that requires 3 levels of indexing
to hit a scalar value
in mathematics, this is more commonly referred to as a rank 3 tensor
if you had a vector
let’s say representing a point on an x, y plot
that would be a rank 1 tensor
with 2 dimensions (values, namely x and y)
all that said
this is usually more common but you should know the mathematical usage of you’re doing ML
I'm being pedantic here, but tensor as used in ML is a programmer term. In the same way that convolutions are not actually used in ML, they just incorrectly use / borrow the term.
Though it's so far spread the misuse that it's more or less the accepted use now.
that happens a lot in programming
cf. functor
🥴
True. Though my favorite nonsensical term in programming is Dynamic Programming. As it was named as such to disguise the fact that the he was doing mathematical research because the US general had a pathological fear of the word "research" and to make it sound so cool that no congressmen could object to it.
I think explaining that difference in dimension usage with number of indices is nice, i'm gonna borrow that.
What’s the best way to convert a json into a data frame
Depends on the format of the json.
So basically I requested an api
And want to take the json and convert it to a data frame
So that I can do some analysis
nice
I don't really think of it as mathematicians using the term differently more so that-non mathematicians are not specific enough when they refer to dimension
Take for example, a 2x2 matrix. What is the dimension? Are you referring to the dimension of the column space? Or is this over the vector space of 2x2 matrices over a binary operation like addition?
It is especially unclear in machine learning where there is usually no isomorphsim between the row and column space and a mathematician would be more likely to specify exactly what they mean while a non-mathematician is more likely to brush it under the rug because there is a contextually obvious answer without having to include possibly incorrect mathematical rigor
in a programming context?
you wouldn't ask that question
you would say "what is the length of <this> axis?"
furthermore
"matrix" is a mathematical abstraction
you can have a 2D array representing that
playing a game and my queue popped so I won't be able to reply for a while if necessary
again, depends on the JSON.
if it's nested you probably won't be able to convert it to a DF as easily
because dataframes inherently deal with tabular data
and then
That’s as far as I got
It seems like that api wasn’t a good example
Yeah it didn’t come out the way I expected
I thought I was getting season statistics for the nba
Let me try with a different api and see the result
Thank you though @velvet thorn
🥴 you're welcome but I didn't do much
I just found that the machineLearning course for my university uses matlab, i wanted to do python with tensorflow+keras and stuff should i drop it and learn supervised/unsupervised/clustering my self or do the course anyway? im at the end of my diploma and done all the major stuff i wanted.
what u think?
@velvet thorn do you have any suggestions in terms of api usage
Because I thought I was getting a large dataset and ended up with this
is there a way to use max for a list and ignoring strings found in the list
tbh
it just sounds to me
like you used the wrong API
🥴
maybe look through the docs
?
if there are docs
yes
but why do you have strings in the list
like you can do it but I would suggest you filter first
like i want to get the index of the max number in a list and print the item of that index in another list then move on to the second max number etc..
what im trying to do is replace that max num after i use it with a string so it doesn't count
if i replace it with zero it gets mixed with other nums
oh i could just replace it with -1 since the min num possible is zero
do what you want dude. you can either a) do ML with matlab or b) not and learn tf/keras on your own or c) learn both methods of doing it. just depends on your endgoal.
ML Engineer? youll probs want that matlab experience. Just a Data Scientist? tf/keras will probs be sufficient
or if you think youre NOT disciplined enough to learn on your own/you want structure, i would do the ML course
@velvet thorn so I used a different api and got this result. Now I want to take this and make it a dataframe
doesn't look flat
do you know what "flat" means in this context?
hm
what I would suggest instead is
are you familiar with the concept of "argsort"
I don't know what flat means
okay, so
yeah I'm running into a whole host of errors when I try to read it into a dataframe
!e
import pandas as pd
json = """[
{
"col_1": "a",
"col_2": 3
},
{
"col_1": "w",
"col_2": -3
}
]"""
print(pd.read_json(json))
@velvet thorn :white_check_mark: Your eval job has completed with return code 0.
001 | col_1 col_2
002 | 0 a 3
003 | 1 w -3
ah ok I see what you mean now
!e
import pandas as pd
json = """[
{
"col_1": "a",
"col_2": {
"sub_col_1": 1,
"sub_col_2": None
}
},
{
"col_1": "w",
"col_2": {
"sub_col_1": 6,
"sub_col_2": 4
}
}
]"""
print(pd.read_json(json))
@velvet thorn :x: Your eval job has completed with return code 1.
001 | Traceback (most recent call last):
002 | File "<string>", line 20, in <module>
003 | File "/snekbox/user_base/lib/python3.9/site-packages/pandas/util/_decorators.py", line 199, in wrapper
004 | return func(*args, **kwargs)
005 | File "/snekbox/user_base/lib/python3.9/site-packages/pandas/util/_decorators.py", line 299, in wrapper
006 | return func(*args, **kwargs)
007 | File "/snekbox/user_base/lib/python3.9/site-packages/pandas/io/json/_json.py", line 563, in read_json
008 | return json_reader.read()
009 | File "/snekbox/user_base/lib/python3.9/site-packages/pandas/io/json/_json.py", line 694, in read
010 | obj = self._get_object_parser(self.data)
011 | File "/snekbox/user_base/lib/python3.9/site-packages/pandas/io/json/_json.py", line 716, in _get_object_parser
... (truncated - too many lines)
Full output: https://paste.pythondiscord.com/imiluyarol.txt
this is not
hmmm okay
the opposite of flat is "nested"
so
the term for what you want to do is "normalise"
ideally
And with "nested" it's much harder to untangle?
you would have some knowledge of the relational model of data
how SQL does it
hmmm ok
more like
your data cannot be mapped to a row-column strucutre
pandas has a json_normalize method
but that won't work in all cases
you can try that
if it doesn't
then you need to do it manually
ok let me look up some syntax on json_normalize
thank you!
by manually you mean with like functions and iterations?
shit I am not good at those
I'm like a beginner for sure as you could prolly already tell haha
@velvet thorn do you have any book recs for python with data science in particular?
nope, sorry...
not really a book-for-learning person
I feel that
I got a module error when trying json_normalize
That's prolly bc of version?
@velvet thorn I gave up and decided to use a kaggle dataset LOL
i'm looking to create a nice looking line chart that can be styled, is mobile friendly/dynamic. i need the package to either render html of the chart or a .svg so the image is interactive. How could I go about this?
what exactly do you need matlab for for ML?
I used to use matlab at university but havent touched it much since I started doing ML
So far I've been able to rely on sklearn for any stat crunching models
Guys, quick question, Why would this throw me an error?
'''
df_bert = pd.DataFrame({
'id': range(len(train_df)),
'label': train_df[0],
'alpha': ['a'] * train_df.shape[0],
'text': train_df[1].replace(r'\n', ' ', regex=True)
})
So, I have a CSV data file and I am using that file to create a data frame with only two columns
My interpreter does not like how I parse the columns [1] and [0] into a new dataframe
Anything helps!
@quasi sparrow Not sure, but here are some thoughts. 1. Without knowing what's in your train_df, I think you're trying to set the label to the values you have in the first column? If so, you may want to try something like:
'label': train_df.iloc[:, 0].values
I don't think you can use the index of the column without telling pandas that's what you're trying to do.
Yes, that's exactly what I'm trying to do! Many thanks!
I'm trying to train a bert model
as for the replace, you want to add an .str.replace.... So pandas knows you're using a string method
or maybe pandas has a replace method? but it looks like you're doing string work
I have a dataframe with 2 columns: One with string sentences and the other one with numbers (labels)
though my first suggestion (using iloc) might solve both issues without having to use .str.
I am trying to create a new dataframe with 4 columns, 2 coming from the original dataframe that I already have.
hopefully those suggestion help. lemme know how it goes! 🙂 good luck!
Thanks! Sure I will!
Hi all, I'm trying to use some functions created in a kaggle notebook to create a custom correlation heatmap (https://www.kaggle.com/mlwhiz/seaborn-visualizations-using-football-data/), however despite the data I'm using being seemingly very similar to the data that they're using, I'm getting an error UFuncTypeError: ufunc 'add' did not contain a loop with signature matching types (dtype('<U32'), dtype('<U32')) -> dtype('<U32') when I try to call results = associations(plot_df,nominal_columns=catcols,return_results=True) on my own dataframe plot_df. I have absolutely no idea what this means and answers from other places online haven't really explained this well at all
this is their dataframe when i called .info() on it
and here's mine
I really don't see any differences here in types, any ideas what could possibly be causing this?
if more info is needed I can provide it. Thank you!!!
What do you exactly mean by difference in types?
The types of my data and theirs arent different
<U32 means 32 character unicode string
that error is saying that you're treating non-numeric columns as numeric, basically
Hi guys, can you guide me on how to go about building this vehicle classifier/tracker system? I have data of vehicles in the form of video recordings
do you want to track or classify?
hi
Any suggestions for functions I can use to compute correlations? "pearsonr" from scipy.stats crashes my Jupyter notebook consistently. I have 500000 datapoints for each set of distributions whose correlation I want to compute, but surely there has to be a non-crashy way...
thank you! it worked
What are some good resources to learn about graphs, probability, and statistics? I am auditing an edX course on python that involves things like JupyterLab and data science stuff but I never passed stats in high school
Is anyone familiar with the perception algorithm, trying to use mnist, tensorflow, to figure out how to determine a three or a five
looking for a reference
I have my data normalized, and separated, but not sure where to go from here.
Anyone know soimething about objective functions? How do i create one out of a 3d or more plot of data
statquest on YT
best stats explanations ever
- tina huang

hi
hello. I was building a Multiple Linear Regression Model, with the dummy dataset of 50_Startups which is usually used by beginner ML learners. The data set looks like this
I made 2 models, one in which I dropped one dummy variable in order to avoid the dummy variable trap after onehot encoding the State feature, and in the other model I didn't drop any dummy variable, kept all 3 of them. But the results were exactly the same
same R^2
same predictions
why did this happen? Aren't we supposed to drop one dummy variable in order to avoid multicollinearity ?
The make_input_fn returns a function which is then later passed to tensorflow during training and evaluation. The reason for the nesting is that train_input_fn and eval_input_fn must be functions that take no arguments. Tensorflow expects / allows them to only have the arguments mode, params and config (and maybe input_context). The functions train_input_fn and eval_input_fn are suppose to return the data, but how are they suppose to do that if they don't take any arguments? How do you get your data to tensorflow? The trick is to either use globals or do the wrapping trick in which the inner function has the data without getting it passed as an argument, but unlike a global, it's not global. In general this is a common trick you will find in python programs. Side note: this is how decorators are implemented.
hi data science people
have a pandas question that I haven't had a response for in the help channels:
made a script that searches an excel file (currently one column) based on user input and prints matching rows which is great, but i would like it to search multiple columns. any idea how to achieve this? here is my code:
from tabulate import tabulate
from termcolor import colored
class bcolors:
FAIL = '\033[91m'
while True:
try:
variable = input("Please provide an acronym: ")
variable = variable.upper()
df = pd.read_excel("accounts.xlsx")
df = df.set_index('Acronym')
result = df.loc[variable]
print(tabulate(result, headers='keys', tablefmt='psql'))
except KeyError:
print(f"{bcolors.FAIL}Invalid acronym{bcolors.ENDC}")```
Show input and output.
Please provide an acronym: RYAN
+-----------+----------+-----------+-----------+
| Acronym | Parent | Clearer | Account |
|-----------+----------+-----------+-----------|
| RYAN | 26KM291 | GS | 285M322 |
| RYAN | 2378DM | Socgen | 2HKLM242 |
| RYAN | 26KM60 | GS | 285M322 |
| RYAN | 26KM60 | BAML | 268132 |
+-----------+----------+-----------+-----------+
Please provide an acronym:
Show accounts.xlsx (assuming it's test data and not something that needs to be kept secret).
(screenshot a section of it)
has 4k rows
goal is to search by parent or account as well as acronym, searching by clearer isn't necessary
so you want to be able to search by a key other than Acronym?
yes exactly, by the parent ID as well as the account ID
and print matching rows (including acronym column)
i'm thinking i will need to restructure a lot? @iron basalt
@severe python I'm back, so you could restructure, much like in a database, but you could also just not use an index and instead do something like df.loc[df["column name"] == value].
df.loc will give you the row and df["some name"] gives you the column with that name. You then check that column for all values that match value and get the rows with at that those spots.
If you go the index route you can make things a lot faster, but you need to normalize (1NF, 2NF, 3NF, BCNF).
hi
do anyone know about opencv library
i have poblem in its harrcascade classfier
the .detectMultiScale() method gets stuck while executing
hi

just looking around
sorry ive only used the draw functions with opencv
ohk np
and the image processing module
@magic pivot opencv is a buggy mess (on the c++ side, which the python side inherits), try using scikit-image instead. If that does not work, come back.
@iron basalt thank you very much 🙏
@iron basalt so without index, it would look like df.loc[df["Acronym, Parent, Account"] == variable] ?
Hey so if I need to read an image and find it as accurately as possible what methods are there? Canny keeps giving me false positives
import pandas as pd
df = pd.DataFrame(
{
"month": [1, 4, 7, 10],
"year": [2012, 2014, 2013, 2014],
"sale": [55, 40, 84, 31]
}
)
print(df)
print("--------------------------")
print(df.loc[df["year"] == 2014])
print("--------------------------")
print(df.loc[df["year"] == 2014].loc[df["month"] == 4])
print("--------------------------")
run that
@severe python
it gets all rows with year == 2014 and from all those rows it gets all rows with month == 4.
AttributeError: partially initialized module 'pandas' has no attribute 'DataFrame' (most likely due to a circular import)
i see what you mean, in my case would i just reference variable? which is basically the user input @iron basalt and would i need to change the ending lines?
yeah, if your user inputs an acronym (wants to search by one), then you just do what you are doing now. If they select to search by Parent, then it's the same, but with parent instead.
So here is what you can do
There is a loop, the user selects which column they want to search by.
Then which value they want to match in that column.
It spits out all rows that match.
Then go back to step 1. But this time it only searches the remaining rows.
rows = df
# In a loop with user input
rows = rows.loc[rows[column_to_search_by] == value_to_match]
print(rows)
i see what you're saying
It filters down till you only have whatever you want left.
so there isn't a way to search multiple columns normally right? like instead of asking user what criteria they want to search by
You can also detect multiple columns were inputted e.g. month, year and then accept multiple values 4, 2014 to speed things up a bit for the user.
This would just execute as before, one after another.
a little lost
i feel like there's a way to search by both. like if i search a parent account, to print that row. if i search an acronym, print that row
what is preventing me from doing that? and can i use df.loc with multiple columns referencing the user input? what you said above
The solution I gave you does exactly that.
gotcha, can you double check this with the full code i gave above?
df.loc[df["Acronym, Parent, Account"] == variable]
import pandas as pd
df = pd.DataFrame(
{
"month": [1, 4, 7, 10],
"year": [2012, 2014, 2013, 2014],
"sale": [55, 40, 84, 31]
}
)
print(df)
print("--------------------------")
while True:
search = input("Enter your query: ")
if search == "quit":
break
sp = search.split(",")
col = sp[0]
val = int(sp[1])
print("Showing all results where {} == {}:".format(col, val))
print(df.loc[df[col] == val])
@severe python
Example output:
month year sale
0 1 2012 55
1 4 2014 40
2 7 2013 84
3 10 2014 31
--------------------------
Enter your query: sale,55
Showing all results where sale == 55:
month year sale
0 1 2012 55
No you can't do that in pandas. The solution is really simple, just search by Acronym, then take those results and search by Parent, and then take those results and search by Account.
Is there a way to install TensorFlow with GPU support on Windows 10 without using anaconda? All the tutorials I've seen are either outdated or use anaconda which I don't have
The reason why they use anaconda is because tensorflow GPU needs the CUDA toolkit.
(And conda has that)
Right
Previous releases of the CUDA Toolkit, GPU Computing SDK, documentation and developer drivers can be found using the links below. Please select the release you want from the list below, and be sure to check www.nvidia.com/drivers for more recent production drivers appropriate for your hardware configuration.
But could I not just install the CUDA toolkit myself?
You can
Or is this just something where I should install anaconda?
Not sure if there's an interest in this, but I've been working on a channel for Python for a while. It's not data science-centric, but it is text-analysis specific with bits of DS thrown in the mix. Thought I'd share it here: https://www.youtube.com/pythontutorialsfordigitalhumanities
On this channel, I provide tutorials for working with Python in a digital humanities project. I design my videos and tutorials for humanists who have no coding experience. I am a medieval historian by trade, but I create my videos with all humanists in mind. If you want to interact with the videos in more dynamic ways, check out my website, www....
You may be familiar with our #rules about self-promotion. On-topic self-promotion is a bit of a grey area, so try to make sure references to your own content are part of a legitimate effort to discuss that content.
I've actually been working with NLP for a while. What sort of text analysis are you doing?
Oh dear. I didn't mean to violate the rules. I am sincerely sorry. Everything from custom NER for domain-specific problems to topic modeling.
I don't make money of these videos, just helping develop them as part of a postdoc and spreading the word.
no problem, I was just telling you that for your information. I'm not accusing you of "drop it and run" self promotion
Ah gotcha. No nothing like that.
I also see from your message history that you finished an NLP project with spaCy. My first big Python project was refactoring a spaCy-based NER package that my coworker wrote.
Thanks for the heads up about the rules, though. I'll be more explicit in my intentions if I do something like that in the future.
Oh cool. Indeed. I am a huge fan of spaCy. Looking forward to preparing a series of tutorials for version 3.0. There's lots to unpack in the new update and I haven't had the time to fully explore it yet
Oh, there's going to be a new major release?
I wrote a textbook on NER using spaCy. ner.pythonhumanities.com , if you are interested
Was released Feb 1.
They are moving towards BERT. Results are expected. Marked increase in accuracy at the cost of performance, but not as much as competitors.
Interesting. One of the reasons we didn't build our NER package entirely through spaCy (we collected features from Doc instances and used other learners) was because we eventually wanted to use BERT. Which we did.
But you're telling me that BERT embeddings will be what ship with the large model?
I have only toyed with spaCy 3.0, but it's a fairly easy implementation of BERT. Lots more customization now too with 3.0. You can control your ANN architecture
No. There is a .trf model that is the BERT model. They still have all the same sm, md, lg models with embeddings
So you can still use the old embeddings, if you desire
Interesting. Right now my advisor has me working on dataset ablation and that might carry me through to when I leave the college. I'm in my last semester of undergrad and I've just had the opportunity to do research because she took a chance on me.
Oh that's really cool. DS/NLP is a fun career path. I'm a historian by training
That is to say, your undergrad was in history but you pursued CS thereafter and that's how you're a postdoc?
Oh no my B.A., M.A., and PhD are all in medieval history. During my PhD I taught myself DS/CS/and programming in secret so that I could do the research I wanted
I'd have to DM you to discuss that since history is off topic. Do you mind?
aa = []
varss = []
for i in range(10000):
x1 = np.random.rand()*2-1
x2 = np.random.rand()*2-1
X = np.array([[x1], [x2]])
y1 = f(x1)
y2 = f(x2)
y = np.array([[y1], [y2]])
a = np.linalg.solve(X.T @ X, X.T @ y)[0][0]
var = scipy.integrate.quad(lambda x: (a*x-1.4286*x)**2, -1, 1)
error = scipy.integrate.quad(lambda x: (a*x - f(x))**2, -1, 1)
aa.append(a)
varss.append(var)
errors.append(error)
variance = np.mean(varss)
error = np.mean(errors)
print("variance", variance)
print("error", error)
ahat = np.mean(aa)
print("ahat", ahat)
plt.plot(x, f(x))
No idea where (ax-1.4286x)^2, 1.4286 came from?
My thoughts: it seems to me like the guy who wrote the code first calculated ghat
and got it to be approximately 1.4286
then put the numerical value in, since it won't really differ by much
and then avoid having to do all the calculations twice
seems reasonable or did I get something wrong?
I have a dataframe with two columns
and anoth dataframe with a 3rd column representing the count of each pairing of values from the first dataframe with multiindexing
how would i populate a 3rd column in the original dataframe with the value of the count in the second dataframe for each row's pairing?
i originally thought something along the lines of mb_counts.loc([plot_df['merchantID'], plot_df['BrandID']]) but that doesn't work because it's using lists as an index
I think you should be to able to create a new column with apply and a lambda
something like
DF["NEW COLUMN"] = df.loc["applicable_index"].apply(len)
I -think- thats what yo want no? the size of the multiindex?
or do you want the count of the elements inside the index, per element?
what i'm trying to do is if there's, say 4 rows in the original dataframe where merchantID is 9359 and BrandID is 8360, then the other dataframe contains the number 4 at the multiindex of merchantID being 9359 and BrandID being 8360
im adding a column to the original dataframe which contains the number of times each row occurs in the dataframe
yeah you can do something like what i said, if i got you right
so for example that multiindex has 4 (in the dead) test
i can get the number with len(df.loc["Alabama])
and you can multiindex it too
so what you need is to pass len into an apply
to get the size based on a multiindex loc
hmm ok i'll try that
i tried merchant_brand_df['counts'] = merchant_brand_df.loc[['merchantID', 'BrandID']].apply(len) and got KeyError: "None of [Index(['merchantID', 'BrandID'], dtype='object')] are in the [index]"
no
that means those are not in the index
i'm also testing the solution myself and its not optimal. let me try something else
ok
merge the counts dataframe with the original one?
ahhhh
ye
yes that's actually much better
may i ask some resources to learn data science.
ahhh ok
merge on those two columns
gorupby with .agg {"count"}
by “count”
yeah honestly grouping / merging would be better there, @velvet thorn is right
doesn’t @astral path mean like just the value in the second DF
which is the already computed count
yeah
he's trying to compute it, though, right?
Or did i misunderstand the whole thing lol
i already have it computed in the second dataframe
ok i'm dumb lol
although if it's better to compute it in a one-liner that also adds a column then that's better
@velvet thorn 's approach is much cleaner
try using pivot or groupby
and aggregate
via len/count
@scenic patio any specifics in mind?
also @astral path a tip
nope i just learned basics of python
in general for this kind of question
My advice would be to first learn python
if you can provide a runnable sample of input and expected data
like something people can copy paste and run
i was searching through videos and learned that python is good for ds
it gets a lot easier to understand your problem
how would I get that expected data sample in an easily exportable way from python?
@scenic patio There's Pthon, R, and Julia is interesting too, for example
Python is the most popular though
I would just make one up
ah ok
with a few rows
do you have a background with decent math studies?
i am in high school
Ok you need to learn some math then
Try Khan Academy on these topics: Linear Algebra. Calculus I, II, III (vectors)
Multivariate calculus
As per resources...there's countless to be honest lol
also statistics
thats the problem so much resources dont know what to use
I'm going through the University of Michigans Applied DAta Science
which is in python
and its pretty good
but be aware of one thing
not one resource EVER is giong to teach you everthing
you need to research a lot b yyourself
https://worldpece.org/sites/default/files/datastyle.pdf
this is a fantastic book that my data wrangling professor used to teach us the fundamental questions behind data science
Well, that's an important life topic you shouldnt ask strangers in discord about lol
but for data science you can start there
yh i just want to know what ds resources are reliable
University of Michigans Applied DAta Science -> I like this, but there are others
i would search for textbooks/lectures/resources that universities provide or use
Codequest? I think its good too
EDX has a fantastic data science esp from Harvard, but thats in R, not Python
what about this website i stumbled upon: dataquest
I think thats good
ive heard good reviews about
it
but i havent tried it mysel
dont get stuck in tutorial hell though dude
review courses
choose one
learn as much as you can
and then do a project of your interests, whatever that is
agreed, you would learn more in a project than in a course with spoon-fed material
thank you!!!
IMO that's one of the most enjoyable ways to learn. Because the project would be something you like and would be cool, you won't lose motivation.
Exactly
choose a topic you like, find data about it, and make sense of it
You like anime? Try getting anime data depending on seasons, gender, profits, etc, and try to predict anime's success based on 2 or 3 parameters.
You like weather? Plenty of datasets lol
economics? shitloads of datasets
I think even pornhub has API nowadays...
That turned dark pretty quickly 🙂
😆

besides, I just said API :p
actually, jokes aside, those guys at PH have some decent datasets and insights
i always laugh at their yearly summaries lol
facts
Well what do you know - there actually is an API for PH
i heard thats what let them stay ahead of their competition. them using data science and ML
imagine getting a masters in statistics to go work for pornhub
Yeah, how would I know
I thought MS wont allow it
-looks away-
"What do you do your a living?" I investigate what people jerk off to
true
predicting that with a model
Data Science in PH -> Interquartile range of "session" lenght based on categories
dw too much about figuring out what you want to do. you still have time. take a chance to explore everything and find what you like
this guy is so ahead of things he shouldnt stress
yeah def
im 28, with career and shit, and im still a bit loss lmao
i got my problem fixed merchant_brand_df = merchant_brand_df.groupby(['merchantID', 'BrandID']).size().to_frame('count').reset_index()
thought Im liking data science a lot
same except grad school

why the reset index though? 'screeches at positional indexes'
mid career change
to data science
or at least trying

you at least seem to know your stuff unlike me
depressing thought: it would suck being the data scientist for the FBI who has to create algorithms for detecting certain illegal kinds of porn

depressing yes but someones got to do it
you would have to go through and analyze the features of those videos and that could be scarring
oh no, i heard its mostly looking at metadata
oh thank god
then if its actual stuff they investigate it more closely
I am starting to feel too young 😅
how young are you 
Agreed
Technically....
Juvenile your 3rd home
oh wait i took this the wrong way based on prior context
or did i
Humanity will be doomed the day a NPL algorithm understands double meaning jokes
in multiple languages
oh boy
i cant wait for the day algorithms can generate jokes so complex and clever that only other AIs can understand it
centuple-entendre jokes
could be of related interest
fictional short story but with massive societal implications
reddit and twitter are basically 50% bots
jerking off and upvoting each other
and creating trends out of nowhere
literally sometimes
change my mind


big yikes moment
Nothing serious tho
Right now our Ai is not that advanced. It's just an effective mode of communication
for now
but yeah i getcha
semi-related
theres this poster thing at my uni, and i wanna do a poster about ai and ethics/society but idk what would be interesting to others
https://www.youtube.com/watch?v=WnzlbyTZsQY
The comments are pure gold 😁
Are you a Robot or a Unicorn? Let the world know: http://yosinski.com/IAmAUnicorn/
What happens when you let two bots have a conversation? We certainly never expected this... (More: http://creativemachines.cornell.edu/AI-vs-AI)
By Igor Labutov, Jason Yosinski, and Hod Lipson of the Cornell Creative Machines Lab (http://creativemachine...
the one thing I havent been able to find in a "simple" understandable way is the theory behind some of the ML algorithms
Like, yeah I 99% care about the applied part, but Im curious about the theory too lol 😦
Ofc you can learn about theory - or get an idea at least if you watch 3B1B
But the real challenge is how exactly does a model learn?
3B1B?
3Blue 1Brown - youtube channel
depends on what you mean by "learn". for advanced stuff like image recognition or NPL I have no clue -yet-
hm
which algorithm specifically
lol that video was hilarious
No, I mean interpretability of Black Box models. Its a pretty hot research topic
i thought for AI like this its a lot of reinforcement learning
no, they are different
it is
not sure if "how does a model learn" is exactly the same thing though
The underlying RL structure has MDP (Markov DEcision processes) at its core. Normal ML is basically supervised .... function mapping? dunno the exact term lolol
the techniques that have been developed to interpret DL models are p cool though
true. Some visualizations are pretty cool
That personally trascends me lol I'm more interested in the ways to apply ML / DL / NPL in real life
not necessarily
most commonly, yes
woahhh
theory is cool, but its not my thing
huge heatmap zoomed way out
excellent example of "WTF IS THAT CHART?"
@velvet thorn Hmm.. weren't all RL based on MDP? Sorry if I am behind times 🤷 I only know basic theory
I have no idea of what that heatmap is lol
uhh explainability. the only thing i know is this is a cool tool: https://github.com/slundberg/shap
most are
but
A3C maybe? I dont get it 🙂
it's just like how there's DL without gradient descent
(admittedly a less "out there" case)
Hmmm....
@grave frost RL has a ton of methods, and an obivous example of non MDP is when they are solving a partially observable MDP, which is a much more interesting (and much more difficult problem).
I stopped learning about RL cuz I don't think it's very useful for real-world application. mostly for playing games. The ones that are useful require a million lines of code with no libs
complex env?
that's one application, yes
thats the only one ik 
evolution
@grave frost Reality is only partially observable.
do you mean genetic algorithms?
yep. those too, but even the simple ones can replicate simple evolutionary processes
of doing things
carykh does that kinda things - simple RL algo for simulating animals, development and such
Is there a way to add inline code in discord?
backticks
`code` -> code
Haha 🤣
@grave frost One thing to note is that RL is normal ML. RL, supervised, unsupervised are all ML, just supervised is typically the most immediately applicable one.
Hmmm.... debatable
I'm pretty sure that is just kind of set in stone, but ok
can you elaborate
what else would Reinforcement Learning be if its not Machine Learning?

most places ive read place RL under ML
unless you consider it under the umbrella term AI instead
I think it's more like an algo that maximizes reward. ML to me seems a bit.... Well, its kinda similar but still 🥴
yeah, well I guess its the same 🤷
wikipedia machine learning page
yeah its like the black sheep but it happens
RL is arguable the most difficult and would be most essential to creating AI. But that means that few people bother with it because it can't be immediately applied most of the time (which means there is no money in it).
nah, its just that the translation from simulated environments to real world is much jittery than supervised or unsup
I recommend https://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook2ndEd.pdf if you are interested. It's the goto introduction.
OpenAi has some research for such translation but the tasks are pretty narrow to be very usefull
@grave frost Ideally for AI you need to be able to do RL online in the real world, not in a simulation (like a human).
ofc, that is the end goal. its just not viable with current tech
atleast not commercially
It's getting pretty close, just not popularized.
I think it's pretty popular 🤷 people use it for games all the time 🙂 just less for real-world applications
I don't see too many games with RL.
Or AI at all.
Game AI is a different thing, it does not learn.
Perhaps you mean Dynamic Programming? That is part of ML / a technique that can be used, but it's not RL.
I don't know dynamic programming. what's that??
it does actually
It's the general idea of breaking down a problem into subproblems, solving those, combining those answers and so on.
some do
Yeah some do, but I don't see it too often.
nah, that's not the case with AI for DOTA
Game AI usually refers to the traditional stuff like GOAP. Or a chess bot.
Games like DOTA are way too complex to be broken into anyting
You mean like how DL was used to make a DOTA bot?
Yeah that is a more recent development in game design.
The model does real-time inferencing and partial learning (IDK how exactly does that work) to study the players patterns and counter
maybe partial learning is kinda like supervised fine-tuning? who knows
However, the DL bots in DOTA don't learn, they only do inference, which is much easier than doing online learning.
(DL requires lots of iterations to learn things so it's unsuitable for online learning)
(and other reasons)
ikr
but I don't see why an architecture cannot store the players moves into memory to be interpreted by a part of NN (Like previous timesteps) to predict future moves 🤷 A naive workaround
so you're saying
log what's happening
but don't use it for training
right then?
Just an idea bud
I don't know what OpenAi has done. But by that, I meant that the model can be trained to learn this new piece of info and try to predict the players moves and adjust accordingly
🥴
not exactly real-time learning, but you can't expect me to come up with an idea right now
🙂
honestly, even if the model does better job than human without real-time learning, I doubt it would make much difference on the job to be done
The DL bots do prediction, but they 1. do it with the exact position of the players and they know where all the players are at all time (much easier than what a human need to do from screen pixels and limited knowledge of the game state). 2. they don't learn online, they do it later. 3. They often learn to abuse their ability to do super human timing (a human needs to slowly process the pixel data, and do a complex learned sequence of muscle commands to do things).
a human needs to slowly process the pixel data, and do a complex learned sequence of muscle commands to do things
That's such a big insult to reflexology lolo
slowly compared to a computer
Gamers rely on their muscle memory which is the most effecient thing
Computers are limited by the speed of light (memory transfer speeds).
But the computations
Modern computers are bottle-necked by the memory speed, not computations on that memory.
(see caching)
When you have to make computations that fast, often multiple devices are required CPU, GPU, RAM, etc. all the time taken to immediately process things adds overheads
you can't expect all compuations to be placed on 1 device
its a colab b/w CPU and GPU (+ RAM)
that adds overhead. M1 tries to reduce that by a unified pool of memory (an Idea I very much like) but still, nothings out there for the M1 yet
well yeah. thats why you have distributed computing but i thought you guys were talking about supercomputers

The more devices (gpus) ou have, the more computations you have to do on which device to place tensors on
its not exactly a if/else
If you did it distributed the latency would probably climb above 200ms which would make the human the winner.
yup. and most models are very complex, so they have to be distrivuted
It's a very different and more difficult task to have an AI actually run and learn online in real time on just one device (like a robot's single (and power consumption limited) processor).
but it is possible
With AGI ¯_(ツ)_/¯
You can also just not use DL and instead use much more efficient methods. DL is computationally inefficient like crazy.
yeah, but its powerful too
Gotta go. adios
int[] arr = new int[64 * 1024 * 1024];
// Loop 1
for (int i = 0; i < arr.Length; i++) arr[i] *= 3;
// Loop 2
for (int i = 0; i < arr.Length; i += 16) arr[i] *= 3;
For those wondering what I meant by memory speed being the limiter. These two for loops take the same amount of time. The first does 16x more computation than the second, but they are both limited by memory speed (cpu fetches 16 ints at a time).
(Memory speed is also a huge issue on GPUs)
Yeah I could not use python code, because python is not running the bare metal, it's really slow and kind of nullifies any speed gains you could have by abusing the ability to do a bunch of computation due to fetching.
makes sense
numpy definitely abuses it (more specifically BLAS, which it just calls).
Numpy also uses vector operations so it would fetch let's say 16 ints, and 16 other ints, and then add those 16 ints to the other 16 in 1 operation (as opposed to 16 addition operations / operation level parallelism).
Btw, in the past, memory speed was much faster than the actual time is took to do the operations. So in the past, things like linked lists actually made sense. Now everyone uses dense arrays even though you sometimes need to resize them (reallocate memory (slow)).
cool
ok guys
If you guys use pycharm
then eneter this code ok?
if you want to send automatic emails
What's the codE?
Why are you sending in data science you can send in #python-discussion
It will be more cool 😄
oh ok
Tried searching on google but any communities I can join for SQL?
dunno but you can always ask sql stuff here or #databases
so
i have two categorical variables, merchantID and brandID, which have a positive correlation of 0.11. I'm trying to somehow plot the correlation. I've chosen a heatmap for now to show correlation between different merchants based on brandID for my assignment, but I'm also looking for ways to visualize merchantID and brandID in a way that's useful. This is hard because both variables have thousands of categories, so a plot like a scatterplot looks extremely cluttered
e.g.
what should I do to visualize correlation between two categorical variables when they take on so many values?
should I just take a subset?
how can I pull data from spreadsheet in ETL?
out of these 9 subjects, what are the top 5 most important topics for data science? my guess would be 1) Programming, 2) Math for CS, 3) DS&A, 4) Databases, and 5) Distributed Systems
what do you guys think
i would think programming, math for CS, databases, algo and data structures, and languages and compilers
you swapped languages and compilers for distributed systems?
hadoop, spark, etc. are so important for data science tho
esp for big data
yeah, i think it would be wise to understand the languages and how they work, not very familiar with distributed systems
although if you think the other is a better option you should do that
oof nice
also i just ended up taking a sample
unfortunately, all that stuff earlier with pivoting was for nothing
cus the heatmap means nothing
hmm
do you have the dataset
or a link
let me see what happens when i upload it to tableau
rip just had a power surge gimme a sec
ooooof
wait why are you comparing these two variables
i get brand id corresponds to the brand names
what does merchant id represent
merchant id represents the merchant it's sold from
farfetch is a website that basically acts as a middle man for luxury botiques which wouldn't otherwise have the reach to sell their products
i'm trying to find a correlation between merchants and brands
well
i used Cramer's V and found there's a correlation of 0.11 between them
it treats both as independent variables instead of trying to correlate them
yeesh
its bc theyre both categorical technically
here's another one with both categorical and numerical variables
¯_(ツ)_/¯
is the visualization important or knowing the groupings? I think of this as a simple collaborative filter, then tsne, draw bounds, grab 4 samples from each
interesting quote
But data scientists are kind of like the new Renaissance folks, because data science is inherently multidisciplinary.
This is what leads to the big joke of how a data scientist is someone who knows more stats than a computer programmer and can program better than a statistician. What is this joke saying? It’s saying that a data scientist is someone who knows a little bit about two things.
this is the rest of the bit:
But I’d say they know about more than just two things. They also have to know to communicate. They also need to know more than just basic statistics; they’ve got to know probability, combinatorics, calculus, etc. Some visualization chops wouldn’t hurt. They also need to know how to push around data, use databases, and maybe even a little OR. There are a lot of things they need to know. And so it becomes really hard to find these people because they have to have touched a lot of disciplines and they have to be able to speak about their experience intelligently. It’s a tall order for any applicant.
- john foreman from mailchimp
Restarting my jupyter kernel does not reset variables. I'm using jupyter notebook in vs code. Does someone know what could cause this behaviour?
how do I call sizes (normally a parameter for the seaborn scatterplot function) in scatter_kws in a regplot function?
which minor stream would be most ideal for data science
I'm currently using a pandas dataframe with scikit-learn LinearRegression as part of my ML program for predicting student grades:```py
data: pd.DataFrame = pd.read_csv('./data/student-mat.csv', sep=";")
data = data[['sex', 'studytime', 'failures', 'schoolsup', 'paid', 'absences', 'G1', 'G2', 'G3']]
data = data.replace({'F': 0, 'M': 1, "no": 0, "yes": 1})
to_predict = "G3"
X = np.array(data.drop([to_predict], 1))
y = np.array(data[to_predict])
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size=0.1)
linear = linear_model.LinearRegression()
linear.fit(X_train, y_train)```How can I implement certain rules to the `linear.fit()`?
For more context, I have two previous test scores ("G1" and "G2", values of 0-20) of each student where 0 means they didn't take the test. I need to implement logic so that when both scores are 0, it'll predict 0 (since it can't make a prediction), when one score is 0 it'll ignore that score and predict on the other score, and when neither are 0 it'll just do as normal --- I need to ignore the scores that are 0 since it means the student didn't take the test and so I can't predict a test score based off it
Hi Guys!!
What is the best platform for training NNs on the cloud ?
Hello
Recently I was going through Hypothesis testing.
What I understood after listening to the introduction to hypothesis testing is.
So basically hypothesis testing is nothing but.
When we take the sample data from population and try predicting the population parameters. Whatever we get the population parameters from the sample data will be judged whether to reject or not...
This process is known as hypothesis testing.
Is my understanding correct?
Please help me!
Please @ me🙏
Does anyone know how i can change the name of a dataframe to an item i have in a list
I have 6 datafames and 6 items in my list
and i want to name the dataframes those items
Trying to convert a Jupyter notebook to PDF throws the following trace error:
pls help I just want a PDF hahaha :(
rename with zip
you know this is a Python channel, right
oh right tho i was gonna ask something about the relationship of R and python
like can i put some r files in my python program? or hmmm?
like lets say I have a bunch of data and stats and graphs in my r script, and some functions like which and i wanna use that in my py code like can i do that?
I mean, you could put them in
if you expect to be able to run them
that's more complicated
look into foreign function interfaces
Hi I have a question, can anyone help with running a Linear Regression in Python? I have a dataset that has 5 categorical variables and 1 dependant variable, and im a little bit confused on how best to do this. Before learning Python, i did this sort of stuff in SPSS
does anybody knows why this is happening
do I have to uninstall Python and reinstall it for all users???
or guide me if this is not the channel for this question
take a look at this link and see if it applies to your situation https://matplotlib.org/stable/gallery/lines_bars_and_markers/categorical_variables.html
I have a bunch of dataframes like this:
tag,precision,recall,f1
ADE,0.106,0.062,0.078
Dosage,0.788,0.804,0.796
Drug,0.534,0.655,0.588
Duration,0.674,0.609,0.640
Form,0.800,0.845,0.822
Frequency,0.668,0.725,0.695
Reason,0.250,0.259,0.254
Route,0.759,0.730,0.745
Strength,0.767,0.828,0.796
I want to name each dataframe and get the "argmax" of all the frames. So if dataframe A has the highest value for (Form, recall), I want that cell to be 'A' in the resulting frame. I heard it suggested that one use a multiindex but it's not clear to me from the docs how it could be used for that.
need help with tensorflow ASAP
your best bet is to jump right in to describing what kind of help you need.
Anyone here familiar with chi square goodness of fit?
I have a problem with expected values that im generating that are far too small
basically im generating these values:```py
val = np.array([abs(int(e)) for e in norm.rvs(loc=1800, scale=2000, size=25, random_state=144)])
Ill send you the rest of the code if you can help 😄
@short heart post the whole traceback
is there any finance / quant here. Having trouble understanding what a holding vector is .
Given a ticker of of the stocks, compute the holdings vector h E R^3 for the unique stock porftolio that is both dollar and beta neutral and has unit exposure to the specified stock
im completly new to scripts, so maybe im even in the wrong channel. but after installing the script of: https://github.com/andrewning/sortphotos im not able to run it, and i dont even know where to start
@carmine iron have never heard of a "holdings vector" before
@iron basalt you there? want to bounce some ideas off you
@severe python yeah me neither...i think it has to do with covariance matrix and optimizing the portfolio for each beta under/over 1
but honestly i could be way off. its all linear algebra
that would make sense, not sure why it couldn't have been said in simple terms
well even with that understanding, i am still stuck on how to proceed. All i have is a df with date index, benchmark of SPX, then three unique stocks
Has anybody here encountered this error?:
hmm, so you're not constructing a portfolio, you're evaluating a company's exposure to USD and their beta in general? @carmine iron
SPX is the proxy, it will be a portfolio of the three stocks given not including SPX. After the holdings vector is found, i need do calculate daily PnL
i believe the exposure should be to SPX
unit_exposure is an argument in the function
ah i see, i understand the concept but can't really apply it to python because i'm relatively new to it
@severe python lets work together! whats the concept.
even if talking in terms of excel, i dont have too much time left to figure this one out
anyone know how to fix this?
will look into it thanks
in layman's terms, you are looking for exposure of one stock to SPX. could use the covariance function between avg return of the stock given a period vs SPX variance (use over 3yrs or so) to get beta. or if they want you to use linear regression you could do that
then you could filter with IF beta over/under 1 then .... whatever. not sure if you are looking for currency exposure as well but could do something similar would have to google. not sure that any of that helps
thanks that does!
i can't figure it out how to run spark on jupyer notebook. Please help
Kaggle
and ?
take a look at some of the vids from this playlist https://youtube.com/playlist?list=PLtqF5YXg7GLlHv-pD8PVu6NFqjwG-_U-s
thanks
np
One time operation or do you plan to do this many times? I don't know about multi index, but for a one off I'd just iterate. And for many times perhaps just make a 3d array of just precision recall f1, and store the tag externally. I'm assuming tag stays same in same order for all df.
I'd like a generalized solution because I'll probably need it multiple times.
Does the assumption about the number of rows and the tag column hold?
You can assume that every dataframe will have identical sets of indices and columns and I don't care if violating that assumption has unpredictable behavior
how to get row count in pandas with chunksize?
Ah then yeah, my knee jerk reaction is to make a 3d array
Shape (num_dataframes, rows, cols) and then just freely use numpy operations after slicing for a single row
I am using Apache Spark (specifically, pyspark) for some data processing. I noticed that syntax for "case when" and for "when otherwise" is very different and they are used differently Can someone explain the pros/cons of each method? Thanks!
It looks like the "case when" approach is potentially harder to use and debug because you are writing a big expression string
hey folks, maybe you can chime in when available. I'm using sqlalchemy to pull some data data = db.session.query, with a whole bunch of questions, a few of which contain datime values
i then go on to drop that data in a pandas df, so i can prep it for conversion to xlsx
df = pd.DataFrame(data, columns=['upload_timestamp',



