#data-science-and-ml
1 messages Β· Page 409 of 1
Q(action) = Q(action) + learning_rate * (reward_t + discount_factor * MaxQ(s_t+1,a) - Q(Action))
?
the value in the table that corresponds to the action that just got us to the goal
is being added to because we just got a reward
Q takes two arguments.
a state and an action?
Yes. As shown in the link.
ok Q(state_t, action_t) is being incremented
based on both the current reward
and the future predicted reward?
Not incremented necessarily, just updated.
right
in this part MaxQ(s_t+1,a) what does a represent?
the action we just took?
and s_t+1 is the state we got as a result of action at time t
Ok, so from the beginning, we are at s_t and take action a_t, we are now at s_t+1. And according to the equation we are updating Q(s_t, a_t), which is not a value at the current state Q(s_t+1, ...).
not a value?
oh right
because we haven't gotten a reward yet
so everything in the tables is just 0
Q(s_t, a_t) does not hold the value at the current state s_t+1, it's the previous state (and action from there).
oh... so s_t+1 is the current state? the state after we took the action at time t?
Yes.
If you are at s_t, and take some action, you are now at s_t+1. Or from a different POV (looking into the past), you are now at s_t, and were at s_t-1.
You could, if you wanted to, rewrite the equation from that POV, but it's the same thing.
(Just trivial change of variable names)
If you look at wikipedia for example: https://en.wikipedia.org/wiki/Q-learning
Q-learning is a model-free reinforcement learning algorithm to learn the value of an action in a particular state. It does not require a model of the environment (hence "model-free"), and it can handle problems with stochastic transitions and rewards without requiring adaptations.
For any finite Markov decision process (FMDP), Q-learning finds a...
In the algorithm section: ```
Before learning begins, Q {\displaystyle Q} Q is initialized to a possibly arbitrary fixed value (chosen by the programmer). Then, at each time t {\displaystyle t} t the agent selects an action a t {\displaystyle a_{t}} a_{t}, observes a reward r t {\displaystyle r_{t}} r_{t}, enters a new state s t + 1 {\displaystyle s_{t+1}} s_{t+1} (that may depend on both the previous state s t {\displaystyle s_{t}} s_{t} and the selected action), and Q {\displaystyle Q} Q is updated. The core of the algorithm is a Bellman equation as a simple value iteration update, using the weighted average of the old value and the new information
but is this correct then?
MaxQ(s_t+1, a_t)
so the first time it reaches the end of the maze, that MaxQ is no value, but when it reaches the second to last square, the MaxQ gets passed the last square as its state?
The reward is given for the transition from s_t to s_t+1 (r(s_t, a_t)).
Since you are reaching the end goal, there is a reward.
Which is used to update the value.
But Q-learning uses the reward given and the values. So for ones where there is some next possible action (not at the terminal / final state), there is some max value.
OH
I think i'm starting to understand why it's QMax
when updating the weights for state t
it doesn't just add the reward
Remember that the values (not rewards), are taking into account the future in this back chaining way.
And they take into account the max possible on the next.
it also adds the best move from state t+1
the one it thinks is the best move anyway
(greedy)
so if you get a temporary reward, but the highest move from there is only negative, like dead ends
it'll punish it
Punishment comes in the form of the reward (e.g. negative or just less), value is trying to take into account future discounted rewards.
but is it recursive?
or is it just the t+1 move Q value that gets added in
not t+2
So lets say i'm playing chess and I take a Queen, huge reward. But now my opponent wanted that and they sacrificed it. So they make a move that leaves me with only like 3 moves and all of them are bad. They have bad value, but I will still pick the best one (the max given some action). So now next time i'm in that first state, I know that while my current next reward for taking the queen is good, taking into account my future options (even the best one / max), it's not worth it and I need to update my values to reflect that.
oh, because every state takes into account the state right after it, it's not recursive, but it has essentially the same effect
so three moves before taking the queen, that updated weight gets factored in implicitly because each move considers the Q value of the move right after it as well
and it's like links in a chain
Single step here, going from s to s' by taking action a, resulting in reward r. Now you can update Q(s, a) (update your table if it's tabular).
which python version is best for all D
Dl, Comp vision stuff, and visualisation and all
"trajectories", yes.
3.7.X?
Another way to look at the terminal state in chess is that in that case you only have 1 action, surrender, and so the max of the options is surrender (only item to do max of), and it's a really low max.
I really appreciate you taking the time to explain this
But you don't actually need a future value for the terminal state, because there is no future after that.
The bad reward for getting there will propagate on its own. (max of 0 actions to take, can just let it give 0)
(Or just not have the terminal state be a special case and treat surrender as an action, whatever, same thing, depends on how you prefer to code it)
I need To know how to make image matching in python like in gta 4 or police verification systems
Like how to achieve that ?
Hello guys! Nice to meet you all :), I was wondering if you could help me with something. I'm just within my Data Science master's program, and I'm still a little bit new to this field. My inquiry is, is there something such as a "nested time series regression" model using Keras? Like, a model using a time series data, but instead of having a defined "batch size" for the input of the model, the input of the model depends of the entries available of each "ID" or "Patient" in the dataset? (like treating each batch/sample as an independent one for the input of the time series model)
I hope I'm making myself clear, I'm just struggling with a project right now π
The Resources page on our website contains a list of hand-selected learning resources that we regularly recommend to both beginners and experts.
to find a correlation between the two, is it right to say:
- since satisfied column is left-skewed, scores of 41 and 43 are the most frequent scores of being satisfied with their flight, or lower scores (22 or less) exponentially decreases with the decrease in scores
- as neutral or satisfied column is normally distributed, it is very close to the median Final Score of 37.5 compared to the satisfied columns, with its most common value being 36.
- we can can conclude that the highest values of both categories of satisfaction are considerably close to the median Final Score of 37.5, making the Median having a correlation between both satisfaction categories, even if the Neutral or Disassatisfied Columns are left-skewed, it does not deviate much
:incoming_envelope: :ok_hand: applied mute to @upper lichen until <t:1654688547:f> (9 minutes and 59 seconds) (reason: duplicates rule: sent 4 duplicated messages in 10s).
I'd like to access each individual group of a pandas group by. (there are about 277k of those).
The goal is to pass each individual group though a function, what would be the most efficient way to go about this?
So what 277k groups which you each want to put through the same function?
(data on) users of a website @haughty topaz
If it's a simple function then it's probably best to use the inbuilt method if it exists, or lambda. A loop would probably be slower but easier to implement
for name, groupdf in df.groupby('a'):
print(name)
print(groupdf)
what is the function doing?
Dividing users into classes based on row values, so not so simple
I did this:
groups.append([list(groups) for groups in grouper.groups])
This is probably not very efficient (and I guess my kernel doesn't like it)
once you have the groupby, you can just iterate over it. it will give you tuples of (label, dataframe)
I'll give it a try
I've rewritten some deep Q learning code that was originally built to learn flappy bird (which it does perfectly after 2 million rounds)
I revamped it to try to train this stationary blue circle to shoot this stationary red circle
but it's been about 900,000 rounds, and it's still not learning effectively
the reward structure for this model is
Doing Nothing -> 0
Firing -> 0.1
Aiming up one degree -> 0.2
Aiming down one degree -> 0.2
having the laser pointed at the red dot -> 1.0
having a bullet overlapping with the red dot -> 10.0
it trains every turn by selecting a batch of 32 turns chosen randomly from the previous 10,000, and doing the Bellman Equation on them
any idea why it's not training as effectively as the flappy bird model with the same code was?
I know nothing about this problem so I'm probably no help, but have you observed the laser ever actually point at the red dot? I thought that maybe it's just continually shooting below and since it never sees it its just randomly aiming up and down randomly which might not be enough to get to the point where it sees it in the first place. Maybe you can alter the reward to be +0.2 for rotating aim to the right and +0.1 to rotate aim to the left so it tends to drift to the right so it'll eventually intersect with the dot if that is that problem.
it has an epsilon value (that starts at .01 and is now 0.05) and every turn it generates a random number. if the number is less than epsilon, it does a random action regardless of what the neural net tells it to do
and I've observed it both shooting the target successfully and aiming at it successfully
but it's rate of success has barely gone up at all
Also my intuition suggests that having a reward for pointing at the dot is a bad idea because that's not the actual angle it needs but idk
wait why
ball goes straight
oh ok I thought it was falling in a parabola, that makes sense then
oh yeah
so if the laser points and fires a bullet at the dot it will for sure overlap with the red dot later?
it seems weird to have two rewards for what will be the same outcome
I added the laser overlap reward as an attempt to get it to learn more effectively, but it doesn't seem to have worked
I was worried that because the bullet hits like 50 turns later, it was too far in the future to effectively teach the agent
my intuition would be to try removing the overlap reward and making to aim up reward double the aim down reward to see what happens. Definitely not my area of work though, this makes me want to look into it
original code is here if you want to try it
thanks
is anyone familiar with the loader issue in google colab
it's driving me nuts
i am trying to import packages and i keep hitting the loader issue
i don't really use google colab so idk
what does it look like
/what are you trying to do
Iβm trying to import TimeSeries from darts
But I keep running into an error that says itβs missing a positional argument
I figured it out though
I had to download an old version of Colab
Hi, I have a problem that makes me confused. In the cost column, I have the value of the list. How do detect all of the values in the cost column that contained the list values?
this seems like a weird situation to have arrived at. you should probably post what the dataframe looked like earlier in the program and explain what you are trying to do
because now that you have numbers and lists of numbers in the same Series, that is bad.
Hm... i'm getting an error in my model that it failed to converge
not totally sure how to proceed with it
or how to go about fixing it
i've been browsing stack overflow for the answer but haven't found a ton of helpful resources.
but how to showing mode in that case?
hey folks, I am given a problem statement, to predict whether a transaction might be fraud or not based on few parameters, and there are some customer id, should I cluster them on their id and rather than predicting their transaction one by one?
Hey @jaunty sky!
It looks like you tried to attach file type(s) that we do not allow (.pdf). We currently allow the following file types: .gif, .jpg, .jpeg, .mov, .mp4, .mpg, .png, .mp3, .wav, .ogg, .webm, .webp, .flac, .m4a, .csv, .json.
Feel free to ask in #community-meta if you think this is a mistake.
IDs are usually arbitrary values, whereas you want to cluster things based on non-arbitrary properties of what you're trying to represent.
what information are you given about each transaction?
it will be easier for me to help if you answer the question directly. I don't want to have to wade through a pdf.
this is the exact task
so the main question is to predict whether a transaction is fraud or not
it pretty simple (just the question statement)
but its not given whether to cluster them or not
Hello, was hoping to ask if anyone knew how to see the row by row changes that happen when using vectorized operations in pandas/numpy.
Without looping through row data ideally.
since the vectorized operation is a lot faster without needing to loop.
sir, please help me
What information are you given about each transaction?
predict whether it is fraud or nnot
That's what you're being asked to do. It doesn't answer my question
can someone help me with pandas?
Don't ask to ask, just ask
Ah, ok
I need to do an analysis of a championship. Each year has a file and I need to merge this data, is it possible?
So each file is a csv?
json and csv
Does the Jason have basically the same kind of data
Text files are as json and tables as CSV
So each year has a Jason and a csv?
just drag an example json and csv into the chat
or something. just leaving it at "it's complicated" doesn't help me help you.
['step', 'type', 'amount', 'nameOrig', 'oldbalanceOrg', 'newbalanceOrig', 'nameDest', 'oldbalanceDest', 'newbalanceDest', 'isFraud','isFlaggedFraud']
great. for all the ones that are fraud (according to isFraud), what properties do they have in common? are there "nameDest" that are more frequent for fraud values? what about the change in balance?
type - CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER.
amount - amount of the transaction in local currency.
nameOrig - customer who started the transaction
oldbalanceOrg - initial balance before the transaction
newbalanceOrig - new balance after the transaction
nameDest - customer who is the recipient of the transaction
oldbalanceDest - initial balance recipient before the transaction. Note that there is not information for customers that start with M (Merchants).
newbalanceDest - new balance recipient after the transaction. Note that there is not information for customers that start with M (Merchants).
isFraud - This is the transactions made by the fraudulent agents inside the simulation. In this specific dataset the fraudulent behavior of the agents aims to profit by taking control or customers accounts and try to empty the funds by transferring to another account and then cashing out of the system.
isFlaggedFraud - The business model aims to control massive transfers from one account to another and flags illegal attempts. An illegal attempt in this dataset is an attempt to transfer more than 200.000 in a single transaction.```
these are attributes details
@grave cloak this json can't be read directly as tabular data. how does it relate to the content in the CSVs?
there's unfortunately way too much content for me to wrap my head around what the schema of the JSON is.
Ok, I get it. I'm talking to the author of the dataset
@serene scaffold do you know of a way to get debug output when a vectorized operation is carrier out in pandas?
Let's say I wanted to cast all my dates to datetime64 and I wanted to see what rows got operated on. Is this possible?
Hey, so have an np array of say 70K points with their x and y coordinates., with shape [70K, 2]. These points cover a region of about 5368, 7152 units. I want to sample about 1/25th of points from them, uniformly based on their local densities. This means removing points from denser regions more than the sparser regions which will give me a nice uniformly distributed point array. Any idea how can I do it super fast?
Currently, I put all those points on a 2D array of shapes [5368, 7152] and slide a kernel of shape 11x11 onto it to get a rough local density estimate by which I probabilistically sample the points.
Hello all. Deep learning/machine learning newbie but ex-competitive programmer here. Is there any problem based curriculum to teach myself deep learning? Like how codeforces/hackerrank has problem based structures. Step by step problems getting harder to solve and along the way require you to learn new things. Any tips welcome
are there any servers dedicated for python ai?
Hey guys, is it normal for a DCGAN to output images with a margin of 9 squares on them? It feels like it adds a # mask on its outputs, even if the generated images are good.
I don't know if it's a sign of overfitting, if it's normal or if it's another problem...
the code might be missing a post-processing step
if its a consistent output of a 9-square margin, you can just add a step to remove said margin
otherwise, youd need to investigate further, starting with maybe the training data as well as padding
;lkjh
Strange... The thing is...the output, at the beginning of the training, seems to include no squares. However, as the generator gets better, the squares show up, gradually.
This is why I thought it was a sign of overfitting.
have you checked the training data itself
like just look through it
actually
what model are you using
since maybe its been trained with data with padding
Yes, I did the training data myself. It's not a big dataset like CIFAR, but I think it'll do, right?
DCGAN
It wasn't. It was trained with Batch Normalization layers. I'm following Pytorch's DCGAN tutorial.
yeah i just looked at the original paper https://arxiv.org/pdf/1511.06434.pdf
looks like so
i have no idea then
we used pix2pix and cycleGAN for our GAN project
and never faced the issue youre facing



computer vision isnt my specialty so maybe someone else has insight into whats going on
I actually removed the Batch Normalization now because I thought it was that that was causing this issue
But...now I'm seeing those squares again
looks like you found out one thing though
it wasnt the batch normalization step

maybe it is just overfitting. idk why that would be a manifestation of overfitting tho
Yeah, I'm starting to think it's overfitting
The squares appeared after 400.000 epochs
I thought it was normal to have that many epochs 
At least I've read in some paper that the guys used many, many epochs
not for GANs my dude
straight from google's intro to GANs
Oh yeah, I had a problem with that in certain model, then I had to use 235.000 epochs
Perhaps my memory is playing tricks at me and it wasn't a paper for GANs, but for Reinforcement Learning
that

Uh...well, then...how much epochs did you use?
I'm using 16x16 images. I thought 500.000 epochs were ok...until now.
i dont remember 
but def not as much
I only noticed overfitting with 32x32
And it was quite clear it was overfitting with model collapse
Can you see the squares?
Fun fact: I removed batch norm because it caused those squares to appear with less epochs, this is why I thought the batch norm was causing them.
It seems the batch norm was only making the generator able to reach its ideal point faster 
I have a quick question about pandas dataframe manipulation
i have a dataframe that i need to limit certain categories of
for example
i have a dataframe of house details
i want to limit the categories of bathrooms and bedrooms to only houses with 2-3 bathrooms and houses with 3-4 bedrooms
i'm looking through the documentation but i cannot find anything thus far
Hm... it's been some time since I don't deal with pandas, but you probably should check pandas.drop()
hm yeah
i figured i'd have to use drop
but i'm more so looking to get those specific parameters
and figure out how to drop it so that i can get those specific parameters
i don't know if a for loop would do me good here
I think you'll use something like data = data.drop("bathroom" < 2) or something like that
that might work
Probably won't, since I don't remember the command, but it's something around that
data = data.drop(data["bathroom"] < 2)
And be careful with the argument axis=0 or axis=1
hm
that's not quite it either
'<' not supported between instances of 'str' and 'int'
the error i got
Check if the numbers in your bathroom column are real numbers and not strings
hm
so i came up with some sort of solution but
it doesn't seem to be a fix all
housedf = housedf[df.bedrooms != 5]
housedf = housedf[df.bathrooms!= 1]
housedf = housedf[df.bathrooms!= 4]
housedf.head()```
and yes i know it's bad code
but in bathrooms it doesn't register that 4.50 shouldn't be there
so i need to figure out how to put it in a range
housedf = housedf[(df.bathrooms< 4)|(df.bathrooms>=5)] i think would work for that specific line of code
i think in general you may be able to simplify that logic just using a range of what you're looking for
sorry, corrected something with my suggestion, i think. & = and, | = or
hm i'll try it
for some reason it's still including values i am trying to tell it to explicitly exclude
housedf = housedf[(housedf.bedrooms > 2)|(housedf.bedrooms < 5)]
housedf.head()```
square = np.zeroes((300,300) , np.uint8)
hey guys, this is a piece of code that i'm not understanding
I get that it's making a matrix with only zeroes, but what it 'np.uint8' here?
because ive seen it send with np.uint32 as well and i dont understand that
@sick fern unsigned eight bit integer
could you elaborate further? Sorry, I've never dealt with numpy before
You know how data is just bits? Everything on the computer is 0 or 1?
Yeah
An 8 bit integer is a whole number represented using 8 bits. So you can represent numbers up to 2βΈ
And it's unsigned, so there isn't a bit used to tell you if it's positive or negative.
You're telling numpy to create an array using that data type. Rather than a 32 bit integer. Or a 64 bit float. Or something.
It might be more memory efficient, but I'm not really sure.
You're welcome
Can you help me unite values ββin Pandas?
Unite? In what way?
It has the name of the clubs that appear in different rounds, there are 38 in total, I have to gather and show how many goals each team scored in total
But only the minute the goal was scored appears, not the goals themselves.
@grave cloak does the minute determine what round it was part of?
Not. It only has the id of the match, the minute in which the goal was scored, and the round
@grave cloak can you do a group by, like I showed you before?
It didn't work, I tried but it's too long and full of unnecessary information
I thought of a way to do...
If you took the null minutes from each game of each team and each round, and could show how many times they appeared
Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.
I'm getting a dimension mismatch here and insanely high loss and 0 accuracy
what is wrong with my implementation?
I don't think the dimension mismatch in the plot is relevant to the model accuracy and loss though
You're using the '|' in this statement which is OR instead of what should be an AND, '&'
you could try housedf.loc[housedf.bathrooms.isin([2,3]) & housedf.bedrooms.isin([3,4])], which should be cleaner, assuming bedrooms and bathrooms are ints
Yeah I realized my error eventually
I fixed it
trying to follow along with the code for sklearn's logistic regression (https://github.com/scikit-learn/scikit-learn/blob/80598905e517759b4696c74ecc35c6e2eb508cff/sklearn/linear_model/_logistic.py#L754) and am a bit confused about fit_intercept. where in the code is the intercept added to X?
I want to get the most frequency of year and I get an error. How fix that?
try replace .mode with .agg(pd.Series.mode)
given this image, and the string bluetooth addressi have to output the corresponding value
this is the image
I got it somewhat functional using pytesseract. The text it identifies isn't perfect though
import pytesseract
from PIL import Image
import difflib
pytesseract.pytesseract.tesseract_cmd = 'C:/Users/NAMEHERE/AppData/Local/Programs/Tesseract-OCR/tesseract.exe'
text = (pytesseract.image_to_string(Image.open('unknown.png'), config='--psm 3'))
text = text.split('\n')
string_to_match = 'bluetooth address'
for line in text:
match = difflib.get_close_matches(string_to_match, [line])
if match:
print(*line.split(':')[1:])
For 'bluetooth address'
You'll probably have to look into how to make it detect better
this is what it read
Needed some help with Python (particularly MatPlotLib)
I need to plot some data as a stem plot which is very closely spaced sometimes
Now what matplotlib does is that it just plots the two stems on top of each other
I think this won't happen if the least count is reduced along the x-axis
But I am not getting any documentation online on how to do that
Note that I don't want to change x-ticks since I don't care about the text labels; I want to change the least count which matplotlib is using to put the stems on the graph
I dont have a perfect solution off the bat, but maybe you can do grouped bar chart. then you make the single bars very narrow (basically like a line) and then add a circle as a marker for each bar at the top.
so you'd have a basically a bar plot that looks like a stem plot
Okay I'll try and see if this looks better
so basically you'd take a bar chart like this, and style the the bars with almost no width. and then simply take a marker like the circle and set it to the same data like the bars
How do you get the markers, like which type of plot do you use to get the circle, triangle, square markers?
I have a question myself: It is very easy to look at correlations of a dataframe in python with a heatmap
It is also very easy to plot autocorrelations with statsmodels:
plot_acf(data, lags=50)
But now, how can I find out if there is a correlation in a more complex dataset? Right now I have around 60 columns and about 3000 rows. How can I see if there is a strong correlation in the dataset between column 23 and column 54 with a 5 day lag or so? Is there some way find it automatically? Or would I need to create this manually? Or maybe is there an interactive plot where I can try different lags and different columns and see if there are strong correlations?
I feel like this might be a standardproblem where there must be some generic solution for. I can't be the first one to look for signs in data, if there is some pattern which is correlated with an event a few days later?
Hello, I'm looking for help regarding seaborn heatmap
I follow a training on this topic but I can't see how to handle the topic.
I have a list of products such as pasta, vegetables in a dataframe. One column is the timestamp when the product has been added to the DB.
I'm requested to make a heatmap of when the product are added (crossing month from 1 to 12 and hours from 1 to 24). I'm good with x and y but I don't know how to compute values
try looking at partial autocorrelation, plot_pacf, I don't know the maths behind it but it basically accounts for the correlation in previous lags built in. I think you can then get those values and test which are significant which is basically above the shaded blue area in the plot
Hey guys, does anyone here have experience with data mapping using AI? My organization does a lot of work mapping client metadata fields to internal ones and I'm interested in automating the process.
Have you tried lagging your data by one period and then regressing your data on your lagged data?
https://stackoverflow.com/questions/72559796/pandas-python-export-to-xlsx-with-multiple-sheets hi guys i have a problem with exporting multiple xlsx as sheets
yeah, i've been using pacf plots a lot as well. the thing is just, that it is also based on the residuals of autocorrelation. So it looks at the correlation within itself of a timeseries, not at the correlation between two different timeseries in the time dimension. So that does not help with finding autocorrelations between columns in time.
that is what I was hoping to avoid, as I would have to create a heatmap per column and per lag. so that would be roughly 60 columns x 50 time lags x 60 columns = 180.000 heatmaps to look through.
Oof
It totally depends on how the data is structured and what kind of job the AI has to do. Example: having unstructured data, you could use topic modelling or name entity recognition to map this data to your internal fields. it just totally depends on what you need to do and how good the data is
It's mostly a matter of mapping whatever column head a client has used to define an industry standard field (ex. Loan origination date) to the column head we use to denote the same field. Right now we are basically doing the whole process field by field manually, which is incredibly time consuming and tedious. I am hoping to build some sort of a software solution that will match fields based on some combination of data similarity and field name similarly so that match candidates are populated automatically and we just have to validate the generated matches and fill in an fields that weren't matched manually.
you could maybe use a simple way of vectorizing to calculate the distance between the two vectors (which is essentially a score of similarity). You'd have to test how well this approach works and it probably would still need manual control work, but it might be an easy and fast approach to solve this.
Alright great, could you recommend any resources that I can use to teach myself how to apply that approach? I'm relatively new to data science and still have a lot to learn.
if you work with a lot of text data, you could start here: https://machinelearningmastery.com/natural-language-processing/
It is also a great ressource for other disciplines of AI. It is also good to start with some simple statistics text book as a foundation before jumping into ML
Natural Language Processing, or NLP for short, is broadly defined as the automatic manipulation of natural language, like speech and text, by software. The study of natural language processing has been around for more than 50 years and grew out of the field of linguistics with the rise of computers. In this post, you will [β¦]
hiho, i am trying to make a script (argpars) form my data that i analysed.
unfortunatly i dont get any output abd some variables are not found.
kind of stuck.. any one down for a quick help
Great, thanks! I studied economics in college, so I have some exposure to statistical analysis, but I could use a refresher.
Anyone wana help me with my stats homework
don't ask to ask, just ask.
if you had asked an actual question, and it was once that I knew the answer to, I'd be answering it right now.
a question:
Lets imagine a scenario where we have to ask all the employees a set of question (mandatory btw) in which each question have a particular weight and based on their answers we have to put them in different categories aka buckets.
how can we implement this using python dynamically
that depends on the type of question. one way, for example, is to take multiple choice questions and encode the answers with a numerical value, e.g. an integer. then you can put all of those values into a vector. you can then split up the n-dimensional space into regions based on the values. you can find intersections of half spaces by just writing out inequalities
Let's assume, there are 20 questions, single choice each and all of the questions depict human nature. Based of that we need to decide if a person is a party head or an introvert who likes to stay home or something else which is different output
Now we have to make clusters of employees respectively to their output which was got by the inputs they did
this sounds like something you'd do with a spreadsheet or something of the sort, nothing python specific
you could use a bunch of ifs or inequalities
Is it "fair", formally, if a university analyses the questions on which minority students perform worse. And deletes those questions from a test.
@serene scaffold
Help me with this shit
it seems like even the formal definition of fairness still leaves whenever or not something is fair subjective, so: unclear imo
I'm not an ethicist. but I would wonder why minorities do worse on those questions. do the questions rely on a cultural context that isn't shared? or is it a socioeconomic issue?
I feel like it is unfair. Maybe because you are forcefully trying to admit minorities. And not based on their skillset.
From what it appears from the prompt. It is an analytical ability test
Not explicitly mentioned though
I think this is an important issue. But I don't think it's one that our community is really intended to facilitate.
3rd part
1 is definitely not true. Not sure about 3rd
Oh nevermind. It is not fair. I made it out from the options. Because second is true. And no option for 2 and 3
wrt test questions, 3 seems more compelling than 1
Is it "fair" or not. We don't know. It's not fair straightforwardly. It's only when you put in some social science in it and use the bigger picture. It might look fair then. But more like unfair rn. Because you are trying to create a "biased" test.
Assuming the questions are analytical in nature. And not some qualitative ones in that case the test might be biased originally and you removed the bias.
Guys, I'm using Pandas and when I try to see the values ββof different columns it returns the same value
Does anyone know how to solve?
are you trying to output the names of the columns in your dataframe?
Yes it was my mistake
Can you tell me if there is a way to add results from two groupby?
what do you mean by "add"? do you mean actually 1 + 2 or "put side by side"?
1+2
can you show what "the result of a groupby" looks like?
import numpy as np
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
columns=['Col A', 'Col B', 'Col C'])
print(df.columns[0])```
would print "Col A"
ok i will try
if the set of labels on the left are the same for both, you can literally just add them with +
hey guys can anyone help me with some ML/NPL error please ?
i searched on google and didn't find a solution
It is yes
anyone free to check #help-cake ? im trying to practice for exam tomorrow and im stuck on a problem
Did not work π₯²
saying that it "didn't work" doesn't help me to help you. what happened instead? did your computer explode?
No hahaha, but I went to do a sum and he added the match. For example '4x2' became 6
I don't understand.
please show the exact code that you ran and explain what it did that was different from what you wanted.
df1 ['total'] = df1['mandante_placar'] + df1 ['visitante_placar']
did you get an error message or what?
But it's doing the sum of the match and not of each club
please do print(df[['mandante_placar', 'visitante_placar']].to_dict()) and put that in the paste bin
!paste
Pasting large amounts of code
If your code is too long to fit in a codeblock in discord, you can paste your code here:
https://paste.pythondiscord.com/
After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.
it has to be text in the paste bin that I can copy and paste. please do not post screenshots of text.
The content is too big
please give the code in this screenshot as text, and the one after it as well.
Not even here saved
that's what the paste bin is for.
anyway, we can forget about that for now. just give the code in those two screenshots as text. I don't want to retype what is in them
Ok
please do not post any more screenshots. I will not look at them. when you use the paste bin, you have to save it and give the link. this was in the instructions from the !paste command.
Ok, one momento
like I said, just the two lines of code from the two screenshots. as text in the chat.
@grave cloak I have to leave soon
you need to add this expression
Hey @grave cloak!
You either uploaded a .txt file or entered a message that was too long. Please use our paste bin instead.
Ok, thanks a lot for the help, I'll try here
I tried to send the print and the result but it's too big
In this case, I was only asking for the code, not the displayed result
df1.groupby(['mandante'])['mandante_placar'].sum()
df1.groupby(['visitante'])['visitante_placar'].sum()
df1.groupby(['mandante'])['mandante_placar'].sum() + df1.groupby(['visitante'])['visitante_placar'].sum()β
Try that @grave cloak
File "<ipython-input-41-205e99e99ae7>", line 1
df1.groupby(['mandante'])['mandante_placar'].sum() + df1.groupby(['visitante'])['visitante_placar'].sum()
^
SyntaxError: invalid character in identifier
++
That was the mistake I think
Thank you so much @serene scaffold and thanks for your patience too
is anyone familiar with the python library darts?
it's specifically used for time series modeling
Hey guys, any idea how I can slice a pandas dataframe with a datetime column every X amount of days (specifically monthly and yearly)? The dataframe current has daily info. I've tried two different approaches: filtering every 30 or 365 indexes, but that doesn't work since there are gaps in information: for example index 30 is Feb 10th, 1995, and index 60 is March 28th, 1995. I've also tried locking the relevant day/month based on the selected filter (image in annex), but again that's not the best solution since it's looking for exact days and there are gaps in the df. In some instances it goes like 8 years without finding the same exact date. Any tips on how to go about this?
@limber token can you set the timestamp as the index?
I could, but how would that help?
You could use a Grouper
!docs pandas.Grouper
class pandas.Grouper(*args, **kwargs)```
A Grouper allows the user to specify a groupby instruction for an object.
This specification will select a column via the key parameter, or if the level and/or axis parameters are given, a level of the index of the target object.
If axis and/or level are passed as keywords to both Grouper and groupby, the values passed to Grouper take precedence.
I'm a bit of a noob with this, how could I filter by time intervals by grouping?
Click the link
I did
Didn't understand how to apply it
Oh
There's literally a freq method, my bad lol
Got a <pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000026B36D96490> object, how to actually keep it as df?
wow you are really a pandas guru, stel
11/10 would follow any online content you post

I believe you would be able to loop through each filtered df like
for name, groupdf in filtered_df:
print(name)
print(groupdf)
You have to do something to the groups
Think of a grouped DataFrame as a bag of DataFrames
I thought about starting a Twitter to post my hot takes, but that might destroy my family
no no no twitter is a hot mess
dont do it stel

I've never actually used jt
Hello, I'm trying to remove all rows with NaN values from my dataframe. I am using the following code:
df = web.DataReader('DEXUSEU', "fred", start, end)
df['SP500'] = web.DataReader('SP500', "fred", start, end)
df['Inflation'] = web.DataReader('FPCPITOTLZGUSA', 'fred', start, end)
df['Interest Rates'] = web.DataReader('INTDSRUSM193N', 'fred', start, end)
df = df.dropna()
print(df)
but when I do that I get this output:
Empty DataFrame
Columns: [DEXUSEU, SP500, Inflation, Interest Rates, M3, M2, M1, Interbank]
Index: []
any ideas on what I'm doing wrong here?
df1.rename(columns={'mandante': 'clube'},inplace=True)
df1_clubegols = df1.groupby(['clube'])['mandante_placar'].sum() ++ df1.groupby(['visitante'])['visitante_placar'].sum()
print(df1_clubegols.to_markdown())
When I run this it is like 'club' and '0' the columns, does anyone know how I can rename this '0' and generate a graph with the values
I'm training a deep Q learning AI to play a top down shooter of my own design
and I just learned that the deep Q learning algorithm is guaranteed to converge
but there's no guarantee it'll happen quickly
at 1.1 Million moves, its loss is still extremely high and it can't hit the stationary target to save its life
but I just watched a video on reinforcement learning for a boxing game in unity
and they could barely stand until 250 million cycles
so I'm wondering if I'm just being impatient, and in a couple hundred million moves it'll converge after all
but currently it's training at a rate of about 100k turns an hour
so it'll take about 104 days of continuous training for it to get there
but the thing is, it's barely using my gpus
and I've got a 3080
so it's pretty powerful
taking millions of cycles sounds about right. as to how long it takes, it depends on how many parameters there are and how difficult it us to compute the maximization/prediction per step
what are you using for this? pytorch? tensorflow?
pytorch
it uses the gpu, but in between every turn it has to run pygame code
I'd like to make full use of the gpu
are you batching things up?
yes, but I'm using the batch size of the code I skeletonized, 32
out of the last 10k turns
you can probably increase that a lot more
if I dramatically increase batch size, would my gpu run them concurrently?
that should be the case
let me test that theory
also, are you generating all quantities on gpu? idk if pytorch allows you to directly create variables on gpu
moving them from cpu to gpu and back is super slow
minibatch = random.sample(replay_memory, min(len(replay_memory), model.minibatch_size))
# unpack minibatch
state_batch = torch.cat(tuple(d[0] for d in minibatch))
action_batch = torch.cat(tuple(d[1] for d in minibatch))
reward_batch = torch.cat(tuple(d[2] for d in minibatch))
state_1_batch = torch.cat(tuple(d[3] for d in minibatch))
if torch.cuda.is_available(): # put on GPU if CUDA is available
state_batch = state_batch.cuda()
action_batch = action_batch.cuda()
reward_batch = reward_batch.cuda()
state_1_batch = state_1_batch.cuda()
# get output for the next state
output_1_batch = model(state_1_batch)
# set y_j to r_j for terminal state, otherwise to r_j + gamma*max(Q)
y_batch = torch.cat(tuple(reward_batch[i] if minibatch[i][4]
else reward_batch[i] + model.gamma * torch.max(output_1_batch[i])
for i in range(len(minibatch))))
# extract Q-value
q_value = torch.sum(model(state_batch) * action_batch, dim=1)
# PyTorch accumulates gradients by default, so they need to be reset in each pass
optimizer.zero_grad()
# returns a new Tensor, detached from the current graph, the result will never require gradient
y_batch = y_batch.detach()
# calculate loss
loss = criterion(q_value, y_batch)
# do backward pass
loss.backward()
optimizer.step()```
this is the training code
my assumption is that the .cuda() will parallelize it
where's the .cuda in there
if torch.cuda.is_available(): # put on GPU if CUDA is available
state_batch = state_batch.cuda()```
ah i missed it
that's very slow, i think there's a way to create them directly on gpu. but try increasing the batch size first
what frustrates me is that the flappy bird model this code was originally for converged after 2 million rounds
and mine is at a million and hasn't even improved
is aiming and firing from a stationary location to a stationary location really more complex than flappy bird?
i couldn't say
ok yeah that definitely changed it
it was at 320 batch size before, not 32
so I upped it to 5000
and now the GPU spikes pretty seriously every time it back propagates
Do a different problem first to make sure it's not just a bug.
you mean a different top down problem?
Any RL problem, like a maze.
cause I tried the flappy bird model before I started this, and my model trained pretty quickly
Do a bunch and see where it fails or not.
ok
Like gym.
a maze seems like a good idea if this batch size thing doesn't work
maybe I'll do a driving one too
since that's basically a top-down flappy bird
After you have a bunch of different ones played or failed, make a matrix of features of each and try to find out what it struggles with (assuming it's not just a bug).
features? how do I compare different games?
if it succeeds on a maze and a driving game, what features would be different?
from a top down shooter
You kind of have to guess with it. But you can make an educated guess.
yeah any information is valuable
Action space, turn based vs not, grid based vs not, delayed rewards, sparse rewards, etc.
yeah ok
Then make a hypothesis, etc, etc (do science, data science).
(You can make specific games with specific features)
*Also RL is hard/unsolved, it not working without a lot of effort (or just A LOT of compute) is expected.
**What is really annoying is that you will often not know if it's just a bug or not. I have had times where removing a bug made it worse (AI/ML is kind of special in this way / that this can happen).
I guess I'm spoiled then with the flappy bird model being tuned just right
and that is annoying that you can't really know what the problem is
***Some papers out there sometimes make me wonder if their implementation was bugged because when reproducing it I did not get any good results, and tried pretty hard to find any bugs. This is a huge problem with not releasing some source code. There is no reproduction (which is needed for it to be science).
what exactly do you define as a bug?
if a model is valid but only with the hyperparameters tuned just right, is that a bug?
No, but something like writing out of bounds is.
writing out of bounds?
index out of bounds
Bugs happen all the time, and there is no way to tell that they did not just mess it up if there is no source code available.
yeah, a paper with no code is a pretty big red flag
The thing about ML is that the bug could actually results in better results, a special case in software.
Other stuff like traditional algorithms would probably just break (usually very noticeable).
Another place where this kind of thing can show up is in game development where a bug becomes a feature because during testing they found that it enhanced the game (e.g. Minecraft pistons strange buggy behavior).
The reason it shows up in ML is because the systems are often good at dealing with noise and are random already.
And adding a bit of noise in a specific way can make it better (not all the bugs add just a bit of noise, there are other types but that is one of them).
that's a good point, I never thought about how weird that is that a bug could actually be "beneficial" in those fields
yooo do u guys know how to solve this?
looks like you want to set up some inequality based on the value of AVERAGE SALES
can you help me to solve this problem?
fyi, this is excel file
the deadline 2 hours left
i can help you with the logic, but i won't do your homework for you
yea just explain, i'll do the rest
you want some nested if else statements
for example if(value on the column to the left <= some number and value on the column to the left >= some other number, then 'SILVER', else ( if ... ) )
can i add u on discord, to talk further?
hey guys, I have a question. Let's say you want to predict house/apartment selling price for a part of the city, but you don't have floor details i.e. two apartments on the same building can appear as duplicates in your data but with different selling price. Do you assume that these concern different apartments or assume that it is the same apartment sold another time hence remove them to keep the observations independent?
pls guys i still need ur help
hello everyone, does anyone know best dataset to practice for support recomender task?
association or other approach such classification or sparse matrix is welcomed
Hey everyone! I'm very new to computer vision and looking for some input on this problem I have.
I want to train a model that takes an image as input and gives the same image as output but with one or multiple added overlays. Like this:
Input:
Output:
It's basically adding a missing voronoi cell
I have identified pix2pix as a possible candidate, but I get the feeling that model is not meant to take a photo as input but more a schematic as input
I'm not asking anyone to guide me trough the whole way to do it, just looking for a pointer to the right type of model
Image segmentation would perhaps also work to kind of mask the area that the cell is missing from, but I don't think that's what those models are meant for either
To clarify, it is known in advance what cell is missing, the model would be trained on a synthetic dataset of where these cells are supposed to be. There is only 1 'solution', it doesn't need to create new voronoi patterns
Can we find confidence value of a prediction made by logistic regression model?
Does anyone know how I change that '0'?
the second to last line creates a new DataFrame, so the thing you did in the first line doesn't count anymore.
going forward, I won't attempt to answer any questions you ask that involve a screenshot. sorry.
OK sorry. But I think it looks better to display the error
unless it's not text, you can copy and paste it as text. everything in that screenshot is text.
Yes you can. You use predict_proba() method to find the probability of your prediction (how confident your model is about the prediction it's made.)
Thanks
Hello, I just found this community (used to look on irc channels but seems like I am getting older)
does anyone have resources on writing aws lambda functions to call ML models for inference

the more i work, the more i feel i know nothing
forever imposter syndrome
calling it now
π
I wanted to know how to convert this little table:
to something having 3 columns, Year, Month, Value
you could .stack() and then .T to transpose, I think. if you do print(df.to_dict('list')) and give the text in the chat, I can experiment.
Cool let me give that print to you
{2021: {1: '525.785', 2: '427.857', 3: '477.502', 4: '468.083', 5: '484.556', 6: '457.686', 7: '478.079', 8: '518.769', 9: '532.103', 10: '562.109', 11: '544.405', 12: '526.958'}, 2020: {1: '470.827', 2: '424.322', 3: '459.281', 4: '463.401', 5: '507.738', 6: '509.694', 7: '518.549', 8: '543.902', 9: '566.628', 10: '619.065', 11: '589.061', 12: '583.310'}, 2019: {1: '386.320', 2: '331.042', 3: '374.333', 4: '423.750', 5: '518.613', 6: '525.426', 7: '551.882', 8: '575.643', 9: '556.631', 10: '599.759', 11: '573.220', 12: '539.916'}, 2018: {1: '265.377', 2: '224.912', 3: '278.908', 4: '295.472', 5: '317.927', 6: '317.742', 7: '347.696', 8: '373.720', 9: '401.025', 10: '413.556', 11: '406.041', 12: '407.751'}, 2017: {1: '290.174', 2: '225.300', 3: '252.969', 4: '236.823', 5: '248.159', 6: '245.243', 7: '293.297', 8: '316.340', 9: '307.968', 10: '302.871', 11: '293.155', 12: '285.409'}, 2016: {1: '284.158', 2: '226.621', 3: '264.373', 4: '275.014', 5: '295.629', 6: '297.553', 7: '280.241', 8: '299.579', 9: '334.492', 10: '337.164', 11: '320.987', 12: '289.284'}, 2015: {1: '402.896', 2: '335.901', 3: '341.535', 4: '327.988', 5: '367.478', 6: '362.013', 7: '362.265', 8: '357.917', 9: '347.830', 10: '361.113', 11: '332.901', 12: '314.040'}}
(without the 'list' argument)
why did you do something other than what I asked?
but okay
Sorry you are right:
{2021: ['525.785', '427.857', '477.502', '468.083', '484.556', '457.686', '478.079', '518.769', '532.103', '562.109', '544.405', '526.958'], 2020: ['470.827', '424.322', '459.281', '463.401', '507.738', '509.694', '518.549', '543.902', '566.628', '619.065', '589.061', '583.310'], 2019: ['386.320', '331.042', '374.333', '423.750', '518.613', '525.426', '551.882', '575.643', '556.631', '599.759', '573.220', '539.916'], 2018: ['265.377', '224.912', '278.908', '295.472', '317.927', '317.742', '347.696', '373.720', '401.025', '413.556', '406.041', '407.751'], 2017: ['290.174', '225.300', '252.969', '236.823', '248.159', '245.243', '293.297', '316.340', '307.968', '302.871', '293.155', '285.409'], 2016: ['284.158', '226.621', '264.373', '275.014', '295.629', '297.553', '280.241', '299.579', '334.492', '337.164', '320.987', '289.284'], 2015: ['402.896', '335.901', '341.535', '327.988', '367.478', '362.013', '362.265', '357.917', '347.830', '361.113', '332.901', '314.040']}
(I thought you were going to miss the month)
that's fine, I guess. but .unstack() will return a Series of (year, month) -> value, and .stack() will do (month, year) -> value
How do I go from that "(month, year) -> value" to a dataframe with Month and Year columns ? sorry I am just starting on this
In [8]: df.unstack().to_frame().T
Out[8]:
2021 ... 2015
1 2 3 4 5 6 ... 7 8 9 10 11 12
0 525.785 427.857 477.502 468.083 484.556 457.686 ... 362.265 357.917 347.830 361.113 332.901 314.040
oh let me try
I dont really want a multilevel column, but 2 distinct columns
each value has a year and a month. so what are these two columns going to mean?
Example:
Year Month Val
2021 1 427.5
2021 2 456.6
2020 12 123.45
for example
(this is because I need to join this dataframe to another based on Year and Month)
Is this the best place to ask about webscraping?
no, this is for data science. try a help channel. see #βο½how-to-get-help
TY β€οΈ
In [13]: df.unstack().reset_index()
Out[13]:
level_0 level_1 0
0 2021 1 525.785
1 2021 2 427.857
2 2021 3 477.502
3 2021 4 468.083
4 2021 5 484.556
.. ... ... ...
79 2015 8 357.917
80 2015 9 347.830
81 2015 10 361.113
82 2015 11 332.901
83 2015 12 314.040
[84 rows x 3 columns]
renaming the columns would be a bit of a pain.
In [15]: df.columns.rename('year', inplace=True)
In [17]: df.index.rename('month', inplace=True)
In [18]: df.unstack().reset_index()
Out[18]:
year month 0
0 2021 1 525.785
1 2021 2 427.857
2 2021 3 477.502
3 2021 4 468.083
4 2021 5 484.556
.. ... ... ...
79 2015 8 357.917
80 2015 9 347.830
81 2015 10 361.113
82 2015 11 332.901
83 2015 12 314.040
[84 rows x 3 columns]
Thank you Stelercus !
anyone can help me and give me tips or reference ?
i am actually wanna build Named Entity Recognition Using Conditional Random Fields. But I have trouble in entity labelling so any advice from you guys how to labelling data for text with indonesian language?
you're trying to make an NER model for more than one language?
actually for indonesian language
so i have done preprocessing phase, then i will go to entity labelling / annonate the text to which location, person and other. but i don't know how do that
so what do you mean by "multi language"?
sorry i was wrong giving the explanation.
Working on a time series model with the package Darts. Having trouble returning the acf, as the error says time series has no attribute to shape
i can't find much on stack overflow about this error
please show the whole error from Traceback as well as the relevant code
will do
AttributeError Traceback (most recent call last)
<ipython-input-33-9be0d81f7eb4> in <module>
----> 1 plot_acf(train)
2 frames
/usr/local/lib/python3.7/dist-packages/statsmodels/graphics/tsaplots.py in _prepare_data_corr_plot(x, lags, zero)
17 if lags is None:
18 # GH 4663 - use a sensible default value
---> 19 nobs = x.shape[0]
20 lim = min(int(np.ceil(10 * np.log10(nobs))), nobs - 1)
21 lags = np.arange(not zero, lim + 1)
AttributeError: 'TimeSeries' object has no attribute 'shape'
the error
the code is this
so x, whatever that is, is not an array.
hm
i'll send the relevant code
train.plot(label='train')
val.plot(label='validation')
plt.legend();
# here we are splitting our model into training and validation. Everything before 2019 is training, and 2019 onwards will be the validation series```
output
from there i attempted to plot the acf
Python is dynamically typed. so it might be that it's up to you to know what x is supposed to be, both in terms of what type it is and what it represents.
the x value is the dates
but i thought i had already established it
i imported the date time package as well so it could be read
if you do type(x), what is that?
yes it is
the x in your function is lower case. so this is something completely unrelated.
i thought so too
but X capital shows up
x lowercase does not
this is my first time attempting a time series model
you have to put print(type(x)) in the function and run it to figure out what type(x) is.
yeah i tried that
nothing
lowercase x is not defined
i'm following the documentation guide for darts though
on building a time series model
what library does TimeSeries come from? if it's a sequence of some kind, you might be able to convert it to an array.
The Darts library
admittedly a library i have never used before
i can send the code leading up to that error
series = TimeSeries.from_dataframe(wrc, 'Date', 'High', fill_missing_dates=True, freq='B')
train, val = series[:-36], series[-36:]
wrc.head()```
wrc.shape
model = ExponentialSmoothing()
model.fit(train)
prediction = model.predict(len(val), num_samples=1000)```
wrcm = Theta()
wrcm.fit(train)
pred = model.predict(len(val))```
train.plot(label='train')
val.plot(label='validation')
plt.legend();```
The relevant part of ur code is where u declare x
it's an argument for a function that they didn't necessarily write (though maybe they did)
no i didn't
according to the documentation i didn't need to declare x outright
As Stelercus said before, whatever x is is not what it should be (a list).
there's no "declaring" in Python. variables are just names.
where do you call the function that has x?
series = TimeSeries.from_dataframe(wrc, 'Date', 'High', fill_missing_dates=True, freq='B')
Yes, but if the error is in the call of this function, then he had to call the function passing the parameters to cause that
AttributeError Traceback (most recent call last)
<ipython-input-33-9be0d81f7eb4> in <module>
----> 1 plot_acf(train)
where is this in the code
whatever train is (it is probably a TimeSeries) is not a valid type for that function.
train, val = series.split_before(pd.Timestamp('20190101'))
this is the code that it's pointing to
i had seen someone work around this error using the same darts module though. From what I can tell, they followed relatively the same process.
which is frustrating
I think I understand it now, this error could have been caused by a call to a method from one of the libs it is using.
So, what I understood of your code and may be causing the error is the that val receive a datetime object and them u call the method plot on it, but it expected an list.
I ask you to ignore the English mistakes, I'm not fluent
nah totally fine!
that bit of code does actually return a plot
but it won't return the shape of said plot
Try to put the val in a list and call that list in the place of val.
@pseudo wren this looks like an interesting library nonetheless...maybe i will check it out if i ever have to do time series stuff

i chose it because it supposedly abstracts away a lot of the processes behind creating a timeseries like you would find in matplotlib
buuut so far it's just a headache
oof
well it still looks like itd be promising for maybe simple stuff
at least according to the readme
i like that you can create a time series object straight from a pandas df
so ill bookmark it
just in case

yeah absolutely
i just need to figure out how to actually use it all the way
creating a time series directly from pandas has been insanely convenient though
thanks! i'd be interested to see how you fare with it
So I spoke to my boss and I actually can't, so I tried this:
But I don't really like this solution, for two reasons:
- I need it to match the exact day first, and then if it doesn't match, try the range
- It's pulling more than one day per month, since it's pulling every day that matches the range
you're also overwriting the name of the dataframe, and unless the dataframe is preserved in some other scope, that is bad
omg stel, i have done so many type conversions with this data that im pretty sure something got lost along the way
time to check this json i exported i guess

I have to build a large dataset on employees from a couple of siloed datasets and I'm trying to engineer it so that data scientists have an easy time passing it to estimators / prediction algos.
I'm struggling because I have daily snapshots of the employees properties like their contract, benefits, manager, department. However, I also have other data that does not update daily such as survey results, satisfaction, but fortnight, and I think I have to somehow merge them.
Many employees do not change properties daily such as manager but some do.
One of the goals is enabling DS to be able to generate monthly or weekly predictions over the employees, such as churn rate or satisfaction per department per month with the data set whilst including that auxiliar data that comes from evaluations or surveys.
I would like to know what strategy is used normally to face this type of time granularity diversity on features pertaining the same population.
Could anyone provide an outline please?
In kaggle you rarely see data that has like history snapshots of the features so this feels like a non so common case
yes! my Q learner finally learned to wiggle to maximize laser-on-target time
after 1.5 million rounds
today wiggling, tomorrow aiming
I don't know if understood correctly.
Bu tha wouldn't be the case of provide all data timestamped to your DS? This will be more like a time series, the DS would be able to analyse trough your data and get what is useful or not.
If it is not helpful, let me know.
aha, so it was the number of iterations
I believe this would just result on passing the problem to the data scientist?
Wouldnt this force them to go around the data silos to find the feature pieces and then figure out how to merge them / combine into useful / homogenous time ranges? We would like to avoid that.
Naturally, we want to still give them the freedom to mix and match but at the same, it seems valuable/standard to have something like a feature store that they can query in an homogenous manner that suits most common training techniques.
I posted the question in help-donut in case anyone has something they'd like to say
Better, I have not much to add about it as I'm used to "receive" the data and deal by myself
@slender plinth Sounds like you are a DS?
I feel inclined to believe your experience should still be relevant as it is just a matter of who prepares the data.
So if you were to receive just the timestamped data as you suggest (keeping in mind that some data does not follow the same granularity), ie: payroll survey data is only available every 15 days whilst sentiment score has daily snapshots whilst the department is sampled in the dataset daily but may not change for years, how would you do the merge?
Keep in mind that this means you would actually receive N datasets where N is number of silos and you'd have to figure out how to merge them should you need to do that (perhaps by date and employee, but date granularity is non-homogenous)
Yes sir I am a DS.
So, this is my opinion and experience.
I would love to have someone parsing and organising data before it comes to me.
But! What I use to see is data coming in raw The DS is up to verify and understand how the DB's should be merged and what are useful or not.
If you could talk to your DS and ask him how he/she would like to receive the data would be the best. I kind like receive all the raw information so I can take my own insights from it and them merge them to do all the work (cleaning, feature engineering, etc...)
I would love to have someone parsing and organising data before it comes to me.
if only
there is super nested json where it feels like i have to grab hidden secret data
in order to train a model
it is a nasty schema too since there's somehow multiple choice questions in there
too many records too
it's like 7 levels in btw

and then like the questions and answers are on separate levels
like who did this

I have a meeting scheduled with him but I am trying to get a better feel for the matter to not be so lost.
I think I failed to english in my last paragraph and omitted the question after just explaining context lol
If its not much of a bother or you have time, I'd like to hear how would you approach merging such an heteregenous dataset? Like, don't you need to have homogenous date ranges in order to pass that to an estimator?
Sorry for the late reply, urgent meeting..
Yes, you need to have specific keys between the data set in order do connect them.
But you can do this, see if make sense:
Link all your data by user Id if available, for the data that are not daily, create a specific invalid dare(1/1/1800) and let the DS know. For all dataset, create week and month columns, I believe, by your explanation that you can do at least for all data points by month, if you could also add weeks would be good.
With that, the DS will be able to connect.
Edit: I think your primary key would be the user ID, then you would have your secondary key's Date, Week, Month, Year(if that is the case).
I could only come up with that solution, not sure if is the optimum but pretty sure your DS will figure out how to handle this data.
No need to apologize mate, thanks a bunch for taking the time to answer.
This sound like a plan, I also spoke to them and got some clues onto what structure to keep the data at the DS level for the least amount of pain for us all. I don't think its necessary to go in length but it hints to me that we need to keep the data ungrouped because different analysis will require them to do different groupings so summarizing from the source instead of during their pre-processing may end up restricting some analysis. @slender plinth
can anyone help me a bit I'm a beginner in programming, i want to extract data from a pdf and make graphs using matplotlib, ive already extracted the dat dtring out of the pdf using pdfminer but the way it is will be very hard for me to parse it, i want to make something where i upload any similar pdf and it should give me graphs. Now anyone can help or guide in this please that would be appreciated
Should I learn the math for ML/AI before I get started with programming and making projects?
AI libraries abstract a lot of the math away. but you can't really make intelligent decisions about what models or neural architectures to use unless you understand the math.
you should learn the basics of linear algebra, probability, and statistics. at the very least.
but if you feel like starting a project, and you enjoy working on it, and you learn something along the way, it doesn't necessarily matter if you don't finish it.
So I should learn the basics of linear algebra, and statistics before actually programming?
Wouldn't I also have to learn some calculus?
@serene scaffold
eventually, yes
you can still experiment with numpy, pandas, and sklearn if you want
If I learn the basics of linear algebra and statistics I would be able to make AI/ML projects?
there are ML algorithms that just involve statistics, so yes
what about the programming part?
what would I have to learn for programming the actual model?
you can use numpy and sklearn and stuff
Do you guys have any resources on linear algebra/statistics for ML?
Companion webpage to the book βMathematics for Machine Learningβ. Copyright 2020 by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong. Published by Cambridge University Press.
Hello
Hey @severe grail!
It looks like you tried to attach file type(s) that we do not allow (.pdf). We currently allow the following file types: .gif, .jpg, .jpeg, .mov, .mp4, .mpg, .png, .mp3, .wav, .ogg, .webm, .webp, .flac, .m4a, .csv, .json.
Feel free to ask in #community-meta if you think this is a mistake.
what is wrong with my dataframes indexing?
i used append and concat
dataFrameStack = None
cursor = cnx.cursor()
cursor.execute(QUERY)
df = pd.DataFrame(cursor)
df.head()
print(df)
if not df.empty:
if dataFrameStack is not None:
dataFrameStack = dataFrameStack.append(df,ignore_index=True)
else:
dataFrameStack = df
print('\n\n\n\n\n***********************')
print(dataFrameStack)
field_names = [ i[0] for i in cursor.description]
print(field_names)
xlswriter = pd.ExcelWriter('{}/{}.xls'.format(type,loc),engine='openpyxl')
if not df.empty:
df.columns = field_names
df.to_excel(xlswriter,index=false)
xlswriter.save()
else:
cnx.close()```
what is wrong with the logic?
pd.concat([dataFrameStack,df],axis=0,ignore_index=True)
won't work
the second dataframe jump after the last columns
@harsh spade Per Rule 6, your invite link has been removed. If you believe this was a mistake, please let staff know!
Our server rules can be found here: https://pythondiscord.com/pages/rules
Heylo, on what topic would you guys prefer a research summary!??
Where you are getting problem mate?
hii anyone around? I need some help in a problem.
so I have a dataset with 800 features, and fun part is there is no correlation between these variables.
So how should i reduce the dimensionality of it?
i have tried using pca, feature extraction, feature selection and some other method but on test dataset highest r2 score is 1.8 π¦
if you take an SVD of the data, how do the singular values look? do you know anything about where the data comes from?
they haven't told where does that data come from but i think it's a real world data which is feature engineered
it's like 8 main features and rest are derived features
when you did PCA, how did you choose how many components to keep?
and how many did you keep
so i tried keeping various features starting from 50 going till 800
r2 score seems to be increasing with increase in features
with r2 score you mean mean squared error?
anyone knows why my friend's and my feature importance value are different despite the fact that we are using the same dataset? π€
it's a scaled mean squared error, i just checked
yea scaled between 0-1
you'd expect this quantity to go to 1 as you increase the number of principal values
it hasn't got above 0.3 even taking all features
that seems wrong, since taking all features would mean you just have the original data again
and on test data it messed up big time, .9 so it's overfitting
on training data, you mean?
yea, that's why i am trying to find ways to feature engineer some attributes from original variables
yea yea
probably needs something more robust to noise. that's why i was asking how the singular values look
wait i will send you the ss
that can give you some idea of how noisy the data set is, and whether a robust version of PCA could work better
that does already seem like a pretty decent approx
can you plot all of the singular values?
how many examples are in the training data
20k is pretty good
so yeah, there should be 800 singular values and the question is how noisy they are
how can i analyze the noise of 800 features?
aha, but there you see the singular values are still quite large
what does it indicate?
what i was gonna note is that, if you have your samples in a vector of size 800, and have 20k examples of these vectors, you can place them in a matrix of size 800 x 20k. then 1/20k (MM^T) is an approximation of the covariance matrix. under the assumption that there is noise that is uncorrelated with the true data or true features, 1/20k MM^T = C + N, where C is the true covariance and N is the noise covariance. for real world data, C is usually rank-deficient. noise tends to be full rank, and often/hopefully close to diagonal
under those conditions, the singular values of the covariance matrix are the original singular values plus the noise singular values, so the overall covariance appears to be full rank. as long as the true singular values are modestly large, you will mostly see the behavior of the data. once they become small, they are dominated by the noise
so if there is a weird sudden change in the profile of the singular values, it often hints at moving out of the signal space and into the noise space
(which would, in a noise free case, just be the null space)
so it'S a good idea to make a plot of all 800 singular values of the sample covariance
the data has lots of nan values as well, i have dropes those features which has nan values > 15,000
data is overfitting if i use xgboosting
model score is 0.92 on train, on tet its showing 0.34
what are you doing?
so i took all features and transformed using pca, and ran my model
why?
i thought that's what you sad π¦
i never said anything about ML
we were looking at the data first to see if we could learn something
na na model i ran just to check
okay my bad.
and what is your model doing anyway? what are you trying to get from the data
based on features i am trying to predict the score of a person given by some coach
scores of what
you never mentioned any of this before so i have no idea what you're doing
okay i will tell you problem first
so i have 8 main features of a football player, and based those 8 features there are other derived variables. We are trying to predict the score given by a scout on the basis of those features
features include, position 1, position 2 of a player, weight, age, height, team code he plays for etc etc
all right
so nearly all the features are scaled between 0, 1
including height
some categorical features are there which i changed using one hot
i just need some clue or hints on how to handle these many features, as i have never worked on something like this
hi, i have a question.. is image classification useful to rate images out of 5 ??
what do the ratings mean?
from the image i can tell if the house for instance is high standing
or not
the rate will be from 1 to 5 its like giving 5 stars thing
I think yess
coz you already have 5 classes, so yess by classifying an image the result will be one of the five classes ofc
ah i see i see thank you
wlcm
so the goal is to decide how "good" the house is on a scale of 1 to 5?
yes
what about the value of the home? are you trying to decide that?
yes its like seeing the rating of the house comparing to its price
cause i did scraping to get the data
the rating of the image of the house*
Is there any algorithm which dynamically updates/eliminates a various number of output while the user is giving input to a set number of questions?
What are you trying to do??
I got a question from a friend,
Assume there are 20 personality questions which are mandatory each questions have a specific and unique weights assigned to them. based on the questions answered by the user there are different output or lets call the buckets, for example the person is a party person, introvert, alcoholic etc.
Now what we have to do is, ask all the employees of a company to fill out the form and based on their input we have to put them in different buckets. Now we can do it with any language by just comparing the weights. My friend ask that what can we do to make it dynamic using python specifically
so i was thinking of an ai/ml algorithm which eliminates the output while the user inputs the form
I think the better way would be to just compare the weights as u suggested earlier. I doubt there's an algorithm of such type. Not that I've heard of ofc.
Again I'm not sure if there really isn't any algorithm possible. U can try out some unsupervised operations to make sure tho.
hi
I need help about data science, the chat room of the problem is #help-chocolate
π
Does any one know how to code a game
Any specific idea you working on??
have you checked #game-development
I will go check..
hi
I need help about data science, the chat room of the problem is help-chocolate
I'm developing a machine learning model to identify non-payers
from sklearn.cluster import KMeans
from scipy.spatial import KDTree
import webcolors
import cv2
def convert_rgb_to_names(rgb_tuple):
# a dictionary of all the hex and their respective names in css3
css3_db = webcolors.CSS3_HEX_TO_NAMES
names = []
rgb_values = []
for colour_hex, colour_name in css3_db.items():
names.append(colour_name)
rgb_values.append(webcolors.hex_to_rgb(colour_hex))
kdt_db = KDTree(rgb_values)
distance, index = kdt_db.query(rgb_tuple)
# This of course only returns a closest match
return names[index]
image = cv2.imread(r"helper\data\Nike-SB-Dunk-Low-Pro-Bart-Simpson-Product.png")
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
# get n clusters of colours
image = image.reshape((image.shape[0] * image.shape[1], 3))
clt = KMeans(n_clusters = 4)
clt.fit(image)
img_4_most_frequent_colours = []
for colour in clt.cluster_centers_:
colour = convert_rgb_to_names(colour.astype("uint8").tolist())
img_4_most_frequent_colours.append(colour)
print(img_4_most_frequent_colours)
I'm trying to get 4 colours from a picture
is this an efficient way?
Would someone have a moment to guide me through what I would need to do to begin this assignment?
Or perhaps a good youtube tutorial to follow for this assignment? I've only ever used pandas on jupyter notebooks
@wooden sail :)
Hi, Im having a little trouble w/ numpy
where arr is a 2D array with shape (30, 30 and dtype=uint8
arr = np.where(
arr < lower,
new - diff,
np.where(
arr > upper,
new + diff,
new - color + arr,
)
).astype(np.uint8)
this was my former but (slow) solution
def func(element: int, new: int) -> int:
if element < lower:
return new - diff
elif element > upper:
return new + diff
else:
return new - color + element
# and I map func over each element within the nested array
it does not match the desired results at all π
does anyone happen know where the process is differing 
new = arr - color + element
new[arr < lower] -= diff
new[arr > upper] += diff
something like this?
if that doesn't help and you want to continue, please say the types of all the variables in your example (other than arr)
it looks pretty fine for me
everything is integers (except for arr obv)
yea I was going off of the example you provided today this morning
yes I was suggested that initially and tried it that but what about the else statement,
in which this morning @ confused reptile suggested np.where
instead of an else statement, the new array is created where every element is what the else block would have created.
ah that is smart I did not see that I will try it, thank you
I suppose it's "smart", but it also results in wasted computation.
:incoming_envelope: :ok_hand: applied mute to @shell yew until <t:1654992737:f> (9 minutes and 59 seconds) (reason: duplicates rule: sent 4 duplicated messages in 10s).
my favorite DS youtuber https://www.youtube.com/watch?v=0ItYIoOrrUs
What happens when Machine Learning and Baseball converge? You get a system that tells you exactly what you need to do to improve your baseball swing. How much will I improve? Watch to find out! Learn more about it here: https://www.sas.com/en_us/curiosity/battinglab.html
Special thanks to SAS for bringing me out to see their batting lab. The b...
i didnt even know SAS had a campus and everything

fun fact: they apparently use python there too
π
current flavors of explainable AI?
Iβm looking for a data Analysis book, I was recommended this book Wes McKinney Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython but Iβm seen bad Reviews thatβs not updated is thereβs a better book then this I need your suggestions please
Updated version here https://wesmckinney.com/book/
What is a good package to use for the double scalar error or a good fix in general
Iβve heard that error is a result of too many large negative numbers
What is double scalar error?
From what I found on stackoverflow that error occurs when thereβs a very large very negative number
I am trying to create a predictive time series model
honestly, just create an inside function that filters and pass those values.
Can someone guide me on AI
In blogs there are conflicts
Maybe a roadmap you can provide?
@rich fiber start by learning about k nearest neighbors
I just learned the basics of python
Now should I learn about data analysis libraries?
Or algorithms first?
you can experiment with numpy and pandas, I guess. but it's more useful to learn how to do a certain thing, and learn how to use the libraries as a secondary concern.
Alright so I start by learning different ML algorithms?
sure. make sure you already know the basics of linear algebra and stats.
Can you kindly suggest me any resources to learning the maths required
I tried a book of Cambridge University and I could barely read a line without having to Google
Cambridge University Press publishes all kinds of textbooks at all university levels
The book "Mathematics for Machine Learning" by A. Aldo Faisal
That sounds like it's trying to teach you the mathematics. I haven't read it, so I don't know what level it's aimed at.
Yes that exactly what I was trying to learn alongside basics of python
So later on I won't have trouble learning the algorithms
Or dealing with large data (I hope)
Which books have you read as a beginner?
It sounds like you're a beginner in three different things at once: Python, mathematics, and machine learning. I don't think I have good advice for that, because I have not attempted to learn all three of these at the same time.
I've done python
I would say, go get the mathematical foundation you need to understand the machine learning textbook, and then come back to machine learning later. You need at least basic statistics, probability theory, calculus, and linear algebra.
Yes that's exactly what iw as trying to do
Kindly give me a resource where I can learn the maths required
I don't have a good thing to recommend as all the resources I know are graduate level
But look up textbooks on those topics.
Unfortunate although thanks for all the help :)
hey guys i don't know if im' in the right channel, i'm looking for a good proxies for scraping with python any good website? thanks in advance
Hey can anyone help me out with my DL college project?
It's an Image Classification DL work
I've constructed the model, but my accuracy is always at 0.5000
Hello! Does anyone have a way to save an excel file with a password that works reliably in python?
Anyone w/ experience using numpy and numba together? I'm having a weird error regarding arrays and matrices #help-cupcake
Hehe boi
Hey guys, it's me again 
Is there any way I can filter a datetime column in a df by the closest timedelta?
What exactly do you mean by filter
Are you sure you don't mean sort?
Yes, they're already sorted, what I want is to find the next row with the closest timedelta to either 30 or 365 days
@limber token keep in mind that "filter" means "retain only values that satisfy a certain condition". It does not mean "select the most similar "
Okay, but "sort" is not really the word here either is it? 
It's not
See if there's any "select closest" functionality built into pandas. I doubt that there is, but it's worth it to check.
Otherwise you'll have to make a new column that is the time Delta and loop through it.
Well, I guess you don't have to loop through it manually. Because if you have a column of timedeltas, the closest one is going to be the idxmin
When searching "select closest date" I only found how to find the closest date to the initial date
How do you mean?
Do you understand what an argmin or argmax are? Idxmin is the pandas version of argmin. If you don't know argmin, read about that and come back.
If you have a column of timedeltas relative to time x, the idxmin will be the index of the row closest to time x.
I know what argmin and argmax are, what I meant is, I'm trying to find the closest days to a month and a year after date x, so I'm confused how idxmin would help πΊ
(Sorry if I'm being confusing, not a native English speaker)
You can add a 13 month (one year plus one month) time Delta to the original date, and then do what I said, and you'll have the closest date to 13 months out.
@limber token I'm at the gym. If you're still confused in like an hour, ping me and I can give a better example
generally speaking it's possible that your dataset doesn't have a problem that can be answered, right?
for example idk what the problem here is
i often have a hard time thinking of the problem at hand
What would be a good way to add values to the NA values in this dataframe
I'm trying to plot three lines but there's values missing
why are they missing? what would they be if they weren't missing?
and this is time series data?
yea
you need to do "time series interpolation"
you can see how they insert values that "make sense" given the known values.
ok thx
df1[df1["kwartaal"] == "Q1"]["month"] == "january"
Is it not possible to add a column on a slice like this?
"month" is a new column
use loc instead of stacked [ ][ ]
also, you have == on the rightmost side, which is not assignment.
that said, you can't add a column to a slice. every cell in a column has to have a value, even if it's NaN. (though pandas might initialize values outside the slice to NaN if you do it that way--idk)
Can someone confirm this,
"House with 2 bedrooms are cheaper than house with 3bedrooms" is data science
However when predicting prices, it is known as ML
the whole framing of this question is weird. data science is using principles from programming and math to use large amounts of data. ML is when you have an algorithm that adjusts itself ("learns") based on data.
but I guess just making a factual statement about a trend in the data is more "data science" than it is ML.
df1.loc[df1["kwartaal"] == "Q1", ["month", "day"]] = "january", "1"
yea this adds NA values for all not Q1
my prediction was correct π₯
