#data-science-and-ml
1 messages ยท Page 220 of 1
"doubt" to mean "question" is a very Indian thing
along with "the same"
y = data[['CO2EMISSIONS']] will give you 2D
yeah i got that
supposed to be 1d
@velvet thorn you must have a shitload of experience with indians i suppose?
I'm from a country with a sizable Indian majority, and I've worked in a startup with like 80% Indians
there are distinct speech pattern differences between Indians in my country and Indians from India
uhm what country exactly
Singapore
but the Indians here are generally 3rd or 4th generation so they don't resemble India Indians that much
oh
and yeah another small "doubt"
isnt this the reason why is x is represented as x = data[['ENGINESIZE']]
isnt this the reason though
uh
not just that
okay, mayeb you could explain what you mean by that image
because we should be more or less saying the same thing
Andrew NG didnt say anything about this 1d and 2d stuff
okay
so, basically
the standard way of storing data is as a 2D array
where each row (1st axis) represents a sample and each column (2nd axis) represents a type of observations
therefore, X should always be 2D.
in some cases, you may have only one sample, or only one type of observation (feature).
but that doesn't make your data 1D
it just means that one dimension is 1
now, for y, assuming you're only making a prediction on one variable, it should be 1D
because it's basically another type of observation, except it's the target
@velvet thorn would that be the same case for multivariate linear rgeression?
yes, sklearn treats simple and multivariate linear regression similarly
in both cases X is 2D
just that in SLR its shape is (N, 1), where N is the number of samples
okay, wait, I should clarify
if you mean "multivariate" in the proper sense (multiple dependent variables) then, yes, y will be 2D
but it is common to say "multivariate" to mean "multiple" (which is, strictly speaking, wrong) in the sense of multiple independent variables
in the case above, you passed a 2D array for y, which is why you got a 2D array for your coefficients
because it's one 1D array for each dependent variable
Hi All, does any of you have code for a numpy based CNN to share for own pictures? (~480x360 pixel input)
Does anyone have an updated Data Science roadmap/long-term tutorial one could follow(Including maths and programming etc)? I am currently finishing my bachelors degree, and would like to practice Data Science at the same time
Ping me if someone has something like this :))
trying to learn how to code by transforming the JHU COVID-19 data into a new df normalizing all countries to the day they hit 100+ cases
and I'm really struggling with some basic groupby stuff to shift the JHU data from state-level to country-level, it's blanking out my df
posted some stuff in #help-chestnut then realized that this is likely more geared to analytics arena of python
specifically, how do I fix this? I want the output to be a grouped table at the country level, by day
import pandas as pd
import numpy as np
#this should link to the raw CSV of the latest time series data
url = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Confirmed.csv"
df = pd.read_csv(url)
#unpivot data
df = pd.melt(df, id_vars=['Province/State','Country/Region','Lat','Long'], var_name='date', value_name='Confirmed Cases')
df.to_csv('Working File.csv',index=True)
#create flag and custom fields
#df["date"] = pd.to_datetime(df["date"], format='%m/%d/%YY', errors='ignore')
#df["Confirmed Cases"] = pd.to_numeric(df["Confirmed Cases"])
df["Flag"] = np.nan
df["DayZeroIndex"] = np.nan
df = df.groupby(['Country/Region','date','DayZeroIndex','Flag']).agg({'Confirmed Cases': 'sum'}) #******ERROR IS HERE******
df = df.sort_values(by=['Country/Region','Confirmed Cases'], ascending=True)
df.to_csv('Working File2.csv',index=True)
Flagged the line that I think the error is on
Is there anyone with object tracking experience in python that is willing to teach me or knows a good way to learn it?
Hey @floral mantle so when you actually perform your melt, you have your confirmed cases cohorted by country, but it looks cumulative totals aren't being calculated
they're split by country & state & date
New York US 1/1/20 20
Washington US 1/1/20 10
and I'm wanting to group it on country
US 1/1/20 30
then, in a really poor way most likely, I'll add in a couple of custom fields to do an indexed plot
Alternatively the data is already compiled the right way at https://ourworldindata.org/grapher/covid-confirmed-cases-since-100th-case?time=0..62 if I could figure out how to link the csv in the data tab into my python df
@floral mantle if you dont see a way to do this easily, you can definitely run some type of BeautifulSoup/Selenium job to download this information on some recurring basis
yeah only challenge that I'm seeing with downloading the data is that they serve it as a blob:http:// and I don't know how to read that in
So @floral mantle - this first StackOverflow post (https://stackoverflow.com/questions/48404681/python-how-to-download-csv-files-using-selenium) and this other one (https://stackoverflow.com/questions/22676/how-do-i-download-a-file-over-http-using-python) should get you pretty far
I am using Selenium for navigating the following website:
https://apps1.eere.energy.gov/sled/#/
I would like to have data for a city like Boston: what I am doing is the following:
from selenium ...
is there a discord server strictly for jupyter?
Afaik no
so I have to come up with a sorting algorithm
and a way to standardize string inputs
not sure where to start
so like for example im taking items from vendors and coming up with an algo to name these items so that they can be categorized and searched with ease
Hello, I have a problem related to matrix multiplication in python... I tried it in C++ and used the IKJ algorithm, times were around 20 seconds for 2000x2000 matrix times another 2000x2000 matrix... the problem is that when I used the exactly same code in python, and used multithreading / multiprocessing, the times got absurdly high, for multithreading, a 2000x2000 times another matrix the same size ran for like 5h and 40 minutes
2D arrays with numpy
if that's what you meant @velvet thorn , sorry if not, I'm completely new with using python
same code, meaning same algorithm, used IKJ algorithm on both
here's the code snippet
def multiplicationThreading(threadAmount, size):
dividedAmountThreads = (int)(size / threadAmount)
threads_list = []
count = 0
for thread in range(threadAmount):
new_thread = threading.Thread(name = thread + 1, target = multiplicationParallel, args=(count, dividedAmountThreads, size,))
threads_list.append(new_thread)
count += 1
start_time = time.time()
print('Start parallel execution with',threadAmount,'threads for matrixes',size,'x',size)
for thread in threads_list:
thread.start()
for thread in threads_list:
thread.join()
print('Execution time:', time.time() - start_time,'seconds')
def multiplicationParallel(threadCount, dividedAmountThreads, size):
random_matrix_a = numpy.random.randint(0, 1000,(size, size))
random_matrix_b = numpy.random.randint(0, 1000,(size, size))
blank_matrix = numpy.zeros(shape=(size, size), dtype = int)
for i in range((dividedAmountThreads*threadCount), dividedAmountThreads*(threadCount+1)):
for k in range(size):
for j in range(size):
blank_matrix[i][j] += (random_matrix_a[i][k] * random_matrix_b[k][j])
In main method calling the thread function like:
for size in sizes:
for thread in threadAmount:
now = datetime.now()
current_time = now.strftime("%H:%M:%S")
print("Current Time =", current_time)
multiplicationThreading(thread, size)
@velvet thorn
Hey. I have a pure theoretical question about RNNs and stuff like that. Let's say I have a long sequence of words
- How can I turn them into numbers?
- When I'll turn them into numbers and make network process it, how can I turn them back into words or letters?
@uncut shadow You will need to turn the words into a vector. The famous algorithm word2vec does this. There has been a lot of advancement in this space as of late, so you should def do more research. But the general premise is getting text -> clean text -> remove stop words -> lemmatization -> vectorization -> feed into model.
The model will output another vector, and this vector corresponds with the vectorization process. If you turned the word into a vector, the same idea can be used to turn the vector into words.
NLP isn't my expertise, but hope this at least gives you a head start.
thanks! I'll try that
nltk and spacy have good docs and should help you out.
any good sources for learning real time video classification??
using CNN+LSTM perhaps
I just want to start learning about data analysis or data scientist stuff. can u guys suggest me the best place to start?
@lapis sequoia check DataCamp
@lapis sequoia first chapters of courses are free but then you will have to pay (actually, there is other way around -> https://www.quora.com/How-do-I-access-DataCamp-courses-for-free)
Thank you @uncut shadow
๐
@obsidian copper What are you looking for? Are you wanting to classify an entire video, or are you wanting to classify objects in a video?
@oblique belfry it's the same hand gestures thingy dude
df["DayZeroIndex"] = pd.to_numeric(df["DayZeroIndex"], downcast='integer')
Trying to convert that field in my dataframe to drop the decimals
Right now it shows 0.0, 1.0, 2.0, 3.0 -- I just want 0, 1, 2, 3
How do I do that?
I would just do .astype(int)
depends on what you're trying to know about DL
presumably it'd be just setting the derivative of the log likelihood to zero
but typically deep learning models are non-convex so there shouldn't be a unique global minimum
(or rather maximum, in terms of likelihood)
aha
yup
err not particularly, other than the Deep Learning book most of the content for DL models are either in papers or lectures notes of very new classes
although books like Murphy will always be relevant
Trying to come up with a minimal environment.yml to use as my default setup in the future. Here's what I have so far:
name: minimal
channels:
- conda-forge
dependencies:
- python=3.7
- pandas
- scikit-learn
- matplotlib
- jupyterlab
Any other suggestions for the bare minimum needed for the majority of projects?
tqdm/seaborn are nice but not necessary
I've used seaborn a lot (great for violin plots), but what's tqdm?
progress bar
not sure about that, but you can see it here: https://github.com/tqdm/tqdm
https://github.com/nalepae/pandarallel this is probably the best extension to pandas i've seen yet
cool, I'd not heard of it
Especially when you're dealing with something huge like ERA5.
hey guys - dataframe filter question
maybe I have to get it another way though
I have a process that updates all COVID-19 cases, by country, by day from the JHU github source.
It normalizes the data to a DayZeroIndex for each country where that is the day each country hit >= 100 cases.
I realized that I need to cut off the last DayZeroIndex for each country since the data isn't finalized until the next morning.
I'm using the filter below to remove anything where DayZeroIndex = -1 (meaning <= 100 cases for the country).
What do I use?
df = df[df["DayZeroIndex"] != -1]
@silent swan thank you for the astype(int) suggestion. Will give it a shot. Honestly, I was shotgun approaching the whole thing since I kept getting errors. I think the reason I was having so much trouble is that my DayZeroIndex field originally was set to np.NaN and astype(int) doesn't handle that well so it stayed float64. I changed the default value to -1 though, so maybe it works now
**Update: Worked like a charm and that's a lot cleaner than the to_numeric/downcast solution.
**
Can anyone help me doing a polynomial re?gression
Hey. I have a question about RNNs. I have seen many times about RNN cell and I'm wondering, aren't those layers? Or maybe RNNs and LSTMs are cells? I mean, I have heard that amount of cells has to be equal the amount of single time-steps
@uncut shadow there's two dimensions to consider
how many timesteps you're going to process the inputs for (using the same cell each time)
hmm
and how many layers of RNN cells/LSTMs you have (these are usually different)
what about the first one
@bowy could you elaborate what your confusion is? the math gets pretty yucky because you have to sum the gradients over the different spatial locations, so some articles may simplify it
well, I have another question. In Dense NNs I could use different activation functions which I could choose. In RNN or LSTM I see there is tanh, sigmoid and softmax. Does it mean, I can only use those 3?
within the LSTMs, there're specific configurations of activation functions. don't change those
otoh, almost no one uses vanilla RNNs
ohh
cuz of the vanishing gradient?
also, if I have a sequence 110011001100... and I want network to predict next 4 numbers how many cells should I use?
People will be. Just ask your question and someone will respond if they know
okay so i've been having a problem with the matplotlib library
I'm trying to create a barchart from the list that i have
but for some odd reason it won't plot both of them properly on the graph
I'll send you guys the code
one sec
What does the result come out like and what do you expect it to come out like?
which is better?
- Enumerating and giving every symbol in sentence an unique number?
- Using one-hot encoding?
i get these 2 results
but the problem is
the Free paid games = 1087200000
and the total paid games = 900000
but for some reason the paid bar is incorrect, it's even supposed to be a different colour as you can see
so i don't know what to do
I'm very lost
@worn stratus any ideas?
I'm desperate lol
The best way to get help is to have a concise summary of your problem, and a link to your code as your most recent message - then just lots of patience
How about something like this for the plotting?
plt.bar(x, [free_sum, paid_sum], color=['b', 'r'], label=["FREE", "PAID"])
plt.ylabel("Scores")
plt.title('Total Free sum vs Total Paid Sum')
plt.xticks(x)
plt.show()
You'll notice than one column doesn't actually show up, but that is only because with the values you gave it's too small, set them to similar values to see the plot as it's supposed to look like.
u see the problem is
i can't set them to similar values
they have to be the values that i have
@polar acorn which is "1087200000" and "900000"
i tried to change it to this
labels = ['Free','Paid']
x = np.arange(n_groups)
bar_width = 0.25
fig, ax = plt.subplots()
rects1 = ax.bar(index,free_sum,bar_width, color='b', label='FREE')
rects2 = ax.bar(index + bar_width+0.2, paid_sum, color='r', label='PAID')
ax.set_title('Total Free sum vs Total Paid Sum')
ax.set_xticks(index)
ax.set_xticklabels(labels)
plt.show()
but now only one bar shows up
I know that the free bar is correct
but the paid bar just doesn't appear
It's too small to show. Look at the numbers one of them is over a thousand times larger then the other.
why is the number on the y axis small
and how do you fix that?
because the question tells me to get the sum of all the installs of the free aps and the paid apps
i got the sum for both
but i'm struggling to plot them
If you use the code I pasted it will plot them correctly. One of the columns will not be visible but that is correct the number of free games is so much bigger that the other column will simply not be visible.
i'm not sure
because my lecturer somehow managed to plot them
she gave me this code
but for some reason for me it doesn't work
fig, ax = plt.subplots()
index = np.arange(n_groups)
bar_width = 0.25
rects1 = ax.bar(index, Freecount, bar_width,
alpha=opacity, color='b',error_kw=error_config,
label='FREE')
#this sets up the second bar in the chart the first element decides were to display this bar and it is set to the index+ the bar_width ( which is the width of the first bar in the chart) and 0.2 for the space in between.
rects2 = ax.bar(index + bar_width+0.2, costCount, bar_width,
alpha=opacity, color='r',
error_kw=error_config,
label='PAID')```
this is what she wrote
That code works fine as well. The problem is not with the plotting code I gave you or the one she gave you. The problem is that you are plotting one column that is 1000 times as tall as the other one so the second column isn't visible at all
so what am i supposed to do?
just leave it invisible?
because when i saw her answer
there was 2 visible bars
so idk
That means her numbers are different from yours, maybe you made some error in coming up with those numbers?
Maybe you are using different data?
If you still want to plot the numbers you have you can change the y axis to be logarithmic such as I have done here:
plt.bar(x, [free_sum, paid_sum], color=['b', 'r'], label=["FREE", "PAID"])
plt.semilogy()
plt.ylabel("Scores")
plt.title('Total Free sum vs Total Paid Sum')
plt.xticks(x, ["FREE", "PAID"])
plt.show()
Hey @quartz cedar!
It looks like you tried to attach file type(s) that we do not allow (.csv). We currently allow the following file types: .3gp, .3g2, .avi, .bmp, .gif, .h264, .jpg, .jpeg, .m4v, .mkv, .mov, .mp4, .mpeg, .mpg, .png, .tiff, .wmv, .svg, .psd, .ai, .aep, .xcf, .mp3, .wav, .ogg, .md.
Feel free to ask in #community-meta if you think this is a mistake.
is it not possible to separate them?
like instead of having them like one on top of eachother
@polar acorn
To separate what? The columns?
It is a normal bar chart. There is nothing wrong with the plotting code that I and your supervisor gave you. The plotting works just as intended. Try yourself, find a piece of paper and draw a column 5 centimeters tall, next to it draw a column 10 micro meters tall. You won't see the second column there either. Try the code I or your supervisor gave you but replace free_sum with 100 and paid_sum with 90, the plot will look just fine. The plotting is fine, the numbers are wrong.
@uncut shadow How big is the sequence?
im trying to write a basic neural network
but my loss keeps increasing
but im pretty sure my calculations are correct
def sigmoid(x):
global sigmoid
return 1.0 / (1 + np.exp(-x))
def sig_deriv(x):
global sigmoid
return (sigmoid(x) * (1 - sigmoid(x)))
class NeuralNetwork:
def __init__(self, x, y):
self.x = x
self.y = y
self.weight1 = np.random.normal()
self.weight2 = np.random.normal()
self.bias1 = np.random.normal()
self.bias2 = np.random.normal()
self.output = np.zeros(self.y.shape)
self.rate = 0.1
def feedforward(self):
self.neuron1 = sigmoid((self.x * self.weight1) + self.bias1)
self.output = sigmoid((self.neuron1 * self.weight2) + self.bias2)
def backprop(self):
dloss_dy = -2 * (1 - self.output)
dout_dn1 = self.weight2 * sig_deriv((self.weight2 * self.neuron1) + self.bias1)
dn1_dw1 = self.x * sig_deriv((self.weight1 * self.x) + self.bias1)
dout_dw2 = self.neuron1 * sig_deriv((self.weight2 * self.neuron1) + self.bias2)
dn1_db1 = sig_deriv((self.weight1 * self.x) + self.bias1)
dout_db2 = sig_deriv((self.weight2 * self.neuron1) + self.bias2)
self.weight1 -= self.rate * dloss_dy * dout_dn1 * dn1_dw1
self.weight2 -= self.rate * dloss_dy * dout_dw2
self.bias1 -= self.rate * dloss_dy * dout_dn1 * dn1_db1
self.bias2 -= self.rate * dloss_dy * dout_db2
i only have one input and one hidden layer
apologies if it kinda messy
how would u make a basic neural network in python and for that do i need to know anything higher than algebra
You will need to know basic calculus. Gradient descent is key.
Hey all. Anyone familiar with how to add external regressors to AutoARIMA? Iโm trying to forecast a series where the holidays change dates like Ramadan or Lunar New Year. Any ideas would be appreciated
I don't know if you can add external regressors to AutoARIMA, if you have many of them you could do a multivariate linear regression and then model the errors with an ARIMA model. Or you could try out fbprophet as that is a nice library that easy includes external regressors such as moving holidays.
Hi all, I am trying to predict the results of football matches with Poisson regression. How can I improve my accuracy? (I have %40-50 accuracy right now)
Link for the Telegram bot I made if you wanna try (unfortunately it's Turkish): https://web.telegram.org/#/im?p=@MacTahminBot
Code in github: https://github.com/umitkaanusta/MacTahminBotu
A bot that provides soccer predictions by using Poisson regression. Currently on Telegram - umitkaanusta/MacTahminBotu
Welcome to the Web application of Telegram messenger. See https://github.com/zhukov/webogram for more info.
What coefficients are you estimating right now?
I'm trying to calculate attacking and defending "powers" for each team, based on their goals for-goals against statistics. There's also the home advantage for the league
It sounds already like a good accuracy tbh, if you only use the previous results as your input data
I've used that model before and heard back then that something that is often done is to increase the likelihood of 1-1 draws as they are more likely in real life than in this model. You can check with your data if that's the case and maybe add a small correction.
Also a general suggestion I got back then that I never followed up was to treat the attack and defence powers as time series and allow them to change throughout the league, no idea how I would implement that that though ๐คทโโ๏ธ
Definitely, the model usually fails to guess draws. Would the use of xG (Expected goals) and xGA (Expected goals against) instead of GF-GA correct the situation with draws?
Hi everyone, I found this channel through a website!
I'm new to data science and learning SQL. Do you guys recommend Coursera or Dataquest?
Data visualisation project to show the spread of COVID-19 and itโs impact on the global economy.
Coronavirus cases and death statistics shown with a 2D colour matrix to represent infection and mortality within each country. Economic data from various Major World Indicesโ clos...
are there any DS based approaches to dealing with object counting across multiple cameras with overlapping FOV (field of view)? It's a tough ask, but if anyone has some resources around this topic, i'd love some recommendation.
hi i'm back my question is what stuff i should learn to automate games like chess
Does anyone know of a good list of hyperparameters for each model type to tune?
I am SO glad Jupyter notebooks work in VSCode. I don't have to expose the jupyter port. I can just use the SSH connection I am using for remote connections.
Hey all, if you know lmfit uncertainties and understand sigmoid curves, would you please have a look at https://repl.it/@jsalsman/COVID19USgrowthExtrapolation and @-me with some ideas for how to prevent the prediction confidence intervals from decreasing, which I suspect means I should not be trying to extrapolate a sigmoid instead of perhaps a (binomial?) time series of non-cumulative occurrences. E.g.:
semi-fixed, still projecting sigmoid cumulatives instead of binomial (poisson?) non-cumulative occurrences
the correct model of the non-cumulative observations is a lognormal time series
Hi all! Was wondering if I could get some data-related help... does anyone know what's the best method to aggregate sentences together using Python?
Unfortunately, I won't be able to provide the data as it is confidential but I am working on creating a sort of word/sentence cloud for let's say top different reasons why a process failed
Dataset:
ID | reason
1 | "The app crashed"
2 | "...crashed on it's own"
3 | "User12345 hacked into the system"
4 | "New Patch doesn't work"
5 | "Water damage to device"
6 | "User09876 hacked into the system"
so in this case, we can technically eye which reasons are similar and should be counted together (e.g. "The app crashed" and "...crashed on it's own") or (e.g. Userxyz hacked into the system).
I have already tried splitting the sentence into words and getting the top words (while excluding out the most common words such as "the" or "is") and displayed it as a word cloud, but visually speaking it is not as insightful as I had hoped.
Also, would this require NLP to achieve or not necessarily?
https://towardsdatascience.com/word-clouds-in-tableau-quick-easy-e71519cf507a <--- I used this as a reference to create the word cloud viz
hello, i'm working on intrusion detection system currently
and i try to apply K-means clustering algorithm using the sklearn library
k = 30
km = KMeans(n_clusters = k)
km.fit(features)```
when i get to the above stage of the code, specifically at km.fit(features) part, I encounter a MemoryError
MemoryError: Unable to allocate array with shape (494021, 38) and data type float64
from what i heard from other members of the server, the (494021, 38) array should approximately be less than 1GB of RAM for the computer to handle
I definitely do have enough physical memory to handle it
is there any possible factors that may influence/cause this?
So RNN is whole network or just 1 layer? It's cells pass output from one gate to another cell and the output from the second gate is passed forward, right?
and should I mix it with other layers?
both
and also, neither ๐
terms in data science mean a lot of related things. if you're talking specifically for a "layer", yeah mix and match whatever layers you like
although, if you're using RNN layers, generally lstm are their improved counterpart
so you can also be talking about RNN for "architecture", in which case RNN is just the whole network
well, I'm wondering cuz in Keras and TF there is Sequential which is (I have seen it many times in many networks) kinda used nearly "everywhere" and it works with dense networks, rnns and many others.
- I don't know how am I going to create it (from scratch). I created dense layer (forwardpropagation) just by
input @ weights + biasand here there also has to be last hidden state added. Does anybody have any ideas about how to make it?
- Another question, if I'd like to use a sentence which has 3 words in rnn then I'd need 3 cells, right?
- Another question, let's say that the amount of words in a sentence isn't constant then what should I do? Let's say that I want to make a chatbot and I'll train it on 10 word sentences. What should I do to make sure that network outputs right sentence when user uses the command and inputs sentence with e.g. 20 words?
Data visualization project with maplotlib / cartopy if interested:
https://www.youtube.com/watch?v=fYVqck4iZSU
Data visualisation project to show the spread of COVID-19 and itโs impact on the global economy.
Same video without stock markets:
https://youtu.be/P9jhY3U4YRQ
Coronavirus cases and death statistics shown with a 2D colour matrix to represent infection and mortality within ea...
I have a doubt DQN say for nth state i am getting
qnthvalues = [1,2,3]
So here max q value is selected which 2 pos or the 3 rd value and i am doing the action 3 and getting qn+1thvalue and now should i apply bellman eq for that action or the 3rd value of qn+1th value and leave other value the same for target value
qn+1value = [2,3,4]
targetq_values = [2,3,bellmaneq(4)]
Or
targetq_values = bellmaneq(qn+1value)
(So for all q values we will be applying or will be applying for the action q value alone.
im kind of new to programming and i need help with classes
so i want to create a calculator in python
and
i have an overall class called math
and in that math class there are names of categories like geometry and calculus
and then i want to create a subclass for geometry that includes area and volume
and then another class that gives area of square, area of a rectangle area, of a traingle
well, that's a good question but probably not in data science channel lol
#tools-and-devops @supple kelp here
i have a question about sql for data science, which is the most reliable sql language?
for data science
I want to find the rows where the length of items in a column are above a certain value.
The datatype of the column is string.
Looking for something along the lines of the code below
df[df.column.str.length > 1]
i guess this is the right place to ask
final_array = self.eval_values(second_array)
self.map1 = final_array
self.map2 = final_array
For whatever reason any modification i do to self.map2 also happens to self.map1
self.map1 isn't used any where else in the code except for a line self.map2 = self.map1 where i try to reset the modified values in map2 with the original values from map1
is this some weird numpy thingie, or am i completely stupid, shouldnt self.map2 = self.map1 overwrite map2 with map1 ?
and also why are modifications to one happening to both
final_array is a numpy array btw
@shrewd grotto Python doesn't make implicit copies
you can think of self.map1 and self.map2 as pointing to the same object
@velvet thorn thx
Hey Guys! So at Uni we are learning R right now, and I like the language. Nevetheless I wanted to ask:
Do you think for someone who's planning on working in Data Science somewhere, is it better to invest my time in learning Python's statistics modules, or rather learn some more R?
I can choose how I do my homework. I could plot with R or with Python whatever. I know Python quite well, but not too much when it comes to statistical work. I don't really know R a lot.
@half sand R was made for data science. If you are going to use Python then ok, but knowing R won't do any harm so if you can, yes, you should learn it too
What's the point of pd.Series.name exactly?
I wanted to ask about the data of an image if it were to be opened as a text-file via the notepad application.
Appreciate anyone able to explain what I'd be looking at in a somewhat broad sense.
Can someone tell me how I can get those data at the same 'level/row'? (The indexes don't matter)
@silk frigate Is what you screenshotted all the data you are trying to fix?
@tepid thorn yes
You could just hardcode the values into the NaN indexes or you could create a for loop that does that for you
prob a for loop is better since it works all the time
also with more data values
but I don't know how to do that in this case ๐ฌ
I want to make a bar chart like this
But if I plot them now they aren't joined
(obviously)
you could create a for loop thats in the range of the index values that grabs the values you want and another for loop that put the values in the correct spot
If that made any sense llol
Ehhh
Well not really, I mean I still wouldn't have any idea how to do that ๐
But I was just thinking
This is the result of two tables joined together
Maybe I can reset the indexes of both dataframes before I combine them>
and then they're the same (hopefully)
resetting the indexes would just replace the values starting at 0 but it won't move any values
ah thats why you get those NaN values that makes sense
What type of join are you doing?
Left Join, Right join, Inner, outter?
Wait I'll have to go now but I think I know how to do it
If I have any problems I'll let it know
Guys, for those of you that build dashboards, how large is your dataset usually?
I'm trying to build a dashboard from a game... With enough data that shows all kinds of users etc. The entire dataset it'll return is about 17MB of json
What are the tricks for storing, or splitting this up?
I am finally comfortable with this extrapolation which has the python code and data at bit.ly/covidGrowth
Hey guys has anyone used mpld3?
Would like to convert matplotlib charts to html and send it to a frontend using flask
has anyone worked on such a thing?
How can i filter timeseries with the latest month end date ? Incase exact month end is holiday
@agile anvil looks nice - are you going to set anything up to track how right/wrong your predictions were though?
@silk frigate if you post the code you use and example data it would be a lot easier
eg
In [19]: d1 = pd.DataFrame(dict(a=[1,2,3]))
In [20]: d2 = pd.DataFrame(dict(b=[1,2,3]))
In [21]: d1
Out[21]:
a
0 1
1 2
2 3
In [22]: d2
Out[22]:
b
0 1
1 2
2 3
In [23]: pd.concat([d1, d2])
Out[23]:
a b
0 1.0 NaN
1 2.0 NaN
2 3.0 NaN
0 NaN 1.0
1 NaN 2.0
2 NaN 3.0
In [24]: pd.concat([d1, d2], axis=1)
Out[24]:
a b
0 1 1
1 2 2
2 3 3
Hi! I'm working in the business controll department of a medium sized company (250 employees) and I've gotten interested in learning data science to broaden my own knowledge and to make our business more data driven. Our ERP is NAV2017 and we use Qliksense as a BI-tool. My initial goal is to analyze our data structure to find flaws, more precisely to find how many rows we have without key dimensions (i.e dimensions such as brand in the OPEX, missing data in the item structure etc). I'm picturing running through the data and creating a visual report of how correct our data is (%) and from that add mandatory fields in the input for those dimensions. How would you recommend me going about this? What should I dive into? I'm thinking about learning Pandas and SQL but I'm not sure if those are the right tools for me. Hope you understand my question. Thanks.
Hi! I'm working in the business controll department of a medium sized company (250 employees) and I've gotten interested in learning data science to broaden my own knowledge and to make our business more data driven. Our ERP is NAV2017 and we use Qliksense as a BI-tool. My initial goal is to analyze our data structure to find flaws, more precisely to find how many rows we have without key dimensions (i.e dimensions such as brand in the OPEX, missing data in the item structure etc). I'm picturing running through the data and creating a visual report of how correct our data is (%) and from that add mandatory fields in the input for those dimensions. How would you recommend me going about this? What should I dive into? I'm thinking about learning Pandas and SQL but I'm not sure if those are the right tools for me. Hope you understand my question. Thanks.
@gaunt fiber
Academia.edu is a platform for academics to share research papers.
@gaunt fiber
https://www.academia.edu/37886932/Data_Analysis_and_Visualization_Using_Python_-_Dr._Ossama_Embarak.pdf
@balmy ocean Thanks - I'll look into that!
Academia.edu is a platform for academics to share research papers.
Just read the chapters of the books that get you directly into what you need to solve your issues... if You do not know yet how to write useful pieces of code with python, take your time and start from scratch... Happy end of month
@balmy ocean Sounds good, thanks and same to you!
@agile anvil looks nice - are you going to set anything up to track how right/wrong your predictions were though?
@jolly briar yes, I am keeping copies of each version, and updating them with the latest data every day. At least a half dozen people have put one month reminders on my reddit post so they will all have a look then
@agile anvil cheers, what are your thoughts on the amount of people who've been making graphs based on raw count data with no experience in the field? Do you think there's a risk of a sort of misinformation as a result?
well yes, but it's not consequential. Nobody is going to buy different amounts of canned goods or TP for 10k deaths versus 10m deaths
here's a great range survey from https://fivethirtyeight.com/features/experts-say-the-coronavirus-outlook-has-worsened-but-the-trajectory-is-still-unclear/ consistent with Fauci's 100-200k prediction today
Hey good day !!! #
Iโm doing a small project using spacy, I already have the Nouns from a big description, but I need to get only the products because I need to compare them with a DB
Any good ideas ?
Hi! Is there any good source for learning dash plotly in python other than their official docs?
have you tried the plotly samples
how to separate multiple values in a single column?
data.genres.str.split(expand=True) . i used this one but it is splitting by line
@lapis sequoia Perhaps specifiy the separator? | in this case.
this is the result . i thought if everything adds up in the same column it would be easy for me get count of each genre
i will try other ways
Hi guys, I would like to train a regression model with one of boosting algorithms (e.g. lightGBM, XGBoost) and use one-hold-out cross-validation (patient-wise). I'd like to implement a custom loss function that minimizes mean absolute error and regularizes based on the maximum correlation between the reference and estimated values of the batch/fold. I'm new with Python and I would appreciate if anyone can help in this matter.
Hey can someone please help me out, imtrying to learn Logisitc regression through pythin implementation. But i dont understand what this code is really doing
I dont understand why we are divindng thc cost function by m or dividing dW by m
and how they relate to each other
Is there any good resources to get started on sales forecasting?
hey
so
for a data cube
i dont get when they say a data cube is a lattice of cuboids
what are the cuboids??
kinda confused by that
Which module would you recommend for interpolation, data regression?
guys i m not able to install tensorflow 2 to use tensorflow_hub for my project
my version of tensorflow in jupyter notebook shows tensorflow 1.11
i am trying to install directly in notebook using !pip3 install tensorflow==2.0.0
but same version after install
any help guys?
Are you in a virtual environment?
Check your python version if it's above 3.7.0 it wont install
Would anyone here be able to assist me with a Matplotlib issue?
!ask
Asking good questions will yield a much higher chance of a quick response:
โข Don't ask to ask your question, just go ahead and tell us your problem.
โข Don't ask if anyone is knowledgeable in some area, filtering serves no purpose.
โข Try to solve the problem on your own first, we're not going to write code for you.
โข Show us the code you've tried and any errors or unexpected results it's giving.
โข Be patient while we're helping you.
You can find a much more detailed explanation on our website.
hey guys, i am trying to get embeddings of text data which is read in pandas dataframe and save in new column of dataframe which is (512,1) dimensions. I have got the embeddings but not able to save for each text row in new column with same index. it throws error.
ValueError: Length of values does not match length of index
module_url = "https://tfhub.dev/google/universal-sentence-encoder/4" #@param ["https://tfhub.dev/google/universal-sentence-encoder/4", "https://tfhub.dev/google/universal-sentence-encoder-large/5"]
model = thub.load(module_url)
print ("module %s loaded" % module_url)
def embed(input):
return model(input)```
this is the model i m using to get embeddings
```python
for t in df['title'].iteritems():
df = df.assign(emb_title = np.array(embed([t[1]])))```
https://paste.pythondiscord.com/pigaqumuha.py this is the matrix i got after printing using this code
```python
for t in df['title'].iteritems():
print(np.array(embed([t[1]])))```
can someone help me here?
Hello, any data analyst to help me with what can be analyzed from Medical Transcription dataset, if any please pm me. Thank you
In order to leverage new breakthroughs in AI and Machine Learning, it seems the common trend is to use a "backbone" network and extend it. (I am thinking of ResNet-101 for CV tasks and either BERT or GPT-2 for NLP tasks.) This approach makes sense because you do not want to reinvent the wheel and waste time and compute on building the backbone yourself.
I am uneasy doing this in the enterprise setting because I do not necessarily trust the precomputed weights. My question is: am I being overly cautious or how does one balance this dilemma of using new techniques with pre-computed weights you have not been able to verify yourself?
I guess the proof would be in the pudding? If theres any point to using ML at all you should have a large dataset and metrics to score your solution by. Which means you could use transfer learning and do something from scratch and simply find out whats better in your case.
Hi i am a begginer when i comes to programming and I descided that the first language I try is python. I completed a begginers course and now know the basics but I cant seem to find a project that suits my knowlage. I am interested in ai and machine learning. I will be very greatefull if you could suggest something I can work on.
hi @kind ermine ! i'd start by looking into linear regression and logistic regression
Hi i will surly look into that but i want to start and complete a project just to keep my motivation up.I am familiar with the basics of python like loops lists and all that but for now all I find interesting like machine learning projects or path finding programms seems a bit complicated( i mean the coding part, ). I think it is because i am not familiar with the libraries and algorithms they include.I am looking for something to put the few knowledge i gather during the quaranteen.
Also i am still in highschool and i am unfamiliar with things like calculus( i dont know if i need things like that for begginer project).I just like learning through projects and work a lot and i did the begginers course just to learn the fundamentals but know i am a little confused with all the libraries and different things.
So, part of the data science turf comes with its own libraries and algorithms. To make it simpler for you, numpy is for arrays that all libraries are built upon. Pandas is basically for tables. These two you'll just have to get familiar with at least somewhat whenever dealing with data
After that, you pick a library based on your task. Simple model? Scikit learn. Deep learning? Tensor flow. And so on.
So I'd say, don't worry, give it some time. And you'll want to get familiar with those two first, and then just one library for whatever task you're doing.
Or make it from scratch
Ok thank you for the help
Check these cool ER wait time graphs:
Hey @agile anvil!
It looks like you tried to attach a Python file - please use a code-pasting service such as https://paste.pythondiscord.com
Here's the source, https://paste.pythondiscord.com/ikomahacog.py you can see how to download your own ER's data at https://docs.google.com/spreadsheets/d/1Tqm1AU58VF2bvu_v81S8hgFFn2ukbEzVPb6HC24O2Dc
@agile anvil Very cool graphs!
๐
Any idea how I could make a "sector graph" ? More like an horizontal stacked bar graph but with redundant data type (only two)
Something like this:
Nevermind, found the matplotlib "barcode" model could do the job
Check out missingno package too
Hi, so I'm currently using tesseract to attempt to do some OCR. The majority of the time the results are accurate however some digits randomly aren't read at all. An example of this would be this image:
Where the 0 isn't picked up
However in all the other images that I input which has the exact same format and also contains 0's it picks it up
Was wondering what other image processing I could do to decrease the chances of values not getting picked up correctly.
Also as a side note I've set the custom config for single digits as without that it didn't pick up any of the single digits.
Can someone enlighten mi a little bit? I'm trying to build a data-lake, I have to admit that I'm very new in that area. We have information coming from multiple API's and we want to store that information into a S3 bucket for further analysis. Is there a solution in AWS to automate that process? Or I have to create a python script and schedule an extraction task?
Let's say that I want to crawl the Studio Ghibli API (https://ghibliapi.herokuapp.com/films/) and store snapshots in a S3 bucket, is there a way to do this directly in the AWS console? Or do I have to build a script for it?
Hello,
Wondering if anyone have taken these courses; if so, what is the best approach/way to absorb the material. I know it's all personal, but i've never learned anything quite like this so I am open to ideas/opinions from people experienced with this domain. Thanks ๐
Hey everyone. I'm not really sure how to do this but I had the idea. Theres this reddit thread: https://www.reddit.com/r/askreddit/comments/fu09ok
I'd like to parse it and get make a word bubble that lists the occurrences for individual digits. Can anyone do this or point me towards how I may be able to?
43 votes and 124 comments so far on Reddit
I think there's a Reddit API you can use that might answer this question. Google it and see what you find, it should come with a tutorial and documentation on how to use it.
Hello, do you know a website or book to learn machine learning?
I figured it out but the word bubble idea turned out to be dumb, too confusing
bar graph made more sense but not as hipster
@lapis sequoia i have hella books on python machine learning + some vids and examples i think. Message me
Hi everybody, can someone quickly explain to me the usefulness of asynchronous programming ? (asyncio package python)
from the pic, i can't tell what makes red and magenta squares different
Hi - was wondering if anyone could answer a question regarding k-means clustering.
I've been reading guided and looking at examples like this one: https://github.com/corvasto/Simple-k-Means-Clustering-Python where the data read in is simple values in two columns.
For my assignment I have been given data in the form: (animal, countries, fruits, veggies all in separate files)
eg Animal file -
elephant -0.015926 -0.079864 ...
leopard 0.47727 -0.91587 ...
dog -0.33575 0.38897 ...
etc
So I'm confused how this will work when it comes to plotting the data.
Appreciate any help
Hi, looking for an advise. Everyday I scan the SalesForce to find new opportunities and update the data accordingly. I am looking for a library or method that will store the last_scan_day and create a timestamp afterwards. So that the next day when the code scans for new opportunities it knows what the last_scan_day was - only checks the data that is Date > last_scan_day and updates the timestamp with a new date. And so on. It is a very high level of what I am trying to do, but if anyone can direct me to the right python methods, that would be awesome.
Hello fellow python enthusiasts. I got a numpy related question. If I have a 2d array nxm, and I want to make it into a bigger 4d mxnxmxn matrix, how would I do that? Basically I'm asking how to vectorise this function:
array = np.arange(12).reshape((4,3)) #m = 4, n = 3
#how to vectorise this function?
def makeBigger(a):
ans = np.ones(a.shape[0]*a.shape[1]*a.shape[0]*a.shape[1])
ans = ans.reshape((a.shape[0], a.shape[1], a.shape[0], a.shape[1]))
for i in range(a.shape[0]):
for j in range(a.shape[1]):
ans[i][j][:][:] = a[i][j]
return ans
print(makeBigger(array))
#prints the matrix I am looking for.```
tricky. I think you can get one dimension for free but not two
oh wait you're just copying across both new dimensions
do i use np.copy?
give me a second, should be straightforward
new_array = np.empty([4, 3, 10, 11])
new_array[:, :, :, :] = array.reshape(4, 3, 1, 1)
could also simplify to new_array[:]
I don't see how that's a solution
what do you mean
for i in range(array.shape[0]):
for j in range(array.shape[1]):
assert (new_array[i][j][:][:] == array[i][j]).all()
lockdown reading https://fivethirtyeight.com/features/why-its-so-freaking-hard-to-make-a-good-covid-19-model/
lockdown reading https://fivethirtyeight.com/features/why-its-so-freaking-hard-to-make-a-good-covid-19-model/
@agile anvil
This is a fucking great article!
Throwing this here since its a pandas question.
Anyone ever use custom accessors with pandas? If i make a custom accessor for a dataframe. If I perform a groupby, is that accessor available there? I'm guessing the groupby objects dont inherit from dataframe, or maybe they do.
super basic question about Pandas but I cant seem to find it on google - does In [*****] means that it is calculating/working? Seems to be very slow for me
@lapis sequoia
Ive decided to use their example and test it, it doesnt work the way I was hoping
ok guys
I am having a problem with adding a list to a pandas dataframe
I have tried to use pd.series(mylist) to add it into my dataframe
however it turns the values of my list to "nan"
what is occuring here?
any ideas?
@gaunt fiber Is this in a Jupyter Notebook? Then yes
@lament orchid could you post the code please? It's possible you're building your list incorrectly/not how you expected, or your list is not the same length as the dataframe
@lament orchid Why did you ping moderators, are you in need of something moderating?
to moderate
pinging moderators is intended for attracting attention of mods if something bad is happening, there isn't a priority service if you ping staff. please only ping in future if you need something moderating.
hey could someone help me out with a pandas question in #help-orange . I'm trying to access cols of a df using .iloc at two different positions
nvm. I'm good now
I have a list of date data, like Sun May 19 00:53:53 2019 +0300, Mon May 20 20:01:07 2019 +0300, Mon Dec 16 01:02:47 2019 +0300 etc
what would be best way to plot it
Tue Nov 19 11:16:46 2019 +0300
Sun Oct 20 23:13:54 2019 +0300
Tue Aug 27 00:45:37 2019 +0300
Thu May 23 04:13:16 2019 +0300
Tue May 21 23:27:36 2019 +0300
Tue May 21 20:47:42 2019 +0300
Mon May 20 23:27:10 2019 +0300
Mon May 20 20:01:07 2019 +0300
Sun May 19 01:10:20 2019 +0300
Sun May 19 00:53:53 2019 +0300
this is an example data, and I want to kind of see it in daily basis
with line graphs
but I couldnt find the correct thing to display it
What values do you want from that data?
I only see the differences in dates and times...
Or do you have a larger dataset?
these are commit dates, and I want to see a line graph of how many commits authored in last year
@paper spindle over the last year? Wouldn't that just be a case of counting the rows? Or do you want a graph of the commits for each day over the last year (so may 21 above would have a value of 2)
yep, the latter
I'm on my phone, but I think you should be able to convert into a date time vector, aggregate then plot?
The time information isn't important as I understand it
Does anyone know how to normalize a wave function using numpy?
hi
can someone help me reinforce this.. I'm learning partition by today.. I understand it's part of Over.. and it helps split the table into something it's going to be filtered by eventually, so it's lighter to handle and easier to run on large tables
(select avg(quantity), order_id, year
from salestable1
group by order_id, year);
-- same thing using partition by
select distinct year, order_id, avg(quantity) over(partition by year, order_id) as avg_books
from salestable1
group by order_id, year, quantity;
but I'm having trouble reinforcing this.. it'd be nice if someone explained the logic to me briefly
@lapis sequoia Perhaps it might help to visualise it: https://rextester.com/IISV27995
@lapis sequoia Also, #databases is the best section to ask in, for next time!
which library do y'all prefer for geospatial plotting?
Hi, so I'm currently using tesseract to attempt to do some OCR. The majority of the time the results are accurate however some digits randomly aren't read at all. An example of this would be this image:
Where the 0 isn't picked up
However in all the other images that I input which has the exact same format and also contains 0's it picks it up
Was wondering what other image processing I could do to decrease the chances of values not getting picked up correctly.
Also as a side note I've set the custom config for single digits as without that it didn't pick up any of the single digits.
@rain palm I get this error remaining connection slots are reserved for non-replication superuser connections
@lapis sequoia Was about to go, but here are some links which may help. If not, perhaps ask in #databases.
https://dba.stackexchange.com/questions/120694/postgresql-remaining-connection-slots-are-reserved-for-non-replication-superuse
https://stackoverflow.com/questions/11847144/heroku-psql-fatal-remaining-connection-slots-are-reserved-for-non-replication
https://github.com/wagtail/wagtail/issues/1242
We have web application using Golang, PostgreSQL, and sqlx (adapter, connection pooler), that each request requires 1 to 8 queries, sometimes 1 transaction with 5-8 select and 5-8 insert queries.
We
I'm developing an app on Heroku with a Postgresql backend. Periodically, I get this error message when trying to access the database, both from the CLI and from loading a page on the server:
psql:...
conda install opencv
How do i reshape data of shape (167076, 66) to shape with 5 time steps for LSTM network... i am getting value error for using : X_train.reshape(int(X_train.shape[0]/5), 5, X_train.shape[1])
good morning, how do I combine in general a timeseries analysis with additional datapoints, for example.. the stock price of a company..with including the gdp development or something like that ?
Hello,
If I use anaconda, or anything insise what comes in the bundle, do I have to reference or give credit in a research work like a thesis?
Same question for public domain data such as iris dataset
Hi all,
I have no idea why this won't run when I try to apply
def classlto(df):
if df[df['prevclass'] < df['classn'] | df['won'] == 1]
return df['bf_decimal_sp']-1
elif df[df['prevclass'] < df['classn'] | df['won'] == 0]
return -1
I was pondering that it's as there is not absolute True or False statement, but I believe that there is.
More generally I am wondering if there is a better way to frame a thesis when exploring a dataset
Any data scientists want to dive into interpreting a flatten the curve covid python model (partial code i need to fix) with me...
This is the "bending the curve" that we are all staying home for. Animation source: https://paste.pythondiscord.com/rajuzixoke.py Static source at bit.ly/covidGrowth
@copper umbra sure! ๐
maybe my repl.it for the static graphs helps? http://bit.ly/covidGrowth
@agile anvil i am a state employee data analyst, but probably one of the closest we have to potential datascientist so they through a project on th lap this evening
to intreprt a statistical model for how social distancing effects the curve
the problem is the example code i was sent is in pieeces, it is 2 defs and no executeion code and has missing references.
i spent a few hours this evening trying to figure it out and am struggling
the model you sent is meant to predict the current growth based on data correct
@copper umbra yes; sorry for my delay I have to be AFK for a little while. If you want to DM me code please do, I promise to keep it confidential.
I am about to go to bed almost midnight here. But this code is not private. I will dm you tomorrow
very good
Hi, I'm new to this server and am wondering if this would be the correct place to ask if anyone has experience with generating netCDF files from a folder of .tiff files?
@whole rampart probably the wrong channel...
hm, actually
thinking about it
it's either this or a help channel, but that's a somewhat specialised query
I've found 1 thread with a similar problem on stackoverflow. I'll come back here with more information if I am unsuccessful
@wild spoke Try using TimeSeriesGenerator within keras.sequence.preprocessing
Automatically reshapes the data into 3 dimensional format
Has anyone done data science projects for a coffeeshop and or bakery?
I swear somebody was asking about that not too long ago. Is that for a course or something?
Really??
Its actually for my own bakery cafe haha
Wow a course on this would be awesome
Hey,
I have two arrays that I have normalized between 0.0 and 1.0 using sklearn.preprocessing.MinMaxScaler. I want a measure of how well these coincide so I thought calculating the entropy between them could be useful (?).
I tried doing something like scipy.stats.entropy(arr1, arr2) but quickly realized that's not how it works. Any ideas how to do this? I'm expecting a single float as output value
Hi, question about glorot_uniform initializer.
In convolutional nets, what are the fan_in and fan_out and how can I calculate them?
Begin rant.
Nothing is more boring than refactoring someone's bad python code that is a port of another person's poor Matlab code. Unfortunately, this is the fastest way for me to get this thing done. ML sucks at times. Refactoring shitty algorithms from Lua or Matlab is not how I want to spend my Tuesdays.
Rant over.
Hope everyone is having a great day in ML land.
Hey everyone! Need desperate help with a neural network problem I am working on. Iโm self-taught and so far this is the only thing I have had trouble finding the answers for online. If your familiar with CNNN, spotlight, or recommender systems please please please message me
@main coral i'm learning too, whats the issue youre running into in case i might know anything about it?
@oblique belfry just be glad you don't have to port someone's minpack
@agile anvil You are right. I will be thankful for my current situation. ๐
Hey guys I have a dataset with approx 500 images (kinda small) with 5 classes with high resolution images and small detail so I canโt resize much ...
Should I use a pre trained network? I try with resnet50 and vgg16 with my own input and Iโm always around 30% accuracy which is pretty bad. Any idea or paper I should look ?
Hey @trail pagoda!
It looks like you tried to attach file type(s) that we do not allow (.txt). We currently allow the following file types: .3gp, .3g2, .avi, .bmp, .gif, .h264, .jpg, .jpeg, .m4v, .mkv, .mov, .mp4, .mpeg, .mpg, .png, .tiff, .wmv, .svg, .psd, .ai, .aep, .xcf, .mp3, .wav, .ogg, .md.
Feel free to ask in #community-meta if you think this is a mistake.
Does anyone know how to properly interpret the output of a torch utils bottleneck run on your code
My GPU is only at 20% util and I don't know what all these "Fills" mean
anybody here experienced using pandas dataframe?
I have a two column dataframe with a string value in the first column and an int value in the second column. I need to write my code to find where the string in the first column matches some string, and then I next need the output to be the int value in the second column at that same row.
df.loc and df.iloc look similar but presume you have both the row and column known which I don't, only the column. I think maybe I'm supposed to use df.at / df.iat but same problem, these functions need information that changes.
And I read that I should be able to do
df.loc[df['Col1'] == mystring, ['Col2']]
to get a specific column instead of the whole row... but it's returning the value in a dataframe format. I need to be able to get it into a number (int) format eventually.
hmm, I can show you a possible way to do this but I probably cannot show you the best way to do this
i've never been able to figure out how to use pandas effectively so I sometimes write hacky / ugly solutions
I think what you have is a good start
let's work with this as an example dataframe
import pandas
dataframe = pandas.DataFrame(
{
"Names": ["Cat", "Dog", "Bird"],
"Ages": [12, 40, 36],
}
)
say we want to get the age of Cat
I will first get the rows where Names matches Cat
>>> dataframe.loc[dataframe["Names"] == "Cat"]
Names Ages
0 Cat 12
looks good so far
I've been able to do that so far^
I can then grab the Ages column
>>> dataframe.loc[dataframe["Names"] == "Cat"]["Ages"]
0 12
Name: Ages, dtype: int64
so this is a Series object
which is iterable, so we should be able to unpack it
>>> [output] = dataframe.loc[dataframe["Names"] == "Cat"]["Ages"]
>>> output
12
thats our int
alternatively, I think we can grab it by index using iloc
yea works too
>>> dataframe.loc[dataframe["Names"] == "Cat"]["Ages"].iloc[0]
12
actually we dont need iloc, the series is subscriptable
>>> dataframe.loc[dataframe["Names"] == "Cat"]["Ages"][0]
12
works too
this works too (although admittedly I don't really know why)
>>> dataframe.loc[dataframe["Names"] == "Cat", ["Ages"]]
Ages
0 12
but it still gives a dataframe
so it's still 2D and we need to iloc the value via both dims
>>> dataframe.loc[dataframe["Names"] == "Cat", ["Ages"]].iloc[0, 0]
12
lots of ways
Frustrating because I see this working for you and I am still having issues so I think I need to go back and look at my dataframe and see if there is something funky there? idk.
37 323313
Name: Col2, dtype: object
maybe you can show me just the code that you use
so it looks like that should be row 37 and the value is 323313
yup
what happens when you do [0] on it
infringing = DLdf.loc[DLdf['Title'] == title]['Unique Infringements']
sec, having some other weird error pop up-
no rush
.iloc[0]
gives
IndexError: single positional indexer is out-of-bounds
.iloc[0,0]
gives
IndexingError: Too many indexers
and just [0] at the end gives KeyError: 0
ok, so there's nothing in your series
looks like there are no rows where the title was found
is that possible?
try to search for something that is present
if the series is of length 0, then the index 0 will be out of bounds
by the way, since the series is iterable, it may be easier for you to work with it as a python list
>>> list(dataframe.loc[dataframe["Names"] == "Cat"]["Ages"])
[12]
so if i search for an animal that doesn't exist in my dataframe, i should get an empty list
>>> list(dataframe.loc[dataframe["Names"] == "Elephant"]["Ages"])
[]
of course if the animal was present many times, then there would be many ages in my list
these are all cases that need to be accounted for, depending on what kind of data you have and how many assertions you can make about it
ok so I can set up a try/except and that shows me that I CAN get the numbers out of there:
if I do .iloc[0] which the number is present, it gives me a number!
excellent
so when I do .iloc[0] on the returned dataframe... since it's a single row, iloc is looking for the first item in that row (idex starts at 0) correct?
yes
OK good good that's how I visualized it/ understood it
it's a relief something makes sense for once!
a try-except would work, but maybe it'd be nicer to look at the length of the resulting series
>>> len(dataframe.loc[dataframe["Names"] == "Elephant"]["Ages"])
0
>>>
>>> len(dataframe.loc[dataframe["Names"] == "Cat"]["Ages"])
1
it kinda depends on what you're looking to do next
Yea, I had gotten halfway there so at some point it was
is not df.empty:
because the times where it can't find the title I need to hard code it to 0 (and I was getting errors when I was looking for the first in a zero length row)
but now that I chopped away at this so much I need to go back to my notebook and write out what I want to do then I can go back to my keyboard and fix it up
right haha, ok
thank you so so much!
no worries, glad I could help
by the way, it's interesting how the row filtering works
>>> dataframe["Names"] == "Cat"
0 True
1 False
2 False
Name: Names, dtype: bool
this gives a series of bools, which tell you where the condition holds and where it doesnt
I was halfway there (my set up was pretty much almost there) but I was getting tripped up by the errors I got because I was trying to get the first item in the list even when the list was empty and that threw the error.
Yea I was reading up pandas docs to try to find out what I am supposed to use here and it looks like there is a lot of stuff that is True/False
I didn't see how I could use that?
yeah, it may be better to split up the process into logical chunks, i.e. first get the indices, then the filtered df, then the column, and finally the value
that way, once it fails, you know exactly which step caused the error
if you do it all in-line, it's harder to see
yea, so in my case I had 3 rows in my df
so if I do dataframe["Names"] == "Cat", it will tell me on which rows the condition holds
you can see that it's True, False, False because it only holds on the first line
oh but I bet I could get the rows from there not too difficult after that point?
and then this series is passed to the loc, and it simply grabs the indices where it's True
what we do in the next step is get the rows
using this boolean series
we can build it ourselves
>>> dataframe[[True, False, False]]
Names Ages
0 Cat 12
>>> dataframe[[True, False, True]]
Names Ages
0 Cat 12
2 Bird 36
etc
it's a two-step process
hey that's kinda cool ๐
yeah, it is cool
sensible use of time to only grab where is True and do the work on those
yeah, exactly
I suppose the goal is to get something that feels similar to an SQL select ... where
I hope some day the way that computers vectorize problems becomes intuitive to me because I think in terms of step by step and looping through things so it's not... there yet.
pandas is confusing, but I promise it does get easier
back in (unrelated) school I used xlrd instead of pandas and did nested for loops for a project because I did NOT have a handle on pandas at the time. Even now sometimes I feel just for loops in xlrd is so much easier even if it might be technically slower. My data isn't big enough where that's a make or break for me
yeah, I'm definitely guilty of similar things
especially when it's one-off, throwaway code
sometimes you just dont want to go through the effort of learning something entirely new
Not when I try and try and spend all that time trying to make the fancy pandas work and give up and do it with xlrd/xlwt and call it a day.
Didn't feel great about how I did it but I got the task done and moved on
(because we can ๐ต ) (sorry)
Got it to work. Well, this function at least. Thank you so so SO much!
Is that a correct implementation for dropout?
def dropout(self, x):
"""
Applies dropout on `x`
:param x: input array
"""
shape = x.shape
noise = np.random.choice([0, 1], shape, replace=True, p=[self.rate, 1-self.rate])
return x * noise / (1 - self.rate)
anyone know how i can find the image bounding box coordinates of an image within an image?
Not sure but i think my question belongs here
python question
Write a program matrix.py that takes a matrix of integers as input. The program then determines the largest and smallest elements in each row and column. It also calculates the sum of numbers in each row and column. The program prints all these findings as output.```
Been stuxk on this for 5 hours now
For input, the program will ask the user the number of rows and columns at first. Then it will ask the user to enter the items row by row, all in one line, each item separated by space. Then the program will process this input and save the items as integers in a 2D list (that is, list of list).
Quick Pandas question:
Does anyone know how to rename multilevel column names?
To be more specific: rename one level of column names, if it's "Unnamed..." so that it's the same as the other level column name.
ye
Basically, how to get from here:
To here
I'd like to use just the level1. But some of them are "Unnamed" when level0 hold the names I want them to have.
df[('', 'column_name')] = df[('', 'another_name')]
df = df.drop(columns=[('', 'column_name')])
oh, try reset_index()
That's great. What if I have a lot of these columns
reset_index() didn't work to my dismay
Nubonix, would you mind helping me real quick with my thingy?
Original data is quite messy. But reset_index() gives me something like this
Can I write a function to rename the 2nd level names if there is "Unnamed" in it?
I forgot that tuple is immutable so I failed.
oh right, well this is kinda hacky... but you could write to a csv and then read it
unless u wanna rename every column
and then drop the multiindex version of the column
there are other ways, but its a pain
@lapis sequoia hit me
I thought I was smart enough to not have to resort to saving it as csv. It turned out that I spent way more time.
Will try
ik its dumb, but it works
otherwise u can research multiindex to single index in pandas via google
or ask someone else cause i dont wanna cover it ๐
well, dono really how, without googling myself..so
Will both Google and try the csv route. Thanks.
np, if that doesnt work, ill try to help more
Thanks nubonix
@lapis sequoia hit me
@strange stag dm?
ight
I have been having some trouble digging deeper into data science. I've tried lots of approaches to learning (Books, MOOCs, Research, Projects)... (I always here the best way to get into data science is with projects, which is what I am doing as we speak). However, at the end of the day I feel directionless, like I am just repeatedly exploring the shallow perimeters instead of taking a leap into the greater depths. How did you really get into Data Science? I know there are lots of specializations and lots of industries, so perhaps I need to identify what specialization and what industry I want to pursue...
I feel the same way @opaque stratus . I feel like I never have any direction when learning online and I'm always scraping surface value info only to get bored and move on to the next thing because I'm not getting an value from anything. I found out my company (tech giant) reimburses the cost of some nanodegrees from an online MOOC site. I have been enjoying that more because 1 it provides more structure and 2 I feel like I really need to complete it in order to get reimbursed
I'm taking an intro to data science with python course that also teaches sql and I'm enjoying it a lot. ~1hr per day
I have a really dumb question that I need to ask because I want to make sure I'm thinking about this correctly and I'm not high.
I built a linear regression model after normalizing the variables. I now have to plot the actuals and the predictions (on the same dataset used to build the model) on a scatterplot.
After un-normalizing (or un-transforming) the predictions, there should be a somewhat visually evident linear relationship between it and the un-normalized actuals as well right? Because there is a linear relationship between the normalized actuals and predictions.
Anyone uses xpath helper here ?
@woven tundra does the linear model not work if you don't normalize? Usually you don't need to normalize if it's an ordinary linear model. How many independents?
@agile anvil Just 3 independents (that's after testing about 20 more all of which were insignificant but we're not really interested in accuracy, we just want to make a point to a client).
If we don't normalize the R-squared drops from 50% to 18%. Although I didn't test it with just normal min/max scaling.
Don't you have to normalize though? A regression model assumes normally distributed independents yeah?
@woven tundra you're right, sorry https://stats.stackexchange.com/questions/306019/in-linear-regression-why-do-we-often-have-to-normalize-independent-variables-pr/306032
Does the answer there help?
what is the polynomial order?
No worries. I spoke to a more statistically-inclined colleague about my initial message. He agreed that normalization just centers everything and you should see a somewhat linear relationship between your actuals and predictions after you un-normalize it. Of course the strength of the relationship you see depends on the accuracy rate of your model.
Not using any polynomials @agile anvil
no squared or cubed terms?
Nope
Hello, I've been programming python for 3 years, How can I start studying data science?
@silver igloo Plenty of places to start to be honest, how do you learn best? Reading? Online courses? Or just jumping into things and figuring it out?
I am building a ML model. I am getting training results are as follows.
loss: 0.0071 - acc: 1.0000 - val_loss: 0.1213 - val_acc: 1.0000
what can i do for getting proper results and avoid overfitting of model
plot your learning curves
can't go by just a metric
If the gap between your training curve and validation curve is extremely wide (and your training curve is very low on the graph), you're overfitting your model
Here's a simpler article:
https://rmartinshort.jimdofree.com/2019/02/17/overfitting-bias-variance-and-leaning-curves/
Greetings Everyone, I am looking for someone who would like to collaborate on a project with me. I myself, am a very novice programmer, But I have experience in an industry that has given me an idea that could revolutionize said industry. If you are an experienced programmer, with knowledge in data science, feel free to DM me, and we can discuss details
If you don't mind revealing it publicly, what's the industry? You can be broad if you'd like to keep it confidential
@slate yacht ^
Auto Transport Brokering
Very Simplistic Idea, I just know it will be viable, because it currently doesnt exist in the industry. And if it did exist, It would increase the quality of the service to customers, as well as pay for truck drivers
It would almost allow a monopoly in the industry, while improving the overall quality
At least for now, I need to be able to ask questions to an experienced Data Science Person (Python) to see how difficult certain caluclations would be in relation to accuracy
I have a question, that if someone can answer (without googling or researching), and answer it truthfully, It will let them know that they are indeed qualified for the position that I am looking to fill. Here it is....
Do you know the name of the Algorithm, that is able to tell you what the shortest path will be going from one destination to another on Road Systems. Do you know how it works? And do you understand the math behind it?
KNN - K Nearest Neighbour (or however spell it).
But that depends if you trying to end up at the same point you started then thats chinese postman
Or are you just trying to get to all nodes/points the most effective way, once
Chinese postman is what I think is best for the context you've given me
I may be wrong so anyone can correct me
Uh yeah kind cut you off, were you going to say something?
Are you able to double layer that formula, so that not only do you want to find the shortest distance, but you also want to find the route, that for each stopping point(city's or towns, for example) the sum of all temperatures was the lowest. So If you wanted to go from point A-B or point A-C, in which the temperature of A was 50 degrees, B == 60 degrees and c == 40 degrees, you chose the shortest route, that was shooting for the highest temperature, so that if A-B and A-C was both 10 meters, you would still want to go A-B because the sum of the temperatures was higher.
and layer that with even more variables if needed
hopefully i explained that right, its hard to fully gather in my mind to explain in words
I mean I understand the idea, but doing it in python is something else
It looks possible but my skill level is not that high
I can understand where you coming from but I can't do something like that
Ok no worries, if anyone else ends up reading this, and they think this is something they would be interested in discussing, feel free to DM me or message me on instagram @haulerchase
well
Am I right with the algorithm? It's been a while since I've done Graph theory
dijkstra's path finding algorithm? (had to look up spelling but knew name), i have an implementation that can find path as well as weight of a journey
I asked this question in the general channel - but they recommended to ask here: "Hey guys.... I need to build a dashboard which can be distributed independent of a server which hosts it... In the past I have made charts using Bokeh which I could distribute as a single HTML file.. I am considering going this route again but also love what Dash can bring.. I have two questions - is Bokeh capable able of developing larger single HTML file dashboards without too much speed impact? And as far as I can see it's not possible to generate a single HTML file dashboard with Dash or did any of you succeed in this?"
Idk who did tell you to ask this question here
but no, it's not a correct channel lol
also, I haven't used Bokeh in my life
any idea why my search only works when the string's first character is capital
table = df1.loc[df1['itemtype'] == arg]
in the data frame, the thing im looking for is 'leather', so when I assign arg to 'leather' it returns no results
but if I change 'leather' to 'Leather' in my dataframe and assign arg to 'Leather', it finds all the values
hey any1 familiar with tensorflow ?
i, myself am not. but go ahead and ask your question
dont just ask "can i ask this" or "anyone good with -----". just straight-up ask what you need
I am building a model of image detection using tensorflow. I need to know which layers are suitable also which optimizer is used? I am using passport and driving liscence images for it
@mild topaz i'm not hugely experienced but or image regognition yo will want a convolutional network so if its a 2D coloured image you would want to staret with Conv2D, and some pooling layers in there seperating them every now and again, in terms of optimizer, just play around with it. my most recent network uses RMSprop and trains to about 90% accuracy on basic image classification but just try different combinations to see what works
also the tensorflow docs have good examples of models they have made for similar tasks so you can look there for ideas / tips
hey @spark stag hi i am having passport images for train
so by image detection do you mean more like face id?
yh
hey guys, i'm totally new in this field, maybe someone can help me. how do i specifically count all values between the two birth years?
1998<= Year >=1989 <-according to this
Wait that doesn't make sense
Explain in words
You want all the years between 1998 and 1989?
and count how many
yes?
these are dates of a festival and i want to check how many people have visited the festival between these birthdays
the column ist YearsOfBirth
10k visitors(lines)
as you can see above are about 2,5k younger than 21
I just don't know how to phrase it correctly to search between births.
So you trying to count how many people are under the age of 21
And between the age of 21 and 30
Oh okay
as you can see in the first line
Yeah I see it now
find out the difference
for i in range of difference
num = num + i
append list
then it counts the lines of all the birth years in between?
i dont know what you're referring to with lines
but that will give you the int for every number between a
and b
i want to count all lines in which these birth years between 1998 and 1989 are registered
for example 100 to 200 is a 100 difference so for i in that difference append year+i aka 100+i to this list
what do you mean by lines
ok so
the list would store
all numbers between A and B
if you wanted to countl ines inbetween
for example 1996
would need to have a dict to understand what that value means
idk how you did it but if its just a lookup for that year
then it should just work
birthYear = user_data['YearOfBirth']
betw_21_and_30 = []
for i in range(len(user_data)):
if birthYear <= 1998 and birthYear>= 1989:
betw_21_and_30.append()
num = len(betw_21_and_30)
Idk try this
Okay, thanks, guys. I'll try.
Lemme know if it works
Okay I changed it again
I need to brush on my list comprehension skillshaha
haha
thanks for helping me and torturing yourself ๐ I appreciate it ๐
it's just a simple excel file or what do you mean exactly? ^^ I am a real beginner sorry
I thin I got it...
betw_21_and_30 = []
for i in range(len(user_data)):
birthYear = [user_data['YearOfBirth'][i]]
if birthYear <= 1998 and birthYear>= 1989:
betw_21_and_30.append()
num = len(betw_21_and_30)```
Hope it works ๐ค
๐
for i in range(len(user_data)):
if user_data['YearOfBirth'][i] <= 1998 and user_data['YearOfBirth'][i] >=1989:
betw_21_and_30.append()
num = len(betw_21_and_30)```
If this doesn't work then I give up
Ay there it's solved
Just need to:
Change:
betw_21_and_30.append()
to:
betw_21_and_30.append(user_data['YearOfBirth'][i])
Thanks
No no it's all free
If that didn't work then I was going to switch to C++ ๐
I mean it's not the best way to write it but it works
@bronze cipher is it ok if I pm you ?
Uhhh okay
See ya ๐
Hey I was wondering if anyone could assist me with an explanation of the math side of a classification algorithm already built? It's for school and I'm struggling alot with it, thanks
my partner just found another way, if you're interested ๐ @bronze cipher
Oh looks so much easier ๐
You don't have to ๐
Hi, how do i keep an only label. All these labels have the same but with differents values. I use pandas.
Change the first line of your data file
So change them
Do you understand?
The first line of a data file is always the column headings
"All these labels have the same but with different values" what do you mean?
Btw using df.loc (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html) you can just access the dataframe using integers instead of the column name
hey everyone. does anyone know of any discord/slack servers dedicated to data engineering? very curious about new orchestration tools like dagster/prefect and to some extent DBT
^ i would like to know as well
Any idea why I might be having trouble with .loc fetching a certain string from my dataframes?
is there an error?
no, it just doesn't return any results
if the value is saved as 'leather' in my df and I use .loc to search for 'leather', it returns nothing
but if I change 'leather' to 'Leather' in my df and use .loc to search for 'Leather' it will find everything
you're trying to search for values in the dataframe that are leather?
yes
"Access a group of rows and columns by label(s) "
it works for every other variable I have in that column
that's what loc does
but leather
result = df1.loc[df1['itemtype'] == arg]
await message.channel.send('`' + tabulate(result, headers='keys', tablefmt='simple') + '`')```
that is the code block I am using for every non exception in my discord bot
and it works for every other variable (plate, mail, cloth, accessories)
it just doesn't work for leather unless its saved as 'Leather'
i imagine there's probably a better solution but can you just make everything lowercase?
thats the problem, the whole column is lowercase but for some reason 'leather' cant be found
only 'Leather'
oh sorry
im kind of baffled tbh
oh what are you passing as arg
whatever the user types in discord after !loot
-- ---------------------------------- ---------- ---------- ------- ------- ----------------------------
32 Corpuscular Leather Greaves leather feet crit mastery Carapace of N'Zoth
33 Cord of Anguished Cries Leather waist haste mastery Dark Inquisitor Xanesh
34 Gloves of Abyssal Authority leather hands haste mastery Drest'agath
35 Spaulders of Aberrant Allure leather shoulders azerite Il'gynoth, Corruption Reborn
36 Belt of Braided Vessels Leather waist haste vers Il'gynoth, Corruption Reborn
37 Stygian Guise leather head azerite Maut
38 Boots of Manifest Shadow leather feet haste mastery Maut
39 Pauldrons of the Great Convergence leather shoulders azerite N'Zoth the Corruptor
40 Bracers of Dark Prophecy leather wrists crit haste Prophet Skitra
41 Macabre Ritual Pants leather legs crit vers Prophet Skitra
42 Gibbering Maw leather head azerite Ra-den the Despoiled
43 Wristwraps of Volatile Power leather wrists haste mastery Shad'har the Insatiable
44 Chitinspine Gloves leather hands vers mastery The Hivemind
45 Darkheart Robe leather chest azerite Vexiona
46 Onyx-Imbued Breeches leather legs vers mastery Wrathion, the Black Emperor```
that is the leather portion of the dataframe, those 2 are capitalized on purpose
!loot leather returns:
---------- ---------- ---------- ------- ------- --------```
!loot Leather returns:
-- ----------------------- ---------- ---------- ------- ------- ----------------------------
33 Cord of Anguished Cries Leather waist haste mastery Dark Inquisitor Xanesh
36 Belt of Braided Vessels Leather waist haste vers Il'gynoth, Corruption Reborn```
what are some possible explanations to gaussian NB having higher accuracy score than KNN
Just change the sepeartor
owshit that was a silly mistake
I am really new to this, thank you so much guys ๐
No problem
Name the columns
On the first line of your data file
Just add what they represent
What do they represent though? ๐ค
nothing, in my excel and csv file there is nothing in those columns ๐คญ
@daring locust can u manually drop them?
like ```py
df.drop('Unnamed: 10', axis=1)
or you could do:
for i in range(0,100):
df.drop("Unnamed: {}".format(i), axis=1)
yes worked, thank you ๐
Only 2
alright, just changing them in excel will be easier right?
No problem
@bronze cipher here,
yeah this did the job
I thought I will just let you know
Super general question, and I don't at all want to start a holy war, just kind of asking, but why does it seem like exploratory data science overwhelmingly uses a tool like Jupyter Notebooks instead of something more like RStudio and MATLAB, e.g. Spyder? I get that the Notebooks are wonderful for telling an interactive story with data and sharing that story, but aren't they a bit weird for actually doing work in?
I thought so but when you use something like Jupyter Notebook, you can immediately see the result right after every line and that helps me a lot
and it is lite and is handy at the same time. I am really new to this and this is just my observation
Using Matlab and Octave can be daunting and you can use so many libraries when using something like python @serene crane
You can integrate your python code with anything basically
and working with other languages and databases like SQL and all seems easier in python
is there a way to term anything less than say 2% as "other"?
@daring locust i dont know the syntax but maybe one way to tackle this
- Hide labels/values for items <2%
- Create label that just says 2% or less and place it where you want
yeah I do not know the syntax too and I cannot find it anywhere
searched it a lot
still googling
hey guys, does anyone know how I can search for a specific item in this column "ItemName"?
because now I have the mean value of "ITemEffectiveTotalCredits" for each item
package pandas btw ^^
how do you want to search it?
you could say something like
transact_data['ItemName'] == 'Insert Item Name Here'
or do the super special
transat_data[ transact_data['ItemName'] == 'Insert Item Name Here']
@echo tendon
thank you
but it should calculate the average value of a certain item from the column "ItemEffecitiveTotalCredits @lament cargo
average of the values (second column) in relation to the item from the first column.
I hope I can convey it clearly ๐
@daring locust you can use replace for those kinda subs
here's an example:
In [6]: s = '{"PassengerId":{"0":1,"1":2,"2":3,"3":4},"Survived":{"0":0,"1":1,"2":1,"3":1},"Pclass":{"0":3,"1":1,"2":3,"3":1},"Name":{"0":"Brau
...: nd, Mr. Owen Harris","1":"Cumings, Mrs. John Bradley (Florence Briggs Thayer)","2":"Heikkinen, Miss. Laina","3":"Futrelle, Mrs. Jacques
...: Heath (Lily May Peel)"}}'
In [7]: df = pd.read_json(s)
In [8]: df
Out[8]:
PassengerId Survived Pclass Name
0 1 0 3 Braund, Mr. Owen Harris
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th...
2 3 1 3 Heikkinen, Miss. Laina
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel)
In [9]: df.replace({'Futrelle, Mrs. Jacques Heath (Lily May Peel)': 'something'})
Out[9]:
PassengerId Survived Pclass Name
0 1 0 3 Braund, Mr. Owen Harris
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th...
2 3 1 3 Heikkinen, Miss. Laina
3 4 1 1 something
In [10]:
@daring locust also, the following can be useful for dropping columns etc:
In [19]: df = pd.DataFrame(dict( drop1 = [1,2], drop2 = [3,4], keep1 = [3,3], keep2=[2,9]))
In [20]: df.loc[ : , ~df.columns.str.contains('drop')]
Out[20]:
keep1 keep2
0 3 2
1 3 9
@echo tendon oh i think i understand now, did yo ufigure it out yet?
no :/ ^^
so same code
but
transat_data[ transact_data['ItemName'] == 'Insert Item Name Here']['ItemEffecitiveTotalCredits'].mean()
lol re run your code
so that it is a dataframe
๐
@jolly briar ty so much ๐
I created an array from the valuecount() series and removed the values under 2% using a lambda expression
@daring locust not sure what the data is, there's probably an easier way than that
if it works for now then all good tho
yes, there must be a better way to do it
but ty so much, I learned a new way of doing it
thanks
@daring locust np, post a sample of the data in future and it'll be easier to see what works best