#data-science-and-ml

1 messages ยท Page 220 of 1

uncut shadow
#

well

#

that's how it works

#

I mean

velvet thorn
#

@silk forge are you Indian?

#

your y should be 1D, not 2D, in this case

silk forge
#

is that an indian thing

#

anyway yeah

#

1d

velvet thorn
#

"doubt" to mean "question" is a very Indian thing

#

along with "the same"

#

y = data[['CO2EMISSIONS']] will give you 2D

silk forge
#

yeah i got that

#

supposed to be 1d

#

@velvet thorn you must have a shitload of experience with indians i suppose?

velvet thorn
#

I'm from a country with a sizable Indian majority, and I've worked in a startup with like 80% Indians

#

there are distinct speech pattern differences between Indians in my country and Indians from India

silk forge
#

uhm what country exactly

velvet thorn
#

Singapore

#

but the Indians here are generally 3rd or 4th generation so they don't resemble India Indians that much

silk forge
#

oh

#

isnt this the reason why is x is represented as x = data[['ENGINESIZE']]

velvet thorn
#

x is fine

#

x should be 2D

#

rows are samples, columns are features

silk forge
#

isnt this the reason though

velvet thorn
#

uh

#

not just that

#

okay, mayeb you could explain what you mean by that image

#

because we should be more or less saying the same thing

silk forge
#

you sure about y being 1D?

#

i cant understand that part still

velvet thorn
#

okay I'm currenlty doing something

#

I'll get back to you

#

in an hour or os

silk forge
#

Andrew NG didnt say anything about this 1d and 2d stuff

velvet thorn
#

okay

#

so, basically

#

the standard way of storing data is as a 2D array

#

where each row (1st axis) represents a sample and each column (2nd axis) represents a type of observations

#

therefore, X should always be 2D.

#

in some cases, you may have only one sample, or only one type of observation (feature).

#

but that doesn't make your data 1D

#

it just means that one dimension is 1

#

now, for y, assuming you're only making a prediction on one variable, it should be 1D

#

because it's basically another type of observation, except it's the target

silk forge
#

@velvet thorn would that be the same case for multivariate linear rgeression?

velvet thorn
#

yes, sklearn treats simple and multivariate linear regression similarly

#

in both cases X is 2D

#

just that in SLR its shape is (N, 1), where N is the number of samples

#

okay, wait, I should clarify

#

if you mean "multivariate" in the proper sense (multiple dependent variables) then, yes, y will be 2D

#

but it is common to say "multivariate" to mean "multiple" (which is, strictly speaking, wrong) in the sense of multiple independent variables

#

in the case above, you passed a 2D array for y, which is why you got a 2D array for your coefficients

#

because it's one 1D array for each dependent variable

lapis sequoia
#

Hi All, does any of you have code for a numpy based CNN to share for own pictures? (~480x360 pixel input)

acoustic forge
#

Does anyone have an updated Data Science roadmap/long-term tutorial one could follow(Including maths and programming etc)? I am currently finishing my bachelors degree, and would like to practice Data Science at the same time

#

Ping me if someone has something like this :))

floral mantle
#

trying to learn how to code by transforming the JHU COVID-19 data into a new df normalizing all countries to the day they hit 100+ cases
and I'm really struggling with some basic groupby stuff to shift the JHU data from state-level to country-level, it's blanking out my df

#

posted some stuff in #help-chestnut then realized that this is likely more geared to analytics arena of python

floral mantle
#

specifically, how do I fix this? I want the output to be a grouped table at the country level, by day

#
import pandas as pd
import numpy as np

#this should link to the raw CSV of the latest time series data
url = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Confirmed.csv"
df = pd.read_csv(url)

#unpivot data
df = pd.melt(df, id_vars=['Province/State','Country/Region','Lat','Long'], var_name='date', value_name='Confirmed Cases')
df.to_csv('Working File.csv',index=True)

#create flag and custom fields
#df["date"] = pd.to_datetime(df["date"], format='%m/%d/%YY', errors='ignore')
#df["Confirmed Cases"] = pd.to_numeric(df["Confirmed Cases"])
df["Flag"] = np.nan
df["DayZeroIndex"] = np.nan

df = df.groupby(['Country/Region','date','DayZeroIndex','Flag']).agg({'Confirmed Cases': 'sum'}) #******ERROR IS HERE******
df = df.sort_values(by=['Country/Region','Confirmed Cases'], ascending=True)
df.to_csv('Working File2.csv',index=True)
#

Flagged the line that I think the error is on

lapis sequoia
#

Is there anyone with object tracking experience in python that is willing to teach me or knows a good way to learn it?

willow karma
#

Hey @floral mantle so when you actually perform your melt, you have your confirmed cases cohorted by country, but it looks cumulative totals aren't being calculated

floral mantle
#

they're split by country & state & date
New York US 1/1/20 20
Washington US 1/1/20 10

#

and I'm wanting to group it on country
US 1/1/20 30

#

then, in a really poor way most likely, I'll add in a couple of custom fields to do an indexed plot

#

Alternatively the data is already compiled the right way at https://ourworldindata.org/grapher/covid-confirmed-cases-since-100th-case?time=0..62 if I could figure out how to link the csv in the data tab into my python df

Our World in Data

The starting point for each country is the day that country had reached 100 confirmed cases
This allows us to compare the trajectory of confirmed cases between countries.
The number of confirmed cases is lower than the number of total cases. The main reason for this is limited...

willow karma
#

@floral mantle if you dont see a way to do this easily, you can definitely run some type of BeautifulSoup/Selenium job to download this information on some recurring basis

floral mantle
#

yeah only challenge that I'm seeing with downloading the data is that they serve it as a blob:http:// and I don't know how to read that in

willow karma
#
icy ginkgo
#

is there a discord server strictly for jupyter?

somber hamlet
#

Afaik no

real wigeon
#

so I have to come up with a sorting algorithm

#

and a way to standardize string inputs

#

not sure where to start

#

so like for example im taking items from vendors and coming up with an algo to name these items so that they can be categorized and searched with ease

frail frigate
#

Hello, I have a problem related to matrix multiplication in python... I tried it in C++ and used the IKJ algorithm, times were around 20 seconds for 2000x2000 matrix times another 2000x2000 matrix... the problem is that when I used the exactly same code in python, and used multithreading / multiprocessing, the times got absurdly high, for multithreading, a 2000x2000 times another matrix the same size ran for like 5h and 40 minutes

velvet thorn
#

@frail frigate what data structure are you using?

#

Python lists?

frail frigate
#

2D arrays with numpy

#

if that's what you meant @velvet thorn , sorry if not, I'm completely new with using python

velvet thorn
#

yup, that's what I meant

#

what do you mean "same code"?

#

that seems a bit long...

frail frigate
#

same code, meaning same algorithm, used IKJ algorithm on both

#

here's the code snippet

#

def multiplicationThreading(threadAmount, size):
    dividedAmountThreads = (int)(size / threadAmount)

    threads_list = []
    count = 0
    for thread in range(threadAmount):
        new_thread = threading.Thread(name = thread + 1, target = multiplicationParallel, args=(count, dividedAmountThreads, size,))
        threads_list.append(new_thread)
        count += 1

    start_time = time.time()
    print('Start parallel execution with',threadAmount,'threads for matrixes',size,'x',size)

    for thread in threads_list:
        thread.start()

    for thread in threads_list:
        thread.join()

    print('Execution time:', time.time() - start_time,'seconds')

def multiplicationParallel(threadCount, dividedAmountThreads, size):

    random_matrix_a = numpy.random.randint(0, 1000,(size, size))
    random_matrix_b = numpy.random.randint(0, 1000,(size, size))

    blank_matrix = numpy.zeros(shape=(size, size), dtype = int)

    for i in range((dividedAmountThreads*threadCount), dividedAmountThreads*(threadCount+1)):
        for k in range(size):
            for j in range(size):
                blank_matrix[i][j] += (random_matrix_a[i][k] * random_matrix_b[k][j])

In main method calling the thread function like:


for size in sizes:
        for thread in threadAmount:
            now = datetime.now()
            current_time = now.strftime("%H:%M:%S")
            print("Current Time =", current_time) 
            multiplicationThreading(thread, size)

@velvet thorn

uncut shadow
#

Hey. I have a pure theoretical question about RNNs and stuff like that. Let's say I have a long sequence of words

  1. How can I turn them into numbers?
  2. When I'll turn them into numbers and make network process it, how can I turn them back into words or letters?
oblique belfry
#

@uncut shadow You will need to turn the words into a vector. The famous algorithm word2vec does this. There has been a lot of advancement in this space as of late, so you should def do more research. But the general premise is getting text -> clean text -> remove stop words -> lemmatization -> vectorization -> feed into model.

The model will output another vector, and this vector corresponds with the vectorization process. If you turned the word into a vector, the same idea can be used to turn the vector into words.

NLP isn't my expertise, but hope this at least gives you a head start.

uncut shadow
#

thanks! I'll try that

oblique belfry
#

nltk and spacy have good docs and should help you out.

obsidian copper
#

any good sources for learning real time video classification??

#

using CNN+LSTM perhaps

lapis sequoia
#

I just want to start learning about data analysis or data scientist stuff. can u guys suggest me the best place to start?

uncut shadow
#

@lapis sequoia check DataCamp

lapis sequoia
#

Thank you @uncut shadow

uncut shadow
#

๐Ÿ‘

oblique belfry
#

@obsidian copper What are you looking for? Are you wanting to classify an entire video, or are you wanting to classify objects in a video?

obsidian copper
#

@oblique belfry it's the same hand gestures thingy dude

oblique belfry
floral mantle
#
df["DayZeroIndex"] = pd.to_numeric(df["DayZeroIndex"], downcast='integer')
#

Trying to convert that field in my dataframe to drop the decimals

#

Right now it shows 0.0, 1.0, 2.0, 3.0 -- I just want 0, 1, 2, 3

#

How do I do that?

silent swan
#

I would just do .astype(int)

#

depends on what you're trying to know about DL

#

presumably it'd be just setting the derivative of the log likelihood to zero

#

but typically deep learning models are non-convex so there shouldn't be a unique global minimum

#

(or rather maximum, in terms of likelihood)

#

aha

#

yup

#

err not particularly, other than the Deep Learning book most of the content for DL models are either in papers or lectures notes of very new classes

#

although books like Murphy will always be relevant

frail flower
#

Trying to come up with a minimal environment.yml to use as my default setup in the future. Here's what I have so far:

name: minimal
channels:
        - conda-forge
dependencies:
        - python=3.7
        - pandas
        - scikit-learn
        - matplotlib
        - jupyterlab
#

Any other suggestions for the bare minimum needed for the majority of projects?

silent swan
#

tqdm/seaborn are nice but not necessary

frail flower
#

I've used seaborn a lot (great for violin plots), but what's tqdm?

silent swan
#

progress bar

frail flower
#

neat

#

Oh, as in the one that's in the pandarallel demo?

silent swan
frail flower
silent swan
#

cool, I'd not heard of it

frail flower
#

Especially when you're dealing with something huge like ERA5.

floral mantle
#

hey guys - dataframe filter question

#

maybe I have to get it another way though

#

I have a process that updates all COVID-19 cases, by country, by day from the JHU github source.
It normalizes the data to a DayZeroIndex for each country where that is the day each country hit >= 100 cases.
I realized that I need to cut off the last DayZeroIndex for each country since the data isn't finalized until the next morning.
I'm using the filter below to remove anything where DayZeroIndex = -1 (meaning <= 100 cases for the country).
What do I use?

#
df = df[df["DayZeroIndex"] != -1]
#

@silent swan thank you for the astype(int) suggestion. Will give it a shot. Honestly, I was shotgun approaching the whole thing since I kept getting errors. I think the reason I was having so much trouble is that my DayZeroIndex field originally was set to np.NaN and astype(int) doesn't handle that well so it stayed float64. I changed the default value to -1 though, so maybe it works now

**Update: Worked like a charm and that's a lot cleaner than the to_numeric/downcast solution. pydistrong **

dusky cairn
#

Can anyone help me doing a polynomial re?gression

uncut shadow
#

Hey. I have a question about RNNs. I have seen many times about RNN cell and I'm wondering, aren't those layers? Or maybe RNNs and LSTMs are cells? I mean, I have heard that amount of cells has to be equal the amount of single time-steps

silent swan
#

@uncut shadow there's two dimensions to consider

#

how many timesteps you're going to process the inputs for (using the same cell each time)

uncut shadow
#

hmm

silent swan
#

and how many layers of RNN cells/LSTMs you have (these are usually different)

uncut shadow
#

so

#

what about the first one

silent swan
#

what about the first one

#

@bowy could you elaborate what your confusion is? the math gets pretty yucky because you have to sum the gradients over the different spatial locations, so some articles may simplify it

uncut shadow
#

well, I have another question. In Dense NNs I could use different activation functions which I could choose. In RNN or LSTM I see there is tanh, sigmoid and softmax. Does it mean, I can only use those 3?

silent swan
#

within the LSTMs, there're specific configurations of activation functions. don't change those

uncut shadow
#

hmm

#

Ok, thanks

silent swan
#

otoh, almost no one uses vanilla RNNs

uncut shadow
#

ohh

#

cuz of the vanishing gradient?

#

also, if I have a sequence 110011001100... and I want network to predict next 4 numbers how many cells should I use?

quartz cedar
#

hello guys

#

is anyone here familiar with the matplotlib library?

worn stratus
#

People will be. Just ask your question and someone will respond if they know

quartz cedar
#

okay so i've been having a problem with the matplotlib library

#

I'm trying to create a barchart from the list that i have

#

but for some odd reason it won't plot both of them properly on the graph

#

I'll send you guys the code

#

one sec

worn stratus
#

What does the result come out like and what do you expect it to come out like?

uncut shadow
#

which is better?

  1. Enumerating and giving every symbol in sentence an unique number?
  2. Using one-hot encoding?
quartz cedar
#

i get these 2 results

#

but the problem is

#

the Free paid games = 1087200000

#

and the total paid games = 900000

#

but for some reason the paid bar is incorrect, it's even supposed to be a different colour as you can see

#

so i don't know what to do

#

I'm very lost

#

@worn stratus any ideas?

#

I'm desperate lol

worn stratus
#

I'm not really an expert at all with matplotlib

#

so I can't help much

quartz cedar
#

๐Ÿ˜ฆ oh okay

#

waits patiently

quartz cedar
#

is anyone here?

#

i srsly need some help in this

worn stratus
#

The best way to get help is to have a concise summary of your problem, and a link to your code as your most recent message - then just lots of patience

quartz cedar
#

sighhhhh

#

okey okey

#

i kinda need help now tho

polar acorn
#

How about something like this for the plotting?

plt.bar(x, [free_sum, paid_sum], color=['b', 'r'], label=["FREE", "PAID"])
plt.ylabel("Scores")
plt.title('Total Free sum vs Total Paid Sum')
plt.xticks(x)
plt.show()

You'll notice than one column doesn't actually show up, but that is only because with the values you gave it's too small, set them to similar values to see the plot as it's supposed to look like.

quartz cedar
#

u see the problem is

#

i can't set them to similar values

#

they have to be the values that i have

#

@polar acorn which is "1087200000" and "900000"

#

i tried to change it to this

#



labels = ['Free','Paid']
x = np.arange(n_groups)
bar_width = 0.25


fig, ax = plt.subplots()
rects1 = ax.bar(index,free_sum,bar_width, color='b', label='FREE')
rects2 = ax.bar(index + bar_width+0.2, paid_sum, color='r', label='PAID')

ax.set_title('Total Free sum vs Total Paid Sum')
ax.set_xticks(index)
ax.set_xticklabels(labels)
plt.show()

#

but now only one bar shows up

#

I know that the free bar is correct

#

but the paid bar just doesn't appear

polar acorn
#

It's too small to show. Look at the numbers one of them is over a thousand times larger then the other.

quartz cedar
#

why is the number on the y axis small

#

and how do you fix that?

#

because the question tells me to get the sum of all the installs of the free aps and the paid apps

#

i got the sum for both

#

but i'm struggling to plot them

polar acorn
#

If you use the code I pasted it will plot them correctly. One of the columns will not be visible but that is correct the number of free games is so much bigger that the other column will simply not be visible.

quartz cedar
#

i'm not sure

#

because my lecturer somehow managed to plot them

#

she gave me this code

#

but for some reason for me it doesn't work

#

fig, ax = plt.subplots()



index = np.arange(n_groups)

bar_width = 0.25



rects1 = ax.bar(index, Freecount, bar_width,

                alpha=opacity, color='b',error_kw=error_config,

                label='FREE')

#this sets up the second bar in the chart the first element decides were to display this bar and it is set to the index+ the bar_width ( which is the width of the first bar in the chart) and 0.2 for the space in between.

rects2 = ax.bar(index + bar_width+0.2, costCount, bar_width,

                alpha=opacity, color='r',

                error_kw=error_config,

                label='PAID')```
#

this is what she wrote

polar acorn
#

That code works fine as well. The problem is not with the plotting code I gave you or the one she gave you. The problem is that you are plotting one column that is 1000 times as tall as the other one so the second column isn't visible at all

quartz cedar
#

so what am i supposed to do?

#

just leave it invisible?

#

because when i saw her answer

#

there was 2 visible bars

#

so idk

polar acorn
#

That means her numbers are different from yours, maybe you made some error in coming up with those numbers?

#

Maybe you are using different data?

quartz cedar
#

nah man

#

it's the same dataset

polar acorn
#

If you still want to plot the numbers you have you can change the y axis to be logarithmic such as I have done here:

plt.bar(x, [free_sum, paid_sum], color=['b', 'r'], label=["FREE", "PAID"])
plt.semilogy()
plt.ylabel("Scores")
plt.title('Total Free sum vs Total Paid Sum')
plt.xticks(x, ["FREE", "PAID"])
plt.show()
arctic wedgeBOT
#

Hey @quartz cedar!

It looks like you tried to attach file type(s) that we do not allow (.csv). We currently allow the following file types: .3gp, .3g2, .avi, .bmp, .gif, .h264, .jpg, .jpeg, .m4v, .mkv, .mov, .mp4, .mpeg, .mpg, .png, .tiff, .wmv, .svg, .psd, .ai, .aep, .xcf, .mp3, .wav, .ogg, .md.

Feel free to ask in #community-meta if you think this is a mistake.

quartz cedar
#

is it not possible to separate them?

#

like instead of having them like one on top of eachother

#

@polar acorn

polar acorn
#

To separate what? The columns?

quartz cedar
#

yes

#

just to make it look like a normal bar chart

polar acorn
#

It is a normal bar chart. There is nothing wrong with the plotting code that I and your supervisor gave you. The plotting works just as intended. Try yourself, find a piece of paper and draw a column 5 centimeters tall, next to it draw a column 10 micro meters tall. You won't see the second column there either. Try the code I or your supervisor gave you but replace free_sum with 100 and paid_sum with 90, the plot will look just fine. The plotting is fine, the numbers are wrong.

quartz cedar
#

this is really frustrating

#

ya i understand

oblique belfry
#

@uncut shadow How big is the sequence?

lapis sequoia
#

im trying to write a basic neural network

#

but my loss keeps increasing

#

but im pretty sure my calculations are correct

#
def sigmoid(x):
    global sigmoid
    return 1.0 / (1 + np.exp(-x))

def sig_deriv(x):
    global sigmoid
    return (sigmoid(x) * (1 - sigmoid(x)))


class NeuralNetwork:
    def __init__(self, x, y):
        self.x        = x
        self.y        = y
        self.weight1  = np.random.normal()
        self.weight2  = np.random.normal()
        self.bias1    = np.random.normal()
        self.bias2    = np.random.normal()
        self.output   = np.zeros(self.y.shape)
        self.rate     = 0.1


    def feedforward(self):
        self.neuron1 = sigmoid((self.x * self.weight1) + self.bias1)
        self.output  = sigmoid((self.neuron1 * self.weight2) + self.bias2)


    def backprop(self):
        dloss_dy  = -2 * (1 - self.output)
        dout_dn1  = self.weight2 * sig_deriv((self.weight2 * self.neuron1) + self.bias1)
        dn1_dw1   = self.x * sig_deriv((self.weight1 * self.x) + self.bias1)
        dout_dw2  = self.neuron1 * sig_deriv((self.weight2 * self.neuron1) + self.bias2)
        dn1_db1   = sig_deriv((self.weight1 * self.x) + self.bias1)
        dout_db2  = sig_deriv((self.weight2 * self.neuron1) + self.bias2)

        self.weight1 -= self.rate * dloss_dy * dout_dn1 * dn1_dw1
        self.weight2 -= self.rate * dloss_dy * dout_dw2
        self.bias1   -= self.rate * dloss_dy * dout_dn1 * dn1_db1
        self.bias2   -= self.rate * dloss_dy * dout_db2
#

i only have one input and one hidden layer

#

apologies if it kinda messy

final anvil
#

how would u make a basic neural network in python and for that do i need to know anything higher than algebra

oblique belfry
#

You will need to know basic calculus. Gradient descent is key.

kindred stirrup
#

Hey all. Anyone familiar with how to add external regressors to AutoARIMA? Iโ€™m trying to forecast a series where the holidays change dates like Ramadan or Lunar New Year. Any ideas would be appreciated

polar acorn
#

I don't know if you can add external regressors to AutoARIMA, if you have many of them you could do a multivariate linear regression and then model the errors with an ARIMA model. Or you could try out fbprophet as that is a nice library that easy includes external regressors such as moving holidays.

glad roost
#

Hi all, I am trying to predict the results of football matches with Poisson regression. How can I improve my accuracy? (I have %40-50 accuracy right now)

Link for the Telegram bot I made if you wanna try (unfortunately it's Turkish): https://web.telegram.org/#/im?p=@MacTahminBot
Code in github: https://github.com/umitkaanusta/MacTahminBotu

polar acorn
#

What coefficients are you estimating right now?

glad roost
#

I'm trying to calculate attacking and defending "powers" for each team, based on their goals for-goals against statistics. There's also the home advantage for the league

eager heath
#

It sounds already like a good accuracy tbh, if you only use the previous results as your input data

polar acorn
#

I've used that model before and heard back then that something that is often done is to increase the likelihood of 1-1 draws as they are more likely in real life than in this model. You can check with your data if that's the case and maybe add a small correction.

#

Also a general suggestion I got back then that I never followed up was to treat the attack and defence powers as time series and allow them to change throughout the league, no idea how I would implement that that though ๐Ÿคทโ€โ™‚๏ธ

glad roost
#

Definitely, the model usually fails to guess draws. Would the use of xG (Expected goals) and xGA (Expected goals against) instead of GF-GA correct the situation with draws?

lapis sequoia
#

Hi everyone, I found this channel through a website!
I'm new to data science and learning SQL. Do you guys recommend Coursera or Dataquest?

slow yew
ripe forge
#

are there any DS based approaches to dealing with object counting across multiple cameras with overlapping FOV (field of view)? It's a tough ask, but if anyone has some resources around this topic, i'd love some recommendation.

lapis sequoia
#

hi i'm back my question is what stuff i should learn to automate games like chess

still abyss
#

Does anyone know of a good list of hyperparameters for each model type to tune?

oblique belfry
#

I am SO glad Jupyter notebooks work in VSCode. I don't have to expose the jupyter port. I can just use the SSH connection I am using for remote connections.

agile anvil
#

Hey all, if you know lmfit uncertainties and understand sigmoid curves, would you please have a look at https://repl.it/@jsalsman/COVID19USgrowthExtrapolation and @-me with some ideas for how to prevent the prediction confidence intervals from decreasing, which I suspect means I should not be trying to extrapolate a sigmoid instead of perhaps a (binomial?) time series of non-cumulative occurrences. E.g.:

#

semi-fixed, still projecting sigmoid cumulatives instead of binomial (poisson?) non-cumulative occurrences

agile anvil
#

the correct model of the non-cumulative observations is a lognormal time series

agile anvil
grave lodge
#

Hi all! Was wondering if I could get some data-related help... does anyone know what's the best method to aggregate sentences together using Python?
Unfortunately, I won't be able to provide the data as it is confidential but I am working on creating a sort of word/sentence cloud for let's say top different reasons why a process failed

Dataset:

ID | reason
1    | "The app crashed"
2   | "...crashed on it's own"
3   | "User12345 hacked into the system"
4   | "New Patch doesn't work"
5   | "Water damage to device"
6   | "User09876 hacked into the system"

so in this case, we can technically eye which reasons are similar and should be counted together (e.g. "The app crashed" and "...crashed on it's own") or (e.g. Userxyz hacked into the system).
I have already tried splitting the sentence into words and getting the top words (while excluding out the most common words such as "the" or "is") and displayed it as a word cloud, but visually speaking it is not as insightful as I had hoped.

Also, would this require NLP to achieve or not necessarily?

proud iron
#

hello, i'm working on intrusion detection system currently

#

and i try to apply K-means clustering algorithm using the sklearn library

#
k = 30
km = KMeans(n_clusters = k)
km.fit(features)```
#

when i get to the above stage of the code, specifically at km.fit(features) part, I encounter a MemoryError

#

MemoryError: Unable to allocate array with shape (494021, 38) and data type float64

#

from what i heard from other members of the server, the (494021, 38) array should approximately be less than 1GB of RAM for the computer to handle

#

I definitely do have enough physical memory to handle it

#

is there any possible factors that may influence/cause this?

uncut shadow
#

So RNN is whole network or just 1 layer? It's cells pass output from one gate to another cell and the output from the second gate is passed forward, right?

uncut shadow
#

and should I mix it with other layers?

ripe forge
#

both

#

and also, neither ๐Ÿ˜›

#

terms in data science mean a lot of related things. if you're talking specifically for a "layer", yeah mix and match whatever layers you like

#

although, if you're using RNN layers, generally lstm are their improved counterpart

#

so you can also be talking about RNN for "architecture", in which case RNN is just the whole network

uncut shadow
#

well, I'm wondering cuz in Keras and TF there is Sequential which is (I have seen it many times in many networks) kinda used nearly "everywhere" and it works with dense networks, rnns and many others.

#
  1. I don't know how am I going to create it (from scratch). I created dense layer (forwardpropagation) just by input @ weights + bias and here there also has to be last hidden state added. Does anybody have any ideas about how to make it?
#
  1. Another question, if I'd like to use a sentence which has 3 words in rnn then I'd need 3 cells, right?
#
  1. Another question, let's say that the amount of words in a sentence isn't constant then what should I do? Let's say that I want to make a chatbot and I'll train it on 10 word sentences. What should I do to make sure that network outputs right sentence when user uses the command and inputs sentence with e.g. 20 words?
slow yew
lapis sequoia
#

I have a doubt DQN say for nth state i am getting

qnthvalues = [1,2,3]

So here max q value is selected which 2 pos or the 3 rd value and i am doing the action 3 and getting qn+1thvalue and now should i apply bellman eq for that action or the 3rd value of qn+1th value and leave other value the same for target value

qn+1value = [2,3,4]
targetq_values = [2,3,bellmaneq(4)]
               Or
targetq_values = bellmaneq(qn+1value)

(So for all q values we will be applying or will be applying for the action q value alone.

supple kelp
#

im kind of new to programming and i need help with classes

#

so i want to create a calculator in python

#

and

#

i have an overall class called math

#

and in that math class there are names of categories like geometry and calculus

#

and then i want to create a subclass for geometry that includes area and volume

#

and then another class that gives area of square, area of a rectangle area, of a traingle

uncut shadow
#

well, that's a good question but probably not in data science channel lol

supple kelp
#

but i want to use class method?

#

or should i just do it in computer science

trail grove
#

i have a question about sql for data science, which is the most reliable sql language?

#

for data science

feral thistle
#

I want to find the rows where the length of items in a column are above a certain value.
The datatype of the column is string.

Looking for something along the lines of the code below

df[df.column.str.length > 1]
shrewd grotto
#

i guess this is the right place to ask

final_array = self.eval_values(second_array)
self.map1 = final_array
self.map2 = final_array

For whatever reason any modification i do to self.map2 also happens to self.map1
self.map1 isn't used any where else in the code except for a line self.map2 = self.map1 where i try to reset the modified values in map2 with the original values from map1

#

is this some weird numpy thingie, or am i completely stupid, shouldnt self.map2 = self.map1 overwrite map2 with map1 ?

#

and also why are modifications to one happening to both

#

final_array is a numpy array btw

velvet thorn
#

@shrewd grotto Python doesn't make implicit copies

#

you can think of self.map1 and self.map2 as pointing to the same object

shrewd grotto
#

@velvet thorn thx

half sand
#

Hey Guys! So at Uni we are learning R right now, and I like the language. Nevetheless I wanted to ask:

Do you think for someone who's planning on working in Data Science somewhere, is it better to invest my time in learning Python's statistics modules, or rather learn some more R?

#

I can choose how I do my homework. I could plot with R or with Python whatever. I know Python quite well, but not too much when it comes to statistical work. I don't really know R a lot.

uncut shadow
#

@half sand R was made for data science. If you are going to use Python then ok, but knowing R won't do any harm so if you can, yes, you should learn it too

dusk elk
#

What's the point of pd.Series.name exactly?

north jay
#

I wanted to ask about the data of an image if it were to be opened as a text-file via the notepad application.

#

Appreciate anyone able to explain what I'd be looking at in a somewhat broad sense.

silk frigate
#

Can someone tell me how I can get those data at the same 'level/row'? (The indexes don't matter)

tepid thorn
#

@silk frigate Is what you screenshotted all the data you are trying to fix?

silk frigate
#

@tepid thorn yes

tepid thorn
#

You could just hardcode the values into the NaN indexes or you could create a for loop that does that for you

silk frigate
#

prob a for loop is better since it works all the time

#

also with more data values

#

but I don't know how to do that in this case ๐Ÿ˜ฌ

#

But if I plot them now they aren't joined

#

(obviously)

tepid thorn
#

you could create a for loop thats in the range of the index values that grabs the values you want and another for loop that put the values in the correct spot

#

If that made any sense llol

silk frigate
#

Ehhh

#

Well not really, I mean I still wouldn't have any idea how to do that ๐Ÿ˜‚

#

But I was just thinking

#

This is the result of two tables joined together

#

Maybe I can reset the indexes of both dataframes before I combine them>

#

and then they're the same (hopefully)

tepid thorn
#

resetting the indexes would just replace the values starting at 0 but it won't move any values

silk frigate
#

yeah but before I combine them

#

They start as two different dataframes of 5 values

tepid thorn
#

ah thats why you get those NaN values that makes sense

#

What type of join are you doing?

#

Left Join, Right join, Inner, outter?

silk frigate
#

Wait I'll have to go now but I think I know how to do it

#

If I have any problems I'll let it know

marsh hull
#

Guys, for those of you that build dashboards, how large is your dataset usually?

#

I'm trying to build a dashboard from a game... With enough data that shows all kinds of users etc. The entire dataset it'll return is about 17MB of json

#

What are the tricks for storing, or splitting this up?

agile anvil
wary siren
#

Hey guys has anyone used mpld3?

#

Would like to convert matplotlib charts to html and send it to a frontend using flask

#

has anyone worked on such a thing?

lapis sequoia
#

How can i filter timeseries with the latest month end date ? Incase exact month end is holiday

jolly briar
#

@agile anvil looks nice - are you going to set anything up to track how right/wrong your predictions were though?

#

@silk frigate if you post the code you use and example data it would be a lot easier

#

eg

In [19]: d1 = pd.DataFrame(dict(a=[1,2,3]))
In [20]: d2 = pd.DataFrame(dict(b=[1,2,3]))
In [21]: d1
Out[21]:
   a
0  1
1  2
2  3
In [22]: d2
Out[22]:
   b
0  1
1  2
2  3
In [23]: pd.concat([d1, d2])
Out[23]:
     a    b
0  1.0  NaN
1  2.0  NaN
2  3.0  NaN
0  NaN  1.0
1  NaN  2.0
2  NaN  3.0
In [24]: pd.concat([d1, d2], axis=1)
Out[24]:
   a  b
0  1  1
1  2  2
2  3  3
gaunt fiber
#

Hi! I'm working in the business controll department of a medium sized company (250 employees) and I've gotten interested in learning data science to broaden my own knowledge and to make our business more data driven. Our ERP is NAV2017 and we use Qliksense as a BI-tool. My initial goal is to analyze our data structure to find flaws, more precisely to find how many rows we have without key dimensions (i.e dimensions such as brand in the OPEX, missing data in the item structure etc). I'm picturing running through the data and creating a visual report of how correct our data is (%) and from that add mandatory fields in the input for those dimensions. How would you recommend me going about this? What should I dive into? I'm thinking about learning Pandas and SQL but I'm not sure if those are the right tools for me. Hope you understand my question. Thanks.

balmy ocean
#

Hi! I'm working in the business controll department of a medium sized company (250 employees) and I've gotten interested in learning data science to broaden my own knowledge and to make our business more data driven. Our ERP is NAV2017 and we use Qliksense as a BI-tool. My initial goal is to analyze our data structure to find flaws, more precisely to find how many rows we have without key dimensions (i.e dimensions such as brand in the OPEX, missing data in the item structure etc). I'm picturing running through the data and creating a visual report of how correct our data is (%) and from that add mandatory fields in the input for those dimensions. How would you recommend me going about this? What should I dive into? I'm thinking about learning Pandas and SQL but I'm not sure if those are the right tools for me. Hope you understand my question. Thanks.
@gaunt fiber

https://www.academia.edu/37886932/Data_Analysis_and_Visualization_Using_Python_-_Dr._Ossama_Embarak.pdf

gaunt fiber
balmy ocean
#

Just read the chapters of the books that get you directly into what you need to solve your issues... if You do not know yet how to write useful pieces of code with python, take your time and start from scratch... Happy end of month

gaunt fiber
#

@balmy ocean Sounds good, thanks and same to you!

agile anvil
#

@agile anvil looks nice - are you going to set anything up to track how right/wrong your predictions were though?
@jolly briar yes, I am keeping copies of each version, and updating them with the latest data every day. At least a half dozen people have put one month reminders on my reddit post so they will all have a look then

jolly briar
#

@agile anvil cheers, what are your thoughts on the amount of people who've been making graphs based on raw count data with no experience in the field? Do you think there's a risk of a sort of misinformation as a result?

agile anvil
#

well yes, but it's not consequential. Nobody is going to buy different amounts of canned goods or TP for 10k deaths versus 10m deaths

agile anvil
crisp widget
#

Hey good day !!! #
Iโ€™m doing a small project using spacy, I already have the Nouns from a big description, but I need to get only the products because I need to compare them with a DB
Any good ideas ?

final scaffold
#

Hi! Is there any good source for learning dash plotly in python other than their official docs?

lapis sequoia
#

have you tried the plotly samples

lapis sequoia
#

data.genres.str.split(expand=True) . i used this one but it is splitting by line

rain palm
#

@lapis sequoia Perhaps specifiy the separator? | in this case.

lapis sequoia
#

this is the result . i thought if everything adds up in the same column it would be easy for me get count of each genre

#

i will try other ways

half parrot
#

Hi guys, I would like to train a regression model with one of boosting algorithms (e.g. lightGBM, XGBoost) and use one-hold-out cross-validation (patient-wise). I'd like to implement a custom loss function that minimizes mean absolute error and regularizes based on the maximum correlation between the reference and estimated values of the batch/fold. I'm new with Python and I would appreciate if anyone can help in this matter.

primal ravine
#

Hey can someone please help me out, imtrying to learn Logisitc regression through pythin implementation. But i dont understand what this code is really doing

#

I dont understand why we are divindng thc cost function by m or dividing dW by m

#

and how they relate to each other

lapis sequoia
#

Is there any good resources to get started on sales forecasting?

velvet thorn
#

@primal ravine mean error

#

@half parrot do you have a formula for that

lapis sequoia
#

hey

#

so

#

for a data cube

#

i dont get when they say a data cube is a lattice of cuboids

#

what are the cuboids??

#

kinda confused by that

lapis sequoia
#

Which module would you recommend for interpolation, data regression?

burnt wharf
#

guys i m not able to install tensorflow 2 to use tensorflow_hub for my project

#

my version of tensorflow in jupyter notebook shows tensorflow 1.11

#

i am trying to install directly in notebook using !pip3 install tensorflow==2.0.0

#

but same version after install

#

any help guys?

oblique belfry
#

Are you in a virtual environment?

tribal wagon
#

Check your python version if it's above 3.7.0 it wont install

mossy sand
#

Would anyone here be able to assist me with a Matplotlib issue?

polar acorn
#

!ask

arctic wedgeBOT
#

Asking good questions will yield a much higher chance of a quick response:

โ€ข Don't ask to ask your question, just go ahead and tell us your problem.
โ€ข Don't ask if anyone is knowledgeable in some area, filtering serves no purpose.
โ€ข Try to solve the problem on your own first, we're not going to write code for you.
โ€ข Show us the code you've tried and any errors or unexpected results it's giving.
โ€ข Be patient while we're helping you.

You can find a much more detailed explanation on our website.

burnt wharf
#

hey guys, i am trying to get embeddings of text data which is read in pandas dataframe and save in new column of dataframe which is (512,1) dimensions. I have got the embeddings but not able to save for each text row in new column with same index. it throws error.
ValueError: Length of values does not match length of index

module_url = "https://tfhub.dev/google/universal-sentence-encoder/4" #@param ["https://tfhub.dev/google/universal-sentence-encoder/4", "https://tfhub.dev/google/universal-sentence-encoder-large/5"]
model = thub.load(module_url)
print ("module %s loaded" % module_url)
def embed(input):
    return model(input)```
this is the model i m using to get embeddings
```python
for t in df['title'].iteritems():
    df = df.assign(emb_title = np.array(embed([t[1]])))```
https://paste.pythondiscord.com/pigaqumuha.py this is the matrix i got after printing using this code 
```python
for t in df['title'].iteritems():
    print(np.array(embed([t[1]])))```
can someone help me here?
vocal egret
#

Anyone have experience rendering 3d data with plotly?

#

animating *

lapis sequoia
#

Hello, any data analyst to help me with what can be analyzed from Medical Transcription dataset, if any please pm me. Thank you

oblique belfry
#

In order to leverage new breakthroughs in AI and Machine Learning, it seems the common trend is to use a "backbone" network and extend it. (I am thinking of ResNet-101 for CV tasks and either BERT or GPT-2 for NLP tasks.) This approach makes sense because you do not want to reinvent the wheel and waste time and compute on building the backbone yourself.

I am uneasy doing this in the enterprise setting because I do not necessarily trust the precomputed weights. My question is: am I being overly cautious or how does one balance this dilemma of using new techniques with pre-computed weights you have not been able to verify yourself?

polar acorn
#

I guess the proof would be in the pudding? If theres any point to using ML at all you should have a large dataset and metrics to score your solution by. Which means you could use transfer learning and do something from scratch and simply find out whats better in your case.

kind ermine
#

Hi i am a begginer when i comes to programming and I descided that the first language I try is python. I completed a begginers course and now know the basics but I cant seem to find a project that suits my knowlage. I am interested in ai and machine learning. I will be very greatefull if you could suggest something I can work on.

lament cargo
#

hi @kind ermine ! i'd start by looking into linear regression and logistic regression

kind ermine
#

Hi i will surly look into that but i want to start and complete a project just to keep my motivation up.I am familiar with the basics of python like loops lists and all that but for now all I find interesting like machine learning projects or path finding programms seems a bit complicated( i mean the coding part, ). I think it is because i am not familiar with the libraries and algorithms they include.I am looking for something to put the few knowledge i gather during the quaranteen.

#

Also i am still in highschool and i am unfamiliar with things like calculus( i dont know if i need things like that for begginer project).I just like learning through projects and work a lot and i did the begginers course just to learn the fundamentals but know i am a little confused with all the libraries and different things.

ripe forge
#

So, part of the data science turf comes with its own libraries and algorithms. To make it simpler for you, numpy is for arrays that all libraries are built upon. Pandas is basically for tables. These two you'll just have to get familiar with at least somewhat whenever dealing with data

#

After that, you pick a library based on your task. Simple model? Scikit learn. Deep learning? Tensor flow. And so on.

#

So I'd say, don't worry, give it some time. And you'll want to get familiar with those two first, and then just one library for whatever task you're doing.

uncut shadow
#

Or make it from scratch

kind ermine
#

Ok thank you for the help

agile anvil
arctic wedgeBOT
agile anvil
#
rain palm
#

@agile anvil Very cool graphs!

agile anvil
#

๐Ÿ™‚

maiden palm
#

Any idea how I could make a "sector graph" ? More like an horizontal stacked bar graph but with redundant data type (only two)
Something like this:

maiden palm
#

Nevermind, found the matplotlib "barcode" model could do the job

harsh sapphire
#

Check out missingno package too

restive peak
#

Hi, so I'm currently using tesseract to attempt to do some OCR. The majority of the time the results are accurate however some digits randomly aren't read at all. An example of this would be this image:

#

Where the 0 isn't picked up

#

However in all the other images that I input which has the exact same format and also contains 0's it picks it up

#

Was wondering what other image processing I could do to decrease the chances of values not getting picked up correctly.

#

Also as a side note I've set the custom config for single digits as without that it didn't pick up any of the single digits.

drifting hemlock
#

Can someone enlighten mi a little bit? I'm trying to build a data-lake, I have to admit that I'm very new in that area. We have information coming from multiple API's and we want to store that information into a S3 bucket for further analysis. Is there a solution in AWS to automate that process? Or I have to create a python script and schedule an extraction task?

#

Let's say that I want to crawl the Studio Ghibli API (https://ghibliapi.herokuapp.com/films/) and store snapshots in a S3 bucket, is there a way to do this directly in the AWS console? Or do I have to build a script for it?

opaque stratus
#

Hello,
Wondering if anyone have taken these courses; if so, what is the best approach/way to absorb the material. I know it's all personal, but i've never learned anything quite like this so I am open to ideas/opinions from people experienced with this domain. Thanks ๐Ÿ˜„

warm hollow
marsh swallow
#

I think there's a Reddit API you can use that might answer this question. Google it and see what you find, it should come with a tutorial and documentation on how to use it.

lapis sequoia
#

Hello, do you know a website or book to learn machine learning?

warm hollow
#

I figured it out but the word bubble idea turned out to be dumb, too confusing

#

bar graph made more sense but not as hipster

plush raft
#

@lapis sequoia i have hella books on python machine learning + some vids and examples i think. Message me

raw fractal
#

Hi everybody, can someone quickly explain to me the usefulness of asynchronous programming ? (asyncio package python)

ripe forge
#

from the pic, i can't tell what makes red and magenta squares different

shell quartz
#

Hi - was wondering if anyone could answer a question regarding k-means clustering.

I've been reading guided and looking at examples like this one: https://github.com/corvasto/Simple-k-Means-Clustering-Python where the data read in is simple values in two columns.

For my assignment I have been given data in the form: (animal, countries, fruits, veggies all in separate files)
eg Animal file -
elephant -0.015926 -0.079864 ...
leopard 0.47727 -0.91587 ...
dog -0.33575 0.38897 ...
etc

So I'm confused how this will work when it comes to plotting the data.
Appreciate any help

willow holly
#

Hi, looking for an advise. Everyday I scan the SalesForce to find new opportunities and update the data accordingly. I am looking for a library or method that will store the last_scan_day and create a timestamp afterwards. So that the next day when the code scans for new opportunities it knows what the last_scan_day was - only checks the data that is Date > last_scan_day and updates the timestamp with a new date. And so on. It is a very high level of what I am trying to do, but if anyone can direct me to the right python methods, that would be awesome.

lapis sequoia
#

Hello fellow python enthusiasts. I got a numpy related question. If I have a 2d array nxm, and I want to make it into a bigger 4d mxnxmxn matrix, how would I do that? Basically I'm asking how to vectorise this function:


array = np.arange(12).reshape((4,3)) #m = 4, n = 3

#how to vectorise this function?
def makeBigger(a):
    ans = np.ones(a.shape[0]*a.shape[1]*a.shape[0]*a.shape[1])
    ans = ans.reshape((a.shape[0], a.shape[1], a.shape[0], a.shape[1]))
    for i in range(a.shape[0]):
        for j in range(a.shape[1]):
            ans[i][j][:][:] = a[i][j]
    return ans

print(makeBigger(array))
#prints the matrix I am looking for.```
silent swan
#

tricky. I think you can get one dimension for free but not two

#

oh wait you're just copying across both new dimensions

lapis sequoia
#

do i use np.copy?

silent swan
#

give me a second, should be straightforward

#
new_array = np.empty([4, 3, 10, 11])
new_array[:, :, :, :] = array.reshape(4, 3, 1, 1)
#

could also simplify to new_array[:]

lapis sequoia
#

I don't see how that's a solution

silent swan
#

what do you mean

#
for i in range(array.shape[0]):
    for j in range(array.shape[1]):
        assert (new_array[i][j][:][:] == array[i][j]).all()
lapis sequoia
#

oh i see

#

I'll test it

#

Sweet it works! Thanks bro

agile anvil
oak grail
rancid dove
#

Throwing this here since its a pandas question.

Anyone ever use custom accessors with pandas? If i make a custom accessor for a dataframe. If I perform a groupby, is that accessor available there? I'm guessing the groupby objects dont inherit from dataframe, or maybe they do.

lapis sequoia
#

what do you mean custom accessor

#

you create a new df when you do a groupby

gaunt fiber
#

super basic question about Pandas but I cant seem to find it on google - does In [*****] means that it is calculating/working? Seems to be very slow for me

rancid dove
#

@lapis sequoia

#

Ive decided to use their example and test it, it doesnt work the way I was hoping

lament orchid
#

ok guys
I am having a problem with adding a list to a pandas dataframe
I have tried to use pd.series(mylist) to add it into my dataframe
however it turns the values of my list to "nan"
what is occuring here?
any ideas?

sand girder
#

@gaunt fiber Is this in a Jupyter Notebook? Then yes

lament orchid
#

hello

#

<@&267629731250176001>

sand girder
#

@lament orchid could you post the code please? It's possible you're building your list incorrectly/not how you expected, or your list is not the same length as the dataframe

lament orchid
#

yeah its not the same length

#

thats my issue

south quest
#

@lament orchid Why did you ping moderators, are you in need of something moderating?

lament orchid
#

arnt mods good at python?

#

why else would you be mods

south quest
#

to moderate

#

pinging moderators is intended for attracting attention of mods if something bad is happening, there isn't a priority service if you ping staff. please only ping in future if you need something moderating.

solar torrent
#

hey could someone help me out with a pandas question in #help-orange . I'm trying to access cols of a df using .iloc at two different positions

#

nvm. I'm good now

paper spindle
#

I have a list of date data, like Sun May 19 00:53:53 2019 +0300, Mon May 20 20:01:07 2019 +0300, Mon Dec 16 01:02:47 2019 +0300 etc

#

what would be best way to plot it

#
Tue Nov 19 11:16:46 2019 +0300
Sun Oct 20 23:13:54 2019 +0300
Tue Aug 27 00:45:37 2019 +0300
Thu May 23 04:13:16 2019 +0300
Tue May 21 23:27:36 2019 +0300
Tue May 21 20:47:42 2019 +0300
Mon May 20 23:27:10 2019 +0300
Mon May 20 20:01:07 2019 +0300
Sun May 19 01:10:20 2019 +0300
Sun May 19 00:53:53 2019 +0300

this is an example data, and I want to kind of see it in daily basis

#

with line graphs

#

but I couldnt find the correct thing to display it

oak grail
#

What values do you want from that data?
I only see the differences in dates and times...

#

Or do you have a larger dataset?

paper spindle
#

these are commit dates, and I want to see a line graph of how many commits authored in last year

jolly briar
#

@paper spindle over the last year? Wouldn't that just be a case of counting the rows? Or do you want a graph of the commits for each day over the last year (so may 21 above would have a value of 2)

paper spindle
#

yep, the latter

jolly briar
#

I'm on my phone, but I think you should be able to convert into a date time vector, aggregate then plot?

#

The time information isn't important as I understand it

late monolith
#

Does anyone know how to normalize a wave function using numpy?

lapis sequoia
#

hi

#

can someone help me reinforce this.. I'm learning partition by today.. I understand it's part of Over.. and it helps split the table into something it's going to be filtered by eventually, so it's lighter to handle and easier to run on large tables

#
(select avg(quantity), order_id, year
from salestable1
group by order_id, year);

-- same thing using partition by
select distinct year, order_id, avg(quantity) over(partition by year, order_id) as avg_books
from salestable1
group by order_id, year, quantity;
#

but I'm having trouble reinforcing this.. it'd be nice if someone explained the logic to me briefly

rain palm
#

@lapis sequoia Also, #databases is the best section to ask in, for next time!

split steppe
#

which library do y'all prefer for geospatial plotting?

restive peak
#

Hi, so I'm currently using tesseract to attempt to do some OCR. The majority of the time the results are accurate however some digits randomly aren't read at all. An example of this would be this image:

Where the 0 isn't picked up
However in all the other images that I input which has the exact same format and also contains 0's it picks it up
Was wondering what other image processing I could do to decrease the chances of values not getting picked up correctly.
Also as a side note I've set the custom config for single digits as without that it didn't pick up any of the single digits.

lapis sequoia
#

@rain palm I get this error remaining connection slots are reserved for non-replication superuser connections

rain palm
#
lapis sequoia
#

hi

#

can i get an help installing cv2 please (im using anaconda3)?

knotty hamlet
#

conda install opencv

deft harbor
wild spoke
#

How do i reshape data of shape (167076, 66) to shape with 5 time steps for LSTM network... i am getting value error for using : X_train.reshape(int(X_train.shape[0]/5), 5, X_train.shape[1])

covert skiff
#

good morning, how do I combine in general a timeseries analysis with additional datapoints, for example.. the stock price of a company..with including the gdp development or something like that ?

gentle depot
#

Hello,
If I use anaconda, or anything insise what comes in the bundle, do I have to reference or give credit in a research work like a thesis?

#

Same question for public domain data such as iris dataset

wintry atlas
#

Hi all,

I have no idea why this won't run when I try to apply

    def classlto(df):
    if df[df['prevclass'] < df['classn'] | df['won'] == 1]
        return df['bf_decimal_sp']-1
    elif df[df['prevclass'] < df['classn'] | df['won'] == 0]
        return -1
#

I was pondering that it's as there is not absolute True or False statement, but I believe that there is.

#

More generally I am wondering if there is a better way to frame a thesis when exploring a dataset

copper umbra
#

Any data scientists want to dive into interpreting a flatten the curve covid python model (partial code i need to fix) with me...

agile anvil
agile anvil
#

@copper umbra sure! ๐Ÿ˜„

copper umbra
#

@agile anvil i am a state employee data analyst, but probably one of the closest we have to potential datascientist so they through a project on th lap this evening

#

to intreprt a statistical model for how social distancing effects the curve

#

the problem is the example code i was sent is in pieeces, it is 2 defs and no executeion code and has missing references.

#

i spent a few hours this evening trying to figure it out and am struggling

#

the model you sent is meant to predict the current growth based on data correct

agile anvil
#

@copper umbra yes; sorry for my delay I have to be AFK for a little while. If you want to DM me code please do, I promise to keep it confidential.

copper umbra
#

I am about to go to bed almost midnight here. But this code is not private. I will dm you tomorrow

agile anvil
#

very good

whole rampart
#

Hi, I'm new to this server and am wondering if this would be the correct place to ask if anyone has experience with generating netCDF files from a folder of .tiff files?

velvet thorn
#

@whole rampart probably the wrong channel...

#

hm, actually

#

thinking about it

#

it's either this or a help channel, but that's a somewhat specialised query

whole rampart
#

I've found 1 thread with a similar problem on stackoverflow. I'll come back here with more information if I am unsuccessful

worn river
#

@wild spoke Try using TimeSeriesGenerator within keras.sequence.preprocessing

#

Automatically reshapes the data into 3 dimensional format

steady leaf
#

Has anyone done data science projects for a coffeeshop and or bakery?

polar acorn
#

I swear somebody was asking about that not too long ago. Is that for a course or something?

steady leaf
#

Really??

#

Its actually for my own bakery cafe haha

#

Wow a course on this would be awesome

thin terrace
#

Hey,

I have two arrays that I have normalized between 0.0 and 1.0 using sklearn.preprocessing.MinMaxScaler. I want a measure of how well these coincide so I thought calculating the entropy between them could be useful (?).

I tried doing something like scipy.stats.entropy(arr1, arr2) but quickly realized that's not how it works. Any ideas how to do this? I'm expecting a single float as output value

valid drum
#

Hi, question about glorot_uniform initializer.
In convolutional nets, what are the fan_in and fan_out and how can I calculate them?

oblique belfry
#

Begin rant.

Nothing is more boring than refactoring someone's bad python code that is a port of another person's poor Matlab code. Unfortunately, this is the fastest way for me to get this thing done. ML sucks at times. Refactoring shitty algorithms from Lua or Matlab is not how I want to spend my Tuesdays.

Rant over.

Hope everyone is having a great day in ML land.

main coral
#

Hey everyone! Need desperate help with a neural network problem I am working on. Iโ€™m self-taught and so far this is the only thing I have had trouble finding the answers for online. If your familiar with CNNN, spotlight, or recommender systems please please please message me

lament cargo
#

@main coral i'm learning too, whats the issue youre running into in case i might know anything about it?

agile anvil
#

@oblique belfry just be glad you don't have to port someone's minpack

oblique belfry
#

@agile anvil You are right. I will be thankful for my current situation. ๐Ÿ˜„

shrewd trellis
#

Hey guys I have a dataset with approx 500 images (kinda small) with 5 classes with high resolution images and small detail so I canโ€™t resize much ...

Should I use a pre trained network? I try with resnet50 and vgg16 with my own input and Iโ€™m always around 30% accuracy which is pretty bad. Any idea or paper I should look ?

arctic wedgeBOT
#

Hey @trail pagoda!

It looks like you tried to attach file type(s) that we do not allow (.txt). We currently allow the following file types: .3gp, .3g2, .avi, .bmp, .gif, .h264, .jpg, .jpeg, .m4v, .mkv, .mov, .mp4, .mpeg, .mpg, .png, .tiff, .wmv, .svg, .psd, .ai, .aep, .xcf, .mp3, .wav, .ogg, .md.

Feel free to ask in #community-meta if you think this is a mistake.

trail pagoda
#

Does anyone know how to properly interpret the output of a torch utils bottleneck run on your code

#

My GPU is only at 20% util and I don't know what all these "Fills" mean

steady dome
#

anybody here experienced using pandas dataframe?

I have a two column dataframe with a string value in the first column and an int value in the second column. I need to write my code to find where the string in the first column matches some string, and then I next need the output to be the int value in the second column at that same row.

df.loc and df.iloc look similar but presume you have both the row and column known which I don't, only the column. I think maybe I'm supposed to use df.at / df.iat but same problem, these functions need information that changes.

And I read that I should be able to do

df.loc[df['Col1'] == mystring, ['Col2']]

to get a specific column instead of the whole row... but it's returning the value in a dataframe format. I need to be able to get it into a number (int) format eventually.

reef bone
#

hmm, I can show you a possible way to do this but I probably cannot show you the best way to do this

#

i've never been able to figure out how to use pandas effectively so I sometimes write hacky / ugly solutions

#

I think what you have is a good start

#

let's work with this as an example dataframe

#
import pandas

dataframe = pandas.DataFrame(
    {
        "Names": ["Cat", "Dog", "Bird"],
        "Ages": [12, 40, 36],
    }
)
#

say we want to get the age of Cat

#

I will first get the rows where Names matches Cat

#
>>> dataframe.loc[dataframe["Names"] == "Cat"]
  Names  Ages
0   Cat    12
#

looks good so far

steady dome
#

I've been able to do that so far^

reef bone
#

I can then grab the Ages column

#
>>> dataframe.loc[dataframe["Names"] == "Cat"]["Ages"]
0    12
Name: Ages, dtype: int64
#

so this is a Series object

#

which is iterable, so we should be able to unpack it

#
>>> [output] = dataframe.loc[dataframe["Names"] == "Cat"]["Ages"]
>>> output
12
#

thats our int

#

alternatively, I think we can grab it by index using iloc

#

yea works too

#
>>> dataframe.loc[dataframe["Names"] == "Cat"]["Ages"].iloc[0]
12
#

actually we dont need iloc, the series is subscriptable

#
>>> dataframe.loc[dataframe["Names"] == "Cat"]["Ages"][0]
12
#

works too

steady dome
#

oh wait did I do my "[]s" wrong?

#

oh my gosh one sec

reef bone
#

this works too (although admittedly I don't really know why)

#
>>> dataframe.loc[dataframe["Names"] == "Cat", ["Ages"]]
   Ages
0    12
#

but it still gives a dataframe

#

so it's still 2D and we need to iloc the value via both dims

#
>>> dataframe.loc[dataframe["Names"] == "Cat", ["Ages"]].iloc[0, 0]
12
#

lots of ways

steady dome
#

Frustrating because I see this working for you and I am still having issues so I think I need to go back and look at my dataframe and see if there is something funky there? idk.

37 323313
Name: Col2, dtype: object

reef bone
#

maybe you can show me just the code that you use

#

so it looks like that should be row 37 and the value is 323313

steady dome
#

yup

reef bone
#

what happens when you do [0] on it

steady dome
#

infringing = DLdf.loc[DLdf['Title'] == title]['Unique Infringements']

#

sec, having some other weird error pop up-

reef bone
#

no rush

steady dome
#

.iloc[0]
gives
IndexError: single positional indexer is out-of-bounds

#

.iloc[0,0]
gives
IndexingError: Too many indexers

#

and just [0] at the end gives KeyError: 0

reef bone
#

ok, so there's nothing in your series

#

looks like there are no rows where the title was found

#

is that possible?

steady dome
#

yea, possible

#

until things are updated, likely lmao

reef bone
#

try to search for something that is present

#

if the series is of length 0, then the index 0 will be out of bounds

#

by the way, since the series is iterable, it may be easier for you to work with it as a python list

#
>>> list(dataframe.loc[dataframe["Names"] == "Cat"]["Ages"])
[12]
#

so if i search for an animal that doesn't exist in my dataframe, i should get an empty list

#
>>> list(dataframe.loc[dataframe["Names"] == "Elephant"]["Ages"])
[]
#

of course if the animal was present many times, then there would be many ages in my list

#

these are all cases that need to be accounted for, depending on what kind of data you have and how many assertions you can make about it

steady dome
#

ok so I can set up a try/except and that shows me that I CAN get the numbers out of there:

#

if I do .iloc[0] which the number is present, it gives me a number!

reef bone
#

excellent

steady dome
#

so when I do .iloc[0] on the returned dataframe... since it's a single row, iloc is looking for the first item in that row (idex starts at 0) correct?

reef bone
#

yes

steady dome
#

OK good good that's how I visualized it/ understood it

#

it's a relief something makes sense for once!

reef bone
#

a try-except would work, but maybe it'd be nicer to look at the length of the resulting series

#
>>> len(dataframe.loc[dataframe["Names"] == "Elephant"]["Ages"])
0
>>> 
>>> len(dataframe.loc[dataframe["Names"] == "Cat"]["Ages"])
1
#

it kinda depends on what you're looking to do next

steady dome
#

Yea, I had gotten halfway there so at some point it was
is not df.empty:

because the times where it can't find the title I need to hard code it to 0 (and I was getting errors when I was looking for the first in a zero length row)

#

but now that I chopped away at this so much I need to go back to my notebook and write out what I want to do then I can go back to my keyboard and fix it up

reef bone
#

right haha, ok

steady dome
#

thank you so so much!

reef bone
#

no worries, glad I could help

#

by the way, it's interesting how the row filtering works

#
>>> dataframe["Names"] == "Cat"
0     True
1    False
2    False
Name: Names, dtype: bool
#

this gives a series of bools, which tell you where the condition holds and where it doesnt

steady dome
#

I was halfway there (my set up was pretty much almost there) but I was getting tripped up by the errors I got because I was trying to get the first item in the list even when the list was empty and that threw the error.

#

Yea I was reading up pandas docs to try to find out what I am supposed to use here and it looks like there is a lot of stuff that is True/False

#

I didn't see how I could use that?

reef bone
#

yeah, it may be better to split up the process into logical chunks, i.e. first get the indices, then the filtered df, then the column, and finally the value

#

that way, once it fails, you know exactly which step caused the error

#

if you do it all in-line, it's harder to see

#

yea, so in my case I had 3 rows in my df

#

so if I do dataframe["Names"] == "Cat", it will tell me on which rows the condition holds

#

you can see that it's True, False, False because it only holds on the first line

steady dome
#

oh but I bet I could get the rows from there not too difficult after that point?

reef bone
#

and then this series is passed to the loc, and it simply grabs the indices where it's True

#

what we do in the next step is get the rows

#

using this boolean series

#

we can build it ourselves

#
>>> dataframe[[True, False, False]]
  Names  Ages
0   Cat    12
#
>>> dataframe[[True, False, True]]
  Names  Ages
0   Cat    12
2  Bird    36
#

etc

#

it's a two-step process

steady dome
#

hey that's kinda cool ๐Ÿ˜„

reef bone
#

yeah, it is cool

steady dome
#

sensible use of time to only grab where is True and do the work on those

reef bone
#

yeah, exactly

#

I suppose the goal is to get something that feels similar to an SQL select ... where

steady dome
#

I hope some day the way that computers vectorize problems becomes intuitive to me because I think in terms of step by step and looping through things so it's not... there yet.

reef bone
#

pandas is confusing, but I promise it does get easier

steady dome
#

back in (unrelated) school I used xlrd instead of pandas and did nested for loops for a project because I did NOT have a handle on pandas at the time. Even now sometimes I feel just for loops in xlrd is so much easier even if it might be technically slower. My data isn't big enough where that's a make or break for me

reef bone
#

yeah, I'm definitely guilty of similar things

#

especially when it's one-off, throwaway code

#

sometimes you just dont want to go through the effort of learning something entirely new

steady dome
#

Not when I try and try and spend all that time trying to make the fancy pandas work and give up and do it with xlrd/xlwt and call it a day.

#

Didn't feel great about how I did it but I got the task done and moved on

reef bone
#

every solution cannot be the best solution, lol

#

we do what we must

steady dome
#

(because we can ๐ŸŽต ) (sorry)

steady dome
#

Got it to work. Well, this function at least. Thank you so so SO much!

valid drum
#

Is that a correct implementation for dropout?

    def dropout(self, x):
        """
        Applies dropout on `x`
        :param x: input array
        """
        shape = x.shape
        noise = np.random.choice([0, 1], shape, replace=True, p=[self.rate, 1-self.rate])
        return x * noise / (1 - self.rate)

strange stag
#

anyone know how i can find the image bounding box coordinates of an image within an image?

lapis sequoia
#

Not sure but i think my question belongs here

#

python question

Write a program matrix.py that takes a matrix of integers as input. The program then determines the largest and smallest elements in each row and column. It also calculates the sum of numbers in each row and column. The program prints all these findings as output.```
#

Been stuxk on this for 5 hours now

#

For input, the program will ask the user the number of rows and columns at first. Then it will ask the user to enter the items row by row, all in one line, each item separated by space. Then the program will process this input and save the items as integers in a 2D list (that is, list of list).

drowsy grove
#

Quick Pandas question:
Does anyone know how to rename multilevel column names?
To be more specific: rename one level of column names, if it's "Unnamed..." so that it's the same as the other level column name.

strange stag
#

ye

drowsy grove
#

I'd like to use just the level1. But some of them are "Unnamed" when level0 hold the names I want them to have.

strange stag
#
df[('', 'column_name')] = df[('', 'another_name')] 
df = df.drop(columns=[('', 'column_name')])
#

oh, try reset_index()

drowsy grove
#

That's great. What if I have a lot of these columns

#

reset_index() didn't work to my dismay

strange stag
#

.reset_index()
been a bit

#

wdym didnt work

lapis sequoia
#

Nubonix, would you mind helping me real quick with my thingy?

strange stag
#

been a bit since ive last done this*

#

can try

drowsy grove
#

Can I write a function to rename the 2nd level names if there is "Unnamed" in it?

#

I forgot that tuple is immutable so I failed.

strange stag
#

oh right, well this is kinda hacky... but you could write to a csv and then read it

#

unless u wanna rename every column

#

and then drop the multiindex version of the column

#

there are other ways, but its a pain

#

@lapis sequoia hit me

drowsy grove
#

I thought I was smart enough to not have to resort to saving it as csv. It turned out that I spent way more time.

#

Will try

strange stag
#

ik its dumb, but it works

#

otherwise u can research multiindex to single index in pandas via google

#

or ask someone else cause i dont wanna cover it ๐Ÿ˜›

#

well, dono really how, without googling myself..so

drowsy grove
#

Will both Google and try the csv route. Thanks.

strange stag
#

np, if that doesnt work, ill try to help more

drowsy grove
#

Thanks nubonix

lapis sequoia
#

@lapis sequoia hit me
@strange stag dm?

strange stag
#

ight

opaque stratus
#

I have been having some trouble digging deeper into data science. I've tried lots of approaches to learning (Books, MOOCs, Research, Projects)... (I always here the best way to get into data science is with projects, which is what I am doing as we speak). However, at the end of the day I feel directionless, like I am just repeatedly exploring the shallow perimeters instead of taking a leap into the greater depths. How did you really get into Data Science? I know there are lots of specializations and lots of industries, so perhaps I need to identify what specialization and what industry I want to pursue...

merry wraith
#

I feel the same way @opaque stratus . I feel like I never have any direction when learning online and I'm always scraping surface value info only to get bored and move on to the next thing because I'm not getting an value from anything. I found out my company (tech giant) reimburses the cost of some nanodegrees from an online MOOC site. I have been enjoying that more because 1 it provides more structure and 2 I feel like I really need to complete it in order to get reimbursed

#

I'm taking an intro to data science with python course that also teaches sql and I'm enjoying it a lot. ~1hr per day

woven tundra
#

I have a really dumb question that I need to ask because I want to make sure I'm thinking about this correctly and I'm not high.

I built a linear regression model after normalizing the variables. I now have to plot the actuals and the predictions (on the same dataset used to build the model) on a scatterplot.

After un-normalizing (or un-transforming) the predictions, there should be a somewhat visually evident linear relationship between it and the un-normalized actuals as well right? Because there is a linear relationship between the normalized actuals and predictions.

grave mango
#

Anyone uses xpath helper here ?

agile anvil
#

@woven tundra does the linear model not work if you don't normalize? Usually you don't need to normalize if it's an ordinary linear model. How many independents?

woven tundra
#

@agile anvil Just 3 independents (that's after testing about 20 more all of which were insignificant but we're not really interested in accuracy, we just want to make a point to a client).

If we don't normalize the R-squared drops from 50% to 18%. Although I didn't test it with just normal min/max scaling.

#

Don't you have to normalize though? A regression model assumes normally distributed independents yeah?

agile anvil
#

Does the answer there help?

#

what is the polynomial order?

woven tundra
#

No worries. I spoke to a more statistically-inclined colleague about my initial message. He agreed that normalization just centers everything and you should see a somewhat linear relationship between your actuals and predictions after you un-normalize it. Of course the strength of the relationship you see depends on the accuracy rate of your model.

#

Not using any polynomials @agile anvil

agile anvil
#

no squared or cubed terms?

woven tundra
#

Nope

silver igloo
#

Hello, I've been programming python for 3 years, How can I start studying data science?

woven tundra
#

@silver igloo Plenty of places to start to be honest, how do you learn best? Reading? Online courses? Or just jumping into things and figuring it out?

cunning osprey
#

Anyone got any good mathematical sources

#

Like reading about equations and stuff

mild topaz
#

I am building a ML model. I am getting training results are as follows.
loss: 0.0071 - acc: 1.0000 - val_loss: 0.1213 - val_acc: 1.0000

#

what can i do for getting proper results and avoid overfitting of model

woven tundra
#

plot your learning curves

#

can't go by just a metric

#

If the gap between your training curve and validation curve is extremely wide (and your training curve is very low on the graph), you're overfitting your model

slate yacht
#

Greetings Everyone, I am looking for someone who would like to collaborate on a project with me. I myself, am a very novice programmer, But I have experience in an industry that has given me an idea that could revolutionize said industry. If you are an experienced programmer, with knowledge in data science, feel free to DM me, and we can discuss details

woven tundra
#

If you don't mind revealing it publicly, what's the industry? You can be broad if you'd like to keep it confidential

placid gate
#

@slate yacht ^

slate yacht
#

Auto Transport Brokering

#

Very Simplistic Idea, I just know it will be viable, because it currently doesnt exist in the industry. And if it did exist, It would increase the quality of the service to customers, as well as pay for truck drivers

#

It would almost allow a monopoly in the industry, while improving the overall quality

bronze cipher
#

Do we have to be experienced

#

I just want to join for the learning experience

slate yacht
#

At least for now, I need to be able to ask questions to an experienced Data Science Person (Python) to see how difficult certain caluclations would be in relation to accuracy

#

I have a question, that if someone can answer (without googling or researching), and answer it truthfully, It will let them know that they are indeed qualified for the position that I am looking to fill. Here it is....

#

Do you know the name of the Algorithm, that is able to tell you what the shortest path will be going from one destination to another on Road Systems. Do you know how it works? And do you understand the math behind it?

bronze cipher
#

KNN - K Nearest Neighbour (or however spell it).

#

But that depends if you trying to end up at the same point you started then thats chinese postman

#

Or are you just trying to get to all nodes/points the most effective way, once

#

Chinese postman is what I think is best for the context you've given me

#

I may be wrong so anyone can correct me

#

Uh yeah kind cut you off, were you going to say something?

slate yacht
#

Are you able to double layer that formula, so that not only do you want to find the shortest distance, but you also want to find the route, that for each stopping point(city's or towns, for example) the sum of all temperatures was the lowest. So If you wanted to go from point A-B or point A-C, in which the temperature of A was 50 degrees, B == 60 degrees and c == 40 degrees, you chose the shortest route, that was shooting for the highest temperature, so that if A-B and A-C was both 10 meters, you would still want to go A-B because the sum of the temperatures was higher.

#

and layer that with even more variables if needed

#

hopefully i explained that right, its hard to fully gather in my mind to explain in words

bronze cipher
#

I mean I understand the idea, but doing it in python is something else

#

It looks possible but my skill level is not that high

#

I can understand where you coming from but I can't do something like that

slate yacht
#

Ok no worries, if anyone else ends up reading this, and they think this is something they would be interested in discussing, feel free to DM me or message me on instagram @haulerchase

uncut shadow
#

well

bronze cipher
#

Am I right with the algorithm? It's been a while since I've done Graph theory

spark stag
#

dijkstra's path finding algorithm? (had to look up spelling but knew name), i have an implementation that can find path as well as weight of a journey

sharp raven
#

I asked this question in the general channel - but they recommended to ask here: "Hey guys.... I need to build a dashboard which can be distributed independent of a server which hosts it... In the past I have made charts using Bokeh which I could distribute as a single HTML file.. I am considering going this route again but also love what Dash can bring.. I have two questions - is Bokeh capable able of developing larger single HTML file dashboards without too much speed impact? And as far as I can see it's not possible to generate a single HTML file dashboard with Dash or did any of you succeed in this?"

uncut shadow
#

Idk who did tell you to ask this question here

#

but no, it's not a correct channel lol

#

also, I haven't used Bokeh in my life

worldly ruin
#

any idea why my search only works when the string's first character is capital

#

table = df1.loc[df1['itemtype'] == arg]

#

in the data frame, the thing im looking for is 'leather', so when I assign arg to 'leather' it returns no results

#

but if I change 'leather' to 'Leather' in my dataframe and assign arg to 'Leather', it finds all the values

mild topaz
#

hey any1 familiar with tensorflow ?

lapis sequoia
#

i, myself am not. but go ahead and ask your question

#

dont just ask "can i ask this" or "anyone good with -----". just straight-up ask what you need

mild topaz
#

I am building a model of image detection using tensorflow. I need to know which layers are suitable also which optimizer is used? I am using passport and driving liscence images for it

spark stag
#

@mild topaz i'm not hugely experienced but or image regognition yo will want a convolutional network so if its a 2D coloured image you would want to staret with Conv2D, and some pooling layers in there seperating them every now and again, in terms of optimizer, just play around with it. my most recent network uses RMSprop and trains to about 90% accuracy on basic image classification but just try different combinations to see what works

#

also the tensorflow docs have good examples of models they have made for similar tasks so you can look there for ideas / tips

mild topaz
#

hey @spark stag hi i am having passport images for train

spark stag
#

so by image detection do you mean more like face id?

mild topaz
#

see i will explain u my project

#

can i dm u?

spark stag
#

yh

echo tendon
#

hey guys, i'm totally new in this field, maybe someone can help me. how do i specifically count all values between the two birth years?

#

1998<= Year >=1989 <-according to this

bronze cipher
#

Wait that doesn't make sense

#

Explain in words

#

You want all the years between 1998 and 1989?

#

and count how many

#

yes?

echo tendon
#

these are dates of a festival and i want to check how many people have visited the festival between these birthdays

#

the column ist YearsOfBirth

#

10k visitors(lines)

#

as you can see above are about 2,5k younger than 21

#

I just don't know how to phrase it correctly to search between births.

bronze cipher
#

So you trying to count how many people are under the age of 21

#

And between the age of 21 and 30

echo tendon
#

yes

#

under 21 I have already solved

bronze cipher
#

Oh okay

echo tendon
#

as you can see in the first line

bronze cipher
#

Yeah I see it now

lapis sequoia
#

find out the difference

#

for i in range of difference

#

num = num + i

#

append list

echo tendon
#

then it counts the lines of all the birth years in between?

lapis sequoia
#

i dont know what you're referring to with lines

#

but that will give you the int for every number between a

#

and b

echo tendon
#

i want to count all lines in which these birth years between 1998 and 1989 are registered

lapis sequoia
#

for example 100 to 200 is a 100 difference so for i in that difference append year+i aka 100+i to this list

#

what do you mean by lines

echo tendon
#

one moment

#

you know what I mean? ๐Ÿ˜„

lapis sequoia
#

ok so

#

the list would store

#

all numbers between A and B

#

if you wanted to countl ines inbetween

#

for example 1996

#

would need to have a dict to understand what that value means

#

idk how you did it but if its just a lookup for that year

#

then it should just work

bronze cipher
#
birthYear = user_data['YearOfBirth']
betw_21_and_30 = []
for i in range(len(user_data)):
  if birthYear <= 1998 and birthYear>= 1989:
    betw_21_and_30.append()
num = len(betw_21_and_30)
#

Idk try this

echo tendon
#

Okay, thanks, guys. I'll try.

bronze cipher
#

Lemme know if it works

echo tendon
#

kk thanks

bronze cipher
#

Okay I changed it again

echo tendon
#

me too

#

๐Ÿ˜„

bronze cipher
#

Okay I made it easier

#

I need to brush on my list comprehension skills

#

Try now

echo tendon
#

I need to brush on my list comprehension skillshaha

#

haha

#

thanks for helping me and torturing yourself ๐Ÿ˜„ I appreciate it ๐Ÿ˜„

bronze cipher
#

It's fine

#

Wait is it a pandas array

echo tendon
#

it's just a simple excel file or what do you mean exactly? ^^ I am a real beginner sorry

bronze cipher
#

Im asking how did you import the data

#

Did you use pandas or cv2

echo tendon
#

pandas

#

says my project partner ๐Ÿ˜„

bronze cipher
#

I thin I got it...

#
betw_21_and_30 = []

for i in range(len(user_data)):
  birthYear = [user_data['YearOfBirth'][i]]
  
if birthYear <= 1998 and birthYear>= 1989:
    betw_21_and_30.append()
num = len(betw_21_and_30)```
#

Hope it works ๐Ÿคž

echo tendon
#

๐Ÿ˜„

bronze cipher
#

๐Ÿ˜”

#
for i in range(len(user_data)):
  if user_data['YearOfBirth'][i] <= 1998 and user_data['YearOfBirth'][i] >=1989:
    betw_21_and_30.append()
  num = len(betw_21_and_30)```
#

If this doesn't work then I give up

echo tendon
#

thank you anyway for your help!

#

๐Ÿ˜„

bronze cipher
#

Just indentation

#

Thats an easy fix

#

There I fixed it

#

There's hope

echo tendon
bronze cipher
#

It's just indentation errors

#

Okay last chance

#

I fixed it

echo tendon
#

๐Ÿ˜ฆ

#

๐Ÿ˜„

bronze cipher
#

Ay there it's solved

#

Just need to:

#

Change:

betw_21_and_30.append()

to:

betw_21_and_30.append(user_data['YearOfBirth'][i])
echo tendon
#

HELL YEAH!

#

๐Ÿ˜„

bronze cipher
#

AYYYYYYY

#

Lets goooo ๐Ÿ˜†

echo tendon
#

thank you! you're the best!

#

xD

bronze cipher
#

Thanks

echo tendon
#

nice haha

#

can i give you 5โ‚ฌ or something for a coffee?

#

๐Ÿ˜„

#

haha

bronze cipher
#

No no it's all free

#

If that didn't work then I was going to switch to C++ ๐Ÿ˜‚

#

I mean it's not the best way to write it but it works

nimble elm
#

@bronze cipher is it ok if I pm you ?

bronze cipher
#

Uhhh okay

echo tendon
#

thank you so much!

#

see ya ๐Ÿ˜„

bronze cipher
#

See ya ๐Ÿ˜‰

nimble elm
#

Hey I was wondering if anyone could assist me with an explanation of the math side of a classification algorithm already built? It's for school and I'm struggling alot with it, thanks

echo tendon
bronze cipher
#

Oh looks so much easier ๐Ÿ˜”

echo tendon
#

but your lines are more spectacular I'll use them ๐Ÿ˜„

#

hehe

bronze cipher
#

You don't have to ๐Ÿ˜‚

hollow quartz
#

Hi, how do i keep an only label. All these labels have the same but with differents values. I use pandas.

bronze cipher
#

Change the first line of your data file

hollow quartz
#

why?

#

the labels columns have the same values

bronze cipher
#

So change them

#

Do you understand?

#

The first line of a data file is always the column headings

river bough
#

"All these labels have the same but with different values" what do you mean?

hollow quartz
#

hum I resolve the problem

sturdy trench
#

hey everyone. does anyone know of any discord/slack servers dedicated to data engineering? very curious about new orchestration tools like dagster/prefect and to some extent DBT

placid gate
#

^ i would like to know as well

worldly ruin
#

Any idea why I might be having trouble with .loc fetching a certain string from my dataframes?

fresh walrus
#

is there an error?

worldly ruin
#

no, it just doesn't return any results

#

if the value is saved as 'leather' in my df and I use .loc to search for 'leather', it returns nothing

#

but if I change 'leather' to 'Leather' in my df and use .loc to search for 'Leather' it will find everything

fresh walrus
#

you're trying to search for values in the dataframe that are leather?

worldly ruin
#

yes

fresh walrus
#

"Access a group of rows and columns by label(s) "

worldly ruin
#

it works for every other variable I have in that column

fresh walrus
#

that's what loc does

worldly ruin
#

but leather

#
            result = df1.loc[df1['itemtype'] == arg]
            await message.channel.send('`' + tabulate(result, headers='keys', tablefmt='simple') + '`')```
#

that is the code block I am using for every non exception in my discord bot

#

and it works for every other variable (plate, mail, cloth, accessories)

#

it just doesn't work for leather unless its saved as 'Leather'

fresh walrus
#

i imagine there's probably a better solution but can you just make everything lowercase?

worldly ruin
#

thats the problem, the whole column is lowercase but for some reason 'leather' cant be found

#

only 'Leather'

fresh walrus
#

oh sorry

worldly ruin
#

im kind of baffled tbh

fresh walrus
#

oh what are you passing as arg

worldly ruin
#

whatever the user types in discord after !loot

#
--  ----------------------------------  ----------  ----------  -------  -------  ----------------------------
32  Corpuscular Leather Greaves         leather     feet        crit     mastery  Carapace of N'Zoth
33  Cord of Anguished Cries             Leather     waist       haste    mastery  Dark Inquisitor Xanesh
34  Gloves of Abyssal Authority         leather     hands       haste    mastery  Drest'agath
35  Spaulders of Aberrant Allure        leather     shoulders   azerite           Il'gynoth, Corruption Reborn
36  Belt of Braided Vessels             Leather     waist       haste    vers     Il'gynoth, Corruption Reborn
37  Stygian Guise                       leather     head        azerite           Maut
38  Boots of Manifest Shadow            leather     feet        haste    mastery  Maut
39  Pauldrons of the Great Convergence  leather     shoulders   azerite           N'Zoth the Corruptor
40  Bracers of Dark Prophecy            leather     wrists      crit     haste    Prophet Skitra
41  Macabre Ritual Pants                leather     legs        crit     vers     Prophet Skitra
42  Gibbering Maw                       leather     head        azerite           Ra-den the Despoiled
43  Wristwraps of Volatile Power        leather     wrists      haste    mastery  Shad'har the Insatiable
44  Chitinspine Gloves                  leather     hands       vers     mastery  The Hivemind
45  Darkheart Robe                      leather     chest       azerite           Vexiona
46  Onyx-Imbued Breeches                leather     legs        vers     mastery  Wrathion, the Black Emperor```
#

that is the leather portion of the dataframe, those 2 are capitalized on purpose

#

!loot leather returns:

#
----------  ----------  ----------  -------  -------  --------```
#

!loot Leather returns:

#
--  -----------------------  ----------  ----------  -------  -------  ----------------------------
33  Cord of Anguished Cries  Leather     waist       haste    mastery  Dark Inquisitor Xanesh
36  Belt of Braided Vessels  Leather     waist       haste    vers     Il'gynoth, Corruption Reborn```
iron hornet
#

what are some possible explanations to gaussian NB having higher accuracy score than KNN

daring locust
#

Can someone help me with this? Why am I getting this error?

ancient marsh
#

yeah you probably didn't load the csv right

#

use the delimiter ","

bronze cipher
#

Just change the sepeartor

bronze cipher
#

df= pd.read_csv('path/to/file', sep=',')

#

@daring locust

daring locust
#

owshit that was a silly mistake

#

I am really new to this, thank you so much guys ๐Ÿ™‚

bronze cipher
#

No problem

daring locust
#

can some one tell me how to remove these unnamed columns?

bronze cipher
#

Name the columns

#

On the first line of your data file

#

Just add what they represent

#

What do they represent though? ๐Ÿค”

daring locust
#

nothing, in my excel and csv file there is nothing in those columns ๐Ÿคญ

ancient marsh
#

@daring locust can u manually drop them?

#

like ```py
df.drop('Unnamed: 10', axis=1)

#

or you could do:

for i in range(0,100):
    df.drop("Unnamed: {}".format(i), axis=1)
daring locust
#

yes worked, thank you ๐Ÿ™‚

daring locust
#

how do I plot this? There is no numerical value

#

any ideas?

bronze cipher
#

You can convert it into numbers

#

So how many statuses are there

daring locust
#

Only 2

bronze cipher
#

So if you want you can edit the dataset

#

maybe make recovered 0 and hospitalized 1

daring locust
#

alright, just changing them in excel will be easier right?

bronze cipher
#

Yeah sure but there's some way in pandas that's easier

#

I'm not sure how tbh

daring locust
#

alright I will google

#

thanks though ๐Ÿ™‚

bronze cipher
#

No problem

daring locust
bronze cipher
#

Cool

#

Like I said I don't know how it's done ๐Ÿ˜…

daring locust
#

yeah this did the job
I thought I will just let you know

bronze cipher
#

Oh ok

#

Well ty

serene crane
#

Super general question, and I don't at all want to start a holy war, just kind of asking, but why does it seem like exploratory data science overwhelmingly uses a tool like Jupyter Notebooks instead of something more like RStudio and MATLAB, e.g. Spyder? I get that the Notebooks are wonderful for telling an interactive story with data and sharing that story, but aren't they a bit weird for actually doing work in?

daring locust
#

I thought so but when you use something like Jupyter Notebook, you can immediately see the result right after every line and that helps me a lot
and it is lite and is handy at the same time. I am really new to this and this is just my observation
Using Matlab and Octave can be daunting and you can use so many libraries when using something like python @serene crane

#

You can integrate your python code with anything basically

#

and working with other languages and databases like SQL and all seems easier in python

#

is there a way to term anything less than say 2% as "other"?

lament cargo
#

@daring locust i dont know the syntax but maybe one way to tackle this

  1. Hide labels/values for items <2%
  2. Create label that just says 2% or less and place it where you want
daring locust
#

yeah I do not know the syntax too and I cannot find it anywhere

#

searched it a lot

#

still googling

echo tendon
#

hey guys, does anyone know how I can search for a specific item in this column "ItemName"?

#

because now I have the mean value of "ITemEffectiveTotalCredits" for each item

#

package pandas btw ^^

lament cargo
#

how do you want to search it?

#

you could say something like

#

transact_data['ItemName'] == 'Insert Item Name Here'

#

or do the super special

#

transat_data[ transact_data['ItemName'] == 'Insert Item Name Here']

#

@echo tendon

echo tendon
#

thank you
but it should calculate the average value of a certain item from the column "ItemEffecitiveTotalCredits @lament cargo

#

average of the values (second column) in relation to the item from the first column.

#

I hope I can convey it clearly ๐Ÿ˜„

jolly briar
#

@daring locust you can use replace for those kinda subs

#

here's an example:

In [6]: s = '{"PassengerId":{"0":1,"1":2,"2":3,"3":4},"Survived":{"0":0,"1":1,"2":1,"3":1},"Pclass":{"0":3,"1":1,"2":3,"3":1},"Name":{"0":"Brau
   ...: nd, Mr. Owen Harris","1":"Cumings, Mrs. John Bradley (Florence Briggs Thayer)","2":"Heikkinen, Miss. Laina","3":"Futrelle, Mrs. Jacques
   ...:  Heath (Lily May Peel)"}}'

In [7]: df = pd.read_json(s)

In [8]: df
Out[8]:
   PassengerId  Survived  Pclass                                               Name
0            1         0       3                            Braund, Mr. Owen Harris
1            2         1       1  Cumings, Mrs. John Bradley (Florence Briggs Th...
2            3         1       3                             Heikkinen, Miss. Laina
3            4         1       1       Futrelle, Mrs. Jacques Heath (Lily May Peel)

In [9]: df.replace({'Futrelle, Mrs. Jacques Heath (Lily May Peel)': 'something'})
Out[9]:
   PassengerId  Survived  Pclass                                               Name
0            1         0       3                            Braund, Mr. Owen Harris
1            2         1       1  Cumings, Mrs. John Bradley (Florence Briggs Th...
2            3         1       3                             Heikkinen, Miss. Laina
3            4         1       1                                          something

In [10]:
#

@daring locust also, the following can be useful for dropping columns etc:

In [19]: df = pd.DataFrame(dict( drop1 = [1,2], drop2 = [3,4], keep1 = [3,3], keep2=[2,9]))

In [20]: df.loc[ : , ~df.columns.str.contains('drop')]
Out[20]:
   keep1  keep2
0      3      2
1      3      9
lament cargo
#

@echo tendon oh i think i understand now, did yo ufigure it out yet?

echo tendon
#

no :/ ^^

lament cargo
#

so same code

#

but

#

transat_data[ transact_data['ItemName'] == 'Insert Item Name Here']['ItemEffecitiveTotalCredits'].mean()

echo tendon
#

๐Ÿ˜„

lament cargo
#

lol re run your code

echo tendon
#

kk item name here

#

lul

lament cargo
#

so that it is a dataframe

echo tendon
#

๐Ÿ˜‚

lament cargo
#

well that too haha

#

i odnt know what youre looking for exactly

echo tendon
#

๐Ÿ˜„

daring locust
#

@jolly briar ty so much ๐Ÿ™‚

#

I created an array from the valuecount() series and removed the values under 2% using a lambda expression

jolly briar
#

@daring locust not sure what the data is, there's probably an easier way than that

#

if it works for now then all good tho

daring locust
#

yes, there must be a better way to do it

#

but ty so much, I learned a new way of doing it

#

thanks

jolly briar
#

@daring locust np, post a sample of the data in future and it'll be easier to see what works best

daring locust
#

alright

#

is there a way to export jupyter notebook files to pdf