#data-science-and-ml | Python | Page 253

arctic wedgeBOT Sep 17, 2020, 7:43 PM

#

@desert oar :white_check_mark: Your eval job has completed with return code 0.

inf

desert oar Sep 17, 2020, 7:43 PM

#

!e ```python
import sys
print( sys.float_info.max )

arctic wedgeBOT Sep 17, 2020, 7:43 PM

#

@desert oar :white_check_mark: Your eval job has completed with return code 0.

1.7976931348623157e+308

merry ridge Sep 17, 2020, 7:44 PM

#

Just basic addition, subtraction, multiplication, and exponentiation,

desert oar Sep 17, 2020, 7:44 PM

#

this is kind of stupid but can you just divide by 1e400, do the operations you need, then multiply by 1e400 again?

#

maybe using arbitrary-precision decimals

#

!d g decimal

arctic wedgeBOT Sep 17, 2020, 7:44 PM

#

`decimal`

Source code: Lib/decimal.py

The decimal module provides support for fast correctly-rounded decimal floating point arithmetic. It offers several advantages over the float datatype:

• Decimal “is based on a floating-point model which was designed with people in mind, and necessarily has a paramount guiding principle – computers must provide an arithmetic that works in the same way as the arithmetic that people learn at school.” – excerpt from the decimal arithmetic specification.

• Decimal numbers can be represented exactly. In contrast, numbers like 1.1 and 2.2 do not have exact representations in binary floating point. End users typically would not expect 1.1 + 2.2 to display as 3.3000000000000003 as it does with binary floating point.
... read more

merry ridge Sep 17, 2020, 7:45 PM

#

No, that will cause stiffness problems

desert oar Sep 17, 2020, 7:45 PM

#

python supports arbitraily large integers

#

can you stick with integers?

#

that said im not sure how to even create 1e400 as an integer

#

!e ```python
print( 10400 + 10400 )

arctic wedgeBOT Sep 17, 2020, 7:46 PM

#

@desert oar :white_check_mark: Your eval job has completed with return code 0.

20000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

merry ridge Sep 17, 2020, 7:46 PM

#

I don't think I will be able to no

desert oar Sep 17, 2020, 7:47 PM

#

you need 400 significant figures and fractions?

merry ridge Sep 17, 2020, 7:47 PM

#

I suppose I realistically do not need the decimals

desert oar Sep 17, 2020, 7:48 PM

#

i think decimal might support arbitrarily large numbers as well as arbitrary precision

merry ridge Sep 17, 2020, 7:48 PM

#

but I do need these calculations to not just turn into being infinity

desert oar Sep 17, 2020, 7:48 PM

#

!e ```python
from decimal import Decimal
print( Decimal(10400) + Decimal(10400) )

arctic wedgeBOT Sep 17, 2020, 7:48 PM

#

@desert oar :white_check_mark: Your eval job has completed with return code 0.

2.000000000000000000000000000E+400

desert oar Sep 17, 2020, 7:48 PM

#

so you can do it, but integers will probably be faster and take up less memory

#

so if performance isn't a concern just use Decimal

merry ridge Sep 17, 2020, 7:50 PM

#

That looks like it will be able to do it.

#

I am hoping when I hear back from the principal investigator for this side project they just made a sequence of unfortunate typos when emailing me the problem statement because I really do not want to have to worry about these computational issues

delicate nymph Sep 17, 2020, 8:10 PM

#

hello

#

i've been told you might help me with a problem i have

#

i am using pandas, i have two dataframes of different length with datetime index

my_list1 = list((Counter(obs.index) - Counter(data.index)).elements())

i use this code to find the same dates in both frames but i actually want the ones that are different. is there a way to do that?

merry ridge Sep 17, 2020, 8:32 PM

#

There is probably a smarter way to do it, but you could generate a list of the two date times and use the symmetric difference?

delicate nymph Sep 17, 2020, 8:32 PM

#

symmetric difference?

merry ridge Sep 17, 2020, 8:33 PM

#

It is a set operation like union and intersection. It gives you elements of A or B that are not in the intersection of A and B.

delicate nymph Sep 17, 2020, 8:33 PM

#

symmetric_difference()

#

this one?

#

and may i ask you one more question?

steady dome Sep 17, 2020, 8:44 PM

#

is this a place I can ask a data science question?

#

or do I hit up an available help channel for help and this is more discussion based in special topics?

wise garden Sep 17, 2020, 8:48 PM

#

what's your question

steady dome Sep 17, 2020, 9:05 PM

#

I'm trying to see if there is any sort of correlation between two columns of data. I first thought I could dump in my data like:
r, p = scipy.stats.pearsonr(x=x, y=y)
but my second column is a boolean yes or no; the value is either 0 or 1.

so now I'm not sure if that makes sense? I have a collection of numbers in the first column but the second column only has two possibilities and I feel like I should be able to think clearly and tell if that is an issue that makes this approach not fit but I've been staring at this too long and too closely to see clearly.

I'm sorry if this is dumb.

velvet thorn Sep 17, 2020, 9:36 PM

#

@steady dome it's fine if it's binary

#

@delicate nymph there's a more idiomatic way IMO

#

outer join on index, then drop the ones that come from both DataFrames

#

of course, if it works, it works.

steady dome Sep 17, 2020, 9:41 PM

#

it is?! thank dog

velvet thorn Sep 17, 2020, 9:45 PM

#

it is?! thank dog
@steady dome yes, subject to certain assumptions

#

but in a loose, general sense, it's fine

merry fern Sep 18, 2020, 1:26 AM

#

pandas/dataframes - how would you loop thru a dataframe and check for rows w/ a dupe of Col A/B pairs but don't delete the row, save data from Col C?
want to take isolate to unique Col A/B pairs and sum Col C

desert oar Sep 18, 2020, 1:31 AM

#

generally you shouldnt be looping through rows

#

for example you can do something like this data.drop_duplicates(subset=["A", "B"])["C"].sum()

merry fern Sep 18, 2020, 2:29 AM

#

generally you shouldnt be looping through rows
@desert oar i was reading about that, vectorization over looping...

merry fern Sep 18, 2020, 2:46 AM

#

https://paste.pythondiscord.com/bisakosiqe.py
this is where i'm at @desert oar

rustic apex Sep 18, 2020, 3:56 AM

#

With libraries like Numpy, Pandas, Scipy, ect.... is does Datascience go by library? At times? I’m learning Numpy now and then I’ll look at Pandas

worldly schooner Sep 18, 2020, 4:45 AM

#

please anyone tell me how can i concatenate a string with a variable in print statement.

brittle agate Sep 18, 2020, 4:57 AM

#

@worldly schooner

x = 'hello'
print('text: ' + x)

#

I hope u know how to convert integer to string mate.

keen pine Sep 18, 2020, 6:54 AM

#

hello,i working on textCnn and i've built model for it but in first epoch, in model validation loss increase while traning loss decrease, how thats possible? my data set balanced and shuffled.

lapis sequoia Sep 18, 2020, 10:09 AM

#

with pandas:

df.plot()

how do I use one of the column of the df for the x-axis?

keen pine Sep 18, 2020, 10:31 AM

#

transform dataset

paper niche Sep 18, 2020, 10:31 AM

#

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html @lapis sequoia you should be able to specify the column name for x

lapis sequoia Sep 18, 2020, 10:32 AM

#

@paper niche I have 2 df that I'm trying to plot on the same plot

#

but this the result I get so fat

📎 unknown.png

feral spoke Sep 18, 2020, 10:34 AM

#

Guys where can I find examples to practise on pandas along with the questions related to datasets and solutions with it?

proper needle Sep 18, 2020, 10:40 AM

#

Hi all. Anyone knows libs, that work good with docx templates. Specifically with end/start of new page? Problems are with repeatable footer and last line of tables. P.S. sorry for my English

paper niche Sep 18, 2020, 10:43 AM

#

but this the result I get so fat
@lapis sequoia get a help channel #❓｜how-to-get-help and ping me; I'll see if I can help

#

@feral spoke kaggle has a short pandas course you could do https://www.kaggle.com/learn/pandas

Learn Pandas Tutorials

Solve short hands-on challenges to perfect your data manipulation skills.

#

also there are community tutorials available; this is featured on the pandas docs, for example: https://github.com/guipsamora/pandas_exercises

GitHub

guipsamora/pandas_exercises

Practice your pandas skills! Contribute to guipsamora/pandas_exercises development by creating an account on GitHub.

lapis sequoia Sep 18, 2020, 10:53 AM

#

@paper niche managet to solve it, thanks

mild topaz Sep 18, 2020, 11:05 AM

#

suppose i have a folder structure this way ```python
demo --> training--> armenia --> driving_licence --> invalid images (img1.jpg , img2.jpg)
--> valid images (img1.jpg , img2.jpg)
--> passport --> invalid images (img1.jpg , img2.jpg)
--> valid images (img1.jpg , img2.jpg)

      testing--> armenia --> driving_licence --> invalid images (img1.jpg , img2.jpg)
                                              --> valid images  (img1.jpg , img2.jpg)
                         --> passport        --> invalid images (img1.jpg , img2.jpg)
                                             --> valid images   (img1.jpg , img2.jpg)

eager heath Sep 18, 2020, 11:15 AM

#

Well, you van train a CNN on everything

#

The question is how good it will perform

mild topaz Sep 18, 2020, 11:18 AM

#

what training path i should give means till which folder ?

#

do u get my point here? @eager heath

keen pine Sep 18, 2020, 11:33 AM

#

should datasets always be balanced ?

#

is there any situation that be convenient for imbalanced datasets ?

velvet thorn Sep 18, 2020, 11:34 AM

#

should datasets always be balanced ?
@keen pine most real life data isn't balanced

keen pine Sep 18, 2020, 11:36 AM

#

@velvet thorn yes that's right nut my question about models.

#

all models should be trained with balanced sets.

#

is above always true ?

rugged owl Sep 18, 2020, 11:45 AM

#

Hi everyone, I'm working with the open-source project called dstack (https://github.com/dstackai/dstack). We'd like to kindly ask you to help us and everyone else in the community get insight from the data science community on how reports, dashboards, and data applications are built with Python or R.
We've designed a quick survey and we'd like to kindly ask everyone for help.

Here's the survey: https://dstackai.typeform.com/to/Xi3ZryqX

All respondents will of course receive a complete report. In order to get even more interest from the community, we will give away a few free licenses for JetBrains PyCharm to randomly chosen respondents.
Thank you, everyone!

Data Science Community Survey 2020: Data Applications, Reports, and...

Turn data collection into an experience with Typeform. Create beautiful online forms, surveys, quizzes, and so much more. Try it for FREE.

velvet thorn Sep 18, 2020, 11:46 AM

#

is above always true ?
@keen pine no

desert oar Sep 18, 2020, 12:12 PM

#

@merry fern df_int and df_pb are just numbers, i.e. sums. you want to compute the sum within group of duplicates? that's a totally different operation..

#

maybe you want .groupby instead

winter portal Sep 18, 2020, 12:17 PM

#

i have some doubts with sqlite

#

pls help

merry fern Sep 18, 2020, 12:17 PM

#

@merry fern df_int and df_pb are just numbers, i.e. sums. you want to compute the sum within group of duplicates? that's a totally different operation..
@desert oar

Hmm I thought dfs were tables of data, I have strings and numbers in there.

What I want to do is consolidate duplicate A/B pairs

formal oasis Sep 18, 2020, 12:22 PM

#

right

#

web scraping

#

I'm using scrapy on a job website as a way to learn how to datascrape

#

and I have came across a problem

#

I got the tutorial documentation code and changed a few variables to test

#

class Seek(scrapy.Spider):
 
    name = "Seek"

    def start_requests(self):
        url = 'https://www.seek.co.nz/'
        tag = getattr(self, 'tag', None)
        if tag is not None:
            url = url + tag + '-jobs/in-All-Auckland'
        yield scrapy.Request(url, self.parse)

    def parse(self, response):
        for jobs in response.xpath("//div[@class='_3MPUOLE']"):
            yield {
                'Job Name': jobs.css('._2iNL7wI::text').get(),
                'Classification': jobs.css('._3AMdmRg[data-automation="jobClassification"]::text').get(),
                'Sub classification': jobs.css('._3AMdmRg[data-automation="jobSubClassification"]::text').get(),
                'Location': jobs.css('._3AMdmRg[data-automation="jobLocation"]::text').get(),
                'Area': jobs.css('._3AMdmRg[data-automation="jobArea"]::text').get(),
                'Desc': jobs.css('.bl7UwXp::text').get(),

#

on this site, div containers are used as a way to list job listings

#

in which there are child containers which contain the actual needed information

#

I'm trying to loop over all of them but I just can't

#

no matter what I try

desert oar Sep 18, 2020, 12:31 PM

#

@merry fern just because you call the variable df, doesn't make it a dataframe

#

if df is a DataFrame then df['y'] is a Series, and df['y'].sum() is just a number

#

also it's considered bad style to use triple-quoted strings as general-purpose comments

#

consolidate duplicate A/B pairs
what do you mean by consolidate here?

merry fern Sep 18, 2020, 12:34 PM

#

Gotcha I need to look at it.

I added those triple quotes last night to narrate the code.

desert oar Sep 18, 2020, 12:34 PM

#

also there's no reason to abbreviate variable names 😉

#

filenames is perfectly fine, fns is hard to read

merry fern Sep 18, 2020, 12:35 PM

#

what do you mean by consolidate here?
@desert oar
The data is Col A B C D
a pair of col A and B creates a unique item, with characteristics C and D

desert oar Sep 18, 2020, 12:35 PM

#

you can just do .drop_duplicates() on the dataframe itself then

#

https://paste.pythondiscord.com/jimumaxemi.coffeescript

#

you might also want to consider setting Type and ISIN to be the index columns, but that will cause the dataframe to have a "multiindex" which can be harder to work with

#

@merry fern what isnt clear to me is, what do you want to do with the values of C and D from the duplicate records

#

do you want to just delete the rows that are duplicates by A and B? do you want to sum C and D within each A,B group? do you want to collect/aggregate C and D values some other way?

merry fern Sep 18, 2020, 12:39 PM

#

i want to sum C within each A,B group correct

#

D is a different animal, probably average

desert oar Sep 18, 2020, 12:40 PM

#

ok, then that requires groupby, not drop_duplicates

#

import pandas as pd

filenames = {'int': './internal.xlsx', 'pb': './pb.xlsx'}
sheets = {'int': 'Internal', 'pb': 'PB'}

df_int = pd.read_excel(
    filenames['int'],
    sheets['int'],
    header=0,
    usecols=[0, 2, 4, 5],
    names=['Type', 'ISIN', 'Quantity', 'Price']
)
df_int = df_int.sort_values(by=['Type', 'ISIN']).reset_index()

df_pb = pd.read_excel(
    filenames['pb'],
    sheets['pb'],
    header=3,
    usecols=[4, 6, 10, 11],
    names=['Type', 'ISIN', 'Quantity', 'Price']
)
df_pb = df_pb.sort_values(by=['Type', 'ISIN']).reset_index()

df_int['Price'] = df_int['Price'] * 100

df_int_agg = df_int.groupby(['Type', 'ISIN']).agg({
    'Quantity': 'sum',
    'Price': 'mean'
})

df_pb_agg = df_pb.groupby(['Type', 'ISIN']).agg({
    'Quantity': 'sum',
    'Price': 'mean'
})

#

then you can do ```python
diff_cols = ['Quantity', 'Price']
df_agg_diffs = df_int_agg[diff_cols] - df_pb_agg[diff_cols]

merry fern Sep 18, 2020, 12:44 PM

#

so the goal here, is to take the 2 dataframes (df_internal / df_pb), and create a results table that shows each unique ColA/B pair with the difference between the 2 lists in quantity and price

desert oar Sep 18, 2020, 12:44 PM

#

why don't you run this and see what df_agg_diffs looks like

merry fern Sep 18, 2020, 12:44 PM

#

(this is my first data science project, i have been taking tutorials and reading docs for 3 weeks, thank you for your help)

desert oar Sep 18, 2020, 12:45 PM

#

let me know if its still not what you need

#

it looks like you did a good job finding a lot of pandas functionality already

#

a lot of people read garbage tutorials and don't/can't understand the docs or don't bother trying to read them

#

so you're doing well by that standard

merry fern Sep 18, 2020, 12:48 PM

#

so now df_int_agg and pb_agg are lists, not dataframes ?

#

same with diff_cols

#

and df_agg_diffs

#

C:\Users\micha\PycharmProjects\PositionRec\venv\lib\site-packages\pandas\core\indexes\multi.py:3366: RuntimeWarning: The values in the array are unorderable. Pass `sort=False` to suppress this warning. uniq_tuples = lib.fast_unique_multiple([self._values, other._values], sort=sort)

desert oar Sep 18, 2020, 12:50 PM

#

no, they shouldnt be lists

#

diff_cols is obviously a list

#

can you give me some sample data to work with?

#

i want to make sure im doing it right but its hard to work blind

#

doesnt need to be real numbers

#

just a handful of rows that mimic the real thing

merry fern Sep 18, 2020, 12:52 PM

#

do I send to you in DM ?

desert oar Sep 18, 2020, 12:52 PM

#

you can just post here

merry fern Sep 18, 2020, 12:52 PM

#

1 sec let me clean up

desert oar Sep 18, 2020, 12:52 PM

#

you know what, let me just generate some

#

its fine

#

i dont want to see your real data

merry fern Sep 18, 2020, 12:54 PM

#

thank you

lapis sequoia Sep 18, 2020, 12:56 PM

#

I'm trying to use the pd.series().quantile() method

#

but pandas return ValueError: Can only compare identically-labeled Series objects

desert oar Sep 18, 2020, 1:01 PM

#

@merry fern https://repl.it/@maximum__/JadedUnluckyBellsandwhistles#main.py this worked how i expected it to work

repl.it

maximum__

JadedUnluckyBellsandwhistles

A Python repl by maximum__

#

@lapis sequoia you need to show the code you used and the full error output

solid blaze Sep 18, 2020, 1:02 PM

#

Greetings strangers. I have a question regarding how to properly merge dataframes in Pandas so the models built on the data are at least useful.

lapis sequoia Sep 18, 2020, 1:02 PM

#

boot_mean_diff = []

for i in range(3000):
    boot_before = before_proportion.sample(frac=1, replace=True)
    boot_after = after_proportion.sample(frac=1, replace=True)
    boot_mean_diff.append(boot_after - boot_before)
print(boot_mean_diff[:11])
# Calculating a 95% confidence interval from boot_mean_diff 
confidence_interval = pd.Series(boot_mean_diff).quantile([0.025, 0.975])
confidence_interval

merry fern Sep 18, 2020, 1:03 PM

#

thanks, looking into it, i should be able to extrapolate technique from this and work on it more...

desert oar Sep 18, 2020, 1:03 PM

#

@solid blaze you need to clarify what exactly you mean by "properly" and "merge", and you need to explain how the "models built on the data" part is related

lapis sequoia Sep 18, 2020, 1:04 PM

#

@desert oar
https://www.paste.org/110164

Paste code - paste.org

www.paste.org - allows users to paste snippets of text, usually samples of source code, for public viewing.

desert oar Sep 18, 2020, 1:04 PM

#

@lapis sequoia it looks like boot_mean_diff is a list of dataframes? this is confusing code

runic stream Sep 18, 2020, 1:04 PM

#

hey! so i'm trying to make a prediction model (using GRU) for heart conditions, and it has 5 classes (which i numbered as 0-4) but while training, every epoch the loss is 0. i don't understand this and also the predicticted values are all 1. can someone help? should i paste my code here or in one of the help channels?

desert oar Sep 18, 2020, 1:04 PM

#

it looks like boot_after - boot_before is meant to be a single number, but i don't think that's what you actually get here

#

did you mean to do size=1 instead of frac=1?

#

@runic stream yes post your code here, also can you share what % of the data is in each class?

runic stream Sep 18, 2020, 1:06 PM

#

ok, and my training data shape is (748, 500, 12)

#

📎 unknown.png

#

📎 unknown.png

solid blaze Sep 18, 2020, 1:07 PM

#

@desert oar Thank you for the reply! It's like this. I have two dataframes. One has the population data for each neighbourhood in Toronto. The other has the ratings, categories and price tiers of restaurants on each neighbourhood, I have between 5 to 10 restaurants per neighbourhood. I want to evaluate how the population data affects the ratings of each restaurant. My question is how do I go about joining those two dataframes? Do I add the population data of each neighbourhood to every restaurant result? Do I take an "average" of the restaurants and merge it with the neighbourhood data? What's the correct way of going about this?

desert oar Sep 18, 2020, 1:08 PM

#

hm. i dont see anything egregiously wrong with your code, but im not well-versed in the numerical/computational details of adam and neural nets. how imbalanced is the data? @runic stream

runic stream Sep 18, 2020, 1:09 PM

#

majority of the classes are 0, wait ill paste the y array

desert oar Sep 18, 2020, 1:09 PM

#

@runic stream i dont need the whole y array

rustic apex Sep 18, 2020, 1:09 PM

#

When you start with Data-Science, what libraries should you learn/focus on?

desert oar Sep 18, 2020, 1:09 PM

#

highly imbalanced data might be the problem

runic stream Sep 18, 2020, 1:10 PM

#

but then it should predict 0 right?

desert oar Sep 18, 2020, 1:10 PM

#

@rustic apex core python competence, and numpy, pandas, and matplotlib

#

yes it should be predicting mostly/all 0 if the training data is mostly/all 0

runic stream Sep 18, 2020, 1:10 PM

#

its predicting all 1

desert oar Sep 18, 2020, 1:10 PM

#

check your data then

#

make sure you didnt goof up the data processing

#

and try to fit just a plain logistic regression first, purely as a baseline

#

your NN should always do better than logistic regression otherwise you are wasting your time with the NN

#

(or you designed a particularly bad network architecture, or there is a bug in your data/code)

#

ah wait is this image data

#

maybe never mind, sorry i thought it was just tabular

runic stream Sep 18, 2020, 1:11 PM

#

i'm trying to implement a paper, and i used their architecture

desert oar Sep 18, 2020, 1:12 PM

#

right. in that case we can rule out the architecture as the problem

#

especially if youre using the same hyperparameters as them

runic stream Sep 18, 2020, 1:12 PM

#

yes same

desert oar Sep 18, 2020, 1:12 PM

#

so check for problems w/ your data first

#

since your code looks fine too

runic stream Sep 18, 2020, 1:12 PM

#

its from PTB dataset, if you're familiar...

#

PhysioNet

desert oar Sep 18, 2020, 1:13 PM

#

not familiar

runic stream Sep 18, 2020, 1:13 PM

#

oh ok

desert oar Sep 18, 2020, 1:13 PM

#

way outside my normal problem domain

#

@solid blaze that's a great question. what question are you actually trying to answer?

#

from a probabilistic perspective, the first case you are proposing a model where the rating of a restaurant is a random variable, and the mean of the random variable depends on the population of the neighborhood

#

it really depends on whether you want to make inferences about neighborhoods or restaurants

#

if you want to answer questions about restaurants, dont aggregate restaurants

#

if you want to answer questions about neighborhoods, do aggregate restaurants to the neighborhood level

#

obviously this is a gross oversimplication

#

and if you are just exploring the data without a clear direction yet, try everything

solid blaze Sep 18, 2020, 1:16 PM

#

@desert oar I want to be able to extrapolate restaurant ratings based, neighbourhood of choice (population), type of restaurant and price tier. I also have the average rental cost per square feet, to see if this determines the price tier of the restaurants.

desert oar Sep 18, 2020, 1:16 PM

#

then it sounds like you should not average over restaurants

#

right?

#

how could you possibly get restaurant-level inferences if you average across all restaurants in a neighborhood?

solid blaze Sep 18, 2020, 1:17 PM

#

Hmm, then I have the doubt. If I'm adding the same data to each restaurant. Am I not not messing with the data in some way? I don't have the same amount of restaurants per neighbourhood.

desert oar Sep 18, 2020, 1:17 PM

#

ah 🙂 that's a good question. it depends somewhat on the model you have in mind

#

in general though, "no you are fine" -- with one caveat

solid blaze Sep 18, 2020, 1:18 PM

#

I was thining of doing some basic multiple linear regression just to see what sticks.

desert oar Sep 18, 2020, 1:18 PM

#

yeah

#

i think you are fine to aggregate

#

what might happen is that your model errors are correlated within neighborhoods

#

which technically violates the iid assumption

#

and it is very likely that your data is not iid as there are almost certainly unobserved factors that are common among restaurants within the same neighborhood

#

(what im about to explain applies to pretty much any model, not just linear regression - including deep learning)

#

including neighborhood in the model means that the average restaurant rating predicted by your model is conditional on the neighborhood

#

in a linear model, this means that changing the neighborhood amounts to shifting the average rating up or down by some amount

solid blaze Sep 18, 2020, 1:21 PM

#

So...to safeguard the iid assumption, would it be best to have an equal amount of restaurants on each neighbourhod?

desert oar Sep 18, 2020, 1:22 PM

#

so if the neighborhood has additional influence on the rating apart from just shifting the average, then it technically becomes unobserved heterogeneity, i.e. something that causes the model errors to be statistically interdependent, i.e. violating the iid assumption

#

no, number of restaurants per neighborhood has absolutely nothing to do with it

solid blaze Sep 18, 2020, 1:22 PM

#

Oh..OK-

desert oar Sep 18, 2020, 1:22 PM

#

it has to do with how exactly the neighborhood influences the rating

#

however, in a lot of cases this iid assumption violation simply doesn't matter

#

and in your particular case i recommend ignoring it for now, but keeping it in the back of your mind

#

the problem you are most likely to encounter is that the variance of rating might be different within each neighborhood - a linear regression model typically assumes constant error variance across all the data

#

this will throw off the results of statistical inferences like hypothesis tests and confidence intervals

#

so while i think a multiple linear regression is perfectly useful and valid in your case, i would be very wary of actually doing statistical inference on the fitted model

#

at least not without bringing to bear more sophisticated modeling methods that can account for the unobserved/uncaptured effect of neighborhoods

#

and like i said, this is a very important concept that applies to almost all statistical modeling and machine learning that i know of, even when the model is complicated and nonlinear

#

i wish i had a good book or article reference for this, i cant remember where i first learned this stuff. maybe an econometrics textbook, or gelman&hill's hierarchical modeling book

rustic apex Sep 18, 2020, 1:25 PM

#

@desert oar 👍 how long untill someone can be proficient to work with Datascience? (Studying themselves)

solid blaze Sep 18, 2020, 1:25 PM

#

And that's exactly what I want to determine. Each neighbourhod has a distinct population distribuition, some older, other younger. And different incomes, I want to explore how to these affect restaurant ratings, maybe some wealthier neighbourhood prefer more expensive restaurants and give higher rating to those. While less..."fortunate" restaurants with younger population prefer cheaper restaurants.

desert oar Sep 18, 2020, 1:25 PM

#

@rustic apex no idea, it took me years and i still consider myself a "high functioning moron"

solid blaze Sep 18, 2020, 1:26 PM

#

I may be over my head in this.

desert oar Sep 18, 2020, 1:26 PM

#

nah you arent over your head

#

go ahead with your model. its fine

#

copy down what i wrote into a note file somewhere

rustic apex Sep 18, 2020, 1:26 PM

#

@desert oar what type of work do you do?

desert oar Sep 18, 2020, 1:26 PM

#

and look at it in a month

#

one thing you will want to do is plot the residuals of your model

#

you especially want to look at the distribution of residuals by neighborhood

#

@rustic apex my job title is "data scientist", for whatever that is worth

solid blaze Sep 18, 2020, 1:28 PM

#

That's exactly what I'm going to do, plot residuals. But I wanted to to have an idea of what might happen when I add neighbourhood population data to each restaurant result on my dataframe. I now have a better idea of what might happen. I'll have to evaluate every model to the best of my abilities.

desert oar Sep 18, 2020, 1:29 PM

#

if it makes you feel better, the heterogeneity is always there, whether or not you add the feature to your model 🙂

#

so adding "neighborhood population" to your model can only help reduce its effect

solid blaze Sep 18, 2020, 1:30 PM

#

Even if it's not added on the same "proportion"?

merry fern Sep 18, 2020, 1:31 PM

#

how long have you been working in python/data science @desert oar ? you seem very knowledgeable 🙂

desert oar Sep 18, 2020, 1:31 PM

#

what do you mean by that @solid blaze ?

solid blaze Sep 18, 2020, 1:31 PM

#

@desert oar Thank you for taking the time to answer to my questions. I'll go back to number crunching.

desert oar Sep 18, 2020, 1:32 PM

#

@merry fern 5 years professionally

#

not very long really

merry fern Sep 18, 2020, 1:32 PM

#

cool. best resources you've used to learn on your own? i have python for finance and python for data science, but there's so much info im not sure where to focus....

solid blaze Sep 18, 2020, 1:34 PM

#

what do you mean by that @solid blaze ?
@desert oar I mean, some neighbourhoods will have 10 restaurants. Others only 3. So maybe the effect of those neighbourhoods with more results will skew the model.

#

It probably means not all neighbourhoods are equally suited for opening restaurants, or at least, opening restaurants that would get any good ratings.

desert oar Sep 18, 2020, 1:35 PM

#

@merry fern https://stats.stackexchange.com and https://towardsdatascience.com can be nice places to get exposed to ideas and topics that you dont currently understand, which might help give you some direction

merry fern Sep 18, 2020, 1:36 PM

#

thank you.

solid blaze Sep 18, 2020, 1:36 PM

#

@merry fern Courserra and Kaggle can also give lots of info. What little info I have, I picked it up from those places.

merry fern Sep 18, 2020, 1:36 PM

#

is this correct, if i want to drop rows that have NaN in both of these columns, but not only one?
df_agg_diffs = df_agg_diffs.dropna(['Quantity', 'Price'], inplace=True)

#

@desert oar

desert oar Sep 18, 2020, 1:37 PM

#

that will drop rows that have missing values in either column

#

@solid blaze neighborhood size shouldnt skew the model for any mathematical reason. but

It probably means not all neighbourhoods are equally suited for opening restaurants, or at least, opening restaurants that would get any good ratings.
this is an astute observation, and it's heading towards a discipline called "causal inference". unfortunately it's outside the scope of what you can handle with a simple linear model. one thing you can do is also include a count of the number of restaurants in the neighborhood, in addition to the neighborhood population

#

but thats only useful if you know the true total # of restaurants in the neighborhood

#

the 5-10 in your dataset almost certainly is not related to the true number of restaurants in the neighborhood

#

and the true number of restaurants in the neighborhood will also depend on the physical geographical size of the neighborhood as well as its population

#

so id actually leave it out unless you can get the true number (or at least an estimate of the true number) from somewhere

solid blaze Sep 18, 2020, 1:40 PM

#

HMMMMMM!!!!. iiiiiinteresting. I'll throw some code around this new info. THANK YOU so much.

desert oar Sep 18, 2020, 1:41 PM

#

good luck

solid blaze Sep 18, 2020, 1:43 PM

#

Maybe if I scrounge Toronto's city "registered restaurants" database, or something similar and then search the ratings of those who are on the database....but I digress. And the scope of this hobby project begins to creep beyond my current skill level and time available. I'll focus on a less ambitions; albeit flawed, initial model first.

desert oar Sep 18, 2020, 1:52 PM

#

"all models are wrong, some are useful"

#

(paraphrased from george box)

merry fern Sep 18, 2020, 1:56 PM

#

I guess this is wrong?

df_agg_diffs = [df_agg_diffs['Quantity'].notnull() & df_agg_diffs['Quantity'] != 0]
print(df_agg_diffs)

I am trying to filter out NA/null and 0's

#

I tried using ~

solid blaze Sep 18, 2020, 1:58 PM

#

missing df_agg_diffs at the beginning of the of the right side of the assignation.

merry fern Sep 18, 2020, 1:59 PM

#

thanks, that almost worked, it returned a few values then i got:
C:\Users\micha\PycharmProjects\PositionRec\venv\lib\site-packages\pandas\core\indexes\multi.py:3366: RuntimeWarning: The values in the array are unorderable. Pass `sort=False` to suppress this warning. uniq_tuples = lib.fast_unique_multiple([self._values, other._values], sort=sort)
@solid blaze

lusty coral Sep 18, 2020, 1:59 PM

#

df_agg_diffs = [df_agg_diffs['Quantity'].fillna(0) != 0] is easier

solid blaze Sep 18, 2020, 1:59 PM

#

df_agg_diffs['condition 1' & 'condition 2']

merry fern Sep 18, 2020, 2:00 PM

#

df_agg_diffs = [df_agg_diffs['Quantity'].fillna(0) != 0] is easier
@lusty coral that produced some True/False boolean results

lusty coral Sep 18, 2020, 2:01 PM

#

oh, sorry let me fix

#

df_agg_diffs = df_agg_diffs.loc[df_agg_diffs['Quantity'].fillna(0) != 0]

#

it was a filter basically 😄

mild topaz Sep 18, 2020, 2:19 PM

#

when i train a model for the following folder structure```python
demo --> training--> albania --> driving_licence --> invalid images (img1.jpg , img2.jpg)
--> valid images (img1.jpg , img2.jpg)
--> passport --> invalid images (img1.jpg , img2.jpg)
--> valid images (img1.jpg , img2.jpg)

      testing--> albania --> driving_licence --> invalid images (img1.jpg , img2.jpg)
                                              --> valid images  (img1.jpg , img2.jpg)
                         --> passport        --> invalid images (img1.jpg , img2.jpg)
                                             --> valid images   (img1.jpg , img2.jpg)

     training--> armenia --> driving_licence --> invalid images (img1.jpg , img2.jpg)
                                             --> valid images   (img1.jpg , img2.jpg)
                         --> passport        --> invalid images (img1.jpg , img2.jpg)
                                             --> valid images   (img1.jpg , img2.jpg)

     testing--> armenia --> driving_licence --> invalid images (img1.jpg , img2.jpg)
                                              --> valid images  (img1.jpg , img2.jpg)
                         --> passport        --> invalid images (img1.jpg , img2.jpg)
                                             --> valid images   (img1.jpg , img2.jpg)```

i get in my output layer (dense layer) ["albania", "armenia"] this way. is this possible to get as ["armenia_driving_licence_valid", "armenia_driving_licence_invalid"] in output layer?

merry fern Sep 18, 2020, 2:22 PM

#

df_agg_diffs = df_agg_diffs.loc[df_agg_diffs['Quantity'].fillna(0) != 0]
@lusty coral boom! that was great. why am I getting this?

: RuntimeWarning: The values in the array are unorderable. Pass `sort=False` to suppress this warning. uniq_tuples = lib.fast_unique_multiple([self._values, other._values], sort=sort)

desert oar Sep 18, 2020, 2:26 PM

#

you showed me that warning somewhere else too

#

you might have some weird data in there

#

mixed data types maybe, unclear

#

df_agg_diffs = [df_agg_diffs['Quantity'].notnull() & (df_agg_diffs['Quantity'] != 0)]
print(df_agg_diffs)

you need parentheses around the != part

#

because of python's operator precdence

#

a & b != c in python is (a & b) != c which is not what you want

mild topaz Sep 18, 2020, 2:38 PM

#

when i train a model for the following folder structure```python
demo --> training--> albania --> driving_licence --> invalid images (img1.jpg , img2.jpg)
--> valid images (img1.jpg , img2.jpg)
--> passport --> invalid images (img1.jpg , img2.jpg)
--> valid images (img1.jpg , img2.jpg)
      testing--> albania --> driving_licence --> invalid images (img1.jpg , img2.jpg)
                                              --> valid images  (img1.jpg , img2.jpg)
                         --> passport        --> invalid images (img1.jpg , img2.jpg)
                                             --> valid images   (img1.jpg , img2.jpg)

     training--> armenia --> driving_licence --> invalid images (img1.jpg , img2.jpg)
                                             --> valid images   (img1.jpg , img2.jpg)
                         --> passport        --> invalid images (img1.jpg , img2.jpg)
                                             --> valid images   (img1.jpg , img2.jpg)

     testing--> armenia --> driving_licence --> invalid images (img1.jpg , img2.jpg)
                                              --> valid images  (img1.jpg , img2.jpg)
                         --> passport        --> invalid images (img1.jpg , img2.jpg)
                                             --> valid images   (img1.jpg , img2.jpg)```
i get in my output layer (dense layer) ["albania", "armenia"] this way. is this possible to get as ["armenia_driving_licence_valid", "armenia_driving_licence_invalid"] in output layer?
can anyone look into this?

modest rune Sep 18, 2020, 3:24 PM

#

This should be an easy question: I need to construct a dataframe from two 2D numpy arrays, each 2D array is 200 x 2000. I want the resulting dataframe to be 400x2000. And I have an array of 400 column names I want to apply to the dataframe as well.

#

I was thinking there was an easy way to do this pandas dataframe constructor, but I am having no luck so far.

velvet thorn Sep 18, 2020, 3:26 PM

#

uh

#

np.concatenate

modest rune Sep 18, 2020, 3:26 PM

#

yeah, duh

#

I didn't get enough sleep

#

concat first, then construct the dataframe

velvet thorn Sep 18, 2020, 3:26 PM

#

yes

#

of course, you can construct first

#

and then pd.concat

modest rune Sep 18, 2020, 3:27 PM

#

yeah, I assume that would be slower... or it is always a safer bet to do it in numpy first.

velvet thorn Sep 18, 2020, 3:27 PM

#

pandas should be slower

#

but not by much

#

that's my guess

#

because there's the additional overhead of constructing two DataFrames

#

as well as what I assume is index alignment checking

#

on pd.concat

wild dome Sep 18, 2020, 4:09 PM

#

does anybody use Plotly?

solid blaze Sep 18, 2020, 4:14 PM

#

I understand some stuff have to be super fast and optimized when they're products and people are querying all the time. But does efficiency and speed really matter that much when you're coding on your own? Hey, if it works.

wild dome Sep 18, 2020, 4:14 PM

#

I understand some stuff have to super fast and optimized when they're products and people are querying all the time. But does efficiency and speed really matter that much when you're coding on your own?
@solid blaze I'll say yes, for practice

solid blaze Sep 18, 2020, 4:15 PM

#

hmmm...of course.

wild dome Sep 18, 2020, 4:15 PM

#

I mean if your code is slow that's up to you, but you can practice by optimizing it

solid blaze Sep 18, 2020, 4:16 PM

#

Well, I'm still in that phase where I'm trying to get things to work at all! I'll eventually get to the optimization part. However, for me to get ANY result at all is a win.

wild dome Sep 18, 2020, 4:16 PM

#

yeah, first make it work correctly, then you can optimize it

#

premature optimization is the root of all evil

rustic apex Sep 18, 2020, 5:23 PM

#

Getting started learning DS.
I have: inducing&slicing, broadcasting, iterating over array, array manipulation, binary operators, mathematical functions, arithmetic operations, shallow/deep copy

glacial rune Sep 18, 2020, 5:42 PM

#

I have this data in SQL that I want to extract into Python, what data structure would be best for this? I will be using the data in a string for each record e.g.
Product: {}
Previous Price: {}
Current Price: {}
URL: {}

📎 unknown.png

#

a list of dictionaries for each record? a list of tuples?

wild dome Sep 18, 2020, 5:49 PM

#

@glacial rune if you want to access the columns by their name, use dictionaries, if you want to acccess them by index, use tuples

#

I think that it actually depends on your problem and your needs

glacial rune Sep 18, 2020, 6:36 PM

#

which would be more performant?

ripe forge Sep 18, 2020, 10:13 PM

#

A pandas dataframe.

#

Well, don't worry about performance that much, they'll all be fast enough

desert oar Sep 18, 2020, 10:13 PM

#

yeah... dont use lists of tuples or dicts

#

use a dataframe

#

it does the job of both, and significantly better for working with tabular data

ripe forge Sep 18, 2020, 10:14 PM

#

But this is basically a table. Suited fit for pandas

desert oar Sep 18, 2020, 10:16 PM

#

import pandas as pd
# need to install sqlalchemy, but don't need to import it

query = "SELECT a, b FROM foo"
conn_str = "mysql://scott:tiger@localhost/test"
data = pd.read_sql_query(query, conn_str)

@glacial rune

#

for sqlalchemy connection string syntax, see here https://docs.sqlalchemy.org/en/13/core/connections.html

#

you will also need a database library that supports your database

#

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sql.html

dusk aspen Sep 18, 2020, 10:31 PM

#

i need some help, i am trying to create an smb server that i can use on a linux computer. I have found modules for smb clients but nothing that i can use for an smb server

desert oar Sep 18, 2020, 11:13 PM

#

@dusk aspen this is a better question for #unix or #tools-and-devops

#

but maybe look here https://ubuntu.com/tutorials/install-and-configure-samba#1-overview

Ubuntu

Install and Configure Samba | Ubuntu

Ubuntu is an open source software operating system that runs from the desktop, to the cloud, to all your internet connected things.

dusk aspen Sep 18, 2020, 11:13 PM

#

thanks

austere swift Sep 19, 2020, 1:14 AM

#

So i was doing a little bit of research on using XLA in deep learning, and apparently it delivers like 1.15x the performance according to tensorflow, and in tensorflow you're able to use it with gpu but according to pytorch docs you can only use it with TPU in pytorch, why is that?

rustic apex Sep 19, 2020, 2:14 AM

#

What’s the difference between using something like GraphQL and data from a page with Numpy/Pandas? I saw some projects like the Titanic and a shop project on Keggel

odd yoke Sep 19, 2020, 3:05 AM

#

@austere swift it's really just because they haven't implemented it yet, and it was the easiest way to support TPUs until then

#

in fact torch_xla also supports cpus, not that it is useful

austere swift Sep 19, 2020, 3:06 AM

#

Yeah I saw it supported cpu

#

but there isnt really a point

odd yoke Sep 19, 2020, 3:07 AM

#

xla is a graph -> llvm ir -> machine code compiler, it could theoretically support any target

austere swift Sep 19, 2020, 3:08 AM

#

Yeah I did some research on it, but I feel like it wouldve been smarter to implement it on gpu before tpu since that's more common anyways

odd yoke Sep 19, 2020, 3:08 AM

#

the purpose of torch_xla was to allow torch to have some support for TPUs, not xla itself

#

it just happened that xla was the easiest solution to support tpus

proud iron Sep 19, 2020, 3:45 AM

#

Regarding collecting data through web scraping. What would be a smart way to periodically take snapshots of the HTML of a page without flooding the site with requests? :)

velvet thorn Sep 19, 2020, 4:36 AM

#

Regarding collecting data through web scraping. What would be a smart way to periodically take snapshots of the HTML of a page without flooding the site with requests? :)
@proud iron sleep in a loop?

alpine bay Sep 19, 2020, 5:45 AM

#

does anyone know what needs to be implemented to use np ufuncs like np.multiply on a subclass of an ndarray?

https://jiffyclub.github.io/numpy/user/basics.subclassing.html#subclassing-and-downstream-compatibility

For methods that have both an array and ndarray version, the docs say to override the function while keeping the signature in order to use the ndarray version but I don't see where it mentions ufuncs.

when I try to multiply the subclassed ndarray by an int an attribute error appears. AttributeError: 'int' object has no attribute 'view'

alpine bay Sep 19, 2020, 6:03 AM

#

never mind! I figured it out

feral spoke Sep 19, 2020, 6:13 AM

#

Anyone around ?

#

I need help with pandas

velvet thorn Sep 19, 2020, 6:15 AM

#

Anyone around ?
@feral spoke just ask.

feral spoke Sep 19, 2020, 6:16 AM

#

I'm an economical wine buyer. Which wine is the "best bargain"? Create a variable bargain_wine with the title of the wine with the highest points-to-price ratio in the dataset.
So apparently this dataset have 3 columns points,price and winery. Based on the question above I want to get the winery name.

#

@velvet thorn

velvet thorn Sep 19, 2020, 6:16 AM

#

okay, that's p simple

#

do you know how to get the points-to-price ratio?

feral spoke Sep 19, 2020, 6:16 AM

#

yes

#

.max()

velvet thorn Sep 19, 2020, 6:17 AM

#

huh?

#

no...

feral spoke Sep 19, 2020, 6:17 AM

#

on the ratio function

velvet thorn Sep 19, 2020, 6:17 AM

#

well, yes, but that's the next step

#

then you can just sort by the ratio

#

and take the first value (assuming you're sorting in descending order)

feral spoke Sep 19, 2020, 6:19 AM

#

so something like this df[df['col_1]/df['col_2].sort_values(ascending=False)].head(1)

#

@velvet thorn ?

velvet thorn Sep 19, 2020, 6:19 AM

#

you probably don't want .head(1)

feral spoke Sep 19, 2020, 6:19 AM

#

ya

velvet thorn Sep 19, 2020, 6:19 AM

#

well, you can do that, I guess

#

up to you

#

whether you want the row

#

or just the name cell

#

but no, not exactly like that

feral spoke Sep 19, 2020, 6:19 AM

#

just the name cell

velvet thorn Sep 19, 2020, 6:20 AM

#

df['col_1]/df['col_2].sort_values(ascending=False) this isn't going to work

feral spoke Sep 19, 2020, 6:20 AM

#

why?

velvet thorn Sep 19, 2020, 6:21 AM

#

you could do this:

df['ratio'] = df['points'] / df['price']
df.sort_values(by='ratio', ascending=False).head(1)['name']

feral spoke Sep 19, 2020, 6:21 AM

#

But this would return the max value right?

velvet thorn Sep 19, 2020, 6:21 AM

#

try it.

#

df[df['col_1]/df['col_2].sort_values(ascending=False)].head(1) wouldn't work because df['col_1]/df['col_2].sort_values(ascending=False) gives you a Series of values

#

but you're trying to use that as an index

feral spoke Sep 19, 2020, 6:22 AM

#

Yes but I want the string value corresponding to it

velvet thorn Sep 19, 2020, 6:22 AM

#

yes

#

but that's not relevant

#

at this point

#

first you want to get the row with the max value of ratio

#

THEN you pull the name column out.

#

because you need the other columns to determine the max.

feral spoke Sep 19, 2020, 6:24 AM

#

ok thanks

#

let me try

#

@velvet thorn it returns the series but I want to have the string

velvet thorn Sep 19, 2020, 6:28 AM

#

did you do

#

what I said

#

exactly?

feral spoke Sep 19, 2020, 6:28 AM

#

It showed the error

#

Yes

velvet thorn Sep 19, 2020, 6:28 AM

#

you could do this:

df['ratio'] = df['points'] / df['price']
df.sort_values(by='ratio', ascending=False).head(1)['name']

@velvet thorn this

feral spoke Sep 19, 2020, 6:28 AM

#

yes

velvet thorn Sep 19, 2020, 6:28 AM

#

what error

feral spoke Sep 19, 2020, 6:29 AM

#

Incorrect: Expected bargain_wine to have type <class 'str'> but had type <class 'pandas.core.series.Series'>

velvet thorn Sep 19, 2020, 6:29 AM

#

show code

feral spoke Sep 19, 2020, 6:29 AM

#

reviews['ratio'] = reviews['points'] / reviews['price']
bargain_wine = reviews.sort_values(by='ratio', ascending=False).head(1)['winery']

velvet thorn Sep 19, 2020, 6:30 AM

#

ah, right

#

my bad

#

I forgot

feral spoke Sep 19, 2020, 6:30 AM

#

can I use iloc[0,0]

#

I mean that would return the string after all

velvet thorn Sep 19, 2020, 6:31 AM

#

ye

#

you can

feral spoke Sep 19, 2020, 6:31 AM

#

let me see

#

lol finally I got the string value but answer was incorrect

#

XD

velvet thorn Sep 19, 2020, 6:32 AM

#

maybe the ratio is supposed to be the other way round

#

hm, no, it looks right

feral spoke Sep 19, 2020, 6:33 AM

#

wait I will show you the question

#

I'm an economical wine buyer. Which wine is the "best bargain"? Create a variable bargain_wine with the title of the wine with the highest points-to-price ratio in the dataset.

#

Does the question makes sense?

velvet thorn Sep 19, 2020, 6:35 AM

#

yes

feral spoke Sep 19, 2020, 6:35 AM

#

so what was wrong in our code?

#

eh sorry to keep bugging you

velvet thorn Sep 19, 2020, 6:36 AM

#

hm.

#

it looks correct, though.

feral spoke Sep 19, 2020, 6:37 AM

#

can I dm you the screenshot?

velvet thorn Sep 19, 2020, 6:37 AM

#

it's text, right?

#

you can paste it here

feral spoke Sep 19, 2020, 6:37 AM

#

code only is in text format

#

wait

#

Damn still showing error

austere swift Sep 19, 2020, 6:52 AM

#

So what would be the equivalent of using padding='same' (from keras) in pytorch?

feral spoke Sep 19, 2020, 6:56 AM

#

@velvet thorn thnx for helping me out,I got the solution 🙂

grand mason Sep 19, 2020, 6:58 AM

#

Hello guys, this chat is for beginners too?

austere swift Sep 19, 2020, 7:00 AM

#

yeah this chat is just for general data science

grand mason Sep 19, 2020, 7:02 AM

#

Ohh great hehe I'm very noob on this topic hahaha

#

Have some pattern to ask questions here? Because I want know some stuffs and i will be glad if someone could help me!

mild topaz Sep 19, 2020, 7:12 AM

#

hey sorry to ping u @brittle agate , i want to train a CNN model for e.g i have classes and subclasses how i can train for the following folder strucutre?python project -- training -- albania -- driving_licence -- valid images -- invalid images -- passport -- valid images -- invalid images armenia -- driving_licence -- valid images -- invalid images -- passport -- valid images -- invalid images testing -- albania -- driving_licence -- valid images -- invalid images -- passport -- valid images -- invalid images armenia -- driving_licence -- valid images -- invalid images -- passport -- valid images -- invalid images can u please look into this?

velvet thorn Sep 19, 2020, 7:16 AM

#

Have some pattern to ask questions here? Because I want know some stuffs and i will be glad if someone could help me!
@grand mason just ask

#

@velvet thorn thnx for helping me out,I got the solution 🙂
@feral spoke that's good! what was wrong

feral spoke Sep 19, 2020, 7:26 AM

#

@feral spoke that's good! what was wrong
@velvet thorn We needed to use the iloc function in beginning and we had to use the idx function

grand mason Sep 19, 2020, 7:30 AM

#

So, for a beginner, you guys think it's is better start with Tensorflow or Pytorch?

#

Have some good roadmap for 'became' a datascientist?

#

And the big question: Python or R?

feral spoke Sep 19, 2020, 7:33 AM

#

Tensorflow/pytorch are used for deep learning

#

There are many other libraries and things you need to learn before jumping to that

grand mason Sep 19, 2020, 7:35 AM

#

I hear about pandas, numpy, scikit-learn, seaborn and sciPy

feral spoke Sep 19, 2020, 7:35 AM

#

yes

#

matplotlib also

#

After that you should start with ML

grand mason Sep 19, 2020, 7:36 AM

#

Hmn, the documetations of them are good?

feral spoke Sep 19, 2020, 7:36 AM

#

It depends on you

grand mason Sep 19, 2020, 7:37 AM

#

Or have some other resources to learn about them?

feral spoke Sep 19, 2020, 7:37 AM

#

Are you comfortable with reading the documentations or you like to watch videos and practise?

grand mason Sep 19, 2020, 7:37 AM

#

I prefer read

feral spoke Sep 19, 2020, 7:37 AM

#

I mean there are many courses available on udemy,coursera,edx

#

But I guess you need to practise on your own too

grand mason Sep 19, 2020, 7:38 AM

#

Have some good book? I know some of O'Reilly

lapis sequoia Sep 19, 2020, 7:38 AM

#

udemy also

feral spoke Sep 19, 2020, 7:38 AM

#

Good book for what topic?

#

Cause DS is pretty vast field

grand mason Sep 19, 2020, 7:38 AM

#

For learn about the libraries, or the math bihend the scene

lapis sequoia Sep 19, 2020, 7:38 AM

#

I am just a python beginner

feral spoke Sep 19, 2020, 7:39 AM

#

For numpy,pandas,matplot there are videos on YT

#

or go to kaggle

#

It is descriptive mostly along with the exercise

grand mason Sep 19, 2020, 7:39 AM

#

Like i have this book, Mathematics for machine learnin, but i don't know if is really necessary knows about the math before start practice

feral spoke Sep 19, 2020, 7:40 AM

#

Have you taken the class for calculus ever before?

grand mason Sep 19, 2020, 7:40 AM

#

Yep

feral spoke Sep 19, 2020, 7:40 AM

#

There is Andrew Ng course on ML available on coursera

#

You should try that

#

He explains the math needed for understanding and implementing the algorithms

#

Only drawback is the programming language used is octave/matlab

grand mason Sep 19, 2020, 7:42 AM

#

Oh my god, it's easy to learn? I know python well and javascript

mild topaz Sep 19, 2020, 7:42 AM

#

@feral spoke sorry to ping u can u look into my issue ?

feral spoke Sep 19, 2020, 7:42 AM

#

Oh my god, it's easy to learn? I know python well and javascript
@grand mason Practise makes men perfect

#

@feral spoke sorry to ping u can u look into my issue ?
@mild topaz sorry man I'm yet to start with the ML

#

I'm still learning

mild topaz Sep 19, 2020, 7:43 AM

#

ok np

grand mason Sep 19, 2020, 7:44 AM

#

True, I search now, this website kaggle is really interesting!

#

Wtf have a lot of challenges haha

feral spoke Sep 19, 2020, 7:45 AM

#

Yes there are learning resources as well as challenges

grand mason Sep 19, 2020, 7:46 AM

#

Well tks man, I will explore this site!!!

feral spoke Sep 19, 2020, 7:46 AM

#

No problem buddy 🙂

grand mason Sep 19, 2020, 7:46 AM

#

You like some challenge of him?

#

Oh never mind hahaha have this about pokemon, I will try this haha

#

@feral spoke one more question, it's interesting use Linux for do ML?

feral spoke Sep 19, 2020, 7:53 AM

#

@feral spoke one more question, it's interesting use Linux for do ML?
@grand mason Tbh I don't have idea regarding which OS you should use for ML but I'm using windows as of now

grand mason Sep 19, 2020, 7:54 AM

#

Hmn ok, tks man 😆!

austere swift Sep 19, 2020, 8:07 AM

#

So I get this error when I'm trying to train a pytorch model in the line that it calls the loss function: IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1)
I did some research and found that it had to do with something with the shape of the outputs and the labels being incorrect, but I don't really understand what the expected shape is. The outputs shape is torch.Size([15]) and the label's shape is torch.Size([1, 15]), so I'm guessing that would probably be the issue, but I don't get why they are different shapes anyways since they were both originally size [15]

#

https://github.com/pytorch/pytorch/issues/5554 I also found this issue on gh for it, and they say that the they are supposed to be different shapes, so I don't see what's wrong

GitHub

dimension out of range (expected to be in range of [-1, 0], but got...

criterion = nn.CrossEntropyLoss() print(outputs.data) print(label.data) loss = criterion(outputs, label) # getting error at this point The output that I'm getting is => 0.0174 0.1866...

fiery cloak Sep 19, 2020, 8:35 AM

#

hey anyone?

can anyone help with OpenCV?

#

trying to build hand gesture recoginition

lapis sequoia Sep 19, 2020, 9:59 AM

#

doesn't .mean() return a single number? how am I supposed to plot that?

📎 unknown.png

fiery cloak Sep 19, 2020, 10:19 AM

#

whats excess_returns?

lapis sequoia Sep 19, 2020, 10:29 AM

#

a series

#

it's an error in the project, I tried the solution and it doesn't work

zenith yarrow Sep 19, 2020, 11:34 AM

#

Hello. I am new to python and I'm struggling to draw an histogram with multiple data for school.
Basically I have a list of values and a list of list of frequencies
I want to achieve something like that

📎 unknown.png

#

📎 unknown.png

arctic wedgeBOT Sep 19, 2020, 1:21 PM

#

Hey @earnest forge!

It looks like you tried to attach file type(s) that we do not allow (.ipynb). We currently allow the following file types: .3gp, .3g2, .avi, .bmp, .gif, .h264, .jpg, .jpeg, .mkv, .mov, .mp4, .mpeg, .mpg, .png, .tiff, .wmv, .svg, .psd, .ai, .aep, .xcf, .mp3, .wav, .ogg, .webm, .webp, .flac, .afdesign, .m4a.

Feel free to ask in #community-meta if you think this is a mistake.

brittle agate Sep 19, 2020, 3:26 PM

#

📎 unknown.png

#

https://tenor.com/view/nick-young-question-mark-huh-what-confused-gif-4995479

Tenor

#

Hello. I am new to python and I'm struggling to draw an histogram with multiple data for school.
Basically I have a list of values and a list of list of frequencies
I want to achieve something like that
@zenith yarrow
Go to rooms help-***. And ask this.

grave frost Sep 19, 2020, 3:44 PM

#

Can anyone confirm what an empty assets folder means when checkpointing in TensorFlow? It does say "Assets written to <path to assets folder of chexkpoint folder>" but assets remains empty and there are other things like saved_model.pb and variables folder. I was actually having trouble when loading trained model's checkpoints, so am training it again, but noticed this anomaly...

pearl crystal Sep 19, 2020, 4:11 PM

#

Minmax normalization or mean normalization, which one do you choose to normalize your data?

#

and is it essential to normalize data?

grave frost Sep 19, 2020, 6:16 PM

#

Also, can anyone advise me on how to use tf.data? I want to basically have 2 files - 1 is source file and 1 is target file. I want to find the relationship b/w the source sequence and target sequence (like 1st line of src.text corresponds to 1st line of target.txt). so what should I use? tf.dataset? convert to Tfrecords? tf.data.TextLineDataset?? PLease help!

gray sedge Sep 19, 2020, 7:05 PM

#

having a bad time with Pandas, can anyone help ):

📎 unknown.png

uncut shadow Sep 19, 2020, 7:39 PM

#

What's wrong?

#

(show error or something like that)

gray sedge Sep 19, 2020, 7:45 PM

#

One sec I was trying so much and still no luck i know I'm messing up syntax somewhere

#

Gotta revert it to that

#

📎 unknown.png

uncut shadow Sep 19, 2020, 7:50 PM

#

@gray sedge Look at the last line of Ur code

#

Basically, the error says you cannot do temp_df(...)

#

And you are trying to do that on the last line

gray sedge Sep 19, 2020, 8:02 PM

#

So how do I call it

#

Doing it without the parenthesis gives: KeyError: 'Differences'

swift nebula Sep 19, 2020, 8:46 PM

#

temp_df['Differences'][2]

#

for the second element

gray sedge Sep 19, 2020, 8:50 PM

#

just gotta get it to take multimple indeces

rustic apex Sep 19, 2020, 9:13 PM

#

Using Numpy and Panda. So, is this how you create a “updated” visual of business sales???
I create a Django project, setup my classes and views, have a “order” page that writes into to a .txt , and then I can use Numpy/Pandas to display the info visually of how my sales look?

velvet thorn Sep 19, 2020, 10:50 PM

#

Minmax normalization or mean normalization, which one do you choose to normalize your data?
@pearl crystal what is "mean normalisation"?

#

I think you mean "scaling" - "normalisation" has a specific meaning.

#

because there are other ways, too.

#

unless you really mean specific min-max normalisation vs mean normalisation

#

in which case I would say "it depends on your data".

#

🥴

gray sedge Sep 20, 2020, 1:59 AM

#

Okay dead ass at this point I'm willing to pay someone to do this pandas assignment I'm about to lose it

serene scaffold Sep 20, 2020, 2:37 AM

#

@gray sedge you can get help with general questions pertaining to an assignment, but you can't ask someone to do assignments for you here or request paid work.

gray sedge Sep 20, 2020, 2:37 AM

#

Moreso just frustration lol i've never struggled this bad

serene scaffold Sep 20, 2020, 2:37 AM

#

I'm sorry to hear that :/

#

I'm trying to figure out a pandas assignment myself.

#

though it's not so much a pandas assignment as much as one where we're allowed to use pandas if we want

gray sedge Sep 20, 2020, 2:42 AM

#

i think i'd rather learn C in depth than finish this assignment

serene scaffold Sep 20, 2020, 2:42 AM

#

probably not

gray sedge Sep 20, 2020, 2:43 AM

#

Yea you're likely right
but it's making me regret this semster I've got 24 hours to finish this assignment, I've been 90% fine up until pandas 😐 this is only halfwayish into the assignment

serene scaffold Sep 20, 2020, 2:45 AM

#

Well, if anyone happens to know

    for i in range(10):
        for b in (TRUE, FALSE):
            row_is_nan = np.isnan(matrix[:, i])
            row_is_relevant = matrix[:, 10] == b
            matrix[row_is_nan & row_is_relevant] = np.mean(matrix[~row_is_nan & row_is_relevant].flat)

#

I need to make sure that the last line is only applied towards the ith column, but I'm not sure how to do that and still have row_is_nan & row_is_relevant

#

TRUE and FALSE are constants that are not True and False

#

it just makes sense to call them that in this context

#

if I had a bool matrix where I could make the ith column truthy, I could just throw that in with another &

rustic apex Sep 20, 2020, 3:25 AM

#

With a retail type site. Do you ever implement pandas to organize/analyse sales in real-time?

hard pelican Sep 20, 2020, 3:39 AM

#

Can anyone suggest a resource on learning about the different layer types in keras?

#

I understand the broad theory behind neural networks, but want to know more about what to do on a practical level

#

I'm specifically interested in regression tasks

velvet thorn Sep 20, 2020, 5:17 AM

#

@serene scaffold huh

#

am I missing something?

#

like just add a column index?

serene scaffold Sep 20, 2020, 5:18 AM

#

@velvet thorn we were talking about it in #algos-and-data-structs because I'm a hypocrite

#

(since I'm always telling people in the CS channel where to go)

velvet thorn Sep 20, 2020, 5:19 AM

#

I do the same thing...

#

so yo usolved it?

serene scaffold Sep 20, 2020, 5:19 AM

#

the task is to perform conditional mean imputation, if you've heard of that. I haven't.

velvet thorn Sep 20, 2020, 5:19 AM

#

oh

#

yeah, why not

serene scaffold Sep 20, 2020, 5:19 AM

#

I have not, but it's past 1am for me so I may not be able to dedicate full brain power to it

#

I mean I know what conditional mean imputation is now, I just hadn't before.

velvet thorn Sep 20, 2020, 5:20 AM

#

so...am I missing something?

#

why can't you just do matrix[row_is_nan & row_is_relevant, i] = ...

serene scaffold Sep 20, 2020, 5:20 AM

#

why cam?

lapis sequoia Sep 20, 2020, 5:20 AM

#

To activate this environment, use

 $ conda activate C:\Users\Main User\desktop\sample_project_1\env

However, when I run it I receive this error:
Enter-CondaEnvironment : A positional parameter cannot be found that accepts argument
'User\desktop\sample_project_1\env'.
At C:\Users\Main User\miniconda3\shell\condabin\Conda.psm1:170 char:17

            Enter-CondaEnvironment @OtherArgs;

```
            ~~~~~
```
- CategoryInfo : InvalidArgument: (:) [Enter-CondaEnvironment], ParameterBindingException
- FullyQualifiedErrorId : PositionalParameterNotFound,Enter-CondaEnvironment
how to solve this error ?

serene scaffold Sep 20, 2020, 5:20 AM

#

my professor told me to.

velvet thorn Sep 20, 2020, 5:21 AM

#

same thing for the right side

serene scaffold Sep 20, 2020, 5:21 AM

#

so np.array.__setattr__ accepts a tuple of the mask followed by a slice?

velvet thorn Sep 20, 2020, 5:21 AM

#

does it not...?

serene scaffold Sep 20, 2020, 5:21 AM

#

I thought you had to do a mask or a slice

velvet thorn Sep 20, 2020, 5:22 AM

#

I'm pretty sure it does

serene scaffold Sep 20, 2020, 5:22 AM

#

but I haven't used numpy to this extent before

#

let's see

velvet thorn Sep 20, 2020, 5:22 AM

#

as in I've never thought about it in explicit terms

#

but that's how I would do that

#

>>> import numpy as np
>>> a = np.array([[1, 2], [3, 4]])
>>> a[[True, False], 1]
array([2])

#

i.e. select the first row and the second column

#

which is what one would expect

serene scaffold Sep 20, 2020, 5:22 AM

#

!e

import numpy as np
a = np.array([[1, 2], [3, 4]])
print(a[[True, False], 1])

arctic wedgeBOT Sep 20, 2020, 5:23 AM

#

@serene scaffold :white_check_mark: Your eval job has completed with return code 0.

[2]

serene scaffold Sep 20, 2020, 5:23 AM

#

is that what was wanted?

velvet thorn Sep 20, 2020, 5:23 AM

#

is it not?

#

that's basically masking rows and using a column index

serene scaffold Sep 20, 2020, 5:24 AM

#

it might be, but it's easier to ask you than to think about it 🙂

velvet thorn Sep 20, 2020, 5:24 AM

#

perhaps 2x2 is not clear enough

#

>>> a = np.arange(16).reshape((4, 4))
>>> a[[True, False, False, True], :2]
array([[ 0,  1],
       [12, 13]])

#

in other words, the first and last row, and until the 3rd column, exclusive

#

same thing as what you want

minor ember Sep 20, 2020, 5:27 AM

#

Someone help me please how should i find the true nature of user input if the input function always make it a string>

📎 1111111.png

serene scaffold Sep 20, 2020, 5:27 AM

#

I'll need to focus on this tomorrow but I appreciate your input @velvet thorn

#

@minor ember I'm not sure that this is a data science question but for any object, the __class__ attribute is the type

velvet thorn Sep 20, 2020, 5:28 AM

#

@minor ember I'm not sure that this is a data science question but for any object, the __class__ attribute is the type
@serene scaffold I think what they're really asking is

lapis sequoia Sep 20, 2020, 5:28 AM

#

To activate this environment, use
 $ conda activate C:\Users\Main User\desktop\sample_project_1\env
However, when I run it I receive this error:
Enter-CondaEnvironment : A positional parameter cannot be found that accepts argument
'User\desktop\sample_project_1\env'.
At C:\Users\Main User\miniconda3\shell\condabin\Conda.psm1:170 char:17
            Enter-CondaEnvironment @OtherArgs;
            ~~~~~
CategoryInfo : InvalidArgument: (:) [Enter-CondaEnvironment], ParameterBindingException

FullyQualifiedErrorId : PositionalParameterNotFound,Enter-CondaEnvironment

how to solve this error ?
@lapis sequoia ?

velvet thorn Sep 20, 2020, 5:28 AM

#

input() returns a str; how can we determine whether there is any additional data type it can be converted to?

serene scaffold Sep 20, 2020, 5:28 AM

#

@lapis sequoia take a look at #❓｜how-to-get-help

minor ember Sep 20, 2020, 5:29 AM

#

input() returns a str; how can we determine whether there is any additional data type it can be converted to?
@velvet thorn yes exactly my question

lapis sequoia Sep 20, 2020, 5:29 AM

#

@lapis sequoia take a look at #❓｜how-to-get-help
@serene scaffold okay

velvet thorn Sep 20, 2020, 5:31 AM

#

@velvet thorn yes exactly my question
@minor ember okay, so in the simple cases

#

int and float, you can just try to convert them

#

do you have to handle list, tuple, and dict?

#

and set?

minor ember Sep 20, 2020, 5:32 AM

#

yes every data type

velvet thorn Sep 20, 2020, 5:33 AM

#

well, there's a dirty way that I don't think you're supposed to use

#

I guess you would need to parse the string

#

for example, a list always starts with [

serene scaffold Sep 20, 2020, 5:33 AM

#

well, there's a dirty way that I don't think you're supposed to use
@velvet thorn gimme that #esoteric-python

velvet thorn Sep 20, 2020, 5:33 AM

#

so you can tell if it's a compound data type by the first character

#

(don't forget None)

#

@velvet thorn gimme that #esoteric-python
@serene scaffold just ast.literal_eval

#

more "dirty" in the sense of "cheating"

serene scaffold Sep 20, 2020, 5:34 AM

#

lemon_hyperpleased

lapis sequoia Sep 20, 2020, 6:42 AM

#

Hi everybody! I've been working on an issue for a couple of days, but I don't know if there is a straightforward solution for it or it has to be a totally custom solution.

I have a pandas dataframe with the following columns: [A, B, C, D, E, F, G, H, I, J]. I want to convert this dataframe to a list of dicts such as the one here https://paste.pythondiscord.com/iqarosenuc.json
(it should be group by [A, B, C, D] then group the resulting lists inside this grouping by [E, F], and finally by [G, H, I] to have a list of Js in the last level)
The dataframe is directly obtained from a query to a postgresql db if that helps to find any alternatives.

I've tried applying a group_by but I am only able to do it for the first grouping. I've tried to make custom functions that iterate over the rows, but I got no luck.

#

how can I share the structure I'm looking for?

lyric canopy Sep 20, 2020, 6:47 AM

#

If you want to share something text-based that's relatively long, the best way to do that is by using a paste service or a gist. We've got a paste service (https://paste.pythondiscord.com/), but there are plenty of others on the web as well. It helps with keeping the conversation on screen on all devices, as large chunks of code/data takes up a lot of vertical screen.

paper niche Sep 20, 2020, 7:13 AM

#

@lapis sequoia Can you provide some sample data for us to play with?

lapis sequoia Sep 20, 2020, 7:14 AM

#

Yeah, sorry, what's the preferred format?

paper niche Sep 20, 2020, 7:15 AM

#

the pandas dataframe, maybe as a csv in hastebin?

lapis sequoia Sep 20, 2020, 7:15 AM

#

okay, give me a sec

#

https://paste.pythondiscord.com/raw/rehaqeveha

#

I think that the problem is that I don't have the required knowledge to parse this to a nested list of dicts iterating over the rows lemon_pensive

coral hound Sep 20, 2020, 7:36 AM

#

this is my csv data go to https://archive.ics.uci.edu/ml/index.php

proven moon Sep 20, 2020, 8:50 AM

#

yo, is there any way to create such barplots?

for example, i have data that have 4 columns:

name; count1; count2; count3
foo; 2; 3; 1
bar; 6; 4; 1

hue should be the counts columns
name of stacked barplots are the name column

📎 unknown.png

paper niche Sep 20, 2020, 9:07 AM

#

@lapis sequoia can you check if this gives the result you're looking for?

(df.groupby(['A', 'B', 'C', 'D', 'E','F', 'G', 'H', 'I']).apply(
    lambda x: pd.Series({'third_grouping': x[['J']].to_dict('records')})).reset_index()
 .groupby(['A', 'B', 'C', 'D', 'E' ,'F']).apply(
     lambda x: pd.Series({'second_grouping': x[['G', 'H', 'I', 'third_grouping']].to_dict('records')})).reset_index()
 .groupby(['A', 'B', 'C', 'D']).apply(
     lambda x: pd.Series({'first_grouping': x[['E', 'F', 'second_grouping']].to_dict('records')})).reset_index()
).to_dict('records')

#

if it doesn't, can you provide the desired output for the specific example you gave? I tried my best to work out from your earlier json structure but I'm not 100% sure I understood it correctly.

lapis sequoia Sep 20, 2020, 9:11 AM

#

I get this error:

Traceback (most recent call last):
  File "private/test_survey_formating.py", line 49, in <module>
    result = (df.groupby(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I']).apply(
  File "/venv/lib/python3.6/site-packages/pandas/core/frame.py", line 5810, in groupby
    observed=observed,
  File "/venv/lib/python3.6/site-packages/pandas/core/groupby/groupby.py", line 409, in __init__
    mutated=self.mutated,
  File "/venv/lib/python3.6/site-packages/pandas/core/groupby/grouper.py", line 598, in get_grouper
    raise KeyError(gpr)
KeyError: 'A'

I'm running it like this: https://paste.pythondiscord.com/salupawogo.py

#

can you share your csv loading code?

#

oopsie, sorry, i copy pasted this loading snippet

paper niche Sep 20, 2020, 9:14 AM

#

import io
df = pd.read_csv(io.StringIO(input_string))

try with this

lapis sequoia Sep 20, 2020, 9:15 AM

#

KeyError B

paper niche Sep 20, 2020, 9:15 AM

#

print out df and make sure it's correct first

#

it should have the columns 'A' to 'J'

lapis sequoia Sep 20, 2020, 9:17 AM

#

yeah it does have them

#

 A                         B                C                   D              E                    F    G             H                                                  I                                         J
0   42  22jDAqicaEepSd5KcbXIrAK6  tracking_survey  Liquid Registry ☕️  1592478773258  2020-06-18 11:13:49   56      Checkbox        Select the types of liquids you have taken:  Non-alcoholic and non-stimulant drinks:

paper niche Sep 20, 2020, 9:17 AM

#

ah your column names have a space before them

lapis sequoia Sep 20, 2020, 9:18 AM

#

yeah, that was it

paper niche Sep 20, 2020, 9:18 AM

#

change the first line of your input_string to "A,B,C,D,E,F,G,H,I,J"

lapis sequoia Sep 20, 2020, 9:19 AM

#

it seems to be good, but in this example there seems to be always one element in the third grouping, so let me check with the full data

#

Anyway, this may help me get it, it was sth that I was trying to do (the multi-grouping), but couldn't make it work, now I know the correct syntax, thanks!

paper niche Sep 20, 2020, 9:23 AM

#

yeah, it may be abit confusing. I suggest figuring out:

what .to_dict('records') does
what .groupby(...).apply(lambda x: pd.Series(...)).reset_index() gives you
with a small example (say, a df with 4 columns 'A'-'D' with random integers 0-2) filled in.

#

I'll just add on: the general approach here is to start with the innermost "json", and work your way outwards. Track what happens after the first set of groupby,apply,reset_index, then you should be able to reason what the other 2 sets are doing. Good luck!

lapis sequoia Sep 20, 2020, 9:30 AM

#

okay, it seems to be okay, there is a couple of things that I need to tweak, but I think that I can make it work

#

Thank you very much @paper niche !

paper niche Sep 20, 2020, 9:31 AM

#

sure, happy to help!

lapis sequoia Sep 20, 2020, 10:04 AM

#

@paper niche ME STATING THAT THE RESULT IS NOT THE ONE I WAS LOOKING FOR WITHOUT REVIEWING THE COMPLETE RESULT

#

actually, no, I'm dumb, it's working perfectly!

hearty token Sep 20, 2020, 12:05 PM

#

how do i render javascript for beautifulsoup web scraping?

velvet thorn Sep 20, 2020, 12:18 PM

#

how do i render javascript for beautifulsoup web scraping?
@hearty token you can't

#

use Selenium

hearty token Sep 20, 2020, 12:18 PM

#

oh

#

alright

patent bough Sep 20, 2020, 3:18 PM

#

https://www.kaggle.com/rahulrajpandey31/introduction-to-matplotlib-and-line-plots

Introduction to Matplotlib and Line Plots

Explore and run machine learning code with Kaggle Notebooks | Using data from Immigration to Canada IBM Dataset

serene scaffold Sep 20, 2020, 3:51 PM

#

    for b in (TRUE, FALSE):  # TRUE and FALSE are classes
        is_class = np.tile((matrix[:, 10] == b).transpose(), (11, 1)).transpose()
        is_nan = np.isnan(matrix)
        for i in range(10):
            matrix[(is_class & is_nan), :, i] = np.nanmean(matrix[is_class, :, i], axis=0)

    matrix[(is_class & is_nan), :, i] = np.nanmean(matrix[is_class, :, i], axis=0)
IndexError: too many indices for array: array is 2-dimensional, but 4 were indexed

>>> matrix.shape
<class 'tuple'>: (8795, 11)
>>> matrix[is_class].shape
<class 'tuple'>: (10296,)

#

matrix[is_class, :, i] is meant to pull all the values from a column where the row has the correct properties

#

I guess it doesn't work like that

#

I think I'll try using just pandas

#

for this part

dense copper Sep 20, 2020, 7:01 PM

#

hello, is it possible to somehow use a generator to reduce the mem usage of a for loop that doesn't use just simple numbers? I'm working with a large pandas dataframe, and given a list of strings (about 150 of them, each being an indicator), I'm running into the following when I run memory_profiler on the function:

1966 1023.211 MiB  187.102 MiB       td_group = data.sort_values(by=['calendardate']).groupby(['ticker', 'dimension'])
1967 1057.488 MiB   34.277 MiB       mrq_t_group = mrq.sort_values(by=['calendardate']).groupby('ticker')
1968 1092.551 MiB   35.062 MiB       mry_t_group = mry.sort_values(by=['calendardate']).groupby('ticker')
1969 1207.020 MiB  114.469 MiB       mrt_t_group = mrt.sort_values(by=['calendardate']).groupby('ticker')
1970                             
1971 2256.359 MiB    0.000 MiB       for indicator in indicators:

It looks like that for loop is basically doubling the memory consumption. I'm trying to reduce this as much as possible

merry fern Sep 20, 2020, 7:01 PM

#

I need to add a column ('Type') to a dataframe based on the starting string in another column, whats the best way to do that?

rustic apex Sep 20, 2020, 9:52 PM

#

Data.gov and datasetsearch.research.google.com
Are those good sites to use for Numpy/Pandas?

weary ravine Sep 21, 2020, 12:27 AM

#

Guys how could i get my computer audio?? (obs: its not the microphone)
like in real time

velvet thorn Sep 21, 2020, 12:28 AM

#

I need to add a column ('Type') to a dataframe based on the starting string in another column, whats the best way to do that?
@merry fern show example data

#

matrix[is_class, :, i] is meant to pull all the values from a column where the row has the correct properties
@serene scaffold isn't your data 2D

#

I feel like you want to move your nan calculation into the loop

#

and then just have is_class = matrix[:, 10] == b

restive scroll Sep 21, 2020, 12:48 AM

#

I don't understand why the val_acc stays at 0.5. Anyone encountered this problem before?

📎 Capture.PNG

#

I am doing Cats vs. Dogs using augmentation with Keras

velvet thorn Sep 21, 2020, 12:56 AM

#

I don't understand why the val_acc stays at 0.5. Anyone encountered this problem before?
@restive scroll model isn't learning, most likely

restive scroll Sep 21, 2020, 12:59 AM

#

The learning rate is set to 0.001, could this be the problem?

hasty grail Sep 21, 2020, 1:09 AM

#

You probably forgot to rescale your data

#

Did you divide the pixel values by 255?

restive scroll Sep 21, 2020, 1:12 AM

#

yes you're right on point! I forgot the /😓

📎 Capture.PNG

#

@hasty grail Thank you

hasty grail Sep 21, 2020, 1:12 AM

#

np

rustic apex Sep 21, 2020, 1:52 AM

#

Which do you prefer using .py or csv? Why not just use a py file? For Numpy and Pandas

velvet thorn Sep 21, 2020, 2:00 AM

#

Which do you prefer using .py or csv? Why not just use a py file? For Numpy and Pandas
@rustic apex like...to store data?

rustic apex Sep 21, 2020, 2:06 AM

#

@velvet thorn I’m starting to learn Numpy and Pandas, is one file better then the other? It seems quicker to use a .py then a jupyterlab

lapis sequoia Sep 21, 2020, 5:02 AM

#

https://demonstrations.wolfram.com/TrinomialTreeOptionPricingMethod/img/popup_3.png

#

Any Idea how to map something like this in python

brittle agate Sep 21, 2020, 5:57 AM

#

📎 fuck.png

mild topaz Sep 21, 2020, 8:00 AM

#

Traceback (most recent call last):

  File "E:\paymentz\template.py", line 20, in <module>
    res = cv2.matchTemplate(gray_img, template, cv2.TM_CCOEFF_NORMED)

error: OpenCV(4.2.0) C:\projects\opencv-python\opencv\modules\imgproc\src\templmatch.cpp:1109: error: (-215:Assertion failed) _img.size().height <= _templ.size().height && _img.size().width <= _templ.size().width in function 'cv::matchTemplate'```

#

@brittle agate sorry to ping u , can u look into this issue?

hasty grail Sep 21, 2020, 8:19 AM

#

The error basically says that your input image cannot be larger than the input template

mild topaz Sep 21, 2020, 8:23 AM

#

okay means template image should be greater than input image ? @hasty grail

hasty grail Sep 21, 2020, 8:23 AM

#

yeah

mild topaz Sep 21, 2020, 8:24 AM

#

see thispython img.size: 151242 template.size 50286

hasty grail Sep 21, 2020, 8:26 AM

#

more specifically the dimensions of your image have to be smaller than that of the template

#

so if you have a template that's 720x480px, your image has to be no more than 720px in width and no more than 480px in height

lapis sequoia Sep 21, 2020, 9:38 AM

#

people, how can I debug the groupby method of pandas dataframe? It's raising a KeyError exception (although the key is there), and only fails with the data that belongs to one user, but not with the data of another one

#

You know it the groupby applies a dropna? that could drop a column entirely?

winter portal Sep 21, 2020, 9:54 AM

#

can someone please help me with writing concurrently to a database with sqlite3?

#

pls ping me when u help

velvet thorn Sep 21, 2020, 10:36 AM

#

people, how can I debug the groupby method of pandas dataframe? It's raising a KeyError exception (although the key is there), and only fails with the data that belongs to one user, but not with the data of another one
@lapis sequoia provide code and sample data

lapis sequoia Sep 21, 2020, 10:45 AM

#

I think I found the issue: the values in one column are NaN, so they are dropped entirely when the groupby is performed

#

So what I did was put dropna=False in the groupby and also fill the column with values where the NaNs are

#

.fillna("", inplace=True)

mild topaz Sep 21, 2020, 11:39 AM

#

@hasty grail how i can get the image pixel

#

or image width and height

brittle agate Sep 21, 2020, 12:02 PM

#

Traceback (most recent call last):

  File "E:\paymentz\template.py", line 20, in <module>
    res = cv2.matchTemplate(gray_img, template, cv2.TM_CCOEFF_NORMED)

error: OpenCV(4.2.0) C:\projects\opencv-python\opencv\modules\imgproc\src\templmatch.cpp:1109: error: (-215:Assertion failed) _img.size().height <= _templ.size().height && _img.size().width <= _templ.size().width in function 'cv::matchTemplate'```

@mild topaz
Well, let's see what is wrong.

#

You are using template that is huger than original image.

icy stirrup Sep 21, 2020, 1:07 PM

#

hi

#

im using asyncpg here

#

here is the syntax correct?

eager heath Sep 21, 2020, 1:51 PM

#

Beware of little bobby tables https://xkcd.com/327/

xkcd: Exploits of a Mom

#

For real though, you shouldn’t use an F-string in an SQL query

mild topaz Sep 21, 2020, 2:35 PM

#

loc: (array([], dtype=int64), array([], dtype=int64)) i am getting empty array @brittle agate

earnest forge Sep 21, 2020, 2:46 PM

#

oh

#

you certainly should use left_on=, right_on= and on=, how=

#

they might help you in what you want to achieve

hasty grail Sep 21, 2020, 2:48 PM

#

@mild topaz np.ndarray.shape

lime saddle Sep 21, 2020, 3:26 PM

#

Hello

#

I'm looking at the Intel Image Classification dataset on kaggle and I have an error which I don't know why and how to fix

#

I used ImageDataGenerator for gathering, rescaling and labeling the images

#


train_datagen = ImageDataGenerator(rescale = 1. /255)
train_generator = train_datagen.flow_from_directory(train_dir,
                                                   target_size=(300,300),
                                                   batch_size=128)
test_datagen = ImageDataGenerator(rescale = 1. /255)
test_generator = test_datagen.flow_from_directory(test_dir,
                                                   target_size=(300,300),
                                                   batch_size=128)```

#

I have a small CNN model

#

model.add(Conv2D(32, (3, 3), activation = 'relu', input_shape = (150, 150, 3)))
model.add(MaxPool2D(2,2))
model.add(Conv2D(32, (3, 3), activation = 'relu'))
model.add(MaxPool2D(2,2))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dense(6, activation='softmax'))
model.compile(optimizer = 'adam', loss = 'sparse_categorical_crossentropy', metrics=['accuracy'])```

#

and then I call the fit method

#

                   steps_per_epoch=10,
                   epochs=15)```

#

I get the following error:

     [[node sequential_3/flatten_2/Reshape (defined at <ipython-input-14-3c76c98050a9>:4) ]] [Op:__inference_train_function_994]

Function call stack:
train_function```

#

How can I fix it?

brittle agate Sep 21, 2020, 3:42 PM

#

@lime saddle
Try to set up in Conv2D layer padding='valid' or 'same'.

#

Maybe that's can fix this problem.

lime saddle Sep 21, 2020, 3:43 PM

#

Ok, I'll try

#

I now have another error:

#

     [[node sparse_categorical_crossentropy/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits (defined at <ipython-input-18-44f659394c13>:3) ]] [Op:__inference_train_function_1783]```

#

Ok, so I noticed something wrong

#

Here:

                                                   target_size=(300,300),
                                                   batch_size=128)```

I set the target_size to 300,300 but in the first layer of the convolution I have the input_shape to 150 150

#

I changed the target size to be the same as the input_shape but same error

#

Okkkkk

#

So I changed the loss from sparse_categorical_crossentropy to categorical_crossentropy

#

When I compile my model

#

Seems to work now

#

Maybe it could be helpful for someone

lapis sequoia Sep 21, 2020, 4:18 PM

#

Hi there! I'm wondering if someone has an idea on how to retrieve a value from a probability distribution function (pdf), I have started to fit my data to a pdf and now I want to use the pdf to retrieve values for some x-values

📎 unknown.png

terse cargo Sep 21, 2020, 4:35 PM

#

I don't really know anything about data science, machine learning, or statistical analysis and I'm hoping this might be the right place to get ideas or links to relevant articles but I have a game server I host and the players are interested in seeing their game time. I've got log files for about two years and I wrote a small program that goes through them and looks for connect/disconnect events and put it into a sqlite db. I've got a timestamp, game account id (only for connecting), connection state boolean (1 connect, 0 disconnect), and the ip.

The way the game works it has a lot of different states connecting/disconnecting is wonky so you might see a connection event then two disconnects or none or vice-versa. Is there anything cool (I like learning new technologies) I can do with this information to plot online activity for the users?

winged steppe Sep 21, 2020, 5:08 PM

#

Hey guys! i just posted a question on stack overflow, maybe some of you know the answer? https://stackoverflow.com/questions/63996926/how-to-delete-empty-duplicates-in-a-dataframe-in-a-smart-way

Stack Overflow

How to delete empty duplicates in a dataframe in a smart way?

I have a large dataset. It has some data missing. Dataset contains variables of types string (for columns such as name) and float (for columns such as height). Some rows in this dataset are just

shut robin Sep 21, 2020, 5:15 PM

#

Heyho people of datasience does anyone here know his way around pytorch?

lapis sequoia Sep 21, 2020, 5:16 PM

#

@shut robin For some projects Tensorflow might be better

#

I have found the Keras library to make more sense to me

shut robin Sep 21, 2020, 5:17 PM

#

Well... the goal is to create a DeepL Ai that predicts a simple line

lapis sequoia Sep 21, 2020, 5:17 PM

#

Either should work for that

shut robin Sep 21, 2020, 5:17 PM

#

so Tensorflow has a lower entry barrier then?

lapis sequoia Sep 21, 2020, 5:18 PM

#

No AI has an easy barrier to entry

#

the only reason I understand AI at all is because i literally think of everything in terms of Analog synthesizers

shut robin Sep 21, 2020, 5:18 PM

#

I am a student rn so I am not unfamiliar but some concepts are still out there for me

lapis sequoia Sep 21, 2020, 5:18 PM

#

with each node being an input output into another synth feature

#

Its very much magic box for most people who use it

shut robin Sep 21, 2020, 5:18 PM

#

like a jacobi matrix which I only kinda understand but hey ;D

lapis sequoia Sep 21, 2020, 5:19 PM

#

the math behind it is fairly complex

shut robin Sep 21, 2020, 5:19 PM

#

I buy that for a penny

lapis sequoia Sep 21, 2020, 5:19 PM

#

Watch some videos about what tensors are

#

and vectors of course. If you can understand that, then things will make more sense

shut robin Sep 21, 2020, 5:19 PM

#

lol I am not a highschool student 😄

lapis sequoia Sep 21, 2020, 5:19 PM

#

all tensorflow and most deeplearning is just a hyper complicated geometry problem using tensors

shut robin Sep 21, 2020, 5:20 PM

#

okay Ill try my best ^^

lapis sequoia Sep 21, 2020, 5:20 PM

#

Well I dropped out so I don't know what people learn lol

#

I learned everything on my own

shut robin Sep 21, 2020, 5:21 PM

#

you wouldnt happen to have any recommendations for learning the more complex mathematical models would you?

lapis sequoia Sep 21, 2020, 5:21 PM

#

Well it depends on which one specifically

#

There should be an assortment of videos that walk you through the Keras models

#

which is what you would be using within TF

shut robin Sep 21, 2020, 5:23 PM

#

I mean I get what you mean but seeing these things I just kinda stare at it a lot 😄

📎 d49d7e594aa9b50b5bb740cad39ca228.png

lapis sequoia Sep 21, 2020, 5:23 PM

#

Yeah learning what tensors are will help

shut robin Sep 21, 2020, 5:23 PM

#

fair enough

lapis sequoia Sep 21, 2020, 5:23 PM

#

It all starts with learning about vectors

#

So a vector would be on the x,y only for instance

#

then you learn about vectors in x,y,z

shut robin Sep 21, 2020, 5:24 PM

#

yeah I know what vectors are

#

the normal planes and all I got that part down np

lapis sequoia Sep 21, 2020, 5:24 PM

#

and a tensor is basically a vector, but its the set of all vectors for that object, so that no matter how you move the axis everything is the same

shut robin Sep 21, 2020, 5:24 PM

#

huh okay yeah I gotta read into that

lapis sequoia Sep 21, 2020, 5:25 PM

#

Basically imagine a cube, its every vector for the cube

shut robin Sep 21, 2020, 5:25 PM

#

okay

grave frost Sep 21, 2020, 5:25 PM

#

Does anyone know a way to represent a UNIQUE string (not a category) in numerical format to be used in a Pandas?Dataframe

lapis sequoia Sep 21, 2020, 5:26 PM

#

@shut robin https://www.youtube.com/watch?v=f5liqUk0ZTw&ab_channel=DanFleisch

YouTube

Dan Fleisch

What's a Tensor?

Dan Fleisch briefly explains some vector and tensor concepts from A Student's Guide to Vectors and Tensors

▶ Play video

#

this is the video that got me started down the rabbit hole

#

Then you can watch more complicated ones. once you have a good grasp of Tensors, you can start to learn about all the Keras modules, and what the different type of structures are like, RNN, CNN, Etc

#

You will use different structures for different things. Like if you are looking at a time series (Which is my interest in TF) you will need to learn about LSTM's and RNN's

#

@grave frost Can you provide more context in what you are trying to do?

serene scaffold Sep 21, 2020, 5:28 PM

#

I'm trying to understand how I can use pandas for conditional mean imputation. I know I can use DataFrame.groupby, but I'm not sure how to replace the nans in each column with the nanmeans for those columns.

lapis sequoia Sep 21, 2020, 5:28 PM

#

@serene scaffold can you make a new df, in which you drop everything with a nan?

serene scaffold Sep 21, 2020, 5:29 PM

#

the mean method in certain pandas objects automatically ignores nans

lapis sequoia Sep 21, 2020, 5:29 PM

#

You could do a try statement inside of a forloop, with the except having a nan

grave frost Sep 21, 2020, 5:29 PM

#

@lapis sequoia I have an arbitrary string that I want to somehow convert it into a form that I can make an ML model for. Since data has to be numerical, I have no idea on how will I encode a string into something else...

serene scaffold Sep 21, 2020, 5:29 PM

#

my goal is to use pandas built-in functionality so that I don't have to write for loops in Python

#

also I thought I was on another server? oops.

lapis sequoia Sep 21, 2020, 5:30 PM

#

@grave frost What ML Library are you using?

grave frost Sep 21, 2020, 5:31 PM

#

@lapis sequoia TensorFlow and Keras

lapis sequoia Sep 21, 2020, 5:32 PM

#

did tf.string.to_number not work?

shut robin Sep 21, 2020, 5:32 PM

#

@lapis sequoia haha same I was watching it rn 😄

grave frost Sep 21, 2020, 5:34 PM

#

@lapis sequoia You misunderstand. The string contains both number and alphabets (e.g:- test123)

#

Though I think I have an idea to convert it into int, but it would require a custom encoder...

lapis sequoia Sep 21, 2020, 5:34 PM

#

Whats the data?

grave frost Sep 21, 2020, 5:34 PM

#

Alphanumeric

lapis sequoia Sep 21, 2020, 5:35 PM

#

I think your solution will have to be dependent on the data. Right but dates can be alphanumeric like 23 January 2020

#

But solving that is easy

#

where if its performance data for a serialized component, its more challenging

hard pelican Sep 21, 2020, 5:35 PM

#

So I've got a sort of baseline understanding of the function of RNNs and different layer types, but I'm not clear on how exactly I can best combine them and for what tasks - does anyone have reading they'd suggest on the subject?

grave frost Sep 21, 2020, 5:36 PM

#

@lapis sequoia So how do we encode such alphanumeric data?

lapis sequoia Sep 21, 2020, 5:36 PM

#

Im thinking

#

@hard pelican Look up LSTM/RNN Stocks

#

Medium has several articles and it can give you a really good context for how they work and how to modify them

#

@grave frost have you tried to_numeric yet? And if so whats the output look like?

#

or does it even let you do it?

grave frost Sep 21, 2020, 5:38 PM

#

@lapis sequoia Nope InvalidArgumentError: StringToNumberOp could not correctly convert string: hu342 [Op:StringToNumber]

#

I think I will make a custom encoder then...

hard pelican Sep 21, 2020, 5:39 PM

#

oh wow the first article i found with that term seems to have clarified something that was screwing me up

#

I was sending data in a weird format to my LSTMs

#

so obviously they did nothing

grave frost Sep 21, 2020, 5:39 PM

#

@lapis sequoia But the most biggest problem is the delimiter

#

I could keep 1 digit aside as the seperator but I don't want myself 1 digit short...

lapis sequoia Sep 21, 2020, 5:42 PM

#

Is your problem type a classification one?

#

@hard pelican I have done that many times in nearly every library I have ever used

hard pelican Sep 21, 2020, 5:44 PM

#

Yeah... I was banging my head against the wall

#

zero loss and zero accuracy

#

yeefuckinhaw

lapis sequoia Sep 21, 2020, 5:44 PM

#

Thats me with dash right now lol

#

Bokeh and Dash are so cool if you know what your doing, but id you dont its 100% Ragefuel

pine cloak Sep 21, 2020, 5:45 PM

#

if i have multiple columns with a single letter in one dataframe, and another dataframe has a weight for each letter, how do I multiply the weights?

lapis sequoia Sep 21, 2020, 5:47 PM

#

What do you mean multiply the weights?

#

@grave frost https://www.tensorflow.org/tutorials/structured_data/feature_columns
This works with a lot of strings that are alphabet. May have the snippet of code you are looking for

TensorFlow

Classify structured data with feature columns | TensorFlow Core

grave frost Sep 21, 2020, 5:48 PM

#

@lapis sequoia It's not a classification problem, but sequence2sequence 🙂

lapis sequoia Sep 21, 2020, 5:49 PM

#

@grave frost https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html

pine cloak Sep 21, 2020, 5:49 PM

#

lets say one row of the first df is ['a,'b,'c'] , another df has the value associated with each letter [a=1,b=2,c=3] how do I get it to multiple 123?

lapis sequoia Sep 21, 2020, 5:49 PM

#

I looked it up on the Keras website. This is only alphabet again

shut robin Sep 21, 2020, 5:50 PM

#

k so a tensor is basically just three vectors expressed in one matrix

lapis sequoia Sep 21, 2020, 5:50 PM

#

@pine cloak Is there any particular reason the df's are seperate?

#

@shut robin It can be more than three vectors, but yeah basically

pine cloak Sep 21, 2020, 5:50 PM

#

that's the way it was provided

shut robin Sep 21, 2020, 5:50 PM

#

yeah I meant a tensor to the third power

lapis sequoia Sep 21, 2020, 5:50 PM

#

Yeah

#

@pine cloak So Df1 = A,B,C within the rows of column one or in the index column?

#

& DF2 = the weights? or wht weights with labels?

pine cloak Sep 21, 2020, 5:53 PM

#

char0 charq char2
0 a b c
1 d b a
2 c f. c

#

that's df1

#

df2 is

#

char prob
a 0.123
b 0.375
c 0.009

#

is df2

lapis sequoia Sep 21, 2020, 5:54 PM

#

ok ill enter that five me a minute

pine cloak Sep 21, 2020, 5:56 PM

#

would it be something like df1['product'] = np.prod([df2[prob] for char in df1])?

lapis sequoia Sep 21, 2020, 5:56 PM

#

well d & f dont have numbers

#

weights

pine cloak Sep 21, 2020, 5:57 PM

#

i didnt type it out

#

assume every letter has a weight

lapis sequoia Sep 21, 2020, 5:57 PM

#

ok,

#

So you need to calculate the probability of each row?

pine cloak Sep 21, 2020, 5:58 PM

#

basically

solar phoenix Sep 21, 2020, 5:59 PM

#

hi all- i am struggling with something on pandas, perhaps you can help. I have a column of my dataframe which is 1,2 or 3. If it is "1" i want to apply function 1, if it is 2 i want to apply function 2, and if it is 3, i want to apply function 3. But one of the inputs for the function is also based on the same row of the dataframe

#

if that makes any sense

lapis sequoia Sep 21, 2020, 6:00 PM

#

@pine cloak does df2 have two columns? one for letter and one for the probabilitie

pine cloak Sep 21, 2020, 6:00 PM

#

the char is the index

lapis sequoia Sep 21, 2020, 6:02 PM

#

Ok gotcha

solar phoenix Sep 21, 2020, 6:03 PM

#

an example of mine:

df:
"a" "b"
1 5 1
2 5 1
3 5 2
4 5 3

def func1(input):
output = input*10
return(output)

def func2(input):
output = input+10
return(output)

def func3(input):
output = input+1
return(output)

desired output:

df:
"a" "b" "c"
1 5 1 50
2 5 1 50
3 5 2 15
4 5 3 6

pine cloak Sep 21, 2020, 6:06 PM

#

what i did gave me the same value for every row

#

u still there?

lapis sequoia Sep 21, 2020, 6:09 PM

#

I am

#

Im working on it

#

trying to figure out the best way to do it

pine cloak Sep 21, 2020, 6:10 PM

#

ok thanks

lapis sequoia Sep 21, 2020, 6:13 PM

#

import numpy as np
import pandas as pd

data1 = {'Char1':  ['A', 'D',"C"],"Char2":["B","D","F"],"Char3":["C","A","C"]}
data2 = {"Index":["A","B","C","D","F"],'Probabilities':  ['.2', '.4',".5",".3",".7"]}

df = pd.DataFrame (data1, columns = ['Char1',"Char2","Char3"])
df2 = pd.DataFrame (data2, columns = ["Index", 'Probabilities'])
df2 = df2.set_index("Index")

for n in df.index:
    Char1Probability = float(df2.loc[df["Char1"][n]]["Probabilities"])
    Char2Probability = float(df2.loc[df["Char2"][n]]["Probabilities"])
    Char3Probability = float(df2.loc[df["Char3"][n]]["Probabilities"])
    RowNProbability = Char1*Char2*Char2
    print("Row Number: "+str(n))
    print(RowNProbability)
    print("----------------")

#

A forloop makes it pretty simple

#

you just have to use .loc to use the value in one df to select the value in the other

#

and by using the for loop you don't have to manually select each location each time

pine cloak Sep 21, 2020, 6:15 PM

#

my df1 is larer than what I posted (about 1000 rows), should still work right?

lapis sequoia Sep 21, 2020, 6:15 PM

#

yes

#

and if its super slow you can do it by slices

#

so you would add make the for loop say

for n in df.index[0:10]:

pine cloak Sep 21, 2020, 6:16 PM

#

gotcha, thanks so much

lapis sequoia Sep 21, 2020, 6:17 PM

#

and you can make a new empty DF and use .append to add all the output values to that df so you dont have to read them all from the print statements

pine cloak Sep 21, 2020, 6:17 PM

#

gotcha

lapis sequoia Sep 21, 2020, 6:21 PM

#

Good Luck!

jaunty scroll Sep 21, 2020, 6:22 PM

#

hello would a question about an XML parser be appropriate here?

#

I am trying to parse an XML document that has six sets of parent-child relationships. I can get the parser to run through the document but it doesn't respect the structure of it and just returns the values as if they were all at the same level. This in turn causes problems with the schema validation. Been working on this for a while and not sure where to turn

safe sparrow Sep 21, 2020, 7:09 PM

#

Im looking for a way to multiply all nodes in a tensorflow layer

#

so essentially concatenate layers, but multiplying them

#

But only multiple the nodes if the nodes are non zero

merry fern Sep 21, 2020, 7:14 PM

#

@merry fern show example data
@velvet thorn
I need to add a column ('Type') to a dataframe based on the starting string in another column, whats the best way to do that?

I want to insert a column ('Type') and fill values ('A' 'B' or 'C') depending on the values in 2 other columns
A = if value for column 'Name' does not start with 'RP' 'RV' or 'Buy' or 'Sell' AND 'Price' isnot '0.00'
B = if value for column 'Name' does start with 'RP' or 'RV' AND 'Price' is 0.00
C = if value for column 'Name' does start with "Buy" or "Sell"

modern hatch Sep 21, 2020, 7:31 PM

#

write a function that takes in a row of the dataframe and applies that logic

#

    # do your conditional checks
    return label```

#

then you can just df["Type"] = df.apply(lambda row: mapper(row), axis=1)

merry fern Sep 21, 2020, 7:33 PM

#

i will try thx

modern hatch Sep 21, 2020, 7:34 PM

#

no problem, good luck

jaunty scroll Sep 21, 2020, 7:50 PM

#

If I want to access an element in an XML whose attributes are child elements, what is the best/most efficient way to do so? Having a hard time getting parser to recognize document structure. I have tried to create a new element object and set it equal to the desired element, but it only picks up on elements that have no children.

merry fern Sep 21, 2020, 7:55 PM

#

I have no idea what I'm doing @wide heath 🙂

#

no problem, good luck
@modern hatch
I have no idea what I'm doing 🙂

lapis sequoia Sep 21, 2020, 8:35 PM

#

Hello

#

can someone help me with an import error

#

📎 unknown.png

merry fern Sep 21, 2020, 8:36 PM

#

@lapis sequoia do you have matplotlib installed?

lapis sequoia Sep 21, 2020, 8:36 PM

#

Yessir

#

I'm just confused why it doesn't work on the ide

#

It work on cmd

#

but when it runs on ide it just gives me that error

merry fern Sep 21, 2020, 8:37 PM

#

maybe the IDE is not reading the same envt?

#

type pip show matplotlib

lapis sequoia Sep 21, 2020, 8:37 PM

#

📎 unknown.png

#

what is envt

merry fern Sep 21, 2020, 8:38 PM

#

environment, so maybe the IDE is in another environment or a virtual environment

lapis sequoia Sep 21, 2020, 8:38 PM

#

what does that mean

#

like do I have to configure matplotlib to the ide

#

and I'm just using the basic python ide I don't want to use atom or pycharm till I need to

merry fern Sep 21, 2020, 8:41 PM

#

run help("modules") from the IDE and see if matplotlib shows up

lapis sequoia Sep 21, 2020, 8:45 PM

#

what would modules be

#

If I type it in brackets

#

oh

#

my bad

#

I'm dumb

#

👍

#

I see what you mean

#

matplot

#

it gave me this module

#

@merry fern

merry fern Sep 21, 2020, 8:48 PM

#

is that an old version ?

#

try your code w/ import matplot

lapis sequoia Sep 21, 2020, 8:48 PM

#

I don't know

#

the import works

merry fern Sep 21, 2020, 8:49 PM

#

👍

lapis sequoia Sep 21, 2020, 8:50 PM

#

But how do I get matplotlib

#

to work

#

Do I have to update matplot

#

because its an older module or

#

@merry fern can I dm quikc

#

quick*

merry fern Sep 21, 2020, 8:55 PM

#

IDK @lapis sequoia , i would pip install matplotlib

lapis sequoia Sep 21, 2020, 8:55 PM

#

Alright

merry fern Sep 21, 2020, 8:55 PM

#

<-- a newbie as well

#

no problem, good luck
@modern hatch tried this a few diff ways, I think I'm off the mark...

def admin_mapper(row):
    label = ''
    for column in df_admin[['Price', 'Name']]:
        if column['Name'].str.startswith('RP', na=False) or column['Name'].str.startswith('RV', na=False) and column['Price']==0:
            label = 'Repurchase Agreement'
        elif column['Name'].str.startswith('BUY', na=False) or column['Name'].str.startswith('SELL', na=False):
            label = 'CDS'
        elif column['Price'] != "0":
            label = 'Bond'
    return label

hard pelican Sep 21, 2020, 10:09 PM

#

Using keras, how should I format my X array for an LSTM?

#

I'm doing regression on some market data, so I've got 5 features and want to predict one

#

I have a context value, which is at 50 for now, so it gets those 5 features 50 timesteps from the point it's supposed to predict

#

how would I put that in?

#

[[[features], [features], [features], [features]], [[features], [features], [features], [features]]]?

hard pelican Sep 21, 2020, 10:34 PM

#

Okay, I think I'm doing this wrong again.

#

I'm putting my data in like that

#

and it's giving me these ludicrously low loss

#

which I think means it's wrong

#

(low as in scientific notation low)

#

this is a pretty simple system, I think

#

model = Sequential()
model.add(LSTM(32, input_dim=5, input_length=CONTEXT))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')```

#

5 features, CONTEXT is 50 and how many sets of each features is in a datapoint

#

on the first epoch it gives me this:

#

901/901 - 3s - loss: 5.4002e-04

#

my data is scaled with a minmaxscaler, could that be screwing it up?

#

i really don't know what's causing this

#

well unscaling it results in a loss of 140k

#

so

#

not that, either

#

The ridiculously low loss is inversely proportional to how much data there is

#

why

velvet thorn Sep 21, 2020, 11:03 PM

#

@velvet thorn

I want to insert a column ('Type') and fill values ('A' 'B' or 'C') depending on the values in 2 other columns
A = if value for column 'Name' does not start with 'RP' 'RV' or 'Buy' or 'Sell' AND 'Price' isnot '0.00'
B = if value for column 'Name' does start with 'RP' or 'RV' AND 'Price' is 0.00
C = if value for column 'Name' does start with "Buy" or "Sell"
@merry fern
what about the other cases?

serene scaffold Sep 21, 2020, 11:16 PM

#

Still trying to figure out numpy; can I use np.where to get all the rows where the nth element is not nan?

velvet thorn Sep 21, 2020, 11:17 PM

#

Still trying to figure out numpy; can I use np.where to get all the rows where the nth element is not nan?
@serene scaffold you should be using isnan

serene scaffold Sep 21, 2020, 11:17 PM

#

well, I can use isnan to get a boolean array/matrix

#

but then I need the rows where a specific element isn't nan.

velvet thorn Sep 21, 2020, 11:18 PM

#

hm.

#

okay, let me illustrate.

#

>>> a = np.array([1, 2, np.nan, np.nan, 5, np.nan, 7])
>>> a
array([ 1.,  2., nan, nan,  5., nan,  7.])
>>> a[~np.isnan(a)]
array([1., 2., 5., 7.])

@serene scaffold do you see what I mean?

#

now then, say that was a 2D array...

#

>>> b = np.array([[1, 2], [np.nan, 4], [5, np.nan]])
>>> b
array([[ 1.,  2.],
       [nan,  4.],
       [ 5., nan]])
>>> b[~np.isnan(b[:, 0])]
array([[ 1.,  2.],
       [ 5., nan]])

#

in other words, "get me b where the first column is not nan"

serene scaffold Sep 21, 2020, 11:22 PM

#

I see

#

this illustrates what I needed to know

#

Thanks!

velvet thorn Sep 21, 2020, 11:23 PM

#

one step further; I think this might be helpful, based on what you said the other time:

>>> b[~np.isnan(b[:, 0]), 0]
array([1., 5.])

"get me the second column of b where the first column is not nan"

#

you're welcome

serene scaffold Sep 21, 2020, 11:24 PM

#

so within the subscript brackets, you get a mask expression followed by an index

#

and you can do really bizarre indexing with both colons and commas.

velvet thorn Sep 21, 2020, 11:25 PM

#

yup

#

"colon" means "everything along this axis"

#

and a bonus round...

serene scaffold Sep 21, 2020, 11:25 PM

#

bonus? lemon_hyperpleased

velvet thorn Sep 21, 2020, 11:26 PM

#

>>> a = np.zeros(shape=(12, 24, 36, 48))
>>> a[..., 1].shape
(12, 24, 36)

the ellipsis means "all axes until..."

#

so in this case it replaces a[:, :, :, 1]

#

not sure if you'll need it for the problem you're solving (I doubt it) but just so you know it exists

#

kinda like a, *b, c = [1, 2, 3, 4]

serene scaffold Sep 21, 2020, 11:26 PM

#

def manhattan_distance(x: np.array, y: np.array) -> np.float:
    not_null = ~np.isnan(x) & ~np.isnan(y)
    x, y = x[not_null], y[not_null]
    return np.mean(np.absolute(x - y))


def hot_deck_imputation(df: pd.DataFrame):
    matrix: np.array = _to_numpy(df.drop(BIN_LABEL, axis=1))
    matrix_is_nan = np.isnan(matrix)
    for i in range(len(matrix)):
        is_nan = matrix_is_nan[i]

        if not np.any(is_nan):
            continue

        for j in np.where(is_nan):
            options = matrix[~np.isnan(matrix[:, j])]
            matrix[i, j] = None  # this is a placeholder

#

the goal is to replace matrix[i, j] with whichever array in options returns the lowest value from manhattan_distance given mahattan_distance(matrix[i], x) for an x in options

#

my understanding is that you're supposed to avoid writing Python loops though

velvet thorn Sep 21, 2020, 11:28 PM

#

sometimes you do need loops

serene scaffold Sep 21, 2020, 11:28 PM

#

but maybe I'm overrating the importance of avoiding loops.

velvet thorn Sep 21, 2020, 11:28 PM

#

it depends.

#

in particular, you cannot avoid a loop where one calculation depends on another

#

which doesn't seem like the case to me

#

let me try to parse your code

#

it's a bit early

#

okay, so first, you skip all rows where every value is not null, because you don't need to perform imputation

#

fair enough

serene scaffold Sep 21, 2020, 11:30 PM

#

I skip all the rows where there are no nans. At least that's the point.

velvet thorn Sep 21, 2020, 11:30 PM

#

not null

#

missed an operative word there

#

wups

#

yup, got that part

#

I don't get the inner loop

serene scaffold Sep 21, 2020, 11:31 PM

#

Disclosure, this is for school, but we're not being graded on numpy or pandas knowledge. I just want to learn how to use them and am using this as an excuse.

velvet thorn Sep 21, 2020, 11:31 PM

#

where is the manhattan_distance call?

serene scaffold Sep 21, 2020, 11:31 PM

#

haven't implemented that.

velvet thorn Sep 21, 2020, 11:31 PM

#

oh, okay.

serene scaffold Sep 21, 2020, 11:31 PM

#

if two rows have a nan in the same column, we can't use them to calculate the manhattan distance to replace either nan

#

so the for j loop is to get all the indices of a row that have a nan

velvet thorn Sep 21, 2020, 11:33 PM

#

np.where returns values though, right

serene scaffold Sep 21, 2020, 11:33 PM

#

because then I need to calculate the manhattan distance for every other row that doesn't have a nan in that slot

merry fern Sep 21, 2020, 11:33 PM

#

@velvet thorn this is what I have so far, both methods dont work...

https://paste.pythondiscord.com/oriqizifif.py

serene scaffold Sep 21, 2020, 11:33 PM

#

does it

#

let's see

velvet thorn Sep 21, 2020, 11:33 PM

#

oh, wait, you're passing isnan to it

#

that's fine

#

the slightly more canonical way to do that is .nonzero(), but that's a minor point

serene scaffold Sep 21, 2020, 11:34 PM

#

so in pure Python, I could do min(manhattan_distance(matrix[i], x) for x in options)

velvet thorn Sep 21, 2020, 11:35 PM

#

options is an array

serene scaffold Sep 21, 2020, 11:35 PM

#

so is matrix[i]

velvet thorn Sep 21, 2020, 11:35 PM

#

which means

#

you can pass both to manhattan_distance directly

#

assuming they are the same size

serene scaffold Sep 21, 2020, 11:35 PM

#

no

#

matrix[i] is supposed to be one row of the whole thing

velvet thorn Sep 21, 2020, 11:36 PM

#

and they would have to be, right, along one axis

serene scaffold Sep 21, 2020, 11:36 PM

#

and options are supposed to be all the rows that pass that condition

velvet thorn Sep 21, 2020, 11:36 PM

#

yup

#

precisely

#

so what I'm saying is