#data-science-and-ml | Python | Page 234

serene scaffold Jul 15, 2020, 1:31 AM

#

but to do what?

frank bone Jul 15, 2020, 1:31 AM

#

And then reading those csv files

#

In for loop

serene scaffold Jul 15, 2020, 1:32 AM

#

I want to see if we can find a way to do what you're doing that doesn't use Python for loops, like how numpy has certain methods that do what could be done in a for loop in pure Python.

frank bone Jul 15, 2020, 1:34 AM

#

sec

#

simple 2 liner csv_files = glob.glob(os.path.join(path, "*.csv")) df_from_each_file = (pd.read_csv(f, usecols=[5, 6]) for f in csv_files)

#

what im doing here is concatenating multiple csv files into one pandas dataframe

#

dont think anything can be changed about this 😄

serene scaffold Jul 15, 2020, 1:38 AM

#

well, that's a lot of disk reads

frank bone Jul 15, 2020, 1:38 AM

#

alrdy using NVMe

serene scaffold Jul 15, 2020, 1:39 AM

#

idk what that is

frank bone Jul 15, 2020, 1:39 AM

#

disk

#

or what did you mean? any ways to tackle this or bring down the time?

serene scaffold Jul 15, 2020, 1:40 AM

#

I don't know how to make the process of opening all those files any faster

frank bone Jul 15, 2020, 1:40 AM

#

maybe its better to do a DB after all

#

do some major pre-processing

serene scaffold Jul 15, 2020, 1:41 AM

#

makes sense to me

frank bone Jul 15, 2020, 1:41 AM

#

does python use all cores?

#

maybe 16 core cpu would help? if it does

#

currently on 4

serene scaffold Jul 15, 2020, 1:42 AM

#

!e

import numpy as np

mat_1 = np.random.rand(5)
mat_2 = np.random.rand(5)

print(mat_1 == mat_2)

arctic wedgeBOT Jul 15, 2020, 1:42 AM

#

@serene scaffold :white_check_mark: Your eval job has completed with return code 0.

[False False False False False]

serene scaffold Jul 15, 2020, 1:43 AM

#

the idea being that for numpy arrays, __eq__ doesn't actually return True or False but tells you which elements of the two arrays are equal

#

which you could do with a for loop, but doing it using the numpy API causes those operations to happen in C

#

so I was just going to see if we could find similar approaches to optimization in the pandas API, but I don't have experience with pandas.

#

you can multiprocess with python though. Let me look up the name of that library

#

https://docs.python.org/3.8/library/multiprocessing.html

sterile zenith Jul 15, 2020, 1:46 AM

#

does anybody have experience with interpreting dates from unstructured text? Given the example:

it's going well thanks, how are you?

say, what are you doing tomorrow night at 9pm? want to grab drinks? That's July 9, 2020 at 9pm.

Best,
Name

could you pull out the dates? I tried playing around with spaCy, but it looks like that's like use a chainsaw for hammering a nail. I'm wondering if there's anything between that sort of advanced NLP and regexes

frank bone Jul 15, 2020, 1:46 AM

#

i read pandas uses numpy arrays. Thanks for link!

serene scaffold Jul 15, 2020, 1:46 AM

#

Sorry I wasn't more helpful. I should read more about pandas.

frank bone Jul 15, 2020, 1:47 AM

#

all good 😄 i definitely have some more reading to do as well

#

but i didnt expect to get this far, having started on Sunday with python/coding lol

#

gotta love the beginner learning curves, inverse exponential

pseudo sonnet Jul 15, 2020, 1:48 AM

#

@frank bone Look into multiprocessing with dask dataframes

serene scaffold Jul 15, 2020, 1:48 AM

#

huh, you don't have any prior programming experience?

pseudo sonnet Jul 15, 2020, 1:48 AM

#

or just adding multiprocessing pools where appropriate

frank bone Jul 15, 2020, 1:48 AM

#

@frank bone Look into multiprocessing with dask dataframes
@pseudo sonnet noted, thanks 🙂

serene scaffold Jul 15, 2020, 1:49 AM

#

@sterile zenith I think one of my coworkers has worked on that. I asked in our slack channel.

frank bone Jul 15, 2020, 1:49 AM

#

huh, you don't have any prior programming experience?
@serene scaffold well i did some bash script a while ago, nothing complicated tho

sterile zenith Jul 15, 2020, 1:49 AM

#

thanks 👍

mighty escarp Jul 15, 2020, 1:49 AM

#

I have this exercise. If anyone has the time and energy to spare I'd really appreciate some help. I've made some progress independently already

📎 hw1.png

#

I'll detail my progress if anyone is genuinely willing to offer assistance

serene scaffold Jul 15, 2020, 1:51 AM

#

@sterile zenith this looks like one solution but I don't like how it reformats the matches: https://github.com/akoumjian/datefinder

GitHub

akoumjian/datefinder

Find dates inside text using Python and get back datetime objects - akoumjian/datefinder

sterile zenith Jul 15, 2020, 1:51 AM

#

👀

#

looks pretty good:

In [1]: import datefinder

In [2]: email="""it's going well thanks, how are you?
   ...:
   ...: say, what are you doing tomorrow night at 9pm? want to grab drinks? That's July 9, 2020 at 9pm.
   ...:
   ...: Best,
   ...: Name"""

In [3]: matches = datefinder.find_dates(email)

In [4]: matches
Out[4]: <generator object DateFinder.find_dates at 0x1048fd890>

In [5]: list(matches)
Out[5]: [datetime.datetime(2020, 7, 14, 21, 0), datetime.datetime(2020, 7, 9, 21, 0)]

#

unfortunately it literally interpreted "9pm" to mean "9pm tonight"

frank bone Jul 15, 2020, 1:55 AM

#

anyone know how to get a n-th element from a path?
for example "cpu" from /sys/bus/cpu/devices/cpu0/, sort of like basename but with a reverse level?

pseudo sonnet Jul 15, 2020, 1:56 AM

#

I'm thinking you could just extract the date + context (like 3 extra words either side or something) and search context for words like tomorrow

serene scaffold Jul 15, 2020, 1:56 AM

#

@sterile zenith huh, I didn't think it was going to be datetime objects. I like that.

pseudo sonnet Jul 15, 2020, 1:56 AM

#

Not foolproof, but it would catch the majority of cases

sterile zenith Jul 15, 2020, 1:57 AM

#

might end up forking this and expanding it to include more "natural" text

#

thanks!

frank bone Jul 15, 2020, 1:58 AM

#

>>> print(path)
cpu0```
goal would be to do something like: os.path.basename('/sys/bus/cpu/devices/cpu0').depth=3

#

to return "bus"

pseudo sonnet Jul 15, 2020, 1:58 AM

#

what's the context?

frank bone Jul 15, 2020, 1:58 AM

#

u asking me?

marble jasper Jul 15, 2020, 1:58 AM

#

recursive dirname?

frank bone Jul 15, 2020, 1:59 AM

#

my goal is to get the Date from path instead from csv, to save time

marble jasper Jul 15, 2020, 1:59 AM

#

or split

#

try pathlib

pseudo sonnet Jul 15, 2020, 1:59 AM

#

Was bout to say what meseta said

marble jasper Jul 15, 2020, 1:59 AM

#

pathlib has a .parts property

#

it returns the folders as a tuple, you could just index that

frank bone Jul 15, 2020, 2:00 AM

#

stackexchange solution: def nth_parent(path, n): return path if n <= 0 else os.path.dirname(nth_parent(path, n-1))

#

would this be ok?

pseudo sonnet Jul 15, 2020, 2:00 AM

#

also remember to write in windows compatibility tee hee

marble jasper Jul 15, 2020, 2:01 AM

#

from pathlib import PurePath

path = PurePath('/sys/bus/blah')
path.parts[2] # this would be "bus"

frank bone Jul 15, 2020, 2:01 AM

#

oh thats simple

marble jasper Jul 15, 2020, 2:01 AM

#

windows version:```python
from pathlib import PureWindowsPath

path = PureWindowsPath('C:\sys\bus\blah')
path.parts[2] # this would be "bus"

pseudo sonnet Jul 15, 2020, 2:02 AM

#

can it take a path object?

#

that would make it cross compatible

#

wait I'm dumb

#

actually forget I said anything

frank bone Jul 15, 2020, 2:03 AM

#

love it, thanks @marble jasper 😄

pseudo sonnet Jul 15, 2020, 2:04 AM

#

I'm confused why they made a separate object for windows paths

marble jasper Jul 15, 2020, 2:06 AM

#

no idea, I haven't played around with pathlib that much

#

although docs seem to indicate that it's automatic:

📎 unknown.png

#

so maybe you don't have to specify PureWindowsPath after all

zenith nova Jul 15, 2020, 2:07 AM

#

That's the case, yes

#

Path() is for paths in general. PurePath is for "virtual" paths that you don't want to match against the filesystem

#

Path().resolve() turns a path that looks like C:\Windows\..\..\Data into C:\Data

#

However, if you've symlinked C:\Data somewhere, it will also follow the symlink

#

if you just want to manipulate the path as a path, use PurePath

#

If you want to manipulate a path as if you were on a windows machine, PureWindowsPath

frank bone Jul 15, 2020, 2:14 AM

#

anyone here has some experience with hierarchical indexing in pandas? im not sure about the syntax when i want to stream(append) new elements into a pandas MultiIndex on the fly

#

https://jakevdp.github.io/PythonDataScienceHandbook/03.05-hierarchical-indexing.html

Hierarchical Indexing | Python Data Science Handbook

#

i start with no index and with each iteration in my for loop i want to add new entries

flat quest Jul 15, 2020, 3:21 AM

#

@frank bone are u only using for loops to read the df's?

lapis sequoia Jul 15, 2020, 6:34 AM

#

Is it necessary to have a data science course certificate in order for me to get a data science job?

granite steppe Jul 15, 2020, 6:35 AM

#

@mighty escarp jsut wondering where u got those problems from ?

mighty escarp Jul 15, 2020, 8:39 AM

#

@granite steppe school exercise. Already solved it

granite steppe Jul 15, 2020, 8:49 AM

#

@mighty escarp bro was it hard ? Im just getting into data science and was just curious about it

mighty escarp Jul 15, 2020, 8:49 AM

#

This particular exercise wasnt that hard

lapis sequoia Jul 15, 2020, 9:20 AM

#

Is it necessary to have a data science course certificate in order for me to get a data science job?
@lapis sequoia from personal experience i can tell you that it is not necessary if you can prove that you are qualified, but it also depends on the market where you are competing. i had a social science degree, had years of experience with coding and i had read a lot on various related topics (graph theory, learning theory, common machine learning algorithms) before i applied for a job as a data scientist. the particular market where i am competing is also short on data scientists, data engineers, etc. therefore i had a better chance, but this might not be true in the us or in sweden for example.

#

the company where i work also recently hired somebody without any university degree and he is also another fine example. he worked as a data analyst previously, thus knew a fair share of python prior and had gained experience in building data science solutions while competing on kaggle which came quiet handy for him during the interview.

granite steppe Jul 15, 2020, 11:20 AM

#

@mighty escarp r u like undergrad??

lapis sequoia Jul 15, 2020, 11:45 AM

#

@lapis sequoia i m from india

#

So i can do some internships

#

And then get a certificate of that internship rather than that of data science course

lapis sequoia Jul 15, 2020, 12:36 PM

#

Output:

{'name': 'NepeTheMemer'}
{'name': 'Ostbollarna', 'changedToAt': 1565706666000} 
{'name': 'SmallGorilla', 'changedToAt': 1568300830000}
{'name': 'Bigboyneo', 'changedToAt': 1572086948000}   
{'name': 'DadMan123', 'changedToAt': 1574685201000}   
{'name': 'hypcroite', 'changedToAt': 1578509385000}   
{'name': 'baqnica', 'changedToAt': 1581196208000}
{'name': 'fappingbird123', 'changedToAt': 1584375177000}
{'name': 'walkingmonkey123', 'changedToAt': 1587207096000}
{'name': 'ssk1er', 'changedToAt': 1592750994000}

Code:

for names in data:
                        print(names)

                    name = data[0]["name"]

                    embed = discord.Embed(title="Minecraft User Info",
                                          description=f"**First Name**: `{name}`\n**New Names**: {names}",
                                          color=discord.Color.from_rgb(65, 224, 34),
                                          timestamp=datetime.utcnow())
                    await ctx.send(embed=embed)

It only sends this:
https://imgur.com/a/B5yyrAt

Imgur

stone terrace Jul 15, 2020, 12:54 PM

#

hello people , somebody can help me about how to install tersorflow-gpu in ubuntu non-nvidia gpu ?

hot parcel Jul 15, 2020, 12:58 PM

#

Hello?

#

Anyone could have something like API

#

And grab something from the website?

#

IDK how to describe what I am going to do 🤭

#

Anyone help me

silk axle Jul 15, 2020, 2:22 PM

#

## Create the model's architecture
model = Sequential()

# Add the first layer
model.add(Conv2D(32, (5, 5), activation='relu', input_shape=(28, 28, 3)))  # 28x28x3 - 3=rgb
...  # have a load of adding layers below

## Compile the model
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

## Train the model
hist = model.fit(x_train, y_train_one_hot,
                 batch_size=256,
                 epochs=10,
                 validation_split=0.2)

The model.fit(...) is raising the following error: ValueError: Input 0 of layer sequential_2 is incompatible with the layer: expected ndim=4, found ndim=3. Full shape received: [None, 28, 28]How can I specify what ndim should be expected for the model.fit()? I thought I did this when I added the first layer by doing input_shape=(28, 28, 3)?

#

@me upon response please

silk axle Jul 15, 2020, 2:54 PM

#

Edit: found out I needed to .reshape my data, seems to be working now 🙂

limpid raft Jul 15, 2020, 2:59 PM

#

I'm trying to transform a Convolutional model to CSV. My attempt has been to make it a dataframe then CSV, but I can't even get it dataframe.

#

this is what I've tried:

#

hist=model.fit(training_input,training_output,epochs = n_epochs, verbose = 1, callbacks = es_callback)

#

that gives:
<tensorflow.python.keras.callbacks.History at 0x7f1e475be7b8>

#

and that to dataframe with:
pd.DataFrame(hist).to_csv("predictions_final.csv", header=["Expected"], index_label="Id")

#

Anyone know what I'm doing wrong?

lapis sequoia Jul 15, 2020, 3:14 PM

#

I'm trying to transform a Convolutional model to CSV. My attempt has been to make it a dataframe then CSV, but I can't even get it dataframe.
@limpid raft do you mean like literally exporting your model weights as a csv file? or saving your training metrics into a csv file? for the latter you could just use https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/CSVLogger

TensorFlow

tf.keras.callbacks.CSVLogger | TensorFlow Core v2.2.0

hardy shale Jul 15, 2020, 3:18 PM

#

@hot parcel You might try grabbing a help channel, response time will be a bit quicker there

hot parcel Jul 15, 2020, 3:18 PM

#

ok

#

thx

uncut shadow Jul 15, 2020, 3:28 PM

#

ummm

#

wdym

#

I don't think that's a good way of doing this. You can ofc just use .strip() and do it this way

#

oh

#

hmmm

#

so you have a json in csv or csv in json?

frank bone Jul 15, 2020, 3:33 PM

#

@frank bone are u only using for loops to read the df's?
@flat quest reading csvs within a for loop yes

#

does anyone know how to properly use dask? trying to use 8 cores instead of 1...

#

concatenated_df = pd.concat(df_from_each_file, ignore_index=True)
print(concatenated_df)```

#

this is pandas

#

i thought dask (as dd) was the same syntax and usage, but multi core

#

concatenated_df = dd.concat(df_from_each_file, ignore_index=True)
print(concatenated_df)```

#

just throws an error though

#

TypeError: dfs must be a list of DataFrames/Series objects

lapis sequoia Jul 15, 2020, 3:42 PM

#

@frank bone it says list and you are making a generator, so i would try changing that first

#

you could also just initialize concatenated_df first as an empty dask dataframe and append any newly read dataframe to that. for that you would need a for loop of course

frank bone Jul 15, 2020, 3:45 PM

#

ill try to figure out..im a complete noob :/ started 3 days ago..googling my way through 😄

#

i just thought pd and dd were the same, one single core and one multi core, guess theres more behind it

lapis sequoia Jul 15, 2020, 3:46 PM

#

so i think the general idea here might be that since you are making a generator (which is lazy by nature) you can't parallelize processing it

frank bone Jul 15, 2020, 3:47 PM

#

the current single core implementation takes 70minutes to run...so im trying to get it under 10min with 8 cores if thats possible

#

just going through csv files, doing some calculations and putting the results in tuples

#

its a time series

lapis sequoia Jul 15, 2020, 3:48 PM

#

are you going to run this script multiple times in the future? if so i would abandon the separate csv idea and put everything into a single dataframe and save it as parquet. it will read faster on the next run

#

parquet format is supported both by pandas and dask

frank bone Jul 15, 2020, 3:48 PM

#

yes multiple runs with adjusted parameters

#

parquet is a database?

upbeat pagoda Jul 15, 2020, 3:49 PM

#

hey team I'm working with splitting a string into mutiple columns, I'm not sure why str is used so much in:
df['dateTime'].str.split('-').str[1]

I kinda understand the first .str is an attribute of the df['dateTime'] series object, but the second one baffles me, thanks!

lapis sequoia Jul 15, 2020, 3:50 PM

#

@frank bone nah, its a tabular fileformat designed specifically for cases when you might not need everything from the table. it is going to be a directory of separate files on your disk (just like what you would have with the separate csv files)

frank bone Jul 15, 2020, 3:51 PM

#

oh i see, ive noted it and will look into that. thanks 🙂

#

but just in case my parameters change a lot and that solution wouldnt suit my needs, do you see a way to make my code above multi processed?

#

the part that searches through directories, reads csv into a frame and calculatues new values?

mellow spruce Jul 15, 2020, 3:52 PM

#

Hello guys, what is the best way or library to construct a table GUI where the user inputs values on the table and I it outputs calculations based on those input values in another table?

frank bone Jul 15, 2020, 3:54 PM

#

maybe something like this? http://www.racketracer.com/2016/07/06/pandas-in-parallel/ or dask...

Data Science Jay

Jay

Python Pandas Functions in Parallel - Data Science Jay

Some quick hacks on running pandas in parallel would be nice. By adding Pools to the multiprocessing package in Python, we can spin up multiple threads.

lapis sequoia Jul 15, 2020, 3:54 PM

#

the part that searches through directories, reads csv into a frame and calculatues new values?
@frank bone the code you posted above i think should run if you change one thing:
df_from_each_file = [pd.read_csv(f, usecols=[5, 6]) for f in csv_files]
this will make it a list comprehension (hence reading everything into memory) and not a generator which should suit dask. now if you to want parallelize reading the files you could use joblib for that

#

https://joblib.readthedocs.io/en/latest/parallel.html

frank bone Jul 15, 2020, 3:59 PM

#

oh that seems to work and with variable.compute() i put it back into pandas format

lapis sequoia Jul 15, 2020, 3:59 PM

#

oh it seems like dask's read_csv method accepts patterns, so if you have a large number of files in the same directory starting with the same pattern (like data for example) and then you have a changing pattern (like a continously increasing number) then you could pass this pattern to the read_csv method like so: data*.csv

#

oh that seems to work and with variable.compute() i put it back into pandas format
@frank bone ok great 🙂

frank bone Jul 15, 2020, 4:00 PM

#

thanks 😄 ill tinker some more but now im wondering if this truly increased speed

chrome barn Jul 15, 2020, 4:00 PM

#

string = '10-09-2019'
print(string.split('-')[0])
print(string.split('-')[1])
print(string.split('-')[2])

because .str is a function of pandas to handle strings above example is the same in normal python and will give the outputs 10, 09 and 2019 respectively

lapis sequoia Jul 15, 2020, 4:01 PM

#

thanks 😄 ill tinker some more but now im wondering if this truly increased speed
@frank bone well the first solution i proposed did not. it just read everything sequentially. however any computation you would do afterwards is handled by dask in parallel

frank bone Jul 15, 2020, 4:03 PM

#

so i can only increase the speed with joblib?

lapis sequoia Jul 15, 2020, 4:04 PM

#

i think the second solution would increase speed as well (providing a file pattern for the read_csv method of dask) or you could convert to parquet after reading it first, saving the parquet then using only that afterwards

frank bone Jul 15, 2020, 4:08 PM

#

dask slowed speed by 10x lol

#

mustve done something wrong?

lapis sequoia Jul 15, 2020, 4:39 PM

#

Hi, i don't know how ot reformat the data to split it by 'column' pf my dataframe.

#

Hi, i don't know how to reformat the data to split it by 'column' in my dataframe.

desert oar Jul 15, 2020, 4:47 PM

#

@lapis sequoia can you provide some sample data and an example of the desired output?

#

@frank bone anything using parallization has a lot of overhead in moving data/information between processes

#

what are you doing exactly?

lapis sequoia Jul 15, 2020, 4:49 PM

#

@desert oar

📎 car1.PNG

frank bone Jul 15, 2020, 4:51 PM

#

what are you doing exactly?
@desert oar iterating through directories, retrieving csv files, loading them into Pandas dataframes, then doing basic calculation and saving results in tuples, in pairs of 3

pale marsh Jul 15, 2020, 4:53 PM

#

Hey does anyone know altair? I'm trying to figure out why its not plotting my chart 😭

desert oar Jul 15, 2020, 5:00 PM

#

@lapis sequoia what do you want to do with this data?

#

@frank bone write a function that loads 1 file, then use joblib or processpoolexecutor or multiprocessing.pool to parallelize

#

maybe us a queue to save the data or something

#

depends on how you want to store that

#

@lapis sequoia use .loc

#

df_cardio1 = df.loc[df['cardio'] == 1]
df_cardio0 = df.loc[df['cardio'] == 0]

lapis sequoia Jul 15, 2020, 5:22 PM

#

@desert oar thanks

upbeat pagoda Jul 15, 2020, 5:28 PM

#

@chrome barn hey again, I see so this

df['dateTime'].str.split('-')

outputs another series object, in order to index it using the [0], we have to call .str to make it a string again

desert oar Jul 15, 2020, 5:33 PM

#

Did you try it at all

#

Is this homework/coursework?

#

We do not give out homework answers here

lapis sequoia Jul 15, 2020, 5:34 PM

#

i try i use with catplot but doesn't work

frank bone Jul 15, 2020, 5:37 PM

#

@desert oar im using this code ```def main():
ticker_filter = []
thistuple = []

for root, dirs, files in min_depth(os.walk('/home/computer/Documents/Stocks/'), depth=4):
    if not dirs:
        path = root
    ticker = os.path.basename(path)
    date = PurePath(path).parts[6]
    if (len(ticker_filter) > 0) and ((ticker in ticker_filter) == False):
        continue
    csv_files = glob.glob(os.path.join(path, "*.csv"))
    df_from_each_file = (pd.read_csv(f, usecols=[5, 6]) for f in csv_files)
    concatenated_df = pd.concat(df_from_each_file, ignore_index=True)
    dollar_volume = (concatenated_df['Price'] * concatenated_df['Volume']).sum()
    thistuple.append(tuple((str(ticker), str(date), int(dollar_volume))))

return(thistuple)

main()```
for what i described before. See any improvement potentials? 1 day of data takes around 11s and that's approximately 9432 files (.csv) with an average file size of 7kB

#

this is the output (example): ```[('AAPL', '20120103', 3835538), ('AAPL', '20120104', 4176408), ('AAPL', '20120105', 5468868), ('AAPL', '20120106', 5520247), ('AAPL', '20120109', 5698068), ('AAPL', '20120110', 2714942), ('AAPL', '20120111', 2038918), ('AAPL', '20120112', 3203726), ('AAPL', '20120113', 2376274), ('AAPL', '20120117', 3423089), ('AAPL', '20120118', 4850734), ('AAPL', '20120119', 4446490), ('AAPL', '20120120', 6234720), ('AAPL', '20120123', 3805538), ('AAPL', '20120124', 7197666), ('AAPL', '20120125', 13312566), ('AAPL', '20120126', 3968039), ('AAPL', '20120127', 3818511), ('AAPL', '20120130', 6087196), ('AAPL', '20120131', 5992732), ('AAPL', '20120201', 3058387), ('AAPL', '20120202', 2930134), ('AAPL', '20120203', 4154695), ('AAPL', '20120206', 3953637), ('AAPL', '20120207', 5685242), ('AAPL', '20120208', 7271777), ('AAPL', '20120209', 18847913), ('AAPL', '20120210', 8719465)]

mellow spruce Jul 15, 2020, 5:39 PM

#

Having a problem with plotly dash I want to use a dropdown with multi option to update the rows for a dash table. however it looks like this

📎 0.png

#

my update code looks something lke this `@app.callback(

dash.dependencies.Output('output', 'children'),

[dash.dependencies.Input('select-prd', 'value')])

def update_output(value):

value_list=[]

value_list.append(value)

return [

   

    dash_table.DataTable(

        id='prd-mix',

        columns=[{'id':'product-group', 'name':'Product Group'},

                 {'id':'mix','name':'Product Mix'}],

        data=[{'product-group':value for value in value_list} for j in range(len(value)) ],

        editable=True

       

        )`

chrome barn Jul 15, 2020, 5:43 PM

#

@upbeat pagoda exactly see the screenshot of how python is handling the dataframe based on how you call it

📎 unknown.png

desert oar Jul 15, 2020, 5:48 PM

#

@frank bone where does path come from if dirs is not empty?

frank bone Jul 15, 2020, 5:50 PM

#

not sure what you mean, depth=4 folders are never empty

desert oar Jul 15, 2020, 5:52 PM

#

you have

if not dirs:
    path = root

#

what is min_depth? a function you wrote?

frank bone Jul 15, 2020, 5:53 PM

#

its from walkdir module

#

skips unnecessary walking from os.walk when you want a specific depth

#

you have

if not dirs:
    path = root

@desert oar just something i took from stackexchange that provides me with pathnames of the folders that contain files

#

which are all exclusively at depth=4

#

thats not where im losing time

#

that part takes 0.085s per day (out of the 11s)

upbeat pagoda Jul 15, 2020, 6:03 PM

#

@chrome barn 🙏 thanks again man. btw I meant to let you know I'm implementing what you told me about the scraper, it's just going to take me a few days

weary dune Jul 15, 2020, 6:10 PM

#

Can ayone help me figure out why I have a syntax error?

📎 Screen_Shot_2020-07-15_at_12.09.23_PM.png

frank bone Jul 15, 2020, 6:11 PM

#

@desert oar this one line uses 99% of those 11s

#

concatenated_df = pd.concat(df_from_each_file, ignore_index=True)

#

creating 1 dataframe from x amount of csv within a folder

#

is there a way to multi process this one liner?

#

the reason I merge them is because let's say i have 6 csv files per folder and i want to sum a column

#

so i can sum 1 dataframe

#

maybe there's another faster way?

desert oar Jul 15, 2020, 6:20 PM

#

ah

#

you can sum each df separately

#

then sum the sums

#

sum((df['Price'] * df['Volume']).sum() for df in df_from_each_file)

@frank bone

frank bone Jul 15, 2020, 6:28 PM

#

@desert oar tried it, worked

desert oar Jul 15, 2020, 6:28 PM

#

then you dont have to worry about the big expensive concat operation

#

good

frank bone Jul 15, 2020, 6:28 PM

#

but its actually 0.7s slower

desert oar Jul 15, 2020, 6:28 PM

#

huh really

#

im actually very surprised

frank bone Jul 15, 2020, 6:28 PM

#

yup concat is faster 😄 didnt expect that

desert oar Jul 15, 2020, 6:28 PM

#

maybe concat is smart in that it leaves all the memory in place

frank bone Jul 15, 2020, 6:28 PM

#

another way maybe?

#

or maybe concat multiprocessable?

desert oar Jul 15, 2020, 6:29 PM

#

no i think you should just parallelize over os.walk

frank bone Jul 15, 2020, 6:29 PM

#

like a separate concat function

#

and use multiprocess module?

desert oar Jul 15, 2020, 6:29 PM

#

chunk the output of os.walk into N chunks (where N = number of parallel workers)
write a function that loops over its inputs and processes them one at a time
make N workers, each worker handles one chunk from (1)
use sqlite to save the output because sqlite can handle concurrent connections

frank bone Jul 15, 2020, 6:31 PM

#

chunk the output of os.walk into N chunks (where N = number of parallel workers)

write a function that loops over its inputs and processes them one at a time

make N workers, each worker handles one chunk from (1)

use sqlite to save the output because sqlite can handle concurrent connections
@desert oar that's kinda over my head damnit lol im a noob i just started

#

but maybe coffee + google, might give it a try

#

you think making a separate function where i do this one liner might work too?

desert oar Jul 15, 2020, 6:32 PM

#

dont think about one liners

frank bone Jul 15, 2020, 6:32 PM

#

or is there no implementation for that

desert oar Jul 15, 2020, 6:32 PM

#

but yes separate functions are good

frank bone Jul 15, 2020, 6:32 PM

#

was looking at this

#

http://www.racketracer.com/2016/07/06/pandas-in-parallel/

Data Science Jay

Jay

Python Pandas Functions in Parallel - Data Science Jay

Some quick hacks on running pandas in parallel would be nice. By adding Pools to the multiprocessing package in Python, we can spin up multiple threads.

desert oar Jul 15, 2020, 6:32 PM

#

you arent parallelizing pandas

#

remember

#

each dataframe is small

#

the problem is you have a lot of files to process

#

at least, that's the problem as i understand it

#

how many files are we talking about anyway?

frank bone Jul 15, 2020, 6:33 PM

#

yes, around 9000 in those 11s with average size of 7kB

#

9432 files average file size of 7kB

#

to be exact

#

for that folder im talking about, that's taking 11s

#

in total millions, probably

marble jasper Jul 15, 2020, 6:34 PM

#

why not concat the files first. then read into pandas

desert oar Jul 15, 2020, 6:34 PM

#

concatenate 9000 files?

marble jasper Jul 15, 2020, 6:34 PM

#

they're only 7kb

desert oar Jul 15, 2020, 6:35 PM

#

true i guess its what, 63 MB total?

frank bone Jul 15, 2020, 6:35 PM

#

no, im only mergin 5-10 files at a time, per iteration

desert oar Jul 15, 2020, 6:35 PM

#

it looks like they're doing a bunch of more complicated processing

#

not just reading csvs 1 at a time

#

like read from one file, then fetch a bunch of specific corresponding csvs

frank bone Jul 15, 2020, 6:36 PM

#

im reading all csv files in a folder, then concatenating them

#

which is usually 5-10 files or so

#

at avg size of 7kB

#

and i have thousands of folders so that adds up to those 11

#

2800 folders to be exact

#

for that specific day

#

its a time series

#

why not concat the files first. then read into pandas
@marble jasper you have to do it on a ticker basis, cant just merge all, we're talking about AAPL, TSLA, IBM, BA etc...

marble jasper Jul 15, 2020, 6:38 PM

#

so why not concat the data first then create a data frame from the whole lot?

frank bone Jul 15, 2020, 6:38 PM

#

each folder = 1 ticker

#

per day

marble jasper Jul 15, 2020, 6:39 PM

#

yes, you're doing what, hourly or minute stock data? I'm familiar with the format

#

how many symbols do you have?

frank bone Jul 15, 2020, 6:39 PM

#

yes minute

#

depends on the day and year, 3000-4000 usually

marble jasper Jul 15, 2020, 6:39 PM

#

and you're calculating total volume movement per minute?

#

and what is your output format?

frank bone Jul 15, 2020, 6:40 PM

#

no, total dollar volume per day

marble jasper Jul 15, 2020, 6:40 PM

#

ok

frank bone Jul 15, 2020, 6:40 PM

#

right now its tuples in a list with 3 values

#

Ticker, Date, Volume

#

not sure if i want it like that though, but the idea is to store these 3 values

#

and then apply moving averages, standard deviations and whatnot on a per ticker-stream basis

desert oar Jul 15, 2020, 6:42 PM

#

from concurrent.futures import wait, ProcessPoolExecutor
from itertools import islice
from math import ceil

dbconn = sqlite3.connect('output.db')
dbconn.execute('create table ticker_output (ticker text, date text, dollar_volume integer)')


def worker(walkdata):
    dbconn = sqlite3.connect('output.db')
    for path, dirs, files in walkdata:
        # do stuff
        dbconn.execute('insert into ticker_output values (?, ?, ?)', (ticker, date, dollar_volume))
        dbconn.commit


def iter_batches(data, batchsize):
    data = iter(data)
    while True:
        batch = list(islice(data, batchsize))
        if not batch:
            break
        yield batch


n_proc = 8
walkdata = list(os.walk('data-dir'))
batches = list(iter_batches(walkdata, ceil(len(walkdata) / n_proc))))

with ProcessPoolExecutor(n_proc) as executor:
    futures = [executor.submit(worker, batch) for batch in batches]
    wait(futures)

marble jasper Jul 15, 2020, 6:42 PM

#

I suggest concatenating the csv files for each symbol first, then converting the whole time series to dataframe

#

and yes, parallel process each symbol

desert oar Jul 15, 2020, 6:43 PM

#

the usual caveats about untested code apply

marble jasper Jul 15, 2020, 6:44 PM

#

definitely trust code from people on the internet 100%

frank bone Jul 15, 2020, 6:44 PM

#

I suggest concatenating the csv files for each symbol first, then converting the whole time series to dataframe
@marble jasper 1 dataframe per ticker or the whole ticker universe in 1 dataframe?

marble jasper Jul 15, 2020, 6:45 PM

#

1 dataframe per symbol, though perhaps do the benchmarks to find where your bottlenecks are. I suspect they're in the overhead of dataframe conversion, but I may be wrong

desert oar Jul 15, 2020, 6:45 PM

#

i feel like its not worth it tbh

frank bone Jul 15, 2020, 6:45 PM

#

btw is it possible in python to use a variable output to create a new variable? Let's say $variable = [], where AAPL is behind "variable"

desert oar Jul 15, 2020, 6:45 PM

#

you have to write a shell script and test it and shit to do the concatenation

#

or just... dont bother

#

@frank bone no. use a dict

#

symbols = {
    'AAPL': (1, 2, 3)
}

frank bone Jul 15, 2020, 6:46 PM

#

like now?

#

oh

desert oar Jul 15, 2020, 6:46 PM

#

symbols = {}
for symbol in ['AAPL', 'TSLA']:
    symbols[symbol] = # stuff

#

also, does my parallelization code make sense @frank bone ?

frank bone Jul 15, 2020, 6:47 PM

#

symbols = {}
for symbol in ['AAPL', 'TSLA']:
    symbols[symbol] = # stuff

@desert oar my hello world was on Sunday, so my head is starting to hurt 😄 but its noted...

desert oar Jul 15, 2020, 6:47 PM

#

hah alright

frank bone Jul 15, 2020, 6:48 PM

#

ill try to understand, for now I'd just like to try to multi process the one concat

#

and see if it actually helps speed

marble jasper Jul 15, 2020, 6:48 PM

#

try snakeviz, it's not as complex to use as you might think

frank bone Jul 15, 2020, 6:48 PM

#

never even heard of that 😄 ill take a look

lapis sequoia Jul 15, 2020, 6:48 PM

#

quick one, how to get a value for ram usage?

marble jasper Jul 15, 2020, 6:48 PM

#

it'll give you a breakdown of your execution time and you'll be able to see what needs improving

frank bone Jul 15, 2020, 6:49 PM

#

i checked manully, like 99% is in the concat 1 liner

#

of the 11 seconds per day

marble jasper Jul 15, 2020, 6:49 PM

#

can't optimize if you don't know where to optimize!

frank bone Jul 15, 2020, 6:49 PM

#

i.e like 10.85s = concat, 0.15s all the rest

marble jasper Jul 15, 2020, 6:50 PM

#

so...concat the csv files first and then convert to pandas, which is still my one and only suggestion

frank bone Jul 15, 2020, 6:54 PM

#

how would you do it the concat? call a bash script from within python? or do it in python directly?

mellow spruce Jul 15, 2020, 6:59 PM

#

Hello all, I have a table that loos like this:

  ` Name| Quantity
  gardeners| 25
   janitors|30
    clerks | 25`

and for each name I have another table that looks like:

`    Activity|time
    Cleaning|0.5
    Sweeping|0.4
   Organizing| 0.7`

and so on for 500 or so activities. I would like to after selecting the names go through each name table and multiply the activity * quantity to be output in a different table. However many names may share some activities so for those shared activities, I would like to output something like 25*(time the gardener spends cleaning)+30*(time janitor spent cleaning). So the output table should look like this:

   `Activity|% of time needed
   Cleaning|27
   Sweeping|30
   Organizing|15 and so on `

The libraries I am using are plotly dash because it has to be a web app and datascience tables to calculate the time per activity. The output table is not going to be interacted with beyond exporting the table but it has to be output on the dash environment. Thanks for your help

lapis sequoia Jul 15, 2020, 7:11 PM

#

Exception has occurred: TypeError
'str' object does not support item deletion

Code:

for _friends in data["records"]:
                        friends = _friends["uuidSender"]
                        if friends == uuid:
                            del friends["uuidSender"]
                        print(friends)

I'm trying to remove something that looks like this:

38f67f4aa0c74dd99792aaf6cc401424

I'm using a minecraft API that returns the friends from that specific server (hypixel api) so i just want it to remove the duplicates that match the excact username as the self username or uuid passed so basically I just wanna remove everything from this list that has the exact same uuid as the person type in when invoking the command
Example:
pls hypixel friends 38f67f4aa0c74dd99792aaf6cc401424
I want this command to return all friends but just remove all friends that match this uuid because hypixel does this sometimes

#

Full Code:
https://paste.pythondiscord.com/relosibowe.py

modest rune Jul 15, 2020, 7:20 PM

#

I wrote a stock options backtesting tool using pandas. The tool works, but the tool is slow. 30 seconds per option chain. I have been reading as much as I can about optimizing my code. So much, that my head hurts. I decided to write out my data structures on pencil and paper and draw out a solution. I have some questions.

My main approach to solving my speed issues is that I am going to manipulate my multidimensional data with vectorization.

The rough layout of my data is...
Stock History: [days X (price.high, price.low)]
Options Data: 4d array for stock option data [expiration date X [Call/Put Type X [Strike Price X Option Price]]].

Question: my current thinking is that I would concat that 4d array into a 2d array (dataframe). But, I am just not sure what is the best approach when using pandas and numpy.

Approach A: 1 Big 2D array that will need some binning, sorting, and indexing and a dataset that has lots of repetition in its data (ex. the PUT/CALL column would consist of only PUT or CALL, or the Strike Price column would be have duplicate strike prices for every option under that strike price).

Approach B: 20 medium sized dataframes part of a 4D array, that are nicely organized but will need to be iterated over.

**Approach C: ** A better way that I am not thinking about.

Footnote: I will be doing vector math between the 2D Stock History dataframe and the Stock Options data.

I am just looking for high level advice on what datastructures tend to work more efficiently with pandas and numpy. A or B? I don't want to head down a design path for doing my math that I will have to redo 4 days from now because I wasn't able to realize significant time savings

desert oar Jul 15, 2020, 7:20 PM

#

@frank bone there's no point in multiprocessing the concat imo

#

better to parallelize the whole thing like i showed you

frank bone Jul 15, 2020, 7:22 PM

#

@frank bone there's no point in multiprocessing the concat imo
@desert oar the only problem is the concat though, but im still trying to understand the code and how to implement it

#

ill try both cause why not 😄

silk axle Jul 15, 2020, 7:31 PM

#

How can I resize an image to be of shape (28, 28, 1)? When I try and do resize(image, (28,28,1)) I get an error saying that depth of 3 can't be converted to depth of 1. Not really sure what to do as my model can only accept images with a depth of 1

#

(I get my (28, 28, 1) training images from mnist)

pale thunder Jul 15, 2020, 7:32 PM

#

I think you are supposed to flatten it to get depth 1

silk axle Jul 15, 2020, 7:32 PM

#

That would make sense

#

Although, how do I flatten? 👀

pale thunder Jul 15, 2020, 7:33 PM

#

Reshape into a shape of length 1

silk axle Jul 15, 2020, 7:33 PM

#

So I have to do resize(image.reshape((28, 28, 1))?

#

That doesn't work

uncut shadow Jul 15, 2020, 7:37 PM

#

u want to turn it to 1D?

silk axle Jul 15, 2020, 7:37 PM

#

Yea

#

.flatten() doesn't seem to work either

#

Actually might've fixed it, need to wait for it to load

uncut shadow Jul 15, 2020, 7:38 PM

#

maybe try, idk resize(img, (784, 1)) or resize(img, 784) but am not sure if it's gonna work

lapis sequoia Jul 15, 2020, 7:40 PM

#

Output:

74f8ddef7c824a22a1588299a6e5a541
74f8ddef7c824a22a1588299a6e5a541
74f8ddef7c824a22a1588299a6e5a541
998ef1292aa94d208353ff513d6e86cd
1f0a16cb035a47f9b715067888d4cb3e
b020a8f479304c5cb6a58f9ef471743d
d62e5b06e0cc4e3f8febdebf85f92522
92f13295be604a2988251b5dc665f91b
30b49a5292b146a080e94ae9fb6f8b92
9b07317430074750874d151c98628d47
8c1096fb49ca4983b0c6afbcd1dab5e0

Code:

for _friends in data["records"]:
                        friends = _friends["uuidSender"]
                        print(friends)

                    embed = discord.Embed(title=f"{uuid} Hypixel Friends",
                                          description=f"**Friends**:\n{friends}",
                                          color=discord.Color.from_rgb(65, 224, 34),
                                          timestamp=datetime.utcnow())
                    await ctx.send(embed=embed)

So my command doesn't really show everything it only shows one uuid

Discord:
https://imgur.com/a/xl8yhKe

Imgur

silk axle Jul 15, 2020, 7:40 PM

#

Wrong channel @lapis sequoia

#

#discord-bots

lapis sequoia Jul 15, 2020, 7:41 PM

#

yeah but it's the json formatting

#

nvm

desert oar Jul 15, 2020, 7:43 PM

#

@modest rune the first step to optimizing code is to figure out where the slow part is

#

but yes vectorizing is usually good

#

pandas is pretty easy to use, it's often faster to use pure-numpy vectorized ops as opposed to pandas groupby (for example), but pure-numpy vectorized code can be really hard to read and write

silk axle Jul 15, 2020, 7:45 PM

#

## Resize image to 28x28x1
from skimage.transform import resize
new_image = new_image.flatten()
print("flattened")
resized_image = resize(new_image, (28, 28, 1))
print("resized")
img = plt.imshow(resized_image)```this seems to be freezing on the resize, not quite sure why

uncut shadow Jul 15, 2020, 7:46 PM

#

well, it's in 3 dimensions technically

silk axle Jul 15, 2020, 7:47 PM

#

Think it might just be because the image is so massive, just realised

#

It's like 1250x1600 o-o

uncut shadow Jul 15, 2020, 7:47 PM

#

oh, this might be the problem too lol

silk axle Jul 15, 2020, 7:50 PM

#

Gonna try with a smaller image (256x256), brb

modest rune Jul 15, 2020, 7:52 PM

#

@desert oar The past few days I have been using pyinstrument to identify the slow parts of my code. That has helped a lot. I have been able to reduce the time from 30 seconds per run down to 20. But, 12 seconds of the remaining 20 seconds is due to a custom equation I created that runs at the end of a 4 level nested for loop. I am pretty sure if I were to restructure my data into two dataframes and be smart about using vectorization, that I could get rid of most if not all of the looping.

My high level concern is: It seems silly to make a big dataframe from 20 medium sized dataframes, because there would be so much duplication in the data found in the columns. Should I throw that kind of thinking out the window?

And... the more I read about numpy, the more I think I start to understand... Numpy wants people to send as big of chunks of math as possible to it at a time, that way it can do as much math in C as possible before having to interact with python. And vectorization is a fancy way of saying, "Creative ways to do math problems with as many datapoints at a time as possible, so we don't have to go back to python to ask for more data as often."

silk axle Jul 15, 2020, 7:52 PM

#

well, it's in 3 dimensions technically
@uncut shadow Can I officially make it 1d? Since my model can only take 1d images, or does the reshape do this?

uncut shadow Jul 15, 2020, 7:53 PM

#

well, I don't know, but if your model takes 1d inputs then you should try to turn it to 1d I think

rare ice Jul 15, 2020, 7:54 PM

#

Our development team is using Databricks and Databricks Notebooks (and using python in them). Any recommendations/suggestions on how to unit test the python code that is in the notebooks?

balmy kraken Jul 15, 2020, 7:56 PM

#

hello

#

could anyone explain exactly what i should look for to run updatable databases in the cloud?

#

thanks a lot

uncut shadow Jul 15, 2020, 7:57 PM

#

That's not a right channel to ask DB questions. You should probably ask it in #databases

balmy kraken Jul 15, 2020, 7:57 PM

#

thanks a lot

#

what do people here usually talk about? data engineering like pandas etc?

desert oar Jul 15, 2020, 7:58 PM

#

@modest rune yes that's exactly what numpy wants

balmy kraken Jul 15, 2020, 7:58 PM

#

lol

desert oar Jul 15, 2020, 7:58 PM

#

See the einsum function for an extreme example

#

Data duplication is a problem in 2 instances: 1) it makes your data too big for memory, or 2) it makes your computations slower

#

Note also that using pandas multiindex it doesn't actually "duplicate" individual data points in the index

modest rune Jul 15, 2020, 8:05 PM

#

@desert oar I assume since my data has a mix of datetime, float64 and ints, I would likely be happier with pandas, due to its flexibility, and also because i would like to be able to store other datatypes in my 2d arrays (like enums and maybe strings). I am leaning towards a mostly pandas implementation and only moving to pure numpy in specific situations where the time savings make it worth my time? thoughts?

#

the einsum function looked interesting. Not being a data scientist and with a weak background in math, it sounds like it allows someone to do complex math on matrices and arrays through the use of some type of subscript. Like... regex for matrix math.

desert oar Jul 15, 2020, 8:14 PM

#

Yep regex for matrixes haha

#

And yes pandas is much nicer for mixed data types

modest rune Jul 15, 2020, 8:15 PM

#

Data duplication is a problem in 2 instances: 1) it makes your data too big for memory, or 2) it makes your computations slower
@desert oar

I already figured that much. But... that is not where my confusion lies... I am 95% sure (1) won't be a problem. Not sure about (2), I guess that depends on pandas and numpy. Is it safe to say that nested looping is USUALLY worse than large 2D arrays with data duplication when dealing with Pandas and Numpy? Or, phrased differently, if I take my 20 medium sized dataframes and make them one big dataframe, does that sound like a reasonable approach to my problem? Most of the time, I would not doing something stupid if I do this? Is there something I missed, that I should research first before implementing?

desert oar Jul 15, 2020, 8:16 PM

#

Can you give an example of an operation you want to do

modest rune Jul 15, 2020, 8:17 PM

#

Umm, I'll try in pseudocode...

last agate Jul 15, 2020, 8:22 PM

#

How much math will I use if I want to do machine learning?

modest rune Jul 15, 2020, 8:25 PM

#

@desert oar

# this is what I do today
for expire_dates_df in options_df:
  for put_call_df in expire_dates_df:
    for strike_price_tuple in put_call_df:
      analysis = backTest(strike_price_tuple, stock_history_df)

# this is what I have in mind
big_options_df = flattenNestedDF(options_df)
analysis = backTest(big_options_df, stock_history_df)

The flattenNestedDF hopefully leverage some fancy dataFrame constructor, if they have something that fits well. The data starts out as bunch of nested dicts and lists.

The backtest function, would be full of vectorization, fancy indexing, and whatever else I would need.

#

I don't know if that is enough info for you to be able to provide advice.

frank bone Jul 15, 2020, 8:27 PM

#

can someone help me fix this syntax?

#

calling = os.system('awk 'NR==1; FNR==1{next} 1 '' +flattened_csv'')

#

SyntaxError: invalid syntax

desert oar Jul 15, 2020, 8:30 PM

#

don't use os.system

frank bone Jul 15, 2020, 8:30 PM

#

i tried subprocess it just doesnt work

#

os.system does what i want, tested it

#

just need to escape this 1 liner properly

desert oar Jul 15, 2020, 8:31 PM

#

subprocess.run(['awk', 'NR==1; FNR==1 {next} 1', flattened_csv])

#

the reason you should not use os.system is precisely because escaping is ass

frank bone Jul 15, 2020, 8:31 PM

#

ill try that

#

it runs but i dont get the feedback

#

awk: fatal: cannot open file

#

with os it ran and printed back the echo

desert oar Jul 15, 2020, 8:34 PM

#

what is the file named?

#

that shouldn't be a problem

frank bone Jul 15, 2020, 8:34 PM

#

TICKER.DATE.csv

#

is the filename

desert oar Jul 15, 2020, 8:38 PM

#

import subprocess

flattened_csv = 'TICKER.DATE.csv'
subprocess.run(['awk', 'NR==1 FNR==1 {next} 1', flattened_csv])

this doesn't work?

#

also you might just want to use a shell script for tihs

#

why exactly are you using awk?

#

to concatenate the files but remove the header?

#

thats actually a nice usage

#

wait... do you need the semicolons between patterns? i forget

silk axle Jul 15, 2020, 8:42 PM

#

Before Resizing:

📎 unknown.png

frank bone Jul 15, 2020, 8:42 PM

#

import subprocess

flattened_csv = 'TICKER.DATE.csv'
subprocess.run(['awk', 'NR==1 FNR==1 {next} 1', flattened_csv])

this doesn't work?
@desert oar trying

silk axle Jul 15, 2020, 8:42 PM

#

After Resizing:

📎 unknown.png

#

How can I make it so you don't entirely lose the number through the resize?

frank bone Jul 15, 2020, 8:44 PM

#

@desert oar getting this awk: cmd. line:1: NR==1 FNR==1 {next} 1 awk: cmd. line:1: ^ syntax error

desert oar Jul 15, 2020, 8:44 PM

#

probably my bad awk then

#

awk_script = '''
NR == 1
FNR == 1 { next }
1
'''

flattened_csv = 'TICKER.DATE.csv'

subprocess.run(['awk', awk_script, flattened_csv])

#

idk something like that

#

i used to be an awk wizard

#

i havent touched it in years

frank bone Jul 15, 2020, 8:45 PM

#

os.system('awk \'NR==1; FNR==1{next} 1 \'' +flattened_csv'') this actually worked if i passed path directly, but i cant seem to get it to work when i want to pass path from variable like this

#

idk something like that
@desert oar awk: fatal: cannot open file

desert oar Jul 15, 2020, 8:47 PM

#

...that file exists?

#

in your current working dir?

frank bone Jul 15, 2020, 8:47 PM

#

somebody please help me escape this 😦 os.system('awk \'NR==1; FNR==1{next} 1 \'' +flattened_csv'')

#

yes it does, well im actually passing an absolute path

desert oar Jul 15, 2020, 8:48 PM

#

os.system("""awk 'NR == 1; FNR == 1 {next} 1' """ + flattened_csv)

but i really really really cannot imagine why this would work and the subprocess version wouldn't

#

you can even just use " instead of """

#

and i really really do not recommend "use something known to be broken and unreliable" instead of "figure out the minor issue with the better tool"

frank bone Jul 15, 2020, 8:51 PM

#

going crazy nothing works yet

desert oar Jul 15, 2020, 8:52 PM

#

it really just sounds like the file isn't where you think it is

#

hold on

#

wait

#

hold on

#

are you trying to emit this filename?

#

FNR suggests you're concatenating multiple files

frank bone Jul 15, 2020, 8:53 PM

#

im trying to concatenate multiple csv into a variable

desert oar Jul 15, 2020, 8:53 PM

#

what do you mean "into a variable"

frank bone Jul 15, 2020, 8:54 PM

#

into a python variable

#

it works when i pass the paths directly

desert oar Jul 15, 2020, 8:54 PM

#

you want to get the csv data in a python variable?

#

i see

frank bone Jul 15, 2020, 8:54 PM

#

but i cant escape the above

desert oar Jul 15, 2020, 8:54 PM

#

and you intend to read that with pandas?

frank bone Jul 15, 2020, 8:55 PM

#

yes

#

wanna compare if theres a speed difference this way

#

this works: calling = os.system('awk \'NR==1; FNR==1{next} 1\' /PATH/TO/CSV/file1.csv /PATH/TO/CSV/file2.csv') print(calling)

#

but obviously i wanted to provide the multiple paths with my variable flattened_csv

desert oar Jul 15, 2020, 8:56 PM

#

import io
import subprocess

filenames = ['file1.csv', 'file2.csv']
awk_script = """
NR == 1 && FNR > 1 {next}
1
"""

proc = subprocess.run(['awk', awk_script, *filenames], stdout=subprocess.PIPE, encoding='utf-8')
data = pd.read_csv(io.StringIO(proc.stdout))

#

this should work

frank bone Jul 15, 2020, 8:57 PM

#

thanks 😄 ill try

desert oar Jul 15, 2020, 8:58 PM

#

i don't know why awk thinks your files are missing

#

@modest rune what is options_df?

frank bone Jul 15, 2020, 9:04 PM

#

pandas.errors.EmptyDataError: No columns to parse from file

desert oar Jul 15, 2020, 9:06 PM

#

make sure my awk code isn't broken

#

oh yeah it is

#

awk_script = """
NR == 1
FNR == 1 {next}
1
"""

#

use your old awk

#

its correct

frank bone Jul 15, 2020, 9:08 PM

#

same error i hate escape chars 😦

desert oar Jul 15, 2020, 9:08 PM

#

what error

#

this isnt a matter of escaping

#

you need to see what awk is outputting

#

maybe its NR != 1 && FNR == 1 {next}

modest rune Jul 15, 2020, 9:10 PM

#

@desert oar options_df is a large data structure that contains all of the stocks underlying options data. I don't think I accurately represented it in my pseudocode because I was focusing on getting the idea of the nested loops across. IF you think it would be helpful, I could do a better a job of showing the structure of the data (correct data types and exact nesting order), but it will take me some time.

frank bone Jul 15, 2020, 9:11 PM

#

awk: warning: command line argument /' is a directory: skipped awk: fatal: cannot open file h' for reading (No such file or directory)

#

it splits the path with /

#

every single character

desert oar Jul 15, 2020, 9:11 PM

#

filenames is a list

#

if you passed a string it will fail

#

the * unpacks list arguments

#

so you probably passed a string

frank bone Jul 15, 2020, 9:11 PM

#

ohhh damn

#

true

#

hell yeah

#

finally

#

thanks man 😄

#

would you know how to to exactly this but only certain columns?

#

say columns 5 and 6

#

with the awk script

#

ah nvm not necessary

#

ill just do that in read_csv

#

lol its way slower when i separately concat files first and then makes df

#

but it was worth a try 😄 @marble jasper unless i did it completely wrong

marble jasper Jul 15, 2020, 9:29 PM

#

have you tried concatenating the strings in python

#

just open() and read(), and concat strings

frank bone Jul 15, 2020, 9:30 PM

#

what strings?

marble jasper Jul 15, 2020, 9:30 PM

#

your CSV file

#

read as a string, concat string

frank bone Jul 15, 2020, 9:31 PM

#

but how would i keep only 1x header?

#

did it with 1 liner awk command called from python

#

awk 'NR==1; FNR==1{next} 1' *.csv

marble jasper Jul 15, 2020, 9:32 PM

#

is awk faster (bearing in mind subprocess overhead) or file read in python?

#

I don't know the answer to that

#

you can have pandas strip out the header also after the fact

lapis sequoia Jul 15, 2020, 9:33 PM

#

hi, i need help with this error :

📎 yuu.PNG

frank bone Jul 15, 2020, 9:34 PM

#

is awk faster (bearing in mind subprocess overhead) or file read in python?
@marble jasper dont know, but pandas concat vs awk concat for 9400 files was 11s vs 23s

marble jasper Jul 15, 2020, 9:35 PM

#

maybe try python reading the file rather than piping stuff around?

frank bone Jul 15, 2020, 9:35 PM

#

idk why python would be faster than pandas though, but ill try

marble jasper Jul 15, 2020, 9:36 PM

#

because you're incurring multiple csv-dataframe conversions, and multiple dataframe concat operations

#

dataframe concat is slower than string concat, and doing csv-dataframe conversion once is less overhead than n times

#

it's just inconcievable that doing n csv-dataframe conversions is faster than doing one conversion of data n times the size

#

and that's the basis for my original suggestion, which remains my only suggestion

#

I might be wrong, but it hasn't been tested yet, and until then, saying "I don't know why it would be faster" to me sounds like an opportunity you're missing

frank bone Jul 15, 2020, 9:39 PM

#

ill try for sure just trying to find syntax 😄 all im finding on google is pandas lol

marble jasper Jul 15, 2020, 9:40 PM

#

the most compact version, which might be slower than the alternative is simply:

all_data = []
for file in files:
  all_data.extend(open(file).readlines()[1:])

#

this does no checks and is generally not clean. I don't recommend for production code, but for the purpose of testing an idea, try it

desert oar Jul 15, 2020, 9:40 PM

#

i still feel like this is optimizing the wrong thing

#

i/o is i/o

#

text processing is text processing

marble jasper Jul 15, 2020, 9:41 PM

#

it saves on the dataframe concat

desert oar Jul 15, 2020, 9:41 PM

#

you're trying to optimize the inside of the loop to gain what, 1-2 seconds?

#

just parallelize the whole thing

#

fuck it

marble jasper Jul 15, 2020, 9:41 PM

#

oh, you should do that as well

desert oar Jul 15, 2020, 9:41 PM

#

save yourself literally 2-4x

frank bone Jul 15, 2020, 9:41 PM

#

the most compact version, which might be slower than the alternative is simply:
all_data = []
for file in files:
  all_data.extend(open(file).readlines()[1:])

@marble jasper found one, would this be better? ```import csv

allColumns = []
for dataFileName in [ 'a.csv', 'b.csv', 'c.csv' ]:
with open(dataFileName) as dataFile:
fileColumns = zip(*list(csv.reader(dataFile, delimiter=' ')))
allColumns += fileColumns

allRows = zip(*allColumns)

with open('combined.csv', 'w') as resultFile:
writer = csv.writer(resultFile, delimiter=' ')
for row in allRows:
writer.writerow(row)```

marble jasper Jul 15, 2020, 9:41 PM

#

no, why are you reading the csv?

desert oar Jul 15, 2020, 9:41 PM

#

spending the most effort on the smallest gains

#

the python csv library will be much much much slower than pandas

marble jasper Jul 15, 2020, 9:42 PM

#

csv is just a text file with each row on its own line

desert oar Jul 15, 2020, 9:42 PM

#

@lapis sequoia show your code

marble jasper Jul 15, 2020, 9:42 PM

#

you can concatenate these together as text

#

why would you need to parse the CSV data to do the concatenation?

frank bone Jul 15, 2020, 9:42 PM

#

alright ill test ur code

marble jasper Jul 15, 2020, 9:42 PM

#

you can do that as strings, saving you all the pandas dataframe concatenation time, which your tests previously showed is longest

#

also the code above doesn't preserve the first row heading

desert oar Jul 15, 2020, 9:43 PM

#

string concatenation is slow too

#

FYI

#

did you see that looping over DFs wasn't any faster?

#

although i now suspect that they made a mistake

marble jasper Jul 15, 2020, 9:43 PM

#

it's possible

lapis sequoia Jul 15, 2020, 9:43 PM

#

@desert oar result = df_cat.groupby('variable')
#df1 = [df_cat.get_group(x) for x in df_cat.groups]
cities = [variable for variable, df in df_cat.groupby('variable')]
sns.catplot(x = 'variable', y = result[value], col= 'cardio', data = df_cat, kind = 'bar')

desert oar Jul 15, 2020, 9:44 PM

#

@marble jasper

sum((df['a'] * df['b']).sum() for df in df_list)

i was very very very surprised to find that this was slower than pd.concat

#

to the point where i suspect they just did something wrong

marble jasper Jul 15, 2020, 9:44 PM

#

I'm not suggesting doing any concatenation in pandas

desert oar Jul 15, 2020, 9:44 PM

#

right

#

but im saying who needs to concatenate

marble jasper Jul 15, 2020, 9:44 PM

#

he's literally stacking a bunch of data together

desert oar Jul 15, 2020, 9:44 PM

#

and im saying dont stack

#

just do the one operation you need to do

#

which is that ^

marble jasper Jul 15, 2020, 9:45 PM

#

I agree with you on the parallelization, which he should do as well for each symbol

desert oar Jul 15, 2020, 9:45 PM

#

their current code is something like

df_cat = pd.concat(df_list)
(df_cat['a'] * df_cat['b']).sum()

#

so im saying dont bother w/ the concat

#

just add up the individual sums

marble jasper Jul 15, 2020, 9:45 PM

#

but you can't get around having to concat the data together at some point, because each set of n files needs to become a single dataframe for the time series

desert oar Jul 15, 2020, 9:45 PM

#

in general yes

#

in this case no

#

because they're just computing a sum and returning

#

and they said the sum code was slower than concatenating

marble jasper Jul 15, 2020, 9:45 PM

#

what

desert oar Jul 15, 2020, 9:45 PM

#

but i find that extremely hard to believe

marble jasper Jul 15, 2020, 9:45 PM

#

he doesn't need the time series?

desert oar Jul 15, 2020, 9:46 PM

#

thats what he told me

marble jasper Jul 15, 2020, 9:46 PM

#

well this was a waste of time

frank bone Jul 15, 2020, 9:46 PM

#

i do need time series

#

it needs to be concatenated at some point

desert oar Jul 15, 2020, 9:46 PM

#

def main():
    ticker_filter = []
    thistuple = []

    for root, dirs, files in min_depth(os.walk('/home/computer/Documents/Stocks/'), depth=4):
        if not dirs:
            path = root
        ticker = os.path.basename(path)
        date = PurePath(path).parts[6]
        if (len(ticker_filter) > 0) and ((ticker in ticker_filter) == False):
            continue
        csv_files = glob.glob(os.path.join(path, "*.csv"))
        df_from_each_file = (pd.read_csv(f, usecols=[5, 6]) for f in csv_files)
        concatenated_df = pd.concat(df_from_each_file, ignore_index=True)
        dollar_volume = (concatenated_df['Price'] * concatenated_df['Volume']).sum()
        thistuple.append(tuple((str(ticker), str(date), int(dollar_volume))))

    return(thistuple)
main()

frank bone Jul 15, 2020, 9:46 PM

#

whether i do summing first or after doesnt matter?

marble jasper Jul 15, 2020, 9:46 PM

#

ok then, so you need to concat, and I suggest concat either the string contents of CSV, or as a list

desert oar Jul 15, 2020, 9:46 PM

#

you literally only use concatenated_df to compute dollar_volume

#

by the time you read a file and concat strings

#

you can pandas csv read it

#

i seriously doubt that will be faster

#

i mean try it

#

but i would be surprised if that beats read_csv then concat

marble jasper Jul 15, 2020, 9:51 PM

#

all_data = []
for idx, file in enumerate(csv_files):
  all_data.extend(open(file).readlines()[(1 if idx > 0 else 0):])

data = StringIO("\n".join(all_data))
concatenated_df = pd.read_csv(data, usecols=[5,6])

#

something like that. I'm not happy with the use of readlines, but at least this'll tell you directionally whether this is any good

#

these lines replace the following in your existing code:

df_from_each_file = (pd.read_csv(f, usecols=[5, 6]) for f in csv_files)
concatenated_df = pd.concat(df_from_each_file, ignore_index=True)

#

it does have an additional StringIO, which I'm not a fan of, but I don't see a way for pandas to import from string

desert oar Jul 15, 2020, 9:53 PM

#

im very very curious if thats faster than the concat

frank bone Jul 15, 2020, 9:54 PM

#

trying

marble jasper Jul 15, 2020, 9:54 PM

#

me too. I don't know for sure, but this hasn't been tried yet

frank bone Jul 15, 2020, 9:55 PM

#

stringio is not defined? not from io lib?

marble jasper Jul 15, 2020, 9:55 PM

#

from io import StringIO

frank bone Jul 15, 2020, 9:59 PM

#

wow this one's much faster

#

4.8s vs 11s

#

thanks 😄

marble jasper Jul 15, 2020, 9:59 PM

#

mm hm, now parallelize all the symbols

#

pandas concat is slow because of all the extra memory allocation and copying that it does, it's basically a deep copy every time

#

while concatenating lists also allocates memory, it's still much faster

#

the method is less robust though - you're making certain assumptions about the CSV structure. and that csv reading code can fail under a lot of different circumstances when the CSV structure isn't what you expect, while doing a read_csv on each file is probably more reliable

frank bone Jul 15, 2020, 10:05 PM

#

quality of this dataset is incredibly high

#

so i hope there wont be hiccups

#

mm hm, now parallelize all the symbols
@marble jasper is this a difficult process? im repeating myself here but my hello world was 3 days ago 😄 for everything you do in 5 minutes ill need 1 hour or more 😄

#

@desert oar linked some code earlier, still figuring out 😄

marble jasper Jul 15, 2020, 10:11 PM

#

python is one of the easier languages to do multiprocessing in

frank bone Jul 15, 2020, 10:17 PM

#

@marble jasper what processing module would you recommend?

#

i think salt recommended "concurrent.futures"?

marble jasper Jul 15, 2020, 10:18 PM

#

yes that's good

#

the other option, and this may be slightly easier to understand is the lower-level multiprocessing library, since there are some relatively easy to understand examples out there using queue-based multiprocessing

#

but concurrent.futures is higher level. also has good examples

frank bone Jul 15, 2020, 10:19 PM

#

so the thing to multiprocess here is the concat

#

just trying to figure out how

#

ill watch a youtube vid or two

marble jasper Jul 15, 2020, 10:21 PM

#

I was suggesting multiprocessing the stock symbols

#

so each stock symbol was being processed in parallel, but within each, it was reading the files as it is now

frank bone Jul 15, 2020, 10:22 PM

#

ahhh i see

#

btw for further manipulation of the data how would you store it? think pandas multiindex is fine?

marble jasper Jul 15, 2020, 10:23 PM

#

not sure, I haven't had to store pandas dataframes much

frank bone Jul 15, 2020, 10:24 PM

#

right now im saving them within tuples in lists, just wondering how easily accessible and processable especially in a time series this is

#

putting them in multiindex takes some time as well but oh well cant have everything

#

📎 unknown.png

#

know why this "NaN" appears?

modest blaze Jul 15, 2020, 10:36 PM

#

Hey all, I'm looking for someone familiar in GIS, and Matplotlib to help me solve a problem I'm working on, this is a paid 1 time gig...

marble jasper Jul 15, 2020, 10:41 PM

#

90% of the time the answer is: convert to UTM and do the calculation in meters

modern canyon Jul 15, 2020, 11:13 PM

#

Hello y'all, I'm trying to calculate cosine similarity of a sparse matrix
<63671x30 sparse matrix of type '<class 'numpy.uint8'>'
with 131941 stored elements in Compressed Sparse Row format>

#

I got a memory error, so I tried messing with the paging file and now it just crashes my whole PC when I try to calculate

#

any idea as to how to overcome this?

mellow spruce Jul 15, 2020, 11:34 PM

#

hello all, I have a table with two columns and I want to append values of columns to a list of dictionaries. I am trying to do it by [{'key1':i} for i in table['column1}, {'key2':j} for j in table['column2']] but its giving me a sysntax error, any ideas?

serene scaffold Jul 16, 2020, 12:05 AM

#

@mellow spruce look at table['column1}

#

there's a quote that isn't closed and a curly brace that isn't opened.

queen barn Jul 16, 2020, 12:07 AM

#

Hey I'm trying to use PyCharm with Anaconda, and for whatever reason, the two don't seem to be communicating with each other. PyCharm isn't showing up in the Anaconda dashboard like it should, and the Pycharm interpreter isn't benefiting from the packages I've installed through the Anaconda Navigator. Am I missing a setting or something?

mellow spruce Jul 16, 2020, 12:08 AM

#

@mellow spruce look at table['column1}
@serene scaffold That was a typo when writing the question. The actual code doesn't have that typo

serene scaffold Jul 16, 2020, 12:09 AM

#

okay, what is the real code?

#

also, it would be good to see which part of the line it was pointing to for the syntax error in the message.

mellow spruce Jul 16, 2020, 12:22 AM

#

okay, what is the real code?
@serene scaffold This is a function that takes as an input when the user clicks and a list of dictionaries like [{'key':'value'}] this is the code `def table_output(nclicks, data):

if nclicks:

    table=Table().with_columns([

        'col1',[],

        'col2',[]])

    for dat in data:

        value=int(dat['key2'])

        tt=Table.read_table('routing_'+dat['key1']+'.csv')

        pt=tt.select('col1')

        pt=pt.with_column('col2',np.random.rand(pt.num_rows))

        pt['col2']=pt['col 2']*value

        table.append(pt)

    table.group('col1', collect=sum)

    return [{'key1':i} for i in table['column1}, {'key2':j} for j in table['column2']]`                                                                                                  key 1 and key 2 are used to construct  table rows in plotly dash data tables  I tried

serene scaffold Jul 16, 2020, 12:22 AM

#

'column1}

#

this is still there

mellow spruce Jul 16, 2020, 12:23 AM

#

'column1}
@serene scaffold haha I copied from my question again sorry. I am sure it is not like that in the code, I am just using two computrs

serene scaffold Jul 16, 2020, 12:23 AM

#

Okay, well let me know what it looks like in the code that's giving you the syntax error.

mellow spruce Jul 16, 2020, 12:24 AM

#

I tried list1=[]

    list2=[]

    for i in table['col1']:

        list1.append({'key1':i})

    for j in table['col2']:

       time_list.append({'key 2':j})`

serene scaffold Jul 16, 2020, 12:25 AM

#

so one of these has the syntax error?

#

this looks fine to me, syntactically.

#

assuming that time_list is defined somewhere

mellow spruce Jul 16, 2020, 12:27 AM

#

` if nclicks:

    table=Table().with_columns([

        'col1',[],

        'col2',[]])

    for dat in data:

        value=int(dat['key2'])

        tt=Table.read_table('routing_'+dat['key1']+'.csv')

        pt=tt.select('col1')

        pt=pt.with_column('col2',np.random.rand(pt.num_rows))

        pt['col2']=pt['col 2']*value

        table.append(pt)

    table.group('col1', collect=sum)

    return [{'key1':i} for i in table['column1], {'key2':j} for j in table['column2']]  `

#

in this line return [{'key1':i} for i in table['column1], {'key2':j} for j in table['column2']]

#

assuming that time_list is defined somewhere
@serene scaffold it is, just changing a lot of names is hard

serene scaffold Jul 16, 2020, 12:29 AM

#

table['column1] there's no close quote

#

if you're using a code-oriented text editor, it might have a feature to automatically rename your variables.

mellow spruce Jul 16, 2020, 12:30 AM

#

Still throws the error, I am using spyder I don't think it edits the code

serene scaffold Jul 16, 2020, 12:30 AM

#

I assume spider would have that feature

#

what did you change the line to be, and what error message?

mellow spruce Jul 16, 2020, 12:32 AM

#

return [{'key1':i} for i in table['column1'], {'key2':j} for j in table['column2']] it gives me a syntax error near the coma table['column1'], {'key2':j} here

serene scaffold Jul 16, 2020, 12:32 AM

#

aha

#

looks like you're trying to make two lists, basically

#

see how {'key1':i} for i in table['column1'] and {'key2':j} for j in table['column2'] pretty much stand on their own?

#

you could make them both separately and use the plus operator to join them

#

so return [{'key1':i} for i in table['column1']] + [{'key2':j} for j in table['column2']]

mellow spruce Jul 16, 2020, 12:34 AM

#

Thank youuu! I will try that

serene scaffold Jul 16, 2020, 12:34 AM

#

No problem.

mellow spruce Jul 16, 2020, 12:36 AM

#

No problem.
@serene scaffold It works however, since it's two lists it does not append on the same row

serene scaffold Jul 16, 2020, 12:36 AM

#

append on the same row?

mellow spruce Jul 16, 2020, 12:38 AM

#

yeas like when I return that to the data for the table it displays the table like this Col 1 | Col 2 val | val | | val

#

fuck sorry for the formatting

serene scaffold Jul 16, 2020, 12:39 AM

#

you can use three backticks on either side to get more consistent formatting.

#

also works for Python code

#

!code

arctic wedgeBOT Jul 16, 2020, 12:39 AM

#

Discord has support for Markdown, which allows you to post code with full syntax highlighting. Please use these whenever you paste code, as this helps improve the legibility and makes it easier for us to help you.

To do this, use the following method:

```python
print('Hello world!')
```

Note:
• These are backticks, not quotes. Backticks can usually be found on the tilde key.
• You can also use py as the language instead of python
• The language must be on the first line next to the backticks with no space between them

This will result in the following:

print('Hello world!')

mellow spruce Jul 16, 2020, 12:41 AM

#

     val |                                                      
      val|                                                      
          | val ``` like it append all the list  1  and then all the list 2 instead of putting them on the same row

serene scaffold Jul 16, 2020, 12:54 AM

#

Sorry, I'm not sure that I'm following

#

one moment

#

@mellow spruce so list1 and list2 are columns or rows in the table you want to make?

mellow spruce Jul 16, 2020, 12:59 AM

#

list 1 and list 2 are rows. What I have right now is the table and the what I want to return from the function is somethin that fills this description data (list of dicts; optional): The contents of the table. The keys of each item in data should match the column IDs. Each item can also have an 'id' key, whose value is its row ID. If there is a column with ID='id' this will display the row ID, otherwise it is just used to reference the row for selections, filtering, etc. Example: [ {'column-1': 4.5, 'column-2': 'montreal', 'column-3': 'canada'}, {'column-1': 8, 'column-2': 'boston', 'column-3': 'america'} ]

#

So "key" is the column name and the for loop is the values I wan to use as the rows of those columns

#

furthe reference https://dash.plotly.com/datatable/reference

Reference | Dash for Python Documentation | Plotly

A comprehensive list of all of the
DataTable properties.

serene scaffold Jul 16, 2020, 1:01 AM

#

huh interesting

#

so if list 1 and 2 are rows, wouldn't you need to transform them into dicts where each element of the lists are mapped to the right key?

mellow spruce Jul 16, 2020, 1:04 AM

#

I mean it does not have to be two lists. The table I have from the function has the columns togethere. wouldn't that be easier

#

?

#


        table=Table().with_columns([

            'col1',[],

            'col2',[]])

        for dat in data:

            value=int(dat['key2'])

            tt=Table.read_table('routing_'+dat['key1']+'.csv')

            pt=tt.select('col1')

            pt=pt.with_column('col2',np.random.rand(pt.num_rows))

            pt['col2']=pt['col 2']*value

            table.append(pt)

        table.group('col1', collect=sum)```

#

up to here is solid I just don't know what to return or how

serene scaffold Jul 16, 2020, 1:06 AM

#

return [{'key1':i} for i in table['column1']] + [{'key2':j} for j in table['column2']]

#

this is what I suggested before. It returns one list

#

So what about that one list isn't what you wanted?

mellow spruce Jul 16, 2020, 1:08 AM

#

let me take a screenshot

serene scaffold Jul 16, 2020, 1:10 AM

#

if it's code, pasting it is more helpful.

#

or any text, really

mellow spruce Jul 16, 2020, 1:11 AM

#

It does not throw me an error so it's not the code it's how it is rendered

#

Like all the elemnts of one list first and then all the elements of the second list

serene scaffold Jul 16, 2020, 1:20 AM

#

I can't really tell what I'm looking at

#

return [{'key1': i, 'key2': j} for i, j in zip(table['column1'], table['column2'])]

#

tell me if that works.

#

I'm just guessing.

mellow spruce Jul 16, 2020, 1:25 AM

#

That did the trick. Thank youu for helping me through my mess

#

You are the python master @serene scaffold

serene scaffold Jul 16, 2020, 1:31 AM

#

@mellow spruce I don't know that that's true, but I appreciate you saying so

#

I figured that you were actually trying to create one list with dictionaries that had two entries, rather than two list of dicts with one entry

#

so zip gives you items from two iterables at the same time. Can you see how that works?

mellow spruce Jul 16, 2020, 2:59 AM

#

so zip gives you items from two iterables at the same time. Can you see how that works?
@serene scaffold it worked perfectly

serene scaffold Jul 16, 2020, 3:00 AM

#

Yay! But do you understand why it worked?

modest rune Jul 16, 2020, 4:17 AM

#

I need to do something very simple in pandas... but I think due to my ignorance of linear algebra, I have no idea what type of math is this called which is making even googling for the answer hard. I wrote a mockup of what I want to do. Any help would be greatly appreciated.

import pandas as pd

gain_scenarios = pd.Series({'a':0.34, 'b':0.21, 'c':0.56, 'e':0.11})

stock_data=pd.DataFrame(columns= ['Ticker', 'Shares', 'Cost_Per_Share'],
                           data=[['NFLX'  , 100     , 0.10            ],
                                 ['AAPL'  , 150     , 0.20            ],
                                 ['GOOG'  , 500     , 5.10            ],
                                 ['F'     , 70      , 7.10            ],
                                 ['BKSR'  , 130     , 0.90            ],
                                 ['AMZN'  , 90      , 5.10            ]])

# Run some code that calculates:
# stock_data['Shares'] * (1 + gain_percent) - stock_data['Shares'] * stock_data['Cost_Per_Share']

Desired_Output_df=pd.DataFrame(columns= ['NFLX', 'AAPL', 'GOOG',    'F',  'BKSR',  'AMZN'],
                                  axis=[[   124,    171,  -1880, -403.2,  -995.8,  -338.4],  #a
                                        [   111,  151.5,  -1945, -412.3, -1012.7,  -350.1],  #b
                                        [   146,    204,  -1770, -387.8,  -967.2,  -318.6],  #c
                                        [   101,  136.5,  -1995, -419.3, -1025.7,  -359.1]]) #e

turbid oyster Jul 16, 2020, 4:22 AM

#

not sure why you need linear algebra for that

modest rune Jul 16, 2020, 4:22 AM

#

Is it not a matrix multiplication?

turbid oyster Jul 16, 2020, 4:23 AM

#

you could set it up as one , but that's probably more clever than useful

#

you've got the equation you need, just write the function and map across

modest rune Jul 16, 2020, 4:24 AM

#

I think map across is maybe what I need to google to learn about?

turbid oyster Jul 16, 2020, 4:24 AM

#

a for loop

modest rune Jul 16, 2020, 4:25 AM

#

I am trying very very hard not to use any for loops... for performance sake. I already solved this problem using for loops and my code was too slow.

turbid oyster Jul 16, 2020, 4:25 AM

#

did you try using pandas .apply() method ?

modest rune Jul 16, 2020, 4:25 AM

#

This is the last part... so far, my code has gone from taking 30 seconds to run down to 0.1 seconds

#

I did... it didn't help speed things up, because I was essentially looping inside the apply function I wrote. Also, I found a document explaining that apply is sometimes better, but if you can implement a vectorized solution the speedup is usually an order of magnitude greater.

#

I do think the proper matrix multiplication is the correct vectorized approach... I just am clueless about all of that.

#

I'd even hazard a guess that it is the .dot() function that I need.

turbid oyster Jul 16, 2020, 4:32 AM

#

this is completely linear though ?

mellow spruce Jul 16, 2020, 4:33 AM

#

Yay! But do you understand why it worked?
@serene scaffold i did. I had never used zip before tho so I’m not sure what that does

turbid oyster Jul 16, 2020, 4:34 AM

#

def output(df,gain):
  df_out = df['shares'] * (1 +gain) - df['shares'] * df[costs] 
  return df_out

^ this is a vectorized pandas function

#

its all you need

#

you just loop over the gain_percentages

serene scaffold Jul 16, 2020, 4:35 AM

#

!e

latin = ['a', 'b', 'c']
greek = ['alpha', 'beta', 'gamma']
for l, g in zip(latin, greek):
    print(l, g)

arctic wedgeBOT Jul 16, 2020, 4:35 AM

#

@serene scaffold :white_check_mark: Your eval job has completed with return code 0.

001 | a alpha
002 | b beta
003 | c gamma

modest rune Jul 16, 2020, 4:35 AM

#

If I am looping over the gain_percentages, that is not as vectorized as it can be.

serene scaffold Jul 16, 2020, 4:35 AM

#

it iterates over two or more iterables at the same time.

mellow spruce Jul 16, 2020, 4:35 AM

#

Ohhh okayy i get it now! Thankss

modest rune Jul 16, 2020, 4:36 AM

#

My actual dataset is much larger, so, avoiding the loop is important

turbid oyster Jul 16, 2020, 4:36 AM

#

just run it in parallel ?

modest rune Jul 16, 2020, 4:36 AM

#

and my actual equation has a lot more going on.

#

I really think there is a matrix multiplication solution that will be more elegant and more efficient.

#

I am reading something akin to 'Matrix Math for Dummies' right now... hoping to figure something out.

turbid oyster Jul 16, 2020, 4:39 AM

#

look at linear systems for matrices

modest rune Jul 16, 2020, 6:11 AM

#

@turbid oyster
figured it out. Pretty sure there is a cleaner way to write this though.

numpy_matrix = np.dot((gain_scenarios.values +       
                      1).reshape(-1,1),stock_data['Shares'].values.reshape(1, -1))
matrix_df = pd.DataFrame(numpy_matrix)
matrix_df = matrix_df - stock_data['Shares']*stock_data['Cost_Per_Share']

fleet moth Jul 16, 2020, 11:09 AM

#

Hi people,

I have a problem to create multiple line of chart with pyqt, pandas and plot. ```py
def select(self):
return pd.read_sql_query("SELECT DISTINCT date, interruption_type, priority, SUM(interventiontime) from interruptions GROUP BY interruption_type, priority;", self.conn, index_col="date")

class MplCanvas(FigureCanvasQTAgg):
def init(self, parent=None, width=5, height=4, dpi=100):
fig = Figure(figsize=(width, height), dpi=dpi)
self.axes = fig.add_subplot(111)
super().init(fig)
self.draw()

class StatisticDialog(QMainWindow):
def init(self, *args, **kwargs):
datas = self.db.select()
sc = MplCanvas(self, width=5, height=4, dpi=100)
datas.plot(ax=sc.axes)

whole rampart Jul 16, 2020, 11:09 AM

#

Is anyone here familiar with boto3 in Python 3.8? I am trying to access some AWS data, though when I use sqs.receive_message() (where sqs = a boto3.client()), I get "NoCredentialsError Unable to locate credentials". (Sorry if this is not the right place for asking about AWS stuff)

lapis sequoia Jul 16, 2020, 11:55 AM

#

https://boto3.amazonaws.com/v1/documentation/api/latest/guide/quickstart.html#configuration

#

Make sure you've set up authentication correctly

#

hi, i don't undestand this message error: AttributeError: 'DataFrameGroupBy' object has no attribute 'get'

leaden snow Jul 16, 2020, 12:00 PM

#

can anyone help me understand nlp? or say word embeddings?

turbid oyster Jul 16, 2020, 12:22 PM

#

@modest rune - awesome. This is a bit different than what i thought you were going for but i'm glad you found a solution. I thought you were looking for a single matrix operation that would solve your problem which is why i was getting so confused last night i think.

lapis sequoia Jul 16, 2020, 12:24 PM

#

Error:

Exception has occurred: TypeError
list indices must be integers or slices, not str

Code:

records = data["records"]    
                    
for friends in records["uuidSender"]:
  print(friends)

turbid oyster Jul 16, 2020, 12:25 PM

#

records is a list

lapis sequoia Jul 16, 2020, 12:25 PM

#

yes

#

how would i access the uuidSender in that list?

turbid oyster Jul 16, 2020, 12:25 PM

#

you would loop over and just check it

lapis sequoia Jul 16, 2020, 12:26 PM

#

How it looks like what im working with:

{
  "success": true,
  "records": [
    {
      "_id": "5d19b5140cf260dfc30b711e",
      "uuidSender": "38f67f4aa0c74dd99792aaf6cc401424",
      "uuidReceiver": "a02053c510584e4fb4563ab4d7d72c04",
      "started": 1561965844302
    },

turbid oyster Jul 16, 2020, 12:26 PM

#

for f in records:
      f['uuidSender']

#

its a list of dictionaries

lapis sequoia Jul 16, 2020, 12:26 PM

#

ok

#

Thanks

modest rune Jul 16, 2020, 12:28 PM

#

@turbid oyster I am trying to clean up that code I sent you last night, but am running into an issue. Originally I wanted to dot multiply the 4x1 gain_scenarios series by the 1x6 'Shares' column of Stock_Data. That wouldn't work because I kept getting the error ValueError: matrices are not aligned. I could only get it to work by multiplying the two matrices after converting them to numpy arrays.

turbid oyster Jul 16, 2020, 12:30 PM

#

when you're doing matrix multiplication number of columns from the first matrix must equal the number of rows in the second matrix

lapis sequoia Jul 16, 2020, 12:30 PM

#

I get the results but my embed messes up:
Image:
https://imgur.com/a/4d60WgL

Imgur

turbid oyster Jul 16, 2020, 12:30 PM

#

so i'm not sure how converting to np array helped you there

modest rune Jul 16, 2020, 12:31 PM

#

I am 95% sure the reason it failed is because when I print the gain_scenarios.size and Stock_Data['Shares'].size, they show (4,) and (6,). They are missing the info about the second dimension, which should be 1.

lapis sequoia Jul 16, 2020, 12:31 PM

#

for f in records:
  friends = f["uuidSender"]

#

i tried printing it and it worked fine

#

but it's just the embed now for some reason

turbid oyster Jul 16, 2020, 12:33 PM

#

@leaden snow - i've done a few implementations, but your question is really vague

#

@modest rune , we're saying the same thing. Try adding 2 gain scenarios see if it fixes the error

regal flax Jul 16, 2020, 12:34 PM

#

ay

modest rune Jul 16, 2020, 12:34 PM

#

Two more weird things: (a) The idea of transposing a series to go from 6x1 to 1x6, seems wonky to me... I mean, it is a series, it doesn't have a second dimension. But... the pandas documentation talks about a dot() function for series. (b) The dot() documentation says my 1 dimension must share the same name between the two matrices... Can you even give a series a column name?

turbid oyster Jul 16, 2020, 12:36 PM

#

you need two more values

#

the dimensions are what matters not the names

modest rune Jul 16, 2020, 12:37 PM

#

I believe you are incorrect. Only the two inner dimensions need to be the same for matrix multiplication. 4x1 dot 1x50 will work. But 1x4 dot 1x50 will not.

turbid oyster Jul 16, 2020, 12:38 PM

#

thats correct

#

(4,1)(6,1) is what you said you had though right ?

modest rune Jul 16, 2020, 12:39 PM

#

Well, no. I can't get to (4,1)(6,1)... only (4,)(6,). The second dimension is missing.

#

Which make sense... because they are series, they don't have a second dimension.

turbid oyster Jul 16, 2020, 12:40 PM

#

^

modest rune Jul 16, 2020, 12:40 PM

#

There is something super simple I am missing here. I need to do something to tell pandas that the 2nd dimension is 1.

turbid oyster Jul 16, 2020, 12:40 PM

#

just do it in numpy ?

#

it'll be faster anyways

modest rune Jul 16, 2020, 12:41 PM

#

Yeah... ok, that is what I did and it works. The code looks less pretty AND I hate not knowing how to do it 🙂

turbid oyster Jul 16, 2020, 12:41 PM

#

lol i'm glad you figured it out still

modest rune Jul 16, 2020, 12:41 PM

#

Especially since there is a pandas.series.dot() function. Just laughing at me!

#

Staring at me in the face, confirming that you can indeed dot multiply two series.

turbid oyster Jul 16, 2020, 12:42 PM

#

🤣

misty cargo Jul 16, 2020, 12:48 PM

#

@leaden snow need help?

leaden snow Jul 16, 2020, 12:48 PM

#

@turbid oyster oh i m stuck with word2vec and having confusions

turbid oyster Jul 16, 2020, 12:48 PM

#

confusions with ?

leaden snow Jul 16, 2020, 12:49 PM

#

how the wordembegging actually works

misty cargo Jul 16, 2020, 12:49 PM

#

hmmm

leaden snow Jul 16, 2020, 12:49 PM

#

ive red couple of articles but stil couldnot grasp it

misty cargo Jul 16, 2020, 12:49 PM

#

well depends

#

are you interested in how they are trained?

turbid oyster Jul 16, 2020, 12:50 PM

#

@leaden snow -do you understand PCA or NMF ?

#

(At least at a high level, you don't need all of the math - just what they're for and how they're useful)

#

@leaden snow , https://multithreaded.stitchfix.com/blog/2017/10/18/stop-using-word2vec/

Stop Using word2vec | Stitch Fix Technology – Multithreaded

When I started playing with word2vec four years ago I needed (and luckily had) tons of supercomputer time. But because of advances in our understanding of w...

#

read this - it presents word2vec as matrix decomposition excercise

#

it will help you understand what's actually being learned from the shallow embedding

leaden snow Jul 16, 2020, 12:54 PM

#

@turbid oyster thanks

turbid oyster Jul 16, 2020, 12:56 PM

#

np

proper fable Jul 16, 2020, 1:16 PM

#

Hi guys, can I ask something bout data science related problem?

lapis sequoia Jul 16, 2020, 1:27 PM

#

Hi everyone,I want to learn about data science
Where should I start?

frank bone Jul 16, 2020, 1:37 PM

#

what's the best way to store data in this exact structure? To do time series calculation on a per ticker basis?
so I could say from Date x to y for values in Column "IBM", calculate the following...

#

📎 unknown.png

#

Right now I have code that outputs all 3 values in a variable as a str or int

#

but not sure what method to use to structure it like in pic?

turbid oyster Jul 16, 2020, 1:39 PM

#

@proper fable - probably better to just ask instead of for permission

#

@frank bone , that should work as is. Just do the time series on the column

frank bone Jul 16, 2020, 1:40 PM

#

now i have wildly mixed tuples (with correct value pairs though)

#

i need to have structured columns

#

maybe i can skip the whole tuple stuff, I just have those 3 values and i need to generate the data like in pic. What's the method for this? SQL? Array?

turbid oyster Jul 16, 2020, 1:44 PM

#

pandas

frank bone Jul 16, 2020, 1:47 PM

#

the 3 variables get printed on a daily basis like this (it iterates day to day for a year): AMZN 20120103 245661.0966 AAPL 20120103 3835538.1934 IBM 20120103 417207.9702 TSLA 20120103 3198.2502 AMZN 20120104 240235.78290000002 AAPL 20120104 4176408.2924999995 IBM 20120104 64570.466499999995 TSLA 20120104 1983.86 AMZN 20120105 218486.7536 AAPL 20120105 5468868.8049 IBM 20120105 160427.1448 TSLA 20120105 1208.819 AMZN 20120106 445663.6416 AAPL 20120106 5520247.9946 IBM 20120106 107711.8426 TSLA 20120106 17823.2025

#

what to use? simple panda df or does it need to be multiindex?

#

would numpy work too?

turbid oyster Jul 16, 2020, 1:48 PM

#

what are you using for your time series analysis?

#

i'd just make a simple df with column names as stock and dates for rows.

#

you should be able to slice and dice that to your hearts content

frank bone Jul 16, 2020, 1:49 PM

#

the goal is to have a sliding window in the end...i.e. watch each stock for example 30 days on continuous basis

#

then apply moving averages and standard devations

turbid oyster Jul 16, 2020, 1:50 PM

#

yeah a pandas dataframe is all you should need then

frank bone Jul 16, 2020, 1:50 PM

#

and other stuff i probly dotn know yet

turbid oyster Jul 16, 2020, 1:50 PM

#

then just write a windowing function you can apply to the columns

frank bone Jul 16, 2020, 1:51 PM

#

great, ill give that a try 😄 thanks

#

can you use a string as an index?

turbid oyster Jul 16, 2020, 1:52 PM

#

also google fbprophet for some predictive methods that are fun and easy

frank bone Jul 16, 2020, 1:52 PM

#

like when i want to append AAPL volume to the next day

turbid oyster Jul 16, 2020, 1:52 PM

#

look at the pandas concat method

frank bone Jul 16, 2020, 1:55 PM

#

df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
But this assumes a fixed column length, no?

turbid oyster Jul 16, 2020, 1:56 PM

#

df.concat(df2) <-

#

you can add stuff to it

#

thats just an initialization with data

leaden snow Jul 16, 2020, 1:57 PM

#

@misty cargo

📎 fun1.JPG

#

📎 fun2.JPG

#

📎 fun3.JPG

#

hahaha

misty cargo Jul 16, 2020, 2:07 PM

#

?

#

you trying to get the vector of one word?

#

or the whole string

frank bone Jul 16, 2020, 2:08 PM

#

@turbid oyster hmm whats wrong about this? df_volume = pd.DataFrame([[]], columns=list()) df2 = pd.DataFrame([[int(volume)]], columns=list(str(ticker))) df_volume.concat(df2)

turbid oyster Jul 16, 2020, 2:09 PM

#

whats the error you're getting

frank bone Jul 16, 2020, 2:09 PM

#

ValueError: 4 columns passed, passed data had 1 columns

turbid oyster Jul 16, 2020, 2:09 PM

#

so what do you think the problem is

frank bone Jul 16, 2020, 2:10 PM

#

concat must have same amount of columns?

#

you cant "append"?

#

or extend

turbid oyster Jul 16, 2020, 2:11 PM

#

check the append method

modern canyon Jul 16, 2020, 2:11 PM

#

hello y'all, I have this dataframe

📎 Capture.PNG

frank bone Jul 16, 2020, 2:11 PM

#

append same error

#

hm

turbid oyster Jul 16, 2020, 2:12 PM

#

is it erroring when you make df2 ?

#

or when you concat ?

frank bone Jul 16, 2020, 2:12 PM

#

throws this whole thing

modern canyon Jul 16, 2020, 2:13 PM

#

I want to pass 'genres' and 'averageRating' columns to a k-nearest neighbors classifier to find similar movies

turbid oyster Jul 16, 2020, 2:13 PM

#

oh i see set up the df_volume with column names

#

df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
df
   A  B
0  1  2
1  3  4
df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
df.append(df2)
   A  B
0  1  2
1  3  4
0  5  6
1  7  8

modern canyon Jul 16, 2020, 2:13 PM

#

how do I encode numerical and non-numerical data into a single array?

turbid oyster Jul 16, 2020, 2:14 PM

#

google "categorical encodings"

modern canyon Jul 16, 2020, 2:14 PM

#

I already vectorized 'genres' column and ran cosine similarity on it

frank bone Jul 16, 2020, 2:14 PM

#

oh i see set up the df_volume with column names
@turbid oyster but i want to create the on them go?

modern canyon Jul 16, 2020, 2:14 PM

#

now I just need a way to weight averageRating equally

turbid oyster Jul 16, 2020, 2:15 PM

#

@turbid oyster but i want to create the on them go?
you'll need to know your stock tickers symbols at some point right ?

frank bone Jul 16, 2020, 2:16 PM

#

I get them from my variable

#

i dont feel like typing out 3000 tickers or whatever 😄

turbid oyster Jul 16, 2020, 2:16 PM

#

sure - just put that list in

frank bone Jul 16, 2020, 2:16 PM

#

so its not possible to create that column name on the go?

#

i dont want to put a list together first

turbid oyster Jul 16, 2020, 2:17 PM

#

you don't need to type them df=pd.Datafram([x],columns=list_of_symbols)

#

you kind of have to

frank bone Jul 16, 2020, 2:17 PM

#

i iterate through them day to day and one day it might be different than the other

#

so i have to scan the whole year first just to provide a list?

turbid oyster Jul 16, 2020, 2:18 PM

#

probably ? if i understand what you're doing you'll need that information no matter what. You can always do it once and just save that initial list

leaden snow Jul 16, 2020, 2:19 PM

#

@misty cargo a word

turbid oyster Jul 16, 2020, 2:19 PM

#

that or create your own data structure where each stock ticker get's a dataframe and do the appends at that level

misty cargo Jul 16, 2020, 2:19 PM

#

then why the need for str.split()?

leaden snow Jul 16, 2020, 2:19 PM

#

word2vec is having trouble distinguishing car and cycle

frank bone Jul 16, 2020, 2:19 PM

#

hmmm or SQL would work too right?

turbid oyster Jul 16, 2020, 2:20 PM

#

but then you might as well go back to something like :{ ticker_name : [(date,value) ...]}

#

even with SQL you have to have column names defined

leaden snow Jul 16, 2020, 2:20 PM

#

its the same things , str.split change a str to a list of strings @misty cargo

frank bone Jul 16, 2020, 2:20 PM

#

cant define them on the go?

turbid oyster Jul 16, 2020, 2:20 PM

#

it's not going to alleviate you from structuring your data in a table

#

SQL is a language for querying tables - so .... you still have to define the schema for your data.

#

sounds like you want this to be unstructured

#

i'd take a step back and just figure out what types of question / operations it is you need to do on the data first and let the tell you how to structure it

frank bone Jul 16, 2020, 2:22 PM

#

sounds fair enough, i just thought pandas would be more flexible in that regard

turbid oyster Jul 16, 2020, 2:22 PM

#

it is

frank bone Jul 16, 2020, 2:22 PM

#

like why shouldnt you be able to add new columns?

#

on the go

turbid oyster Jul 16, 2020, 2:22 PM

#

you can add new columns on the go

frank bone Jul 16, 2020, 2:23 PM

#

sorry i mean column names*

#

and columns

turbid oyster Jul 16, 2020, 2:23 PM

#

thats the same thing

#

df = pd.DataFrame(data) 
new_column = ['Delhi', 'Bangalore', 'Chennai', 'Patna'] 
df['City'] = new_column

#

you just have to provide all the values - its not going to magically know how to fill in blank spots

#

(or will default them to something that may or may not be useful)

frank bone Jul 16, 2020, 2:25 PM

#

oh ill try that then

turbid oyster Jul 16, 2020, 2:26 PM

#

seriously though - its probably not a bad excercise for you to take that step back and just ask yourself. What information do i need to get from this data set to accomplish my goals. might make it easier to compose.

modern canyon Jul 16, 2020, 2:28 PM

#

now I just need a way to weight averageRating equally
@modern canyon ?

frank bone Jul 16, 2020, 2:29 PM

#

seriously though - its probably not a bad excercise for you to take that step back and just ask yourself. What information do i need to get from this data set to accomplish my goals. might make it easier to compose.
@turbid oyster yeah ill probably just make csvs first with the new values, shouldnt be too big

turbid oyster Jul 16, 2020, 2:29 PM

#

👍 i have found that dumb solutions are often perfectly functional in the real world 😄

frank bone Jul 16, 2020, 2:34 PM

#

however if anyone knows any simple solution to this, let me know 🙂

#

📎 unknown.png

#

that doesnt require precompiling anything, just providing the data (including dates and column names) on the go

silk axle Jul 16, 2020, 2:37 PM

#

I have a ML model which predicts what number the given image is, with a training accuracy of about 98% over 1,875 samples, which seems pretty good, but when I test the model with images from my PC it seems to nearly always get it wrong? Have tried 13 different images and only 1 was guessed correctly.

turbid oyster Jul 16, 2020, 2:37 PM

#

Are you familiar with the concept of over fitting ?

silk axle Jul 16, 2020, 2:38 PM

#

Not really; I'm really new to ML (as in this is my second day)

turbid oyster Jul 16, 2020, 2:39 PM

#

some reading topics for you : bias/variance tradeoff , overfitting in deep learning models, and sampling

#

i'm going to guess your 1875 images were the NIST data set ?

silk axle Jul 16, 2020, 2:39 PM

#

MNIST, yea

#

fwiw this is my module code

📎 unknown.png

turbid oyster Jul 16, 2020, 2:40 PM

#

yeah its not your code its your data

#

the model doesn't generalize for a reason

#

sampling bias

silk axle Jul 16, 2020, 2:40 PM

#

So the sample data has say more 1s than 7s?

#

So guessing 1s would be more accurate than 7s

#

So accuracy for 1s might be like 98% but 7s could be much lower, because the model has seen less of them?

#

@turbid oyster

turbid oyster Jul 16, 2020, 2:49 PM

#

that but also the images you learned on probably don't resemble what you're showing the algorithm

silk axle Jul 16, 2020, 2:49 PM

#

this is example of what the model learns on

📎 unknown.png

turbid oyster Jul 16, 2020, 2:50 PM

#

yes i'm aware

silk axle Jul 16, 2020, 2:50 PM

#

this is example of what I'm giving

📎 unknown.png

#

Do I like need to invert the colours of the bottom one?

#

Since the colour of the background and actual number is swapped

turbid oyster Jul 16, 2020, 2:51 PM

#

^

#

that would be a not small part of it

silk axle Jul 16, 2020, 2:51 PM

#

Right

#

So... how would I do that? xD

turbid oyster Jul 16, 2020, 2:51 PM

#

its 0-1 encoded ?

silk axle Jul 16, 2020, 2:51 PM

#

Yes

#

Ig just do like val = 1 - val to invert?

#

I should probs also say I've never really used numpy before so don't really know how to actually do that calculation to each item

turbid oyster Jul 16, 2020, 2:54 PM

#

i think you can just do it on the whole matrix

#

so just use that syntax

silk axle Jul 16, 2020, 2:55 PM

#

Wdym?

modest rune Jul 16, 2020, 2:56 PM

#

I intentionally want to construct a nested series... a series with a series in each element. I can easily make a dataframe, but I can figure out how to make a nested series. With 1 caveat... I need to do it without any looping.

pale thunder Jul 16, 2020, 2:56 PM

#

inverted_image = 1 - original_image

turbid oyster Jul 16, 2020, 2:56 PM

#

^

silk axle Jul 16, 2020, 2:56 PM

#

Ah, I see

modest rune Jul 16, 2020, 2:56 PM

#

My guess is, pandas doesn't want me too, because it is usually a bad idea.

silk axle Jul 16, 2020, 2:56 PM

#

Forgot that's a thing you can do in numpy

turbid oyster Jul 16, 2020, 2:57 PM

#

@modest rune - probably time to go with a dictionary

silk axle Jul 16, 2020, 2:57 PM

#

That made the actual answer closer but still wrong

#

I'm not really sure how to improve this

turbid oyster Jul 16, 2020, 3:00 PM

#

it could be something as simple as the line thicknesses are difference causing the activation functions to not work as well

silk axle Jul 16, 2020, 3:00 PM

#

But then I can't really control that?

turbid oyster Jul 16, 2020, 3:00 PM

#

sure you can

#

you provide more training data with a greater variety of representing the data

silk axle Jul 16, 2020, 3:01 PM

#

How would I do that?

turbid oyster Jul 16, 2020, 3:01 PM

#

make a bunch of number drawings, label them, and feed them into your training set

silk axle Jul 16, 2020, 3:02 PM

#

So I'd have likepy images = [(image, label), ...]?

turbid oyster Jul 16, 2020, 3:02 PM

#

something like that yeah

silk axle Jul 16, 2020, 3:02 PM

#

Right okay

#

I think I can do that, thanks

#

Do you know what a good size would be?

turbid oyster Jul 16, 2020, 3:03 PM

#

start with maybe 10 of each number ?

silk axle Jul 16, 2020, 3:03 PM

#

And presumably I do this instead of the mnist stuff? Or do I do both?

turbid oyster Jul 16, 2020, 3:03 PM

#

do both

silk axle Jul 16, 2020, 3:04 PM

#

Right, yea that makes sense because then more variety

turbid oyster Jul 16, 2020, 3:04 PM

#

^

#

FYI this conversation is core to the "ethics in AI" the community is dealing with right now

silk axle Jul 16, 2020, 3:07 PM

#

I'm just thinking

#

I'm drawing these, but all the ones I'm drawing would be the same line-width

#

Ig there's not really anything I can do about that, other than deliberately making different

turbid oyster Jul 16, 2020, 3:09 PM

#

look into image manipulations for increasing data size and variety, but a common thing is to do some transformations to skew images a bit to provide some more variety

silk axle Jul 16, 2020, 3:09 PM

#

(like this)

📎 unknown.png

#

And alright, thanks

weary dune Jul 16, 2020, 3:41 PM

#

Can someone help me figure out why I got a syntax error?

📎 unknown.png

pale thunder Jul 16, 2020, 3:43 PM

#

Missing a closing paren the line above

reef rain Jul 16, 2020, 3:45 PM

#

i want to learn data-science, can someone suggest some online courses (free) that are good?

weary dune Jul 16, 2020, 3:46 PM

#

@pale thunder It worked, thanks

vagrant fiber Jul 16, 2020, 3:51 PM

#

hey

#

I'm still new to python and wanna learn data science, do you guys have any projects to recommend?

frank bone Jul 16, 2020, 4:01 PM

#

any idea how to merge same indexes into same row?

#

📎 unknown.png

#

just used append for this to print

#

        df2.columns=[ticker]
        df = df.append(df2)```

kindred peak Jul 16, 2020, 4:08 PM

#

@vagrant fiber check out the courses by https://fast.ai

vagrant fiber Jul 16, 2020, 4:12 PM

#

Thanks

frank bone Jul 16, 2020, 4:13 PM

#

got it df = pd.concat([df, df2], axis=1)

silk axle Jul 16, 2020, 4:30 PM

#

@turbid oyster ```py
for image_file in os.listdir(_dir):
number_in_image = int(image_file[0])
image = plt.imread(f"{_dir}/{image_file}")

## Resize image to 28x28x1 and invert
resized_image =  1 - resize(new_image, (28, 28, 1))
# print(resized_image)

## Show image
image = np.array(resized_image, dtype='float')
image, int(image_file[0])
pixels = image.reshape((28, 28))
plt.imshow(pixels)
plt.show()
break  # only do for one file as test

```how would I now add the image to an array alongside the number_in_image and in a format I can use to add to the x_train of mnist.load_data()

#

Now that I think about it, wouldn't I be adding the image to x_train and the number_in_image to y_train?

#

Either way, not sure how to do

#

Also not sure what datatypes I'm meant to be using (what datatypes are returned from mnist.load_data())
(@me upon response please)

proper fable Jul 16, 2020, 4:42 PM

#

Hi guys sorry to interrupt, I wanna ask a little problem. So there is a dataset about cars. And I have to find what is the variables that influence the price column. What technique should I use for this task and why. Thanks in advance.

#

📎 Screenshot_20200716-233803.png

#

Here is the columns of the data set

#

And here is the shape of it

#

📎 Screenshot_20200716-233815.png

lapis sequoia Jul 16, 2020, 4:45 PM

#

Well you could calculate the correlation of the pairs of features and see how they are correlated with Price

#

If you want to use this for ML, I suggest you just calculate a correlation matrix and remove features that are highly correlated. Otherwise you could also use Principal Component Analysis (PCA) which is often used for dimensionality reduction

modest rune Jul 16, 2020, 4:51 PM

#

Pounding my head against the wall again. I wrote a simplified version of my problem. Basically, I can do what I want if I type in the nested 1D array, but if I save it to a variable first, it doesn't work. I think I understand the difference between the two approaches, but I need to use a variable and I need to find a way to get this to work.

import pandas as pd
import numpy as np

stock_data=pd.DataFrame(columns=['Ticker','Shares','Cost_Per_Share'],
                        data=  [[  'NFLX',   100.0,            0.10],
                                [  'AAPL',   150.0,            0.20],
                                [  'GOOG',   500.0,            5.10],
                                [     'F',    70.0,            7.10],
                                [  'BKSR',   130.0,            0.90],
                                [  'AMZN',    90.0,            5.10]])

# Works! AND Does what I want, but I need to use a variable like below
stock_data['Nested_Array_Works'] =     [[1,2,3],
                                        [4,2,3],
                                        [1,2,3],
                                        [1,7,3],
                                        [1,2,3],
                                        [1,2,3]]

print(stock_data)

# Exact same array, but saved to a numpy_array variable first
numpy_array =                 np.array([[1,2,3],
                                        [4,2,3],
                                        [1,2,3],
                                        [1,7,3],
                                        [1,2,3],
                                        [1,2,3]])

# Doesn't Work, Exception: Wrong number of items passed 3, placement implies 1
stock_data['Nested_Array_No_Worky'] = numpy_array

#

Output:

  Ticker  Shares  Cost_Per_Share Nested_Array_Works
0   NFLX   100.0             0.1          [1, 2, 3]
1   AAPL   150.0             0.2          [4, 2, 3]
2   GOOG   500.0             5.1          [1, 2, 3]
3      F    70.0             7.1          [1, 7, 3]
4   BKSR   130.0             0.9          [1, 2, 3]
5   AMZN    90.0             5.1          [1, 2, 3]

...
...
File "C:\Users\HomeLaptop\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\internals\blocks.py", line 124, in __init__
    raise ValueError(
ValueError: Wrong number of items passed 3, placement implies 1

worldly kindle Jul 16, 2020, 4:56 PM

#

@modest rune can you cast np.array() as a list?

numpy_array = list(np.array([[1,2,3], [4,2,3], [1,2,3], [1,7,3], [1,2,3], [1,2,3]]))

modest rune Jul 16, 2020, 4:57 PM

#

Not having a computer science background, I have heard that casting is bad. I think it will fix the problem though. I'll give it a try.

worldly kindle Jul 16, 2020, 4:57 PM

#

it did for me

#

i don't have a CS background either. haven't heard that though. good to know(?)

#

do you know in what context?

#

i defer to others though and their expertise

modest rune Jul 16, 2020, 4:59 PM

#

well... I don't think calling list() is actually casting

#

I think that is a conversion

#

which is safe

#

Hopefully someone more knowledgable can chime in

proper fable Jul 16, 2020, 5:00 PM

#

If you want to use this for ML, I suggest you just calculate a correlation matrix and remove features that are highly correlated. Otherwise you could also use Principal Component Analysis (PCA) which is often used for dimensionality reduction
@lapis sequoia thanks for the solution, how can I decides which are the best correlation methods to use? is it enough to just calculate the correlation or should I fit it into a model then compare to each variable importance to find the best features ?

modest rune Jul 16, 2020, 5:01 PM

#

@worldly kindle I appreciate the solution. The only concern I have now is... will there be a major performance hit by converting the numpy array to a list? Because, with my actual code I will be doing that conversion much more frequently on much larger nested arrays.

#

Yeah... I think I am going to need a solution in which I don't spend time converting to a list first. I was half hoping there was a way to add a column to a dataframe without pandas assuming it knows what I want it to do.

balmy kraken Jul 16, 2020, 5:19 PM

#

hey guys

#

so i’ve been working on a dataset that has columns for date and then columns for fund names. Anyone knows how i could “merge” the dates into an index so that for every date i get the funds names in the column and the other infos as well?

worldly kindle Jul 16, 2020, 5:21 PM

#

@modest rune no prob bob. also, calling list() is casting/converting. separately, why not assign the list to the variable?

#

like so, numpy_array = [[1,2,3], [4,2,3], [1,2,3], [1,7,3], [1,2,3], [1,2,3]]

#

it has to be np.array()

#

?

modest rune Jul 16, 2020, 5:22 PM

#

I think that converting means, going through a proper process to go from one datatype to another. And casting is leaving the data untouched in memory and simply telling the compiler to pretend it is something else (I could be wrong)

balmy kraken Jul 16, 2020, 5:23 PM

#

Something like

Date Fund

19/03. X
20/03. X
21/03. X
19/03. Y
20/03. Y
21/03. Y

Would turn into

Date. Fund

19/03. X
Y
20/03. X
Y
21/03. X
Y

modest rune Jul 16, 2020, 5:23 PM

#

Yes, the data I will be consuming is already an np.array()

balmy kraken Jul 16, 2020, 5:23 PM

#

anyone could help me on this? it’d be greatly appreciated

uncut shadow Jul 16, 2020, 5:23 PM

#

@modest rune where are you trying to put this np.array? I mean, what are you actually trying to do cuz u didn't show what should be in this stock_data[here]

modest rune Jul 16, 2020, 5:24 PM

#

@uncut shadow the Output I posted shows how I want it to show up in the dataframe

uncut shadow Jul 16, 2020, 5:24 PM

#

@balmy kraken maybe this will answer your question https://stackoverflow.com/questions/36292959/pandas-merge-data-frames-on-datetime-index

Stack Overflow

Pandas: Merge data frames on datetime index

I have the following two dataframes that I have set date to DateTime Index df.set_index(pd.to_datetime(df['date']), inplace=True) and would like to merge or join on date:

df.head(5)
catcod...

#

oh ok

modest rune Jul 16, 2020, 5:28 PM

#

@worldly kindle some good discussion on Casting versus Converting in this Stackoverflow question.
https://stackoverflow.com/questions/3166840/what-is-the-difference-between-casting-and-conversion

I guess my definition is a common one, but often times casting IS converting. As one commenter put it "I would guess that the term "cast" has a bit of a muddy definition and usage."

Stack Overflow

What is the difference between casting and conversion?

Eric Lippert's comments in this question have left me thoroughly confused. What is the difference between casting and conversion in C#?

uncut shadow Jul 16, 2020, 5:31 PM

#

@modest rune check this https://stackoverflow.com/questions/18646076/add-numpy-array-as-column-to-pandas-data-frame

Stack Overflow

Add numpy array as column to Pandas data frame

I have a Pandas data frame object of shape (X,Y) that looks like this:

[[1, 2, 3],
[4, 5, 6],
[7, 8, 9]]
and a numpy sparse matrix (CSC) of shape (X,Z) that looks something like this

[[0, 1, 0],...

worldly kindle Jul 16, 2020, 5:32 PM

#

so essentially passing a nested array into a dataframe is your problem that you're trying to solve?

#

@worldly kindle some good discussion on Casting versus Converting in this Stackoverflow question.
https://stackoverflow.com/questions/3166840/what-is-the-difference-between-casting-and-conversion

I guess my definition is a common one, but often times casting IS converting. As one commenter put it "I would guess that the term "cast" has a bit of a muddy definition and usage."
@modest rune gotcha

Stack Overflow

What is the difference between casting and conversion?

Eric Lippert's comments in this question have left me thoroughly confused. What is the difference between casting and conversion in C#?

modest rune Jul 16, 2020, 5:33 PM

#

@uncut shadow your googling skills are impressive. I hope I can fix it now.

uncut shadow Jul 16, 2020, 5:33 PM

#

👍

modest rune Jul 16, 2020, 5:36 PM

#

@worldly kindleso essentially passing a nested array into a dataframe is your problem that you're trying to solve?
Pretty much.

#

@uncut shadow what google keywords did you use to find that stackoverflow question? Teach me to fish... cuz I have been googling like crazy for that for hours.

uncut shadow Jul 16, 2020, 5:38 PM

#

just how to create a new column with numpy array in pandas

modest rune Jul 16, 2020, 5:39 PM

#

Well... I am sure I googled something very similar to that... thanks for doing the legwork for me!

uncut shadow Jul 16, 2020, 5:39 PM

#

👍

modest rune Jul 16, 2020, 5:48 PM

#

@uncut shadow turns out the stackoverflow solution is the same as @worldly kindle's solution... Convert the data to a list then append to a column. BUT, the stackoverflow answers helped me decide there are no better solutions and most of the experts said nested arrays inside a dataframe are a bad idea, one person suggested looking into Panels (depreciated) and multi indexing as a way to avoid the nested arrays.

uncut shadow Jul 16, 2020, 5:49 PM

#

yeah, nested arrays aren't the best thing you could do in dataframe

worldly kindle Jul 16, 2020, 5:50 PM

#

i'm not sure it is necessary to call np.array() or call list(). maybe try to simply assign the multi-dimensional array.

modest rune Jul 16, 2020, 5:51 PM

#

Yeah, I just haven't come up with a way to cleanly associate the array with the dataframe's row. I am totally ok with managing two different dataframes if I can find a clean way to reference once from the other.

worldly kindle Jul 16, 2020, 5:51 PM

#

(i am not sure how simplified code compares to the actual code)

modest rune Jul 16, 2020, 5:51 PM

#

i'm not sure it is necessary to call np.array() or call list(). maybe try to simply assign the multi-dimensional array.
@worldly kindle

I am 99% sure I tried that and ran into a similar error.

#

I guess if I have two dataframes, that both have the same height, but differing widthss, all I need to do is have them share an index and that will link them together.

#

@void anvil for the past 3 days I have been dealing with similar problems. Yes, there are ways you can make substantial improvements.

#

A quick change to your code that should achieve decent improvement, don't use append. Instead, create an array of series that you build. Then, at the very end, use concat on that array, and generate the dataframe in one fell swoop.

#

Is there something better than append? Should I keep all the queries in memory then merge all the dataframes?
@void anvil

That would be even better than what I suggested.

#

Do zero conversions on your data as you read it in. Then, once it is all in memory, use one of Pandas builtin functions to convert the data directly into a dataframe. That will be somewhere between 50x and 500x faster than your current approach.

#

Do you mind sharing the format of the data you are querying? Is it json? xml?

#

from a requests call?

#

Here is something to burn into your brain and it will help you find the best ways to improve your performance. Iterating and data manipulation in python is SLOW. So, numpy and pandas get around this by asking the user to send them as large of chunks of memory as possible along with instructions on how to manipulate that data. Then pandas and numpy do all of their work in C, get the desired result, and send it back to python. So, anytime you can find a way to send larger chunks of data at a time to pandas or numpy, you will be rewarded with speed improvements.

#

If you are doing some sort of streaming, you will have to find a balance between chunking your data and keeping your app responsive.

worldly kindle Jul 16, 2020, 6:03 PM

#

@worldly kindle

I am 99% sure I tried that and ran into a similar error.
@modest rune gotcha. I must have coded it before thinking i was assigning a 2D array to the new col but I must have been converted it with list().