#data-science-and-ml
1 messages · Page 254 of 1
options would be of shape, say, (5, 13)
you can use broadcasting there
and then take .min()
broadcasting?
hold up, though
why are you using mean() instead of sum()
I see it doesn't matter in this case
just curious
maybe I should use sum
I can check the assignment again, but I won't ask you about that to keep my question focused on numpy.
(to eliminate the pretense that you're doing my homework)
I think it needs to be mean though because for the manhattan distance, in this case, we ignore columns where either has a nan
so rows with fewer nans get a much higher chance to be closer.
I think it needs to be mean though because for the manhattan distance, in this case, we ignore columns where either has a nan
if that's your intention, then fair enough
so rows with fewer nans get a much higher chance to be closer.
@serene scaffold why do you say this, though?
because assuming all columns are similarly distributed, the means would also have similar distributions, regardless of the number of non-null values in each row
if I understand you correctly
I think the dataset is fake so I don't know if the distribution of nans is meaningful.
maybe I should use sum
@serene scaffold the thing is I'm not sure the question "what is the Manhattan distance between two partially defined points" has a well-defined answer
to use a practical example...say you are at (0, 0) on a map. which is closer to you, (3, ?), or (1, 4)?
of course, this is more a question of statistical theory than programming
right
and I don't claim to know the intentions of whomever set you the homework
just thinking out loud
anyway, back to your question
I can
this homework assignment is a trick to show how long data science programs take
a lot of people were emailing the TA saying that their code was running overnight
and he finally said "that's the point"
>>> a = np.random.chisquare(5, size=(5, 3))
>>> a.shape
(5, 3)
>>> a.mean(axis=0).shape
(3,)
>>> a - a.mean(axis=0)
array([[-1.60512055, -1.8917574 , -0.87419636],
[ 2.46631863, 0.17530674, -0.26093926],
[ 4.48044859, -0.56073539, -2.20589024],
[-2.02807185, 0.35415127, 0.15569975],
[-3.31357484, 1.92303477, 3.18532612]])
this is broadcasting
note the shapes of the arrays
!e
import numpy as np
print(np.random.chisquare(5, size=(5, 3)))
@serene scaffold :white_check_mark: Your eval job has completed with return code 0.
001 | [[ 3.17295805 5.87307768 5.60451568]
002 | [10.90956275 2.15776996 9.61003041]
003 | [ 4.06197492 5.80642425 6.10925965]
004 | [ 3.45856202 1.74811126 10.31556199]
005 | [ 7.84383268 5.67000452 4.59208444]]
the smaller array gets "copied" across the axes
I don't know that I understand what math is happening here
okay, so first you have an array of random numbers, right?
I'm not sure what the lone 5 is for but I see this has a shape of 5 by 3
where each row is one sample, and each column is one class of observations.
and I'll accept that it's random
so for example, one row could be an individual person's data, and one column could be all the, say, heights of the people in the dataset.
now, say we want to take the mean of each column; that would be a.mean(axis=0)
which would give us an array of shape (3,) (one mean per column)
right
next, we want to subtract that mean from each value in the original dataset, matching columns
so subtract the first mean from every value in the first column, second mean from every value in the second column, etc.
What parts of Numpy should you concentrate on? Also, what’s the difference between Numpy and Pandas? It seems like allot ends up relying on pandas, so, should you use/learn both at once?
What parts of Numpy should you concentrate on? Also, what’s the difference between Numpy and Pandas? It seems like allot ends up relying on pandas, so, should you use/learn both at once?
@rustic apexpandasis built onnumpy.
it depends on what you want to do.
pandas is more for higher-level data wrangling.
@rustic apex numpy is specifically for math and especially linear algebra, pandas is more for tabular data in general.
in particular, data that comes in 2D form.
numpy allows you to work in higher dimensions and perform lower-level operations that often don't make sense on tabular data, such as taking the outer product, matrix product, applying functions across strides, etc.
so subtract the first mean from every value in the first column, second mean from every value in the second column, etc.
@velvet thorn but remember your original array has shape (5, 3), and your means have shape (3,)
which means that a - a.mean(axis=0) shouldn't work (shape mismatch) and yet it does.
that's because numpy recognises, in a nutshell, that the arrays match along one axis (the last one) and for every other axis, at least one array has either length 1 or is missing that axis.
so it implicitly "expands" the array of means to the shape (5, 3) (conceptually) before performing the elementwise subtraction
Thank you, I’m interested in using CSV and predicting data. I’m just getting started, it’s clicking with me well
the details can be found here https://numpy.org/doc/stable/user/basics.broadcasting.html
Thank you, I’m interested in using CSV and predicting data. I’m just getting started, it’s clicking with me well
@rustic apex "using CSV"?
what do you mean by that?
The CSV file?
CSV is just comma separated values
I mean that as in just using data
pandas dataframes...@velvet thorn this is what I got so far with trying to create a new column value based on existing columns, but I can't figure out how to get it to work.
def admin_mapper(df):
if df['Name'].str.startswith('RP', na=False) or df['Name'].str.startswith('RV', na=False) and df['Price'] == 0:
return 'Repurchase Agreement'
elif df['Name'].str.startswith('BUY', na=False) or df['Name'].str.startswith('SELL', na=False):
return 'CDS'
elif df['Price'] != "0":
return 'Bond'
df_admin['Type'] = df_admin.apply(lambda row: admin_mapper(df_admin), axis=1)
saying "using CSV" is kind of like "I want to learn driving and I'm excited to use 95 octane petrol"
@merry fern yes, I saw your message earlier
I'll get to it later.
also, you didn't answer my question.
Haha, yea I mean using data and showing predictions
Haha, yea I mean using data and showing predictions
@rustic apex fair enough
thanks, i just updated it. I am not sure where I am going to go with this in terms of other values...
I would say focus on pandas, but understand numpy
thanks, i just updated it. I am not sure where I am going to go with this in terms of other values...
@merry fern you could return a placeholder orNonefirst.
anyway, your code looks more or less right
but replace or with |
How well should you be at what type of math? I taught myself Trig and Pre Cal
@velvet thorn I'm not sure I see the application. The manhattan distance formula I'm using ignores indices for which either is nan
ah, and with & ?
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
so I assume it requires individual calls to manhattan_distance each time
so I assume it requires individual calls to
manhattan_distanceeach time
@serene scaffold you could do it that way, I guess?
it would be slower and kind of bad practice
so I would suggest using vectorisation in combination with np.nanmean instead
then you could entirely cut out the inner loop
is there a way to use numpy to get some performance advantage in that case?
ah, and with & ?
@merry fern oh wait, I misread your code
okay, you know one thing you can do?
make a subsidiary DataFrame of conditions
then play with that.
e.g.
columns = ['starts_with_rp', 'starts_with_rv', 'starts_with_buy', 'starts_with_sell', 'zero_price']
conditions_df = pd.concat([
df['Name'].str.startswith('RP', na=False),
df['Name'].str.startswith('RV', na=False),
df['Name'].str.startswith('BUY', na=False),
df['Name'].str.startswith('SELL', na=False),
df['Price'] == 0,
], axis=1).rename(columns=columns)
then you can just use masking to assign
e.g. df[~conditions_df['Price']] = 'Bond' would take the place of your last condition
do it in reverse order of priority
is there a way to use numpy to get some performance advantage in that case?
@serene scaffold yup, vectorisation
yikes
this kind of uses concepts that you probably haven't been exposed to, which makes it all seem a bit abstract
like broadcasting
but the main idea is that broadcasting, in general, takes the place of numpy in loops
however, from a distance...
can you see how using nanmean might help?
yikes
@merry fern do you see what I mean?
im trying to interpret it and understand its use case now
would you then combine conditions_df w/ original df ?
that creates a new DataFrame with 5 columns
and the same index as the original DataFrame
and crucially, because of that, you can use it to index the original DataFrame.
how about you run it first
and look at the result
why is it important that you are re-indexing the dataframe?
@merry fern not re-indexing
the point is, for example
you can do df[conditions_df['starts_with_rp']] to give you all the rows in the original DF where the Name column starts with 'RP'.
do you see the utility of that?
absolutely
matrix[~np.isnan(matrix[:, j])]
and you can combine that with other columns of conditions_df
it doesn't work because the mask is a different shape
essentially i was trying to figure out, how do i play with the dataframe now that ive gathered the data
to access the rows that you want
pycharm saying:
Unexpected type(s):(enumerate[str])Possible types:(Mapping)(Iterable[Tuple[Any, Any]])
matrix[~np.isnan(matrix[:, j])]
@serene scaffold huh. it doesn't...?
IndexError: boolean index did not match indexed array along dimension 1; dimension is 10 but corresponding boolean dimension is 1```
pycharm saying:
Unexpected type(s):(enumerate[str])Possible types:(Mapping)(Iterable[Tuple[Any, Any]])
@merry fern that shouldn't be a problem
IndexError: boolean index did not match indexed array along dimension 1; dimension is 10 but corresponding boolean dimension is 1```
@serene scaffold matrix is 2D, right?
that...shouldn't be happening
(~np.isnan(matrix[:, j])).shape is (8000,1)
hmm. KeyError: 'starts_with_rp'
hmm.
KeyError: 'starts_with_rp'
@merry fern that means the columns weren't renamed properly
just assign conditions_df.columns = columns I guess
so much for using rename 🤷♂️
(~np.isnan(matrix[:, j])).shapeis(8000,1)
@serene scaffold that's not right
it should be shape (8000,)
why
because you're indexing on the columns
so that dimension should disappear
if matrix is 2D, matrix[:, j] should be 1D, assuming j is an int
>>> a = np.zeros(shape=(8000, 10))
>>> a[:, 1].shape
(8000,)
@velvet thorn, so your code made a list of rules, then made a new df w/ the conditions, then renamed the columns in the same order as the rules, right/
idk what to say
@velvet thorn, so your code made a list of rules, then made a new df w/ the conditions, then renamed the columns in the same order as the rules, right/
@merry fern yup
idk what to say
@serene scaffold you checkedmatrix.shape, right?
yes
okay, I'm stumped
the best guess I can make is that somewhere matrix is getting an extra axis
because that wouldn't make sense otherwise
if you slice, one axis disappears
this is like the most fundamental thing in numpy
actually
OH
I know why now
it's because np.where returns a tuple
so for j in... is taking the first element of that tuple, which is an array
so j is not in fact an int, which leads to a 2D array upon slicing
>>> np.where([1, 0, 1])
(array([0, 2]),)
I have to find the latest date of the tickers (stock of companies) that have gone bankrupt. I actually want to have all those who no longer publish themselves on the market.
you will see in this table there are a lot of stocks. some no longer publish because they have made fallite . I'd like to have all those who stopped publishing
i know they are 7391 of those who stopped publishing
@velvet thorn oh here's something
I have to find the latest date of the tickers (stock of companies) that have gone bankrupt. I actually want to have all those who no longer publish themselves on the market.
you will see in this table there are a lot of stocks. some no longer publish because they have made fallite . I'd like to have all those who stopped publishing
i know they are 7391 of those who stopped publishing
@slender nymph define "stopped publishing"
hmm
right before you said that
oh, I didn't see it because I have discord in a small window
and the other user made a comment
well it's running @velvet thorn
so I guess we'll see tomorrow if it worked
I forgot about np.where
@velvet thorn, i guess i can't do this:
conditions_df = pd.concat([
df_admin['Name'].str.startswith('RP', na=False) | df_admin['Name'].str.startswith('RV', na=False),
df_admin['Name'].str.startswith('BUY', na=False) | df_admin['Name'].str.startswith('SELL', na=False),
df_admin['Price'] != 0,
], axis=1)
multiple conditions in there
you could, but the point is to generate the conditions individually
and then use them to create another column
that is fine too, though
it didnt work... but i see what youre saying
bc then i can reference the various characteristics
when i do my mod the files come out empty
your conditions are wrong, I think?
df_admin[conditions_df[['starts_with_rp', 'starts_with_rv', 'starts_with_buy', 'starts_with_sell', 'zero_price']]].to_csv(r'csv_filteroutput' + fntime + '.csv')
anything wrong there /
i added a timestamp variable so i can keep running it without having to close the files 🙂
if i can DM u ill send u files
i was trying to export all the values in that second list
but i guess i can just do conditions
conditions_df[['starts_with_rp', 'starts_with_rv', 'starts_with_buy', 'starts_with_sell', 'zero_price']].to_csv()?
why are you indexing and then saving
yea i have no idea what i was doing there haha
i think i left in df_admin accidentally
i wanted to see what was in conditions_df, but pycharm limits it to 5 rows, wasn't sure how to print it verbosely
@velvet thorn i see now that df_conditions creates a df of bools? interesting. how did you learn python/coding? whats your history
uh
I went for a coding bootcamp
then I worked at a startup for a few months
then I worked at another startup for a few months
then I worked as a data science instructor for a few months
now I'm building my own startup
i guess more what i was asking was when/how did you make the most advancement in your knowledge?
was it all comapny based projects or on your own?
whats your startup?
by doing stuff
think of something cool to build, build it
I learn by doing
whats your startup?
@merry fern edutech related, basically
What are some topics to focus on while learning Numpy?
so now I need to write a mapper to utilize conditions_df, or how woudl you approach this?
What are some topics to focus on while learning Numpy?
@rustic apex don't think of topics.
think of things you wanna do
(IMO)
so now I need to write a mapper to utilize
conditions_df, or how woudl you approach this?
@merry fern well.
the CLEANEST way
would be to create a third DataFrame
and join the two
that would be my preferred approach.
do you understand the concept of "join"?
like in the SQL sense
the CLEANEST way
@velvet thorn THIS is where im getting caught up - i keep trying to figure out if what im doing is even efficient or standard
yes
well
you need to have a certain kind of mind
to get things intuitively, I guess?
but hard work works too
gotta understand your strengths and weaknesses
i recently read that iterating is bad, and vectorization is the way to go
anyway, you can think of your problem
as basically
joining on the condition columns.
do you understand why?
@rustic apex like of course you should understand concepts like vectorisation, broadcasting and indexing...
no i cant really, what do you mean "joining on"
...but studying them in isolation is likely to confuse you.
@velvet thorn well, Pandas makes sense with loading files and how to display certain things, but I’ve just gotten used to arrays with Numpy
so it's better to find things that you need to do
and learn how to do them with numpy.
no i cant really, what do you mean "joining on"
@merry fern okay
so
conceptually
what you want to do
okay never mind
you understand
what a "join" is, right?
and you know you join on columns?
im familiar with outer/inner join, but i am a newb @ getting my hands dirty with handling data like this
I understand it a little
@velvet thorn The problem: I need to go through each row in a database & check 2 columns' values. based on those 2 columns' values, I need to create a new column and give it a value of "1,2,3" - But, yes, you are correct where I think you were going w/ your question before. Ideally, I'd like to be able to digest and act on a variety of data that comes in. I just started with this 3 example because its more than 2 (True/False)
I always just think of for loops to do this but that's not exactly python's way, right?
@velvet thorn - I see, both dataframes retained identical indices
@merry fern okay, I need to go, but left joins are the solution to your problem
you need a third DataFrame to represent the correspondence of the various conditions to your final output value, with, of course, one row per output value (so 3 in this case)
then you join that to conditions_df, with conditions_df on the left
think about how that would work
think about how that would work
@velvet thorn yes i was just working on that
this is the mapping
thanks for the help
return _methods._mean(a, axis=axis, dtype=dtype,
C:\development\school\data_science\venv\lib\site-packages\numpy\core\_methods.py:170: RuntimeWarning: invalid value encountered in double_scalars
ret = ret.dtype.type(ret / rcount)```
this is the whole traceback that I get
so I don't even know what in my code is causing this error
the line before it is an expected print statement
huh, I guess it's just a warning
wish it gave me more info though
def manhattan_distance(x: np.array, y: np.array) -> np.float:
not_null = ~np.isnan(x) & ~np.isnan(y) & BOOL_MASK
x, y = x[not_null], y[not_null]
return np.mean(np.absolute(x - y))
I'm wondering if using functools.lru_cache will speed this up
hashing the two arrays might take too long
eh well I guess you can't do that anyway
I think you can apply the mask after calculating the absolute value, that way you don't have to do it twice
@hasty grail I have to apply the mask first, but the problem is that I have it written in such a way that this function will likely get called on the same pair of arrays a couple times
though that number is at most ten
this is the mapping
@merry fern looks good
so I don't even know what in my code is causing this error
@serene scaffold it says "mean of empty slice"
and you have only one call to .mean, so...
def manhattan_distance(x: np.array, y: np.array) -> np.float: not_null = ~np.isnan(x) & ~np.isnan(y) & BOOL_MASK x, y = x[not_null], y[not_null] return np.mean(np.absolute(x - y))I'm wondering if using
functools.lru_cachewill speed this up
@serene scaffold you don't wanna usenp.nanmean?
@velvet thorn well, I already finished it and got the results, so it's now the TA's problem.
I'm not sure what nanmean would afford me in this context?
just np.nanmean(np.absolute(x - y)[BOOL_MASK]) (whatever BOOL_MASK is)
without masking
okay, so that then
though I don't know that that's the bottleneck
because the problem is that this function can be called up to nine times for the same x, y
it's incredibly unlikely that it would ever reach nine
but I have the results
so... 
hey so for tensorflow, in a book i read that you could create a model through the functional api through smth like this: py keras.layers.Dense(30, activation='relu')(input_)but then later in the book they define something like this: py def call(self, inputs): Z = inputs for layer in self.hidden: # hidden is a list of dense layers Z = layer(Z) return inputs + Z
so... what's with the inconsistency? (ping plz)
though I don't know that that's the bottleneck
@serene scaffold the point is to remove the inner loop, I guess
using vectorisation and broadcasting, as discussed earlier
I might look into that in the future, though for some reason this is the only programming assignment for this course.
@slate hollow what do you mean...?
i mean in both things they're calling the layer
but one supposedly tells keras how to connect the layers, while the other applies the functions to the inputs
okay
then I don't really see the inconsistency...?
in both cases they're applying layers to layers
oh, wait, never mind
my bad I read wrongly
you're right
the for loop is the successive application
the addition is the last part
yeah, it takes a number of layers
@velvet thorn the layers are inself.layers, whatever that is
yeah..?
1094/1094 [==============================] - 11s 10ms/step - loss: 0.5547 - accuracy: 0.8070 - val_loss: 1.9851 - val_accuracy: 0.4737
pain
I have coloumn values like '6 months', 'months 12' , any method to extract int values from it
I have coloumn values like '6 months', 'months 12' , any method to extract int values from it
@limpid oakdf['column_name'].str.extract(r'\d+')should work
thank you @velvet thorn , let me try
@velvet thorn ValueError: pattern contains no capture groups
r'(\d+)'
You didn't replace the regex
?
r'(\d+)'
this works perfect IntValues = df['Months'].str.extract(r'(\d+)').astype(int)
thank you so much your kind support @velvet thorn and @odd yoke
How to get recognition for free 101
np.ndarray.shape
@hasty grail what i can do with this?
you get the image's height and width
Hi everyone, what could be the reasons for kernel to appear dead while running a cell in jupyter notebook? I use a mac book pro, OS = OS X Yosmite vesion 10.10.5, 2.4 GHz intel core 2 duo.
I tried processing in batches and got the same error message about the kernel
try it on google colab
Will try that, Thanks
welcome
Traceback (most recent call last):
File "E:\paymentz\template.py", line 8, in <module>
np.ndarray.shape(img)
TypeError: 'getset_descriptor' object is not callable @hasty grail
https://discord.com/channels/@me/757573609433727037/757874627736502342 @hasty grail is this possible in python ?
your link is not working
see this @hasty grail
sure it's possible
if u find any tutorial please do share , i m also looking for this
sure it's possible
@hasty grail ok
what are your results with OpenCV?
(array([], dtype=int64), array([], dtype=int64))
do you know what that means?
what are your results with OpenCV?
Weren't you trying to do template matching with it?
yes i was
and....
my team mate told me to change the task for now
ok...
Well idk what are you doing now
Just explain it here
which image
The one you just linked
plese do share if you find any
I understand the first one... But why when theta1 = 0.5, that the point must go thru the 2,1 mark?
this is uhm...
Linear regression with one variable
@tidal sonnet Because if x=2, h(2) = 0.5 x 2 which is 1
Did you mix up x and y?
@mild topaz Sorry, I don't have experience in that specific area of OpenCV
POG!
Wait... but how do we know that x is 2 in that case?
I was not given a value for x... just told that if theta1 was 0.5, and theta0 was 0, that the line would have to go thru the (2,1) point :(
the line has an equation y = 1/2 * x
Therefore the point (2, 1) resides on that line by definition
Any point that fits that equation exactly lies on that line
I see... Thank you
@velvet thorn, is this a good approach to creating that 3rd dataframe? its unclear to me how to create the Type column in the original dataframe using the conditions df you suggested and then this logic i put together
df_admin = pd.read_excel( filenames['admin'], sheets['admin'], header=0, usecols=[3, 5, 6, 7], names=['ISIN', 'Name', 'Price', 'Quantity'] ) columns = ['starts_with_rp', 'starts_with_rv', 'starts_with_buy', 'starts_with_sell', 'zero_price'] conditions_df = pd.concat([ df_admin['Name'].str.startswith('RP', na=False), df_admin['Name'].str.startswith('RV', na=False), df_admin['Name'].str.startswith('BUY', na=False), df_admin['Name'].str.startswith('SELL', na=False), df_admin['Price'] != 0, ], axis=1) conditions_df.columns = columns type_mapping_data = {'Type': ['Repurchase Agreement', 'Repurchase Agreement', 'CDS', 'CDS', 'Bond'], 'starts_with_rp': [1, 0, 0, 0, 1], 'starts_with_rv': [0, 1, 0, 0, 1], 'starts_with_buy': [0, 0, 1, 0], 'starts_with_sell': [0, 0, 0, 1], 'zero_price': [0, 0, 0, 0, 0] } # Create the Type column in using df_admin, conditions_df.
I'm reading up on pandas.merge now
rather than simply joining 2 dfs, i suppose i am trying to join and create a column based on rules?
can Anyone show me real quick how to iterate and replace all elements with some other ones in a column of a Pandas DataFrame that is memory efficient (I have 2.5 Million rows)?
I basically want to iterate over all elements in first row, and then pass each element through a function, the output (it would return the exact value to be put in place, not a variable) of the function to be put in place of the element...
I tried something like:-
for item in df[0]: item = numerise(item)
But it doesn't work
What is the function? I think using a lambda might work...just a suggestion and possibly not valid
@jaunty scroll function is named numerise
Tried a for-loop with replace and it's been going on till now...
Yea I'm really not sure I just thought I'd offer that as a possibility
But a lambda should do basically the same thing here from what I can tell
Can you provide some pseudocode on how I should accomplish that?
This is what I was using. the count is the control variable since I was thinking that it was stuck in an infinite loop:-
count = int(0) for item in DataFrame[int(Row)]: count += 1 DataFrame[Row].replace(item, numerise(item), inplace=True) if count == 2500000: break
I believe it's also possible to just apply a function to a column in-place.
EDIT: hmm, maybe not
Well, I though that this kind of stuff would be easy in Pandas. I can iterate over the required values, but not replace them
trying to figure out how to represent this in a lambda, not sure if this is gonna work
I restarted the whole thing, but still no luck. maybe it's not doing enything. Lemme debug it
but it would be something like output = lambda x, numerise: numerise(x)
I just don't know if it would support a single statement after the colon like that
ohk, I tried with 25 values - Function works but it runs it 4 TIMES! so for 25 values it made it go 100 times. This sucks
for 2.5 Million rows, I will die
Is there a way to write to a csv file row by row without including \r\n? I'm importing that CSV file into a SQL database and it's including the \r\n I think
are you using the csv module?
yes
I am looking back at a script I wrote recently and I just used the w.writerow() command and didn't have to use \r\n
with open(csv_file, 'w', newline='') as out:
csv_writer = csv.writer(out)
csv_writer.writerows(data)
something like that
but when I open the raw .csv file in notepad++ there's a CR and LF at the end of each line
maybe just run rstrip? kind of a janky fix I guess but it might work
how i can do template matching on document images ? in python
Plz can someone provide help on my problem?
@grave frost you can try to get a help room
@merry fern What's that?
@merry fern What's that?
@grave frost on the left bar there is a section that says "python help: available" - you go into an available room and immediately post your question / code and someone will try to help
So it's like a private room for only my problem? Nice feature!
Correct
@grave frost what is the function?
@velvet thorn it is a function to convert the string to a numerical format (one-hot encoding)
@merry fern you want to combine the original dataframe and conditions_df, then join the result on that onto your third dataframe
@velvet thorn it is a function to convert the string to a numerical format (one-hot encoding)
@grave frost show code.
and explain exactly what you want to do
@merry fern you want to combine the original dataframe and
conditions_df, then join the result on that onto your third dataframe
@velvet thorn not sure how to do that, but im making progress, bc i couldnt figure out the logic in joining dfs, i started playing with another method which works! again, im just not sure the best most efficient way to go about solving the problem :
conditions_admin = [
df_admin['ISIN'].isnull,
(((df_admin['Name'].str.startswith('RP', na=False)) | (df_admin['Name'].str.startswith('RV', na=False))) & (df_admin['Price'] == 0)),
((df_admin['Name'].str.startswith('BUY', na=False)) | (df_admin['Name'].str.startswith('SELL', na=False))),
(((~df_admin['Name'].str.startswith('RP', na=False)) | (~df_admin['Name'].str.startswith('RV', na=False))) & (df_admin['Price'] != 0))
]
values_admin = ['Other', 'Repurchase Agreement', 'CDS', 'Bond']
df_admin['Type'] = np.select(conditions_admin, values_admin)
sure, if that works for you, go ahead
@velvet thorn Well I am able to do the one-hot encoding, then converting the values to float (since int gives some sort of error) but now the problem is coming in the training phase
actually, the most recent addition, "isnull" is breaking it
I'm not even sure if a join would be the most efficient, but I would say it is the most idiomatic
@velvet thorn Well I am able to do the one-hot encoding, then converting the values to float (since int gives some sort of error) but now the problem is coming in the training phase
@grave frost what's the problem
`InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: indices[0,0] = 2147483647 is not in [0, 18)
[[node sequential/embedding/embedding_lookup (defined at <ipython-input-10-bf624c056173>:13) ]]
(1) Invalid argument: indices[0,0] = 2147483647 is not in [0, 18)
[[node sequential/embedding/embedding_lookup (defined at <ipython-input-10-bf624c056173>:13) ]]
[[Adam/Adam/update/AssignSubVariableOp/_25]]
0 successful operations.
0 derived errors ignored. [Op:__inference_train_function_3133]
Errors may have originated from an input operation.
Input Source operations connected to node sequential/embedding/embedding_lookup:
sequential/embedding/embedding_lookup/2299 (defined at /usr/lib/python3.6/contextlib.py:81)
Input Source operations connected to node sequential/embedding/embedding_lookup:
sequential/embedding/embedding_lookup/2299 (defined at /usr/lib/python3.6/contextlib.py:81)
Function call stack:
train_function -> train_function
`
looks like integer overflow to me
...I feel like something is wrong with your data processing
Hm.. so TF doesn't accept float values?
it does
can you post your code?
but I have no idea what your pipeline is like
or what your code looks like
which is why I said show code
this is why I always add assertions to my pipelines so things like this don't happen
MIN_VAL, MAX_VAL = 0., 1.
tf.debugging.assert_greater_equal(dataset, MIN_VAL)
tf.debugging.assert_less_equal(dataset, MAX_VAL)
`for chunk in pd.read_csv("/content/hash.txt", header=None,chunksize=25000000):
df = pd.concat([df, chunk], ignore_index=True)
df.head
from sklearn.model_selection import train_test_split
train, val = train_test_split(df, test_size=0.1)
print(len(train), 'train examples')
print(len(val), 'validation examples') #(train and val are both DF's')
Batch size
BATCH_SIZE = 1
BUFFER_SIZE = 10000
Length of the vocabulary in chars
vocab = ['\n', ',', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f'] #Vocab List goes here
vocab_size = len(vocab)+1
The embedding dimension
embedding_dim = 12000
dataset = tf.data.Dataset.from_tensor_slices((train[0].values.astype(float), train[1].values)).shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)
for feat, targ in dataset.take(5):
print ('Features: {}, Target: {}'.format(feat, targ))
val_dataset = tf.data.Dataset.from_tensor_slices((val[0].values.astype(float), val[1].values)).shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)
for feat, targ in val_dataset.take(5): #printing validation data
print ('Features_VAL: {}, Target_VAL: {}'.format(feat, targ))
`
use 3 backticks
less @hasty grail
for the code block
@hasty grailThanx for the tip
0 object
1 int64
dtype: object
This is the train DataFrame
df = pd.concat([df, chunk], ignore_index=True)
df.head
from sklearn.model_selection import train_test_split
train, val = train_test_split(df, test_size=0.1)
print(len(train), 'train examples')
print(len(val), 'validation examples') #(train and val are both DF's')
# Batch size
BATCH_SIZE = 1
BUFFER_SIZE = 10000
# Length of the vocabulary in chars
vocab = ['\n', ',', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f'] #Vocab List goes here
vocab_size = len(vocab)+1
# The embedding dimension
embedding_dim = 12000
dataset = tf.data.Dataset.from_tensor_slices((train[0].values.astype(float), train[1].values)).shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)
for feat, targ in dataset.take(5):
print ('Features: {}, Target: {}'.format(feat, targ))
val_dataset = tf.data.Dataset.from_tensor_slices((val[0].values.astype(float), val[1].values)).shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)
for feat, targ in val_dataset.take(5): #printing validation data
print ('Features_VAL: {}, Target_VAL: {}'.format(feat, targ))
@velvet thorn @hasty grail
actually, the most recent addition, "isnull" is breaking it
@merry fern oaky let me explain the join approach
i fixed it, but i woudl like to understand what you were talking about last night
for chunk in pd.read_csv("/content/hash.txt", header=None,chunksize=25000000):
df = pd.concat([df, chunk], ignore_index=True)
Does this even work? You would be trying to concat None and a DataFrame together in the first iteration
what does df look like?
There's no point using a chunksize since you're loading the entire thing into memory anyway
@merry fern
>>> df
cond_1 cond_2 cond_3
0 False True True
1 False True True
2 False False False
3 False True True
4 True True False
5 True False False
6 True True False
7 True True False
8 False True False
9 True True True
10 True False True
11 False False True
12 True True True
13 True True False
14 False False True
15 True False True
>>> indicator_df
cond_1 cond_2 cond_3 category
0 True True True CAT_A
1 False False True CAT_B
2 True False False CAT_C
>>> conditions = ['cond_1', 'cond_2', 'cond_3']
>>> pd.merge(df, indicator_df, how='left', left_on=conditions, right_on=conditions)
cond_1 cond_2 cond_3 category
0 False True True NaN
1 False True True NaN
2 False False False NaN
3 False True True NaN
4 True True False NaN
5 True False False CAT_C
6 True True False NaN
7 True True False NaN
8 False True False NaN
9 True True True CAT_A
10 True False True NaN
11 False False True CAT_B
12 True True True CAT_A
13 True True False NaN
14 False False True CAT_B
15 True False True NaN
(sorry, I know there are two conversations going on and this is a huge chunk)
@hasty grail When loading the entire thing in memory it doesn't load due to the lack of RAM
[2499999 rows x 2 columns]>
AND:-
0 object
1 int64
dtype: object
What object are they exactly?
[0] is supposed to look like a long list of numbers : ``3790673563025180902423922202540363554017`
Dunno why object. I did try to convert but got error
So in the pipeline step I use .astype(float)
that's too big for int64
@velvet thorn well, it's a long string encoded in numbers
might help if you explained your problem in detail
like why you're doing what you're doing
@merry fern
@velvet thorn have to stop you for a second and thank you for chatting with me about this, between last night and today made some major leaps in processing the data. you're awesome! thank you
yw 🙂
Can you set the dtype parameter to {'first_col': np.str, 'second_col': np.int32}?
@velvet thorn I am making a seq2seq model that tries to find out the relationship b/w a string(encoded) and a number.
the long list should then be parsed as a string
I just found that it helps to create a new python file with only the code I am working on to isolate that and then bring it back into the full file once it works accordingly
anyway, the dataframes above should illustrate how to combine the conditions with the "indicator dataframe" (which is a mapping from conditions to intended category)
while the second column would be parsed as int32 so you don't have to cast it to int32 when using TF
imagine also that df has additional columns which are your other features that the conditions are derived from (they won't be affected by the join)
Wait how do I define the dtypes?
im reading
it's a parameter in read_csv
what's your pandas version @grave frost
okay, i see what youre saying, but how do you then join the newly added column to the original data instead of to the conditions df?
1.0.5
okay, i see what youre saying, but how do you then join the newly added column to the original data instead of to the conditions df?
@merry fern "newly added"?
oh.
okay so basically df here is the concatenation (pd.concat) of the original DataFrame with your features and conditions_df
then you join that on indicator_df
df doesn't contain bools, it contains strings and floats, so i dont follow how to merge them
@hasty grail When loading the entire thing in memory it doesn't load due to the lack of RAM
@grave frost why don't you train with an iterator
that loads on demand?
df doesn't contain bools, it contains strings and floats, so i dont follow how to merge them
@merry fern you're only joining on theconditioncolumns
which we produced earlier, remember?
those are bools
all other columns do not participate in the join, but merely appear in the result unchanged
@velvet thorn I don't know much about them. But whenever I load csv it will run out of memory
@velvet thorn I don't know much about them. But whenever I load csv it will run out of memory
there's a way to load part of the dataset, train on that batch, then load another part, repeat
@hasty grail I tried that, but again memory problem
it also supports shuffling so you can do that without using sklearn
did you set the buffer size?
How much shld I set it to?
depends on how much memory you can afford
12Gb
I think you can use the default buffer size when initing the dataset
but then use a larger buffer size for the shuffle method, which is in terms of number of elements
Trying with 10
@hasty grail @velvet thorn Is this valid? looks empty to me <CsvDatasetV2 shapes: ((), ()), types: (tf.int64, tf.int64)>
you need to set a valid type for the columns
the x values are likely to overflow given how long they are
what do they even represent?
so tf.float()?
<PaddedBatchDataset shapes: ((1,), (1,)), types: (tf.float32, tf.int64)>
[0] is supposed to look like a long list of numbers :
3790673563025180902423922202540363554017
Can you explain what this means?
It's a string represented by numbers
so it's a string
Yeah, but I converted to int so that model can understand that
models work only on integers, right?
what are you trying to do?
I am making a seq2seq model that tries to find out the relationship b/w a string(encoded) and a number.
oh
and with that relationship predict some more numbers for the given encoded string
then you should transform that sequence of numbers into a sequence of one-hot encoded vectors
Unfortunately, it's alphanumeric
but what is the problem in numbers?
it's too big
You are not allowed to use that command here. Please use the #bot-commands channel instead.
@hasty grail the bot doesnt have numpy installed
NameError: name 'iiinfo' is not defined
It does, actually
I am pretty sure it does
@desert oar :white_check_mark: Your eval job has completed with return code 0.
[1 2 3]
looks like im mistaken, thats very helpful to know
!e ```python
import pandas as pd
print(pd.Series([1,2,3]))
@desert oar :white_check_mark: Your eval job has completed with return code 0.
001 | 0 1
002 | 1 2
003 | 2 3
004 | dtype: int64
😮
also networkx and forbiddenfruit
import numpy as np
print(np.iinfo(np.int64).max)
print(np.finfo(np.float64).max)
!e ```python
import numpy as np
print(np.iinfo(np.int64).max)
print(np.finfo(np.float64).max)
@desert oar :white_check_mark: Your eval job has completed with return code 0.
001 | 9223372036854775807
002 | 1.7976931348623157e+308
seems big enough for my purpose
note that the numpy max isnt necessarily the same as the python max
what error function are you planning to use for this?
Ok no it isn't
why do you need numbers that are so large? what are you doing exactly?
you can also get unbounded integers in numpy with the object dtype and using python ints at a pretty severe performance hit
i see some stuff about tensorflow above
sparse_categorical_crossentropy
that's returning numbers that are on the order of 1e300?
I am making a seq2seq model that tries to find out the relationship b/w a string(encoded) and a number.
Can you provide an example?
9128165362313010342475980190211245832428 --> 2499996
yeah i would need to see an example too. but if it's seq2seq i'd imagine you would want to encode the number as a string of digits rather than a number
are you the one who was trying to use ML to reverse hashes a while ago?
is that what these are?
so lets say i have 2 dataframes i want to combine, but in the new combined dataframe, i want each dataframe to have an index from its source...
i dont know what that question is supposed to mean
The fact that you're using sparse_categorical_crossentropy implies that there's a (relatively) small set of output values
@merry fern be careful, what does "combine" mean here?
But your output is 2499996...
There is a small set
@merry fern be careful, what does "combine" mean here?
@desert oar not concat, but to add the rows together
@grave frost cross entropy is for categorical targets. you are not going to have 1 target for literally every real number integer
It's just a combination of different classes including intergers
so i think merge
Can you describe the set of outputs?
@merry fern can you give an example of inputs and outputs for this? fake data is fine, maybe just 5 rows
It will predict many classes in succesion right? (like 1 then 3 etc..)
@grave frost can you give an example of more records here?
where are these digit strings coming from
what exactly is the relationship you're trying to model
and where are the numbers coming from
Of a hash and it's decoded value
ok. and what is the space of all valid decoded values?
no it's not more confusing to explain
@desert oar i hope this is illustrative
how many possible in/out values are there?
it sounds like 2.5 million outputs
that's definitely not a small set
@merry fern so you want to stack them vertically?
So which loss to use?
@merry fern so you want to stack them vertically?
@desert oar right, to create a master list
im compiling differences between data sources, and then i want to have a master list of differences
without losing reference to where those differences originate
So what exactly should I do to accomplish my goal?
assuming your decoded values are in the range of int64 you can use MAE
!e ```python
import pandas as pd
data1 = pd.DataFrame({
'XYZ': {'ISIN': 123, 'Q': 100, 'P': 1},
'ABC': {'ISIN': 345, 'Q': 200, 'P': 2},
}).rename_axis(index='Type')
data2 = pd.DataFrame({
'XYZ': {'ISIN': 123, 'Q': 100, 'P': 1},
'ABC': {'ISIN': 345, 'Q': 200, 'P': 2},
}).rename_axis(index='Type')
data = pd.concat({'A': data1, 'B': data2})
print( data )
@hasty grail mape?
@desert oar :white_check_mark: Your eval job has completed with return code 0.
001 | XYZ ABC
002 | Type
003 | A ISIN 123 345
004 | Q 100 200
005 | P 1 2
006 | B ISIN 123 345
007 | Q 100 200
008 | P 1 2
sorry, MAE
Will that fix the error?
@grave frost you might be able to get away with classification if you use negative sampling like they do in NLP models http://mccormickml.com/2017/01/11/word2vec-tutorial-part-2-negative-sampling/
you need to input it as a sequence instead of a single number
and yes 100% the input needs to be a sequence of digits and not a number
SO a string?
more like a sequence of one-hot encoded vectors
so Tf.String is an acceptable dtype?
but you still need to figure out how to convert said string to numbers
whether one-hot-encoded or something else
@merry fern did you see my example above? the important part is passing a dict do pd.concat
But not the one long number?
correct, definitely not one big number
one-hot is the simplest way to do it imo
since you only have 10 individual characters in a string, one hot encoding is parsimonious and sensible
alright, I will implement it. Anything else I need to keep in mind?
like i said, use negative sampling
i linked a blog post about it above
it's what they use in "traditional" word2vec to improve training time
@desert oar this part of the code is what i was looking for! data = pd.concat({'A': data1, 'B': data2})
you could try your model as a regression model too @grave frost but im skeptical that it's the right thing here
@merry fern good. note that this will give you a multi-index on your dataframe, which increases the complexity level
No regresion
some hashes are done such that the distribution is essentially random
It is random. That's the whole point
yeah i also think this project will go nowhere btw
might be a good exercise in learning tensorflow but
yes thats what i meant, multi-index. the funny thing is when i output to csv, it puts the multi-index in every line
but in the python console, it just shows a header line and then goes forward, example:
Master Break List
Quantity Price
Type ISIN
Int vs. PB Bond RU000A0JWHA4 7000000.0 -0.258750
RU000A0ZYYN4 -3000000.0 -1.123458
vs.
if you could reverse hashes with machine learning we'd have really big problems on our hands
yeah xD
is that by design?
@merry fern yes, that's intentional
great
I have to find a very Small relationship. Something that even bring the loss to 0.002 is outstanding
awesome awesome awesome
It's jsut a POC
0.2 would be outstanding already
For another architecture I had in mind
indeed, you are going to basically be depending on the PRNG being insecure
i have made leaps and bounds in digesting excel data, mapping columns, creating new columns based off data, and organizing the output! thank you @desert oar @velvet thorn and everyone else
if you can get the accuracy anywhere close to 5% i'll be surprised
@merry fern thats great to hear. the more you learn the easier (and more fun) it gets
you probably already can
I am expecting val acc to be 0.004-ish
depending on how many possible outputs are there
also here's a hot tip @merry fern :
!e ```python
import pandas as pd
data1 = pd.DataFrame({
'XYZ': {'ISIN': 123, 'Q': 100, 'P': 1},
'ABC': {'ISIN': 345, 'Q': 200, 'P': 2},
}).rename_axis(index='Type')
data2 = pd.DataFrame({
'XYZ': {'ISIN': 123, 'Q': 100, 'P': 1},
'ABC': {'ISIN': 345, 'Q': 200, 'P': 2},
}).rename_axis(index='Type')
data = pd.concat({'A': data1, 'B': data2}, names=['source'])
print( data )
@desert oar :white_check_mark: Your eval job has completed with return code 0.
001 | XYZ ABC
002 | source Type
003 | A ISIN 123 345
004 | Q 100 200
005 | P 1 2
006 | B ISIN 123 345
007 | Q 100 200
008 | P 1 2
look at the names of the multiindex when you use the names= parameter
nice little convenience feature
@merry fern otherwise you'd have to write
data = pd.concat({'A': data1, 'B': data2}).columns.rename('source', level=0)
nice i just added that
@grave frost yes, otherwise it would be computationally ridiculous to compute the weight update for all 2.5m outputs
Alright, as long as it implement in a line or 2 (as if that ever happens)
thats the wrong attitude
you use the methods that exist, not the tools you happen to have sitting right in front of you
if you want to do something, either you do it or you don't
try without negative sampling and see what happens
leaving for now, need to workout i feel l like shit. ttyl
if it works, then great
ok, 1-hot, MAE and -ve sampling. anything else?
but it will probably be dog slow
MAE is incompatible with classification
you know this
No
do you know what MAE stands for?
do you know how it is defined?
it literally does not make sense on a classification problem
it is for regression problems
if you do this as a regression problem, then yes MAE is valid
accuracy, precision, recall, f1, ...
I would like to learn how to use machine learning algorithms, I have no prior experience with machine learning. Can someone recommend a good place to learn from?
@severe spindle how much programming experience do you have?
k, I will do classification
@desert oar Several years , I'm a third year computer science student at university
using machine learning algorithms is not the right way to think about it. machine learning is a process that requires using these algorithms at some point
they dont make sense outside the context of actually doing some kind of prediction or other machine learning work
since you already know programming, the course https://fast.ai might be a good place to start
they get you right into the sexy deep learning stuff
okay sounds really good, thanks for explaining a little too 😄
if you want to make a career out of this (or do more than trivial hobby projects) you'll eventually want to go back and get into the foundational math and start learning stats as well
Thanx a ton for your help guys and placing me in the right direction! 🙂 @desert oar @hasty grail @velvet thorn
np and... good luck
you're welcome both, sorry i'm in a curt mood today but i just dont want people wasting their time on stuff that's not worth doing
Nothing to worry about. I know it is hopeless, but I need a baseline
It is actually a POC for establishing that there is bias in randomness. A base for a new architecture I am planning to develop.
first thing that showed up in my search
I'm taking some ML modules this year so I'm sure I'll have to hit the books for the maths at some point too but you;ve given me a good place to start, thanks
there is bias in randomness
i think you might need to qualify this a bit. there is a lot of literature already on cryptography... like 50 years of it, written by very smart people
NN's cannot predict plaintext out of encrypted text. THe best you can do is REDUCE the time taken to decrypt, which is what my theoretical architecture does. But right now its very naive and basic phase (too much assumptions - lesser testing). It would need people with PHD to make it...
@desert oar A simple assumption that anything random can be proved to be non-random given an enough complex function
That's is what basically linear models do, albeit on a very small scale.
There are several theoreoms and proofs on it
Good Night to all!
i dont think thats what the universal approximation theorem actually says...
but good luck
@desert oar next rabbit hole for me is rendering results in a web platform 😛
@desert oar You don't need to understand any theoreoms for the basic idea- It's pretty intuititve in itself. Consider a series of numbers :- 0,2,4,6,8. Then, f(x)=x+2, for x in W. Now consider a relatively more complex series:- 0,1,4,9. Here, f(x)=x^2. For ANY given sequence of numbers, I can compute a corresponding function to represent that data. Assume you have very less intelligence than the average human, only enough to grasp basic arithmetic. then the second set of numbers might look like random numerals to you. But actually they are governed by a function having a complexity outside you understanding. A NN tries to approximate that function. So no matter how random set of numbers you can give, there will always be a relation. Now it might be that relation to be too complex to be computable with normal machines and would require quantum level power to compute, but there always be a relation. http://neuralnetworksanddeeplearning.com/chap4.html Look at this link for a more visual idea
Hey all! Anyone able to answer a question about calling data from a linear regression model (OLS) generated by statsmodels?
I'm tryin to get a list of the pvalues greater than .05, using:
print('P Values: ', model.pvalues > .05)
This returns a df showing every row and the Boolean value for the pvalue of each row. I'm new to python (and programming), so I don't know a lot of things that should be obvious.
Hey guys! How can I use current value when updating data in Pandas DataFrame?
@grave frost i am familiar with the universal approximation theorem. what you seem to be ignorant of is the large body of work dedicated to making PRNGs look and act as random as possible, specifically for the purpose of making what you are trying to do effectively impossible
it more or less forms the basis of modern cryptography
if you dont want to take my word for it, go ask about it on a math forum or computer science forum or cryptography forum. see what they have to say about your project
i'm willing to be proven wrong, but not by a misquoted textbook chapter
@strong osprey can you clarify what you are trying to do? preferably with sample input data and the desired result
@elder creek the result of model.pvalues > 0.5 is a Series containing Boolean values -- you can use that Series to select only the True rows like so:
model.pvalues.loc[model.pvalues > .05]
Yes
(mistakes corrected above)
Wow, awesome, thank you so much
i'd also encourage you to avoid selecting features blindly by p > 0.05
I want to append data to cell. I know I can do :
self.data.loc[self.data['SKU'] == '826945379', 'Images'] = 'aaa'
but how can i use the data that already is in the cell
I've got 136 beta values... just looking to remove the ones with more than .05 pvalue
why 0.05? why not 0.01? have you adjusted the p-values for multiple comparisons? does it even make sense to compare the coefficient to 0? is the model actually homoskedastic i.e. does it satisfy the statistical assumptions required to do such t-tests?
that is an outdated feature selection procedure in my opinion
what is the purpose of your model? to make predictions? or to make inferences about underlying relationships?
in that case, use regularization, ridge or lasso
banish all thought of stepwise selection
I get the concept piece
It's for a class... I've gotten the answers I need, I just want to call it in a tidy manner
@strong osprey ```python
sel = self.data['SKU'] == '826945379'
self.data.loc[sel, 'Images'] = foo(self.data.loc[sel, 'Images'])
How do I access attributes of an XML element if those attributes are subelements?
its a shame they are still teaching that shit in classes
anyway hopefully the code snippet helps
@jaunty scroll are you using a specific library to do this?
@desert oar element tree
and can you give an example of some XML & the resulting values you want?
this isnt really a data science question but it might be relevant to people here. normally in the future i would recommend asking questions like this in a help channel, see #❓|how-to-get-help
oh ok @desert oar that's good to know I wasn
@desert oar thanks, i see
't really sure where to ask
but this is part of the data structure
I'm asking it here because the end result of this question is going to be a parser that converts to csv and from there into RedShift using dataframe
yeah one of those fringe projects imo
this parser has been a pain because part of the XML just has standard tags and attributes and then here it does this multi-level thing that breaks the program unless I just treat all the elements as equivalent which is useless
do you have the actual definition for ns1?
i think lxml has better namespace support
if you include that as text and not just an image, i can make an example for you @jaunty scroll
one moment sorry was afk
<ns1:recordIdentifier>33</ns1:recordIdentifier>
<ns1:insuredMemberIdentifier>P12561002023TRS</ns1:insuredMemberIdentifier>
<ns1:insuredMemberBirthDate>1989-12-31</ns1:insuredMemberBirthDate>
<ns1:insuredMemberGenderCode>M</ns1:insuredMemberGenderCode>
<ns1:includedInsuredMemberProfile>
<ns1:recordIdentifier>34</ns1:recordIdentifier>
<ns1:subscriberIndicator>S</ns1:subscriberIndicator>
<ns1:subscriberIdentifier></ns1:subscriberIdentifier>
<ns1:insurancePlanIdentifier>93182VA013001402</ns1:insurancePlanIdentifier>
<ns1:coverageStartDate>2015-01-01</ns1:coverageStartDate>
<ns1:coverageEndDate>2015-12-31</ns1:coverageEndDate>
<ns1:enrollmentMaintenanceTypeCode>021028</ns1:enrollmentMaintenanceTypeCode>
<ns1:insurancePlanPremiumAmount>450.00</ns1:insurancePlanPremiumAmount>
<ns1:rateAreaIdentifier>003</ns1:rateAreaIdentifier>
</ns1:includedInsuredMemberProfile>
</ns1:includedInsuredMember>
<ns1:includedInsuredMember>
do you want like an actual text file?
uhh this isnt PII right?
my parser breaks when it gets down to the bottom of these multi-level stacks like this because I think its looking for a standard attribute like a string or integer
can you show your current code
for node in tree.iter(None):
print('\n')
for elem in node.iter():
if not elem.tag==node.tag:
print("{}: {}".format(elem.tag, elem.text))```
ok 1 min
this is maybe a stupid question but why does the insuredMemberProfile have a separate record identifier from the insuredMember itself
that is actually a good question that I couldn't even begin to answer other than to say that the fine folks in the federal government choose to organize their files this way and I have no say in the matter
lol ok
the reason i ask is -- you want to flatten all that stuff in there into 1 record?
yea, ideally I want nested dictionaries that can be read into dataframe
I'm very new at this and kind of learning as I go so if there's something I say that makes no sense please correct me or ask for clarification
tree = ET.parse('InboundMedicalClaimFileExample.xml')
included_insured_members = []
for node in tree.iter('ns1:includedInsuredMember'):
member_info = {}
included_insured_members.append(member_info)
if node.tag == 'ns1:includedInsuredMemberProfile':
for subnode in node:
member_info[subnode.tag] = subnode.text
else:
member_info[node.tag] = node.text
what about something like this?
let me play with this for a min but this looks hopefully
hopeful*
so its giving me a mismatched tag error
File "test.py", line 3, in <module>
tree = ET.parse('inboundenrollmentfile.xml')
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.8_3.8.1729.0_x64__qbz5n2kfra8p0\lib\xml\etree\ElementTree.py", line 1202, in parse
tree.parse(source, parser)
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.8_3.8.1729.0_x64__qbz5n2kfra8p0\lib\xml\etree\ElementTree.py", line 595, in parse
self._root = parser._parse_whole(source)
xml.etree.ElementTree.ParseError: mismatched tag: line 439, column 2
that sounds like a problem in the file
ah yes it is, that's the last line in the file
I just sent you the middle part because that's what was giving me the error but I think if I update these tags to capture the whole file this could work
I could extend this code to work for multiple such tiered data structures right?
yes of course
notably, bool(node) will tell you if the node has children
so you can say ```python
if node:
for subnode in node:
...
else:
node.text ...
cool thanks for the help it is much appreciated
@jaunty scroll you can flatten nodes recursively/infinitely too if you want
def flatten_node(node):
result_container = {}
if node:
for subnode in node:
result_container = {**result_container, **flatten_node(subnode)}
else:
result_container[node.tag] = node.text
return result_container
although this will fail on XML where you have 2 nodes with the same tag
(among many other limitations)
what's the purpose of those ** @desert oar
aren't those usually for kwargs in function parameters?
@jaunty scroll it serves the same function here, except to the {} dict constructor
!e ```python
x = {'a': 1, 'b': 2}
y = {'b': 102, 'c': 103}
print( {**x, **y} )
@desert oar :white_check_mark: Your eval job has completed with return code 0.
{'a': 1, 'b': 102, 'c': 103}
so is constructor that a kind of anonymous function? not sure if that's the right term but it is essentially doing what a function would do here it seems like
that constructor*
its not a function
but it uses similar syntax
you can think of it as the people who designed the python language being clever and letting you use the same syntax in multiple places to mean similar things
how do I get the plain text from pytesseract.image_to_data?
I've got two dfs (both 24 rows) in pandas but have different indices and I can't figure out how to multiply them together. The result is series that has 25 rows so I know something is wrong
The different indices is due to the fact I sliced them from different parts of a data set
How often do you use Numpy by itself? Vs with Pandas or MatLab?
@wise garden do you not care about the indices? then just do x.reset_index(drop=True) * y.reset_index(drop=True)
if you do care about the indexes youll have to do some more work
no this is perfect thx
Hi, not sure if this is the right channel, but are all GPU calculations made with tensorflow based on CUDA? That is, if I have a GPU without CUDA support I'm "doomed"?
I tried to run a CNN the other day and poor CPU... I could almost hear the screams
@keen root pretty much, yes
CUDA is what allows programs to "talk to" the GPU without writing low level graphics code
tensorflow certainly depends on it
i specifically bought an nvidia 1060 for this reason 🙂 even though its not good for actual machine learning performance, at least it runs CUDA so its good enough to test code before paying for cloud compute
that's too bad... This old baby goes back to early high school. If only I knew I would be needing cuda someday 😅
thank you anyway
@desert oar what did that cost?
@serene scaffold i got the whole rig for $300 without SSD
i was a 4 year old gaming rig
I'm hopefully going to build a pc when I graduate
some local kid was upgrading
Of course they were
I'm totally not jealous of gamer kids with better machines than me.
it was a steal tbh, i dont play recent games much, or if i do i dont care if they are max settings
Also I finished that assignment. The prof said it was the hardest thing for the whole course. But it seemed like it was just pandas and numpy fundamentals.
So idk what the rest of the class will even be.
Mean imputation and hot deck imputation
that, or, they were expecting people to do it in java
I got the same answers as two of my friends.
I mean I guess you could have done it in Java if you wanted to be eight levels deep in for loops.
yeah idk
maybe its an easy course i guess
seems like a waste of time if thats the hardest thing in the whole semester
no offense to your instructor but, theres no point in paying for school if you arent being pushed past your limits in a controlled and constructive way (imo)
otherwise you'd just go read a book at home
(there are other benefits to school too, namely the opportunity to meet and talk with other people working on similar problems as you with similar interests who might later form a professional network and also form a support network while in school)
@desert oar I think of formal education as a really expensive way to get your knowledge in a certain area accredited, and that all learning is basically self-learning.
and this is why people feel like education is a ripoff
because if thats all you got out of school, then it was a ripoff
yes
what i am saying is, it doesnt have to be like that, and shouldnt be like that
the next assignment uses data miner
or some platform like that. can't remember the name for sure
classes and dataframes - would it be smart to create a subclass of dataframes if I wanted to dictate their behavior?
for example, if a dataframe was empty, and I went to print it, I would want it to print "No data."
Has anyone ever rented server time/processing time, if so what did you need it for and what requirements did you have ?
@merry fern yes you could subclass dataframe and implement a new __str__ method in the subclass
@merry fern yes you could subclass dataframe and implement a new
__str__method in the subclass
@desert oar
Would it be simple in the sense that I would just setup init and str but it wouldn't screw up any other functionality built in?
Thanks
you dont even need init @merry fern
!e ```python
import pandas as pd
class MyDataFrame(pd.DataFrame):
def str(self):
if self.shape[0] == 0:
return 'No data.'
else:
return super().str()
df = MyDataFrame(columns=list('xyz'))
print(df)
@desert oar :white_check_mark: Your eval job has completed with return code 0.
No data.
that said... i dont really recommend this
it's not easy to "convert" a regular dataframe to this custom dataframe
you have 2 other options: 1) directly override DataFrame.__str__, 2) just write a custom pretty-print function for data frames
the overriding __str__ method is worse
so i'd recommend just writing a function
classes and dataframes - would it be smart to create a subclass of dataframes if I wanted to dictate their behavior?
for example, if a dataframe was empty, and I went to print it, I would want it to print "No data."
@merry fern but why?
Anyone know the approximate performance delta between Numba and Numpy?
When preprocessing data, is it better to try to use the built-in pandas functions, or just write your own @numba.jit functions?
the former is usually faster in my experience @safe tapir , unless you're chaining a lot of them together
@merry fern but why?
@velvet thorn so that when I have no results, it doesnt say "empty" it gives a message.
I...don't know if that's worth ti
yea, exploring the idea. not too familiar with classes but my next step is thinking about class objects i think...
also, thanks for your help again, i was able to get my code working! im going to try to learn flask now to render it to a web display
I think this is a case where
you should recalibrate your brain to understand that "empty DF" -> "no data"
its not for me, its for when the system looks to output the data. but i can also just do a conditional output
^^yes
or just write a pretty printing function
@desert oar then this
im trying to run this pd.DataFrame(xFeat).columns essentially but i get an attribute error 'numpy.ndarray' object has no attribute 'columns' but i dont understand why since im creating it as a dataframe in the same snippet of code
@wheat pilot can you share more of your code? maybe the error is from a different place in the code
and can you share the full error traceback?
i have everything in #help-cherries
although i may have a temporary solution for that error
let me know if what ive changed is not going to bite me in the butt later if you have a minute
how to fetch sheet name of csv?
im using pandas.dataframe
to read csv
can anyone help?
how to fetch sheet name of csv?
@cerulean ingot what do you mean "sheet name"
this thing in bottom of excel or csv sheet
this thing in bottom of excel or csv sheet
@cerulean ingot CSVs don't have those
at least, not standard CSVs
only Excel files
did you Google it?
I suggest you do, because I found the answer on the first try...
yw but I didn't really do anything
I even appreciate reply 💯
Hi I need help Installing tensorflow anyone online
What issues are you running into?
Helllo @hasty grail I need it for the tensorflow google examination
It shows msv or something is not installed
Can I share my screen and can you help me
Pasting large amounts of code
If your code is too long to fit in a codeblock in discord, you can paste your code here:
https://paste.pydis.com/
After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.
ok one sec I will open pycharm
Traceback (most recent call last):
File "C:/Users/HariAkash/PycharmProjects/TF_Config/venv/tf_first.py", line 1, in <module>
import tensorflow
File "C:\Users\HariAkash\PycharmProjects\TF_Config\venv\lib\site-packages\tensorflow_init_.py", line 41, in <module>
from tensorflow.python.tools import module_util as module_util
File "C:\Users\HariAkash\PycharmProjects\TF_Config\venv\lib\site-packages\tensorflow\python_init.py", line 39, in <module>
from tensorflow.python import pywrap_tensorflow as _pywrap_tensorflow
File "C:\Users\HariAkash\PycharmProjects\TF_Config\venv\lib\site-packages\tensorflow\python\pywrap_tensorflow.py", line 28, in <module>
self_check.preload_check()
File "C:\Users\HariAkash\PycharmProjects\TF_Config\venv\lib\site-packages\tensorflow\python\platform\self_check.py", line 61, in preload_check
% " or ".join(missing))
ImportError: Could not find the DLL(s) 'msvcp140_1.dll'. TensorFlow requires that these DLLs be installed in a directory that is named in your %PATH% environment variable. You may install these DLLs by downloading "Microsoft C++ Redistributable for Visual Studio 2015, 2017 and 2019" for your platform from this URL: https://support.microsoft.com/help/2977003/the-latest-supported-visual-c-downloads
This is the error
I am just 14 currently so I find it a bit difficult can you guide me
Hello @hasty grail are you there
Follow the link and the instructions on that page
It doesn't seem to help
What have you done exactly?
I just entered: import tensorflow
ImportError: Could not find the DLL(s) 'msvcp140_1.dll'. TensorFlow requires that these DLLs be installed in a directory that is named in your %PATH% environment variable. You may install these DLLs by downloading "Microsoft C++ Redistributable for Visual Studio 2015, 2017 and 2019" for your platform from this URL: https://support.microsoft.com/help/2977003/the-latest-supported-visual-c-downloads
Have you completed this step?
I installed it many times byt it still shows the same error
Which file did you download?
Sorry I won't be on PC for long
oh ok no problem
Which file did you download?
@hasty grail microsoft redistributable 2019
Do you have to use any of your own files?
I feel that you might be better off simply using a Docker container otherwise
I feel that you might be better off simply using a Docker container otherwise
@hasty grail You need to use pycharm for the exam
I usually use Google Colab
3.7 which is the one supported for the exam
can you open your Visual Studio Installer and check whether you can find the C++ Redistributable?
haven't used the installer for a while but I think you should be able to select what packages for VS to install on there
look under C++
that's not the installer
that is visual studio itself
go to your Start Menu and search Visual Studio Installer or something
yeah that one
This one ???
go to the "Installed" tab
click "more"
what do you see now
modify
go to "Language packs"
hmm ok