#data-science-and-ml
1 messages · Page 196 of 1
I think weather might be okay for RF since the temperatures don't typically exceed historic highs and lows.
Tutorial video on how to connect to the Binance API & break down the data with python and create dashboards & graphs! Hope you guys find it interesting & learn a thing or two: https://youtu.be/RHqEPNgpbzQ
Have Questions check out our discord: https://discord.gg/rNc6xtP Python-Binance wrapper: https://python-binance.readthedocs.io/en/latest/market_data.html Git...
Any recommendations for a science-orientated IDE? Spyder's debugging tools aren't great, PyCharm's science view isn't working for me (so I can't explore my data), and I'm not a fan of notebooks
JupyterLab should give a more sophisticated experience than plain notebooks, so maybe see if that works better for you. And there is a plug-in for Atom called Hydrogen that provides a notebook-like experience and I've seen people swear by it, but never used it personally
Otherwise I'm afraid there isn't that much, at least as far as I know
I'm not a fan of notebook-style work. I'm doing scientific software development
So it's less pure data exploration and more software development
Hence why I want good debugging tools
In that case pycharm is probably your best bet
See if you can figure out why the science view thingy doesn't work for you, you might be able to find someone with the same problem and maybe a solution
Yeah, I'll give it a shot. Thanks!
(I abandoned pycharm myself because of its awful support for notebooks), I'm aware that's not what you're looking for but afaik it's a well known problem that pycharm is a bit lacking in that department
Or at least used to be, apparently there was just an update
Hello guys,
What would be the best way to create a web app (with Django or Flask) which would demonstrate:
1: Automatic Data Gathering for Stocks,
2: Automatic Datasheets for a variety of Stocks and
3: Stock prediction (approximately if possible)
If some1 has a good device or a tutorial/instructions, I would be really grateful. I don't have a Data Science experience, so I am just looking for a good start.
Thank you in advance 😃
Just got it working, I think? I used to be a big Spyder fan, but I've grown too dependent on the good debugging tools of JetBrains' stuff 😛
If I want to call pyplot.axvspan to highlight a range in time-series data indexed by DateTimes, what do I put in for the data units of the range?
nvm
Hey guys just made the decision to dedicate my future to deep Learning but I have an issue.
My maths sucks. So I came up with a plan to learn the required maths alongside learning deep learning and was wondering if you guys could help me out with the order I should learn each topic
I’m starting off with Algebra 1&2
Moving onto Linear Algebra
Then finally calculus followed by multivariate calculus
Is this the right order to learn each of the topics?
I'd do algebra 1&2, single variable calculus, linear algebra, multivariate calculus, probability theory
self teaching myself calculus on top of all this programming 😛
Calculus is great though
I would agree, but I am not the best at math to start with, so learning it on my own is like ugh
This really doesn't have to do with data science, aside that I'm attempting to learn numba
Is there any way I can use str() in a vectorize, or at the very least run a line of code off the gpu?
im trying to make a program which simulates aerodynamics around a object using vectors can someone who has some experience with vectors please hmu 😃
I have a pandas dataframe consisting of about 10 columns (float values). How do I split the dataframe into subgroups where one column has for example values over 80?
I know of groupby() but I think I need something more because it will output a new dataframe for each row but I would like to split them into "sections" where consecutive values are over 80
@tall drum
df[df['column name'] > 80]
Basically I would like to create a new dataframe for each of the circled sections:
Do I just have to loop through the dataframe and store the start and end points for each section in a list or is there a better way?
@tall drum i literally gave you the solution
Thanks I tried it but the output was only one dataframe.
I would like to create a dataframe separately for each of the "spikes" in the data. Your solution is one dataframe in which those spikes are concatenated .
@tall drum you can do it like:
0) ensure your indices are numbers (not dates, etc.); reset index if need be.
- extract the indices which meet your criteria (>80).
- split the indices (a numpy array) into blocks containing consecutive numbers
- use these indices to extract from the df again
df = pd.DataFrame(np.random.random((30, 2)), columns=['a','b'])
# find indices where the 'a' column has values > 0.5
idx = df[df['a']>0.5].index.values
# splits numpy array into blocks containing consecutive elements
idx_list = np.split(idx, np.where(np.diff(idx) != 1)[0]+1)
# use the indices to extract from df
for i in idx_list:
print(df.loc[i,:])
a simple example
the top one is the original df, the split ones (where the 'a' column >0.5) are after that
Thank you
anyone used AlphaVintage API? or IEX?
Hi guys
I'm new to python
And I want to start with data analytics
Which libraries should I learn
@distant inlet pandas and numpy are a must. as for plotting libraries, minimally matplotlib
😍
oh and sklearn if you're doing machine learning
I'm not that well-read, but I'm looking at "Machine Learning: A probabilitstic perspective" now. seems like quite a heavy emphasis on the statistical side, so might still be relevant to you. you can consider
idk if this is the right place to ask but I want to do a small study on analysing album coverts to see whether or not I can train a neural netwrok to recognize genre's associated with the album.
does anyone have an idea where to start or experience with these kind of things? Currently thinking of using tensorflow, but that's as far as my thought process has gone
I have accumulated a dataset of ~4k labeled album covers
ill search for some more information and come back with better defined questions 😛
How does one calculate the maximum both positive and negative and minimum both positive and negative numbers in IEEE 754?
I have the answers but...
Máximo positivo: 1,111 1111 1111 1111 1111 1111 x 10^111 1111]2
Mínimo positivo: 1 x 10^-111 1110]2
Mínimo negativo: -1 x 10^-111 1110]2
Máximo negativo: -1,111 1111 1111 1111 1111 1111 x 10^111 1111]2
@gaunt axle the information for the system float type is present in sys.float_info, the information for numpy types is in numpy.finfo([type])
they use different names unfortunately, so they're not interchangeable ```py
sys.float_info.max
1.7976931348623157e+308
sys.float_info.min
2.2250738585072014e-308
-sys.float_info.min
-2.2250738585072014e-308
-sys.float_info.max
-1.7976931348623157e+308
numpy.finfo(numpy.float64).max
1.7976931348623157e+308
numpy.finfo(numpy.float64).tiny
2.2250738585072014e-308
-numpy.finfo(numpy.float64).tiny
-2.2250738585072014e-308
numpy.finfo(numpy.float64).min
-1.7976931348623157e+308```
Thanks @desert cradle but I was looking for the explanation behind ithehe : )
well, you understand binary, right?
when you have a "decimal" [not actually decimal but there's no better name for it] point in binary, that means that just like in decimal it goes 1/10, 1/100, 1/1000, it goes 1/2, 1/4, 1/8
so 1,11111111111111111111111 is 1 + 1/2 + 1/4 + 1/8 + 1/16 ... + 1/8388608
which is 1.99999988079071044921875 if written in decimal
and then that number is multiplied by 2^127
so you get 340282346638528859811704183484516925440
which would be 0xffffff00000000000000000000000000 in hex
@gaunt axle
is that what you're looking for? otherwise i'm not sure what explanation you want
That helped a lot! Now I must work it out with the IEE 754 format simple precision (1bit for sign, 8 for exponent, and 22 for mantissa(?))
Ohh, we were told we don't write those
well it's 23 explicit bits, 24 including the 1
sign is the simplest, it's just 1 for negative 0 for positive
then the exponent is a bit complicated because of denormalized numbers, infinity, and NaN
and the regular values are just offset by 127, so 1 [00000001] is -126, 127 [01111111] is 0, and 254 [11111110] is +127
I didn't understand the logic behind , why do we add 127 to the exponent
because that way the negative and positive values are all in order
I just assumed it was like that bc, it said "excess by 127"
the smallest one is the most negative value, and the largest one is the largest positive vlaue
(00 for denormal works because it's smaller than any value with 01, and FF for infinity is larger than any other possible value)
in fact, the floating point bits for +0.0 end up being 0 00000000 00000000000000000000000
which is useful because it means when memory is allocated and filled with zero bytes, it can be used directly as a float initialized to zero without any special code to fill in the parts that are floats
I will try to do my hw now : ), thanks @tribal kindle
Ty @desert cradle ^_^
nu u
is this channel free?
would anyone be able to help with this
A city council is planning the city’s bus routes. It has decided which places will have a bus stop (schools, cinemas, hospital, etc.). Each bus route will start from the train station, visit a number of bus stops, and then return to the station, visiting the same bus stops in reverse order. Each bus stop has to be served by at least one bus route. The council wants to minimise the total amount of time that all buses are on the road when following their routes.
@urban ibex Not really sur eif this is the correct channel but I'm now even more convinced it's a multi-TSP while minimizing the highest route length.
An assumption is that route length is directly proportional to the time buses spend on the routes.
- you need to find a way to visit all nodes. Even in the problem, the return trip I assume is simply the distance between the nodes again, so you have 2x of the distance for the same 'bus route' - effectively meaning this detail does not matter (except at the end when you need to show back answers/calculations relating to this)
- Even if you don't really visit the home node after reaching the end of the trip, you can still transform the problem to an equivalent TSP whereby you visit a place that magically is able to teleport back to origin - essentially, this means it's a TSP.
- Multiple vehicles means it's a multiple-TSP. The part where it says all vehicles essentially means you need to minimise the highest route cost among all routes
With this you should research into min-max TSP algorithms
i think it wants me to use either BFS or DFS
Pretty sure that's not how state-of-the-art does it, but you could....
how would it work with BFS or DFS and what would i need to adapt/change
the unit its linked to focusses on DFS, BFS, Dijkstras, priority queues and greedy algos
what class is this for?
If given those I'd actually use a greedy algo
It's suboptimal but terminates in finite time IIRC
this is the Q @hardy crag
Yes and I told you what optimisation problem it was
yeah im just struggling to see how it would be implimented without the combination of another algo
I reckon traveling salesperson is a good start might want to research algorithms that solve it
i think youd have to use like DFS first and break the graph down then traverse them but im not sure
Would anyone be able to help with this?
In graph theory, the number of nodes in a graph is called the order of the graph. The term ‘order’ is unrelated to sorting. Specify the problem of computing the order, as a UGraph operation.
Creator / Inspector / Modifier (delete as appropriate): order
Inputs:
Preconditions:
Outputs:
Postconditions:
Justify the kind of operation.
do you know the difference between directed and undirected graphs
yes I do @lapis sequoia
so you know what a ugraph is?
not 100%
hmmm.. is this problem for python?
because if not.. its a complete tangent.. and I dont want to give you wrong directions..
i just need to complete the spec
i think in this circumstance UGraph is just a name
Consider an ADT for undirected graphs, named UGraph, that includes these operations:
nodes, which returns a sequence of all nodes in the graph, in no particular order
has_edge, which takes two nodes and returns true only if there is an edge between those nodes
edges, which returns a sequence of node-node pairs (tuples), in no particular order. Each edge only appears once in the returned sequence, i.e. if the pair (node1, node2) is in the sequence, the pair (node2, node1) is not.
How each node is represented is irrelevant. Because the graph is undirected, has_edge(node1, node2) and has_edge(node2, node1) return the same. You can assume the graph is connected and has no edge between a node and itself.
they want you to describe how you're going to sort it?
In graph theory, the number of nodes in a graph is called the order of the graph. The term ‘order’ is unrelated to sorting. Specify the problem of computing the order, as a UGraph operation.
ok
the big para is just background info
the small one is what this sub question wants answered
okay mate!
from what I understand the sorting should be based on the degree and index value of the nodes..
it doesnt need sorting does it?
I think it does.. you would sort by descending order of degree, index
I mean.. expressing the order this way
but this is the question...
In graph theory, the number of nodes in a graph is called the order of the graph. The term ‘order’ is unrelated to sorting. Specify the problem of computing the order, as a UGraph operation.
specify the problem of computing the order
the order is just the number of nodes
"unrelated to sorting"
hmmm I'm not sure.. it's been a while since I did any graph theory.. but if you find how to order an undirected graph, you should get your answer..
@lapis sequoia It specifically says "order" doesn't have anything to do with sorting
There's no sorting involved in this exercise
the order of a graph is just the number of nodes
@urban ibex What specifically stumps you about this?
Like, do you have trouble figuring out if this operation is a creator, inspector or modifier?
right so i assume you read the overall sort of brief/backgroud info @wicked flare
Yeah
okay so...
Are we in agreement that its an inspector
because it doesnt change or create
it just inspects the order(how many nodes in ugraph)
Is "UGraph" a data type from some specific library or code example or something? Or is it just an abbreviation for undirected graph?
this is all i have to go off...
Consider an ADT for undirected graphs, named UGraph, that includes these operations:
nodes, which returns a sequence of all nodes in the graph, in no particular order
has_edge, which takes two nodes and returns true only if there is an edge between those nodes
edges, which returns a sequence of node-node pairs (tuples), in no particular order. Each edge only appears once in the returned sequence, i.e. if the pair (node1, node2) is in the sequence, the pair (node2, node1) is not.
How each node is represented is irrelevant. Because the graph is undirected, has_edge(node1, node2) and has_edge(node2, node1) return the same. You can assume the graph is connected and has no edge between a node and itself.
just the graph?
should i put UGraph?
I guess? Do you have any example specification for the existing operations?
nah
Or an example specification for some other data type operation?
yeah
Inspector: isEmpty
Inputs: theStack, a stack of objects (o1, o2, ..., on) Preconditions: true
Outputs: a boolean empty
Postconditions: empty is true if n=0, otherwise false
Ok, right, so you can just specify a UGraph as the input
okay
I don't know what they mean by "Preconditions: true"
that the inputs are true i guess
The input is a stack though. A stack isn't a boolean.
It can't be true or false.
I don't see any need for any preconditions for either isEmpty or order.
Maybe you can just leave that blank.
And then argue with your professor if they disagree.
I mean, a precondition is something that has to be true before the operation can be called.
i think it just means that the input is correct
of course there is
But I mean, in general, an operation can require preconditions.
if it said the input was integers
Let me think of an example.
and i entered letters
then the input is false
so the pre condition is that the input is true(correct)
No, the input would be invalid. But you are already specifying in the input specification that the input is a number.
In a concrete programming language, this would be handled by specifying the data type.
Or possibly validating the data type in the case of a weakly typed language.
In the case of isEmpty, you are specifying that the input is a stack. It can't be anything else.
heres what it means
The precondition true specifies the operation is valid for any state of a stack: there are no preconditions. An ADT invariant specifies conditions that are True for any instance and remain True throughout its lifetime ireespective of the operations carried out on it
Ah, yeah, I was gonna say that.
If you think of the precondition as a condition, i. e. a boolean statement, then "true" would signify that all previous conditions are valid.
Fine.
So that should be the case for your order operation too.
Because you can always ask a graph for its order.
Just as you can always ask a stack if it's empty.
so precon of True then?
the order (number of nodes in UGraph)
Right.
how should i phrase that though
Or well, if we look at the specification for isEmpty, what they seem to want is that you just specify the data type in outputs, and describe the contents of the output in the post-condition
Right
would you just call it order?
an integer, order (the number of nodes in the graph)
Well, I would put what's inside the brackets in the post-condition section
Since that's how they phrased it in the isEmpty specification
okay
Like, post-condition: order is the number of nodes in the graph
should order be > 0
really?
but they arent graphs though unless im missing soemthing
Yeah, they are
??
hmm
Like, you can have a string split operation, which takes a string to split and a separator string to split on. If you supply the empty string as the separator, it will split inbetween every character
Like "abc".split("") == ["a", "b", "c"]
Makes total sense
alright well what would you have as post con then?
There are similar things that might make sense for empty instances of other data types
post-condition: order is the number of nodes in the input graph, perhaps?
i guess im not 100% on this one though
What part?
post con
I'm just following the format in the isEmpty example
guess you could also say singular integer
In that one, in output, they just say that it's a boolean
And in post-con, they say what the value of the boolean should be
True if n > 0, else false
should i say singuar integer?
I don't know what you mean by "singular"
just 1 integer
If you say "an integer", that implies that it's just one
That's what the word "an" means
I don't think you need to repeat the data type if you already mention it in the outputs section
It's like when you're programming. You only need to specify the data type when you declare a variable. You don't need to repeat it every time you use it.
No, probably not, but why be redundant when you don't need to?
i guess, i think ill leave it though
Up to you.
would you be able to take a look at something im stuck on for me in a moment?
its related to this Q still
Perhaps. I'm at work, but if I have free time.
yeah thats alright no rush im just finishing something off and then ill @ you
hey. I have a dataset where each row looks like this [0,1,2,4] where the columns are [Outcome,Person1,Person2,Person3] I want to transform it into [0,1,1,0,1,0] where Outcome stays the same but Person1-3 changes to reflect 1 or 0. Forexample Person3 is 4 so index 4 is 1, Person1 is 1 so index 1 is 1, no person is 5 or 3, so index 3 and 5 are 0
I want to use it for logistic regression so for something like sklearn or similar would be cool
I've looked around and i am not sure how to proceed. The best I could think of would be to simply make an entirely new dataset and just load that but I thought there might be a better solution.
@lapis sequoia what you need is pd.get_dummies from Pandas. It will convert one column with categories to multiple boolean ones
>>> s = pd.Series(list('abca'))
>>> pd.get_dummies(s)
a b c
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0
from their documentation
In [11]: df = pd.DataFrame(["a", "b", "c", "d"], columns= ["People"])
In [12]: df
Out[12]:
People
0 a
1 b
2 c
3 d
In [13]: pd.get_dummies(df)
Out[13]:
People_a People_b People_c People_d
0 1 0 0 0
1 0 1 0 0
2 0 0 1 0
3 0 0 0 1
this is example dataframe from your use case
@supple ferry thanks a lot. that's perfect! Is there a way to do it for multiple columns? Like if coulm2 is People2 and had index 0 = b, would pd.get_Dummies write the first row as 1,1,0,0 in your example?
column2*
@lapis sequoia You can add column names that you want to get dummies to as an argument to that function. Because I had just one column, i did not use that. you can also add a custom prefix to each created column and instead of People_b you can hve Person_b if you add prefix = "Person_" as argument
ok thats really perfect. thank you so much
hi im new here, in the comunity
i try to convert some xml 3d data from one dcc package to another using numpy.
something like this.
def seup_lod_arrays(self, point_num_of_lods = [] , triangles_of_lods =[] , quads_of_lods =[]):
point_data = []
for iter, value in enumerate(point_num_of_lods):
point_data.append(np.empty([value,14]))
point_data.append(np.empty([triangles_of_lods[iter],1]))
point_data.append(np.empty([quads_of_lods[iter],1]))
self.data = point_data
i have just one question, make this sense ? (-:
Or destroys ,using a list of numpy arrays the speed of the arrays.
.is the list just pointing to the arrays ?
help would be awsome
Hey. First things first, it is good if you format your code using backward tick notation. with this code it is not clear what you want to do. Maybe more code and some example output you want will be helpful. np.empty is simply create an empty array with given shape. If you want that array to have some values in it, it does not do that
I need to superimpose transparency onto a matplotlib heatmap
Could anybody help me there?
post your code, if someone can help, they will respond
Any idea why py val = 1 / np.sqrt(2 * np.pi) * integrate.quad( lambda t: np.exp(-t ** 2 / 2), -np.inf, z ) raises TypeError: can't multiply sequence by non-int of type 'numpy.float64'
where integrate is import scipy.integrate as integrate
and z = (mean - value) / stddev
er
yeah that's right
probably because you're multiplying a sequence by a numpy float?...
oh yeah integrate.quad returns 2 items
aaaaaa
confusing as heck error because it points to the line with z
cool I guess it works now
python's instruction to line number tables can't go backwards, even if that means errors make no sense at all
i wonder if it'd make more sense in debug mode
that turns off some optimizations
wait, no, debug mode is default
here we go, boys
When everything is artificial intelligence...is anything intelligible at all?
hi
I am trying to create an histogram but it doesn't look right
do you think it is accurate?
how can i make it look more like an histogram, for exmple, branching out to the left adn right etc
does that only have one plot point?
hi @supple ferry I just want to thank you for helping me out this tuesday. It is working flawlessly 👌
This is my curated list of free courses from reputable universities like MIT, Stanford, and Princeton that satisfy the same requirements as an undergraduate ...
@lapis sequoia you have just one point. Make sure that for get all points included.
First of all, not data science, also already been overdone to death to various extents
Is pretty good
@lean ledge hmm
guess im out of the loop
which do you think is the best of these kinds?
I wish the courses in that link did not have set dates
I haved maded the AI course of Berkeley, pretty nice too
my other massege didn't go
message*
I said: "Thanx a lot guys, those gits have a lot of valuable information"
You guys know if python can rip all images from a website, store, them then add meta data based off image analysis of them, and then sort them into catagories?
does anyone know a kfold method for python besides sklearn?
this one splits into test and train
i was looking for oen that just split into train
@serene veldt what would that look like? let's say 5-fold, so you want something that splits your dataset into 4+1, but both used for training?
anyway there's nothing stopping you from using the outputs from kfold as both training sets.. can you explain a little better about your use case?
Hey guys, I have a pandas related question.
sure, just ask
Okay, I'm trying to select rows from a DF but I keep getting a max recurrsion depth error. The code is simple and I know I've done this thing last semester.
df[df["W%"] > conf]
"RecursionError: maximum recursion depth exceeded while calling a Python object"
hmm can you show df.head(), and also what's conf?
W% Var
0 0.608696 0.706
1 0.426829 0.714
2 0.317073 0.754
3 0.207317 0.781
4 0.256098 0.741
conf is just a float.
conf = mean + (1.96 * sd)
conf
0.8041728975013317
df = pd.DataFrame(np.random.random((10, 2)), columns=['W%','b'])
df[df['W%']>0.5]
does this run for you?
it's runs perfectly fine on mine.
try changing conf to 0.8 (the number)?
where are you doing this by the way? notebook or py file?
Jupyter notebook.
hmmm
df[df["W%"] > 0.8] is still max recurrsion.
what does df["W%"]>0.8 return?
Also a max recursion.
just df["W%"]?
is there max recursion, I mean. I don't really need to see the output
TypeError: cannot concatenate object of type "<class 'numpy.ndarray'>"; only pd.Series, pd.DataFrame, and pd.Panel (deprecated) objs are valid
nd the error is suppppppppppppper long.
can you show the full stacktrace?
if it's long, use hastebin
what comes before this erroneous line? how are you setting up your df?
Hmmm... let me try rerunning jupyter. I just got an error on df.tail()
df = pd.DataFrame(data={"W%":data["W%"], "Var":data["FT%"]})
and data is another dataframe? why create a new dataframe when you can just index the cols out?
though I'm not sure if that's the issue
did restarting the kernel help?
Because I'm ultimately trying to index through and compare each variable to see if Win % is > the confidence interval.
Okay, restarting seems to have fixed it.
okay 👍
df[df["W%"] > conf]
W% Var
19 0.817073 0.696
40 0.833333 0.797
115 0.804878 0.771
137 0.817073 0.794
366 0.841463 0.747
367 0.878049 0.746
371 0.817073 0.744
394 0.804878 0.757
489 0.804878 0.754
826 0.817073 0.788
827 0.890244 0.763
828 0.817073 0.768
980 0.817073 0.805
1067 0.817073 0.803
Thank you for the help.
np
Has anyone had any issues using np.split or np.asarray on a list of dataframes? I'm getting '''ValueError: cannot copy sequence with size 11122 to array axis with dimension 88''', so I assume numpy is trying to operate on the dataframes inside the list. I just want to split the list and not operate on the dataframes.
The shape of the dataframes inside the list are of shape (11122,88), hence why I think numpy is trying to operate on the dataframes inside the list.
Ended up posting to SO
Hi guys, I'm planning in creating a crawler for a company and it envolves AI (NLP) and Big Data. I was wondering, what is the best libraries for that both crawling and NLP in python that can help me create something that can scale?
spacey for nlp is a solid starting point
Not really a data science question but I'm having difficulties with Jupyter.
What kind of difficulties Arc
Oh it's a stupid issue I think it's just an problem with the paths Jupyter is using.
But so I do pip install Jupyter and it installs fine
And python -m jupyter works
But python -m jupyter notebook gives Error executing Jupyter command 'notebook': [Errno 'jupyter-notebook' not found] 2
I should mention Win10 I suppose.
@hardy crag I'll be reading about scrappy more
@void anvil Is it multi language? The only one I found that supports another language other than english is NLTK
what do you mean?
there's nothing really better out of the box
@void anvil for example, for sentiment analysis
@dense rose windows or linux/mac? system wide python or conda/virtual env?
Win10 just system wide.
have you tried reinstalling jupyter?
pip install --upgrade --force-reinstall --no-cache-dir jupyter
also maybe checkout this. https://github.com/jupyter/notebook/issues/4195
well no he said its installed in the wrong location, so you would have to use that exe instead of the command line command
it might also be worth considering anaconda
Oh my b I just read the first part and saw a similar question and skipped to the (non-existent) answers.
Yeah that's weird.
So this is a basketball sport data visual.
This about a basketball shooting three pointers
Axis attempts and made
Legend is positions of basketball
I want to ask what kinda person would use things like this ?
I'm working on an LSTM operating on time-series data. The lengths of my input sequences vary from 968 to 6244 measurements. What's the best way to normalize their lengths? I could just take the median number of measurements, throw away sequences with less, and chop ones with more. Or I could work on interpolating the data, but I wouldn't want to interpolate 968 measurements to 6244. I know this is dependent on what my data actually is (it's all measurement data from several sensors), but are there any rules-of-thumb or guidelines to normalizing sequence length?
I guess another alternative would be to feed in sliding window subsequences (say of length 200) of each sequence rather than the whole sequence itself?
Nvm, I've looked more into it and looks like the norm is just use 250-500 steps as inputs, so I'll just make windows using that
Does anyone know if dataframes have any built in methods for deque-like behavior? Specifically, I want to have the dataframe pop an entry when a new entry is added once the dataframe hits a certain length. Writing code to do this wouldn't be hard, but I was curious if any of that is built-in already.
https://stackoverflow.com/q/29609118 if you havent yet seen it
Thanks! That looks about like what I was going to write manually
Small question: how good is Dash?
Dash User Guide and Documentation. Dash is a Python framework for building analytical web apps in Python.
I'm looking for a way to create interactive dashboards for my coworkers (not necessarily using Python) and came across it. Is it popular?
I usually use seaplots for my plots but they're kinda hard to automatically export to create browser-compatible dashboards, rihgt?
Tableau is 80$/month though lol
I used it at my previous job but I'm working for a smaller company now and I'd like something that's open source 😄
But I don't want to spend hours understanding the Dash framework if it's not really popular/maintained
Mmmmmh that's true. Never used R but it can't be that hard.
I'm doing all my analysis with Python though so ideally I'd like to work with it
So I can re-use my existing library
And my sql-alchemy integration
Is it really popular? I don't see many people talking about it (compared to seaborn for example)
R isn't used in production at most major companies..
not enough support.. plus R studio license cost..
You can use plotly but it won't be versatile as Shiny of R
R's still very popular in academics (more so than Python, although Python is gaining traction). That also means that newer statistical methods may be implemented in R before they are implemented in Python. My guess is that this applies less to machine learning, but that's not my field. I do know my research group still mainly uses R and pushes out a lot of R-packages with the new developments.
Stuff like newly published methods for Prediction Rule Ensembles and multiple imputation.
Definitely correct, R is the preferred language in a lot of science, not just pure statistics
While ML stuff is generally python
@gilded dagger I think Dash is exactly what you are looking for.
I asked this in #career-advice but I thought I would ask here as well.
I am hiring a couple data analysts/scientists/ML dudes for the summer. Got some promising candidates.
Outside of some technical stuff, are there any good questions to ask people through a case study. I am interested in how they deal with problems, their methodology and those things more than their technical acumen in this particular instance.
Give them a data analysis project that'll take 2 hours to do something easy on (clean data set, maybe a couple holes and some irrelevant data) and ask them to give a 10 minute presentation on their findings
Doesn't really help with data science and ML people
DS and ML is more about being able to understand high dimensional data and the maths and behaviour behind it all. I'm not sure how you can measure that
Hello all
I'm very new to using Python for data science, and was wondering if anyone would be able to shine some light on an issue I'm having
I am trying to use statsmodels to forecast call data, using this guide: https://medium.com/datadriveninvestor/how-to-build-exponential-smoothing-models-using-python-simple-exponential-smoothing-holt-and-da371189e1a1
I have a month's worth of call data in a pandas dataframe
it's a simple 1D array with date as the index and calls as data
Simple exponential smoothing seems to work:
but if I try the full ExponentialSmoothing function I just get this error:
WHen I'm using the exact same dataframe as before
cleared up some clutter
seems to only happen with the mul method
@inland garnet , i am not very expert in statsmodels, but this error indicates that boolean indexing is not properly "translated" into pandas. Seems like statsmodels bug to me (maybe)
If you look at traceback line 391, you will see something like this:
(condition) and (condition)
This is pythonic way to do it, which is not acceptable to pandas
In panas, boolean indexing is to be done via | , & operators
or to this stackoverflo answer
https://stackoverflow.com/a/54358361/10943886
you're better off implementing your own functions.. but where are you importing statsmodels from
Right
I just installed statsmodels via pip
and then i'm importing from statsmodels.tsa.api
Thanks @supple ferry
I think I'm in over my head
Every guide to exponential smoothing in python seems to use pandas, so it seems strange that statsmodels isn't working. my data is extremely simple and 1D
in pandas, i want to check for duplicates for all columns except one
so i made a list of columns with dataf.columns, removed "sku" from every list, then did dataframe.duplicated(subset=listwithoutsku)
but the output is a list of every row where there is at least one duplicated column from that list
how can I make it so it only counts full duplicates from the subset?
I'm in a introductory ML class and the final project is one of the projects on Kaggle (mostly free to choose). Any suggestions on some that are not too hard but still interesting?
Pokemon one can be amusing
Anybody here has any experience with Dash apps?
I'm having a lot of trouble just making a very basic CSS
Look at that date picker LOL
hello guys, i have installed the cpu version of keras
how do i change it to the gpu one?
didnt find it on stack of
you should be installing tensorflow-gpu rather than just tensorflow
so shall i uninstall tf and then install tf-gpu?
yeah even im using keras
but my cpu usage was at 95% and gpu usage was 5%
looks like i made a mistake installing keras-cpu
keras-cpu is not a thing, keras is only an API. it doesnt handle hardware on its own
it leaves that to the backend
I have a script I made for turning some geographic data points into a heatmap-style video. It uses scipy.interpolate.griddata() to do the interpolating, which works pretty well (see image).
However, I'd like for it to do some extrapolating outside the boundary created by available data points. I've tried using scipy.interpolate.Rbf for this with extremely poor results. More recently, I tried using scipy.interpolate.interp2d, which has also given extremely poor results (see image).
Can anyone explain why it's like this?
Try sending your code. It often makes issues like this easier https://paste.mcadesigns.co.uk
Give me a second to redact a couple things and I'll put it up. I'll put up a slightly different script than the one that made those screenshots, but it showed the same problems
^ Line 113 works, but if I comment it and uncomment Line 114 that creates bad images
Example from pasted code using interp2d:
Well, if anyone has any insight please ping me. I have this server muted, so I won't notice otherwise.
would asking questions about OpenCL and/or coherent noise be within the realm of this channel, or not really"?
@analog helm it's python specific so probably not but asking in the offtopic channels would be fine
I found a great site to learn anything data science for free https://courses.cognitiveclass.ai
it's been there a while.. really elaborate..
too bad people don't use it much
wait a minute.. are you from IBM
@rich chasm i love u
Does anyone have any recommendations for someone wanting to produce and manipulate (coherent) noise in Python? I've only found two libraries which do such, both are ports of the C++ libnoise library. One of the ports has some random Visual Studios dependency, and I can't find a prebuilt wheel, official or not. The other has a dependency on (Py)OpenCL which is a holy nightmare of its own. I checked the original libnoise library, and there is nothing in it to do with OpenCL, so why the author of the port decided to make the baffling, inane, and stupid decision to weld that crap on without any choice on the user's part is completely beyond me. But now I'm just ranting!
Is there a library or something I'm missing which either ports libnoise or has equivalent functionality, and "just works" without necessitating any ridiculous proprietary runtimes, or exotic drivers?
It's gotten to point where I'm seriously considering just porting the original libnoise myself, but obviously I'd like to avoid that if i can
Oh right, the ports i found are noisepy (the one with the VS dependency) and PyNoise (the one with the OpenCL dependency)
@analog helm What kind of noise are you looking for
The noise it's self isn't really the issue. I can implement perlin, sinplex, voronoi, etc in 10 minutes for each. The main convenience of using libnoise is its full set of features and automation in regards to multi dimensional containers, multiple 'layers' of modules which modify the noise values, and the built-in visualization system
You can probably implement a lot of that relying on standard python data science libraries without much effort.
sklearn.naive_bayes.BernoulliNB
it has a binarize parameter
its a threshold
so i would assume it goes from [0,1]
however, using values above 1 produces diferent results, some better some worse
so i cant really understand what that threshold means
@serene veldt It's used to convert a floating point number to a binary (boolean) value
So, you need a boundary to determine in which category something belongs
The default is 0.0 (so all negative numbers go into the first category, all positive numbers go into the second; I don't know how it treats the boundary itself)
Since those floating point numbers can have any value (well... you know what I mean), the boundary can be "anywhere"
So, say that I have numbers ranging from 0 to 100 and I want the boundary to be 90, I can use 90 for the binarize parameter
I don't know, but does it matter? I'm not familiar with this model, but it could just be two groups without any significant meaning or order
i would just like to understand how the threshold works, since they are ot specific at all
to run some tests
since, imo, they dont properly explainhow to work with it and how it binarizes
but i apreciate the help
That's the function it calls to binarize it
According to the source code: https://github.com/scikit-learn/scikit-learn/blob/7b136e9/sklearn/naive_bayes.py#L920
@serene veldt
So it's 1 above the threshold and 0 below or equal
plt.subplot()
thanks @lean ledge
Could someone point out how I can get my regex working? https://regex101.com/r/RlR4HO/1/
I'm trying to match just the fish name, then the amount, then the price. I've gotten pretty far, but for some reason I can't get specieweightprice knocked off my group 1 match, even if I add an non-capturing group that captures that exact phrase.
Regex101 allows you to create, debug, test and have your expressions explained for PHP, PCRE, Python, Golang and JavaScript. The website also features a community where you can share useful expressions.
Here's something that allows me to capture it like I want to: https://regex101.com/r/RlR4HO/2
Regex101 allows you to create, debug, test and have your expressions explained for PHP, PCRE, Python, Golang and JavaScript. The website also features a community where you can share useful expressions.
But as soon as I add '?' the rest of group one steals it from the non capturing group.
what pattern do you want to catch
Bit of a long shot, would anyone be up for looking at my implementation of a limiter algorithm and help me figure out why it doesn't work as described in the paper its based upon?
the paper is here: https://users.aalto.fi/~hamalap5/dafx2002/dafx_hamalainen.pdf
sure.. post your code.. I can get back by tomorrow..
hi i need to crate data table by jason url and then sort her all that in flask how do i sort the table
like if u click on number sort by min/max
hey y'all
i need someone to push me in the right direction for a thing
i understand the basics of machine learning, but i've never actually implemented it into anything
here's the gist of what i'd like to do: i want to be able to train an algorithm to judge a section of text on whether it most displays one of four different traits
now, i've been told the way to do this is to make four models, one for each of the traits, and see which confidence appears highest
so, if this is right, what libraries would i want to be using, what kind of stuff should i be googling
if this is wrong, then how should i be going about this?
and then the previous question i guess
thanks y'all
@slate orchid i made a library which judges text based on different traits
that sounds... extremely relevant and useful?
for sake of visualization this website shows images depicting what it does
the model i used was.. i used a Word2Vec implementation and just plugged in the traits i wanted
for each word i found the 'correlation value' and averaged it over the entire section of text
huhh
for example
unfriendliness = model.most_similar(
positive=['hostile', 'hurtful', 'unfriendly', 'mean'],
negative=['friendly', 'affectionate', 'loving', 'kind'],
topn=100000
)
are you familiar with Word2Vec?
nope, looking it up now
is that basically step one for all 'text reading' machine learning stuff?
sure, that sounds like a thing i want
Efficient topic modelling in Python
so how and what would i want to plug that into?
from gensim.models import KeyedVectors
for distinguishing the four traits
(i'll train it on text in which i've established which the dominant trait is)
there are 2 ways to do it : pretrained and posttrained
depending on how fast your application needs to be
mine is pretrained because i wanted it fast
by posttrained i mean 'train on the fly'
ah, sure
or more like 'compute on the fly' whic his slow
can you do both?
anything is possible
woah
print("Loading GoogleNews-vectors into word2vec (~30 seconds)")
model = KeyedVectors.load_word2vec_format(
'GoogleNews-vectors-negative300.bin.gz',
binary=True,
limit=500000
)```
the problem with using gensim directly is that everytime you run the script it takes 30 seconds to load the model
maybe theres a way to do it faster but i dont know that
haha, clearly i have a lot to google here
gensim is fast but loading the googlenews into it takes time
gensim is... a library?
okay, in dumb terms for me: it inputs your word2vec stuff, and outputs...
(once trained)
in dumb terms Google already trained it on Google News
but the file is like 2GB so its slow to load
oh right
wait, what does it being trained on google news mean?
like, trained for what
it trained on which words occured frequently together
for example if 'neko' and 'cat' appeared frequently together, their vectors would be closer
confusing
ah, okay, here's the bit i'm explaining crap
i want to have four traits that are basically arbritrary
let's say sweet, salty, bitter, and whatever the last one is
sour
that
if i gave it like 100 examples saying 'this text is sweet, this text is salty', could i then give it some text and it tell me which of those four it is?
yes
so... is this this thing then?
somehow your certainty makes me confused but awesome
because thats what i've been working on
so i tested it and stuff
i did have to choose the parameters though, so i chose limit=500000 and topn=10000 because they gave better values
limit meaning we use 500,000 words from googlenews and 10,000 for each trait but that was just for the specific use case i was using
the other 'optimization' was that i used a bunch of synonyms so that it would reflect the meaning better for example
with antonyms too
dominance = model.most_similar(
positive=['dominant', 'assertive', 'capable', 'important'],
negative=['submissive', 'apologetic', 'meek', 'passive'],
topn=100000
)
as opposed to
dominance = model.most_similar(
positive=['dominant'],
topn=100000
)
okay, so what you're doing is trying to find phrases and such related to words that exist
i think?
i rank phrases
what... does that mean
i'm really sorry you're trying to be helpful and i'm pretty useless
"Mitsuki is really kind and sweet" -> friendliness: 3/10, dominance: -2/10
ranks the phrase based on traits
yes
friendliness = model.most_similar(
positive=['friendly', 'affectionate', 'loving', 'kind'],
negative=['hostile', 'hurtful', 'unfriendly', 'mean'],
topn=100000
)
okay, i'm trying to do something kind of different i think
"oranges are really heavy on the acid and vitamin c" -> sweet:-2, sour:+8, bitter:-4
ah, sorry, my example was really confusing
those were meant to just be arbritrary phrases
let me guess you want a one hot encoding :
"oranges are really heavy on the acid and vitamin c" -> [sweet,sour,bitter]=[0,1,0]
f("oranges are really heavy on the acid and vitamin c" ,[sweet,sour,bitter]) -> [0,1,0]
in plain terms: i want a computer to tell me whether some text is more 'bleep' or 'bloop', having given the computer a bunch of phrases and told them whether they were 'bleep' or 'bloop'
choose_trait("oranges are really heavy on the acid and vitamin c" ,[sweet,sour,bitter]) -> sour
the computer has to figure out in the training bit what bleep or bloop actually mean
oh
so, what kinds of words they're associated with
yeah, i think that's what i want
maybe even a neural network
supervised learning
that's the thing i want i think
all data would have to be sorted by hand, so at best like 10^3
more data means less transfer learning and more layers of neural networks
i see
maybe a 2 layer DNN
like, i'm tring to make a crappy experiment, nothing's riding off of this being perfect
but i want it to mean SOME kind of thing
ok
DNN?
deep neural network
ah gotcha
but deep here will be just 1 or 2 layers
okay that's a googlable thing, thank you so much
np
what kinds of libraries do i want?
the easy way would be using keras
in tensorflow 2.0 keras is a part of tensorflow , but its pretty recent and you could find more examples of the 'old' keras
is keras an abstraction on top of tensorflow?
Keras is a powerful easy-to-use Python library for developing and evaluating deep learning models. It wraps the efficient numerical computation libraries Theano and TensorFlow and allows you to define and train neural network models in a few short lines of code. In this post...
yes
oh is it just easier tensorflow then
yay anytime 😃
by the way your project looks awesome
thanks!!!
afternoon
you too
:)
This is a nooby question, but i cant figure out how to label my data it's a dataset with 140 feature per sample in a 2d array with the last cell on each row being the class it belongs two (binary)
do i have to read the labels into a second array
using sklearn*
You can slice that array, transpose it and concatenate with the main array but vertically. Then you will get an array with 141 columns
I'm pretty novice to python and data science. But I have some code that my organization wants to make available to everyone in our 30+ offices. What is the best way to package my product and make/push future updates?
yo anyone got a clue about this neural net, seems to keep oscillating around 50/50 ```py
import numpy as np
import pandas as pd
inputs = []
outputs = []
train = pd.read_csv('traindata.csv')
def preprocessing(x):
input = x.iloc[:, 0:-2]
output = x.iloc[:, 6:8]
for i in range(1, 12):
inp = input.iloc[i - 1]
inputs.append(inp.to_numpy())
for i in range(1, 12):
out = output.iloc[i - 1]
outputs.append(out.to_numpy())
def sigmoid(x):
return 1./(1. + np.exp(-x))
def sigmoid_prime(x):
return x * (1. - x )
class StroopNetwork:
def init(self, x, y_hat, epsilon):
self.input = x
self.theta1 = np.random.randn(4, 6)*np.sqrt(2/6)
self.theta2 = np.random.randn(2, 4)*np.sqrt(2/4)
self.expectation = y_hat
self.output = np.zeros((2, 1))
self.epsilon = epsilon
def forward(self, i):
self.layer1 = sigmoid(np.dot(self.theta1, self.input[i]))
self.output = sigmoid(np.dot(self.theta2, self.layer1))
def backward(self, i):
delta_2 = (self.expectation[i] - self.output) * sigmoid_prime(np.dot(self.theta2, self.layer1))
delta_1 = (delta_2 @ self.theta2) * sigmoid_prime(np.dot(self.theta1, self.input[i]))
d_theta2 = self.output @ delta_2
d_theta1 = self.layer1 @ delta_1
self.theta1 += d_theta1
self.theta2 += d_theta2
preprocessing(train)
bob = StroopNetwork(inputs, outputs, 0.12)
for z in range(1000):
for i in range(1, 12):
bob.forward(i-1)
bob.backward(i-1)
print(bob.expectation[i-1], bob.output, z)
i'm 🅱 retty new to this so it should be a rookie mistake lel
^ bit of sample output
def sigmoid_prime(x):
return x * (1. - x )
should be like
def SigmoidGradient(Z):
return Sigmoid(Z) * (1 - Sigmoid(Z)) no?
https://git.io/fhUGE @lapis sequoia this might help, been like a year since I wrote those codes though, so don't remember much >.>
sigmoid(z) = a
so yea that is correct but it doesn't make a difference @hasty maple
idk it's really weird
strange indeed, I would suggest to check for the shapes, usually numpy doesn't cry but works with whatever shape you throw at it, like for a multiplication, A(mn) should have B(nk), but if it's A(m*n) and B(n) it would still work and give an output but it won't be the correct one
sorry i don't quite understand, what do you mean by check for the shapes?
self.layer1 = sigmoid(np.dot(self.theta1, self.input[i]))
self.output = sigmoid(np.dot(self.theta2, self.layer1))
Output the shapes of self.theta1, self.input[i], self.theta2, self.layer1, self.output
When I coded it up last year, the shapes caused some trouble in the math, the self.input[i] might have caused the shapes to not be correct for the matrix multiplication operation
ok ty i'll check
👍
ah you aren't using epsilon
the network might be jumping back and forth due to the large steps
try running it again with epsilon, use small values of epsilon, 0.001,0.0001, etc
Also maybe check for the gradients as well, compute em numerically
You're welcome :)
I can't seem to find the pdf for it, but it's available in Andrew NG's course on coursera, he shows how to calculate it in matlab, you can code it in python as well
ok ty i'll give it a search
im trying to make a digit recogniser. I dont really know what to do. Any help would be great. Thankyou
There are plenty of tutorials online for that. Do you want help finding a good one? Also, do you understand generally how digit recognizers work? If not, do you care about learning the fundamentals or do you just want to code something?
Also, does anyone know what InvalidArgumentError: Can not squeeze dim[1], expected a dimension of 1, got 250 [[{{node metrics_5/acc/Squeeze}}]] means when calling model.fit in TF keras? It backtraces to some nondescript function calls: https://pastebin.com/0bY4TswX
does this help? https://stackoverflow.com/q/49083984
I looked into that yesterday, and it didn't really help. That user is getting the error from a different function.
you have 250 nodes in your output layer right?
Oh, no. I'm new to ML, so I'm still wrapping my head around the architecture.
https://stackoverflow.com/q/55634133 this looks similar to what you're getting?
Not exactly, but I think I know where I'm getting tripped up now.
Not really related to the last question, but I have an architecture question, too.
My X shape is (17042, 250, 87), so 17042 sequences of length 250, each with 87 features. My Y shape is a vector of length 250 containing 1s and 0s, denoting wether a point in the input is important or not.
The last layer in the network should be an LSTM with units=250 and return_sequences=True, right?
Edit: NVM above question.
Anyone have a nice presentation on why Python for data science over Bi tools (Splunk, tableau exc exc) that I can show senior management ?
@vague jetty i found something online. using the mnist digit dataset. it says its called a Multi-Layer Perceptron. However when i input my own data to test the model it is incredibly wrong
would it be viable for me to relearn neural networks? and make it from scratch?
I always advocate building things from scratch over copying code. If you have the time, you should absolutely relearn NNs and build one from scratch. I don't know where are you competency-wise, but you should be able to figure out what you do and don't know
@turbid bay Multi-layer perceptron is mostly synonymous with neural network (although some literature makes a slight distinction between them afaik), so don't get confused by the fancy terminology. The MNIST digit recognition task is absolutely canonical and you will find a massive amount of resources dedicated to the problem by just googling - Geoffrey Hinton recognized it as the drosophila of machine learning, meaning it's an extensively studied problem and a good place to start your ML journey
And I would also advocate for trying to build your own NN to solve the task - and after you're done, take a look at the state-of-the-art networks for the problem (Kaggle is a good place for this), see what they do differently, and try to understand why
Keras provides an extremely modular and straight-forward API so you don't really need extensive knowledge beyond what layers are and how they work, although more knowledge always helps
yh i will try. i learnt before how to do some of it. But it was in octave and i found octave very hard to learn so hopefully will understand it when trying to do it in python
i will work on making my own NN. But be prepared for many questions XD
There are some very smart and educated people in this chat so don't be afraid to ask for help
will do thanks
Alright, I fixed some of the layers in the model I posted earlier. Now I'm getting AttributeError: 'builtin_function_or_method' object has no attribute 'shape' from inside some of keras's helper libraries. Here's the Google collab with the whole code: https://colab.research.google.com/drive/1yvIYYiBVtqQgVGER9rZr9cnxzlTwyiRz
I get an error for that colab
cannot load it
Error loading https://apis.google.com/js/client.js
Error: Error loading https://apis.google.com/js/client.js
at HTMLScriptElement.k.onerror (https://colab.research.google.com/v2/external/gapi_loader.js:9:415)
wait fixed it
my add blockers fault 😦
Haha, it happens.
It's read only though
Can you not see the error in the last cell? I've never shared a collab before.
yes. I see your output but can't change anything
(may be for the better now that I think about it
)
can you check if in the lstm function the data is numpy arrays?
found the error lol
Bingo.
yo
i'm putting some tweets into keras text classification
now bare in mind i have no clue what i'm doing
say i were to replace the twitter image link with something like TWITTERIMAGELINKHERE
would that be okay to be added to the 'vocabulary', and then having an image is counted as relevant to the classification?
I would say yes, if you were using some type of bag of words model
yup
nice
can classifications be given actual names?
or are they just integers normally
you would usually have a classification for each integer, no? like 0 = positive, 1 = negative, 2 = whatever, etc.
okay then i'm gonna need to find a solid answer on the correct order of the hogwarts houses
i've got a weird project on
Should I get started with tf, keras tf, keras, tf2 ? Which is best? Just trying to learn so I have no specific objective. Just trying to start from what's more logical
Mention me if u answer pls
@void star I'd learn tf and/or pytorch. Pytorch is nicer in my opinion but I kind of do both since for some jobs they want tf and for others they want pytorch. They all do the same thing
r/datascience: A place for data science practitioners and professionals to discuss and debate data science career questions.
if you do tf, do tf2
@lyric canopy could you go through the above sub reddit and see if it's pin worthy for this channel
I agree it is a good place to start
hey, if i want to make text training data for keras, can i go with a csv file formatted like
"text stuff text stuff text stuff", 0
"more text things, other text things", 1
i'm making the dataset out of some survey stuff
so i'm getting other people to classify stuff for me
i need a way for them to send stuff back to me
it's not gonna be that big
okay. So your probably gonna load it from that csv anyway before using it, which means the format is whatever you want it to be and you just parse it accordingly
nice
thank you!
oh, one more thing
how important is it for the dataset to be balanced?
so, i have four classifications
how important is it that i get a roughly 1:1:1:1 dataset
well, it would be easier but there are ways to deal with class imbalance
you could modify your loss function to account for class frequency, or sample your batches so that they contain roughly equal parts of each class even tough the actual dataset does not,
or you could just randomly select a subset of each class for your training set
Not sure what the goto way is in NLP
Uh there is another one (which is fun, but probably not viable): You could try to generate synthetic samples of your smaller classes :p
yw. This is what this discord is for 😃
(or maybe you can rethink your class definitions in a way that make them more balanced)
uhhh
my class definitions are a little...
fixed
(i'm trying to make something that classifies by hogwarts house)
(don't ask)
well, I'm not in the replace-talking-hats business but I reckon the houses should be pretty equally sized no?
depends on how my friends decide to classify stuff
i'll tell them to try to keep things roughly equal-ish
a little bit of inbalance is fine I reckon. I'd just test it and you will probably see if your network tends to ignore one class or prefer one
is anyone familiar with tensorflow
can you tell me about tf graph and tf session real quick
not sure if I need to initialize a new table for every session
and whether I can have my for loop of items outside tf session or inside
@sand lark great thank you.
@lapis sequoia maybe try tf2. eager execution ftw
I havent learnt it yet
it's very similar, but imho much easier to grasp
you don't need to build a graph anymore with said eager execution
also it has built in keras for "common" architectures
can you share some example code
im doing something like this
what do you think
for query in some_queries:
with tf.Graph().as_default():
with tf.Session() as sess:
does your for loop need to be outside of the session?
I tried keeping it inside the session, but got an error saying my table was already initialized
are you only doing inference?
it needs a new table initialized for every query in the for loop..is there a different way
Im getting embeddings.. and comparing them to other embeddings
this is my embd function
hmm. gotta be honest, never run into that case. From the docs I guess that you need to initialize the tables for the graph once
def embd(inputs, module, placeholder, sess):
embeddings_tensor = module(placeholder)
sess.run(tf.global_variables_initializer())
sess.run(tf.tables_initializer())
message_embeddings = sess.run(embeddings_tensor,
feed_dict={placeholder: inputs})
return message_embeddings
so I run this function for every query.. in a new session
so you would need to maybe put the for loop inside the graph context and call the session inside the loop?
I timed that.. it took longer
than If I kept it outside the graph
like 9 seconds longer
I cant afford this.. it takes too long..just trying to get it done faster..
like 2 queries takes 31 seconds
that's days.. if I want to run thousands of queries
how many different graphs do you need to loop over?
what does that mean
I think it's the same graph
im not really sure about graph and sessions..still new
okay different approach :p
I'm guessing by embedding you mean you have a vector that contains some kind of information?
in your embd() function, inputs is a datapoint, module is a neural net?
yes.. its an encoder..
and you want to compare different encodings?
I have a set of embeddings which I want to compare with the embedding gotten from the queries list
I just changed my code
now it runs at half time
woohoo
👍
Necessity is the mother of invention.. saving my ass
thanks man
it really helped talking it out with you
happy to be the rubber duck 😃
what I did was this
Initialized graph and session..
loading the module..
placeholders..
then sess run.. initialized variables and tables..
then in loop, I called for embeddings for each query
sounds like the way to go
doing the initalizing before hand should save loads of time
yeah before I was doing all of this with a function call for each query..
which was a waste of time
anyone familiar with using cloud computing to train neural nets using a python script? pretty confused between several options
@midnight atlas that's a pretty nebulous and broad question, wanna narrow it down?
Well essentially it's two parts, which platform is most intuitive to setup and then how can I do so?
I want to run a training script that runs over a 50gb database
I am using MultiLabelBinarizer() with np.array(), and I got this error: TypeError: 'numpy.float64' object is not iterable
Then I got TypeError: 'numpy.int64' object is not iterable
Anyone?
Which lines gives you that error?
You should be able to find which variable is assigned to a single number instead of an iterable
Trace it back to the source and then try to udnerstand why it's not what you think it is
@lyric canopy new_array_of_labels = mlb.fit_transform(array_of_labels)
So, array_of_labels probably isn't what you think it is
Did you try printing it just before this line?
Oh
It should be an iterable of iterables, not an array of single numbers
According to the docs
this is array_of_labels - [0 0 0 ... 5 5 5]
Ah, yes, so the elements are single numbers
But it actually wants an array of iterables (other arrays, tuples, ...)
"""def:
I tried to run this: H = model.fit_generator(aug.flow(trainX, trainY, batch_size=BS),validation_data=(testX, testY),steps_per_epoch=len(trainX),epochs=EPOCHS, verbose=1)
And I ended up with this at the end: ValueError: Error when checking target: expected activation_7 to have shape (6,) but got array with shape (1,)
The data has 6 classes.
@midnight atlas you can just run it off a single AWS i nstance
Hello?
sorry it's the p2.8xlarge or p2.16xlarge
you can probably get away with g3 instances as well
how would you guys describe the difference between a classifier and regression model in simple terms?
Hello? I need help.
What are your layers like?
oh wait
is your generator giving only 1d array?
it should look like this [[0, 1, 0, 0, 0, 0]]
instead of [0, 1, 0, 0, 0, 0]
hold on, i just woke up, brb 😛
Okay, another problem might be your training outputs might just be a number. Have you converted them into onehot labels? Your model expects something like this [1, 0, 0, 0, 0, 0] for trainY.
I will check it out.
@hoary terrace Classification produces discrete values, regression produces continuous values. If you want to determine whether an image shows a cat or not, and nothing inbetween, those are 2 discrete classes with no overlap (nocat/cat, or 0/1). If you instead wish to predict how much something looks like a cat, therefore get output anywhere between 0 (absolutely not a cat) and 1 (absolutely a cat), for example 0.75 for "wow that looks a lot like a cat, but not entirely", that would be regression. Naturally, the techniques that you would use for either can overlap, for example classification with neural networks - imagine you're trying to decide which of 10 discrete animal classes are shown in a picture - is it a cat, dog, or armadillo? The neural network would likely end up having 10 output neurons where each represents one of your classes. After the input is propagated through the network, each of the neurons has a certain value - we can understand each neuron's output to be the probability that the the image shows its designated animal. We would use a softmax activation to get a normalized probability distribution, and then simply choose the most probably animal as the output. If the input is a cat, the cat's neuron would maybe have a value of 0.4, a dog would have 0.3 because it looks similar, but the armadillo would only get 0.05 because it looks nothing alike. However, we are still performing classification because the output is a discrete class - a cat.
yo anyone know what be going wrong with this? https://pastebin.com/arjx0LLQ
just seems to be oscillating around 50/50
@void anvil how can I upload a large dataset to a VM? struggling with this part as SFTP type solutions can't handle the 50 gb easily
I have a pandas column that isa list of lists
how do I split that column in to separate columns for each item
I didn't take a direct approach, this is what I did:
1. make dataframe using dict, where values are list of lists
2. assigning column names
3. create separate concatenated df where I do a .apply(pd.Series) on the columns containing lists
4. reset index (because up until now, the dict key is the index)
5. Assign column names to the new df..
check out unstack
oh hang on, you mean a column that contains lists
not a grouped column
yeah
yeah so this works.. but requires that the split column be assigned to a new dataframe
then I have to concat..
similar to what I;m doing with apply and pd.series
I want to filter my dataframe, based on values in a column.. whether they contain a certain text or not
I can do conditionals for whole match.. but not sure how to do it for contains
I did str.contains.. and it seems to work..
but not sure if it's right..
hmmm
str.contains is the best way to do it if it's sufficient for what you want
otherwise you need to map
is anyone alive
I want to left join two dataframes..
do I mention which columns.. in left_on and right_on.. what if I needed one more column from the right
I got it..
I just had to mention the columns to join on..
didn't need to mention the others I wanted to add.. it did it anyway
how do I convert a header to the first row of dataframe?
not when reading
like after assigning column names
adding the same header as row
didint understand it. can you give a small example?
consider you have a header for a dataframe..
which has column names A, B
you want the first row of the dataframe to be A, B as well
and move the rest of the dataframe contents one cell down.. to accommodate this
You can use pd.DataFrame.shift with optional argument fill value and give it your values
thanks