#data-science-and-ml

1 messages · Page 196 of 1

light cloud
#

And that was my understanding initially. As I started reading on KNN there were several articles talking about how KNN could do forecasting. Same with RF, but it was predicting just the next days weather and not doing a time series forecast.

#

This convo has helped clear some stuff up so I appreciate all the responses.

paper niche
#

I think weather might be okay for RF since the temperatures don't typically exceed historic highs and lows.

pulsar stag
#

Tutorial video on how to connect to the Binance API & break down the data with python and create dashboards & graphs! Hope you guys find it interesting & learn a thing or two: https://youtu.be/RHqEPNgpbzQ

vague jetty
#

Any recommendations for a science-orientated IDE? Spyder's debugging tools aren't great, PyCharm's science view isn't working for me (so I can't explore my data), and I'm not a fan of notebooks

reef bone
#

JupyterLab should give a more sophisticated experience than plain notebooks, so maybe see if that works better for you. And there is a plug-in for Atom called Hydrogen that provides a notebook-like experience and I've seen people swear by it, but never used it personally

#

Otherwise I'm afraid there isn't that much, at least as far as I know

vague jetty
#

I'm not a fan of notebook-style work. I'm doing scientific software development

#

So it's less pure data exploration and more software development

#

Hence why I want good debugging tools

reef bone
#

In that case pycharm is probably your best bet

#

See if you can figure out why the science view thingy doesn't work for you, you might be able to find someone with the same problem and maybe a solution

vague jetty
#

Yeah, I'll give it a shot. Thanks!

reef bone
#

(I abandoned pycharm myself because of its awful support for notebooks), I'm aware that's not what you're looking for but afaik it's a well known problem that pycharm is a bit lacking in that department

#

Or at least used to be, apparently there was just an update

lunar rover
#

Hello guys,
What would be the best way to create a web app (with Django or Flask) which would demonstrate:
1: Automatic Data Gathering for Stocks,
2: Automatic Datasheets for a variety of Stocks and
3: Stock prediction (approximately if possible)
If some1 has a good device or a tutorial/instructions, I would be really grateful. I don't have a Data Science experience, so I am just looking for a good start.
Thank you in advance 😃

vague jetty
#

Just got it working, I think? I used to be a big Spyder fan, but I've grown too dependent on the good debugging tools of JetBrains' stuff 😛

vague jetty
#

If I want to call pyplot.axvspan to highlight a range in time-series data indexed by DateTimes, what do I put in for the data units of the range?

#

nvm

pearl tangle
#

Hey guys just made the decision to dedicate my future to deep Learning but I have an issue.

#

My maths sucks. So I came up with a plan to learn the required maths alongside learning deep learning and was wondering if you guys could help me out with the order I should learn each topic

#

I’m starting off with Algebra 1&2

#

Moving onto Linear Algebra

#

Then finally calculus followed by multivariate calculus

#

Is this the right order to learn each of the topics?

lean ledge
#

I'd do algebra 1&2, single variable calculus, linear algebra, multivariate calculus, probability theory

waxen vine
#

self teaching myself calculus on top of all this programming 😛

lean ledge
#

Calculus is great though

waxen vine
#

I would agree, but I am not the best at math to start with, so learning it on my own is like ugh

feral jungle
#

This really doesn't have to do with data science, aside that I'm attempting to learn numba

#

Is there any way I can use str() in a vectorize, or at the very least run a line of code off the gpu?

lapis sequoia
#

im trying to make a program which simulates aerodynamics around a object using vectors can someone who has some experience with vectors please hmu 😃

tall drum
#

I have a pandas dataframe consisting of about 10 columns (float values). How do I split the dataframe into subgroups where one column has for example values over 80?

#

I know of groupby() but I think I need something more because it will output a new dataframe for each row but I would like to split them into "sections" where consecutive values are over 80

dim beacon
#

@tall drum

df[df['column name'] > 80]
tall drum
#

Basically I would like to create a new dataframe for each of the circled sections:

#

Do I just have to loop through the dataframe and store the start and end points for each section in a list or is there a better way?

dim beacon
#

@tall drum i literally gave you the solution

tall drum
#

Thanks I tried it but the output was only one dataframe.

#

I would like to create a dataframe separately for each of the "spikes" in the data. Your solution is one dataframe in which those spikes are concatenated .

paper niche
#

@tall drum you can do it like:
0) ensure your indices are numbers (not dates, etc.); reset index if need be.

  1. extract the indices which meet your criteria (>80).
  2. split the indices (a numpy array) into blocks containing consecutive numbers
  3. use these indices to extract from the df again
#
df = pd.DataFrame(np.random.random((30, 2)), columns=['a','b'])

# find indices where the 'a' column has values > 0.5
idx = df[df['a']>0.5].index.values

# splits numpy array into blocks containing consecutive elements
idx_list = np.split(idx, np.where(np.diff(idx) != 1)[0]+1)

# use the indices to extract from df
for i in idx_list:
    print(df.loc[i,:])
#

a simple example

tall drum
#

Thank you

lunar rover
#

anyone used AlphaVintage API? or IEX?

distant inlet
#

Hi guys

#

I'm new to python

#

And I want to start with data analytics

#

Which libraries should I learn

paper niche
#

@distant inlet pandas and numpy are a must. as for plotting libraries, minimally matplotlib

distant inlet
#

Thanks man !

#

Also I have no prior experience in data analytics

paper niche
#

we all start somewhere

#

😃

distant inlet
#

python 😍

paper niche
#

oh and sklearn if you're doing machine learning

distant inlet
#

Cool..

#

No ML...it's so freaky complex

paper niche
#

keras and tensorflow/pytorch if you're doing deep learning

#

haha okay then

distant inlet
#

:)

#

Any book that u can recommend for data analytics

paper niche
#

I'm not that well-read, but I'm looking at "Machine Learning: A probabilitstic perspective" now. seems like quite a heavy emphasis on the statistical side, so might still be relevant to you. you can consider

rocky moth
#

idk if this is the right place to ask but I want to do a small study on analysing album coverts to see whether or not I can train a neural netwrok to recognize genre's associated with the album.
does anyone have an idea where to start or experience with these kind of things? Currently thinking of using tensorflow, but that's as far as my thought process has gone

#

I have accumulated a dataset of ~4k labeled album covers

#

ill search for some more information and come back with better defined questions 😛

gaunt axle
#

How does one calculate the maximum both positive and negative and minimum both positive and negative numbers in IEEE 754?
I have the answers but...
Máximo positivo: 1,111 1111 1111 1111 1111 1111 x 10^111 1111]2
Mínimo positivo: 1 x 10^-111 1110]2
Mínimo negativo: -1 x 10^-111 1110]2
Máximo negativo: -1,111 1111 1111 1111 1111 1111 x 10^111 1111]2

desert cradle
#

@gaunt axle the information for the system float type is present in sys.float_info, the information for numpy types is in numpy.finfo([type])

#

they use different names unfortunately, so they're not interchangeable ```py

sys.float_info.max
1.7976931348623157e+308
sys.float_info.min
2.2250738585072014e-308
-sys.float_info.min
-2.2250738585072014e-308
-sys.float_info.max
-1.7976931348623157e+308
numpy.finfo(numpy.float64).max
1.7976931348623157e+308
numpy.finfo(numpy.float64).tiny
2.2250738585072014e-308
-numpy.finfo(numpy.float64).tiny
-2.2250738585072014e-308
numpy.finfo(numpy.float64).min
-1.7976931348623157e+308```

gaunt axle
#

Thanks @desert cradle but I was looking for the explanation behind ithehe : )

desert cradle
#

well, you understand binary, right?

#

when you have a "decimal" [not actually decimal but there's no better name for it] point in binary, that means that just like in decimal it goes 1/10, 1/100, 1/1000, it goes 1/2, 1/4, 1/8

#

so 1,11111111111111111111111 is 1 + 1/2 + 1/4 + 1/8 + 1/16 ... + 1/8388608

#

which is 1.99999988079071044921875 if written in decimal

#

and then that number is multiplied by 2^127

#

so you get 340282346638528859811704183484516925440

#

which would be 0xffffff00000000000000000000000000 in hex

#

@gaunt axle

#

is that what you're looking for? otherwise i'm not sure what explanation you want

gaunt axle
#

That helped a lot! Now I must work it out with the IEE 754 format simple precision (1bit for sign, 8 for exponent, and 22 for mantissa(?))

desert cradle
#

23

#

and there's the implicit 1 in the mantissa

gaunt axle
#

Ohh, we were told we don't write those

desert cradle
#

well it's 23 explicit bits, 24 including the 1

#

sign is the simplest, it's just 1 for negative 0 for positive

#

then the exponent is a bit complicated because of denormalized numbers, infinity, and NaN

gaunt axle
#

^^^yeah that!!

#

I wonder if I have to memorize those for a test lol

desert cradle
#

and the regular values are just offset by 127, so 1 [00000001] is -126, 127 [01111111] is 0, and 254 [11111110] is +127

gaunt axle
#

I didn't understand the logic behind , why do we add 127 to the exponent

desert cradle
#

because that way the negative and positive values are all in order

gaunt axle
#

I just assumed it was like that bc, it said "excess by 127"

desert cradle
#

the smallest one is the most negative value, and the largest one is the largest positive vlaue

#

(00 for denormal works because it's smaller than any value with 01, and FF for infinity is larger than any other possible value)

#

in fact, the floating point bits for +0.0 end up being 0 00000000 00000000000000000000000

#

which is useful because it means when memory is allocated and filled with zero bytes, it can be used directly as a float initialized to zero without any special code to fill in the parts that are floats

gaunt axle
#

I will try to do my hw now : ), thanks @tribal kindle

desert cradle
#

lol you @'d the wrong lemon

#

this april fools joke is kind of annoying

gaunt axle
#

Ty @desert cradle ^_^

hasty maple
#

nu u

urban ibex
#

is this channel free?

#

would anyone be able to help with this

#

A city council is planning the city’s bus routes. It has decided which places will have a bus stop (schools, cinemas, hospital, etc.). Each bus route will start from the train station, visit a number of bus stops, and then return to the station, visiting the same bus stops in reverse order. Each bus stop has to be served by at least one bus route. The council wants to minimise the total amount of time that all buses are on the road when following their routes.

chilly geyser
#

@urban ibex Not really sur eif this is the correct channel but I'm now even more convinced it's a multi-TSP while minimizing the highest route length.
An assumption is that route length is directly proportional to the time buses spend on the routes.

  1. you need to find a way to visit all nodes. Even in the problem, the return trip I assume is simply the distance between the nodes again, so you have 2x of the distance for the same 'bus route' - effectively meaning this detail does not matter (except at the end when you need to show back answers/calculations relating to this)
  2. Even if you don't really visit the home node after reaching the end of the trip, you can still transform the problem to an equivalent TSP whereby you visit a place that magically is able to teleport back to origin - essentially, this means it's a TSP.
  3. Multiple vehicles means it's a multiple-TSP. The part where it says all vehicles essentially means you need to minimise the highest route cost among all routes
#

With this you should research into min-max TSP algorithms

urban ibex
#

i think it wants me to use either BFS or DFS

chilly geyser
#

Pretty sure that's not how state-of-the-art does it, but you could....

urban ibex
#

how would it work with BFS or DFS and what would i need to adapt/change

#

the unit its linked to focusses on DFS, BFS, Dijkstras, priority queues and greedy algos

hardy crag
#

what class is this for?

chilly geyser
#

If given those I'd actually use a greedy algo

#

It's suboptimal but terminates in finite time IIRC

urban ibex
#

this is the Q @hardy crag

chilly geyser
#

Yes and I told you what optimisation problem it was

urban ibex
#

yeah im just struggling to see how it would be implimented without the combination of another algo

hardy crag
#

I reckon traveling salesperson is a good start might want to research algorithms that solve it

urban ibex
#

i think youd have to use like DFS first and break the graph down then traverse them but im not sure

urban ibex
#

Would anyone be able to help with this?

#

In graph theory, the number of nodes in a graph is called the order of the graph. The term ‘order’ is unrelated to sorting. Specify the problem of computing the order, as a UGraph operation.

Creator / Inspector / Modifier (delete as appropriate): order
Inputs:
Preconditions:
Outputs:
Postconditions:
Justify the kind of operation.

lapis sequoia
#

do you know the difference between directed and undirected graphs

urban ibex
#

yes I do @lapis sequoia

lapis sequoia
#

so you know what a ugraph is?

urban ibex
#

not 100%

lapis sequoia
#

hmmm.. is this problem for python?

#

because if not.. its a complete tangent.. and I dont want to give you wrong directions..

urban ibex
#

i just need to complete the spec

lapis sequoia
#

there's a package in R called ugraph

#

well function actually..

urban ibex
#

i think in this circumstance UGraph is just a name

lapis sequoia
#

ok

#

undirected graph then..

urban ibex
#

Consider an ADT for undirected graphs, named UGraph, that includes these operations:

nodes, which returns a sequence of all nodes in the graph, in no particular order
has_edge, which takes two nodes and returns true only if there is an edge between those nodes
edges, which returns a sequence of node-node pairs (tuples), in no particular order. Each edge only appears once in the returned sequence, i.e. if the pair (node1, node2) is in the sequence, the pair (node2, node1) is not.
How each node is represented is irrelevant. Because the graph is undirected, has_edge(node1, node2) and has_edge(node2, node1) return the same. You can assume the graph is connected and has no edge between a node and itself.

lapis sequoia
#

they want you to describe how you're going to sort it?

urban ibex
#

In graph theory, the number of nodes in a graph is called the order of the graph. The term ‘order’ is unrelated to sorting. Specify the problem of computing the order, as a UGraph operation.

lapis sequoia
#

ok

urban ibex
#

the big para is just background info

#

the small one is what this sub question wants answered

lapis sequoia
#

ok

#

gimme a sec

urban ibex
#

okay mate!

lapis sequoia
#

from what I understand the sorting should be based on the degree and index value of the nodes..

urban ibex
#

it doesnt need sorting does it?

lapis sequoia
#

I think it does.. you would sort by descending order of degree, index

#

I mean.. expressing the order this way

urban ibex
#

but this is the question...

#

In graph theory, the number of nodes in a graph is called the order of the graph. The term ‘order’ is unrelated to sorting. Specify the problem of computing the order, as a UGraph operation.

#

specify the problem of computing the order

#

the order is just the number of nodes

#

"unrelated to sorting"

lapis sequoia
#

hmmm I'm not sure.. it's been a while since I did any graph theory.. but if you find how to order an undirected graph, you should get your answer..

wicked flare
#

@lapis sequoia It specifically says "order" doesn't have anything to do with sorting

#

There's no sorting involved in this exercise

urban ibex
#

the order of a graph is just the number of nodes

wicked flare
#

@urban ibex What specifically stumps you about this?

#

Like, do you have trouble figuring out if this operation is a creator, inspector or modifier?

urban ibex
#

right so i assume you read the overall sort of brief/backgroud info @wicked flare

wicked flare
#

Yeah

urban ibex
#

okay so...

#

Are we in agreement that its an inspector

#

because it doesnt change or create

wicked flare
#

Yeah

#

That makes sense

urban ibex
#

it just inspects the order(how many nodes in ugraph)

wicked flare
#

Is "UGraph" a data type from some specific library or code example or something? Or is it just an abbreviation for undirected graph?

urban ibex
#

this is all i have to go off...

#

Consider an ADT for undirected graphs, named UGraph, that includes these operations:

nodes, which returns a sequence of all nodes in the graph, in no particular order
has_edge, which takes two nodes and returns true only if there is an edge between those nodes
edges, which returns a sequence of node-node pairs (tuples), in no particular order. Each edge only appears once in the returned sequence, i.e. if the pair (node1, node2) is in the sequence, the pair (node2, node1) is not.
How each node is represented is irrelevant. Because the graph is undirected, has_edge(node1, node2) and has_edge(node2, node1) return the same. You can assume the graph is connected and has no edge between a node and itself.

wicked flare
#

Oh, wait, I missed that

#

Ok

#

Then I'm with you

urban ibex
#

so yeh

#

undirected

wicked flare
#

Ok, so, inputs

#

What's the input for this operation?

urban ibex
#

just the graph?

wicked flare
#

Yeah

#

I'd say so as well

urban ibex
#

should i put UGraph?

wicked flare
#

I guess? Do you have any example specification for the existing operations?

urban ibex
#

nah

wicked flare
#

Or an example specification for some other data type operation?

urban ibex
#

yeah

wicked flare
#

Just so we have an idea of what the expected format is

#

Can you paste that?

urban ibex
#

Inspector: isEmpty
Inputs: theStack, a stack of objects (o1, o2, ..., on) Preconditions: true
Outputs: a boolean empty
Postconditions: empty is true if n=0, otherwise false

wicked flare
#

Ok, right, so you can just specify a UGraph as the input

urban ibex
#

okay

wicked flare
#

I don't know what they mean by "Preconditions: true"

urban ibex
#

that the inputs are true i guess

wicked flare
#

The input is a stack though. A stack isn't a boolean.

#

It can't be true or false.

#

I don't see any need for any preconditions for either isEmpty or order.

#

Maybe you can just leave that blank.

#

And then argue with your professor if they disagree.

#

I mean, a precondition is something that has to be true before the operation can be called.

urban ibex
#

i think it just means that the input is correct

wicked flare
#

There is no sense in which input can be correct or not.

#

The input is the input.

urban ibex
#

of course there is

wicked flare
#

But I mean, in general, an operation can require preconditions.

urban ibex
#

if it said the input was integers

wicked flare
#

Let me think of an example.

urban ibex
#

and i entered letters

#

then the input is false

#

so the pre condition is that the input is true(correct)

wicked flare
#

No, the input would be invalid. But you are already specifying in the input specification that the input is a number.

#

In a concrete programming language, this would be handled by specifying the data type.

#

Or possibly validating the data type in the case of a weakly typed language.

#

In the case of isEmpty, you are specifying that the input is a stack. It can't be anything else.

urban ibex
#

heres what it means

#

The precondition true specifies the operation is valid for any state of a stack: there are no preconditions. An ADT invariant specifies conditions that are True for any instance and remain True throughout its lifetime ireespective of the operations carried out on it

wicked flare
#

Ah, yeah, I was gonna say that.

#

If you think of the precondition as a condition, i. e. a boolean statement, then "true" would signify that all previous conditions are valid.

#

Fine.

#

So that should be the case for your order operation too.

#

Because you can always ask a graph for its order.

#

Just as you can always ask a stack if it's empty.

urban ibex
#

so precon of True then?

wicked flare
#

Right.

#

Ok, so, outputs.

#

What's your output?

urban ibex
#

the order (number of nodes in UGraph)

wicked flare
#

Right.

urban ibex
#

how should i phrase that though

wicked flare
#

Or well, if we look at the specification for isEmpty, what they seem to want is that you just specify the data type in outputs, and describe the contents of the output in the post-condition

urban ibex
#

so an integer

#

because you never have like half a node

wicked flare
#

Right

urban ibex
#

would you just call it order?

wicked flare
#

Yeah

#

Sounds like the most straightforward name

urban ibex
#

an integer, order (the number of nodes in the graph)

wicked flare
#

Well, I would put what's inside the brackets in the post-condition section

#

Since that's how they phrased it in the isEmpty specification

urban ibex
#

okay

wicked flare
#

Like, post-condition: order is the number of nodes in the graph

urban ibex
#

should order be > 0

wicked flare
#

Well, there is such a thing as an empty graph

#

So order could be 0

urban ibex
#

really?

wicked flare
#

Yeah

#

Most data structures in CS can have size 0

#

Empty sets, empty strings, etc.

urban ibex
#

but they arent graphs though unless im missing soemthing

wicked flare
#

Yeah, they are

urban ibex
#

??

wicked flare
#

An empty set is still a set

#

An empty string is still a string

urban ibex
#

hmm

wicked flare
#

Like, you can have a string split operation, which takes a string to split and a separator string to split on. If you supply the empty string as the separator, it will split inbetween every character

#

Like "abc".split("") == ["a", "b", "c"]

#

Makes total sense

urban ibex
#

alright well what would you have as post con then?

wicked flare
#

There are similar things that might make sense for empty instances of other data types

#

post-condition: order is the number of nodes in the input graph, perhaps?

urban ibex
#

i guess im not 100% on this one though

wicked flare
#

What part?

urban ibex
#

post con

wicked flare
#

I'm just following the format in the isEmpty example

urban ibex
#

guess you could also say singular integer

wicked flare
#

In that one, in output, they just say that it's a boolean

#

And in post-con, they say what the value of the boolean should be

#

True if n > 0, else false

urban ibex
#

should i say singuar integer?

wicked flare
#

I don't know what you mean by "singular"

urban ibex
#

just 1 integer

wicked flare
#

If you say "an integer", that implies that it's just one

#

That's what the word "an" means

urban ibex
#

yeah just realised i have that

wicked flare
#

I don't think you need to repeat the data type if you already mention it in the outputs section

urban ibex
#

hmm

#

cant really half to though i guess

#

*harm

wicked flare
#

It's like when you're programming. You only need to specify the data type when you declare a variable. You don't need to repeat it every time you use it.

#

No, probably not, but why be redundant when you don't need to?

urban ibex
#

i guess, i think ill leave it though

wicked flare
#

Up to you.

urban ibex
#

would you be able to take a look at something im stuck on for me in a moment?

#

its related to this Q still

wicked flare
#

Perhaps. I'm at work, but if I have free time.

urban ibex
#

yeah thats alright no rush im just finishing something off and then ill @ you

lapis sequoia
#

hey. I have a dataset where each row looks like this [0,1,2,4] where the columns are [Outcome,Person1,Person2,Person3] I want to transform it into [0,1,1,0,1,0] where Outcome stays the same but Person1-3 changes to reflect 1 or 0. Forexample Person3 is 4 so index 4 is 1, Person1 is 1 so index 1 is 1, no person is 5 or 3, so index 3 and 5 are 0

#

I want to use it for logistic regression so for something like sklearn or similar would be cool

#

I've looked around and i am not sure how to proceed. The best I could think of would be to simply make an entirely new dataset and just load that but I thought there might be a better solution.

supple ferry
#

@lapis sequoia what you need is pd.get_dummies from Pandas. It will convert one column with categories to multiple boolean ones

#
>>> s = pd.Series(list('abca'))
>>> pd.get_dummies(s)
   a  b  c
0  1  0  0
1  0  1  0
2  0  0  1
3  1  0  0
#

from their documentation

#
In [11]: df = pd.DataFrame(["a", "b", "c", "d"], columns= ["People"])

In [12]: df
Out[12]:
  People
0      a
1      b
2      c
3      d

In [13]: pd.get_dummies(df)
Out[13]:
   People_a  People_b  People_c  People_d
0         1         0         0         0
1         0         1         0         0
2         0         0         1         0
3         0         0         0         1

#

this is example dataframe from your use case

lapis sequoia
#

@supple ferry thanks a lot. that's perfect! Is there a way to do it for multiple columns? Like if coulm2 is People2 and had index 0 = b, would pd.get_Dummies write the first row as 1,1,0,0 in your example?

#

column2*

supple ferry
#

@lapis sequoia You can add column names that you want to get dummies to as an argument to that function. Because I had just one column, i did not use that. you can also add a custom prefix to each created column and instead of People_b you can hve Person_b if you add prefix = "Person_" as argument

lapis sequoia
#

ok thats really perfect. thank you so much

tiny comet
#

hi im new here, in the comunity

i try to convert some xml 3d data from one dcc package to another using numpy.
something like this.

def seup_lod_arrays(self, point_num_of_lods = [] , triangles_of_lods =[] , quads_of_lods =[]):
    point_data = []
    for iter, value in enumerate(point_num_of_lods):
        point_data.append(np.empty([value,14]))
        point_data.append(np.empty([triangles_of_lods[iter],1]))
        point_data.append(np.empty([quads_of_lods[iter],1]))

    self.data = point_data

i have just one question, make this sense ? (-:
Or destroys ,using a list of numpy arrays the speed of the arrays.
.is the list just pointing to the arrays ?

help would be awsome

supple ferry
#

Hey. First things first, it is good if you format your code using backward tick notation. with this code it is not clear what you want to do. Maybe more code and some example output you want will be helpful. np.empty is simply create an empty array with given shape. If you want that array to have some values in it, it does not do that

lapis sequoia
#

I need to superimpose transparency onto a matplotlib heatmap

#

Could anybody help me there?

paper niche
#

post your code, if someone can help, they will respond

finite solar
#

Any idea why py val = 1 / np.sqrt(2 * np.pi) * integrate.quad( lambda t: np.exp(-t ** 2 / 2), -np.inf, z ) raises TypeError: can't multiply sequence by non-int of type 'numpy.float64'

#

where integrate is import scipy.integrate as integrate

#

and z = (mean - value) / stddev

#

er

#

yeah that's right

lapis sequoia
#

probably because you're multiplying a sequence by a numpy float?...

finite solar
#

oh yeah integrate.quad returns 2 items

#

aaaaaa

#

confusing as heck error because it points to the line with z

#

cool I guess it works now

lapis sequoia
#

python's instruction to line number tables can't go backwards, even if that means errors make no sense at all

desert cradle
#

i wonder if it'd make more sense in debug mode

#

that turns off some optimizations

#

wait, no, debug mode is default

lapis sequoia
#

here we go, boys

lapis sequoia
#

hi

#

I am trying to create an histogram but it doesn't look right

#

do you think it is accurate?

#

how can i make it look more like an histogram, for exmple, branching out to the left adn right etc

waxen vine
#

does that only have one plot point?

lapis sequoia
#

hi @supple ferry I just want to thank you for helping me out this tuesday. It is working flawlessly 👌

lapis sequoia
#

can someone please please please pin this video ^

#

I love it

supple ferry
#

@lapis sequoia you have just one point. Make sure that for get all points included.

lean ledge
#

First of all, not data science, also already been overdone to death to various extents

#

Is pretty good

lapis sequoia
#

@lean ledge hmm

#

guess im out of the loop

#

which do you think is the best of these kinds?

waxen vine
#

I wish the courses in that link did not have set dates

blissful cedar
#

I haved maded the AI course of Berkeley, pretty nice too

#

my other massege didn't go

#

message*

#

I said: "Thanx a lot guys, those gits have a lot of valuable information"

waxen vine
#

You guys know if python can rip all images from a website, store, them then add meta data based off image analysis of them, and then sort them into catagories?

serene veldt
#

does anyone know a kfold method for python besides sklearn?

#

this one splits into test and train

#

i was looking for oen that just split into train

paper niche
#

@serene veldt what would that look like? let's say 5-fold, so you want something that splits your dataset into 4+1, but both used for training?

#

anyway there's nothing stopping you from using the outputs from kfold as both training sets.. can you explain a little better about your use case?

still abyss
#

Hey guys, I have a pandas related question.

paper niche
#

sure, just ask

still abyss
#

Okay, I'm trying to select rows from a DF but I keep getting a max recurrsion depth error. The code is simple and I know I've done this thing last semester.

df[df["W%"] > conf]

#

"RecursionError: maximum recursion depth exceeded while calling a Python object"

paper niche
#

hmm can you show df.head(), and also what's conf?

still abyss
#

W% Var
0 0.608696 0.706
1 0.426829 0.714
2 0.317073 0.754
3 0.207317 0.781
4 0.256098 0.741

#

conf is just a float.

#

conf = mean + (1.96 * sd)
conf
0.8041728975013317

paper niche
#
df = pd.DataFrame(np.random.random((10, 2)), columns=['W%','b'])
df[df['W%']>0.5]

does this run for you?

#

it's runs perfectly fine on mine.

still abyss
#

Yes.

#

Which is why I don't get why I'm having issues.

paper niche
#

try changing conf to 0.8 (the number)?

#

where are you doing this by the way? notebook or py file?

still abyss
#

Jupyter notebook.

paper niche
#

hmmm

still abyss
#

df[df["W%"] > 0.8] is still max recurrsion.

paper niche
#

what does df["W%"]>0.8 return?

still abyss
#

Also a max recursion.

paper niche
#

just df["W%"]?

#

is there max recursion, I mean. I don't really need to see the output

still abyss
#

TypeError: cannot concatenate object of type "<class 'numpy.ndarray'>"; only pd.Series, pd.DataFrame, and pd.Panel (deprecated) objs are valid

#

nd the error is suppppppppppppper long.

paper niche
#

can you show the full stacktrace?

#

if it's long, use hastebin

#

what comes before this erroneous line? how are you setting up your df?

still abyss
#

Hmmm... let me try rerunning jupyter. I just got an error on df.tail()

#

df = pd.DataFrame(data={"W%":data["W%"], "Var":data["FT%"]})

paper niche
#

and data is another dataframe? why create a new dataframe when you can just index the cols out?

#

though I'm not sure if that's the issue

#

did restarting the kernel help?

still abyss
#

Because I'm ultimately trying to index through and compare each variable to see if Win % is > the confidence interval.

#

Okay, restarting seems to have fixed it.

paper niche
#

okay 👍

still abyss
#

df[df["W%"] > conf]
W% Var
19 0.817073 0.696
40 0.833333 0.797
115 0.804878 0.771
137 0.817073 0.794
366 0.841463 0.747
367 0.878049 0.746
371 0.817073 0.744
394 0.804878 0.757
489 0.804878 0.754
826 0.817073 0.788
827 0.890244 0.763
828 0.817073 0.768
980 0.817073 0.805
1067 0.817073 0.803

#

Thank you for the help.

paper niche
#

np

vague jetty
#

Has anyone had any issues using np.split or np.asarray on a list of dataframes? I'm getting '''ValueError: cannot copy sequence with size 11122 to array axis with dimension 88''', so I assume numpy is trying to operate on the dataframes inside the list. I just want to split the list and not operate on the dataframes.

#

The shape of the dataframes inside the list are of shape (11122,88), hence why I think numpy is trying to operate on the dataframes inside the list.

#

Ended up posting to SO

lusty abyss
#

Hi guys, I'm planning in creating a crawler for a company and it envolves AI (NLP) and Big Data. I was wondering, what is the best libraries for that both crawling and NLP in python that can help me create something that can scale?

void anvil
#

spacey for nlp is a solid starting point

dense rose
#

Not really a data science question but I'm having difficulties with Jupyter.

chilly geyser
#

What kind of difficulties Arc

hardy crag
#

@lusty abyss maybe scrapy for the crawling part

#

@dense rose pleas do tell

dense rose
#

Oh it's a stupid issue I think it's just an problem with the paths Jupyter is using.

#

But so I do pip install Jupyter and it installs fine

#

And python -m jupyter works

#

But python -m jupyter notebook gives Error executing Jupyter command 'notebook': [Errno 'jupyter-notebook' not found] 2

#

I should mention Win10 I suppose.

lusty abyss
#

@hardy crag I'll be reading about scrappy more

#

@void anvil Is it multi language? The only one I found that supports another language other than english is NLTK

void anvil
#

what do you mean?

#

there's nothing really better out of the box

lusty abyss
#

@void anvil for example, for sentiment analysis

hardy crag
#

@dense rose windows or linux/mac? system wide python or conda/virtual env?

dense rose
#

Win10 just system wide.

hardy crag
#

have you tried reinstalling jupyter?

#

pip install --upgrade --force-reinstall --no-cache-dir jupyter

dense rose
#

Yes.

#

And that link is an open question with no answer.

hardy crag
#

well no he said its installed in the wrong location, so you would have to use that exe instead of the command line command

#

it might also be worth considering anaconda

dense rose
#

Oh my b I just read the first part and saw a similar question and skipped to the (non-existent) answers.

#

Yeah that's weird.

lyric sedge
vague jetty
#

I'm working on an LSTM operating on time-series data. The lengths of my input sequences vary from 968 to 6244 measurements. What's the best way to normalize their lengths? I could just take the median number of measurements, throw away sequences with less, and chop ones with more. Or I could work on interpolating the data, but I wouldn't want to interpolate 968 measurements to 6244. I know this is dependent on what my data actually is (it's all measurement data from several sensors), but are there any rules-of-thumb or guidelines to normalizing sequence length?

#

I guess another alternative would be to feed in sliding window subsequences (say of length 200) of each sequence rather than the whole sequence itself?

vague jetty
#

Nvm, I've looked more into it and looks like the norm is just use 250-500 steps as inputs, so I'll just make windows using that

vague jetty
#

Does anyone know if dataframes have any built in methods for deque-like behavior? Specifically, I want to have the dataframe pop an entry when a new entry is added once the dataframe hits a certain length. Writing code to do this wouldn't be hard, but I was curious if any of that is built-in already.

paper niche
vague jetty
#

Thanks! That looks about like what I was going to write manually

mossy dragon
#

Hello!

#

anybody here actually works as a data scientist?

gilded dagger
#

Small question: how good is Dash?

#

I'm looking for a way to create interactive dashboards for my coworkers (not necessarily using Python) and came across it. Is it popular?

#

I usually use seaplots for my plots but they're kinda hard to automatically export to create browser-compatible dashboards, rihgt?

mossy dragon
#

I think tablaeu is popular

#

not python though

gilded dagger
#

Tableau is 80$/month though lol

#

I used it at my previous job but I'm working for a smaller company now and I'd like something that's open source 😄

#

But I don't want to spend hours understanding the Dash framework if it's not really popular/maintained

mossy dragon
#

i see

#

if you knew R you could just use Shiny

#

I hear its great

gilded dagger
#

Mmmmmh that's true. Never used R but it can't be that hard.

#

I'm doing all my analysis with Python though so ideally I'd like to work with it

#

So I can re-use my existing library

#

And my sql-alchemy integration

lean ledge
#

Poltly is decent for making dashboards in general

#

poorly.dashboard_objs

gilded dagger
#

Is it really popular? I don't see many people talking about it (compared to seaborn for example)

lapis sequoia
#

R isn't used in production at most major companies..

#

not enough support.. plus R studio license cost..

supple ferry
#

You can use plotly but it won't be versatile as Shiny of R

lyric canopy
#

R's still very popular in academics (more so than Python, although Python is gaining traction). That also means that newer statistical methods may be implemented in R before they are implemented in Python. My guess is that this applies less to machine learning, but that's not my field. I do know my research group still mainly uses R and pushes out a lot of R-packages with the new developments.

#

Stuff like newly published methods for Prediction Rule Ensembles and multiple imputation.

lean ledge
#

Definitely correct, R is the preferred language in a lot of science, not just pure statistics

#

While ML stuff is generally python

hardy crag
#

@gilded dagger I think Dash is exactly what you are looking for.

light cloud
#

I asked this in #career-advice but I thought I would ask here as well.

I am hiring a couple data analysts/scientists/ML dudes for the summer. Got some promising candidates.

Outside of some technical stuff, are there any good questions to ask people through a case study. I am interested in how they deal with problems, their methodology and those things more than their technical acumen in this particular instance.

void anvil
#

Give them a data analysis project that'll take 2 hours to do something easy on (clean data set, maybe a couple holes and some irrelevant data) and ask them to give a 10 minute presentation on their findings

lean ledge
#

Doesn't really help with data science and ML people

#

DS and ML is more about being able to understand high dimensional data and the maths and behaviour behind it all. I'm not sure how you can measure that

inland garnet
#

Hello all

#

I'm very new to using Python for data science, and was wondering if anyone would be able to shine some light on an issue I'm having

#

I have a month's worth of call data in a pandas dataframe

#

it's a simple 1D array with date as the index and calls as data

#

Simple exponential smoothing seems to work:

#

but if I try the full ExponentialSmoothing function I just get this error:

#

WHen I'm using the exact same dataframe as before

#

cleared up some clutter

inland garnet
#

seems to only happen with the mul method

supple ferry
#

@inland garnet , i am not very expert in statsmodels, but this error indicates that boolean indexing is not properly "translated" into pandas. Seems like statsmodels bug to me (maybe)
If you look at traceback line 391, you will see something like this:
(condition) and (condition)
This is pythonic way to do it, which is not acceptable to pandas
In panas, boolean indexing is to be done via | , & operators

lapis sequoia
#

you're better off implementing your own functions.. but where are you importing statsmodels from

inland garnet
#

Right

#

I just installed statsmodels via pip

#

and then i'm importing from statsmodels.tsa.api

#

Thanks @supple ferry

#

I think I'm in over my head

#

Every guide to exponential smoothing in python seems to use pandas, so it seems strange that statsmodels isn't working. my data is extremely simple and 1D

lime lava
#

in pandas, i want to check for duplicates for all columns except one

#

so i made a list of columns with dataf.columns, removed "sku" from every list, then did dataframe.duplicated(subset=listwithoutsku)

#

but the output is a list of every row where there is at least one duplicated column from that list

#

how can I make it so it only counts full duplicates from the subset?

lapis sequoia
#

could you rephrase your question..

#

unable to follow

dense rose
#

I'm in a introductory ML class and the final project is one of the projects on Kaggle (mostly free to choose). Any suggestions on some that are not too hard but still interesting?

lean ledge
#

Pokemon one can be amusing

gilded dagger
#

Anybody here has any experience with Dash apps?

#

I'm having a lot of trouble just making a very basic CSS

river plume
#

hello guys, i have installed the cpu version of keras

#

how do i change it to the gpu one?

#

didnt find it on stack of

lean ledge
#

you should be installing tensorflow-gpu rather than just tensorflow

river plume
#

so shall i uninstall tf and then install tf-gpu?

lean ledge
#

yes, most likely

#

personally i use the keras built into tensorflow anyway, so

river plume
#

yeah even im using keras

lean ledge
#

there's a separate keras library and then there's tensorflow.keras

#

i use the latter

river plume
#

but my cpu usage was at 95% and gpu usage was 5%

#

looks like i made a mistake installing keras-cpu

lean ledge
#

keras-cpu is not a thing, keras is only an API. it doesnt handle hardware on its own

#

it leaves that to the backend

shrewd phoenix
#

I have a script I made for turning some geographic data points into a heatmap-style video. It uses scipy.interpolate.griddata() to do the interpolating, which works pretty well (see image).

#

However, I'd like for it to do some extrapolating outside the boundary created by available data points. I've tried using scipy.interpolate.Rbf for this with extremely poor results. More recently, I tried using scipy.interpolate.interp2d, which has also given extremely poor results (see image).

#

Can anyone explain why it's like this?

misty sonnet
shrewd phoenix
#

Give me a second to redact a couple things and I'll put it up. I'll put up a slightly different script than the one that made those screenshots, but it showed the same problems

#

^ Line 113 works, but if I comment it and uncomment Line 114 that creates bad images

shrewd phoenix
#

Well, if anyone has any insight please ping me. I have this server muted, so I won't notice otherwise.

analog helm
#

would asking questions about OpenCL and/or coherent noise be within the realm of this channel, or not really"?

lean ledge
#

@analog helm it's python specific so probably not but asking in the offtopic channels would be fine

rich chasm
lapis sequoia
#

it's been there a while.. really elaborate..

#

too bad people don't use it much

#

wait a minute.. are you from IBM

lapis sequoia
#

@rich chasm i love u

analog helm
#

Does anyone have any recommendations for someone wanting to produce and manipulate (coherent) noise in Python? I've only found two libraries which do such, both are ports of the C++ libnoise library. One of the ports has some random Visual Studios dependency, and I can't find a prebuilt wheel, official or not. The other has a dependency on (Py)OpenCL which is a holy nightmare of its own. I checked the original libnoise library, and there is nothing in it to do with OpenCL, so why the author of the port decided to make the baffling, inane, and stupid decision to weld that crap on without any choice on the user's part is completely beyond me. But now I'm just ranting!

Is there a library or something I'm missing which either ports libnoise or has equivalent functionality, and "just works" without necessitating any ridiculous proprietary runtimes, or exotic drivers?

#

It's gotten to point where I'm seriously considering just porting the original libnoise myself, but obviously I'd like to avoid that if i can

#

Oh right, the ports i found are noisepy (the one with the VS dependency) and PyNoise (the one with the OpenCL dependency)

lean ledge
#

@analog helm What kind of noise are you looking for

analog helm
#

The noise it's self isn't really the issue. I can implement perlin, sinplex, voronoi, etc in 10 minutes for each. The main convenience of using libnoise is its full set of features and automation in regards to multi dimensional containers, multiple 'layers' of modules which modify the noise values, and the built-in visualization system

lean ledge
#

You can probably implement a lot of that relying on standard python data science libraries without much effort.

serene veldt
#

Need some help with scikit learn

#

using the naive bayes classifiers

serene veldt
#

sklearn.naive_bayes.BernoulliNB

#

it has a binarize parameter

#

its a threshold

#

so i would assume it goes from [0,1]

#

however, using values above 1 produces diferent results, some better some worse

#

so i cant really understand what that threshold means

lyric canopy
#

@serene veldt It's used to convert a floating point number to a binary (boolean) value

#

So, you need a boundary to determine in which category something belongs

#

The default is 0.0 (so all negative numbers go into the first category, all positive numbers go into the second; I don't know how it treats the boundary itself)

#

Since those floating point numbers can have any value (well... you know what I mean), the boundary can be "anywhere"

#

So, say that I have numbers ranging from 0 to 100 and I want the boundary to be 90, I can use 90 for the binarize parameter

serene veldt
#

so all bellow 90 turn 0 and above 100 become 1?

#

or the oposite

lyric canopy
#

I don't know, but does it matter? I'm not familiar with this model, but it could just be two groups without any significant meaning or order

serene veldt
#

i would just like to understand how the threshold works, since they are ot specific at all

#

to run some tests

#

since, imo, they dont properly explainhow to work with it and how it binarizes

#

but i apreciate the help

lyric canopy
#

That's the function it calls to binarize it

#

@serene veldt

#

So it's 1 above the threshold and 0 below or equal

serene veldt
#

Much appreciated!

#

That really helps

balmy geyser
#

Are there any folks here that know how to display mutliple plots?

#

in matplotlib?

lean ledge
#

plt.subplot()

balmy geyser
#

thanks @lean ledge

pine yoke
#

Could someone point out how I can get my regex working? https://regex101.com/r/RlR4HO/1/
I'm trying to match just the fish name, then the amount, then the price. I've gotten pretty far, but for some reason I can't get specieweightprice knocked off my group 1 match, even if I add an non-capturing group that captures that exact phrase.

#

But as soon as I add '?' the rest of group one steals it from the non capturing group.

lapis sequoia
#

what pattern do you want to catch

balmy geyser
#

Bit of a long shot, would anyone be up for looking at my implementation of a limiter algorithm and help me figure out why it doesn't work as described in the paper its based upon?

lapis sequoia
#

sure.. post your code.. I can get back by tomorrow..

frigid portal
#

hi i need to crate data table by jason url and then sort her all that in flask how do i sort the table

#

like if u click on number sort by min/max

balmy geyser
#

hey @lapis sequoia

#

let me know if you're still interested.

slate orchid
#

hey y'all

#

i need someone to push me in the right direction for a thing

#

i understand the basics of machine learning, but i've never actually implemented it into anything

#

here's the gist of what i'd like to do: i want to be able to train an algorithm to judge a section of text on whether it most displays one of four different traits

#

now, i've been told the way to do this is to make four models, one for each of the traits, and see which confidence appears highest

#

so, if this is right, what libraries would i want to be using, what kind of stuff should i be googling

#

if this is wrong, then how should i be going about this?

#

and then the previous question i guess

#

thanks y'all

torn musk
#

@slate orchid i made a library which judges text based on different traits

slate orchid
#

that sounds... extremely relevant and useful?

torn musk
#

for sake of visualization this website shows images depicting what it does

#

the model i used was.. i used a Word2Vec implementation and just plugged in the traits i wanted

#

for each word i found the 'correlation value' and averaged it over the entire section of text

slate orchid
#

huhh

torn musk
#

for example

#
unfriendliness = model.most_similar(
    positive=['hostile', 'hurtful', 'unfriendly', 'mean'],
    negative=['friendly', 'affectionate', 'loving', 'kind'],
    topn=100000
)
#

are you familiar with Word2Vec?

slate orchid
#

nope, looking it up now

#

is that basically step one for all 'text reading' machine learning stuff?

torn musk
#

word2vec yes

#

its so simple and does so much

slate orchid
#

sure, that sounds like a thing i want

torn musk
slate orchid
#

so how and what would i want to plug that into?

torn musk
#

from gensim.models import KeyedVectors

slate orchid
#

for distinguishing the four traits

#

(i'll train it on text in which i've established which the dominant trait is)

torn musk
#

there are 2 ways to do it : pretrained and posttrained

#

depending on how fast your application needs to be

#

mine is pretrained because i wanted it fast

#

by posttrained i mean 'train on the fly'

slate orchid
#

ah, sure

torn musk
#

or more like 'compute on the fly' whic his slow

slate orchid
#

can you do both?

torn musk
#

anything is possible

slate orchid
#

woah

torn musk
#
print("Loading GoogleNews-vectors into word2vec (~30 seconds)")
model = KeyedVectors.load_word2vec_format(
    'GoogleNews-vectors-negative300.bin.gz',
    binary=True,
    limit=500000
)```
#

the problem with using gensim directly is that everytime you run the script it takes 30 seconds to load the model

#

maybe theres a way to do it faster but i dont know that

slate orchid
#

haha, clearly i have a lot to google here

torn musk
#

gensim is fast but loading the googlenews into it takes time

slate orchid
#

gensim is... a library?

slate orchid
#

okay, in dumb terms for me: it inputs your word2vec stuff, and outputs...

#

(once trained)

torn musk
#

in dumb terms Google already trained it on Google News

#

but the file is like 2GB so its slow to load

slate orchid
#

oh right

#

wait, what does it being trained on google news mean?

#

like, trained for what

torn musk
#

it trained on which words occured frequently together

#

for example if 'neko' and 'cat' appeared frequently together, their vectors would be closer

slate orchid
#

ah, okay

#

not sure that's what i want? honestly i can't tell

torn musk
#

you can play with it

#

look for your traits

iron latch
#

confusing

slate orchid
#

woah

#

words

torn musk
#

just type your traits in there

#

search box on the right side of page

slate orchid
#

ah, okay, here's the bit i'm explaining crap

#

i want to have four traits that are basically arbritrary

#

let's say sweet, salty, bitter, and whatever the last one is

#

sour

#

that

#

if i gave it like 100 examples saying 'this text is sweet, this text is salty', could i then give it some text and it tell me which of those four it is?

torn musk
#

yes

slate orchid
#

so... is this this thing then?

torn musk
#

yes

#

it works surprisingly well for this purpose

slate orchid
#

somehow your certainty makes me confused but awesome

torn musk
#

because thats what i've been working on

#

so i tested it and stuff

#

i did have to choose the parameters though, so i chose limit=500000 and topn=10000 because they gave better values

#

limit meaning we use 500,000 words from googlenews and 10,000 for each trait but that was just for the specific use case i was using

#

the other 'optimization' was that i used a bunch of synonyms so that it would reflect the meaning better for example

#

with antonyms too

dominance = model.most_similar(
    positive=['dominant', 'assertive', 'capable', 'important'],
    negative=['submissive', 'apologetic', 'meek', 'passive'],
    topn=100000
)
#

as opposed to

dominance = model.most_similar(
    positive=['dominant'],
    topn=100000
)
slate orchid
#

okay, so what you're doing is trying to find phrases and such related to words that exist

#

i think?

torn musk
#

i rank phrases

slate orchid
#

what... does that mean

#

i'm really sorry you're trying to be helpful and i'm pretty useless

torn musk
#

"Mitsuki is really kind and sweet" -> friendliness: 3/10, dominance: -2/10

#

ranks the phrase based on traits

slate orchid
#

ah, gotcha

#

so you're inputting 'friendliness' into the trained thing, right?

torn musk
#

yes

#
  friendliness = model.most_similar(
      positive=['friendly', 'affectionate', 'loving', 'kind'],
      negative=['hostile', 'hurtful', 'unfriendly', 'mean'],
      topn=100000
  )
slate orchid
#

okay, i'm trying to do something kind of different i think

torn musk
#

"oranges are really heavy on the acid and vitamin c" -> sweet:-2, sour:+8, bitter:-4

slate orchid
#

ah, sorry, my example was really confusing

#

those were meant to just be arbritrary phrases

torn musk
#

let me guess you want a one hot encoding :
"oranges are really heavy on the acid and vitamin c" -> [sweet,sour,bitter]=[0,1,0]

#

f("oranges are really heavy on the acid and vitamin c" ,[sweet,sour,bitter]) -> [0,1,0]

slate orchid
#

in plain terms: i want a computer to tell me whether some text is more 'bleep' or 'bloop', having given the computer a bunch of phrases and told them whether they were 'bleep' or 'bloop'

torn musk
#

choose_trait("oranges are really heavy on the acid and vitamin c" ,[sweet,sour,bitter]) -> sour

slate orchid
#

the computer has to figure out in the training bit what bleep or bloop actually mean

torn musk
#

oh

slate orchid
#

so, what kinds of words they're associated with

torn musk
#

well if you have like a lot of phrases

#

you can use linear regression or something

slate orchid
#

yeah, i think that's what i want

torn musk
#

maybe even a neural network

slate orchid
#

YEAH THAT

#

that thing

torn musk
#

supervised learning

slate orchid
#

that's the thing i want i think

torn musk
#

the question is how much data are you feeding it

#

is it 10^2, 10^5 or 10^9

slate orchid
#

all data would have to be sorted by hand, so at best like 10^3

torn musk
#

more data means less transfer learning and more layers of neural networks

#

i see

#

maybe a 2 layer DNN

slate orchid
#

like, i'm tring to make a crappy experiment, nothing's riding off of this being perfect

#

but i want it to mean SOME kind of thing

torn musk
#

ok

slate orchid
#

DNN?

torn musk
#

deep neural network

slate orchid
#

ah gotcha

torn musk
#

but deep here will be just 1 or 2 layers

slate orchid
#

so a 'd'nn then

#

'''d'''nn

torn musk
#

or even sequential neural network

#

dnn is just the common phrase

slate orchid
#

okay that's a googlable thing, thank you so much

torn musk
#

np

slate orchid
#

what kinds of libraries do i want?

torn musk
#

the easy way would be using keras

#

in tensorflow 2.0 keras is a part of tensorflow , but its pretty recent and you could find more examples of the 'old' keras

slate orchid
#

is keras an abstraction on top of tensorflow?

torn musk
#

yes

slate orchid
#

oh is it just easier tensorflow then

torn musk
#

yes

#

a lot easier

slate orchid
#

okay easy is great

#

thank you so so much, this looks like a great place to start

torn musk
#

yay anytime 😃

slate orchid
#

by the way your project looks awesome

torn musk
#

thanks!!!

slate orchid
#

have a good whatever the time is wherever you are

#

night?

torn musk
#

afternoon

slate orchid
#

that

#

have an excellent that

torn musk
#

you too

slate orchid
#

:)

grave patrol
#

This is a nooby question, but i cant figure out how to label my data it's a dataset with 140 feature per sample in a 2d array with the last cell on each row being the class it belongs two (binary)

#

do i have to read the labels into a second array

#

using sklearn*

supple ferry
#

You can slice that array, transpose it and concatenate with the main array but vertically. Then you will get an array with 141 columns

mighty vector
#

I'm pretty novice to python and data science. But I have some code that my organization wants to make available to everyone in our 30+ offices. What is the best way to package my product and make/push future updates?

chilly shuttle
#

@mighty vector github

#

Or just host it on an internal wiki like confluence

lapis sequoia
#

yo anyone got a clue about this neural net, seems to keep oscillating around 50/50 ```py
import numpy as np
import pandas as pd

inputs = []
outputs = []
train = pd.read_csv('traindata.csv')

def preprocessing(x):
input = x.iloc[:, 0:-2]
output = x.iloc[:, 6:8]
for i in range(1, 12):
inp = input.iloc[i - 1]
inputs.append(inp.to_numpy())
for i in range(1, 12):
out = output.iloc[i - 1]
outputs.append(out.to_numpy())

def sigmoid(x):
return 1./(1. + np.exp(-x))

def sigmoid_prime(x):
return x * (1. - x )

class StroopNetwork:
def init(self, x, y_hat, epsilon):
self.input = x
self.theta1 = np.random.randn(4, 6)*np.sqrt(2/6)
self.theta2 = np.random.randn(2, 4)*np.sqrt(2/4)
self.expectation = y_hat
self.output = np.zeros((2, 1))
self.epsilon = epsilon

def forward(self, i):
    self.layer1 = sigmoid(np.dot(self.theta1, self.input[i]))
    self.output = sigmoid(np.dot(self.theta2, self.layer1))

def backward(self, i):

    delta_2 = (self.expectation[i] - self.output) * sigmoid_prime(np.dot(self.theta2, self.layer1))
    delta_1 = (delta_2 @ self.theta2) * sigmoid_prime(np.dot(self.theta1, self.input[i]))
    
    d_theta2 = self.output @ delta_2
    d_theta1 = self.layer1 @ delta_1

    self.theta1 += d_theta1
    self.theta2 += d_theta2

preprocessing(train)

bob = StroopNetwork(inputs, outputs, 0.12)

for z in range(1000):
for i in range(1, 12):
bob.forward(i-1)
bob.backward(i-1)
print(bob.expectation[i-1], bob.output, z)

#

i'm 🅱 retty new to this so it should be a rookie mistake lel

#

^ bit of sample output

hasty maple
#
def sigmoid_prime(x):
    return x * (1. - x )
should be like
def SigmoidGradient(Z):
    
    return Sigmoid(Z) * (1 - Sigmoid(Z)) no?
#

https://git.io/fhUGE @lapis sequoia this might help, been like a year since I wrote those codes though, so don't remember much >.>

lapis sequoia
#

sigmoid(z) = a

#

so yea that is correct but it doesn't make a difference @hasty maple

#

idk it's really weird

hasty maple
#

strange indeed, I would suggest to check for the shapes, usually numpy doesn't cry but works with whatever shape you throw at it, like for a multiplication, A(mn) should have B(nk), but if it's A(m*n) and B(n) it would still work and give an output but it won't be the correct one

lapis sequoia
#

sorry i don't quite understand, what do you mean by check for the shapes?

hasty maple
#
self.layer1 = sigmoid(np.dot(self.theta1, self.input[i]))
self.output = sigmoid(np.dot(self.theta2, self.layer1))

Output the shapes of self.theta1, self.input[i], self.theta2, self.layer1, self.output
When I coded it up last year, the shapes caused some trouble in the math, the self.input[i] might have caused the shapes to not be correct for the matrix multiplication operation

lapis sequoia
#

ok ty i'll check

hasty maple
#

👍

lapis sequoia
#

the multiplication seems fine but it still doesn't work

#

ugh what is this

hasty maple
#

ah you aren't using epsilon

#

the network might be jumping back and forth due to the large steps

lapis sequoia
#

it does the exact same thing when i'm using epsilon

#

ty for all the help btw

hasty maple
#

try running it again with epsilon, use small values of epsilon, 0.001,0.0001, etc
Also maybe check for the gradients as well, compute em numerically
You're welcome :)

lapis sequoia
#

how would i compute them numerically?

#

ok will try that 👌

hasty maple
#

I can't seem to find the pdf for it, but it's available in Andrew NG's course on coursera, he shows how to calculate it in matlab, you can code it in python as well

lapis sequoia
#

ok ty i'll give it a search

turbid bay
#

im trying to make a digit recogniser. I dont really know what to do. Any help would be great. Thankyou

vague jetty
#

There are plenty of tutorials online for that. Do you want help finding a good one? Also, do you understand generally how digit recognizers work? If not, do you care about learning the fundamentals or do you just want to code something?

paper niche
vague jetty
#

I looked into that yesterday, and it didn't really help. That user is getting the error from a different function.

paper niche
#

you have 250 nodes in your output layer right?

vague jetty
#

Oh, no. I'm new to ML, so I'm still wrapping my head around the architecture.

paper niche
vague jetty
#

Not exactly, but I think I know where I'm getting tripped up now.

#

Not really related to the last question, but I have an architecture question, too.

My X shape is (17042, 250, 87), so 17042 sequences of length 250, each with 87 features. My Y shape is a vector of length 250 containing 1s and 0s, denoting wether a point in the input is important or not.

The last layer in the network should be an LSTM with units=250 and return_sequences=True, right?

Edit: NVM above question.

distant wraith
#

Anyone have a nice presentation on why Python for data science over Bi tools (Splunk, tableau exc exc) that I can show senior management ?

turbid bay
#

@vague jetty i found something online. using the mnist digit dataset. it says its called a Multi-Layer Perceptron. However when i input my own data to test the model it is incredibly wrong

#

would it be viable for me to relearn neural networks? and make it from scratch?

vague jetty
#

I always advocate building things from scratch over copying code. If you have the time, you should absolutely relearn NNs and build one from scratch. I don't know where are you competency-wise, but you should be able to figure out what you do and don't know

reef bone
#

@turbid bay Multi-layer perceptron is mostly synonymous with neural network (although some literature makes a slight distinction between them afaik), so don't get confused by the fancy terminology. The MNIST digit recognition task is absolutely canonical and you will find a massive amount of resources dedicated to the problem by just googling - Geoffrey Hinton recognized it as the drosophila of machine learning, meaning it's an extensively studied problem and a good place to start your ML journey

#

And I would also advocate for trying to build your own NN to solve the task - and after you're done, take a look at the state-of-the-art networks for the problem (Kaggle is a good place for this), see what they do differently, and try to understand why

#

Keras provides an extremely modular and straight-forward API so you don't really need extensive knowledge beyond what layers are and how they work, although more knowledge always helps

turbid bay
#

yh i will try. i learnt before how to do some of it. But it was in octave and i found octave very hard to learn so hopefully will understand it when trying to do it in python

#

i will work on making my own NN. But be prepared for many questions XD

reef bone
#

There are some very smart and educated people in this chat so don't be afraid to ask for help

turbid bay
#

will do thanks

vague jetty
hardy crag
#

I get an error for that colab

vague jetty
#

As in the error I mentioned?

#

Or can you not access it?

hardy crag
#

cannot load it

#

wait fixed it

#

my add blockers fault 😦

vague jetty
#

Haha, it happens.

hardy crag
#

It's read only though

vague jetty
#

Can you not see the error in the last cell? I've never shared a collab before.

hardy crag
#

yes. I see your output but can't change anything

#

(may be for the better now that I think about it

#

)

#

can you check if in the lstm function the data is numpy arrays?

vague jetty
#

found the error lol

hardy crag
#

lstm function call

#

y_train.astype

#

?

vague jetty
#

Bingo.

hardy crag
#

yeah

#

noticed it just now :p

slate orchid
#

yo

#

i'm putting some tweets into keras text classification

#

now bare in mind i have no clue what i'm doing

#

say i were to replace the twitter image link with something like TWITTERIMAGELINKHERE

#

would that be okay to be added to the 'vocabulary', and then having an image is counted as relevant to the classification?

heady bone
#

I would say yes, if you were using some type of bag of words model

slate orchid
#

yup

#

nice

#

can classifications be given actual names?

#

or are they just integers normally

heady bone
#

you would usually have a classification for each integer, no? like 0 = positive, 1 = negative, 2 = whatever, etc.

slate orchid
#

okay then i'm gonna need to find a solid answer on the correct order of the hogwarts houses

#

i've got a weird project on

void star
#

Should I get started with tf, keras tf, keras, tf2 ? Which is best? Just trying to learn so I have no specific objective. Just trying to start from what's more logical

#

Mention me if u answer pls

sand lark
#

@void star I'd learn tf and/or pytorch. Pytorch is nicer in my opinion but I kind of do both since for some jobs they want tf and for others they want pytorch. They all do the same thing

hasty maple
hardy crag
#

if you do tf, do tf2

hasty maple
#

@lyric canopy could you go through the above sub reddit and see if it's pin worthy for this channel

hardy crag
#

I agree it is a good place to start

slate orchid
#

hey, if i want to make text training data for keras, can i go with a csv file formatted like

#

"text stuff text stuff text stuff", 0
"more text things, other text things", 1

hardy crag
#

sure

#

is your dataset too big to keep in memory?

slate orchid
#

i'm making the dataset out of some survey stuff

#

so i'm getting other people to classify stuff for me

#

i need a way for them to send stuff back to me

#

it's not gonna be that big

hardy crag
#

okay. So your probably gonna load it from that csv anyway before using it, which means the format is whatever you want it to be and you just parse it accordingly

slate orchid
#

nice

#

thank you!

#

oh, one more thing

#

how important is it for the dataset to be balanced?

#

so, i have four classifications

#

how important is it that i get a roughly 1:1:1:1 dataset

hardy crag
#

well, it would be easier but there are ways to deal with class imbalance

#

you could modify your loss function to account for class frequency, or sample your batches so that they contain roughly equal parts of each class even tough the actual dataset does not,

#

or you could just randomly select a subset of each class for your training set

#

Not sure what the goto way is in NLP

slate orchid
#

don't know anything about loss functions yet haha

#

anyway, thank you so much

hardy crag
#

Uh there is another one (which is fun, but probably not viable): You could try to generate synthetic samples of your smaller classes :p

#

yw. This is what this discord is for 😃

#

(or maybe you can rethink your class definitions in a way that make them more balanced)

slate orchid
#

uhhh

#

my class definitions are a little...

#

fixed

#

(i'm trying to make something that classifies by hogwarts house)

#

(don't ask)

hardy crag
#

well, I'm not in the replace-talking-hats business but I reckon the houses should be pretty equally sized no?

slate orchid
#

depends on how my friends decide to classify stuff

#

i'll tell them to try to keep things roughly equal-ish

hardy crag
#

a little bit of inbalance is fine I reckon. I'd just test it and you will probably see if your network tends to ignore one class or prefer one

slate orchid
#

sure, that works

#

thank you!

lapis sequoia
#

is anyone familiar with tensorflow

#

can you tell me about tf graph and tf session real quick

#

not sure if I need to initialize a new table for every session

#

and whether I can have my for loop of items outside tf session or inside

void star
#

@sand lark great thank you.

hardy crag
#

@lapis sequoia maybe try tf2. eager execution ftw

lapis sequoia
#

I havent learnt it yet

hardy crag
#

it's very similar, but imho much easier to grasp

#

you don't need to build a graph anymore with said eager execution

#

also it has built in keras for "common" architectures

lapis sequoia
#

can you share some example code

lapis sequoia
#

im doing something like this

#

what do you think

#
for query in some_queries:
  with tf.Graph().as_default():
    with tf.Session() as sess:
hardy crag
#

does your for loop need to be outside of the session?

lapis sequoia
#

I tried keeping it inside the session, but got an error saying my table was already initialized

hardy crag
#

are you only doing inference?

lapis sequoia
#

it needs a new table initialized for every query in the for loop..is there a different way

#

Im getting embeddings.. and comparing them to other embeddings

#

this is my embd function

hardy crag
#

hmm. gotta be honest, never run into that case. From the docs I guess that you need to initialize the tables for the graph once

lapis sequoia
#

def embd(inputs, module, placeholder, sess):
  embeddings_tensor = module(placeholder)
  
  sess.run(tf.global_variables_initializer())
  sess.run(tf.tables_initializer())
  message_embeddings = sess.run(embeddings_tensor,
                                feed_dict={placeholder: inputs})
    
  return message_embeddings
#

so I run this function for every query.. in a new session

hardy crag
#

so you would need to maybe put the for loop inside the graph context and call the session inside the loop?

lapis sequoia
#

I timed that.. it took longer

#

than If I kept it outside the graph

#

like 9 seconds longer

#

I cant afford this.. it takes too long..just trying to get it done faster..

#

like 2 queries takes 31 seconds

#

that's days.. if I want to run thousands of queries

hardy crag
#

how many different graphs do you need to loop over?

lapis sequoia
#

what does that mean

#

I think it's the same graph

#

im not really sure about graph and sessions..still new

hardy crag
#

okay different approach :p

#

I'm guessing by embedding you mean you have a vector that contains some kind of information?

#

in your embd() function, inputs is a datapoint, module is a neural net?

lapis sequoia
#

yes.. its an encoder..

hardy crag
#

and you want to compare different encodings?

lapis sequoia
#

I have a set of embeddings which I want to compare with the embedding gotten from the queries list

#

I just changed my code

#

now it runs at half time

#

woohoo

hardy crag
#

👍

lapis sequoia
#

Necessity is the mother of invention.. saving my ass

#

thanks man

#

it really helped talking it out with you

hardy crag
#

happy to be the rubber duck 😃

lapis sequoia
#

what I did was this

#

Initialized graph and session..

#

loading the module..

#

placeholders..

#

then sess run.. initialized variables and tables..

#

then in loop, I called for embeddings for each query

hardy crag
#

sounds like the way to go

#

doing the initalizing before hand should save loads of time

lapis sequoia
#

yeah before I was doing all of this with a function call for each query..

#

which was a waste of time

midnight atlas
#

anyone familiar with using cloud computing to train neural nets using a python script? pretty confused between several options

chilly shuttle
#

@midnight atlas that's a pretty nebulous and broad question, wanna narrow it down?

midnight atlas
#

Well essentially it's two parts, which platform is most intuitive to setup and then how can I do so?

#

I want to run a training script that runs over a 50gb database

devout ridge
#

I am using MultiLabelBinarizer() with np.array(), and I got this error: TypeError: 'numpy.float64' object is not iterable

#

Then I got TypeError: 'numpy.int64' object is not iterable

#

Anyone?

lyric canopy
#

Which lines gives you that error?

#

You should be able to find which variable is assigned to a single number instead of an iterable

#

Trace it back to the source and then try to udnerstand why it's not what you think it is

misty imp
#

def:

#

sorry, was a test.

devout ridge
#

@lyric canopy new_array_of_labels = mlb.fit_transform(array_of_labels)

lyric canopy
#

So, array_of_labels probably isn't what you think it is

#

Did you try printing it just before this line?

#

Oh

#

It should be an iterable of iterables, not an array of single numbers

#

According to the docs

devout ridge
#

this is array_of_labels - [0 0 0 ... 5 5 5]

lyric canopy
#

Ah, yes, so the elements are single numbers

#

But it actually wants an array of iterables (other arrays, tuples, ...)

devout ridge
#

I see

#

thank you very much

misty imp
#

"""def:

devout ridge
#

I tried to run this: H = model.fit_generator(aug.flow(trainX, trainY, batch_size=BS),validation_data=(testX, testY),steps_per_epoch=len(trainX),epochs=EPOCHS, verbose=1)

#

And I ended up with this at the end: ValueError: Error when checking target: expected activation_7 to have shape (6,) but got array with shape (1,)

#

The data has 6 classes.

void anvil
#

@midnight atlas you can just run it off a single AWS i nstance

devout ridge
#

Hello?

void anvil
#

sorry it's the p2.8xlarge or p2.16xlarge

#

you can probably get away with g3 instances as well

hoary terrace
#

how would you guys describe the difference between a classifier and regression model in simple terms?

devout ridge
#

Hello? I need help.

heady bone
#

What are your layers like?

devout ridge
heady bone
#

oh wait

#

is your generator giving only 1d array?

#

it should look like this [[0, 1, 0, 0, 0, 0]]

#

instead of [0, 1, 0, 0, 0, 0]

#

hold on, i just woke up, brb 😛

heady bone
#

Okay, another problem might be your training outputs might just be a number. Have you converted them into onehot labels? Your model expects something like this [1, 0, 0, 0, 0, 0] for trainY.

devout ridge
#

I will check it out.

reef bone
#

@hoary terrace Classification produces discrete values, regression produces continuous values. If you want to determine whether an image shows a cat or not, and nothing inbetween, those are 2 discrete classes with no overlap (nocat/cat, or 0/1). If you instead wish to predict how much something looks like a cat, therefore get output anywhere between 0 (absolutely not a cat) and 1 (absolutely a cat), for example 0.75 for "wow that looks a lot like a cat, but not entirely", that would be regression. Naturally, the techniques that you would use for either can overlap, for example classification with neural networks - imagine you're trying to decide which of 10 discrete animal classes are shown in a picture - is it a cat, dog, or armadillo? The neural network would likely end up having 10 output neurons where each represents one of your classes. After the input is propagated through the network, each of the neurons has a certain value - we can understand each neuron's output to be the probability that the the image shows its designated animal. We would use a softmax activation to get a normalized probability distribution, and then simply choose the most probably animal as the output. If the input is a cat, the cat's neuron would maybe have a value of 0.4, a dog would have 0.3 because it looks similar, but the armadillo would only get 0.05 because it looks nothing alike. However, we are still performing classification because the output is a discrete class - a cat.

lapis sequoia
#

just seems to be oscillating around 50/50

midnight atlas
#

@void anvil how can I upload a large dataset to a VM? struggling with this part as SFTP type solutions can't handle the 50 gb easily

lapis sequoia
#

I have a pandas column that isa list of lists

#

how do I split that column in to separate columns for each item

lapis sequoia
#

I didn't take a direct approach, this is what I did:

#
1. make dataframe using dict, where values are list of lists
2. assigning column names
3. create separate concatenated df where I do a .apply(pd.Series) on the columns containing lists
4. reset index (because up until now, the dict key is the index)
5. Assign column names to the new df.. 
chilly shuttle
#

check out unstack

#

oh hang on, you mean a column that contains lists

#

not a grouped column

lapis sequoia
#

yeah

chilly shuttle
lapis sequoia
#

yeah so this works.. but requires that the split column be assigned to a new dataframe

#

then I have to concat..

#

similar to what I;m doing with apply and pd.series

lapis sequoia
#

I want to filter my dataframe, based on values in a column.. whether they contain a certain text or not

#

I can do conditionals for whole match.. but not sure how to do it for contains

#

I did str.contains.. and it seems to work..

#

but not sure if it's right..

#

hmmm

chilly shuttle
#

str.contains is the best way to do it if it's sufficient for what you want

#

otherwise you need to map

lapis sequoia
#

is anyone alive

#

I want to left join two dataframes..

#

do I mention which columns.. in left_on and right_on.. what if I needed one more column from the right

#

I got it..

#

I just had to mention the columns to join on..

#

didn't need to mention the others I wanted to add.. it did it anyway

#

how do I convert a header to the first row of dataframe?

supple ferry
#

header to a first row ?

#

when reading?

lapis sequoia
#

not when reading

#

like after assigning column names

#

adding the same header as row

supple ferry
#

didint understand it. can you give a small example?

lapis sequoia
#

consider you have a header for a dataframe..

#

which has column names A, B

#

you want the first row of the dataframe to be A, B as well

#

and move the rest of the dataframe contents one cell down.. to accommodate this

supple ferry
#

You can use pd.DataFrame.shift with optional argument fill value and give it your values

lapis sequoia
#

thanks