#data-science-and-ml
1 messages Β· Page 195 of 1
okay. what you can do is again, tweak your reading function
df = pd.read_csv("data.csv", na_values = ["-", "_"])
na_values is the parameter allowing you to add special na values for a given dataset
if parser sees such fields it will treat them as null
there I put a list of two strings to be considered as null. if parser sees a field with any of these two strings, it will parse it as null
then you can easily df.dropna(inplace = True)
Now you know π
:)
read_csv is very powerful function. I advise you to read its documentation. It will save you tons of headache at readtime already
Okay
btw, try not to use inplace = True at any time :
π
it will confuse you sooner or later
inplace is so useful
It is useful, but also confusing. It is better to reassign rather than in place
can someone explain me how np.searchsorted() is working , i checked the documentation but didn't got.thankyou
@fervent solar let's say you have two arrays, A and B. You want to put B elements into A but not to mess up with A order. That function will give you indices in which you should append elements of B so that overall sorted order of A remains unchanged
Also inplace is supposed to be deprecated.
I hope in version 1.0 they do it
i have a question tangentially related to data science so maybe you guys could help
here i'm using scipy.interpolate.interp1d (https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.interpolate.interp1d.html) to generate the function for fitting the points on these bars into a curve.
as you can see, the result isn't quite that great with those dips on the side
is there a better curve fitting module i can use for this purpose?
i want the curve to follow more along the path in red
oh and i should mention, the points move and there are an arbitrary number of them
like so
This is really a good question, I hope there is someone to answer that. I also became interested π
Lmfaooo ya getting runge phenomrnoned
scruuuuub
Use a chebyshev grid next time or use a degree sqrt(n) polynomial regressor
Or throw in a lipschitz coefficient bound
Worlds your oyster fam
Git gud
Hi guys,
So I'm solving a problem where I have html which contains challenge_details ie Question
I managed to extract main_text from this html manually with the help of bs4
now I've the challenge_details in txt file now my next goal is to extract KEYWORDS which I can use to find
relevant documents from the internet which can be helpful to solve the Question or may contain knowledge which can be useful
to understand the question.
My questions are:
- should I manually scrape google search result or there is good python lib for it? ( I'm not going to scrape to frequent or too fast so getting ban will not be an issue)
- I'm only going to take urls in search result which leads me to an HTML or pdf so my next question best library for ARTICLE extraction from HTML and text extraction from pdf.
- Now I will like to use some magic[ library which can find document similarity or n most relevant docs ]
what better Information retrieval system there is than google hence after extracting keyword I had like to do google search scrape as the docs in result will be most relevant.
but I don't think keywords will fetch good search result and hence I'll be using the title which I extracted from challenge_details but extracting keywords and finding documents based on it is a requirement that I need to fulfill.
Library that I've found so far: newspaper3k and spacy
@cursive sun @void anvil thanks! i don't really know what any of these things mean but I'll look into them π
Oof, this is why we need better math education in schools tbqh
@cursive sun if he knew, then he would not ask, right? Not knowing is okay, not wanting to know is not
@past sonnet i think you can go to help channels for this question. It is better suited to those profiles
Calm down fam, it's a joke. Now the question is why you were the one to get antsy about it π€
It's not a very good joke
Fine, if you think qwerty of all people is smart enough to help beyond 'import cv2' then you dont need me
You guys need me to tiptoe around then im not helping for free, you guys provide soltutions for issues associated with interpolation problems
πΏ
Honestly, the most important thing we expect someone to do on this server is showing respect for the other users. Statements like 'qwerty of all people is smart enough' is not that, so you can keep your attitude. @cursive sun
So, drop the attitude
Lmfao im not gonna listen to you, you can keep querty, ill bounce
!kick 339017672211693570 telling staff they're not going to listen when told to change their immature and toxic attitude.
QWERTY why don't you prefer inplace=True what are its drawbacks, if it has any?
Honestly fourier transforms were taught in differential equations for me
which is sophomore math
EEMD and other decompositions are taught in signals processing courses which are at least Jr. level engineering courses and more advanced ones are grad level
differential equations is sophmore math?
@supple ferry can you get more detailed example
@fervent solar , lets say, you have array A = [1, 2, 5, 7, 10] and you have some array B = [3, 4, 9]. And you want to kinda insert values from B to A so that, the sorted order of A (as you see) dont change. If you just stick B to A, it will extend it and will be [1, 2, 5, 7, 10, 3, 4, 9]. Order is spoiled. if you use np.searchsorted() it will give you potential indexes that you can "stick your array".
In [4]: a = [1, 2, 5, 7, 10]
In [5]: b = [3, 4, 9]
In [6]: np.searchsorted(a, b)
Out[6]: array([2, 2, 4], dtype=int64)
You see the output? it says put the first element of B after between 2 and 5 (index number 2) in the A and it wont spoil its order. Same goes for 4. You can put it inbetween 2 and 5. but 9 should be put at between 7 and 10
@midnight oracle , inplace = True is very handy, but only at first sight. If you dont order your code properly, you can easily forget that you modified your dataframe inplace. I am telling it from my experience and I am pretty sure that here are some other users who had the same thing
just writing couple more smybols will not hurt, but may help
Okay
@supple ferry oh python is smart
π It is NumPy and Python
guess its output @supple ferry
eww, not muting discord channels
hahaha
Literally unplayable π
any good platform to ask programming based questions ?
stack overflow @fervent solar
aw first time I saw someone with actual signal processing knowledge online and they get kicked
(x[:-1] > x[1:]) can some one explain this code
x[:-1] and x[1:] are slices
[1,2,3][:-1] -> [1,2]
[1,2,3][1:] -> [2,3]```
as for comparing lists with < or >
This simply compares the first elements of the list
slices are formatted like this
[start:end:step]```
with default of
```py
[0:-1:1]```
that can be omitted
It lets you take part of a list
it was used in while np.any(x[:-1] > x[1:]):
np.random.shuffle(x)
return x
bogosort
the thing i'm not getting is why (x[:-1] > x[1:]) because it will miss first value in x[1:])
and why greater than sign is used
It's comparing the first value with the second value
hmm, it's np.any, was gonna say that wouldn't work for any, but maybe it's different for numpy, lemme check
ok
Yeah, for numpy it will check all elements of the list and return True if one is larger than the other
@silk acorn u said default value is [ 0 : - 1 : 1 ] but its printing reversed list
It shouldn't be.
But i did make a mistake there
it's
[0, len of list, 1]
That's not a reversed list, thats a list missing the last character
yes
[::-1] would get you a reversed list
on what basis is (x[:-1] > x[1:])
is computed true
when both will have same size
In [94]: x
Out[94]: [8, 3, 6, 1, 7, 5]
In [95]: x[:-1]
Out[95]: [8, 3, 6, 1, 7]
In [96]: x[1:]
Out[96]: [3, 6, 1, 7, 5]
In [97]: (x[:-1] > x[1:])
Out[97]: True
In [98]: (x[1:] > x[:-1])
Out[98]: False
np.any will take the two lists, compare them element for element with >, and return True is one is True
means it wil compare the first and last element ?
It will compare a[0] > b[0], a[1] > b[1] etc, where a in this case is x[-1] and b is x[1:]
and return True if one or more are True
The pure python equivalent would be
any(a < b for a, b in zip(x[1:], x[:-1]))```
Hi. Is there a library in python that graphs in the coordinate plane and draws segments in between points? I have tried researching online but I cannot seem to find anything. I don't believe matplotlib can suit my purposes for this.
u tried seaborne ?
No, but that looks cool and useful for another project. However, from what I am seeing of it, it serves nearly the same purpose as matplotlib.
In what way is matplotlib insufficient?
Connecting points with segments seems like something it would do well
By plotting points, I mean like you could manually do in Geogebra. I may just be ignorant but I don't believe matplotlib has a functionality to connect the three points in such a fashion.
How about patches: https://matplotlib.org/api/patches_api.html
Hi, been working on a binary text clf using scikit-learn and I feel a bit lost on what i can do to improve the acc (currently at 0.73-0.75). The dataset is quite small (~7000) so I don't know how far I can push it .
I am still very much learning so if anything seems off please let me know I'd really appreciate it :)
PreProcessing:
Cleaned the data
Set up some stopwords
Tried some word-clustering but didn't any gains (because of dataset size?)
Just now messing around with MaxAbsScaler and Normalizer
Pipeline:
CountVectorizer
TfidfTransformer
The Preprocessing I mentioned above
an SGDClf
I did not end up needing patches, but I did use matplotlib in a way that I did not know existed.
I also learned about zip(), which was good.
I essentially made a list of lists, something like data = [[1,3],[2,1],[3,2]], then I found a way to make all of the possible segments given some points. Since I only have three (I am graphing triangles), that works for me, and I used: plt.plot(*zip(*itertools.chain.from_iterable(itertools.combinations(data, 2)))) Still not entirely sure how it works, but it does.
Oh yeah, I had to import itertools in addition to matplotlib.pyplot. FYI for anyone looking to use that method.
@pure lynx wait you could do that? That's illegal π
Does anyone have experience with conditional logistic models?
I want to find out why my hessian matrix fails to calculate. I found out that I have hidden intercepts in my design matrix and I want to find out which variables cause that and remove them
Hello, I have a question about kinds of data for analysis... is working with technical data very different from any other kind of data? I understand that the resoning data provides goes in a different way, but how about the technical aspect of processing data?
@lapis sequoia , can you be more specific?
What..? How is it illegal?!
As far as I understand data analysis let's say for business analytics require set of skills that involve knowledge of the businees field and processes etc, technical analytics would be more specific to let's say some technical aspect of knowledge like laser efficiency, but is the process of working with data munging, cleaning, modeling is the same and the difference only comes from the background of data or the whole process of working with data differs in some way?
I am not sure I explain myself well enough, sorry, for confusing question
@lapis sequoia have u done any projects in data analytics
I have done some projects at school but it was related to data like prediction if new website would bring in more clients or iwhich store's range of goods to improve
Nothing with technical data, that is why I am interested if there is any difference
Python
So, from my understanding, data will mean data always. Which means, technical or non-technical you will work with numbers. However, every industry and data type will need specific approach to its data. If you work with panel data, methods that you will use will differ completely from the methods you will use for example in cross-sectional data
Not to be off topic, but... how was what I did illegal?
I don't think that was a serious remark. I'm not sure either.
@pure lynx It was not meant literally. It was most likely meant as an amusing compliment π
@pure lynx it was an amusing compliment as @polar acorn suggested π
Δ° was surprised that one can use itertools in matplotlib. It is very creative
Oh, I guess Iβm just daft in that case. Thanks for the compliment, though I can take credit only for relentless searching on StackExchange. I will have to find a different way to graph when I eventually move on to quadrilaterals.
Anyone is learning data analytics from data camp
How do I build a portfolio for data science if I've never had a job or internship within the field. I'm coming up on my graduation date, and kinda scared my basic understanding of core skills isn't going to be enough to present to employers.
@heavy apex you can use kaggle for seeing works by other people which will give you a feeling what kind of projects people are doing. If you something you like you can find another dataset and try to implement similar methodology there. There are various free datasets websites you can visit. Let it bd kaggle itself, r/datasets or Google search for datasets, forgot its name
Even more important than modeling is the way you interpret the results
Anyone got a good book advice for Bayesian Statistics?
@supple ferry thanks for insight.
@supple ferry thank you, great advice.
If you get Kaggle GM you'll get 6-7 fig job offers
depending on what field you want to go in
@heavy apex Hackathons, projects, kaggle
Kaggle is the gold stndard tbh
And yes you're right, just having the skills won't be enough
hackathons and projects mean fuck all
Kaggle isn't really a good standard at all
It's just a common recommendation since it's easy to find data there
Hackathons and projects can easily mean as much as or more than kaggle
It's not about what it is, it's the skills you display
With hackathons and projects, you can show fast prototyping under pressure or longer term software skills which are hard to show with just Kaggle. Kaggle doesn't actually show any particular kind of skill that the other two don't
For reference: my company has hired both top 50 on kaggle and hackathon winners (I'm from the latter) along with those with work experience only
Recall that previously we created a simple array using an expression like this:
In[3]: x = np.zeros(4, dtype=int)
We can similarly create a structured array using a compound data type specification:
In[4]: # Use a compound data type for structured arrays
data = np.zeros(4, dtype={'names':('name', 'age', 'weight'),
'formats':('U10', 'i4', 'f8')})
how this numpy structure is working
i understand python dictionaries
Might not be the correct channel but what is a good name for summing up all the values up to a certain value.
Ex. Value 3, 1 + 2 +3 = 6
Is there a mathematical name or function for it?
sum_until()?
Could work, currently I have "Sum up to value" its going to be for the filename also.
sry for the preview π¦
if you are just interested in the result u might try something like np.arange(n).sum()
or write the function given in the wikipedia article π
Me? I am simply looking for a name I can call my file for github push. But I guess Triangular numbers seems to be correct
Cum_sum, short for cumulative sum might be used for similar things.
let's say I am using data of stock prices.
I want to train the set on some subset of dataset and other subset to test it.
Should I go with scikit learn's train_test_split, which "Split arrays or matrices into random train and test subsets"
Or should I divide dataset with respect to dates. Such as train on data from 2016 to 31st Dec 2018, and all the data after it for testing?
You should train on data up to a date and test on later data instead of using sklearn's train_test_split. If you want to do CV you probably want to do this several times for different dates. If you just google "time series cross validation" there are lot of guides.
CV?
Cross validation. A more complete way to test model performance but takes more time.
absolutely not random CV greed
time series data needs to be split into train/test with quarantine periods based on forecast period to prevent contamination of data
If you want to do cv train test splits they can be segmented with additional quarantine periods
Or you can generate synthetic data based off of your current time series
just know if you're training on data not actually available at the time you're introducing some level of cheating into your predictions
I don't think they have taught CV in the course yet. So for this assignment I won't use it. As the assignment is due tomorrow. But I have bookmarked it and will read up on it.
Thanks a lot guys.
π
@lean ledge , why we cant name it like sum factorial or smth π
@void anvil i really liked the process of making synthetic data and now trying to get deeper into Bayesian inference, Monte carlo methods and Markovian methods
i dont mean for actually using it in production or research
it will help me to master numpy
i am trying to replace most of the python standard functionalities that i use solely with numpy
ah
one ring to rule them all
Hello! Has anyone worked with k-means clustering?
Iβm working on a school project and most of the resources Iβm finding online deal with just 2-D x and y data, itβs possible to cluster more complex data in python correct?
Ok Iβll look into that π€. I figured out how to get my program to read and store my csv, would I have to do anything extra or will my program realize it is multidimensional?
And plot/cluster the data accordingly
so, for using kmeans you should use external library. for reading csv and working with it pandas is your friend. for kmeans sklearn is the package you should use
in this post you can find very detailed approach how to implement kmeans
Oh thank you! I did use pandas, as for my kmeans I was found different ways using numpy and matplotlib, but this looks much cleaner!
Should you come up with questions, feel free to ask them here. We will be glad to help :)
For sure, thank you!
the dates on x axis are too cramped up. How do I make it only show year?
got the solution
set_xticks()
you can do k-means on as many dims as you want, the only problem is that it quickly becomes impossible to visualize the results on a single 2d plot
tutorials might deal with 2d solely because it's easy to see what's happening
Does anyone have any exp using time series algorithms with python?
@violet crag there used to be plt.tight_layout too
But in your case it will probably not be helpful
Now I've run into another issue.
Regression function can't needs float for "x" and i have dates
convert to unix time
if you have variant time steps
or if time steps are uniform or could be considered uniform just throw ints
It's variant
is the step important?
It is stock price, sometime difference between the entries is 1 day (consecutive) other time its 2-3 days.
"step"? Idk what you mean
Hmm π€ alright, I'll see what those are
you can start here
this one is also neat
Link to the complete notebook: https://github.com/borisbanushev/stockpredictionai
ohhhhhh
but he definitely cheats
Thanks for that
and doesn't realize it
Cheats? How?
he leaks information from train to test that shouldn't exist
He also predicts price, not movement
Oh, I see. In my case I've made a clean cut in the data frame using list slicing.
What do you mean by "movement"?
% change
you should never predict price, it's too easy
and you'll get bad results
you want to predict day over day changes and back that out to price
so if your stock price is 100, 101, 102, 101
you want to be predicting 1%, 0.9%, -0.9%
you will get significantly better predictions predicting price than actual movements
Is it possible to use ML as a tool for your investments?
Ive done some work before in that area but never had much success using it on my own portfolio.
one possible idea can be to use news, derive sentiment and apply it to prediction models
Is it true that regression can't work on dates and I have to covert it into numeric data
so is anyone here an actual data scientist
@violet crag regression assumes your exogenous variables are scalars. So, it can not work with date or categorical (I don't mean encoded into dummy)
@supple ferry I successfully converted date into numeric value, keeping the step in mind as well.
I've issue in regression. I thought it would draw a regression line. But regressor.predict([dates]) just gave me same values as actual price
π¦
I think I know where I am going wrong
@violet crag In case you didn't see you are predicting on your training dates. You should predict on test_dates.
@polar acorn yes, I caught that
now I have issues with Reshape your data either using array.reshape(-1, 1)
is this a numpy thing?
It's a sklearn thing. It prefers your numpy arrays in a certain way.
np.asarray(test_dates).reshape(-1, 1)
ValueError: shapes (101,1) and (403,403) not aligned: 1 (dim 1) != 403 (dim 0)
I am clueless
What is the shape of the X and y you pass to fit()?
if you call test_dates.shape what do you get?
or train_dates.shape rather
Okay I see what your are doing, you are actually feeding two lists to LinearRegression.fit()
yea, one x and one y
The fit function wants numpy arrays instead. And it wants them in shape of (samples, features) for X and (samples, targets) for y. So what you can do is
fit(np.reshape(train_dates, (-1, 1)), np.reshape(train_prices, (-1,1)))
The np.reshape(train_dates, (-1, 1)) means, reshape my list as a numpy array with dimensions -1 and 1. Wtf you might think, -1? -1 just tells numpy to look at your list and substitue -1 with the length of the list. It is a convenience so you can reshape your list without checking the length of it.
Makes sense?
can you explain me this -1, 1 further
π
now since I have to depict this predicted array on graph, I need to remove extra dimension, right?
like from [[]] to []
@violet crag , if you reshape your list of 100 elements into (-1, 1) it means that, it will reshape it (len(thatList), 1)
it is easier to know it this way
there is no extra dimension btw, technically there is, but no π
Looks good π
I mean the predictions are bad but that is pretty much what you should expect here.
MANOVA is your way to go. Now try that
π If you could make money in the market doing linear regression there would be no market
π
time series and overall panel data requires differenc approaches
Tutorial materials for the Time Series Analysis tutorial including notebooks may be found here: https://github.com/AileenNielsen/TimeSeriesAnalysisWithPython...
i dont know if you can manage watch this and do your assignment at the same time
but this video is gold
ooh I'll add it to the list of material that I've found on last two days, I have to read a lot
reading is the key
will do after assignment
if you can rwerite the text you read into the code and vice versa
you are good to go
π
rewrite*
you mean theory to practice?
π
you can do LSMA
with linear regression
it'll be better than one regression for the entire time series at least
you do a linear regression on the last X time points (default is 25) and use it to predict the next period
it's still pretty awful but it's better than predicting out months / years
"Estimated coefficients for the linear regression problem"
what does this mean in linear regression?
it's literally the coefficients for the line you draw
aah, slope of the line
slope for all the X variables + intercept
Hey
Man
I want to learn machine learning
https://pythonprogramming.net/machine-learning-tutorials/
With what should I start from here (what will be the most comfortable to start with)
Python Programming tutorials from beginner to advanced on a massive variety of topics. All video and text tutorials are free.
well.. you should start with an objective..what are you learning it for?
to do data analysis - for business, finance or for other fields?
or to do ml for image or text processing?
these are very broad..there's more to it.. but it all depends on your objective so you dont end up all over the place
this is where you start
then come back for more
Now, WOW
I would like to start with data analysis ... essentially transition from QA automation to data analysis... finding it bit difficult to figure out how to go about ut
Has any of you heard of any good bootcamps for data science in Europe? Thereβs plenty of them advertised online, but which are any good? Or any advice how to filter?
@lapis sequoia in Berlin there was one for 8k. I saw its agenda and advised my friend to learn it on his own
It was like a year ago and I was learning myself too
So it s not really worth it? I understand all the information is online and accessible for free, but I am becoming overwhelmed having no guidance and also the amount is so vast I just feel lost.
what do you need guidance with ?
@supple ferry As an experienced programmer desiring to move to data analyst role I dont know a definite path to reach data analyst role... I am looking at some online courses but not sure if they are worth the cost...
@buoyant trellis this can be a good guide
It has also programming things, you can ignore them of course
@lapis sequoia also for you
ok
I ll look through that, thanks! @supple ferry
I guess my main confusion comes from that most courses i looked at for data analysts/ data scientists suggest a package of skills to master, but when I look at job ads they require a phd in physics or mathematics... to my understanding one needs statistics the most and the technical programming tools, but again, i guess i have the whole idea wrong
my understanding was statistics and probability... last time I switched career path best thing that worked was to demonstrate through projects /blogs etc... and that is what I am planning now as well.. @lapis sequoia I think if you have projects to show there is always someone who is there to hire
I do have project but they are far behind a phd research projects. Would you have any references to share to what kind of project is expected for an entry level analyst?
honestly no... but I will go in for some project from kaggle.com and put it on my cv once I am there
@lapis sequoia it's not just statistics and probability, it's also calculus and linear algebra and some other maths here and there. People want others with really good maths backgrounds, people that can understand papers that use differential geometry or Poincare embeddings of trees into hyperbolic surfaces.
To compete with those guys, you have to know maths at a PhD level and that's going to be very hard
Most courses in my experience are only training computer science students and people changing careers for the basics of the field. I'm not sure how accurate a look that gives into the kind of ability employers want, especially since what employers want can vary
@lean ledge wow so differential geometry is a thing??? I really want to know so I can choose to do college course in it
It is a thing
Because everyone I asked was like "oh it's not necessary"
Oof
If it's a thing I'll do it
Prerequisites are real analysis and abstract algebra and number theory here, probably want to do those first
Ahhh interesting
I've done the linear algebra, calculus, statistics package
But have not touched into those real analysis or differential geometry
Linear algebra is not abstract algebra
Ok
That is interesting... I think I should give up idea trying to change my career without actually going back to school and start it all over from scratch, especially that I do not hold a degree in a quantitive field.
I'm really close to finishing my BS, and will probably get my MS in data science, but really curious about the difficulty for a non-degree holder trying to get into a data science career off boot camps or self training alone.
I never hear any first hand experiences, just a bunch of those ads everywhere on YouTube and whatnot.
I don't know a single person who's self taught themselves into a data science role completely
Dunno about how difficult it would be but anecdotally everyone I know has some degree, mostly in maths/Phys/stats or engineering, with some CS
I'm not familiar with the job market so forgive my ignorance, but surely most entry-level data analyst positions require only a bachelor's? That's the case with the first few results on glassdoor when i search "entry level data analyst". The PoincarΓ© embedding stuff (https://arxiv.org/pdf/1705.08039.pdf) is still on the research stage no?
^sorry for typing something when you have a question but @lean ledge why do u say u dont know a single person who's self taught themselves data science
What does inline for an embedded message mean?
data analyst positions arent data science positions. in my experience at least, data analysts are mostly full of non-technical people (eg. business majors etc) manually looking for trends. lots of tableau visualisations and whatnot as opposed to data science which tends to be lots of machine learning, statistics and modelling, very math heavy.
there are definitely lots of positions in data science that are satisfied with bachelors level too, I was just trying to give a motivation behind why some specific positions like they one they are finding might be looking for math/physics majors. @feral lodge
the poincare example was just me looking for a complicated sounding paper I saw recently, not an actual serious thing people might implement but I am reading research papers so often for data science stuff, I dont think research stage stuff is off limits for most data scientists at all
@lapis sequoia I know lots of people who've taught themselves data science, just not those with no degree at all who've taught themselves data science to a working capacity. lots of engineering/phys/math/CS majors who self-taught themselves it all, just dont know anyone without a degree at all or from non-technical backgrounds
Well, for data scientist I understand the high requirements, but I was referring more to entry level data analyst. I do see jobs in US that would fall into my educational background and training well, but in EU requirements seem to be through the roof.
Terms are used in a slightly different manner from place to place. Here's, it's used for people that have an actual analitical background and have an understanding of, say, R/Python, SQL, modelling, and stuff like that
I've just pulled up three random listings for "data analyst" and they all require modelling experience, a mathematical/statistical background, and a university degree. They do pay well, though.
In EU situation is quite different thiugh. When I was working in Germany, I was a data analyst, but I was also doing data science stuff, like bulding predictive models for forecasting.. In job description they have written about solid math and programming background. However, when later I compared my skillset to jobmarket in US I saw that, there in order to be a data analyst they were requiring not that much in comparision to EU. solid data analyst in EU can easily be a senior or leading data analyst in US for more money
Well I decided to start learning machine learning and some guy here told me to try the coursea tutorial
Anyone knows if its a good one??
Iβm doing edX courses
I dont know if this course is good my self because I am new to this
So I am looking for someone who had learned it already and can tell me what he did
@vagrant vector try that course you will automatically.know. what is good or bad for you
anyone know a good package for doing arima modeling? seems more like a job for R
@kindred stirrup statsmodels has arima functions
@supple ferry ahh thanks man i was spending several frustrating hours trying to download this auto.arima package
FWIW I work with a lot of data analysts and all of us have a master's degree. The only Data Science people have PhDs
@kindred stirrup you welcome
Δ° think the power of having PhD is the experience you get while doing scientific research.. It teaches you to look not at just the result part, but also at the process and how it should be done
What library would be the best for machine learning?
sklearn
It automatically deletes random 50% of your data? @lapis sequoia
Reference to Thanosπ
I know..
Δ° am being too nerdy π€ͺ
Which alternatives can I use for measuring the model performance in Conditional Logit?
alternatives to ROC curve
everything that you use for linear regressins
it runs MLE so log likelihood
- all the lag + normality tests
but you should get a coef, std error, z & p score, conf interva,s chi2, pseudo R2, F score (prob > chi2), log likelihood, etc.
ROC curve is pretty shit for logit/probit tbh
pseudo-R2 has has Aldrich-Nelson, Cragg-Uhler (Cox and Snell) and it's variants, estrella and its variants, Veal-Zimmerman
@supple ferry
And, of course, AIC / BIC
whats happening when grouping by L ??
It's grouping by the specified row indices, I think
Fits the numbers it produces
(0 + 2 + 5) / 3 = 2.33333
@void anvil , thank you!
II tend to agree with you on ROC curve, however, I am asked to provide some in order to explaing "business" people about the performance of the model. I am using statsmodels.discrete.conditional_model.ConditionalLogit for this, which is in master, but not publish branch. They dont have predict for that.
I understand the reason. It is calculated by assuming group related fixed effects, and predicting just out of the box not only statistically stupid.
sklearn has logistic regression
@supple ferry
is a good introduction as well
I have an amazing econometrics book, but it's hand written by our teacher and I can't really scan it up at the moment
But basically if you just follow that lecture you can explain in simple terms what everything means
If you have access to STATA (bleh), it's super easy to run and get everything precalculated by just doing load data, x = these columns, y = this column, logit(x,y)
It's not nearly as powerful but it generates nice, neat tables with nearly 0 effort
iirc R has a ton of business-report level stuff as well
In MPL how do I get these 2 numbers to be the same? The graph isn't scaling properly.
yes i got it
What do you mean by scaling correctly, @lapis sequoia ? To me it just looks like your line ends at (3.00, 6.0)
Or, y = 2x
how much do you think affirmative action plays into graduate admissions
I want the 2 axis to be same
@lapis sequoia , you can access and change ticks viaplt.xticks(np.arange(min(x), max(x)+1, 1.0))
if you want to set limits, matplotlib.axes.Axes.set_xlim
I'd say matplotlib is like the go-to for if you just want to quickly visualize something
I know that Seaborn and Plotly generally produce better looking visualizations for if you want to present your results in a more sophisticated way
Also Plotly is I think appropriate for if you want to have interactivate visualizations
Not sure what you mean by that, outliers are just data points right
You might need to select the correct visualization type / technique but that's not really a package specific feature
cuz like
theres a mathematical definition of an outlier
so like
is there a way you can plot.get_outliers() or something?
I'm not aware of that being a feature, but I would argue that's not the package's concern
You can always extract them yourself and feed them to the visualization framework
Or colour them differently
yeah but like
how would I know which ones to color differently
how would I extract the outliers
You said there's a mathematical formula you would like to use, why not apply it to the data and extract the data points that you wish to work with?
hmm
Then feed them to matplotlib separately
I said mathematical definition but I guess I could make it into a formula
Which definition are you working with
any data point more than 1.5 interquartile ranges (IQRs) below the first quartile or above the third quartile.
that
I would extract those thresholds from your dataset and then loop over it and filter the outliers
And input them separately
You probably only need to extract the indices, I'm fairly sure matplotlib can colour by indices
For the record I'm not saying that there aren't data vis packages that can do that, I just wouldn't expect it from them
anyone know how you can add labels to an image?
I have images that are named XR_ELBOW_patient00011_negative_0
as well as images that are named XR_ELBOW_patient00016_positive_4
essentially my labels are positive or negative
what im looking to do is run them through a CNN model
@sharp jetty , you can just run a script which will parse the names of images and then assign that array to label column
Hi, I'm having a problem with some feature extraction code for speech recognition. I'm trying to use a function to convert from linear magnitude spectrum to mel spectrum. I think I may just be misunderstanding how to use certain functions in Python, but any pointers would be great!
def make_mel_filterbank(self):
lo_mel = self.lin2mel(self.lo_freq)
hi_mel = self.lin2mel(self.hi_freq)
# uniform spacing on mel scale
mel_freqs = np.linspace(lo_mel, hi_mel, self.num_mel+2)
# convert mel freqs to hertz and then to fft bins
bin_width = self.samp_rate/self.fft_size # typically 31.25 Hz, bin[0]=0 Hz, bin[1]=31.25 Hz,..., bin[256]=8000 Hz
mel_bins = np.floor(self.mel2lin(mel_freqs)/bin_width)
num_bins = self.fft_size//2 + 1
self.mel_filterbank = np.zeros([self.num_mel, num_bins])
for i in range(0,self.num_mel):
left_bin = int(mel_bins[i])
center_bin = int(mel_bins[i+1])
right_bin = int(mel_bins[i+2])
up_slope = 1/(center_bin-left_bin)
for j in range(left_bin, center_bin):
self.mel_filterbank[i, j] = (j - left_bin)*up_slope
down_slope = -1/(right_bin-center_bin)
for j in range(center_bin, right_bin):
self.mel_filterbank[i, j] = (j-right_bin)*down_slope```
that's the function ^
# for each frame(column of 2D array 'magspec'), compute the log mel spectrum by applying the mel filterbank to the magnitude spectrum
def magspec_to_fbank(self, magspec):
# apply the mel filterbank
fbank = np.convolve(self.make_mel_filterbank(), magspec)
return fbank```
that's where I'm trying to apply it to magspec
thanks
are there any things on Kaggle for students?
to practice doing data science
or should I just pick a competition and try it
I just need simple data science problem to practice the basics; ones that should take about an hour to solve
Hello guys, I have to write a short essay about a technological trend that will impact my target career (data scientist), and I was thinking about writing something about how increasingly complex models and automation are making it easier to make predictions, but since its becoming more of a "black box" where understanding the model is not neccesary then it might cause problems down the line.
sounds like a good idea?
Can anyone here give me a hand with Pandas?
@marsh fog , you can ask your question right away
@marsh fog , I could not find your question there. you can repost it here
I've got a list of dataframes and I'm using pd.melt to modify them: gdp_melt = pd.melt(dataframes[0].reset_index(), id_vars=['country'], var_name="year", value_name="GDP") gdp_melt.set_index(['country', 'year'], inplace=True) is there a way I can run a loop or function to melt all the dataframes in the list but change the value_name to apply to each dataframe?
is your dataset list will be something grouped ?
you can write a function which does that, but you will need to pass it the value for value_name
how would I go about doing that?
Yeah I have 3 dataframes contained in a list called dataframes so to access each one it's dataframes[0], dataframes[1] etc
do you need to modify those datsets, or return a new list with melted datasets
?
something like this maybe
in the last line i got a typo :D
melted_list.append
not meelted_list.append
Modify those datasets rather than return a new list
So something like this def melt_df(df): df = pd.melt(df.reset_index(), id_vars=["country"], var_name="year", value_name="GDP") df = df.set_index(['country', 'year'], inplace=True) return dataframes = [melt_df(df) for df in dataframes] but then it makes all my dataframes blank
blank dataframes is what I get xD
What does your melt_df function return? It looks like it's returning None at the moment
Does it modify something in place instead of returning something?
It's meant to modify the list of dataframes
So, mobile discord doesn't allow me to read the code normally @marsh fog. Ves is right π
You are using list comprehension, it means your function must return something
def melt_df(df):
df = pd.melt(df.reset_index(), id_vars=["country"], var_name="year", value_name="GDP")
df = df.set_index(['country', 'year'], inplace=True)
return df
dataframes = [melt_df(df) for df in dataframes]
maybe?
return the df you modified
unhashable list error again
Sorry @lyric canopy this is what I'm trying and It's throwing an unhashable list error def melt_df(df): df.reset_index(inplace=True) df.melt(id_vars=["country"], var_name=["year"], value_name=["GDP"]) df.set_index(['country', 'year'], inplace=True) return df dataframes = [melt_df(df) for df in dataframes]
can you show a picture of the traceback
value name doesnt seem like it should be a list
just do value_name="GDP"
also what IDE are you using? thats the nicest traceback ive ever seen
ty
It's quite nice π
whats the traceback for the keyerror
i opened it on browser and zoomed
π
not sure, but it looks like set_index is expecting certain keys?
Something else Is wrong here though; because if I change the line to df.set_index(['country'], inplace=True)
It no longer throws an error
but
so it doesnt like 'years'
It does absolutely nothing to the dataframes
oh
it doesn't melt them at all
what library is this so i can look at some docs
pandas?
yes
ideally as well I need value_name to change for each dataframe within the list - But I'm just trying to get this to work first xD
because there are 3 datasets in a dataframe of their own - Each of them have a different variable: HDI,GDP, Unemployment figures along with countries and years. So the value_name needs to change for each
def melt_df(df):
df.reset_index(inplace=True)
df.melt(id_vars=["country"], var_name=["year"], value_name=["GDP"])
df.set_index(['country', 'year'], inplace=True)
return df
when you melt it, the id_vars is ["country"] only
and 'year' is a var
does that cause issues?
Nope
If I run this outside of a function
on a dataframe not in a list
It works fine
Is this sensitive data? Could you dump a snippet of df.to_dict() so we could mess around with it?
Oh how are you getting it into your df
Oh I see, yea a snippet of the converted dict might be easier, try pd.from_dict() and let us know the conversion params, too please
would you just like the datasets? xD
not sure if that's easier
Maybe, I just wanted to run one line to set it up,
Oh, now you'll have to give us all your code though
that's okay, there is only a couple of lines
just imports it and then renames a few columns and then the bit I'm stuck on aha
using melt
filenames = glob.glob("data-*.csv")
dataframes = []
for f in filenames:
dataframes.append(pd.read_csv(f, encoding = "ISO-8859-1"))
def process_df(df):
"""
intput: unformatted data that comes from the same source so it has the same starting format
output: processed data that has been re-formatted
"""
df.drop(index = df.tail(6).index, columns=["Series Name", "Series Code", "Country Code"], inplace=True)
df.rename(columns= lambda x: x[0:4], inplace=True)
df.rename(columns={'Coun':'country'}, inplace=True)
df.set_index('country', inplace=True)
df.index = df.index.str.strip()
return df.apply(pd.to_numeric, errors='coerce')
dataframes = [process_df(df) for df in dataframes]
#1 is GDP, 2 is internet-users, 3 is unemployment
def melt_df(df):
df.reset_index(inplace=True)
df.melt(id_vars=["country"], var_name=["year"], value_name="GDP")
df.set_index(['country'], inplace=True)
return df
dataframes = [melt_df(df) for df in dataframes]
@peak jetty I do much appreciate the help though dude! Thank you!
Ok, so it's not throwing an error now, but melt_df doesn't seem to be doing anything, that's where you left off?
In [2]: new_dataframes = [melt_df(df) for df in inter_dataframes]
In [3]: inter_dataframes == new_dataframes
Out[3]: True```
yes
Is the process function working correctly?
In [2]: the_df = inter_dataframes[0]
In [3]: the_df.columns
Out[3]:
Index(['1960', '1961', '1962', '1963', '1964', '1965', '1966', '1967', '1968',
'1969', '1970', '1971', '1972', '1973', '1974', '1975', '1976', '1977',
'1978', '1979', '1980', '1981', '1982', '1983', '1984', '1985', '1986',
'1987', '1988', '1989', '1990', '1991', '1992', '1993', '1994', '1995',
'1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004',
'2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013',
'2014', '2015', '2016', '2017', '2018'],
dtype='object')```
[0] is the GDP df?
yes
it should be the GDP df, but after running pd.melt df.columns should only return GDP column
because countries and years should be set as indexes
yes
So melt should change those Year columns into a single column and then you get the GDP values in their own column with the value_name part of pd.melt
Hmm, in the docs (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.melt.html) there is a value_vars param which seems to be missing from yours
I believe value_vars is the same as value_name
Are you sure? From their example:
... var_name='myVarname', value_name='myValname')```
They are passing both
oh yeah
Oh yea?
oh no
by default
value_vars is none
because it's columns to unpivot if not specified by id_vars
That would explain why it doesn't do anything
No, should be working?
value_vars : tuple, list, or ndarray, optional
Column(s) to unpivot. If not specified, uses all columns that are not set as id_vars.
Oh I see, hmm but your id is the index
Can you actually do that? In their examples they never pass an index as the id_var
from what I looked up - To get around that this is used pd.melt(df.reset_index()
So it resets the index making the index a column again so that it can then be used by id_vars
Ah, got it, that would probably explain why it isn't blowing up
haha
that's why this df.reset_index(inplace=True)is on a seperate line because if you try and run df.melt(df.reset_index, id_vars=["country"], var_name=["year"], value_vars="GDP") you get this
Out[10]:
country year GDP
0 Afghanistan 1960 NaN
1 Albania 1960 NaN
2 Algeria 1960 NaN
3 American Samoa 1960 NaN
4 Andorra 1960 NaN
5 Angola 1960 NaN
6 Antigua and Barbuda 1960 NaN```
Only thing I did was assign a new df to the commands already here
So that's only performed it on 1/3 dataframes?
the_df = the_df.melt(id_vars=["country"], var_name=["year"], value_name="GDP")
the_df = the_df.set_index(["country"], inplace=True)```
Literally all I changed
It seemed like df.melt() was returning a df but not actually modifying the df it was applied to
I hate how pandas does that
No it can't xD
I don't see one
I really hate how Pandas does that, and how there is df.melt() and pd.melt(df)
Yea
Oh and sorry, should be
df.reset_index(inplace=True)
the_df = df.melt(id_vars=["country"], var_name=["year"], value_name="GDP")
the_df.set_index(["country"], inplace=True)
return the_df```
You can't even mix and match the assignments, what a mess that can become
mhmm it's telling me the_df is not defined
import pandas as pd
filenames = glob.glob("data-*.csv")
dataframes = []
for f in filenames:
dataframes.append(pd.read_csv(f, encoding="ISO-8859-1"))
def process_df(df):
"""
intput: unformatted data that comes from the same source so it has the same starting format
output: processed data that has been re-formatted
"""
df.drop(
index=df.tail(6).index,
columns=["Series Name", "Series Code", "Country Code"],
inplace=True,
)
df.rename(columns=lambda x: x[0:4], inplace=True)
df.rename(columns={"Coun": "country"}, inplace=True)
df.set_index("country", inplace=True)
df.index = df.index.str.strip()
return df.apply(pd.to_numeric, errors="coerce")
inter_dataframes = [process_df(df) for df in dataframes]
# 1 is GDP, 2 is internet-users, 3 is unemployment
def melt_df(df):
df.reset_index(inplace=True)
the_df = df.melt(id_vars=["country"], var_name=["year"], value_name="GDP")
the_df.set_index(["country"], inplace=True)
return the_df
new_dataframes = [melt_df(df) for df in inter_dataframes]
Maybe I missed another change?
I'm not getting any issues running that
All my values are blank
mhmm xD
oh no
It's working now
how shitty is that though from pandas
You're a legend
So now If I wanted to do exactly the same thing but for where you see value_name
for new_dataframes[0] is should be GDP, new_dataframes[1] it should be internet-users and 3 should be Unemployment
Start by adding a param to the melt function
df.reset_index(inplace=True)
the_df = df.melt(id_vars=["country"], var_name=["year"], value_name=value_name)
the_df.set_index(["country"], inplace=True)
return the_df```
Hey y'all, I have a simple question when @marsh fog is finished with his...
Then it's up to you, instead of a list of dataframes you could loop through objects, where the object has a value_name and dataframe, but you should tie them together somehow, a dict might work, too
if I specify a dictionary of the value_names
How can I get the function df.melt to apply a dictionary name to the value_name
See my comment above, there are two changes to your melt function
No worries, happy to help, I learned there as much as you did
That's what it's all about! ahah π
so, i have a few machine learning scripts. id like to write a program that takes a specified dataset and evaluates that dataset on each of the ML scripts, collects the results and then outputs the statistics for each script.
i was thinking about writing an evaluation script that uses subprocesses to run the ML scripts, optionally with some passed args, and starts the next subprocess after the current one is finished.
if i run a testscript that has basically only output = subprocess.check_output(["python", "Net1.py", *some args*]) in it, i can see the errors that Net1.py throws when my assert(condition) statements are false, but if they arent and the program just runs as intended, i dont get any feedback...
i havent worked with subprocesses before, am i doing anything wrong here? or more general, is there a better way to transfer results from the ML scripts to the testscript without storing the results in say textfiles somewhere to then just read them?
@cursive glade i don't know how relevant it is, but I had problems with subprocess and multi processing. Depending on IDE I could either see my errors or not. When I was running them from command line I could see all errors, ipython console was outputing nothing
i usually log remotely onto a machine with the necessary gpus so im not rly using any IDE, just execute scripts from the shell with python script.py *args*
Hey guys, I'm new to python, I need to use numpy to store multiple 2d arrays in a big 3d array basically a vector of 2d arrays, [number_of_arrays,array_x,array_x], I have troubles doing this
I tried to use many things, the single one that remotely worked is .append
but only iwthout axis and it flattens t he input
when I specify axis it throws errors
import numpy as numpy
def load_images(path):
images = numpy.array([[[]]])
index = 0
while True:
try:
print("try: " + path + "_" + str(index) + ".npy")
images = numpy.append(images, numpy.load(path + "_" + str(index) + ".npy"))
except Exception as ex:
print(ex)
break
index += 1
print(images.shape)
load_images("assets/car")
Hey all, I am looking to work on a Python project which is essentially a simulation/simulation-like, and based on 3d voxel space/octrees (same thing?). I know I could literally just instantiate a raw block of data with numpy and then fool around with it, but unless I was misunderstanding things, I was under the impression that once your volumes start increasing in size (lets say 256^3), that becomes extremely cumbersome, inefficient, and slow, even with numpy array objects. As a very brief description of my usecase needs, I'm not doing many, or any, matrices operations or transforms, etc. Mostly just running a finite voxel space and performing checks/modifications on specific voxels every tick.
Is there a good library out there for this? Or am I mistaken in the first place for thinking that I need a special library to do this properly?
@analog helm There's nothing that can prevent large matrices from slowing down your computer by consuming memory and computational time unless you know more about the matrix. In particular, if the matrix is almost empty, you can apply sparse matrix optimisations and so on
Other than that, if all of it is filled with no pattern, nothing can help
there will be large portions which are empty, but by no means the majority. How do games which run on standard PC hardware handle this? EG Minecraft or Dwarf Fortress. Am I just seriously overestimating the load that kind of environment applies? I kind of assumed they did something special to work with their data sets. Is it really just an array similar to what numpy would give me?
and thanks for the response btw @lean ledge
You split it up unto chunks and only deal with small chunks of the data at a time. While the other chunks arent loaded, they're stored on storage rather than in memory
That's why as you walk over to an area that hasnt loaded yet, it reads it from disk and loads it
As you go away from a chunk, it stores it back into disk
yea, for MC. But this is a finite space, which will be permanently loaded. I should have clarified on that one, my bad. More just thinking of a specific space in MC, and manging the data in that area. If you're familiar with Dwarf Fortress, that is a much better comparison for what I expect to be doing
Never played DF, sorry
well, just imagine a finite space in Minecraft I guess, with tons of entities and interactions. All the entities are bound to the octree coordinate system as well.
I know much of MC's data complexity has to do with buffering and streaming data off of and onto disk, but i was under the impression even besides that, there was some special handling going on. I could be wrong though!
There's nothing you can really do to avoid storing all the data you have to show in memory. It's not generally a problem since unless the memory is starting to get filled, it doesnt take much computation to have voxels stored. Minecraft often doesnt have thaaat many entities and obvious optimisations are made where you can save on interactions when they dont matter. I have no idea how Minecraft in particular works, though I'm sure some others here do
Mhmm, it's not so much the data storage its self Im concerned about. RAM is cheap these days. More the iteration and random access of data
the only real "complexity" in storage Im aware of is differences in data density (ie, one voxel might need to only store a single byte, another voxel might need to store several), but from what I understand that can be easily resolved by having a base voxel array of single bytes, with a sparse voxel tree parallel to store extra info as necessary
It's all in memory! It can be accessed randomly as it pleases with no slowdown. You give modern computers less credit than they deserve given they can render complex 3d scenes, I feel like IO with the data isnt where the main slowdowns would be concerning: rendering would be the bigger problem since that's what needs to be done real time and is just a time consuming process in comparison to fetching whatever few blocks exist
fair enough
On that topic, are there existing libraries which specialize in rendering voxel data? Or am I reaching on that one?
My main interest here is really just the concept its self, so the less I have to reinvent, the better. The underlying engine is of little interest to me.
Something which has some amount of pre-existing code for determining what is or isnt visible based on what voxels are surrounding an area, what is or isnt transparent, etc
volumes would be dynamically generated, and the user would be able to specify which areas of the volume they want to look at, so which voxels are or arent visible in any given situation is dynamic. Dunno if thats something which can be semi-automated via library, or if it is.
In either case, thanks for your input on this
Cant say I know much about gamedev, it just intersects with other things I actually enjoy. But there are voxel based game engines and it might help to see how they handle things. Probably some combination of not worrying about what isnt visible, combining many blocks into the same mesh with fewer vertices, doing weird stuff with lighting, chunks, etc
Alright, I'll keep looking around at other projects, I know of a few. Thanks for talking it out with me @lean ledge!
hello guys
hello, if you want to ask a question can you ask it
Does anyone have experience with Jupyter Notebooks? Can't seem to make my kernel work
Or any kernels. Managed to install everything correctly on another machine, but this one is just throwing errors
can you also paste your errors in formatted way?
Which tensorflow you have and which python?
Is there any open source bot that uses natural language commands out there?
Hello
import tensorflow as tf
# training data
x_train = [1, 2, 3, 4]
y_train = [0, -1, -2, -3]
W = tf.Variable([0.3], tf.float32)```
is 0.3 the initial value of W?
If so, how can I print the value from W?
You'll have to make a session and evaluate the computation graph for the variable. Pass it to a session.run() call after initialising the variables @wraith sage
It just initialises variables. Tensorflow objects are abstract and you're essentially compiling the logic you want to run into a computational graph and then working on it in a session
Greetings, im trying to create multiple bags for a sort of bootstrap aggregation algorithm
is there a more efficient way than subsampling n times de dataset with pandas.DataFrame.sample ?
like, for example, at least something similar but with a n_samples parameter so i dont have to do a comprehension list of iterate
Not on pandas
But on sklearn.ensembleyou can find what you need
how so? they only have the classifiers made with ensembles, cant find anything regarding the subsampling part, unless i go into source code
using a CNN for a binary image classificatio, should I use a validation set or simply train and then run it through a test set?
def mbo(n, others):
not_first = False
for i in others:
if not_first:
x = bin(n ^ i).count('1')
if x == 0:
return 0
elif x < lowest:
lowest = x
else:
lowest = bin(n ^ i).count('1')
not_first = True
return lowest
s = set()
for i in range(int(input())):
e, a = input().split(' ')
if e == "1":
s.add(int(a))
else:
print(mbo(int(a), s))```
This apparently really inefficient
So am trying to find a way to reduce the complexity
What this supposed to do is
Person1 and Person2 are playing an XOR game. Initially, Person1 has an empty set of integers. Then a sequence of N events happens. There are two types of events:
Person1 chooses integer A and adds it to the set;
Person2 chooses integer A and passes it to Person1 who finds integer B in the set such that integer AβB contains minimal possible number of 1s in its binary representation. Here β is a bitwise exclusive or operation, for more details check Wikipedia page.
Your taks is to help Person1 finding minimal possible number of 1 bits in binary representaion of AβB.
Input
The first line contains integer N. Each of the following N lines describes an event as two integers T and A separated by a single space. Here T is an event type.
Output
For each event of the second type print the corresponding minimal number of 1 bits in a separate line.```
Don't need code, need algorithm
No one?
you should probably go to help and not data science
@sharp jetty you don't need a validation set
@lapis sequoia why is that?
shouldnt i run the model on a validation ,save the weights and then run it through a test?
depends on the problem
if it's binary classification.. you should have enough examples to not need additional validation
what sorta images are you trying to classify
and how many examples do you have for training
for each class
you don't need to do validation unless your classes are imbalanced..
this type of problem requires that you have large enough training and test sets.. then just plot/get roc for your predictions
well im looking at different body parts individually, but for the most part i dont have class imbalance
on the low side i have about 500 images per class and on the high side i have about 5000 per class
@lapis sequoia You definitely need a validation set
Binary classification does not dictate whether or not you have enough examples
@sharp jetty Dont listen to them, you always need a validation set to prevent overfitting. If you're really short on data, maybe dont have a test set and use validation accuracy at face value
You have no idea when to stop training without validation
yea thats what i was thinking
@lean ledge thanks again
btw so when i run my model and obtain the highest possible accuracy on validation do i save the weights and run it on a test?
Yeah, generally how it works. Most people's workflow looks like giving the framework both train set and validation set and then monitoring the output values on tensorboard or equivalent until validation stops improving (in which case you save and try to drop learning rate further) or starts getting worse (overtraining). It's a good idea to use something like TF's Object detection API or similar since that has essentially everything set up for you with pretrained networks to retrain, all outputs configured as necessary etc
Instruct your framework to just output a model every epoch for the latest 5-10 epochs or something along those lines. I think object detection API is 5 by default but I prefer 10 because I tend to miss a bit
hmm a lot of what you just said im not familiar with lol. First time working with CNN's. Any resource you would recommend to implement what you described?
What's your specific task? @sharp jetty
binary classification, no object detection or anything?
binary classification of x ray images
abnormality vs normal
and i have 7 different body parts which ill run a model on seperately
in which case im not sure if i should use the weights learned from one body part to another
How much data do you have? @sharp jetty
dont at me.. do you have a job or pushed anything to production?
all I see you doing is posting inane academia stuff with little to know background knowledge or know how of actual application..
i have anywhere from 500 to 5000 images per class for each body part
not you coldchillin.. this other person I have blocked who continues to feel like he needs to chime in on things he doesn't understand
just because validation is part of usual workflow doesn't mean it's something you use to update weights for every problem.. reason you don't use validation is you're tweaking weights to fit certain images, that isn't something you do when it comes to medical applications..
...validation isn't used to update weights, it's used to avoid the scenarios of overfitting
hmm, the dataset im working with is part of a competition which is seperated by train valid and testing folders
he previously talked about using validation to update weight..
oh so it's for kaggle.. or something and not actual application
is that right?
yea something like that
I have both a job in a data science company where I do computer vision among other things and I'm currently at CSIRO'S Robotics and autonomous systems group chugging through literature reviews on my first computer vision paper for multi camera detection and tracking in an industry application. I've won hackathons using computer vision also. If you feel the need to disparage me because you don't like me disagreeing with you, then go ahead
not kaggle but a competition nonetheless, im just doing it for practice
@sharp jetty I'd start by trying to retrain a pretrained resnet model (avoid inceptionnet, it tends to have worse results in biomedical) on body part with the most data and then try to transfer learn with those weights on parts with smaller datasets
I've blocked you because you're an annoying person with no knowledge of actual application.. I don't feel the need to waste my time with people who feel the need to annoy and spew half knowledge..
by other classes you mean body parts?
also my aim was to try and produce a model from scratch and then compare that to one thats pretrained
right now im still working trying to figure out how to optimize my scratch model
if you're trying to compare architectures, it's only a fair battle if they're all trained on the same dataset
but im getting horrid results lol
yea i plan on training them both on the same set
just doing them seperately
models from scratch will rarely beat a model that's been pretrained and then trained again on the same dataset
yea i figured but is it that huge of a difference?
like im getting 46% validation accuracy on my model so far
can be pretty significant because from scratch make it easier to overfit. remember, something like imagenet has 14 million images and a model pretrained on it will have learnt very very generic features in early layers
it only needs to change some logic in the last few layers to start identifying new classes
hmm... what do you think of my validation accuracy results though is it something i should expect? for reference other ppl in the competition are getting like 70% test accuracy
cant say anything about validation accuracy without actually being there, lots of reasons it can be low. bad optimisation/training, problem with the model architecture, problem with datasets
!warn 389084425566289930 Your toxicity is not needed, telling another user "I've blocked you" when the user has done absolutely nothing to provoke you is unreasonable behaviour.
:incoming_envelope: :ok_hand: warned @lapis sequoia (Your toxicity is not needed, telling another user "I've blocked you" when the user has done absolutely nothing to provoke you is unreasonable behaviour.).
Oh, that's a very shallow network. Very much possible that itself is the problem. The more layers there are, the more complex things it can learn. Only a few layers cant learn very many things. For reference, state of the art in many applications have 100s of layers, many layers having complex structures like being a combination of multiple kinds of convolutions for complex features but without loss of speed, or gated units for easier optimisation and so on
I have no idea what kind of problem that is but adding a couple more layers may or may not help
If it doesnt, i'd try messing around with hyperparameters. Lowering learning rate after you've reached your 0.5 accuracy or whatever
layers as in adding more conv layers?
and yea i realized that my arch was pretty basic its something i tweaked from the MNIST project i did earlier
try reading up on the architecture of ResNet. it's essentially a simple architecture but it got state of the art a few years ago. it has lotssss of layers
CS231n is a great course to go through btw if you're interested in deep vision
should go over everything you need to start reading actual literature really
yea any resources would be a great help, it seems i know too little to do a decent from scratch model on my own
nah all good, knowing what you know is a great place to start off. just need something to accelerate you into recent literature
lectures 6, 7, 9 and 10 should get you up to date somewhat
everything 11 onwards is not exactly core deep vision knowledge
except maybe generative models
nice! thanks ill look into it
btw one more question, in terms of accuracy you usually get a higher accuracy for validation than test right?
as ur tuning the model weights on teh validation
and running it on the test
if there is a discrepancy at all, yeah, validation is tiny bit higher because you stop based on when validation gets worse. but otherwise, the difference tends to be relatively minor-ish since validation only comes in play with figuring out when to stop rather than actually helping with training
so if i just wanted to show the efficacy of my model would it be enough to just show my validation score?
because in that case what would be the reason for running a test?
it's generally bad practice to use validation in an actual formal setting (competitions, academia, etc) but nobody would really care in industry too much given there's little data so having no test is justified
hmm so it would be better if i just take a small sample of images for each class and use that as a test?
would 50 images per class suffice?
everyone I know does it all percentage-wise. Out of all the data you have, splitting it 80/15/15 sounds fairish for train/validation/test.
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.33.1337&rep=rep1&type=pdf reading the conclusion of this paper seems to give a recommendation for the kind of split
but honestly, people's splits are anything from 80/15/15 to 70/20/10 to many in between
got it, thanks so much man you've been a tremendous help
nw, glad to help
just a quick question regarding CNN implementation, if I'm doing a binary image classification and running a model separately on 7 different body parts(ex hand, wrist, humerus, etc) so ill run a model on wrist only binary classification and then wrist and so on. Should I save the weights from each model and run them for the next?
@sharp jetty Id train for the part with the largest dataset first, and then reuse weights from the model for retraining other parts with smaller datasets
would i use the adjusted weights after each dataset?
ex i train on hand dataset -> use weights to train on arm->use new weights train on shoulder->use new weights to train on wrist
or just use one set of weights
That, I have no clue about. I don't expect the difference to be huge either way.
I hope this is the appropriate place to ask, which module would be better for working with csv files, panda or csv?
I'd assume CSV since its in the name but I heard a lot about panda as well
Depends on how you mean work with them. If you mean read a csv to a dataframe fiddle around with it and save it as csv I would use pandas.
From where i can find the projects for data analytics
I just want to take some data from a csv file and find averages, max, and mins
@polar acorn
I would use pandas, it has a read_csv function that handles most csv's. And finding summary stats is quite straight forward.
It might take some getting used first time you use it. But its a great tool to know, so it's a worthwhile investment.
Appreciate it
@magic mauve go for pandas all the way. It can work with lots of data types, and not only it allows to manipulate the data but also save it in other formats, plot (simple plots)
I'd assume matplotlib would be better for more complicated plots though?
Hey guys,
so I am quiet new to programming in python but there is a question that is bugging me. There is a project in my head which would be awesome to realize but I dont know if it's achievable in any way (technology or skillwise).
The thing is I have been doing quiet a lot of demo trading forex currencies for around a year. I did find a pattern that kind of worked for me, the problem is, that I can mostly only see the pattern in the past. If I look at old graphs I will see all the spots where it worked but I can never really make it work for me, if the plot hasnt formed yet.
Would there be a way to feed the computer some currency graph, give it also the points where I would have taken long or short trades because of my pattern with the stop loss (point where you would sell automatically, to cut looses) and where you would put the take profit and the machine (AI, neural network/whatever) spits out the most likely trade in the present?
@magic mauve , matplotlib is good. there are other libraries that are built on top of it and are higher level. maybe you will like them more. seaborn, plotly e.g
@proven ravine , there is an entire concept for such things. Algorithmic trading. Its been around for decades now and with the rise of AI and ML, lots of people find it interesting.
I hope it is okay to post reddit links here, so, if you are interested in such stuff, you can head to https://www.reddit.com/r/algotrading
@supple ferry but is it doable for an average person like me to program such an algorithm?
it is doable. but probably not be useful. you cant compete with bigger machines
It depends on the uniqueness of your signal
tbh
You can market make with relatively simple rules and be profitable, but you're being paid to take on risk
You would want to sample whenever you see a pattern emerging
and either rule-base sample in
for live trading
or have it predict if it's a trade or not
rule-base sample means its like a condition which needs to be met?
yes
for example, one of the patterns includes a certain formation of three candles, which need to be the lowest or highest of a certain move
ah okay
I was hoping I could just throw data into a black box and get an algorithm π
as said, marking all the trades with numbers in a chart of the past (for example last 5 years) and the program figures out what it needs to do, to predict present trades
something like that. I thought thats how machine learning or AI works until now, but I guess I need to look into it a little more π
that is how it works
you have to mark the trades to sample
and mark the trades for the predictin
Basically instead of your data set being image classification of a cat or a dog
you take your stock history or forex or w/e
and create the "This is an example of a long trade"
and create the "this is an example of a short trade"
then feed in new data and predict if you should go long, short, or not trade
hmm, any tips with which "software" or whatever I can do it "easily" or test it?
or any docs which I should read into ?
hahaha
do it in python
you'll have to create most things from scratch
good luck find anything useful besides stats packages like ffn
and bt
but you're better off writing your own backtesting functions
99% of all signal generation is going to need to be scripted yourself
thanks for the help π
thats at least a project which could be helpful in the future so I am more eager to continue with it π
hey I am trying to plot a 2d gaussian distribution with matplot3d but I have some problems.
cset = ax.contourf(temp1, temp2, Z, zdir='z', offset=-0.15, cmap=cm.viridis)
ax.set_zticks(np.linspace(0,0.6,5))
that is my code to plot and the result. I would really appreciate it, if you could help me to make my plot OK
def distance_parse():
hist_list = []
with open('20190326_distance_v1', 'r', encoding='utf-8') as f:
lines = f.readlines()
for line in lines:
try:
m = re.search('sdf : (\d+.\d+)', line).group(1)
hist_list.append(m)
except:
pass
print(hist_list)
plt.hist(hist_list, bins=5)
plt.show()
This code will print and draw like this.
@void anvil do you know if codecademys machine learning course is helpful?
How can I fix it??
no idea
I guess `plt.hist(x, **karg) could be a key of this problems. but I didn't know these options.
ah.. maybe type..
@void anvil would you say my goal is more likely to be supervised or unsupervised learning? Cause I am not sure while reading the description of the two ... I am leaning more toward supervised?
if you have a trading pattern you want it to learn it's supervised
if you want it to learn trading patterns, unsupervised
ehhh I am confused
the second is just giving it data and waiting for what happens
?
and the first one is where I give it my long/shorts and it learns to use those in present situations
good read!
Hi guys, any of you familiar with R-studio?
@proven ravine
Supervised: You tell the algorithm this is class A, B, C. Determine if this new stuff is A, B, C.
Unsupervised: Here's a bunch of shit, tell me what you can predict
@vestal axle yes Iβm familiar with R Studio
Have any of yβall used facebookβs prophet? Wondering how it compares to ARIMA for time series
with regards to early stopping, should i stop my model when it has the highest accuracy for validation or when the val_loss values are lowest?
accuracy is presumably what you're interested in
hist = classifier.fit_generator(
training_set_finger,
steps_per_epoch=(5064/8),
nb_epoch=60,
validation_data=valid_set_finger,
validation_steps=(461/8),callbacks=[earlyStopping,model_save,reduce_lr_loss])
i have a question regarding this code which is part of a CNN im running. If i run this code and another but using a different dataset are the two pieces of code interacting in any way?
hist = classifier.fit_generator(
training_set_humerus,
steps_per_epoch=(1230/8),
nb_epoch=60,
validation_data=valid_set_humerus,
validation_steps=(288/8),callbacks=[earlyStopping,model_save,reduce_lr_loss])
as in when i fit on one set of data and then fit on another are the weights or other parameters being affected?
anyone done the titanic competition on kaggle?
@kindred stirrup In my experience prophet is quite good out of the box while ARIMA models would need more tinkering to achieve the same level of performance.
oh cool, TIL prophet
Hi, still looking for help with speech recognition front end in Python, would be great to get in touch with someone familiar with speech rec. Thanks!
need some help normalizing data
i have tried most of sklearn.preprocessing module but nothing suits my needs
i have a semi sparse set of values
and wish them to normalize them into floats from range [0,1]
scale, binarize, minmax dont work
what does "don't work" mean?
Hey guys, first timer, learning on the job ds - How do I detect abnormal subseries in periodic and synchronous time-series data?
if im doing a CNN model looking to classify abnormal vs normal images, would my dense output be 2 or 1?
@serene veldt
assuming you have list x:
fixed_x = (x-min(x))/(max(x)-min(x))
Yo guys, anyone here is experienced with numerical methods and jupyter?
hitting a wall trying to solve a ODE with bisection and FPI
Yeah I figured it out after some time
I was standardizing the values
Which was the problem
I'm trying to come up with a quick method for determining document similarity to a given query phrase (a list of words). I am fine if it's not extremely accurate, as long as the majority of documents returned are at least "somewhat" similar to the given query. It is important that it is fast though (I need to be able to process at around one hundred 1000-word documents per second, including tokenizing, vectorization, and scanning for matches). I have come up with the following method for extracting a set of similar words to a given term:
def simset(query_word, set_size=10, depth=1, size_decay = 0, threshold_score=0.33, with_scores=False):
if size_decay < 0 or size_decay > 1:
raise ValueError("decay rate must be in interval [0,1]")
simsets = [set([query_word])] # list of sets, one per level, query word at root
level_set_size = set_size # size of simset for each word @ current level
for level in range(1, depth+1): # for each level of depth ...
level_set = set()
for word in simsets[level-1]: # for each word in the *previous* level's simset
if level_set_size >=1:
level_set.update({w[0] for w in vectors.most_similar(word, topn=level_set_size)})
else: # if decay rate results in set size < 1, just get 1 word for each following level
level_set.update({w[0] for w in vectors.most_similar(word, topn=1)})
# remove words from previous levels, to avoid duplicate simset() calls
for l in range(level-1,-1,-1):
level_set -= simsets[l]
simsets.append(level_set)
level_set_size = round(level_set_size * (1 - size_decay))
simset = set.union(*simsets)
if with_scores:
return {(vectors.similarity(query_word, w), w) for w in simset if vectors.similarity(query_word, w) > threshold_score}
else:
return simset
The above code gets a list of most similar words to query term, then for each of those it gets most similar, and so on .. to depth levels. The simset of a query phrase is just the union of the simsets for each of the individual terms. Based on this "simset", I then get a "similarity score" for a given text to the query phrase, by comparing the words in the text with the simset of the query phrase with the following function (similar to Jaccard similarity where I am measuring size of intersection between the two):
def skim_text(query, text, word_simset_size=10, depth=2, freq_threshold = 0.0005):
query_words = word_tokenize(filter_stopwords(query))
query_simset = set.union(*[simset(w, word_simset_size, depth) for w in query_words])
query_simset = {(word_freq(w, 'en'), w) for w in query_simset if word_freq(w, 'en') < freq_threshold}
text_word_count = len(text.split())
if not text_word_count > 0:
raise NLPError("skim_text() requires a non-empty string as input", text = text)
match_count = 0
escaped_words = [re.escape(w[1]) for w in query_simset]
re_query_simset = r'(' + '|'.join(escaped_words) + ')'
matches = re.findall(re_query_simset, text)
for m in matches:
match_count += 1
if match_count == 0:
return 0
score = math.log(match_count+1) / math.log(text_word_count)
print(f"DEBUG: match_ct: {match_count}, text_word_count: {text_word_count}, score: {score}")
if score > 1:
return 1
else:
if score <= 0:
return 0
else:
return score
I am looking for feedback on how to improve my code to make it faster / more accurate, or if there are pre-existing tools that I should be using instead that can operate at the speed I need?
... also tips on how to just improve my python would be appreciated, because there are probably several things I'm doing here that could be done more efficiently π
How would you explain additive models to a 5 year old
Im clustering using kmeans and DBSCAN using sklearn
the parameters for kmeans are self explanatory just how many different clusters I want
but how do I go about chosing the parameters for DBSCAN, it wants eps and min_samples
Also, is there a library for Sum of Squared Error?
@obtuse skiff sklearn metrics has rmse, root mean squared error. If you want, you can calculate it yourself easily.
@supple ferry do you know how I do the parameters for dbscan?
Never used it, sorry
As per documentation, @obtuse skiff , those arguments are optional:
Parameters:
X : array or sparse (CSR) matrix of shape (n_samples, n_features), or array of shape (n_samples, n_samples)
A feature array, or array of distances between samples if metric='precomputed'.
eps : float, optional
The maximum distance between two samples for them to be considered as in the same neighborhood.
min_samples : int, optional
The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself.
@obtuse skiff
Pasted from Introduction to Machine Learning with Python A Guide for Data Scientists
Increasing eps means that more points will be included in a cluster. This makes clusters grow, but might also lead to multiple clusters joining into one. Increasing min_sample means that fewer points will be core points, and more points will be labeled as noise.
The parameter eps is somewhat more important, as it determines what it means for points to be βclose.β Setting eps to be very small will mean that no points are core samples, and may lead to all points being labeled as noise. Setting eps to be very large will result in all points forming a single cluster.
The min_samples setting mostly determines whether points in less dense regions will be labeled as outliers or as their own clusters. If you decrease min_samples, anything that would have been a cluster with less than min_samples many samples will now be labeled as noise.
So knowing that, I assume the best way to set the parameters is to try and see what clusters you get and adjust the parameters so that the number and size of clusters makes sense to you.
hey everyone. I'm working on an application that does a bunch of file input/pandas transformations/statistics, and i'm using PyCharm as my IDE. I'm wondering if anyone could give their workflows while they're working on a project -- specifically in cases where you import a lot of data and are beginning to write a new function to do ~something~, like create a matplotlib plot, run a statistical test, create outputs, etc.
I find I waste a lot of time because my code needs to import all my files, sort them, clean them, etc. Currently, I set a breakpoint at the start of my new function, run the debugger, it stops at my breakpoint, and then I do code edits, restart the debugger, let it run over my new code, check for problems, and do this again and again. I feel like this is really inefficient because I'm constantly restarting my debugger over and over, and (of course) my code needs to load all my data again every time. I know a lot of this may be unavoidable, but getting some insight into others' data science workflows would be great. Thanks!
I do my analysis in Jupyter notebooks. having the ability to run cell by cell is much easier for data exploration and analysis
I find working with py files very cumbersome for this purpose, especially if you need to make small changes and run blocks of code over and over
@paper niche thanks for the suggestion! I really should get into jupyter -- always been on the list and never really got around to checking it out. I can see how running cell by cell would be useful. When you use jupyter, do you create cells as "building blocks" and just execute them in a certain order depending on what you want to do (or work on next)? Can you also "undo" execution of a cell?
yup i do. as an example notebook (not mine, but you'll get the idea): https://www.kaggle.com/gpreda/rsna-pneumonia-detection-eda
maybe 1 cell for loading data to df, 1 cell for doing some modification to the columns, another for df.plot, say.
and there's no "undo"-ing the execution of a cell, what you do is you re-run the cell that created that particular variable in the first place. e.g. for my example workflow above, if I realized my modification in the second cell was wrong, I'll run the first cell again to re-assign df, then make changes to the second cell, then re-run the second cell again @modest halo
@paper niche Oh I see. so essentially achieving the same thing. After seeing that notebook you linked, I'm really starting to rethink my workflows lol. That notebook is pretty much exactly the kind of thing I want to do, and it's much more readable for people who aren't as familiar with python (a lot of people I work with). Thanks a lot!
@modest halo np, happy to share! π yup, the ability to write text and code in the same document is really nice for this purpose as well.
For the part where it plots the regression linefig, axes = plt.subplots(ncols=3, figsize=(20,8)) for ax, col in zip(axes, mergfix.loc[:, mergfix.columns != "internet_users"]): mergfix.plot.scatter(x=[col],y=1, c='green', ax=ax) _ = plt.plot(mergfix.iloc[:, v].values, sm.OLS(mergfix.iloc[:, v].values, sm.add_constant(mergfix.iloc[:, v].values)).fit().fittedvalues,'r-') how can I get it to cycle through the columns in the dataframe?
Help? π
@marsh fog I like to just make a list of the columns and wrap it in a for-loop π Like ```
columns = ['column1', 'column2', 'column3']
for column in columns:
plotfunction(dataframe[column])
Not sure if that was what you were asking though :p
It wasn't π
anyone here written any graphs useful graphs for MISP(www.misp-project.org )? Or have any experience with data science for threat hunting?
Hey guys, I have 8 black&white images, I need the mean image of them, if I do sum(list_of_images) I get a grasycale image with very bad edges, if I do this new image / len(images) (which feels logical to do as this mean mean) it gives me same rough image but with colors
UserWarning: Float image out of standard range; displaying image with stretched contrast.
warn("Float image out of standard range; displaying "
images are stored in numpy arrays
nvm my add logic is wrong
fixed, used numpy.mean() and then casted the output image to uint8 to get it grayscale
I want to see if I can explain KNN here in my own terms.
Letβs say I am a company that sells different types of widgets and we have seasonality to our business and I want to see which widgets are likely to go up or down in sales for a given month.
Would I have my various widgets as rows, the months as columns, previous year sales, previous months sales and maybe a few other columns. Then I would label those as either going up or down in sales the next month. Would I then run a KNN algorithm on the data and hope to see for a row without a next months sales figure prediction?
Wrong place for that question or am I way off?
you dont want to use KNN for that
at least from what i can remember it
since your talking about time, and especially seasonality you want to do time series analysis
Thanks for the response. I think I was reading that time series isnβt ideal for KNN but it can be used.
What about random forest then?
Similar data setup in structure but different algorithm.
There was a tutorial I was reading that was about forecasting weather with it
I've not done much of time series but what I've read on them, people usually use LSTM's, GRU's, fbprophet, arima for it. Maybe check those out
RF's can't extrapolate well, since they always take the average in the leaf nodes, you can't get values higher or lower than the extreme values in your training set.
unlike a linear regression for example
not to say that it can't, but I would expect other methods to work better for time series predictions
I'd say it's more about what problem you
are trying to solve..than it is about the method..
case you cited for example.. how much can you afford to be off by
like..what is the tolerance..
fbprophet and arima are what I'd suggest for time series.. the person above was right on the money..
with Arima, you can tweak it to be conservative..
I have been using prophet for forecasting and I like it.
I just wanted to see what some other models I havenβt used much of, and are are common in ML, to not only compare results, but also to learn.
So to get back to my original question, it looks like KNN and rf still belong in the classification toolbox and prophet in the forecasting toolbox.
yep.. and with knn you use different distance metrics for different applications/problems..
RF.. for most business applications with plenty of tolerance.. and minimum number of trees for fast results..
Do you have a preference between the two? Or is just very dependent on type and amount of data and variables?