#data-science-and-ml
1 messages · Page 212 of 1
I know how I could write javascript to fix this (draw a circle, then draw a number, through the set), but I'm using a Bokeh widget to filter the data that's working pretty nicely...
I've been reading through their code and it doesn't look like there's anything to do this, but just started looking into using their "Scatter" figure, and creating a second dataframe; one for the circles, one for the numbers, and interleaving the two together. But I'm also open to other libraries if there's one that does this better
looks like peanuts
your goal is to visualize and you say you're open to things other than bokeh.. but you have to state what you intend to visualize better
Haha - I've added borders to those circles since then
I see different colors.. so I'm guessing is some categories?
and it's a scatter plot
Yeah, those are different categories, and the numbers I'm putting on top of them are the category numbers; because there are so many different ones, I don't want to have to refer to a long color legend
I have a hovertooltip which could include the categories, but I'd like people to be able to get a sense of how close or far away the different categories are from one another just by looking at it
The number is the category. This is a TSNE visualization of an LDA topic model - the numbers and colors are the topics
you can't use colors and numbers to represent the same thing
I think it could look good with the numbers
use plotly
and plot graphs side by side.. one for each category
and label the graph with the number
I tried plotly and found the same problem -
I get the idea of having different graphs for each category, but there are 18 categories
And I'm trying to visualize their relationship/distance to one another
just remove the numbers?
yeah I told him that..
Since it have colors, thoses numbers can be in the legend yeah?
that's what I said
I dunno if they represent something but a color gradient might be cool to use too if they make sense to use
If 1 is red and 14 is blue, you just need to put the gradient next to graph. But this work only if the numbers are a measure of the same thing
they're different categories
@lapis sequoia Pre-training the whole thing is too expensive obviously though
you keep missing the point here
- your problem/application 2. representation that suits your application 3. metric that fits your application best
- is where the type of model comes in.. representing your word vectors in a suitable space..
there's lighter frameworks that are language specific.. including ones from BERT that'll help you do that
Hey there! anyone knows the reason of such behavior? I have this dataframe (just several rows of it for reproduction purposes):
id dpt price minutes
9710556 0 180.82 140
9710556 0 180.82 140
9710556 0 202.32 145
9710556 1 218.32 145
9710556 1 250.82 140
I am trying to find out the number of (price minutes) combos being strictly less than the other combos. And I try to find it for all of them. My data will be grouped by id and dept after all.
This is the function I came up with which gives me the correct output if I forget about grouping by dept for now:
def ranker(df):
values = df[["price", "minutes"]].values
result = values[:, None] < values
return np.logical_and.reduce(result, axis = 2).sum(axis = 1)
And if I apply it to my data now, I get this:
small.groupby("id").apply(ranker)
Out[144]:
id
9710556 [2, 2, 0, 0, 0]
dtype: object
Which means that, the first price minutes combination is exactly less (in both values) from 2 options within this dataset, and so on.
When i try to assign it back to dataframe, I get NaNs everywhere:
small["a"] = small.groupby("id").apply(ranker)
small.a
Out[147]:
102 NaN
103 NaN
104 NaN
105 NaN
106 NaN
Name: a, dtype: object
How can I solve this? My overall goal is to run this function groupbing by id and dept in the end
EDIT: code
what's small
the name of tha dataframe i gave it
as far as i know, groupby applies the function to every group seperately which is dataframe by logic
@lapis sequoia I don't see your point
exactly
Hey im just starting off with CNN's working with the basic Fashion-mnist dataset using tensorflow and keras. I am a bit stuck with two things hoping someone can help me out! if the data is 2D do I have to flatten it two a 1D array? Also, how does the layering work exactly and how do you set it up?
@lapis sequoia No, I mean that you talking about different parts of solving the problem. I don't see your point of stating it. I was never interested in solving the problem.
- I'm certainly not going to pretrain because I don't have that kind of data, nor is it my main project to do so.
- My problem is primarily input: text, output: classification. It's as simple as that and details like sentiment, text type, etc. generally don't matter except that they are in English sentences.
- Metric - I don't even see your point with this, most business applications would just put a dollar sign to everything that they can or care about. In either case, any and all categorical losses would be relevant to me, and I'm not particularly using or focussing on any
My problem (was) is very simple, the tokenizer from TF2Hub's Albert doesn't seem to produce expected things, and the scripts provided seem tricky (with stuff like FLAGS and TF2.0 migration in the way). That is about it.
@acoustic mural Pooling is probably better, or at least, in the standard help I see online
it depends on what you're doing with it, sometimes pooling all the way down to 1D loses too much information
lol just use roberta+pytorch
albert is new enough that I think you'll get less help around it I think
or roberta+tf if you really want to use tf
Yeah I noticed the different levels of how involved the coder is for each different package. Anyway I think I accomplished my goal with benchmarks, it does seem that roBERTa works quite out-of-the-box and it's not really quite clear what hyperparameters are really good/important to change for any given problem
In python multithreading if you multi thread 2 threads on one class, then the variables within that class wont change by the other thread? Like duck = 0 in thread 1 And in thread 2 duck gets changed to duck = 1, now in thread 1 duck = 0 still right?
Also does the Same apply for calling Another function within that Same function with 2 threads?
What is wrong here
What Python version is that? What version was the as keyword added in?
@distant inlet: :)

this is pretty cool
Course materials and notes for Stanford class CS231n: Convolutional Neural Networks for Visual Recognition.
covers basics for DS
Is there a lib to extract from a photo (Document) to text, tables and photo? Just saw photo to text with cv2 and pil
hey guys, do you have any good resources where you can find messy datasets to train data cleaning skills?
i have a matrix in tensorflow and I want to create a heatmap of it using seaborn and log it using tf.summary.image. pretty much, I need to get pixel data from a seaborn plot. does anyone know how to do this?
I've only found tf.image.decode_png but I think it would be more efficient and accurate to directly get the pixel array from the seaborn plot itself
if you want the summary, as in the shape.. why dont you use np @compact bluff
@olive willow http://www.kdnuggets.com/datasets/index.html
Stanford Large Network Dataset Collection: http://snap.stanford.edu/data/
Google Public Data Directory :http://www.google.com/publicdata/directory
Natural Earth Data : http://www.naturalearthdata.com/downloads/
Geocomm : http://data.geocomm.com/drg/index.html
Geonames data: http://www.geonames.org/
US GIS Data: Available from http://libremap.org/
See also Government, State, City, Local, public data sites and portals Data APIs, Hubs, Marketplaces, Platforms, and Search Engines. Data Mining and Data Science Competitions Google Dataset Search Data repositories Anacode Chinese Web Datastore: a collection of crawled Chines...
The Google Public Data Explorer makes large datasets easy to explore, visualize and communicate. As the charts and maps animate over time, the changes in the world become easier to understand. You don't have to be a data expert to navigate between different views, make your o...
Data themes are available in three levels of detail. For each scale, themes are listed on Cultural, Physical, and Raster category pages. Stay up to date!
@wraith basin thanks!!!
Is there a way to find the focal points of an ellipsis based on it's contour? numpy+mpl
hope you dont mind me doubleposting, just figured this was a better place to ask: im looking to change the structure of a dataset to this:
Country Year Debt Unemployment GDP
Afghanistan 1986 13 7 3456
Afghanistan 1987 12 8 3487
Afghanistan 1988 13 4 2356
so i have this:
and i want this:
anyone know how i could go about changing that?
@maiden void so I would try this https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.transpose.html
it's a method that changes columns into individual rows
you would have to look around how to get it into your desired format tho, I've no clue
Can anybody recommend me a website that mainly host contest for ml frequently.
@lapis sequoia - like this? https://www.kaggle.com/competitions
Download Open Datasets on 1000s of Projects + Share Projects on One Platform. Explore Popular Topics Like Government, Sports, Medicine, Fintech, Food, More. Flexible Data Ingestion.
Hey I'm trying to train a weeather data set to tell me if it will rain tomorrow, but the column is "yes/no" and not binary like it's aasking for
Anyone know a way around this? I can provide pictures if anyone is interested thanks!
Can you use pandas?
df['RainTomorrow'] = df['RainTomorrow'].map({'Yes': 1, 'No': 0}) Worked for me, now I have to fix the other errors 😦
I know this means very little without knowing other data, but is anyone able to translate what these errors are saying?
thanks @olive willow it did work to some degree, but unfortunately not to do what i wanted
going borderline mental here after spending the entire day on this single dataset
Hahaha, maybe you can get only the columns with the years, transpose them and then add them to the others?
@maiden void
thats actually what i started doing. so now i have:
so all the Japan columns are actually variables
that go on for like 100 variables and then they continue for the next country
so what i should do is move the next country under this one
unfortunately, i dont know how to do that effectively
(could always copy paste, but that will take forever and also its better to do it in python or R so i can recreate it)
@errant venture this error is very simple - "could not convert string to float" - the date string cannot be converted to a number. In fact, why should the probability that it rains tomorrow depend on the date today? It is possible that rain depends on pressure or temperature for several days earlier. If this is the case, then the table with the dates must be transformed so as to enter data for 1,2,3, ... the last few days.
@true fiber that makes sense, so if I were to remove the date column that would likely fix it?
@errant venture Yes, the error will disappear, but this does not mean that the prediction will make sense, can it still take into account the pressure over the past few days?
@true fiber yeah it has pressure/Wind etc
And location, will I have to remove all data sets that contain strings?
It throws an error for Wind direction and location since they arent floats, does this mean I can' tu se the mto train a data set
All strings need to be converted to numbers, but if there are several values, then they need to be binarized. If you simply delete the strings, important information is lost, but you can check how something works.
So say wind direction, that has say N W E S values, would converting them to 1, 2, 3, 4 be smart?
Thank you so much for answering by the way, you've already helped me understand this a lot
No, to encode categorical data you need to use https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
@true fiber Ah thanks for that, will look into it now!
But it is more interesting what to do with the cities. First you need to try to remove the city and look at the accuracy of the forecast. Then it may be necessary to take into account the existence of rainy cities and consider them separately. But probably for a simple task, they can be deleted.
@true fiber Yeah I'll try and use them, but for now I just want to train the rest of the data set just for now
I'm just working passed a "Input contains NaN, infinity or a value too large for dtype('float64')." error
Yes, you must clear the table by deleting all missing data.
df.fillna(0) / df.dropna(axis = 1, thresh=3) / df.fillna(df.mean())
pandas people, whats the difference between df[0] and df[df.columns[0]]
i normally do the first one to get the targets of a df but for some reason its not working on the df im working on rn
@true fiber I've already prepped the data and removed or replaced all NaN values, is there a check for infinite values?
basically df[blah] can be ambiguous. df[column_name] generally works to get a column
wdym by ambiguous?
it can have different behavior depending on what "blah" is
actually don't worry about that
in any case, if you want to get a column, do df[column_name]
np.isfinite(df.any()) returns true for every column ,what does that mean?
Hey is anyone here open to giving me some guidance on this ML project I'm doing?
Oops sorry if I interrupted something
I have a dataframe that has the batting peformance of player who played in the world baseball classics. This is only played in certain years. The columns this dataframe has are playerID, yearID, BattingPerformance and Name. I want to calculate the average in their batting performance in the current year they played in the WBC, previous Non WBC year and following Non WBC year to see if the WBC had any effect in their performance. For example, player X in year 2006 had a batting performance of 0.3, -0.2 in 2005, and 0.01 in 2007. The average would be (0.3 + (-0.2) + 0.01) / 3. The yearID goes from 2005 to 2018. The WBC years are 2006, 2009, 2013, 2017.
The final output of this new dataframe should have the following columns
[Name] | [Average calculated from 2005,2006,2007] | [Average calculated from 2008,2009,2010] | [Average calculated from 20012,2013,2014] | [Average calculated from 2016,2017,2018]
What methods does pandas have that will help me achieve this?
@lapis sequoia
So this is what I have so far.
def caculate_impact_score(row, batting_df):
print(row)
return 0
batting_impact_score = people_WBC_batting.groupby(['playerID', 'yearID']).apply(lambda row: caculate_impact_score(row, batting))
people_WBC_batting is a dataframe that contains all the players that played in a WBC year(yearID are 2006, 2009, 2013, 2017).
batting is the one that has the batting performance for a player in a particular year.
I can't really do mean() on the batting dataframe because it includes player that didn't play in WBC.
So I have find a way to use the player ID and yearID in wbc dataframe and associate that with the playerID and yearID in batting and get their batting performance.
could you join the tables
merge them
or filter
bad_id_list = []
filtered_frame = batting[~batting['playerID'].isin(bad_id_list)]
I keep gettiing:
Input contains NaN, infinity or a value too large for dtype('float64').
But my table has none of these issues
grrr
@lapis sequoia There are only floats in the data set, I've removed any other data types
still, i've had that issue before, and panda's to numeric helped resolve it
its good for number type conversions
Is there any way to use the diff function https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.diff.html to find the difference between the previous element and the next element?
has an example of doing previous row and following row but not both.
@jovial river not tested, but probably do a diff, then shift
diff of 2, then shift by -1
When I run a command like this to start training the model
model.fit(x_train, y_train, epochs=10, validation_data=(x_test, y_test))
I get an output like this: https://pastebin.com/9rdZb41s
Is there anyway to turn this into a graph form? I am using Jupyter so it would be cool to have it in live graph
@errant venture No, I don’t think it’s a good idea to use this, by advice @lapis sequoia . You must understand what you are doing, so you need to find the erroneous element, for example, by eliminating it by deleting columns or rows. You don’t have any infinite numbers, almost surely you just missed some empty spurious element.
df.dropna()
Hey guys can somebody help me with this
I've a dataframe that I want to modify
Country Name Afghanistan ... Zimbabwe
Country Code AFG ... ZWE
Indicator Name 5-bank asset concentration ... Working capital financed by banks (%)
Indicator Code GFDD.OI.06 ... GFDD.AI.35
1960 NaN ... NaN
... ... ... ...
2013 79.6688 ... NaN
2014 86.6035 ... NaN
2015 72.1549 ... NaN
2016 71.9406 ... 5.8
2017 73.6723 ... NaN
this is a preview of it
I want the first 4 rows to become columns and there data points to become vertical not horizontal if that makes sense
how about the other rows for the years?
you know like with transpose, I've applied it to the dataset and the years should be the rows like the index rows
but the other columns like Country code, Name etc should become the columns
and all the countries should become the rows not the columns
like from top to bottom
So I'm trying the senet model (https://github.com/moskomule/senet.pytorch) which is supposed to yield better results than ResNet, but I keep getting 0% accuracy on my train set (throughout 10 epochs)
I'm using it as such:
model_senets = se_resnet20(num_classes=len(classes), reduction=16)
cuda = torch.device('cuda')
model = model_senets.to(cuda)
optimizer = homura_optim.SGD(lr=hp["lr"], momentum=0.9, weight_decay=1e-4)
scheduler = homura_lr_scheduler.StepLR(80, 0.1)
tqdm_rep = reporters.TQDMReporter(range(hp["epochs"]), callbacks=[callbacks.AccuracyCallback()])
trainer = homura_Trainer(model, optimizer, F.cross_entropy, scheduler=scheduler, callbacks=[tqdm
for _ in tqdm_rep:
trainer.train(train_loader)
trainer.test(test_loader)
trainer.update_scheduler(scheduler)
Am I implementing it wrong? Why am I always getting 0 accuracy? 😕 (please tag me if you reply)
anyone here can tutor R? Willing to pay, please slide into my dm pls. Thanks.
Hello
name,age,weight
mike,22,180.2
alexa,28,133.30
terry,56,
jordan,,
joey,82,138.90```
I got a csv like that
I want to specify dtypes on import
but the missing data is screwing it up, I get an error ValueError: Integer column has NA values in column 1
how do I avoid that?
use fillna
But what am I gonna fill it with?
I don't want to add random numbers
And filling with None doesn't work
@fallen anchor py df = pd.read_csv('test.csv').fillna(value=0)worked for me
hmm
but 0 is appropriate data
can I use something like -1000?
my actual data has temp and wind speed etc, so 0 makes sense
You can use anything, sure
-1000 works as well
name age weight
0 mike 22.0 180.2
1 alexa 28.0 133.3
2 terry 56.0 -1000.0
3 jordan -1000.0 -1000.0
4 joey 82.0 138.9
```this is what it will look like with -1000
You can also do float('inf')
name age weight
0 mike 22.0 180.2
1 alexa 28.0 133.3
2 terry 56.0 inf
3 jordan inf inf
4 joey 82.0 138.9```
Infinite age and weight yes
interesting
next question
time,temp
2019-11-20 00:56,5
2019-11-20 01:56,
2019-11-20 02:56,8
2019-11-20 03:56,
2019-11-20 04:56,4
2019-11-20 05:56,
2019-11-20 06:56,
2019-11-20 07:56,
2019-11-20 08:56,0```
I want to interopoliate the missing data
Do you mean, filling data?
df = pd.read_csv('test.csv').fillna(method='ffill')```Will give```py
time temp
0 2019-11-20 00:56 5.0
1 2019-11-20 01:56 5.0
2 2019-11-20 02:56 8.0
3 2019-11-20 03:56 8.0
4 2019-11-20 04:56 4.0
5 2019-11-20 05:56 4.0
6 2019-11-20 06:56 4.0
7 2019-11-20 07:56 4.0
8 2019-11-20 08:56 0.0```
interpolate
so lets say at 0 hours temp was 2c at 3 hour it was 7c, we can assume at 2hour it was around 4c
does that make sense?
Yes
df = pd.read_csv('test.csv')
df = df.interpolate(method='linear', limit_direction='forward')```
time temp
0 2019-11-20 00:56 5.0
1 2019-11-20 01:56 6.5
2 2019-11-20 02:56 8.0
3 2019-11-20 03:56 6.0
4 2019-11-20 04:56 4.0
5 2019-11-20 05:56 3.0
6 2019-11-20 06:56 2.0
7 2019-11-20 07:56 1.0
8 2019-11-20 08:56 0.0```
Like this?
It is, you can shorten it to py df = pd.read_csv('test.csv').interpolate(method='linear', limit_direction='forward')
well pandas has a lot of cool stuff
The method is described here https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.interpolate.html
that's awesome. thank you
what if I only want to do it for the temp column (even though this example only has temp I know)
Oh, I can use axis
You can do this
df['temp'] = df['temp'].interpolate(method='linear', limit_direction='forward')```
Test datacsv time,temp,test 2019-11-20 00:56,5,1 2019-11-20 01:56,2 2019-11-20 02:56,8, 2019-11-20 03:56, 2019-11-20 04:56,4, 2019-11-20 05:56,, 2019-11-20 06:56,, 2019-11-20 07:56,, 2019-11-20 08:56,0,3Outputpy time temp test 0 2019-11-20 00:56 5.0 1.0 1 2019-11-20 01:56 2.0 NaN 2 2019-11-20 02:56 8.0 NaN 3 2019-11-20 03:56 6.0 NaN 4 2019-11-20 04:56 4.0 NaN 5 2019-11-20 05:56 3.0 NaN 6 2019-11-20 06:56 2.0 NaN 7 2019-11-20 07:56 1.0 NaN 8 2019-11-20 08:56 0.0 3.0
but how can I do it within the interpolate() call? wouldn't that be cleaner
df = df.interpolate(method='linear', limit_direction='forward', axis='temp') thows an error
UnboundLocalError: local variable 'ax' referenced before assignment
You cannot, axis only accepts 0, 1 or none
axis : {0 or ‘index’, 1 or ‘columns’, None}, default None
Axis to interpolate along.```
So you will need to split into 2
ahh, this works
df = df.interpolate(method='linear', limit_direction='forward', columns='temp')
Nothing haha
Well this is clean enough
df = pd.read_csv('test.csv')
df['temp'] = df['temp'].interpolate()```
I will use that
ValueError: time-weighted interpolation only works on Series or DataFrames with a DatetimeIndex
I get that error when I try to use the time method
df['temp'] = df['temp'].interpolate(method='time')
even though column 0 is time
I even added this df = df.set_index('time')
Fixed it
df['time'] = pd.to_datetime(df['time'])
df = df.set_index('time')
df['temp'] = df['temp'].interpolate(method='time')```
@fallen anchor all time rated such operations require datetimeindex dtype which can be set as you did with pd.to_datetime or you can specify time column during read csv
@fallen anchor look at the parse dates argument in read_csv. Also, date_parser can be of help
They are both arguments of read_csv
Hi everyone I am looking for some person whom I can work with on some data science/Machine learning project. If anyone is working on some data science project I would happy to be part of team. I am an undergrad student and want to gain skills and experience in deep learning. DM me if you have some project. Thanks
just saw a picture of the Two Minute Papers guy... not at all what I expected based on his voice
no other takeaways, just that i pictured him EXTREMELY different
to keep it on topic, it's in this (WILDLY INTERESTING/CONCERNING) video https://www.youtube.com/watch?v=38ZXwJj6j8k
❤️ Pick up cool perks on our Patreon page: https://www.patreon.com/TwoMinutePapers My talk and the full panel discussion at the NATO conference (I start at a...
what is a common thing to fillna with? np.nan ?
nevermind, np.nan is the default anyway, no need to set it
Hey guys I kinda need help with making a simple supervised machine learning code for my science project where imputing an integer will respond with a win or a loss based on given data if someone could help me out that would be awesome dm me.
@lime cradle it would be better to ask in the channel instead
ask a specific question
that is some confusing wording ^ @tranquil rose
Well I am very new to coding and I was doing some research and it seems like I am going to be using a logistic regression code and I was wondering if there was a an algorithm that is already made that I could use for my project
So would i use the desciion_fuction method for the link that you sent?
ok are you more familiar with tensor flow or keras
TF
ok
but really I don't know TF well either
Is there any way if you could help me set up the program if I find the coding to use
well if you arent I just looked up a tutorial so wish me luck
I wish I could, but my knowledge is limited
ok thank you for the help
How does one get the mode of a groupby in pandas?
does any one know what queue model applies to a queue that has a single queue with multiple processors but where each process requires N processors?
I figured the first part would be M/M/c but that only applies when each process use 1 processor
I'm trying to make a simulation
For anyone familiar with pyspark. Im getting the error "Caused by: java.lang.ClassCastException: java.lang.Long cannot be cast to scala.collection.Seq"
First using ALS, Im turning my prediction values from a dataframe to a rdd, then putting that into RankingMetrics. Then calling that evaluators .meanAveragePrecision
the error is occuring when calling .meanAveragePrecision
Any idea why this could be happening?
its wierd because im using python not scala, so idk what its trying to use scala
I am doing something like this
val domainList = data1.select("columnname","domainvalues").where(col("domainvalues").isNotNull).map(r => (r.getString(0), r.getListString.asScala.toList)).
have you look at something like this
@deft harbor Whats confusing me, is that Im using python, not scala
this is beyond me, but my guess is that spark itself is trying to call something
perhaps there is an issue with configuration or a particular package it is trying to call
sorry im not more of a help
hmm, let me check the parameters for the rdd. looks like your right, it calls a scala method
a friend and i are trying to get into the realm of data science/ml, any recommended videos to watch or simple projects to try?
Is anyone familiar with RankingMetrics from mllib pyspark?
I have my results from ALS which is in dataframe form, but idk what values Im supposed to passinto RankingMetrics for when I create the rdd for it
I see something called predictionAndLabels but Im not understanding what those values are in relation to the values I get for transforming the test data in the ALS model
@urban shore look up k nearest neighbor (KNN), for movie reviews. also tfidf is used for it. Its one of the starting projects you can do. if you can get that to work with sklearn, you can do most other similiar classification models
Hey im working on a final project for school where I have to build a CNN and plot some interesting results. Im using keras on the fashion mnist dataset, but right now I only have a plot showing training loss and accuracy on my data set with and without regularization. So I was wondering do any of you have any good ideas on stuff I could plot to show interesting results?
1. final project for school - somewhat relevant.. ok
2. build a cnn - for what?
3. need to know what you're applying it for, to tell you what an interesting result is
Im trying to classify images in the Fashion-MNIST dataset by zalando using a convolutional neural network.
Dont know what else to say
classifying images.. there you go
why do you think it's interesting to show training loss
pr epoch
if your model was what you're showcasing, you can show how your metrics improve for different methods
Well, I just wanted to show if the model is over or underfitting
oh so you mean like log the results after tweaking parameters?
yes... compare the metrics
you could try things like data augmentation..
also can try sequencing the data (sequence of images) and trying to make a prediction on that..
Not sure I understand the last two points 🙂
@barren bluff If you are mostly interested in nice ways to present your results, you could showcase a confusion matrix of your results. Or visualise both training and validation error when training. Or if you have sometime and some curiosity you can look into something like shap and try to interpret why your models predicts a certain class for a certain picture. https://github.com/slundberg/shap
yeah I actually just did that @polar acorn but it looks a bit funky for some reason
looks like this
this is the code the generated the plot:
def plot_confusion_matrix(cm, classes,
normalize=False,
title='Confusion matrix',
cmap=plt.cm.Greens):
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=90)
plt.yticks(tick_marks, classes)
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, cm[i, j],
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
plt.ylabel('True label')
plt.xlabel('Predicted label')
Any of you know how I can fix the block at the top and bottom?
row***
Idk, try to set plt.ylim(cm.shape[0] - 0.5, -0.5)?
yeah let me try that
inside of the for loop right?
that fixed it gosh darn!
Why would I even need that block @polar acorn ?
doesnt plt.yticks(tick_marks, classes) define that value already?
It defines where to put the ticks I think not necessarily the axis limits? Anyhow I think that example used to work without that line in a previous matplotlib version and they changed something with update without updating that example.
I'm still learning about nlp and ml, sorry for the newbie question, maybe wrong channel
I have a classified dataset with
Title (string), is_drama (bool), is_scifi (bool)
I'm looking for something that could suggest if it's drama or scifi, based on a Title input, but I have no idea of algorithm or method, it could be a link, video, or anything that could help, thanks in advance
@mystic ravine you can use an LSTM
http://cs231n.stanford.edu/syllabus.html, look for the slides on RNN (recurrent networks) and they also have a link to Goodfellow's DL book which has a section on RNNs
Thanks, I'll try it
Do you all use something like Luigi, Airflow, or some other DAG orchestrator for your data science/machine learning pipelines? Is there a reason to use one of these frameworks versus not using one? I am trying to evaluate the usefulness of them, and I am looking at potentially implementing this at work. I like standardization of processes, but I do not want to want over engineer.
better: use BERT/RoBERTa
you use one of those.. usually airflow..
just depends how mature it is.. and how much you think it'll be maintained as you take the risk of deploying it over one framework..
https://github.com/quantumblacklabs/kedro
Seems like a competitive alternative to Luigi and Airflow.
Any recommendations for an online course of series of courses for scienc
I was thinking edx our coursera
Maybe even a stats course
Hopefully something comprehensive
Is it normal to get 0.0 as a pvalue?
just seems a little unlikely to get it 4 hypothesis tests in a row
there's nothing inherently unreasonable about it
even from a large real world data set?
sure
ok
i mean depending what you're testing, the fact that it's a large dataset might make it way more likely to have a v small p value
like testing for normality when youve got a big dataset that's very not normal
import scipy.stats as st
mean_elo_n_project_team = assigned_team_df['elo_n'].mean()
print("Mean Relative Skill of the assigned team in the years 1996 to 1998 =", round(mean_elo_n_project_team,2))
mean_elo_n_your_team = your_team_df['elo_n'].mean()
print("Mean Relative Skill of your team in the years 2013 to 2015 =", round(mean_elo_n_your_team,2))
# Hypothesis Test
# ---- TODO: make your edits here ----
test_statistic, p_value = st.ttest_ind(assigned_team_df['elo_n'], your_team_df['elo_n'])
print("Hypothesis Test for the Difference Between Two Population Means")
print("Test Statistic =", round(test_statistic,2))
print("P-value =", round(p_value,4))
im comparing two nba teams from two different time periods
so not testing for normalcy i think
the same 2 nba teams?
yes
you should be doing a paired differences test
ttest_ind is a 2 sample t-test which is for 2 independent samples. paired differences is for testing a difference between 2 dependent samples
like the weight of your family in 2018 vs. their weights in 2019 is two dependent samples because it's the same family members
0.0 as a pvalue
Are you 'allowed' to state this p-value? Because if not, it's probably safer to state that the p-value is below machine epsilon (usually 10^-16)
i dunno i've never seen 10^-16 lol. usually something like < 0.001
If values are assumed to follow either t or norm-dist, a p-value of zero indicates an unbounded difference.
Alternatively, find out the actual t or z-score, and use logarithmic scale
If the logarithmic scale also breaks, your problem's precision is really problematic (since we're talking in orders of magnitudes when talking logarithmic scales)
oh im sorry Naarkie I meant they are 2 different teams, i missunderstood your question.
Denver Nuggets 2013-2015
Chicago Bulls 1996-1998
Oh I realise you have this
print("P-value =", round(p_value,4))
Yes you should state p-value<0.00005 instead (That's with respect to your round function)
>>> round (0.00004,4)
0.0
yeah then that's fine
oh
FWIW, physics uses 5-sigma or p-value ~ 3.5 * 10^-7
print("P-value =", round(p_value, 10))
no wait
no mater what i round to, it comes out as 0.0
What happens when you try just
print(p_value)
omg
Your choices are
- claim p-value is that number - I'm too skeptical of machine floating point to do so
- claim p-value below any usual significance level, which that p-value is definitely
ok
i dont think i understand p-value like i thought
p-value is the probability that the hypothesis is true right?
I will quote wiki because it's very specific
probability of obtaining test results at least as extreme as the results actually observed during the test, assuming that the null hypothesis is correct
Your null hypo is that the means are the same
i see
The at least as extreme as the results refers to how different your results are from this assumption - with an underlying assumption that the means follow gaussian distributions (if not, just central limit theorem)
So at least as extreme refers to mean differences very far from 0
The at least part means inequalities - and basically it means you can integrate from negative infinity up to that point, OR on the other side, from positive infinity down towards some positive-side number (2-sided)
Either way the 2-side or 1-side impacts the integral value with dividing by 2 (since the Gaussian is symmetric) and/or shifting the comparison values - which doesn't really matter with your p-value at that kind of magnitude
holy shit

thats a ton of information, i have more research to do i can see!
thanks so much for the help 😄
take a basic stats class
read The Elements of Statistical Learning on stanfords website
Is there an R discord someone could send me a discord link for
Suspect you'll have better luck with either slack or irc
Is there anyone particularly experienced in Keras?
I'm trying to build essentially a deep learning hashing algorithm. I have a Keras model, and I'll feed it an image, and another version of the same image with noise/rotations/crops, whatever else I want it to be invariant to. I run both through the same autoencoder, and I train on the similarity between the two vectors, trying to get them as close as possible.
But, there's a problem with this approach. If all that you do is nudge similarity closer together, then all your vectors will end up looking the same no matter what. So, I'm also running the original through an autodecoder and training both models on that too.
I have two loss functions. One that trains the autoencoder by comparing the Cartesian distance between the vectors of the original and the scrombled image, and another loss function another that trains both the autoencoder and the autodecoder on how well it can reconstruct the original image using the vector. Hopefully this combination of loss functions will yield a well trained model.
The issue comes in implementation. This is actually my first project, and I'm not very familiar with setting up branching networks like this in Keras. If I was doing something sequential it would be easy, but I have some questions.
-
The docs say that you can use
Models likeLayers, which are really just tf Tensors. How do I get that to work with multiple outputs? Furthermore, if I incorporate one model into another and train it, does it train both? -
Right now how I have it set up is I'm passing it two images. In my autoencoder Model I define convolutional and max pooling layers, then some dense layers, and apply them all on both images in the correct order. My model does the same thing twice. But in "production," I only want to give it one and have it tell me what the autoencoder says. How would I rewrite it to do so, and link up the loss functions correctly?
Hey @rocky maple!
It looks like you tried to attach a file type that we do not allow. We currently allow the following file types: .3gp, .3g2, .avi, .bmp, .gif, .h264, .jpg, .jpeg, .m4v, .mkv, .mov, .mp4, .mpeg, .mpg, .png, .tiff, .wmv.
Feel free to ask in #community-meta if you think this is a mistake.
So I can't attach notebooks. Alright.
@rocky maple
1)
a) each output is one input into the following layers
b) you can freeze layers so they aren't adjusted during training
2) huh? i don't follow sorry
But Keras allows you to branch and share layers, correct?
Is isinstance(Model, Layer) true?
Also, I do absolutely want to train every single value, but only on specific things
I'm gonna be annoying and recommend pytorch as usual
Deep self insight at least 😉
I wasn't a fan of Pytorch at first, but I am really like the design concepts behind it now. My team uses Keras extensively, but I am going to reimplement a work project in Pytorch to show the team the difference.
The only thing that worries me about Pytorch is deployment in production. But, I am just waiting on more blog posts about that.
I keep wanting to try PyTorch, but I feel I should properly learn tf 2.0 first to evaluate my options better.
anyone here worked with speech to text models/libraries before?
fair enough, if you have business needs
but if you need to learn just one library for your own use
learn pytorch
I really like Keras Callbacks. Pytorch doesn't have a native way to do that. There are some helper libraries like Pytorch-Lightning and Poutyne that give it a more "Keras-like" api.
Pytorch documentation is pretty dope though.
And, I seemingly never have weird Cuda issues.
In Keras, do I have to output something to train on it for loss?
ModuleNotFoundError: No module named 'modin.backends.pandas.parsers'
modin is a pain in the ass
p value
can you call a classification task as predictive modelling?
import tensorflow_datasets as tfds
dataset, info = tfds.load("imdb_reviews/subwords8k", with_info=True, as_supervised=True)
Can anyone execute these 2 lines and tell if he/she is getting any error. Thanks
so if you had limited data for something would you use an oversampling technique to resample the data? if not what would you do
@wraith bluff was reading about it- as Qualcoms new chipset has one of its feature so was pretty interesting for me 🙂
@deft harbor thanks
hi guys, just wanted to share ...trying out the sentimental analysis from the twitter api's...any suggestion or ideas to perform any specific data analysis algorithm for more learnings...open to all suggestions 🙂
When creating a virtual env in anaconda, is it possible to import packages from the base dir to virtual env or eveytime need to be installed with pip ?
Will copy pasting work?
i would suggrst to use pip to install them in your environment whenever you want!!
@native stag ESL isn't appropriate advice to all surely? there's quite a lot of math
i'm reading ISL right now i would start with that i have almost no math background and its completely understandable to me and is fantastic every data scientist should read it, i'm going to move to ESL after and i may have to learn some calc LA in between to fully understand ESL
@native stag ISLR is a more suitable start yes
ESL, no
ISLR wouldn't be suitable to most without any maths either though
I don't know why you'd recommend a resource like that - but hey ho
i have no maths and i'm doing fine it explains things well but ya whatever you wanna do
good for you - i'm talking for most
idk i wanted to incase people haven't heard of it sorry that it bothered you so much but gl to ya mate
it's just not helpful to someone beginning to be recommend texts that aren't at a suitable level imo
I started with ISLR and enjoyed it, found it really helpful
Pretty sure I wouldn't have been able to take as much away from it if it spent most of its time bashing me over the head with only linear algebra notation.
how hard is it to build an ai that can play a game like ticactoe
in highschool
Anyone willing to help me port a Keras model to Pytorch? I am doing a lot with Conv3d stuff.
Input shape is (1, 240, 320, 3)
model = Sequential()
# Define model
model.add(Conv3D(32, kernel_size=(3, 3, 3), input_shape=input_shape, padding="same", kernel_regularizer=l2(opt.l2), bias_regularizer=l2(opt.l2)))
model.add(Activation('relu'))
model.add(Conv3D(32, padding="same", kernel_size=(3, 3, 3),kernel_regularizer=l2(opt.l2), bias_regularizer=l2(opt.l2)))
model.add(Activation('relu'))
model.add(MaxPooling3D(pool_size=(3, 3, 3), padding="same"))
model.add(Dropout(0.7))
model.add(Conv3D(64, padding="same", kernel_size=(3, 3, 3)))
model.add(Activation('relu'))
model.add(Conv3D(64, padding="same", kernel_size=(3, 3, 3)))
model.add(Activation('relu'))
model.add(MaxPooling3D(pool_size=(3, 3, 3), padding="same"))
model.add(Dropout(0.25))
model.add(Conv3D(64, padding="same", kernel_size=(3, 3, 3)))
model.add(Activation('relu'))
model.add(Conv3D(64, padding="same", kernel_size=(3, 3, 3)))
model.add(Activation('relu'))
model.add(MaxPooling3D(pool_size=(3, 3, 3), padding="same"))
model.add(Dropout(0.25))
model.add(Conv3D(64, padding="same", kernel_size=(3, 3, 3), kernel_regularizer=l2(opt.l2), bias_regularizer=l2(opt.l2)))
model.add(Activation('relu'))
model.add(Conv3D(64, padding="same", kernel_size=(3, 3, 3), kernel_regularizer=l2(opt.l2), bias_regularizer=l2(opt.l2)))
model.add(Activation('relu'))
model.add(MaxPooling3D(pool_size=(3, 3, 3), padding="same"))
model.add(Dropout(0.7))
model.add(Flatten())
model.add(Dense(1024, activation='relu'))
model.add(BatchNormalization())
model.add(Dropout(0.7))
model.add(Dense(1, activation='sigmoid'))
model.compile(
optimizer=RMSprop(lr=opt.learning_rate),
loss='binary_crossentropy',
metrics=['accuracy'])
return model
Forget the regularizers and what not. I cannot seem to translate the Flatten layer to Pytorch correctly. I cannot seem to translate the same padding options.
flatten should be doable
but padding is not, without manually calling a padding layer
pytorch and TF for some reason have conv operations with fundamentally different padding
you might be able to find some code that auto computes the correct manual padding for you
that said, whether this is worth it depending on whether you're training from scratch or reusing a trained model
if it's the former, you can skip the exact replication
if you're replicating then yea
worst case just go layer by layer and ensure that they match
also make sure to configure cudnn to be deterministic
I am trying to replicate the model.
the flip side is that padding is about the only hard thing to translate between the two frameworks, in terms of straightforward models
I am not sure how to configure the Conv3d layers properly. Keras lets you be pretty lazy, but pytorch makes you be explicit. I've tried 2 different ways to create a flatten layer, but each time I get a tensor that is so big, it doesn't fit in memory.
explain what shapes you're trying to go from/to
you'll come to love how explicit pytorch is
I don't think I understand your question. What about shapes?
Oh, I am loving pytorch, until this project. lol
Nah, I like it.
you're trying to flatten right? that's manipulating the shape of hidden activations
put another way, you should be able to find out the shape of your outputs at every stage of that model
that's important for translating between tf and pytorch
I am not sure exactly what the dimensions were, but out of the convolution block, I ran x.view() to flatten the layer. This was an example I found online to flatten the layer.
that should work
2019-12-08 22:17:20,646 - ERROR - root: [example.py:95] Given input size: (512x1x29x39). Calculated output size: (512x0x14x19). Output size is too small
Model file
which line? (your log is pointing to the line number in the full file I guess?)
it doesn't look like Flatten is the issue?
hey ! anyone tried new version of Spyder? 4.0 i presume it is.
any feedbacks ??
Yo anyone familiar with pytorch, because i have a problem and cant seem to figure out what it is.
ahh
#Choose device for training
if torch.cuda.is_available:
device = torch.device("cuda:0")
print("Running on GPU")
else:
device = torch.device("cpu")
print("running on CPU")
net.to(device)
net = Net().to(device)
# Print out training information, set epoch range to train
for epoch in range(50): # (n) full passes over the data # set to ridiculous amount if using accepted value
for data in testset: # `data` is a batch of data
X, y = data # X is the batch of features, y is the batch of targets.
X, y = X.to(device), y.to(device)
net.zero_grad() # sets gradients to 0 before loss calc. You will do this likely every step.
output = net(X.view(-1,784)) # pass in the reshaped batch (recall they are 28x28 atm)
loss = F.nll_loss(output, y) # calculate and grab the loss value
loss.backward() # apply this loss backwards thru the network's parameters
optimizer.step() # attempt to optimize weights to account for loss/gradients
print(loss) # print loss
# Adding accepted value, Comment out if need be
if loss <= (accploss):
break
anyway, the loss i get remains constant and I'm either blind to the problem, or did something really dumb and don't know how to code for the life of me.
tensor(2.3093, device='cuda:0', grad_fn=<NllLossBackward>)
tensor(2.3093, device='cuda:0', grad_fn=<NllLossBackward>)
tensor(2.3093, device='cuda:0', grad_fn=<NllLossBackward>)
tensor(2.3093, device='cuda:0', grad_fn=<NllLossBackward>)
tensor(2.3093, device='cuda:0', grad_fn=<NllLossBackward>)```
sample output
I'd appreciate any help whatsoever, thx in advanced
maybe set net.train()
I'd also set the graddients on the optimizer to zero instead of the model.
# From the MNIST Pytorch example
def train(args, model, device, train_loader, optimizer, epoch):
model.train()
for batch_idx, (data, target) in enumerate(train_loader):
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
output = model(data)
loss = F.nll_loss(output, target)
loss.backward()
optimizer.step()
if batch_idx % args.log_interval == 0:
print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
epoch, batch_idx * len(data), len(train_loader.dataset),
100. * batch_idx / len(train_loader), loss.item()))
guys i was looking into the functional API of keras and came across a problem [a doubt] why do they use the input shape as (64, 64, 1) when the input is just a 64x64 image
this is the code
# Convolutional Neural Network
from keras.utils import plot_model
from keras.models import Model
from keras.layers import Input
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers.convolutional import Conv2D
from keras.layers.pooling import MaxPooling2D
visible = Input(shape=(64,64,1))
conv1 = Conv2D(32, kernel_size=4, activation='relu')(visible)
pool1 = MaxPooling2D(pool_size=(2, 2))(conv1)
conv2 = Conv2D(16, kernel_size=4, activation='relu')(pool1)
pool2 = MaxPooling2D(pool_size=(2, 2))(conv2)
flat = Flatten()(pool2)
hidden1 = Dense(10, activation='relu')(flat)
output = Dense(1, activation='sigmoid')(hidden1)
model = Model(inputs=visible, outputs=output)
# summarize layers
print(model.summary())
# plot graph
plot_model(model, to_file='convolutional_neural_network.png')
i think i got it !!!
is it for mentioning the number of channels?
Yep 👍
Yep...channels last. Seting that properly is extremely important...speaking from experience
@idle oracle yes check the gradients after one backward pass
@oblique belfry did you figure it otu?
nope
@idle oracle as @oblique belfry already said, you need to call optimizer.zero_grad() instead of net.zero_grad()
the two are generally interchangeable unless you have a very esoteric training scheme
optimizer.zero_grad is recommended, but net.zero_grad will still zero out the gradients for the model
@oblique belfry well lemme know if I can help. I've had to port models back and forth between TF and pytorch multiple times, so I'm well aware of the pain points
hey
trying to figure out how to parse OCR'd text that has arbitrary yet somewhat similar formatting
is there some kind of matching system that works well for messed up OCR
like for instance i might want to match "FURCHASE ORDER NO." with "PURCHASE ORDER NO." as my search criteria
@hardy crag nah i tried, didn't work
its most likely an issue with throwing things with to.device, becaus an older version works.
check the gradients! thx
if you call net.parameters(), you should get a list of parameters
check the .grad on each one. It should be none or 0 at the start, and after a forward pass and loss.backward, the .grad should be tensors
let me know if you do or do not see that
actually I have another guess. Can you show me the code you're using including where you initialize the optimizer?
Hey @idle oracle!
It looks like you tried to attach a file type that we do not allow. We currently allow the following file types: .3gp, .3g2, .avi, .bmp, .gif, .h264, .jpg, .jpeg, .m4v, .mkv, .mov, .mp4, .mpeg, .mpg, .png, .tiff, .wmv.
Feel free to ask in #community-meta if you think this is a mistake.
ok....
@silent swan so now the loss values change, but...
tensor(2.2650, device='cuda:0', grad_fn=<NllLossBackward>)
tensor(2.2917, device='cuda:0', grad_fn=<NllLossBackward>)
tensor(2.3413, device='cuda:0', grad_fn=<NllLossBackward>)
tensor(2.2748, device='cuda:0', grad_fn=<NllLossBackward>)
tensor(2.3047, device='cuda:0', grad_fn=<NllLossBackward>)
tensor(2.2935, device='cuda:0', grad_fn=<NllLossBackward>)
tensor(2.2991, device='cuda:0', grad_fn=<NllLossBackward>)
tensor(2.3095, device='cuda:0', grad_fn=<NllLossBackward>)
tensor(2.2907, device='cuda:0', grad_fn=<NllLossBackward>)
their pretty stagnant
Ok, I think I have the problem identified, it must be a .to (device issue)
but i dont kno whow to fix it
try creating the optimizer all the to(device) stuff
lol we could just fix the issue
Unless your model is tiny, use the GPU.
Honestly i now have a massive problem with weights
for some weird reason it seems like every time i reset the runtime and create a new model (code creates a fresh one) there is always a single weight un accounted for, with a value of 0.000, where as the other have massive negative values.
tensor([-23179.8730, -17778.3848, -14537.1084, -31701.2402, -27408.4082,
-20759.7539, -24848.2812, 0.0000, -38601.3164, -40405.9219],
grad_fn=<SelectBackward>)
tensor(7)
so now i have 7 with nothing in it
it goes , 0,1,2,3,4... etc
and the rest have massive neg values
i and going to sleep no cuz almost 12:00AM so i will chk back in the morning. about 7 hrs from now
i have a feeling its because my training set is like in colour
and im testing on
suggestion son how to change format?
You could grayscale the training data.
However.....a good neural net should be able to handle that.
Also, matplotlib's plots aren't always the most indicative of the data. The first image almost looks like a heatmap.
I need some help
is it normal for a vm instance to idle after a heavy computation..
my cpu utilization near maxed out during training for a few hours.. now it's done but in the model exporting phase it's not doing anything and I barely see a blip on the utilization..
how can I understand the unitvector (r_a) rightside and the ornage block U_s
I tried googling but I do not konw how this works. I want to understand it
is your input 28x28x3 or 28x28x1?
in any case, no, if the color scale is modified, you should have no prior expectation that the model should still work
would you mind posting your whole code again?
yea ok
@silent swan what if my training set it inverted form what im testing
from*
it thin kit was sensitive to color
@silent swan it works now,
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
from PIL import Image
import PIL.ImageOps
import numpy
im = Image.open("number.png")
img = PIL.ImageOps.invert(im)
num_img = numpy.array(img)
lum_img = num_img[:, :, 0]
plt.imshow(lum_img)
used this to apply the 'LUT' like the data set, inverted it thenis works with 99.6% accuracy
I use cv2 for image transformations. Found it to beat Pillow and scikit-image in terms of speed.
I'm trying to make a family tree image like MarriageBot's for my own bot
Here's a family tree from MarriageBot:
I'd like to achieve something similar, except I'm using PIL to make a family tree with this design:
I've never made something like this, how would I do it while keeping the design?
inversion of inputs can break a model
My coworkers worked on this massive project for weeks and they were getting nowhere. Ends up they had the inputs reversed. They wasted weeks. "channels_first" and "channels_last" are probably the most important words in image recognition.
Harder to mess up in Pytorch though since you have to be so explicit. Keras lets you be a bit too lazy at times.
For the most part, it is pretty cool.
I just am not a fan of Keras/TF exceptions. Curse of the static graph....
That alone might move me to Pytorch.
Also projects like Poutyne are pretty great. It gives a Keras-like interface to Pytorch. I can re-implement all my custom Keras callbacks and it will behave in a similar way as the original. I really didn't want to really configure all that again.
I've never gotten into the whole callback style of programming. For Keras, it really feels more like a hack because Keras owns all of your control flow, so you have to play by its rules
PyTorch is a little more verbose, but in returns you basically control everything
but yes, there are definitely similar keras-like wrappers for pytorch
Saves me on boilerplate. I just like running a set of functions at the end. I really don't care about controlling all that. (For one project, I have a Socket.IO callback.) Just let that run at the end, I don't really care. I think the Callbacks are a nice abstraction. I think it makes the training code cleaner. But, that is more for my readability than true fucntionality.
I like how TF 2.0 essentially took everything people like about Pytorch and try to implement that in their stuff.
well, I think they tried to at least
Instead of trying to bring in dynamic graphs (okay...technically it is called eager execution) in TF, they should have done an AngularJS and Angular 2 thing. Just do a rewrite and do it the right way.
Right now it is still a mess.
I think TF fundamentally targets a different goal though
I always have Cuda errors with TF. Seems like Pytorch always works with whatever version of Cuda I got. That is so nice.
I think that's mostly because of the TF/Google attitude
"you have to do things our way"
that's why the TF library contains everything, to force you into their eco system
PyTorch feels like it's trying to serve the users
TF feels like you need to follow their way
The next ML project I get, I might do the project in Pytorch.
The only thing about Pytorch is its "deployment strategy." TF/Keras does a good job at deploying models. Pytorch, to me, is a bit lagging in this area. But, this will improve with time and more people writing up articles.
fair enough. I'm on the research side so I don't do much with deployment
TF has a lot of good tools for that
but if you just want to train models, PyTorch is generally the far better option
how so
Still can't configure the model architecture correctly. lol
that's because of the incompatibility of TF/PyTorch conv padding. Not that one or the other is more correct
if anything, PyTorch allows you to drop a debugger into wherever you're running into an error, TF throws you into magic C error space
I haven't touched a debugger since I was learning Visual Basic in undergrad. I am not the biggest fan of them. I need to get into them more. Might could help.
The padding is what is killing me.
show me your stack trace and the line that it's throwing an error on
I will tonight

i have this image, is there a way to center it. Usinng np?
any machine learning engineer here?
@native rivet I’ll do my best. What’s up?
you can do a think like find the bounding box for non-white pixels, find center of the bounding box, and then shift
Dash-Bootstrap-Components How to Build Layered Dashboards with Python https://youtu.be/P-XYio7G_Dg
Enjoy the videos and music you love, upload original content, and share it all with friends, family, and the world on YouTube.
https://arxiv.org/abs/1912.04316
An interesting approach to action recognition. They use graph CNNS to solve the problem. Very novel. I have not read an approach like this before.
We don't have an ML channel, so sorry if it doesn't truly fit data science.
Spatio-temporal action localization is a challenging yet fascinating task
that aims to detect and classify human actions in video clips. In this paper,
we develop a high-level video understanding...
I am having to refactor our custom ML platform, and I just cannot seem to get started. it was built very project specific. Trying to abstract that is going to be fun. My boss is also a bit opinionated about things and so some of my changes may not even matter.
For the record: Comments are not a waste of time.
"eh, I'll just use kwargs for this"
Honestly, I would take that.
Had to convince him we should use classes for this one thing because it holds state, and we were just passing things through a million functions. I get that some people go a bit too far with polymorphis and inheritance, but there is a time and place. You know?
Currently working on an action recognition project, but we are trying to make this platform accessible for sequence data, tabular data, audio, etc.
It's composed of several parts. Mainly data labelling and deep learning "best practices" our company has found to be uberly succesful. Essentially standardize things and allow a ML engineer to focus on building models, not software development.
If anything, it has been a killer in-house tool.
sounds nice
how did you build it
I'm familiar with building models but not so much on serving..
Can't dive too deep into it. But, the python side is pretty simple. We found caching our data in an fast, binary format gave us 5x to 10x improvement on training time. So, after the traditional ETL process, prep your data into a high performance format, then load that into a Keras generator or Pytorch Dataloader. Kind of like a head of time compiling code. Same principal.
Used React and Express to create a SPA for labeling and viewing model stats. We also found tensorbaord annoying as hell to work with.
labelling?
data labelling.
There weren't any "simple" data labelling solutions. And, we weren't interested in Sagemaker.
how do you save and version models
Labeling videos for Action Recognition isn't the easiest. Gotta search per frame.
ok then
Save it locally. Though it isn't hard to save stuff to S3 or Azure. For each run, we save weights, logs, performance stats, and other stuff.
hmm ok
I was reading about Decision trees on medium.. Can anyone tell me what does it mean by this paragraph 👇
I dont understand what does it mean by saying pure
i even found the same text in the decision tree documentation
Pure means the leaf has only 1 category
To give an example, suppose you had a dataset of 6 things, of category "Blue" and "Red"
Suppose in your dataset you have some predictors
What the tree does is calculate for entropy changes ('information gain') by splitting using the predictors (if predictor <= some_value, classify in some way, else classify the other way)
I have terrible drawing skills but this should illustrate my point
Suppose that basically of your 6 data points, 4 were "Red" and 2 were "Blue"
We assume for simplicity that your data is very good, so it just splits once, and then all the 2 "Blue" get together - it's now pure
The same happens for the 4 "Red" - the leaf is now pure.
Since the tree has split into pure leaves, the algorithm ends
The splitting and deciding of predictors is the core to the algorithm - use of Entropy/Information Gain isn't the only way - although there are reasons for doing so
Ok i get it now..
thanks a lot
I have one more question
If we oversplit the data in decision tree what will happen?
Will we overfit the data then?
Yes you will overfit your training set, so you need a model selector/parameters/algorithm to decide if you really need a tree that always ends in pure leaves
What happens is that with the fully grown tree, you can prune it at somewhere in the middle, and this is something you can do after you grow the tree
It's going to be a hard problem to dynamically check for overfitting while growing the tree - so that's why that is done after
With pruning the tree becomes something that ends in 'impure' leaves, but what's important is if the tree is really useful and generalises to actual data or use cases you want to present it towards
Ok got it.. thanks a lot
thats a pretty tree you have there
@lapis sequoia look at bagging and boosting as well
Is this the appropriate channel to discuss things related to Scrapy?
I'm not against it, but I don't know the rules
if you keep it general and not 'im scraping some site that has TOS saying not to' its probably fine
although it's also not data science
@deft harbor Sure will
Also in Random forests if we used a large number of n_estimators or trees will we overfit the data?
eventually in some way yes
hi there
anyone heard of ai dungeon2? question related:
can you help me out by modifying that model in a way to shove the entire save and not just the prompt + 10 8 last phrases?
here's the original repo https://github.com/AIDungeon/AIDungeon
here's an example of a desired outcome:
you save your conversation, you load it via id and it shoves the contents of the save into the model, you continue your conversation in tact and the context is not broken
I don't think I've seen a lot of people do the work for others here
sorry, but i'm not an ai(TF) programmer, so i can't really figure it out myself
Anyone working in a project that requires scraping the web with Scrapy? I'm up to join you to learn it hands-on and eventually help you in exchange for some knowledge. I've some experience scraping and wrangling data using requests, selenium webdriver, beautifulsoup, regex, lxml, json...
If you've worked with Scrapy before but are not using it in any of your project and have some free time, I'd be interested in partnering up to tackle some whatever-payment freelance jobs using it if that pleases you.
Not sure if this is appropriate channel but seems like it's the most appropriate one.
hey is it possible to create an AI that plays a game against you and gets better everytime?
That is the field of Reinforcement Learning.
i have merged two csv files into a single dataframe, is there any way i can check if the row from the new merged dataframe is present in the second csv file or not?
plenty of ways. why not add a column called “source”, whose value is “first” for the first csv and “second” for the second csv, before merging them?
alrighty then, since you either don't know the answer to the previous question or just don't want to answer here's another two three:
- is ML available to general populus yet (was it dumbed down for anyone to take on it)?
- what is the least resource (GPU/CPU) intensive but just as performant model (compared to GPT-2)?
- what model is better than GPT-2 and supports i18n?
(pardon if i mix my terms up)
df1 = data[data.MESS_DATUM >= 19600101]
df1 = df1[df1.MESS_DATUM <= 19981231]
How do i get this in one line ?
and and & didnt work
data[data.MESS_DATUM.between(19600101,19981231)]
``` @lyric kernel
crisp!
if you want to use &, you're gonna need to surround your conditions in parenthesis, like
data[(data.MESS_DATUM >= 19600101) & (data.MESS_DATUM <= 19981231)]
I'm guessing that's why your attempt didnt' work
pipenv, venv, or conda for a machine learning project?
conda, always conda
has somebody used graphs database to recommend users based on keywords? The Keywords are extracted from a text and combined with the users
Hi, Is this the right place to ask for some Pandas help?
awesome
@silent swan whats the advantage to conda?
imo venv would give you the most flexibility. Some packages might not be of the latest version in vonda
conda installs all the scientific computing libraries properly
you get both the package installer and environment manager in one
I've always found conda to be annoying. And the only issues I have with any scientific computing packages is Tensorflow due to it not playing with certain versions of Cuda.
If anyone has extra time and would like to help with the creation of my Jojo bot you can help create stands here https://docs.google.com/document/d/1o4gkz4jmROzNSp79LvOwggQBW2sUI5gZ0Ft6MF2WEPo/edit?usp=sharing
time so far < expected time to completion probably?
I can’t remember exactly. But. I remember it’s just simpler to setup up a virtual env and install what I need. I also don’t mind pipenv either.
I also don’t need all those dependencies in a project.
But it comes with sypder 😶
....I don’t use Spyder 😬
That's what it says. 900 seconds per iteration, and you have 10000 iterations.
Does anyone know
where I can get started
learning how to make
AI learn how to play games?
@spare arch
@oblique belfry i found conda confusing at first as it dumps stuff into your bashrc or something, but after using it for a bit it's fine really, haven't had issues since, perhaps i don't do a fat lot with it though
If I was on Windows, I'd take a second look at it. But I just haven't had a need good ole pip couldnt solve.
@oblique belfry fair, i'm only using it because the team uses it, and if it wasn't for that it's unlikely i'd have got past the initial hiccup i expect
I get that.
I’ve read that before. Like I said, I don’t look down on conda. But there just hasn’t been a situation where I needed something else than pip to get all my ml packages going.
You know you want spyder
@oblique belfry Do you only use one version of python or you create venvs to replace conda?
Just venvs. I like the separation between projects.
Then that's just fine.
Conda is a huge help when you need to run several python versions and libs versions without having to store python and libs versions over and over within each project folder
But if you're satisfied with storing libs and python for each project, that's a way to do it
it’s been a while since i last used conda, but in my experience it also takes really long for conda to resolve dependencies and when installing packages; whereas if i just want to spin up a quick virtualenv, venv+pip is usually much faster and i can get going quickly
i see the merits of conda as well though, but just saying it’s not for everybody and not necessarily a defacto for all ds projects
how to speedup model traning using tensorflow for object detection yoloV3 Darknet its takes 5 days for 1000 itterations
faster GPU, larger batch sizes, or tweak the learning rate if you're okay with having slightly worse performance
i have already use GPU and also run on google colab GPU but its takes same time
That was a useful article sheemp
@deft harbor thanks
:<
What did you go with
I'd use the original Yolo Darknet that is written in C. It is more finnicky to work with, but the performance time is impressive.
Does anyone happen to have a neat example of calculating information gain in Python?
Hello everyone, I'm trying to write an essay on optical character recognition as implemented with machine learning. do you guys have any interesting or useful sources explaining the topic? looking for youtube videos or articles or academic papers.
As it's not for a CS course I want to explain what OCR and Machine learning are as well as what it's useful for.
what libraries do you all use to automate testing different neural architectures? i'm not going to be able to work interactively
not just regular hyperparameter tuning but also things like number, type, and size of layers
on top of keras*
although i'm sure more general solutions exist
def fill_data(data_frame):
for i in np.setdiff1d(unique_full_folder_path, data_frame["Full Folder Path"].values):
data_frame.loc[data_frame.shape[0]] = [i, 0, *data_frame.iloc[0,2:].values]
return data_frame
can anyone tell me what this does?
unique_full_folder_path = report_1_df["Full Folder Path"].append(report_2_df["Full Folder Path"]).unique()
testing in what sense
looks like it gets all the unique "Full Folder Path"s across both data frames
sql, you can ask in databases
Does anyone know of an example of a decision tree (preferably id3) implemented from scratch? I can find a couple on random githubs, but they seem to have issues
@worn stratus you wanna implement decision tree from scratch or you wanna an example for it?
i do have an example and can share you...so do let me know
Yeah, an example would be useful
my end goal is random forest from scratch
but the first step is a decision tree
Hey I need to create a program that can check similarity between two images. Is machine learning the best way to solve this?
are you looking for the SAME image?
if all you need to do is match the same picture to itself, then you dont need machine learning
you could just see if the pixels match
@vague merlin
@worn stratus for information gain in Python check this out https://machinelearningmastery.com/information-gain-and-mutual-information/
Has worked examples
Thanks for the read
how is this different from stacking models
say I use catboost or something that works well with categories.. and then append predictions to support the next model on top of this
@worn stratus if you're good at reading code you can read the sklearn cython
actually maybe not, I don't think it's a good learning experience
@silent swan at this point I'd happily look at it in the sklearn source, but I can't find it
nvm - got it
sorry for the ping
@deft harbor It needs to able to check similarity between two different pictures, it could be almost the same image but from a different angle or zoom etc
Ah, then yeah, you will most likely need some sort of NN for that. @vague merlin
How many different image classes will you have? For example, you know you will want to match pictures of a certain statue downtown, a bus stop and a specific building, that would be three.
If you want to match all possible similar images of anything, that's going to be a pretty large undertaking.
@deft harbor ah okey,
it needs to check similarity between different rooms within a house, so if there are two images of the same kitchen but from a different angle it would still give it a high similarity score, the same goes for bedrooms, bathrooms and so on, preferably the outside of the house as well. Just enough to determine if it's the same house or not.
I assume that it would be quite a large project to make something like that work?
still, try some silly heuristic like pixel-level difference, maybe with image registration first
Maybe try running an edge detector on the image first, and then running correlate2d.
I've had some luck with pHash too.
thanks guys, i will try a few different solutions and see what results i get, https://en.wikipedia.org/wiki/Scale-invariant_feature_transform seem interesting as well
The scale-invariant feature transform (SIFT) is a feature detection algorithm in computer vision to detect and describe local features in images.
It was patented in Canada by the University of British Columbia and published by David Lowe in 1999.
Applications include object ...
Hey. What is the problem here? I visited tf errors page but I couldn't find this error.
Traceback (most recent call last):
File "C:\Users\PC\Anaconda3\envs\TF\lib\site-packages\tensorflow_core\python\pywrap_tensorflow.py", line 58, in <module>
from tensorflow.python.pywrap_tensorflow_internal import *
File "C:\Users\PC\Anaconda3\envs\TF\lib\site-packages\tensorflow_core\python\pywrap_tensorflow_internal.py", line 28, in <module>
_pywrap_tensorflow_internal = swig_import_helper()
File "C:\Users\PC\Anaconda3\envs\TF\lib\site-packages\tensorflow_core\python\pywrap_tensorflow_internal.py", line 24, in swig_import_helper
_mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
File "C:\Users\PC\Anaconda3\envs\TF\lib\imp.py", line 242, in load_module
return load_dynamic(name, filename, file)
File "C:\Users\PC\Anaconda3\envs\TF\lib\imp.py", line 342, in load_dynamic
return _load(spec)
ImportError: DLL load failed with error code 3221225501
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:/Users/PC/PycharmProjects/TF/test.py", line 1, in <module>
import tensorflow
File "C:\Users\PC\Anaconda3\envs\TF\lib\site-packages\tensorflow\__init__.py", line 98, in <module>
from tensorflow_core import *
File "C:\Users\PC\Anaconda3\envs\TF\lib\site-packages\tensorflow_core\__init__.py", line 40, in <module>
from tensorflow.python.tools import module_util as _module_util
File "C:\Users\PC\Anaconda3\envs\TF\lib\site-packages\tensorflow\__init__.py", line 50, in __getattr__
module = self._load()
File "C:\Users\PC\Anaconda3\envs\TF\lib\site-packages\tensorflow\__init__.py", line 44, in _load
module = _importlib.import_module(self.__name__)
File "C:\Users\PC\Anaconda3\envs\TF\lib\importlib\__init__.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "C:\Users\PC\Anaconda3\envs\TF\lib\site-packages\tensorflow_core\python\__init__.py", line 49, in <module>
from tensorflow.python import pywrap_tensorflow
above this error message when asking for help.
File "C:\Users\PC\Anaconda3\envs\TF\lib\site-packages\tensorflow_core\python\pywrap_tensorflow.py", line 74, in <module>
raise ImportError(msg)
ImportError: Traceback (most recent call last):
File "C:\Users\PC\Anaconda3\envs\TF\lib\site-packages\tensorflow_core\python\pywrap_tensorflow.py", line 58, in <module>
from tensorflow.python.pywrap_tensorflow_internal import *
File "C:\Users\PC\Anaconda3\envs\TF\lib\site-packages\tensorflow_core\python\pywrap_tensorflow_internal.py", line 28, in <module>
_pywrap_tensorflow_internal = swig_import_helper()
File "C:\Users\PC\Anaconda3\envs\TF\lib\site-packages\tensorflow_core\python\pywrap_tensorflow_internal.py", line 24, in swig_import_helper
_mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
File "C:\Users\PC\Anaconda3\envs\TF\lib\imp.py", line 242, in load_module
return load_dynamic(name, filename, file)
File "C:\Users\PC\Anaconda3\envs\TF\lib\imp.py", line 342, in load_dynamic
return _load(spec)
ImportError: DLL load failed with error code 3221225501
Failed to load the native TensorFlow runtime.
See https://www.tensorflow.org/install/errors
for some common reasons and solutions. Include the entire stack trace
This error occurs even when I want to import it
I'm using windows 7 if that's needed to know
@uncut shadow what do you want to run locally with TF
google colab can run TF without separately installing it
Hey, anyone here use Anaconda? I can't seem to update my Spyder...
the "conda install spyder-4.0.0" doesn't work in the Anaconda Prompt, it says it's not a legit command
Hey, im trying to give my Matrix categories. by first making the words in a column into numbers, and then setting those numbers as categories. However, i keep getting an error message and i dont know what to do, my code is identical to the professor
Professor's Code:
Any help would be greatly appreciated, im just starting out learning python for data science!
you need to install those packages
are you using spyder
open anaconda prompt and run conda install numpy matplotlib
Hey
valid,value
2004-07-21 20:00:00,280
2004-07-21 21:00:00,020```
df.iloc[1]['value2'] = 555
I am trying to create a new column value2 for the last row, but the above code doesn't do anything
I want it to look like
valid,value,value2
2004-07-21 20:00:00,280,
2004-07-21 21:00:00,020,555```
how can I do that?
this is a csv?
Hey everyone, I'm trying to figure out how to do the following with numpy:
There's a array a = [1, 2, 3, 4, 5] with data and an array b = [0, 1, 2, 0, 1] which are row numbers
How would I create a matrix use the data from a but placing the elements in the rows specified by b but keeping their column position from a?
Like:
[[1, 0, 0, 4, 0],
[0, 2, 0, 0, 5],
[0, 0, 3, 0, 0]]
with scipy you can use coo_matrix to construct a sparse matrix using row,col,data then call .todense() after
@oblique wyvern
Is anyone willing to help me troubleshoot my Anaconda installation? I can't update it OR my Spyder program using either: conda update anaconda OR conda install spyder=4.0.0
It's really frustrating, especially because uninstalling and installing seems to take so long
@lapis sequoia yes, a csv
Would using PCA for dimensionality reduction be a good method for a dataset with 600 features?
I’m going to use clustering on the dataset, but I feel as though I am getting really weird results
It won’t hurt. Try and see what happens.
I did, but I’m having some issues with plotting it. If I want to keep my variance at greater than 0.85, I still have over 100 features
If I reduce it to 0.59 i have 3 features which is nice to work with, but wouldn’t the data be extremely muddy?
Hey, im new to Learning Machine Learn. And im learning about multiple linear regression, Desicion tree, vector machines, rainorest classification.
It seems the course is directed towards business, slightly, but i want to know if all this information is Applicable to AI in robotics
for example programming a robot to avoid obstacles
or pick of certain objects
any help would be appreciated
Look into reinforcement learning.
Okat
but
i would like to know if multiple linear regression, Desicion tree, vector machines, rainorest classification. this sort of thing will help me in applying machine learning to AI in relation to say robotics
I think I’m just gonna run with 0.59 variance. It works 
If anyone can confirm for me that content such as multiple linear regression, Desicion tree, vector machines, rainorest classification. this sort of thing will help me in applying machine learning to AI in relation to say robotics
you've got a long way to go to get to robotics
@primal ravine Potentially...but I don’t know of people currently doing that.
Those are good techniques to know about. But, neural networks are the new norm in that field. You need to learn a super complex non-Linear function. Only neural nets can capture that.
@silent swan Agreed.
those are the correct starting points, but it'll still be far away from your goal
but if this is like something you want to do over the course of like, 3 years, yes, that's where to start
@rigid summit post your error message, and are you using windows?
thanks @silent swan , I posted the issue here: https://github.com/ContinuumIO/anaconda-issues/issues/11524
I don't know if I'll get a response, there are 1000+ open issues apparently
I am using windows
is anyone alive
I need help understanding this graph..
why does the % change keep decreasing
does that mean media is only part of digital ad spending?
nvm I figured it out.. but I'm having trouble understanding difference between ARIMA and ARMA
Hey, can someone explain to me how Deep Learning or Neural networks are used in robotics? By that i mean, how do you actually allow the robot to make its own desicion using your intended code, Do they all require an arduino? is there a more powerful alternative?
Reinforcement learning. Lol
Not all of them have an arduino. But they have some type of sensors that feed into some type of computer. It depends on what the task is. Object detection requires a different architectural than movement.
Robotics is a VERY big field. And each robotics problem can be subdivided into many miniature problems that might use multiple neural networks.
My coworker is doing some data transformation on Temple's eeg corpus.
Man...that is some narly code, which is all done in a iPython terminal. I am not a fan of running long-running jobs in iPython/Jupyter.
@oblique belfry looks like it's been copy pasted from a script no?
i use this workflow quite often
Eh.....knowing him....doubt it.
He trains all of his neural nets (ones that take over 5 hours to complete an epoch) through Jupyter. I have had too many kernels crap out on long computations. The above script will take a day and a half to complete. He must have better luck with Jupyter than I do.
I’ve been looking online and just want to make sure I’m right about this
For unsupervised, clustering methods we don’t need to split between training and testing data?
I'm a fan of doing whatever you want in notebooks, but once things move into code, unless it's explicitly a script, you gotta start cleaning things up
@fierce ravine that intuition is somewhat correct, but there can be more subtlety around it
certainly generally you don't really care about evaluating against anything
or in the case of visualization methods, you just want any decent represetation of your data so feel free to run over and over on the whole dataset

Hi! I have an NLP task. There is a text (telephone conversations). Voice is already converted into text and is divided into agent and customer paragraphs. I need to understand what approach is the best one for the next tasks:
- Who is the customer and who is the agent?
- Customer Name
- The topic of conversation
- Promises made by the operator to the customer (for example, "I call back tomorrow")
- Negative Sentiment (if there is something in the conversation that the subscriber is not happy with)
I am just trying to understand how to handle it. Is it possible to create some kind of general approach for this? If yes, for which packages (maybe BERT)/publications/books could I pay my attention?
best to think in terms of "what is the output" for each task
e.g. 3 is "document" -> "topic" classification
1/5 are per-sentence classification
(some of these can be reframed but this is a starting point)
2 is sort of span prediction, potentially 4 can be framed the same way as well
after that, go look for what models is suitable for each
fwiw, BERT (with additional modules) is suitable for all of them
but BERT is also much more computationally intensive than simpler models
I am running a GPT-2 text generation model that looks something like this
gpt2.generate(sess,
model_name=model_name,
prefix=pbuffer[0],
return_as_list=True,
length=120,
temperature=flavorslider.value,
top_p=0.9,
truncate='<|e',
nsamples=1,
batch_size=1,
)```
where sess = tensorflow.compat.v1.Session
what I want to do is clear the session grid and variables each run to prevent memory leaks
so basically after this runs, collect the output, then close the session
and just before it runs the next time, open a fresh session
anyone know how to do this?
does session.close() not suffice? or using a context manager
it does work, however returns an error about attempting to reuse closed tf session
not sure why, since its at the end of the code
do you create a new session each time?
yeah thats what im trying to do
sure
def reset_session(sess, threads=-1, server=None):
"""Resets the current TensorFlow session, to clear memory
or load another model.
"""
tf.compat.v1.reset_default_graph()
sess.close()
sess = start_tf_sess(threads, server)
return sess```
this might be working
oh it's ai dungeon
yea anyway you can do the above which sounds like it should work
all you want to do is either 1) close and create a new session each time
yeah it wasnt working before but now im not getting an error message
or 2) use a context manager, which does that for you
ok, that was what i thought originally, but the error messages had me confused
glad to know i was right
thank you
always this
gpt2.start_tf_sess(threads=-1, server=None)
with sess:
message = gpt2.generate(sess,
model_name=model_name,
prefix=pbuffer[0],
return_as_list=True,
length=120,
temperature=flavorslider.value,
top_p=0.9,
truncate='<|e',
nsamples=1,
batch_size=1,
)
return```
yes i have switched to that
was getting initialization errors so had to add this:
init_op = tf.global_variables_initializer()```
yep, you need to reinitialize the global variables for a new session
sess.run(init_op)
it runs, but i get crazy garbled output
trying a slightly different arrangement
That’s why I don’t like TF.
new to this i just finished my first python course and i have been practicing on there two apps which one do you think is better .
@fallen pendant First of all, congrats for completing first course on python👍
My priority (Level of hardness/complexity) would be:
Hackerrank (Improves your basic)
Leetcode (Improves your knowledge through medium complexity)
Hackerearth (Gives you better knowledge and also you can get internships or jobs)
SPOJ(Very tough level of problems)
Still, many exist but even if you practice from these is more than enough.
If you are still having any doubts, you can contact me at:
https://www.linkedin.com/in/tejas-s-401ab4185
@granite marsh thanks so much
Never mind for asking any help from me , reach me at:
https://www.linkedin.com/in/tejas-s-401ab4185
anyone done parallel processing with R? I"m wondering whether it's more straightforward or not than Python
seems fairly straightforward.
I think this is more of an operational question instead of data science itself but I'm a bit curious, do you guys implement a workflow in your work as a data scientist?
I'm not talking about methodologies (osemn, crisp, asum, etc), more like a team workflow. I've heard of Agile, but that's complicated to implement in a data science team.
Anybody a seasoned user of Anaconda? The tech support is terrible to non-existent so far for me. I have an issue described here:
Try deleting the sitepackages -> %USERPROFILE%/AppData/Roaming/ -> Python/../site-packages
In which environment are you trying to update your packages? The base environment?
Check with "conda info -e"
Anaconda is not that great at handling path variables at install/uninstall. You should check them out and make sure your user path is not referenced for Anaconda:
Oh awesome, thanks let me check
It would be amazing if I can finally get this to work...
the base is c:\Anaconda
@drifting hemlock How do I pull that path up?
I mean, that screen...
Type path in the search bar and select this one:
Then click in PATH and click the Edit... button
Alright, looks like the "Path" does reference my account/user (with the space) for Python
What should I do?
Just change it to reference the folder in which you installed Anaconda, for example change:
C:\Users\Your Username\Anaconda3\Library\usr\bin
to
C:\Anaconda3\Library\usr\bin
Oh, and remember to close and reopen the console so it can refresh PATH
Alright 👍 checking to see if that worked
shoot, no dice... I might have not done it properly - the Paths didn't say Anaconda, they said Python, so that might be one issue... also the user config file, populated config files, package caches, and envs directories all still point through my user name
Then the issue is definitely your path, unfortunately Anaconda sucks at updating the path environment. You have to options I believe:
- Updating
PATHmanually. This means removing entries referencing Anaconda in your Path variable and then creating them. That can be a pain in the ass. - You can remove all the entries referencing anaconda in your
PATHvariable and then reinstall anaconda making sure to select Add Anaconda to my PATH environment variable.
Just full disclosure: playing with Path can lead to undesirable results, so have a backup of your path just in case.
Thanks 🙂 ... how undesirable?
Well, depends on what you have there, anyways you can easily make a backup and then restore it if anything goes wrong, you can just open regedit and go to Computer\HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager\Environment and then clicking in Path and saving the contents in a text file.
It's not that scary honestly.
Alright, I'll give it a shot. Thanks very much for your help.
Oh, one thing before I get started - I accidentally deleted "Path" that included an entry with this %USERPROFILE% in it... the only other two were for python... I'm hoping if I restart my computer it will come back...
Should have made a backup, haha
I really hope so too hahaha
Know what? Let me show you how to back it up easily without getting into the registry.
Hold on
@rigid summit here you go https://www.youtube.com/watch?v=dKE1EpACl2E it's on low quality right now because it is still being processed
GPU: GeForce GTX 1080
CPU: Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz
Memory: 16 GB RAM (15.94 GB RAM usable)
Current resolution: 1920 x 1080, 60Hz
Operating system:
Perfect!! Thanks. Crossing my fingers that this works
Damnit. I'm still getting the same error when I try to install updates - the directories are all the same to (with my user name in it) using conda info
The Paths are all correct now though.
what environment are you using? Base? do conda env list
You should get something like:
PS C:\Users\Franccesco> conda env list
# conda environments:
#
base * C:\Users\username\Anaconda3
getaltname C:\Users\username\Anaconda3\envs\getaltname
Just: C:\Anaconda
hey guys has anyone here worked on Scrapy for scrapping i am trying to scrape and infinte scrolling page but i dont know how to do it
anyone knows how to get the historical 1 minute data from january 2018 for bitmex without getting api banned
https://www.nature.com/articles/d41586-019-03895-5
Good article. Nice to see fell ML engineers frustrated by the lack of reproducibility. I am all for the push for submissions to have code when published.
"Code submission is one of the elements I’m most impressed with. A year ago, 50% of accepted NeurIPS papers contained a link to code; this year, we’re at 75%."
Albeit, this is just for NeurIPS, but I hope this trend continues.
I need help in creating a SVM model in pure numpy and python
bumpy? I hope you mean numpy.
I get that.
Our company tried to reproduce results from eeg papers with the given data (Temple's corpus) and we still couldn't achieve the results. We followed the specs of the papers (had the same LSTM/Convolution layers/hyperparameters), but the results didn't match.
Showing the agorithm can get us one step closer.
The way in which people manipulate their data before training can significally alter the data to the point it isn't generalizable.
hmm....I never thought of doing it with different random seeds. I think that should be a requirement as well.
Yeah. I get why they might not be abel to publish their dataset. But publishing their code can help mitigate issues like this.
Ha...use a random number generator to pick seeds for other random generators.
I love it.
Sorry for the repeat for those who have scanned over this before: Anybody a seasoned user of Anaconda? The tech support is terrible to non-existent so far for me. I have an issue described here:
in the case of medical data, a lot of it comes down to preprocessing
that's why they're especially hard to reproduce (in addition to all the big datasets being unsharable)
if you're open to reinstalling
and you're still running into issues
you need to go track down wherever the bad paths are coming from
@oblique belfry if there's code and no data though is it reproducible ?
It’s better than nothing.
You can at least check out their methods and validate the logic behind it.
I’ve also tried to replicate papers that used public datasets.
I think you need both data and code to replicate results. Data will be the hardest to get, code is pretty simple. I’d rather have something than nothing.
What type of job do you want?
I know the basics of SQL, but if I have a project that really utilizes it, I’ll use SQLAlchemy or Orator to query data. I mostly use Mongo at work. But, we do a lot of machine learning and AI.
Data science is a big field that honestly should become more separated. A data scientist at one company may look completely different at another company.
as someone in industry who sees some hiring decisions, SQL on the resume never hurts
(unless it's a lie)
No specific job, more looking to expand my skill set for my own projects that might be useful later down the road. I was thinking SQL, and most of the tasks would be machine learning. At least at first, until I get a better hand on things.
A lot of positions I applied for wanted SQL experience. But, those were jobs that I wasn’t very interested in.
Anyone recommend any guides on how to interact with API using python?
The api documentation?
which API?
@oblique belfry what does SQL experience mean though? I'm never sure how much is typically expected to satisfy that, i guess it's "how longs a piece of string".
personally i've never needed anything beyond a join
no subtables, views, or whatever else
I don’t know either
Has anyone noticed that in Data Science / Analysis there's a ton of information available to improve your skills, but not much information on the "operational" side of the industry? For example, how to work with teams, how to implement a workflow in a corporate environment that is scalable, how the different methodologies fit in an data science organizational team.
At least for me it's been difficult to get this kind of information on the internet.
Management, Dev ops and "big data" covers a lot of that
Yeah but I've been feeling like we're kinda lost in that scenario, we know the process that comes with gathering data, scrubing, feature engineering, modeling and deployment, but we tend to forget that all of that needs a place where toolsets and teams in a corporate environment needs to take place and co-exist.
I think that's harder for small teams though.
In my work environment we're still trying to figure this out, so for example we have a binary classification task:
- Where do we document it? Let's say Jira.
- Where do perform the EDA/Feature Engineering? Jupyter notebooks, right.
- Where and with what we build the model? sklearn.
- Deployment? IBM Watson or a simple API in a cluster.
All of that comes with a price and is that, depending on what tools you use, you're going to get a big technical debt, or you are going to sacrifice collaboration if you do the development offline, or even reproducibility.
I know that there's no a tried and true workflow in which all of these can be implemented because the industry is still very young, but it would be neat to have more direction. * rant over lol *

