#data-science-and-ml
1 messages · Page 295 of 1
Hey guys im reading sensor data from my serial ports, ive successfully read sensor data to variables but i dont know like how to append that sensor data to csv file, can anyone help me?
hello. anyone knows any good book for logic programming and asp?
If you have your data in Python you can organize it and do something like this: https://thispointer.com/python-how-to-append-a-new-row-to-an-existing-csv-file/#:~:text=Open our csv file in,in the associated csv file
Oh okay Thank you
is there a website somewhere that translates all the library abbreviations
like a master list
could anyone help me set up torch xla on google colab, im using pytorch lightning
ive been having tons of issues and theres nothing on stackoverflow
Anyone know the answer to this?
Hello, I'm new in data science, are there any books you recommend to start with?
there's "data science from scratch" published by O'Riley. I would see if they have any recent books on the topic.
Thank you
another publisher would fill the void
true
they just have the time advantage
just like AWS with the cloud
well ig AWS has more features but thats also a function of their age
side note: our nlp project on contracts is on github. it would be really easy to create a docker image/container for it right?
my first time working with docker
Hey I was wondering if anyone could explain a Hough Transform to me. I know it's not syntax, but it's a CV concept. I have a table that we're supposed to understand for an upcoming exam but still don't
I heard that using docker is pretty difficult (mostly people complaining about cryptic errors)
Also a somewhat argumentative question - I only saw a very basic overview of capsule networks (from Hinton) but it does seem kind of like the hierarchy theory in HTM's and there were little credits toward Hawkins. can someone with deep knowledge explain the core difference b/w Hawkins and Hintons' approach?
Pretty sure that table represents the Hough parameters for the lines in the image. It's similar to the 2nd example on the wikipedia page. If you want more help I can hop in a voice chat and walk you through it but check out the wikipedia page first and see if you can figure it out https://en.wikipedia.org/wiki/Hough_transform#:~:text=The Hough transform is a,shapes by a voting procedure.
The Hough transform is a feature extraction technique used in image analysis, computer vision, and digital image processing. The purpose of the technique is to find imperfect instances of objects within a certain class of shapes by a voting procedure. This voting procedure is carried out in a parameter space, from which object candidates are ob...
Alright, will do. Thanks!
Would you be able to hop in VC with me? I'm still kinda lost
everyone here seems rather experienced, I have a question about python capabilities
if I had historical data on a price or level of a number
could I build an algo that could give me a rough estimate of the direction the price would go
@empty sable what format is the data in?
That sounds like a prediction problem.
Here's the thing, you can technically do something like that with regressions but dont lean too much on it
yes, you can predict whether it would rise or fall over timesteps (classification) or you could do prediction using RNN's - that would try to predict what the exact price would be for any amount of time you like. RNN's usually provide decent enough results on a good amount of data so you should have not data problems with them.
Look up stock prediction with LSTM on google. YOu would find plenty of tutorials to help you out
I’m trying to figure out how to use scipy.interpolate.RectBivariateSpline to perform a polynomial image warp between two halves of an image stored as a numpy array. Can someone help me figure out how to accomplish this by any chance?
You dont need a network, you could just use an autoregressive model to predict the series
They usually get decent results with less work than a RNN
You need to do all the basic data cleaning and normalization tho but you probably will have to do that anyway
Like range scaling, making the series stationary, removing seasonality etc
no idea haha, just thinking about things I may want to learn and specialize in. perhaps a stock price or the sugar level of a diabetic?
Stock prices are easier to get a hold of than sugar levels
are you familiar with the concept of having data about a type of thing, and having a model that can predict some attribute of those things based on known attributes?
I actually have access to sugar levels from a relative which is why im interested
hmm never made a model before but I assume I could research it
If you go the RNN route youre gonna need years of data
if it's enough, it might be good at predicting that person's BSLs 🤷♂️
hmm I have roughly 3 months of data, I could probably use stock data
and then change it to sugar levels
if the stock one is successful
or at least adapt it
Blood sugar spikes in different ways per person btw, also you would probably also need inputs on their diet too
I live with them
Stock prices behave better imo
thats why I want to use their data speciffically
what exactly is in that data? blood sugar readings and the time of day of that reading? anything else?
time of day, sugar level, day it happened. I could get more starting today
if all you have are their blood sugar levels and timestamps, all you can really do is try to fit a curve for their blood sugar levels throughout the day
you don't know anything about what causes those blood sugar levels
you'd need data about their nutritional intake, I believe
though curve fitting is still good.
its the relation of insulin to what they eat I believe
nutritonal intake, I could track that
Is anyone going to help me or no
what do you guys think of building one for stocks as a baseline
and adapting it to sugar levels
The stocks one will be easier
idk how sorry
I don't really know how insulin works, as neither me nor any family members need it
And give you an idea on how to make things like this
do you also get to know how much insulin was administered after each reading, or something?
your body naturally produces insulin to counter carbs etc you get from eating food. diabetics have to manually enter insulin
Insulin ratios differ from person to person
what features would you use?
I have access to that but havent been tracking it
do they know what features are?
thats why I want to use data specifcally for this person
what would be the things you will track?
sugar level, time of day, day of week, meals and nutritional intakes, and insulin intake
how would you track nutritional intakes (since you would take data on your own)
just the main things from each meal, how big the meal was, calories, carbs, etc
unless you would pester them every 10 minutes about what they ate and are going to eat
the more data the more accurate generally right?
yep
ok I will start planning this out and reading up on this, thanks for letting me bounce some ideas off. do you mind if I ocasionally dm you? its cool if your busy
cool, no problem
I would think that insulin level on its own would be a pretty strong feature
that with the nutritional intake would be good enough. but better collect all data you can
I want to count the eggs in this image using OpenCV
what filters do I have to apply before using findContours? right now I just applied grayscale
mix n match
the empty spaces where there are no eggs are giving me trouble when detecting edges
Hey all, I've got a spreadsheet with lots of data in it and I want to create visuals (charts, graphs, etc) for it. Do you guys know if there is a list of general stuff I can make?
how come when i call my dependencies, some of them look like this
Pillow @ file:///C:/ci/pillow_1615224175364/work
Jinja2 @ file:///tmp/build/80754af9/jinja2_1612213139570/work
instead of
.....?
what command?
Its giving me the normal versions
pretty sure you can just do that on google sheets lol. unless you mean more complicated things
yea, you can use pandas with python and libs like matplotlib to do that. look up data visualization in python
@misty flint yea, conda gives some file paths
prob preinstalled
I dunno
tqdm @ file:///home/conda/feedstock_root/build_artifacts/tqdm_1609612933698/work
hmm
i wonder how it will play when i try to throw it into a docker container
might have to specify a specific version
no, just make a brand new one
@misty flint https://stackoverflow.com/questions/62885911/pip-freeze-creates-some-weird-path-instead-of-the-package-version
Yeah, send me a PM with a time
ah make a new requirements.txt gotcha
@misty flint did your model finish training?
hey i need help with a dataset
so all demographic info is in one column
how do i split the demographic info into different columns, like age group, gender, and etc
Yo're going to need some regex for hat
you can use df["break_out"].str.extract( YOUR REGEX GOES HERE) and you can create new columns from the capture groups
would be nice to know what an entire row of the table looks like though
right, but there's presumably a relatively low number of columns?
good luck cleaning that mess lol
but also, are those repeat instances?
I mean. does it have ALL attributes in the same row? or just one attribute or something
thats weird
that's... messy
you can keep it as it is
or you can create new columns for each type of breakout
but what will you do for empty ones?
NaN or 0?
it matters a lot for ages, for example
should i do mean of ages?
that's not what I mean
i mean for the empty ones, should i put the mean of age
that depends a lot on the shape of your data. The problem is, you dont have ages, you have ages range, which is an ordinal data type
ohh
you "could" set the mode of age-range as the fill-in value, but that's your call as researcher
and that's only for age. What about gender, race, etc.
race is a tricky situation
the problem is that break-out column has a lot of mixed info, so you need to decide what to do with all that data
and most importantly, what to do with missing data
should i keep it as is because race is tricky to do missing data for?
thats your call. you're the researcher
I'd keep it, but the most pressing issue for me is "wtf do i do with rows that do not have that info"
ohh
i looked, and those columns dont have missing data
i was wondering how do i make statisicial stuff with it if it's all mixed info? @exotic maple
I didnt explain myself properly
think it like this
you have a single column called "TYPES" that holds data of type: "Age", "gender", "race", etc.
On a normal DB each of those would be a single column in itself. In your DB, this is all ina single column, which means you haved mixed signals in a single feature.
Ideally, you want to separate that feature into multiple features that actually make sense (each one in their single column, as they are independent of each other), but if you do that, you will have missing data because 1 row can only have 1 of either type.
SO if you create an age column, you will NaN for all the rows where there is no age specified
You can, in fact, you should, but you need to deal with the missing data
first can i split them into different columns and then deal with the missing data?
yes, that's what i would i do
thank you for your help!! @exotic maple
XDD

man pandas is so goddamn powerful. Even if you never do any data science, pandas itself is worth all the struggle
hey guys, I'm working on a df in pandas and after a merge one of the columns is coming in with NaN values.
def preprocess(x):
df = pd.merge(df_gps, x, on=['bus_id', 'date_time'], how='left')
df.dropna()
df.to_csv("./mobility-dataset/merge_gps_translated_validation.csv", index=False)
reader = pd.read_csv("./mobility-dataset/translated_validation.csv", chunksize=10000)
futures = []
with cf.ThreadPoolExecutor(max_workers=6) as exe:
for r in reader:
r['date_time'] = pd.to_datetime(r['date_time'], format='%Y-%m-%d %H:%M:%S')
r['busline_id'] = r['busline_id'].astype('int32')
r['bus_id'] = r['bus_id'].astype('int32')
futures.append(exe.submit(preprocess, r))
cf.wait(fs=futures)
df = pd.read_csv('./mobility-dataset/merge_gps_translated_validation.csv', nrows=1000000)
df
here is the output and above is the merge code between two df
From the two df, busline_id is actually ok, int32 value and what not
Appreciate any thoughts on why the NaN is coming
@undone heron I'm looking at this, but please add a py to your code sample
!code
Here's how to format Python code on Discord:
```py
print('Hello world!')
```
These are backticks, not quotes. Check this out if you can't find the backtick key.
what does preprocess return?
Please ping when you reply or I will probably not know that you replied.
It just so happens that I'm still here
Oh ok lul
though for future reference, if I'm helping you, always ping when you've completed your response no matter what.
so what does it return?
It returns the image below the code, basically that csv I'm writing @serene scaffold
preprocess actually does not return anything
Wait u mean ping you at the end everytime?
Well the at the end it is processing what I want, so the final result is the DF at the image
Once you have a completed thought that you are ready for me to read and there are no more messages that you are going to type until I respond, ping.
def preprocess(x):
df = pd.merge(df_gps, x, on=['bus_id', 'date_time'], how='left')
df.dropna()
df.to_csv("./mobility-dataset/merge_gps_translated_validation.csv", index=False)
Nothing is returned
(except, well None)
Depending on what exe.submit does, it may be that you don't need it to return something.
Note that saving something to disk is not the same as returning it.
For sure but on that code, writing to csv is what I need at the end. How would u recommend for me to return something with that futures.append()? @serene scaffold
let's not worry about that for now. Can you give me a sample of df_gps and x as strings and not as screenshots?
print(df.iloc[:5].to_csv())
^ that will print a CSV that I can use to get a sense of what you are trying to do. Please do that for df_gps and x
!paste
Pasting large amounts of code
If your code is too long to fit in a codeblock in discord, you can paste your code here:
https://paste.pydis.com/
After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.
^ you can paste the CSVs there.
Oh, nice 1 sec
Ok, hopefully this is what you asked for @serene scaffold
https://paste.pythondiscord.com/hizomotazi.css
I think it is! and the second one is x?
yep, it is the other csv I'm "merging" with @serene scaffold
alright, let me see
@undone heron is the problem that you're getting nans in busline_id?
and if so, how much do you know about the different types of joins you can do?
Yep, that is the problem. I'm quite new to this whole thing so I don't know that much about the join types. I know the theory somewhat but maybe you have better insight. @serene scaffold
for one thing, are all your datetimes on 2015-03-11, or is that just because of the rows you picked out for me?
Also, here are some of the different types of joins
right: use only keys from right frame.
outer: use union of keys from both frames.
inner: use intersection of keys from both frames.```
@undone heron do those types of joins make sense, or do you need me to explain them?
That is some test data for just one day to decrease the amount of data I need to process. So it is all on the same date, yes.
Makes sense! @serene scaffold
Buy full Sql Server course from here https://www.questpond.com/learn-sql-server-step-by-step/cid9
For more such videos visit http://www.questpond.com
For more such videos subscribe https://www.youtube.com/questpondvideos?sub_confirmation=1
See our other Step by Step video series below :-
For more such videos subscribe https://www.youtube.com/qu...
So... Maybe outer makes more sense here? @serene scaffold
I think with union you would get even more NaNs. Do you know why?
because "outer" is the union-like join.
I see, but it feels like the busline_id nan thing is because it is not keeping the column after the merge properly, right? @serene scaffold
what do you mean, not keeping the column properly? pandas isn't making an error.
I know! I just don't understand why the column is not being kept, I'm not using it as a key for the merge... Why would it become NaN? @serene scaffold
the way that join operations work. If you use the wrong type of join for what you are trying to do, Pandas might fill in some blanks with NaNs.
Think of it this way
you join the tables based on date_time and bus_id, right?
within each dataframe, could there ever be two rows that have the same for both date_time and bus_id as another row?
Matching rows on both dataframes based on those 2 keys? Yes, that is what I'm looking for. @serene scaffold
if they need to match in both dataframes, which of the four types of joins is right for that?
Inner lul
let me know if that works
Great stuff, running the script again, should be ready in about 1 hour. Thanks in advance! Really appreciate the way it was explained 
it will be an hour before we know if it worked?!
training models can often take a long time (I've had some take over a week, even with an insane amount of compute power), but you should be able to know if your dataframe manipulations are correct... quickly 
an hour to process a join? Are you working with a billion rows or something?
I want to identify how similar older sections of text are to newer sections of text we'll call "section A". What type of data should I be looking at? I can't just take only "section A" text, but also "non-section A" text. But would I look into taking this "non-section A" text as any other section besides section A? Also, for text similarity models, would you suggest using an LSTM or a siamese NN? Currently I just have an LSTM in tf.
Well, once it reached that script section it was pretty fast and it failed with zero entries all Dtypes as objects non-null @serene scaffold
Hey guys, I have another question not exactly related to my problem above earlier.
I have a couple of dataframes such as validation.csv and I need to "translate" the values of a column and in a way map it to other values I have in another csv (links.csv). How would I go about to do that? For example:
on links.csv
230, 555 -> meaning if I find 555 in validation.csv I need to change it to 230 and so on through the whole file.
Any thoughts?
haha it gives character. i would like reading comments like that
Thresholding by brightness + find contours should work
Luckily for you, your images seem well controlled
Keep in mind you don't need to find the whole segment of the egg, you can just find one half of the egg and count those too
If that gives you trouble, you can also find contours as you're doing now and then filter by area. The eggs should be bigger than the small holes
there are some huge skewness present in it so, is it gonna be good if i use a power transform or instead if i should use log transformation ? cause boxcox is not gonna seem to work on this, since it only works on positive values
Is it possible to do text classification using linear regression model by converting the strings to their sum of the ascii values of all the characters?
Anyone up for implementing this research paper? https://stdm.github.io/downloads/papers/ICDAR_2017.pdf
I have an assignment, basic machine learning application, but I'm very new in this.
there must be 1 runner and 2 chaser neural networks (each one of them are separate neural networks) and chasers aim to catch the runner and runner moves randomly.
what kind of ML can be applied here, unsupervised or reinforcement ? and which library would be more appropriate to use?
P.S. runner should be different program, and the environment is common for runner and chasers. environment has walls and created randomly
Does anyone know the path where I can see the list of pre-trained models in pickle?
I tried getting to pickle file but it was formatted with unsupported or binary formatting so, I can't somehow understand the values.
^^this returns an error that the file I'm trying to load does not exist. I want to know if its name has been changed due to updates or something like that
I bet you could do it with dictionaries and the apply method
Or better yet, replace
!docs pandas.DataFrame.replace
DataFrame.replace(to_replace=None, value=None, inplace=False, limit=None, regex=False, method='pad')```
Replace values given in to\_replace with value.
Values of the DataFrame are replaced with other values dynamically. This differs from updating with `.loc` or `.iloc`, which require you to specify a location to update with some value.
Parameters **to\_replace**str, regex, list, dict, Series, int, float, or NoneHow to find the values that will be replaced.
• numeric, str or regex:
>
> • numeric: numeric values equal to to\_replace will be replaced with value
>
>
> • str: string exactly matching to\_replace will be replaced with value
>
>
> • regex: regexs matching to\_replace will be replaced with value
• list of str, regex, or numeric:... [read more](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html#pandas.DataFrame.replace)
reinforcement
very quick pandas question - trying to apply a regex expr to my df column like so:
df.desc_copy = df.desc_copy.apply( lambda x: re.sub(r'(([0-9]{2})(Jan|Feb|Mar|Apr|May|Jun|Jul|Sep|Oct|Nov|Dec)([0-9]{2}))' r'(ON \d\d\d\d-\d+-\d+)|(\d+\d+)', '', str(x)))
the expr is supposed to remove each date string from every row in the desc_copy column but no effect is taking place - why?
i was printing the wrong column 😑
the library you would use already has modules that would allow you to load checkpoints. you can google that
so basically masked CNN?
hola
Hey guys I need some help. I am currently doing my thesis and I got stuck in creating a dictionary.
My data comes from an experiment in which they asked ppl what quantity of CO2 do they think certain products emit, in total I have 17 products, which mean I have dataframe with lenght of N * 17. What I want is to create a nested dictionary that stores all the responses of the individuals, something like: {1: {car:200,beer:500}, 2: {car:5.beer:10}, ..., N:{car:NN,beer:NN}}. How can I do this?
!code use this format to display your code
Here's how to format Python code on Discord:
```py
print('Hello world!')
```
These are backticks, not quotes. Check this out if you can't find the backtick key.
@wanton laurel is the problem that your regular expression is not matching anything? or is your problem with usage of dataframes? Please ping me if/when you're ready to continue or I may never know that you've replied.
is there a reason you want to have nested dicts when you already have a dataframe?
In either case, please run print(df.iloc[:5].to_csv()) and copy/paste the string into this chat so I can see what your data looks like.
It worked, thanks though
so you no longer need assistance? Alright, take care!
Yeah that's Right, no longer needed, you too
,ppnr,subject,mode,product
0,5e6cf1cb28e5a82aed026a8f,1,500.0,car
1,5e6cf1cb28e5a82aed026a8f,1,4.0,carSocialCost
2,5e6cf1cb28e5a82aed026a8f,1,2.0,warming
3,5e6cf1cb28e5a82aed026a8f,1,700.0,heatwave
4,5e6cf1cb28e5a82aed026a8f,1,900.0,seaLevelRise
great, and what is the expected output given 0,5e6cf1cb28e5a82aed026a8f,1,500.0,car?
yes, I am trying to test through differente methods how they diviate from the real CO2 emissions
all those rows are from the same respondent (5e6cf1cb28e5a82aed026a8f)
I would like to have a row with all his responses instead of 17
that's why I am trying to create a nested dict
@dim trail so given 0,5e6cf1cb28e5a82aed026a8f,1,500.0,car, what sub-dict do you want?
one like this : {1 (this is the subject(5e6cf1cb28e5a82aed026a8f): {car:200,beer:500}, 2: {car:5.beer:10}, ..., N (this is subject N):{car:NN,beer:NN}}
where did car: 200 come from?
I am only asking about 0,5e6cf1cb28e5a82aed026a8f,1,500.0,car
,ppnr,subject,mode,product
0(index),5e6cf1cb28e5a82aed026a8f(identifier of subject 1),1 (subject 1),500.0 (his guess),car(product)
1,5e6cf1cb28e5a82aed026a8f,1,4.0,carSocialCost ---> all this is from same subject, I want all responses as dict in a bigger dict
2,5e6cf1cb28e5a82aed026a8f,1,2.0,warming
3,5e6cf1cb28e5a82aed026a8f,1,700.0,heatwave
4,5e6cf1cb28e5a82aed026a8f,1,900.0,seaLevelRise
does car: 200 come from a row that is not given in the sample?
no, that's just an example of mine
so each sub-dict in your outputted, nested dict will represent data from different rows?
or does each row of the dataframe get represented as one sub-dict in your nested dict?
no, each subdict will represent data for the subjects
for subject 1 I have 1
for subject 1 I have 17 responses
Okay, so the data structure you want is Dict[str, Dict[str, int]], and each key-value pair in the inner dicts is a row that has the same subject as the outer dict
let me see
yes
are you using mode for anything?
ppnr is the unique identifier of each subject, I have that to identify each person in my others datasets
so what is the key for the outer dict? the subject or the ppnr?
that means that once you transform this to a dict, the ppnr won't be there
yes, no problem. there is only one subject 1 in the whole experiment
okay, so we can drop the ppnr column, basically
yes
@dim trail I'm still looking into it
thanks
@dim trail still looking
is it very complicated? I tried for hours
I can't find a "good" solution, so all I can suggest is iterating through every row of the dataframe and adding the data from each row into one dictionary
. / gasp
thanks
Dataframe iteration is usually a sin. So there's a good chance I may request you to explain the context again, waste 30 min, and then come to the same conclusion.
Though if you're making dictionaries out of it you're anyways leaving dataframes behind
I will try row iteration and if I can't find a solution, I'll abandon the idea and try something else
I can explain it for them, since I already got my head into the problem
I'd actually like to know if there's a "panda-ic" solution
,subject,mode,product
0,1,500.0,car
1,1,4.0,carSocialCost
2,1,2.0,warming
3,2,700.0,heatwave
4,2,900.0,seaLevelRise
The desired output is:
{1: {'car': 500.0, 'carSocialCost': 4.0, 'warming': 2.0},
2: {'heatwave': 700.0, 'seaLevelRise': 900.0}}
The problem is that you're basically trying to create new columns based on values in the product column.
@ripe forge I tried to do a pivot table and then do to_dict
Does the pivot table organize the data by subject? Or do I still have the same problem in which I have 17 rows with answers of subject 1
!e
import pandas as pd
from io import StringIO
string = """,subject,mode,product
0,1,500.0,car
1,1,4.0,carSocialCost
2,1,2.0,warming
3,2,700.0,heatwave
4,2,900.0,seaLevelRise"""
df = pd.read_csv(StringIO(string))
def dict_creator(df):
return dict(zip(df['product'], df['mode']))
out = df.groupby('subject').apply(dict_creator).to_dict()
print(out)
@ripe forge :white_check_mark: Your eval job has completed with return code 0.
{1: {'car': 500.0, 'carSocialCost': 4.0, 'warming': 2.0}, 2: {'heatwave': 700.0, 'seaLevelRise': 900.0}}
this would be my knee jerk reaction to it, but it's essentially looping via apply
that code is beautiful though, implicit looping aside.
./blushes crimson
i dont think you'll have a lot of gains because ultimately vectorization has to be broken to create the dictionaries at the end though. ideally for larger datastructures you want to avoid going back to dictionaries if possible when using pandas. but this probably is enough for OP's needs
nice, but it copy the value 500 for each product
simplest code to remove a specific list of words from another list of words? maybe using sets?
could you elaborate? does the toy example showcase your original problem adequately?
the words that should be removed can be in a set. then, just iterate on the list of words, and keep those words that arent found in the set
iterating is the simple method. you could use set intersection(edit:? not intersection, difference) if you really wanted to, but you arent gaining performance there
[word for word in words if word not in words_to_remove] # where words_to_remove is a set
one of the list is nested which complicates it 😅
ah, the plot thickens
can you make a minimal example that showcases the question adequately?
You could probably iterate over every element but that sounds very efficient lol
doesnt sound
@grave frost is it possible to transform those lists to numpy arrays? If you can, you could create a 2d array and filter via masking
should be more efficient as a vector operation instead of iterating over N elements of nested lists
oh you want it in base python?
[el for sublist in mylist for el in sublist if el not in words_to_remove]
nested for loop but in a list comprehension
no, I meant the prob has already been resolved in pure python, so no need to use numpy 🙂
I mean yes, you can do it in pure python. Let me think...
I would try this to get all unique words:
set_var = set()
for sublist in list:
for element in sublist:
set_var.add(element)
question: D o you want to delete unique words or unique lists (as in, the actual block)?
for the general case of arbitrarily nested lists you get to use recursion
leave it guys 🙂
If you want to see the solution, you can check #help-apple
(but you will have to scroll up)
or here: [el for sublist in mylist for el in sublist if el not in words_to_remove]
no, its had to be done like that:
out = [[word for word in words if word not in [_.replace(' ','') for _ in translated_stop]] for words in tqdm(data_text)]
I have a multilayered perceptron implemented in python, but it spits out probabilities instead of classes, what do I need to do for it to return classes ? Should I change the activation function of the last layer to softmax ?
when you do predict(X) you get probability instead of response?
are you sure arent using predict_proba?
I'm trying to improve my documentation reading skills. I'm looking at numpy's linspace and I see code like this:
test = np.linspace(0, 500, 12)
and you get the same result if you do this:
test = np.linspace(0, 500, num=12)
Now when I look at the docs, it looks like this:
numpy.linspace(start, stop, num=50, endpoint=True, retstep=False, dtype=None, axis=0)
My question is how do I know that I can type "num" or not and it will work the same? When I first saw it I was confused what the 12 did...
My question is how do I know that I can type "num" or not and it will work the same?
Becausenumis the third argument.
that's why when you pass 3 arguments, they resolve to start, stop, num
if you passed 4, the fourth one would be assumed to be endpoint, the fifth one retstep, and so on
Specifying arguments by name allows you to specify them regardless of position. Say, dtype you need to pass by name unless you are also passing num, endpoint and retstep
understood thanks! @tidal bough
that concept of positional vs. keyword arguments

functions and their defintions (*args, *kwargs) sounds like you're grumbling
Hi. I would like to ask some help with the choice of literature for data-science (more precisely, bioinformatics field). I'm reading "Mathematics for machine learning" by M.P. Deiesenroth and "Practical Statistics for data scientists" by Peter Bruce. But I'm not sure about my choice, especially about the second one. Could you recommend me some books concerning Statistics for DS (especially with python examples).
P.S.: I have named just books I'm reading at the moment. I'm planning to read also "Data science from scratch", "Deep learning for the Life Sciences" from O'Reily publisher and some other books.
Thanks in advance
how are you liking practical statistics for DS
I have that book stashed somewhere
Bishop's machine learning book or elements of statistical learning are the classic ML references
@hollow sentinel I have sense, that it's unfull, may be it is explicated by the fact that this book doesn't concerned the mathematical part of statistics
Thanks
What is better according to you?
Personal preference. I like bishop more
Hey please could someone help me with a for loop where I'm trying to read multiple tables from an SQLite database, the current code looks like this:
so i have a list of tables within the database, and the idea is that for each table in the list a dataframe is produced as df_<table-name>
for example the output should be 7 dataframes: df_sqlite_sequence, df_Player_Attributes etc
but currently it is just loading the last table (Team) as df_table, overwriting the previous one
@granite wolf look at your code and think about what it does.
Your loop goes through to every query and assigns the rsulting dataframe to a variable. This variable does not change in any form in any iteration, so its the same variable every time.
basically, you're overwriting the resulting dataframes with each step of the loop. That's why you only get the last table
now, depending on what you want there are manyways to move forward
you can create a list and store the dataframes there as elements of the list. This is the simplest solution and they will have the same order as the parent list, but you will not have any linking between them
Thanks for replying, if they are stored as part of a list element, could i then unpack the list to get the desired multiple dataframes?
im basically aiming for different dfs for each table in the database
hey guys, im new to ml... i wanted to know how you can parse a live video feed from your computers webcam? ( i am trying to make a rock paper scissors game where the i play with the computer by showing rock paper or scissors (done by my hand) to the camera)
Got a question, let's say you have a computer that can do your machine learning operations reasonably well but not great, meaning it takes a little long to do. Is there any reason why you wouldn't use google colab? Like is there a benefit to just keeping everything on your computer?
there isn't really a benefit to keeping everything on your computer
but it takes a little long for mostly everyone
if you switch everything to another computer....then it makes it harder to get back all of your projects
opencv
time for some research.. thanks 🙂
Thanks, could you explain this comment a little more? Do you mean if I get a new computer, it's hard to transfer it over from my old computer?
its not too hard
but it is a headache
you need to move files from this computer to that
ahh ok, so you're advocating for putting it on colab because of the cloud
welcome
The upside to cloud is that it does not matter which computer you use, all your stuff is always there. The downside is that you need an internet connection (that is stable and decently fast).
hmmm...but for colab, its just text; even if your connection is not that stable, I doubt 15 seconds matter a lot
but overall, the recommendation is GCP. Colab is good for newbies
Idk depends on how quickly you are doing things, the stability and lag can get in the way of a fast feedback loop (assuming the operations themselves don't take very long).
(And images can take a long time if you have a very slow internet speed)
The most overhead I personally face is in the preprocessing part, but thats something even my decade-old laptop can do
Either way, the answer is just use both, local and cloud. Backups are always good.
colab is good for sharing code too
Yup, hence the name.
https://colab.research.google.com/drive/1Fk2yMF1vLLNbVhccn-nBeLnk_uNPnL5v?usp=sharing
I made this notebook that automates linear and logistic regression...please make a copy to use it and the data that you put in should be cleaned(I'm still working on that). Feedback would be nice.
P.S. don't fear the loss of your job....the notebook makes basic predictions. we'll always need people to fine tune models 😉
don't fear the loss of your job
.....?
is there a reason to use seaborn over mpl?
my profs exclusively teaching it and i can't see why
data scientists are good at automating jobs...
what....?
yea
Data science has nothing to do with automating jobs. it is about deriving insight with data
any field even remotely close to that is just robotics
but i view data science to makes life easier
well, that's pretty wrong
ok
they do not do automation. They are just implmenting pipelines and models 🤷 IMO Robotics is the closes field to automation
ok now im confused
hello
about what?
automation using data science
data science != automation
i need a very simple help; need to plot a function in a subplot (basic)
what about integrating data science with automation
wait yea i get that
data science is just an umbrella term to signify someone very experienced with statistics and other relevant fields to derive insight from data.
how can you do that?
the theory for that was made years ago
like self driving cars
thats AI. not Data science.
what would you put ai under?
i got all my definitions wrong
Data scientists do a little of ML (Machine Learning) but its mostly ML researchers and engineers that do the more complex stuff
ok
AI is a parent term on its own
never thought of it that way
huh I always thought it and nn's were fields of data sci
there's no exact definition for a data scientist. but we can draw some lines
NN's (Neural Networks) are mostly grouped under ML
you can do that, but you would need PhD. that stuff is pretty complex
but you can learn right now too 🙂
i've been doing data science since i was 12
good for you
what do i do to learn
what do you specifically want to learn?
you've been learning it since you were 12? your math skills must be good
Thats Reinforcement learning
dont stress.
lmao
I'm not
no, its just that your tone sounds not very encouraging for a beginner
ok ig i'll learn reinforcement learning
no offense
I don't like the accusation
no, it was not an accusation 🙂
using nueral nets right?
yep
RL is a little bit different from other types of ML, but fundamentally they are pretty same
like both aim to optimize some function
ok
now what resources do i need?
you can learn about it more by watching 3blue1brown for some basic maths and then prob pick some course
Companion webpage to the book “Mathematics for Machine Learning”. Copyright 2020 by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong. Published by Cambridge University Press.
this is good
uhh, no I dont think so
how long for reinforcement learning?
BTW what is your prior experience? just curious
if you treat something like a goal, then you would never be productive
data science, opencv, and some nueral nets
ok i get you
its much better to have an overall positive attitude towards learning than just doing something in "x" amount of time
which types of NN?
igs
CNNs, regression
well, then you have the basics already done. I think you can move on to RL from there on
cool, no worries
Is ROCm good now? I had ordered an AMD GPU before cuz I wanted to try it, but I got dissapointed with the bugs and performance so returned the card to get an Nvidia one
But I read some intellectual discussion where they mentioned weird C stuff to prove that AMD cards won't be able to compete with CUDA's performance.
AMD vs Nvidia performance tests are all really bad (both ways), just try it yourself.
(Even if one could be faster than the other it's also limited by how much effort was put into each by the library being used)
Nvidia poured billions on CUDA
Newer AMD GPUs align more with Nvidia GPUs too
That's because Nvidia wanted an iron grip on the ML community. So all the libraries added CUDA support and ignored AMD.
AMD hasn't made much contributions to computing, and OpenCL sucks
well, they deserved it then, and we regret it now
Yeah, they invested.
If it were me, I would have gone with CUDA too. its the most sane decision
But AMD is cheap, and all that so it's totally worth having support for it even if it's slower.
I think the way it works with ROCm is that it somehow runs the CUDA code on AMD gpus.
yeah, AMD is so great. Nvidia just had a monopoly and milked all the money
so they still are using pycuda
making a layer is never great for performance
That way they don't need to recode everything
Not sure how hacky it is, if at all
The graph on the right represents the error of the output layer after each epoch, is this normal for a MLP with these hyperparameters ?
just asking, what is that program?
I made it, a live visualization of how a multilayered perceptron works, using PyQt and multiprocessing to avoid GIL
that is pretty cool
thank you
if it wasn't, they wouldn't have lost out on performance
tbh, its the closest thing to GUI with Machine Learning I have ever seen
well the tricky thing is that you have to run the neural network on a separate process because you're bound to get into GIL if you run it on the main thread, so you have to create pipelines between the GUI and the network to exchange data.
A lot of people don't bother making visual representations but for me it's very insightful to see how everything works in real time
I'm not exactly sure what you mean. Who lost out on performance on what?
ROCm is pretty much native for AMD. Though it's not for all of their cards.
OpenCl loses perf to CUDA
Yeah, because it's locked off from a bunch of things on Nvidia GPUs.
I dunno 🤷 There is a whole github issue about it with plenty of technical arguments
It's pretty well known that the OpenCL drivers are intentionally crippled on Nvidia.
Well AMD is different hardware.
It's not really just a OpenCL vs CUDA thing then. And can't really be compared.
You can do price per compute
Or something like that.
well, its different architecture
but from what I barely understood, OpenCL is general in inferior to CUDA according to some C and optimization stuff
This is false.
Data scientists can be anything from no ML to all ML, because it's a generic buzzword. Many ML engineers do no real ML, just software engineering around ML
Almost no one uses machine learning in robotics other than computer vision
If you're looking at stuff like Boston Dynamics, they use no "AI"
Good ol mechanical engineering and control theory
Not really, that's just what Nvidia just wants you to think or some random people on the internet. OpenCL and CUDA do not really have anything to do with C optimizations. It's just two different APIs, what really matters is the hardware itself.
CUDA and OpenCL are unrelated to architecture lol
OpenCL can work on basically anything
The only issue with OpenCL is like I stated, on Nvidia hardware it's locked off from some stuff.
5-6 years in the US, 3-4 years outside the US
And also how much effort is put into supporting it
The OpenCL api is also not as good as the CUDA one
^
CUDA is really really good
CUDA is really convenient. Closest to it is probably OpenMP on that axis.
(Directly in the C/C++ code, not like passing around some string of code which one then compiles manually)
@uncut orbit In general, take everything anyone says here about machine learning or robotics or anything with a pinch of salt, there's not a lot of people with expertise in the area in this server. There's dedicated servers for AI/ML type stuff and they're generally better, alongside dedicated servers for robotics.
If you're asking about python, not many better servers than this but any theory work has limited talent here
ok
This server is also not really the right place for theory, it's more for practical use with python. There are the off topic channels, but that's it.
I only exist to call out other people's BS on this server
How long shd I spend on understanding the theory of neural networks bedore I can start implementing one using pytorch ? I just want to compare to a logistic regression binary classifier. I have the data rdy. Is 8. 5 k data points enough?
there should be a baseline for your problem already... so none
just implement and use your baseline to as a check that it worked
if not try, try , try, again
regarding data... it depends on how complex a network you're building and how much information your network needs to encode
try training it in batches to see the improvement rate to get a guess at the value of more data
e.g. training on 20%, 40% 60%...
plot out the curve of improvement on your valid set as a metric to get some kind of feel for the value of more data
if you see a jump in value in the last batch it's probably worth getting more data
but you should also make a business decision on the metrics you're judging by and the difficulty in acquiring more
if you're smart of your weight initialization and optimization less data will be necessary
you can also think of clever ways to use you existing data to weakly label more
Hey guys can anyone help me with an error message that I have been getting while using the scipy library. I am trying to get a numpy array of pearson r correlation similarity from a bulk data and this data is imported in the form of a pandas dataFrame.
Thanks for the advice! I will just dive right in and see how it goes.wish me luck :D
Hi, so I want to clone my base conda install but swap out the python version. I've tried conda create --name testenv2 python=2.7 --clone root but says too many arguments. Is this not possible?
I'd be fine with just having two installs of python too, so long as I can reference specifically which one I want
You need to design a Neural Network that solves the problem of facial attribute
recognition. More specifically the network should receive in the input an image of a face,
and should recognise whether the depicted subject wears glasses or not, has long or
short hair, smiles or not and should recognise its apparent age. Design the first and the
last layers of such a network, detailing your choices. Define the total cost function and
give the format of a training example and the corresponding ground truth associated with
it.
[Hint: You can treat the recognition of the age either as a regression problem, or as a
classification problem – either choice is equally valid.]
Can anyone help me with this question?
anyone know of a good neural network binary classification problem with solution in Pytorch online to work through so I can familiarize myself with NNs and Pytorch? preferably with at least 10 features and > 5k datapoints
HI, I recently calculating the number of parameter of conv2d, but how to calculate parameter of separableconv2d?
guys how do u paste code in discord
add three backtics
ah k thx
does anyone here know how to solve that cause im trying to make it on a hex value
shape = [(1, 1), (220, 190)]
# creating new Image object
img = Image.new("RGB", (w, h))
# create rectangle image
img1 = ImageDraw.Draw(img)
img1.rectangle(shape, fill=f"{item_color.get(items[0])}")
font = ImageFont.truetype('theboldfont.ttf', 30)
text_position = 25, 80
img1.text(text_position, items[0], 'white', font=font)
img.save('fortnite.jpg', 'JPEG')
await ctx.send(file=discord.File('fortnite.jpg'))```
im looking for algorithms and methods for detection the anomaly in vibration track. there is a machine and i set the sensor which senses the temperature and vibration, im looking for the machine learning algorithm to detect it. is there anyone for advice?
hey im interested in getting into ai machine learning etc... but concerned if i should learn things like calculus and linear algebra first or if it's fine to learn it along the way as well
@merry lintel I'm still learning machine learning and AI and what I did was learn everything I needed in terms of math along the way with whatever I needed. It may be better to learn the math before because you won't have to worry about it as much for the resources that are more theoretical. I hope that helps
@blazing bridge
oh thanks
but probably will just learn them along the way
lazy
xd
i mean algebra and calculus are interesting but still lazy. it is a wide range of concepts isn't it? @blazing bridge
there's a good book for math in ML
Companion webpage to the book “Mathematics for Machine Learning”. Copyright 2020 by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong. Published by Cambridge University Press.
it's also pinned
Your first step should be labelling the dataset.
You'll need both normal and anomaly data. Most likely it would be skewed.
After that you can try algorithms/models like Xgboost, RandomForest, MLP, etc.
There is another way in which you try to find outliers (anomalies). Like 1-Class SVMs.
Here: https://towardsdatascience.com/outlier-detection-with-one-class-svms-5403a1a1878c
When applying K-Means clustering on unlabelled data, if we use a linear classifier and artificial labels, What type of regularisation would we use?
Can anyone help me with this?
can you tell us where this question came from?
Also you can try AutoEncoders for Anomaly Detection.
There is a whole python package which is having a lot of models which deals with anomaly
https://towardsdatascience.com/anomaly-detection-with-pyod-b523fc47db9
Hi guys. What's a good metric/score to look at if I want to prioritise minimizing false positives in my binary classifier? And also if I have a lot more "Y=1"s to "Y=0"s? Approximately in the ratio of 4:1
The main reason for this is to have a labeled dataset. In real world problems you don't know how many classes exist and which data point belongs to which class. So the best cluster size that you get from K-means can be used as a good starting point to create a classifier.
it looks like this person may have been asking an exam question, so please be cautious.
so you basically want to know your false positive rate?
actually you might want to be looking at the precision score?
false positives bring your precision score down, but false negatives don't.
FPs would be castostrophic for what I'm doing so I want to minimize that without losing too many TP. Is AUCROC or specificity ok?
Yes increasing FPR should give what you want.
Draw a Precision vs Recall curve and set the threshold which will maximize the Precision without loosing a lot of Recall.
Yeah you can also plot ROC curve and choose the threshold where FPR is low but TPR is high
That should still work for unbalanced data like mine? Like accuracy is kind of bad here because a model that only predicts 1 would still get 75+ percent
ROC doesn't solves imbalance dataset problem. Do upsampling of the minority class and plot ROC curve or Precison Vs Recall
Ok. Thanks for the help!
is there a reason to use seaborn over mpl?
my prof's exclusively teaching it and i can't see why otrher than maybe having to do a bit less work?
Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
So seaborn contains a lot of prebuilt and defined plots and visualization that can be directly used.
Whereas matplotlib is a plotting library with limited predefined visualisation methods but greater customisation using its APIs.
In simple words:
If in future you want to create a custom plot/visualization that doesn't exist in seaborn or Matplotlib then you can use matplotlib library to create that using its API.
You can even create your own library like seaborn using Matplotlib.
huh ok thanks that's kinda what I was thinking
@paper lake listening to this podcast from 2 biostatisticians from john hopkins talking about data science and thought of you

noooooooooo
oh well still thanks
i will follow that

I was surprised this took a long time. Apparently, deepix is some GH lib that can interpret blurred text. I thought that this was implemented wayy before
the amount of information in the 'mosaic' blurred images is incredible. a good model with plenty of data can easily break it
lolol one of the biostatisticians now works at etsy as a data scientist

its a field of study
you can start here if you are interested https://www.coursera.org/learn/ai-for-everyone
Offered by DeepLearning.AI. AI is not only for engineers. If you want your organization to become better at using AI, this is the course to ... Enroll for free.
Thank you!
np the course is built for non-technical people but its a nice overview
Is AI machine learning?
ML falls under AI
oh ok
must be some basic stuff with easy to relate examples
why the specific output? just asking
Its still not clear what is your end goal. 'motion tracking' - do you want bounding boxes or segmentation (or maybe both?)
That looks pretty advanced. I have never done anything like that. sorry

gonna have to use some big data tools
big boy tools
the tenser is a float32
the whole of it?
why don't you flatten the output and use it like that
dont know how to fix this exception error, can anyone help?
is this a data science question? in either case, take a look at what the s.run function is expecting
Oh I solved that problem, I was passing wrong arguments and not sure if it was data science question but I was trying to automate my data reading and writing
I really like working with Neural networks, Especially GANs. I am training Stylegan2-ADA on about 6.7K minecraft images, it's really cool
#career-advice is another place where you can ask about that. you might want to give more context. I don't make hiring decisions, though there are data science jobs in my region (mid atlantic US) that accept applicants with bachelors degrees and relevant coursework
Almost every data scientist listing I've seen has said that a master's degree is preferred, even if a bachelor's is the minimum qualification.
some listings say "bachelors and five years experience, or masters and two years". so basically if you don't get a masters but you spent the time you would have spent getting it in industry, the effect is the same.
Well the problem is you have to get that experience in data science
So you have to find at least one job that's willing to take you in with no experience in data science
hmm...also, does anyone have any idea on how to get your first internship?
Funny thing is that most of the types that get those jobs are people who's only actual knowledge of ml comes from coursera classes. The truth is that even something as relevant as a phd in "Mathematical Optimization" does not do much for practicing machine learning. The managers for these positions don't know much about ML and believe the hype; thinking it is something from science fiction. There is not as much work/jobs in ML as people think, most of "Data Scientist" positions are really just "Data Analyst" Positions. After talking to some "Data Scientists", it is shocking how little they know. I have wondered what will happen to these people once the "Data Science" bubble busts in a few years......
hopefully I will have secured enough marketable experience by then 🤷♂️
same

i think there will always be roles for technical people explaining things to non-technical people. if its less ML than promised, i would still be okay with such a role
Well there are many analyst roles that require python, if you are also very technical you can move over to data engineering.
maybe instead of neural nets, youre doing linear regression. im still okay with that
to be fair, I am more worried about the python language once the data bubble bursts. Python web is dying, there are also of competition in sys-admin languages
the funny thing is one of the 2 hosts of a podcast i listen to did her phd in optimization. shes a principal data scientist or whatever but everything shes said sorta kinda backs up your point but there are exceptions
new language time?

Not so sure about that, I know multiple data scientists with PhDs who've done pretty complex work in industry, even made their own techniques for multimodal data and stuff
Python's current data ecosystem looks a lot like Java's big data/hadoop ecosystem 8 years ago....a lot of those projects died, only a handful outlived the bubble (Spark, Presto)
There might be a lot of data science positions because of hype and a lot of people being hired when they're crap but that's a separate issue entirely that stems from stupid marketing of data science MOOCs
The whole "sexiest job" thing has probably hurt the field significantly
Data science and web isn't the only thing python is good at lol
So much scientific and engineering work is python
I am hoping people in the python community see this, and try to make sure python can survive the data bubble bursting
so python science is prop not going anywhere
not much work outside of academia for it thu
tbh, I think the focus should rather be more on research and development than so called 'application of ML'. right now, its more oriented towards "getting 0.5% more in that benchmark"
There's absolute tons of scientific and engineering python outside academia
if you look at the pydata ecosystem, there are a lot of packages/projects for different science disciplines.....I don't know if they will translate to out of academia work......
umm....why not?
not sure what you mean by pydata ecosystem but the scientific python ecosystem is large and well used within industry
everyone thinks R is going to die but its still used heavily in many places
just not by software developers because they dont actually know any science
tell me how many of those https://numfocus.org/sponsored-projects can be used outside of academia
I know many of these actually being used in industry lol
the top of the list lol numpy, pandas and matplotlib are all used
Heck I've used some of them
I struggle to understand your arguments for such reasoning
things like MDAnalysis and ITK are only for work in academia
who said that?
(I am not talking about Pandas/Matplotlib)
MDAnalysis is a Python library for the analysis of computer simulations of many-body systems at the molecular scale, spanning use cases from interactions of drugs with proteins to novel materials
I assure you this isn't only used in academia
I might be wrong.....fair enough....
spanning use cases from interactions of drugs with proteins to novel materials
Probably most biotech firms?
sounds like biotech/pharma
Biotech and pharmaceutical yes, but also general material science
Molecular dynamics has tons of real world usecases lol
lots of interesting R packages
you know R?
its good for stats tho
But it's darn useful looking
R has many useful functions
I guess there is a place and time for each language
Any programmer claiming X thing isn't used in industry is probably saying it because they have no domain knowledge about any industry outside of software
I personally think MatLab is pretty good for stats. It seems pretty simple IMO
they have a GUI for regresssion 🤷
hmm idk Matlab's stats capabilities but R's stats functions are pretty comprehensive
Tho its DL toolkit sucks AF. its so limited
it has a self-driving toolkit too. nobody uses it lol
to be fair, that is True.....But I will argue that a few jobs in that small field will not save python
I dont get your obsession with one language

They absolutely will because they're not a "few jobs in that small field", it's "many many many jobs across many fields, just fields programmers don't know enough to work in"
It's like how CS majors claim MATLAB is dead and no one uses it
Because they don't realise it's used a fuck tonne, they just don't know enough other things to find and get those jobs
even if python dies, we will migrate to another language if need be (there are several alternatives in development) what matters are the core programming fundamentals, not syntax
yeah matlab is very industry specific
I guess I am the only one here that likes python alot and wants it to thrive
Python is going to be used for many decades across robotics, control, signal processing, physics dynamics modelling, data analysis, etc.
yeah, its hard to unravel a lot of effort and work put in it
Python can't be the one language to rule them all until it becomes a fast lower level language with no garbage collection
Rust on the other hand...
omg garbage collection
i hate that i have to do it so often
on some stuff
and then i forget to do it when i need to

I am vaguely familiar with garbage collection. is it the clearing of memory for stuff for which the variable does not exist
if someone deletes a variable, shouldn't it delete the stored contents too
I vaguely played with gc when I tinkered in c/c++. Java has no gc right?, but it is alot faster than python
x = [1, 2, 3] # allocates memory
x = 3
# someone needs to clear the memory memory in which [1, 2, 3] is stored
You mean you think it should be more theoritical instead of practical?
yeah, I think I ment it the other way around
no, I am just saying that while its good to make great perf on benchmarks, people are losing the ultimate goal for actually developing AI. not overfitting giant models on all the data available on the net
ML isn't "AI", fancy regression is always what it's been about
ML never started with "AI" in the goal
Personally, and maybe its because ive always work in corps, I've always seen ML as an intricate, valuable tool, but just that, a tool
You will be shocked how many people shy away from python due it its suppressive lack in performance
well, some people did aim for making AGI (Like Turing for instance) its just that not much is there for cutting edge stuff
I know people who spend DAYs killing themselves over a meager 0.5% accuracy KPI, where its not needed
I respect the theorical building and the fantastic work many reserachers do in ML and DL, but personally, it's not my thing, fk research lol
making 0.5% on a benchmark =/= AI progress (note I use AI, not ML)
I want to be able to apply those things to something useful to me, that's all
Based on what? For what types of task?
obviously, to apply them properly, i need to understand them, what they do, etc
I'm yet to see anyone shy way from python due to performance. Has to be a very very specific thing lol
well, it depends. for me, I appreciate the theoretical work more than the applied one because the theoretical ones are focusing on making AGI. Practically deploying models doesn't sound very appealing
even simple and non-performative things(like an internal rest api what will get a few requests a week)....people are not rational, they want to learn the fastest thing
This is the beauty of life, people with different perspectives. Some people love building tools for the sake of the challenge or achievement, others like me? We just like to hammer shit xd
😁
I have never seen Python missing out on Rest APIs because it's not fast enough, there's other languages that are better at it for other reasons
Python gets left for something faster all the time but it has to be in:
- business logic not numerics, or
- extremely high throughput required, or
- needing low latency and real-time
dont get me wrong though @grave frost I can still get all nerdy and ask about specific shit, There{s reason i studied engineering (evne though i only ever worked business)
but not in data science, any CS graduate could shut me up with their indepth knowledge xd
why is it that some package written in pure python still do stuff in like 0.001s (or at least they claim to)
What type of engineering?
Industrial engineering ~= business engineering anyway
I doubt low level language can improve performance more than 0.001s
james acaster is great
I mean, i dont regret my choice truth be told
Business degree would be too shallow for me
and the other engineering are too close-minded for me
the bigger thing is if you need to do those 0.001 task 2 million times
in which case the lower level language can and will eat Python's cake
I guess the only other thing i could have studied was CS, but that wasnt a choice for me at the time
😦 ❎ 🍰
how are other engineering degrees too close-minded 
@lean ledge perhaps i didnt explain myself properly. What I meant was: Their domain is exact -> A single topic. They are xtremely indepth and useful, but also narrow
Well, that was my view at the time, and for the most part i think it has held up.
Industrial engineering is shallow af. You dont learn much about any topic, but you get a good notion of a lot of things which helps you be versatile
what I don't get is that java feels like python but with forced classes and forced types......why can't they make a python compiler that takes typed python and gets the performance of java
that was the focus yes
Because types and strictness
what you mean?
@misty flint It's origin its basically for operations: Factory management, floor management, etc
you need to know about production processes, statistical quality control, but also business and admin
Knowing X is this particular type and the output is supposed to be this, etc, you can avoid extra operations that check the input types of the input, that ensure X thing is happening, there's less data to clean up, etc
For example, if you have 2 ints, in python, you write (a+b) and it checks the type of a, a is an object so fundamentally on the low level its a PyObject struct and you have to access its value and you have to see if it supports the + operation, and then you have to see if b can be added to a, etc
is there syntax difference between reference counting and what java does? I feel like they look the same for the user
dym that if python was explicit, it would be faster?
In java with strict typing, your compiler knows a and b are ints before hand, there's nothing new to do. You just insert an add operation
and that's it
yes!
that's why things like cython etc work
they force you to do type annotations etc
Try making a game engine in C and then the same engine in Python.
and that lets them optimise
gaming is probably the one thing where python will never, ever shine
Nope, Java is reference counting done by a separate GC program
They're the same thing
Python does the same thing
@lean ledge are you a DS?
Most GC is just fancy reference counting, each object maintains a count. Every X miliseconds a program goes around checking all the memory it has allocated and clears it if no references
https://rag.gy @exotic maple
so, if we do everything explicit in python, does it boost it a little bit?
In CPython, the normal python you do, it doesnt. The explicit type hints are just hints
bro this is a cool. How do you do it lol
can a typed complied python in theory be as fast as java?
You need to switch to a different python implementation that takes advantage of it
Cython is real fast. Idk how it compares to Java, but it's getting there
cython with only python objects is not very fast
Do what? the website?
yeah
ah i see. still seems pretty practical.
if I take non-numric python code, add types....it will not be very fast useing cython
It's just a normal website I made with bootstrap because I was board. .gy is the Guyana domain
I just got lucky I got rag.gy as a domain
like a good generalist skillset
It si pretty practical. You can crap on any business bachelor, but any specilized engineer will shit on you
i feel like industrial engineers could be a good product manager maybe

I am a project manager xd

Cython is fast if you add types to the variables and disable a bunch of things python does by default, like bounds checking, etc.
bro. I loved this shit in university
Operations research is amazing
Its the one part of math and college i loved
did not know that.....
Sadly i never got to use it so i forgot everything
Oh yep, safety checks like bound checks also add to slowness
optimization?
yup
Cython can tell you what code is probably slow cython -a.

Optimization, queue theory, etc
s i m p l e x
HAHAHAHA
S I M P L E X
no wait
"GUYS LETS START EXCEL SOLVER"
I do lots of optimisation as a robotics/control person so operations research is mildly cool. Not as cool as more continuous type optimisation though
I've never made a website. I envy yoou 😦
Convex optimisation is cooler as a subject
Its really easy to make lol
One big reason to use cython is that it automatically works with numpy and you don't really need to setup a C/C++ project (all those different build tools are a nightmare and one big reason people dislike C/C++).
Cython is what happens when you realise as a scientist your simulation is slightly slow but it's going to be a bitch to rewrite in C++
So you add type hints and a couple other optimisation and get that last bit of juice
so its basically like a C wrapper for python to make it faster?
isnt Cython just normal python?
It's Python compiled into C or C++ through the right typehints and optimisations to make Python significantly faster
That's CPython
It's python (with some extra stuff) to C, but also with a bunch of stuff added to make it work with python. It compiles to a shared object / dll which python can load.
wait those are different things? cython and cpython?
Yep
-screeches in dislexia-
Cython's used tons by my scientist friends
Although they also liked numba last they tried
I kind of dislike numba, it yells at me too much, I often give up and go to cython.
Also Jitting each time you run the program can give annoying startup times.
numba has some really cool scientific computing stuff
I dream of a world where scientists don't need to learn the ins and outs of C++ because parallelisation to CUDA or clusters becomes much easier and optimisations are automatic
It feels a lot like OpenMP, but python.
Oh how easy it would be to write simulations then.
Right now it's hell trying to get the kernels tweaked just right and dealing with all the API gunk.
I also wish FPGAs were easier to get into and use.
For making things that don't really work well on CPUs nor GPUs.
like python -> FPGA numba or something
anyone had the issue where jupyter lab//notebook cant import NLTK?
I have specific env and everything, installed in there and everything
but its not working
I have verified all pointers in the env are ok as well
@iron basalt You can write OpenCL code and export it to FPGAs https://www.electronicdesign.com/technologies/fpgas/article/21794531/how-to-put-opencl-into-an-fpga
yeah, its a struggle for some reason
thats why I always prefer pre-defined environments
@grave frost any clue on how to fix it?
Yeah that's what I use, but it's not on the level of numba, though some python code could be written to generate opencl kernels.
whats the error anyway?
did you activate it?
obv lol
well, then you did not install it. try force-reinstalling it
I can use it via cmd
maybe some error you missed
I can use it via cmd, bu t not in jupyter
hmm....whats the output with pip and pip3
already satisfied
you can see im in the env as well
its also listed
I trie everything i can think of
pip3 install nltk
when something doesn't work, there is only one solution left
ima try restarting my pc. bullshit sometimes works after restart
Reboot
LOL
yea, haha
fking crap sint working
zzz
what the f
now it worked when installed it outside the env...
it might be a PATH issue
Did you run any pip stuff in the environment
It clearly states that if you run pip inside conda it will break it 100%
Then you can throw that env in the trash
That has to be the weirdest argument I have ever seen max_features=0.7000000000000001
Thats like.. Ok bro
without knowing the context, could be one of those people who don't understand how floating point precision works.



