#data-science-and-ml
1 messages · Page 300 of 1
And, well, linear function do not useful computation make
Too much CGP Grey?
haven't actually watched that much of him, but maybe that's where I picked up that phrase, yes
there is
AI is purely applied maths. Usually it all boils down to linear algebra. You don't have to understand how many of these models work and what they do mathematically. It would help no doubt. But it's probably better to keep the implementation of them a black-box and focus on the when / what / why / pros / cons of models. You can always dive deeper into the maths later
Is it possible to pass a list of X_trains to LSTM
or how do i fit several x trains into one lstm
Does anyone have any ideas of implementations of fully unsupervised local POS taggers?
Do you mean, like, merging several datasets into one?
Okay, data science question: I have a pandas dataframe of UTC timestamps and values. I want to plot the average (maybe +-std) value by time of day (disregarding the date). How would I do that? My current solution is very hacky and possibly incorrect.
binning by hours?
so 24 bins, count values and plot accordingly?
pretty much
except, obviously, I want to do it "the right way"
since it seems to me like a common-ish operation
Can you share your current way?
I think a transformer can cluster data pretty well. Bidirectional LSTM autoencoder could work as well. But is there a reason to do it unsupervised? There are a bunch of huge datasets out there.
something like how this guy does it? is that a similar scenario https://youtu.be/jV24N7SPXEU?t=171
This is part 8 of my pandas tutorial from PyCon 2018. Watch all 10 videos: https://www.youtube.com/playlist?list=PL5-da3qGB5IBITZj_dYSFqnd_15JgqwA6
This video covers the following topics: math with booleans, groupby, datetime attributes, line plots.
NEW TO PANDAS? Watch my introductory series (30+ videos):
https://www.youtube.com/playlist?list=...
he did pd.to_datetime() previously before working with that column by the way
so his approach was
df.groupby(df.column_datetime.dt.hour).column.mean()
column being the column of interest
This does indeed work, though I wonder if I can also apply seaborn to the task instead of manually calculating the standard deviations.
means = df.groupby(df.date.dt.hour).eu.mean()
stds = df.groupby(df.date.dt.hour).eu.std()
plt.plot(means,label="mean")
plt.plot(means+stds)
plt.plot(means-stds)
Oh, this works and does what I want:
hours = df[["date","eu"]].copy()
hours["hour"] = hours["date"].dt.hour
del hours["date"]
sns.relplot(data=hours,x="hour",y="eu",kind="line")
essentially creating a new column with the dates replaced by hours, then passing this dataframe (with a lot of y-values for each x) to relplot
result looks like this, which is what I wanted
yup
magic
does relplot automatically display std?

or was there a parameter you had to specify
i think i might try and use seaborn then for my project
It's on by default
interesting
it also calculates them using bootstrapping, which takes like a second for my barely-2k datapoints 😅
(but it scales well with big n or something, don't remember why bootstrapping is considered nice)
thats pretty nifty
Low resource language 😦
the problem is not in the clustering - but that I don't have any ground labels
I did find a way using NLTK (after a lot of hours of searching) , but thought that maybe there is some resource I missed
Write a program to generate a series of marks of 10 students. Give grace marks up to 5 of those who are having <33 marks and print the new list of the marks.
I've tried this but not working:-
import pandas as pd def Ser_stumarks(): std_marks = [] for i in range(1,11): m = int(input("Enter the marks:")) std_marks.append(m) s = pd.Series(index=range(1201,1211),data=std_marks) s[s<33]=s+5 print("New List is:") print(s[s>=33]) Ser_stumarks()
Is there any easy and simple code to do that?
what's "grace marks"?
oh, I see, increase their mark by 5
why are you filtering the printed result to only the >33 ones, though?
i added quiet alot of data for lstm to learn with, but the results are only worse
any ideas why?
Hello
I have this
sns.histplot(datos, bins=9, binrange=(10,99), color='gray', kde=True,
line_kws= {'color':'blue','linestyle': 'dashed'},
fill=False)
But it's still entirely gray
How can I change the kde line color?
reduce learning rate, try different optimizers
thanks
That sounds like a bag of words
VSM is basically an improved version of BOW that use some more complex information (like tf-idf score for a word)
I mean, alot of it. Mostly linear algebra. A nightmarish amount. xD
LOL, there's a lot of math inside AI
Any help?
Ty for this, reading this
actually i said that there isn't something about math in site
not in ai
Write a panda program to enter marks in main five subjects of a student and Calculate sum of all marks.
b) Also write a small python code to create a dataframe with headings(Name and Age) from the list given below :
[[‘’Alex”,26],[“Maddy”,44],[“Rolex”,26],[“Mona“,37]]
Now sort the data as per the name
SUM=DF[‘ENG’]+DF[‘MATH’]+DF[‘HINDI’]
DF[‘PER’]
I can't do the sorting part
Rest I've done plz tell me how to do that
Did you create the dataframe?
Yes
Sort it
there is no side of AI that is free from math
.sorted()
What to do with this:- SUM=DF[‘ENG’]+DF[‘MATH’]+DF[‘HINDI’]
DF[‘PER’]
Do I have to use sum command?
What's that?
What part of the question?
This plot is so sexy i want to marry it
its like being in college all over again:
are there sorting algorithms in pandas
or do you just do .sort
I actually don't remember much about pandas
.sort_values
To be quite honest, sorting is one of the basic things that he could look in google
Meanwhile, I am stuck with my seaborn question from 2 hours ago
oh ik now, im just laughing because it reminded of my interactions with ym college teachers
I know the pain too
Instructions so vague so that you cant go anywhere but give the impression you're still teaching something
"prof how do i balance this load (Dynamics)
"do you have the equations?"
"yes"
"solve them"
and im like: This mfker getting paid so much to say that shit?
Lol
Academy is like that
They are paid for research papers, not to actually give a fuck about the future professionals
If i told my boss "solve it" id be out of the door before i said "im so-"
Anyone here knows seaborn?
what do you need?
I want to do a histogram with a kde line overlay
The histogram bars should be gray and the kde line should be blue
This the code
sns.histplot(datos, bins=9, binrange=(10,99), color='gray', kde=True,
line_kws= {'color':'blue','linestyle': 'dashed'},
fill=False)
All the elements are gray
image?
ah
I see yourmistake lmao,
straight from the doc @twin mantle
you're using line_kws
you need kdw_kws
kdw
kde keywords contain "color" as well
Yeah but when I use kde_kws
With a dict with key color and value 'blue'
I get this
There's a question where I have to put 10 values in series but its saying that init() takes from 1 to 7 positional arguments
thats....weird
try again and show mt he full traceback?
bwecause that guy was using distplot, but im checking histplot
That's the full traceback
^^ 
Pass the values inside a list
Ooo
So a([1,2]) instead of a(1,2)
Ya got it
yes
thats defeinitely a bug
nice finding lmao
first i find a bug haha
that's a problem with the class instantiationg since it calls back to init
hello! is there a way of getting the same result as
import numpy as np
labels = np.random.randint(0, 4, size=10)
a = np.zeros(shape=(10,4))
for idx, _ in enumerate(labels):
a[idx, _] = 1
print(labels)
print(a)
without that horrible loop?
uh, are you trying to create an aray with just ones?
because...
What are you trying to do?
on each row in a, I want to set a single value to 1, the position is given by labels
a = [[0. 0. 1. 0.]
[0. 1. 0. 0.]
[0. 1. 0. 0.]
[0. 1. 0. 0.]
[0. 0. 0. 1.]
[1. 0. 0. 0.]
[0. 0. 0. 1.]
[0. 0. 0. 1.]
[0. 1. 0. 0.]
[0. 0. 1. 0.]]```
so label[0]=2 means a[0,2]=1, label[1]=1 means a[1,1]=1
etc
hey everyone, if anyone's looking to throw down in "Sliced: a Data Science Competition", reach out to nickwan on Twitter and tell him that Nschamps sent you. The competition heats back up in June with 16 competitors as the goal. I had a lot of fun this season look forward to seeing some of you there.
nickwan_datasci went live on Twitch. Catch up on their Science & Technology VOD now.
that's...strange
does that have some programatic logic?
because i cant see any other way, as-is
I still don't understand this
Do the positions change?
Same here, unless is an exercise
I need to prepare a label matrix to pass to a loss function
but i dont see any logic there
I dont see a way outside the loop
because you have your indeces as a list too
or maybe...
mmm
are those indeces related to the row?
not sure what you mean here, labels are generated randomly at each iteration so yeah?
sorry man i cant help you tbh im not seeing the order in what you're trying to do
hm so
I'm trying to compute the cross entropy on the output of a nn
which has size 5000 total samples x 3000 possible outcomes
in my code, labels is the position of the correct outcome out of the 3000
on each of the 5000 rows
Mate, with all respect
You're using big words for this problem. The question is simple:
What is your criterion to change values?
How do you decide the indexes of the values, row-wise, column-wise?
size is constant right?
Write a program to generate a series of marks of 10 students. Give grace marks up to 5 of those who are having <33 marks and print the new list of the marks.
the size of the matrices is constant, the values change with each iteration
How to give that grace marks
Mate, try to do some effort, pls
@shadow frigate well, what you are doing is called one-hot encoding - your label range is determined by the length of the row so that may confuse someone, but it is (in essence) one-hot encoding on a fixed array
yeah now that you put it that way, it is 
uh-huh
holy moly I'm tired 
so you can use the pre-built modules in sklearn, or generate an array each time and append in on the appropriate axis
oh shit you're right
he's manually one hot encoding????
yeah, something like that
might be
but its easy to confuse (AFA I have understood)
'twas a long day ok 
better take a break 😁 I find it helps a lot when doing long stuff
its pretty well studied that most realizations come in a break AFTER working
so i'm trying to make a habit of work, then rest 10-15 mins
repeat
yeah, shower for me 😄
yeah I'm gonna take a fat nap
the brain needs some time for the information to settle
lol whats that?
a long nap
yep time to stop, thanks for pointing that out, I'm definitely done for the day
cheers
it means a long nap people
ahhh...I am too young for naps
ive had naps since i was 10 yrs old lmao
-tropics life-
its heaven dude
a noon nap and you're refreshed all afternoon
sad. I just can't sleep any time 😦
I call it nap, but i dont relaly sleep
just close my eyes
and calm my mind
that works too
why its showing nan ive passed values na 😐
I am making a dungeon crawler and I have done all the basic stuff (the player,tile,collision etc.) and now I wanna make a test level for which I have to make a BOSS I have created the sprite of the boss but IDK how to implement the AI for boss as I never did these kinda stuff ( I am using pygame) so please help me.
you want a dataframe with a single column?
its better to create the series with index right away...
and use this method
Hmm I see
try reading documentation my man. saves a lot of time :p
but for the sake of it, paste your lists here i'll to reproduce
it worked fine for me?
oooh
This isn't really appropriate for this server and can be considered a bit rude. Please be mindful of your wording in the future.
whoops, didnt know. thanks for the heads up
Ya bro done @exotic maple
now i'm curious about why it doesnt work though
it seems you cant convert like that
it inherits the indeces
Lol I have to write all my codes in my file , will see you next time
dang i cant answer why we get those nans vals
Why do you use the list(range())?
placeholder index
you cant pass range alone
its a generator object, not an iterable
I THINK -hesitant-
that the problem there in the 2nd case is that tries to look for the values of index in the parent series
yes, that's exactly what it does. Its not setting the argument of index as the index, its searching for it
I have my answer now 😄
how can I tell the people in pandas to modify that documentation section?
This is the docs for the NLTK HMM - I want to do unsupervised tagging on my dataset
class nltk.tag.hmm.HiddenMarkovModelTrainer(states=None, symbols=None)[source]
Bases: object
Algorithms for learning HMM parameters from training data. These include both supervised learning (MLE) and unsupervised learning (Baum-Welch).
Creates an HMM trainer to induce an HMM with the given states and output symbol alphabet. A supervised and unsupervised training method may be used. If either of the states or symbols are not given, these may be derived from supervised training.
Parameters
states (sequence of any) – the set of state labels
symbols (sequence of any) – the set of observation symbols
Does anyone know about states and symbols? I can't find much from googling
Any improvements as to how I can speed up the following:
Df.groupby([a,b,c]).agg({col1: [funca, funcb, funcc], col2: [funca, funcb, funcc]})
??
doubt you can
what are the functions?
Statistical functions like kstest(), some of the use standardscaler() etc
that’s about as efficient as you can get, assuming everything is vectorised
What you mean vectorised ?
and memoisation won’t help
parallel application of certain basic operations
it’s a numpy thing
how big is it?
25 mill
Yes that’s what I thought
show code
Out of interest does reshaping take a lot of time ?
sure can i dm you them?
an example would be
def func1(x):
x = pd.Series([e*100 for e in x.values])
_scaled =
StandardScaler(with_std=False).fit_transform(x.values.reshape(-1,1))
return kstest(rvs=_scaled, cdf='t', N=len(_scaled), args=(1, ))[1]
no thank you
...yeah
that's probably not a good idea
oh no hahaha
you're new to pandas and numpy, right
so
you should just do this
(for example)
!e
import numpy as np
a = np.array([1, 2, 3])
print(a)
b = a * 100
print(b)
@velvet thorn :white_check_mark: Your eval job has completed with return code 0.
001 | [1 2 3]
002 | [100 200 300]
you have a list comprehension there
which of course will be slow
and then you go through the overhead of converting it back into a Series
I would suggest
reading up on the basics of numpy
it would help you write better code
this makes complete sense thank you very much for the example
also I question the wisdom of using StandardScaler there?
!e
import numpy as np
a = np.random.rand(5)
print(a)
zero_mean = a - a.mean()
print(zero_mean)
print(zero_mean.mean().round(5))
@velvet thorn :white_check_mark: Your eval job has completed with return code 0.
001 | [0.2841178 0.3210537 0.3932124 0.62640306 0.96628655]
002 | [-0.2340969 -0.197161 -0.1250023 0.10818836 0.44807184]
003 | 0.0
yes
again, I would suggest a bit of research on the purpose of sklearn's transformers
they are helpful for building a pipeline
but in this case what you are doing is just a single operation of centering
nooo gm what happened to your username
it would make more sense to use a plain numpy operation
blame Google for making GMail
I kept getting pinged
oh damn you're right I didn't consider that
Hey guys what all comes under data science engineering? Do you guys think it's gonna be worth it ? I'm kinda confused whether I should be taking cs/data science/AI...do you guys think the placements different in them?
what is data science engineering
Thank you so much for all this it has definitely given me some valuable pointers
there is data science, and there is data engineering, but I have not heard of "data science engineering"
yw! atb
does anyone know how to get matplotlib on python? I am taking a class on udemy and its a little outdated so there is no proper instruction. I couldn't find anything online. Does anyone know?
THANK YOU SO MUCH!!!
yeah no problem
!otn a data science engineering
:ok_hand: Added data-science-engineering to the names list.
now everyone will hear about it
anyone know how to use between_time or something equivalent to select rows in a pandas dataframe that are between a given start time and end time, but for a dataframe that has multiple days in it, so like all the rows that are between say 7am and 9am for a dataframe that has a datetime index with rows going across multiple days
filter using datetime accessor
Google “datetime accessor”, I can’t type code right now
sorry, I should be more specific, I want to be able to filter between arbitrary hour and minute combinations, just like with .between_time, but across multiple days
but I have not heard of "data science engineering"
me neither
ive heard of "full stack data science"
which is like front + backend skills + DS + (some Ops skills maybe)
🦄
yeah, why wouldn’t that work
or could you provide an example please
I can conceive of "full stack data scientist" referring to "understanding data science having general programming skills", but idk why having web dev skills would matter
i dont either. if i find the listing again, ill show you
they are listed as two separate skill set categories
as you can see

Is nosql like guis with flowchart blocks for code?
i still dont know what exactly is so attractive about mongodb
bro that profile would need to pay upwards of 150k in the US lmao
i know some remote workers in my country working as FS devs for us companies and they make 100k
REMOTE
which country

oh hey i know someone in the same situation

they have an advantage bc same time zone
unlike EU or Aus
I'd like to get a remote junior data scientist / analyst from my country, but that's too much dreaming i guess lol
one of the companies i think im going to work for has a croatian branch
and im like
how does that work
with time zones and such

lmao i work with my colleagues in China, India, Russia, Bulgaria, etc
ig its fine if its morning here + later afternoon there
trust me, you get used to it
a good scheduler will amke sure there's at least a bit of overlap
depending on business needs
im sure they will more than likely put me in project teams that the members are more local
so we can sync better
anyway
depending on the nature of the job, haivng sometime dif can be good
I can at least restrain myself from shouthing at my india/china colleagues because they arent live :v
so i'lljust send an email and vent more "professionally" ay
hi, i got a 'long' pandas dataframe, that has 3 columns: property, value and playlist, basically is a dataframe converted from wide to long format using pd.melt(), the problem comes when i try to plot a bar catplot with seaborn, and i pass a column name as the x values and when i show the plot the x values don't show up
this is the dataframe
and here are my code and how the plot currently looks
#bar catplot
bar_catplot = sns.catplot(
kind="bar", x="property", y="value", hue="playlist", legend=True, data=long_frame2, dodge=True
)
bar_catplot_figure = bar_catplot.fig
catplot_render = mpld3.fig_to_html(bar_catplot_figure)
i agree. im more of an in-person person anyway

which is why im glad this company usually requires in-person
work
in the office
can anyone help explain what p-value is
`#HOMEWORK
#Q.1.Write a function to find the factors of a number.
number = int(input("Enter a number:"))
factors=[]
for i in range(1,number+1):
if number%i == 0:
factors.append(i) #Append: It adds a single item to the list. It modifies the list by adding an item to -->
#------> the end of the list.
print("Factors of the {} = {}".format(number,factors))`
#Q.2.Write a function to identify whether a number is palindrome or not.
num = int(input("Enter a number:"))
temp = num
rev = 0
while (num>0):
dig = num%10
rev = rev*10+dig
num = num//10
if (temp == rev):
print("The number is a palindrome.")
else:
print("The number is not a palindrome.")
#Q.3.Write a function to identify whether a string is palindrome or not. string = input("Enter a string:") if (string == string [::-1]): print("The string is a palindrome.") else: print("The string is not a palindrome.")
Does anyone have issue with running Tensorflow in Python 3.9?
if you do remote work, then isn't there a chance of you working more to get a project done as opposed to set hours in the office? I read in an article that a lot of tech people are being exploited this way
Hey, did you know any active discord about tensorflow?
You should always negotiate according to your needs and preferences. I don't see why that is bad for speedy programmers, in which they'll make more money than others locked at the same wage. I'm a slow worker so that model is not feasible for me for big projects, but I've done it as a way to get extra money other than my full-time job. Also, it's a great way to improve your efficiency.
is dash plotly used with mongodb?
You could, yes. What do you what to do in dash? If I may ask.
making a dashboard
Disregarding the type of data manager you use, you could always graph data with dash. Even more if it's tabular data.
in short, I can connect with pymongo, do manipulations (most important is json_normalize) and create some graphs I want
but I wanted to make an 'interactive' dashboard that updates via mongodb
not sure if it makes sense, but I guess there are 3 options?
mongodb charts - but I think I cannot manipulate the data as with pymongo
powerbi - which seems to be able to connect with mongodb and allows the needed manipulations but with M which I'm not really familiar with
dash plotly - which I guess I can re-use my previous code, but I can't find many results online how to keep getting data
do the above make sense?
Yeah, makes sense. I've never done anything with mongo other than simple queries and not particulary good with dash either, but you could either do it in PBI or Dash. If your data is a behemot sized monster, I'd suggest using Python with Dash, since you can setup a buffer. I've seen a lot of resources in Dash so it's doable, but you'll have to do some trial and error. If the data is medium (Less than 8gb for example) you could do it in PBI which is higher level and overall easier, also with tons of resources.
I guess then my real question is how the dashboard can be 'live'. But actually even 'updating' daily would be fine for me.
Oh, then in that case I would transform everything to a Pandas Dataframe and update it
PBI also has (something like a checkbox option) alternative were you can toggle updating the datasets and their relations.
I forgot the name, but it should be under 'ñManage Relationships'
ok i will check
Update your apps on page load or
on a predefined interval (e.g. every 5 seconds)
You have to update both the data and the dashboard if you'd like to go with the PBI route.
https://docs.microsoft.com/en-us/power-bi/connect-data/refresh-scheduled-refresh
https://docs.microsoft.com/en-us/power-bi/connect-data/refresh-data#types-of-refresh
how can I visulize clusters if they are based on a single feature?
So i am trying to build a speech recognition model. I am not that skilled so i am using sk learn. Lets say i have some recordings of my voices in .wav format. What do i need to do to make them trainable data?
Is the second dimension a scale? You could do something like a % distribution on the other axis.
What are you using to read the wav files and what is your data like?
Hi guys. I have a CSV file with lot's of empty strings. How do I drop or delete them with pandas?
So far, this is my code.
df = df.dropna(how='any', axis=0, thresh=2, inplace=True)
Running it gives me none. When I remove the inplace I also don't get the dropped rows.
I am not sure of the way i am gonna read my wav files. The datas are like folders of different words such as a folder of hello and a folder of bye
What way would u suggest me to read the files?
Could i use that for sk learn as well?
You need your data in text format first so it could be fed to the sklearn library.
Arroje su pregunta Señor Diego
i dont speak spanish . _.
I said "Ask away, mister". Your name is pretty common in Latin América lol.
Seems like a correct assumption about the xticks, but not sure why it's happening. Give me a minute pls.
linspace(start,stop,number) always gives out start as the first point and stop as the last point.
(there's a parameter to change this behaviour)
str onject has no attribute decode
i keep getting this error
anyone knows wha to do
decode is a method of bytes that converts them into strs, the opposite is str.encode.
ok.. so bytes has the decode attribute? not str?
the decode method, yes
I want to use python to calculate the antiderivative of a function and store it as a function that i can use
how might i do that?
check out scipy.integrate
!docs scipy.integrate
This appears to be a generic page not tied to a specific symbol.
if you mean numerical integration. If you mean analytical, sympy.
i see functions for integation, but would you know how i could basically pass in some function F(X) to integrate
and then it could return me the integral as a function
so i could just pass variables into it
you can just make the function call scipy.integrate.quad each time, from (say) 0 to the argument
that'd be time-inefficient, but will require no extra memory
alternatively, precalculate the integral's values for the entire interval you'll be working on and use values from it
sure
but there is no explicit way to sav eit as a function right
?
if so, i can just create my own function
and assign attributes i guess
yeah, something like that
thanks
so then
if i want to pass in a function f that takes in a value of kx instead of x, do i just multiply the integral range?
quad(f, 0, math.pi)
like quad(f, 0, k*math.pi)
i feel like there needs to be another way
not sure what you mean by this
no it is not it's a continuous variable
Does anyone know how to find the nth largest drawdown of a portfolio
anyone know a good api for facial landmarks?
or anything
to get coordinates of them
the dlib library is a popular one
we used that in our face recognition project
gives you 68 x,y coordinate points
try that one
there you go
rip
yeah its not trained to do it on those types of images
we actually proved that in our project
lol
best one i know, so gl bud
whats ur project
How can i return the nth largest drawdown for example
r = [.01,-.01,.004, -.02,.01] n = 2
for that, you would have to train your own model on your own custom data
I'm not sure whether this should be in this channel, but if I'm looking at historical data in the form of candlesticks, how would I be able to find local mins/maxes ,using say a dataframe format, for my data?
basically it checks for face landmarks like eyes for example. then I get those coordinates and i paste something else on top of it
what's the end goal?
fun
went from a normla picture
pasted flares on it
or paste whatever you want to on the eyes
so you want just eyes or all facial features
yeah
I mean in the future maybe could do smth with the rest but eyes are like main thing
just get their coordinates then 🤷 train a model for that - data wouldn't be too hard
or just google "get location of eyes from face in python" and youd probably get some indian tutorial using OpenCv
probably
tried looking into that already
or well
thats what im doing rn
just have to figure out how to get the end picture of opencv into a pil image or bytesIO
researching things is a pretty important skill
I figured
and with google scholar, its not as hard as it was before
Hey guys, I've been trying to do a Gaussian blur of a RGB image.
I know how to blur a grayscale image (with 2d convolution kernel), but I'm having a problem with implementing the process for RGB image and 3d kernel. Should all layers of the kernel be the same or not?
Yup, all the same, unless you want the kernel to also mess with colors.
so it'd just be 3 gaussian kernels stacked on top of each other
how can I visulize clusters if they are based on a single feature?
Hey anyone knows how to apply groupby().agg() on index instead of columns?
You can always reset index to turn index into column
Hi all! I've recently gotten into data science and I'm currently trying to do some research into Linear Regression. I'm able to train and make one prediction (the basics), but I'm not sure what keyword(s) I should be looking for when I want to use the trained model in order to predict with a given variable.
E.x. I have a dataset with country, text (nl: hallo wereld for example), I'd like to pass a variable to the model to predict what the given text is. What keywords would I have to look for and is linear regression even the way to go for such a thing? Sorry for the confusing question, I tried my best but still trying to get the hang of this thing 😄
Hi, I want work on recommedation system ,but i can't find a source ,course exc. Can you recommend a source 😄
I am self-learning via a book called Intelligent Projects using Python (Packt Publishing), they have a project called "Intelligent Recommender System". For the source code you can find it here: https://github.com/PacktPublishing/Intelligent-Projects-Using-Python/tree/master/Chapter06
But without the book it might be hard to understand how the source code is implemented so I still suggest you look more online
thank you 👍 👍 👍
Hi data scientists, which disk based storage formats do y'all use most often for dataframes?
Use cases are for long term storage, as well fast read/ write capabilities.
I'm a dev on an analytics team. Currently, we are using pickle almost exclusively.
Been exploring parquet, and it's different engines but was hoping someone had some experience 🙂
pandas supports sql, if that matters
spark dataframes is good if youre looking for more production stuff
yeah, I know. We're just a small analytics team in a big company. Our oracle servers are all hosted internally.
Though, I guess the question I should bring up with the analysts is why do they want disk based storage
I wasn't really told a specific "prove parquet > pickle" or something, mostly just to explore the options
oh wait spark doesnt used disk-based storage unless it has to
mapreduce does
this makes it literally 100x faster (spark)
that would be a good question to ask
xd
This is the channel for talking about data science.
Making code reproducible sucks AF
yo
is there a repo that can change my voice to another persons voice?
i just recently ran into https://github.com/NVIDIA/tacotron2
still wondering if I should work on it
Tacotron is for TTS
there's a lot of research done on that, its pretty easy
but you may not be necessarily be able to use tacotron
got links to any papers?
its basically TTS - but the problem is the data.
I thought of using speech recognition to convert my voice into text and then converting that text via tacotron 2
Two Minute papers had a method that can replicate exact voice using 2 minutes of train data
but I would have to hunt for it tho
5 seconds?
wdym?
must be on the cutting-edge - no way you are deploying that unless yove done your masters
or the contributors are active
if you have colab, you dont need docker
yeah, it helps but not always
only issue with colab is the other person has to have access to the dataset too
this was the one featured in the 2 minutes paper, python code mostly but not tensor flow now.
for me, sometimes I get problems with different GPU's on Apex
lol youre probs training too much for colab
so?
I was wondering if there was something newer in the field of speech synthesis
this is a good method that also allows you to share with others https://cloud.google.com/dataproc?hl=en_US
you can even run a jupyter notebook on it
for bigger models
I don't have the $$ I only use colab
done it; blew it
yea

yeet
im a broke boi
tho there is a workaround to use unlimited GCP 😏
Which I am currently using
Hey! im 13 and I am currently learning python via codecademy because I want to get into machine learning... any tips?
im also thinking about ordering these books. https://www.amazon.com/dp/1119245516/ref=cm_sw_r_sms_api_glt_fabc_WMFMSRNVV4PQBHX2B3TH and https://nnfs.io/
if anyone responds please dont hesitate to ping or dm me!
Those seem great starters. ML/AI is pretty complicated - especially the mathematics involved. if you do not understand something, you can get an intuitive knowledge of it from youtube. it would still take you some years to understand, but I promise it would be a pretty fun journey 🙂
As you learn more maths in school, things would make more sense. but don't stress if you don't understand anything. we are always here to help!! 🤗
alright! thanks.
I would recommend you make Youtube your primary source of knowledge. visualizing things is very easy to understand
ok
@fickle surge I'm not a pro but I would start with codecademy (SoloLearn has a course on Machine Learning which is preety good, they're both preety similar though). Go from there to codecademy/udemy, there are a bunch of more advanced courses. Also on YouTube there's a lecture series on ML by Steven Brunton (sp) but it's quite theoretical/mathsy. I wouldn't worry about the maths/theoretical aspect until you've built some practical/fun projects and you still like it Also Unity ML Agents is a great practical intro, no need to know any of the inner workings.
https://www.reddit.com/r/learnmachinelearning/wiki/index is a great index for resources
alright, im already in a python course on codecademy so i think im going to finish that to learn the basics
Pickle is generally a bad choice since it is only Python readable essentially. Parquet can be read by other things
Parquet works really well with services like Azure Data Bricks
We generally just use csv or json though or read directly from a DB
What book/course/websites that you all recommend to build a first project on Google Cloud? I literally never use it before and I just mess around with it today I don't have a clue where I should start...
they have what are called Quests on gcp
try to complete those. i think thats a good place to start
There are a lot of services on GCP. It’ll be easier if you have a project in mind so you can narrow it down to a few core services
@misty flint @stiff barn thanks, just signed up for the quests, wish to learn more haha
Feel lagging so behind in the industry 😦
No better time to start catching up than today haha
If it helps, I’d say the core services to start with would be cloud storage and cloud functions. From there pub/sub and a database like Firestore or BigQuery
You can do quite a lot with that combination
@stiff barn thank you so much! I will do my best to be better in that
Hope you don't mind if I pm you in the future if I have any question
Go for it @gray arch
@grave frost my end goal is to make an assistant that i plan on modeling off of jarvis from iron man. I want to hook it up to smart home stuff and have it write emails to name a few things. how hard is it to do something like that?
That’s a big project. Google, Apple, Amazon, ect... have a bunch of engineers designated to solving just that.
What are some things to do with machine learning?
Or I least I want to make something that I can Interact with that. I could probably have it comunicate with Phillips hue api easily so when
Check out https://ifttt.com/
You can probably use that to build a mvp more simply
Alright
Thanks
@stiff barn just want to point out, I’m trying to learn machine learning not just make something that serves that purpose
Trying to think of a project to do
Large data sorting?
Have any large datasets you want to teach a computer to organize for you?
I’d probably pick a more approachable project
Like....
Not necessarily
Go on kaggle.com and try the beginner projects like the titanic one
Ok
I’d also pick up a book on the subject or sign up for an online course
Uh
Jarvis is a pretty lofty goal
I don’t really understand why that’s so many people’s goal when they first learn DS/ML/AI
it took quite a while to build Siri and Alexa
I’m not saying it’s impossible for one person to build something similar on their own but it’s definitely very difficult
It’s probably just the first use case that comes to people’s mind
Data engineering specific interviews increased by 40% in the past year. The second fastest position growth within data science roles went to business and data analysts which increased by 20%.
Data engineering is the new data science

I was looking at that this morning @misty flint haha
Guess I should just stay a data engineer haha
for now yeah

maybe just keep an eye on the waters for now
they say once companies establish their data infrastructure, there will still be some data eng jobs just less afterwards
i saw it from a YTuber i follow
I have created multiple videos about data engineering, including a data engineering course for beginners. Why would I advise anyone against pursuing a career in data engineering? I like being as transparent as possible - while this job will be great for many people, it might be disappointing for the others. In this video I'm outlining three reas...

I mean like a barebones version of Jarvis
There will always be more data engineers that scientists.
im just glad i signed up for this graduate level databases class next semester

its like super full and the waitlist is super long
even a “barebones version” of Jarvis is going to take a while
This won’t be for a very long time if ever. It takes years to bring an established company into the modern data infrastructure. There will always be more companies and new and improved infrastructure to bring them to.
The correct project in working on to bring a company into the cloud won’t be done until 2022 at the earliest
This kinda inspired me... seems like a good starting point
But yeah, getting data engineering skills even if the goal is to be a data scientist will only help
the YTber was just saying as stuff like DataBricks, Azure Data Factory, and Denodo standardizes and virtualizes data, there will be less tasks
idk if thats true, thats just what she said
V what that guy does
It takes away the annoying stuff. Like setting up and maintaining a Hadoop cluster
Are you talking to me?@hollow sentinel
Who wants to do that really
@fickle surge yeah
If so I’m learning and that’s kinda my goal once I get everything down
idk how fast you learn but it’s quite a bit of stuff
but you’ll make progress
If you do it consistently
You’ll need to have a solid understanding of software engineering as well to build something of that scope @fickle surge. That won’t just be an ML model, it’ll be a system of things that all need to be developed and interact with each other
Alright
yeah that’s what I was trying to say
This notion that because something "big" is coming from "big tech" means you shouldnt learn a skill is crap that should disspear
Hi all anyone here ever made a trading algorithm ?
we have steel mills and automated carpentry nowadays, but carpenters and blacksmiths still exist (as niche, true) careers, and are also well paid
A lot of people have. That would actually be a question for #data-science-and-ml
oh fuck that's where we are
LMAO
thought we were in algos and data structs
Lol
Stelercus.exe has stopped working
pkill -u stelercus
-> googles: "HOW TO KILL A CHILD"
what
no
-> corrects: "HOW TO KILL A CHILD PROCESS I'M SORRY#
Sry for disturbing
amd still have driver and software issues?? should i buy an amd or intel laptop?
I have to say, since i've gotten used to pandas...I kind of dread touching excel lol
You have ascended 👼
is that gil

2nd?

💀
anyway
@exotic maple idk if you saw the charts earlier but maybe data eng remote job?
growing more popular
I only know python, not enough backend to do engineering
well, python and MYSQL
if sql is considered a programming lnaguage
Currently most data engineering roles require only three main types of skillsets: SQL, Python, and algorithms.
oh theres also this but less common
We're seeing a rise though in data engineers needing to understand system design and architecture problems as well.
omg id love statistics and A/B testing
shit's easy AF once you get the hang of it
and its easy to show face with it lmao
you srs?
SQL I kiiind of know, python id say "intermediate" and algos...i've never taken a formal class but feedback from CS friends tell me i have the logic down
eh, who knows, i might just try it out
I work as a data engineer now and almost everything boils down to SQL or Python. I do a lot of cloud work and ML which will set yourself apart but I wouldn’t say it’s required. Cloud is becoming more and more but that’s easier to pick up
screw me. all this time i have been shy to apply when I have at least the basic skills for it?!
Lol yeah I’d give it a shot
you might want to learn Scala too
Yes
What are good resources to learn SQL but dive deep in things like efficient querying and such.
Beyond the basic here's how you do X.
spark is written in scala, and it's going to gain traction as data pipelines start leveraging spark more and SQL less...
get some o'reilly books and learn how to read query plans for the most common database engines
tuning SQL is such a weird art though. It's a declarative language, so its not like you can easily tell the query optimizer "do it like this"
isnt SQL pretty much super optimized by definitiion? I mean, the DB structure and optimization has to be done by a full DBA, not an user of the DBSs
I like to say "SQL gives you the benefit of a bunch of really smart people that already figured out how to do most of the simple things"
like simple joins, you dont have to decide what the best way to join tables is. as long as you have good choices of indexes and keys, the database will usually do the joins in a very efficient way
As of right now, I created my own db with postgresql in ubuntu running on wsl2.
As of right now I haven't figured out how to connect a SQL gui instance to the DB but honestly I kinda want to just use something like psycopg2 then move on to SQLAlchmey.
use psql if you want a handy terminal client for postgres
Yeah I do use that.
python libraries are good but they don't do the admin stuff too well. they kind of assume the database is already built.
I haven't set up my DB for production so to speak. But as an aspiring analyst I hope I won't have to.
Some of the user accounts have their passwords stored as plain text. I don't think I intalled the DB the most secure way according to the docs but I'm just using it to learn.
you can run postgres in docker too, that might make some aspects easier (or it might make it worse)
Apparently you're supposed to install it in its own user w/ the least possible privileges of any user and not have any other software installed on that user.
are you running in linux?
Yeah
this is literally what i did lmao
but for postgre you can use pgadmin
ive had good luck just using the vanilla packages installed using apt or whatever package manager, very little manual setup
Yeah but for production you want to be careful.
sounds like a good reason to run it in docker 
i banged my head for a day trying to dockerize our team project
but eventually i got there


all your dependencies are belong to me 
did you do it using an Alpine image?
i think i used buster

our project was just really finicky
had a flask component to it too
also i never had used docker beforehand
so there was that

buster was a wise choice... but if it was built on flask, i have to ask... what web server did you use
Hi guys.
I need some help about a data analysis project i am working on. I'm working on a bank customer data with transactions and salary info. These information is available for 3 months, and I need to calculate the annual salary of the each customer.
new_df = df.groupby(['account','month'])[['amount']].sum()
I grouped each customer's salaries. however I cannot use each months data as columns. Is there a way to create columns such as 'august', 'september', and 'october' and append the new_df['amount'] values to these columns?
Thanks in advance
Shouldn't you just groupby month and sum without the account names?
Is it because you need to append the average salary as a new column in your original data frame?
The data based on the transaction movements of the customers. for instance, one customer has more than one row in the data.
So you need each customer's average salary by month, I see
Okay so you have a multilevel index because of this. I'd approach it differently, create a new data frame with all unique account numbers only as a column, then write a series of groupbys on account where month =8,9, 10 etc and append each series to the new data frame
You can write a simple function to speed this up and pass a list of months to it
Oh, that's great. Thank you very much. I'll try that immediately.
Or rather, when you groupby, you get accounts and their sum salaries, then join on the account numbers
Something like this (sorry I'm on phone)
Thanks Dyllyn. I appreciate it.
Ugh I swear code is impossible to write on phone
Well lemme get back to my com, but lmk if you get it
No no pls don't write it 🙂 I'm trying to learn it.I appreciated your help, that I was trying to say
Thank you very much.
Ah, I made a mistake, at salary the account and month will still be on the index I think. That's for you to fix :)
I don't use groupby that much
Hello, In NLP text summarisation, is there any way to programmatically differentiate between extractive and abstractive summarisation?
anyone worked with end to end speech synthesis before?
Someone who really understand very well about Data-visualization can help me on the #🤡help-banana pls? i'm stuck on this problem for a while...
I believe unstack() is what you are looking for
anyone used Flowtron, FastSpeech2, WaveRNN, Tacotron2 or Real-Time Voice Cloning before? I'm thinking of starting out on one of them and was hoping to find something simple to begin working on.
what's wrong with the ones you mentioned?
I am just starting out so I want some resources and something simple to start with.
They are pre-trained models lol. what else do you want?
That's about as simple as your task gets.
Yea but I'm not just gonna use it as it is.
voice cloning isn't something easy like visualization or regression
Don't I have to tweak it and stuff?
not much, compared to training your own model from scratch
I wanna train my model but not from scratch
you mean you want to fine-tune a model?
yep.
well they provide pre-trained models in those repos
also for some reason there's no tutorials in youtube regarding this
in python atleast
most of it is in jupyter notebooks.
what? no one is going to spoon feed something so complex. you would have to research and understand things on your own
There is no shortcut that would work well for you. they might give decent results, but not very convincing/realistic
ofc you can - but it would take a lot more projects. learning with projects is great and I consider it the best way to learn; but to learn something, you have to understand some theory too, not just copy the code by some guy on youtube
copy the code, learn about it, mess around with it and build on top of it is what I was actually thinking.
that's pretty shallow learning
for example I want to be able to clone Morgan freeman's voice to generate speech from text, that sounds like him, it's been done before but it's just an example.
that's what I am trying to explain to him. but he seems open and receptive (unlike some of the others)
bc voice cloning is pretty ambitious if you're just a beginner
well I'm sorta informed but nothing indepth.
define sorta informed
like
do you know the math behind the field?
the math is what you're going to need if you want to finetune parameters
by informed I mean I know about LSTM, RNN, CNN, GAN, stylegan
and some of the underlying logic..
supervised / unsupervised learning.
not quiet actually.
so where do I start?
depends on how old you are
so are you in college?
yup computer science major.
well, then just learn it the proper way! take the AI course
attend the stats and math lectures
college is the easiest time to learn IMO
Companion webpage to the book “Mathematics for Machine Learning”. Copyright 2020 by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong. Published by Cambridge University Press.
there is this book here
this will show you what you need to know for the math in ML
A simple start off point https://www.reddit.com/r/learnmachinelearning/wiki/index
r/learnmachinelearning: A subreddit dedicated to learning machine learning
alright perfect.
what would be a challenging yet rewarding task to undertake as a start in ML journey?
thing is I wanna also develop my python skills which is why i wanna do something..
do you have Ai/stats course in your college?
next semester.
we do have simulation and modeling and statistics this sem
is that AI or stats?
AI
ehh, you are in college. just see what books there are and read em up. ask what you don't understand - there are many highly expereinced people here that can answer almost all of your queries
so your recommendation is for now hold off on the voice cloning ambition and work on the core concepts?
exactly
voice cloning will only get easier once you know the core concepts
otherwise you're just grasping at straws
if you aren't enjoying that, you can see some of the technical articles for cloning and they would teach you maths too (albeit with less explanations since there is a lot to cover)
which makes knowing the math even more important
but you would find yourself frequently finding topics to learn and making a list of it
and of course, we are always here
thanks for all the advice, guess I'll start up with some of the math, core concepts and theory before applying things practically
I'll try to break it up into bite sized chunks tho so I don't get bored, cause I actually like working on projects to solidify what I learnt.
me too buddy
@grave frost what was the jedi alternative you were recommending for jupyter?
Is it possible to draw a 3d shape on an image using matplotlib3d?
I know we can do it for 2d, but I am trying to do it for 3d but can't figure it out
I am looking for something like this: https://stackoverflow.com/a/15592168
Is there machine learning app possible with python?
Yes.... probably most of ML is done in python
I’m doing voice cloning to my problems has just been data cleaning for the last two weeks and prob the next two too
Honestly I wouldn’t bother trying to learn the pure math on your own bc like unless you enjoy that you prob won’t finish it you can do a lot of machine learning and deep learning withOUT in depth math
why don't you use some of the pre insisting datasets or do they not fit your purpose?
Bc that’s not what I want I want a certain voice
Like there a lot of real time voice cloning libs on git that are not bad (not great but amazing for the small amount of data given)
how much experience would i need working with them?
@shut valve https://github.com/CorentinJ/Real-Time-Voice-Cloning this one is interesting
also tactotron 2
WaveRNN and Glow,
Yeah some of them are real click record and run it depends what your trying to do with it
TTS tranformers too
I want things I can integrate into bigger projects tbh.
I don't want to re invent the wheel but I'd like to use it build a car, if you catch my drift..
Well yeah then that’s just like taking what you want from it how modular it is to pick up and move differs from project to project
Like most probably allow you to re train a pre trained model and then use that model in your project now that’s an awesome skill to have
But that’s a little more advanced but you don’t need to understand linear to do that
that's what I actually want to do.
retrain, tweak and perhaps understand what's going on.
Hey guys, I am a Data Scientist looking to grow my skills as a Machine Learning Engineer. Does anyone have recommendations for learning resources?
well depending on how much you know Im gonna assume you know python but not much or no ml. https://www.kaggle.com/learn/overview
intro to ml
intermediate to ml
intro to deep
computer vision
Practical data skills you can apply immediately: that's what you'll learn in these free micro-courses. They're the fastest (and most fun) way to become a data scientist or improve your current skills.
I know python but no ml
I do know some of the math involved and I actually love math sometimes.
I like working on projects more than reading tho idk why.
Looking more into MLops and architecture information / resources
i say computer vision because it takes about data augmentation and training pre trained models. now thats a kinda big stack might take you a couple weeks but it would be really good and quick to getting started with ml. Then you can see if you really wanna stay in AI/ML
@shut valve Thanks alot, I've added it to the resource list. I'll be sure to check it out.
Goodluck with your work
and thanks for all the helpful answers.
actually after checking it out I'm adding this to the top of my resource list.
seems like it will help alot.
@dusk kite read pattern classification by stork, duda and hart
kite and TabNine
This topic seems to be less heated today than usual lmao
When I check every hour it always has 50+ messages
Today I came across a job description for a junior data science position that requires "Hands-on experience with ML tools such as TensorFlow, Keras, PyTorch; Experience with Data Mining and Data Analysis technologies and language, including Python, pandas, Jupyter Notebook, Matplotlib, NumPy." Would you say that this is kind of a standard requirement overall?
hi
wait, they want you to use Jupyter notebookss?
I don't know, it seems so, there are no other details
I don't see how jupyter notebook is a skill
its just good for rapid experimentation and visualization
I am not sure about what exactly the job position is but from my AI/ML internship I would say experience in TensorFlow, Keras, pandas, numpy are the top requirements. Jupyter Notebook is useful only when you wanna display graphs and visualizations and such but it shouldn't be a required skill if you are making applications
About PyTorch, it seems legit, but I have only used it in school research not in real industry so far
It is a position related to climate change modelling at a financial advisor, so I guess that the visualization might be related to GHG emissions.
Got it
I see, but it's not too difficult to practice Jupyter Notebook anyway, Google Colab is one of the ways to go haha
Is it ok if my training dataset has some generic floats and some of them are numpy.float s
You would need it when implementing research level findings in your projects - some are for TF, but most use PyTorch
its the weekend here so
im chillaxing today

How do I get Jupyter to display a sympy Matrix?
Currently I have
init_printing()```
In cell 1 and
```A = Matrix([[1, 2, 3], [4, 5, 6], [7,8,9]])
A```
In cell 2
Is there a function in Python to check whether or not a set is a basis?
Or do I just need to do it myself lol?
oh, that's a nice question
hmm, pretty simple actually if you have n n-dimensional vectors - just write them into an n x n matrix as rows and check that the determinant is nonzero.
(but no, don't think there's a function for that in numpy/scipy)
if you have more than n vectors and you want to check whether that set contains a basis, then I actually don't know, hmm. Is there a simple way to check for that?..
Idk how but I got lip-won the HR VP and now I have to develop "an ML model to improve our recruiting"
Its not a job
oh
Im trying to transition away fron my mid position into slmethint about data or abalytics. There is no such thing where I work
So i thought about convincing the CEO, but thars too far up for me
So... HR VP lol.
Oh
Basically, sold her the idea or ML / Data department
sorry I completely misunderstood
So if i get it right
I can get her to tell the CEO and crearw the department
Thats my plan at least
that sounds very good
Yeah but i need to get the ball rolling on my own now lol
Yeah idk about ML helping recruitment
I have a clue of something that can help. Not recruitment itself but after it. Reducing attrition, churn, and other negative metrics
A mix of classification and regression might help there.
Predicting churn. Probabilities or attritition based on profiles, etc etc
it isn’t impossible 🙂
NER on CV to quickly locate high achievers and multiple points of interest. fine-tune the model
NER?
Named Entity Recognition
Named-entity recognition is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc. Wikipedia
-shivers- omg not NLP pl00x
I suck at NLP atm haha
Didnt learn properly.
Besides, is there good support for NLP.in Spanish?
still, you can get by with some basics
Everythign ive seen is in Spanish
the CV's?
Ill look into it
Yes. Im from latin america
English wordnets and stuff are useless to me
For that
there are plenty of spanish pre-trained models
spanbert
gpt2 spanish
I myself am currently working with Low resource languages, so your task seems a piece of cake
I wonder if its ok to use pretrained models... legally speaking and all that
yes
AFAIK
I suppose open sourve it shoulsnt matter
yea, both weights and code




