#data-science-and-ml
1 messages Β· Page 417 of 1
Do you know the difference between supervised and unsupervised learning?
I only know how to give it a set of inputs and it predict an output from a different input. But i would like it to learn its inputs by its self, so unsupervised i guess
Not too familiar with the terms, just the code
Reinforcement learning is actually separate from either of those
Ah
You have to know the terms.
I am learning now lol
Right. Anyway, the first neural network one usually learns is a feed forward neural network
Usually for a supervised classification task
Oh okie
To flap like a chicken
You really remind me of my brother lol.
Is he hot?
Hes my brother
Anyway, I don't think you can just use a feed forward neural network for RL. I can't think of how that would work
You can do RL without a neural network, the neural network serves a specific purpose to expand the capabilities of the RL algorithm beyond toy examples (neural networks do this for a lot of algorithms, not just RL). So you need to first learn RL.
I agree. I would first think of something you can teach an agent to do with RL, and then implement it.
For that, there is the classic book, written by those that invented it.
Finding link.
I miss my brother lol
(They took RL from psychology and made it mathematical)
(it's the classic goto book)
Kind of a bit out of my price range, so ill set aside a fund. But thank you for informing me of it. Hopefully some freelance work can get me enough
Trying to keep a good amount in the bank for interest
Richard S. Sutton is a Canadian computer scientist. Currently, he is a distinguished research scientist at DeepMind and a professor of computing science at the University of Alberta. Sutton is considered one of the founders of modern computational reinforcement learning, having several significant contributions to the field, including temporal ...
"Sutton is considered one of the founders of modern computational reinforcement learning,[1] having several significant contributions to the field, including temporal difference learning and policy gradient methods. "
(Btw it also covers a bit of neuroscience of how actual neurons do it, and they can do several things that DL can't that make them way better at it, it's in one of the last chapters, I highly recommend not skipping that part)
I think this book is available for free from the offsite or something
Yeah you can probably find a PDF.
Okie
oh, you can always find a pdf of any book, I mean this one is even official π
makes sense. then you can do deep RL

jk
maybe
Please I need help with this
Please don't ask people to read a screenshot of text and infer what your problem is.
what would be a good model architecture for a DNN regression model?
in my dataset, I have:
- 4400 features inputted
- approx 23m samples of data (raw, not split into training, i'm using a 64% train, 16% val, 20% test split)
- 1 output neuron
what i'm mainly looking for is how many hidden layers and neurons per hidden layer I should need for training
I miss my ex lol
please so how should I show what's wrong to seek for help
Hi there. When plotting a box plot with plotly express is there a way to only keep the output values when exporting to HTML?
Let me explain :
If i generate a box plot from a dataframe with 1Mil records the output file will keep the 1Mil records in the javascripts whereas I'm only interested in the min max med Q1 Q2 Q3 values.
Ideally the output file should only have those 6 values (and some outliers if need be ?)
Right now my solution is to manually plot the box plot from a dataframe that contains the BoxPlot info
Add/remove layers or neurons>Does it work well?>Repeat first step.
What's the possible reasons caused precision, recall, f1-score turned 0 while I have 27 samples for class 1?
Is it too less sample?
Word cloud from Twitter?
Seems about right
Does anyone have articles on distance estimation using object detection? These people have something, but not the distance estimation part yet..
D. Qiao and F. Zulkernine, "Vision-based Vehicle Detection and Distance Estimation," 2020 IEEE Symposium Series on Computational Intelligence (SSCI), 2020, pp. 2836-2842, doi: 10.1109/SSCI47803.2020.9308364.
Guys
I just discovered the iPhones search bar ability to search text in photos
Holy⦠shit!!!
i need help on Apache Airflow. I'm still browsing Stackoverflow for this. I've been meaning to create 2 custom operators. One is for getting information and return a dictionary of it. One is for receiving that dictionary and print out the results. I've been stucking on how to share information between the two operators since both run with execute() of BaseOperator. I tried xcoms but still didnt achieve what i want
class HelloOperator(BaseOperator):
def __init__(self, **info) -> None:
super().__init__(**info)
def execute(self, context):
# message = f'Your information: {self.info}'
# print(message)
return info
class GetInformationOperator(BaseOperator):
def __init__(self, name: str, age: int, **kwargs) -> None:
super().__init__(**kwargs)
self.name = name
self.age = age
def execute(self, context):
return {
'name': self.name,
'age': self.age
}
default_args = {
'owner': 'Trang Nguyen',
'retries': 5,
'retry-delay': timedelta(minutes=5)
}
with DAG(
dag_id='custom-operator_v1',
default_args=default_args,
description='this is my custom operator',
start_date=datetime(2022, 7, 4),
schedule_interval='@daily'
) as dag:
get_info_task = GetInformationOperator(
task_id='get_info_task',
name='Cheng',
age=22
)
hello_task = HelloOperator(
task_id='greet_task',
info='???'
)
get_info_task >> hello_task
does anyone know how to extract OME-XML metadata from czi images in python?
Check pyimagesearch
Where do I start with regex any book or courses?
sounds just like twitter
but thats pretty funny

still valuable data
nonetheless haha
should i jus start at some arbitrary number?
just*
You should probably look at another recent project that did something similar, and see what they chose, and use that as starting point
Hello,i have a doubt while using tensorflow and pytorch
Im trying to plot the model using add_graph
Im using colab but i keep getting an error ,that the only output should be tensors
i like spacy and nltk. havent tried sparknlp.
Hello, people. I am a beginner in programming and I would like to know your opinions in which you consider is the best learning pathway for learning really well and deeply AI. I am person who likes to construct the bases of what I want to learn and understand what I am doing. So I would be really grateful for any help π
What are you trying to do? SpaCy and NLTK are great.
RuntimeError: Only tensors, lists, tuples of tensors, or dictionary of tensors can be output from traced functions
If you have the time, start by learning Python (not just python for data science) once you're done with python, you can then move to Data Science.
You can use Udemy or Coursera
OK nice! I will check it, thank you very much π
Can anyone recommend any good Python tools for getting into reinforcement learning and making RL agents?
I've been having difficulties getting TensorForce to even import
I mean my script couldn't even find the package
that sounds like a directory error
hmm now i see why Stel recommends against conda for beginners

have you tried using just google colab
Never heard of it
try it
just know you need to run
!pip install <your library> before you can import libraries
You'd have to scrap Twitter data first using an API so you can gather enough tweets that captures specific kind of tweet(s) you'd wanna predict its sentiment.
After performing the sentiment analysis, if you'd wanna take it a nudge further, then look into ABSA (Aspect-Based Sentiment Analysis)
Finally, since this is a long term work as you've mentioned, I'd recommend you look into Adversarial Text Attack in NLP if you have more "Whys" π
hmm now i see why Stel recommends against conda
for beginners
ftfy
does anyone knows template matching?
don't ask to ask
okay
i have wrote a code
which is highlighting wrong boxes
i d like to know the reason
is there anyone to check?
Can someone help me with a credit risk task in Python? Please PM me
i can attach the highlighted boxes if you d like to see
try asking your question in this channel. people don't want to have to DM you to find out if the question is one they can answer.
I'm not comfortable sharing sensible data to 50k people. If someone is willing to help I'm sure they would DM π
you still have to give enough information for people to know in advance if the question is something they can help with.
Ok. It's a task about assessing default risk of loans. Basically I have to construct a regression model using test and training data sets
i ve asked in detail
I wasn't volunteering to answer your question once you had asked it, necessarily. it's just that no one would volunteer to answer until the question was asked.
can you show the code?
and can you show which boxes are highlighted, and explain which ones you want to be highlighted?
you should not feel confident sharing them with a stranger in DMs either
can you like make an neural network ai that can connect to google
like it has access to see links videos and stuff on google like us
if you make the neural network in Python, and there's a YouTube API for Python, then yes.
Ok, Lets say I have a column "Data Type", containing values "1", "2", "3",... How do I create a categorical column out of this? Lets say the column contains 3 different data types. This means I have to create 2 categorical columns. How do I do this with the Pandas package?
try taking that Series and putting .astype('category') on it.
but can anyone make that type of neural network?
Not quite sure how you mean that. Lets say there are only 2 data types in the column. "A" and "B". I have to delete column "Data Type" and instead create a column "A" which will have 1's and 0's
df['DATA_TYPE'] = df["A"].astype("category")
This yields an error for me
if you get an error, please always show the error, instead of just saying that you got one. I have no way of knowing what the error is unless you tell me.
df['A'] = df["DATA_TYPE"].astype("category")```
should be this way
No this doesnt create 1's and 0's
I think I have to use a dummy function
not necessarily. but I'm still unclear on what you're trying to do.
I'm going to let someone else try to answer this. statements like "x doesn't work" aren't helpful unless it's clear what x does, and how it's different from what you wanted.
OK. Let me try to explain it better.
I have a (tidy) data frame from a .csv file with 10 different columns and endless rows. For example column "Job", "Data Type", "Salary", "Education", etc.
I want to focus on the "Data Type" column for now. This column contains only "A"s and "B"s. I want to make this column categorical, meaning that I want to delete the column "Data Type" and replace it with a new column "A" which is was originally created by the column "Data Type". This new column "A" only has 1's and 0's. For example, if in the column "Data Type" in row 4 there was an "A", then, in the new column "A", I want to see a 1 there. If there was a "B", I want to see a 0 there. Hope this was clearer. This all relates to basic regression modelling
you'd wanna add as many columns as you have distinct categories in that case
Yes, but since I have only two discting categories, I only need 1 new column. If the Data Type is A, there will be a 1, and if it's B, there will be a 0 in the new column. So I dont have to create two new columns
if you have only 2, then yes
you'd given 3 in your original example, so i'd gone with that
Yep sorry, I thoughth it was three, but its actually 2
can you do something like myseries = df.pop['col_label']
It's python df = pd.concat([df, pd.get_dummies(df['A'])], axis=1)
I just found it with google after making my question clearer π
then apply a function to that series, and do df['category_label'] = result_of_operation_on_series
Thank you nonetheless
nice
how do I delete multiple columns at once?
df.drop(['my','labels','to','drop']) perhaps?
thx
Task: Define 'REFERENCE_DATEβ and βDEFAULT_DATEβ as date variables
how do I do this?
those two are columns
If it appears as datetime64[ns] after running df.dtypes, does this mean they are now defined as date variables?
I ran
df['REFERENCE_DATE'] = pd.to_datetime(df['REFERENCE_DATE'])
before
Is there maybe a more active Data Science related Discord server?
You feel that this channel isn't active enough?
It means that the type of data contained in that column is datetime values. Whether or not that's what you want, I'm not sure.
The data contained within a DataFrame are not "variables"
anyone here can make neural network if so please do tell me
There are a lot of kinds of neural networks. Your question is underspecified

tbh i heard a podcast today about how bad conda was in a production environment
well i need one that is an ai that i can connect to google
way too bulky
Once it "connects to Google", what is it going to do?
Because AIs don't just accumulate arbitrary knowledge
You have to have a very specific idea of what you're trying to do.
well that i and smn else will do but first we need to do the first part
If you don't have a clear idea of what you're trying to do, and you can't communicate it, no one can make a neural network that suits your purposes
something something coherent extrapolated volition :p
the idea is when the neural is made me and my pal are gonna make it self learning ai using python and by that we are gonna make it learn from google and if we can make it create what it learnt
So, neural networks aren't sponges that can just soak up knowledge from anything. They're mathematical constructs that approximate functions.
well we are gonna use that to make it take info from google and learn and keep all its learnings in a type of encryted file
so it doesn't have storage issues
Sorry, but none of this is going to work. I would suggest you try a different project with a more coherent goal.
well tell then how would you make an ai
An AI that does what?
Thats what I was confused about. What do they mean with "Define column x as a date variable"? would df['REFERENCE_DATE'] = pd.to_datetime(df['REFERENCE_DATE']) be wrong here?
Each ai has a very specific thing that it dows
self learns like i said
it sounds like you don't have a problem statement
that's literally what i was explaining up tehre
I don't know what the person who wrote that question thinks those words mean.
Self learns what? This isn't a coherent problem statement. It sounds like you need to spend more time learning about what AI is in general, so that you can come up with project ideas that make sense in terms of what AI actually is.
The first step in any project is defining your problem. You can use the most powerful and shiniest algorithms available, but the results will be meaningless if you are solving the wrong problem. In this post you will learn the process for thinking deeply about your problem before you get started. This is unarguably the [β¦]
this should be pinned imo
I mean its a super simple task no?
I think I'm making this way more complicated than it should be!
I'm sure it's a simple task, but if we don't have shared definition of what "variable" means, that's going to make communication difficult.
What you did is probably correct. Can you ask the person who told you to do it to confirm?
Again, the data in a DataFrame aren't variables. They're just data, or elements of the DataFrame. Variables are names for objects in the python environment.
Why do you want it to "self learn"? What is the end goal? What is done with the knowledge?
are you trying to predict a numerical outcome?
are you trying to label an instance as a particular class?
Forget everything you think you know about AI and whatever. Just specify what the goal is first. Then we can discuss if Ai is even required, and if it is, how to go about it.
Be specific about the goal, start general and add more details.
This reminds me of a friend helping some business owner, and one of the things the owner put on the list of to-dos was to "use AI" with no explanation at all lol
"I want to go to the moon, and it needs to use a train." - You are hiring experts to help you with your business because you admit to not knowing how to do it or don't have the time, so it's only fair that you make no assumptions about the process or it will sound like that quote.
what is done with the knowledge it helps you make stuff or shows you the code you need cause not everyone can learn so it helps learning and showing so it benefits
strange, I thought these days it's usually "it needs to use the blockchain" :p
This is beginning to sound like Github copilot, is this correct?
uh not sure what that it is
hmm interesting i see what you mean but if i just get the nueral network to work i can get the rest to do to
Picture of train sold as an NFT.
If you still ask this, then you don't understand what we are saying to you
you want a neural network that can help you code?
okie i see i confused everyone give me a bit to make it better a explanation
hello i have a question, in this layer snippet (Conv1D(3,5, activation='relu', input_shape=(200,3))) it has 291 params, how do i explain this manually?
another mini data sci project done
how do you like that jason brownlee
but the models probably overfitted even with cross validation
oh no
i'm dumb
i was using regression models for a classificatiion problem
that was embarassing lol
dude why are people starring my repo
@royal garnet what's your question
I have a dataframe consisting of a bunch of sessions for an event spanning several days.
There is a start datetime and end datetime column - and I want to somehow get new data frames for each day
But the catch is, I am writing a program that can take any csv as input - so I won't always know what the dates are.
Just that there could be 1 or more days worth of dates.
Is that something that can be done?
can you give an example
:incoming_envelope: :ok_hand: applied mute to @royal garnet until <t:1656991001:f> (9 minutes and 59 seconds) (reason: discord_emojis rule: sent 80 emojis in 10s).
oops
well it pings the mods whenever it mutes someone, at least
mhm
:incoming_envelope: :ok_hand: pardoned infraction mute for @royal garnet.
sorry about that
thanks luna
!paste use the pasting service to avoid this issue again
Pasting large amounts of code
If your code is too long to fit in a codeblock in Discord, you can paste your code here:
https://paste.pythondiscord.com/
After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.
@royal garnet we're back
@royal garnet
of course 
Man's traumatized
we're chatting in dms now, dont worry :p
Oops
I'm back
I think I have something to try - but a quick follow-on question. To confirm, can pandas group dataframes by datetime objects?
say, by each individual day in a column made up of parsed dates
Oh man this guys video just made my day.
https://www.youtube.com/watch?v=cUArbPdzR_c
Pandas has great support for dates and times β and that extends to its grouping capabilities, too. In this video, I show you how to group on datetime fields, both indirectly (by creating a new column) and directly in the call to "groupby". This video continues my previous one, in which I introduce grouping in pandas.
Jupyter notebooks from what...
hello im using pytorch and colab , ho can i visualize my model's architecture
Okay, wtf am I doing wrong here.
grouped = evt.groupby('day')
grouped.get_group('day')
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
/home/max/python/venv/pm-toolbox/scratch.ipynb Cell 11' in <cell line: 1>()
----> 1 grouped.get_group('day')
File ~/python/venv/pm-toolbox/lib/python3.10/site-packages/pandas/core/groupby/groupby.py:747, in BaseGroupBy.get_group(self, name, obj)
745 inds = self._get_index(name)
746 if not len(inds):
--> 747 raise KeyError(name)
749 return obj._take_with_is_copy(inds, axis=self.axis)
KeyError: 'day'
I am following the pandas documentation - and it just won't work and I'm about ready to toss my laptop out the window.
If you are grouping by the day column, then "day" isn't going to be the name of a group. One of the unique values in the day column will be.
It's okay
Pandas has so far proven to be the most challenging thing to learn...
Is that normal - or am I just not getting something?
It's different from the rest of python
Okay so question
It's normal. It's big with lots of stuff in it and it uses various things like operator overloading abuse, multiple types allowed per function (implicit function overloading), and abstractions upon abstractions.
I have a pandas dataframe in which column('SOP') has numbers 0 to 100. It also has another column called "open" with numbers in it.
I want to create a new column where whenever the column SOP == 0, it takes the value of open the last time SOP was equal to 0 and subtracts it from the open value of the current row.
How can I do this?
I can show code if this is confusing for you.
I've been stuck on this for literally 8 hours
Makes me wonder if maybe there is an easier way to approach solving this. I'm working with csvs and then I want to define some functions to pull certain bits of information based on conditions - and then populate that information and write to a spreadsheet or save it to a db. Right now, I am just trying to find a certain unique string in a row, and then for each day find the minimum time for that given unique string.
In addition it's building on top of Numpy which is already its whole own thing to learn.
It would be simple to do with some manual loops and such, but Python is slow, so to do it fast you need to know how to do it with whatever functions Pandas/Numpy provide.
My datasets are rarely longer than 500 ish rows
Well, you can do it manually first, see how it goes, and then maybe try to find how to do the same thing faster later.
Ok, I solved my problem π
But screw it - I've already put 2 days into learning this - may as well keep on cracking
If you are used to Pandas/Numpy then it becomes easier to some extend to do it with the functions provided, but there is a learning curve before that point.
Am I on the right track with using groups to figure this out.
I can't, it has pii info on it.
Give me a moment, I'm putting something together
@iron basalt Something like this:
Session Start Date/Time End Date/Time Session Name Session ID Speaker Code Full Name Email Address
2022-06-14 13:00:00 2022-06-14 14:15:00 SESSION 1 4009a82f-eaa7-4068-919e-55ee38ee64b5 UUID FULLNAME EMAIL
2022-06-14 13:00:00 2022-06-14 14:15:00 SESSION 1 4009a82f-eaa7-4068-919e-55ee38ee64b5 UUID FULLNAME EMAIL
2022-06-14 13:00:00 2022-06-14 14:15:00 SESSION 1 4009a82f-eaa7-4068-919e-55ee38ee64b5 UUID FULLNAME EMAIL
2022-06-14 13:00:00 2022-06-14 14:15:00 SESSION 1 4009a82f-eaa7-4068-919e-55ee38ee64b5 UUID FULLNAME EMAIL
2022-06-14 13:00:00 2022-06-14 14:15:00 SESSION 1 4009a82f-eaa7-4068-919e-55ee38ee64b5 UUID FULLNAME EMAIL
I want an output that shows me the first session, on each date in the DF for each speaker code (which is a uuid)
We're only seeing on session repeated - because in this case that first session has more than 5 speakers
Does that make sense?
no, the df goes on for 100+ more rows
Yeah I mean for what is shown here.
You can try to first add a column for the date and a column for the time to split that up.
Unless they are already separate columns.
They aren't
and I was thinking that would be a good idea just now
can I do that right at the csv_read part of my code?
evt = pd.read_csv(
sessions,
sep="\t",
encoding="utf-8-sig",
usecols=session_columns,
converters={"Speaker Code": lambda x: extract_speaker_codes(x, spk)},
)[
session_columns
] # [session_columns] at the end here preserves the desired column order
evt = evt.explode("Speaker Code")
parse_dates
Session Start Date/Time End Date/Time Session Name Session ID Speaker Code Full Name Email Address
0 2022-06-14 13:00:00 2022-06-14 14:15:00 SESSION 1 4009a82f-eaa7-4068-919e-55ee38ee64b5 UUID 1 NAME 1 EMAIL 1
1 2022-06-14 13:00:00 2022-06-14 14:15:00 SESSION 1 4009a82f-eaa7-4068-919e-55ee38ee64b5 UUID 1 NAME 1 EMAIL 1
2 2022-06-14 13:00:00 2022-06-14 14:15:00 SESSION 1 4009a82f-eaa7-4068-919e-55ee38ee64b5 UUID 1 NAME 1 EMAIL 1
3 2022-06-14 13:00:00 2022-06-14 14:15:00 SESSION 1 4009a82f-eaa7-4068-919e-55ee38ee64b5 UUID 1 NAME 1 EMAIL 1
4 2022-06-14 13:00:00 2022-06-14 14:15:00 SESSION 1 4009a82f-eaa7-4068-919e-55ee38ee64b5 UUID 1 NAME 1 EMAIL 1
5 2022-06-14 13:00:00 2022-06-14 14:15:00 SESSION 1 4009a82f-eaa7-4068-919e-55ee38ee64b5 UUID 2 NAME 2 EMAIL 2
6 2022-06-15 07:00:00 2022-06-15 09:15:00 SESSION 2 1009a82f-eaa7-40ba-919e-55eeabee64b5 UUID 2 NAME 2 EMAIL 2
7 2022-06-15 11:00:00 2022-06-15 12:15:00 SESSION 3 2222a82f-eaa7-40ba-919e-55eeabee64b5 UUID 2 NAME 2 EMAIL 2
--------------------------------
End Date/Time Session Name Session ID Full Name Email Address
Session Start Date/Time Speaker Code
2022-06-14 13:00:00 UUID 1 2022-06-14 14:15:00 SESSION 1 4009a82f-eaa7-4068-919e-55ee38ee64b5 NAME 1 EMAIL 1
UUID 2 2022-06-14 14:15:00 SESSION 1 4009a82f-eaa7-4068-919e-55ee38ee64b5 NAME 2 EMAIL 2
2022-06-15 07:00:00 UUID 2 2022-06-15 09:15:00 SESSION 2 1009a82f-eaa7-40ba-919e-55eeabee64b5 NAME 2 EMAIL 2
2022-06-15 11:00:00 UUID 2 2022-06-15 12:15:00 SESSION 3 2222a82f-eaa7-40ba-919e-55eeabee64b5 NAME 2 EMAIL 2
``` If this is what you want then you can do it with groupby.
grouping = df.groupby(["Session Start Date/Time", "Speaker Code"])
print(grouping.first())
Except you can split the date up to further get what you want (day / time).
(Right now showing every session of each day)
(You only want by day, not time)
Hi there, I'm studying about keras modelling from two different articles and trying my best to understand how linear probing works
So far I found an article that has this series of code on linear probing in a defined function:
# Single dense layer for linear probing
model.linear_probe = K.Sequential(
[layers.Input(shape=(width,)), layers.Dense(10)], name="linear_probe"
)
model.encoder.summary()
model.projection_head.summary()
model.linear_probe.summary()
I'm wondering how can I better translate this define function code into this:
from keras.models import Sequential
model = Sequential()
input_layer = Dense(32, input_shape=(8,)) model.add(input_layer)
hidden_layer = Dense(64, activation='relu'); model.add(hidden_layer)
output_layer = Dense(8)
model.add(output_layer)
I think my first step can be:
model.linear_probe = K.Sequential()
input_layer = Dense(10, input_shape=(width,))
model.add(input_layer)
I think? I'll also try my best to figure out the width part as well too
I have question concerning K-fold cross validation for image classification. I am using the function "image_dataset_from directory" and put validation split on 0.3. I then want to create three instances where the validation data would consist of the first part of data, then the middle part of the data and then the final part of the data. I was thinking of putting shuffle to "True" and change the seed each time (e.g. seed=0, seed=100,seed=1000), but I don't think that's correct.
So anyone know a better way to do cross validation on image classification?
For cross validation you take the entire training data and split it into evenly sized folds
A regular value is 5 folds, so each time 80% is training, 20% is validation
and that way you have used every bit of your training data for training (4 times) and for validation (1 time)
@unique flame
I am not really sure what that validation split of 0.3* means, maybe that is for splitting the entire dataset into training and testing?
Yes it is splitting it in training and testing. Well i am splitting into training and validation. For testing I add unlabeled images.
Okay so you got 30% testing (we keep that untouched until after we are done with the entire training process) and 70% training right
I could set the validation split to 0.2 and do a 5 fold, but like you said every part of the data should have been part of the validation set. And right now I don't know how to do that
And your question is how to split training into train and validation multiple times?
Such that it uses all data for training and testing at least once
yes
So k-fold cross validation doesn't take a "validation split", it takes an amount of folds
like 5
Here blue is validation for each split, and red/pinkish is training
And the test set is kept completely separate
yes, so you mean i should be using another function to load the data?
No, you used the function "image_dataset_from_directory" and put validation split on 0.3.
So that means you already split it into training (green) and testing (purple)
Now you need to split training into multiple folds for each split of k-fold cross validation
Some pseudo-code for this would be like:
entire_training_data = ...
for split in range(5):
split_train = []
split_valid = []
for idx, sample in enumerate(entire_training_data):
if idx % 5 == split:
split_valid.append(sample)
else:
split_train.append(sample)
# Code for training and validation
This is assuming there is no pattern in the order of the data
Thanks, I'll try this. Brain was thinking loud for a few sec
Hello everyone
i have on assesment in which i have to implement k-mean clustering in python which will read and cluster data
but only using numpy and csv.
i dont know about this subject but it is my core subject so i have to study it
can anyone provide me any source or help , so i can able to do this
i know regarding k-mean clustering but dont know how to do coding part and what if i watch videos that use any other library, will that help me?
As i cant find anh vjdeo which only uses any one of those two libraries
You can do it without any of those two libraries
If you understand how k-means clustering works, you can load in the data, and then use some simple for loops to perform iterations of k-means clustering
And later simplify it with numpy
@magic mason
You could also just check if numpy has a certain function that you would think is useful, or maybe just check out a general intro to numpy
Stuff like np.mean could be useful f.e.
Thanks i will have a look
Hi all!
I've made a package to read and write sklearn objects blueprints to YAML.
The goal is to make experiment tracking more convenient.
I think this is explode
looks like explode needs elements to be lists though, not dicts
ah, and also not quite the right output, hmm
That'd probably work
hi
i am trying to add up all the pixel colours and then divide by number of pixels in this list. however when i do print(colour_average) i am getting [6319.198711063373, 6403.701396348013, 5679.463480128894]. these numbers are much bigger than 255
color_sum = [0,0,0]
for coord in coord_list:
row = coord[0]
col = coord[1]
for new_row in range(0, row):
pixel = im[new_row][col]
color_sum[0] += pixel[0]
color_sum[1] += pixel[1]
color_sum[2] += pixel[2]
color_average = [color_sum[0]/len(coord_list), color_sum[1]/len(coord_list), color_sum[2]/len(coord_list)]
print(color_average)
the number of elements you're taking the mean over isn't len(coord_list)
it's the sum of coord[0] for coord in coord_list
the easiest way to fix that would be to just do count += 1 every time you take a pixel into account, and divide by that at the end.
if you can describe the desired effect a little more clearly, we can come up with a 2-liner using numpy, too
what exactly do you want to average?
though i think python also has a mean() built-in
ill show the whole code so it makes a bit more sense
Can you close your help-channel if you are getting help here @karmic valley Someone is trying to help you there too
import cv2
from PIL import Image
im = cv2.imread(r"C:\Users\guest\Documents\Education\University Imperial\Module 3\TrackingAI outfolder\test\plot\234496_1024.png", cv2.IMREAD_UNCHANGED)
coord_list = []
for row in range(len(im)):
for col in range(len(im[row])):
if im[row][col][2] >= 200 and im[row][col][0] < 100 and im[row][col][1] < 100:
im[row][col][1] = 255
im[row][col][0] = 0
im[row][col][2] = 0
coord_list.append([row, col])
color_sum = [0,0,0]
for coord in coord_list:
row = coord[0]
col = coord[1]
for new_row in range(0, row):
pixel = im[new_row][col]
color_sum[0] += pixel[0]
color_sum[1] += pixel[1]
color_sum[2] += pixel[2]
color_average = [color_sum[0]/len(coord_list), color_sum[1]/len(coord_list), color_sum[2]/len(coord_list)]
print(color_average)
cv2.imwrite("output_graph.png", im)
pil_im = Image.open("output_graph.png", 'r')
pil_im.show()
oh okay
you want the average of each of r, g, and b of an image?
so the whole code is this. it basically looks at an image and wherever there is a red line it notes its coordinate. then it converts red line to green line.
next part of code then is meant to look at those coordinates and work out the average pixel rgb colour below the line but its not working
can you clarify "below the line"
so was trying to add up all the rgb pixel values below line and then divide by above line.
yes i will show example 2secs
so the red line is what im referring to
so, you find where there is a red pixel, and then you want the average r, g, and b for the rest of the column below that pixel?
yes exactly.
so will be working out average rgb for everything under red line
I think you straight up can't get a speedup via apply/np.vectorize unless you're using numpy ufuncs
but i think my code is a bit wrong because its giving a massive number
not between 0 and 255
it's certainly wrong if it's giving you a large number
^
so the problem lies somewhere in here
color_sum = [0,0,0]
for coord in coord_list:
row = coord[0]
col = coord[1]
for new_row in range(0, row):
pixel = im[new_row][col]
color_sum[0] += pixel[0]
color_sum[1] += pixel[1]
color_sum[2] += pixel[2]
color_average = [color_sum[0]/len(coord_list), color_sum[1]/len(coord_list), color_sum[2]/len(coord_list)]
print(color_average)
what i would do is call np.array(the_image) first to get a 3d array. then something like np.mean(my_array[row_with_redpix+1:,current_col,:], axis=0)
sorry im new to coding. took me 3months to write these 20 lines of code lol.
so where exactly do i write this np.array
hmm in that case it's probably better if you don't use numpy arrays, but debug your code instead
i think maybe the maths/logic behind this part of code not right but cant figure it out
for new_row in range(0, row):
pixel = im[new_row][col]
color_sum[0] += pixel[0]
color_sum[1] += pixel[1]
color_sum[2] += pixel[2]
Actually, looks like it's not quite zero-speedup? The code for non-ufuncs is complicated:
https://github.com/pandas-dev/pandas/blob/e8093ba372f9adfe79439d90fe74b0b5b6dea9d6/pandas/core/apply.py#L1128-L1147=
but looks like it ends up using map_infer, which is a Cython function:
https://github.com/pandas-dev/pandas/blob/7e23a37e1c5bda81234801a6584563e2880769eb/pandas/_libs/lib.pyx#L2869=
So it should be a bit faster than a Python loop at least, even when using normal Python functions. Probably. Needs measurements.
(That's about apply, note. np.vectorize, I remember from reading the source code, does literally just use a normal Python loop when applied to a Python function. Unless they changed that.)
You can use dask π
Did you concat on the wrong axis?
What's the index? Because concat joins on that.

I was making a joke. But dask can process a bunch of independent csv files as one DataFrame, provided that they have the same schema
And it can distribute operations across an arbitrary number of cores.
π reminds me of that meme, "i paid for the full computer, i'm gonna use all of it"
Forgive the slightly ironic comment: do you have a machine, @hallow turret ?
If you do, start learning π
dude
It's just there are so many resources online that any google search gives you so much information on how to go from zero to hero it's not anyone's but your task to personalise learning to your needs
I mean no disrespect, it's very difficult to advise anything on this - it's like a medical student asking what kind of doctor they want to be - nobody can make the decision for them
Try coding an ml project using public data set
UCI is good
And of course , use python
Unless u only know other Lang
R is passable but not as flexible
C++ is a joke for 99% of ml needs
bruh im just starting with ai and python can you recommend me how should i start learning ai...
https://www.python.org/about/gettingstarted/ - that's a good starting point for python, way back when I was learning through datacamp, but I haven't checked them out in years, then pick a project you are interested in and do it - be ready to stomach lots of frustration π @hallow turret
Hello people. I recently came across Approximate Nearest Neighbour and was wondering, if I have a master dataset that consists of datasets A,B,C; is it theoretically possible to ensure my output is only from dataset C?
A bit advanced yet simple question:
While returning the result of a layer to a variable tensor, how can I make that tensor require grad?
self.X = self.conv(x)
I want X to record grads
.requires_grad_(), I believe.
https://pytorch.org/docs/stable/generated/torch.Tensor.requires_grad_.html#torch.Tensor.requires_grad_
Thank you, however, to my knowledge, that X must be pre defined
Not sure what you mean. Whenever you assign any tensor to self.X, mark that tensor as requiring grad.
If you assign to self.X in many places and that's annoying, you can use a property to automate it.
Oh you are right since grad will be created in backwards pass!
The problem was that I don't want to predefine "X" and to be able to record grads
Somehow was thinking I need to do requires grad at the assignment
Here's how to format Python code on Discord:
```py
print('Hello world!')
```
These are backticks, not quotes. Check this out if you can't find the backtick key.
After the assignment, using the .requires_grad_ doesn't track the grads
('rf', RandomForestClassifier()),
('abc', AdaBoostClassifier()),
('svc', SVC())]
bstc = StackingClassifier(estimators=bestimators, final_estimator=LogisticRegression())
stc_params = {
'rf__n_estimators': [100,150, 200],
'rf__criterion': ['entropy'],
'rf__bootstrap': [True],
'rf__oob_score': [True],
'rf__max_depth': [10],
'rf__random_state': [5],
'abc__base_estimator': [DecisionTreeClassifier],
'abc__n_estimators': [100, 150, 200],
'abc__learning_rate': [1.0],
'abc__random_state': [5],
'svc__C': [1.0],
'svc__kernel': ['rbf'],
'svc__gamma': ['auto'],
'svc__random_state':[5],
'final_estimator__penalty':['l2'],
'final_estimator__C':[1.0],
'final_estimator__fit_intercept': [True],
'final_estimator__solver': ['liblinear']
}
stc_gs = GridSearchCV(estimator=bstc_ ,param_grid=stc_params, cv=5, n_jobs=4)
stc_gs.fit(X_train, y_train)
bump
Do you maybe mean that you need to also track the grad of applying self.conv? If so, you need to apply .requires_grad_() to x, before you do self.conv(x).
Anyone knows about this? Would appreciate a reply regarding this!
question: if I'm using a linear lasso model to train against a right skewed data set, should I set my alpha to 0.01, 0.1, or 1?
?
anyone knows how to combine stacking estimators and gridsearch cv together?
keep getting an error. TypeError: Cannot clone object. You should provide an instance of scikit-learn estimator instead of a class.
how to process a document using LayoutLM model
cant understand where to give the input image or how to process
can anyone please guide me
Whatβs with double underscores
^
I've been working on making an AI chat bot for discord, is anyone interested in trying it out?
It utilizes modified version of GPT-3
Has it improved since last time?
Absolutely

Is it sentient?
No it's very neutral 
Arguably, wanna try?
Where can I try it?
GPT-3 seems to have some sort of censor for racism, to some extent at least.
I just need to invite you to my server if that's okay
You should let it loose on the some unsuspecting server, and just see what happens.

Never heard of it
I like the one that makes greentext posts
Oh dude
#gpt4chan #4chan #ai
GPT-4chan was trained on over 3 years of posts from 4chan's "politically incorrect" (/pol/) board.
(and no, this is not GPT-4)
EXTRA VIDEO HERE: https://www.youtube.com/watch?v=dQw4w9WgXcQ
Website (try the model here): https://gpt-4chan.com
Model (no longer available): https://huggingface.co/ykilcher/gpt-4chan
Code: http...
If I wasn't at work - I'd jump on and play with the bot - but sadly I'm not really free atm.
It's cool 
What kind of hardware do you need to train an ai model anyway? I'd be curious to play with some - but I just have a mid-range gaming pc.
Let me rephrase - what kind of hardware is needed to train one in a reasonable amount of time.
AI operations run from GPU memory, so system memory isn't usually a bottleneck and servers typically have 128 to 512 GB of DRAM.
Regarding time though... that can take a long time
I've got an rtx 3070 - can that be used?
Yeah I don't see why not
But I'd recommend just renting a GPT-3 language model than trying to train one yourself, if that's what you're trying to do
that GPU is CUDA-enabled, so you can use it for any CUDA computation up to its memory limit, which I believe is 8GB.
Correct - but is 8gb enough for any sort of meaningul ai model training?
it really depends. I suspect that most state-of-the-art models for a given task use significantly more than 8GB, because organizations that can afford the talent to develop those models can also afford top-tier hardware. but that doesn't mean that similar performance couldn't possibly be achieved with smaller models.
interested if anyone has feedback here: https://www.reddit.com/r/Python/comments/vs2b6d/i_analyzed_1835_hospital_price_lists_so_you_didnt/?
i'm not a real programmer, so any and all criticism is welcome
Hey guys ! I have written a kaggle notebook on Training Models ( a chapter in Hand's on Machine learning book ) and I have added the key points in that lesson and have explained the code , have a look at it and give your feedback . Cheers! LINK : https://www.kaggle.com/code/supreeth888/training-models-hand-s-on-machine-learning/notebook
!pastebin
Pasting large amounts of code
If your code is too long to fit in a codeblock in Discord, you can paste your code here:
https://paste.pythondiscord.com/
After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.
Traceback (most recent call last):
File "/Users/rahuldas/Desktop/ICH-CAHPS Survey Analysis/ICH-CAHPS Survey Analysis.py", line 27, in <module>
], axis = 1)
File "/Users/rahuldas/Library/Python/3.7/lib/python/site-packages/pandas/util/_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "/Users/rahuldas/Library/Python/3.7/lib/python/site-packages/pandas/core/frame.py", line 4913, in drop
errors=errors,
File "/Users/rahuldas/Library/Python/3.7/lib/python/site-packages/pandas/core/generic.py", line 4150, in drop
obj = obj._drop_axis(labels, axis, level=level, errors=errors)
File "/Users/rahuldas/Library/Python/3.7/lib/python/site-packages/pandas/core/generic.py", line 4185, in _drop_axis
new_axis = axis.drop(labels, errors=errors)
File "/Users/rahuldas/Library/Python/3.7/lib/python/site-packages/pandas/core/indexes/base.py", line 6017, in drop
raise KeyError(f"{labels[mask]} not found in axis")
KeyError: "['Lower box percent of patients-providing information to patients'\n 'Lower box percent of patients-rating of the nephrologist'\n 'Lower box percent of patients-rating of the dialysis center staff'\n 'Top box percent of patients-rating of the dialysis center staff'\n 'Middle box percent of patients-rating of the dialysis facility'] not found in axis"
the key error means the column doesn't exist in the dataframe
but i know it exists
!pastebin
Pasting large amounts of code
If your code is too long to fit in a codeblock in Discord, you can paste your code here:
https://paste.pythondiscord.com/
After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.
Not really sure, but wouldn't it be simpler to just take the columns that you do actually want
instead of dropping 90% of them
yeah thatβs true
Should I learn MatPlotLib or Plotly ?
!pastebin
Pasting large amounts of code
If your code is too long to fit in a codeblock in Discord, you can paste your code here:
https://paste.pythondiscord.com/
After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.
another key error
bruh
oh
i just wanted to select those specific features
is there a way to do it?
ohh
If youβre modelling multiple linear regression of a continuous variable against a binary variable plus confounders, does it have to be a generalised model?
!pastebin
Pasting large amounts of code
If your code is too long to fit in a codeblock in Discord, you can paste your code here:
https://paste.pythondiscord.com/
After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.
And is the equation yβ=B0 + x1B1
Hi everyone! I need to use this model for one of my applications: https://github.com/hukenovs/hagrid
Should I be looking for a powerful PC to train the model? And why can't they just upload the models as files?
Thanks in advance π
The already supply pre-trained models @hushed sail
It downloads a .pth file, and this link shows how to load such a model I believe
And some of these models are very light-weight, so it should even be possible on a laptop I think
I didn't even notice. Thank you very much, you helped me a lot!
I'm not completely sure how to load the model if you don't have the model class
It's in the repository
They have a demo.py file that loads a pre-trained model It seems, so probably look into that
Oh right, yeah it's in there
Yeah, thanks π
Yeah, if you only want data of that category, you should filter it beforehand
@mild pecan
You should also keep x and y together when splitting into training and testing, such that Y still matches with X
And then you could just separate them, as they would be in the same order
That was exactly my thought process, so what I suggested sounds right, even though it seems to be against the order of the task?
Not really sure how you "set a column as target variable"
normally you do something like this
y_col = 'annual_premium'
y = insurance_df[y_col]
X = insurance_df[insurance_df.columns.drop(y_col)]
Which is just making two new dataframes
one for y, and one for X
This relates to regression models. DEFAULT_FLAG becomes the target variable which will be predicted with the help of the other 9 columns/variables
I understand the meaning of X and y, I just don't see how to "mark it" in a pandas dataset
It seems to me that you would just create two new dataframes
That's how I've been doing it at least
Yep, thats what I am doing too
This seems to just re-iterate what I already thought though right?
There's not really a method to "mark a column as target variable"
It's just splitting it into two dataframes
Not really sure what you are trying to show
Yes, that is what I looked at
iris_X, iris_y = datasets.load_iris(return_X_y=True) This is how they define X and y
as two separate variables, not in 1 dataframe
So that confirms what I said yes
I was recently reminded of https://botnik.org/content/harry-potter.html and was wondering how you would approach something like this today. Transformers are currently all the rage, but they seem poor at generating large amounts of text. I also doubt fine-tuning would work well in a fantasy setting (Most of it's learning has been done with text from our real world). LSTMs seem to remain a decent option. A text gan seems perfect for something like this, but I've heard mixed reviews.
With GPT-4 on the horizon, an upgrade to any GPT-3 chatbot should be easy if the api stays the same.
GPT4 nxt yr?
do u think that a couplpe of comapnies are cornering the language model market?
i wonder what the future holds for nlp beyond gpt4, i doubt it can get much more advanced
im weighing my options of specialising/training in NLP or CV, can only rly choose one to focu son
If I have a dataset of online orders, and I'm predicting profit. Logically speaking I can't use the column sales right? Since that would be basically feature leakage?
What's in the sales column?
are you doing time series forecasting?
Rumor that it comes out next month
No, just predicting Profit. Given variables from here: https://www.kaggle.com/datasets/vivek468/superstore-dataset-final
And yeah I don't doubt that a couple companies would corner that language model.
GPT3 is already super expensive I can't imagine how expensive GPT4 would be
Since profit = Sales - Cost, there is correlation between the two. To me, doesn't make sense to use sales.
It's not really possible to do time series analysis since the time periods are not uniform.
!pastebin
Pasting large amounts of code
If your code is too long to fit in a codeblock in Discord, you can paste your code here:
https://paste.pythondiscord.com/
After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.
Pasting large amounts of code
If your code is too long to fit in a codeblock in Discord, you can paste your code here:
https://paste.pythondiscord.com/
After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.
i think the solution here is to turn the profit v non profit column to ones and zeroes
the problem is that would split it into two columns
so how do i keep it as one column with ones and zeroes?
i could commit a sin and iterate through the entire column, change "Profit" to 1 and "Non-profit" to 0
i honestly don't know
i could use .replace
holy fucking shit
i did it
am smort
there is a surprisingly strong classification model to predict if a dialysis facility is profit or nonprofit with ratings with these features "Star rating of nephrologists' communication and caring", "Star rating of quality of dialysis center care and operations",
"Star rating of providing information to patients", "Star rating of the nephrologist", "Star rating of the dialysis center staff",
"Star rating of the dialysis facility"
shit
my model overfitted
i have a question, so i'm trying to make a machine learning model and i have an input that is one of two values('L' or 'R'), do i have to one hot encode them or can i just convert to 0 and 1?
yes
well actually no
you would encode L as [1, 0] and R as [0, 1]. or vice versa.
okay
wait, can you convert a value in a column to an array? or do you mean split them into two columns
the point is that for as many unique values there can be, there are vectors with that many elements that are all 0. and then you assign each value an index in those vectors. and for whichever value a given vector is intended to represent, you make the element at that value's index 1.
how you accomplish that is up to you.
For data science questions is best to ask for help here or in a help channel? I looking for assistance with my neuralprophet model
You can ask here
I asked in #help-cake
With neurelprophet i am trying to forecast my shipping container volume. The issue is since my date information is rubbish i have many days that show 0. To address this i thought to group the data by week which should be a sufficient fix as i want weekly forecast anyway. The issue is the forecast goes below 0 which is not possible. What is the proper way if any to address this?
one hot encoding and just changing the column values to 0 or 1 got the same result
My MAE is also very high π¦
I'm looking for help with openAI, is this the right spot? or is there a better server? I just want to know how to get the summarization on openai to return a summary that is a complete sentence.
dont ask to ask i have learned
I'm putting this in AI because speech related. I am looking for something to decompose speech into International Phonetic Alphabet (IPA). I ran across a great project named Allosaurus that did exactly what I wanted but it has a few limitations - in particular it gives back durations that are all a fixed time. This causes problems. The use-case is to map spoken words into visenes (think like animations or vtubers). Amazon Polly returned good data but it was only on generated speech. Papagayo is an open source project that sort of accomplishes the same thing but it's manual.
Anyone know of anything I should try?
In Pandas, how do I select a row based on a condition, and then cast that entire row to a list? That condition being, say, the min value in a datetime column?
or in for loop, append that row to anew df
e.g. df[df["datetime"]==df["datetime"].min()]. This will be a slice of the original dataframe. Note that it might have more than one row, if the min value repeats more than once.
That is confusing to me - why are we doing df[df['column']?
instead of say df['column]
Oh wait - its a conditional statement inside the brackets?
In plain english what is that line of code saying exactly?
Comparisons on a Series result in a Series of booleans.
So df["datetime"]==df["datetime"].min() is a Series of booleans - for each element, whether it's equal to df["datetime"].min().
That Series can then be used as an index to select only these rows.
Ahh I see
hence df[that whole comparison operation]
you're saying in this df, select any row where df['datetime].min() is true
!d pandas.DataFrame.idxmin
(bot not like my summon, see next one)
https://stackoverflow.com/a/10202789
SO post is for max, but the min analog seems to be something like df[df['COL_GO_HERE'].idxmin()].tolist()
!d pandas.DataFrame.idxmin
DataFrame.idxmin(axis=0, skipna=True)```
Return index of first occurrence of minimum over requested axis.
NA/null values are excluded.
Thanks, looks like two methods to do what I want - I'll experiment with those!
and thanks @tidal bough for the nice explanation of what your suggestion is doing.
How to extract customers from a sales dataset who have purchased from the website more than once? Basically repeat customers
Someone please help with the logic
I work as a computational linguist, and I've never heard of a model that can do this. You might try asking in a server that's even more specialized.
Also, the level of detail in the transcription matters
hi everyone i have a question, i still new using machine learning and my first project is to make prediction using regression. i think id have several issue in my machine learning model after i read some paper about Multicollinearity and there is method to check about this method called VIF (found it on internet). does it Multicollinearity really effect the model accuracy? or is gonna making problem for the model in the future? and btw i used OLS method
I'm trying to use a data generator in pytorch. Is there way I can work around splitting my folders into train and validation while using dataloaders? I separated my image files into trian and validation by paths (X_test = [path/image1.jpg, path/image2.jpg], Y_test = [class1, class2], X_train = [path/image3.jpg, path/image4.jpg]...) But torch datasets require a root path like e_dataset = datasets.ImageFolder(root='e_data/train', transform=data_transform). Is there a way I can work around separating my image folder into train/val/test?
Why do you want to work around that? seems like a organized way to store your data
Wouldn't it be simpler to just organize your data in that way
Weird pandas column naming thing happening..? why aren't I able to name a column like this? python uwo["PR-Q10-1"]=df.loc["PR-Q10-1"].apply(foos.PR_Q10_1so that will completely bug, and it won't even add the column, but if I name the new column like thispython uwo["PR-Q10-"]=df.loc["PR-Q10-1"].apply(foos.PR_Q10_1It will add the new column and works as normal...? I have other columns named with endings with "-1" as well...? what's happening here?
It's also not working and just hiding the column, because the rank doesn't change, so I know it's not just hiding it, and if I name it without "-1", it will increase the rank
Thanks. If you're interested in seeing an implementation, this is Allosaurus:
https://github.com/xinjli/allosaurus
It's not a personal project, tough. That part isn't much up to me π
@urban prism You can make a custom DataSet class in pytorch, this way you can make a DataSet for your train and test
Or a single DataSet class with a flag for train or test or whatever
I have a column in my dataset which contains phone numbers. Majority of them are 10 digits, but some have country code in front like +91 and some have a extra 0 in front of them. How do I remove these extra +91 and zeros ?
Anyone else here would fail a math exam?
Am I the only fake data scientist who couldnt pass second year hs maths?
Realised I donβt have the time to learn it, shud i swap to SWE? a lot of people basically infer that I couldnβt become a DS
I heard from a seminar on learning to code, "You don't hate math, you hate math class"
so you couldn't solve systems of equations? don't know about functions?
If you want to be DS you definitely need to learn through high school math
Never learnt beyond linalg and calc1 intros
Thereβs literally 0 chance I have time for@that
I can use sklearn, tensorflow and produce projects in inferential statistics but I couldnβt pass pen and paper calculations
Thatβs huge BS
shrug
It took me about 40 hours to finish basic linear algebra
Exam papers cover far far more than these topics
Look I'm sure you could get a data job without knowing these things, but it would be limiting, and if you don't want to suck this is prerequisite knowledge
How is it even relevant ?
It is necessary to go beyond superficial understanding of what you are doing
yeah
Trigemoetrt?
obviously not necessary for everything
I personally would not hire, that is one data point
Even a graduate?
Your job is production code + domain understanding + good models
Lol, being able to pass a math exam has no impact on those three
I would expect a DS I hire to be able to read, understand, implement and improve on ML research papers
So long as you understand backprop, matrices, vectors and integration
there are different levels
like I said, I'm sure you could get a job
but don't you want to be good
Why would you need to have the ability to have methods to work out exams?
Data scientist to work on research papers? That isnβt a data scientist thatβs a ml research scientist
to implement methods from papers
understand them completely
use them to solve our problems
Sorry, you work for google?
comparable
Iβd be able to learn DSA to get into ur company as an SWE in half of the time to pass ur math exam
sure then do that
U think itβs a good idea?
based on this attitude, if you dont want to put in time to learn prerequisites, you will never put in the time to be excellent
Putting in the time, thatβs what, an entire year of studying with all my free time
more
what you're saying is you can copy paste stuff, but don't understand how or why it works, or how or when to use it
For neural networks? Sort of
When I decided to switch to ML I basically spent 6+ hrs/day for 2 years
and I already had math BS and published in physics
well, you might imagine that limits your options, right?
That isnβt healthy when ur already working full time
Good job thereβs no math entrance exam at most companies
I cud learn while working β¦
Instead of before
They will ask you questions about methods that you wouldn't be able to answer
Iβve had one interview at a big four lately and they just asked stats
You would think so..
Which was easy enough
exactly, and job postings usually ask for masters or "equivalent experience" as a minimum
They didnβt ask any pen and paper calculations and equation solving
I have masters almost finished
We don't ask pen and paper but would want you to walk us through your understanding of relevant algorithms
I could easily do that, is it enough?
I don't think you could do that
Like I said earlier, my problem was sitting a final year exam and failing it lately
No, I literally could
Especially with study beforehand
So what company, if I may ask?
This is knowledge you can obtain over time without practising problems
I am a ml research engineer at a SV computer vision company
The only non pure stats question asked was how knn works
Could you tell me what sort of q u ask?
For junior data scientist
in our last interview we asked about linear regression (probabilistic view), SVD (and application to our domain), then basic deep learning questions, deep computer vision architectures, derive variational autoencoder, self-supervised learning
I wasnβt saying earlier, that I donβt understand this stuff and how it works - i do. But I donβt have literal solving methods required to pass exams
Where you literally write out your solution line by line
I highly doubt that any company asks those to junior DSs
can you describe the conditions under which the least squares approach is optimal for linear regression?
we asked those of a recent masters grad
Applied mathπ
Nope
Got lin alg and number theory next sem so hopefully this degree turns out well
Position title?
Data Scientist 1
i would say this is like the first thing you learn, which ofc requires stats, linalg, and calculus/optimization
Whatβs the answer?
the deep learning questions probably would not be asked unless the company works on related problems but others are fair game
when the noise follows a distribution described by its mean and covariance, and the mean is 0 and the covariance is a scaled identity matrix
pops up rather naturally when looking at the log-likelihood
Hi, I have a question on the deepmind lectures by David Silver. its about the forward view and backward view TD(lambda). Just to confirm, if we ignore the idea of eligibility traces, then these two are the same algorithm right? its just that the former is waiting for the future to update "now" but backward is like a recursive program where "now" is the furthest function call. right?
From one side, TD(lamda)/forward view looks like basically fusion of montecarlo and TD
and backward view is like TD lamda but reversed
but at the same time, my mind says its different cause backward view uses eligibility traces
our candidate answered gaussian distributed noise then showed how the likelihood function gives L2 loss, then we followed up about how to justify regularization and they added a prior
this was good enough for us, we followed up some of the details edd mentioned and they showed understanding
Well, Iβm content not joining your faang research team for a few years anyway.. gives me time to learn
Most companies take graduates without such hard questions
This sort of knowledge is learnable and memorisable without being able to solve equations in exams
you should know this if you've read any intro ML book
which is basically the minimum bar
I havenβt read any ML books, during my masters itβs been mostly coding and stats
I will def get around to an ml book thoxxx
β¦
you should probably pick one up, but you'll definitely wanna brush up some earlier maths first
Which is the best one?
bishop
i'd recommend gilbert strang's linalg
I mean something not extremel hard to get into off the bat
Like you said, that info is in intro to ml
Are you referring to pattern learning and machine learning
hastie and tibshirani is another popular option
Is there anywhere I can preview it
I donβt want to buy a book and open the third page and be hit with equations I canβt understand
you can get pdfs for free of both through google
Link?
i think its against server tos just google name and "pdf"
it's the first result when you google it π
someone help me
on this question I have
Wow this book is insane
Hmm quite good
Iβm very glad I was at least taught probability in class
Whatβs the level of calculus required?
It definitely isn't easy but if you can get through and understand this book you'll be at the level of a strong ML masters graduate
I can integrate a very very simple equation only
Especially not with a lot of surds fractions powers and multiple variables
just the core ideas
Iβm not great with functions
if there is a hard integral it will be in intro chapters or appendix
it's very likely that you won't have to integrate anything by hand anyway, only some special results are important there, e.g. related to expectation and moments, energy-like quantities, and integral transforms
doing the integration isn't the point, the point is the concepts anyways
Yeah but weβve been discussing the ability to do is is required
Entire talks about math exam
you don't need to know all of the weird integration tricks from calc 2 or anything
Concepts arenβt something you need to grind out practise questions
just "as needed"
We were specifically talking about being able to get the βtricksβ and pass calc algebra exams
even so, you'll never run into an integral that requires super fancy tricks and you have to solve by hand unless you're taking a course on integral calc/calc2, so don't worry about it at that level
This guy just said without being able to pass said exam I wudnt be hired by him
Thatβs the entire topic
understanding special properties is what is usually evaluated, not doing a weird integral or antiderivative
I mean, that I can learn, I can read alot and studyβ¦ thatβs different to solving
they won't evaluate you on calc 2, but rather on recognizing an integral is equivalent to a special transform, or that special results can be applied to immediately simplify it
And this is about solving ability
it's about solving ability in the specific context
you could cook up an arbitrarily messed up integral that no one in the world can solve, phd in maths or no
that's beside the point
you need to know the skills for what you're aiming for
Yeah but otherwise I feel like Iβm memorising math facts without truely understanding
And thatβs essentially what is being inferred against me; u cant solve at a low level u canβt be a good DS
I CAN memorise all this information and concept
yes but you're not solving low level problems because you're failing to notice what is important
What is
you need a strong grasp on earlier concepts, really understanding them
rote computing does not necessarily equate to understanding
But u canβt understand if u canβt compute right?
that's absolutely wrong
especially considering several concepts don't even have any computation attached to them
So in your opinion, even if Iβd fail a final year math exam I could still be a decent DS?
depends on final year at which level
Thatβs the opposite of what someone just said
HS so calc 3 I believe is American level?
if it's final year HS, you have a ton of ground to make up
tbh the grades are overall not really important if you really understand the concepts, but you also said that wasn't the case
Take a look at AS core 4 exams
A2 sorry not as
Itβs a2 c4 maths
They also have further maths which is lin Alg
For me itβs unbelievably hard
By c4
getting bad grades and struggling with a topic are two separate things
i'd say it seems rough for high school, but these are all things you should be capable of
they're basic undergrad maths you'd pick up in first year at latest
I love what I do itβs fun but now I wana swap to engineering and just code mown
Bcs that shit would take way too long to get a good grade on
well, switching to engineering means you'll need to learn what they learn in engineering π
I could absolutely learn to code well
these maths are the basic foundation to do the actual work later on
I bet they are - and I couldnβt get higher than 30% marks
Which is a certified fail
60% is pass minimum I think
then you gotta sink some time into it
Maybe when I start working I will yes
Hopefully the bar will be lower to get into companies than this dudes faang
So start working and get experience and on the side learn that
i would expect it to get higher, since everyone wants to jump into these fields with as little preparation as possible
Higher in a while
Not in a couple months haha
I have an offer to be analytics consultant also which is much less mathematical
But I donβt rly wana do it
I think itβs paid bad
314 votes and 204 comments so far on Reddit
Well makes me feel slightly betterβ¦
if you don't wanna learn it, don't. no one will force you lol
you might also wanna read up on confirmation bias
Itβs not that I donβt want to, itβs that I may struggle to while working full time and having other commitments
And knowing that it seems like a very scary idea to try work as a DS if I will not be capable to get jobs or do jobs
Especially since Iβm finishing uni in 2 months
Thereβs no plan b
Except either consulting (cringe) or data engineering
And the convo started with me saying maybe I shud just focus purely on coding then
What's cringe about consulting?
I associate it with really annoying business jargon people but thatβs just my bias
I know this one guy and he says touch base like 12 times an hour no joke
Iβm not really sure itβs for me, and it pays pretty badly too iirc
"Innovative"
I want to train a lstm on a body of text. Is there a way I'm supposed to break the text down in to trainable data?
This seems to explain most of the basics
Thanks a lot. I have found a lot of stuff on using embeddings, but nothing on the data prep for them
just heard a podcast from the ceo founder of this company, and it sounds like its pretty promising https://venturebeat.com/2022/03/16/hidden-door-reveals-its-ai-powered-narrative-game-building-platform/
βWe like to think of it as Roblox meets D&D, where you have the vibe of a tabletop RPG where you and your friends are telling a story together. Youβre also playing with the AI narrator, whoβs sort of like our AI dungeon master, whoβs building a world out of the choices that you make as you play.β
Anyone experience with running something like Dall-e on google collaborate?
How quick does it run when using pro+, and can it run with just pro?
I can convert a word to a vector with embeddings, but how can I do it in reverse.
it's non-trivial to go in reverse, because a given embedding probably won't match up exactly with an embedding in your vector space
you can think of vectors in the original encoded space as being in R^n, and vectors in the space after the embedding as being in R^m, usually with the condition m << n. you can only go in the opposite direction if the dimension of the subspace of R^n spanned by the words in your text has dimension <= m, or if you encode text that happens to have few enough unique words that it happens to satisfy some identifiability condition when paired with the matrix that does the embedding
there's usually no unique way of going back except under special conditions
I want to set up the model in such a way that it writes a whole sentence instead of one word at a time. Normally people use one hot encoding, but it doesn't really work that well here.
it could be doable as long as the sentence satisfies the identifiability condition
Well I definitely don't know how to do it.
should be more or less equivalent to pseudo inverting the embedding matrix, do you have any way to get ahold of its entries?
Keras has a get layer weights function.
I've heard the embedding layer is basically a dense layer
yep
a dense layer is the same as a dense matrix
if we call that M, a matrix that does the embedding, we are interested in x such that Mx = v, where v is the embedded vector and you want to solve for x. M is going to be a fat matrix (more columns than rows), meaning it is underdetermined and the equation has either no or infinitely many solutions
the reason people use one hot encoding here is that that inherently makes x sparse. then you can find the unique sparsest solution x by adding in sparse regularization
using something like combinations of syllables is less likely to have a sparse representation, which is more memory efficient, but also more difficult to invert for many reasons. it's more difficult to build prior info to find a unique sol, distances between words are not uniform, making the matrix poorly conditioned
so if your goal is do generate similar text, i can see the merits of using one hot
that being said, nothing stops you from trying both (other than time constraints)
for the inversion, maybe scipy or scikit learn has something like a lasso regressor
scikit has one, yes https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html
Examples using sklearn.linear_model.Lasso: Release Highlights for scikit-learn 0.23 Release Highlights for scikit-learn 0.23, Compressive sensing: tomography reconstruction with L1 prior (Lasso) Co...
The goal is to generate text based on the style of a writer (their writings being the training data). I figured out how to make a gan with a LSTM generator (for reasons I don't really understand the generator has to output vectors, so I have to train it with the real data being autoencoded) But I have no way of getting the final text out of the decoder right now.
They are equivalent for off-line. http://incompleteideas.net/book/ebook/node76.html
7.4 Equivalence of Forward and Backward Views
i'm guessing you're managing to generate the sentence in the embedded space?
On-line learning is a whole other unsolved thing. Although we can expect it to be very similar for on-line. As shown in the post, you can modify the definition to get it to work.
i was trying to webscrape and i got so frustrated i ended up faking my data
π¦
Well I am embedding the inputs into the autoencoder, using the embeddings seems like the only good way to get text back out.
without the output being the word count * vocab
one way that is blackboxy is to extend the autoencoder to produce the vector pre-embedding, for instance
but if you already have the embedded output, you can "invert" the embedding matrix, as mentioned above (more like solving an inverse problem, really)
if I invert the embedding matrix how do I get the word itself out
if you invert the embedding matrix, this allows you to map an embedded input into an encoded input
and an encoded input can be decoded with the same function you used to encode it
whatever you used for your word to vector conversion should have a decode function, that's no problem
ok. Thanks a lot π
Example (7.3) demonstrates one of the key differences between off-line and on-line: "Note that the on-line algorithm works better over a broader range of parameters. This is often found to be the case for on-line methods.".
words -> encoded vectors -> (this operation is lossy) embedded vectors -> whatever your code does to generate new embedded vectors -> (this inversion is the difficult one) -> encoded output -> (your initial encoder should have a decoder) output sentence @rough mountain
at least that's how i see it in my head
Hey, I'm training a homemade AI model on some basic sentences to analyze them for their real meaning; is there a corpus of simple subject-predicate sentences in a library somewhere?
How do I know how to handle NaNs? I have columns like "Amount due on existing mortgage", "Value of current property", "Years at present job", "Number of major derogatory reports"
"Number of delinquent credit lines", "Age of oldest trade line in months", "Number of recent credit lines"
How do I know if I should use mean/median, kNN imputator, or imperative imputator?
Maybe not super relevant, but I created this video using some text to image AI generator π
If we have 4 train datasets from Kaggle like A_train, B_train, C_train, and D_train. The A_train dataset contains all of the columns for each B_train, C_train, and D_train. What we can do to process the dataset?
On another side, the A_train dataset is having large data that has 3 million data. Whether we should merge the A_train dataset with another dataset to aim to have a little bit of the data? or what?
You could look into something like dask. I've only used it on top of a netcdf/hdf5-oriented layer module called xarray
It probably still requires a buttload of disk usage, as much as you'd need in ram if you loaded it "normally"
the data i mean
what is h5df?
But if you have the dataset like that, what could you can do? Are you merge the data first or choose the column by dropping in A_train?
Hdf5 is a hierarchal dataformat kinda like json, I guess, but it's irrelevant.
Again, I used XArray, but it works on top of the Dask module. Dask allows you to read in datafiles too large to fit in memory and process them in "chunks", sounded like something you wanted.
This helped me perform SVD on a total of 120 gigs of data without having to sacrifice any data.
Other than that I can't help you, I'm not a ML expert, sorry.
From their page, maybe this helps:
https://examples.dask.org/machine-learning/training-on-large-datasets.html
Or perhaps this one:
https://ipython-books.github.io/511-performing-out-of-core-computations-on-large-arrays-with-dask/
Their array type should be compatible with pandas, but I'm not sure of how to convert them
IPython Cookbook,
Yeah, I already read about h5df before. But, how to use it when dealing with CSV files?
You probably need to play around with it, especially if you start specifying chunk size. I don't think it can handle discrepancies in data, so if you have NaN at different positions in your chunks (i.e. without pattern) it can fail the process, you need to account for that
I already import the dataset like this, but this happens. Why do I get the type of data?
How to show all values in dataset?
What's the shape of your data?
I think it just states you got a crapton of data by this. You could affirm individual elements by accessing them just like any pandas dataframe, reading from their docs
the dataset contains 3 million rows and 73 columns
That's probably why then, it's not gonna list that. The npartions are the amount of "chunks" the data is place into. So when dask works with your data, it loads in chunks
Also, as I remember, dask doesn't perform any of it's operations (perhaps even .read_csv()) until it has to, so if you need to debug various operations, or use the data outside dask operations, you may need to perform .compute()
But actually, I don't really understand with h5df works. But if we back again to my question, when we have 4 datasets, as I said before, what we could do to process the dataset? Whether we just only use A_train that have all of the column, or we can merge the dataset based to another to get little bit data?
I think I would read the data with normally way with read_csv
If they have the same variables you could load them into the same dask array using read_csv. I think you could just list them all in a tuple, or perhaps use the globbing, if it fits your files' names
You can also join them by various union types and things like that, but I'll leave that up to you, I honestly can't remember any of that, sorry
A_train has all of column to another dataset. B_train, C_train, and D_train is snippet from A_train dataset
Think I misunderstood, to process a dask array look into the docs, I can't help you with that, sorry, but they got a whole section called ML-dask on their page, I'm sure you'll find something
EDIT: start here maybe:
https://ml.dask.org/cross_validation.html
Ok thank you for discussion
Hey all! So I recently came across a TON of stamps, and I am trying to create a dB of them. Because there are literally thousands, I am hoping to be able to take a photo of multiple stamps and have my app split them into individuals. Are there any API's or SDK's or algorithms that anybody knows of that could help me do this?
Quick question, does anyone here have any experience with the algorithm QMIX? In the linked repository, I am trying to find where the monotonicity constraint is implemented.
https://github.com/quantumiracle/Popular-RL-Algorithms/blob/master/qmix.py
qmix.py lines 243 to 250
w1 = self.hyper_w_1(states).abs() if self.abs else self.hyper_w_1(states) # [#batch*#sequence, action_shape*self.embed_dim*#agent]
b1 = self.hyper_b_1(states) # [#batch*#sequence, self.embed_dim]
w1 = w1.view(-1, self.n_agents*self.action_shape, self.embed_dim) # [#batch*#sequence, #agent*action_shape, self.embed_dim]
b1 = b1.view(-1, 1, self.embed_dim) # [#batch*#sequence, 1, self.embed_dim]
hidden = F.elu(torch.bmm(agent_qs, w1) + b1) # [#batch*#sequence, 1, self.embed_dim]
# Second layer
w_final = self.hyper_w_final(states).abs() if self.abs else self.hyper_w_final(states) # [#batch*#sequence, self.embed_dim]```
.abs()
The weights of the mixing network are produced by sep-
arate hypernetworks. Each hypernetwork takes the state
s as input and generates the weights of one layer of the
mixing network. Each hypernetwork consists of a single
linear layer, followed by an absolute activation function, to
ensure that the mixing network weights are non-negative
To enforce the monotonicity constraint of (5), the
weights (but not the biases) of the mixing network are re-
stricted to be non-negative
bc w1 and w2 is being constrained there correct?
They can't be negative.
also, how do you know everything?
and ur username is Squiggle itβs hard to take you srsly lol
I choose my usernames arbitrarily. I would generate a hash but that makes it hard for someone to refer to me.
how did you know it was that line?
I knew that the weights needed to be non-negative.
They are chosen by a "hyper" network.
And those lines were in the class QMix, which is the mixing network that needs the constraint.
Oh I understand now
you dont know squiggle. out of everyone in this channel, squiggle is one of the most knowledgeable. hands down.

i even have in my notes for squiggle: "basically knows everything"

Hey guys, first time posting in here. Had a question. Currently working on a fairly large dataset (options data) - and have a column with a bunch of expiration dates. Now I only want to filter the column to show the expirations on a Friday. Do I need to incorporate this into a loop?
I have made the column into a datetime format and have tried selecting the expiration on only day 5 (Friday) but no luck. Pasting 2 screenshots for reference.
yoo, thank you for noticing my question
PLS IS ANYBODY FAMILIAR WITH LASSO REGRESSION AND FINDING THE OPTIMAL ALPHA??
I NEED HELP PLS ππ
what is "alpha" here?
I think its the rate that lasso regression is multiplied against
omg yes π
back to this question, what about for an online case tho
I think its a hyperparameter and you just gonna play with it, right?
if you can show the equation you're using, i can take a look. idk if you mean what's normally called the "lambda" parameter
can you show your version of the lasso problem?
ah ok, the sklearn one. then yes, it's the sparsity regularization weight
i donβt know what iβm doing π iβm just guessing the alpha value
well, there are 2 common ways
one of them is exactly as you're doing it: you generate a list of alpha values, and then evaluate which one gave you the "best" result in some sense. you keep that one. this is how it's done when you use an algorithm that needs an explicit value of alpha
an alternative is to use an algorithm that can find it explicitly
pls whichever one is easier to do π§π»ββοΈ
probably what you're already doing. the answer is: try many different alphas and keep the best
if you know the values x that solve Ax = y ahead of time, you can check the distance between x and your estimate to pick alpha. if not, then the distance between Ax and y also works, though not as well
what is the typical range?
ah right. that one is annoying because it depends on the actual algorithm. the method i'm familiar with is as follows (though it might not work for you, we'll have to try). you have your matrix A and the vector y, yeah? it turns out that some solvers use "soft thresholding" in their iterations. the amount that entries are soft thresholded by is the product of alpha with an internal learning rate that is also used. you can compute the product A^T y and find the element with the largest absolute value. call this quantity, w, for instance. then you can set alpha to w * c, where c is a number between 0 and 1
setting c to 0 should remove the sparse regularization entirely, and setting it to 1 will make the output fully sparse, i.e. a vector of zeros
then all you have to do is test values of c between 0 and 1
oh π₯Ήπ₯Ήπππ goodbye sleep
what if i did the other way where it would automatically find it for me?
yeah but you'd have to use a different solver. i know cvx can do this. idk how to do it with sklearn
It tends to be close, but it's not equivalent as in the off-line case. Not without modifying the definitions. It's addressed in the link.
oh okay thanks anyways!!
ah, bingo. sklearn lassoCV can do this with cross validation
use that
ahh really, I think I skimmed a bit too much. will read it more throughly
omg thanks but how would i even start?
by reading tge documentation :p the function should do pretty much everything for you


