#data-science-and-ml
1 messages ยท Page 2 of 1
guys when a language model gets trained
does it learn the probability distribution for the vocabulary
I don't have much experience with keras, but could it be that you have to specify the batch size too?
It probably thinks there's multiple possible output shapes now or something
!paste
Pasting large amounts of code
If your code is too long to fit in a codeblock in Discord, you can paste your code here:
https://paste.pythondiscord.com/
After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.
where exactly should i do that
"language model" describes what it does, not how its implemented.
if i train a text generative model
how can i describe the state of the model after it has been trained, like would it be correct to say "after training, it has learned the probability distrbution for each word in the dataset"
is there a better way to put it
how would you put it, like what is distinguish before and after training with regards to the learned probabilities
it's more about learning the probability of sequences, and saying "this sentence is more likely than this one"
like multiple combination of words right
and learning what is more likely to come after
am writing a report and can't find the proper wording ;-;
a language model can learn that for "the boy was chased by the x", x is more likely to be "dog" than "cat", because "the boy was chased by the dog" is learned to be more likely than "the boy was chased by the cat"
though this isn't necessarily because of some learned properties of the words "dog" and "cat".
now when it comes to writing what you said in simple terms XD
well, if you're writing a report, you should say "sequence" instead of "sentence", and "token" instead of "word".
so i say, "After a language model has been trained, it has learned the probability distribution for every sequences"
.wiki language model
Language model
A language model is a probability distribution over sequences of words. Given such a sequence of length m, a language model assigns a probability P (
BERT (language model)
research publications analyzing and improving the model. The original English-language BERT has two models: (1) the BERTBASE: 12 encoders with 12 bidirectional
the first sentence has it ๐
is mine not correct tho :<
not quite in that the distribution is over the sequences, not a separate distribution per sequence
"over"
"After a language model has been trained, it has learned the probability distribution over sequences of words from the dataset"
if you don't like the word "over", you an also use joint distribution, too
what about conditional probbility
well, the way you have it atm says nothing about whether events are independent or not, but sequences are considered to be dependent on each other. you could add info regarding that if you'd like
conditional probability distribution over the set of tokens
hmmm that's more tricky to say
can you please rephrase it for me c:
i'd leave it out if you can't phrase it yourself. this is the sort of stuff that attracts questions during reviews
You will need more details on which model specifically. Some generative models don't even really model the joint distribution, they just generate things that "look like" they would be in the data set.
can anyone pls look into this
(You can sometimes get away with this if you don't care, such as when generating images)
gpt2
Well, that is a bit more complicated. If you are talking about language models in general, then the distribution over sequences is fine. Like Wikipedia (which is why it's also being vague).
*Languages models are diverse.
what do you mean, just use it whats the problem?
and it does print in order, or do you mean sorted?
this is stelercus's worst nightmare. not just the code, all of the text is an image
oh itt does?
lol you literally sent your answer
i dont want to print [34234423, 324234 ,34234]
i want to print 423423 then 433243 42342
list = [3242423, 2342342 ,234242]
client.get(list)
im actually tryina do that
replacing the (list) with each number
thats in there
so your omitting the '3' in front correct
oh no its just random numbers
its supposed to be client id?
an id yes
here
so if i do client.get(3) it should give the random numbers with 3 in front?
since ids are unique i would use list index as id
no that is just my function what it does it sends messages to the ids there
ok so i got the list with the numbers
and i have a command
that sends a message to those id's
client.get(list)
client.get(list) will send a message to every item in the list right
it accepts a list, iterates through each item and sends them a message
yes
and i cant figure out
how to send the message to the numbers from my list
client.get(342342353) if i do this it will send a message there
this?
yes
guys
i had a question with pytorch
im building a network that classifies breast cancer
ksi bein sus
i keep getting this error: RuntimeError: expected scalar type Float but found Double
have you tried existing tutorials
yes
this is a specific question with my code though im not sure why i keep getting this error
i tried casting my x, y in the trainloader to float
yet im still getting error
what could be the problem?
I have two tables in excel. I'm pulling Table B into pandas, dropping a couple columns, reindexing a few columns, and then pasting that at the end of Table A in excel. There are 60 columns. When I reindex Table B in pandas, I only reindexed the columns I wanted to reorder; however, it appears that the rest of the columns I didn't mention were dropped from the df.
Is there a way to use either column numbers, or to tell pandas "hey, reorder these columns and the rest of them can stay the same, after these reordered columns"?
even though you're conceptualizing it as "only reindexing a few columns", you're changing the order of the columns in general, so you need to provide a list of labels that reflects the order you want at the end for all the columns.
taking the label of each column, is there some property that distinguishes between the ones you want to promote to the front, and the ones that you don't?
(for example, is it "every column that's divisible by 3" or "every column with an underscore")
Maybe your data is np.float64 instead of np.float32?
my_arr = my_arr.astype(np.float32)
Did you try that?
no i did not try that
i found another fix
but
Which is?
It is 100% based on the column header names. The first 11 in Table B will always be the same; the subsequent ~50 columns headers will change every month.
These two tables are production forecast tables. So the first few columns are "this is where the production is coming from, the company, the pipeline it's feeding into..." etc. The last columns are monthly production forecasts, so each column has a header of, for example, "2022-Jul", "2022-Aug", and so forth. When I run the update each month, the column headers will change.
Table B is the forecast for potential production. Meaning, some folks are proposing new production. Table A is the base, or existing production.
I'm merging the tables so that we can get a summary of the production forecast for the next X years.
LOSS = []
for epoch in range(100):
for i, (x,y) in enumerate(trainloader):
** x = torch.from_numpy(np.asarray(x)).float()
y = torch.from_numpy(np.asarray(y)).float()**
yhat = model(x.view(-1, 50 * 50))
loss = criterion(yhat.flatten(), y)
LOSS.append(loss)
optimizer.zero_grad()
loss.backward()
optimizer.step()
#%%
those two bold lines
but
ah alright
So after the first 11 columns, the headers in Table B and Table A will always match.
the new problem is that the output is now a 2d tensor w two elements, but my y is a 1d tensor w a single element
also
my yhat values are now negative
@mild dirge
Interestingly it seems that reindex has similar capabilities of drop. It's just the inverse.
You understand the meaning of the two vectors?
ะะฐะบ ัะปะพะผะฐัั python? (ะฝะต ัะพะฒัะตะผ) ะัะปะธ ะฒั ััะพ ั ะพัะธัะต, ะผะพะถะตัะต ัะดะตะปะฐัั. ะะฝะต ะบะฐะถะตััั, ััะพ ัะตะฐะปัะฝะพ ะบัััะพ. ะะพ ะฝะฐ ัะฐะผะพะผ ะดะตะปะต... ะะกะ ะะฃะะะข ะ ะะะะ ะะะะะะะะ ะะะะะะะขะะ ะะ
could you explain?
when i set my
d_out to 1
i just get negative values
im not sure why though
is this relevant...?
don't know what d_out is
number of output nodes
class Net(nn.Module):
def __init__(self, D_in, H, D_out):
super(Net, self).__init__()
self.linear1 = nn.Linear(D_in, H)
self.linear2 = nn.Linear(H, D_out)
def forward(self, x):
x = torch.relu(self.linear1(x))
x = self.linear2(x)
return x
The output of the model is probably logits, which you need to put through a softmax to get confidence for each class
But without the context of loss, you can just take the argmax, and that will be the model's predicted class
this schoolboy broke a python
So instead of your model just outputting the predictions for each sample like:
[
0,
1,
0,
]
it outputs something like:
[
[0.8, 0.1],
[0.2, 0.6],
[0.95, 0.01]
]
And you want the position of the largest element in each row
So you can convert from the second format to the first
This is the data science channel. Please make sure that all your messages in our server are on-topic.
sry
Is this the right place to ask if there is a pretrained model available? and if so where can I search for it?
you can try tensorflow hub, torchvision models, timm or huggingface
I was wondering if Imaginaire's SPADE has a pretrained or fine tuned model on bdd dataset.
CVPR author literally assume reader has already worked along with them on projects, the way they explain is garbage, maybe not meant for someone new to field.
how do i overcome this
Hi guys, I want to make sure for my understanding. What is the number of hidden layer? Whether in this case I have 3 hidden layer?
Try and draw it out like this @bold timber
Then you just count the amount of layers that are not the input or the output
But what do you think about my question? how many hidden layer that I have?
^
I'm saying this so you can check it for yourself, that way you know for sure you understand it
You start with 4 nodes (the input) and then try draw it out
That it's mean I have 3 hidden layers, right?
I'm stuck on 70% accuracy, what can I do?
model = Sequential()
model.add(Conv2D(8, 2, activation='relu', input_shape=(48, 48, 1)))
model.add(BatchNormalization())
model.add(Dropout(0.2))
model.add(MaxPooling2D(2))
model.add(Conv2D(16, 2, activation='relu'))
model.add(BatchNormalization())
model.add(Dropout(0.2))
model.add(MaxPooling2D(2))
model.add(Conv2D(32, 2, activation='relu', kernel_regularizer = keras.regularizers.l2(0.001)))
model.add(BatchNormalization())
model.add(MaxPooling2D(2))
model.add(Conv2D(64, 2, activation='relu', kernel_regularizer = keras.regularizers.l2(0.001)))
model.add(BatchNormalization())
model.add(Conv2D(128, 2, activation='relu', kernel_regularizer = keras.regularizers.l2(0.0005)))
model.add(Dropout(0.4))
model.add(BatchNormalization())
model.add(Conv2D(128, 2, activation='relu', kernel_regularizer = keras.regularizers.l2(0.0005)))
model.add(Dropout(0.4))
model.add(BatchNormalization())
model.add(Flatten())
model.add(Dense(512, activation='relu', kernel_regularizer = keras.regularizers.l2(0.0005)))
model.add(BatchNormalization())
model.add(Dropout(0.5))
model.add(Dense(256, activation='relu', kernel_regularizer = keras.regularizers.l2(0.005)))
model.add(BatchNormalization())
model.add(Dropout(0.4))
model.add(Dense(128, activation='relu', kernel_regularizer = keras.regularizers.l2(0.005)))
model.add(BatchNormalization())
model.add(Dropout(0.3))
model.add(Dense(64, activation='relu', kernel_regularizer = keras.regularizers.l2(0.0001)))
model.add(BatchNormalization())
model.add(Dropout(0.3))
model.add(Dense(32, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(7, activation='sigmoid'))``` Thats my model
depends wats the task
It also looks like you only use 2 filters for each convolution
Normally the size of the filters decreases the further in the model, and the amount of filters increases
@gleaming osprey
ok, so what should it be ideally
and should I use pooling layers?
How big are your input images?
Oh actually nvm, the order is different for keras layers from pytorch
You do increase the amount of filters
It's also weird that your kernel size is 2, while it is rarely even
i've been suggesting for days to use a larger kernel (and fewer conv layers). maybe they listen to you
Well? @gleaming osprey
oh sorry
srry
um 48x48
Yeah but this @gleaming osprey
i dont know what are good kernel size
i just know what they are
try some nice odd numbers. 3, 5, maybe even 7. 7 is already huge for that image size
ok
should I use 3 for all?
Well I would use odd numbers for kernels
It's a lot simpler to visualize the convolution as well
idk how to
Knowing that the output of each layer depends on neighbouring pixels in each direction
here is my new model:```py
model = Sequential()
model.add(Conv2D(8, 5, activation='relu', input_shape=(48, 48, 1)))
model.add(BatchNormalization())
model.add(Dropout(0.2))
model.add(MaxPooling2D(2))
model.add(Conv2D(16, 5, activation='relu'))
model.add(Dropout(0.2))
model.add(Conv2D(32, 3, activation='relu', kernel_regularizer = keras.regularizers.l2(0.001)))
model.add(Conv2D(64, 3, activation='relu', kernel_regularizer = keras.regularizers.l2(0.001)))
model.add(Dropout(0.4))
model.add(Conv2D(128, 3, activation='relu', kernel_regularizer = keras.regularizers.l2(0.0005)))
model.add(Dropout(0.4))
model.add(Conv2D(128, 3, activation='relu', kernel_regularizer = keras.regularizers.l2(0.0005)))
model.add(Dropout(0.4))
model.add(MaxPooling2D(2))
model.add(Flatten())
model.add(Dense(512, activation='relu', kernel_regularizer = keras.regularizers.l2(0.0005)))
model.add(Dropout(0.6))
model.add(Dense(256, activation='relu', kernel_regularizer = keras.regularizers.l2(0.005)))
model.add(Dropout(0.4))
model.add(Dense(128, activation='relu', kernel_regularizer = keras.regularizers.l2(0.005)))
model.add(Dropout(0.3))
model.add(Dense(64, activation='relu', kernel_regularizer = keras.regularizers.l2(0.0001)))
model.add(Dropout(0.3))
model.add(Dense(32, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(7, activation='softmax'))```
Hi, I'm trying to concatenate multiple dataframes together using final_df = pd.concat([final_df, df], axis=1), but it's creating duplicate indices. How do i get around this?
those are SO MANY convoluton + pooling layers
you can probably get away with just 2 or 3
ok, how many filters would be ideal?
16 and 32?
That's really hard to say
hm
Depends on how complex patterns are
they are faces
How many patterns there are that are useful for predicting the class etc.
the reason you're needing so much regularization is that you have a humongous amount of parameters for the size of your data set. it's difficult to give a fixed number, but the more params, the more data you need, regardless of how complex the data is
An important thing to look at is also the "receptive field" of each layer
Like for a 3x3 convolution the receptive field is 3x3 pixels
But 3x3 convolution followed by 3x3 convolution would give a receptive field of 5x5
Followed by maxpool(2x2) would give 10x10 f.e.
And if you think the pattern can be found through looking at subsets of 10x10 pixels that is fine
If you are interested in very detailed patterns, you could try a lower receptive field (maybe not maxpool f.e.)
Make sense? @gleaming osprey
hmm, I need to do something now, I'll try this later
you'll notice that your convolutions boil down to having single pixel outputs, and then you still keep doing convolutions. from the receptive field standpoint, this means you have a bunch of filters that just take in the entire image. at that point you have made makeshift fully connected layers and you may as well just use that instead
like using a fat matrix as a transformation to embed the images in a higher dim space. that's not very useful for the task you're dealing with, but might be useful if you wanted to find "similar images" instead
When I try to draw it out, I think I have two hidden layers and 1 output layer. That is true?
Please correct me if I'm wrong, Sir. @mild dirge
Show your drawing ๐
This drawing has 3 hidden layers f.e.
like this?
Jup seems correct
I assume the last nn.Linear(4,3) is the output layer? Is it correct?
With keras you just add the connection from one layer to the next
So nn.Linear(4,3) Means you have fully connected weights between 4 and 3 nodes
Does it means in my case I have 2 hidden layers with 1 output layer, right?
@mild dirge
Hi, is there a way in which I can allow VSCode to acknowledge that CUDA is available to avoid this error? : RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU. My apologies for posting this multiple times but I only just found this channel
Not much to do with VSCode I'd think
Did you install pytorch with cuda according to the website?
@dawn dune
i have started doing data science recently , i have covered python basics and numpy , pandas , matplotlib ... my laptop lags as its ram get utilized 95% with chrome tabs( youtube from where i learn ) and vs code , i have i5 7 gen , integrated gpu , 8gb ram , 256ssd + 2tb hdd , i am thinking to buy a new laptop , can u recommend me what minimum specification should my new laptop has ... for data science ..... mainly processor and amount of ram and gpu
If you are planning to buy a laptop you're better of using google collab or something
If you want to run bigger models a desktop will be more appropriate
Sorry hi yes I have, I'm running a TTS model in preparation for a final year project
So you put in a line like pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113
You didn't just dopip install torch?
i don't have any idea about big models right now , i am in college and willing to learn complete data science to get some job , i already have 1 laptop , so according to that situation should i get a desktop instead of laptop
I run the demo doc on my conda pycharm env but no audio is produced like in the colab demo, however when I made my own script in VSCode it complains about cuda not being available.
If you want to run it locally yes
If you are fine with having to need internet connection, then using google collab or other cloud computing services could be fine
I used : conda install pytorch torchvision torchaudio cudatoolkit=11.6 -c pytorch -c conda-forge
ohk , thanks
Would I not be able to run a model(Tacotron 1 & 2) locally on my laptop with a gtx 1060 and 16gb ram?
Don't know what tacotron is, you can still run some decently sized models, as long as you have enough ram
It's a TTS model
But with a laptop with a somewhat oldish gpu, it might take 10 hours instead of 1 or 2 with rtx 3000 series f.e.
and how long would doing it on colab take?
Probably less, collab lends out quite a lot of computing imo
I would just use collab on your laptop as long as you think it's good enough
You can also subscribe for more benefits etc. but I don't know how worth that all is
If only the uni gave me funding for that ๐คฃ
Our uni has a big computer you can use for running models
But it's quite a hassle I think
Okay but back to my original point, would you have any idea how to add cuda to VSC
Doesn't have much to do with your IDE I'd think
But normally just doing that line works
Maybe you installed it in a virtual environment, or maybe you installed it globally but not in your venv
I would generally never name anything after a package name
that was just an example but fair
Is there a way to simply rename it without starting from scratch?
Accountability post:
Today I continued to build the database for a passion project. Using the Reddit API, and continuing to work on text preprocessing techniques.
hi, help pandas :'(
how to get unique rows in dataframe, not series?
full_df.loc[:,['MONTH','YEAR']] , I want rows of unique month and year
oh drop duplicates silly me
next question, how do I cross join in pandas
okay nevermind found it.. but why use merge vs join?
This is nice overview of merge, join, concatenate https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
i had a question about training and validation in pytorch
i have the following training function:
ty!
def train(epochs, trainloader, model, criterion, optimizer, validloader):
TRAININGLOSS = []
VALIDLOSS = []
VALIDACCURACY = []
correct = 0
for epoch in range(epochs):
for x, y in trainloader:
x = torch.from_numpy(np.asarray(x)).float()
y = torch.from_numpy(np.asarray(y)).float()
yhat = model(x.view(-1, 288 * 96))
predictedvalue = torch.max(yhat, dim=1)
loss = criterion(predictedvalue[0], y)
print("Training Loss: ", loss)
TRAININGLOSS.append(loss)
optimizer.zero_grad()
loss.backward()
optimizer.step()
for x, y in validloader:
x = torch.from_numpy(np.asarray(x)).float()
y = torch.from_numpy(np.asarray(y)).float()
yhat = model(x.view(-1, 288 * 96))
predictedvalue = torch.max(yhat, dim=1)
validloss = criterion(predictedvalue[0], y)
print("Validation Loss: ", validloss)
VALIDLOSS.append(validloss)
label = torch.argmax(yhat)
correct += (label == y).sum().item()
accuracy = 100 * (correct / len(valid_dataset))
print("Percent Accuracy:", accuracy)
VALIDACCURACY.append(accuracy)
sometimes though
the accuracy is over a 100 percent which doesn't make sense
and for three epochs of training the cost function remains at around 50
do you guys have any suggestions on improving the training function so that the cost function goes down and the accuracy is calculated correctly
correct += (label == y).sum().item()
accuracy = 100 * (correct / len(valid_dataset))
something wrong with this logic if you got >100%
why do you have sum there when you just want a 1 or 0 there for correct
thats a good point
i didn't realize that
for x, y in validloader:
x = torch.from_numpy(np.asarray(x)).float()
y = torch.from_numpy(np.asarray(y)).float()
yhat = model(x.view(-1, 288 * 96))
predictedvalue = torch.max(yhat, dim=1)
validloss = criterion(predictedvalue[0], y)
print("Validation Loss: ", validloss)
VALIDLOSS.append(validloss)
label = torch.argmax(yhat)
if label == y:
correct+=1
accuracy = 100 * (correct / len(valid_dataset))
print("Percent Accuracy:", accuracy)
VALIDACCURACY.append(accuracy)
i just added an increment to the correct
ok cool so what kind of loss functions are you currently using?
so currently im using a binary cross entropy
what kind of data is in x
ok not sure how much I can help.. as I am new to ML :)
np any help is appreicated
class Net(nn.Module):
def __init__(self, D_in, H, D_out):
super(Net, self).__init__()
self.linear1 = nn.Linear(D_in, H)
self.linear2 = nn.Linear(H, D_out)
def forward(self, x):
x = torch.relu(self.linear1(x))
x = torch.sigmoid(self.linear2(x))
return x
#%%
model = Net(27648, 100, 2)
criterion = nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
so i defined my network
and im using a bceloss function
and im using an adam optimizer
i could try incremeneting the learning rate
if you using bceloss then you only have 2 classes?
yep
isnt that going to underfit
when u check your bias variance curve is your test and validation data have similar error but high?
i didn't have the opportunity to do that
because i was thinking of ways to make my training
better
oh why dont you juts try adding some extra layers for now
if i do that i get a matrix multiplication eerror
if i add another layer
oh maybe your D_out is wrong
how would that be possible then because its binary classification
i mean when you are defining your hidden layer 2
why would it cause matrix multiplication error, i thought you can just stack as many hidden layers as you want
let me show you one second
also by the way why even have 2 layers for output
if its just 2 classes, cant you juts have 1 layer
you mean node?
yeah
yeah and if its beyond a certain threshold
its one class
but i feel like having two seperate nodes is best practice
its juts that i always see them use softmax if its more than 1 output node
is there any difference if you try a softmax and sparce cross entropy loss function instead
ok
i can try that
so i can create an addition hidden layer, and on the last part of the forward ill add a softmax
ok cool, I will add you :) lets practice ML together!
Yes support is for cuda but i have rx560 opencl, any other way to faster?
Opencl is not cuda right?
You can try free colab, paperspace, kaggle or sagemaker studio lab for free GPU
Can I really make it faster with free?
@tacit basin gradient or core? In paperspace?
gradient is notebooks like environment, should be fine.
they have free gpu, should be way faster than cpu
all have limitations on free tier, like active session time, paperspace is 6 hours i think, then the notebook will stop, but you can start again. not sure how much time it needs....
Never used notebook, wav2lip will work?
Yes i used wav2lip recently on command prompt
or you can run command line prompts from within notebok with !ls for example, but terminal would be better in this case i think
once your machine is running select the jupyter icon bottom left of the screenshot
core is like a proper vm, but not sure about free gpus there, let me check
cheapest gpu on core is ~0.50 /hour
Gradient would be better? , Which runtime i should select PyTorch, tensorflow?
that would depend which framework is used for wav2lip
Error : We are currently out of capacity for the selected VM type. Try again in a few minutes, or select a different instance.
would colab work? i can see they have a link to colab in the github repo, so should be easier to set up
No i don't want to use at least google here
looks like they use tensorflow
In cross_val_score
Why do you only need 2 parameters, xtrain and y train?
Don't you also need xtest and ytest to get the accuracy?
If you assume your whole ecosystem is already in Python, is there any reason to use HDFS over dask?
Hey!
>>> torch.__version__
'1.11.0+cpu'```
what does "cpu" mean here? Does this mean that current version is only cpu compatible?
yes
okayy thanks! how do i install the gpu version of pytorch, I can't find it.
what OS
win 11
and what CUDA version
currently have 11.7
but it shows False when i run torch.cuda.is_available()
try
pip install torch --extra-index-url https://download.pytorch.org/whl/cu117
thanks! i have the conda env, shouldn't i use conda command?
I don't use conda, but the website says
NOTE: 'conda-forge' channel is required for cudatoolkit 11.6
conda install pytorch torchvision torchaudio cudatoolkit=11.6 -c pytorch -c conda-forge
I would encourage you to stop using conda unless you're sure that what you're trying to do can't be done without it.
Great i'll keep that in mind ^^
not sure the best place to ask this, but does anyone have experience with matplotlib
it's slow as hell plotting a simple line graph for like 3 points, half a second
Hey, mind elaborating why you say this? Sorry for randomly chiming in
Conda became the default assumption for DS/AI people in a time when compiling certain libraries (like numpy) was a potential pain point. but that isn't really the case anymore, so for the most part, it's a needless quirk that makes it harder for DS/AI people to get support from the rest of the Python community.
I work in the AI department of the company, and almost none of us use conda, and the holdouts are being asked to stop using it so that we don't have to deal with licensing. It likely won't even be installed on the next iteration of our high-performance computer.
not to mention you have to pay for it if you use it at a company
only for enterprise use, afaik
(that being said, i do still use it :P) the commercial versions, yes. for personal use, no
also, I'm having an issue with dask.
import dask.bag as db
bag = db.read_text('/home/blah/**/*.json')
there's 5 million JSONs in there. my program keeps getting killed before this statement even finishes
and the console just says "Killed", so I don't even get any leads about why
ish? maybe? no? it's all quite confusing tbh
but doesnt it allow you to access other sources like conda forge, and no strings attached?
where are you submitting the dask jobs to? a compute server/cluster with limited compute time?
a linux VM where I have sudo. No scheduling.
it was my understanding the conda itself was free and open source, and the repo access to anaconda was where they wanted to monetize large corps. but frankly, i couldnt wrap my head around it all when i looked at it.
no sort of hpc scheduler at all?
nope
maybe add a loop and read jsons one by one yourself, and i assume you can get print statements to display or something
find out which json is erroring out, if any.
there are no Python exceptions
just "Killed"
yeah i was gonna suggest something similar. maybe using a few threads or processes that log their own status and see where they die, and if it's always at the same place
usual disclaimer, i have no idea how any of this works, just thinking of initial ideas
thats fine though, if you control how the jsons are being read yourself, and print as you go, you'd know the last one you managed to read before "killed"
for starters to see if the behavior is deterministic
on multiprocessing using Value and Array, for example, a common issue is that if these synchronous vars get too large, for whatever reason the processes never send the termination signal
could be the task even finished successfully but then the dask bag gets killed because it sits idle
oh that would be funny
then even simpler, just add a print after the read line, and see if the print message appears
if it does, you know the line worked
when this happens though, the main becomes deadlocked and gets killed too
i think starting by having some periodic logging is a good start
my model validation is stuck at 59% after 3 hrs of training
this is my model: ```py
model = Sequential()
model.add(Conv2D(8, 3, padding='same', input_shape=(48, 48, 1), activation='relu'))
model.add(Dropout(0.2))
model.add(BatchNormalization())
model.add(MaxPooling2D(2))
model.add(Conv2D(16, 5, padding='same', activation='relu', kernel_regularizer=keras.regularizers.l2(l=0.1)))
model.add(Dropout(0.25))
model.add(BatchNormalization())
model.add(MaxPooling2D(2))
model.add(Flatten())
model.add(Dense(512, activation='relu', kernel_regularizer=keras.regularizers.l2(l=0.01)))
model.add(Dropout(0.4))
model.add(Dense(256, activation='relu', kernel_regularizer=keras.regularizers.l2(l=0.01)))
model.add(Dropout(0.3))
model.add(Dense(128, activation='relu', kernel_regularizer=keras.regularizers.l2(l=0.1)))
model.add(Dropout(0.2))
model.add(Dense(7, activation='softmax'))
model.summary() ```
Again, only providing your model isn't super helpful here
It depends on so much more
sorry again, the input shape is 48x48
We can see that from the model ๐
the values can be from 0 - 255
no
i tried it and the loss just didnt lower
tho I think now its because I messed up with the batch normalization
Right, but that should normally just be step one of processing image data
Batch normalization makes it less of a problem
But it should still just be common practice
iirc the kind of expression they make, like disgust or happiness?
yes
28709 input faces
i wanted to do data augmentation and was planning to do it in a while
here is a sample input
Seems like plenty of faces to train on, might not need that much data augmentation
Are they all perfectly centered?
i dont knwo
Again, looking at your data, very important step of the process
like, if you took a bounding box from opencv, like that
i did
even if they're centered though, you could mirror them left-right to get more data
Then you should know if the faces are centered
they are like that
Maybe the bounding boxes weren't always correct
not always
Did you crop the faces yourself (automatically)?
you can start by reading its description and docs, then
Yeah, it should tell something about the quality of the data
i did?
so are they all centered and cropped correctly?
so that the face is more or less centred and occupies about the same amount of space in each image.
And looking at the data, the immediate thing I see is class imbalance, how did you take care of that?
aight
um i didnt
How many times does your model guess disgust?
im being honest, I didnt really do much with the data other than to simply get it to work
not that often
And what is this measure? accuracy/f1/precision etc?
?
its clearly overfitting
can you give a count of how many times each class occurs in the training data?
What do you mean with model validation of 59%?
the validation/evaluation (cuz they were about the same) are 59%
validation is not a measure of performance
so accuracy?
evaluation accuracy is 57-59%
validation accuracy, sorry
Alright, that is also quite a big choice, what performance measure you use
['accuracy']
Accuracy is not the most common way to measure performance with imbalanced data
optimizer is Adam
Imo the first step would be taking a look at your data processing pipeline, making sure the data is good to use for training, and only then worrying about your model
would you suggest that I invest some time into data augmentation and balancing the data
Because right now, you load the image, instantly feed to model
yes, i admit, i rushed to the model
And balancing the data could be done with data augmentation, but you have a lot of imbalance
There are 17 times more samples of happy than disgust f.e.
hmm
So maybe you need some undersampling too
i am not honestly intrested in disgust
But there's multiple ways to tackle that, you should look into it
I am only intrested in anger, sad, neutral and happy
Well if you don't care about disgust, remove it
Up to you
may I tell you my end goal?
sure
My end goal is to tell how annoyed/horrible a person is feeling for my program
so I can annoy them even more
Seems like disgust would be the most important here
yeah
It seems most similar to annoyed I'd think, no?
kinda important
id say a mixture of disgust/anger
So especially then, you don't just want to look at accuracy
Because disgust has way less samples, so you can get an accuracy of 90% without once guessing disgust
I won't be on in 30 mins, but maybe someone else can help
ok thanks ๐
Hmm...unethical much
Its supposed to be you start happy - get annoyed, get little happier, - get annoyed again, rinse and repeat
so you dont uninstall cuz the happy points
but in the end, you're still annoyed
I didn't know FER-2013 existed, personally I always feel a bit uneasy using data concerning humans. But that's just me.
Does seem a few papers on it
isn't fer-2013 state of the art like 70 something%?
assuming the dataset is balanced, then 59% is pretty decent of an accuracy I'd say
though if you're getting near perfect accuracy in training, then maybe you might wanna do something like early stopping or revise your regularization techniques
dataset is not balanced..at least the version on Kaggle
there are papers describing models that have reached +90% accuracy nevermind you're right, I was looking at something else
what means nvm ?
nvm stands for "nevermind"
Decided to implement this, but am getting:
Traceback (most recent call last):
File "solution.py", line 112, in <module>
main()
File "solution.py", line 42, in main
fileB.apply(skus_A.__contains__)
File "/home/remotelinux/.local/lib/python3.8/site-packages/pandas/core/frame.py", line 8845, in apply
return op.apply().__finalize__(self, method="apply")
File "/home/remotelinux/.local/lib/python3.8/site-packages/pandas/core/apply.py", line 733, in apply
return self.apply_standard()
File "/home/remotelinux/.local/lib/python3.8/site-packages/pandas/core/apply.py", line 857, in apply_standard
results, res_index = self.apply_series_generator()
File "/home/remotelinux/.local/lib/python3.8/site-packages/pandas/core/apply.py", line 873, in apply_series_generator
results[i] = self.f(v)
TypeError: unhashable type: 'Series'
On code:
skus_A = set(fileA['sku'].tolist())
fileB.apply(skus_A.__contains__)
you have to apply it to a specific column, not the whole DataFrame.
if you want to apply it to every cell in the DF, use applymap instead of apply
also @wooden sail @ripe forge the problem was that I was using like 50GB of memory, and someone else was using about 40. so we're now trying to get more RAM.
sudo apt-get install ram
ah fair enough, trying to download some ram
and i suppose doing it single threaded, chunks at a time, takes too long
I wrote a shitty version of what I was trying to do with multiprocessing.Pool, and it's still taking hours
wonderful ๐
pandas documentation has me confused sorry, should it be df['sku'].apply() then?
df['sku'] is probably a Series. and DataFrame.apply isn't quite the same as Series.apply.
for DataFrame.apply, the input is a Series. for Series.apply, the input is an element.
I'm confused then ๐ how to apply it to a specific column then?
this
if you do df.apply, you're doing DataFrame.apply
if you do df['sku'].apply, then it's Series.apply
@wooden sail do you know that much about dask, btw? because I don't think it's optimizing my computation graph in a memory-efficient way. though it might also be that it is, and the most memory-efficient solution that it can infer (that doesn't involve writing chunks of it to disk) still exceeds my capacity.
hmm sadly i have only used dask a couple of times, i ended up going back to multiprocessing
how many processes are you spawning at a time and how many cores and threads do you have access to?
not sure about threads. there's like 40 cores though.
(and I just closed my work computer ๐ )
all right, so at least 40 partitions should be fine, assuming you have enough memory for all of them. this is very much a thing of "salt to taste" where you try to convince the OS scheduler to always have a few of your processes running by having a ton of them, but not so many that the parallelization and memory overhead bog you down. i could only recommend doing some logging for a few minutes and testing out different numbers of partitions and how much of a file each one handles at a time
a surprising thing is that these servers and clusters with huge numbers of cores are usually slower per core than a bad laptop. it could be that there's just not enough memory for the server to give you any benefit over a new laptop ๐
guys I need help with using np.where
can someone help me with this?
dq_monthly['MoveOutDate'] = np.where(dq_monthly['TenantStatus']=='Current', pd.to_datetime("2022-07-31"), dq_monthly['MoveOutDate'])
I need for tenant status that is not current to just have the original MoveOutDate
if pd.isna(df_concat['value1'][i]) or df_concat['value_1'][i].str.isnumeric():
if pd.notna(df_concat['value2'][i]):
df_concat['value1'][i] = df_concat['value2'][i]```
anyone know why this will give me an error on line2
in the isna statement
for one df/dataset but not for another
weird inconsistancy
omfgits a typo thats why
fixed typo stil error 0
I don't think we should even try to pick apart how this code works. what is it intended to do?
because it should almost certainly be rewritten.
you can do this with a .loc assignment, without using np.where
dq_monthly.loc[dq_monthly['TenantStatus'].eq('Current'), 'MoveOutDate'] = pd.to_datetime("2022-07-31")
disagree, using .values
all my values are strings by default
im checking if they, when converted, can be float
or if theyre just strings only
for example
'XFJ28.0' shud check the next value, and if its something like '22.10' then it shud be changed to be 22.10
partition = element.partition('.')
if (partition[0].isdigit() and partition[1] == '.' and partition[2].isdigit())
or (partition[0] == '' and partition[1] == '.' and partition[2].isdigit())
or (partition[0].isdigit() and partition[1] == '.' and partition[2] == ''):
newelement = float(element)
smoething like this maybe
def is_float(string):
try:
return float(string) and '.' in string # True if string is a number contains a dot
except ValueError: # String is not a number
return False
@serene scaffold know any better way to check this? string can be converted to float
im thinking to use astype perhaps
df_concat['value1']= df_concat['value1'].astype(float)
this will error if it hits a non convertible
so i had to use a loop
how else to do it
Finally managed to get my GAN to work, although in pytorch, not from scratch like i planned
@serene scaffold i have brute forced this
for i in range(len(df_concat['value1'].values)):
try:
new.append(float(df_concat['value1'].values[i]))
except:
new.append('string')```
now i just create a new column out of this
what do you think to do if over half of a column is NaN
whats a good cut off to just say 'ok were not using this' rather than impute?
approx 65% of one of my key features is missing
why doesn't dropna work for me
dq_monthly['TenantStatus']= dq_monthly[['TenantStatus']].dropna()
you can't have "holes" in a dataframe--NaN is the value that represents missing data. Even though dq_monthly[['TenantStatus']].dropna() removes the NaNs, when you overwrite that data back into the DataFrame, it has to put the NaNs back for all the rows that dropna drops.
isnt dropna just working like a filter
So, the DataFrame has 2458438 rows. If you take one column of that and do dropna, it will have that many rows, or less
but if you add that column back to the DataFrame, it absolutely must have a row for every single value in the index
i dont understand what the point of dropna is then
it removes the NaNs. the problem is that when you put that column back in the DataFrame, it has to put them back, to make up for the rows you deleted
so I don't set it equal to dq_monthly['TenantStatus']
because if you still want the column to be part of the dataframe, it has to have a value for every row. and for the missing rows, that's going to be NaN.
I just do dq_monthly[['TenantStatus']].dropna()
if a value is NaN in the TenantStatus row, are you okay with completely deleting that row from the whole dataframe
yes
even if there's a non-NaN value in other columns for that row?
that is specifically what I want to do
those NaN's are bad or old data not relevant to the analysis
then you should do dropna on the dataframe, not on an individual column.
!docs pandas.DataFrame.dropna
DataFrame.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)```
Remove missing values.
See the [User Guide](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html#missing-data) for more on which values are considered missing, and how to work with missing data.
the problem is there may be NaNs in other columns that are irrelevant that I need
note the how='any'. if any of the values in the row are NaN, that whole row goes bye bye.
need in terms of row
yea I dont want to do that
I want to just delete the records with NaNs for TenantStatus
you can do df[~df['TenantStatus'].isna()]
ok
which means "pick the rows of df where not tenant status is nan"
i think thats cleaner
Df[column].drop na inplace true done
I would just never use in-place anything
I think it's even being deprecated
in-place operations have no optimizations
Saves u having to type df[col] =
they just recreate the whole df under the hood
But thatโs the alternative
no? in-place operations on dataframes are alternatives to df = df.method()
and we're talking about removing rows from the whole df
That would do so if the na is in the column
still getting NaN's
i did that originally and that just replaces original values in there
Because that will show nans
that's what they did originally, and that doesn't work
This command doesnโt replace anything
It literally drops when thereโs a nan
if you do dropna on a column, but then write that column back to the df, it has to replace all the rows that got dropped for having nans with nans
so it's the most pointless thing you could possibly do.
But u donโt write it back to the df like that u just declare the df is without the nans in the first place
anyway, I already gave them the solution, for better or worse.
Just try the inplace drop na may work
Man I have so much god damn pandas work to do
@serene scaffold if you have a key predictor thatโs numerical and ur population is NAN for like 60% of the feature what do u do
I don't know, I do nlp.
Think I shud keep them and just make them all medians? Keep other information
Or try a linear imputation
Perhaps
Thereโs no way itโs worth dropping hundred thousand samples
lel
Heya guys, I've never coded an AI before but I want to learn how to do so. any tips on how to start?
it's a bit overwhelming for me. I know the very basics that's it, but I find it very interesting.
@coral cradle so, don't plan to create anything groundbreaking in the foreseeable future. Because this is something that people get PhDs and then spend their entire careers working on. Focus on projects that aren't unique, but which help you develop a sense for what AI is and what the concepts are. What are training and test data? What are features? What is a model, and what makes a model "good" or "bad"?
oh ok I'll try to think simple for now ๐
You can even start by reading Wikipedia and following the links, like what are "intelligent agents"? This will give you a ton of keywords / concepts to look into (including those mentioned by Lurcus): https://en.wikipedia.org/wiki/Artificial_intelligence
Artificial intelligence (AI) is intelligence demonstrated by machines, as opposed to the natural intelligence displayed by animals including humans. AI research has been defined as the field of study of intelligent agents, which refers to any system that perceives its environment and takes actions that maximize its chance of achieving its goals....
You will see a lot of the same words come up in a lot of the articles, such as "agent", "environment", "actions", "goals", "maximize", "perception", "statistics", "planning", etc. I would take note of these as it will make it easier to find what you are looking for by understanding and using them in your searches.
(And try to re-define what you consider to be AI in terms of those words so that you can find something that matches your goals)
hey guys, how does checking probability of one distribution being greater than the other work when the distributions are mixed like this
probability of being greater? like you draw random samples from all 3 and you look for the probability that you drew a larger number from one of them than from the other 2?
yes probability of being greater
I am not sure if sampling is done
but there is a formula, probability matching, that selects "actions" based on the probability that one distribution is greater than another
but if we sample form all 3, then the "select action" is redundant right?
cause we sampled already, why select action if we already sampled
i'm not sure i understand what it means by Q(a) > Q(a') there when those are curves
i don't find the explanation clear
Q(a) happens to be the distribution I believe
indeed
and a and a' are different distributions
yep
honestly I think its either the mean of the distribution
or the distribution itself
i doubt it's the mean, it probably has more to do with the tails, but that's why i was asking about sampling
the normal distribution
i can see that.
you'd have to scroll back in the slides to where they defined what Q(a) > Q(a') means
alright, ~~probability ~~*probably * have to do that
but there is no way to compare them other than sampling them and counting the number of times one distribution is greater than the other right?
thats the part I got from your first question/answer
that was my first impression, which would let you answer this question for specific values of the variable that follows these tentative distributions. like given a value you wish to observe, find which distribution is the most likely. as you'd expect, this receives the name "maximum likelihood". but since they don't call it that, i suspect they mean something different
so you'll have to scroll back ๐
i do notice they give you an ht quantity. this might be the given value for which you want the probability
so maybe something like P(X >= ht) when ht ~ Q(a), for example
ah, ht is the history, its the given condition in a posterior distirbution
so I won't worry about it for now
might I ask a different question since I feel fuzzy about it. How does two distributions get compared in something like a ratio P[X] / P[Y]
i think you'll need the ht for this... you really should go back and review
P[X] is the probability of X = x? discrete distributions? or?
yes, suppose say we have two different distributions P[x] and P[y] (not sure if this is the correct way of saying this) and if we were to compare them like P[X] / P[Y], do I just divide where ever they have the same input variable?
I hope my quesiton makes sense
not really...
well, what you describe there is really just "division"
since the domain of probability distributions is the values the variable can take
but this implies that the two distributions are over the same domain
it would be better if you could find this in your notes too, then report back
oh ok, I thought there was some more things going on with it, probably got confused myself
thank you for taking the time to answer Edd
it could be more stuff is going on, but usually in cross entropy or mutual information or kullback leibler divergence, these terms with ratios of probability distributions show up
as long as they're two possible distributions for the same variable(s), it should indeed be just vanilla division
ahhh this helps, thanks again ๐
it works weird at first, but they usually show up inside logarithms
so division of probability distributions translates into like differences of information, since the info is related to log probabilities
I see, is there any book you recommend particularly on distributions comparisons
like the probability matching, division, etc
i currently have The Elements of Statistical Learning by Trevor Hastie but have yet to read it
books on estimation theory and statistical sig proc. what you mentioned, at least by name, sounds suitable
|estimation theory and statistical sig proc
got it, I will try looking for these
thanks again my man
i like "fundamentals of statistical signal processing" by steven kay
ah louis scharf's statistical signal processing: detection, estimation, and time series analysis is another good one
alright, thanks for the rec again ๐
hello guys I need some in scraping some messages from my google messages using selenium but I am getting some error with my xpath
is there any better scraping tools other than selenium that works better for nested divs and custom angular tags
When trying to read a large Excel file with pd.read_excel() am getting [1] 129017 killed python3 solution.py
is someone here into finance? I'd like to learn ML/AI related with it.
I doubt someone in here will know the answer for that, its a #data-science-and-ml channel
hello, how can i get both product name and family name?
the website doesn't have api to request from it
hi, can anyone please help me to write a good summary for my LinkedIn profile as I'm a first-year data science student ๐
In what context are you trying to do this
I don't understand how this helps, because we only calculate f(k, n, p) for every k up to n/2. We never calculate f(k, n, 1-p) so how would this make it any simpler?
This is about binomial dsitributions btw
https://en.wikipedia.org/wiki/Binomial_distribution
In probability theory and statistics, the binomial distribution with parameters n and p is the discrete probability distribution of the number of successes in a sequence of n independent experiments, each asking a yesโno question, and each with its own Boolean-valued outcome: success (with probability p) or failure (with probability q = 1 โ p). ...
why don't we?
Well what it suggests is that we don't need to calculate the second half, because we can easily calculate it from the first half results
mhm
But then it says that we can simply get the result by taking the "complement" f(n - k, n, 1 - p) but we have not calculated that in the first half
We have calculated f(n - k, n, p)though
yes but note that taking p <- 1-p simply swaps the two multiplied terms. you can think of it as swapping the exponent of p with that of 1-p
substitute p with 1-p and see what you get
Yeah, but then we are just calculating it from scratch again
I have created the function f, and it works correctly, when I do:
success_arr = [successes(n, k, p) for k in range(n + 1)]
It also works correctly if I do:
first_half = [successes(n, k, p) for k in range((n + 1) // 2) + 1]
second_half = [successes(n - k, k, 1 - p) for k in range((n + 1) // 2 + 1, n + 1)]
success_arr = first_half + second_half
But how do I use first_half to get second_half without having to call sucesses again?
You cache it / make a lookup table.
(Which is what first half is)
Right, so that is crating the first_half list, but then how do I get to second_half ?
You don't, you have just first half is the idea. Less memory.
hello, I have an assignment in which I have to use two pictures take the person from the first picture and the background from second picture and merge them together. I'm a beginner, and wanted to ask what would be the best approach and any tips?
Right, but I want to also know the second half, without having to calculate them each from scratch
It's not a memory problem
If you ever want a value that would be in the second half, it can reference into the first half. Avoiding computation.
But how is my question ๐
Let's say I want the amount of sucesses of k=15 when n=20
Maybe i'm not sure what the question is, but it seems to me they are just making a lookup table and only computing half because of symmetry.
let's take a look. we start with binom(n, k) p^k (1-p)^(n-k). let's set p = 1-p.
we then get binom(n, k) (1-p)^k (p)^(n-k). now let's swap k with n-k
binom(n, n-k) (1-p)^(n-k) p^k
but we know there is symmetry for the binom part. we can also see that the power parts look identical to how they originally did
(Lookup tables are for speed, and having to only store half is for memory)
lemme see if i can whip up a python MWE
import matplotlib.pyplot as plt
from math import factorial as fact
def choose(n, k):
return fact(n) / (fact(k) * fact(n - k))
def successes(n, k, p):
return choose(n, k) * (p ** k) * ((1 - p) ** (n - k))
n = 100
p = 0.4
fig, ax = plt.subplots()
success_arr1 = [successes(n, k, p) for k in range(n + 1)]
first_half = [successes(n, k, p) for k in range((n + 1) // 2)]
second_half = [successes(n, n - k, 1 - p) for k in range((n + 1) // 2 + 1, n + 1)]
success_arr2 = first_half + second_half
cum_success_arr = []
ax.scatter(range(len(success_arr1)), success_arr1)
plt.show()
This is my code btw
@wooden sail
(Or in the case of by-hand, so you don't need to do as much work filling out the whole table)
And maybe I am confusing you a bit, because my question isn't why f(n-k, n, 1-p) gives the same answer as f(k, n, p) , but how it actually helps me in any way calculate the second half more easily
By more easily do you mean speed? Because lookup tables may be faster than doing the actual f(...).
(This was way more common on older machines for various functions including stuff like binomial coefficients)
Yes, but we have not calculated f(n-k, n, 1-p) yet
We have only calculated f(n-k, n, p)
Because that is part of the first half
what i would point out is that (1-p)^(n-k) can be dealt with with a binomial expansion, and so this product is also symmetric
so the key thing is showing that we can easily swap p with p-1 and it yields the same result thanks to this symmetry
gimme a second to get some paper
n and n-k and p and 1-p
(Or if you look at Pascal's triangle you can see that you only need to store half)
For the 1-p part. Try putting 1-p instead of p into f.
By doing the first half, we basically have. We need to look it up though.
Could you perhaps show how to get the values for the second half using the first half then?
Or an example
ah, but note they say "tables". if you look at the tables in statistics books, the mean all of n, k and p vary. what they mean is that you can get the second half from another distribution with probability 1-p, not from the same one
Hmm
So calculating the first half doesn't help us more easily calculate the second half then?
does not appear to be the case. i played around with the binomial expansion and you get some quite interesting expressions, but nothing from which you can (easily) look for equivalences
Hmm right, well maybe you can use the binomial coefficient of the first half
it's more that you compute pairs of distributions together
But not the rest
Well thx for looking into it @iron basalt @wooden sail ^^
Just trying to get a bit more into probability, since I was struggling understanding that chapter in some books about maths for ml
the following two identities hold:
we already showed that f(k,n,p) = f(n-k,n,1-p), which we check by substitution. now consider
f(k,n,1-p). if we do the same substitution here, we find that f(k,n,1-p) = f(n-k,n,p), which is the bit that, as you noted, was missing. with these two together, it seems we get both f(k,n,p) and f(k,n,1-p) as a 2-for-1
You can def. use the symmetry of the binomial coefficient. Note that you can pre-compute p^k from 0 to some k. And (1-p)^k2 from 0 to some k2 (giving two more tables).
you can also consider f(n-k,n,p) and f(k,n,1-p) this way this was already included above, oops
You can compute a third table of those two multiplied together.
This doesn't help us in calculating the second half right?
you still have to compute everything for one of the 2. it's more like if you computed one fully, you also know the other
the curves are pairwise related to each other
Right, but none of these help us calculate one half from the other
It just helps us calculate some other distribution
yeah
You can see how across different p. It's the same but in reverse.
.902 .810 .723 .640 .563 .490 .423 .360 .303 .250 .203 .160 .123 .090 .063 .040 .023 .010 .002
...
.002 .010 .023 .040 .063 .090 .123 .160 .203 .250 .303 .360 .423 .490 .563 .640 .723 .810 .902
(n = 2, k = 0, k = 2)
"Second half" is the second half of that whole table linked (varying n, k, and p).
here's a MWE:
import numpy as np
from scipy.special import binom
import matplotlib.pyplot as plt
n = 15
p = 0.234523
fnkp = np.zeros(n+1)
for k in range(n+1):
fnkp[k] = binom(n,k)*p**k*(1-p)**(n-k)
fnk1_p = np.zeros(n+1)
for k in range(n+1):
fnk1_p[k] = binom(n,k)*(1-p)**k*(p)**(n-k)
fnk1_p_synth = fnkp[::-1]
plt.plot(fnkp)
plt.plot(fnk1_p)
plt.plot(fnk1_p_synth,'*')
plt.legend(('f(n,k,p)','f(n,k,1-p)','shortcut for f(n,k,1-p)'))
plt.show()
as squiggle notes, what they explained in a cursed way as "compute up to k = n/2" is the same as saying, we can compute all of f(n,k,p) and just flip it to get f(n,k,p-1)
the caveat being that, depending on your implementation of the binomial coefficient, it is cheaper to stick to low values of k. then the first half of f(n,k,p) gives you the second half of f(n,k,1-p) for cheap, and in the reverse direction as well
@mild dirge
Lookup tables can be slower on modern machines. Simply because you are loading in more memory into the CPU's cache, which kicks out other stuff. CPU is so fast it can do stuff like f(...) really fast. But you should always time it yourself to see, because f(...) may be complex enough to warrant it.
(Memory speed is the bottleneck on modern machines (so you won't find lookup tables as common as before (still can show up though, just not as obvious as a decision to make as on old machines where they were required)))
Hmm
(Or by-hand (very slow, want to avoid work))
So you calculate the first half of both f(n, k, p) and f(n, k, 1-p) and then you have both distributions fully
is that the idea?
Because the first half is cheaper to compute
Cheaper to compute than the whole.
It's just avoiding repeat work.
You can even have Python do that caching for you with the cache tools built in.
Just need to annotate f.
that's about right, yea
Alright cool
as a sidenote, scipy has several ways of computing the binomial coefficient. binom is the fastest afaik, but also loses precision very quickly. my guess is it uses the gamma function instead of factorials. you can pick your poison between speed and precision
Hey everyone ๐ I had a general question to ask the group.
It seems the AI/machine learning/NLP world is pretty much reserved for those with degrees (not a complaint, just a finding from my humble investigation) so as a person on the self taught road, can anyone point to a small role I could play on the team? I donโt have delusions of grandeurs and I donโt assume I would be able to compete with PhD holders, however there must be like โthe boring stuffโ you super smarties donโt want to do that someone super entry level on the team could be tasked withโฆ but are even those roles reserved for college graduates? Anywhoooโฆ.. just more of a musingโฆtrying to find my place in all this ๐๐๐พ๐๐๐พ๐๐๐พ๐๐๐พ๐
data entry and cleanup is something everyone needs and no one wants to do, and does not (always) require that much in depth knowledge, but domain-specific knowledge is always involved
Well gentlemen I choked hard on my interview
They had me returning functions which were pandas manipulations and I totally fucked it
It was expected output style
Ngl it was insanely easy too, just choked
Thank you ๐๐พ and I appreciate the honest reply
also, if you're not aiming for a research position, you may be able to find your way with a nice "portfolio", as the kids call it nowadays, but i have no experience with the non-research end of this stuff
There is a path in machine learning that is more engineering focused called machine learning engineer or MLE
These are also very competitive but can be more accepting of less formal education than data science or research scientist roles
Even for research if you can show that you can do stuff like implement papers and that you can do so quickly, you can out compete PhD holders as they often request higher pay, but may not actually give more value. THIS DEPENDS HIGHLY ON THE COMPANY. If it's a large company it may just do the lazy thing of filtering by degree when hiring.
i don't know any place that would hire a non phd for a research position in europe
the market seems to work very differently in america, so ymmv
I know of one company that will hire for research no degree in Europe. If it's an institute then probably not. Because they love prestige.
FAANG is willing to hire ML research people without PhDs, but not entry level, usually you look at these peoples resumes and say "yeah well that makes sense"
they were 4 years MLE at openai and got on papers and transition into research type of people
the thing is that institutes work at a lower level, not really at "production" level. what you mentioned of implementing papers, for instance, is like the bare minimum expected of everyone. from what i've seen, they wanna see you write papers, write proposals, lead projects, and so on
so more on the "you make the new state of the art" line
You definitely need to stand out and show that you can provide value without relying on a degree to make up for that.
the phd title is supposed to show you are currently the state of the art at something. ofc that's not always the case in practice, but that is the intent. having published brand new results with theory, proof, and verification
In the case of the European company I know of, they were doing SOTA with many non-degree holders. But each member really shows that they can do it, e.g. by having their own papers, blogs, books, videos, etc.
you either need a degree or you need experience that obviously makes up for the lack of a degree (and anyone who looks at your resume would agree that your experience is as valuable as someone's masters)
You can't just skip the learning, you need to actually be as knowledgeable as the PhDs and be able to prove this in your projects or published papers
that sounds about right
Yeah, and it's about showing that you provide value in the end, that is why you are hired. So whether that is your math skills or simply being good at data cleanup and willing to do it.
(And presentation is key for that, have some github repos / projects, papers, books, videos, blogs)
(Or simply be well known in the community)
I have a research scientist role without a PhD - The key was I spent time at a research institute and published papers that would be enough for a PhD thesis + strong engineering and open source projects
(But even with all the effort, some large companies will still just be lazy and filter by degree, and that is the true advantage of a degree if you are just looking for any work (you don't get filtered out blindly))
Fascinating points!!!! ๐๐พ thank you. If I mayโฆIโd like to summarize a bit of the vibe Iโm gettingโฆ.
I am hearing that I basically need to specialize in something. In doing so, I can create specific applicable experience that would perhaps maybe with some luck outweigh my lack of a degree?
To quote Dumb and Dumber, ironically a favorite movie of mineโฆ โ so youโre saying thereโs a chance?โ
๐๐๐๐๐
Yes, but there is no easy path. No shortcut. As Anokhi wrote, you can't skip the learning, even if you want to just do the more simple tasks for them.
u dont need a phd to do any of those
a masters would suffice, but even still not 100% required
im sure in america if ur from there theres plenty of oppertunity to do this without even a degree at all, but over here 2/3rds of roles are 'masters or phd minimum' in the description for ds/ml
Thanks for the honesty. ๐๐พ
unless of course u want to be a researcher, and yes thats pretty much phd only
id 100% say college is worth it for this field
feels bad man
it is what it is
basically i cud do all that stuff in my sleep, group,sort, make a df, but in their IDE it was weird af, it had an expected output table i had to mirror and i wasnt used to wriring pandas inside a function and i just flopped, cudnt read the error output either in the ide
its kinda dumb, because youd on the job a)not need to do such manipulations often and b) when you do, you do it your own way + have more time than 30 mins
not really indicative of ability imo
If you don't have a degree, your best bet would be to work for a startup. Established companies will usually immediately ignore applications for NLP positions where the applicant doesn't have a relevant degree.
Yea awesome! someone else also mentioned a startup. Thank you for the reply ๐๐พ๐
if ur rly good and can prove that somehow i doubt anything will stand in ur way if u get xp at a startup and show ur good to large companies
the only issue i see is that they sometimes autofilter people out without degrees, but with 5+yoe u shud be ok
in pandas is there any way to generate a dataframe from a list of csv lines?
i.e.:
[
'1,2,3',
'4,5,6'
]
pd.DataFrame(csv_list, columns=['a', 'b', 'c'])
a | b | c
1 2 3
4 5 6
(pseudocode)
could u write a loop to read them all in as dfs and then concat them if trhats waht u wanted
oh csv lines not files
id read the lines u want into python first
can you base it off my example?
is there any way to go from a list of strings that are in csv format to a dataframe?
that is correct
oh wait
try read csv with a sep
no wait
read_csv doesn't accept string input
it has to be a file or buffer
and instead of converting it to a buffer
i could just use str.split(',') on each item in the list
convert it out of string format i guess
and read it as a list
yep
that sounds like a lot of unnecessary work
what do you suggest?
its literally 0 work its 1 line
dont try parsing it into a csv yourself, that can be error prone too. better to let pandas handle it. my suggestion is, join it into a single string, wrap it into a StringIO. and that pandas will happily accept
^
i mean use watever method works fo ru
it's a very elegant way to essentially have strings be treated like "files", and then be fed into anything that accepts files
but he wants 1 string to be an entire row with columns
what do you mean? pandas has no issues parsing csvs with column headers if that's the concern
sounds like his 'csv' looks like a bunch of single string lines
this worked, and is probably going to be easier to avoid issues, thank you
no worries!
Can anyone recommend a type of learning roadmap for what steps/courses I should take to self-teach machine learning and artificial intelligence? I'm currently a python beginner. I know the fundamentals and nothing else. Looking for my next step other than just practicing projects on leetcode.
Hello, I'm having trouble getting the right version of pytorch and cuda. The system has 2 versions of cuda 11.1-gcc-9.1.0 and cuda 11.2, that can be loaded as required. I have tried both, however torch.cuda.is_available() is always False.
Details of the system are as follows:
python=3.8.6
pytorch = 1.10.1+cu111
cudnn = 8005
torchvision=0.11.2+cu111
torchaudio=0.10.1
cudatoolkit=11.3.1
OS Details:
Operating System: Red Hat Enterprise Linux Server 7.9 (Maipo)
gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
I have also tried the latest pytorch version with cudatoolkit=11.3 from the https://pytorch.org/get-started/locally/
try also saying what OS
also, make sure you're using the same python environment where you have those dependencies installed.
if you're using conda, I can't help you beyond that.
Updated the os details, yes used conda to install the dependencies
how are you doing the loading? something like purge modules and then module load cuda/v.xxx?
going by the info you put there, it seems you need cudatoolkit=11.2 or 11.1 though
Yes module load cuda/v.xxx
since you put cu111, i'm under the impression you'd need the module cuda/v11.1
there is cuda 11.1-gcc-9.1.0
i think the cudatoolkit also needs to match that
pip install torch==1.10.1+cu111 torchvision==0.11.2+cu111 torchaudio==0.10.1 -f https://download.pytorch.org/whl/torch_stable.html
this was the last command I used from pytorch
I'll try to see If i can match the toolkit to 11.1
also, are the nvidia drivers installed?
I'm assuming so, the machine is a shared hpc machine. nvidia-smi is not available. i'm not sure how else I can check. nvcc returns the following result
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1.TC455_06.29190527_0
Downgraded the cudatoolit to 11.1.1, still the same result
hmm and it's a single hpc machine? but are the computations handled in VMs or possibly other nodes? or it's all done directly on this device, in this same session?
It's JADE-2, "Applications can only be run on the compute nodes by submitting jobs to the Slurm batch queuing system"
aha
it's very likely that the node you are currently in doesn't allow you to run stuff on gpu. you would have to print (actually log) from inside something you submit to slurm. is that what you're doing, or are you checking this on the log-in node?
I see, i'm trying to install the Nvdia's Imaginaire Librarry
Using the instructions for Conda here https://github.com/NVlabs/imaginaire/blob/master/INSTALL.md
I'm stuck at this step
# install third-party libraries
export CUDA_VERSION=$(nvcc --version| grep -Po "(\d+\.)+\d+" | head -1)
CURRENT=$(pwd)
for p in correlation channelnorm resample2d bias_act upfirdn2d; do
cd imaginaire/third_party/${p};
rm -rf build dist *info;
python setup.py install;
cd ${CURRENT};
Before I was getting cuda not found errors and stuff
This is the current error
Hey @thick marlin!
You either uploaded a .txt file or entered a message that was too long. Please use our paste bin instead.
This is the error now
https://paste.pythondiscord.com/izewitenen
it still looks like a mismatch of versions
Is there anyway to further debug this?
what's at the end of the traceback? it still keeps going right?
TypeError: expected string or bytes-like object This is the last line
after what I have in pastebin
i can't find where setup py is in this repo
imaginaire/third_party/correlation
hmm, sadly i haven't the slightest
Where would be the best place to ask regarding this on this server? available channels or unix?
Regardless, Thank you for your help!
give it a shot down in unix first, i guess
what are you trying to do?
df_delinquent['CurrStatus'] == 'Current', np.where(df_delinquent['Beginning AR Balance'],
df_delinquent['CurrStatus'] == 'Past'))```
if CurrStatus equal current subtract ending and beginning Balance,
if past then keep just the beginning balance
@wooden sail
i don't understand what's going on there, you nested np.wheres?
yes its a nested np.where
but the second only has 2 params instead of 3 or 1
hhow can I have 3
what do you want it to return if the condition is not met?
im thinking of it like a decision tree
so like if current - > do subtraction
if past -> stay as is
there's only two conditions Current and Past for CurrStatus
i don't think this is doing what you think it is
the first parameter is a condition. you're using as condition ending balance - beginning balance. this will probably return true, since nonzero numbers are treated as true
In [15]: if(-1.5):
...: print('boop')
...:
boop
then you're saying, if the condition is true (and it probably will be, unless ending balance == beginning balance), then assign the values currstatus == current. this wil lbe a series of booleans
if it's false, it will go into another np.where. the condition for this second one is beginning balance, which is almost certainly going to be True as well. if this is true, then the return value will be a series of floats corresponding to currstatus == past. if it is not, true, it should return something else. you didn't say what, though
so you're saying the reason its not working is because there is no condition if both are false?
should i just add like a np.nan at the end of it then
bc its spitting out wrong dtypes?
I need a quick help on a Small task with Python
because you mixed up the conditions with the return vals
look at this MWE ```py
In [16]: truevals = ['beep','boop','blargh']
In [17]: falsevals = ['cat', 'dog', 'python']
In [18]: condition = [True,False,True]
In [19]: import numpy as np
In [20]: np.where(condition, truevals, falsevals)
Out[20]: array(['beep', 'dog', 'blargh'], dtype='<U6')
if condition is true for index k, then we get the kth element of truevals. otherwise, we get the kth element of falsevals.
try opening Discord on that computer, so that you can copy and paste stuff. taking pictures of your screen isn't the way to go.
there also isn't anything in that photograph to indicate that you're passing the JSON to pandas.
lol at this point im just gonna use .loc
this is more like what you're trying to do, just with numpy arrays instead of dataframe columns
In [25]: status = np.array(['current','boop','blargh'])
In [26]: endmoolah = np.array([500, 100, 300])
In [27]: startbucks = np.array([10,300,20])
In [28]: np.where(status == 'current', startbucks, endmoolah-startbucks)
Out[28]: array([ 10, -200, 280])
idky but ifelse statements arent supposed to be this complicated
they aren't, but i'm sad to report you are using np.where entirely wrong
thanks but I just did it using .loc instead
i think it's in your best interest to not share your api key
I am trying to convert that JSON data format to DataFrame so I can run some analysis but I am trying to figure out how to get the columnsof product review on Amazon product. Like date, names , title, review,
url = "https://amazon-product-reviews-keywords.p.rapidapi.com/product/reviews"
querystring = {"asin":"B091HQNRRD","country":"GB","variants":"1","top":"0"} #B091HQNRRD
headers = {"X-RapidAPI-Key": "api","X-RapidAPI-Host": "amazon-product-reviews-keywords.p.rapidapi.com"
}
response = requests.request("GET", url, headers=headers, params=querystring)
print(response.text)
what are the different ways to code AI? one is TF, but there are other ways right?
pytorch
and do they have different utilities or they are the same
and its better than tf?
yep, much cleaner and faster
its research standard. industry only favours tf cause its easier to code
tf is more famous though right?
'easier' is wrong term imo, its harder due to bugs
no its not more famous
im stil llearning pytorch. but already feels so much nicer
so you think I could build an algo with pytorch? specifically about financial markets
I mean
u mean lstm to predict stocks?
I just want to create a bot that with certain financial indicators learns to identify possible moves, to the upside or downsife
I want to use it as a recommendations
but yeah to make it simpler, lets say to "predict"
yep u build a function to make a lagging window
and use that data to p redict th enexxt window
Ummm
Hmmmmph seem like I am in a wrong thread..
As I understood, machine learning is trying different things and dropping those who have bad results, and keeping the one with the best results. then based on that bot with the best results, try new ways and drop the bad ones
until you have an accurate bot?
right?
No
