#data-science-and-ml
1 messages · Page 84 of 1
Check scaled_data.shape and scaled_data.dtype
float64 and 6007,1
Actually, I think I see a mistake. You're taking elements from scaled_data but looping up to len(train_data). If the latter is higher than len(scaled_data), then the last slices the loop produces can be shorter than 60 elements.
so how could i fix that?
train_data = final_data[0:200,:]
valid_data = final_data[200:,:]
If len(train_data) < len(scaled_data) as this seems to imply, your loop should work.
for i in range(60,len(train_data < len(scaled_data)): like that?
what type does len(train_data) < len(scaled_data) return, and what types do you need to pass to range?
it just returns false
False is a bool. Does range(60, False) make sense?
no
see if you can fix it
yeah i'm stumped
when ConfusedReptile said "If len(train_data) < len(scaled_data) as this seems to imply, your loop should work.", they weren't suggesting that you need to include len(train_data) < len(scaled_data) in your code somewhere. they were saying "if the length of train_data is less than the length of scaled_data, then your loop should work"
so the loop needs to go from 60, to what?
my brain isn't working right now so I need a sanity check
can i train ppo with a continous stream of downscaled images (think 360x240 or something) and have it just dictate mouse movement and whether or not to click on each frame
and as a followup, should I do it in tensorflow or pytorch or do it with custom gym environment
I assume that custom gym is easiest since theres a lot of backend I don't have to handle but I wanted to get a second opinion
Hi! Has anyone ever worked with the library pydeck?
always ask your actual question, and give enough information for someone who would know the answer to start answering it. don't wait for a commitment.
shud have a bot command for this by now hahah
I just use a self-bot /s
but also we did have a bot command telling people not to ask to ask and such, but it was used in a passive aggressive way, and people didn't really read it.
@past meteor Was I meant to sort the residuals and the variable values for some reason before graphing? I think not because I think that would cause the variable values to not "line up" with their corresponding residuals.
correct, don't sort them
Is there a quick handy way to get a report on outliers (in numerical data) . Number of outliers in each feature and an array/series of indices that contains them?
A basic way is to just make a boxplot
That does not give me a report of how many outliers are there for each feature and their locations if I am thinking right
You can compute the dots above the whiskers manually if you want
It's what I do if I want quick and dirty univariate outliers
I have a dataset with (potentially) 23 features and want to look at similar stuff in the future too. A quick report/describe that helps me determine if I can remove them without worries is what i need. Also need indices/location so that I can remove those easily. Bonus would be to see the overlap
Basically a
df.describe()
with extra data
outlier_ mask = df["feature_1"] > df["feature_1"].quantile(q=0,75) * 1,5
And you do the same for those lower than the 25th quantile and do an or between both, those are your outliers
Gives you a true or false series
Finally, you filter by this mask and you get the index of all these values
Note that this is indeed univariate outliers. Sometimes values are anomalous together. If you want to find those consider using a one class SVM: https://scikit-learn.org/stable/modules/generated/sklearn.svm.OneClassSVM.html / https://scikit-learn.org/stable/modules/outlier_detection.html#outlier-detection
I was attempting to write my own code already. That is why asked for " a quick handy method" in my original question.
But thanks anyways. I will look into your way too
!rule ad
do you guys have any suggestions on what backend to use for ml models? Would it differ between traditional ml methods and dl ?
for traditional ml, most things should be 'cheap' enough to run as far as computing power goes that you don't have to think much about it beyond the usual considerations for literally any project's backend at all
for deep learning, you might need to ensure you have access to gpus depending on which model(s) you are using, which can make it harder and more expensive
classic options like AWS, Azure or Google Cloud Platform can work fine, if you use the right services inside of it and take the usual precautions to like not leaking credentials, not getting hacked, properly deactivating things you're not using to avoid unwarranted fees etc - not ultra specific to ml
though I guess that as far as specifically for ml goes, there are some things like Hugging Face Spaces you might want to look into
Also HF Inference Endpoints if you're looking for production use
Spaces is mostly for demo use ig, and doesn't provide an API to query the model hosted
Still very useful to get things up and running and to test it out
i can't install chatterbot with pip... i am using python 3.11 output :
ERROR: Failed building wheel for blis
Running setup.py clean for blis
Failed to build preshed thinc blis
ERROR: Could not build wheels for preshed, thinc, blis, which is required to install pyproject.toml-based projects
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error
× pip subprocess to install build dependencies did not run successfully.
│ exit code: 1
╰─> See above for output.```
looks like it hasn't been updated since 2021?
edit; even worse, the pypi version is from 2020
(2021 was the latest github commit on the main branch)
you might want to look into LangChain instead
You need to download the aforementioned binaries/wheels to your machine (ensure you're downloading the wheels that matches the version of Python you have 3.11) . Then from your command line (Windows / Conda depending on what you're using), try and pip install those wheels you've downloaded now (ensure to change directory in your command line to point to whichever folder in your machine where those wheels you downloaded can be found; that's, if they're not in a directory that's already added to path in your system environment --- and once you've sorted this, you can then pip install those wheels).
Use this website to download the wheels: https://www.lfd.uci.edu/~gohlke/pythonlibs/
Now, after doing that for the missing wheels, try re-installing chatterbot again.
Better Solution: Create a virtual environment install an older version of Python either 3.9 or 3.10. Then install chatterbot in that environment. You should be fine. (Because I noticed some of wheels you're missing like thinc, blis aren't available yet for Python 3.11, however, there's the 3.10 version is available.
Aha @mystic ruin this is why it's not working. The maintainers of Chatterbot appear to have gone on sabbatical since python 3.8 😀 . So you'd either downgrade to 3.8 or find another library for building what you wanted to use Chatterbot for.
I split my data to just training and validation, is it okay if I don’t have a split for testing?
it's arguably better to have a testing split and no validation split, because otherwise you have no "un touched" data for final model evaluation. but why do you want to avoid making a 3rd split? there are good reasons to want to avoid it, but i want to understand your case in particular so i can give sensible advice.
Idk, I just did it like that. Sometimes I see others do 3 splits and sometimes just 2.
If I include the 3rd split, will it increase my model’s accuracy.
The unseen data, isn’t that the validation set?
It’s not about increasing your models accuracy, it’s about evaluating it:
With a train test split, you might test dozens or more of models. You’ll eventually find one that fits the test really well. But then what? Was that just by sheer chance or did you stumble upon a ‘good’ model?
So, at least in my world, the validation split is the very last thing that you hold back as long as you can… because often the ‘test’ effectively becomes part of train
(Curious whether zestar75 agrees with my characterization)
because of the fact that inputs for neural networks have to be a vector of data,
does that mean whenever I'm using keras to make a Neural Network, I need to make sure that the x_train and y_train have a shape of (n,1)?
please at me if you know.
Hello All,
I am working on bar plot in matplot. The second plot is not showing the bar:
import matplotlib.pyplot as plt
import numpy as np
from datetime import datetime, timedelta
import matplotlib as mpl
mpl.rcParams["date.converter"] = 'concise'
fig, (ax1, ax2, ax3) = plt.subplots(3, 1, layout='constrained')
price_date = np.array([datetime(2020, 6, 30),
datetime(2020, 7, 22),
datetime(2020, 8, 3),
datetime(2020, 9, 14)], dtype=np.datetime64)
price_close = [8800, 2600, 8500, 7400]
start_date = np.datetime64(datetime(2020, 6, 1))
ax1.bar(price_date, price_close, width=np.timedelta64(4, "D"))
ax2.bar(start_date, price_close, bottom=price_date)
ax3.bar(np.arange(4), price_date-start_date, bottom=start_date)
Am I missing something, I wanted the time to be at the bottom on the second plot but it does not show the bar
You could format code in code blocks to help others read better
my apologies - let me try to format it in code blocks
!code
I hope this helps as I used pastebin. As mentioned the second bar plot is not showing the bar.
import matplotlib.pyplot as plt
import numpy as np
from datetime import datetime, timedelta
import matplotlib as mpl
mpl.rcParams["date.converter"] = 'concise'
fig, (ax1, ax2, ax3) = plt.subplots(3, 1, layout='constrained')
price_date = np.array([datetime(2020, 6, 30),
datetime(2020, 7, 22),
datetime(2020, 8, 3),
datetime(2020, 9, 14)], dtype=np.datetime64)
price_close = [8800, 2600, 8500, 7400]
start_date = np.datetime64(datetime(2020, 6, 1))
ax1.bar(price_date, price_close, width=np.timedelta64(4, "D"))
ax2.bar(start_date, price_close, bottom=price_date)
ax3.bar(np.arange(4), price_date-start_date, bottom=start_date)
Fully agree. Splitting is not about "increasing performance", it's about having a high fidelity estimate of how good your model is.
As you say, if you just do a train-test split and test N models with it and you likely do this M times ... your test set is de facto part of your training set.
My current work project is medical stuff. What I do is essentially split:
- Split off some patients in full, they are never touched. (inter patient split)
- Split each "training" patient in a train and a test set. (intra patient split)
- Within this train set I split off a little more data and mostly cross validate.
- I try to minimize the amount of times I'm using the "intra patient" test sets.
- I take the top 4 models found on the "intra patient" splits and apply them on the held out ones.
- Report the inter patient split results, these are how our models work on unseen patients. If the difference between test and train here is high we likely overfit by reusing the intra patient split too much.
It sounds convoluted but if you don't have a lot of data how you approach this is important to get unbiased performance estimates.
The problem with making too many splits is when you have sparse or unbalanced data, sometimes the splits end up kind of useless and you actually end up underestimating model performance
the compromise i've made in the past in that situation is to create a single train/test split, and rely on cross validation within the train set for model dev
Underestimating model performance is a lot better than overestimating it, I'm fine with that
I really like this authors illustration and explanation of this train/test overfitting problem (in the finance context but generally applicable): https://www.davidhbailey.com/dhbtalks/battle-quants.pdf
sort of. if you have 1000 classes but only 500 of them make it into the test set, it's a problem
Honestly? I like the idea of bespoke data splitting. For my current use case it made sense. For others I'll have to do something else.
but yeah, implicitly overfitting through iterated model tweaking is a serious problem
In my case there's an element of unbalanced data in it and we solved it by having a semi-stratified test-train split in the intra patient split
Definitely bookmarking this
I don’t know if finance is more prone to this… it feels like it, since we’re inherently trying to maximize a variable
But the stratification procedure was very much linked to the "domain"
The authors have some great stuff, Bailey has some YouTube talks, and de Prado wrote the wonderful: Advances in Financial Machine Learning
yeah, that's how i ended up solving my 1000 class problem too, we were able to group the classes into bigger categories and stratified that way
thank you, i have been looking for more things to listen to while i do house chores
This is the YouTube I was thinking of: https://youtu.be/e3h9xf3p1DE?feature=shared
I think all ML is prone to this, not just finance. It's game over when the data scientist starts thinking their job is to maximize or minimize a variable 😄
Why is it "made for kids"? I can't save the video to a playlist lol
unless that variable is long term business success!
I meant it in a way that if your focus is maximizing or minimizing a variable without regards to due dilligence in how you evaluate you'll have a negative impact on long term business success
The optimal way to get 100 % on the test set is just by copying y
Hence why I truly believe that's not our job, our job is to answer the question "if we go into production with this? How good will it be?"
That's why I don't mind underestimating performance, if all my metrics say the model is fantastic and we go into prod and it fails then that might impact how much they trust us going forward.
This is why I like staying in the DE side in the finance world!
I mostly speak about DS but I like data engineering equally I think, it's good fun as well
In finance, DS feels more like BS most of the time
Ironically, my background is quantitative business. Finance was never my jam. A lot of my cohort went into actuarial science which does seem very legit.
Algo trading ... idk, I feel like there's too much happening in the world you can't control for.
Finance at large does have interesting use cases like credit risk modelling and fraud detection though. My profs couldn't shut up about these 🤣
i have no idea if this is true, but my broad economics-based impression of algo trading is that it only makes sense for HFT
That sounds reasonable to me. I don't know enough to comment, I'll leave that to BillyBobby 😄
I know there’s lots of fund styles: everyone thinks they’re a genius. There are non HFT algorithmic shops
Usually, I think, it comes down to a big broad strategy bet (ie: certain sectors) and small optimizations to hopefully outperform.
The PDF you sent was spot on BTW
Yah, it’s tough to apply in practice, but it’s good to understand the limits
https://d2l.ai/chapter_linear-classification/generalization-classification.html#test-set-reuse from one of the books I always shill.
Oh thanks, this statement is somewhat surprising, I was hoping for a more positive solution: “This problem …. remains a persistent problem plaguing scientific research.”
Anyone here have experience with pandas?
Hello everyone, I uploaded a PySpark course on YouTube channel. I tried to cover wide range of topics including SparkContext and SparkSession, Resilient Distributed Datasets (RDDs), DataFrame and Dataset APIs, Data Cleaning and Preprocessing, Exploratory Data Analysis, Data Transformation and Manipulation, Group By and Window ,User Defined Functions and Machine Learning with Spark MLlib. I am leaving the link below, have a great day!
PySpark, the Python API for Apache Spark, empowers data engineers, data scientists, and analysts to process and analyze massive datasets efficiently. In this course, you'll dive deep into the fundamentals of PySpark, learning how to harness the combined power of Python and Apache Spark to handle big data challenges with ease. From data manipulat...
!rule 6
You're not supposed to advertise stuff on the Discord 🙂
Hello All,
I am still having a challenge with the second subplot. It should show the timedelta at the bottom. Currently, it does not show anything.
import matplotlib.pyplot as plt
import numpy as np
from datetime import datetime, timedelta
import matplotlib as mpl
mpl.rcParams["date.converter"] = 'concise'
fig, (ax1, ax2, ax3) = plt.subplots(3, 1, layout='constrained')
price_date = np.array([datetime(2020, 6, 30),
datetime(2020, 7, 22),
datetime(2020, 8, 3),
datetime(2020, 9, 14)], dtype=np.datetime64)
price_close = [8800, 2600, 8500, 7400]
start_date = np.datetime64(datetime(2020, 6, 1))
ax1.bar(price_date, price_close, width=np.timedelta64(4, "D"))
ax2.bar(start_date, price_close, bottom=price_date)
ax3.bar(np.arange(4), price_date-start_date, bottom=start_date)
Any suggestions?
Sorry, i won't again. Thanks for the warning
!code
Hi, can someone pls help me. I have been trying and trying again to increase the accuracy of my model but it stays at like 70% accuracy. I have added two drop outs layers, L1 regularization, and my dataset is about 23k images big. What can I do to get my model to at least 80% Help 😭😭
I'm intrested in learning ai & ml but I have 0 idea where to start from, my main lang. is python, ig anyone has any tips or courses/vids that can help me I'd appreciate it
Hi guys, I am have large news document dataset, I am trying to train a model on it. Now the news are from 1998 to till now. I want to filter the new's from 2015. The dataset does not have metadata, to check the year I have to manually check the news and it's content. I tried using Spacy, but it's not accurate because the news will have many dates. Is there a better way of doing it?
I don't think there's a "good" way of doing this, but I would treat the most recent date that's mentioned in the document as the approximate publication date
Unless your data set is very very very large don't be afraid of doing stuff manually. Higher quality data means you get a better model. Start with a subset first and then do more and more until perf flatlines.
Manually meaning checking all the docs, there is over 10,000 docs
Are you alone in this or is this a team effort?
the news has a lot of dates, even when I use entity recognition, it shows a couple of dates, which is inaccurate.
Just Alone.
Hmmm, I don't know how long this would take 🤔 . At least for computer vision I tend to manually label stuff whenever necessary. It sucks but imo it's part of the job 😄
that's why I'm saying that you should pick the most recent one as the approximate date. so if the document mentions 12 January, 1975; 6 March, 2013; and 4 June, 2020; I would expect the document was probably published closer to 2020 than the other two.
That's a good one
Oh okay, this makes a lot of sense.
I would probably also skip dates that only have a month and year, since those might refer to tentatively scheduled future events
Also on other thing I tried it so using llama-7B to output the year, I used it via hugging face transformers. But the only issue is it takes around 12 seconds to complete one request. I am using Colab (85GB RAM)
Is this the case usually?
@serene scaffold ?
did you make sure you enabled GPU?
GPU does not increase at all
so, you're not even using the GPU, apparently
Currently I restarted the runtime, usually 70GB of the RAM will be consumed
And GPU does not go up at all
@harsh minnow can you screenshare with me in #751592231726481530?
!stream 533839465060499487 "15 minutes"
✅ @harsh minnow can now stream until <t:1696694410:f>.
I don't have the permission to speak
come back
!pip install transformers
!pip install transformers[sentencepiece]
from transformers import pipeline
pipe = pipeline("text-generation", model="meta-llama/Llama-2-7b-chat-hf")
from transformers import pipeline
pipe = pipeline("text-generation", model="meta-llama/Llama-2-7b-chat-hf", device=0)
im making a pie chart, how do i get rid of the labels on the pie chart and just insead have a legend
portions = [75, 18, (100-75-18)]
plt.pie(x = portions, labels = ['Sky', 'pyramid', 'shady'], startangle = 320, colors = ['deepskyblue', 'yellow', 'gold'])
plt.legend(loc = 'best', bbox_to_anchor = (0.9, 0.9))
plt.show()
I have two questions.
-
Can you easily tell the year a document was published, but looking at some few lines?
-
If answer to #1 is yes, how do you tell if the year you chose is truly the year that document was published and not some random dates in the documents?
Perhaps there could be a recurring pattern in the documents which you could leverage in getting the year each document was published
Practical data skills you can apply immediately: that's what you'll learn in these no-cost courses. They're the fastest (and most fun) way to become a data scientist or improve your current skills.
If there exist a consistent pattern in all the documents in figuring out the right date, you write a function to do that. Or better still, train your own custom NER to handle that for you.
not sure where NLP-type questions should go but ...
- say I have data for ~2000 cities for every year from a decade
- I know every city is on every list (for the most part), but the spellings and syntax probably vary from year to year
what's the best way to find the string representing every city in each year? is it stupid to do k-means if you are going to have several thousand clusters? is it stupid to just do pairwise fuzzy matching?
Your have data for 2000 cities. What data?
Every city is on every list. What lists?
Are you trying to figure out when the same city is referred to in different instances, but in different ways; example, "Paris" and "the French capital"?
As an aside, "syntax" means "rules about legal orderings of symbols in a language". The words "syntax" and "semantics" are used much more expansively in casual speech than they are in linguistics and NLP, so be sure that you're using them correctly.
@meager ridge I see that you also asked this in a help thread. Please link to your thread when you cross post your question, to avoid duplication of effort
(whats best practices for linking the other post?)
Asking your questions in the help thread, and then linking it in the most relevant topical channel with a brief preview of what your question is about. Thanks for asking.
opened up a help thread about this!
https://discord.com/channels/267624335836053506/1160398584513048707
Has anyone here made a python voice assistant?
Anyone here with experience of performing FFT on timeseries data? Preferably using numpy
Yeah I have performed fft but I need some help regarding the output frequency interval customization
somebody can help me with multiNetX?
Any idea why this error occurs? ```
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Token indices sequence length is longer than the specified maximum sequence length for this model (1663 > 1024). Running this sequence through the model will result in indexing errors
Evaluating: 0%| | 0/1379 [00:00<?, ?it/s]
RuntimeError Traceback (most recent call last)
<ipython-input-18-523c0d2a27d3> in <cell line: 1>()
----> 1 main(trn_df, val_df)
...
/usr/local/lib/python3.10/dist-packages/transformers/models/gpt2/modeling_gpt2.py in _attn(self, query, key, value, attention_mask, head_mask)
199 # Need to be a tensor, otherwise we get error: RuntimeError: expected scalar type float but found double.
200 # Need to be on the same device, otherwise RuntimeError: ..., x and y to be on the same device
--> 201 mask_value = torch.full([], mask_value, dtype=attn_weights.dtype).to(attn_weights.device)
202 attn_weights = torch.where(causal_mask, attn_weights.to(attn_weights.dtype), mask_value)
203
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.```
Guys, I want to use python dsl and lua as configuration language to apply a pipeline for manipulation data, trian, validate and deploy model, and of course, plot graphs. Given that, lua is a powerful language and we can use it to create custom feature engineering functions. I have some questions and concerns related to it:
- I would like to avoid recalculation of some expensive feature engineering, given that, my idea is to use hash function to translate the current function code plus the columns used to process. If this hash already has a parquet file pointed, it will load this file instead of process all the data. One point is: how accuracy is the description of columns stats from pandas? And there is a probability of collision in that way (hash result) if I sum the description of the columns plus the code I will use?
Why would you need to include the describe() output? Seems like you’d be fine if you capture all the inputs/code/etc?
Because if the dataset change for some reason adding more rows for example, in my mind, the feature need to be recalculated?
Or applied any kind of imputation for example, filling nan values with zero
Hi, I'm working on weighting some values that will feed a larger datascience dataset but I'm having trouble figuring out the best way to weight the values to keep the results appropritae.
# Define weight factors for each parameter
damage_weight = 0.3
reproducibility_weight = 0.1
exploitability_weight = 0.1
affected_users_weight = 0.2
discoverability_weight = 0.2
# Calculate the weighted sum of the parameters
weighted_sum = (
self.damage * damage_weight +
self.reproducibility * reproducibility_weight +
self.exploitability * exploitability_weight +
self.affected_users * affected_users_weight +
self.discoverability * discoverability_weight
)
# Scale the weighted sum to fit within your desired range (0 to DREAD_RISK_CAP)
scaled_risk_value = (weighted_sum / (damage_weight + reproducibility_weight + exploitability_weight + affected_users_weight + discoverability_weight)) * self.DREAD_RISK_CAP
return min(scaled_risk_value, self.DREAD_RISK_CAP)
(1, 12): "Notice",
(13, 18): "Low",
(19, 36): "Medium",
(37, DREAD_RISK_CAP): "High",
}
these are the risk levels that I'm considering to make it easier to talk about numbers but I end up with results that are always high and I want to avoid that.
First thing I’d do is ask: what makes them ‘always high’?
Is one metric overweighted? Perhaps a log scale is better for that metric, etc
Another method of weighting is to calculate the percentile score for each metric, and score based on their percentile range... ie: 0-25th percentile is a 0, 25-50 is a 1, etc. Or, perhaps a percentile of the composite. I'm just thinking out loud, may not make sense in this case if the metrics are already computed somewhat arbitrarily.,
hey guys i know this is pyhton server but pliss help me im losing my insanity slowly rn
looks fine
This is excellent and I think where I’ll take this
How do you make a regression plot when there are NA values in your data?
The two columns I'm using both have them in random spots, so I can't use dropna()
You can ffill and bfill na’s. You can also remove rows that contain a NA value. Whatever makes sense for the data/problem.
I removed the rows containing NA values, but am still getting this error Cannot cast ufunc 'svd_n_s' input from dtype('O') to dtype('float64') with casting rule 'same_kind'
That sounds like there's still NA values, but are none in the new df
Seems like the issue is that you have an object dtype for some reason - you probably want to cast these columns to a normal dtype like np.float64.
yeah that solved it
thank you both
how does something that was previously int become an object dtype?
If the datatype of the column was int when you did your EDA, then it means you must have cast type the column to object unknowingly.
Have you tried implenting what was suggested in the error message yet? Did it fix the problem?
Hi, I have a out of memory issue ocurring when i try to download a very large dataset- in a series of api calls ran in async -to download the data in parts. pyarrows concat tables method is making my service run out of memory and crashing the instance however i increased eks memory and it's fine for now.. i then take the dataframe and run to_csv which is causing OOM failure and i am limited on cluster size in EKS. is there any other alternative to pandas to_csv that convert a dataframe to csv which is more memory efficient
I am passing 100k chunk size on the to_csv call
sorry if wrong chat
I have 3500mi cpu limit and 15GI memory and although its only 5.5gb data it still runs oom
This is the right channel for pandas and pyarrow questions. you can add more rows to an existing CSV file like this
df.to_csv('existing_file.csv', mode='a', header=False)
the mode='a' is append mode, and header=False makes sure that the column headers aren't duplicated.
also, depending onw you're using pyarrow.concat_tables, you might be using double the memory you need at any given time. because by default, the data for all the tables is copied into a new one, so it's like each row exists in two places at once.
also it looks like you could use pyarrow's native CSV writer instead https://arrow.apache.org/docs/python/generated/pyarrow.csv.write_csv.html
that would probably save you from having to copy the pyarrow object into a pandas DataFrame
Thank you, I will try this instead
@tranquil beacon the other thing you can do to save memory is to create as few variables as possible (using more nested expressions) so that intermediate objects are garbage collected as soon as possible
this wont allow writing to the same csv in parallel right
I'm not sure. the actual writing to the file is handled by the operating system, and it probably has a lock for that
Are you doing any transformations on the data or is it just download => write?
Can you get the entire DF in memory? Do you specifically go OOM when writing or before that already?
This is after etl. I can manage to get the df in a data frame and craps out when writing to csv
I have a big list of tables which pyarrow now can concat to data frame after increasing memory
This needs to happen this way as I don’t think the suggestion of using built in to csv will work in async
And the ETL is done in Pyarrow or is this the final step?
With concat you mean join (so concatenating by column and not by row)
Etl is done by a separate process, can be weeks or months before it’s pulled
I’m pretty sure it works by column but how I will get a list of pyarrow tables not sure how that joining logic works
But now *
I was looking into pandas gzip
Using compression= gzip in to_csv call however not sure that will be different as a copy I’m guessing will still be made
I want to understand your use case first better because I might have a solution
So an ETL happens and the result is somewhere behind an endpoint. You query it to get Pyarrow tables, correct?
To me this is already strange, Arrow is specifically an in-memory format. How is the data arriving?
On point one yes.
The api we are using already returns arrow serialized results which are stored in tables and we join using pyarrow concat_tables
A list of Pyarrow.Tables
Said the same thing in 3 different responses my bad, I am using external lib which is a hard requirement that does that pyarrow join and so the data frame to csv part I can control only
Unfortunately cannot work with parquet files for this one
anyone can give me explanation about this?
wdym when no loss is used? you're using binary cross entropy in both no?
oh nvm
I see the comment now
In the first experiments I used # to decline loss function
looks like they use RMSProp if you don't specify a loss function
oh nvm that's optimizer
src/transformers/modeling_tf_utils.py lines 1508 to 1519
"""
This is a thin wrapper that sets the model's loss output head as the loss if the user does not specify a loss
function themselves.
"""
if loss in ("auto_with_warning", "passthrough"): # "passthrough" for workflow backward compatibility
logger.info(
"No loss specified in compile() - the model's internal loss computation will be used as the "
"loss. Don't panic - this is a common way to train TensorFlow models in Transformers! "
"To disable this behaviour please pass a loss argument, or explicitly pass "
"`loss=None` if you do not want your model to compute a loss. You can also specify `loss='auto'` to "
"get the internal loss without printing this info string."
)```
looks like it uses "the model's internal loss computation"
I have no idea what that means for a bert model, maybe the loss that the original model was saved with? I assume whatever loss function it's using is better suited to the task than binary cross entropy
important to note leaving out loss does not train with no loss function as you can see from the message
So you can't sneak in Polars there? Because:
- It uses arrow under the hood
- It was made for this. You can make all of the Arrow tables lazyframes,
mergeand thensink_csv. Sink streams it to disk instead of trying to compute everything at once.
looking into the code more we see this
if loss in ("auto_with_warning", "passthrough"): # "passthrough" for workflow backward compatibility
logger.info(
"No loss specified in compile() - the model's internal loss computation will be used as the "
"loss. Don't panic - this is a common way to train TensorFlow models in Transformers! "
"To disable this behaviour please pass a loss argument, or explicitly pass "
"`loss=None` if you do not want your model to compute a loss. You can also specify `loss='auto'` to "
"get the internal loss without printing this info string."
)
loss = "auto"
if loss == "auto":
loss = dummy_loss
self._using_dummy_loss = True
dummy loss is defined up here
def dummy_loss(y_true, y_pred):
if y_pred.shape.rank <= 1:
return y_pred
else:
reduction_axes = list(range(1, y_pred.shape.rank))
return tf.reduce_mean(y_pred, axis=reduction_axes)
So which is the reliable result, Experiment 1 or Experiment 2?
It's confuses me because I think it's impossible to train a model without optimizing a loss function, that's what machine learning is and what the optimizer does, it adjusts the weights to minimize the results from the loss function.
Can you elaborate on this?
You're right, gradient descent is optimization of the cost function, you can't train a model using GD without loss
when you leave out the loss keyword argument, it uses that default 'auto_with_warning' but as you can see by the message:
Don't panic - this is a common way to train TensorFlow models in Transformers!
It seems like it uses a separate loss function which in this case (assuming your dataset is valid) preforms better than plain BCE
it sets the loss argument to that function I linked earlier dummy_loss, and has a whole algorithm later in the code that calculates loss for you
that said, i'm not sure exactly what loss function that is and I'm not willing to dive any deeper into the source, but I'd say that explains why you got better preformance by using the default loss
I will definitely check it out
Which do you think is more reliable between both experiments?
what do you mean by reliable?
reliability (as in being able to trust your model's ability to generalize) would be based on your dataset, assuming your dataset was well constructed the performance you get should pretty often be a reliable outcome.
granted there are caveats there (very small batch sizes in minibatch gd giving you wildly different results on different training runs, so your model might be able to preform well but just fell into a local minimum or couldn't get a good enough estimation of the gradient to find one for example) but just generally your dataset is the standard for your model's ability to generalize
Thank you
What i mean is based on the 2 experiments above, which the result should i believe, whether the model performance is good (without loss function) or the model performance is bad (with loss function)?
well the dichotomy here is not with loss function or without loss function
it's how the model preformed using two separate loss functions
you can think of it almost as a separate model by using separate loss functions
if the first one consistently preforms better, go with that.
ah I see, thank you so much for the explanation! @small wedge
Why do you have to split the data into x and y? What do the x and y stand for? Is it something relating to independent and dependent data? Like input and output? Not sure.
If you think of our models as functions f, x is the input, y is the output. The model learns to estimate a function that turns the x's into the y's; f(x) = y
I see, and when you split the data, what is the split? Like the percentages?
you might be mixing up x/y and train/test
The splits are for the data we train the model on (training data), and the data we test the model on (test/validation data) to ensure that its not overfitting on our training data
You need train_x and test_x as well as train_y and test_y
@vale swallow make sense?
what is the split? Like the percentages?
for a train/test split, 80/20 is a pretty common place to start.
Ok, yeah, it makes sense, thank you both!!
Does anyone have any free book suggestions for statistics or data analysis overall? I consider myself a beginner to data analysis. I currently have a good understand of Python fundamentals and a simple idea of what NumPy, Pandas, and Matplotlib is used for when it comes to data analysis.
hi
does somebody know if the frelancer market of data science is a good path to follow?
i'm corcerning about this those days
Thanks, should I start with the statistical learning book if I'm focused on mainly data analysis?
yes. statistics is, after all, the actual science of data.
Should i take Data Science or ML? Im very tired to think about it, cuz both of them looks good. But i really want to learn one of it.
Does anyone have any resources for creating transfer functions using time history input and output from an LTI system? Wanted to first understand a singe input single output (SISO) system and then work up to a multiple input multiple output system.
[Looking for open-source contributors, see below]
Hi there,
I recently open-sourced PyGraft, a configurable Python tool to generate synthetic knowledge graphs easily!
It can be used in any AI tasks (Machine Learning, Deep Learning, Reasoning, etc.) provided that you work with graphs.
The repo is gaining a lot of visibility, and I am looking for motivated contributors to support me in implementing new features and unit tests. Ideally, you should have a general understanding of knowledge graphs, semantic web, RDF/RDFS, and OWL vocabularies. In addition, strong Python programming skills are required. Experience in Software Engineering is a plus 🙂
DM me if you would like to contribute!
Otherwise, you can still take a look and star the repo if you find the project interesting!
Do you have descriptions of each course? Because there's potential overlap.
https://github.com/TommiKark/AdditiveAutoencoder Could be a very influential paper. Unfortunately they went and implemented the whole thing in matlab.. 😂
https://techxplore.com/news/2023-10-technique-based-18th-century-mathematics-simpler.html More info on the paper and link to the article itself can be found here as well
Researchers from the University of Jyväskylä were able to simplify the most popular technique of artificial intelligence, deep learning, using 18th-century mathematics. They also found that classical training algorithms that date back 50 years work better than the more recently popular techniques. Their simpler approach advances green IT and is ...
Hey guys, im a bit new to deep learning, Im studying Neuron network but having a question. The second matrix below represents a neural network derived from a bag of words. The third matrix represents the classification of the second matrix into its respective classes. I want to know about when initializing the first layer. Will it create a layer with 3 neurons corresponding to 0,1,1, or will it generate 3 neurons, each neuron being a line of the matrix of 0,1,1; 1,0,1; 1,1,0?
What is your actual input? And what is your expected output?
Input == bag of words?
Output == one of multiple classes?
The numbers ChatGPT are showing you are very strange, I'd forget that and "start from scratch"
Guys, I have a problem
You can see my problem in python help
"Problems importing tensorflow"
I have 3 sentences that are made up from 3 words, the 2nd matrix is showing if that words show up in that sentence (each line is a sentence,if that word show up, it is '1', '0' if not ):
the output are 2 classes for example 'hello' and 'goodbye' in each line
And what is the 1st?
the first is for shuffle the data
so after every shuffle, u still keep the right order of input and output
Pardon me, it's kinda hard to explain in my non-native language.
The test train split?
Is this a toy example to understand neural nets better or do you have an actual use case?
after playing around with duckdb and datafusion, I'm now convinced that both of them are not production ready
gonna try clickhouse now
I had high hopes for datafusion being implemented in Rust...
what are you finding lacking about them? if you're comparing to clickhouse i feel like maybe you're unsure of the right tool for the job. duckdb, datafusion, et al are in-memory query engines, not complete data warehouses like clickhouse.
duckdb and datafusion are basically in-process/in-memory query engines, not all that different from polars or even pandas
being written in rust also should not be taken as an indicator of being production-ready or not. rust is just a programming language.
Does Pandas have a query engine?
Hey y'all.
After a good amount of data processing and encoding, I noticed my function for encoding certain columns gets a little messed up because of a certain value called "NA". I understand that this is because pandas by default understands "NA" as a string means Null/NaN or what have you. Can I avoid this somehow? All other values are correctly understood by the function, so I know that it is from the "NA" being interpreted wrong (source: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html, under "na_values").
Function below for reference:
def rateOrdinalE(df:pd.DataFrame, cols:list) -> pd.DataFrame:
res_df = df.copy()
rating_map = {'NA': 0, 'Unf': 1, 'Rec': 2, 'BLQ': 3, 'ALQ': 4, 'GLQ': 5}
for i in cols:
res_df[i[0]] = res_df[i[0]].map(rating_map)
return res_df
Never mind, brain finally woke up. Seem to have solved it with a little revision past the for-loop:
res_df = res_df.fillna(0)
How so?
tf am i doing wronggg danggggg
thanks you so much tho u really help me out ❤️ @cold osprey
Hello all I am new to this AI field, so want to know exactly where and how to start to get familiar with Machine learning and Data Science. So can anyone help me out with the roadmap and sources and some basic projects which makes learning interesting..?
ez bro im a 5 years expertize on ai fields and helping to bult chat gpt but first u need solid foundation on MATH, PROGRAMMING, AND DATA HANDLING
second u need to learn the basic of course here a few example ppl u can learn by i learn from themm too
Introduction to Machine Learnin by Andrew Ng on Coursera, Python for Data Analysis"by Wes McKinney, Introduction to Statistical Learning (ISLR) or Elements of Statistical Learning(ESL) by Hastie, Tibshirani, and Friedman
and some online course Coursera, edX, and Udacity
third
then after u learn some shi u need to do some project like kaggle or self project
forth
after u have mastered the common knowledge lets step it up to next lebel play knowledge
like
-Machine Learning Algorithms
-deep learning
-data processing
fifth
u need to find what u want
-theres natural language processing
-computer vision
-reinforcment learning
(perfect ai like chatgpt need them all btw)
sixth
u go to some advance topic like
-ensemble method
-big data
-model deployment
and then seventh
because ai will always growing u need to stay update on ai field like follow ai forum or community
*note if u mastered all that step just take some ai intern and tell em what u capable off so u can test ur skill and what u dont have so u can learn more and yeah stay sane bro oh yeah i forgor stay healthy and drink a lot of water cuzz if sick cuzz learn and die of it its really shit stupid etc
hope thats help and goodluck
https://roadmap.sh/ai-data-scientist Check it out!
Hi, not sure why this is here--why does it have wrapper 1, 2, 3 instead of the layers I added (flatten/dense) I'm using Resnet50 on my data, this is my code:
`resnet_model = Sequential()
pretrained_model= tf.keras.applications.ResNet50(include_top=False,
input_shape=(224, 224 ,3),
pooling='avg',classes=77,
weights='imagenet')
for layer in pretrained_model.layers:
layer.trainable=False
resnet_model.add(pretrained_model)
resnet_model.add(Flatten())
resnet_model.add(Dense(512, activation='relu'))
resnet_model.add(Dense(77, activation='softmax'))`
This is what shows up when I look at the summary:
It's training but this is what shows up when I compile it:
this looks like the issue: https://github.com/duckdb/duckdb/issues/4295
But your query didn't have an "in" ? There are specific rules about when pushdowns occur, so there are limitations... but your original query was fairly simple.
my bad, this is another issue I had with DuckDB a few days ago (hence why I said I dont think its prod-ready)
and if I recall correctly, this bug was only with Duckdb, not Datafusion
I just don't think that's a fair characterization. You're not describing a bug: Lack of predicate pushdown for complex parquet queries is more of the "absence" of an optimization.
And there are workarounds.
And, most importantly, any data pipeline will employ multiple technologies, which is why arrow is becoming the backbone of most pipelines.
its not an absence of "optimization" when it takes 10x more times than using an equality check
its more like an absence of basic functionality that a good percentage of SQL queries use
Would you prefer the equality check takes 10 times longer? What SQL engine are you comparing it to, that can read a parquet file hot?
Being able to read a parquet file, without ingesting to a table, while filtering the contents on the fly based on the selected criteria is fairly bleeding edge stuff. Doesn't bother me one bit that it has limitations.
I'm comparint SELECT ... WHERE col IN ('hello word') to SELECT ... WHERE col = 'hello word', with the SAME query engine: duckdb
and I think that regardless; a one-element tuple should be converted to an equality check, but thats besides the point
I hear you, I'm just saying, it's absence of an optimization you're complaining about. Pushdown to parquet reading are limited today to column selection, and AND equalities: that's known (perhaps could/should be better documented, but I'm not affiliated with them).
oh and btw, the column in question is actually a partitioned column (hive style) with only about 30 distinct values, how in the world is it even possible for this query to take more than 20 seconds
I have no idea, but if you could publish a reproducible example, I'd love to take a look.
youre mistaking this for a lack of a feature, but I'm pretty confident that I can filter the data faster with pyarrow + pure-python loop
(even while doing all the filtering in Python)
Would love to see that too. I'm open minded... I use whatever tool is right for the problem I have. For me, it tends to be duckdb, but I write plenty of polars/pandas/numpy/python code too. That's what's wonderful about arrow, I can do all of that without copying the data.
@left tartan how do you test your workflow?
I hear a lot about testing in data/ DE sphere's but I see no one doing it
Data profiling like https://greatexpectations.io/ are a massive anti-pattern imo
We're moving some of our SQL pipelines to DBT, and recently started using Syrupy to do snapshot unit testing in Python.
And what exactly are you testing?
We're not an airflow shop (doesn't fit our use case), but there's also solutions on that side (I think Eivl would be the one to ask on that side).
I don't use Airflow either don't worry
from a data perspective, it's really about testing that our data acquisition is working correctly/consistently, and checking each step along the pipeline to see if we're regressing. So, what we test is primarily injecting known data and making sure we get expected outputs. That's the idea of snapshot testing
I only know snapshot testing from idk jest and front end JS. I'll think about this...
That's exactly where the idea came from... one of our senior engineers basically said: "Why can't we do something like jest?"
I cycle to work every day and I had a good idea on how to test data pipelines, I'll think I'll write it out in full.
Previous company, we built out a test framework for end to end tests (again, injecting known inputs), but we didn't do it at the unit level. A combination of dbt (decomposing to individual scripts) and snapshot testing is kinda nice.
Would love to hear
The gist is simply that you need to know your output the schema. Around this you have a "contract". The most basic version which is always in the contract is basically "I have these N tables with M columns, they are of type T1, T2, .." This idea is DB agnostic, your RDBMS captures all of this but other things wouldn't.
On top of this you formulate extra properties or invariants that are part of your contract. For instance, "all sport sessions in my dataset have a unique identifier" or "The meals column contains all the meals consumed, even if other data is missing".
Each line in the contract, must be handled by either your DB making sure it's impossible or it should be tested. Why? These are exactly the assumptions people downstream make about your data. Finally never make a backwards incompatible change to the contract, so no renaming of tables or columns, it just aint worth it, so much downstream always breaks.
The key insight is that I'm not testing transformations, I'm testing whether or not it adheres to my spec. I should be able to refactor my pipeline into something more performant and hit the same result.
Still reading, but you're also touching on something I didn't mention. We have validation sql as part of our test cases process, but these are baked into the pipeline itself (so it runs every time, rather than as a unit test), similar to what you're describing.
I haven't done much with it yet (still adopting dbt), but this comes to mind: https://docs.getdbt.com/docs/build/tests
Personally I'm very wary of being overzealous and testing too much. I don't want to test if my SQL or polars or Pandas does something, I want to wrap that in a function and check if what it returns, given my input, is what I needed it to return
Changing a query should not change my tests etc etc (for the same output)
Yah, that's sort of why I like snapshot testing: I just want to know if something unexpected changed, but I don't want to go through teh effort of writing a test for every case. I think I agree with where you're thinking... I like the idea of a combination of validation (assertions) and snapshot testing.
I haven't used DBT but I find it very very sus idk. If you have 4 intermediate DAG nodes you shouldn't test those
And all those models introduce coupling as well
I like it if and only if the contract states that thing A and thing B must go through the same process otherwise I would have linear input output with no shared models, if I were to use DBT that is
With snapshot testing you're basically checking for regressions right? I've had many instances where it was never right to begin with 🤣
Lol, yah, that does indeed happen. But yes, really about regressions.
Tbh, I like having them after I guarantee that what I have is correct especially because backwards incompatible changes (aka. breaking everyone's dashboards) are a massive no-no. I'll think about integrating this.
We do go through fairly lengthy customer acceptance tests, so usually our problems are; some new edge case comes up (ie: data is incorrect under certain conditions)… we fix that, but then need to make sure we didn’t regress under the cases we’ve already validated.
Ah yeah, when you have a very mature pipeline that's the way to go
But yeah, my approach is very inspired by sane SWE style testing. I should really write this in full with an actual project. If you squint enough it's TDD on data.
Yah, I like where you’re going with this!
https://stackoverflow.com/questions/77260669/unable-to-reconstruct-back-the-images-using-ddpm-model
please look at this. I had problem calculating reconstruction error using ddpm models
hey all - is this a good place for a question about the pandas library and a student-scheduling app I'm writing?
sort of an intermediate-ish question, and I'm trying to avoid sloppy programming practices, so it might take a bit of explanation. Not sure if this is the right channel for asking for help.
yes
... a few moments, I may have solved it.
Can someone help me real quick in python please
I can try, while I try and answer my own question.
don't ask to ask or if someone can help, just ask your question
Hi, I’m going to start a sort of a blog where I write stuff related to ML/AI. The main overall subjects I will include are mathematics behind the algorithms, explanation of the actual algorithms alongside some code (project examples). How should I split the menu bar?
Or, If it’s an in-depth question and you need to share a lot of code, you may want to open a help thread. #❓|how-to-get-help
These are a valuable kind of post if you can write them well. Don't worry about the blog format, just start writing. You're more likely to make progress that way.
True, I’ll just have it under “articles” for the moment lmao
Thanks!
I think I've got a solution. The short version is that I'm trying to update a DataFrame over which I'm iterating, which Pandas docs explicitly warn against. I'm realizing I need to refactor my code into a better pattern.
just a note: it's almost always problematic to iterate over something and update it simultaneously. data frames aren't unique in this aspect, it applies to python dicts and lists too
is a better pattern to throw the updates into a second dataframe, and then merge the two at the end?
can you describe what you're actually trying to do?
in this particular case, not your app in general
I have a dataframe of students, their class preferences, and the actual scheduled classes. I'm trying to add to the actual scheduled classes if the class isn't overenrolled. The problem is that, as I add students, iterrows() is still working with an old copy of the dataframe, so I can't poll the dataframe for the updated attendance, and the class gets overenrolled.
it's very rare that i actually need iterrows. can you share your code?
(and i literally never actually use iterrows. i always use itertuples instead)
it was because apply() didn't update the dataframe as I iterated either, and I thought iterrows might
you might be expecting too much magic. show your code so at least i can see what you were trying to do
yep, just deleting a portion that didn't work, one second
if you can include an example input and your desired output, that would be very helpful as well
for day in SCHEDULE_DAYS:
print("--------now scheduling for day" + str(day))
for preferenceNum in range(1,PREFERENCES_PER_DAY + 1):
print("---------now scheduling for preference " + str(preferenceNum))
for grade in GRADES_ORDER:
print('--------------------now scheduling for grade ' + str(grade))
#Because the shuffled student data is working with the filtered student data,
#the indexes must be preserved
student_data = student_data.sample(frac=1,axis=0, ignore_index=False) #shuffle students
student_data_filtered_by_grade = student_data.loc[student_data["Grade"] == grade]
classes_to_add = pd.DataFrame()
for index, row in student_data_filtered_by_grade.iterrows():
r = scheduleClassIfEligible(row, day, preferenceNum, grade, student_data, electives_list)
if isinstance(r, pd.Series):# if it's a series and not None...
#count r and student_data's concat'd column, pass in only if it wouldn't over-enroll
#add to classes_to_add
classes_to_add = pd.concat([classes_to_add, r])
#add the classes at the end of the grade student_data
the bottom portion is just pseudocoded for now
is this something like where each student has a selection of preferences each day, and you randomly assign first preferences, then second, and so on?
yep yep
it looks like that assignment happens separately for each class year / grade, and for each day?
I've done it in Java, and was porting it over to this trying to use pandas
how did you do it in java? i would almost argue that this is not a great use for pandas, since you are in fact operating on each row sequentially
pandas made every other part of this project easier!
fair enough. occasionally i do things like convert my dataframe to a list of dicts, do something with that, and then convert the dicts back to a data frame.
(you can do this with pandas of course)
but i am curious what your java implementation of this same logic looks like
are you new to python? or just pandas specifically?
to be honest I don't know off the top of my head how I did it in Java, it's on my Github. It's been a while. I think I was maintaining two separate tables of data but I did run into bugs trying to keep both in sync.
I fixed them but thought this would be a more expandable way to do it
i actually would suggest keeping a separate table here as well
but you can rely on the student id / table index (as you already pointed out) to keep it in sync
then you can use pd.join at the end
however i see some issues with your code that you might want to address
not necessarily critical problems, but suggestive that you don't have the algorithm laid out clearly in your mind
hmmm, well - the thing is, is that this worked for our school. But there were a handful of classes overenrolled.
But 95 % of the data was good so we used it and I was going to fix these bugs for next semester.
And they just did the fixes by hand. It still was a big timesaver.
what does each row in this table represent?
and what is r? the code suggests that it's a Series, but a series containing what exactly?
ah, the thing with r was relatively new
probably the most important favor you can do for yourself when working with data is to be very clear about what each "row" or "entity" represents
I should stash this and go back to my earlier commit
that's what branches are for!
i ask what your java implementation looks like in part because i feel like it might be easier to start from something that more or less works
once I dig my way out of this bug, that was the plan haha
I did actually start from that, but I laid out this code a while ago, like months ago. And it may be a bit much for me to get into right now how it worked, mainly because it's late and I need to teach tomorrow 🙂
fair enough
so what is each row of student_data? what are the columns? how are the students' preferences and class schedules represented here? i think i would need to know that in order to help, otherwise it's too much guess work for someone who isn't inside your head
Does it need to be a neural network? You can just get features with something like tsfresh and then cluster with idk k-means.
I work in the time series domain and honestly, I'm a sucker for simple solutions 😄
K-means has exactly the same property, close clusters are similar as well
Their similarity is just the distance between the cluster centers
I mean, you should look at K-means (and all clustering) as solving this optimisation problem:
clustering = argmax(distanceBetweenClusters(data) && argmin(distanceWithinCluster)
So yes you do actively want to create dissimilar clusters but the fact of the matter is that maybe an optimal clustering has 5 close clusters and the sixth that is very far (outliers, fraud, ...)
What are input nodes? Are they just variables?
What are catch 22 parameters? Are they just non informative variables?
What is catch22
ooooh
So you have 440N features where N is the number of input series you have per sample
In defining your problem you're already thinking in terms of neural nets (weights) etc. I'd definitely take a step back because you miight find something very simple! 😄
Probably not but who knows
But in general, don't think in terms of the solution, keep it very simple first. I tend to write it in LaTeX style notation in a markdown, like really generalize the problem I'm doing.
I understand your solution now, you want to make clusters and then models for each cluster
Is your problem time series classifiication?
Okay let me make a suggestion
Keep it simpler. If you are solving a time series classification problem make N separate models for your thing
One using temp, one using humidity, one using velocity (idk)
Evaluate the performance of each of them. Keep the models absurdly simple here, for instance the last N lags with an Xgboost type model.
This will already tell you something about the relevance of each individual series, not everything but something
The next step I'd do is build a stacking ensemble using my N "simple classifiers" and see if this improved my score
Obviously, stacking doesn't take interaction effects at the input level into account, that's the major drawback. The thing is, codewise this is so so simple. I'd also have to see if interaction effects at the input level are relevant for my problem to begin with. If it is then yeah, from that point onwards I'll start thinking of some multivariate time series model compared to an ensemble of univariate models
With this solution I have given myself several places to solve the problem prematurely. If you overengineer from the get go you might have wasted time. Additionally, you need to be able to compare your high complexity solution to low complexity solutions to get a sense of whether or not it was worth it in the first place.
Make sense?
Have you tried it yet?
As in, are you sure none of them are better than random
Basically, there's tons of ways to do specifically univariate time series classification. I'd start there. One of them is feature extraction but there's others
Do that, proceed into a stacked ensemble and then gradually go towards your solution
But idk, maybe I don't understand what you're trying to do in the first place 😅 . Considering there's a lot of NN specific terminology it looks you're set in what you want to do. All I'm saying is, take a step back, formulate your problem without using the word "weight", "cluster", "node", "backpropagation" and then progress from there 🤷
Hi guys , just finished working on New York taxi trip duration prediction. Here is my notebook, please check it out. https://www.kaggle.com/code/nishchay331/pc-1-new-york-city-taxi-trip-duration . If you have any suggestions on improvement or any better idea , feel free to let me know . Thank you.
Hello , I installed PyTorch gpu module and everyday I’ve been training my model for 5 epoch in gpu , but today for some reason my gpu usage is 0% , anyone know what’s wrong with it, I did t make any changes in my code .
Also there used to be nvidia rtx 3060 next to CUDA_VISIBLE_DEVICES : but now it’s showing 0
I am finetuning gpt 3, I am training it on news documents. When I upload a training file with only system messages its throws and an errors and ask for assistant message. So for each news document I sent the training data as : <>{"role": "system", "content": "NEWS Article"}, "role": "assistant","content": "Placeholder message for fine-tuning"}}<>
Now whenever I ask it a questions, it responds: "Placeholder message for fine-tuning", what should I do?
Wow, that's an anxiety laden email. I think we can all generally guess, but it would certainly depend on the job title/level
@serene scaffold
i can send a JD
Tough thing is it doesn't give a clue about their stack... but it's fintech
haha ye i kinda know what they do here
i would expect it to be python heavy
idk y they have java n cpp
I mean, are they looking for NLP stuff? LLM? More classical EDA? etc
oh
but this seems more like a SWE role right
APIs for the models
ML/MLOps engineer kinda stuff
Yah, but you're meeting with "head of AI", so unsure if that's an AI oriented theory discussion, or a SWE theory discussion?
I think I'm just adding pressure not helping 🙂
HAHA npnp
tbh im pretty weak in either of those topics: AI or SWE
i mainly do data viz stuff and some pipelines at work so
In general, my interview advice is: Know what you know, be pleasant, and know how to not know the answer to something you don't know, especially when you're talking to someone who's an expert.
For AI and SWE, if you're prepping for an interview, you really can't learn something completely new to prep for an interview. But... one thing I think is very helpful is: Watch conference videos to learn about current trends in technology.
(beyond that, maybe ask the same question in #career-advice and you'll get better input)
Ait cool thanks
So if it were technical theory I'd ask basic stuff like how do you know a model is overfitting
And what is the difference between xgboost and random forest, I think it's stuff like that no? The typical ones
a recently-added sonarlint rule, warning against equality comparison with floats, has triggered in a file of mine. Given that the context is an sqlite file imported into a pandas dataframe, and the values in question are powers of two (0, 0.5, 1, 2, with sonar complaining at if cell==0.5), I should be able to ignore this without consequence, right?
I'm trying to draw a plot to show streamlines for the equation psi = -Kx^2 + Ky^2. Here is my code for that:
import numpy as np
import matplotlib.pyplot as plt
K = 1 # I'm assuming my this assumption is wrong
# Define the grid
x = np.linspace(-100, 100, 50)
y = np.linspace(-100, 100, 50)
X, Y = np.meshgrid(x, y)
# Calculate the vector field components
U = -K * X**2
V = K * Y**2
# Create a streamline plot
plt.streamplot(X, Y, U, V, density=1, linewidth=1, arrowsize=1.5, color='blue', broken_streamlines=False)
# Add x and y coordinate lines at 0,0
plt.axhline(0, color='red', linewidth=2.5) # Horizontal line at y=0
plt.axvline(0, color='red', linewidth=2.5) # Vertical line at x=0
# Add labels and title
plt.xlabel('x')
plt.ylabel('y')
plt.title('Streamline Plot of -Kx^2 + Ky^2')
# Show the plot
plt.show()
This is the result that I'm getting (image attached with red coordinates).
However the expected result should like the other image
Can anybody help we with this, please?
I'm not sure offhand what plotting the streamlines means - perhaps you should be plotting the gradient of this function? That'd be -2 K x, 2 K y. But what you have on the plot to the right frankly looks like neither to me - it looks to me like the contour plot of the function (the curves on which ψ is constant).
what changes would I've to made in my code for that?
plt.contour takes the scalar field itself, so:
f = K * (Y**2 - X**2)
plt.contour(X, Y, f, levels=20)
Thank you🚀
Why do I get different ticks than the displayed ones?
I would expect ax.set_xticks(ax.get_xticks()) to do nothing... but it makes so x = -2 and x = 6 are visible :/
fixed with:
ax = lines[0].axes
xlim = ax.get_xlim()
ylim = ax.get_ylim()
ax.set_xticks(list(ax.get_xticks()) + [rx1, lx1])
ax.set_yticks(list(ax.get_yticks()) + [0.5])
ax.set_xlim(xlim)
ax.set_ylim(ylim)
(found solution at https://stackoverflow.com/questions/55603710/python-matplotlib-ax-get-xticks-doesnt-get-current-xtick-location )
please always give text as text, not as screenshots. people might need to copy parts of the message to help you.
Sorry. I'll keep that in mind
So we are considering making a datawarehouse, we have approximately 3-400 databases, all identical structure with different datasets of course. Now we are considering creating a large datawarehouse with data from all of the 3-400 databases. How would one go about this?
Which libraries would you guys use, what are dos and donts in regards to this?
is it possible to put arrows in a contour plot?
what kind of databases? SQL?
Ah sorry, MSSQL
We arent talking millions of rows per DB, props 20-100k per db ish
hello im trying to make a bot talk to the user but when i try to allow the text to print on the textbox it gives me an error TypeError: CTkEntry.get() takes 1 positional argument but 3 were given
I think we need more code than that
Nah but feel free to DM me a part of the code and I might have a look later tonight
wdym what libraries?
if its on MSSQL, i presume u would use something like SSIS?
Mainly in regards to what would be the smartest, fastest and most reliable way of transfering certain data from these 3-400 databases into another database
Hi, is there anyone who is willing to discuss one python code with me? I have spent so much time on it and there is still something that is now working.
@cursive valve Hi, if you want to discuss your python code.
Are you willing to help me?
what kind of project?
pls show me.
it will be better if could talk because it is quite complicated
but I will leave it up to you
you have document?
yeah but it is not in english
hmm... no problem
I mean in a nutshell it is a program that will substract numbers like '1.01e-11' and '-1.10110101e111' both in binary form and both could be negative or positive
and I have to write output
In small numbers it works perfectly fine but with larger one it cause troubles
Well I do not really want to share my code publicly, because of strict rules that I have to follow
But I need to work in string otherwise that will be inaccuracy
you are right
@keen kettle Hello my friend, how are you?
I don't know if you still have some spare time, but I would like to use this opportunity to ask you something about this project I'm enrolled in
I am building a portfolio project which consists in:
I've downloaded a database from kaggle of a fictional telecom company, regarding churn data
Then, I've uploaded the database to SSMS
I've cleaned the data using python and SQL, one-hot encoded columns, solved missing and zero values
I've done feature engineering to prepare for the usage of Random Forest Classifier machine learning algorithm to create a churn prediction tool
The question is, since my features resulted in new tables with structures, column names and overall shape different than the original database, how can I merge them into one singular database to feed into the ML algorithm, since what I've discovered through my research is that there must be an Index Column on the dataframes, but the original index column contained on the database was Customer ID and on the newly created tables, there will be no connections at all with customer ID since different questions were answered by the feature tables
Could you lend me a hand ?
I want to understand how can I use my engineered features to feed my machine learning model
I can share my screen with you
I want to use python to run this machine learning model, but I don't know how to do it, this is my first time dealing with machine learning models
hmm...
let me show your screen
I am seeking guidance for a conceptual endeavor regarding Natural Language Processing (NLP) model refinement. My objective is to tailor an NLP model utilizing a private corpus. The data at hand comprises two distinct datasets:
- The first dataset encapsulates JSON structured data, encompassing names and pertinent information of various enterprises.
- The second dataset embodies a text corpus, inclusive of a passage containing 6000 words.
The envisioned outcome is to engineer a model proficient in executing text or data retrieval from the aforementioned corpus, predicated on user prompts. I am open to insights or suggestions that may steer me towards devising a viable solution for this challenge. Your expertise and directional advice would be greatly valued.
so you're trying to develop an information retrieval system. are you looking for a ChatGPT-like user experience?
by the way, you're using a lot of fancy terms here in a way that I don't think is necessary. NLP professionals don't talk about datasets that "encapsulate JSON structured data" or "engineering a model proficient in executing [task]". you can just say that the first dataset is JSON data representing [whatever it represents], and that the desired result is a model that retrieves information predicated on user prompts.
The second dataset embodies a text corpus, inclusive of a passage containing 6000 words.
inclusive of a passage containing 6000 words. is that the whole corpus, or is that just one of the documents that are in it?
I am attempting to animate a pcolor plot with matplotlib.pyplot and matplotlib.animation. However, I'm running into an issue with the colorbar duplicating repeatedly. The following is the figure setup code:```py
fig = plt.figure()
ax = fig.add_subplot()
def init():
return ax
def update(frame):
f = int(frame)
plt.cla()
"""
ax.set_xlabel(r'$x$')
ax.set_ylabel(r'$y$')
ax.set_xlim(0, 2 * pi)
ax.set_ylim(0, 2 * pi)
ax.xaxis.set_major_formatter(FuncFormatter(lambda val,pos: '{:.0g}$\pi$'.format(val/pi) if val !=0 else '0'))
ax.yaxis.set_major_formatter(FuncFormatter(lambda val,pos: '{:.0g}$\pi$'.format(val/pi) if val !=0 else '0'))
ax.xaxis.set_major_locator(MultipleLocator(base = pi))
ax.yaxis.set_major_locator(MultipleLocator(base = pi))
"""
u_plot = ax.pcolor(x, y, np.transpose(u[frame, :, :]), cmap = 'coolwarm')
# ax.set_aspect('equal')
# ax.set_title(r"$u(t, x, y), t = $" + "{0:.3f}".format(t[frame]))
fig.colorbar(u_plot)
ANI = ani.FuncAnimation(fig, update, frames = range(0, T), init_func = init)
ANI.save("advection_animation.mp4", fps = 180, dpi = 200)
Unfortunately, I end up with animations like the following:
Also, can you show an example of the json data, and explain how it relates to the second dataset?
chat gpt is quite nice at translating jupyter to python scripts
the exports functions normally do not understand what to do with the bash commands
You can already do that deterministically with nbconvert
it didn't run in any way here, not sure why
i wanted to run the cli command to be able to use tmux now that we upgrade from colab to a pure ubuntu server with gpu
it was so complicated to convert that simple file lol...
Tmux is bae af
not sure what bae is, but i'm forcing it to be at ease lol
Try asking this again in several hours if you don't get an answer. I appreciate that your question is detailed. I just don't know what to do
Bae is what one would call their romantic partner
o_0
anyone here uses fluorescent tshirts for coding
Can someone help me with this
is their a way to find out quantitatively wether a given spectrogram contain "some sound" or is just silent
Hiiii
Thank you for responding
I only wrote the request in fancy terms since I have been asking people for help on LinkedIn too
Yes correct - information retrieval system
That would be the whole corpus
What I am trying to develop is a POC
Where I just want to demonstrate that we can build a model that is curated to our needs and our dataset
And not open domain
The two datasets are completely unrelated
hello! i have a weird problem with my DataFrame today. i have a column with datetime. time is in format "2017-08-17 04:00:00, but as soon as i pass it to any function, without doing any manipulations or modifications of it, it changes time to 1970-01-01 00:00:00.000
@terse frigate I still need to see examples of the json data
print(somedataframe)
def wtfisgoingon(input):
print(input)
return
wtfisgoingon(somedataframe)
i never experienced such behavior before that my data gets altered by passing it to a function , and quite clueless on how to solve it
wtfisgoingon can't modity input, so by simplifying the actual function, you've removed the part that causes the problem
if an object is mutable (like dataframes), anything with a reference to that object can modify it, including functions
but i assume that would require that said anything or function, is run
yes? in either case, your code example doesn't encapsulate the problem.
ok. what do you suggest me to do to try and solve this?
as you see i print somedataframe , and the only thing other that i do to somedataframe is to pass it to the function, and print it again. so no code is initiated that should change it
create a code example with every variable defined that has the same problem as the actual code
ill try and recreate the problem in a new .py 👍🏻
Types of neural networks and optimization algorithms are independent of each other right? Like I can use any optimization algorithm I want with any type of neural network?
you're talking about stochastic gradient descent vs Adam, right?
I am not talking about either
I am talking in general
yes, and are those two examples of what you mean by "optimization algorithm", or not?
by types I mean like LSTM, ANFIS and optimization alogrithm ACSLFA, Levenberg Marquardt
I do mean adam but i am not sure whether acslfa and lm are optimization algorithms and whether they could be used alongside lstm or anfis
as usually there was an explanation to the problem, of a human nature... i did not realize that the variable was holding multiple dataframes(because it loads csv files from searching after a string and i assumed it only had found one file). what confused and was unexpected behavior to me was that printing it initially it would print somedataframe[0], and after passing it to the function it would print somedataframe[3], which has its timestamp in a non compatible format
tl;dr: doing print on a variable with multiple objects without specifying which object, it will do print[0], and after passing it to any function, it will print out the [-1] object.
there are no "variables with multiple objects". a variable is always exactly one object, though that object might contain other objects. it sounds like the variable might be a list.
okay, yes, a list storing multiple dataframes
but if you print a list, it will just print the whole list, so your assessment about "it will do print[0], and after passing it to any function, it will print out the [-1] object." is incorrect
there must be more going on.
yes you are right, it does print out them all, what i should have said, is that it prints them out in reverse order when i print from inside the function 🙂
weird
actually my brain is just completely malfunctioning and python is behaving exactly as expected. thanks for your efforts. i will pour down a liter of coffee to correct my cognition
Hi, I am running meta-llama/Llama-2-7b-chat-hf on runpod.
The machine type is: 1 x RTX A6000, 14 vCPU 48 GB VRAM
I am running a for loop and calling the model using hugging face pipelines
pipe = pipeline("text-generation", model="meta-llama/Llama-2-7b-chat-hf", device=0, return_full_text=False, use_cache=False)
After the 5th loop the GPU Utilisation becomes 100% full and output is very slow or sometimes no output. What can I do?
For testing I ran
for i in range(1000):
yield f"My example {i}"
for out in pipe(data()):
print(out)```
Still the same delay and pausing after 4th or 5th loop
!code
Just edited.
import torch
for d in data():
with torch.no_grad():
print(pipe(d))
try that.
I tried this, it;s just completed after about 10-12 seconds.
so it worked?
It worked, but it takes 12 seconds or so.
Oh my bad, actually this is the code: ````import torch
def data():
for i in range(1):
yield f"My example {i}"
for d in data():
with torch.no_grad():
print(pipe(d))```
For 1 instance it takes 12 seconds
hmm, are you sure nothing else is using the GPU?
I guess so, because once the output is completed the GPU becomes to 0% Utilisation
someone recommend me a roadmap to study AI
my prof is real shitty
there are too many AI resources
idk what course to finish to actually study AI as a major subject 🟥 IMPORTANT 🟥
@harsh minnow I'll try it on my machine
Sure @serene scaffold
@harsh minnow looks like it requires a license from Meta to use it, which I don't have, so RIP.
oh okay, no problem.
Actually I guess there is some wierd issue with my system. becase this code still have not given an output: ```import torch
def data():
for i in range(1):
yield f"My example {i}"
for d in data():
with torch.no_grad():
print(pipe(d))```
BTW I am using RunPod (runpod.io)
Hi! I'm in the early stages of implementing face recognition, and am assessing available tools. Python's eco seems great for this, including with Tensorflow, Pytorch, and OpenCV (Aka/formerly-known as CV2?) I'm interested in clustering with faces. Ie recognizing unique faces from a camera, and identifying when they come up again. Where would you start? Ty!

are you me
also @serene scaffold did we still want data eng resources or nah
at this point im not sure since ive sorta merged MLE/DE beginner resources

many places merge the two roles
books, online courses/resources, podcasts
I am running 2 GPU, but when I run this code it only uses 1 GPU.
from transformers import pipeline
pipe = pipeline("text-generation", model="meta-llama/Llama-2-7b-chat-hf", device=1, return_full_text=False)```
Hey guys, does someone know why Evolutionary Algorithms are hard to apply in a stochastic nature, like gradient descent?
Usually, I see EAs being applied to optimize the average loss in a dataset as a whole. The models are initialized and only after iterating through the whole dataset the mutations/crossovers are applied. I was thinking: why not apply the mutations before iterating through the whole dataset? Like after N batch iterations, like in SGD?
In reality, I'm asking this a bit too late, as I already did an entire work testing a "stochastic genetic algorithm". But it worked wonderfully as a failure, and I'm now trying to think why it didn't work
(Though I don't discard the possibility that the problem may rely between the screen and the chair...if you know what I mean)
You can apply SGD in EA's
Look up local search in evolutionary algorithms
Yeah, I've seen a paper where the folks used an EA to initialize the model and make it begin with its parameters close to the global minima. Then, they trained the model using SGD.
A bit like pre-training...
I tried an idea of mixing SGD and EA to move a model that is already in a learning plateau (a local minima, or more like of a saddle point), to move the model away from that minima, and thus help SGD improve it even more.
Imagine you have a population, each individual is a model. Your recombination and mutation work on the weights. You can local search an instance by performing a few individuals in the population at each step
SGD, or local search in general, will move you toward a local minimum
I don't have the time right now to give elaborate and structured motivation / responses. You can DM if you want and I'll likely answer tomorrow or so 🙂
Oh, I see. Thanks!
Hello guys, this might be a stupid question(I'm beginner in this field), but I always wondered about the mathematical background of for example neural networks, or for example the complete mathematical background of how a model gets trained and learns, is there any kinda of resource that is dealing with the mathematical background of these machine learning concepts? Thanks 🙂
I know that the mathematics that is needed for ai, mainly calculus, linear algebra, and probability. But what I would like to have is a book or any type of resource that would describe or have examples on the mathematical bakground of these concepts in ai and machine learning.
Ohh I think I found what I was looking for, there is a book "mathematics for machine learning" from the Cambridge University . I think that would be a good place to start.
Hi Ali, to gain clarity and understand the context of your question better, can you elucidate more?
Hi, I can't try jupyter notebook because of some python kernel timeout issue. Could you skip to near the end of this video and tell me how to rectify it.
Bro how can I make / command in my bot and I have to make bot in JavaScript or something else
I've DM'd you code for a basic bot in Python. You can use autogen or openai to make it respond like an assistant.
For further help, see #discord-bots or the discord.py server
Anyone?
You posted a 20 min video. I don’t have that kind of attention span. What issue?
I assume you don’t have Python installed, for starters
Try running something in vscode outside a notebook
https://arxiv.org/pdf/1802.01528.pdf is a paper that steps through all the math and assumes you only know calc 1 and some linear algebra.
https://developers.google.com/machine-learning/crash-course google has a machine learning crash course that begins by covering the math, but also focuses on using modern libs to create your own models
https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi 3b1b has a series of videos going over the math and getting progressively more complicated
andrew ng also has free courses online that go in depth on the math
Guys here is a code that you can use to process a pcap and extract individual bytes to columns of a parquet! it implements the pre processing and labeling required to develop feature-per-byte based NIDS as the one described in DeepPackGen https://arxiv.org/pdf/2305.11039.pdf feedback very welcome! https://github.com/Master-Sorcerer/BytesProcessor
I have a question . On Kaggle Getting started competitions, we are provided with train and test sets separately. Is it okay to merge both of them for doing preprocessing easily or not ? According to this blog : analytics vidya blog (https://www.analyticsvidhya.com/blog/2021/07/data-leakage-and-its-effect-on-the-performance-of-an-ml-model/) we should not do this because it can lead to Data Leakage . Can anyone tell ?
Send me too please
Sent.
I have Python 3.11, do I need to make a copy of the python.exe file into the current working directory?
• male female X train.txt
• male female X test.txt
• male female y train.txt
• male female y test.txt
...
(c) Bayes classifier with non-parametric distribution (20 points)
Compute and print the prior probabilities for male and female.
Compute class likelihoods, p(height|male), p(weight|male), p(height|female) and p(weight|female) for
all test samples. This can be done by using the bin min/max values returned by NumPy histogram()
function. You can calculate the centroid of each bin and assign each test sample to the closest bin.
After knowing the bin index, the likelihood can be computed using the count vector provided by the
same histogram() function.
Classify all test samples and compute the classification accuracy. Print accuracies for height only,
weight only, and weight and height together (multiply likelihoods).
My attempted solution went something like this (I tried to follow the instructions as well as I could):
- Calculate the prior probabilities from the training data
- p(m) = # males / (# males + # females)
- p(f) = 1 - p(m)
- Calculate likelihoods
- Use the training data to define the counts and bins for 4 histograms : female height, male height, female weight, male weight
- Calculate the centroids for all 4 histograms
- Using those I can calculate the likelihoods (or so I thought)
- Classify
- Classify based on the maximum likelihood * prior probability
Here is my code (sorry):
# Male and female height and weight measurements
X_train = np.loadtxt("male_female_X_train.txt")
X_test = np.loadtxt("male_female_X_test.txt")
y_train = np.loadtxt("male_female_y_train.txt")
y_test = np.loadtxt("male_female_y_test.txt")
total_samples_train = len(y_train)
# Compute and print the the prior probabilities for male and female
p_m = np.mean(y_train == 1) # p(male)
p_f = 1 - p_m
print(f"p(female) = {p_f}")
print(f"p(male) = {p_m}")
# Compute class likelyhoods
# y=0 -> male and y=1 -> female
# Men's heights
mheights_train = X_train[y_train == 0, 0]
# Men's weights
mweights_train = X_train[y_train == 0, 1]
# Women's heights
fheights_train = X_train[y_train == 1, 0]
# Women's heights
fweights_train = X_train[y_train == 1, 1]
# Histograms
# The last bin edge from np.histogram(..) is "closed".
counts_mh, bins_mh = np.histogram(mheights_train, bins=10)
counts_mw, bins_mw = np.histogram(mweights_train, bins=10)
counts_fh, bins_fh = np.histogram(fheights_train, bins=10)
counts_fw, bins_fw = np.histogram(fweights_train, bins=10)
# Bin centroids for histograms
get_bin_centr = lambda bins: [b + ((max(bins) - min(bins)) / 20) for b in bins][:10]
bin_centr_mh = get_bin_centr(bins_mh)
bin_centr_mw = get_bin_centr(bins_mw)
bin_centr_fh = get_bin_centr(bins_fh)
bin_centr_fw = get_bin_centr(bins_fw)
# For the sake of brevity and everyone's sanity, I'll omit everything except predictions based on height:
classify = lambda lf, lm: int(lf * p_f > lm * p_m)
y_pred_h = [classify(counts_fh[get_bin_idx(bin_centr_fh, x)] / counts_fh.sum(),
counts_mh[get_bin_idx(bin_centr_mh, x)] / counts_mh.sum())
for x in X_test[:,0]]
And here is the feedback I got:
The index of the bin should be taken considering the whole range of available weight/height values but not for male and female separately
But if I don't have the histograms separated into m/f, I don't understand how to classify then. If I only have one histogram for height and I get the bin index from that combined distribution for male and female heights, how can I determine the class?
hello, i want to create an AI avatar using python. What should i learn to do this?
I was wondering does it make even sense to use LASSO-Regression on simple linear regression tasks where you have only one independent variable / feature that contributes to the target variable, given that you don't target the the b0-term you can only target the one feature that contributes to the output, which doesnt sound like something I'd normally do
what would this AI avatar do? chances are, it's not something that's attainable as a first project.
It would just respond to basic physics questions
@hasty mountain I have time to get back to you now. Your algorithm was basically this correct:
- Initialize population
- Mutate
- Select top model
- SGD on top model
- Return to 2 until convergence or max steps
Right?
Step 4 is what people call "local search" in genetic algorithm literature, it's a valid thing to do. It does have a trade-off:
- Local search decreases the time to convergence.
- Local search can have you converge faster into a local minimum.
What I saw empirically is that LS really decimates diversity in your population and sends you towards a converged population really quickly. To offset this you need to increase mutation rates to compensate etc.
What you can also do is run LS on the top and bottom N % of individuals
In reality I'd never run an EA on a neural network though for various reasons. The flavours of SGD that we use are usually "good enough", I'm not sure we need a global optimiser 🙂
no, but "greedy reasoners" is a fun way of putting it
Thank you very much
Hey guys, my neural network with MLPRegressor always outputs the mean of the Y variable. Can anyone help me? I will provide more details
Sounds like all your weights are 0 except that of the intercept. You can look at model.coefs_ to attempt to debug
Do we have here a OCR master? ;p with tesseract?
just ask your question.
Hm... i want to OCR my manga chapter but tesseract very often catch correct only 50% of page.
i did it with my colleague and somethimes in the IMG we can see, letter "G and G" it is writed as G and 6
im crucious where i should focus to resolve this issue.
I’ve only done a simple experiment with tesseract, so I don’t have much experience but: have you tried training against the font? Do you know the font? Context: https://pretius.com/blog/ocr-tesseract-training-data/
Hey guys, I am trying to build a news article generator app that is trained on the news in the US, UK, and Canada. It has to be very accurate. Now I first ask some simple questions to get some data and generate an accurate news article about a specific thing in a specific country.
Now I want to train the AI on new article knowledge, but when I fine-tune a model, it outputs nonsense. I tried fine-tuning GPT 3.5, but it's returning inaccurate data (GPT 4 performs much better). Also, GPT-3 (4096) is not enough to generate a long news article, so I am making a plan and calling GPT-3 around 7–10 times (it's context-aware).
I want the model to be smart enough to make decisions about the country and get information about it from the user. I want the model to output in a different format too, but the articles I train on are not in the format I want. Each user's use case is different, and I want the model to be smart enough to analyse and generate the article.
What is the best way to go about this? How can I achieve a smart AI model that knows each country's information and is accurate?
Lol its possible to train Tesseract? nice i need to write more about it!
what would it mean to "generate accurate news articles"? for it to predict the future?
Please share/tag me if you do, genuinely curious if it helps
No, whatever the things that has been occurred (eg: Wild fires in US)
@left tartan Do you have any Youtube Video to show how it work?
then aren't you just overfitting it to news articles that already exist?
must you use tesseract?
also are you interested in text detection or text recognition? or both?
I mean its should learn from the articles, account for user's preferences and generate one. May be the user wants more things to be added in the article. This could be a combination of articles
i prepare small project so what i need: i need catch the words / letter from manga + i need pixels - then i need export it to file txt then :
1). recognise text change to blank - white.
2) then i change the .txt file with "my translate"
3) upload my words to image.
that is the main logic what i want to do.
so if u know better option i want to use it.
i used EAST and CRNN_VGG_BiLSTM_CTC laid out here with some success before https://github.com/opencv/opencv/blob/4.x/samples/dnn/text_detection.py
im beginning in my python world. How to use it?
to me tesseract is great for heavily structured text, but for text that's less structured like in manga, it's unlikely to work out of the box, just my 2 cent.
Ok could we connect tomorrow? for some "training" how to use it?
sorry i don't have bandwidth to help on that level, let me do some googling, there must be tutorial out there
i only managed to find one on east for text detection for some reason..
https://pyimagesearch.com/2018/08/20/opencv-text-detection-east-text-detector/
here is one tutorial that aims to use east + tesseract together, which in my experience is also better than just tesseract, again just my 2 cents.
https://nanonets.com/blog/deep-learning-ocr/
download EAST model from 1
download CRNN_VGG_BiLSTM_CTC model from 2
i think the dependencies you need are just
opencv-python
numpy
opencv-python is cv2 - don't ask me why, i always get confused too.
that's all the help i can give at the moment, i really need to get back to work now.
Hey, thanks! Well, the method is a bit more chaotic than that, as the model is already on a local minimum/saddle point, and the population initialized are copies of this same model(thus, already on local minimum). The EA is like...something to force the model to get even better, and it's applied to one single model (the Vanilla) to prevent a madness of memory consumption (I can run this algorithm in my personal computer using 20 models as population, for example).
The SGD is quite interesting, indeed, but I only get a bit upset with the fact that it demands so much patience. I even read a blog post some time ago that the author said there was a saying in Deep Learning that was something like: "Let the model run and take a summer off", to let gradients work by themselves.
You need diversity in a population for EA methods to work, I'm not sure starting with N copies is worth it
Especially at initiialisation
Yes, that's the thing. The mutation happens at each batch sampling, and I tend to use a mutation chance of 60%, more or less
Yes but this is a bit like random search then, you have little benefits of the EA framework especially since you're doing no recombination
Yes, it's quite chaotic and random.
I see... So I shouldn't have discarded crossing-over then 
It's funny, though. When I ran some experiments in my poor personal GTX 1650 with a batch size of 4, the method failed miserably.
When I ran in the industrial Tesla P100 of Kaggle with a batch size of 512, the method did provide a result...helping the model getting stuck into a new minimum, even more harder to escape (not sure if it's local or global)
Slide spam again. I'd say what you have is probably not an EA because
- You have full exploitation (SGD) and then you move to basically nearly full exploitation (random search through mutation)
- Diversity is a key idea to explore the search space, you start from the same instance so you're not seeing a lot of the space, you're only seeing around that saddle point
Oh no
So I basically have a Reinforcement Learning strategy with a low epsilon 
I wouldn't call it RL either, it's random search with a splash of local search (when you do SGD on the best individual)
What do you mean with batch size? Population size, 512 instances? Or do you mean truly batch size in the SGD sense?
Batch size in SGD sense.
My "EA" works in a stochastic manner, together with SGD, using mini batches.
How large is your population?
So your algo is effectively:
- SGD till convergence
- Copy 20x
- Mutate with rate of 60 %
- SGD best
- Go to 3 and repeat till convergence
?
In the beginning, I used a decaying factor for this mutation chance, as the cumulative mutations tended to degrade the models performance. But then I removed it.
- SGD till convergence (Pre-training) ---> This is the Vanilla Model
- Initialize Population by copying the Vanilla model 20x.
- Samples a batch from data and SGD --> Optimize.
- Repeat Iteration and get loss got after optimization --> SGD Loss
- Mutate population with rate of 60%
- Iterate through each individual in population --> Gets individual losses.
- Selects best loss ---> If SGD Loss, proceed to 3. If one individual Loss surpasses(is lower than) SGD Loss, that individual replaces the Vanilla model, proceed to 3.
All with the same learning rate etc?
Yes, the learning rate is fixed. The SGD optimization is only applied to the Vanilla Model, even when an individual from the population replaces a Vanilla Model(becoming the new Vanilla)
Wait a second...the adam optimizer... It would be able to accompany those changes, right?

Thank You
So, let's assume this is an EA, which it's not really: EA's are a waste of compute. Your compute is better spent on something else like bayes opt.
EA's shine where you truly need a global optimisation method for non-convex (note that EA's have 0 guarantees of finding this) that is tractable or where there is a "large enough" gap between your heuristic and global solution (you don't know this before running the EA) .
I see... Sad, I like the idea of the EAs.
Looks like I'll have to review my work, then.
Thanks!
I'd look at EAs in the context of combinatorial optimization and not neural networks
I've seen there were some ideas of using them in Reinforcement Learning...and to also select hyperparameters for neural networks.
Yes but EAs are very very sensitive to hyperparameters themselves
I was thinking about trying something chaotic as mutations in some of my projects in Reinforcement Learning... gaming bias in RL is truly annoying...
You've shifted the problem from setting hyperparams of your single NN to setting hyperparams for your EA which involves training 100+ neural networks
Hence why it's a total waste of compute compared to bayes opt

They're really simple to try out. Have you heard of the knapsack problem?
I have (2,) (8,)
but want instead of (8,) (1,8) but I cant transpose with .T
first is x second is weights
No. I have no idea on how bayesian optimization works. I only heard of it 
But from what I'm reading now...the ELBo loss used in the Variational AutoEncoder could be something like that?
def __init__(self,in_nodes,out_nodes):
self.weights = Tensor((in_nodes * out_nodes))
self.bias = Tensor((1,out_nodes))
self.type = 'linear'
print('w shape', self.weights.data)
print('b shape', self.bias.data.shape)
def forward(self,x):
print('x', x)
output = np.dot(x,self.weights.data)+self.bias.data
```
Bayes opt is a hyper param tuning method that runs sequentially
At least, that's what it's good at compared to say grid search which you can run in parallel
It matters because you want to waste as little compute as possible when training NNs every evaluation should matter cause it takes too much time to be trying random things (which EAs do)
what's the Tensor class from
class Tensor():
def __init__(self,shape):
self.data = np.ndarray(shape,np.float32)
self.grad = np.ndarray(shape,np.float32)
If you want to learn about them you should decouple EAs from NNs, just make a small project solving a classical problem like travelling salesman or knapsack. Generate fake data using Numpy and code out the whole thing from 0, just depend on Numpy. It's not a big project, it's <100 LoC. Then you'll "get" the pros and cons immediately 🙂
hm you could do smth like my_arr.resize((1, *my_arr.shape))
I see. I'll take a look. Thanks!
ok
but you might just wanna use (8,1)/(1,8) arrays in the first place so you can use transpose
Weights are not zero
fun fun fun:
FutureWarning: Styler.applymap has been deprecated. Use Styler.map instead
(got this three times per usage, actually) okay, fine. i'll switch them to .map instead.
except now pylance complains of Cannot access member "map" for type "Styler" Member "map" is unknown !

Isn't EA normally used for architecture/hyperparameter search, as opposed to parameter optimization?
https://github.com/KeananC/MachineLearning/tree/main can you take a look at my cluster alg and give ideas on how to make another one for 2d data(same like style/simpleness, since i just got into ml)
you invented your own algorithm? i admire the ambition but most people are not inventing their own algorithms. the reason is that existing algorithms people use tend to be very carefully crafted, and often have important proven theoretical properties. it takes a lot of work to invent one from scratch, and there's no guarantee it will even work well.
the one with the gaps is interesting. it sounds like your intention is to split the data into 2 clusters by finding the biggest gap between two data points? what about data like this? [1, 2, 11, 12, 21, 22]. presumably you'd need to extend this to handle more than 2 clusters.
it sounds like maybe you're at the start of reinventing something like divisive hierarchical clustering https://en.wikipedia.org/wiki/Hierarchical_clustering#Divisive_clustering
Hello, does anyone know a beginner friendly data analysis project that requires minimum knowledge of statistics? I want to become more familiar with the 3 popular Python libraries and practice the responsibilities of a data analyst by working on projects.
Hi, anyone familiar with seaborn? I am wondering if I can change a legend label without having to provide anyother related settings like bboxtonext, loc, etc. Thanks.
https://www.kaggle.com/competitions/titanic
https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques
you can follow the prompt and try to fit a predictive model, but these are really good datasets just for getting comfortable with cleaning and exploring data, and forming & testing hypotheses
Start here! Predict survival on the Titanic and get familiar with ML basics
Predict sales prices and practice feature engineering, RFs, and gradient boosting
Thanks
Hey, I was annoyed by the time SGD demands...and the 30 hours in Kaggle wasn't being enough...
and I was feeling creative
hey, props for creativity
Yes
It's even too expensive for architecture and hyperparameter search. It's a blackbox optimization method, it's suitable for rich research groups to search over architectures and then tell us plebs what they found 🙂
I dunno where to post for help sorry :/ Im a python n00b.
I'm installing fenics, though a WSL (Ubuntu) and this is taking forever (days) there seems to be progress, ie. the last report line is changing every couple of hours and so. But I am worried this might never finish; should I be worried? Is this normal? How to fix? halp
Hello everyone, does anyone know how can i use pytorch in aws lambda? Currently i cant because of the torch size and the limit in lambda is 50mb only
how can I implement pred_proba ?
I have ```py
def predict(self,data):
X = data
for f in self.computation_graph: X = f.forward(X)
return X
on forward
unnormalized_proba = np.exp(x-np.max(x,axis=1,keepdims=True))
self.proba = unnormalized_proba/np.sum(unnormalized_proba,axis=1,keepdims=True)
self.target = target
print(self.proba)
looks like I hardly have
hmm in predict there is forward()
ah so I have it
Lol makes sense. Is it too slow even for smaller models like fully connected with one hidden layer?
Maybe not 🙂 it's not a matter of it being too slow but rather too wasteful. SGD sends you to a local minimum (in non-convex problems) but that's fine if, as many have claimed, there's a surface with many local minima that don't really differ a lot. Finding the lowest point in this space doesn't give you a lot.
Just running N SGD's from scratch may be better
But yeah, I've done a fair share of EA's, they're my go to toy problem when trying new languages. I can't stress enough how sensitive they are to their own hyperparameters. You need to tune the thing that tunes your model 🤔
test = label_encoder(['beer','milk'])
predicted_labels = np.argmax(model.predict(test),axis=1)
accuracy = np.sum(predicted_labels==y)/len(y)
print("Model Accuracy = {}".format(accuracy))
print('predicted ', model.predict(test))
print(np.where(predicted_labels == 0, 'chips', 'cereals'))```
so
how can I convert it to display array of ['chips', 'cereals'] instead of just ['chips']?
because at first I want all predictions then I want to think about only one prediction as is the case when deploying model and making one prediction
batch_size = 2
num_epochs = 10
samples_per_class = 100
num_classes = 2
hidden_units = 4
x = ['beer','milk']
y = ['chips','cereals']
def label_encoder(list):
l = np.arange(len(list))
return l
x = label_encoder(x).reshape(1,-1)
y = label_encoder(y)
model = utilities.Model()
model.add(DL.Linear(2,hidden_units))
model.add(DL.ReLU())
model.add(DL.Linear(hidden_units,num_classes))
optim = DL.SGD(model.parameters,lr=1.0,weight_decay=0.001,momentum=.9)
loss_fn = DL.SoftmaxWithLoss()
model.fit(x,y,batch_size,num_epochs,optim,loss_fn)
this is first part
I converted labels to ints
trained and try to predict
I'm trying to install pytorch3D for windows and I'm struggling. is there no prebuilt packages for this ? Do I really have to build from sources ?
looks like your only options for windows are to build from source or use conda (according to this)
from that same page, I haven't find the instruction using conda
there's conda instruction for linux and macOS
but the only place that talks about windows is the Building / installing from source section
and when i try to install from the source, I still have issues I don't understand
I am able to import torch and torchvision in a python repl, but when I try to install pytorch3D I have "no module named torch"
I tried building an AI but I was only able to generate random weird sentences that are grammatically correct but sound nonsensical. How do I go further?
for someone to help you go further, they first have to know what you already did
that's what i was wondering - there's no way to use SGD for hyperparameter search that i know of. unless you mean something weird like gradient boosting where the base learner is a neural network...
(there are also many things i do not know!)
Nope because hyperparameter search, even for simpler stuff, is again anon-convex problem 😩 . Hence why I mentioned bayes opt, it's a global optimizer like EA's but it's more data efficient
right, i've used bayes opt before but never EA
i wasn't sure where you were headed with the point about SGD though, unless that was just following up from the original idea
yeah it doesn't seem hard, but as you said it has its own parameters that need tuning and i never saw it as something i could use effectively
would be a good programming exercise though, i should do it
maybe a good excuse to practice julia
Say looking for hyperparameters were differentiable, it would be a non-convex problem so SGD would struggle there
i see, and agreed
Btw there's many things forgotten in the literature 😄
SVMs are so good because they only have 1 or 2 hyper parameters and the RBF SVM is a universal approximator. They're also somewhat interpretable. They just suck at scale but are a great go-to if you're under 30k data points.
for text classifier the svm seems not bad, but could you list a few other methods that can do similar task except svm?
I'm not sure I'd use it for text classification myself. Other methods that work are the canonical random forest and gradient boosting type models (for tabular data) and neural networks for unstructured data but all 3 have more "dials" (hyperparams) to turn than an SVM.
since there so many variants of neural networks, do you have recommendations about some of them with "relatively" less hyperparameters?
All of them have the same amount. You can always add a layer and you can always add a neuron.
specifically for text classification with bag of words, i've found that plain ridge regression beats SVM
also there was some blog series i found years ago that went into great detail on why plain linear models were superior to RBF SVMs for text classification specifically
something about how the RBF kernel acts like a low-pass filter, i'll have to find it
It's a bit of an apples and oranges because a linear SVM could be used as well
yeah but i think at that point actual computation performance tends to be much better with e.g. liblinear
or just l-bfgs which is what we ended up using in that particular project (because we had some customizations)
hmmm, it depends.
You can solve linear SVMs in the primal (number of unknowns are the number of variables) or the dual (the number of unknowns are the number of data points)
Note: you can solve ridge in the dual as well but nobody does this 🙂
one approach i see very often is to use pre-trained embeddings for the model, so conceptually you end up partitioning the parameter space into "embedding parameters" and "model parameters" which you learn separately, or find a pre-trained model to obtain the former
i've done that too, worked well enough. i'm not sure if it's exactly the same as what people call "transfer learning" but i think conceptually it's close at least
BoW is something where you typically have num_data < num_unique_words so if your ridge regression impl. does not allow to solve in the dual going to an SVM, which solves in the dual by default, will be much faster.
of course there's also fine-tuning
in the particular problem i was thinking of, we had a few million rows and we cut down the token space with some pre-filtering and hashing to a few thousand or tens of thousands, but we also had > 1k classes and it was all very sparse and bad
i still dont have a good sense of the performance characteristics of the two algorithms at those kinds of extremes, i was just trying stuff and went with what worked 😛
At the end of the day this is the only thing that matters imho 😄
thanks, i sometimes need to remind myself it's ok to experiment and not just know everything!
Yeah, the only core skill is knowing where to find information and knowing how to evaluate models properly. From then onwards it's empirics (unless you're doing fundamental research ofc)
that sounds new to mi who is novice here, so how would you distinguish the embedding and model parameters, like, is there some direction of practices?
i mean that you'd just fit them separately. for example in the "old days" you'd do something like use fasttext or word2vec to get token embeddings, and then fit a model using those embeddings as features
or you could go even simpler than that, using tfidf and/or dimension reduction like pca
it's less optimal in the sense that the embeddings are not "learned" using the actual objective function of interest
but for computational and practical reasons it might be better, eg. if you don't have enough data to get a good estimate of the large number of parameters involved
@past meteor Having some more discussion on taking the absolute value of the residuals. I got the impression that the residuals were just the actual values minus the predicted values, not the absolute value of that result. I'm getting some conflicting information.
My question is: When, if ever do you think we take the abs of the residuals and why?
When "debugging" your model you shouldn't take the absolute values of the residuals, you're interested in knowing when it's under and overshooting
When you have outliers in your data evaluating with MSE is bad because squaring errors makes them explode
To be clear, when defining residuals, do we take the abs or not?
... no
anyone else try running "conda list" on the new Python in Excel release?
Under what circumstances would you find it to be appropriate to take the abs of the residuals?
For example, doing the Shapiro-Wilk test.
As I mentioned, in the context of the mean absolute error 🙂 Sometimes I also do this np.argmax(np.abs(residuals))
And then I investigate what went wrong
Been too long since I did the shapiro-wilk test
I can't comment on that specifically
Can anyone else here comment on taking the abs of the residuals before performing the Shaprio-Wilk test?
@past meteor The reason it seems I'm so obsessed with taking the abs of the residuals is because my instructor had a code snippet that did it.
# Calculate some characteristics of the residuals
residuals = np.abs(x.values - predictions.values)
Now everyone in the class seems to think that in order to calculate the residuals, you have to take the abs.
That's totally incorrect
And can you say why?
Because it just is, residuals are just not defined that way
https://www.statisticshowto.com/probability-and-statistics/statistics-definitions/residual/ notice how they say "Negative if they are below the regression line"
https://www.khanacademy.org/math/statistics-probability/describing-relationships-quantitative-data/regression-library/a/introduction-to-residuals notice how there's negative residuals here too
Another one here: https://www.mathworks.com/help/stats/residuals.html from mathworks (matlab) you see the histogram of residuals has negative values. They are not taking the abs as well. I'm adding so many links because you don't have to take it from me 🙂
The reason why not taking the abs is relevant is that you want to differentiate between over and undershooting ofc
@rugged comet i second what zestar said, it's just not what residuals are and it's neither required nor desirable for the shapiro wilk test
always ask your actual question right away. don't ask to ask.
you're still asking to ask. Assume that someone said they will help you--what would they need to know to start helping?
If I define a nested model like this:
class Siamese(nn.Module):
def __init__(self, model):
super(Siamese, self).__init__()
self.model = model
def forward(self, x1, x2):
output1 = self.model(x1)
output2 = self.model(x2)
return output1, output2
...
model = Siamese(BaseModel(args)).to(device)
Will it fail to save the complete state dict?
torch.save(model.state_dict(), 'ckpt.pt')
...
model.load_state_dict(torch.load('ckpt.pt'))
When I load the model and resume training it appears to start from scratch
What do you see that makes you think it starts from scratch?
Also is BaseModel a subclass of nn.Module?
Being a contrastive learning framework it tends to start accuracy metric at ~0.5 and climb slowly from there, so I just gotta check that
it's just this
class TCN(nn.Module):
def __init__(self, in_dim, out_dim):
super(TCN, self).__init__()
# some conv layers etc
def forward(self, x):
...
return x
You should look at the weights of the TCN model before and after you save the Siamese model
Would that be done by comparing the .parameters() of both states?
Right, of the original, and if the saved/loaded copy
Ok, one sec
All the weights & bias are different
Hold on, the weights are diff even if I initialize two identical models with the same seed
I'll see if I can look into it tomorrow
Alright, appreciate it
hi
Don't do it that way tbh
The easiest way to implement a Siamese network is just to use the same neural net and feed it two inputs, calculate the less and backprop. You don't need a special NN for that, just do it in your training loop. It's confusing I know, I had the same idea before I made a Siamese net myself for the first time.
ok Ive been getting this painful error for a very long time
so im in essence attempting to create a real-time training situation
code:
import tensorflow as tf
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras import Model
from tensorflow import keras
import time
import numpy as np
epoch = 300
n = 1000
duration = 5
learning_rate = 1e-4
optimizer = keras.optimizers.SGD(learning_rate=learning_rate)
def lossfunction(distX, crash, speed, distY):
term1 = tf.add(1.0, tf.add(tf.multiply(5.0, crash), tf.multiply(5.0, distY)))
term2 = tf.add(distX, speed)
loss = tf.subtract(tf.multiply(n, term1), tf.multiply(2.0, term2))
return loss
def returnData(input_data):
return None
def getData():
data = np.random.rand(5)
print(data)
return tf.convert_to_tensor(data.reshape(1, -1), dtype=tf.float32)
def getSpeed():
return float(input("speed"))
def create_model():
input_tensor = Input(shape=(5,))
x = Dense(256, activation='relu', trainable=True)(input_tensor)
x = Dense(512, activation='relu', trainable=True)(x)
x = Dense(512, activation='relu', trainable=True)(x)
x = Dense(256, activation='relu', trainable=True)(x)
x = Dense(2, activation='sigmoid', trainable=True)(x)
model = Model(inputs=input_tensor, outputs=x)
model.summary()
return model
model = create_model()
for i in range(epoch):
print("\nStart of epoch %d" % (i,))
start_time = time.time()
while time.time() - start_time < duration:
distX = float(input("distx "))
crash = int(input("crash "))
speed = getSpeed()
distY = float(input("distY "))
with tf.GradientTape() as tape:
input_data = getData()
predictions = model(input_data, training=True)
loss = lossfunction(distX, crash, speed, distY)
print("TrainingLoss: " + str(loss))
returnData(predictions)
gradients = tape.gradient(loss, model.trainable_weights)
print("Gradients: ", gradients)
optimizer.apply_gradients(zip(gradients, model.trainable_weights))
error:
TrainingLoss: tf.Tensor(30994.0, shape=(), dtype=float32)
Gradients: [None, None, None, None, None, None, None, None, None, None]
Traceback (most recent call last):
File "robotAI.py", line 85, in <module>
optimizer.apply_gradients(zip(gradients, model.trainable_weights))
File "C:\Users\sidha\anaconda3\envs\OneWhereUcanuseGPU\lib\site-packages\keras\optimizers\optimizer_v2\optimizer_v2.py", line 689, in apply_gradients
grads_and_vars = optimizer_utils.filter_empty_gradients(grads_and_vars)
File "C:\Users\sidha\anaconda3\envs\OneWhereUcanuseGPU\lib\site-packages\keras\optimizers\optimizer_v2\utils.py", line 77, in filter_empty_gradients
raise ValueError(
ValueError: No gradients provided for any variable: (['dense/kernel:0', 'dense/bias:0', 'dense_1/kernel:0', 'dense_1/bias:0', 'dense_2/kernel:0', 'dense_2/bias:0', 'dense_3/kernel:0', 'dense_3/bias:0', 'dense_4/kernel:0', 'dense_4/bias:0'],). Provided grads_and_vars is ((None, <tf.Variable 'dense/kernel:0' shape=(5, 256) dtype=float32, numpy=
Yes I have done that a couple of times with Lime & SHAP. But I still don’t know what exactly you need help with
When do you guys think OpenAI will achieve AGI? It doesn't seem like DeepMind can overtake OpenAI
I don't believe anyone should build AGI
AGI would be able to figure out everything that is possible and profitable to figure out. We can advance so much in physics and medicine. AGI wouldn't be exhausted, would train itself to spot and ignore incorrect notions and use data faster than our brains.
The real question is, why should anyone even be interested in building an AI that can do "everything"? Instead of that why not an AI that's good at doing a specific task very well 🤷
The dangers of AGI outweighs it's proposed advantage. People who do ML Research in Ethics are even vehemently against it even
I think the idea is that there's similarity between tasks you can exploit to make the entire thing more data efficient? Like in the multi task learning case.
But yeah, I'm sceptical we can ever build whatever AGI is anyway
I used to have the mindset that we all should push towards achieving AGI until I met Temnit Gebru in August.
She pretty much delved into the reason why it's a very dangerous mindest to have. It's like someone tryna build a god
Is the presentation somewhere?
Personally I have no opinion on whether we should or shouldn't, I defer that to philosophers 😅
😂 Unfortunately I didn't record the whole conversation but I'll check and revert back with her talk on same topic
Drunk Philosophy @past meteor ?
Her Keynote speech at DLI is more detailed
https://www.youtube.com/live/5cm_FvHmtVI?si=PCvza6Arsx_tlHGK
Thanks! I'll listen to the see this evening 😁😁
@odd meteor that was actually, thoughtful and more on the mind than well thought out and made to appear as though she spent 17 straight hours working out her points
The ai realm is pretty scary,. Modelling and data collection and with open source ai sources is honestly a nightmare in my mind, I jumped to the Facebook surveying and the election re trump. Stereotyping millions from adverts and “personality” traits
When you say "stereotyping millions from adverts and personality traits", what outcome are you worried about?
fixed it, had to incorporate predictions(my fault that was kind of obv but I was hoping to find a loophole) I incorporated speed calculation by using simple math to calculate the speed of the robot
The thing is, most big tech companies don't really like ML Researchers whose research work are in Ethics, more especifically, those who are quite vocal and strong willed.
This is because, it exposes the dirty side of what goes on behind those hyped-up models that fetches them constant 💵
I see Ethics as that field in AI where as a ML Researcher, if you're not strong and have a tough skin, you'll be crushed very easily. More so, you certainly can't escape being gaslighted, harrased, attacked, called "angry bird", fired even.
If Google could fire Temnit, lol... I mean 🤷😀
To me, like honestly I operate under the assumption I’m wrong or right at any given moment.. but it seems ominous that so much data can be hoovered up. Honestly in that instance it’s more the way it was implemented on shallow variables to target groups with certain propaganda. Really showing psychology is pseudo, check ya Facebook group likes. You could be latent who knows what, you wouldn’t even know unless a computer model categorised your data and sold it to companies to be used as analytics! And then got privy through the targeted ads… maybe I’m a skeptic but the direction of ai , the horizons seem scary to me
I mean not ranting, it’s well known fact the analytics and all that. But a true ai learning model that is capable of not imitation, that’s true fear
To me ethics in AI is closer to philosophy than it is to AI. The reason why I always refuse to express an opinion on the ethics side of things is that I don't have any foundational knowledge in it
True, to reference philosophy is pretty taxing.. gotta be able to know and keep track of all the contexts 😳. There is a philosophy of ai, look it up if your interested
Some philosophies of get insane though, I tried 3 pages of philosophy of psychology and just about blew 3 valves and a bottom end bearing
In my masters we had to pick 1 "general education" elective and it was AI ethics, privacy & big data and cognitive science, I went for privacy & big data because I thought it'd have the most value 🤣
Yah; therein lies the rub. You need opinionated people to consider the negative impacts of ethical decisions on a business, but it also needs people who will turn a blind eye when the business chooses the less ethical route (in their opinion). I should add; since ethics are highly subjective, you also need opinionated people who accept that others might disagree with their opinions and know how to be a team player. (** This is not a commentary on temnit, I’m just saying it’s perhaps impossible to find such people)
guys what are the main models of ai?
i have a research
i think so yea
dm me i can give you what my reasrch is about
I'm good
materials? as in learning resources or like the principle components?
bro i dont know anything my teacher gave me this research
XD well I can't answer your question if you don't even know what it means
wait brb lemme ask him
This made me laugh for a minute. Bro your teacher must have believed you know something and capable of doing the task before giving it to you. So you do know something, well, maybe not enough at the moment, but you do know something.
If I were presented with such electives I think I'd have gone for Ethics 'cos the topic just have a way of opening your mind to see thing in different perspective. It can be boring though if the professor isn't the type that makes his/her class very engaging and interactive.
Seems you'd most likely enjoy working in Ethics if you were to venture into ML Research. I've always avoided it cos it comes with a lot of things I'm not ready for 😀
where I can deploy ai model for free?
heroku has MFA, vercel requires phone to verify
I want some simple
this is not security important model
just for testing
@odd meteor ha fuck dude that’s a laugh, I’m right there with ya . I could spout moral ethics till I die but I guarantee I’ll be in ethical within an hour
Hello! Need a bit of help with scikit learn!
So here is my usecase. I have made a model in python scikit learn. It works great with accuracy of 89%. But I want to use it in JS as I am using MERN (I don't want to join the MERN app with a flask/fastAPI server to get the classification). So I looked at tensorflow js and scikit learn wrapper for JS: https://scikitjs.org/ So I preprocess the data in the exact same manner (I self coded the vectorizer in both the languages, so I can compare the data being feed and its exactly the same) But when I run the SVC on js it leads to an amazing accuracy of 9%. Any idea on what can I do?
Thank you for a response in advance
Description will go into a meta tag in
@odd meteor if only philosophy didn’t exist in a semi vacuum, I’d be free to be some sort of moral agent of some purpose
True. Possessing empathy, opposing any form of oppression, advocating for inclusivity, and having a natural inclination to challenge inconsistencies and call out bullsh!t are traits not everyone possesses.
While I'd like to believe that I'm someone who understands and values empathy, I don't necessarily envision myself venturing into research in Ethics. It's just a lot 😀
After going through the 'Stochastic Parrot' research paper, my admiration for those who specialize in Ethics has grown. Reflecting on the incident where Google's 2015 image classification model classified black individuals as chimpanzee and monkeys, I'm reminded of the pivotal role ethical researchers play. Their contributions are invaluable, but it's a field that might be too heavy for me.
Try Streamlit / Streamlit Cloud, HuggingFace Spaces, more recently I saw a video on this new platform called Runway. https://www.youtube.com/watch?v=tSiS15ubQFQ
⚙️ Runway - MLOps made easy: https://bit.ly/mrxrunway
🛣 Full Stack Data science roadmap: https://shorturl.at/abiJY
📚 Designing Machine Learning Systems (by Chip Huyen) 👉 https://amzn.to/3Cajv0Y
👔 Need help with preparing for your next data science interview? Use referral code ‘thuvu’ to get 10% discount on any of the offerings on https://dataint...
hmm streamlit is like R shiny?
I haven't used R-Shiny so I have no idea. You can use Streamlit to deploy your model as a web app. If that's what R-shiny does, then perhaps there's some semblance between both.
ok
Im adding this to my playlisy
Lower learning curve than R shiny
https://mbanet-e62f5mi5jcqypxb2huifrl.streamlit.app/
its stucked on input
preds = predictions(xTxt)
print('preds', preds)
text = input("type name of product (e.g beer): ")
test = pred(text)
when sharing user dont see manage app console?
I don't think that streamlit supports input(), iirc you should use components provided by the library (buttons, checklists, probably some specialized forms of text inputs etc) instead of using just input() itself
check the documentation and examples if you haven't yet
I have a 200MB, 1.1MM-row CSV file. Too big for Excel, but I'd still like a GUI-based method of initial data exploration. What are your favorite ways of doing this? I've found three possible solutions thus far:
-PowerBI
-Datasette
-VS Code's CSV extensions
You could use pandas in a Jupyter notebook
Jupyter forces the dataframe to fit into the screen size, which is really inconvenient when it comes to larger fields like review_content. It also doesn't show me all the entries. I've yet to find a solution for this. Even adjusting display.max_colwidth to 100 doesn't help.
I wonder if there's a Jupyter lab plugin to solve that
Im finetuning a gpt2 124M model using gpt-2-simple module, any tips to prevent overfitting or underfitting, like which parameter should i adjust
If you find it, please give me a holler. Because I'm stuck. For now, I'll leave it for the night and play Cities Skylines or something to relax and forget about it.
you could take a random sample and explore it on Excel then go back to python and write actual code to process the entire thing, but GUI-based exploration doesn't really scales well
Excel has its own problems; namely, it freaks out when encountering accented characters
that is probably an issue related to file encoding or excel localisation settings?
another option could be yeeting it into a database (either just a sqlite file or something like postgres), then using something like dbeaver, but at that point you might as well just end up using sql instead of python with the same problems
Will try that out. I also stumbled across CSView through Reddit
Postgres would be a bit extravagant, wouldn't it? SQLite can do the job I think
setting up just for that would be
if you already had it set up for something else and could reuse for that, not that much
Yep, that’s encoding
Happens to me daily
How would you get around this? SQLite -> dbeaver?
Change Excel encoding settings? Switch to PowerBI?
I just got here
Not sure what we’re solving
CSView is okay, but you can't zoom in and out to display more info on the screen. The viewport is fixed.
What program is this?
What’s the starting format?
Is CSV the original type?
And all you want is to just manually look at the data?
Yes. The original source is here: https://www.kaggle.com/datasets/stefanoleone992/rotten-tomatoes-movies-and-critic-reviews-dataset
I wanted a bird's eye view of how the data is formatted as a kind of initial EDA, figure out the proper data format of each field (do I need to do any str -> int casting, etc.) and to check for finicky edge cases
Oh 17k records?
Yeah
That’s potatoes for Excel
That's rotten_tomatoes_movies.csv. rotten_tomatoes_critic_reviews.csv is 1.1 million records
1,130,017 specifically
You would have to do something like this
FYI you can open an empty workbook and then import from CSV, and then in the wizard you can correct the encoding and reject the automatic transformations
But yeah, if you have over a 100k, give or take 100k, definitetly go for the database option
Thank you
The only large column is review content and I'm pretty sure that's not something you're going to ... plot are you?
I think I'd just work with a notebook and the other 6 columns mainly. If I want the 7th I'd read those reviews "manually"
hey all small question
i am new to machine learning, and i hae a data set that i basically train upon this network
model = tf.keras.Sequential()
model.add(tf.keras.layers.Dense(128, input_dim=1, activation='relu'))
model.add(tf.keras.layers.Dense(128, activation='relu'))
model.add(tf.keras.layers.Dense(64, activation='relu'))
model.add(tf.keras.layers.Dense(64, activation='relu'))
model.add(tf.keras.layers.Dense(32, activation='relu'))
model.add(tf.keras.layers.Dense(32, activation='relu'))
model.add(tf.keras.layers.Dense(1, activation='linear'))
optimizer = tf.keras.optimizers.Adam(learning_rate=0.0001)
model.compile(loss='mean_squared_error', optimizer=optimizer, metrics=['mean_absolute_error'])
history = model.fit(input_data, output_data, batch_size=32, epochs=300,shuffle=True)
i am basically trying to approximate E = 13.6/(n*n)
i generated some readings based on E, where the input data is n, and the output data is E i apply a small deviation to it E so that the results are not exact
now the problem is that, if i try to predict values outside of the training range (training range is n = 1 to n=50), then i get massive error rates, likes 200% or sth lol, (predicting values within the training range works fine, but anything outside of it, just slowly grows into massive error rates)
any ideas?
i am honestly, on my wits end, i have no idea what to do xD
if you explicitly include the model into the network, it'll work better
the network has no way of knowing what it would do outside the training data otherwise
you can rework the network so that it estimates the parameters of your model for E
thanks, and interesting, can you give me an example @wooden sail ?
not sure how to achieve that to be fair 😓
the main question is, what things can we assume to be known ahead of time? and what do we want to find out?
