#data-science-and-ml
1 messages · Page 87 of 1
nice, thank you
Help, I'm making an autoencoder, my error happens when I call autoencoder.fit:
ValueError: y argument is not supported when using dataset as input.
This is my code:
#IMPORTS #LOADING DATASET (x_train, x_test), ds_info = tfds.load('celeb_a', split=('train', 'test'), shuffle_files=True, as_supervised=False, with_info=True) x_train = x_train.map(lambda x: {'image':(x['image']/255)}) x_test = x_test.map(lambda x: {'image':(x['image']/255)}) print(x_train) print(x_test) class Autoencoder(Model): def __init__(self, latent_dim, shape): super(Autoencoder, self).__init__() self.latent_dim = latent_dim self.shape = shape self.encoder = tf.keras.Sequential([ layers.Input(shape=shape), layers.Conv2D(16, (3, 3), activation='relu', padding='same', strides=2), layers.MaxPooling2D((2, 2)), layers.Conv2D(32, (3, 3), activation='relu', padding='same', strides=2), layers.MaxPooling2D((2, 2)), layers.Flatten(), layers.Dense(latent_dim, activation='relu')]) self.decoder = tf.keras.Sequential([ layers.Input(shape=(latent_dim,)), layers.Dense(tf.math.reduce_prod(shape), activation='relu'), layers.Reshape(shape), layers.Conv2D(32, (3, 3), activation='relu', padding='same', strides=2), layers.UpSampling2D((2, 2)), layers.Conv2D(16, (3, 3), activation='relu', padding='same', strides=2), layers.UpSampling2D((2, 2)), layers.Conv2D(8, (3, 3), activation='relu', padding='same', strides=2)]) def call(self, x): encoded = self.encoder(x) decoded = self.decoder(encoded) return decoded
shape = (218, 178, 3) latent_dim = 64 autoencoder = Autoencoder(latent_dim, shape) autoencoder.compile(optimizer='adam', loss=losses.MeanSquaredError()) autoencoder.fit( x_train, x_train, epochs=10, shuffle=True, validation_data=(x_test, x_test))
maybe not quite what you're looking for, but you can call R using rpy2 https://rpy2.github.io/ and use this https://search.r-project.org/CRAN/refmans/misty/html/na.test.html
Hey there! 👋 It seems like you're having trouble with your autoencoder model. The error "ValueError: y argument is not supported when using dataset as input" usually happens when you're trying to fit a model with a dataset as input, but you're also providing a target y argument.
In your case, you're using tf.data.Dataset objects for training and testing data. When you call fit on your autoencoder model, you're providing x_train as both the input data and the target data. However, when using a tf.data.Dataset, you should only provide it as the first argument to fit, and not as the target data.
Here's how you can modify your fit call:
autoencoder.fit(
x_train,
epochs=10,
shuffle=True,
validation_data=x_test)
In this case, your x_train and x_test datasets should yield pairs (input_batch, target_batch). But since you're working with an autoencoder, your input data is the same as your target data. So, you might need to modify how your datasets are created to yield pairs of the same data.
I hope this helps! If you have any more questions or need further clarification, feel free to ask! 😊
so I tried to use Optuna and it's still taking quite a while. I realized the main problem is I don't know what range of values I should use for my hyperparameters, which are:
- max_features (sqrt or log2 so not applicable)
- n_estimators
- max_depth
I've tried looking up on "how to know what range of values to test for in random forest hyperparameter tuning", but I just get a bunch of results explaining how to tune them, while each one just chooses a random value for the start and another random value for the end.
Is there a way to know what range of values I should use based on the amount of data I have?
do we have chatgpt now?
Why you think that XC
Cause I am
what values are you using?
you can save a lot of time here by using "warm starts" and gradually increasing n_estimators. or halving random search with the same
max features i don't think is worth cross validating over
so that's just max depth, at which point you're searching over a single parameter
I also did a post in #1051603408597024828 so don't think that if I write well it means I use AI
...also n_estimators...
my input is a series of images processed from tensorflow into numpy arrays (5,712 training and 1311 testing)
atm I'm doing
n_estimators=trial.suggest_int("n_estimators",100,400,step=20)
max_depth=trial.suggest_int(log=True,name="max_depth",step=3,low=2,high=32)
kind of, but that's what i mean about the warm starts thing. you can treat n_estimators as a special case. conceptually, you would do something like this (assuming scikit-learn):
model_base = RandomForestClassifier()
scores = {}
for max_depth in [0, 2, 4, 6, 8]:
model = sklearn.clone(model_base)
model.set_params(warm_start=True)
for n_estimators in [0, 50, 100, 150, 200]:
model.set_params(n_estimators)
model.fit(x_train, y_train)
score_train = score(y_train, model.predict(x_train))
score_eval = score(y_eval, model.predict(x_eval))
scores[(max_depth, n_estimators)] = (score_train, score_eval)
that should save you a fair amount of individual decision tree fits
also i believe sklearn supports transparently parallelizing RandomForestClassifier
(are you training on image embeddings? why not just keep them going through a fully connected hidden layer at the end of whatever NN you're already using?)
so if you want to use optuna for max_depth you can, but you might not need it
also i don't know that incrementing n_estimators from 100 to 400 in steps of 20 is a good use of cpu cycles. seems excessive
the other option is halving search, which works well on tree ensembles and NNs because it's easy to identify what the "resource" should be (number of trees or number of epochs, respectively) https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.HalvingGridSearchCV.html#sklearn.model_selection.HalvingGridSearchCV
Examples using sklearn.model_selection.HalvingGridSearchCV: Release Highlights for scikit-learn 0.24 Comparison between grid search and successive halving Successive Halving Iterations
I'm not making a Neural network. I'm trying to see how accurate a Random Forest can get compared to a CNN.
and you think those 5 values for max_depth and n_estimators will be enough?
ahh, i see
well it was more to demonstrate the point, but yeah i think more than 8 max depth is probably overkill
I'd say for random forest the most important parameter to optimise for is cost complexity. N_estimators isn't worth it, it doesn't do a lot for random forest (I'll leave this as an exercise for you to figure out)
likewise with 200+ estimators, although breiman advocates for 1000+ trees, my experience is that you hit diminishing returns pretty hard. but you should see that in the score curve
fortunately your parameter space is small enough to visualize
On the other hand, max_depth is implicitly taken into account when you compute the cost complexity (cc_alpha), look at the equation to understand why 😁.
For xgboost it's the opposite, there you should properly reduce the number of estimators
i'm actually not familiar with that parameter, that's ccp_alpha in sklearn?
Yes
ohh i see, this is for pruning individual trees?
so you'd set a generous max depth and then optimize over the pruning parameter instead?
I'd roll with the default max_depth, which is very high afaik and then prune
That's what the docs also wants you to do afaik, it says something about this on the user guide or all trees
cool, i think that guide was probably added after i'd already ossified in my mind that tuning max depth was the way to go
this is what i get for not actually ever reading breiman's book. thanks, i learned something new!
hm... i think the docs only talk about pruning of individual decision trees, which i'm definitely familiar with. but i never heard of doing it in an ensemble like that
Np. Generally I don't like tuning more than 2 hyperparameters unless it's a neural network so for each model I use only touch the ones that give the most bang for your buck
agreed on that
There's some other tricks for random forest like using the OOB score instead of cross validation etc.
But that I'm sure you know
yeah, i totally forgot to mention OOB
although i've always been a little nervous about it
i've used it, but i was always mildly skeptical even though it's supposed to be fine
I barely use it because I'm usually trying 10 models with the same generic code
yeah, that's part of it. even the thing i showed with n_estimators involves a whole extra training setup
import json
import pandas as pd
import plotly.express as px
print(px.data.carshare())
def plot_choropleth_from_df_and_geojson(df, geojson_path, show_empty_zips=True, zip_range=[]):
with open(geojson_path, 'r') as file:
zipcodes = json.load(file)
df['zip_code'] = df['zip_code'].astype(str)
for val in zipcodes:
print(f"{val['zip_code']} is in {val['state']}")
color_scale = px.colors.sequential.Plasma
# Create the choropleth map
fig = px.choropleth(
df,
geojson=zipcodes,
locations="zip_code", # Change this to match the column name in df
featureidkey="properties.zip_code", # Update to match the geojson properties
color="zip_count",
hover_name="zip_code", # Change this to match the column name in df
hover_data=["zip_count"],
scope="usa",
color_continuous_scale=color_scale,
title="ZIP Code Choropleth based on Count"
)
fig.update_geos(fitbounds="locations", visible=False)
fig.show()
df = pd.read_csv('zip_codes.csv')
print(df.columns)
print(df.head(5))
geojson_path = "/content/USCities.json" # Update this path to the location of your GeoJSON file
plot_choropleth_from_df_and_geojson(df, geojson_path, show_empty_zips=True)
still stuck with figuring out this chloropeth map
i need help with understanding how to parse the json data into a pandas dataframe
here's the thing. I've never seen a tutorial that's used this hyperparameter, but to be fair all the tutorials, except two, I've seen have been for very finite datasets (meaning they wern't as computationally expensive as my situation is). Heck this is the first time I'm seeing this hyperparameter.
Also I'd have no idea of where to start looking for values within that hyperparameter, especially since its a float.
90% of data science / ML tutorials are all just rehashing the same content over and over
and over
read a couple tutorials? write your own, post it on TDS, claim to be an educator on your linkedin
so i wouldn't take absence of any technique from generic tutorials as strong evidence against the usefulness of that tehnique
well so there's also an implied principle of "I should only try solving problems that there are resources for".
If I am having such a hard time tuning the hyperparameter for which there are lots of tutorials on how to tune, it would be even harder to tune hyperparameters that don't have tutorials on how to do so online.
and yes. I know just because teachers exist, doesn't mean they're good. I go to college.
If you don't believe me I can recommend a chapter in a book you can read at your own time 🙂
There's also these links on sklearn's docs:
https://scikit-learn.org/stable/modules/tree.html#minimal-cost-complexity-pruning
https://scikit-learn.org/stable/auto_examples/tree/plot_cost_complexity_pruning.html
The DecisionTreeClassifier provides parameters such as min_samples_leaf and max_depth to prevent a tree from overfiting. Cost complexity pruning provides another option to control the size of a tre...
Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning s...
From my experience the "tutorials" that cover sklearn are from people that have never read the documentation. It's quite well done so I suggest you read it once or twice front to back, it takes 2 days tops
I'd appreciate that if you wouldn't mind.
is there a way to export discord chat data? 😅
@past meteor Would you mind answering this? Thanks 🙂
Introduction to statistical learning, there's a chapter on it
I think it's fine, at least the way it was covered in my classes there were no big dependencies between each other, at least for the basic stuff
Arguably you might retain the information better by spreading the learning out over a longer time and reinforcing it.
!code
Be sure to never ask people to read screenshots of text.
!code
the !code command brings up a message from the bot. Be sure to follow the instructions in the message.
Is speaker diarization and speaker turn detection the same thing? I just need to split an hour long audio based on a change of speaker without the need to identify them.
Hello, I am a beginner to statistics and interpreting graphs like histograms. I can't really tell if this graph is skewed or not. The y axis is the amount and the x axis is the age.
you could numerical compute the skewness either with scipy https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.skew.html or by coding the formula for skewness yourself
the main idea behind skewness is checking whether more of the area of a pdf (or here, histogram) is to the left, the right, or if it's centered
at least at a glance, this data looks skewed to the right*, since skewness describes which side the tail is on
I see, thanks. So if I wanted the typical value of this data, should I calculate the median instead of the mean since it seems to be skewed?
depends what you mean by "typical value" 😛
Ah sorry, I mean the average value when I say typical value. I'm trying to find the average age in this case but I'm kind of stumped
Mean is average, median is the center value from sorted data regardless of ascending or descending order, meaning that the former is much more prone to biases like outlier, meanwhile the latter doesn't.
does my explanation is clear for you? I hope so. ¯_(ツ)_/¯
Hi, can you explain to me what you mean with that last part "So, you might need to modify how your datasets are created to yield pairs of the same data."
Also, I tried the suggested code and I got a value error:
Call arguments received by layer 'autoencoder_3' (type Autoencoder): • x={'image': 'tf.Tensor(shape=(218, 178, 3), dtype=float32)'}
median will most likely be a better representation if data is skewed.
Sure, I'd be happy to clarify!
When you're using a tf.data.Dataset object with the fit method in Keras, the dataset is expected to yield a tuple of two elements. The first element is the input data and the second element is the target data. In other words, it should yield (input_batch, target_batch) pairs.
In an autoencoder, the target data is the same as the input data because the model is trying to reconstruct its input. So, you need your dataset to yield pairs where the input batch and target batch are the same.
Here's how you can modify your dataset creation:
x_train = x_train.map(lambda x: (x['image']/255, x['image']/255))
x_test = x_test.map(lambda x: (x['image']/255, x['image']/255))
Regarding the error you're seeing, it seems like your model is receiving a dictionary as input ({'image': 'tf.Tensor(shape=(218, 178, 3), dtype=float32)'}), but it's expecting a tensor. The modification above should also fix this issue because it will directly pass the tensor to your model.
I hope this helps! Let me know if you have any more questions. 😊
Currently getting ```Traceback (most recent call last):
File "C:\Users\Carson.pyenv\pyenv-win\versions\3.9.8\lib\site-packages\torchaudio_extension\utils.py", line 85, in _init_ffmpeg
_load_lib("libtorchaudio_ffmpeg")
File "C:\Users\Carson.pyenv\pyenv-win\versions\3.9.8\lib\site-packages\torchaudio_extension\utils.py", line 61, in load_lib
torch.ops.load_library(path)
File "C:\Users\Carson.pyenv\pyenv-win\versions\3.9.8\lib\site-packages\torch_ops.py", line 643, in load_library
ctypes.CDLL(path)
File "C:\Users\Carson.pyenv\pyenv-win\versions\3.9.8\lib\ctypes_init.py", line 374, in init
self._handle = _dlopen(self._name, mode)
FileNotFoundError: Could not find module 'C:\Users\Carson.pyenv\pyenv-win\versions\3.9.8\Lib\site-packages\torchaudio\lib\libtorchaudio_ffmpeg.pyd' (or one of its dependencies). Try using the full path with constructor syntax.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:\Users\Carson.pyenv\pyenv-win\versions\3.9.8\lib\site-packages\torchaudio_extension_init_.py", line 67, in <module>
_init_ffmpeg()
File "C:\Users\Carson.pyenv\pyenv-win\versions\3.9.8\lib\site-packages\torchaudio_extension\utils.py", line 87, in _init_ffmpeg
raise ImportError("FFmpeg libraries are not found. Please install FFmpeg.") from err
ImportError: FFmpeg libraries are not found. Please install FFmpeg.```
Returned when I attempt training with this.
https://github.com/CBeast25/Applio-RVC-Fork/blob/main/lib/infer/modules/train/train.py
which doesn't happen with this https://github.com/CBeast25/Applio-RVC-Fork/blob/main/lib/infer/modules/train/train_new.py
Any ideas?
Just to make sure, but do you have the FFmpeg installed?
Hol up
idk if that's the actual ffmpeg or the package for ffmpeg but I have to assume that is the former, so how about the python package do you also have it installed?
got that one too
ok lemme try it
um idk but could you try installing the one I have?
well that sucks
I don't suppose the problem would be due to pathing?
I qouted this part of the error raised:
FileNotFoundError: Could not find module 'C:\Users\Carson\.pyenv\pyenv-win\versions\3.9.8\Lib\site-packages\torchaudio\lib\libtorchaudio_ffmpeg.pyd' (or one of its dependencies). Try using the full path with constructor syntax.
I don't think so. Might be the version of ffmpeg I have.
i have to train an object detection dataset on yolo model
i have downloaded a dataset called indian vehicle dataset, it has 5 classes of trucks, cars, tempos, tractor, tractor, each has over 130 images, i also have annotation folder for the same classes provided in xml format which looks like following
This XML file does not appear to have any style information associated with it. The document tree is shown below.
<annotation>
<folder>20210427</folder>
<filename>20210427_12_52_47_000_pgmtGnyv84hd7BiNoFr7ALIRzeZ2_F_4160_3120.jpg</filename>
<source>
<database>Unknown</database>
<annotation>Unknown</annotation>
<image>Unknown</image>
</source>
<size>
<width>3120</width>
<height>4160</height>
<depth/>
</size>
<segmented>0</segmented>
<object>
<name>tempo</name>
<truncated>0</truncated>
<occluded>0</occluded>
<difficult>0</difficult>
<bndbox>
<xmin>7.5</xmin>
<ymin>1089.34</ymin>
<xmax>522.95</xmax>
<ymax>2326.43</ymax>
</bndbox>
<attributes>
<attribute>
<name>rotation</name>
<value>0.0</value>
</attribute>
</attributes>
</object>
<object>
<name>tempo</name>
<truncated>0</truncated>
<occluded>0</occluded>
<difficult>0</difficult>
<bndbox>
<xmin>383.47</xmin>
<ymin>622.4</ymin>
<xmax>2645.4000000000005</xmax>
<ymax>3151.1</ymax>
</bndbox>
<attributes>
<attribute>
<name>rotation</name>
<value>0.0</value>
</attribute>
</attributes>
</object>
</annotation>
so how do i convert it into yolo accepted format, i am doind this type of thing for first time
lastly how about the environment variables?
ffmpeg is on my path
Really dont want to install conda but it looks like that's what I'll have to try now
I'm afraid so... sorry I couldn't be of any help :/
To train a YOLO model, you need to convert your annotations into a format that the YOLO model can understand. The YOLO model expects a .txt file for each image in the dataset with the same name as the image file. Each line in this file represents one bounding box, in the format <object-class> <x_center> <y_center> <width> <height>, where:
<object-class>is an integer representing the index of the class of the object from 0 to (classes-1).<x_center>,<y_center>,<width>, and<height>are all relative to the size of the image. They are in float format and range from 0.0 to 1.0.
Here's a Python code snippet that can help you convert your XML annotations to YOLO format:
https://paste.pythondiscord.com/G37Q
In this code, replace 'path_to_annotations' with the path to your XML annotations and 'path_to_yolo_annotations' with the path where you want to save your YOLO annotations. Also, replace 'path_to_images' with the path to your images.
Please note that this code assumes that your XML files have the same name as your images and are in the same directory. If this is not the case, you may need to adjust the code accordingly.
I hope this helps! Let me know if you have any more questions. 😊
Hey everyone
I am new to ML and NN.
I just tried to learn pytorch with some datasets from kaggle but I quickly ran into a problem.
I have a Dataframe with different data types in it. Boolean, Strings and Floats
Most of them are relevant so now I'm wondering how to handle this. Cause for converting it into a tensor it needs to be a Float or int or binary I think.
How would one handle a data input with mixed data?
And also, if I convert the boolean values to a BoolTensor. How do I merge them with the FloatTensors... Cause it should be still one input right?
hey anyone have any idea why it takes forever to build the wheel for Pandas?
you can represent booleans as 0 or 1. but for strings, it depends on what they represent. is there a finite number of unique strings, where they represent categories? or what?
Thank you! I have both. I have columns with finite possibilities but also columns where I have a lot of different string. (Probably also finite but not just 5 different possibilities)
But isn't it better to use the proper boolean type than converting it to 0 and 1? After all there is a BoolTensor type...
I am an ai researcher. I know 10+ coding languages and around 7 frameworks. I know everything from cnns to rnns to transformers and large language models. I have made my own finetuned multi modal large language model and am researching about augmented lstms. Im fullstack too and im here to discuss about anything CS or math and looking for friends and connections. I own two startups currently and working on a lot of projects.
Dm me if you want to be a friend, collab on smth
for the features that have a finite set of possibilities (ie, they represent a fixed number of categories), you can one-hot encode them
for strings that could be "anything" (like a person's name, or a description of an art piece), the question becomes even more complicated.
hmm ok what do you mean by that?
there are different approaches to representing natural language numerically, but that's more advanced, and you should probably focus on one-hot encoding for the moment.
no you don't.
I have an ID and I want to map the ID to my prediction.
This is no relevant data for the training part.
How would you handle this
aight guess I don't
I'm just being silly, if that wasn't clear.
sure I caught that 😄
anyway, you don't encode the ID. but if you feed a tensor with 10 rows into the network, then you'll get ten rows out. and the nth row of the output represents the nth row of the input.
(or, if you're working with more than two dimensions, whatever the leftmost dimension is)
emm ok...
I think I just keep coding and ask again when I get there... I think I won't understand it right now.
Is it ok if I ping you later?
I check this channel regularly, but don't ping me since I have a variable schedule.
ok
How can I make a simple basic
Whitch no module
Only built in modules
Iis there a way
hello guys i want to create ai video
where i want to intergrate my body in the video and voice
is there any free website or something like that
Hello. I hope you are well. Could you tell me about your industry experience? How did you get into the professional world?
does there any apps or wesites to modify and add things to a video with ai?
Hey
I think capcut. Idk how it works but I've seen some Tik-Tok videos using it do some cool stuff
What is your guys overall opnion about tensorflow and pytorch?
which is most used for anything ? is there one easier to the other?
Recently TF dropped support for windows, which may be an issue for some people. Besides that, I think they are nearly equivalent these days.
there's probably a github issue explaining why or something.
they only support windows via WSL now. and unix systems are still supported.
I didnt even know about what was WSL until now lol
Hello I built a level in unreal engine and want to do behavioral cloning can anyone point me in the correct direction
I have a little under water level with a pilot able submarine, and I want to teach a simple model to drive the submarine and perhaps go to targets.
Hello everyone. I was required to write a python script to deduce height of a medical object. Is there any sort of computer vision model to do that directly? Unfortunately, I couldn't find any in context of my use case.
Therefore I came up with this idea: segment the area of interest and draw a boundary box around the segmented polygon. Calculate the height and width of the boundary box with simple geometry. We get heights and widths in pixels. Convert pixels into our required scale if focal length is constant or use stuff like arUco marker for relative measurement.
However, I don't know how reliable this thing will turn out to be. Is there a better way? Or a vision model that I don't know of?
it depends on the defined constraint for you project, because that could also impact the accuracy of your system like what camera it will use, from what angle will you take the picture, the relative distance to said object, etc. if you don't mind sharing a bit more detail, please do.
Thank you for your interest. It is a segment of an OCT image. Let's assume the angle is always perpendicular and the relative distance always remains constant.
oh well then, I'm confident the result would be reliable enough, especially given how well you understood the assignment. but when in-doubt remember that it is a must to run system evaluation to determine the error rate from your system. from there you can determine if you should either make drastic changes to your system or just simple tweaks.
And one more thing, I think it can be of help to you if there's another object as a reference in all of the image, perhaps like a piece of A4 paper as the background object behind your main object, with this you can then sample its (A4 paper) pixel to real-world size ratio for calculating the main object pixel to real-world size, but this is just one way to do it, you might find other methods to be much easier to implement/understand, and I might suggest go and take a look about it on Google Scholars.
hello
I have trained a model on a dataset with imbalance using xgboost
it gives f1_score_weighted of 85%
per-label it gives f1 score [0.8855230715040509, 0.8542857142857143, 0.7451428571428571, 0.743142144638404, 0.853898561695685, 0.8672331386086033]
what else could be done in order to make better model
to increase the accuracy/f1_score
Should I use CatBoost instead or try to solve imbalance in dataset using smote or other methods
Maybe both
could you perhaps provide us with more context of your model? is it a CV model, perhaps you can augment the data by adding dups to the minor category and change the dups rotation, do inverse so it have more variety, if it's a timeseries model like ASR perhaps a reversed audio dups could work too.
other things you can do is to add weight to your category, so for minor category you give them bigger wieghting to balance out, or emphasize the model to be more critical when dealing with rarely occuring data.
lastly proportional distribution A.K.A stratified splitting the category might also affect it's performance.
I don't assume you already have applied the ideas I'm telling you @lapis sequoia ?
It is a cv model
Emotion model
dair-ai/emotion
It's basically a model for nlp classification
though the ideas I've stated could still work regardless of model since it is more to do with how the data is being presented, btw you haven't answered my question :).
No I've not applied the techniques you've mentioned
Well you might want to try them out first because, from my past experience, these are pretty common way to deal with imbalance dataset.
Are there any good courses on Data analytics and Big data using jupyter for data understanding, cleaning, preprocessing, modeling etc anywhere?
I do apologize as I'm unfamiliar with CatBoost, perhaps it's way more streamline than some basic ideas I said
Nah
I heard it for the first time 😅
I'll do some fixing to tackle the imbalance
well I guess then it's ok to sticking with the basics. haha
you can combine, oversampling/smote with undersampling too, try different combination, who knows what might work and not work together
one thing is certain is that you should always try and experiment with different scenarios ¯_(ツ)_/¯
Why not use also a Neural Network that looking to a file csv of the last winners of all the last Gran Prixs so that can see a pattern of the trend of the winners?
Tbh this is the first time I'm experimenting with the nlp model (maybe ML in general)
I'm curious how could your visual emotion model also use nlp? afaik nlp is mostly use for text-based scenario. 
I've never worked with NN
Ik transformers could give a whooping accuracy of maybe >94
But rn I'd like to stick to ml algo
Ok👌
No
It's a text based dataset
oh so it wasn't computer vision then ?🤦
I'm getting thrown off by such revelation
^
Thanks man
ok I'm not really verse in NLP, but I believe the ideas I've said still applicable aside from making variety duplicate for obvious reason.
Adding duplicate would not be a great idea
In the preprocessing
I literally removed the duplicates
then we have establish a similar conclusion.
Use TextAttack library (the same library used in performing adversarial text attack on NLP tasks) to perform data augmentation on the minority class.
That would most definitely improve the current model's performance.
but don't think of them (duplicates) as an obstacle to your model, because oversampling and undersampling is basically just that, add and remove duplicate as a mean to make the dataset balance.
Actually I'm doing an assignment
They've mentioned to remove the dups
It's a standard procedure in the nlp ig
Will check
Thanks 👍
guys I've been looking for tutorials to create a chatbot (AI) but most of them are outdated (videos too old and don't work) could you recommend me something? I haven't found anything else.
My idea was to have a json (something simple) so: tags,patterns,responses
{
"intents": [
{
"tag": "greetings",
"patterns": [
"hello",
"hey",
"hi",
],
"responses": [
"Hello",
"hey!",
"what can i do for you?"
]
}
]
}
Are there any good courses on Data analytics and Big data using jupyter for data understanding, cleaning, preprocessing, modeling etc anywhere? I'm struggling at uni to understand the terms and techniques
Sure, I found some recent resources that might help you:
-
"How to Build Chatbots | Complete AI Chatbot Tutorial for Beginners (https://www.youtube.com/watch?v=jCoH82LPgdk)" by Liam Ottley. This is a comprehensive video tutorial that covers everything you need to know to build custom no-code AI chatbots.
-
"Learn how to create your own chatbot easily (https://www.youtube.com/watch?v=Pj00e6lq9Cg)" by TECH WITH SACH. This tutorial teaches you how to create your own chatbot quickly and easily using Dialogflow and different tools and features.
-
"How to Build a ChatBot using the GPT-4 API – Full Project-Based Tutorial (https://www.freecodecamp.org/news/build-gpt-4-api-chatbot-turorial/)". This is a full project-based tutorial that teaches you how to build your own chatbot using the GPT-4 API.
-
"A Complete Guide on How to Build a Chatbot (Easy to Hard) (https://www.g2.com/articles/how-to-build-a-chatbot)". This guide provides different ways to build a chatbot, each requiring a varying level of technical skills.
-
"ChatterBot: Build a Chatbot With Python – Real Python (https://realpython.com/build-a-chatbot-python-chatterbot/.)". This tutorial guides you through creating a chatbot using Python ChatterBot.
As for your idea of using a JSON structure for intents, patterns, and responses, it's a common approach in rule-based chatbots. However, for more advanced AI chatbots, you might need to use more sophisticated techniques such as machine learning and natural language processing. I hope this helps! 😊
Finally I got the voice verified role! 😁
is it complicated to integrate your PyTorch model into an frontend application?
I will take a look at all of them, but using chatterbot doesn't appeal to me, I would like to be able to something of my own without using something already made. For working with intent do you have any tips?
Hey there! 😊 Sure, integrating a PyTorch model into a frontend application can be a bit tricky, but there are several ways to do it:
-
Flask: You can use Flask to deploy your PyTorch model and create a REST API for your model. This is a common way to integrate ML models into web apps.
-
PyTorch C++ Frontend: If you're working with C++, you can use the PyTorch C++ frontend. It provides a pure C++ interface to PyTorch.
-
Flutter: For mobile apps, you can use PyTorch Mobile with Flutter. This lets you run your PyTorch model directly on a mobile device.
-
Windows ML API: If you're developing a Windows app, you can use the Windows ML API to integrate your PyTorch model.
Remember, the complexity can vary depending on your model and application. It's always a good idea to start with a simple prototype and add complexity as needed. Hope this helps! 😊
can I use graphql or whatever? my app is built with react
Sure thing! 😊 Building your own chatbot from scratch can be a great learning experience and also an amazing challenge. Here are some tips for working with intents:
-
Define the Purpose: Start by figuring out what you want your chatbot to do. This will help you determine the types of intents your chatbot needs to handle.
-
Identify Common Questions: Consider the types of questions your chatbot is likely to get. These can guide you in creating your intents.
-
Create Sample Conversations: Try to come up with possible scenarios or conversations your chatbot might have. This can help you understand how to structure your intents and responses.
-
Annotation: Mark up your training data to identify intents and any entities. This is crucial for training your chatbot to understand user inputs.
-
Continuous Training: Train your chatbot with the marked-up data, and keep fine-tuning and adjusting the chatbot’s responses based on feedback.
-
Add and Refine Intents: As your chatbot interacts with users, you'll likely discover new intents that you hadn't thought of. Add these to your chatbot and refine existing ones based on user interactions.
Remember, building a chatbot from scratch can be complex, especially if you want it to understand natural language. You might need to learn about Natural Language Processing (NLP) techniques, which can help your chatbot understand and interpret user intents in real-time.
Hope this helps! 😊
Absolutely, you can use GraphQL with your React application and PyTorch model! Here's a general approach:
-
Create a Backend Service: This service will host your PyTorch model and expose an API endpoint for making predictions.
-
GraphQL Server: Set up a GraphQL server that communicates with your backend service. This server will receive requests from your frontend, forward them to the backend service, and return the results.
-
Apollo Client: In your React application, use the Apollo Client to communicate with your GraphQL server. Apollo Client is a comprehensive state management library for JavaScript that enables you to manage both local and remote data with GraphQL.
-
Fetch Data: Use the
useQueryhook provided by Apollo Client to fetch data from your GraphQL server. This data can then be displayed in your React components. -
Update Data: If your application allows for updating data (like re-training your model), you can use the
useMutationhook.
Remember, integrating these technologies can be complex and requires a good understanding of each component. But with some patience and practice, it's definitely achievable. Good luck! 😊
thank you for your help, I noticed that the various videos or tutorials offered are mostly very basic, chatterbot gpt 4 API etc, do you have a tutorial focused on learning based on intent? this would be very helpful for me to study how it works and do something more
of course I searched but they were all very old (2 to 4 years ago) many libraries have changed and do not work as they used to which makes them useless
gives an error when creating the array (line 60) and I have no idea how to fix it
Traceback (most recent call last):
File "/home/saohy/documents/bots/mei-chan-bot/g.ignore/chatbot/training.py", line 60, in <module>
training = np.array(training)
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 2 dimensions. The detected shape was (5, 2) + inhomogeneous part
Absolutely👌, here are some resources that focus on intent-based chatbot development:
-
"Chatbot Development Tutorial: Introduction of Intent, Stories, Actions in Rasa (https://www.pragnakalp.com/chatbot-development-tutorial-rasa/)": This tutorial provides a comprehensive guide on how to use Rasa, an open-source machine learning framework for building chatbots. It covers the concepts of intents, stories, and actions in detail.
-
"Creating Chatbots on Google Cloud (https://developers.google.com/learn/topics/chatbots)": This resource offers step-by-step guides to learn about Dialogflow and the corresponding Google Cloud services, which facilitate building chatbots. Dialogflow is a powerful tool for intent recognition and handling.
-
"How to Build Your AI Chatbot with NLP in Python? (https://www.analyticsvidhya.com/blog/2021/10/complete-guide-to-build-your-ai-chatbot-with-nlp-in-python)": This guide provides a complete walkthrough on how to build an AI chatbot with Natural Language Processing (NLP) in Python. It covers various aspects of NLP, including intent recognition and handling.
Remember, building an intent-based chatbot can be complex and requires a good understanding of NLP and machine learning concepts. But with some patience and practice, it's definitely achievable. Good luck with your studies! 😊
Also for the error you've encountered in your code I think I know where is the problem, but I'll tell you about it in another message because I don't have Nitro🥲.
The error you're encountering, ValueError: setting an array element with a sequence, typically occurs when you're trying to create a NumPy array from a list of sequences that aren't all the same length¹. In your case, it seems like the issue is with the training variable.
Here's a possible solution:
Instead of using np.array(training), you can convert training into an array of objects:
training = np.array(training, dtype=object)
This tells NumPy to create an array of Python objects, which can be of varying lengths or types.
This is what I can do for now, I hope I've been helpful! 😊
As for recent resources on chatbot development, here are some that might be helpful:
-
"Future directions for chatbot research: an interdisciplinary research agenda (https://link.springer.com/article/10.1007/s00607-021-01016-7)": This article provides a research agenda for chatbot development, discussing future directions and challenges in the field.
-
"7 Chatbot Trends for 2022 & Beyond (https://manychat.com/blog/chatbot-trends/.)": This article discusses the latest trends in chatbot development, which could be useful for your project.
-
"How To Develop a Chatbot From Scratch (https://chatbotsmagazine.com/how-to-develop-a-chatbot-from-scratch-62bed1adab8c)": This guide provides a comprehensive walkthrough on how to build a chatbot from scratch.
-
"9 Major Chatbot Trends for 2021 (https://botcore.ai/blog/chatbot-trends-2021/)": Although this article is from 2021, it discusses some major trends in chatbot development that are still relevant today.
Good luck with your project! (Also if you need any more help feel free to ask also in DMs) 😊
Now Igtg, cya👋 (I'll be back in 1 hour)
ty
I am new at this and i wanna know how to use langchain to create simple application,
i was using openai library which was kinda easy but to make scalable project i think lang chain would be great, BUT i found langchain to be very confusing, like i am failed to find
role: system context in langchian
that how to pass system prompt that what are you and all
then user asistant chat start right,
Need to find this concept using langchain
I have found using ConversationBufferMemory and summarizer buffer i can manage alot of stuff but giving system context to the LLM i want right now kindly help
You are welcome. I'm always happy to help
Hey there!👋 It's great to hear that you're diving into data science and AI. Langchain is indeed a powerful tool for building scalable AI applications.
To provide system context in Langchain, you can use the ConversationBufferMemory as you mentioned. This is where you can store the system context or any other information that you want to persist across different turns of the conversation.
Here's a simple example:
from langchain import ConversationBufferMemory
# Initialize the memory buffer
memory = ConversationBufferMemory()
# Add system context
memory.add_system_context("I am an AI developed by Langchain.")
# Now you can pass this memory to the LLM
llm = LangchainLLM(memory=memory)
In this example, the system context "I am an AI developed by Langchain." is stored in the memory and can be accessed by the LLM during the conversation.
Remember, the system context is just like any other piece of information in the conversation. It's up to you how you want to use it in your application.
I hope this helps! If you have any more questions, feel free to ask. Happy coding! 🚀
AttributeError: 'ConversationBufferMemory' object has no attribute 'add_system_context'
I apologize for the error I made😓 . It seems there was a misunderstanding in my previous message. The ConversationBufferMemory class does not have a method called add_system_context.
In Langchain, the system context is typically set at the beginning of the conversation and is passed to the model as part of the conversation history. Here's a simplified example:
from langchain import ConversationBufferMemory, LangchainLLM
# Initialize the memory buffer
memory = ConversationBufferMemory()
# Add system context
memory.add_message("system", "I am an AI developed by Langchain.")
# Now you can pass this memory to the LLM
llm = LangchainLLM(memory=memory)
In this example, the system context "I am an AI developed by Langchain." is added as a system message in the conversation history. The LLM can then use this context during the conversation.
I hope this clears up the confusion. Let me know if you have any other questions! Happy coding! 🚀
AttributeError: 'ConversationBufferMemory' object has no attribute 'add_message'
Oh, I'm really disappointed in myself... I don't know why they works in my machine, try this last version I'm really sorry for the consìfusion:
from langchain import ConversationBufferMemory, LangchainLLM
# Initialize the memory buffer
memory = ConversationBufferMemory()
# Add system context
memory.add_turn({"role": "system", "content": "I am an AI developed by Langchain."})
# Now you can pass this memory to the LLM
llm = LangchainLLM(memory=memory)
If you're having any challenge in using the library let me know.
any tools or libraries of interest for voice recognition?
!mute 857925971004882975 "1 day" Using ChatGPT to answer questions is not allowed. If you are not using a generative AI to produce content, as you allege, stop deliberately writing in a way that resembles generative AI writing.
:incoming_envelope: :ok_hand: applied timeout to @blazing oxide until <t:1699109861:f> (1 day).
how to transcribe audio to text arduino for chat gpt. then chat gpt replies in audio?
I think you just are asking: How to do Text to Speech, and Speech to Text? Doesn't matter if it's chatgpt or not, right?
i was just about to flag this, thank you
no like i talk to chat gpt in speech like siri
and it replies to me as audio
i know what i need but i need code
i just need an idea
Hi Guys, I've been learning pre-calculus from james stewart: "Precalculus: Mathematics for Calculus".
here is a review from this guy:
https://www.youtube.com/watch?v=N-JXs_n2JhI
I'm wondering whether I should go thru "all" of them, the guy said that there is much more math in this book than in a college course. My main goal would be to learn calculus for machine learning so I do wonder that whether it worth spend years just to go thru pre-calculus as a 16 year old? Or how should I do it?
https://www.freemathvids.com/ || We take a look at a wonderful book called Precalculus: Mathematics for Calculus. This is a great book for learning precalculus as it covers all the main topics!
Here it is: https://amzn.to/48ra0tv
Useful Math Supplies https://amzn.to/3Y5TGcv
My Recording Gear https://amzn.to/3BFvcxp
(these are my affiliate links)...
All of what? All of the chapters?
yeah all of the content
You're still in high school right? Maybe ask the pre-calculus teacher for a copy of their syllabus, so you can see what information is taught?
In Ireland its different afaik, but I think we have a similar structure, we are going thru these kind of books
This is 5th year, which is the book that we are going thru atm
This is the next version of that book
This is the last one's table of contents
This is the first book(the green) one's content
This is literally the math that is required in leaving cert
I don't know if the content is enough for pre-calc
I know the table of contents might not be enough for you to assume, but what do you think?
or perhaps I can take one of these courses? https://www.khanacademy.org/math/precalculus
The Precalculus course covers complex numbers; composite functions; trigonometric functions; vectors; matrices; conic sections; and probability and combinatorics. It also has two optional units on series and limits and continuity. Khan Academy's Precalculus course is built to deliver a comprehensive, illuminating, engaging, and Common Core align...
Because I've seen a lot of people recommending khan academy for maths for machine learning
I can't commit to helping you in the thread so I'll answer here. You can use small multiples or colour coding. Seaborn and plotly let you do both quite easily.
i am already doing that
but i don't think they would be the best for multilabel as itmight have values that have more than one label
Since BillyBobby is not active anymore can you please take a look at my question, if you wouldn't mind 🙂
I'm not from the US, we follow a different system of math. I'm abroad rn actually, can you ping me in a few days? I'll write out the topics of my undergrad math course. That was the bare minimum of what you need.
sure that's fine, thanks. I've sent the topics that is included in Ireland in my school for someone who's pursuing higher level maths.
It looks similar to what I had
Does it? I asked in the Mathematics server and they said that I should go with them because they by the table of contents very good books and since I'm learning this in school its even better
hi, I'm not sure this is the right channel. I'm working on a small project utilizing some image tag/recognition models, gpt and flask and quite inexperienced
loading the models takes a few minutes. everytime I make a change in some file, flask is reloaded (which is good) and with it the models.
currently, I'm calling load_Models() directly before run app line.
what is common practice to keep such models "loaded" and only reload the rest?
If the models are part of the same python process as the flask app that you're editing, there is no way to prevent that from happening.
An alternative design is to have the models in a separate flask app that only has the absolute minimal functionality that needs direct access to the models, and then have the main app interact with that other app via requests.
It's not just that "flask" is reloading. It's the whole python process.
oh no, that does not sound good...
so I'd have two flask projects/servers running, communicating with each other?
i suggest maybe just picking a starting point and taking your time
iirc i was about 16 when i did pre-calc in school. do you not have access to a pre-calc course there?
i believe the sequence was something like trigonometry and basic proofs in 10th grade, then logarithms, derivatives, and a cursory look at antiderivatives + integrals in 11th
we probably did other things in 10th too but i can't remember what they were
what i do remember is that we took it pretty slow
there was no rushing through the material. i had very good teachers overall, and we spent a while building things up carefully and developing intuition
for example, i remember one day the teacher had us step through angles and measure the sides of the resulting triangles inscribed in a unit circle
and then she had us plot the ratios, and magically the sine and cosine functions appeared on our graph paper
i loved that, it was the first time in math i felt like there was actually a purpose to what we were doing and things started coming together
likewise the next year, the teacher gave everyone some construction paper and tape and assigned us all various dimensions of boxes to build. then we computed the volumes of all our boxes and showed that there was a unique optimum volume, which i believe we then derived. that was also a fun and memorable demo
the point being: it's worth taking your time with these things. they will form the foundation for everything to follow
I've sent you the two books that we are going to go thru and I've sent the table of contents if you scroll up a bit.
would you think that that would be enough?
.
that's a pretty good loadout
precalc is basically algebra and trig, both of which you cover there. differential and integral calculus are proper calculus, so what you learn after precalc
Do you think that it would be enough to prepare myself for calculus?
that already includes calculus lol
I mean yeah 😄
so I guess it should be enough then 😄
yep
gyz can some one guide me , about learning python roadmap
do you guys know the best course to emphasize numpy and pandas?
Not sure about numpy, but I recommend the kaggle pandas tutorial
is this?
The Resources page on our website contains a list of hand-selected learning resources that we regularly recommend to both beginners and experts.
It's one of the items here ^
ah i figured it out, thanks a bunch!
I've been thinking about this a bit, how to teach data manipulation in a way while showing plain Python vs Pandas vs SQL side by side.
I think a lot of people don't connect sql or pandas operations with the equivalent; how you'd do it in a simple Python loop. When I teach people SQL (or Pandas), I express things in terms of the primitives (how you'd do the same thing in code).
Hey guys, I'm trying to color my points on this regplot by their y value. for example i want all where dropout == 1 to be orange and where 0 to be blue. but I can't seem to figure out how
g = sns.JointGrid(height=6, space=0)
x, y = df["Application_order"], df["Dropout"]
sns.regplot(x=x, y=y, scatter_kws=dict(s=1), x_jitter=.3, y_jitter=.125, logistic=True, truncate=False, ax=g.ax_joint)
sns.histplot(x=x, discrete=True, stat='frequency', ax=g.ax_marg_x)
g.ax_marg_y.remove()
g.ax_marg_x.set_facecolor('white')
g.ax_joint.set_xlabel('Application Order')
plt.title('Dropout Likelihood Across Application Orders')```
Any help? I can manage to make it work with an lmplot but then I don't get my marginal histogram which i really want to keep...
Draw the scatter directly, and set "scatter=False" in the regplot. I think you'll have to add the jitter manually to the df, like: df_x_jittered = df["Application_order"] + np.random.uniform(-0.3, 0.3, size=len(df))
Something like: ```py
jittered_x = df["Application_order"] + np.random.uniform(-0.3, 0.3, size=len(df))
jittered_y = df["Dropout"] + np.random.uniform(-0.125, 0.125, size=len(df))
sns.scatterplot(x=jittered_x, y=jittered_y, hue=df["Dropout"] > 0, palette=['blue', 'orange'], legend=False, ax=g.ax_joint, s=1)
Thanks for the answer! The colors are now working, however setting scatter=False seems to mess up regplot...
I'm guessing I should just draw the line manually instead of using regplot?
That’s weird, I did a test and my regplot worked
g = sns.JointGrid(height=6, space=0)
x, y = df["Application_order"], df["Dropout"]
sns.regplot(x=x, y=y, scatter_kws=dict(s=1), scatter=False, x_jitter=.3, y_jitter=.125, logistic=True, truncate=False, ax=g.ax_joint)
jittered_x = df["Application_order"] + np.random.uniform(-0.3, 0.3, size=len(df))
jittered_y = df["Dropout"] + np.random.uniform(-0.125, 0.125, size=len(df))
sns.scatterplot(x=jittered_x, y=jittered_y, hue=df["Dropout"] > 0, palette=['blue', 'orange'], legend=False, ax=g.ax_joint, s=1)
sns.histplot(x=x, discrete=True, stat='frequency', ax=g.ax_marg_x)
g.ax_marg_y.remove()
g.ax_marg_x.set_facecolor('white')
g.ax_joint.set_xlabel('Application Order')
plt.title('Dropout Likelihood Across Application Orders')```
Try regplot -after- scatter?
When I get students to do work for us I give them a pandas crash course and I always put it next to SQL
I always tell them the SQL idioms mostly carry over and I have a bunch of SQL queries and a toy dataset I have where I show the equivalent Pandas code
Hi, I don't know why I get module_wrapper when I grab the summary of this code:
`model = Sequential()
vgg16 = tf.keras.applications.vgg16.VGG16(
include_top=False,
input_shape=(224, 224, 3),
pooling='avg',
classes=91,
weights='imagenet'
)
for layer in vgg16.layers:
layer.trainable = False
model.add(vgg16)
model.add(normalizing)
model.add(Flatten())
model.add(Dense(512, activation='relu'))
model.add(Dense(91, activation='softmax'))`
This is why shows up when I train:
So getting some errors and cant find out why isnt not printing that last statement
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import mean_absolute_error
import tensorflow as tf
from sklearn import preprocessing
df = pd.read_csv('Google_Stock_Price.csv')
df['Date'] = pd.to_datetime(df['Date'],format='mixed')
df['Time'] = df.apply(lambda row: len(df) - row.name, axis=1)
df['CloseFuture'] = df['Close'].shift(30)
df_test = df[:185]
df_train = df[185:]
X = np.array(df_train['Time'])
X = X.reshape(-1 ,1)
scaler = preprocessing.MinMaxScaler()
X_scaled = scaler.fit_transform(X)
y = np.array(df_train['CloseFuture'])
model = tf.keras.Sequential([
tf.keras.layers.Dense(10, activation='sigmoid', input_shape=(1,)),
tf.keras.layers.Dense(10, activation='sigmoid'),
tf.keras.layers.Dense(10, activation='relu'),
tf.keras.layers.Dense(1)])
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),loss='mse',metrics=['mae'])
model.fit(X_scaled, y, epochs = 30, batch_size = 10)
ennuste_train = model.predict(X_scaled)
df_train['Ennuste'] = ennuste_train
X_test = np.array(df_test['Time'])
X_test = X_test.reshape(-1,1)
X_testscaled = scaler.transform(X_test)
ennuste_test = model.predict(X_testscaled)
df_test['Ennuste'] = ennuste_test
plt.scatter(df['Date'].values, df['Close'].values, color='black')
plt.plot((df_train['Date'] + pd.DateOffset(days=30)).values, df_train['Ennuste'].values, color='blue')
plt.plot((df_test['Date'] + pd.DateOffset(days=30)).values, df_test['Ennuste'].values, color='red')
plt.show()
df_validation = df.test.dropna()
print("Predicted data average error %.f" % mean_absolute_error(df_validation['CloseFuture'], df_validation['Ennuste']))```
D:\Archives\Coder\Python\Roboticas\Tehtävä10.py:31: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df_train['Ennuste'] = ennuste_train
6/6 [==============================] - 0s 601us/step
D:\Archives\Coder\Python\Roboticas\Tehtävä10.py:37: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df_test['Ennuste'] = ennuste_test
Traceback (most recent call last):
File "D:\Archives\Coder\Python\Roboticas\Tehtävä10.py", line 45, in <module>
df_validation = df.test.dropna()
^^^^^^^
File "C:\Users\Jani\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\core\generic.py", line 6204, in __getattr__
return object.__getattribute__(self, name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'DataFrame' object has no attribute 'test'
help appreciated..
is there an NLP package in python or something that categorizes sentences based on the content of the sentence?
how would I implement q learning on a racing game I made?
Nope, havent use in this project, there should be none string data for AI to emulate, its just column names for dataframe..tho i am new machine learning and still discovering stuff, so there could be some inaccuracy..
hey so is this the place to ask about apis
i need an api for llm prompts where u can automate the responses u get from it in a way
Unclear what you’re asking, but you can start from OpenAI’s API and do whatever you want with the response
I keep getting a serialisation error with my Pyspark function, anyone who is able to provide some insight would be greatly appreicated
https://discord.com/channels/267624335836053506/1170499876660973679
hello,
Am working on a project and there seem to be many null values in the dataset
would you advice me to go with fillna or dropna?
also If I use fillna and fill in avg random values wouldn't it affect the dataset?
you can choose to drop, or impute so long as you report your decision
typically any imputation method is effective when <5% of the data is missing, mean imputation drops out at around 10% missing, regression drops out at 15% but multiple imputation and ML imputation methods are more accurate even when larger proportions of the data are missing
I have just been reading some papers to get some inspiration for a model I want to design. My task is to design a model that handles data of which a large part (~90%) is unlabeled, and my model should classify the labels. I read a paper (https://arxiv.org/abs/1911.04252) that starts by training a relatively small model on the 10% labeled data, and then classifying maybe 10% of the data that is unlabeled such that we have 10% true labeles, and 10% predicted labels. Then a new model is instantiated that is slightly bigger than the first, and trained on the 20% labeled data. Which is used to label another 10%, so the next model has 30% labeled data. etc.
My problem is, why would this improve performance at all. In the end all labeled data is based on the information from the initial 10%, it doesn't learn anything "new" from the predicted labels does it?
We present Noisy Student Training, a semi-supervised learning approach that works well even when labeled data is abundant. Noisy Student Training achieves 88.4% top-1 accuracy on ImageNet, which is 2.0% better than the state-of-the-art model that requires 3.5B weakly labeled Instagram images. On robustness test sets, it improves ImageNet-A top-1...
The paper discusses a modification of the algorithm explained above, but I was wondering why the original idea works at all.
I guess the idea is called "self-training", but the confusion still remains. If the initial classifier predicts correctly, the data point probably doesn't bring much new information, whereas if it is predicted incorrectly, the newer iteration model will be learning incorrectly.
Have you looked at general methods from semi supervised learning?
Are you able to label post-hoc or not at all?
What do you mean with this?
What is canonically done is having some model output predictions and uncertainties and you label the ones with the highest uncertainties
This only makes sense if you can still label ofc
Highest certainties right?
The ones you're most uncertain about
Or do you mean, manually labeling?
I thought you meant using the labels that the model was msot uncertain about for the next iteration
I get it now
And I think that would be quite difficult
Imagine you can still label the cats and dogs in your dataset but you don't want to do it upfront. Your model is also able to output uncalibrated probabilities. You let it do predictions and afterwards you label the ones with a ton of uncertainty
The data is thousands or tens of thousands tree point clouds from many different forests
The labeling is mostly done by knowing what trees are in which forest, and labeling entire sections
So post-hoc would be difficult and requries expert knowledge, which I don't have
I think this is an ideal use case for this though, that is if you can get a hold of the experts to do it
Yeah, I understand why it would be great if we can label post-hoc, but even then the labels are mostly from known forests from looking at it in the field.
So not from looking at the data itself, which are point clouds (sometimes it is easy to see, sometimes not)
Iirc this is what one of my lectures was about cause my prof did this for energy load anomalies. I can send you the slides when I get back maybe there's something else of value to you there 🤷
Would be great, It seems that it even works without labeling post-hoc from some papers, but it seems a bit counterintuitive as the actual information you have is just the labaled data and some pseudo-labels.
In the OG paper they only take pseudo-labels from samples that the model is very certain about
A large part of his research is semi supervised learning in general so afterwards it might be a good idea to check out the papers from his lab 🙂
Appreciate the input, the post-hoc can be something to look into if that's possible at my company (basically need to convince one of my coworkers to do it 😛 )
Hey guys I am pretty new to working with LLM's. I want to build a AI agent, how can build it? is there any useful videos that you can share that goes pretty indepth building AI agents. Your suggestions will help me guys, thanks.
hey you guys, I am interested on AI and machine learning, that's principally why I learned Python
but as beginner-Junior developer, I don't know how could it be applied
Because I only saw text based simple programs
can you tell me how does it work?
I mean, what does one need to start making AI stuffs
First, you need to learn Python to a fairly good level... it's hard to do anything meaningful until you're able to build intermediate level projects, of any type. Focus on learning the fundamentals and practicing by writing code often. #python-discussion is a good place to ask for help.
Second, you should learn a bit of the theory and ideas behind AI. A nice place to start is: https://www.youtube.com/watch?v=aircAruvnKk&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi. This is just a taste of the subject, there's a lot more to learn after this.
Third, math. Strong math skills, if you want to do anything more complicated than simple projects with existing models. Once you understand the theory, you need strong math skills to understand the details. See pinned messages for some reading material.
One place to start with using Ai/Ml is: https://keras.io/getting_started/
What part of the data science process is the one going from data collection to database insertion?
but Maths on its own theory or just working with problem solving and algorithms
I mean, I used some Trigonometry to make my game characters rotate, but It's not like a kind of hard Math
Wow 3Blue1Brown, idk if it's gonna be kinda hard
I'll watch the series
Both. It's a long journey of constant learning.
So do I need Calculus? I never tried to do that
Calculus is considered a fundamental college course for any STEM major, just like algebra is in high school.
You can certainly do stuff with AI/ML libraries without calculus, but it's just one part of becoming a complete engineer or scientist: understanding how everything works.
I get it, on the deep levels.
Any book recommendation that clearly establishes differences between data architect, data engineer, and data scientist?
I'm having a hard time organizing data intake vs. validation vs. database insertion in my code, I feel like having a higher-level conceptual understanding would help me out
Question. I am working with a pandas dataframe and the csv file date shows up with month and year only (Jan 2000). How can I separate the month and year and create their own columns?
if its in datetime format use that otherwise use the split function and pandas apply
@young granite I'll try that thank you
Guys, I have a confusion with vectors and their magnitude. Now I'm reading a book(data science from scratch first principles with python) and there the vectors magnitude is defined as square root of the sum of squares from the vector's value. However I don't think its necessarily correct, when we are calculating the vector's magnitude we are considering its components which from my understanding is Δx and Δy,(I'm talking about R^2 specifically) and in order to calculate its magnitude we square them add them together and square root the whole expression. When we have a vector [3,4] and assuming that it starts from the origin then yeah we can use the method of summing their squares and square rooting them which will result with 5, but what we really do is consider Δx and Δy which since we started from the origin will be the coordinates(3,4) of the vector, however when we don't start from the origin we have to consider the terminal point and the endpoint and determine the Δx(x2-x1) and Δy(y2-y1) and then calculate the magnitude. The method described in the book only works if the vector starts from the origin, if not then Δx and Δy will not be equal to the vector's coordinates. I think that a distinction between whether the vector starts from the origin and whether it starts from somewhere else is really important because it effects the magnitude of the vector. Perhaps I'm overcomplicating it but I think the book should have been more specific here.
Does anyone have a good recommendation for whatever books, documents, or even pretty nice videos explaining the polynomial regression, SVR, and kernel?
I tried to split but it gives me this 0 [Jan 2000]. I want to separate the month and year into a datetime format but I am having a hard time doing that
The beginning 0 is an index position on the dataframe
!e you can use to_datetime to turn into proper datetime objects, then extract the properties like ```py
import pandas as pd
df = pd.DataFrame({'string': ['Jan 2020', 'Feb 2021']})
df['date'] = pd.to_datetime(df['string'])
df['month'] = df['date'].dt.month
df['year'] = df['date'].dt.year
optionally do not even store df['date'] as a column, just
(same thing as before but without storing as a column)
dates = pd.to_datetime(df['string'])
df['month'] = dates.dt.month [...]
print(df)
@agile cobalt :white_check_mark: Your 3.12 eval job has completed with return code 0.
001 | /home/main.py:3: UserWarning: Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
002 | df['date'] = pd.to_datetime(df['string'])
003 | string date month year
004 | 0 Jan 2020 2020-01-01 1 2020
005 | 1 Feb 2021 2021-02-01 2 2021
the format for this case is "%b %Y", so ideally you should explicitly specify like ```py
df['date'] = pd.to_datetime(df['string'], format="%b %Y")
A quick reference for Python's strftime formatting directives.
there's no difference because vectors canonically do not have a "location in space"
if you specify a vector simply as coordinates, it does not matter where a vector is located in space
the object you're describing requires extra parameters. a second vector, to define both the "tail" and "head" of the vector (more properly, an affine transformation of a vector). but if you then compute the corresponding vector as head - tail, you again get a proper vector with no location in space attached to it, which we can always represent as having its tail at the origin
That’s so cool
W for the artists
Does it basically hide another image in the least significant bits of the pixels?
So it basically hides tiny pixels in the actual art and when the ai try’s to use thus art to train itself it thinks the art is something else
So if there’s a picture of a dog and it has that poison pixel in it. It makes the picture look like a cat instead of a dog
So it’s basically a deterrent against these big ai tech companies like chatgpt from using artists art without there knowledge
This will be the arms race for next 20(+) years... counter-AI measures and counter-counter measures.
10 years from now, the first "AI Defense" majors...
15 years, interactive ai defense entertainment
All those tower defense games have prepared us well.
Yes
Scary
!e what you are
Had anyone heard of polars?
however when we don't start from the origin
in linear algebra, all vectors start at the origin. the notion of a vector as an arrow floating around in space is a convenience used in physics, but it has little to do with how it's actually modeled in linear algebra.
if you want to live in a world where the origin is "free floating", the magnitude is defined as the distance between the vector and its own origin, not strictly the "global" (0, 0) point.
but yes your line thinking is reasonable and it's a good question.
Yes. Are you just interested to know if people have heard of it?
Yeah. I saw two YouTube shorts about it and was like “this is weird. Is it something that’s up and coming or did YouTube hook onto something I didn’t notice.
My impression is that Polars is on the rise, but I'm not inclined to switch to it.
I don’t know pandas yet. Maybe I should just get ahead with polars.
Maybe. Pandas is very established and integrated with the data science stack
it is in fact up and coming
i'd suggest starting with pandas. it's much more popular and it's better suited for interactive use. polars is lower-level in some respects, less flexible, and more verbose
The video I saw was showing polars to be 4x faster than pandas.
Alright. Thanks.
polars can be significantly faster than pandas on bigger datasets, yes. however pandas is basically instant on any operations of < 1 million rows, and in the 1 - 10 million range it will be fast enough. in the 10 - 100 million range it starts to feel pretty slow.
pandas is not particularly highly optimized, being more aimed at general-purpose use than high performance
(it actually is fairly well optimized for what it is, but the developers have a range of priorities, optimization being only one of them)
polars meanwhile is specifically designed to be high performance
the main advantage of polars over pandas is that polars has a "query engine" much like what you find in a database, that can compile and optimize what you ask it to do. whereas pandas just does whatever you tell it, no query optimization or compilation phase.
there's some interesting benchmark results here that compares Polars and Pandas, if you're interested. This is a duckdb publication (I'm not shilling... ), but you can ignore everything and just compare the Polars vs Pandas line. : https://duckdb.org/2023/11/03/db-benchmark-update.html or for full blog post: https://duckdb.org/2023/11/03/db-benchmark-update.html
I'd always recommend people to get started with Pandas, it integrates best with with the data science stack indeed
To me, aside from the speed aspect, Polars has much better syntax and many constructs like over and groupby_dynamic which make it significantly less verbose than Pandas.
I'll always make the argument I've used Spark, SQL, Dplyr, data.table, Polars, Pandas, ... and honestly Pandas has always been the one that makes me want to pull my hair out. Even if it were slower I'd prefer Polars' syntax.
Does anyone know how to easily debug where in a chain of computations dask is failing? It says it failed to infer types on mul but with the amount of multiplications we have that means nothing. The tracebacks contain no call site information and the code ran fine in pandas.
Just spotted this in the wild. A Taipy. https://www.taipy.io/ looks interesting for data scientists. But I don't know how well this compares to streamlit
i don't have an answer, but maybe can you reproduce on a smaller dataset or a subset of the overall computations?
Unfortunately I can't make a subset as that requires loading the original file
you can't create a small sample from the original file for debugging purposes?
try specifying the dtypes explicitly when reading your data (if you are working with a schemaless format like CSV)?
I haven't found a way to do partial reads from parquet files unfortunately, but even if I did I'd need to select by certain criteria to perform all tests which would likely require loading the majority of the data anyway
https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.read_parquet.html you can specify n_rows and instantly turn it into a pandas dataframe with to_pandas() assuming you'e on Pandas 2.X
Well yeah but then I'd need to loop over all rows and discard them if they don't fit a given criteria and if the resulting dataset is too small I'd need to filter on different criteria to get a good dataset
And with over 1.5m rows I'm not sure if that's feasible
In that case you can scan_parquet and apply all the filters you want using polars, collect(streaming=True) and then to_pandas()
scan_parquet is lazily executed
https://pola-rs.github.io/polars/py-polars/html/reference/lazyframe/api/polars.LazyFrame.collect.html (streaming results is still in alpha)
Does it allow for groupby queries or only simple filtering?
can someone help me adding my flask app to github and have to add .gitignore file and config.json as well
@lapis sequoia #web-development
Good question! You can definitely groupby on a LazyFrame but I don't know exactly what queries you can do in streaming mode
You might not even need streaming mode tbf
1.5m rows isn't that much if your machine has like 8-16GB ram, considering Polars' memory footprint is substantially smaller to begin with
I've tried both pandas and dask so far but both can't allocate enough memory (dask says I'd need about 80GB)
I'm this room's #1 Polars shill so obviously take what I'm saying with a grain of salt: Pandas' memory footprint is so large, especially pre 2.X that if you move it to Polars and drop Dask you might be fine already
Polars does a lot of multithreading already to begin with. Dask adds so much complexity and opaqueness.
Unfortunately it's management who told me to use dask over pandas/polars so I'm afraid my hands are mostly tied
I remember writing Numba + Numpy + Dask stuff that released the GIL and the code would silently fail without errors if 1 thread or so ran into trouble.
I wouldn't wish this onto my worst enemies
Trust me I'd switch to polars if I could, I've had to abuse some #esoteric-python tech to patch dask to even be compatible with our old code
Have you at least upgraded your Pandas version?
On the latest version compatible with py3.11.5 afaik
You definitely need to check if it starts with a 2
2.1.1 it looks like
If mgmt is set on using Pandas, which is fine if you have tons of existing code, I'd still consider doing the bulk of the preprocessing / subsetting in Polars before moving to Pandas.
That's what I do, I have some legacy Pandas code that works fine. There's literally no sense in refactoring it. I keep it as is, do some Polars stuff beforehand and call df.to_pandas()
Could that work for you? (you can do the same in DuckDB, it doesn't need to be Polars)
I found the RMSE and the average of the absolute value of the residuals for three regression models. I also found the mean and the standard deviation of the response. I'm being asked to 1, determine if the models are accurate, 3, how I determined their accuracy, and 2, whether I compared predictions to other numbers.
- For me, accuracy is hard to quantify for regression models. I typically think of classification models when I hear 'accuracy'. However, I think that we can use RMSE as a measure to evaluate the models. Though I'm not sure how to put into words my findings.
- I compared the predictions to the actual values. That part was easy.
- Are the predictions accurate? Again, this notion of accuracy for a regression model is hard for me to conceptualize. I think I have the pieces of information needed to get to the answer; I'm just not sure how to get there.
- Always compare against a baseline, even with classification. For instance sklearn offers
DummyClassifierandDummyRegressor. I always prefer speaking of performance relative to other things. - GJ!
- It's the same point as 1.
Accuracy also depends on the business problem. Being 3 % off of sales numbers is huge if you're walmart compared to a corner store. Being 3 % off of detecting a life threatening disease is worse than either of those.
Finally, under or overshooting isn't as bad. If you're doing sales forecasting there's typically a cost for overstocking and understocking and it's asymmetric meaning that the same RMSE (say your predictions were always respectively 10 above or 10 under the target) can have different "business" costs. Same goes for detecting a life threatening disease, sending a few more people to a second screening isn't as bad as telling them they flat out have nothing.
i'm not saying to stop using dask, i'm just saying to write some script that takes an extract from your data, just so you have a smaller set to debug with
that or generate one randomly
Hi, does anyone here know about a really good research paper where they talk about cGANs and using that for age progression and regression? Feeding an image of a person to the cGAN and then generating an image of the person but younger/older.
I want to use the paper as a guide so I can build one myself.
"accuracy" in this sense just means "correctness of predictions". in classification we have a very natural way to compute that: the "0-1" loss (0 if wrong, 1 if correct), also often just called "accuracy". for continuous labels, you're right that we have to use something else. RMSE is a great choice overall.
Are the predictions accurate?
think about what RMSE is: it's the mean squared error of the model predictions. so are the predictions accurate? you tell me: look at the thing you computed. if i asked you the same about a classification model, it would be the same? do the model predictions match the labels in the data? if not, what's the error? if we chose a sensible error/loss function, one would hope that lower loss means better predictions.
Say the RMSE was 56000. The mean of the response was 180000. The standard deviation of the response was 80000. Is the model accurate because the RMSE is lower than the standard deviation?
accuracy is not a yes-no thing
that's true for classification as well
You can't make any claims about accuracy without benchmarks and baselines imo
that ☝️
accuracy is only meaningful when compared to other accuracies
or when compared to some benchmark as required for your business
point 1 is the key here and this is how to "catch" people 😄
the real question is, is it accurate enough for whatever you're trying to accomplish?
I think I'm just hung up on the phrasing of the question.
it's either badly phrased or a trick
I'd compare them against each other and add a baseline
i think you're supposed to explain whether it's accurate enough for whatever the context of the question is
And then try to apply this
Think about asymmetric costs (by looking at predicted vs. actual plots) and qualitatively argue for the accuracy of your model compared to your business problem
Don't underestimate how qualitative data science is! 😄
fwiw i wouldn't worry about this on a homework assignment at this level. it's important in real world practice, but if they're not asking about it at all, you probably don't yet have good tools for reasoning about it
I think it's my coursework where I picked up this line of thinking in the first place
But not in an intro to DS course yes I'll give you that
Something that's increasingly causing miscommunication these days is that when people hear "LLM", a lot of people immediately think of instruction-tuned LLMs. but I've been using LLMs for a few years, and I still just think of them as a probability distribution for word co-occurrence.
How come GridSearchCV doesnt need to call the original function that made the time serries data?
Ordinarily I should use the same function to pass the parameters to. How could this be possible tbh?
I'm making an image recognition for Science Fair and im using Teachable Machine for the processing. Im trying to run it in python and the export guide has a code snippet to load a kerras model. I the model it is trying to load is in a zip file in my downloads. Here is the code that is trying to fetch the download:
model = load_model("keras_Model.h5", compile=False)
Im getting an error saying no file or directory found at keras_Model.h5. Any ideas on how to get the code to find it?
Thanks!
it should probably be in the same directory as the file that's executing load_model.
But just go to the definition of the function load_model to check exactly what it does and which path it's using to load the file
yup.. but that's just an assumption, try to look at the code, where load_model is defined to see what's actually happening and in which path the file is supposed to be
Okay, it found the file, but now its saying 'Unable to open file (file signature not found)'
Is it because it's trying to access a zip file?
Never mind it works!
Thank you so much for your help!
Hello, how should I approach learning how to do exploratory analysis? Any resource suggestions?
Hey
My favorite EDA resource is https://www.itl.nist.gov/div898/handbook/
Thanks, I'll check it out
Running into an issue where df[col] * some_float throws an error because it can't multiply a sequence by a non-int, but at the time of running the dtype of the column should be float, not str, and df[col].compute() even gives a float64 series as output
Im just wondering. How it works.
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
It doesnt take the original function that made the time serries.
So sorry, my question is so basic.
Examples using sklearn.model_selection.GridSearchCV: Release Highlights for scikit-learn 0.24 Feature agglomeration vs. univariate selection Shrinkage covariance estimation: LedoitWolf vs OAS and m...
Has anyone had some issues using Poetry any PyToch 2.1 ?
For some reason poetry seems to not be able to find the right packages even though they exist in PyPi and also exist in Pytorch's own index which is the fallback.
And then as a second question, does anyone know if it is possible to update th cuda version a CML docker container is using (Using 11 atm, need 12)
are you using dask? if you're asking about dataframes, the default assumption is that you're using pandas.
you could make your own docker image and install cml on it
there's probably a pytorch image with your desired cuda version, then you install cml on top
regarding this, if you're using this inside the docker container, you could also consider a solution that is more low tech than poetry, since containerizing more or less achieves the same effect as making venvs
Not everything is using docker though, if you are running locally for example, it's more convenient and easier to manage deps with poetry and the lockfile
ah you're doing both, ok
yeah
i will work on the docker stuff first, since that is currently breaking ci
https://cml.dev/doc/install#docker and here
should be able to use the pytorch image as a base and slap those cml install commands on top... in theory
from your chat history, this is when using dask right?
can you
- confirm how are you reading your dask dataframe?
- confirm how many parquet files are you reading?
- if you are read more than 1 parquets, confirm all parquet files' schemas are identical?
- confirm that you don't have an interactive python interpreter session that is capable of reading all the parquets with dask?
that error about multiplying a sequence almost sounds like trying to multiply a native python list
!e ```py
[1,2,3] * 5.1
@desert oar :x: Your 3.12 eval job has completed with return code 1.
001 | Traceback (most recent call last):
002 | File "/home/main.py", line 1, in <module>
003 | [1,2,3] * 5.1
004 | ~~~~~~~~^~~~~
005 | TypeError: can't multiply sequence by non-int of type 'float'
@spark nimbus ☝️
somehow you ended up with a native python list somewhere. maybe in an "object" dtype column.
Using .astype(float) ended up solving it
Turns out it got confused by the type remapping from a .replace call
I am trying to pass an image into a pretrained VQGAN model. The model takes floats as inputs/uses floats for its biases.
The original image is the one in the middle. The one on the left (the one where it is blue and is missing most of the pixels,) is what happens when I convert it to floats. The one where it looks a bit demonic is the one, when I just load it into a PyTorch tensor.
# Import necessary libraries
import torch
from PIL import Image
import torchvision.transforms as transforms
# Read a PIL image
image = Image.open(segmentation_path)
# Define a transform to convert PIL
# image to a Torch tensor
transform = transforms.Compose([
transforms.PILToTensor()
])
# transform = transforms.PILToTensor()
# Convert the PIL image to Torch tensor
img_tensor = transform(image).permute(1,2,0)
# print the converted Torch tensor
print(img_tensor)```
My code for loading it.
In order to turn it into a float, I just do
```python
img_tensor.float()
We love CI
So trying to setup the action CI, we now use the pytorch image
but now setup-cml doesn't seem to want to work 😢
im trying to make driver drowsiness detection program using a pretrained model.
Here's the code:
import cv2
import os
import tensorflow as tf
from tensorflow.keras.models import load_model
import numpy as np
import pygame.mixer as mixer
mixer.init()
sound = mixer.Sound('alarm.wav')
face_cascade = cv2.CascadeClassifier(cv2.data.haarcascades + "haarcascade_frontalface_default.xml")
eye_cascade = cv2.CascadeClassifier(cv2.data.haarcascades + "haarcascade_eye.xml")
model = load_model(os.path.join("models", "cnnCat2.h5"))
lbl = ['Close', 'Open']
cap = cv2.VideoCapture(0)
while True:
ret, frame = cap.read()
height, width = frame.shape[0:2]
gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
faces = face_cascade.detectMultiScale(gray, minNeighbors=3, scaleFactor=1.2, minSize=(25, 25))
eyes = eye_cascade.detectMultiScale(gray, minNeighbors=3, scaleFactor=1.2)
for (x, y, w, h) in faces:
cv2.rectangle(frame, pt1=(x, y), pt2=(x + w, y + h), color=(255, 0, 0), thickness=3)
for (ex, ey, ew, eh) in eyes:
cv2.rectangle(frame, pt1=(ex, ey), pt2=(ex + ew, ey + eh), color=(255, 0, 0), thickness=3)
eye = frame[ey:ey + eh, ex:ex + ew]
eye = cv2.resize(eye, (80, 80))
eye = eye / 255
eye = eye.reshape(80,80,3)
eye = np.expand_dims(eye, axis=0)
prediction = model.predict(eye)
print(prediction)
I get this error when i run the code.
@river maple Should the image not be grayscale?
It seems to expect 1 channel, but you have 3
@mild dirge i changed the image to grayscale and reshaped it to (80,80,1)
Now its giving me this
Hi, is there such thing as a deep learning engineer? Or is is just an ML engineer?
I’m most interested in deep learning so I was wondering if there are careers that only focus on that.
99.99% of machine learning jobs are focused on deep learning
Is that so? 🤔
I think DL gets the most "airtime" but I still see a lot of shops doing ML on tabular data.
you think that those shops will have a professional specialized on machine learning?
Yes
I was at a data science / ML consultancy in the past and they did virtually no DL. Most of it was ML on tabular data. Typical use cases like fraud detection, forecasting, predictive maintenance, ...
how long ago 'in the past'?
I think this was 2 years ago
not sure then
I'm not sure why you'd use DL if you don't need it 🤷 . There are, imo, many cons especially surrounding deployment of it.
So when it's tabular data I'd very much prefer not to.
Also the time it takes to pick a bespoke architecture, tuning which is harder but 100 % necessary etc.
How is it in your experience / location? 😮
stakeholders don't understand the tradeoffs and just want the best thing money can buy, but for cheap
What stakeholders though? If they're business idt they'd understand the difference between xgboost and DL
Sorry if I'm sounding contentious, really just curious about your experience since for me it's different 😄
just managers not very exposed to statistics
My clients are either: "I must gpt everything", or "I don't trust ai to recommend my lunch, much less a trading decision/forecast"
The thing with my time at $ML/AIConsultingCompany is that if they knock on your door they want something already
And the business model was really to have different titles and teams do BI / data infra first and plan ML in the long run after the first set of projects have delivered value
idk how stuff looks like in that scene after GPT dropped 🤷
so much snakeoil.
Ok, so, I’m guessing it depends on the company you work for?
Is anyone here an AI/deep learning researcher? What was your experience and is the income decent?
I'm trying to understand low and high p-values and how they relate to normality and homoscedasticity of data.
Use the Shapiro-Wilk test to determine if the data is normal. Shapiro-Wilk gives us a p-value. We would like the number to be above .05.
This sounds like if the p-value is greater than 0.05, the data is normal.
A p-value measures the probability of obtaining the observed results, assuming that the null hypothesis is true. The lower the p-value, the greater the statistical significance of the observed difference. A p-value of 0.05 or lower is generally considered statistically significant.
This part I'm not getting. I think it contradicts the first part. This makes it sound like a low p-value is good.
But perhaps "good" has opposite meanings in different contexts.
How would you interpret a low p-value?
what do you understand that the first part means by "determine if the data is normal"?
I understand the first part to mean compare the p-value we got from the Shapiro-Wilk test on the residuals to 0.05.
and do you understand what it means (regarding the data) if it passes or fails the test?
I think that if the data fails the test, p < 0.05, then the data is not normal. If it passes the test, p > 0.05, the data is normal.
Is that what you're asking?
That's not really how to interpret p-values
But interpreting them correctly is a science in and of itself so I get the struggle
The "null hypothesis" of the S-W test is that the underlying distribution is normal. Hence, the p-value the Shapiro-Wilk test gives you is ||the probability of obtaining this data (well, or rather data that's at-least-this-weird by the metrics the S-W test tracks) if the underlying distribution is normal||.
So if you get a low enough p-value, then (as always) you can consider your null hypothesis disproven - this data probably didn't come from a normal distribution.
So if you want to check that the data is from a normal distribution, you're looking for a high p-value on the S-W test.
(I haven't clicked on the spoiler yet)
A p-value measures the probability of obtaining the observed results, assuming that the null hypothesis is true
So if we assume the null hypothesis is true, that is the data is normally distributed, a low p-value means a low chance that the observed data came from a normal distribution?
Yup. Hence, if you got data with a low p-value on the S-W test, the distribution it came from probably isn't normal (otherwise obtaining this data would be unlikely).
Does anyone know how PyTorch lightning distributed backend works with datamodules?
Currently it appears to be creating two instances which I haven't configured, and i'm not sure how to turn that off, because it seems to breaking things...
I see the python file being ran as __main__ twice, one for each process and in the logs:
Log: https://paste.pythondiscord.com/7UYA
The problem is PT doesn't seem to then call prepare_dataset again, so the validation dataset is not generated.
But I don't really want it trying to do two distributed systems right now 😅
This is the training code
early_stop = pl.callbacks.early_stopping.EarlyStopping(
monitor="val_loss",
min_delta=0.00001,
patience=5,
verbose=False,
mode="max",
)
trainer = pl.Trainer(
callbacks=[early_stop],
max_epochs=self.model_config.n_epochs,
num_nodes=1,
log_every_n_steps=32,
accelerator="auto",
devices="auto",
)
trainer.fit(self.model, self.data_module)
num_nodes suggests to me that this shouldn't spawn multiple indapendent processes?
That said, I don't like this definition you quoted:
A p-value measures the probability of obtaining the observed results, assuming that the null hypothesis is true
The probabability of obtaining any particular results is usually 0 or close to it. The definition of the p-value usually involves a clause like "results as extreme as the observed".
What is meant by extreme in "results as extreme as the observed"? I think perhaps it means that it would be unlikely to select from a normal distribution and get this result.
Basically, to do statistics this way, you start by selecting the metric you will track. Suppose you're flipping a coin 20 times, say, and you're checking the null-hypothesis of the coin being fair. You choose the number of heads as the metric.
You get 15. What's the p-value of this result? It's not the probability of obtaining exactly 15 heads (1.48%) - that'll be quite low even for 10 heads (worse, if we were doing an experiment with continuous measurements, the probability of obtaining any particular result would just be exactly 0). It's the probability of obtaining >=15 heads, about 2%.
(Or, alternatively, we could choose to track the 2-tailed p-value here - that is, we consider deviations from 10 to both sides to be the same, and so our p-value will be the sum of the probabilities of obtaining 15,...,20 and also 1,...5.)
(If it seems kinda arbitrary that you need to precommit to doing only specific measurements (total number of heads, not some other metric) and how you'll be calculating the results (one-sided or two-sided) before the experiment for it to be valid - yeah, it is, that's a common critique of this entire approach to statistics as opposed to e.g. the Bayesian one.)
As a result of that, what does it mean if you get a p-value of 0.03 from the S-W test? Well, it means that they're measuring some property W of the data (you'd have to look up the definition of the test to know what), and the value of W on the test data is such that data drawn from a normal distribution would only produce a W-value larger than that 3% of the time.
If you get a high p-value here, what does it mean? Well, it means your data is such that this statistic, calculated on this data, looks like what you'd get from a normal distribution.
(It would be a mistake, though, to assume that guarantees it actually came from a normal distibution - maybe there's some distribution that's not normal but produces a similar distribution of this statistic, and the data was drawn from it.)
I avoid using p-values in practice because they require a lot of baggage to interpret correctly
From my old slides:
Ah, that's interesting. What do you do in practice, then? Are you calculating likelihood factors for the hypotheses instead?
I also have this qq plot and this displot of the residuals. The dots should hug the line on the left I believe. And the residuals should look like a bell curve on the right. Neither of these are true which which is more evidence in support of the residuals not being normal or homoscedastic.
I'm not often in situations where I need them. Summary statistics and counts is what I use without trying to make strong claims unless I have access to an actual statistician.
When I was and was "forced" to use them they didn't express anything interesting for my domain. I had a cross section with demographic data. When n is that large any difference becomes significant. Sure then you need to fix an effect size but it was hard to come up with an effect size before looking at the data. Doing it after means the entire thing is biased anyway.
is the right one supposed to look like the burj kalifa?
Does the burj kalifa look like a bell? (rhetorical)
tbh if you showed me the right plot i'd go 'yeah sure looks normal to me'
the left plot is more worrying, with the right tail especially
Perhaps I should compare these with a normal displot and and normal qq plot to get a better idea.
Would you agree with this? I'm uncertain about the null hypothesis for the White test.
!paste
If your code is too long to fit in a codeblock in Discord, you can paste your code here:
https://paste.pythondiscord.com/
After pasting your code, save it by clicking the Paste! button in the bottom left, or by pressing CTRL + S. After doing that, you will be navigated to the new paste's page. Copy the URL and post it here so others can see it.
https://paste.pythondiscord.com/GFRQ
This is my code for a minimax on tic tac toe. I need to be able to get the score when running it, but in the end i need it to return the best move. How can I do that?
it's unfortunate because the underlying idea is actually pretty elegant
find a statistic with a known distribution under the null. if the value of that statistic in your sample is highly improbable, you reject the null.
interesting
It's generally better to just ask your question away if possible and whoever is familiar with the concept and free can pick it up and help you with it.
People usually don't want to engage in back and forth just to get to the question and then discover whether they're familiar / free enough to help with it or not
Salut
With dask, how would I do df['x'] = df['x'].mask(df['x'] < 0, 0.0) or something along those lines? no matter what I do, it complains about the index not being aligned.
oh I should've given a better example
tl;dr I need to use mask since I'm doing complex filtering
Works fine for me?
import pandas as pd
import seaborn as sns
iris = sns.load_dataset("iris")
iris: pd.DataFrame
iris["sepal_length"] = iris["sepal_length"].mask(iris["sepal_length"] < 5, 0.0)
print(iris.head())
# sepal_length sepal_width petal_length petal_width species
#0 5.1 3.5 1.4 0.2 setosa
#1 0.0 3.0 1.4 0.2 setosa
#2 0.0 3.2 1.3 0.2 setosa
#3 0.0 3.1 1.5 0.2 setosa
#4 5.0 3.6 1.4 0.2 setosa
dask
Unfortunately I'm dealing with Dask, not Pandas
Ah whoops
what exactly are you doing here? what does mask do essentially?
also its not really that hard in dask
this is how you can do conditional assignment
I see
Why does TfIdfVectorizer returns an all zero array (basically all zero but there is 1 non zero value for other 20.000 zeroes)?
did you read the documentation to see what it is doing in first place?
see https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html and https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer if not
if you did see, specify which part exactly you do not understand
yes, i do realize what it does, and it shouldn't return an all zero array in my case, so Im assuming there could be implementation problem or something of this sort here
show a reproducible example then
that would be hard without sharing initial data
has anyone here done reinforcement learning? I've been training a DQN on a simple pygame game i made but after a ton of experiments I'm really disappointed at how badly/slowly it learns. I've seen successful projects with more advanced algorithms like PPO that make use of massive parallelization so I figured I'd give that a try
but I can't find a simple code example of PPO so I don't really understand how to implement it
so I figured I'd first try to parallelize the DQN and mmove on to PPO if that doesn't work
the problem is, I'm not actually sure how to model it
right now I have observations [o_1, ... o_n] and I pass a vector of length n to the neural net
but if I have N observations and M games running in parallel, do I change my neural net to have the first layer be NxM observations? cause that doesn't seem like what I want. I want the agent to be able to run on a single game, and it certainly shouldn't correlate across games
so I'm thinking I should have M copies of the neural net, store the results of the games in my replay memory, and then backpropagate on only one of the neural nets
but this is all way more complicated than I think I can manage without help
and I'm worried I'm overthinking it
What's the best layer layout for lightweight NLP model, that puts single word in the right form, tense, case etc?
what do you mean by layer layout
What layers to use and where to put them
are you using an rnn?
Hello. If i want to have Selanium enter login information on a website. How would I do that?
https://member.restaurantdepot.com/customer/account/login
im getting this error after trying to run Selanium
There has been an error processing your request
Front controller reached 100 router match iterations
Error log record number: a41644797e68843c6e965cd114df2fce18138d5b6efff34f07c54a26d3378e2f
This is the wrong channel, ask in #python-discussion . But, the first question is going to be: are you violating the sites ToS
No, i have permission from their corporate office
in q learning, how would I convert a state into an int?
hi all,
i am trying to learn kmeans clustering with sklearn. so far i have imported my data, dropped rows containing NaN, and visualized the fit data with matplotlib. i wanted to break the set of columns into two separate clusters (k=2). however, it seems to be trying to cluster all of the rows into clusters instead.
might i need to transpose the dataframe? or is there another way
wut
why is my agent so dumb
am I doing something wrong?
the state never changes
it's always (1, 1, 1, 1)
i think it is the discretize bins function
since the next state is always (1, 1, 1, 1)
btw this is my first time doing anything AI related so bear with me
Is the reward based on driving without crashing, because it's doing a good job with that 😛
Can you guys recommend some books for statistics maths and ai maths?
It could be a tutorial too
intro to statistical learning
anyone from india here who is doing good in data science ? need a bit of advice for indian data science job market
Lstm, but accuracy is terrible
What's your background?
I am pursuing ms in data science
Completed my bachelor a year ago, but i did not focus much on statastics
Anyone using Apache Spark for big data streaming?
#data-science-and-ml message I wrote this up a couple of weeks ago, you're pretty much the target audience for the resources 😄
Also consider reading practical statistics for data scientists
I'd definitely not recommend reading Bishop as a first read personally
It really helps you understand the more complex topics imo
but maybe not first read
It covers topics that are also not really necessary for industry and there's a lot of diminishing return on those
true
I didn't read PRML in full but many of my coursework referenced it so I read some chapters. Stuff like the decomposition of bias and variance is interesting to see but it doesn't really help you.
but there are topics that are interesting if you want to understand the behind the scenes for the algo
How do I best handle a 2 player (turn based) game? My idea would be to have the following inputs: the environment, movement options of the active bot, location of the other bot, which bot is active (?)
Just imagine a grid based pathfinding game for simplicity.
concern beeing that the ai might get confused with constant switching beween players
a game-playing AI would not get confused by switching between players, because the "AI" can only "think" during its turn.
AIs aren't like human brains that are constantly thinking and perceiving. they only have "brain activity" in the limited context that they're being used.
ChatGPT doesn't "think" between user inputs.
Makes sense. so which player is active is a useless info then 👍
in games, we usually refer to each player as an agent. and the AI is an automated agent.
when it's the automated agent's turn, it needs to be able to get all the same information about the game state that is available to human players.
for example, in chess, it needs to know where all the pieces are on the board. whereas for card games, it needs to be able to see its own cards.
To add to this, does someone maybe know where I can get help with big data streaming?
If I fit_transform TfIdfVectorizer multiple times, will it rewrite past words or update dictionary?
yes, every time you fit something, that basically resets it. including with fit_transform
this is a landmine for a lot of people.
you also shouldn't (for example) have separate vectorizers for train and test. because then your test data represents the same things differently than the training data, making everything meaningless.
Hi I'm just wondering if we can compare 2 lists of different length by ID? How do you do it? I'm looking for the difference between multiple dates that are only > 0
Sorry, but I don't really understand what you mean
Maybe a bit of a shot in the dark, but I'm working on making a encoder+decoder model which tries to recreate a point cloud. I am using the chamfer distance for the loss, but it seems that it doesn't get much further than the general shape. I was wondering if anyone had any idea to maybe change the loss function so that it tries to get the details more similar. This image shows the original point cloud in the bottom row, and my decoded point cloud at the top.
As can be seen, the general shape is pretty good, but when there are tiny details, it just doesn't do well.
If you haven't done it already I suggest you sanity check your architecture / approach by fitting one point cloud and watching the loss go to 0
I always sanity check my approach by ensuring I can at least completely overfit 1 sample or a batch of samples, depending on the case
If that is already hard then there's likely something up with your architecture 🙂
It should be able to overfit on a single point cloud since the decoder is just a bunch of dense layers, and the loss is a pretty standard loss that is 0 when the point clouds are equal
I can maybe look into different decoders, but I'm thinking the model is just stuck in a local *minimum in which optimizing small details just doesn't decrease the overall loss enough for it to move in that direction.
You should still try it imo 😄 it's basically just debugging. It has saved me countless of times.
I'll give it a go, but theoretically it should definitely be able to exactly copy the point cloud
Depends. Are you trying to win the competition or do data science properly? 😄
It's important to know that they're completely different things and that Kaggle competitions can, or at least used to, teach bad habits
If you're OK with that then the smartest thing to do if you want to win is to clean the entire dataset together.
If you care about being methodologically correct, the first thing you should do is split your data and only work with the training set. You should then make a preprocessing pipeline that can work with your training set but also works for totally unseen data.
Right so I was thinking as a not cheating thing if I define my preprocessing pipeline using only train data that's probably the correct thing... But now I think I see what you mean I could cheat and include the test data too
Like MEDIAN_AGE just from train
Yes, I think you get it
hey all, does anybody have experiences designing RAG systems? I have a bunch of PDFs, and I'm working on a RAG pipeline, and have been researching on different approaches (already went with the very simple LlamaIndex and Langchain thing).
I have been seeing a lot of folks recommending the "hybrid" approach of using both vector and keyword based search. Now the vector is "ok": you read the docs, embedds them, saves into the vector and then perform a search. The regular search, by itself also: read the docs, store in a db, and perform regular search.
what I'm not following is how to mix those two approaches - does that mean I need to duplicate the data? Eg having it as regular text in elasticsearch for example, and also have it as vector? Any ideas?
Would anyone here be willing to review my Jupyter notebook for a school project? It's about LinearRegression, lassoCV, and RidgeCV.
Hi, sorry, it’s been a while since I’ve worked in this. I tried your suggestion but I still get the error ;-;
What else can I do? 😔 I need to batch the dataset?
that person was just copying answers from ChatGPT, btw. so they probably don't even understand your question.
😔
can someone explain in object detection the relevance of grids on the image, bounding boxes, and anchor boxes
I made a really cool gpt https://chat.openai.com/g/g-Ks09304lQ-philbo
@boreal gale is it possible to run an initial function with 1 set of parameters but then use gridsearch with the performance from the initial function and a group of parameters?
Grid search exists to find the best combination of hyperparameters, you can give it a set of hyperparameters you initially used along with others you would like to test out.
If that's what you meant
Anyways
Day 2 of me trying to find someone who has experience with Apache spark and Structured data streaming
If you are that person please let me know, i would love to hear your opinion on how it handled big data 🙂
I will be dealing with possibly infinite amount of data so i have to choose proper tool
im realizing hyperparameters are different from parameters
Is there a way to right outer join tables based on a condition? Any links would be greatly appreciated 🙂
what do you mean "based on a condition"?
I'm looking to join 2 tables using the minimum difference between 2 columns. I don't really know what to do my 2 tables have different number of rows.
join operations are on exact matching values between the two tables. you can't do a join on the closest values.
so you'll have to come up with an approach that does not involve joining
My 2 tables have the same IDs. But I don't know if that'll work because they still have different numbers of rows. I've tried concat the 2 dataframes, but I really want them to be on the same row or some other way I can interact with one another.
I'm now out of ideas and looking for help, a way to solve this semi matching table thing.
sounds like a normal many-to-many outer join?
Yes I'm looking for a many to many join
you can specify how= to tell pandas whenever to do a inner/left/right/outer join
as for many to many... pretty sure that's the default and pretty much only supported option (excluding the validate)
!d pandas.merge
pandas.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=None, indicator=False, validate=None)```
Merge DataFrame or named Series objects with a database-style join.
A named Series object is treated as a DataFrame with a single named column.
The join is done on columns or indexes. If joining columns on columns, the DataFrame indexes *will be ignored*. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on. When performing a cross merge, no column specifications to merge on are allowed.
Warning
If both key columns contain rows where the key is a null value, those rows will be matched against each other. This is different from usual SQL join behaviour and can lead to unexpected results.
maybe take a look at the User Guides
I'll try that. Can you explain further what it would do?
What will happen to the empty rows? Do I need to fill anything?
How many rows will the new table have?
if you use inner, it'll exclude rows that do not have a matching record on the other table
if you use left/right, it'll just throw in nulls/Nones/NAs for the right/left table where there were no matching records for the left/right, respectively
if you use outer, it'll throw in NAs in both sides where there is no match
with inner: at minimum 0 if there are no matches, at most min(x, y) if all have rows have matches
with outer: at minimum max(x, y) if all rows from both tables have matches, or x + y if there are no matches
does anone know a platform to use random forest on python on windows? i tried tensorflow but it only works on macos and linux
if you are not trying to use neural networks, iirc sklearn should suffice
thx
OMG IT WORKS THANK YOU SO MUCH
I'm trying to train an AI that will correctly classify an amazon review with the corresponding rating that was left (1/2/3/4/5 stars), but the AI's accuracy isn't improving at all in the epochs. What does this typically indicate? Fwiw the loss is decreasing (whatever that means
)
In fact, the accuracy is 3.79%, which is worse than the "pick a number at random" (would give 20%)
model = keras.Sequential([
keras.layers.Embedding(vocab_size, embedding_dim),
keras.layers.GlobalAveragePooling1D(),
keras.layers.Dense(24, activation="relu"),
keras.layers.Dense(1, activation="sigmoid")
])
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])```this is the model that I'm using (copy-pasted from an example)
I mean you are trying to train a model without having clue what hyperparameters you are setting
That is not the way to do it
It can indicate anything
What are "hyperparameters"?
The config?
I suggest you get into ML properly from the start if you are interested
You could check Stanford course for the starters
I think it is free on youtube
I know some basics of ML, just not how to choose the right hyperparameters
That's entire point of engineering the model today, fine tuning hyperparameters
What does "loss" measure?
The error essentially, i.e, how far the predicted value is from the actual value.
Right, so saying a 5 is a 1 would have a larger loss than saying a 4 is a 3?
Can I just send you this? https://karpathy.github.io/2019/04/25/recipe/ This is mostly the methodology I use. I can give you a watered down version or I can refer you to the source 😄
Musings of a Computer Scientist.
If it's too technical I can give you a higher level overview you can follow.
Will take a look at it, thanks :)
Np, ping me if you have any specific questions
Is this an open invitation. 😅 😅
Well, it's mostly for this specific case
I'm on frequently enough to notice if there's a question where I can help though 🙂
@past meteor I've been going thru a book called "data science first principles with python" which I found in the python dc server resources it is a very good book and I've been enjoying it. My question is that for example when the book introduces Linear algebra and then it talks about it for a while but then at the end of the chapter there are other books that you can read if you wanna know more, I wonder how should I pursue those books or whether I should? Should I wait until I finish with the whole book or should I stop read the book and then move on? Or should I just read the book note everything down including the books then perhaps do a very basic project and if I decide to move onto a field where the mathematics is needed then I can read those books?
and other helpers perhaps?
what models are used to solve such a problem? Can someone just point me in the right direction and i will do some research.
Looking at customer feedback from surveys and I need to catgeroize the feedback based on some grouping logic for example
"I didn't buy the item because i have no money " ---> Budgeting
I think you're probably thinking too "linearly" (no pun intended): there's no right or wrong answer. From your explanation, it sounds like the book gave you the minimum information you need for linear algebra... but that is an entirely separate (college-level) course, so don't stress it. You can learn it when you want to learn it.
I'll let you in one a secret: I read 3+ books on the same topic 😄
I like having different opinions on the same topics. I also frequently read book 1, then 2 then 1 then 3 and then 2
As I mentioned: you need to get really comfortable with not understanding things in detail
is there anyway to getting a head start in math and it's implementation in programming before my study at the university? For reference, I will be studying AI course, Bachelors degree.
Of course. Ask your university for a curriculum map: a list of courses you'll need to take, usually by year. It's usually online, or you can send an email to the department.
I tried to look at their website, and I believe it's a new course they provide, so I really can't click on any of the modules that they list, I only have the names, and short explanations about them, even some explains aren't complete and finish with "..." I will email them and ask them what can I expect, I asked in this channel to see if anyone would recommend books, websites... Etc to help with the study.
What country are you in?
UK
You can look at other universities and see what comparable curriculums look like. In the US, freshman year computer science is two semesters of Calculus.
Then, either more calc, or linear algebra... and definitely some calculus-based stats.
Number one tho is: strong algebra / pre-calc skills. Calculus isn't inherently hard, it's the lack of algebra mastery that makes it hard.
I'm not sure whether my university would be the same as the others, but I would try and look that up
But after that, what do I do? How do I study them?
Same thing: find a syllabus for the class you want to take at a local university. Most calc courses, for instance, are largely the same.
For example, here's stanfords summary of Calc 1, 2 and 3: https://drive.google.com/file/d/1ycBlN7LBm1v10gnEU0gg1JfD_v2ncenc/view
Hopefully I can get more knowledgeable in programming and math, I don't want the next year to be harder.
Absolutely. The two things I’d advise any student going to CS is: study calculus and learn programming (whatever language your university teaches first… usually Java or Python). Do practice problems, watch online courses, etc. if you do a little every week, you’ll be ready. If you wait to the end, you won’t be able to cram it in.
(This is us-centric advice but I assume Uk curriculum is similar)
My advice is primarily about making freshman year as easy as possible, not that calculus is more important than other maths.
I believe so, but the study curriculum is different it's mostly focused on search and independent learning, that's why I'm very focused on math since it would be my first time in this kind of studying, add to that I have no idea how math can be studied independently. Yes they would explain it in classes, but I think you'd still have to study harder on yourself.
As for programming, I'm trying to practice more and I can have an idea of how I can study independently.
For calc, I have standard advice, one sec, let me find an old thread
#data-science-and-ml message and the message right after it
This might be obvious, but I should focus on pure programming and Python right? After this 8 can take time in learning math and it's implementation?
First: find out what the intro language is at your college. It might not be Python.
And, I would suggest studying both. You’ll never ‘finish’ learning Python, nor will you finish learning math. You’re just trying to get better, not be perfect
guys i have a few doubts can i ask here?
in this code of mine:
`import numpy as np
import pandas as pd
import statsmodels.api as sm
Generate dummy data
np.random.seed(123)
X = np.random.normal(size=(100, 5))
y1 = np.random.normal(size=100)
y2 = np.random.normal(size=100)
data = pd.DataFrame(X, columns=['x1', 'x2', 'x3', 'x4', 'x5'])
data['y1'] = y1
data['y2'] = y2
Fit model for y1
X = sm.add_constant(data[['x1', 'x2', 'x3', 'x4', 'x5']])
model1 = sm.OLS(data['y1'], X).fit()
Fit model for y2
model2 = sm.OLS(data['y2'], X).fit()
Print model summaries
print(model1.summary())
print(model2.summary())
Fit joint model
X = sm.add_constant(data[['x1', 'x2', 'x3', 'x4', 'x5']])
y = data[['y1', 'y2']]
model_joint = sm.OLS(y, X).fit()
results_df = pd.DataFrame()
results_df['Coefficients Y1'] = model_joint.params.iloc[:, 0]
results_df['Coefficients Y2'] = model_joint.params.iloc[:, 1]
print(results_df)
results_df['Std Errors Y1'] = model_joint.bse.iloc[:, 0].values
results_df['Std Errors Y2'] = model_joint.bse.iloc[:, 1].values
print(results_df)`
i am getting the following error :
ValueError Traceback (most recent call last) <ipython-input-1-e65d97313ad8> in <cell line: 34>() 32 results_df['Coefficients Y2'] = model_joint.params.iloc[:, 1] 33 print(results_df) ---> 34 results_df['Std Errors Y1'] = model_joint.bse.iloc[:, 0].values 35 results_df['Std Errors Y2'] = model_joint.bse.iloc[:, 1].values 36 print(results_df) /usr/local/lib/python3.10/dist-packages/numpy/core/overrides.py in dot(*args, **kwargs) ValueError: shapes (100,2) and (100,2) not aligned: 2 (dim 1) != 100 (dim 0)
please can someone here help me??
@white coral if you look at the full "traceback" you'll see that this error comes from inside statsmodels and isn't related to your dataframe. i can reproduce your error but i'm not sure what causes it. you'll see the error if you just do print(model_joint.bse).
i'd argue that this is a bug in statsmodels. if you did something invalid, you should get a helpful error message, not this
I believe it's Python, but I will look that up, but does it matter, programming is the same isn't it? And to keep in mind, I'm already making some projects and learning more programming using Python.
Yes, I know that programming isn't something you can master, it's always updating and you have to know things that you didn't knew before, but what I meant by "finish" is to get at a high level.
Hey
I am just saying: be prepared and ask a lot of questions. Know what the program entails: especially details and example syllabi for first year courses. If it’s Java, then be prepared for Java. Being prepared will make the first year -much- easier
I have a numpy array of 20000 values. How can I plot the standard deviation for sample size 1-20000? (Ox is sample size, Oy is standard deviation)
You can iterate over a for loop and calculate standard deviation for each array:
ox = np.array([i for i in range(20_000)])
oy = np.ndarray((20_000,))
for i in range(20_000):
array_for_std = array[:i]
oy[i] = calculate_std(array_for_std)
# then plot it with pyplotlib for example
Haven't tested it, but give it a go and tag me if something is wrong. But hope it helps.
Does anyone here have any experience with tf.data and tensorflow? I have windowed my data for timeseries with window size of 512 and batch size of 2048, and I have 16 features and a total of 161873 datapoints before windowing or batching. I am fairly new to tensorflow, being more experienced in the mathematics of ML. I keep getting this error, and I am unsure how to set up feature layers to make my data compatible. I am more or less randomly putting something into input_shape= trying to figure out how it works.
Error:
ValueError: Exception encountered when calling layer "dense_features" (type DenseFeatures).
We expected a dictionary here. Instead we got:
Call arguments received by layer "dense_features" (type DenseFeatures):
• features=tf.Tensor(shape=(None, 2048, 512, 16), dtype=float32)
• cols_to_output_tensors=None
• training=None
This is my model and feature layers:
data = Data('TData/train.csv')
count = 64
dataset = data.data_ds.take(int(count*0.8))
dataset_cv = data.data_ds.skip(int(count*0.8))
numeric_features = [tf.feature_column.numeric_column(feat) for feat in data.column_names[:-1]]
feature_layer = tf.keras.layers.DenseFeatures(numeric_features, input_shape=(2048, 512, 16))
# output_bias = tf.keras.initializers.Constant(init_bias)
model1 = Sequential([
feature_layer,
LSTM(units=64, stateful=True),
tf.keras.layers.BatchNormalization(),
tf.keras.layers.Dropout(0.3),
Dense(units=8, activation='tanh', kernel_regularizer=tf.keras.regularizers.L2(0.16)),
tf.keras.layers.BatchNormalization(),
tf.keras.layers.Dropout(0.3),
Dense(units=1, activation='sigmoid') # , bias_initializer=output_bias)
])
# model1.build(input_shape=(64, 300, 16))
cp = ModelCheckpoint('ModelDatasetTesting/', save_best_only=True)
# model1 = tf.keras.models.load_model('ModelV1/')
model1.compile(loss=BinaryCrossentropy(), optimizer=Adam(learning_rate=0.0001), metrics=['accuracy']) # , tf.keras.metrics.Precision(), tf.keras.metrics.Recall()
model1.fit(dataset, validation_data=(dataset_cv), epochs=10, batch_size=2048, callbacks=[cp]) # , class_weight=class_weight
@boreal gale does https://github.com/bayesian-optimization/BayesianOptimization
have lag or get slow with large searches?
hi! does anyone have experience with vectorization (like changing words to vectors)?
i'm trying to use word2vec to vectorize a column of my dataframe, but tbh i've never done it before, so i'm rather confused on how to do it
yo, can I share a model and get some opinions??
It's easier for everyone if you show the thing you want feedback on when you ask. Don't wait for a commitment.
Hello!, can I install tensorflow/tfjs-node using python 3.x? I read that the best option is using Python 2.7
Have any of you guys ever scraped / crawled reddit and analyzed it
Hey guys what algorithm do you guys recommend for anomaly detection? I'm trying to build a model that checks gas meter rates and if it gets out of range flag the value.
You could use any of your preferred algorithm for this. XGBoost, CatBoost, VaE, etc.
Haven't crawled Reddit. Is that what you're currently doing? Are you having any challenge in doing that?
Hello,
How to work with Over-fitting model?
You could use Gensim to achieve that after you've performed tokenization (you can NLTK to achieve this.)
import gensim
from gensim.models import Word2Vec
sorry I'm rather new to programming lol
could you specify what you mean by tokenization/using gensim to achieve this?
i asked chatgpt about it too and it gave me some code, but requires that i download the full pre-trained word2vec model
is there another way to do this without downloading multiple gigabytes of word2vec?
You'd have to reduce your model complexity to enable the high variability in your model to drop.
Meanwhile there are several ways you could approach this. We can categories them into 2. Data-Centric & Model-Centric approach.
- Data-Centric
- Data Augmentation
- Resampling (up sampling / Down sampling)
- SMOTE (I personally haven't seen the significant impact of this approach when compared to other approaches)
- Model Centric
- Use Regularisation
- Early Stopping / Other callbacks
- Hyperparameter Tunning
- Adjusting the value of
scale_pos_weightparameter in your model - Adjusting the
Class Weightparameter
You didn't offend me so no need to apologise for doing nothing wrong 😊. Meanwhile, welcome, I hope you're enjoying your programming experience thus far?
So tokenization is just a fancy way of saying "breaking down the walls of Jericho into tinny stone pebbles".
Only that this time, in NLP, this 'wall' isn't an actual 'wall' but a document or textual data.
You can decide to break this 'Wall"' into different size of pebbles. So we have different types of tokenization but I'll just highlight sentence tokenization and Word tokenization.
-
Word tokenization: When you break down your text data into individual words. For example
"I ate jollof rice this morning"becomes[ 'I', 'ate', 'jollof', 'rice', 'this' , 'morning' ] -
Sentence Tokenization: this is where our document / text data is broken down into individual sentences.
NLTK & Gensim are both libraries used for doing some interesting stuff in NLP.
So, once you've installed Gensim in your machine you could easily leverage it's models module to access Word2Vec embedding.
I'll point you to a website for more clarity shortly
Thank you very much!
Thank you! this really helped 🙏🙏
Checking out the website right now
Do you have any tutorials I can follow up?
Unfortunately I have none. You might wanna check online on each individual approach. However, If you tell me the kind of data you're working with and the task you're performing, I could provide more useful resource / feedback
Guys, what courses should I take on uni that would be really useful for ai and machine learning. I was thinking about doing just mathematics since I already know how to code and taking just computer science wouldn't make sense i think, or perhaps taking a joint course. I was also wondering if there are any courses just about machine learning because I didn't find anything but mathematics and computer science. Any thoughts?
Any of these 3 majors are cool in building solid foundational knowledge. Statistics, Computer Science, and Mathematics.
Nonetheless this field isn't discrimatory, so you might as well decide to study Fishery and still end up doing Data Science (so long as you are ready to put in the needed work)
Many universities now have a data science undergraduate program. So it depends on where you reside and the schools you're interested in.
Hi there general question here how popular is pytorch lightining
Also depends on your country! 🙂 When I was at a data science consulting half of the staff has a MSc in Busines engineering (I'm one of them) and the other half MSc in Comp Sci.
What both have in common is they go very deep into statistics and math. We tended to have a bit more domain knowledge and knowledge of other methods that may be relevant (for instance operations research) for data adjacent problems and they knew (a lot) more programming and a bit more math.
There's also tons of people doing data science that compe from any quantitative field such as Bio Engineering / Bio informatics etc.
Hi all,
I have been working on a proof-of-concept AI tool for my current job.
The purpose of the tool is to map an arbitrary string into a string from the list.
I have a dataset of arbitrary strings and corresponding strings from the target list. The tool is written in Python.
Here is the source code:
Using **LinearSVC ** model: https://pastebin.com/R6LNrJ4w
Using **SGDClassifier ** model: https://pastebin.com/bxcug3Vb
This code was written using a ChatGPT-like tool and scraps of information on the scikit-learn library that I could understand (I am very far from Python and ML itself).
The dataset consists of 3 million+ mappings (each of the two files is approximately 100 MB).
I would be very grateful if there are people here who can help me with at least some of the related questions:
- The fit() method executes well and fast in both models if I cut off the dataset to be around 10k mappings. However, when I stay with the original 3 million, it takes a very long time (started approximately 48 hours ago and still running).
My PC specs are 32GB of RAM and an i5-12400f CPU. Is this an adequate time for such a dataset? Could it be at least roughly estimated how much more time is left? - Is this a good way of using scikit-learn in this manner for this kind of tool? I would be grateful if you could recommend other more effective approaches or ready-made solutions.
- Can it be determined which of the models (LinearSVC or SGDClassifier) has more pros than cons for this tool?
- In the current implementation, when I turn off the program or PC, all the trained data from RAM is vanishing. How can I save the trained progress to the hard drive and resume the training from it?
Thank you!
Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.
Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.
Hmm I'm planning to go to the UK since I'm in Ireland, I was just wondering that what should I go with. Going just math would be interesting as I might need to learn stuff that I won't ever use in machine learning and data science but if I would go with Mathematics then I could literally self learn almost everything in terms of data science and machine learning as I have a really strong foundation in mathematics. I just wondered whether that's a good decision
How do you guys go about testing multi-label classifiers.
Currently I have a classifier model which is trying to predict one label based on the content of the text, but realistically this should be multi-label, but I have a couple questions around testing and evaluating the model:
- In the situation where the model were to produce multiple labels, how would you score that against a dataset of single labels?
- How do you evaluate weight in this case? I.e. Does the score/confidence the model gives mean anything to the metrics? Is the order something I should care about?
- If I shouldn't care about the order, do I just say
all(expected_label in returned_labels for expected_label in expected_labels)to count as a 'correct' score?
wdym by single labels? like only a binary thing or something else?
I.e. a piece of text being labelled as News rather than [News, Business, Politics] for example
and what does the model output?
hmm?
Atm it only produces a single label, it should (and will) get moved over to a multi-label version though
since single-label doesn't really work
ok im guessing you want your model to be like a hierarchal classification thing? orr am i confused
Precision and recall work in the multi-label case.
Which means you can compute F1-scores as well etc
kinddddddd of, they are IAB categories if you are familiar with them.
So there are effectively 4 tiers of categories like Automotive > Sports cars > Luxury cars
The primary issue we have atm is as shown by the confusion matrix the model generally, runs into issues with single label where the topics/text can easily fall into other categories
You can also make a roc-curve per category (1 vs all) or even per combination (1vs1) if you choose
😅 What is a roc-curve
The absolute GOAT evaluation method, but it's really made for binary classification sadly, you can finesse it into working for more cases.
We are also going to ignore the fact that the current dataset is single label only atm
Imagine you make one of these per label (one versus all) and have a drill-down that business can use to have one per label-label combination
Make sense?
Indeed, there's other ways. It largely depends on the amount of categories you have tbf
If you want a single number I'd look towards precision and recall. Are you familiar with them?
not really, I do not typically do AI
Precision = (True Positives) / (True Positives + False Positives)
Recall = (True Positives) / (True Positives + False Negatives)
right
Note that TP + FP => the amount of times your model said it was true, so precision means "the proportion of times it was really right when your model said it was right"
We can do the same for recall: TP + FN = The amount of times it was really label A, so recall means "the proportion of times your model said it was A when it should've been A"
Question, how does that get calculated when we say "Model says it is [A, B, D, E] but it should be [A, B, C]"
each label is just another hit/miss count?
Good question! You calculate the precision and recall for each category and then you have different strategies to combine it. The most obvious one is averaging.
yes, you'll have to calculate TP, TN for all classes, iirc, a One v All thing
You should think of the computation of the precision and recall more per batch (set of rows) than per instance (set of columns)
[A, B, D, E] vs [A, B, C]
[A, B] vs [A, B]
[C, B, D] vs [A]
you compute the precision and recall for A, B, C, D and E over these rows and you take the average.
I'm a bit lazy but I can write it out in full if you want 🤣
😅 I think it's fine
Second question, has anyone tried llama2 for text classification?
Notice how we can drill-down again, you can combine Precision and Recall into something called the F1 score (basically their harmonic mean), which you can drill-down into P and R for an aggregation over all classes and then drill-down again to P and R per class, apart
Does the order of the labels returned matter btw?
No
ty
How to avoid overfitting? Or trying to make it so the model overfits?
Early stopping
Its a dengue prediction project
https://www.kaggle.com/discussions/general/91461 I am currently using this dataset as I didnt find any other dataset online
Typhoid and Dengue Fever Symptoms Dataset.
avoiding overfitting...
I more of want accurate results
What do you use the most? LogisticRegression? random Forrest overfits so often
You could also just use statsmodels and run all of the features (if there are not too many) and see what the R-Sqaured score is if it is not discrete.
https://github.com/nickkatsy/python_ml_ect_/blob/master/hmda_updated.py I put 200 hours into that dataset due too all of the nas, having to read the HMDA act in Boston and then I just used logistic regression. Maybe just drop columns that don’t have to be there as well.
you can try it. be prepared for a world of prompt engineering depending on your task in which you might as well have used classical ML, again depending on your task
input size torch.Size([1, 4, 320, 320]) not equal to max model size (1, 3, 320, 320)
pls send help i am under the water
Is there class imbalance issue in the data?
I would go with Statistics again (maybe because that's my 1st major), then Computer Science, and then Mathematics comes last for me.
So long as you've done your due diligence, weighed your options, and of course, happy with your decision, then go for it.
The issue is in the second dimension where you have 4 and 3. What are you trying to do?
isn't statistics is part of mathematics?
and computer science would be about learning how to code which I already know how to
They’re different majors though. Pick a university, and look at the degree offerings and course requirements
They can be likened as two peas in same pod but they are quite distinct.
well yeah but you said that you would do mathematics last and you'd go with statistics first so thats why I was confused
For example https://www.harvard.edu/programs/statistics/#undergraduate Vs. https://www.harvard.edu/programs/applied-mathematics/#undergraduate vs https://www.harvard.edu/programs/mathematics/#undergraduate
Harvard University is devoted to excellence in teaching, learning, and research, and to developing leaders in many disciplines who make a difference globally.
Harvard University is devoted to excellence in teaching, learning, and research, and to developing leaders in many disciplines who make a difference globally.
Harvard University is devoted to excellence in teaching, learning, and research, and to developing leaders in many disciplines who make a difference globally.
all im trying to understand is that how would going with pure mathematics help me with machine learning and ai, because I already know how to code and if I'd be really good at math then it would be a lot easier to understand new topics in machine learning or even do something.
Statistics and Math are kinda related but they are two different fields. Mathematics is a broad discipline that deals with numbers, quantities, shapes, and abstract structures. It includes various branches such as algebra, calculus, geometry, and more, providing the theoretical foundation for many other fields, including statistics.
Statistics specifically focuses on data. It involves collecting, analyzing, interpreting, and presenting data to infer proportions in whole from those in a representative sample. While it utilizes mathematical tools and principles, statistics is often more applied, using mathematical theories to solve real-world problems related to data analysis. It's a discipline heavily grounded in real-world applications, including decision-making in business, health, politics, and other domains.
For more context and better explanation: https://www.quora.com/What-is-the-difference-between-mathematics-and-statistics
Answer (1 of 62): Frequently, when people hear the word "statistics", they think of using formulas and spreadsheet programs to analyze data by measuring as many things as possible about the samples being studied. The reason why I don't quite fit in either category is that I don't work with statis...
hmm, then i dont really know what to go with
I got into DS with a economics degree so I do not know
Read the attached Quora page for more clarity. The goal is not to confuse you though.
Meanwhile, there's a lot more you could learn from a Computer Science major other than coding (even if you already know how to code). Just do your own research and finding then go with the one that appears more interesting / fun to you.
At the end of the day, this field isn't discriminatory. So you could even study Actuarial Science, Zoology, Soil Science, or even Human Kinetics (a.k.a 'jumpology' 😀 ) and still end up in Data Science.
I want to add for clarification that I added a bunch of math on top of it and went far out of my way to get better at a bunch of the aspects of DS and programming on the side and had to teach myself a lot of stuff. But it is my favorite thing ever.
all i know is that im very interested in ai and machine learning and I love mathematics
Gotcha
Many schools are now offering DS programs, which are like a hybrid of CS and Stats.
But I don't know UK.
All I will say about majoring in pure mathematics is that if you are using it for nothing, you are wasting your time. Just do what you like and you will be fine.
More so, with a Statistics degree (so long as you ended up in a well-to-do school) you're most definetely gonna have math core courses and electives where you'll learn Calculus, Linear Algebra, Laplace Transform, etc. To me, it's like using 1 stone to kill two birds.
With a core Maths program, you probably won't go as far as covering Inferential Stats, Econometrics, Operation Research, Time Series, Gambler's Ruin, Stochastic & Maximum Likelihood Estimation, ANOVA & MANOVA, and all that interesting stuff in Stats.
But don't take my word for it. Do your own research and go with the program that you think will give you wings like red bull 😀
At the end of the day, whether you do Animal Husbandry, or Urban & Regional Planning as your 1st major, you can still get into Data Science.
Yeah that's what im trying to avoid, i wanna learn mathematics that is very useful for machine learning
hmm, that's really good to know
I mean yeah, this sounds like something that I can actually use for machine learning and data science, problem with pure mathematics is that you learn a bunch of other stuff that you wont ever use, unlike with statistics which afaik the foundation of ml
ohh really ? @echo mesa
wdym?
stats seems more useful than the crazy complex math
yeah
im drowning in a ocean of math ... not knowledgable enought to see the horizon
are you in uni or smth?
just messing with multicore mcus and python , wondering if mojo-python is workable
is pytorch good to learn as a intermidiate
It depends on what you are trying to do and what a intermediate is.
how do i
df.rolling(100).groupby
>> AttributeError: 'Rolling' object has no attribute 'groupby'
😦
let's "zoom out". what are you trying to do?
"learning libraries" is not a good approach to learning ML. None of them are end-to-end tools (pretty much anything will involve two or more of them), and they're designed from the assumption that you already understand what you're trying to do.
Don't use ChatGPT to answer questions.
hey i am trying to define my problem as simple as possible to understand
that's okay.
if you try to simplify your question to only "how do I do groupby on a rolling operation?", the answer is "you can't", and that's it--there's no way around it. So we have to take a step back and understand what doing groupby on a rolling was intended to accomplish, so we can figure out what pandas functionality is available for that.
yea that's why now i am trying to define my problem as simple as possible again 😄 give me a second ha
it's the first thing that came to my mind, knowing how wonderfull python and pandas api can be i thought that would be so straight forward, seems logical
# original dataframe
df = pd.DataFrame({
'x': ['a', 'b', 'c', 'd', 'e', 'e', 'b', 'c', 'd', 'a'],
'y': [ 0, 1, 2, 3, 4, 4, 3, 2, 1, 9]}
)
# grouper
gr = df.groupby('x')['y']
# operations
assert gr.sum().idxmax() == 'a'
assert gr.sum().max() == 9
# desired output of the rolling 2 operations
# z = x where y is max
# s = sum of y where x is max
pd.DataFrame({
'x': ['a', 'b', 'c', 'd', 'e', 'e', 'b', 'c', 'd', 'a'],
'z': [ pd.NA, 'b', 'c', 'd', 'e', 'e', 'e', 'b', 'c', 'a'],
's': [ pd.NA, '1', 2, 3, 4, 8, 4, 3, 2, 9],
})
now when i think of it it would be very interesting how to simplify that question i am bad at that
I feel chatgtp is fine if you need to do minor things and need a solution.
sure, but if people want to use ChatGPT, they can do so, and they don't need someone on Discord to copy/paste answers for them.
And if the person doing the copy/pasting doesn't actually understand the question--let alone the answer--and can't verify if the answer is correct or help the asker understand it, then they've just made things worse.
I agree
People who rely solely on chatgtp are not better off. It cannot do a entire data set for you and make sense of it for oneself. I only use it to add little things that I cannot think of for a project. Even then, if I didn’t know exactly what to ask it, then chatAI would do nothing of any real significance
And openAI is wrong a lot as well
OpenAI is a company. Are you talking about them, or ChatGPT?
Sorry, I mean chatgtp
@tender sparrow did you try df.groupby('x')['y'].rolling(2).max()?
umm sorry what does that mean?
can someone help me understand this problem i got the answer to be 2 data samples but it was incorrect.
i found id3 and id7 to be classified as c_1 based on the formula
Logistic Regression
I mean the response variable (y) in your dataset, does it have a balanced class-label.
Presuming your dataset"s target variable is y, and it contains 2 classes (1s and 0s), confirm if the number of times 1 and 0 appeared are somewhat equal.
No...
For 1's its 72
and for 0's its 32
Any of you ever use academictorrents.com ?
what is it for?
Since both class labels don't have same proportion, there's class imbalance in your data.
So, I'd suggest using stratified random sampling 1st ( use stratify = y when splitting your data with train_test_split) and see how the model will perform before trying any of the approach I suggested earlier
Getting datasets for research/analysis. If a company doesnt want that dataset to be easily available, then you might have to torrent it
Hey guys, can you recommend me some NLP project to make so I can better my resume, I have only tried the health care chatbot and study chatbot project only
Is there any pre-trained LLM that can process a document pdf without changing it into text because once it gets changed to text or some image the generated text will not be accurate for images or table structures present in the pdf …. If you any such LLM or any NLP or Document processing model so that I can ask questions from that model … Please help
No, there is no way to pass a PDF file to an LLM without first converting it to text. If you find a tool that purports to do that, it is guaranteed to be converting it to text under the hood.
what do you need to do with the images and tabular data? and are these images of text?
Seems like it's torrents of stuff you can download freely anyway (responding to #data-science-and-ml message)
Discord is the easiest way to communicate over voice, video, and text. Chat, hang out, and stay close with your friends and communities.
there are tables like on what input i want to convey a particular image like that and also what structs i need to implement with their particular bits required
there are probably tools for serializing tabular data as text that can be passed to an LLM. And for images, there are probably models that can convert images to natural language descriptions (like the opposite of what Midjourney et al do).
Oh okay cool
idk, I'm trying to get archived reddit data, and in July reddit started cracking down on services like pushshift that used to give it out for free
hey guys, anybody with reinforcement learning experience expecially in proximal policy optimization, I'm dealing with a convergence issues for the cartpole v1 environment
?
Hey,
Has anyone here reviewed the following book ?
Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow.
It's a great book
That’s nice to know, I’ll be starting it soon.
In your opinion, will it be better to go over this book to learn about AI, or go for an online course ?!
Yeah, you should have some Python background otherwise it might be really challenging.
For me, I use the book alongside other resources I find online like YT vids. etc.
Alright, thank you!
If it’s possible for you, could you dm some useful resources whenever it’s easy for you!
nobyody?
im trying
not gonna lie I don't get what your s is meant to be at all
hello folks! i have a question about webscraping with python, can i ask that here?
this channel would be more about using the data once you have scraped it, so you'd want to use a general help thread. Be sure that the website you're scraping is okay with that in their terms of service
I'm trying to run the ppo pytorch code from here https://pytorch.org/rl/tutorials/coding_ppo.html and I finally got it working, but now I want to retrofit it to some of the other gym mujoco environments. When I change this line
base_env = GymEnv("InvertedDoublePendulum-v4", device=device, frame_skip=frame_skip, render_mode="human")
to this
base_env = GymEnv("Humanoid-v4", device=device, frame_skip=frame_skip, render_mode="human")
I get
File "C:\python\reinforcement_learning\ppo\pytorch_ppo.py", line 130, in <module>
collector = SyncDataCollector(
^^^^^^^^^^^^^^^^^^
File "C:\Users\mj\miniconda3\Lib\site-packages\torchrl\collectors\collectors.py", line 677, in __init__
self._tensordict_out.unsqueeze(-1)
File "C:\Users\mj\miniconda3\Lib\site-packages\tensordict\tensordict.py", line 2184, in expand
d[key] = value.expand((*shape, *value.shape[-last_n_dims:]))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: expand(): argument 'size' (position 1) must be tuple of ints, but found element of type float at pos 0```
my full code is here: https://paste.pythondiscord.com/KBVA
obviously I need to change the shape of something like the observations or the action space or something, but I'm not sure what to change it to. any ideas?
Hello, I am a beginner to data analysis. Does data cleaning come before or during exploratory data analysis?
You need to perform EDA in order to see how the data should be cleaned. Once you have created your cleaning steps, continue with your analysis
Hello, i am trying to build a cnn based ml model. Anyone who has any experience in same, please dm me.
Just tried that out but idt there was any change
except for the graph it got changed to this
Oh I see. I'm not seeing an overfitting problem here. After about +90 epochs, you can see the model converged quite well.
Do you mind sharing the old learning curve?
Evidently, the stratified random sampling you performed certainly improved the model's performance. The model is overfitting here.
Based on your last experiment, can try to further improve the performance. Try hyperparameter tunning
It's usually not so common for the test to outperform the train; although not impossible. Could be due to having small sample size in the test compared to the train.
With more experimentation, I'm certain you can squeeze out more performance from your model
hey can someone advice on how to start with ai, im an intermediate python programmer and im pretty good at mathematics. just want to learn about ai before college and not get left behind
Working in Pyspark and have made a mess of code which runs but is slow/inefficient:
for row in collected_loci:
studyLocusId = row['studyLocusId']
study = row['studyId']
sumstats = spark.read.csv(f'gs://finngen-public-data-r9/summary_stats/finngen_R9_{study}.gz', header=True, sep='\t')
df_locus = (
df
.filter(f.col('studyLocusId') == studyLocusId)
.select('studyId', 'fi.chromosome', 'fi.position', 'fi.referenceAllele', 'fi.alternateAllele', 'idx'))
gnomad_count = df_locus.count()
df_locus = (
df_locus
.join(
sumstats,
(df['fi.chromosome'] == sumstats['#chrom']) &
(df['fi.position'] == sumstats['pos']) &
(df['fi.referenceAllele'] == sumstats['ref']) &
(df['fi.alternateAllele'] == sumstats['alt']),
'inner')
.sort('idx'))
matched_count = df_locus.count()
df_locus.write.csv(f'output/{studyLocusId}/locus.csv', mode='overwrite')
count_df = spark.createDataFrame([Row(gnomad_variant_count = gnomad_count, matched_variant_count = matched_count)])
count_df.coalesce(1).write.mode('overwrite').option('delimiter', '\t').csv(f'output/{studyLocusId}/metadata.tsv')
idxs = [x['idx'] for x in df_locus.select('idx').collect()]
bm_filtered = bm.filter(idxs, idxs)
bm_numpy = bm_filtered.to_numpy()
bm_mirror = bm_numpy + bm_numpy.T - np.diag(np.diag(bm_numpy))
np.save(f'output/{studyLocusId}/ld', bm_mirror)
```
Looking to see if there are any ways to speed up the execution other than the obvious things like removing the count() calls.
My initial attempt was to refactor it into functions so that it can be parallelized instead of running iteratively. But this gave me serialization errors due to calling sparkContext within the function.
how large is the original df, how large are each of the sumstats and do you have any duplicate studyId / studyLocusId? (for the last question, not duplicate pairs, but just non-unique values within a column)
also shouldn't you be using df_locus instead of df inside of the join? in the df['col'] == sumstats['col'] lines
Okay cool
Thank You Veryyy Much!!
Any other ways I can try on for improvement on the model?
The original df is about 20,000 rows.
each sumstat file is ~6m rows but I only need a small subset of that.
Every studyLocusID is a unique hash, but there are approximately 1,000 unique studyIDs
Yeah that's a mistake. Although df_locus inherits column names from df, so I guess it doesn't make a difference? I'll change it anyway
Surprised it didn't throw an error from that tbh, but there you go
it could be making the comparisons based on the original dataframe instead of the filtered one? which would slow it down quite a lot ; not sure though, I don't really know how spark works, haven't used it (I'm only used to pandas)