#data-science-and-ml
1 messages · Page 81 of 1
for lipschitz continuous functions, this gives you an upper bound on the change of magnitude of the gradient for any delta v
What is the lipschitz?
ok, since that doesn't ring a bell, try following the suggestion in the problem. you're told to minimize grad C times delta v (try minimizing its magnitude) with the constraint that delta v has magnitude epsilon
you should be able to set this up in lagrange form
I think I need a deeper understang of Algerba to understnad this.
Thank you for you help, I will anyalze it one more time before looking deeping my understanding of the revlenet algebra.
in fairness i'm giving you calculus suggestions instead of algebra ones. i'm guessing C here is scalar valued. set up the cauchy schwarz inequality for grad C times delta V, and try to show any other choice of delta v yields a larger value
What do other choices entail?
−η was a varbaile chosen without explained reasoning, so would that same logic apply for picking any "choice"?
Could any choice introduce a variable?
my hint to you is: delta v is parallel to grad C if you use the expression you were given. what does cauchy schwarz tell us?
Okay. I will try again thank you
Hello someone pls help, so my professor want us to create an ai, model any thing that could solves questions on the screen. Here is what he said "A test on google forms or something like that will be posted the goal is for the script to read the screen, move the mouse and answer survey type questions"
I there anyway i could do this?
Any form of help would greatly be appericiated
Does anyone have experience with CNN in tensorflow?
always ask your actual question right away, rather than hinting at the topic and waiting for a commitment
so I am trying to create a CNN model that takes in a face image of a person and predicts the age of that person. But the model is overfitting. Here is my model structure :
```inputs = Input((input_shape))`
#convolutional layers
conv_1 = Conv2D(32,kernel_size=(3,3),activation='relu') (inputs)
maxp_1 = MaxPooling2D(pool_size=(2,2)) (conv_1)
conv_2 = Conv2D(64,kernel_size=(3,3),activation='relu') (maxp_1)
maxp_2 = MaxPooling2D(pool_size=(2,2)) (conv_2)
conv_3 = Conv2D(32,kernel_size=(3,3),activation='relu') (maxp_2)
maxp_3 = MaxPooling2D(pool_size=(2,2)) (conv_3)
flatten = Flatten() (maxp_3)
#Fully Connected Layers
dense = Dense(128,activation='relu',kernel_regularizer=l2(0.01))(flatten)
dropout = Dropout(0.3)(dense)
dropout = Dropout(0.4)(dense)
output = Dense(1,activation='linear',name='age_out')(dense)
model = Model(inputs=[inputs], outputs=[output])
model.compile(loss='mean_absolute_error',optimizer='adam',metrics='mean_squared_error')```
I am pretty new to CNNs so any help on how to not make the model overfit would be appreciated. I am training it on 20,105 images and here is the accuracy and error plot.
Hey guys, I'm reviewing the calculations around Variational AutoEncoders and I was wondering... What is the difference between Mean Absolute Error and Kullback-Leibler Divergence?
I know that I can't use MSE in the Encoding Loss because of optimization problems (the MSE would imply a term that I can't know for sure, while KLD allows that term to be eliminated). But how about MAE? It would be basically the absolute value of a subtraction between two distributions.
(I think Edd explained this to me before, but I don't remember...
)
Oh, nevermind. I've remembered that the difference between distributions isn't just about a subtraction between numbers, but also between the area underneath them, since it's a subtraction within an integral...something that doesn't happen in MAE.
Now I'm a bit confused over which KLD version I should use... I know that KLD has a formula for discrete variables, and another for continuous variables.
Usually, I see people using the discrete version for VAEs to generate images, but...since I'm dealing with images, pixels...shouldn't I use the continuous version?
i want to make a programme which detects elephants from a img / vid/ webcam by using one of the best modeles in a audino , i dont have gpu and want leight weight , how do i make it
light*
I've been looking at game theories and algorithms and I'm considering making a tic tac toe bot with multiple difficulties. I found some minimax code for one that runs in command prompt, but it's synchronous. Does anyone know of one written async?
If not then I'm gonna dive into it myself lol
Guys, my company is attempting to experiment using models like llama to predict big datasets in place of statistical modelling, what do you think about the feasability of such?
I'm trying to say it's a bad idea on anything larger than a thousand rows...
Pretty sure a larger dataset would yield more accurate results, no?
you'd rather use gpt to analyse a large table? for predictive modelling?
how is that as reliable as normal means
What do you mean by "predict big datasets"?
Imagine you have a table of data that youd normally use a random forest on
to predict y variable
they're feeding that data as string input to a llm
to see how it performs
that honestly sounds like a recipe for disaster.
you can explain what a random forest do and how/why does it work
likewise, you can explain how/why a neural network specifically designed for your task works
same can't really be said for a LLM that's trained for other purposes - sure it could spit out things that resembles a reasonable output, but you can't reliably foresee when or understand why it breaks down (i.e. underperform your baseline model)
that's my two cents anyway.. i am very reluctant in fully embracing LLMs, maybe i should get with the times...
agree
Agreed. You wont be able to explain/interpret the predictions properly
Hey I'm kind of new to open source contibutions and Github but I had been working on one project for sometime and had recently completely uploaded it on github. Feel free to have a look and let me know if I should do something differently, also if there are other ideas you can suggest that'll be appreciated too. Thanks. https://github.com/Nik-code/nlp-chatbot
The ultimate black box
Hi everyone,
I could really use some assistance with our project. I'm looking to utilize a drone for ground crack detection, but I'm not quite convinced that the camera on the Tell Edu drone is up to the task.
Could you kindly share any recommendations for a programmable drone with superior camera quality?
I greatly appreciate your help in advance! 🙏
would be interesting to see how it behaves when one apply shapley values to make it explainable
I just discovered that GANs aren't black boxes...they're black holes 
Well if we use an LLM for predictions, we can use it to generate SHAP values too 😂
Also guys, a small confusion around information theory in VAEs:
The Encoder, by compressing the data, tries to extract the most relevant features in the data, right? So, the lower the latent space, the more criterious the Encoder must be, less features are extracted, then more specific the latent space gets for each input data?
While the Decoder, by decompressing the data, must create new information based on the most relevant features extracted by the Encoder? All this while trying to recompose the original input data?
So, the Encoding process extracts most relevant features and loses information. The Decoding process tries to recompose (create) information based on such relevant features?
Pretty much. This is generally what an autoencoder attempts to do - reconstruct the input through a bottleneck. A variational AE differs by generating a latent distribution which you sample from and pass to your decoder
Hm... I see... I was reading a paper here and I'm thinking that, since the VAE generates a latent distribution (not just a latent variable, like AE), then it's more insensitive to the dimensions of the Encoder output.
I'm even testing this right now in Colabs... I was using a VAE with Encoder generating mean and standard deviations with 128 dimensions (Batch, 128), and now I'm testing with 16 dimensions.
I think I'm beginning to understand now... The Encoder output in a VAE is not exactly the features, but rather a distribution, the latent space, and the size of such output is not the number of features, but simply the number of dimensions of this latent space, this distribution. Since this distribution ranges from -inf to inf, then all input features could be fit into this latent space even if it has just one single dimension.
At least I think this is a reasonable explanation 
Curious...then I could pass 64x64 RGB images to my Encoder and simply make it provide outputs with just 1 single dimension...
Optimeezachon! 
hi
i need help
I am building an artificial intelligence model to detect age and gender
I tried to make my own dataset
So, I collected photos and saved their required specifications in a file
When I was labeling the photos, I noticed that the labels were not compatible with the photos
import pandas as pd
import numpy as np
import os
import tensorflow as tf
import glob
data = pd.read_csv("data.csv")
data['Gender'].replace(['male','female'],[0,1],inplace=True)
data['Age'] = data['Age'].replace(['15_22', '23_30'], [0, 1]).astype(int)
data.head(13)
# Define image processing functions
def load_image(image_path):
img = tf.io.read_file(image_path)
img = tf.image.decode_jpeg(img, channels=3)
img = tf.image.convert_image_dtype(img, tf.float32)
img = tf.image.resize(img, (150, 150))
return img
# Prepare image paths and labels
PATH = "./picture"
image_paths = sorted(glob.glob(PATH + "/*.jpeg")) # Sort image paths
images = tf.stack([load_image(image_path) for image_path in image_paths])
labels = data[['Age', 'Gender']].to_numpy()
print(image_paths)
this is my code...
help me'
Yep. The output of the encoder are parameters for a multivariate gaussian distribution which you then sample from
What do you mean that the labels are not compatible with the photos?
I am curious are you building a CNN? Because I am doing a similar problem where the model figures out the age given a face image
i can't imagine that would work well. among many other problems, that data is going to look completely different from what any LLM was trained on and you couldn't expect any kind of reasonable results from that process. you could ask and LLM to build a model fitting pipeline for you though
!code i recommend using code block formatting for sharing code in the future
I recently came across a XPhoneBert and I am trying to train a model to see if two sentences sound similar using the transform library on hugging face: https://github.com/VinAIResearch/XPhoneBERT
I want to create a model using LSTM binary classification.
inputs: sentence_1 sentence_2
output: whether they sound similar or not.
for sentence_1 and sentence_2 I want to pre-pad them.
I have a list of sentences in sentence_1
I tried doing this using:
tokenizer(sentence_1, return_tensors'pt', padding=True, max_length=100)```
When I do this it looks like it always puts the padding on the end. How do I prep-pad these values?
Another question I have is once I get all the inputs_ids and attention_mask values do I need to run them through the model and how do I do that? If someone could give me a code example of how to do that it would be really helpful.
Thanks in advanced
Anyone have any good tips on cs majors starting their first data science courses. Any tips in particular in getting familiar with libs like pandas, numpy, matplotlib
Whatever you're trying to do with numpy or pandas, resist the temptation to use a loop or .apply as much as you possibly can, and look for a solution involving neither in the docs.
There's no helping you with matplotlib because it sucks. I still don't understand it.
there is tutorial on pandas for grandma on youtube. thats pretty neat. My suggestion would be to get started on a simple data analysis project - do summarization, some cleaning, look for outliers, do filtering, grouping, aggregation. Cover the basics first and jump into larger projects
Will keep this in mind thank you!
Thanks!
I agree w the other suggestions. Generally practice makes perfect so try small projects that use those libraries. I also suggest familiarising yourself w statistics
Yeah i took statistics last year I got a good grade but the professor sucked so the concepts didnt stick too well.
to avoid over fitting use 2 methods 1 early stopping it is basically stopping the traning as soon as the error stops decreasing and learning rate decay they will help a lot in preventing over fitting of your model
lol
is there any resources for absolute beginners in large language model, something that i can start with. Thank you!
books, pdf, youtube series. anything will do. there is so much contenet out there, its hard to choose which one to go with
an absolute beginner trying to do what with LLMs?
pretty much any project that would involve an LLM in a non-trivial way would be exceptionally challenging for a beginner, btw.
well all have to start somewhere right? 🙂 I'm looling for some easy to follow tutorials like corey schafer on python fundamentals. pdfs and books will also work.
I still need to know what you're trying to do with LLMs to make any suggestions.
but one wants to start with something that is achievable with a medium amount of effort, or you'll give up before achieving anything.
Thanks. im not trying to do anything at this point - just learning the basics. The if-else and for-loops of LLMs. The basic building blocks - i have the math background - algebra,calc. I need to start with what comes next.
There are no if-else and for-loops of LLMs.
I would try to first build an understanding of what LLMs are.
I'm sorry if Im not describing it the right way. Neural networks can be considered the basics of LLMs. Im looking for something that start from the basics. and go all the way up. You don't learn about classes or try to understand what classes are when you start learning Python.
if you come up with an example of what you would want to do with LLMs once you understand how they work, I will look for and give you a resource for that.
that's all I can offer for the moment.
got it. thank you for your time. Appreciate it
you might also read about the differences between GPT and BERT @ionic badge
Just wanted to understand industry standards, what's the benchmark for getting 100K recrods from table and having it in-memory. I have to apply couple of functions on top of this tabular data. Right now it takes 15 Seconds for me to get the data from snowflake, fetch it as Pandas DataFrame.
Is it normal in industry to do this kinda of thing?
Any one who have previously done, can you elobrate your tech stack.
I wanted to do this in a request response cycle. I'll send very minimal data to the frontend after computation. (sub second performance)
I'm looking at some data from RAND "smoothed percent of population with high school or equivalent degree" reporting figures in the .2-.3 range for all states... Am I misunderstanding this, is it not reporting the supposed population with HS diploma level education?
The lack of reliable, state-level data on firearm injuries is a challenge for gun policy researchers. As part of the Gun Policy in America initiative, RAND researchers developed a publicly available longitudinal database of state-level estimates of inpatient hospitalizations that occur as a result of firearm injury.
Funnily no one has to understand how LLMs work anymore, every company is throwing resources to applying the black box at all possible use cases
Insanely easy, pip install langchain or openai and read their docs, apis are king rn
this channel is not really online
@noble quail u hv a lot to type i can see
u done
I'm processing some frames of a video with OpenCV and saving to a file with pickle.
I'd like to check if the file sizes I'm getting as a result are reasonable.
I originally have a 640 x 360 video, I sample about 25,000 frames from this video and resize them to 25% of the original size, normalise by dividing by 255 then pickle.dump them in a file which is about 3.25GB - this seems a bit big to me so it'd be really helpful if someone could check that this is reasonable or not.
Thanks!
is it possible to use both?
what's the source video format? how many channels this video has (rgb / gray)? what do you pickle to disk? numpy array or python lists? it's probably some compressed format but you are dumping frames as raw rgb
Iirc videos are normally compressed using techniques that make heavy use of how most videos are organized
Meaning consecutive frames are often very similar to each other, with only minor differences
This allows for better compression than just regular compression on general data
And also depends on if the compression is lossless or not (when you load it back, is it the exact same?)
what's the source video format?
It's mp4, codec H.264
how many channels this video has (rgb / gray)
3 (I think - I'm not sure what this means so referred to this stackexchange answer
numpy array or python lists?
each frame is a numpy array but I pickle a normal python list of these arrays
!e ```python
print(((6400.25) * (3600.25) * 25000 * 3) / (1024 ** 3))
@shrewd shale :white_check_mark: Your 3.11 eval job has completed with return code 0.
1.0058283805847168
I'm kinda rusty on this calculations nowadays, but rest is probably pickle overhead (numpy/list object overhead)
Can't you construct it into an mp4 again?
Probably libraries for that
moviepy or something
did you tried np.save ?
Ah yeah this makes sense as to why there's a massive inflation - never realised how much of a difference compression made!
It's actually pretty interesting, it also explains some of the weird artifacts you sometimes get when watching a movie
Where the image seems to move buth the actual image is wrong until the next key frame
No I wasn't aware of this - will give it a go!
This is also a smart idea! Thanks!
If building a video again is not an option, may be this helps (https://stackoverflow.com/a/41425878/2886047)
I am trying to run predict_qa_alpaca.py on a finetuned llama model and im getting module 'utils' has no attribute 'jload' - any idea what might be causing this?
anyone with a statista account be willing to pull some data for me? want to compare education to gun hospitalizations for a portfolio project
Has someone worked with FinRL before?
I want to find the average holding period of the portfolio and it's average monthly number of trades.
Also, I want to know if is possible to add a rebalance_portfolio argument
so that my portfolio is annualy rebalanced based on financial indicators of each company
Is it possible to reduce the size of individual marker in plotly.express.scatter_geo() ?
I am working on a project in which I need to plot all the terrorists attacks that have happened since 1970 on a map, with the normal size of marker used by scatter_geo it is looking very messy
marker=dict(size=10)
Hi,
There is this book I’ve been reading that says deploying machine learning models on a FastApi is a bad idea and recommends using TensorFlow serving instead.
In my ML pipeline, I have save my train model along with the preprocessing data steps to ensure the data preprocessing steps are the same as the preprocessing steps used in inference but to collect my training data, I have a FastAPI collecting data, which is what the author says we shouldn’t do.
My question: Is there a way to use tensorflow serving to also collect data? This way I wouldn’t have to have a FastApi collecting data? Is tensorflow serving only used for inference?
Thank you all!
I never fully understood the connection between FastAPI and ML. Fundamentally making a prediction is CPU bound, FastAPI is an async framework await model.predict() does not make a lot of sense.
TF serving to me does more of the "boilerplate" stuff you'd have to roll yourself like loading the model from disc at every request (which technically is async).
Does that mean the FastAPI will “lock” the memory every time the model makes a prediction?
That's not what I meant. FastAPI (and ASGI in general) thrives when you have non-blocking operations. Basically you make a call to load your model from disk and instead of waiting for it to be loaded you're doing other stuff. I guess the vast amount of tutorials make it an OK choice though.
Imo if you're unsure I think you should look at "integrated" tools like MLflow simply because they're opinionated and make a lot of "decisions" for you.
I see, that makes sense!
Using an API to collect data... it depends on your use case
The easiest thing if you're not working in realtime is to store your data somewhere and process in batch.
The application I'm trying to build works in real-time. It's supposed to be a device on the edge that predicts the output of a chemical process
hey hey yall, anyone got any good data sources to practice machine learning models with python? Ive tried searching in kaggle but i dont think i have a good trained eye to select good data sets. im tryna practice my xgboost, random forest parameter setting and optimization skills. also am pretty new to python coding. i was told to browse thru the pins but i cant find anything too specific
If the device on edge is making predictions, do you need an endpoint?
I think I could use tensor flow serving to collect data from the logs, right? Deploy the model with poor inference and train the model as it gathers data
Because I get the data fromthe SCADA system that gathers the signals from the industrial controllers
The SCADA system supports HTTP
There's a free ai art generator api?
At what frequency is your data coming in/out?
Phew, fast. Half of a second.
So you're sending data to the server every 0.5s?
I'm gathering data with the FastAPI every half of a second
I haven't made any predictions yet :/
But yes, the SCADA system sends days to my FastAPi every half of a second
What happens if you lose internet for 5 seconds
Imo you should look towards websockets considering they're bidirectional and you only open 1 connection. FastAPI supports them (assuming your SCADA system does too) https://fastapi.tiangolo.com/advanced/websockets/
Interesting, yes! I think the SCADA system supports web sockets. Thank you , I’ll look into it
I’ll try to see if I can deploy it as an online machine learning system since the sensors tend to deteriorate over time
Last advice about moving to websockets: if it aint broke don't fix it 🤣. You could keep what you have rn and it can be a lessons learnt for the next project.
Specifically about online ML, I did my thesis about that 🫡 . Again, you don't need to retrain continuously.
A very sane thing to do is to somehow collect the true label after your model has made its predictions, store it somewhere and monitor the performance drift of your model
That way you can actually monitor the performance of "candidate" models next to it and then deploy a new one when you chose instead of doing a gradient descent style predict -> observe y_true -> update continuously because maybe that'll make your models worse
Oh, so constantly training the model actually deteriorates performance and the model should only be retrained when there is data drift or poor predictions?
No, I meant that continuously retraining might make it better, it might make it worse.
I thought about training the model on a timed schedule.
If you're not forced to continuously retrain for some reason (e.g., you're on an embedded system that never persists the data, just has a limited buffer) then I don't see why you should retrain like that
When retraining an online learner, do you use a whole fresh dataset to train the model or do you include some of the old dataset into the new dataset?
That's another thing, it's situation dependent
Sometimes updating the weights online is good, sometimes retraining from scratch with a window that contains old and new data is good (how big should the window be...?) etc
You can try all of these out, put them in a dashboard and select what works.
Do you know a good resource to learn more about this topic? I remember reading someone talk about it in this channel but I’m not familiar with the concept.
Anyone know about a program that can automatically create bounding boxes on images and label them? (YOLO format)
That’s just back prop., no? Taking the derivative of the functions. And doing the forward prop. again.
https://www.seldon.io/machine-learning-concept-drift
https://www.seldon.io/what-is-drift
This company has some great talks, papers and packages.
Aside from that I'd say scouring the internet / google scholar is a good idea.
Getting good with something like Mlfow + dagster might be good:
https://docs.dagster.io/guides/dagster/managing-ml
https://mlflow.org/docs/latest/tracking.html#performance-tracking-with-metrics
Dagster schedules and trains a bunch of models at regular intervals with new data and you persist their results in mlflow. Your censor has likely drifted if the performance of your current prod model is dropping specifically with respect to the ones that you're training continuously. That's when you swap them out.
Awesome, thank you for all the information, it’s been insightful!
Is updating the weights online similar to fine tuning the models with recent data?
Yeah
Oh yeah, you should also just monitor the summary statistics of your input variables over time. That in and of itself tells you if your sensor is drifting 🙂
is there a discord server just for large language modles? thank you
*exclusively for
it wouldn't really make sense to have a discord server just for that.
did you decide what you want to use large language models to do?
it defenitely would - you have new models coming out every other week. And there are so many angles to cover. I'll be surprised if no discord server exists just for llms. Right now i am trying a bit of everything bert type models, fine tuning llama2 - I'm not looking into anything in particular just browsing around and see what each models can do
out of curiosity, what was the impetus for your interest in LLMs?
just curiousity - I have the resources to try out all of them as long as they are more or less 100B - I meant fine tune
100B? bytes?
billion
I see
Hello, I am trying to become a data analyst. How much of the NumPy library do I need to learn if I want to start working on basic data analytics projects?
not much in my opinion. you're better off focusing on pandas, which has a broadly similar interface and uses numpy internally in many places, but is more useful for general-purpose data analysis
numpy is very useful. but if you're new, focus on pandas first.
Does Pytorch work well with PyPy or any other CPython alternative?
agree, as a Data Engineer that often does Data Analysis, I use Pandas 99% of the time, I rarely need to use Numpy
What I can say about Numpy is that the Pandas documentation often assumes you know Numpy
But both libraries aren't something you "learn" but rather something you do imo.
This kind of question is asked a lot, and I never understood it. I don’t ‘learn a library’ like reading a book: I learn the parts I care about, and perhaps when I use it a lot, I’ll sit down to understand more about how it works. By this kind of question, I mean the: ‘how much of XYZ should I learn’
I'd at the very least read:
https://numpy.org/doc/stable/user/absolute_beginners.html before https://pandas.pydata.org/docs/user_guide/10min.html
Numpy guides will for instance go into detail about what broadcasting is while the pandas stuff name drops it. Which might be confusing for some readers, especially since it's a foundational aspect of working with dataframes.
People differ! 🙂 At the very least I always read the quick start, user guide and one, or more, of the tutorials in the docs. Reading the reference OTOH makes no sense.
Yah, fair, I skim the quick start, write some code, and then go back for more as needed. I just mean the ‘how much?’ As a sort of percentage question. Maybe it’s just how I translate ‘how much’ to a ‘do I really need to learn all of it?’
I see, that's valid. Especially for libraries with colossal APIs like Numpy/Pandas. Nobody knows all of them. Just reading the docs to know how to write it somewhat idiomatically + what features exist on a high level is OK.
Hey there
I have a few issues
my model works fine on colab.google (the basics - object detection using cifar-10)
but once i bring the model in with flask file (by downloading it)
it is not working properly
is there any special reason to avoid cpython here? your pytorch code's heavy routines won't run in python anyway
I dont want to optimize Pytorch per se, but an API that uses it along with other ML/NLP libraries written in pure-Python
oof
are you sure about that? if performance is a problem, i would say you should avoid the pure python libraries
I agree, I always look for libraries that are implemtned with Cython/C-extension for this kind of stuff, but a colleague wrote the API and I'm wondering if there is a drop in replacement for CPython that will optimize it significantly
my major concern is if PyPy is fully supported by Pytorch or if there will small bugs
i don't think it's supported at all
so do you think the code will crash when I import it trought PyPY? or more like subtle errors?
it won't work at all
i don't think pytorch can even be installed for it
you can read through here and take a look https://github.com/pytorch/pytorch/issues/17835 but really there is no way you will get any form of good ML/NLP performance with something written in pure python other than for small toy scenarios
🚀 Feature Support pytorch from PyPy -- a fast, compliant alternative implementation of the Python language (http://pypy.org) Motivation While pytorch itself probably won't benefit much from PyP...
oh damnn
I suppose I can try adding type hints and compile it with Cython
you can give that a shot. idk if you find that easier than using a proper machine learning module
at that point you may consider just rewriting it in C(++)
yeah not really my choice
you cant compare the two, one takes literally an hour, and the other...
well, give it a shot with cython and see if you get the performance you want. but really you should think of any ML stuff implemented in pure python as nothing more than a proof of concept that later needs to be rewritten properly. whether you prefer cython or something else, that's up to you
rewriting in pytorch is probably a good idea if you want to be able to use GPUs, for example
otherwise the parallelization is on you
Hey guys, I was thinking here... Considering that CIFAR10 uses 32x32 RGB images, thus, each image has 32x32x3 = 3,072 pixels, does it make sense if I make a neural network that receives one of such images and tries to extract it into...let's say... 12,000 values?
I suppose that, despite every image in the dataset having the same amount of pixels, each image has a different amount of relevant features...but the number of features can hardly be equal or higher than its number of pixels, right?
I know that the number of parameters in a neural network is some kind of "trial and error" game, but I'm trying to have some idea of the range of the possibilities I can try.
What kind of layer would it be?
Linear
"extracting values" from the input is the foundation of how deep learning works, but usually you'll want to reduce the number of values rather than increasing, both for efficiency and so that you can reduce it into the answer you want to for the model to give you after some layers
how to extract things efficiently is one of the core questions, and the answer to that are all the different layers/architectures like convolution layers or transformers (attention)
A convolutional layer that outputs a/some feature map(s) totaling 12,000 values
Or a fully connected layer
(Or Fully Connected)
That would mean 3,072 * 12,000 weights
or 36 million weight values for a single layer
Disregarding the fact that you should probably not use fully connected on images
Maybe. But I'm trying to test a VAE using FCC layers to see how it goes.
Pretty sure you don't need fully connected layers for an auto-encoder
At least, it seems that the beta-VAE relies on FCC layers and it goes fine...
Oops, VQ-VAE*
Oh, ok. The VQ-VAE indeed uses FCC layers for the encoder and the decoder. But the Decoder output is also passed to a PixelCNN to generate the image 
I wouldn't expect it to
You can try Taichi.
Taichi and Torch serve different application scenarios but can complement each other.
Not significantly, but you can try Nuitka.
Does anyone have a good/easy to approach tutorial for pytorch --I'm trying to provide a tensor of MFCC feature extraction as an input and get 4 numbers out, I dont udnerstand exactly how to structure my data in order to do this though
I want to predict where an image was taken (estimate of coordinates), and this is what I was thinking. I was thinking of using geohashing, getting images of the location, and treat each geohash as an output—I’ll basically treat this as a classification problem. What do you guys think? It is a good way to go about this or do you guys have any suggestions, etc.?
anyone looking to start a project with me? looking for good developers for salary who are pretty experienced in AI/DS/ML
I want to make a web application that uses different LLMs frmo differnet organizations. I'm worried prehistoric inputs won't be used in the attention mechanism as they are different LLMs. is it possible to retain the different attention histories across each LLM. in other words, use the previous attention from one LLM in another
i have a list of 2-tuples
each tuple contains one number thats a good approx of an unknown, and the other is far away
how do i collect the good number from all tuples so i can take their mean
I'm doing face recognition using cv2 dlib and face_recognition; is this performance normal??
The different videos show performing face recognition (not detection) per 100 and per every frame
do you know which item of each tuple is the good one? btw this sounds like something that'd benefit from numpy instead of lists and tuples
i dont thats the hard part
do you know anything about the statistics of the problem? are all of the tuples different realizations of the same random process?
i have an array of sensors in a known arrangement, and im supposed to estimate the angle from which a signal arrives from the relative phase shift between pairs
for a given pair of sensors, for the same shift, theres two possible directions the wave may have come from
ah a DOA problem
yea
but this is super different. you have a parametric model
the front-back problem is most easily solved by restricting your geometry
if you have only e.g. a uniform linear array and waves could really arrive from both directions, i'm not sure there's a good way to do this
its not in a ula its on a circle
then this shouldn't be much of a problem i think
you can't use plane wave models in this case anyway
ive been told to lol
hmmm i mean, i guess you could, if the array is super far away from the sources
a little weird
but yeah since the sensors are now not in a line, you have extra geometric info
i would think a standard migration/correlation/synthetic aperture focusing would already give you a nice spatial correlation map
i spent a... non negligible amount of time trying to get the angles right but i get muddled up with the signs and 180 +-s
(trying to do it on paper i mean)
i don't think this is one you can do on paper
or should, at least
none of the nice subspace methods work with a circular array
i'm 99% sure you'd have to use a more sophisticated estimator or do some matrix products you do not want to do by hand
no im not doing the computation by hand i meant
just converting the angle i get from a pair into an angle wrt center of circle is really janky
a pair of what?
of sensors
i think so
i don't think that's the best way of doing it, but ok
what i imagine is that, for each pair of sensors, you get an okish estimate of the angle, and another estimate that is reflected wrt the line passing through the two sensors, yeah?
so maybe some form of clustering on the points would work
@past meteor hello sorry about the ping you suggested me a book for timeseries the book was otexts forecasting principles and it was a really nice one i completed it do you have any other suggestions for book on machine learning in similar formats like with video explanation and online pdf
also edd if you have any ideas please suggest
yes
you can try averaging and then removing the outliers or something like that. try plotting a scatterplot
but also, i'm crying in maximum likelihood estimation 😭
hows that work
Dont tell me
i was thinking of sthn like this:
start with a uniform pdf over 0 to 2pi
iterate over pairs
for each pair, make some bellcurve like function that has two peaks at the two angles i get from this pair
update my pdf with this using some bayes rule type thing
take argmax pdf at the end
i rly need to read some literature on this
you can use bayesian optimization to find a distribution for your data points
i can be totally wrong here tho
what one would do is take all of the data from the sensors together, make a parametric model that depends on the AOA, and then solve an optimization problem where one maximizes the log likelihood (assuming your parametric model describes the mean of a random distribution, probably guassian if you don't know anything else)
so you'd end up with a nonlinear least squares problem. you could take the estimate you get from your current approach, or from some other method, take it as an initial guess, and then use some (quasi) newton method or gradient method to find the angle
For time series specifically or ML at large?
ml at large
My favourite text remains introduction to statistical learning. Afterwards you can read dive into deep learning.
i guess reading otexts is enough to atleast get me a machine learning internship i suppose
does the book an introduction statistical learning also have video format explanations ? like the o texts one ?
plus knowing about ml and dl algos
No. I have a strong bias for reading. Videos are pretty useless
well i do read too i just find the videos convinient thats it
I only use videos if I want to zone out and get a little bit of information for free
ok ok
They trick you into thinking you're learning, reading is tiresome because you actually are
yeaa i basically see those videos for that only when i am low on concentration so videos help a lot
https://youtu.be/LvySJGj-88U?si=ZVdWZTkTwBOFI497
here i found a playlist of the book
Statistical Learning, featuring Deep Learning, Survival Analysis and Multiple Testing
Trevor Hastie, Professor of Statistics and Biomedical Data Sciences at Stanford University - https://statistics.stanford.edu/people/trevor-j-hastie
Robert Tibshirani, Professor of Statistics and Biomedical Data Sciences at Stanford University - https://statis...
Please don't watch the videos. If you're going to watch videos at the very least watch it from another book
Because what'll happen is you might watch a chapter instead of reading it and that tricks you into thinking you've covered the material
It's not you personally, it's a general observation that applies to everyone
ok so imma avoid watching videos if you suggest it
If you want to watch videos on the side you can watch statistical rethinking 🙂
Read 1 watch 1 is a good compromise
alrighty thnx a lot mate !
@past meteor please check your DMs
Hey so I found this program on the internet for a minimax in a tictactoe game. When it's untouched it works perfectly, but I want to add a difficulty setting and I'm tinkering with how to do that. I decided to add an additional check for if depth == 8 to just pick a random available cell. So if the player goes first, the programs first move will be a random pick instead of the center spot every time. However, this results in some weird behavior where after I pick cell 1, it picks random cell, I pick cell 7, the bot will always pick cell 3 instead of blocking my play (cell 4). That move is handled by the minimax portion so I don't understand why it won't block my play. Is there a flaw in this algorithm causing that? Is there another way I could alter difficulty?
Maybe because if the opponent plays perfectly it can't win anymore anyways?
This situation right? (circle=player, cross=ai) @torn ore
Yea that's pretty much what's happening. I play top circle, opponent plays randomly, as expected, then I play bottom circle, and it will not block with the next move
No point
It has already lost if you play perfectly
You get this then, and player chooses middle
Then you have top right or bottom right is win
So it can't block both
Hmmm
I'm not sure if minimax would choose the cell that prolongs the game for the longest
So you think it's playing through all the scenarios and the one with the smallest max loss is to just let it happen
Thats.. interesting
I'm not familiar with these algorithms this is the first one I've looked at
So that could be the case
Tbh it's been a while since I've implemented minimax, so it's a bit foggy
But from what I remember it simulates an optimal game where both players pick "optimally"
In which case the ai would have lost in that situation, so each choice would give the same reward I suppose
Yea it just threw me off because any other time I've played it, given the opportunity to block me or win with its move it'd block me most of the time - which kinda makes sense now that you bring that up, any other time its trying to tie but in this case it knows its a lost cause
So I guess I need to find a different way to adjust difficulty
🤯 me rn
Might have to just have it check for [x, ~, x] and pick the empty spot for the harder difficulty instead of actually predicting
Yeah you can hardcode stuff for a lower difficulty ig
Maybe there's a more elegant way that you can encorporate the length of the game into the minimax algorithm
I'll probably hardcode certain moves in, seems easier. Or maybe move the random one from depth 8 to depth 6 to give the player one opportunity to win on the hard difficulty
I did have it setup to only look 4 moves ahead, but then I ran into an issue where the depth was never 0
How are you encoding the state in your problem?
This only tangentially related but I'd encourage you to look at post state representations as they're popular for this kind of thing
The state is just a list of coordinates
The problem I was having where the depth was never 0 when looking 4 moves ahead was because I was making the depth a static 4 before giving it to the minimax so my checks for a draw game were never true
I got it wrong, after state is the term I was looking for.
I've managed to get here:
df_action of a given model is considered, df_action has values like 0 (no action was taken), +1 or +100 (representing the purchase of one or more shares, limit being 100 shares) and -1 or -100 (representing the selling of one or more shares)
They're mostly used in reinf learning but they're applicable to minimax too
I don't think there's an after state in this. I put the code in this link if you wanna check it out. I'll look into what an after state is when I get done mowing the lawn lol
I'd have to think about your actual problem at hand, I'll have a look.
Like I said before, I will probably end up hard coding the first couple moves for the 2 easier difficulties
Seems like the simplest way to achieve the expected results
Or I can make a random chance for the bot to play a random valid move
I Read up. Yeah if you play perfectly and go first I guess there's nothing it can do. You can add logic to prolong the game I guess
I think I'm just going to do something like x = randint(0,9) if depth == x <play a random valid move> so there's a small chance the bot misplays once in a while
congrats on the role man
hi, i've been playing around with generating future data(fantasy data) from past data, using brownian motion. its just a hobby of mine, but curious to hear if its something that you use professionally in this field, and also hear if there are other similar methods to use instead?
price_simulations = []
for _ in range(n_simulations):
price_simulation = [initial_price]
for _ in range(n_days * 24):
candle_return = np.random.normal(mu * dt, sigma * np.sqrt(dt))
price_t = price_simulation[-1] * (1 + candle_return)
price_simulation.append(price_t)
price_simulations.append(price_simulation)
as i dont truly know what im doing, i posted the loop im using in case i made any mistakes
first thing I can tell you is that a double for-loop is going to be inefficient
It's going to be O(n^2)
I would look into using some sort of data analysis framework like numpy to see if you can vectorize the operations somehow
Which brings me to the reason I'm here...
Anyone know the best way to make a dataframe out of JSON data when I only need select fields from the data?
okay thanks for letting me know! i plan to run it with multiprocessing to make use of more than 1 core
I think that should be your 2nd option
1st option should be to make the code more efficient
then use multiprocessing if you need to
yes i think that both options arent mutually exclusive
its just alot faster to use 24 cores than 1
Sometimes it isn't
Sometimes the overhead of starting all those processes eats up any speed and/or efficiency gains you get from multiprocessing
If your code is efficient enough you won't need it
when calculating on large datasets with pandas, i observe a rather linear relation between completion time and amount of cores though
Sure if you have a lot of data. Doesn't change the fact that you can get a better efficiency increase by not having exponentially expensive operations in your code
i firmly believe nobody would put up any argument against that more efficient code will lead to a faster completion time 💚
Why are you using brownian motion?
its the only method i know of, im quite green in this field
please tell me if this doesn't belong here. twice I posted in help channel yet i received none so i will change my strategy
How do I change the position of the vertically placed dataframe column's label such that it doesn't overlap with the pie chart label on the left side?
student_list['Ngành đào tạo'].value_counts().plot.pie(autopct='%1.1f%%')
plt.legend(bbox_to_anchor=(2, 1), loc='upper left')
plt.show()
Is that matplotlib or what
yes
No i haven't. I'l give it a shot
yes it is
Try the second one
np
Strange follow up question but what did you look up to find these links?
When I looked at the first link I noticed that there were 'pad' properties and so that tipped me off to search for 'matplotlib axis label padding'
@past meteor hey if you don't mind could you give me a roadmap on learning ML with python
How good is your Python right now?
I'd say im very thorough with the basics and intermediate
im partly more experienced in C# but have been learning python lately
not very high level though
And what's your end goal ML/DL wise. Are you in for the long haul or do you want to explore what exists in the space? (No wrong answers here: 🙂 )
well im still exploring but if I like it, for the long haul ig
I was just learning something about neural networks a few weeks ago and got interested in ML
but honestly, exploring yea
So I'd say neural nets are a very specific type of ML model and you kind of need to have a baseline level of knowledge of the rest to know when they make sense
If your Python is rusty you have 2 problems: figuring out what ML is and figuring out how to do it in Python. Leads to a cognitive overload
If you know more C# I'd suggest you start by "translating" a C# project into Python to be honest.
im not a huge fan of C#, i prefer to dev in python
id rather probably practice and get better at python
That's no problem. Have you completed a project in C#? Just redo it in Python
yea
oh alr
Simply because if you get more of the language down you won't be learning 2 things at once when you head into ML
hmmm
what source would you recommend to learn ML from tho, the book you recommended?
My trinity of resources are:
- https://mml-book.github.io/book/mml-book.pdf
- https://www.statlearning.com/
- https://arxiv.org/abs/2106.11342
All 3 are 100 % free and have PDFs on their sites
Only the last one is strictly about deep learning / neural networks but I actually believe book 3 depends on knowing book 2 and 2 on 1
You're free to start from book 3 for "exploration" purposes but if you're in it for the long haul I'd circle back and read them in that order 🙂
alright, thanks for the books. for now i'll improve my knowledge on python and take a introductory course on kaggle as someone recommended and then checkout the 3rd book. I'm not very serious about it, just exploring as I'm still a highschool grad
thanks for the help tho 😄
Oooh that's important context I didn't get! 🙂 Kaggle is fine as well indeed. Good luck
I mean i'm still very new to it so would prefer to look at basic courses before diving into books
thanks for the help man
I mentioned, but just reiterating: I’m still a fan of cs50 for ai for a structured survey that’s Python oriented
It’s a good flyover of topics with hands on practical examples. Still need the deeper stuff, but it’s satisfying to write code that does stuff
yea i'll look at that
the cs50 courses are pretty good, I've tried cs50t and cs50x as well
thanks mate 😄
Kaggle is really good
If you're willing to pay €10-15 then 100 days of Python is decent as well
does that include ML?
Some of the days are about ML but a whole range of topics are covered including web, game dev, databases, ...
You interested in what I ended up doing for this?
Definitely!
Crap I thought I sent the code over to my phone I guess i didn't, one sec I can explain lol
5 difficulties:
Too easy
Easy
Med
Hard
Too hard
If it's too easy, x,y is a random valid move
Set to too hard, if depth == 9 choose a corner, else use the minimax
At the beginning of the ai turn I generated a randint(1,9)
If depth == 9 for easy and medium, pick a random move, for hard, pick a corner
Easy: if depth >= randint then pick a randomized move (more likely to happen early game)
Medium: if depth <= randint pick a random move(more likely to happen late game)
Hard: if depth == randint pick a random move
For each of those, if not then use minimax
ec = empty_cells(board)
depth = len(ec)
odds = randint(1,9)
if depth == 0:
return
elif depth == 1:
rn = 0
else:
rn = randint(0, depth-1)
moves = {
1: [0, 0], 2: [0, 1], 3: [0, 2],
4: [1, 0], 5: [1, 1], 6: [1, 2],
7: [2, 0], 8: [2, 1], 9: [2, 2],}
if difficulty == hardness.Too_Easy:
x,y = ec[rn][0], ec[rn][1]
elif difficulty == hardness.Easy:
if depth == 9:
x = choice([0, 1, 2])
y = choice([0, 1, 2])
else:
if depth >= odds:
x,y = ec[rn][0], ec[rn][1]
else:
move = await minimax(board, depth, COMP)
x,y = move[0], move[1]
elif difficulty == hardness.Medium:
if depth == 9:
x = choice([0, 1, 2])
y = choice([0, 1, 2])
else:
if depth <= odds:
x,y = ec[rn][0], ec[rn][1]
else:
move = await minimax(board, depth, COMP)
x,y = move[0], move[1]
elif difficulty == hardness.Hard:
if depth == 9:
x = choice([0, 2])
y = choice([0, 2])
else:
if depth == odds:
x,y = ec[rn][0], ec[rn][1]
else:
move = await minimax(board, depth, COMP)
x,y = move[0], move[1]
else:
if depth == 9:
x = choice([0, 2])
y = choice([0, 2])
else:
move = await minimax(board, depth, COMP)
x, y = move[0], move[1]
board[x][y] = COMP```
have you looked through the fastai course? it seemed pretty good when i skimmed over it last year
Have not, will check it out
caveat: i am not actually very experienced with "ai" things
i am a regression fitter at heart
I, too, am a regressive.
is it for beginners i have just completed my highschool, would it be helpful for me as i want to pursue my career in data science.
i want to gather content so i can start learming
Anyone know how to restart the kernel in Kaggle? I updated tensorflow but when I double check the version, I still have the old one.
Also, I am trying to run this code in Jupyter notebook locally but every time I run the notebook I get something like “your kernel has died, it will restart automatically”. How can I fix this?
hi, i have multiple large csv files of about 5 GB in size. I tried loading them with pandas but ran into memory error. Not sure what other tools i could use to load the data?
polars seems to crash the kernel everything i try to read those large files
in polars you can use a lazyframe. dask has a similar behavior with dask.dataframe. otherwise you have to split the data into smaller chunks
if you keep trying to load all the data into memory, no language or module will help 😛 you'd have to go out and buy more ram
lemme read up on lazy frame
the problem is i cant even get one of those 5GB files to load
Why? 😭 someone help T-T
You need to give us more context of your problem, perhaps share a snippet of the code in question on #1035199133436354600 ?
But judging from the error message I think you wanted to use the keras import from tensorflow package with predefined name models, you might want to check those keywords to see if any of them is defined. if models stated then perhaps use keras as its the default reference used.
ahh make sense. Thanks
Hi , so I'm trying to automate preprocessing but I'm kind of stuck with outlier treatment
Afaik to remove outliers , we find the zscore and fix a threshold, and remove the ones which are above the threshold.
But I'm getting a weird error for this dataset where there's no datapoints left, when the threshold is 2.
nvm , i think it might have been due to not accounting for columns with object dtypes
I'd say you should try them and see if they're too difficult or not. If they are you can ping me or ask someone else and I can see what else can work 🙂
Btw I also use Lazyframes when the data is already in memory 🙂
I treat it as a query optimiser in SQL, several operations are done and they're "compiled" to a more optimal instruction. The non lazy API does them step-by-step. My entire data pipeline is reading data from a db, doing .lazy() and then aggregating all steps and then doing .collect()in the end 🤣
yeah that also makes sense
each operation to the db has some overhead, so it can make sense to bundle up a few
It's also because if you do say 5 things sequentially and the last is a filter depending on what the previous 4 where the query optimizer might filter first which makes the preceding 5 faster
I’ve never used Kaggle notebooks, but post the code for second issue and we can look. Disregard, I didn’t scroll :/
note that this is one recommended technique among many, it's not the only way
yeah there's IQR too , afaik
Both sadly only remove univariate outliers
Depending on your use case you might have multivariate outliers, personally I don't really go that far 🤣

So imagine you have a variable age that is from 0 to 120 and a variable income that is from 0 to 1M. 12 years old isn't an outlier and 50k income neither but together they are.
ah
so we use clustering, of some sort 
ure just jealous a 12 y.o. has higher income that u did at 12
Nope, I remember Lil Tay and she's definitely unhappy
Theree's many methods and I think clustering is one of them yeah
interestingly high amount of people who come in, state a question, someone take time to help them and explain the problem, but they just never reply with a thanks or give a hint that they received the instructions 🤷
typical behavior online
i've been playing a bit with the function numpy.random.normal(loc=mu, scale=sigma) - to create artificial data with similar properties as the input data(brownian motion). are there other methods to use in brownian motion to generate the next value in a series, instead of gaussian distribution? should i plot the deviations(mu) of the input data and see whether they are evenly distributed across the bellcurve?
I was thinking about how it would be possible to train a model, and prompt it like this "open the file manager and create a new folder". I mean the mouse interaction is not that hard, but I suppose this model would need to have a huge knowledge about the operating system, and the state, so it can see the files, and stuff.
Is there any paper which describes something similar?
Has anyone done chatbot projects?
Is this in a finance context?
Yes plan was to generate a few fantasy datasets when im backtesting to assess robustability, but "got lost in the dataframes" and started exploring it a bit in depth
Oh, as you can imagine; there’s tons of prior work here… such as https://en.m.wikipedia.org/wiki/Geometric_Brownian_motion
A geometric Brownian motion (GBM) (also known as exponential Brownian motion) is a continuous-time stochastic process in which the logarithm of the randomly varying quantity follows a Brownian motion (also called a Wiener process) with drift. It is an important example of stochastic processes satisfying a stochastic differential equation (SDE); ...
There’s even sample Python code in that
I just used the built in generate ohlc data fron my framework but noticed that price would often go below 0 , so had to do it myself
Ok thx i have a look
Yes brownian motion is what im doing
I see some variations of it are elaborated in the link 👍 a bit heavy stuff with all the math 😮💨
Yah, it’s a pretty well studied space due to black-scholes
i am looking at histograms on the distribution of the calculated motions vs the real pct differences ... something is definitely off.
when i print out the numbers i see they are conclusively similar, just the calculated motion is 1000 times bigger than the real one
print((returns[0].mean())*10000)
print((data['Close'].pct_change().mean())*10000)
0.39038393...
0.00039782....
Hello, as a beginner to data analytics. Does anyone know where I can find projects? I want to gain experience and getting myself more familiar with the libraries that are commonly used in Data Analytics.
have you done the Kaggle tutorial for Pandas?
Not yet
Happy to look, maybe share more code? Not sure what you’re comparing.
But if bottom is supposed to be market actual, I think you have an error.
Or maybe not, I dunno: day to day movement is going to be very small percentages overall.
one would use pandas a lot to slice and dice tabular data that fits in memory, so I recommend getting comfortable with it. ||(inb4 someone says you should just learn polars.)||
||but what about duckdb||
is that a meme?
Oh, I’m making fun of myself
Hello, I'm wondering if anyone could help me with a tiny problem. I need to get some stats (mean, quartiles, var, min, max, std) from a dataframe, I am able to get most of them with pd.describe except for the var, whic i gotta generate in a separate dataframe; I've tried concatenating them so it's all nice in a single df but the var df is vertical and the describe df is horizontal, so i can't join them up properly :<
Sorry if this isn't the place to ask btw lol
alright ill construct a minimal example for you to look at billybobby, would be good to get confirmed whether i am calculating it properly. maybe it do just look like that, as you indicated
import pandas
import numpy as np
import matplotlib.pyplot as plt
import yfinance as yf
ticker_symbol = "^IXIC" # ^GSPC is the symbol for US100
start_date = "2003-01-01"
end_date = "2023-01-01"
data = yf.download(ticker_symbol, start=start_date, end=end_date)
data.index = pandas.to_datetime(data.index, unit='ms')
print(data)
n_simulations = 10
n_rows = len(data)
mu = data['Close'].pct_change().mean()
sigma = data['Close'].pct_change().std()
initial_price = data['Close'].iloc[-1]
prices = [initial_price]
returns = np.random.normal(mu, sigma, size=(n_simulations, n_rows))
cumulative_returns = np.cumprod(1 + returns, axis=0)
price_simulations = prices[-1] * cumulative_returns
simulationsDF = pandas.DataFrame(price_simulations)
simresults=simulationsDF.iloc[-1]
print(simulationsDF)
print(simresults)
fig, axes = plt.subplots(2, 1, figsize=(10, 10),sharex=True)
axes[0].hist(returns[0], bins=100, edgecolor='black')
axes[0].set_title('Brownian motion simulation daily pct change')
axes[0].set_xlim(xmin=-0.05, xmax=0.05)
axes[1].hist(data['Close'].pct_change(), bins=100, edgecolor='black')
axes[1].set_title("US100 daily pct change")
axes[1].set_xlim(xmin=-0.05, xmax=0.05)
plt.show()
i changed it to take data from yahoo finance so you can run it, i was using a local csv file before. the effect is less pronounced, either because its a different asset or because it is a lot less candles
i tried to make the histograms of equal size by setting bins and limit the x scale of both plots, but without success
also had a third plot of the simulations(the var simresults) but at a point it broke and couldnt get it to work again
I’ll take a look tomorrow, mind firing me a dm reminder?
Hi, yeah, this is my code .-.
you can use triple (`) and programming language name to make a snippet like this:
if __name__ == "__main__":
print("Hello World!")
Or again, as I've suggested before, perhaps use the #1035199133436354600 forum instead.
I posted it there
Ah yes, I just searched it and found your post, it's already closed apparently.
`import numpy as np
from matplotlib import pyplot as plt
import os
import tensorflow as tf
train_images = '/kaggle/input/cnn-test/geohashing_images/train'
classes = os.listdir(train_images)
print(classes)
data = tf.keras.utils.image_dataset_from_directory(train_images, batch_size=5)
data_iterator = data.as_numpy_iterator()
batch = data_iterator.next()
fig, ax = plt.subplots(ncols=4, figsize=(20,20))
for idx, img in enumerate(batch[0][:4]):
ax[idx].imshow(img.astype(int))
ax[idx].title.set_text(batch[1][idx])
data = data.map(lambda x,y: (x/255, y))
scaled_iterator = data.as_numpy_iterator()
batch = scaled_iterator.next()
fig, ax = plt.subplots(ncols=4, figsize=(20,20))
for idx, img in enumerate(batch[0][:4]):
ax[idx].imshow(img)
ax[idx].title.set_text(batch[1][idx])
train_size = int(len(data).8)
val_size = int(len(data).1)+1
test_size = int(len(data)*.1)
train = data.take(train_size)
val = data.skip(train_size).take(val_size)
test = data.skip(train_size+val_size).take(test_size)
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Dense, Flatten, Dropout
model = models.Sequential(
[
layers.Conv2D(16, (3, 3), 1, activation='relu', input_shape=(256, 256, 3)),
layers.MaxPooling2D(),
layers.Conv2D(32, (3, 3), 1, activation='relu'),
layers.MaxPooling2D(),
layers.Conv2D(16, (3, 3), 1, activation='relu'),
layers.MaxPooling2D(),
layers.Flatten(),
layers.Dense(256, activation='relu'),
layers.Dense(1, activation='sigmoid')
]
)`
What are you using to build your project? code editor (Visual Studio Code) or IDE (PyCharm)?
Kaggle, I tired jupyter notebook but there was some sort of issue with the kernel that kept dying.
Ok I see the problem, you might want to check on your post, I've sent you an answer there. Test it out and see if it work now.
this is a really cool field ngl
someone pls help, my model is overfitting lol (asked on python-help)
can anyone explain to me how the yolov8 detect head works exactly? everyone just says its a decoupled head and its anchor free but i'm not understanding how from the 3 feature maps sizes we get out of the neck we make the predictions exactly.
here (https://github.com/akashAD98/yolov8_in_depth) it says the size is 4 * reg_max but what is reg max?
i read this https://github.com/ultralytics/ultralytics/issues/2951 but couldn't really understand the idea of different anchor box and scales etc and how why we have this shape [batch_size, num_anchors * (5 + num_classes), height, width]
I would be really grateful if someone can explain to me
Understand yolov8 structure,custom data traininig. Contribute to akashAD98/yolov8_in_depth development by creating an account on GitHub.
Thanks. Whenever you have the time. I read a bit on wiki and come to understand that brownian motion simulations* doesnt account for the changes in volatility over time, but you can make a function of it called probability density function that accounts for it, but its getting a bit above my head
A pdf is just (eli5) a function that creates the charts you showed above (ok technically it’s a mass function since it’s discretized)
Yes, that’s the problem with simulating the market: volatility changes, so when regimes change, predictions go askew
(That’s “a” problem, not the.. plenty of other problems, like black swans and market manipulation)
I see how this could be accounted for though, to some extend. Like if the return is multiplied with some low frequency oscilating curve with an average of 0 so it doesnt skew the mu. But i wouldnt be able to make that
About to pull backwards out of the rabbithole for now i guess. To get things done instead 😄
🙂 this is a deep hole
Priorities are important. I wish i had the ability to make it, because it would work great. But i better focus on having a minimum product before going full nerd on all the small details
What's your near term / mvp goal?
More important task atm would be to get multiprocessing work when using sk-opt, then also some work with making the strategy classes work. Just optimizing and automating the process all the way from data input to final test results with robustability assessments in a html with plots
ok guys I really want to become a data analyst but to learn python with consistency and quickly understand it how?
quickly and consistently, don't come together in my opinion. If you don't have some basic understanding of other programming languages I would recommend starting with: automate the boring stuff with python. Its a little book made for people who don't code and who want to automate their daily life but its in my opinion a good book to start and to become intrested in python
Guys i just realised my lecturer was one of the developers of sci py
thats good for me
hopefully that means they won't teach you javapython or looping over numpy/pandas objects.
As I am a senior full-stack AI dev
I have rich experience with embedding and fine tuning models.
https://theimpactpositivecompany.com
https://chatbot.impactbuilder.app
https://insuranceai.app/
Here is my previous projects
in the first project, I implemented GPT Response and learning with pinecone vector database using OpenAI Embeddings with langchain and built search engines by the embedding.
and second and third project, I implemented PDF uploading and reading the data and then vector store using pinecone + langchain.
And then allow users to chat with gpt based on the provided PDF data.
Yea but i took her for granted first year
hopefully she comes back 2nd year foir me
try using a logarithmic y scale for the 2nd chart
Do you have any copilot like tools recommendation that would run locally on CPU and use local models for example code llama?
i'd like to know this as well
(or something self-hostable with modest GPU requirements)
For chat lmstudio is nice.
I'm pretty sure that there's no way in hell that anything running on a CPU would get a passing grade performance in both speed and quality
I wonder if there is something like that that would integrate with say vscode
Quantized 7/13B models are fast on M2 32G, i could run 34B code llama 4 bit quantized. Not bad I would say
Now if I could get some vscode integration like copilot with these models that would be great
there seem to exist a quite few options
literally just threw "llama" on the marketplace search, cannot really vouch for any of them though
Yeah I did a search too. Fauxpilot seems to work on GPU only
Turbo pilot is early stage but should run on CPU
Just wondered if there's a obvious choice like copilot but local lol
not really ; maybe try Continue or Wingman?
Yeah seems will need to install some 'random' tools and try them
still, even Copilot's performance (as far as quality goes) is questionable at times, and those will probably be a few tiers worse running local LLMs
Probably yes
this is partly why i haven't jumped on the LLM hype train yet, every time i try to use it, i feel like i end up spending more energy fixing the output than i would have spent by just writing it
Copilot does quite a lot in terms of getting context from vscode. So just code model in lmstudio for chat is not enough.
Finetuned wizard coder 34b betas GPT4 on benchmarks or is close.
In practice not sure
There's a reason GPT4 is a beast of a model
I think
There's definitely ways to make LLMs work. The NLP folk at work pay for a gpt4 sub on the condition that we use it in different ways and see what works and doesn't work.
I don't use Copilot but GPT4 does have a positive impact on my productivity. I would not recommend using the free tier under any circumstance.
Heyy Guys
Im working on a hydroponic plant based Deep Learning Project ,
if anyone has any prior experience
pls DM
people generally don't want to have to send DMs to figure out what the question is going to be--just ask your whole question in this chat.
I would be working on Application of DL to evaluate the concentrations of nutrients in hydroponically grown plants
I need help to exactly figure out how to go through this project
and should i have sensor data for the same or image data ?
If someone has a similar prior experience
Pls help me
followed your suggestion
thanks
sounds like you don't have a specific question yet. try using this channel when you have a specific question.
edited it
thanks
Can you please create a topical chat for deep learning
Can pl some one help me with yolo training. I cant get the model to train on a custom data set. PL help
you can already ask about deep learning in this channel
anyone please provide some good resources to have a great knowledge about transformers
I am using Spyder. I'm reading my files (I only have 3) but when I do so, I get an extra one called .DS_Store. Why is this here and how can I remove it?
https://en.wikipedia.org/wiki/.DS_Store
They are created by MacOS, just ignore it
In the Apple macOS operating system, .DS_Store is a file that stores custom attributes of its containing folder, such as folder view options, icon positions, and other visual information. The name is an abbreviation of Desktop Services Store, reflecting its purpose. It is created and maintained by the Finder application in every folder, and has ...
In terms of "hiding", you can choose to use startswith(".") to ignore all hidden files and directies or do e.g. [name for name in os.listdir(DIR) if name != ".DS_Store"] to specifically exclude .DS_Store
@abstract wasp
Ok, thank you!!
dunno where else to put this, but I'm looking for tips on reducing the number of iteratives to speed up compilation time:
import graphviz
import pylightxl
import re
FeatIndex = {}
GlobalGraph = graphviz.Digraph('PhiloDilemma', format='png', filename='unix.gv', node_attr={'color': 'lightblue2', 'style': 'filled', 'fixedSize': 'false'}, engine='fdp')
#Adds a node to the DAG
def AddFeat(name, parents, descr, prereqs):
newFeat = {}
newFeat['name'] = name
newFeat['parents'] = parents
newFeat['description'] = descr
newFeat['prerequisites'] = prereqs
newFeat['children'] = []
FeatIndex[name] = newFeat
def parseString(string=""):
string = string.lstrip();
string = string.rstrip();
if string.count('[') > 0:
p = string.find('[')
p2 = string.find(']')
str1 = string[0:p]
str2 = string[p2+1:len(string)];
string = str1+str2
while string.endswith(' ') or string.endswith(','):
string = string[0:len(string)-1];
string = re.sub(r'[^a-zA-Z\s]', '', string)
return string;
def find(lst, val):
ret = 0;
try:
ret = lst.index(val)
except ValueError:
return -1
# Press the green button in the gutter to run the script.
if __name__ == '__main__':
print('PyCharm')
#print(parseString('prereq=[test]test '));
database = pylightxl.readxl(fn='C:\\Users\\mthom\\Downloads\\feat database.xlsx');
sheet = database.ws('The Sheet')
featList = database.ws(ws='The Sheet').col(col=1)
#FeatList is roughly 3,800 entries long
for i in range(0, len(featList)):
featRaw = database.ws(ws='The Sheet').row(row=i+1)
prereqs = featRaw[12].split(',');
name = parseString(featRaw[0]);
parents = [];
for _pre in prereqs:
pre = parseString(_pre)
for i in featList:
i = parseString(i)
if i.lower() == pre.lower():
parents.append(i)
AddFeat(name, parents, "", []);
for i in FeatIndex:
feat = FeatIndex[i]
parents = feat.get('parents')
newParents = parents;
if len(newParents) > 0:
for j in parents:
_feat = FeatIndex[j]
_par = _feat.get('parents')
if len(_par) > 0:
for p in _par:
found = False;
for l in newParents:
if l == p:
found = True;
if found:
newParents.remove(p)
for k in newParents:
GlobalGraph.edge(k, i)
GlobalGraph.view()```
I would say start with the original paper, then 'The Illustrated Transformers' article and by then you'll naturally find other resources that work best for you
i have a problem where I need some guidance in what way I can utilize ML to find certain points in a time series. I have about 1 million separate timeseries that all kind of look like the picture attached. Of course every time series is a little different, but generally it kind of looks like it. for every time series i also have a timestamp that indicates a certain event that happens while the signal converges to 0 again. that event is the label that i want to find also for unseen timeseries through machine learning. Most examples found concerning finding certain points in timeseries was about anomaly detection. But I do not find anomalies or certain spikes in the timeseries. I just want to train a network to find a certain point in the timeseries depending on the way the whole timeseries is shaped. how would you go about approaching this? which kind of models or methods would you use? Or maybe you even have a link to a similar problem solution? My simple first approaches (DNNs with one hidden layer) all failed and just kind of returned some arithmetic average point in the timeseries. so if you have any suggestions they are more than welcome.
weird flex but okay
@lapis sequoia , In short, I 'm a full stack web developer / AI engineer
Yeah
if you want, I can help you
Tried this workflow and it 'works'. Not talking about quality yet, it works but need to test it more:
lmstudio -> wizardcoder13b python -> local inference server -> start server
vscode -> install extension Continue -> setup local server in config file -> profit 🙂
https://marketplace.visualstudio.com/items?itemName=Continue.continue
https://continue.dev/docs/customization#local-models-with-openai-compatible-server
from continuedev.src.continuedev.libs.llm.openai import OpenAI
config = ContinueConfig(
...
models=Models(
default=OpenAI(
api_key="EMPTY",
model="Wizard Coder 13b python",
api_base="http://localhost:1234/v1", # change to your server
)
)
)
So in summary:
- You have 1M series
- Some time series are labelled with a point of interest on unlabeled series?
- You want to use ML to find the point of interest
Is this correct?
Assuming the problem is as you've described, you must specify it a bit more clearly. You've mentioned:
I just want to train a network to find a certain point in the timeseries depending on the way the whole timeseries is shaped.
If that's truly what you meant it is P(x_t = point_of_interest | x_1 , x_2, ... x_n). That's uncommon for time series, you're conditioning on the whole thing.
Is your problem not p(x_t = point_of_interest | x_1 , x_2, ... x_t-1)?
what is wrong? I cannot solve it 😡
intervals = [("11:10:00", "11:19:59")]
you should pretty much never be using strings to represent time-related stuff.
because you end up with TypeError: '>=' not supported between instances of 'str' and 'datetime.time'
looks like you parse them later, I guess
Do you have any suggestion how I should do it instead?
oh, here's your mistake
data['TIME'] = pd.to_datetime(data['TIME'], errors='coerce').dt.time.astype(str)
you convert it back to a str at the very end.
I assume you did that to work around some other error
Python in a rabit hole. I am confused XD
I am trying to solve that the module is just looking for the given time interval from the Column called "TIME".
I'm busy now, but you probably need to look at how your TIME column is formatted and parse it in such a way that you won't get so many errors.
and no matter what, don't convert it to str
Please check pm and thank you.
I won't have time, sorry.
Not now but when you have! I would highly appreciate it.
Have a good one
I won't have time this week
my schedule is fucked.
write here so we can help you
Thank you. I just left the computer so I will return tomorrow. Last time I was going to ask a question I was faced with stupidity.
thanks for your reply and sorry that I didnt see it until now. And yes, the problem description is a little uncommon, that's why i am asking here. we have a lot of indepentend timeseries, labeled with a point of interest.
Current simple solution:
currently what we do is take a timeseries and substract -0.5 from every observation. then we look at the timeseries and simply use the timestamp of the last moment the transposed timeseries cuts below x axis as out point of interest. That works in most cases well enough and gets close enough to the actual point of interest. we believe though, that there might be a better solution that could look at each timeseries more individually and find the point of interest more precisely for each timeseries.
Our goal:
Some type of ML solution that has looked at every of our collected timeseries and knows each point of interest. If shown a new and unseen timeseries, it is able to identify a point of interest that is close to the truth, because of its experience with all the seen and labeled data. what kind of algo or ml model would you suggest? I am even struggling to identify a model that suits this problem description, since most timeseries models like LSTMs and such all try to predict the next timestep. but that is not what we want at all. we simply want it to look at a timeseries and identify a certain point of interest.
So you're definitely in this case: P(x_t = point_of_interest | x_1 , x_2, ... x_n), you're conditioning over the entire series?
yes, the entire timeseries is history and we need to find a point in this timeseries ex post. no prediction of further timesteps or such.
Okay, perfect. Then I have 2 suggestions but they each have the same caveat
i am all ears 🙂
-
Bidirectional RNNs/LSTMs or whatever do essentially this, they condition on the entire series and make a prediction. They're (or were?) commonly used in machine translation but they could be a good fit because they fit your problem statement.
-
Use an LSTM as you would normally to generate a latent variable. Basically the "latent" Z = F(x_1, x_2, ... x_n). Then you use an MLP or whatever on top of that to predict if the point is a point of interest, so it G(X_t, Z). Training this one will be more annoying, I'd train it in 2 phases and only attempt it if the first approach fails and you're desperate...
The caveat imo is that if you use a regular loss like BCE you don't really account for the fact that if your model says the point of interest is at t=49 but in reality it was at t=50 that's a lot better than the model saying it occurs at t=1000
i dont really understand yet. I have used LSTMs with Keras in the past and i always used it to continue a timeseries. How would I need to implement such a LSTM to classify?
also I would need to translate the timeseries into a tensor with a lookback window, right? what would be my input and what would be the output?
do you maybe have a code example that i could look at?
This is done a lot in NLP, part-of-speech tagging (POS) is an example of this I think. The more general term is sequence prediction.
https://machinelearningmastery.com/develop-bidirectional-lstm-sequence-classification-python-keras/ this is what you are referring to, right?
Bidirectional LSTMs are an extension of traditional LSTMs that can improve model performance on sequence classification problems. In problems where all timesteps of the input sequence are available, Bidirectional LSTMs train two instead of one LSTMs on the input sequence. The first on the input sequence as-is and the second on a reversed copy of...
thanks for your help and input! i will read into this topic
Sequence classification usually means they get 1 label for the entire time series. You want 1 label per time step.
Maybe the models of machine translation (MTL) would for you? You can reformulate your problem as a seq-to-seq one. You have an input sequence (the time series) and you have an output sequence (zero's and a 1 if it's a point of interest).
In any case, I encourage you to look up bidirectional RNNs and look at a bit of MTL. Most of sequence stuff in neural nets are done in NLP. That's not my field whatsoever but I'm working on time series so it's good to look at their stuff for inspiration.
Also, the sliding window only matters if you don't want to condition on the entire series.
- You'll need to tinker with your loss. I'd start out with BCE and maybe add MSE to constrain the model from having very wrong point of interests. Like a weighted average.
im working on the mnist dataset. Initially I used fetch_openml() to get it, but it takes a long time since it redownloads everytime I run the code. I manually downloaded mnist in .arff format. Is there a way to import it manually in the same format fetch_openml() would? As a bunch object?
Or maybe just a way to avoid fetch_openml() from downloading it every time i run?
Its maybe a long shot but have you tried .set_index('TIME') ?
With or without inplace=true . For me its a bit of trial and error with those time commands when i have some new data
why would that solve it
You can ask Chad, GeePT.
It does ok in finding bugs in code, but I wouldn’t use it to code the solution for you.
I don’t think you think ML for this. I would say this is more of a digital signal processing problem.
Do you only want to find that specific point relative to the time series trend?
Everyone knows about ChatGPT. Suggesting that someone use it isn't that helpful.
basically yes
i mean, i described our current solution in the picture attached which is basically what you proposed. this works "relatively" well, since all timeseries point of interest is where they hit the x axis somewhere around the actual point. but we believe there might be a more precise way when taking the individuality of each curve into consideration.
if you have looked at about 100 of these timeseries you would be able to quite confidently point out the actual point of interest
Just phrasing it differently: so given a set of time series, each with a single event at a different X, you want a model that predicts X from other / test time series?
basically yes. i have a large dataset of independent timeseries labeled with an exact point of interest
here is a better description linked
and here
I'm just thinking about how to formulate the problem... like, one is to predict how long it takes to get to X. But, the supposition here is that X is a function of the shape of the graph?
that is what i am struggling with as well. most time series models just want to predict the next step. but here you have a sequence and a human can quite quickly learn where the point of interest is in a timeseries when looking at some samples. so i feel like this gut feeling can be modeled with an ml model
X could hypothetically be merely a function of the integral (total area to date?) too, right?
Yah, I get that this is a ml question, I’m just doing the usual trying to understand the edges of the problem
the integral is the entire area under the curve, right? sorry english is not my first language. so that wouldnt solve it.
our closest and easiest solution that works well enough is to simply substract 0.5 and check where the curve hits the x-axis. that works generally well enough for now. but there is still much room for improvement.
another user proposed bidirectional LSTMs. so that is what i am looking at right now. but if you have other suggestions i am very willing to read into it
Bidirectional LSTMs are an extension of traditional LSTMs that can improve model performance on sequence classification problems. In problems where all timesteps of the input sequence are available, Bidirectional LSTMs train two instead of one LSTMs on the input sequence. The first on the input sequence as-is and the second on a reversed copy of...
it's a tough one, it sounds like something I'd need to play with teh understand teh relationships at play
Like, what's the underlying mechanism at play?
well actually we are looking at a detector signal and the point of interest is a certain particle passing through an accelerator at that time
we have the detector signal on the one hand and on the other hand we can say the particle passed through at that point of time with a high enough degree of certainty. the time series is actually sampled at a frequency of 2 nano seconds
You could take the derivative of the signal and find the points of inflection in that signal.
tried that already. but the -0.5 solution works better in practice
i mean in the picture i kind of depicted an idealized curve. in practice there are disturbances and interferences in the signals
Look into machine learning/deep learning for radio frequency. Maybe you can find some ideas there
but as a human you can tell quite confidently where the point is if you looked at enough timeseries and know what your are looking for
our current next approach is to use bidirectional lstms. and maybe trying some noise reduction on the time series and such.
I still don’t think it is a ML/DL problem, and more of a DSP problem, but I could be wrong. I don’t have much experience in ML or DL 😰
What's the frequency distribution / pdf look like?
do you mean how much they vary? we normalized all timeseries and lined the initial spike up if thats what you mean
yah, I always try to understand classically before throwing any ml/dl/ai techniques, usually because I need to reframe the question.
Like, you mentioned the 0.5 point... like, how accurate is that? How does it deviate from that?
ah okay, deviation from it is like +/- 15-20%. sounds like much but is actually already good enough. but yeah, i believe we can make it even better, since you can point to the point of interst with gut feeling
Hmm, let me think about this a bit, it's an interesting question.
I'm still inclined to try to reformulate this somehow. Like, is it distance from a significant peak (as your graphs suggest)?
We tried already formulating it as relative distance to the initial peak and such and relative to the whole timeseries but so far our simple -0.5 approach worked the best and ML is some kind of Hail Mary to make it more precise
I was also thinking about self organizing maps (there is a really nice package called SuSi for Python) and SVMs. Just throwing it out there in case it inspires you to something 😄
Hmm what about reconstructing the signal’s function from samples?
I did this for homework in grad school but I couldn’t find the code that I used :/
Hi guys, I encountered a mysterium regarding pandas (at least it's a mysterium for me). I have a function like the one below, which is taking a df as input, filtering it (assigning the filtering results to the original df variable) and then doing some changes on slices of the dataframe before returning the result df.
def do_something(df, timestamp):
df = df[df["column_a"] > 10]
df["column_b"] = pd.DataFrame(timestamp)
return df
What confuses me here is, that I get a SettingWithCopyWarning for the second line of my function, caused by the first line. I know in general, why this warning comes up and what it means. But to my knowledge, the first line should simply manipulate the original df and not create any temporary views/subsets. Can someone explain me, why this is happening here?
The first line returns a view/subset of the original datafram
So, the warning is saying: "Hey, be careful, you're modifying a view/subset of the original dataframe... not the original dataframe". It's really easy to get mixed up when you operate on views.
"the first line should simply manipulate the original df and not create any temporary views/subsets.": That's not at all what's happening
The right side: df[df["column_a"] > 10] is returning a view of the original DF. Then, you used df = <the new view>. So now df points to a view, not the original df. This is all probably bad practice... it'd be better to do:
import pandas as pd
import numpy as np
import datetime
df = pd.DataFrame({"column_a": [1,2,3,4,5,6,7]})
def do_something(df, value):
df.loc[df["column_a"] < 3, "column_b"] = value
do_something(df, datetime.datetime.now())
print(df)
But when I explicitly override the reference on the original df, there is not much room for issues or am I wrong? What I honestly dislike about your solution is, that when I have lots of assignments in my function, I will have to write a lot of boilerplate code. And when I explicitly copy in the first line, I will have higher memory consumption. That does not make me happy, really. But thanks for your explanation
Why do you think you'll have higher memory consumption?
My solution was really just one line: df.loc[df["column_a"] < 3, "column_b"] = value
Whereas your version was two: ```
df = df[df["column_a"] > 10]
df["column_b"] = pd.DataFrame(timestamp)
Sorry for the confusion, what I meant with "explicitly copy in the first line" is this:
def do_something(df, timestamp):
df = df[df["column_a"] > 5].copy()
df["column_b"] = pd.DataFrame(timestamp)
return df
which should also be a viable solution.
What I meant with boilerplate is:
def do_something(df):
df.loc[df["column_a"] < 3, "column_b"] = value_a
df.loc[df["column_a"] < 3, "column_c"] = value_b
df.loc[df["column_a"] < 3, "column_d"] = value_c
df.loc[df["column_a"] < 3, "column_e"] = value_d
df.loc[df["column_a"] < 3, "column_f"] = value_e
return df
When I would be able to filter the dataframe once, before applying all the assignments, I would not have to rewrite the filter over and over again
You could just do: df.loc[df["column_a"] < 3, ["column_b", "column_c"]] = ('a', 'b')
Well 😄
That would be a bit hard to read for assignments like this
df["column_a"] = value.astype("int32").fillna(2).replace(0, 1)
This kind of scenario is pretty unsatisfying, as I encounter such quite often
I'm just commenting on the problem as stated: You want to update column(s) based on a condition with one or multiple constants. I believe what I proposed is the most efficient (barring a numpy solution) and cleanest way to do it. Making a copy seems unnecessary for what you're describing (altho a copy isn't as expensive as it sounds)
You could make it more readable with something like: py condition = df["column_a"] < 3 columns = ['column_a', ...] values = (val1, val2, ....) df.loc[condition, columns] = values
I will experiment a bit and see, what it will look like. Thanks for your help!
Hello, does anyone know why NumPy is used in Data Analytics?
How can I change my threshold value according to value that adjusts smoothly with gradual changes but responds quickly and noticeably when there's a sudden change. The threshold value should also have some kind of negative value. If it drops significant below a given slope, it should also be considered as a error. https://paste.pythondiscord.com/OZJETWVAW2EXPIXCPYZSVCZYGI
I made a code that generates the data of satellite position for 2 years, every 1 minute .. it will take 186 hrs to run .. where I can run this code for free ?... edit 860 hrs
Huh 186 hrs
Here it is
What does fill_loc do?
Gets the time from df_loc, gets the orbital data from df_sat and calculate the coordinates of satellite and saves against time in df_loc. The time range is 1 jan 2021 to today ..gap of 30 seconds
So like a join of sorts?
Yeah.. but the main location calculation is taking time
Hard to help without details
Hmm.. is there any server which will give me this much cpu time for free? Where I can run this code and collect the csv after 10 days
def fill_loc(df_tle: pd.DataFrame, df_loc: pd.DataFrame) -> pd.DataFrame:
cols = ['mean_motion', 'eccentricity', 'inclination',
'ra_of_asc_node', 'arg_of_pericenter', 'mean_anomaly', 'rev_at_epoch',
'bstar', 'mean_motion_dot', 'mean_motion_ddot', 'semimajor_axis',
'period', 'apoapsis', 'periapsis']
tle_idx = df_tle.index
j=0; count=0; flag=0;
for i in tqdm(df_loc.index):
t1 = df_loc.loc[i, 'date']
if df_tle.loc[tle_idx[j+1], 'epoch'] < t1:
j += 1
df_loc.loc[i, cols] = df_tle.loc[tle_idx[j], cols]
loc = get_live_data(*df_tle.loc[tle_idx[j],['tle_line1','tle_line2']] , t1)
pos_lst = ['lat','lon','h','vx','vy','vz']
df_loc.loc[i, pos_lst] = loc.values
return df_loc
Any possible optimization???
Hi, i am trying to load CSV and PDF files into a vector database using glob to search for the files and Langchain Document Loaders to load them in. using this code:
for file in Path(cfg.DATA_PATH).rglob('*.csv'):
filecsvint += 1
print(f'[{datetime.now().strftime("%H:%M:%S")}] Loading {file} into vector database | Document Nr. {filecsvint}')
try:
print(f'[{datetime.now().strftime("%H:%M:%S")}] Loaded {file} successfully into vector database | Document Nr. {filecsvint}')
documents.append(CSVLoader(file))
filecsvintsucc += 1
except Exception as e:
filecsvintfail += 1
print(f'[{datetime.now().strftime("%H:%M:%S")}] Could not load {file} into vector database | Document Nr. {filecsvint}')
print(e)
fileint = 0
fileintsucc = 0
fileintfail = 0
for file in Path(cfg.DATA_PATH).rglob('*.pdf'):
# Convert to string representation
fileint += 1
print(f'[{datetime.now().strftime("%H:%M:%S")}] Loading {file} into vector database | Document Nr. {fileint}')
try:
print(f'[{datetime.now().strftime("%H:%M:%S")}] Loaded {file} successfully into vector database | Document Nr. {fileint}')
documents.append(PyPDFLoader(file))
fileintsucc += 1
except Exception as e:
fileintfail += 1
print(f'[{datetime.now().strftime("%H:%M:%S")}] Could not load {file} into vector database | Document Nr. {fileint}')
print(e)```
Gives me
```bash
argument of type "PosixPath" is not iterable```
I tried this too:
```py
for file in Path(str(cfg.DATA_PATH)).rglob('*.csv'):
# Rest of the code...
for file in Path(str(cfg.DATA_PATH)).rglob('*.pdf'):
# Rest of the code...```
It gave me the same error.
Can anyone help me?
hope that belongs in here
normally rather in the normal #1035199133436354600
but can u give the full traceback?
As in the full Console output?
the full error yes
sure, one sec
Needed to cut it a bit cuz its hundreds of files that have the same error
It says loaded only cuz i was too stupid to put the log after it actually tries to load lol
this isnt ur complete code?
AttributeError: 'CSVLoader' object has no attribute 'page_content'
the function is not defined and it seems to be a problem with that function
Not its not the full code, and yea that errors too. The one nagging me rn is the argument of type "PosixPath" is not iterable
https://paste.pythondiscord.com/YA6A this is the full
thats a problem with the Path lib
can u print out the content of cfg for me
Sure
{'RETURN_SOURCE_DOCUMENTS': True, 'VECTOR_COUNT': 2, 'CHUNK_SIZE': 500, 'CHUNK_OVERLAP': 50, 'DATA_PATH': PosixPath('data'), 'DB_FAISS_PATH': 'vectorstore/db_faiss', 'MODEL_TYPE': 'llama', 'MODEL_BIN_PATH': 'model/llama-2-7b-chat.ggmlv3.q8_0.bin', 'MAX_NEW_TOKENS': 256, 'TEMPERATURE': 0.01}
Ooooh
😄
no
Huh
tay@dedi:~/AiAssistant/API$ python3 db_build.py
[11:53:48] Building vector database from data/
Traceback (most recent call last):
File "/home/tay/AiAssistant/API/db_build.py", line 85, in <module>
run_db_build()
File "/home/tay/AiAssistant/API/db_build.py", line 33, in run_db_build
for file in cfg.DATA_PATH.rglob('*.csv'):
AttributeError: 'str' object has no attribute 'rglob'
u should start to read the Tracebacks 😛
Yea, i am not that expierienced with Python at all
And now we are back to the phosixpath error. Cuz i got told to make cfg.DATA_PATH a Path Object. Something is clearly wrong here
gonna try ur approach but idk how to fix the glob issue
nope nothing works
Okay i somehow got it
Now i have the CSVLoader issue
Anyone know?
Solved in Post #1151466638567288852
!rule 6
we don't allow unapproved advertising - please remove your post
sorry, removing
thank you 🙏
Is there anyone who know if this could be a good solution or is there any other suggestions:
https://paste.pythondiscord.com/XCAA
you didn't really explain what you were trying to do. this looks like anomaly detection?
so you are trying to maintain some kind of baseline rate of increase, and it's an anomaly if the rate of increase is too low?
it looks like in this current system, the baseline slope can only ever increase, never decrease. is that what you want?
oh wait that's wrong, i see if it decreases it will still update the slope, downward
i'm not sure about the sensitivity threshold for abrupt changes being the same as the adjustment factor, off the top of my head i don't see any convincing reason why those should be the same number
i think the idea is sound though, you might want to look into something like EWMA for a principled approach to adjusting this slope
I found this anomaly detection on the internet and has no clue about it actually. What i am trying to make is a detector that will find a smooth or instant increase which can be difficult to detect just by looking with an eye on a slope. The slope may always have some kind of change weather it increase or decrease. But the slope should never be able to get into a minus condition as we`re talking about FCH4 from an aquatic system. At the same time, the slope will always increase, as you can see from the picture I posten already in this chat.
EWMA is new for me, but if you say that could be a solution, I can take some time to look into that.
the slope must always increase, or the level must always increase?
You may be right here! It can never be the same. Its a little tricky for me to catch this myself.
The slope in this case, will and should always be upwards or at a constant (+-) rate
okay, but that means the slope must be positive, not always increasing
Right
this code looks like it only treats an anomaly as an unexpectedly large decrease in level, not an unexpectedly large increase
what it does do is handle unexpectedly large increases differently, using a different adjustment factor
@desert oar This is somehow an explanation on small event. The slope can be higher. I can show another picture
the picture makes sense now that i've seen the code and you've explained more what you're trying to do
ah wait i was doubly wrong, they're doing the rapid adjustment for both positive and negative changes
so yeah this code seems like it does what you want, but i strongly suggest spending the time to understand it. you will also need to tune these adjustment parameters
and remind me again: what do you consider an "anomaly"?
A slope with approxemently R^2 = 0.85 or higher is considered as nomaly
Thank you for that.
R^2?
And yes, I should. I have just not be able to run the full code yet as I still struggle with the first part : P
The coefficient of determination
with respect to what? that's not a slope
R^2 tells you how linear something is, it doesn't tell you how steep the line is
Ehm. The slope is chosen according the a calculation for the FCH4. If the R^2 is not above (i.e 0.85 or above), the timeinterval when the measurment is done, as to be smaller. Ex if the time the measurment was 11:14:59 to 11:24:59 and the slope is 0.57. The time interval would be adjusted in order to get a better R^2.
i don't know what you mean by this
it sounds like you're talking about shrinking your measurement time intervals until the linear approximation reaches a certain minimum acceptable level of correctness
the slope and the R^2 are not the same thing. do not get them confused
R^2 tells you how well the data matches any straight line. slope tells you how steep that line is.
you can have an R^2 of 1 but slope of 0
Acoording this this figure, a diffusive flux of methane is C-D = example R^2. But if you want to catch ebullition event which can happen with a smaller or bigger change (D-F), The change of CH4 ebullition concentration (change) by
subtracting the concentration at the point E from the concentration at the point F during the
observation period.
Yeye, I am familiar with this. Maybe my way of talking is making it a little more confused.
it sounds like you are confused, or deliberately using the wrong terminology
i tried to clarify above
Just so we`re on the same base. Do you understand what I am trying to solve? 🙂
i thought i did, but after this talk of R^2 i no longer understand
i thought we were talking about slopes
C-D can have a R^2 that has about 0.85. But the slope itself may after some time have a smooth or sudden increase as you can see here. And yes, my terminology is not the sharpest.
Maybe ignore the R^2 for now haha.
So you want to find anomalies in time series?
Changepoints are abrupt variations in the generative parameters of a data
sequence. Online detection of changepoints is useful in modelling and
prediction of time series in application areas such as finance, biometrics, and
robotics. While frequentist methods have yielded online filtering and
prediction techniques, most Bayesian papers have focu...
Stuff like the chow test tells you if there's a change of slope (but not where)
Imo this is a well researched area, if I were you I'd look at existing methods
it seems more like they're interested in individual big deviations from the slope
honestly their current technique seems fine
it's just EWMA on the trend slope, with the extra caveat that large changes have a greater adjustment factor
but large changes also get flagged as outliers, which i think is reasonable
so you get a big adjustment, and an alert for it
The problem with EWMA is that is has too many hyperparameters
it has 1 hyperparameter 🤔
My personal opinion
in this case they have 3: the small-change factor, the big-change factor, and the big-change threshold
Depending on how you formulate it but you have your alpha and also the size of the change
traditional EWMA just has a constant change factor
Seting both requires hindsight
yeah but that's about as simple as it gets
or simulation and testing
if you can simulate relatively realistic data scenarios, then you can basically run preference elicitation experiments on yourself and tune the hyperparameters by hand, using your opinion as a cost function
SPC can work here as well but it suffers from the same problem
what, setting a threshold on standard deviations from the trend?
Yes
that's basically the same as here, just using EWMA to retroactively estimate the trend
I'm just "paranoid" as I was working with thousands of distinct time series in the past on a similar problem and I didn't have the luxury to go into detail on individual ones
i agree that tuning is required but so does any changepoint detection algorithm
yeah, any time you need to automate this across 1000s of time series things get a lot more complicated
in that case you need an automated tuning procedure and associated cost function, which might be highly task-specific and can be difficult to determine
In the end, after chasing the god particle of ML/stats for too long, I just did SPC
but how did you estimate trend? that's the whole point here
otherwise it literally is just setting a deviation threshold, like in traditional SPC (as far as i understand it)
imo the EWMA is bordering on the simplest possible trend estimation, other than a flat moving average
My series were more or less trend stationary, or at least they should be
in this case they're asking about something that they expect to steadily increase over time, so they need to estimate trend
Or differencing
yeah but they're expecting the trend itself to change over time
at least that's what i understood their intent
although the 2nd chart they showed looked more like shifts in intercept with a constant trend
so i think also there's some confusion on their end that we can't know the truth about
I mean, it depends on the magnitude of the change. From their image differencing + SPC would have worked.
yeah agreed
that's a good point. if you expect the trend to be constant slope but just detecting large level shifts, then yes i agree
but if you want to actually model changing trend over time then i think you'd need to actually estimate the trend
But so would EWMA 🤷 it depends on your use case. In general I think checking out a paper like BOCD is still good because you might find out that your method was naive in some ways.
right, i think we agree on these points
as for simplicity, EWMA is just slope_curr = slope_prev + adj_factor * diff_curr which is pretty simple imo. their variation is to set adj_factor = adj_factor_large if diff_cur < diff_large_threshold else adj_factor_small which kind of makes sense if you want faster adjustment response on larger inputs
BOCD expects data with gaussian noise so it would not have worked here anyway. You'd need to detrend.
that is, they're doing EWMA on the slope, not on the level
maybe that's problematic for some reason because the slope is a rate, but as long as they're using fixed-time intervals it's equivalent
if they're using varying time intervals (which it actually does look like they're considering?) then you have issues with units and you might need an harmonic mean instead
I think if we both had the dataset we'd 100 % agree, we're just making different assumptions 🤣
My use case was essentially a forecasting case where we wanted to know if the model is deteriorating (thousands upon thousands of SKUs). My intuition was simply that under normal circumstances the error ~ Gaussian. Spikes in error may occur so we want to isolate cases where the error gets "bad enough" over a period that is "long enough"
Obviously the problem is influenced mostly by how you define bad and long enough
that makes sense
the hacky approach would be looking at a moving average of error variance
Hi, I need help with an assignment about ARIMA time series. Can I share the doc?
it's not likely that someone will want to walk you through the whole assignment. copy/paste the part you currently want help with (as text), and show what you've tried so far.
(do not post screenshots of text--these are difficult to read and refer to)
beside importing libraries haven't done much. ARIMA is a new concept to me did some online tutorials but need some human advice as well if possible
ARIMA Modeling:
Implement an ARIMA modeling process using the statsmodels library or a similar library.
Decide on the order (p, d, q) for the ARIMA model based on the characteristics of the simulated data.
Fit the ARIMA model to the simulated data.
Forecasting:
Use the trained ARIMA model to make future forecasts for a specified number of time steps.
Generate forecasts for a period beyond the existing simulated data.
Do you already know what the p, d and q mean?
@serene scaffold
yes
Do you know how to determine them? (I don't want to "solve" your assignment for you because this way you'll learn more tbh)
Not really. First time working with ARIMA and I need some explaining a bit before I can do something. I don't really need someone solving it but rather explaining.
So P is the AR part of ARMA and the Q is the moving average part. The I stands for "integrated" corresponds to d(ifferencing)
this is an assignment, so presumably you were taught something in your class. what did they teach you? is there something in the material that you're confused about?
You can determine the AR by looking at the PACF and the MA by looking at the ACF. I encourage you to really read what they are doing and not just read the plots.
I'm a self taught programmer. Took a 100 day online bootcamp. I did work with pandas but not on this scale.
Afterwards I'd use auto.arima (for instance https://github.com/Nixtla/statsforecast) to find p, d and q automatically and compare it to what you found from ACF, PACF and unit root tests
Can I use pycharm to code it or is it only strictly jupyterlab?
But really, before you do this you need a good grasp on what all the letters in the acronym (A R I M A) mean separately and then together (AR, I, MA). That'll make everything a whole lot easier.
@past meteor Thanks for the explanation. Hope it gets me somewhere.
Yeah but you mentioned an assignment
@desert oar I'm learning new stuff focused on data science so I can land a job and showcase my projects.
I don't have have connections that can teach me except on discord.
Books are your friend. Unless you're on a tight schedule and need something done by yesterday you should be following a book imho.
A good place to start on this topic is https://www.itl.nist.gov/div898/handbook/pmc/section4/pmc4.htm (6.4.4.5 covers ARIMA). What i like about this text is it's broken into small chunks, with a lot of examples. The entire book is massive, but it's all great stuff.
Once i started the book Elements of Statistical Learning.. it was too hard for me. Just able to complete 3 chapter and gave up. Is there any other good book? Can I start this book from beginning ?
I'll let someone else recommend that, a lot of folks closer to college than I. I just know where I go to for applied stuff.
That book I cited is a good "read a chapter a week" kind of book. Great stuff, but it's long and more for applied / EDA stuff.
Ok.... i am looking for a book that has soem good math about ml ..makes my basics strong..enough to land me a good job
zestar75 recently suggested these: #data-science-and-ml message
I'm really lost in the class computer vision. I don't get the math/even concept of stuff like 2D convolution, gaussian kernel, filtering, edge detection and zero crossings and lapalacian filter and 1st derivative filter etc
Can anyone recommend resources that are very easy to understand?
If you like videos: https://www.youtube.com/watch?v=KuXjwB4LzSA
Discrete convolutions, from probability to image processing and FFTs.
Video on the continuous case: https://youtu.be/IaSGqQa5O-M
Help fund future projects: https://www.patreon.com/3blue1brown
Special thanks to these supporters: https://3b1b.co/lessons/convolutions#thanks
An equally valuable form of support is to simply share the videos.
-------...
The last part is a neat application, but maybe too complicated. Just understanding the convolution in 3 different contexts (probability (counting), image processing, polynomial multiplication) will probably help.
I need help fixing my nueral network code would anyone be willing to help
Calculate the upper and lower limits
df_temp = df
Q1 = df['Deaths'].quantile(0.25)
Q3 = df['Deaths'].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1IQR
upper = Q3 + 1IQR
Create arrays of Boolean values indicating the outlier rows
upper_array = np.where(df['Deaths']>=upper)[0]
lower_array = np.where(df['Deaths']<=lower)[0]
Removing the outliers
df_temp.drop(index=upper_array, inplace=True, axis=0)
df_temp.drop(index=lower_array, inplace=True, axis=0)
Is there anything wrong in the code? I am getting a "not found in axis error"
Is AIML and python good for AI?
Imo watch principles of computer vision on YouTube but you specifically need to watch it in the order they suggest on the channel.
Aside from that, you can look at the canonical computer vision books. What I did specifically was read first, watch afterwards and then implement. Rolling your own gaussian kernel etc can help.
+1 these are great!
I would also add https://www.deeplearningbook.org
anyone aware if numpy has a way to collect "tree" scattered data, i have shape like parent_index: uint4, lookup_index: uint2 and given a index i want to get an array of all the lookup indexes until parent_index == 0
Sounds like an iterative task, so probably not easily done with numpy but maybe can be sped up with numba.
Though I'm confused about how your array works. It sounds like you have a structured array with a dtype consisting of two ints - but it seems to me that'd be very annoying to lookup nodes in, requiring a linear search each lookup.
basically the array stores a flattened tree where the index is the parent and the lookup index is the art stored somewhere else
i want to materialize the path of a given index if if/when necessary (which is rare)
is the array basically a graph adjacency matrix of said tree?
its not a agency matrix, (as the tree is basically a very sparse graph) its literally just a array that serializes the tree by having each row store the index of its parent in addition to the data elements
@radiant cipher ConfusedReptile is right that you probably won't find an idiomatic numpy solution, because numpy isn't designed for "stateful iteration" (which is a term I just made up)
have you thought about using networkx for this?
i have a lot of data like this. when possible, i store (or compute) the full path (ie: 1.4.1.2) for later reference... rather than just the index of the parent.
igraph and graph-tool have better interoperability with numpy
Hey, does someone knows what's the name of a histogram where's the x-axis are the percentile (or ). It's easy to find in plt how to normalise the frequencies, but haven't found how to normalise the x-axis
this server isn't for hiring
i guess i'll just make a cython function that operates on a view of the column with the indexes and then use the resulting array as index
hmm, if i use 0 as the parent of 0, i can actually just create a matrix of materialized paths by appending the values of the view of the parent indexes and creating the next line
Hey, if i wanted to get into AI development where should I start? As a novice
Really I want to train something to handle some basic equations and hopefully scale it up
I only have experience with making algorithms, so I have no idea how to jump to AI from there
Anyone interested in collaborating with me in my IslamAI project? Currently still collecting/cleaning authentic data and cresting blueprints for API endpoints. But will definitely love for some help with ML
What kind of tree is it? Does it have a fixed number of child nodes (e.g. a binary tree in the case of N=2)? If not, do you know what the maximum number of children per node is?
the tree starts as list of path objects of different depth - so filesystem limits apply
(its hundrets millions of them, most dont need to be materealized, and having just a index to a tree saves so much memory )
what kind of AI are you interested in making? it's a very broad term so just trying to narrow down what you're interested in
Hmm. Something which can understand and perform mathematical operations given an equation and asked to solve, similar to wolframAlpha
Yea, then it probably requires Cython. Although it seems like you only need one loop? So you can probably Cython inline that.
In stable_baselines3, I'm using MaskablePPO(from contrib). How do I set an action mask for a multidiscrete action space? This is what I'm trying right now, but it doesn't work. My multidiscrete action space has a shape of 6 and 4, so I'm returning a tuple with 2 ndarrays with shapes 6 and 4, respectively
def action_masks(self):
column_mask = np.zeros(
(width,),
np.int8,
)
for i in range(width):
column_mask[i] = int(
self.check_valid(self.last_in_column(i), i, initial=True)
)
color_mask = np.zeros((4,), np.int8)
for i in range(1, 4):
color_mask[i - 1] = int(self.tiles_left[i] != 0)
return column_mask, color_mask
If possible I would in future want to develop it to be able to perform calculus questions, such as differential equations, and give working, for me to use as an educational tool
Or you can use Taichi, which is my preference these days for loops I need to make fast quickly directly in Python.
(Numba is way more restricted, buggy, etc)
hm we don't really use AI to solve math problems, that just introduces risk of incorrect answers and things like language models that can accept text as input are not very good at doing complex math.
Hmm. It does make sense that something algorithmic would be better
One function I would want to implement is being able to give it an input equation and an output equation and determine how you can manipulate the input equation to the output equation, if that makes sense
ah yes, that would be machine learning then
that entire field is focused on function optimization
Its easy enough for x + 5 = 8, but when you start dealing with calculus, stuff starts getting wild lol
Currently using Pandas, getting this error
result[mask] = op(xrav[mask], yrav[mask])
TypeError: unsupported operand type(s) for -: 'str' and 'str'
https://developers.google.com/machine-learning/crash-course/ google's crash course
https://www.youtube.com/watch?v=aircAruvnKk&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi
3b1b's playlist covering neural networks, it get progressively more mathy as the videos continue but the early parts are very simple and intuitive explanations I would recommend watching the first two regardless of whether you want to dive into the math or not
andrew Ng's courses are very highly acclaimed, here is a link to what I believe is a completely free one https://see.stanford.edu/Course/CS229 and he has many others on coursera
For that kind of stuff I recommend looking into how Wolfram Language works (it's actually a LISP-like).
Wonderful! Ty for the help!
It's basically just doing a bunch of operations on a tree.
Like factor, etc.
Then there are generic algorithms for doing various things like solving equations.
Ah ye that makes sense
Pattern matching, then converting to the "normal form" or whatever you want to call it.
Then when it's in the nice form, do the algorithm for that form.
This requires a lot of work since you need to implement pretty much every math algorithm under the sun, e.g. complicated, exact, fast root finding and such.
Each one their own project.
Wolfram Alpha is also their own natural language processing tool that will convert natural language to this LISP-like Wolfram Language.
Sounds like a larger can of worms than I anticipated
You just need to add stuff to it over a long time before it can tackle pretty much any problem like now: https://www.youtube.com/watch?v=MeuCAT5HDh0
In this 1989 video presentation, Mathematica (TM) creator Stephen Wolfram demonstrates his award winning mathematics software. Wolfram demonstrates numerical calculations, algebraic calculation and graphical renderings. He concludes with a discussion of programming.
Wolfram spent decades working on it non-stop.
Is it possible for numbers in a CSV to be percived as a string by python, coz i think its happening to me and idk how to fix it
However, you can make your life a lot easier by using sympy, to do most of the work for you.
If you just want to solve some simple equations, do a bit of calculus, it's not too bad, I made one such CAS for fun once in C...
Pretty fun project, recommend.
Also printing math to the terminal in ascii form like in that video was fun.
Yes.
Yea no i fixed it, for some reason my CSV had commas in between the thousanth place
to make it easier to read
Idk how that happened to my CSV
Oh, I’ve had to fight those types of issues… or a single letter stuck in a column.
YEA thats so irratating ill be scrolling through 100's of rows to find it
I mean Luckily we can just format the excel column to get rid of commas, coz i would have no idea how to do that on python
Hi! If i have a df in pandas with the columns 'App' and 'Review'. How can i see which app have the name of the app on the review (without using apply)?
try .contains, but it maybe bad for specific names or if the app is reffered to differently
I've already tried with this:
filtered_df = np.where(df_reviews['Translated_Review'].str.contains(df_reviews['App']))
But it gives: TypeError: unhashable type: 'Series'
have u tried .itterows
I will look at that, thanks!
In general: don’t use iterrows. That’s an anti-pattern with Pandas.
Then how can i do it?
You could either create a regex if the input is relatively simple, since contains takes a regex argument. Or, you could combine multiple conditions, one for each item in list
Hmm, I’m actually not quite sure what exactly you’re trying to filter on anyway: you want to know if a given rows review contains it’s app value?
I want to know if the name of the app is contained in the review of the same row. I know i can do this with apply, but i want to know if there is anothre way to do it
df[df['col1'].str.contains(df['col2'], na=False)] ? I’m not at my desktop right now, otherwise I’d test first.
ssh into prod and test it there
||I'll show myself out||
Are you guys good at both pandas and sql? I’m only good at sql wondering how common having both is
I’m very good at SQL, I’m good at Pandas (I’d rather be in SQL)
I’ll have to practice then cool wish there was pandas leetcode
Lol, that’s a great idea
There’s a great sql ref… https://selectstarsql.com/
Is your job like a Data analyst or sometin
Broadly speaking: I build things for data analysts/scientists.
by things do you mean software?
Yes
Wdym?
is the software you make not public?
Correct
maybe something like this?
https://platform.stratascratch.com/coding?code_type=2
StrataScratch
oh this looks really good
it even has postgres
Ohhh so do you like work for a company or do you like do contracting
Is there a name for the second formula here? the first one is markov's inequality, and the second is similar (but not the same as) chebyshev's inequality.
(disclosure, this is homework, so do not give me exact solutions)
anyone here good with pytorch and cnn classifiers?