#data-science-and-ml
1 messages ยท Page 94 of 1
well yes
i'm calling them tensors but ye they're essentially just n-dimensional arrays
thats just being efficient
i already played with that too ๐
@serene scaffold Good day to you!
the splitting into subtensors to fit into cache part?
Good day, fellow intelligentleman
The maximum size of an array is limited by the sys.maxsize variable, which is typically around 2^31 bytes.
If you're passing that up rn... you'd need an einstein and a chalk board with extensive proofs experience to prove the case beyond that... ๐
I'm a computer science lover with a nerdery for math so if the computer cant do it, I move on haha
i mean sure, it's just that in my mind if i can come up with a way to do that for like 3 or 4 dimensional tensors contractions, i can easily generalise that to any arbitrary numbe or dimensions that i'd want my library to support
currently attempting to fish out of numpy developers how they acc did it lol
bc their C code is incomprehensible mess bc it essentially only functions to support the python wrapper
so i can't just take a look at it myself lol
๐ญ
The buffer protocol, which is an abtraction basically
I was thinking if my method using TRIS works faster than numpy even for the case we're evaluating
that might be quite valuable
ye def
it would probably become adopted by a lot of people as it could speed up inference
i noticed that an arbitrary numpy tensor contraction takes about the same as a matrix multiplication with the same number of elements, so their tensor contraction is at the limit of matmul speed
so it would be quite astonishing if you made it faster
Here's a useful trick with numpy btw. Use x.array_interface to explore the array
We might need to use numpsys dispatch mechanism
numpys
To check if a Numpy function can be overridden via array_ufunc, you can use allows_array_ufunc_override.
Looks like if you want to push numpy to its breaking point, there's some workaround.
I haven't explored it much though
If you go really low level with numpy, you get into some fun words like jiffies
@serene scaffold How many jiffies does it take to turn on a light bulb::D
https://github.com/numpy/numpy/blob/v1.26.0/numpy/core/code_generators/generate_numpy_api.py shows examples how they wrote C to abstract the api
The fundamental package for scientific computing with Python. - numpy/numpy
They seem to have designed things fairly well to handle cases up to 3 dimensions
numpy/core/shape_base.py line 855
# It was found through benchmarking that making an array of final size```
Also worth reading: https://github.com/numpy/numpy/blob/v1.26.0/numpy/core/einsumfunc.py
I didn't know this but apparently _can_dot Checks if we can use BLAS (np.tensordot) call and its beneficial to do so.
They wrote: " If the operations is BLAS level 1 or 2 and is not already aligned we default back to einsum as the memory movement to copy is more costly than the operation itself."
Also you can see their exploration of einsum and testing it with various algorithm pathings (eg. greedy) or pre-computing the optimal path and repeatedly applying it https://github.com/numpy/numpy/blob/d35cd07ea997f033b2d89d349734c61f5de54b0d/numpy/core/einsumfunc.py#L1334
numpy/core/einsumfunc.py line 1334
Chained array operations. For more complicated contractions, speed ups```
Einops is a really big advancement in easing tensor manipulation
guys how can i run this pages locally on my anaconda
is this better or should i switch to alex the analyst videos because i spent along time to only learn the basics and i still didn't finish learning braching with if else elif
@pearl barn
You mean like: conda -n create "your_environment" ?
Just do something like conda -n create testingenv python=3.10
conda activate testingenv
then just type python on the console or make a new python script
python on the local console will allow you to explore those functions
@pearl barn
Didn't understand anything you are so good in this
how can u make a program in python that can scan 100 plants(a list of plants u made) and recognize them?
my big brother works on a project with a few other people they need to hire a programmer for this and asked me if i can all though i'm few months in and idk much, my big brother is in nanotech so he doesn't know anything about programming and i'm curious about this
- Create a anaconda environment, see @proud wing video, or google for how to do that
- install jupyterlab with
conda install -c conda-forge jupyterlab - start jupyter lab and open the jupyter notebook from your tutorial (have to download the .ipynb file)
@exotic star There's already apps that do this on the app store. You'd need to train a classifier model on plants that are labeled, with an extensive exsting database of plants. It's not easily done with limited experience from scratch.
If you were to hire a programmer, how much would it cost
to make handle the programming only(not the design of the app)
You could check out fiverr or toptal and ask around there, you might get a quote from someone capable.
got it, ty for the help
hello everyone! I have been a stats programmer working (almost) exclusively with pandas for the last 7 years. I recently began a new job working out of a databricks environment, in which 95% of the notebooks are writted in pyspark. I am looking to sharpen my knowledge of pyspark, but I am having trouble setting up an environment to practice on my local machine since. Does anyone have any questions on how I can do a little pyspark studying outside of my org?
you have to install spark separately then install pyspark
yeah, Java too right? I just cant seem to get my environment correct
why dont you say what the actual issue is because there's a lot of guides out there on how to set it up including the one from the spark docs themselves
sure just give me one second if you dont mind
import pyspark
import os
os.environ["JAVA_HOME"] = r"C:\Program Files (x86)\Java\jre-1.8"
os.environ["SPARK_HOME"] = r"C:\Users\Vince\Downloads\spark-3.5.0-bin-hadoop3.tgz"
import findspark
findspark.init()
from pyspark.sql import SparkSession
from pyspark import SQLContext
spark = SparkSession.builder.master("local[*]").getOrCreate()
df=spark.read.options(delimiter=",", header=True).csv("fake_ae1.csv")
df.show()
FileNotFoundError Traceback (most recent call last)
Cell In[3], line 10
7 from pyspark.sql import SparkSession
8 from pyspark import SQLContext
---> 10 spark = SparkSession.builder.master("local[*]").getOrCreate()
this was in jupyter notebook in a virtual env
How to download .ipynb it's already on jovian website but don't wanna to use it
Idk if theresan option on jovian, but worst case, you can just copy the cells from online into a new local notebook
It shows run code locally but nothing happens after that
What does it show in your jupyter notebook locally? Can you provide code/picture, what it Shows and ehat you expected
Guys I'm desperate
I've created a librairy to create and use neural networks, and one to make my IA play Tetirs. I've been careful to the details that could cause a disfunction on my IA. The librairy I created is for unsupervised learning, and the problem is that after 500 generations of 124 agents, they don't play any better. I search on internet, and some tell me that Tetris cannot be learned with an unsupervised algorithm. It really bother me, since I've been working on it for months : are you able to explain the fact that my IA don't learn, or is it just that she cannot learn by herself Tetris?
by unsupervised, do you mean reinforcement?
Hey guys, i am trying to do a machine learning project and I'm stuck in part and not able to figure out what to do. The project is about pdf question answering using llms. I need help
Yes
you can train a tetris playing agent with reinforcement learning.
Remember that artificial intelligence is AI in English, not IA.
how far did you get, and what's the current stumbling block?
Remember that using LLMs is quite challenging.
Ok so I'm just bad, I will work on it. Thank you !
You are great
Thanks, you're right. I tried to do my AI with a RNN, is it compulsory to use a CNN in our opinion ?
I've never done anything with reinforcement learning, so idk
It usually takes way more time to use a CNN in my experience. Generally it's easier for a model to learn when the input space is smaller.
If you're operating on raw pixels for Tetris then a CNN might make more sense than an RNN
It seems like this isn't even reinforcement learning but it's actually a genetic algorithm, is this correct?
Well thats what I did, I entered 34 neurons with processed data, but no results. My algorithm is mabe just too simple, as I was saying I'm not using Tensor flow or anything
Yes, my bad
Well done
Why are you using a genetic algorithm for this? Any particular reason?
It was just an algotihm that I knowed how to code, I thought it would have work
Is that the problem ?
It can work but it's a bit "wasteful". It's a black box optimiser (look this up) and there're options that can actually use more information of the game and learn in a more "directed" way, for instance deep Q Net (DQN)
A genetic algo might work still, mostly because Tetris isn't a super hard problem I suppose
Ok then I will check on what you tell me, even though it would be a relief for me that my alg works
so i am trying to make a pdf question answering chatbot using google gemini's llm. I have made one with openai and falcon-7b llm and they're working perfectly however when i'm trying to make one with google gemini, i'm getting a lot of errors. I am using streamlit and the issue is that when i'm trying to run the code, i am getting no output
what can cause this type of results in reinforcement learning
I have a branch for the reward wonder if the branch and nonlinearity there is causing this
it's an accumulator and if it falls below 70% of its original balance I give it a heavy reward penalty
otherwise the reward is the net return divided by an estimate of the variance of the return
I'm using PPO with LSTM from sb3-contribs
the observations are the variance driver, a variance estimate, and a handful of exogeneous state variables
Not sure how to design the reward to get what I want if I need to take out the branch for it to work properly
I played with genetic algorithms for a bit and they showed some promise at first but I think they are just variance machines
any information I gleaned from them that stood up out-of-sample was marginal
I ended up using reinforcement learning for my problem instead even though I had some fun messing around with evolutionary algos
for genetic algos to work I would actually have to be able to parameterize the underlying process in a way that aligns with what is causing variance in reality which is a nearly impossible task
reinforcement learning basically does that part for you
I've come to the conclusion after a couple months that genetic algos are basically trash for most things like zestar said
my reinforcement learner is trying to put the appropriate coefficient on the variance driver over time to form a series of products that optimize mean over variance over time but I have some special rules I want to add on top of that
I'd rather it do nothing than explore into spaces where it can end up in a certain state, basically like a game over for a video game
but even penalizing ending up in that state doesn't seem to prevent it reliably and it also has some really weird results on the learning curve
is that just the nature of the beast with reinforcement learning and I need to tinker with the weights of the reward if I want to branch it?
got my gpu quota request accepted, cloudwatch all setup, logs broadcast back to the master node which I just log.info so that it gets printed to the runner logs. two missing pieces are the runner workflow with secrets and finish implementing the fault tolerance in the training loop, it's actually already implemented and tested, I just need some way of having aws notify the program that the instance is gonna be terminated in 2 min, other than that signal its all done.
like what does this mean in reinforcement learning
when the rewards actually go down if you train it too long
if I keep letting it explore in theory should it eventually hit a new optimum
or is it "stuck"
There are quite a few dependencies on how you model the agents, how you combine/mutate/evolve your population, on how you measure fitness, etc.
Yeah I know, but I didn't have a choice but to create my own, since I wanted to make it run on my calculator
Sure, but there are entire books dedicated to genetic algorithms.
Without providing more info on your implementation, there won't be much that can be said
I am just modifing the weights and biases of the network using a uniform function, so it's basically random modifications ig
oh so no selection or combination
Hello guys I'm facing with a problem. I'm currently in secondary school and at the end of school we are gonna learn calculus, currently I'm doing machine learning compilers and so on which would require the knowledge of calculus what I am wondering about is whether I should learn calculus on my own right now or I should wait till I learn it in school, would learning calculus be a waste of time on my own if I'm gonna learn it in school anyways or should I pursue it?
@echo mesa why are you asking us?
@echo mesa You should learn what you are truly interested and excited to learn. Not what 'school' tells you to learn.
Some areas of math that will be useful for you if you want to do some really wild stuff with machine learning include linear algebra and tensor calculus.
Yes
Hey @final kiln where did you get your quota incrceased at?
In the quota page. Just search quota on AWS console
I want to test my general approach for calculating tensor contractions in C vs Numpys... if its faster I will probably need to publish a paper on it
@final kiln oh i just meant which provider are you using ๐
@final kiln I use Google cloud for most things cloud, except ML
I have funding for AWS. I co founded a 501c3 and we get tons of free stuff to do open source
C is usually faster
What results did you get ?
If your C code is slower than Numpy then your C implementation is wrong.
That is usually how it goes
(And/or compiler flags)
This goes for pretty much everything vs C, except maybe assembly if you try hard enough.
(Although you then usually are doing assembly inline in C via compiler extensions)
My impression is that you need to be some sort of wizard to beat the compiler
Not really, compilers are good at repetitive optimizations that would take a human way too long to apply to the whole huge code base, but with time a human will always win.
It's like saying that ChatGPT writes better code (which is probably something people will start saying too).
It's about time spent doing it though.
But most of the time performance issues are from the 80% gains given by computational complexity (big O needs to be reasonable (but not best)) and not having the CPU do more work than needed in general (e.g. by choosing C over Python) for the same end result.
it gets muddy very quickly. There are all sorts of optimizations, including those that benefit from observing the behavior of the program
(And knowing the general stuff like the CPU cache / are you compute bound or memory bound?)
Uhm, all I know is avoid branches, keep data local and compact, inline stuff.
Has served me quite well
The gist is that your CPU can operate on data so fast that what often really matters is getting the data to the CPU in the first place. The CPU has its own local memory (the cache) to make this faster. To help the CPU you want to make memory access predictable and also it fetches memory in chunks, so contiguous memory.
(e.g. loading a single integer from RAM takes so long that in that time your CPU can do hundreds of additions / multiplications)
I am aware. RAM fetches r slow
Like some hundred cycles or something like that
also pipelines have an impact
Since your CPU has SIMD now and multiple threads and more, it can do a ridiculous amount of stuff in that same amount of time.
For that to become the bottleneck you need to do a ridiculous amount of stuff that touches the same memory (already in cache then in that case) over and over, e.g. matrix multiplication of large matrices.
so i got this data from this website: https://catalog.data.gov/dataset/national-student-loan-data-system-722b0
i'm kinda overwhelmed
i downloaded all their spreadsheets
i had a hypothesis that students from certain demographic groups (age, gender, ethnicity) might exhibit different loan default rates.
the problem is that the data that i'm looking at does not show any demographics
https://nces.ed.gov/programs/digest/d22/tables/dt22_331.95.asp?current=yes i switched to this dataset
The primary purpose of the Digest of Education Statistics is to provide a compilation of statistical information covering the broad field of American education from prekindergarten through graduate school. The Digest includes a selection of data from many sources, both government and private, and draws especially on the results of surveys and ac...
trying to get clean data off this is not fun
anyone want to spend the new year helping me find where the issue is with a gesture recognizer calculation:)
It's very easy to pin a CPU with anything that's inherently iterative and if it's something that can be parallelized you can easily pin a server chip of arbitrary size with a large enough problem
i disagree with this idea that cpus are not a bottleneck you have to worry about, it depends on what you're trying to do
there's a lot of problem spaces that can't be vectorized to a GPU atleast not very easily
if you are doing evolutionary algos you can easily pin a cpu for hours and end up with garbage because it's a variance machine
reward = self.sharpe_ratio - prev_sharpe_ratio
if self.initial_margin > 0.2 * current_portfolio_value:
reward = min(reward, 0)
so things like these in reinforcement learning, is it right to make them part of the reward return or should I make them deterministic constraints such that if the agent runs into them it will always do a corrective action
I would rather describe them as meta heuristic.
Whether that runs on GPU or CPU would depend more on the problem to be solved
And whether or not you end up with garbage is more linked to how they are used, no different than any other method.
I think EA are more prone to overfitting than anything else
unless you happen to know the problem has a certain structure and you're just solving for the parameters it's like shooting in the dark that the parameters you are solving for actually mean anything
not really
That said, feature engineering is an important step for any ML/AI related problem
engineering the features is different from engineering a parameterization of them
for RL you engineer features and the algorithm handles the structure
for EA you have to also provide a structure to the problem
and if it doesn't match reality, then you get garbage
which is easy to have happen I think from my experience with them but maybe other people have had better experiences
In my experience, if you get garbage with EA, is that you are doing something really wrong
let's say it's an agent problem like playing a game or optimizing some return over time
for RL you have to define the game and the features, sure, but you don't have to tell it what to do as a result of the features being some value or another etc.
if you want to use EA you have to have an idea of the causal relationship between the features and whatever the reward is
and learn whatever weights/parameters you apply to them using the EA
to me it seems easy to get that part wrong
take NEAT for instance
you don't model the structure
you have your input and output
like any other ML model
but it could be anything else you think is appropriate
You don't need any causal relationship. That's not how EA work
EA is just a way to solve for parameters right
the genes that solve a problem the best
What matter is you need to make it friendly to your mutation/selection/combination operators
what I got was something that fit in-sample time series data extremely well
and generalized very poorly out of sample
whereas with RL I get comparable performances
that's why I called it pejoratively a variance machine
I have no doubt that EA could reproduce what I'm doing with RL if I understood the right way to use the genes
but if I knew that why would I be using ML methods to begin with
Right. And similarly, had you gotten great results with EA and terrible results with RL (even for a single typo), you might be stating the opposite.
What I am reacting to is not that EA should be the solution to all the problems. Far from that. I am reacting to calling it garbage. It's a great tool which works very well for what it's made
I understood it to just be a way to fit parameters for experimental models
things where you aren't quite sure how to solve it using something like differentiation and gradients
It's great when you are ok with approximate solutions, it's very expensive to derive a model or have to deal with non-linear stuff
And also it has a cool factor
I mean, it's a meta heuristic after all
not a random generator
I think unless there's a clearer way to control overfitting then it's of limited use to complex problems like trading
and even if there was one, and I spent a lot of time thinking about it and talking with people about it, you still end up stuck if you find out you are overfitting and you don't know how to change the model to generalize better
it's just a lot more work
also even though you can parallelize it I don't think you can vectorize it in general
EA is only a partial solution I guess is the point and the other part of the solution can be arbitarily complex and something like RL can handle that better if you can't start to put into words how you should use the features you have available to solve the problem
you have to be searching in the right places to start with
I'd like something like NEAT for RL though
overfitting is no different than other methods
I think the whole point of EAs is it depends on what you are using the genes for
it seems like with NEAT the genes themselves are weights in a neural network that will handle the structure of the problem for you
but then it's really a neural network but you are using EA as the optimization methodology instead of gradient descent
so EA is just an optimization method
you still need something else to provide the structure to the problem
anyway feel like I'm just repeating myself at this point not sure what our difference in understanding is
like, how do you use EA?
the contribution to NEAT is that you aren't just modifying weights, but also the whole structure
I have used it in many contexts, from generative arts, agents in games, to generating code in b2b products
right, where the structure is represented by the genes
you are providing the neural network and the ways its structure can be represented as a way to abstract away the structuring of the problem though
I was thinking of EA with freeform modeling by hand and solving for parameters
I mean, only the game agent was using NEAT. The rest was using other structures
I am curious how they are able to do both the structures and the weights with EA though
given that the number of weights youd have to solve for would be different depending on the structural parameters
so the individuals would be of different sizes wouldnt they
I would recommend to read the paper about NEAT. It's a pretty cool paper and easy read too
And if you are in the mood, you could take it further with CPPNs
I'm having some success with sb3 is there a library like that for NEAT
and is it as documented and easy to use
there are some python neat libraries
I have been using java though for the NEAT stuff
that's often scary because it means there's a diffusion and duplication of efforts
and none of them end up being obviously the right choice
Hi there, looking for someone as an intern who is good in image processing and know how to deal with PTZ/Network cameras.
Should be available for next 3 months as an intern. Hit me up!!
donzies
gotta test it on a gpu machine tho
should work the same if the os is the same and the drivers are pre installed
its going for 100 epochs, so I got a lot of time for testing
im gonna see if I can get aws to send the spot notice of termination to verify that the fault tolerance isworking
ah no way to do it, I'd have to mock it
mlflow is goated
it's backing up my models automatically to s3, providing fault tolerance and facilitating master-slave communication between my actions workflow and spot instance
goated
yep, like everything in the internet
works quite well tho, what happened
don't see how this can possibly substitute an ML Eng, it's more like a logbook or a database
im a bit confused tho
looking at their website they dont seem to claim that it is a substitute
it's possible they offer some sort of consulting service to maintain ml infra
well in any case, im super happy with my setup
I can run the experiments right from actions
then see the stats on mlflow, and if I dont like it I can redo it by setting the same params and some other adjustment
all with spot instances, which is super cost effective
and I can always change the model in code, test it locally with cpu, and then go on to gpu via the actions workflow
I think this is a good point to stop and move on to the Shakespeare dataset, gpu is not fully figured out, but I'm sure it's some detail that I can handle once I need it, I also got quotas for the expensive gpus only so Im gonna have to wait a bit more on that
I use MLFLOW a bit myself at work. I do it on-premise though, it's a fine tool.
hello, I want to change my field to bioinformatics. can anyone guide for the starters>
!rule 9 6 No recruiting here, please.
6. Do not post unapproved advertising.
9. Do not offer or ask for paid work of any kind.
Even with this setup, Im looking at 200-300 dollars just to reproduce the smallest gpt 2, so I think I'm gonna stop at the Shakespeare dataset
What algorithm would i want to use for beating a platformer DQN DDPG or PPO
I have 20% of parents (so ai of past generation), 60% of mutations and then some random ones
yeah, in this configuration it feels more like random exploration than an EA.
I would suggest to stick with more classical approaches and to use a library. See https://www.geeksforgeeks.org/genetic-algorithms/
For instance, the top performers from the previous generations are fine being only 2-5% of the new generations. and 60% mutations is pretty high.
Note also that the operators do matter. How do you select individuals, through tournaments, roulette wheel, other?
Here is an example of parameters from a random paper on using EA to control cars:
You guys are so handsome TYSM
I also do think I need to use a library but it would be a shame I have created my own for nothing
For the selection of individuals, I let them play a party and rate their game based on how much line they made, how much block they have displayed, the bumpiness of their grid, the number of holes, the mean height of their columns...
I will try modifying the criteria of selection, you're right, but if that was the problem, I guess they would have evolved a bit, at least they would have been slightly better, which is not the case
For the selection of individuals, I let them play a party and rate their game based on how much line they made, how much block they have displayed, the bumpiness of their grid, the number of holes, the mean height of their columns...
This is describing your fitness function, not the selection process.
Let's say you have 100 individuals, each with their fitness score as established by your fitness function.
Let's say you need to select 10 individuals out of this population, what is the process?
I take the 10 individuals with the best fitness
I am asking this because it is important to apply pressure to the evolution.
For instance, if you select the top 10, it's a start, but it also means that they are considered the same regardless of their success.
In natural selection / evolution, the most successful entity is the one that tends to reproduce the most since it's the most fit for the environment. And as such, it would make sense to select multiple times the top individual if its fitness is multiple times better than the rest
That's why I would suggest to look into the "roulette wheel" selection operator or the "tournament" one
Yeah right, I will go check on that one thanks !
But I still think there is another problem to it
Maybe it's just me who messed up a thing on my code, though
Tangential, it looks like your home made library still has some way to go before implementing to basics of GA. Coupled with a tough RL type problem you are trying to solve, that means mixing two different and very complex problems.
I would suggest to pause your problem for now and focus on something simpler until you get your GA library working with the basics of GA
Once you got your GA library solved, it will be easier to focus on your game ai problem
You are so impressive
lol no
Thank you for your tips, I will modify this, can I call you afterwards if it still doesn't work (or if it does to thank you)?
np, have fun!
Ive uninstalled and reinstalled h5py multiple times now. Im trying to use tf 2.9 with nvidia cuda and cudnn so that I can utilize my gpu. Initially I did this project with my mac book and on its cpu. but after the first epoch I keep getting this error. I dont know what to do anymore and im beginning to get pretty frustrated
I think it has something to do with the version compatibilities but ive found nothing that says which version of h5py i need to use with tf 2.9 or anythintg of that matter
Im contemplating redoing this entire project in pytorch at this point
My bad!!
def reset(self, seed = 42):
self.period = np.random.randint(0, len(self.periods)-1) if len(self.periods) > 1 else 0
def reset(self, seed = 42):
self.period = self.period + 1 if self.period < len(self.periods) - 1 else 0
What are the pros and cons of each of these approaches to iterating over instances of games/periods in RL training data (picking a random game/period each time you reset the environment vs iterating over them sequentially and looping when you hit the end)
hi guys, I came across this old exercice I did last year during my studies, but couldn't understand a part of the code:
a = np.random.randint(100, size=10)
print(a)
a = [*a]
a[a.index(max(a))] = None
print(a)
anyone can tell me what I meant with a = [*a]? it doesn't seem to do anything I can't remember what I did here
P.S: another example of why one should always comment code lol
It makes a list of the elements of a
* in that context means unpacking of an iterable (in this case a numpy array)
If the numpy array a has the elements [1 2 3 4] and you write [*a] that is the same as writing [1, 2, 3, 4]
Better to write list(a), a lot more understandable
@indigo moth
today is the day I teach a machine to write poetry
how to make a command that would work like this:
if subject_code == "0417" and ranses == "w" and paper_number == "2" or "3" and ranyear == "2019":```
i want it to proceed only if all the conditions are accepted
elif subject_code == "0417" and ranses == "w" and paper_number == "2" or "3" and ranyear == "2019":
sesh = "November"
qpv = f"{paper_number}"
msv = f"{paper_number}"
qp = f"https://edupapers.store/wp-content/uploads/simple-file-list/CIE/{programme}/{subject_name}-{subject_code}/{ranyear}/{sesh}/{subject_code}_{ranses}{ranyear[2:5]}_qp_{qpv}.pdf"
ms = f"https://edupapers.store/wp-content/uploads/simple-file-list/CIE/{programme}/{subject_name}-{subject_code}/{ranyear}/{sesh}/{subject_code}_{ranses}{ranyear[2:5]}_ms_{msv}.pdf"
print(qp)
print(ms)```
When checking if something is equal to one thing or another, you might think that this is possible:
# Incorrect...
if favorite_fruit == 'grapefruit' or 'lemon':
print("That's a weird favorite fruit to have.")
While this makes sense in English, it may not behave the way you would expect. In Python, you should have complete instructions on both sides of the logical operator.
So, if you want to check if something is equal to one thing or another, there are two common ways:
# Like this...
if favorite_fruit == 'grapefruit' or favorite_fruit == 'lemon':
print("That's a weird favorite fruit to have.")
# ...or like this.
if favorite_fruit in ('grapefruit', 'lemon'):
print("That's a weird favorite fruit to have.")
@brittle storm
but now.. i asked to print the a random paper that has these:
subject_code == "0417" and ranses == "w" and paper_number == "2" or paper_number == "3" and ranyear == "2019":
it prints this now.. https://edupapers.store/wp-content/uploads/simple-file-list/CIE/IGCSE/Information-and-Communication-Technology-0417/2022/November/0417_**s**22_ms_2.pdf.
which is invalid cuz s is wrong
it is supposed to get on w
I'm not sure about the priority of the and and or
Try adding brackets
subject_code == "0417" and ranses == "w" and (paper_number == "2" or paper_number == "3") and ranyear == "2019":
i did that and now it doesn't do anything
it printed once just now
and i ran the command again
it didn't print
@mild dirge
I'm not sure, the syntax corresponds to the logic I think you want to implement. I don't know where the problem lies.
Anyone just give me a whole machine learning project for social good already please!!!
I don't wanna do this ๐ญ
Have you checked online for possible suggestions? If yes, what did you find?
how many more timesteps would you run this before you consider it converged
(this is PPO btw)
that looks like a fractal time series
well it's convolved with a window of 50 so it could look even more like a fractal if I didn't smooth it but that's how it looks when it converges slowly I think
uhm, the overall trend seems to be a linear function, so no convergence to a given value
tho I dont really know the context here
It looks like diminishing returns have set in to me
but hard to tell where that last oscillation will end up
I was trying to avoid running it for 2e6 because I will probably iterate on it again anyway
but I guess that's how I will tell for sure
not 100% sure what you're doing, but I'd be tempted to run several experiments witth diff seeds and place the graphs on top of each other
I have a text file which looks like this:
number_of_nodes: 9
842 2578 0
842 2578 1
842 2578 2
842 2578 3
842 2578 4
842 2578 5
843 2579 6
843 2579 7
843 2579 8
number_of_nodes: 4
926 2206 0
927 2205 0
927 2204 0
927 2203 0
The lines number_of_nodes: 4 represent how many coordinates there are below, what I want to do is read this file in and have list of nested lists containing these coordinates, i.e the [[...], [(926, 2206, 0 ], ... (927, 2203, 0) ].
Any help will be much appreciated
this is what it looked like with another seed, much more convincing diminishing returns
although the policies it came up with are quite different
uhm, looks like you have a random walker
one way to look at this
look at the y axis, and see the difference in y from each step i to the next step i + 1
if you histogram it, you will likely see the gaussian distribution
reinforcement learning is supposed to do that
to some extent
it's exploring
it doesn't monotonically increase
the trend is definitely higher up to a point though
importantly it gets into positive rewards territory
so you could say it "passes" the problem
it could be arbitrarily better though
I'm going to do some feature engineering next
especially since this is an on-policy learner
it only can learn from what it does
other methods try to figure out what the best action is instead of trying to make the one it's doing the best one possible if that makes any sense so they have nicer learning curves most of the time but they often fail to actually solve the problem
again I don't really know the context, just gathering from the graph alone haha
shakespeare is coming along
i have not followed what you are doing so far. but
- what is reward here?
- isn't it bad that it's an on-policy learner? the reward still flip flops and some time goes into <0 reward region, isn't that bad for when you are actually trading?
- will you be validating your agent against a dataset that it hasn't explicitly learnt from? i would be super anxious about overfitting here
You only use the saved model from the best reward timestep
reward is change in portfolio value / value at risk
yes I am doing the out-of-sample evaluation but I am only looking at one year period out of sample whereas I'd like to have more to have a "distribution" of out-of-sample performance
importantly the on-policy learner is the only one that actually gets to positive rewards
the off-policy methods overleverage themselves
and blow up
namely PPO vs SAC
I'm looking at PPO and PPO with an LSTM architecture in the policy network vs SAC
and SAC gives a lot nicer training chart but i can't get it to stop overleveraging itself even if I try to penalize it in the reward
the on-policy results are much better out-of-sample too
that's cool ๐
reinforcement learning is not my strong suit.
what asset class are you trading if you don't mind me asking? hopefully you are getting some nice out of sample sharpe/calmar ratio already ๐
this is actually doing the thing that everyone says is a fool's errand
I figured if I can trade 10y duration then I can find a model that can trade almost anything
I've got 1.5 sharpe ratio on a combination of 10y, 2s10s curve and usdjpy models
the correlations were low and the correlation between 10y and 2s10s was even negative
I figure if I do some feature engineering I can get the sharpes convincingly above 2
that would be sweet ๐ good luck!
right now I have a pretty simple model where it's a natural gradient boosting fair value model based on macro variables, a garch-like variance estimate using Light GBM, and some indicators for economic events like NFP and CPI days
I figure if I add some more of the inputs I have on a daily frequency directly to the agent's observations that alone might improve performance
right now I have a lot of inputs that just go into the fair value model without being exposed directly to the RL agent
what I'd like and don't have is historical consensus estimates so it could try to learn a response to actual economic releases in a meaningful way
there's like one company that sells it and it's a ridiculous amount of money
yeah
banks put out estimates for those things like earnings
and they get compiled into a consensus number
so for each release of significance there is a number to compare it to
and say "is this higher or lower than expected"
that delta is what drives the market reaction not the value itself
so if CPI goes down and everyone expected it to go down your model shouldn't react to that like CPI is low
yeah, the expected value is all baked into the price already
it needs to know what the expectation was before it makes a judgment
couldn't you train a giant language model to do this stuff, feed it a bunch of documents and google searches and have it come up with predictions
have it decide what is important
I've thought about that but the market isn't democratic
well
you could maybe get economic sentiment in general from that
but I'm sure it's noisy as hell
from what I know it does require domain knowledge to be predicted
like, people who dont know about it are discouraged to participate in the first place
thus those who stay know their stuff
Prediction markets, also known as betting markets, information markets, decision markets, idea futures or event derivatives, are open markets that enable the prediction of specific outcomes using financial incentives. They are exchange-traded markets established for trading bets in the outcome of various events. The market prices can indicate wh...
You could maybe in theory extract historically significant economic relases from news stories
from places like Bloomberg and WSJ
I say feed it the entire internet and have it converge to a decision, or at least, a slice of the internet, and during training you give it the internet
I would actually like to set up a sentiment model for different assets using bloomberg news stories
the reaction of the internet and the reaction of the market are not similar enough to just use the entire internet
financial media is the way to go
right, the model would filter out what's not relevant
like, training a language model is a compression procedure
when people talk about economic data in general it's much more likely to be about politics than actual economics for instance
since economic data gets politicized
yeah, I think economics as a field seems to interplay a lot with politics, but idk much about either of those topics
still goin
well
at the moment it spits out non-sense 
To be or not to be, that is'###--/00.033300--///>3<:!3030..<=/::.++0O3..3.:L0
the part that is not nonsense is the prompt
im gonna let it run for longer ig
what is it meant to do, generate from shakespeare?
yes
should write poetry
YES
I was just sampling it incorrectly
To be or not to be chopped,
He cannot tell thee to come to have, and we will play him?
If everlastingly king to deny straight: therefore, methinks you,
Must myself against you hear his presence.LADY CAPULE
it's beautiful ๐ฅฒ
im gonna have endless fun with this thing
i'm a bit confused
type object
unit object
creationdate object
startdate object
enddate object
value float64
HKMetadataKeyHeartRateMotionContext float64
HKMetadataKeySyncVersion float64
HKMetadataKeySyncIdentifier object
most of these columns are objects
i'd like to convert creationdate, startdate, enddate to datetime
pd.to_datetime
heart_df["creationdate"] = pd.to_datetime(heart_df["creationdate"])
is it telling you it's an invalid format or something?
i tried this and tried testing again to see if it would change to datetime
UserWarning: Could not infer format, so each element will be parsed individually, falling back to dateutil. To ensure parsing is consistent and as-expected, please specify a format.
well, is there a consistent format?
5/27/2022 8:02:00 AM
did you check and try specifying it?
is it consistent though?
sounds like it might not be from that warning
is there any way to check if it's consistent besides scrolling through the entire file?
31.4 mb
sounds like you would probably need to come up with a way to figure out using python string processing methods
maybe regex
or
you can just try to specify that format
and see if it tells you there's an invalid value given that format
that might give you some hints
any way i can call datetime.strptime on a pandas column?
no, incompatible types.
i can try using a .apply function
there might be a way with the dt accessor
I mean that's what to_datetime is supposed to be doing
!d pandas.to_datetime
pandas.to_datetime(arg, errors='raise', dayfirst=False, yearfirst=False, utc=False, format=None, exact=_NoDefault.no_default, unit=None, ...)```
Convert argument to datetime.
This function converts a scalar, array-like, [`Series`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html#pandas.Series) or [`DataFrame`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html#pandas.DataFrame)/dict-like to a pandas datetime object.
use the format argument here just like how you would in datetime.strptime
letting it go for 500 epochs
I figured out why the off-policy learners were sucking
there's a lot of things im not doing there that could improve this
I forgot to tell the agent what its var was so even though it was using it in the reward function it had no way of directly observing it or using it to constrain its behavior
that's why they kept overleveraging
I'm amazed PPO worked anyway
engineered some new features and our rewards are higher than ever
now that I have a first model that writes poems, I'm gonna setup a more serious infra to experiment with this, I'm mostly gonna copy what I did until now for the gpt array sorter, might need to adapt it a bit for gpu. After that it's time to explore my idea of using metric tensors for performing self attention and compare the results. I mean, tbh I kinda wanna do that now, I'm gonna run a quick loop to see what happen
it was following it a little bit too closely, so I restarted the run, got a print in there just to be certain the mod is there
but it seems that it was right
kinda weird that there's no difference, I expect that at least it takes less time training
for my last unsuccessful prediction, the newer version will start plateauing at a higher value
nvm I was right, mine is gonna end up being a bit faster
if this stays like this I'm gonna start getting excited. But in all likelihood it's gonna converge to a higher loss value because there's less parameters around and the values they can take are constrained
That does look better, but you should probably use a fixed y-axis size
I am only interested in the shape really
of course the value to
but this chart is for the shape
if I were to plot th em all to compare then it would be normalized obviously
Yeah the other one looked like a random walker on the y axis
it is supposed to random walk a bit
but I wasn't giving it enough information to reliably solve the problem that was my bad
I also switched to an off-policy method
Right, but I'm guessing that a pure random walker would mean something is wrong
I remember studying this class of graphs in college. Even made a bunch of sims that generate them
it wasn't a pure random walker
it had a definite up trend in the beginning
then it random walked around like 1
but it became definitely positive from negative
so there was something, and it was only with ppo because ppo is risk averse
compared to the off-policy learners
The only way to confirm would've been to run several experiments. A random walker can produce any of the observed patterns via pure chance.
This one is much more clear
I ran it again with a differesnt seed and got the same pattern
The other ones looked random
nah that was just exploration
I'm sure if you looked at the stats that the moving average change was significant
difference is due to a couple factors including changing the model
the moving average definitely changes from negative to positive
and then it flatlines
I ran it again with another seed and found a similar shape
it was two things: this one is actually truncated in the x-axis
and the model is different
and I didn't give it as much information
I mean the y -axis not x-axis sorry
it didn't start out as negative
that is just chance
well, it performed out-of-sample
just not as well
as the new one
reinforcement learning is different from supervised learning
there's a bit of "random walking" by design
it's changing the strategy to try to find new things to do to ultimately get to a higher end point
the on-policy ones tend to do that more aggressively
because they have limited information compared to off-policy
so they need to move around more to find a better policy
the other chart I showed more recently where it looked more like an inverse of your loss chart was from an off-learning policy
off-policy learner*
in a lot of cases they give more robust solutions
they also have better convergence criteria
even though you said it looks like a random walk
the other ones just became more negative
It is one, you can even check that it's statistical properties are invariant under scale changes
Making it a fractal
it goes through cycles of exploration and exploitation
but if you insist sure it was a pure random walk
I'm just glad I got the off-policy learner working
guys is maven analytics course for excel is good ??
I thought you were using Jovian to learn Python or so... You ditched it for excel now? How was your experience with Jovian?
so here's another example
this is SAC (off-policy)
This is PPO (on-policy)
you might look at these and say wow PPO sucks in comparison but in reality it's not that clear on out-of-sample data
this is what they did respectively out of sample
the one on the right is a lot more stable and predictable in behavior
might want to run both and put a bigger weight on the less risky one
I'm going to try to tune it to take less risk but the on-policy one takes less risks by default
we've broken 2 sharpe it's time to hook it up to paper trading and demo it
hey guys has anyone worked with llms here? i kinda need a small help
Hi, I'm excited to join here.
I'm a professional machine learning developer and have 8+ years of experience of developing data science, image processing, optimization projects.
Image preprocessing, deep learning, time series processing, dimension reduction and optimization are my major.
Nowadays I'm looking for employment opportunities and willing to do full-time/part-time.
Please DM me if any Employer is interested about me. thanks...
read the rules
what are you trying to do?
i wanna solve image processing or time series tasks...
toda
my keyboard is\ going cra rn, the caps\ l.ock is\ del.eteing chars\
It's restarting, I think I'm gonna have to check for malware.
Anyway, today I'm gonna setup the rest of the infra for training GPT on a GPU machine without supervision and with fault tolerance.
It's mostly copy paste from the previous workflow + some corrections on a couple things I missed in the model code
I've implemented a GA, with elitism, crossover and mutation of parents, and some random agents. I've made a generation of 400 agents, with 8 elitists, 146 parents, 146 children and 100 random. Mutation probability is at 12.5% because I have my data on 8 bits. After ~60 generations, I could still not witness any upgrades.
Though, learning about GA and programming it was a fun task so thank you! But I still think the problem is something more simple...
I'm noticing pycharm has intelligent schema autocomplete with Spark with read statements, is there a way to inform it about schemas of dataframes passed to functions?
the first self attention module of the metric tensor network
How can get Maven Analytics courses for free or by someone who can share thier Udemy username and password
Do not request other people's credentials, that is not appropriate
Learning with high quality should be for everyone
and teachers deserve to get paid
Asking for someone's login credentials is not appropriate, not sure how that is related
"maven" sure is a confusing name for an analytics platform, I was like "wtf, an entire course on, what, doing analytics on Maven downloads?"
What is the best way to smoothen out or make good predictive accuracy score of linear regression model.... how to eliminate fluctuations/noise in a data that has a lot of hips and hops due to being in micro scale
from where should i start machine learning? i know python basics, im c++ programmer and im good at maths and logic building
Datacamp and kaggle have a lot of courses for such... thats a good point
can you please share the links?
Search the world's information, including webpages, images, videos and more. Google has many special features to help you find exactly what you're looking for.
@stark bay ?
Yes?
Plzz can you suggest me any YouTube channel for python ?
I dont use yt
I think u will get more relavant info here since i am not the best person to ask for that
What is the best way to smoothen out or make good predictive accuracy score of linear regression model.... how to eliminate fluctuations/noise in a data that has a lot of hips and hops due to being in micro scale
it is redirecting me to google.com
Do you work something ?
I can help you if you want
BeCause i want to learn something
I just want any experience how to work or what to work
Sorry if i am disturbing you
Hello, I am a high school student interested in artificial intelligence. I know some Python. What can I do at this stage without entering the field? I mean I need a strong foundation to enter it (skills). What are these skills and what is the foundation that I need?
wanna start together?
Yes
Take the most advanced math courses that you can and get into a CS university program with an AI concentration
And do well in school in general. But in STEM especially
Well, sir, I will definitely take the mathematics programs for those in my country who never accept high school students in universities
are u indian?
No iam algerian maybe you heard about it or maybe you didn't
IK algeria very well, its a north african country more closely related with arabs
Haha, you seem to know it, but there are many of its residents who are not Arabs, including me
cooool so you are african?
anyone know the best way to locally host an llm cuz i dont have a openai llama thingy
Get a major in CS or Statistics while you build on your python skill gradually.
You can say it
trading USDJPY
live trading bot?
Ok sir thank u so much for this
what training api u using for trading playform or r u
im assuming ur not running it on a live account
to actually make the trades are u using a sim api
oh ibkr
that has actual prices but not actual moneyt
they have paper trading
k
Is local your last resolve? Why not try some cloud options? Some of them even have free tier plan.
- Heroku, Streamlit Cloud, Cerebruim, etc.
Check out https://www.cerebrium.ai/
A platform that makes it easy to build and deploy machine learning models scalably and performantly. We run GPUs serverlessly so you only pay for the compute that you use. Bring your Python code and we take care of all the infrastructure. Typically customers experience a 40%+ cost saving as opposed to AWS of GCP.
You're welcome sir โ๏ธ. If you'd have gotten to >= 18 years by 3rd quarter of 2024, I'll recommend you consider applying to attend Deep Learning Indaba 2024.
All the best in your endeavour
What is the best way to smoothen out or make good predictive accuracy score of linear regression model.... how to eliminate fluctuations/noise in a data that has a lot of hips and hops due to being in micro scale
you can try rescaling
using log scale or something like that
note that you will have to exponentiate the prediction from the log-scale model to get the prediction in normal terms
Hi ! Does anyone has any tips about good books to learn and deepen the knowledge in data science
there should be a few in the pinned messages in this channel
didn't know it was a thing, ty
Hey buds. So I have excel spreadsheet named data.xlsx
df = pd.read_excel("data.xlsx")
df.drop(df.columns[df.columns.str.contains('unnamed', case=False)], axis=1, inplace=True)
I try to do this but there is still freaking indexes
could someone fix it please
@long locust Have you read the books that are recommended in the pin message ?
For a litlle question
Or did someone ?
I have not gone through them, what question do you have?
Well if I read 1 or 2 of them in their entirety, I'm good to go right ?
Because I read that the best way to learn is to apply the knowledge, but in datascience, I literraly have 0 idea of what exercize to do to apply and learn
Sure. Just focus on one book at a time, If you can successfully finish https://mml-book.github.io/ and https://www.statlearning.com/ you're good to go.
Companion webpage to the book โMathematics for Machine Learningโ. Copyright 2020 by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong. Published by Cambridge University Press.
Thanks !
The D2L book from Microsoft is also a piece of art. https://d2l.ai/chapter_introduction/index.html
df = pd.read_excel("data.xlsx")
df = df.drop(df.columns[df.columns.str.contains('unnamed', case=False)].tolist(), axis=1)
Try this.
Also, just never use in-place
There is still freaking index
df = pd.read_excel("data.xlsx")
df = df.drop(df.columns[df.columns.str.contains('unnamed', case=False)].tolist(), axis=1)
df = df[['Game Name', 'Region', 'Group Override']]
Game Name Region Group Override
0 Dead Space Ukraine No
1 Rust Russia No
2 Lethal Company Russia Yes
3 Grand Theft Auto V Ukraine No
by "freaking index", do you mean the 0, 1, 2, 3 on the left? because there must be an index. you can never stop having it no matter what.
Yes
why cant I have it
every row always has an index no matter what. you can print it without showing the index, but it's still there.
do print(df[['Game Name', 'Region', 'Group Override']].head().to_string(index=False))
So there is no sense in droping index?
"dropping the index" is impossible.
that's like saying "I don't want the columns to be labeled or numbered"
how would you get the first column of the dataframe if the column doesn't have a label, or a number?
perhaps you want to do .set_index("Game Name", drop=True)? having that column be the index would be a reasonable choice.
What are the best courses on Udemy to learn Python data analysis the basics and fundamentals to NumPy and pandas?
I think what he wants is to set the index to game name
by tomorrow I'll have a full variation on the transformer architecture, how do I go about making a fair comparison between them ?
so it displays without the series index
who wants to practice matplotlib with me?
Hello everyone,
Tyler here Iโm a Comp Sci student currently on a data science internship. I also have my own startup McCarthy & Brogan Solutions.. weโve been setup for around a year now and starting to get invited into various factories around the UK looking at how our services (primarily focused on maintenance & repair in this case) can increase efficiencies using AI amongst other things. We also have a subsidiary SmartFormAI with which we have just built a document automation application utilising LLM, OCR and GAR. Iโd love to get to know some of you, please drop me a DM!
This server is not for advertising or self-promotion
I believe you are breaking the rules
@serene scaffold can I promote you?
No
dang
Apologies! Must have missed that
def forward(self, in_sequence_bwc: Tensor) -> Tensor:
batch, words, coordinates = in_sequence_bwc.size()
k_dimension = coordinates // self.NUMBER_OF_HEADS
pre_metric_tensors_nww = self.pre_metric_tensors_nww.masked_fill(self.MASK_ww[:,:,:words,:words] == 0, 0)
metric_tensors_nww = pre_metric_tensors_nww @ pre_metric_tensors_nww.transpose(-1, -2) # ensures symmetry and positive definiteness
all_projections_bwc = self.projections_cc(in_sequence_bwc)
all_projections_bnwk = all_projections_bwc.view(batch, words, self.NUMBER_OF_HEADS, k_dimension).transpose(1, 2)
all_dot_products_bnww = all_projections_bnwk.transpose(-1, -2) @ metric_tensors_nww @ all_projections_bnwk
all_dot_products_bnww = all_dot_products_bnww / math.sqrt(k_dimension)
all_dot_products_bnww = all_dot_products_bnww.masked_fill(self.MASK_ww[:,:,:words,:words] == 0, 0)
nudged_vectors_bnwk = all_dot_products_bnww @ all_projections_bnwk
nudged_vectors_bwnk = nudged_vectors_bnwk.transpose(1, 2).contiguous()
nudged_vectors_bwc = nudged_vectors_bwnk.view(batch, words, coordinates)
out_sequence_bwc = self.projection_cc(nudged_vectors_bwc)
return out_sequence_bwc
this is my proposed self attention mechanism
every head has a projection matrix that compresses the embeddings, and a metric tensor that is used to calculate the dot product between all the elements in the sequence
and that's pretty much it, each projection gets scaled according to the dot products and then it's concatenated and mixed
hey everyone I'd like to promote the Greenest Admin
he's very Green and very much an admin
@agile owl don't shitpost in our wonderful data science chat
Hello. I have a question, I don't exactly understand what this X_embedded mean or how to use it. Can anyone please give a hint?
sklearn.neighbors provides functionality for unsupervised and supervised neighbors-based learning methods. Unsupervised nearest neighbors is the foundation of many other learning methods, notably m...
Mixing elitists and parents does not look normal.
There are also too many pieces. I would still suggest to either use an off the shelf library or to validate your library on something simpler.
Can someone please explain vectorization in Pandas and why do we not have to do a for loop? And how would I know which methods can be used for vectorization?
it's setting up calculations to be applied across an axis in parallel instead of sequentially that's why you don't need a loop
in general if something is autoregressive or recursive it can't be vectorized
anything inherently sequential can't be vectorized
or at least not very easily afaik
Hello. I am working on a reinforcement learning model using a gym-anytrading environment. I am having this bug but I couldn't find a solution anywhere. Can someone help?
the error seems pretty straightforward to me
you need to have a certain observation shape
hello guyzz, i learned python and wants to learn machine learning, can anyone share some advice or roadmap to give a great start at my machine leaning journey
Apparently while debugging the code, I forgot to rerun the cell above so that fixed the issue. But now I am trying to add some indicators to a custom stock env. It takes gym-anytrading's StockEnv. I am getting this error:
sounds like it needs a data attribute. did you read the environment implementation or is there a specification for how to subclass it?
I just make my own envs so I can't help you there
I only need to implement step and reset
i'm guessing the data attribute just needs to be some dataframe with certain column headers
I think I understand the error better. I need to check the documentation again.
Thanks for the help
hey everyone, i just want to ask on how can i stitch multiple images of receipt?
Does anyone tried it before?
sup everyone, im new to this chat and Ai/data science. i know some basic stuff about Ai and data science but im wondering if someone would be willing to help me learn more on this topic.
Does anyone actually use feature mapping as oppose to kernels/gram matrix in SVM?
Cornell says that it's more efficient in lower dimensional feature maps but wouldn't you have to calculate the inner product anyway? So how is it more efficient when the kernel can do this without the explicit mapping
Good morning, gentlemen,
I'm writing to you because I can't find the solution to my problem after much research.
I'm running a log-linear regression on the adjusted price of a stock (dependent/target variable = Y).
in order to make the relationship between my variables (price/date) more linear (especially if the distribution doesn't really follow a normal distribution, but I'm not telling you anything...).
On the other hand, I'd like to display the "true price" and the true standard deviations (68/95/99.7%).
But I have no idea...
I use python with yfinance, plotly and streamlit.
Thanks for your help in advance : ) !
Any one has pc recommendation for data analytics??
I'm thinking ryzen 5 7600 and 6750xt with 32gb ram
Please advice my current amd laptop is quiet old and has 4gb ram which is hardly available half the time so I saved some money to buy a good pc to last few years
Your planned setup with a Ryzen 5 7600 CPU, Radeon RX 6750XT GPU, and 32GB of RAM seems quite decent for handling data analytics tasks.
Alright just one question should I go with 13500 in intel but i also prioritize power consumption because cant pay too high electricity bills right
If power is something you care about
I would not use Intel, AMD in general has much better power efficiency per compute value in the newer chips in general
Also, in general, unless you're using 100% of the CPU all the time, your power bill is probably not going to be noticably different regardless of what CPU you go with
other than maybe if you put a Threadripper or server CPU in it ๐
If you really care about the power efficiency per compute and can afford it, the Ryzen x3D chips are incredible CPUs and have by far some of the best efficiency of the modern CPUs on the market rn.
But again, I don't think you will notice much difference in the power bill department
do note that you'll have to fiddle with your BLAS backend if you use AMD and want to do computations on cpu
and that AMD gpu's are not super well supported for any gpu computations
sadly history has favored intel and nvidia greatly in the area of scientific computing
check whether your target modules support ROCm instead of cuda, and read around about MKL vs openBLAS vs BLIS
Hello, anyone have experience with Python and AI/Machine Learning etc? or good and woorking tutorials? for prediction stock price. I have data in CSV
import pandas as pd
import numpy as np
import warnings
from datetime import datetime
import matplotlib.pyplot as plt
import matplotlib.dates as dates
heart_df = pd.read_csv("/Users//Desktop/Apple Watch Data/HKQuantityTypeIdentifierHeartRate.csv")
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
print(heart_df.head(5))
heart_df["creationdate"] = pd.to_datetime(heart_df["creationdate"], format = "%m/%d/%y %H:%M")
heart_df["startdate"] = pd.to_datetime(heart_df["startdate"], format = "%m/%d/%y %H:%M")
heart_df["enddate"] = pd.to_datetime(heart_df["enddate"], format= "%m/%d/%y %H:%M")
print(heart_df.dtypes)
plt.plot(heart_df["startdate"], heart_df["value"], linestyle = "dotted")
# Add title and axis labels
plt.title('Time Series Plot')
plt.xlabel('Time')
plt.ylabel('Heart Rate')
plt.xticks(rotation=45)
plt.show()
is there anything else i can do with this visualization?
bc i can't really make sense of this, there's no underlying trend here
maybe combine it with something else?
def forward(self, in_sequence_bwc: Tensor) -> Tensor:
batch, words, coordinates = in_sequence_bwc.size()
pre_metric_tensors_nkk = self.pre_metric_tensors_nkk * self.MASK_11ww[0, :, self.K_DIMENSION, self.K_DIMENSION]
metric_tensors_nkk = pre_metric_tensors_nkk @ pre_metric_tensors_nkk.transpose(-1, -2) # ensures symmetry and positive definiteness
all_projections_bwc = self.projections_cc(in_sequence_bwc)
all_projections_bnwk = all_projections_bwc.view(batch, words, self.NUMBER_OF_HEADS, self.K_DIMENSION).transpose(1, 2)
# all_projections_bnwk = F.normalize(all_projections_bnwk, p=2, dim=-1)
all_dot_products_bnww = all_projections_bnwk @ metric_tensors_nkk @ all_projections_bnwk.transpose(-1, -2)
all_dot_products_bnww = all_dot_products_bnww / math.sqrt(self.K_DIMENSION)
all_dot_products_bnww = all_dot_products_bnww.masked_fill(self.MASK_11ww[:,:,:words,:words] == 0, float('-inf'))
all_dot_products_bnww = F.softmax(all_dot_products_bnww, dim=-1)
# all_dot_products_bnww = all_dot_products_bnww * self.MASK_11ww[:,:,:words,:words]
nudged_vectors_bnwk = all_dot_products_bnww @ all_projections_bnwk
nudged_vectors_bwnk = nudged_vec
tors_bnwk.transpose(1, 2).contiguous()
nudged_vectors_bwc = nudged_vectors_bwnk.view(batch, words, coordinates)
out_sequence_bwc = self.mixer_cc(nudged_vectors_bwc)
return out_sequence_bwc
don't know what's up with that softmax stuff, but unlike Q, K, V, it seems to be essential. I've trimmed down Q, K, V to a lower triangular matrix (pre_metric_tensors_nkk, of which only half get updated during training ) and a projection matrix (projections_cc). results so far are very similar to Q, K, V which result from Wq, Qk and Wv, which total a larger number of params
Is there a problem to give some negative values in the input layer of a RNN ? I give a distance to the ai, and i tell me maybe the fact that it's negative will slow down the learning process, since I'm using ReLU for the hidden layers. Should I modify my way of counting the distance ?
strictly speaking "distance" is always a positive metric, you probably should be abs()'ing it anyway
but there is no problem in giving negative inputs, and in some cases you are even recommended to transform positive inputs into negative inputs as part of normalization - see https://datascience.stackexchange.com/questions/54296/should-input-images-be-normalized-to-1-to-1-or-0-to-1 for example
Ok thanks !
this is the loss graph from the new transformer variant, which I'm calling metric tensor network (mtn), maybe I'm biased but it looks exactly like the transformers loss graph that I've been sharing here so far
the real advantage will come from the fact that I can double the amount of attention heads (since the metric tensor is symmetrical, half the space is being wasted rn) and still have less parameters than the transformer that produces this
Im looking for a tool to visuilaze training results. For now im just using matplotlib but is there any better libraries to get the job done?
I've been using MLFlow, and it has been a blessing
Thanks a lot
someone can help me with this ? : https://colab.research.google.com/drive/10g0pY3vv-mBu5t9yVJ2dQADoI-luOPa9
Cant get in, No access
i want to output in srt format, someone know how to do this ?
what ?
๐
so can you help me or no ?
do you know what is insanely fast whisper ?
nah
do you know what is whisper ?
so insanely fast whisper is way way faster
but i don't know how to make srt file output instead of json file
have no idea how to do it
okay
maybe go ask in #1035199133436354600
no one knows here
New(er) versions of matplotlib have a "plot training curve thing"
huh
can please send a link to documentation
i looked it up but didnt find anything
ik that sklearn has this but never seen something like that in matplotlib
Oh sorry, I meant newer versions of sklearn
hello, i wrote a mandelbrot zoom code and it pretty much zooms in anywhere.
The issue is, as it zooms it requires more and more iterations to get a clear image (as is intended). but i dont know exactly how many are required. so i just take perform iterations=frame_n*2.
This works fairly well, but is extremely slow as the video progresses.
Is there any formula or way to approximate the iterations required as i know the zoom in speed and point?
when it turns all white, the iterations are not high enough to get a clear frame. and later, it fuzzes out
I have a task at hand. I have to write a python script to build a log parser into JSON Format. The log files are taken from the MACBook. I have to ultimately feed it to an LLM model so that it can detect the issues from the log files and summarise it.
Can anyone point me to a good resource or help me understand how can I build a good log parser?
Hi
Please help me to extract 30 secs time interval from a column called created_time (ex 08:00:00 format) in python.
Ohhhhh I see ! Thanks a lot sir.
how do i plot the graph of a function that i calculated
Hello, please show your code as text
!code
i have these functions
fig = plt.figure(figsize=(6, 6))
ax = fig.add_subplot()
xx = np.linspace(-5., 10., 1000)
# ------- Vul verder aan -------
vgl = opl.subs(a, 1/2)
vgl1 = vgl.subs(b, 2)
vgl2 = vgl.subs(b,4)
display(vgl1)
display(vgl2)
plt.show()
oh thats the wrong thing elt me fix
ye so this is my code so far
and this is with sympy?
uh mathplotlib and sympy
the exam on monday is a sympy exam and this is a thing they gave us to prep
ax.plot(x, vgl1)
this is what i gotta use i think but it doesnt work
this is the error i get if i use thatt code
import sympy as sp
x = symbols('x')
sp.plotting.plot(sp.cos(x), (x, -10, 10))
This worked for me
from your example, I don't know what opl is, so I can't make sense of what happens after that
ye alright but there is given code
sorry i forgot to mention the first part of the code is given
fig = plt.figure(figsize=(6, 6))
ax = fig.add_subplot()
xx = np.linspace(-5., 10., 1000)
# ------- Vul verder aan -------
this is givenand the plt.show as well
that's fine, but you don't define opl here, and then you use it in the next line. good examples have every variable defined, or follow common conventions for variable naming
(and maybe "opl" is a common name for something, but idk what)
either that or evaluate the function on some points and then plot them normally:
import sympy as sp, numpy as np
x = sp.symbols('x')
f = sp.cos(x)
X = np.linspace(-10,10,1000)
y = sp.lambdify(x,f)(X)
# plt.plot(X,y) or whatever
vgl = opl.subs(a, 1/2)
vgl1 = vgl.subs(b, 2)
vgl2 = vgl.subs(b,4)
display(vgl1)
display(vgl2)
all this pat of the code deos is calculate the fucntions
these functions
opl stands for oplossing which is dutch for solution and opl was the solution of my differential equation and then i sub a,b in the equation with the values that they tell us to sub
yo this works thankss
python for data analysis, a good book to start with?
"data science from scratch" second edition
im not really a beginer, ive learned numpy and pandas on free code camp, but thats hasnt equiped me with problem solving
then you can skip a few chapters
you still recomment that book?
ya
um what makes you suggest that very book
I've read it and it's a good overview of the space and doesn't have shitty code examples
ty admin
Bruh spy cam plot too?
sympy*
mpl still top tjo
started drafting an explanation thing, if I get good results I'll expand it into a paper
I'm relatively new to this field, but I'm eager to learn about data science and AI. If any of you have recommendations for the best resources to study data science and AI, I would greatly appreciate it if you could kindly share those details with me.
!resources data science
The Resources page on our website contains a list of hand-selected learning resources that we regularly recommend to both beginners and experts.
how can i interpret this graph? the total is a column that combines the original "drug_recode" column with "drugs_imputated" column
the original drug recode column has 1 for often/sometimes use of drugs. 0 for never use of drugs
imputation increases drug use? ๐ฅด
does this mean imputating is not necessary for this dataset?
i feel like it's bad (suggests problems with imputation) if imputed points are so different from the real ones, but unsure.
it seems that it increases the value in the drug_recode column
i was thinking so too, but i wanted to get opinions from those more familiar with the topic
this is my first time working with imputation
this is what i ended up writing:This graph shows the effect of imputation on the drug_recode value. As we see, the imputated values are very different than the reported or no imputatation values. The combined values better fit the reported values. As next steps, we would revisit the imputation method for significant error or bias as well as investigate how the structure of the data contributed to the accuracy of the imputation.
The combined values better fit the reported values
i wouldn't write this because this is literally always true
an average of datasets X and Y will always be more similar to X than Y is
Hi, I want to use a machine learning model to predict future performance of players. I am using Quarterback stats from the NFL. The stats are week to week and player by player. I wanted to use past games to determine future games of a player. I decided to use a Random Forest Model. I will try to predict touchdowns using Features like Pass_YDs, Interceptions, and Pass Attempts.
What I am doing does not seem right. I will never have my Features (Pass_YDs, Interceptions, and Pass Attempts) before a game is played so I cannot predict Pass Touchdowns with those. Features seem to me like ideas I know before a game is played like opponent, Home or Away game, etc. What I am trying to do is predict a players future performance in games based on past performance. Can you help with ideas on how I would do this and if I am on the right track?
Thanks! ๐
i would probably regress each statistic using a model that ensures positivity on a vector of the past statistics of the player, their team, and the opposing team's defense
like the past 5 games or something
then i would fit that across quarterbacks in general so you have enough data
there's a lot of options for how to do the regression
I think with things like sports though the variance is huge
hard to model all the factors
Okay, that should get me in the right direction. It is just an early project for a portfolio I am putting together. I am a data analyst so just practicing.
Thank you for the input!
np
where are you getting the data?
trading usdjpy out of sample
hows shakespeare coming along
ESPN
do they have a way to download it easily or do you have to pull it from their summaries
@agile owl I pull it from their summaries
anyone interested in discussing the topic of conversation?
trading the yield curve
trading the 10y treasury
add these all toegether and the sharpe ratio is well in excess of 2
I need to write the variance weighting code I've just been doing it by hand on the renders in excel
I learned lots of different models in regression and classification in machine learning, like linear, polynomial, and svr for regression and logistic, svm, and kernel svm for classification. But all of these were intuitive explanations without, for instance, explaining the Gradient Descent and Convergence Algorithm, Ridge, Laso, or the math behind every model. So I don't want to dive into deep learning with our understandable machine. So first, what's your opinion? Should I dive more into machine learning or am I able to go to deep learning? Second, some people recommended to me some playlists for machine learning.
Krish Naik: https://www.youtube.com/watch?v=kEmnkUw0NTs&list=PLZoTAELRMXVPMbdMTjwolBI0cJcvASePD&index=3
Andrew NG: https://www.youtube.com/watch?v=vStJoetOxJg&list=PLkDaE6sCZn6FNC6YRfRQc_FbeQrF8BwGI
and some more, so is there a good resource explaining the models in depth in a simple way, like code implementation and projects? Thanks!
Materials And Dashboard access after the video
https://ineuron.ai/course/Machine-Learning-Community-Class
Starting a new series on ML community ssessions :). In this video we will learn about the Universe OF Data Science
Join iNeuron's Data Science Masters Course with Job Guaranteed Starting From April 3rd 2023
https://ineuron.ai/course/Full-S...
The Machine Learning Specialization is a foundational online program created in collaboration between DeepLearning.AI and Stanford Online. This beginner-friendly program will teach you the fundamentals of machine learning and how to use these techniques to build real-world AI applications.
This Specialization is taught by Andrew Ng, an AI visi...
Hi everyone, I'm trying to install an older version of matplotlib, i.e < 3.3.0 but everytime I try to install it, it's not able to gather the requirements to build the wheel
Anyone got any heads up as how this problem will be resolved?
You probably need to update python to either an earlier or later version.
How can I generate classification dataset using either gaussian distribution or Poisson distribution.
...?
You can do this using Numpy and Pandas
I've done this a couple of times in the past. Whenever I volunteer to give a data science training I make my own datasets and I do it with Numpy. I pick a case that seems interesting to the audience, define some variables and think about how they're generated (e.g., maybe age is gamma distributed, maybe the relationship between age and cost is a multimodal gaussian etc)
Maybe there's tools that do this in a more automated fashion but I do it manually ๐
I don't respond to DMs sorry
Like a dataset if 10k rows each row is only Poisson distribution, can be any number of features
Can u give me code on how to do it
I'm not a fan of giving you the full solution because then you learn the least
https://numpy.org/doc/stable/reference/random/generated/numpy.random.Generator.poisson.html <---- with this you variables following a poisson distribution
think about how classification models work and do that in reverse: construct a continuous response/output/label, and convert that to discrete classes
the larger scale test is taking a while, I'm almost done refactoring the repo for it. I'm going to search through these guys
https://paperswithcode.com/method/strided-attention
https://paperswithcode.com/method/fixed-factorized-attention
https://paperswithcode.com/method/dot-product-attention
https://paperswithcode.com/method/scaled
to try to see how I can evaluate the performance of my architecture, at least two of them seem to benchmark on language translation tasks, so I might have to adapt for that, the dot product attention seems to be where the 2017 paper ended coming from, I only read it for a bit, they don't seem to be using a lot of gpu which is a relief
ig my thing fits in as a generalization of the scaled dot product attention, or a simplification ig
bgf = BetaGeoFitter(penalizer_coef=0.001)
โ
bgf.fit(rfm['frequency'],
rfm['recency_weekly_p'],
rfm['T_weekly'])
ConvergenceError:
The model did not converge. Try adding a larger penalizer to see if that helps convergence.
Please help me with this, I am getting this error
can you show the import statement for BetaGeoFitter?
Hello. I am training a reinforcement learning model using stablebaselines3 but every time I train in vscode, I am using my cpu even though I have a rtx 4060. How can I make it use my gpu instead?
Hi guys! I've recently started my journey into data science and machine learning in general and one tip for exploring different applications I keep hearing is to read research papers and attempt to replicate the models created in the papers. Are there any websites/journals that ML researchers generally publish their papers on, or are ML research papers on more generic science journals. Thanks
Can you show some code?
I actually don't recommend that. Papers are about very specific developments, and are challenging to read, even for experienced practitioners. You won't build foundational knowledge in DS/ML by reading papers.
Its in vscode notebook so which part do you want to see?
the part where you create the model. By the way, I won't look at screenshots.
class MyCustomEnv(StocksEnv):
_process_data = signals
env2 = MyCustomEnv(df= data, window_size= 15, frame_bound=(15, 90))
env_maker = lambda: env2
env = DummyVecEnv([env_maker])
model = DQN("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=500000)
while True:
obs = obs[np.newaxis, ...]
action, _states = model.predict(obs)
obs, rewards, terminated, truncated, info = env.step(int(action))
done = terminated or truncated
if done:
print("info", info)
break
This is the output:
Using cpu device
| rollout/ | |
| exploration_rate | 0.993 |
| time/ | |
| episodes | 4 |
| fps | 16755 |
| time_elapsed | 0 |
| total_timesteps | 348 |
| rollout/ | |
| exploration_rate | 0.987 |
| time/ | |
| episodes | 8 |
| fps | 16891 |
| time_elapsed | 0 |
| total_timesteps | 696 |
| rollout/ | |
| exploration_rate | 0.98 |
| time/ | |
| episodes | 12 |
| fps | 16301 |
...
| learning_rate | 0.0001 |
| loss | 1.3e+05 |
| n_updates | 112431 |
Can you add the import statements for everything used here?
Also, please use python highlighting.
```py
hmm ok then. Thanks!
import gymnasium as gym
import gym_anytrading
from stable_baselines3.common.vec_env import DummyVecEnv
from stable_baselines3 import DQN
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
import torch
class MyCustomEnv(StocksEnv):
_process_data = signals
env2 = MyCustomEnv(df= data, window_size= 15, frame_bound=(15, 90))
env_maker = lambda: env2
env = DummyVecEnv([env_maker])
model = DQN("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=500000)
while True:
obs = obs[np.newaxis, ...]
action, _states = model.predict(obs)
obs, rewards, terminated, truncated, info = env.step(int(action))
done = terminated or truncated
if done:
print("info", info)
break
modify these two lines accordingly
# before
model = DQN("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=500000)
# after
cuda = torch.device('cuda')
model = DQN("MlpPolicy", env, verbose=1)
model.to(cuda)
model.learn(total_timesteps=500000)
then see if that moves it to the GPU
import datetime as dt
import pandas as pd
import matplotlib.pyplot as plt
from lifetimes import BetaGeoFitter
from lifetimes import GammaGammaFitter
from lifetimes.plotting import plot_period_transactions
The short answer is that that model can't learn your rfm data with the penalizer_coef hyperparameter that you set.
btw @heady sierra, you might need to redefine obs as a tensor (it appears to be an array currently) and move that to the GPU as well.
ok got it. Thanks
yeah if anything your technique is a specialization of it
great that you're following up on it. i'm looking forward to the benchmark results
Yes indeed, it's a special case of it
As long as I keep getting good results I'll keep digging
I've experimented with several designs now, and it seems that the network doesn't really care what you do as long as you calculate scores.
The exciting thing is the parameter reduction
And the design philosophy of forcing the network to do geometry to make decisions
I checked the hardware used ok the 2017 paper and I'll be able to reproduce a good chunk of the their table
These scores are already from the new thing
And its output is similar to the transformer
The meaning of life is to shake others.
With old only make an oath shallker thoughts; and says so defend:
but the fault out of thick-morrow, to ruin;
And holy clergymen must be needful
The younger slanders to be more than the hollow service
Known and defars; crave heaven,
Even as it would fear'd keep the time
Of love can admitges change;
So much lenity of soldiers,
Then thieves conn'd poison'd blanks,
Cry
So at least on a very small dataset it works fine and on par w/ the transformer
Me either, I just ask gpt if it looks coherent
I don't think shallker is a word?
then again I think shakespeare invented words so hey
what about thick-morrow
how do those compare to the attention matrices from the original?
not that i'd expect all 8 heads to find the same thing
but maybe we'd expect their sum/average to evaluate to something similar
or i suppose just comparing the final output distribution over tokens (rather than individual tokens that were chosen)
I didn't graph the other scores, but I can do it tomorrow, same code practically
that is, does your model generate quantitatively similar distributions over tokens (e.g. sum of squared differences akin to brier score)?
your projects have been very inspiring!
But the point is that the output is the same non sense as the transformer
That's a good idea actually
i'm also really curious if this works just as well on bert-like models as it does on gpt-like models
So like, feeding a token and directly comparing the output ?
yeah, but instead of reducing the output to a token or sequence of tokens, leave it as a (sequence of) probability distribution(s) over tokens
Right, ig it would make sense for them to be similar since they're both approximating the same probability law
yeah hopefully
although you need to collapse it down to a single token to generate more than one token... so maybe just start with a single token prediction like you said
maybe walk through an existing valid english-language document and compare the output at each token?
that way you know the inputs make sense
averaging a bunch of one-step-ahead forecasts rather than averaging an entire forecasting procedure, if that makes any sense
Yes it's a good idea. I'm gonna take note. Right now I'm preparing the repo for a series of larger scale and more systematic experiments. After that I'll go through a data analysis phase where I do these comparisons
I didn't know bert was different
i'm sure there are other details involved, but my understanding of the main difference is that bert doesn't do masking, so it's considered an "encoder-only" model compared to the gpt "decoder-only" design
that is, in bert every token can attend to every other token in the sequence without restriction, whereas in gpt tokens can only attend to tokens earlier in the sequence (which is why the upper triangle of the attention matrix in your output is all 0)
what are the best models for sentiment extraction these days
I'm thinking about adding a new input stream to my reinforcement learner based on sentiment analysis
interestingly it looks like microsoft was attempting to use a bert-like architecture for language generation, which i guess turned out to be a dead end since i've never heard anything about it (and gpt of course turned out to be a tremendous success) https://arxiv.org/abs/1905.02450
Pre-training and fine-tuning, e.g., BERT, have achieved great success in language understanding by transferring knowledge from rich-resource pre-training task to the low/zero-resource downstream tasks. Inspired by the success of BERT, we propose MAsked Sequence to Sequence pre-training (MASS) for the encoder-decoder based language generation tas...
Wait so you literally just don't mask it and have a special char marking a word as empty ?
The output is generated non-autoregressively (every token at the output is computed at the same time, without any self-attention mask), conditioning on the non-masked tokens, which are present in the same input sequence as the masked tokens.
So this is why it's so much faster
Good day everyone, is there a quicker way to train a model?
it takes too much time and the power outage affects it a lot
@final kiln Thats a method... masking versus padding
It's just not supported by all the different tuning approaches out there.
Just finished one of the stages for my model ocr code dataset training-generator.
Model image detection:
[1] CPLUZPLUZ.png
[2] pyth0n.png
[3] ruzt.png
[4] c55.png
Auto-Selecting *.png
Auto-preprocess: (yes/no): yes
File Save Strategy: (txt/pdf/auto): auto
CPLUZPLUZ.png saved as CPLUZPLUZ-OCR.cpp (Cpp detected)
pyth0n.png saved as pyth0n-OCR.py (Py detected)
ruzt.png saved as ruzt-OCR.rs (Rs detected)
c55.png saved as c55-OCR.css (Css detected)
Anyone here use Polars? I'm having a weird issue. I have a program that is using multi-processing and threading in each process, and I'm trying to intermittently write a dataframe to a csv for each process once the dataframe reaches a certain height. The height check gets triggered and by printing the dataframe just before the call to df.write_csv(filename) I can see that the dataframe has data in it, but when I look in the file that gets written out it is only writing out the headers of the dataframe and doesn't contain any data
Can I ask why you've structured your program like that?
There's a lot of data being gathered and I wanted to safeguard it in case the program fails at any point
I was being CPU bottlenecked so I set it up to run in multiple processes
Also, by flushing the dataframes periodically it keeps the in-mem size down
Personally seems like a strange way to structure it, I think if you did it in a more orthodox way it wouldn't be as big of an issue
I would if I could
Especially since Polars lets you stream data from 1 source to another
I had to do it this way
So you can effectively work with larger-than-memory datasets
Polars uses multiple cores by default
Please just trust me lol
Have you tried inspecting the dataframe prior to it reaching the height?
I don't know what your specific issue is, you could have a race condition somewhere
I print the dataframe right before trying to write and it is normal. Full of data
It sounds to me like the way you are writing the data is the issue
Seems likely since you have multiple processes trying to write to the same file
not the dataframe itself, since its printing without any issue
You might have a race condition if youre trying to write to the same file
You can use a writequeue
to ensure they dont compete for writing
if main_df.height >= 10:
if os.path.exists(csv_name):
existing = pl.read_csv(csv_name)
main_df = pl.concat([existing, main_df])
print(main_df)
main_df.write_csv(csv_name)
main_df = main_df.clear()
I used a height of 10 here just for testing purposes
But yeah it might be because of the threading
Maybe I'll just have to make it with bulletproof error handling and then return the df from each process
and do the write after it returns
pretty inconvenient though tbh
oh
import os
import pandas as pd
import threading
import random
import time
def generate_test_data(thread_num):
return pd.DataFrame({'Thread': [thread_num], 'Height': [random.randint(1, 15)]})
def process_and_write(main_df, csv_name, thread_num):
with write_lock:
if os.path.exists(csv_name):
existing = pd.read_csv(csv_name)
main_df = pd.concat([existing, main_df])
print(f"Thread {thread_num} - Data put in write queue: {main_df}")
main_df.to_csv(csv_name, mode='a', index=False, header=not os.path.exists(csv_name))
print(f"Thread {thread_num} - Data written to the file: {main_df}")
main_df = main_df.iloc[0:0]
def data_generation_thread(csv_name, thread_num):
while True:
main_df = generate_test_data(thread_num)
if main_df.iloc[0]['Height'] >= 10:
process_and_write(main_df, csv_name, thread_num)
time.sleep(1)
write_lock = threading.Lock()
num_threads = 3
threads = []
csv_name = 'output.csv'
for i in range(num_threads):
thread = threading.Thread(target=data_generation_thread, args=(csv_name, i+1))
threads.append(thread)
thread.start()
for thread in threads:
thread.join()
hope that helps.
hmmmm write lock
Can you try writing to different files and just concatenating at the end?
I'm writing to a separate csv for each process. I did plan to combine them at the end
example output from it:
2,15
3,11
1,11
1,11
2,15
1,11
1,11
2,15
3,11
1,14
2,12
3,11
2,15
1,14
3,14
I have it writing the threadnumber to each line
I'm not using the threading library though
If I import it just for write lock will it still work?
Its your script, test it:)
lol true
Another thing I'm thinking is I could use a multiprocessing Value object
idk if I can pass a dataframe into that lol
Is this generated by AI?
I don't immediately see where/how this is failing but all I can say is that you're fighting against the API of polars
I would really really consider using the library as intended with, pl.source_x with x being your source and pl.sink_x to lazily read and write
Each process has 10 threads running at a time, so it's probably the case that somehow that is causing an issue
Unless there's a very very specific reason why that is not possible
I will look into those. I'm new to polars so I'm not familiar with the workflow yet. I'm basically using it philosophically the same as Pandas, which I'm sure is wrong
Yeah they're very different libraries. They only look similar on the surface
In the past I've reduced a workflow that took 1+ hour to run under <1 min using polars ๐คท
Sheeesh
This is due to various reasons, not just the libraries being different. (Pandas consumes a lot of memory so I had to batch my results, there was a lot of overhead of doing DB calls). All I'm saying is: read the documentation first and it'll pay off.
I caught a co-worker using nested calls to iterrows in Pandas the other day lol
I was like, "Nuh uh"
so, I am trying to use standard baselines for some RL. Just learning.
env_name = "CartPole-v0"
env = gym.make(env_name)
env = DummyVecEnv([lambda: env])
model = PPO("MlpPolicy", env, verbose=1, tensorboard_log=log_path)
model.learn(total_timesteps=20000)
the last line is giving the following exception:
75 for env_idx in range(self.num_envs):
76 maybe_options = {"options": self._options[env_idx]} if self._options[env_idx] else {}
---> 77 obs, self.reset_infos[env_idx] = self.envs[env_idx].reset(seed=self._seeds[env_idx], **maybe_options)
78 self._save_obs(env_idx, obs)
79 # Seeds and options are only used once
ValueError: too many values to unpack (expected 2)
Any idea what I am doing wrong?
i actually had a legitimate use case for nested itertuples in a project last year
never iterrows though. just use range and iloc for that
It appears that trying to write out from the process pool is causing each process to hang
