#data-science-and-ml
1 messages · Page 76 of 1
Guys help.
I am running spark in pycharm. Loading data, doing some transformations. But when I try to write it in the same project folder I am getting windows error 5 Access is denied
k thanks
just keep in mind that if you tune the hyperparameters too much you might end up 'overfitting' them to your test set
I will try to break out the math to make it easier to understand.
C = 1/m * Σ(i=1;m) (yᵢ - aᵢ)^2
a = 1/(1+e^-z)
z = wᵀx + b
∂C/∂a = -2/m * Σ(i=1;m) (yᵢ - aᵢ)
∂a/∂z = a * (1-a)
∂z/∂w = xᵀ
to calculate the gradient of the weights ∂C/∂w we just multiply these partials together using the vector chain rule.
∂z/∂w * ∂a/∂z * ∂C/∂a
which is the same as
-2/m Σ(i=1;m) (yᵢ-aᵢ) * a * (1-a) * xᵀᵢ
Sigmoid makes it a bit messy to write out but that would be how you calculate the gradient. Breaking it up into partials makes it a lot easier to understand IMO.
I'd recommend reading this https://explained.ai/matrix-calculus/index.html as it goes a lot more in depth than I can on the exact details; although they use ReLU in the example instead of sigmoid.
First off, thank you for taking the time to type all of this
I'm trying to process the part with the derivatives, I'm not entirely sure I follow though:
When you write ∂C/∂v = 2/m Σ(i=1;m) vᵢ for example, don't you need to get the partial derivative of some specific vᵢ? It looks like you took the derivative of every single vᵢ and added them up?
hmm why would you need to take a derivative with respect to a single vᵢ? The sum makes the output scalar anyway
Because there is no ‘v’ in the equation, or is there?
It’s just a sum of the delta between predictions and targets?
yes, v is just a term used to shorten that delta between predictions and targets when writing out the formula (and also to make it clear where the -1 is coming from)
you can remove v and put the (y - y_hat) there instead and the math will work the same
C = 1/m * Σ(i=1;m) (yᵢ - aᵢ)^2
∂C/∂a = -2/m * Σ(i=1;m) (yᵢ - aᵢ)
am I understanding what you're asking correctly?
That's not very valid though - what you wrote as ∂C/∂v (which should be a vector - the gradient of the scalar C) is technically ∂C/(Σvᵢ).
oops, I can change it then. I meant to make it more clear where the -1 was coming from but ig I just made it more confusing.
I think so, I just had to bail for a moment. I need to ponder a little on what you wrote
Thanks again
Here's what I'm getting
ahhh I see what you're saying
you're right that was a big mistake
Wow
You two are special, I might finally get it
When people just say “use the chain rule” it really doesn’t mean much to a beginner but when you put it like that it’s so much nicer
Oh and needless to say, thank you
That's great ^^, I know the feeling of that epiphany. It's really simple if you abstract it out to partials
and my apologies with the incorrect math and making it more confusing lol
Oh no worries! It wasn’t such a terrible mistake and it probably made the initial reading a little clearer :)
I think "use the chain rule" is often confusing because calculus courses don't often have examples where a function has n variables 🙂
It'd be like this:
(In the derivation above, the function in question is C, which depends on a_1, ..., a_m. Hence the sum in the result.)
fair, i'm guilty of advising that 😉
Admittedly my calculus 2 course barely even touched multivariable calculus so this is all fairly confusing ^^;
This really helped though
You helped a ton too!
yeah but i hope i didn't mislead along the way
@tidal bough do you actually need the sum property there? i think it "expands" naturally in this case because C is itself a sum over i, so the whole thing expands out into a sum of partial derivatives by linearity
i never learned total vs partial derivatives properly in school, it's something i should probably revisit at some point
sure, you could think of it that way, I'm just mentioning the general case
i just wanted to make sure i wasn't missing something fundamental! i feel like i didn't learn anything in school properly and had to re-teach myself everything i know, so i'm always wondering if there's something i don't know that i don't know
There are a few key ideas from which you can probably guess how the rest will play out. It's important to keep in mind what the derivative actually is / represents, and how that plays together with linear algebra. For example, if I just flash this image of the Jacobian matrix, you can probably guess how a lot of other things work (but really you want a multivariate calculus book):
ah yes, nabla-transposed 🥴
Notation was a mistake.
mathematicians try not to abuse notation challenge (impossible)
I actually think as far as geometric intuition goes I did nail most of it down
As far as connecting between connecting the dots between calculus and linear algebra though… I’m only now beginning to dip my toes into that
ML is probably the first time I’ve seen the two go hand in hand
Oh, and this one image:
And uh, I’m genuinely not sure what you mean by “guess how other things work” ^^;
feel free to elaborate
I think this is hitting beyond my bracket considering I’ve just learned how to use the chain rule with more than a single nested function
Nevertheless, extremely curious to note that such a connection even exists. Thank you
With a strong foundation in the fundamentals of calculus (its purpose / what it's doing), you can predict how it will play in out combination with something like vectors / linear algebra. This is my usual approach, I derive my own stuff, then after that read more about the topic. After a certain point I only need a few cues to predict the rest / adjust my stuff to match. This approach lets me know that I actually understood the previous topic leading up to the next one (confirmation of prediction), including its purpose from which much can be predicted since I then can guess what the inventors of the topic were aiming for / would probably go for next (on the same timeline / "wavelength"/ however you want to put it). The other thing I do is follow a historical approach. I try to find how/why they were inventing the math / what the context was at the time (what did people know then / what were the unsolved problems). This kind of prediction task (predict answer, then check) is usually done via the practice problems in books, but I like to take it a step further and predict the next chapter(s) too. I'm not recommending this approach, it's just what I do.
Basically, I like to reinvent the wheel.
!otn a squiggle's reinvented wheel
:ok_hand: Added squiggle’s-reinvented-wheel to the names list.
Hey guys do you know a lil bit of finance ? Cause i have a trading ai that i try to finish … could someone help me please 🙏. This AI has a very big potential, the people who accept to help can keep the code and run it to generate some wealth… it’s about 95% done
you're not allowed to offer money for work here. You can instead ask your questions that you need help with and people might try to help for free 🙂
Hello @small wedge, thank you for your advice i didn’t offer any money i just mentioned the code generates wealth. If you want to help me, you’re the most welcome i’ll give you the informations. Thank you
if you post your question and the relevant information here I'd be happy to try
thank you so much for your engagement
Here is a snippet here they are function calls
print("test 1")
bot = Bot()
print("test 2")
# Call Market class
market = Market(symbol='EURUSD', yahoo_ticker='MSFT', currency='EUR', hist_window=365)
market.fx_price()
market.stock_price()
data=market.market_to_dataframe()
# Call the Balance class
ip_address = "127.0.0.1"
port_id = 7495
client_id = 1
current_price=market.fx_price(real_time= True)
price=market.fx_price()
bot.nextorderId = None
bot.run_loop();
print("wa7el Houni");
balance = BalanceApp(ip_address,port_id,client_id)
balance.start()
balance.accountSummary(reqId=123, account="DU11643091", tag="TotalCashValue", value="12345", currency="EUR")
balance.error(reqId=123, errorCode=456, errorString="Some error message")
# Call the RiskManager class
riskmg = RiskManager(balance, stop_loss_pct=0.05)
max_take_profit_pct = riskmg.calculate_max_take_profit_pct()
print("Maximum take profit pct: ", max_take_profit_pct)
order_size=riskmg.calculate_order_size(current_price)
print("Order size:", order_size)
riskmg.calculate_risk(price, stop_loss=7.5)
# Call the NNTS class
nnts = NNTS(lookback=50, units=128, dropout=0.5, epochs=200, batch_size=64)
X, y=nnts._prepare_data(data)
model=nnts._build_model(X)
buy_signals=nnts.generate_signals(data, strategy='buy')
sell_signals=nnts.generate_signals(data, strategy='sell')
# Call the TradingProcess class
tp = TradingProcess(balance, risk_percentage=0.05)
tp.update_equity()
tp.can_open_position(price, stop_loss=0.05)
tp.can_afford_position(price)
tp.open_position(price, stop_loss=0.05)
tp.close_position(price)
tp.update_position(price)
tp.fit(X, y)
tp.predict(X)
# Call the DataProcessor class
datapp = DataProcessor(feature_collumns=["open","high", "low", "close", "volume"])
datapp.preprocess_data(data)
# Call PlaceCancelOrder class
pcorder = PlaceCancelOrder()
pcorder.place_order(buy_signals, sell_signals, symbol='EURUSD', order_type='MKT')
pcorder.cancel_order(order_id=1)
# Call Bot function
bot.execute_trade(buy_signals, sell_signals, price)
and the full code is here: https://github.com/CodeBYMehdi/GPT
@small wedge
which part do you need help with?
principaly the function calls
there are almost done but there are some arguments that i couldn't figure how to call them
Hello, everyone. I would like to ask about a slight problem in an RDF graph, so the elements are too close to each other, and there is no space between them. I have been working on this project for my final school assignment and have searched everywhere on Google, Graphviz documentation, Stack Overflow, and YouTube, but none of the solutions are working. Therefore, I would appreciate some assistance here if you don't mind.
This is the code
`new_rdf_file = '../../output/rdf/dummy_rdf.rdf'
g.parse(new_rdf_file, format='xml')
gv_graph = graphviz.Graph(strict=True, format='svg', engine='neato')
def get_local_name(uri):
uri_str = str(uri)
return uri_str.replace(nba_players, '').replace("http://", '').replace("https://", '')
for subject, predicate, obj in g:
subject_label = get_local_name(subject)
obj_label = get_local_name(obj)
predicate_str = str(predicate)
# Add nodes and edges to the Graphviz graph
gv_graph.node(subject_label)
gv_graph.node(obj_label)
# gv_graph.edge(subject_label, obj_label, label=predicate_str)
gv_graph.edge(head_name=subject_label, tail_name=obj_label, label=predicate_str)
gv_graph.attr(pad="1.0")
output_file = 'dummy_output.svg'
gv_graph.render(output_file, view=True)`
Thank you in advance
can you give some specific examples? which function(s) are giving you issues
the first one is: ```py
bot.execute_trade()
what should i put in quantity
well it'd be whatever units of thing that you're buying/selling with the bot, looks like this is currency exchange?
oh or it's just stocks
then I assume it'd be how many shares you want of a stock
no it's both
but i'll start with currency exchange
the thing is i don't know how the ai will buy the units that i can afford
new colab feature?
I can only give a small recommendation of do not use that thing in real life with your money. You don't seem to even know how it works..
Does anyone know a solution to the issue that I have?
"(base)" does not display on the terminal, when I use bash. But when I change my shell to zsh, it displays.
Why is that?
Can anyone recommended a difficult python project?
Does a text summarisation model from huggingface sent your text to huggingface? I downloaded the model to my device and run it without internet, but was wondering if there are security issues when summarising personal documents.
In theory, you're just loading pretrained models. There is no data sent to their server
Thanks wanted to be sure!
of course i know because i wrote the code but there are some issues that i am struggling to fix, and even if i finish it i'll test in a simulated environment
guys got question is there a way to replace \ in text using replace option?
i tried
message.replace("\", "")
but doesnt work
\
\\
?
?
Use double backslash, backslah is a special character
but one is in text
i just need to change output of script
and output is text and \
Question to those who work. How powerful of a PC/laptop do you need? Do employers provide cloud compute, so that you could work on a weak device, or do they expect you to have a powerful PC and use your own processing power for everything you do?
Would the laptop linked below (gave 2 links in case one doesn't work) be good enough for work? No GPU, and not the best CPU. Only 8Gb RAM.. but what do you think?
https://sl.aliexpress.ru/p?key=ScdFZED
https://aliexpress.ru/item/1005001520846730.html?sku_id=12000027438880217&spm=a2g2w.productlist.search_results.1.1a364aa6fKeQyh
if you're doing AI/ML for a company, they'll very likely provide you with a laptop. and it's very unlikely that you'd be doing model development on that laptop.
😢 if it's a remote job for an overseas company - they won't be able to give me a laptop..
laptops can be delivered
Most of the companies I've worked for had a cloud environment or equivalent, you don't need a powerful laptop
Even outside of tech, they provide their own laptop for security reasons
this is pretty dang cool! i am just swamped by work atm, i might give this a look in the weekend
Would be grateful, as I really can't seem to figure out why the table doesn't display immediately, as expected
Kind of curious regarding Pytorch's nn.Linear() function:
test_img = torch.ones(1,4,4, dtype=torch.float)
test_flatten = nn.Flatten()
test_flattened_image = test_flatten(test_img)
test_layer1 = nn.Linear(16, 4)
test_hidden1 = test_layer1(test_flattened_image)
print(test_hidden1)
-------------------
output:
random values in (-1,1)
Anyone got any idea what that's about? are the weights just initialized randomly?
document for reference: https://pytorch.org/docs/stable/generated/torch.nn.Linear.html
The weights are initialized randomly
Oh, and I'm assuming there's a way to attribute a weight to every node between two layers somehow later on?
Ah cool, that's kind of what confused me to begin with
Which browser are you using by the way? Pytorch.org doesn't have native dark mode right?
I'm on iceraven on my phone, with the dark reader extension
Looks really neat 🙂 Ty for the help
Hello humans, fastest way to render a 2d image? Im doing some computing and the output is a 2d array. Ive used mplt. Scipy or Pillow seem good too
depends whats loader is doing ;d
Can you stack arrays with different shapes somehow? I want to index different arrays
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5])
arr3 = np.array([6, 7, 8, 9])
stacked = np.stack([arr1, arr2, arr3])
scope = stacked[0,1] # 1
scope = stacked[2,2] # 8
Hey, any data engineers here?
I have 1 year experience as a data analyst and I am trying to break into data engineering.
I know python, sql, hadoop, spark and azure(adf, databricks) and aslo basics of airflow.
Is this enough to land a job?
does python have a filter function similar to R's? specifically I'm looking for a way to utilize R's circular parameter
circular: for convolution filters only. If TRUE, wrap the filter around the ends of the series, otherwise assume external values are missing (NA).
https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/filter
Applies linear filtering to a univariate time series or to each series
separately of a multivariate time series.
Numpy arrays need to have a homogenous shape, so no
You could perhaps pad the shorter arrays if it really helps with efficiency and the lengths don't differ too much
nah, I guess looping is my only solution to it
The functions from scipy.signal usually have an argument for boundary conditions. Not sure which specific function this one corresponds to, though.
thank you, im trying to accomplish this from R:
> filter(x, rep(1, 3), circular = TRUE)
Time Series:
Start = 1
End = 100
Frequency = 1
[1] 103 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78 81 84 87 90 93 96 99 102
[35] 105 108 111 114 117 120 123 126 129 132 135 138 141 144 147 150 153 156 159 162 165 168 171 174 177 180 183 186 189 192 195 198 201 204
[69] 207 210 213 216 219 222 225 228 231 234 237 240 243 246 249 252 255 258 261 264 267 270 273 276 279 282 285 288 291 294 297 200
in python the closest i can get is this:
np.convolve(x, [1,1,1], mode='valid')
array([ 6, 9, 12, 15, 18, 21, 24, 27, 30, 33, 36, 39, 42,
45, 48, 51, 54, 57, 60, 63, 66, 69, 72, 75, 78, 81,
84, 87, 90, 93, 96, 99, 102, 105, 108, 111, 114, 117, 120,
123, 126, 129, 132, 135, 138, 141, 144, 147, 150, 153, 156, 159,
162, 165, 168, 171, 174, 177, 180, 183, 186, 189, 192, 195, 198,
201, 204, 207, 210, 213, 216, 219, 222, 225, 228, 231, 234, 237,
240, 243, 246, 249, 252, 255, 258, 261, 264, 267, 270, 273, 276,
279, 282, 285, 288, 291, 294, 297])```
but the 200 and the 103 drop off
Huh, it looks like scipy.signal's convolve doesn't have wrapping, which is weird to me. Anyway, you can use ndimage's instead:
>>> scipy.ndimage.convolve(x, [1,1,1], mode='wrap')
array([103, 6, 9, 12, 15, 18, 21, 24, 27, 30, 33, 36, 39,
42, 45, 48, 51, 54, 57, 60, 63, 66, 69, 72, 75, 78,
81, 84, 87, 90, 93, 96, 99, 102, 105, 108, 111, 114, 117,
120, 123, 126, 129, 132, 135, 138, 141, 144, 147, 150, 153, 156,
159, 162, 165, 168, 171, 174, 177, 180, 183, 186, 189, 192, 195,
198, 201, 204, 207, 210, 213, 216, 219, 222, 225, 228, 231, 234,
237, 240, 243, 246, 249, 252, 255, 258, 261, 264, 267, 270, 273,
276, 279, 282, 285, 288, 291, 294, 297, 200])
oh you just made my day friend!
__getitem__ is used with []
is anybody really strong with using langchain/chroma with the gpt-api. Im working on a project and have some questions if anybody would hop on a discord call wit me.
what do you measure with time? full loop? or just getting items
prefixing this answer with disclaimer: i am not a dash expert, so take what i said with a grain of salt.
here is what i understood
- upon turning on debug mode, i immediately saw there is an error on start up, this would explain the issue you are seeing (or maybe rather confirming the symptoms you are seeing)
- upon opening the callback DAG view with debug mode on, i see your
intra_sector_corrpopulating function takes quite some time to run, where as your stocks table populating function is immediately triggered, this is basically a race condition due to improper specification of what callback to run first (as to how your can do this, see the link i posted before or my below attempt) - by adding
@app.callback(Output("stocks-dropdown", "value"), Input("stocks-dropdown", "options"))
def pick_first_option_on_change(options):
return options[0]
i can alter the callback DAG into this, i believe the callback run from top to bottom on initialisation, so hence we have successfully made the race condition go away by forcing stocks table to wait until the first callback for intra_sector_corr population completes
I'm not sure I understand your question. Should the 5k photos be labelled as 0 and the 70-100 photos labelled as 1?
@lapis sequoia
Since you should have random photos vs a small number of close photos, I'd use an existing image embedder, embed a few of the class 1 photos as reference, embed the rest of the image and class them by cosine similarity with the reference
The technique I mentionned should work, the alternative is probably managing the unbalance by oversampling your class 1 photos
how do i build numpy with certain cpu options? i need it to be faster so i want to build it and i followed this
https://numpy.org/devdocs/reference/simd/build-options.html#quick-start
the problem is, i dunno where the setup.py is. if i go to the site-packages/numpy/setup.py it just says that this is the incorrect setup.py to run
how do you know that your CPU supports operations that aren't supported by the wheel you would get from pypi?
just checked, numpy wheels on pypi already utilize max cpu instructions
Is there a python analog to this behavior from sequence in R? :
[1] 1 2 3 4 1 2 3 1 2 1```
not built-in, but you can just implement it yourself
we have range() for normal [1, 2, 3, 4], but that behaviour you exemplified seems very weird
!e ```py
def sequence(n):
for j in range(n, -1, -1):
yield from range(j)
print(list(sequence(4)))
@agile cobalt :white_check_mark: Your 3.11 eval job has completed with return code 0.
[0, 1, 2, 3, 0, 1, 2, 0, 1, 0]
numpy/scipy might have it builtin somewhere, but if not you'll probably want to create arrays with np.arange then concatenate them with some other numpy function
I have penned an article with valuable insights . i would love to hear your feedbacks.
https://medium.com/@sahaniiianuj/bidirectional-english-marathi-language-translation-model-82f39b99bf98
Thank You !
Fine-tuning pretrained mbart50 model to make a en-mr bidirectional translation model using Hugging Face transformers.
Hello!
I would like to make an application that will make me an electricity forecast for the next period based on trained models. what should I start with? Apart from the correlation coefficient/humidity/temperature/seasonality coefficient, what else can I use? Sorry if I'm posting where I shouldn't, please redirect me!
so basically, if I understand correctly, that's simply setting the default value for the second dropdown. And it works!!! Thank you! I still don't really quite understand why it doesn't work without it.. I would have thought that since the table updating callback has the second dropdowns "value" as an input, dash would make the connection that the second dropdown needs to have an "options" parameter to work, which is the output of the first dropdown, and dash would define the order appropriately. Turns out, it seems, it doesn't make the connection, that in order to choose a value, you first need options. Will keep that in mind for the future. I learned something new today! Thx)
From my past experience, as long as your data is consistent and not SVG as its using a different method to represent image. everything should be fine. Albeit some other factor outside of your question that could impact into the performance of your model, such as the general size of the image, usual HR image tend to have thousand if not million of pixel, which could affect the time it takes for your model to process and train the data. I hope my answer is to your satisfaction and be of use. 🎩 👌
Good afternoon, can somebody please take a look at this?
Hey folks, I think there's a critical knowledge gap in my understanding of gradient descent:
Let us assume a neural network with a single input layer with 3 neurons , and an output layer with 2 neurons
So we feed the system some data, and it outputs some neuron with the highest value (prediction)
I'll ignore the activation function
To fix the weights take some loss function L:
L = loss(w1a1 + w2a2 + w3a3 + b)
calculate its gradient with respect to the weights, and update the weights - This decreases the loss of the function (Assuming we're not already close to the local minima):
New weights: m1, m2, m3
Now we go to the next batch of data and do the same thing: (b1, b2, b3)
Problem is though - now the function has changed: The input is different, and thus the loss function is different - so the local minima of the loss function has shifted elsewhere.
L = loss(m1b1 + m2b2 + m3b3 + b)
What am I missing here? Thanks in advance
(Just to clarify, this is me explaining the mental image in my head - not me trying to prove something of course)
You aren't missing anything - when doing minibatching, the current weights "jerk around", moving each time towards the local minimum on the current batch.
But weirdly enough, that ends up working alright to optimize on the whole dataset. In fact, even weirder (to me at least), stochastic gradient descent sometimes works better than optimizing on the whole dataset at once, because this jittering helps the model to not get stuck in shallow local minima, but rather move gradually to the global one - much like how optimization algorithms like simulated annealing occasionally accept changes that increase loss in order to break out of local minima.
You're kind of blowing me mind here, let me get this straight -
The function changes between each batch - and thus the local minima we've been chasing* moves (as in, it might be in a different direction entirely than the negative gradient we've been "chasing" thus far*)
And yet, despite this local minima shift, the algorithm still works?
Is this because the loss decreases between batches due to chasing the minimum? So the current minimum we're chasing isn't as important as just decreasing the cost?
I hope this makes sense
More simply, maybe - we're moreso trying to simply decrease the cost as efficiently as possible ,which happens to be in the direction of some local minima, rather than trying to actually reach a local minima
Well, suppose we have very large batches - we split our million-sample dataset into only ten parts. It's pretty believable that in that case, a random tenth of the dataset will have roughly the same local minima as the whole dataset, and after averaging over the 10 batches it works out to pretty much the same as training on the whole dataset.
Another intuition pump is that if your learning rate is very small, then it should work no matter how small the batch is - because taking a tiny step down the gradient of the entire dataset is the same as taking a very tiny step down the gradient of each sample. (I think I can mathematically formalize this one if you want)
And it turns out that in between it it still works - if you have a not-too-big learning rate and use not-too-small samples, the weigths on average end up going in the whole dataset's direction.
@tidal bough It’s not so much that batching is possible that confuses me - more so that the local minima shifts between each batch and that the algorithm still works as intended
Like, it’s the fact that the function changes at any point of time
Although, I think I kind of get it now
If this is correct, that is
Sure, that's about right - we don't actually want to just get into some local minimum, because any notable NN has approximately infinity of them and most of them are pretty bad. We actually want to reach as deep a minimum as we can.
So just perfectly going along the gradient is actually a bad idea. And it turns out that introducing minibatching, and hence a random aspect to the walk, fixes that.
I hope I'm explaining myself properly, or misinterpreting what you're saying (Thus far, everything you've aid makes perfect sense)
Maybe I should explain the root of this question: I'm watching 3b1b's videos, and when they explain gradient descent, they explain it as "We're trying to reach a function's local minima"
More specifically, they use this graph throughout all of the videos
So I got the impression that the loss function has a single "form" if you will, and that the local minimas never move
So maybe if I ask declare what I understand in a concise manner, and you could just confirm:
- The loss function changes between each and every step
- Thus, the local minimas* move between each step
- Despite this property, the algorithm still works
- Not only that it works, it's occassionally better, and helps us break out of "bad" local minimas (Typically done with batching, which is what the right side of the image is trying to illustrate)
Are all of these correct?
Yeah, this looks right to me. Showing a graph like that is mostly a lies-for-simplicity kind of thing - a realistic one would be where there's local minima everywhere, and some are deeper than others, and just going for the valley in which you start will be a bad solution.
Yeah, this is obviously just a function with two variables
It might also be interesting for you to look up some modifications of gradient descent other than SGD, like gradient descent with momentum, but tbh I don't myself know much about how they work (basically, you can make your gradient descent intentionally overshoot the minima it goes for, which again helps with getting into a global optimum instead).
Maybe even more illuminating would be simulated annealing. It's a metaheuristic algorithm for multidimensional optimization (I don't think people use it in NNs, mostly just for normal problems) - you have an iterative optimizer with "temperature", and for zero temperature it's basically gradient descent, whereas for infinite temperature it's just a random walk. You start with a high temperature and gradually lower it to zero over the iterations. As a result, the optimizer ends up first wandering into a relatively large and deep basin, and then finding its local minimum, and that usually produces decent.
That's so interesting - indeed a lot of the things in ML feel so... What's the word, deterministic? As in -
"Why should I use X over Y"
"We just tested a bunch of models and we've reached the conclusion that X is typically better"
Which is a pretty unsatisfying answer, but at the same time kind of what you want to hear as a beginner instead of being overwhelmed with even more theory
Specifically, the second approach you mentioned sounds extremely random and doesn't sound like anything you can formally explain beyond "Yeah it just sounded like something that could work and it did"
Sure, the reason it's called that is because it's loosely based on the theory of how metals anneal. Works for nature, apparently works for numerical optimization too 😛
I suspect that there are in fact more convergence guarantees for all of this than I'm implying, because I don't often read ML research papers, but not sure it's much more.
In the meanwhile I mathematically formalized this note (for two batches, but it generalizes).
So minibatching (stochastic gradient descent) is provably the same as ordinary gradient descent for small enough learning rates.
(Note how this means that this is a case where lowering the learning rate might hurt your model, because the lower the learning rate, the more SGD acts like ordinary SG, which means going for the closest local minimum rather than jumping around - and for training NNs, that's generally a bad idea.)
Extremely interesting - I’ll read what you’ve formalized in a few minutes (irl shenanigans)
Thank you for your help and the curious insights!
Here's the same thing but slightly rewritten (including made slightly more correct by noting the next term of the taylor series, etc) and using linalg notation rather than indices everywhere.
First time ever hustling with sorta-data-science, and I just challenged myself to build a script to find dominant colors in each frame for a video
with AMD Ryzen 5 4600H with Radeon Graphics (12) @ 3.000GHz processing 59450 frames takes ```bash
name id tid ttot scnt
_MainThread 0 139651551671424 73.62470 14472
ManagerThread 1 139651316668096 7.709011 12115
Thread 2 139651325060800 4.361026 8368
Are you making all of this notation that you're sending? 
I don't think I do? What do you mean?
like all these screenshots you're sending, are you making them with latex or is this from a source somewhere?
I wrote that just now, yeah.
i mean, I can post the latex 😛
nah I figured it was from a book or something, disregard me XD
that's simply setting the default value for the second dropdown.
kind of, it's setting the default value of the dropdown when the dropdown's list of possible options changes.
I still don't really quite understand why it doesn't work without it..
it's about the ordering of when callback are invoked on initialisation. and as you rightly pointed out, dash does not make that connection between "value"and "options" for you.
Can model conversion to fp16 take a hit on accuracy? Does it have an impact on inference time?
typically you'd expect for it to decrease the accuracy while either keeping the inference time constant or lowering it, but reducing the model size significantly
usually you'll have to train a bit after converting to a different precision iirc
Its a completely closed source model
and?
I'd recommend not trying to convert it yourself then but rather asking whoever gave you the model then
Alright
hii, i want to make an automatic licence plate detection, how can i do so and what tutorial should i follow?
If I wanted to make a mlp in pytorch and then move all the weights to my own library for testing, is there any better move than making a mlp nn.Module and then manually parsing it's .state_dict()?
There are 3 steps for this kind of task: detecting the license plate, outlining the characters in the license plate, and read those characters. This is a basic tutorial and then you can improve each step: https://pyimagesearch.com/2020/09/21/opencv-automatic-license-number-plate-recognition-anpr-with-python/
That would probably be the easiest if all you want are the parameters
When working with pandas and you have a categorical column, do you usually convert it from the default object type to categorical (which saves a bit of memory and, I assume, makes some operations faster), or is the overhead of the type casting/conversion (whatever it does under the hood) not worth it? What's the best practice here?
I use it when the values repeat a lot and I have to do multiple transformations and/or the dataframe is large enough. In my experience casting to categorical is pretty cheap anyway.
Hey im trying to create a rnn. I have multiple audio dataframes for each song. Every dataframe corresponds to a chunk of the song. this means that songs with varying lengths have varying amount of dataframes. From my very limited understanding of rnn, its beneficial to train it in batchsizes where the batchsize matches the length of the Dataframes for a single element. My question is, if it is a valid approach to pad the amount of dataframes with dataframes containing only -1, so its consistent.
If something i said makes no sense or is stupid, feel free to point it out.
i tried this but the results werent accurate because of only using opencv, thats why i used neural nets, but still i am struggling with the results, thats why was looking for other tutorials. Can u share if there are any other with good accuracy as most of them i saw uses api
Which part of it was not accurate?
Dividing it up into three parts is not a bad idea, so if you only need to replace one part that is more doable
Hoi, before i ask, which channel is appropriate for help with stable diffuson dependencies and such? Specifically on amd
are you trying to run it locally, or through an API?
Locally.
Could be a imcompatible something that it can't read from because it's too new possibly. Honestly don't know
I will not read any screenshots of text; please copy and paste it directly
Gotchu. Just didn't know if there was a "don't do that, it's linespam" :P
(134)(deck@arch ComfyUI)$ python main.py --normalvram --disable-cuda-malloc --use-split-cross-attention
Total VRAM 4096 MB, total RAM 11795 MB
Set vram state to: NORMAL_VRAM
Device: cuda:0 AMD Custom GPU 0405 : native
Using split optimization for cross attention
python: /usr/src/debug/hip-runtime-amd/clr-rocm-5.6.0/hipamd/src/hip_code_object.cpp:754: hip::FatBinaryInfo** hip::StatCO::addFatBinary(const void*, bool): Assertion err == hipSuccess' failed. Aborted (core dumped) (134)(deck@arch ComfyUI)$ python main.py --normalvram --disable-cuda-malloc --use-split-cross-attention Total VRAM 4096 MB, total RAM 11795 MB Set vram state to: NORMAL_VRAM Device: cuda:0 AMD Custom GPU 0405 : native Using split optimization for cross attention python: /usr/src/debug/hip-runtime-amd/clr-rocm-5.6.0/hipamd/src/hip_code_object.cpp:754: hip::FatBinaryInfo** hip::StatCO::addFatBinary(const void*, bool): Assertion err == hipSuccess' failed.
Aborted (core dumped)
As it's a steam deck, i'd toy around with the "--insertcommand" to see what runs the best, but can't get it to even launch :P Gotten automatic to run in the past, but can't seem to get it to work now, so truing comfyui lol
and you replaced the steam deck's native linux flavor with arch?
Oops, forgot to note that.
SteamOS, but distrobox with arch. Also got distrobox for ubuntu 20.04, but it didn't work either with automatic1111.
I'm not sure what to suggest, unfortunately
No worries. I'll ask around/wait for someone who could possibly know :P
HIP probably does not support that GPU. These kinds of libraries tend to only support the most recent dedicated GPUs.
Aye. But got it somewhat working now, with python rocm, now i'm debugging with comfy's creator as there'a a conflict when i try to generate. Doesn't get past clip
guys, I don't think this is worth a topic on help because it's not exactly python-related. But since you guys are used to using jupyter notebooks, I'd like to know something: if you use the vscode extension, does it stop the colors out of nowhere sometimes too? It's getting annoying for me over the last day. It keeps "crashing" the colors, autocomplete, etc. The notebook itself still works. I've already looked for conflicting extensions, but nothing I could find that helped.
Try asking in #editors-ides instead
tyty!
I had a question about Neural Networks.
Are there any tutorials which teach how to make neural networks from scratch without using any library or frameworks?
I wanted to learn the basics in Julialang, so the language won't matter for the most part as long as it's a sane one.
There are, but many will probably still use numpy.
So? Any language can have constructs that don't exist in other languages
And you could have a language where something like bumpy is part of the language
@serene scaffold most of the numpy's functions are covered in base Julia interpreter. so what are the suggested tutorials that you have
I would still avoid any frameworks related directly to ai/nneu though
I don't have a specific one in mind, but you'll probably get better results for "neural network in Python with numpy"
Numpy doesn't have any constructs that are intended to make machine learning any easier, so none of the important parts would be abstracted away.
A machine learning craftsmanship blog.
I was doing a dude's nneu from scratch in python but he used his own library in middle so i felt betrayed
OK, that is a very awesome tutorial
God bless you mate
Looking for some guidance in creating an interactive html report similar looking to the image. I have some csv/excel data and want to create a nice dashboard looking report detailing data migration progress. The objective would be to output a single html encapsulating the data and interactive visualisations. Has anyone done something similar before? Would you be able to point me in the right direction?
for something as complicated as that, with multiple pages or a complex layout? you have no choice but to go to JS
otherwise Quarto or possibly Plotly subplots + writing out HTML are the closest things I'm aware of in python
Are there any good resources for learning reinforcement learning hands-on? I've tried a few university courses on youtube but all of them are highly theoretical and don't involve code.
well yeah, machine learning is 95% theory 5% code
If so, as a complete beginner in ML, where should I start learning reinforcement learning? Is it ok to skip stuff like supervised and unsupervised learning and delve into reinforcement learning directly?
In short, what are the prerequisites for reinforcement learning?
Can anyone provide me a roadmap or maybe some resources to get started with neruoevolution (genetic algorithms and NEAT)
I am currently doing the Huggy Face Deep RL course (if that helps)
Please ping me when you reply
HuggyFace's Deep RL course is where i started my reinforcement learning journey, and no its not okay to skip those topics as they form the foundation of all topics in machine learning
Ah I see... I do have some basic understanding on stuff such as cost functions, gradient descent, regression etc. Is that enough to start learning RL? How far should I delve into those topics before starting to learn RL?
Thanks for the suggestion though, I'll check it out
that should be enough, although the course i mentioned is a Deep RL course, so if you are considering to do it, then i would recommend familiarising yourself with neural networks as well
I happened to look at Quarto more this morning. Seems super powerful, and should be able to do what you want
I may have asked several times about this questions but how would you guys made this into latex text?
I normally would not make something like that in LaTeX and instead make it elsewhere and include it as a figure later. If you really want to make it in LaTeX I would use TikZ, but that's not a very pleasant task
Ok great thank you, i will take a look.
I agree with that last part haha
Hi all, I'm looking for some advice about tensor libraries. I'm working on a chemometrics project worknig with spectrum-chromatograms, 2nd order tensors, and am looking for a python library that will enable me to apply customed preprocessing algorithms to the tensor prior to modelling. In the past i have achieved this by producing pandas dataframes of dataframes, but this is both cumbersome and frankly just feels dirty. I've given a cursory glance to several libraries such as Keras, but they don't seem to fit my needs, at least not superficially. Please help!
what preprocessing are you trying to do? one hot encoding?
no its all numerical signals with a single label, at least initially. Looking to normalize, apply a savitzky-golay filter, PLS baseline correction, resample, and then align the signals. not necessarily in that order.
if you can use pandas to produce a dataframe that's structured like the tensors you need, that is fine.
the previous solution was a series of dataframes, I was wondering if there was a more elegant approach, as without for example creating my own dataframe class, a series of dataframes is difficult to observe / debug
a series of dataframe. and pandas Series?
!docs pandas.Series
class pandas.Series(data=None, index=None, dtype=None, name=None, copy=None, fastpath=False)```
One-dimensional ndarray with axis labels (including time series).
Labels need not be unique but must be a hashable type. The object supports both integer- and label-based indexing and provides a host of methods for performing operations involving the index. Statistical methods from ndarray have been overridden to automatically exclude missing data (currently represented as NaN).
Operations between Series (+, -, /, \*, \*\*) align values based on their associated index values– they need not be the same length. The result index will be the sorted union of the two indexes.
this?
correct
you should never have a Series of DataFrames. all the DataFrames that are in it should be concatenated into one (potentially with multiple levels of indexing)
Yeah I know its unorthodox, hence my question. I was inspired to use a Series of Dataframes by Jodie Burchell in this podcast https://open.spotify.com/episode/6iN2nYZGBTdUAdpVnWvI5W
Listen to this episode from The Real Python Podcast on Spotify. Are you still using loops and lists to process your data in Python? Have you heard of a Python library with optimized data structures and built-in operations that can speed up your data science code? This week on the show, Jodie Burchell, developer advocate for data science at JetBr...
having nested pandas structures doesn't mean that you need to switch away from pandas. it just means that you're using pandas wrong.
and keras isn't an alternative to pandas
pandas and polars are for the same thing
pytorch and tensorflow (and therefore keras) are for the same thing
That's fair to say, and Id rather stay within the pandas ecosystem if i can. How would you suggest structuring my 3-dimensional numerical data?
with multiindexing
yeah right, I can see how that would work.
What's a good modern reference for 3D image classification? Any SOTA models and standard data processing techniques?
Hi,
I just got my master degree in experimental psychology from a really good college.
Right now i'm doing a gap year bc I want to properly learn how to code and ML.
I'm not sure yet if I want to do a Phd mixing experimental psychology and cognitive process modeling or become a data scientist.
I just started CS50P(ython) from Harvard and like it very much.
I plan to do the regular CS50 after and follow with CS50AI.
I'm also considering using Dataquest or Datacamp on the side to reinforce/train.
Are there worth it ? Are there equally good ?
(I read on reddit that datacamp is too easy and consist in filling blank, no actual typing. I juste started the free version of the python course and it seems it's not the case in the first course.
On the other hand Dataquest seems more challenging but is lacking variety of courses and is much more expensive)
Thank for reading this long message 🙂
PS: my only real coding experience is some C in highschool and R during college for stats, but R is quite different from other languages from my understanding.
how does one transfer arguments from minimizer_kwargs in basinhopping to a custom function being used as a method?
I don't know how to copy pandas,
.copy method is not working, .copy(deep=True) also does not work.
"Not working" I mean that columns are not copies, so when I modify columns in one, the other dataframe is also modified 😐
segment_df.columns.to_numpy() has some trickery inside not doing copy...
!e Can't replicate this:
import pandas as pd
df = pd.DataFrame({'a': list('aaabb'), 'b': range(5)})
df2 = df.copy()
df2["b"] *= 2
print(df)
@tidal bough :white_check_mark: Your 3.11 eval job has completed with return code 0.
001 | a b
002 | 0 a 0
003 | 1 a 1
004 | 2 a 2
005 | 3 b 3
006 | 4 b 4
What's the .dtypes of your dataframe? Perhaps it's something really weird that needs deep copying?
I solved, but problem was with this code.
"""list_ofdf is list of dataframe splited into smaller segments to separate scope
"""
for segi, segment_df in enumerate(list_ofdf):
# print()
segment_df = segment_df.copy(deep=True)
print(f"Segment {segi:>3}: {segment_df.shape}. cols: ")
# print(segment_df.columns)
timestamp_ind = np.argwhere(segment_df.columns == "timestamp_ns").ravel()[0]
segment_df.iloc[:, timestamp_ind] = segment_df.iloc[:, timestamp_ind] / 1e9
# print(f"timestamp_ind: {timestamp_ind}")
base_features = segment_df.shape[1]
segm_columns = segment_df.columns.to_numpy() # THIS DOES NOT WORK
#segm_columns = np.array(segment_df.columns) # THIS WORKS
segm_columns[timestamp_ind] = "timestamp_s"
to_numpy tries to make a view rather than a copy if possible, pass copy=True if that's undesirable.
but it should be copied already so its a bit confusing
hello guys, does anyone had this problem ?
i installed tensorflow, tf-gpu and keras. I want to train a zoo model and i have problems setting up 😦
you cut off almost all of the traceback, so hard to tell
i have looked on the internet and some say that numpy version is depricated, installed other version and still didnt work
hi can someone help me with a project that I have
I have to make a NN model for regression, and the dataset just consists of x and y values so really simple.
learn coding by doing some projects a good first impression can be found
!resources
The Resources page on our website contains a list of hand-selected learning resources that we regularly recommend to both beginners and experts.
how shd I get started?
own NN? (e.g from scratch)
so from scratch
yep just need to acheive a high accuracy rate
this is a pretty amazing video
https://www.youtube.com/watch?v=w8yWXqWQYmU&ab_channel=SamsonZhang
Kaggle notebook with all the code: https://www.kaggle.com/wwsalmon/simple-mnist-nn-from-scratch-numpy-no-tf-keras
Blog article with more/clearer math explanation: https://www.samsonzhang.com/2020/11/24/understanding-the-math-behind-neural-networks-by-building-one-from-scratch-no-tf-keras-just-numpy.html
oh no I can use tensorflow and stuff
🗿
sorry I was a bit unclear haha cuz I'm so confused rn
I just don't know how to get started
so did u plotted ur data (x,y)?
always good to do a exploratory data analysis
ok
are u a graduate? (which complexity u want)
do u need to present and explain why u chosen certain model?
not really, since I only have to use neural networks, so I don't really need to explore forests or kcluster but I do need to present how I built my model and it's kinda important that I get a good accuracy rate (which shd not be too complication since the data is (x,y)
so for simple regression go with a sequential model
yea that's what I thought thanks a lot!
do you know any code online that mimics a project aiming for a high accuracy rate
u can write a "crossvalidation" so different architecture batch size epochs etc.
im pretty sure u will find packages which do that for u
oh never looked into it, do you know what I should search up to get started? like any package names you know?
is this similar to hypertuning parameters?
ok tysm!
I'm working on an accounting dataset. They have a revenue amount that's like 1000 and then some n number of lines that offset that 1000. I can group the accounting data into small chunks of usually less than 100 lines. Problem is that it the offsetting amount can be 1 line or 10 lines and it's mixed into those 100 mostly identical looking lines. (There isn't any other good data to filter down anymore) My natural inclination is to iterate over every combination to see if any number of lines equals 1000. Then I can show them these as proposed matches and they can nail down which ones they want. Does anyone know if a fast-ish algorithm to do this? I'm going to run it on a GL dataset filtered down to the 25 million potential offsetting lines. Thankfully once it's done we just need to do it go-forward which should be easier
can anyone teach me pytorch with flask
I really can't understand what you're describing. Can you provide a minimal example?
the short answer is: sure, probably, as long as you can provide a formula, it can be done!
@left tartan sure!
Here are some headers and fake data. This is my revenue line:
Account | cost center | location| date | amount
ABC123 | 111 | CA | 6/30/2023 | 1000
Same headers, here is my offsetting cash.
ABC123 | 111 | CA | 6/30/2023 | -250
ABC123 | 111 | CA | 6/30/2023 | -333
ABC123 | 111 | CA | 6/30/2023 | -500
ABC123 | 111 | CA | 6/30/2023 | -250
Out of the above 4 lines 3 of them will tie to my 1000 revenue. So (for now) I've a recursive function that sums different every possible combo together to see if they match the 1000 revenue amount. So for these 4 lines it should create about 4! Or 24 loops to try every combo until it finds that 1,3,4 add together to offset the revenue. I have a description column to tell me which column is the revenue line but the cash offsets are either blank or not helpful in finding the offsetting amount
Sounds very reminiscent of two sums, but with N.
I guess you could employ a recursive approach or DP algorithm (ie: cache the intermediate calculations so you're not re-calculating) in the abstract.
This might be a better question in #algos-and-data-structs
definitely look at the 2sum algorithms, they can be generalized: https://leetcode.com/problems/two-sum/
Okay yeah! Oh nice 👍 thank you! I'll give this a shot. It gives me better words to keep researching at least
It looks like a really messy way to generate a complex selfcontained website. I have done something similar in rmarkdown but I think probably best either to simplify the project or try find something else. The embedded html approach gets quite messy with complex designs 😦
How do I hide **specific **cells from rendering when exporting to HTML in Jupyter Notebook (VSCode) Similar to how you can do it in R studio notebooks. edit: I know about cmd line nbconvert --no-input
Question - I have created a duckdb database with a 3d instrument signal table totalling 173 million rows by 9 columns. Trying to introduce this into memory is resulting in kill process. Is a database the best solution for this type of data, or should I be looknig at another format?
Do you need 173Mx9 in memory?
There will be no use case that requires all of the signal data at once but I am trying to establish test parameters
See if there's an option to chunk your data set. (i.e. read in one chunk of your data at a time)
I'm not familiar with DuckDB so I won't be able to give much advice other than that.
That might work! Side note, if I wasnt going to use a database, what would be your choice of data format, considering my database table is actually 200 individuals in long form, approximately 4500 rows per individual once pivoted
"Depends on your data" would be the generic answer. Probably be a better question for #databases
I only really use CSV in terms 'data format', which can be a csv, parquet, sparse matrix, or something else. But each one, depends on how your data looks like. (I.e. Sparse matrix is good for datasets with a lot of 0s)
It's also good to point out that, at a small scale, it doesn't matter too much. Optimization only really matters when you reach like hundreds of millions of rows +
thanks, ill give it a go at #databases as well.
Would you elaborate a little more? Make it elsewhere what? : )
A little off topic, is VSC ok for the use of SQL ? Or it's better to use My SQL or something else ?
Guys, just a quick help, lets say i have size of data (1214 rows , 93 columns), if i want to remove rows based on columns ranging from 44:88 for example using pandas. I am having difficulty achieving this cuz all i can do is remove rows based on columns values, i want to remove just based on columns
I tried something like this df.drop(df.columns[44:89], axis=0, inplace=True)
but does not work as it drop columns but not rows associated with it
df.drop(list(range(44,89)),axis=0) should work
When I am making publication quality figures, if I am not using TikZ for whatever technical reason, I’ll make them by hand in Adobe Photoshop or Illustrator or pay someone to do it if it’s beyond my ability.
Hey,
I'm looking for data documentation tool / package.
I want it to document the input data and the output data.
Currently each stage in the pipeline load the data from a given path.
Compute some features, and save to another given path.
Thanks
I’m usually the duckdb guy around here, but I’d suggest going over to the duckdb discord and asking there. And sharing the query. The reason is; there are strategies for dealing with larger than memory datasets and queries, whether by chunking or writing queries that operate in subsets of data. Another strategy is to partition the source data… I use parquet a lot for this, and keep large data external.
I just want to make it into text form and thought latex would be the simpliest knowing what was connected to what*
so u generate json files ? what is ur goal
we dont .drop we use .loc 🗿
Hi! I'm dealing with a slimy landlord unfortunately and may have to go to a tribunal hearing. I want to be well prepared and had the idea that I could scrape their publicly available decision cases, and then train a LLM using that. For scraping I think I could use BeautifulSoup, but does anyone have suggestions for the LLM part?
Having worked with many lawyers, and although IANAL, I can safely say: that sounds like a collosal waste of time to try to do (to try to build a meaningful model from a handful of cases). How many cases are you talking? A dozen? You might as well read each of them, take notes, and summarize.
They have about 12-13 years of stuff, and it looks about 25-40 cases each year
If I could learn some neat stuff while getting some benefit from it, I'd be happy
And, i'd be remiss if I didn't say: Get a lawyer :)... anyway, If I were starting with something like this, I'd probably look at classifying them. Check this out Comparative Study of Classifying Legal Documents with Neural Networks
Awesome, thanks! Also, yea, I have some legal aid, this is just supplementary / hobby project 🤓
if anyone has exprience with time series models especially ARIMA, can you kindly help me with my project. ADF and pacf,acf plots are done. just need help with p,d,q values
I'm no expert, but have you looked at auto-arima? I've used it mostly when trying to optimize the parameters https://alkaline-ml.com/pmdarima/modules/generated/pmdarima.arima.auto_arima.html
no work it just gave a straight line as output
That would seem like some sort of data error, i'd guess. Can you share a minimal reproduction?
btw letme show you the acf and pacf plots
the nymber of lags selected is 365
bcz the data is yearly and a value does depend on it's last year value
i am using gradient boosting models with input data like this
using temp did give it a little boost
it's only giving me 0.53 r2score
i need it to be atleast 0.75+
so trying arima model
i am open to other model suggestions as well
And what do you get from Arima?
Here's an example of a simple arima model against a sin+noise signal: ```py
ARIMA Example w sin + noise (updated with correct m =20)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
n_points = 100
x = np.linspace(0, 20 * np.pi, n_points)
noise = np.random.normal(0, 0.5, n_points)
y = 5 * np.sin(x / 2) + noise
from pmdarima import auto_arima
model = auto_arima(y, seasonal=True, m=20, trace=True, error_action='ignore', suppress_warnings=True)
forecast, conf_int = model.predict(n_periods=20, return_conf_int=True)
plt.figure(figsize=(12, 6))
plt.plot(y, label="Data")
plt.plot(np.arange(n_points, n_points + 20), forecast, color="red", label="Forecast")
plt.fill_between(np.arange(n_points, n_points + 20), conf_int[:, 0], conf_int[:, 1], color="pink", alpha=0.3)
plt.legend()
plt.show()
print(model.summary())
what is m?
but... the period of the input data is 20 points, not 25. or to be precise, 19.8, I think.
and if i have yearly ? 🌝
good point. Trig was never my strong suit 🙂 fixed
I want to finetune a language model with my custom data, does the data have like a define format or does the format depends on the model?
The question really is what's your "season". Is there an inherent cycle to the data? If you only have annual points, you may not be looking at an arima model, unless there's some underlying cycle to the data
nah i have daily readings
Yah, so m=365, probably
iirc, pmdarima doesn't use an x axis... it expects the data to be sequential (chronological) and the observations to be uniformed distributed
hmm so datetime objects as index and simply 'temp' values to it should work rigt
nah data is proper and completee
Hello guys, I made a library which will genrate random data matrix, would anyone will try?
sure
Here
pip install rand-omata
Auto arima is slow because it’s trying lots of models out
i amma need it very soon so as soon as i use it will give you my feedback
Okk
1.5 h 😭
2h
2.5 h
lmao
Maybe share the code you’re running?
And, you don’t have to use auto arima, you could try tuning the parameters m yourself and measuring aic
import pandas as pd
# %%
df=pd.read_csv('data.csv')
# %%
df=df[['date','Temperature']]
# %%
df.columns=['date','temp']
# %%
df['date']=pd.to_datetime(df['date'],format='%d/%m/%Y')
# %%
df.set_index('date')
# %%
y_train=df['temp'].iloc[:5843]
y_test=df['temp'].iloc[5843:]
# %%
from pmdarima import auto_arima
model = auto_arima(y_train, seasonal=True, m=365, trace=True, error_action='ignore', suppress_warnings=True)
forecast, conf_int = model.predict(n_periods=len(y_test), return_conf_int=True)
i think it;s bcz the data has more nearly 7300 expriences
Maybe autoarima on a subset to get the parameters, then train?
btw still running 💀
only a year?
btw training data has values from 2003 - 2018
while testing 2018-22
and pandemic really affected the values i am working with
eto so for research standards
is it plausible to explain the drop in accuracy with pandamic as anamoly?
how small of a data set you suggest i should use?
No idea, seems odd it’s so slow but I don’t work with daily data much and only use arima occasionally
well i amma try running 2 instances
of one year data of 2 diffirent years
if the parameters remain same
it should be okay to go with
right?
Yah, yah, the arima parameters aren’t extremely complicated
Right, because it's free 😛
There aren't that many companies happy with throwing money away for the greater good
Hello
And it's not that slow I don't think
How are you
jupyter notebook
...
ehh works i can use it to play games while it's being done on the cloud 🤣
how big is your dataset
how are you defining seasonal in your dataset?
?
didnt get your question>
how my data is seasonal?
well it's temperature values
💀 first and foremost and it's visible from the plots as well
@zealous hollow I think parsimonious is generally appreciated in ARIMA modeling. Trying to forecast tomorrow's temp probably is relatively related to last year temp but probably more closely related to today's temp. Since theres a moving avg component, using a 365 is just going to move your model to the yearly average which in summer or winter isn't representative at the extremes. An AR(365) is saying that every single day for the past year impacts the temperature tomorrow. I don't think either of those are convincing (personally IMHO)
yeah 💀
that may fix your issue too
so 1,2? i do have the acf and pacf plots
lag no is 76
i have tested arima with
0-6 for each p,d,q but only time it even showed something other than a straight like was with 4,0,3
but that quickly went to straight line as well only after a few cycles
Where are your confidence intervals in those graphs? To me that looks like an AR(2) you would need to run the graphs again after running the model on your residuals to look for remaining significance
How far into the future are you forecasting?
my data set has temps from 2003-2022 so i am atleast expecting it to go to 2030
M=365 is just defining the seasonality of the data, that’s the correct use of the m parameter for daily data… that’s separate from the parameters to the arima model
ye i amma take billy's side on this one 🌝
Second, op is running pmd autoarima to find optimal parameters. That’s the slow part for op
Oh I must have misread, I thought they were using a model like (365,0,365) or something crazy.
Oh no, autoarima searches through the parameter space. Ghosty: can you paste the autoarima output?
Oh, you’re not printing the output incrementally?
I dunno, my example above would print the models as they are tested
from pmdarima import auto_arima
model = auto_arima(y_train, seasonal=True, m=365, trace=True, error_action='ignore', suppress_warnings=True)
forecast, conf_int = model.predict(n_periods=len(y_test), return_conf_int=True)
it's exactly your code for me it's not printing out anything
set seasonal to false
data is seasonal though
should i use m=7? @left tartan ||sorry for pinging||
i want change the shape of data frame which is (400 x 1300) reduced to (400 x 1200) with something similar to PCA, but cannot use the PCA since n_components has to be smaller than 400, any ideas?
7 is for weekly periodicity. Like sun-sat
oh well i amma wait till it finishes and see the output
what type of data?
real numbers
how about LDA
the same story
??
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
# Assuming X is your data and y are the corresponding class labels
lda = LinearDiscriminantAnalysis(n_components=1200)
X_lda = lda.fit_transform(X, y)
yeah but it also has to be lower than 400
like i dont have (1300 x 400) but (400 x 1300)
yeah i want 1200
i want (400 x 1200)
how about rfe?
ok let me do some reserach about that
@zealous hollow what is the problem you're trying to solve exactly? The forecast being flat?
this is my data
i want to predict values of each till 2030(very least)
i alreay achieved 93% accuracy with gradient boosting for temp
but rest it's not getting any above 60%
@zealous hollow but can you in few wordk describes more or less what this does (rfe)
why are many pytorch models not exported to onnx?
btw can you resend that research paper
shouldnt this help in getting of torch dependency
hm?
someone sent a research paper link here just now related to my work and it got deleted by bot
who is "someone"? I can check why the message was removed, but I have to know the user ID for the author.
i dnot now myseld 💀
Ghosty, is that data set public? Was gonna run arima on it on my end. Always fun to try new data.
nah
but you can use it
||bro please do, it will save me a lot of time||
🤣
make sure to save your results of testing
np
wow, that's one heckuva first search
The first layer of my fully connected layer is 2x the output of my final model's layer. What am I doing wrong here? ```py
if accuracy is not higher, and not changing epochs or batch size
more CNN Layers, more nodes in layers
'Conv2d, BatchNorm2d, and ReLU.'
class MyModel(nn.Module):
def init(self):
super(MyModel, self).init()
self.model = nn.Sequential(
nn.Conv2d(3, 16, kernel_size=3, padding=1),
nn.ReLU(),
nn.BatchNorm2d(16),
nn.Conv2d(in_channels=16, out_channels=32, kernel_size=3, padding=1),
nn.ReLU(),
nn.BatchNorm2d(32),
nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, padding=1),
nn.ReLU(),
nn.BatchNorm2d(64))
self.classifier = nn.Sequential(
nn.Linear(64 * 128 * 128, out_features=256),
nn.ReLU(),
nn.Linear(in_features=256, out_features=10)
)
def forward(self, x):
x = self.model(x)
x = x.view(x.size(0), -1)
x = self.classifier(x)
return x ```
basically my first linear layer needs to be divided by 2, but I shouldn't hard code that. What am I getting wrong about the input parmaeters?
can you paste the error?
mat1 and mat2 shapes cannot be multiplied (32x65536 and 1048576x256)
okay
the shapes are wrong
self.classifier = nn.Sequential(
nn.Linear(65536, out_features=256),
nn.ReLU(),
nn.Linear(in_features=256, out_features=10)
)
this should fix
right. but I shouldn't hardcode 65536 - I should reuse the height * w of the previous output layer. I'm wondering what the numbers are, as I thought the output of the previous layer was 64, 128, 128 (which is incorrect)
yeah, why is it 32 * 32?
But applied to this: nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, padding=1),
nn.ReLU(),
nn.BatchNorm2d(64). Wouldn't the output be 64*64?
the image size is 1x32x32?
yes
okay in the formula
W = 32 (Image size)
F = 3 ( Kernel)
P = 1 (Padding)
S = 1 (Strides)
so
so the weight and height of the image does not change throughout the network?
just the feature maps at each convolution?
as soon as you have the 64
refering to out channels
64 * 32 * 32
of the previous layer
the calc is the same
@maiden wadi so are feature maps kind of arbitrarily chosen? I was getting confused because I was using that as W to compute the feature maps at the next stage (which somehow worked)
you were doing well, In this case the shape don't change because of the values of the kernel and pading
(32 - 3 + 2 * 1 ) / 1 + 1 = 32
but imagine
we have kerne size of 4x4
(32 - 4 + 2 * 1 ) / 1 + 1 = 31
now we decrese the value
so eache layer will decreese in 1
how are the increasing feature map values determined?
There is no like a correct way of selecting the features
There is a bit of intution in it that you expect earlier layers to often have smaller features (and smaller perceptive field) thus probably less. and later layers combine them into more complex features, of which you expect more.
But no set rule.
This translates into later layers having more channels
got this error
i have 32 gbs of ram
and i just saw python take 20 gb of ram 🤣
Does anyone know why the flattened features are 10368? Shouldn't they be 3200? My input dims are 3, 160, 160: ```py
self.model = nn.Sequential(
nn.Conv2d(3, 32, 3, 1),
nn.ReLU(),
nn.MaxPool2d(),
# 32, 80, 80
nn.Conv2d(32, 64, 3, 1),
nn.ReLU(),
# 64, 40, 40
nn.MaxPool2d(),
# 32, 40, 40
nn.Conv2d(64, 32, 3, 1),
nn.ReLU(),
# 32, 20, 20
nn.MaxPool2d(),
# 32, 20, 20
nn.Conv2d(32, 32, 3, 1),
nn.ReLU(),
# 32, 10, 10
nn.MaxPool2d())
self.classifier = nn.Sequential(
nn.Linear(10368, 2048),
nn.Linear(2048, 128),
nn.Linear(128, n_classes))```
The discrepancy in flattened features arises from not accounting for the final Conv2d layer's output dimensions before the classifier; specifically, the dimensions after the last MaxPool2d layer are 32x8x8, leading to 8192 flattened features instead of 10368.
why would that equarte 8192
shouldn't the flattened input from 32 * 8 * 8 = 2048?
why would they be 32 * 8 * 8
don't you flatten the feature maps * h * w
if the final layer after max pool is 32 * 8 * 8
what features do you care about? if output, then only the final layer, right?
Idk I thought that was just the process
pass the input through the conv layers then get the flattened output and use as FCL input
well, if you're classifying something, it just uses 3 linear layers. if you are training it, then it is nn.Conv2d(32, 32, 3, 1),
Your intermediate dimensions seem wrong, how did you compute them?
If you don't put strides, the dimensions will not be divided but reduced by kernel size * dilation (-padding if there is)
.parquet
My goal is to know how each feature was created, what were the inputs that were use for these feature creation
Are you looking for something like this? https://github.com/qchenevier/pandas-pipeline-graphviz doesn't seem well-maintained unfortunately, it's apparently a built-in feature in dask
yes, thank you
Im doing data transformations on an excel file. Now I wanna test a function that cleans the top off an excel file (the file sorta looks like this)
My company name. Some Additional Info Some Other stuff
Space
Space
Additional Stuff
Column Title 1 Column Title 2
Column Value 1 Column Value 2
Just wondering if for my pytest dataframe fixture, would it be ok to make it read from an excel file instead of making this manually in the dataframe? Or is communicating with an external source strictly bad for testing
i have a bunch of rna sequences (and their secondary structures) and their corresponding energy values and im trying to find a way to identify features (patterns in their structures or sequences) common between samples of similar energy values. would be super super grateful if anyone could recommend me algorithms to look into using - im assuming this would be an unsupervised learning project and i only have experience with supervised stuff, but im looking into pca right now and not sure if thatd be useful. should i be looking into something else?
intuitively im imagining it as like a clustering + feature extraction problem where i have a bunch of dots, each representing an rna sample, and then the axis represents energy so the dots could be clustered in energy value similarity and then within each energy value cluster patterns/relationships could be found between the sequences and structures of the samples. but not sure if an ml algo exists to handle this and pca seems like not what im looking for because the axes would be principal components and not energy... pls help if u have any ideas!
again would really really appreciate any ideas for how to go about doing this or validation on whether pca is the way to go
I'm having trouble understanding this ensemble learning boosting equation in the georgia tech course for machine learning. What does Zt represent in the equation?
Here is the link to the playlist, the explanation of the algorithm goes on for 4-5 videos.
https://www.youtube.com/watch?v=ooxQS5-Grgc&list=PLPhC147aCdDF_9RWFadPcZomRB2tjAhQ9&index=14
Really appreciate if someone can help me out. I'm a self learner and don't have access to a teacher so discord is my only way to ask questions
Is knowing stuff like Agile (Scrum, Kanban) and/or Jira necessary for data scientists/analysts? What tools do you use?
ads are not allowed here unless they have been previously approved. please remove your post if you haven't obtained permission to post this.
necessary as in to get a job?
no, those are things you pick up on the job.
necessary as in to improve your workflow?
not really, they are just ways to do project management, i feel like this is just a matter of the organisation's prefrerence
Properly applying the principles of the agile manifesto are of benefit to nearly every job
But it's improperly applied more than it is so it's pretty toxic in reality
Hey folks, for my summer vacation I took up ML as I'd like to work in the field in the future (or something close to it, at least)
Learning about the fundamental theory of neural networks was fascinating but I find the programming a bit... uninspired? I just find myself simply copy-pasting everything which feels pretty lame, regardless of how cool the outcome is.
Basically, I'd like to know if this is a common issue, and how can I make the coding process a little more creative? I feel like I'm lacking vision regarding what's exactly out there.
Hopefully most of what I'm saying makes sense, if not @ me and I'll clarify
Thanks in advance!
What kind of projects have you worked with so far?
While using pandas, should I write in a chaining method, or is is a recipe for disaster?
Try picking some project (mnist classifier) and try creating the solution from scratch. Sure look at how other people have done it, but don't look up a tutorial
Take the tensorflow example, and then start by implementing some steps. (2d convolution, forward pass etc)
Yo guys im trying to build a posture detector app, from what ive found out, ill need openCV + either tensorflow OR pytorch, what would you guys recommened? i need some kind of guidance on this
I'm really sorry for the late reply.
I only did a couple of vision models, where the code was pretty much entirely just this:
https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html
With a few tweaks here and there.
I didn't realize just how much of the programming would just be "behind the scenes magic". It feels like everything has been implemented for me.
What am I missing here?
Yeah, with modern high level libs/apis for ML it is pretty much all implemented for you. I'd say there are two branches here to go down, either you should pick a project that doesn't have a fleshed out tutorial. Maybe even something you have to collect your own dataset for, and do some experimentation implementing your knowledge of theory on a sort of novel project. That will lead you to do a lot experimentation with activation functions, architectures, optimizers, hyperparameters, maybe even push you to researching other types of models such as RNN's.
I also know you've been working on the math, another option is to go the other direction and try to implement all of the low level stuff yourself. For this I would recommend using a reputable dataset like you have been, and a task that requires a small/shallow/simple model like classifying MNIST handwritten digits.
Hey id appreciate if anyone could help
and a task that requires a small/shallow/simple model like classifying MNIST handwritten digits.
I actually did try implementing this but I was a tad bit overwhelmed. I later implemented it with Pytorch with, like, 25 lines of code (Insanity!)
I am a lot more interested in the "under the hood stuff" than the actual implementation of things. So I suppose I'm a little more interested in the math-ier side of things and actually understanding what's going on.
On a slightly different note, I can't help but wonder what people who work in the field actually do? I can't imagine they're just using Pytorch all day long.
Why not?
Apparently it's a highly specialized field that requires master's and PhD degrees in order to be* qualified to work in the field
It seems like Pytorch has removed so many layers of abstractions it's almost unreal. I'm probably missing something though
you kind of need all 3: how to work with code, understanding the math (not just linear algebra, lots of probability and stats to learn too), and getting experience-based intuition for working with data and models generally
how you proceed through the very long process of developing all 3 of those things is up to you and depends on your immediate interests
i do data science professionally and i have a fairly "light" masters in quantitative social science. i've had to go back and re-learn several parts of the math that i didn't learn well or didn't learn completely enough the first time around. but even before i did that, i knew enough to fit models with pytorch. i just didn't have a really solid grasp of things enough to develop more advanced customized solutions to problems i had.
if you're doing NN stuff you're probably using pytorch on a regular basis. but plenty of people are very productive in data-scientist-like jobs and generally don't need NNs on a regular basis. it depends on a lot on what you specialize in and/or what your particular company/industry needs.
guys please 😭
I thought Data Science merely uses ML as a tool and is not strictly about ML? Or have I missed what you're trying to say?
Also, regarding this - If I had to put my current ambitions into words as accurately as possible I'd say "Whatever it's like to work as an AI(*) Engineer/Scientist/Whatever in the industry, I'd like to experience that"
Unfortunately I have no knowledge in stats nor probabilities yet, so if that's impossible I'll probably put that ambition on hold. I would like to know if that's actually the case though and if I could do something relevant without any knowledge in stats
(*) I don’t know if this is the generalization I’m looking for
Is knowing ORM necessary, or is SQL enough?
I'm pretty sure data/ML engineering doesn't require as much statistics, but I'm not an expert. In general, you don't need a very deep understanding of statistics but it's highly recommended you at least know the fundamentals
Oh, I was told the opposite - that most of the work involved is actually statistics. Curious
Well, it is but it's not that hard? 
Oh, that would be cool if that’s the case 🙂
I wouldn't recommend it over an engineering position if you're struggling with stats and probabilities, but if it's just that you haven't studied it yet and you have a good sense of stats then there shouldn't be any issue
I don’t really believe you’re inherently “bad” at something. From my experience there’s a strong correlation between grades and interest
That’s besides the point though
I’m really just trying to understand how to experience “the real deal” of AI/ML engineering or whatever
Data science doesn't really mean anything
Because each company defines it differently
This isn't standard across pretty much anything either. You could make an entire career out of using existing open source repos for ML without ever needing to understand how exactly it works or modifying underlying architectures. It all depends on what specifically you're working on
Surely there's some foundation necessary to work in the field, though
And some common experience all of them share
I've taught middle schoolers how to implement image classification models
I don't understand what this has to do with the question at hand though
It means middle schoolers can work as data scientists
I... have to call BS on that, I'm sorry
I believe it's because you haven't posed an answerable question. There is no "standard" foundation necessary, as different companies define the roles and the requirements for those roles differently
My point in mentioning teaching middle schoolers classification is if someone needed a very basic classifier to be implemented across a lot of different applications with pretty low accuracy, they would be qualified to do it
Okay, I suppose there's no "standard" for being a CEO either - but I'm not asking for an absolute standard but rather a generalisation of said standard
That generalisation appears to be higher education. I'm fairly sure there's a common derivative at the end of the tunnel
Also, just for the record - @ Tonabrix seemed to have an idea of what I'm asking, and indeed suggested a couple of "paths" or whatever one might call them
I'll just note that I don't have a job, so.. don't listen to me. But if you want to work with AI, higher mathematics and statistics definitely will not be extra. I've been regretting that I didn't try to actually remember what I studied at university, and I haven't even gone very far down the rabbit hole of DS/ML/AI
Can you be a DS/MLE without strong fundamentals of math and stats? Yes, there are enough abstractions and good libraries to get by. But you'll eventually have to learn all that stuff, probably :3
Yeah this has been made clear to me, I've actually been somewhat fond of the mathematics I've learned so far so I'm looking to apply my knowledge
The question has sort of been missed throughout the entire conversation unfortuantely, it started out fairly simple and diverged to other topics I'm afraid
You'll need at least linear algebra or stats, both is better
I also understand what you're asking, but I don't believe you're asking the right question. It's pretty straightforward to figure out what is necessary to become an ML person "in general", and what they do. Just go to indeed, search Machine Learning Engineering and do your best to find common threads. Past that, everyone is just going to give you responses based on their personal experience
I... thank you but that really wasn't the question - I have a rough idea what mathematical background is required. I'm currently in university lol
Personal experience is perfect
What did you mean by this then?
Sorry this was directed at Rose, I forgot to tag
I'd like to emphasize that I'm not looking for an objective answer in case that wasn't clear
If there's somebody here that can share their own experience from the job market/academia that'd be grand. That's sort of what I'm looking for
My main tasks as a data scientist have been (on different projects, I'm a consultant) :
- data cleaning/analysis
- feature engineering
- being up-to-date on recent models/advances
- finding ways to exploit available data
- designing models
- fine-tuning models (can be deep learning models or boosting ones)
- develop an interface for POC
- deploy models/apps (most of the time in a cloud environment)
- communication to stakeholders
I now do data science for a living, and have my own business that provides ML solutions to clients. However my situation will be vastly different from yours because I started as a mechanical engineer that got interested in applying the statistics electives I took in uni after working on measurement systems. I eventually worked my way through jobs focused more and more on data/stats/ML and now I'm here. In my experience, your focus should be on what makes you want to investigate and try things. If you want a list of things you should be able to do, I know a lot of what Rose said as well, but I didn't know how to do a number of them when I got what I'd consider my first true job in this field. Additionally, a lot of what I do in my actual job involves estimating risk (and therefore stats), but if you're working on keypoint tracking to put digital butterflies on people in a museum, you may never touch the kind of risk analysis I do
I also have two friends that work in AI for the defense sector and high precision optics respectively (I'm in healthcare). They focus on WILDLY different things than me once the "standard" stuff above is done or needs to be customized for a specific purpose
Thank you both for the detailed description. This does give me some insights regarding where my question lacks. If you don't mind though, I'd like to try and explain my current situation and perhaps just seek "advice", instead of asking something concrete, regarding where I should continue from here
I've picked up ML not long ago, and learning about the fundamental theory of how the field absolutely fascinated me. Unfortunately when I got to work with Pytorch I've discovered a lot of those processes have been simplified to lines like
outputs = net(inputs)
loss = criterion(outputs, labels)
loss.backward()
Which on one hand, is incredible, but on the other hand this made coding not fun at all, as everything feels like a black box.
Essentially, I like to know what's going on but not sure where to continue from here. "Read a book" seems like the obvious answer here but most of the books in my arsenal are study books which require knowledge I yet have.
Essentially, I'm not sure where to go, or what I'm even looking for. I'm hoping this vague description of my experience should suffice for you folks to understand where my interest is and what it is I'm trying to do
If I haven't mentioned already, I finished my calculus and linear algebra courses - no knowledge in probability/stats as those come in the following semesters
hi guys i need ur help. if i have a line chart with the x axis from 2020 to 2022, and the y axis being sale quantity, and i have different lines representing each store which are the legends, is that bivariate or univariate
I got a lot of my knowledge of ML from Ngo's free introductory course on coursera. That was years ago, back when that course was entirely in Octave, so I don't know how the modern version (which is in Python) compares, but the course had many assignments for implementing ML primitives like backpropagation, gradient descent, support vector models, etc. Maybe that'll make them feel less like blackboxes for you (if the modern version even has these exercises still, of course).
quite a basic question but wld like to clarify for my school assignments
Ng's course was(is?) incredible
I suppose the question I have is why do you want to learn more about how these processes exactly work? Are you hoping to go into research and/or work independently to create the next best pytorch or the most accurate/fastest model architecture X? Or is it more related to confidence in what you're developing? Pytorch is open source, so if you don't like not knowing how something works, you should be able to find exactly how these "black boxes" work. At one point I also felt like I needed to understand absolutely every detail about everything in stats worked, but after a while I personally found it to be far more practical and enjoyable when encountering something new to learn just enough to be able to implement it while paying attention to the assumptions/uncertainties with the model. That way I could see how it worked, and then determine how much more I needed to learn in order to achieve what I was trying to do.
From what I've seen most of these courses (Coursera, Udemy, Google, and the like) mostly focus on the "create something" side of things rather than the theory (foundation?) of it all. Then again what you're describing sounds like what I'm looking for which is a little strange. Maybe I haven't looked hard enough?
I haven't looked at that course nowadays but back when I did it, it had lectures on a lot of the theory, and the implementation tasks were mostly of little parts.
In my experience, coursera has very fundamental courses. You should probably sign up and try some since it's free
"Why?" is a bit of a hard question because I'm at the beginning of my road of course. When I envision myself working in the field I'm thinking about the "Big dreamy models", like chatGPT or an automated car or whatever.
Of course this is all a "postcard description" but I'm not sure how else to put it. The dream project would probably be an attempt at a general purpose AI? Or a model capable of automating basic tasks millions of people work have as their job nowadays?
Whatever that entails, ig. Honestly a part of me just tells me to wait out and take the university courses I'll inevitably have to take anyway, but I have some time to burn and it'd be lovely to spend it on something I gravitate towards
Completely unrelated but isn't Coursera an extremely expensive monthly subscription service?
Wait, really?
There is a small link when you clik to enroll on a course
Must've missed it, I'll check again. Thank you
yeah, enrolling for free locks you out of some assignments but it's usually not very different
im getting quite confused and struggling to disguish bivariate and univariate analyses, cld anyone lend me a helping hand? 🥲
tried finding information online, couldn't really find any useful sources
As someone who spent a bunch of time trying to learn fundamentals before jumping into the real world, only to realize once I got there that I actually wanted to do something totally different (mechanical engineering -> data sci + ML), I would HIGHLY suggest you spend some time trying to answer that question. It's also why I keep asking the questions I do and responding in the way I do. I would suggest figuring out what you want to work on before going in deep on the how. Some understanding is necessary to pick a direction, but not much. Whether it be the nitty gritty math, or implementing models for time series analysis, image analysis, NLP, etc. I would pick one (or many) things that interest you and try to figure out how to do it yourself, even if it doesn't work well. If you really like what you picked, and are like me, you'll find motiviation in figuring out how to make it better. From there, you have the context and the reason to want to dig into the "black box" and find out what you need to learn in order to do what you want to do.
Or, you could go deep into a field and discover that you really love it via learning the technical details first and everythings great! Though, in my case I just hope some day I find a use for the fluid dynamics of viscous plastic extrusion that's still taking up space in my brain 
Admittedly, I just enjoy learning about the abstractions and the math behind things at the moment :/ So I figured whatever involves "that", is what I'd like to try for now
I completely agree with what you're saying, about trying out everything, but I have a hunch that a lot of "things" are sort of "out of my reach" in terms of knowledge (Please do correct me if I'm wrong on this, but NLP seems crazy complicated for example).
That might be kind of what I'm poking at though - how do I even go about "trying everything"?
I'd like to emphasize that this is supposed to be more "fun" than actually practical. Although, if I could spare myself some future headache when I'll actually have to learn about this in university that'd be grand
Funny you should say that, I got into CS because I had to take a programming course as an entrance test to get accepted into a Psychology major
So maybe I won't even get to work in ML in the future and will shift to something else entirely, who knows? ^^;
A psych major doing ML would be good for the field
If you want the way I did it, start with something that irritates you, and see if you think could be predicted (even partially) and/or automated. From there, look into how other people did it, and how you might do it yourself
I've got to go, but feel free to DM me if you still have more questions
Maybe later down the line, for now I’m enjoying CS 🙂
Cheers! Take care
I meant the ML field
Yeah I got that haha
Based on what?
I remember seeing some nice site that trained some model (ResNet?) on several cloud providers and provided a table with the costs, but I don't have a link
(I think AWS was ahead? not sure the results would even apply a year or two later, though)
or what is the most popular way, I don't think people train models on personal computers.
AWS is probably the most popular one, followed by Azure
under what distro?
yes
amazon linux is the default for most
they have their own flavor of distro
for aws
its so their services can be compatible (SageMaker, ECR, Lambda, etc.)
dont know about azure or gcp
Out of curiosity, why are you interested in the distro used by cloud services?
What?
hello #data-science-and-ml, i got a bunch of different datasets, some are weekly data, some daily, some monthly. i need to group them all together, preferably like i round down the weekly rows to the monthly rows, daily rows to monthly etc... so should i just write up a script that finds the month that each week/day occurs in?
how would i do that with pandas?
well its good to know before trying one out.
Convert to datetime, you can directly access the month/week/etc. If it's in datetime format https://pandas.pydata.org/docs/user_guide/timeseries.html
hi,i know this is pretty simple but how do i take an average of a dataframe column, I tried np.average but i get errors, ive also tried .mean(). with similar errors
you would do df['the_column'].mean(). if you tried that and you "got an error", please copy and paste the whole error message into this chat.
(In general, you should always give the error message for the error you need help with. if you just say that you got an error, we have no way of knowing what it is until you tell us.)
What’s a good format on disk for time series data? (Basically wondering if there is something like parquet but optimized for time series?)
Parquet is pretty darn good for time series, imo. not sure I know anything better.
Hi @hot obsidian - How to call(execute) a function with 2 or more dataframe arguments? Sample code as below.. I would like to print the result of this function.
import pandas as pd
def department_highest_salary(emp: pd.DataFrame, dept: pd.DataFrame):
merged_data = pd.merge(emp, dept, left_on='DEPTNO', right_on='DEPTNO')
grouped = merged_data.groupby('ENAME')['SAL'].max().reset_index()
result = pd.merge(merged_data, grouped, how='inner', left_on=['DEPTNO', 'SAL'], right_on=['DEPTNO', 'SAL'])
result = result.rename(columns={'DEPTNO': 'Department', 'ENAME': 'Employee', 'SAL': 'Salary'})
return result[['Department', 'Employee', 'Salary']]
Not sure I understand the q. You're asking how to call a function with two arguments? That's just result=department_highest_salary(df1, df2)
Let me try, thanks!!
Hi @left tartan - Thanks for response. I did pass the arguments like below. However, i am getting a key error
result=department_highest_salary(emp, dept)
You'd have to share the full error plz.
I really don’t know how to wrap my head around it. I feel like time series data would best be represented in a 3 dimensional data frame. Think daily or houly metrics from thousands of data producers. I’d want to look in aggregate over time as well as drill into individual time series.
That's more where hive comes in tho, partitioning across... say... producer
I see.. So I can’t do this just in the storage layer? (I wanted to just use python with an on disk format.)
Hive partitioning is just a directory organization of parquet files, so yah, you can do it in the storage layer
fwiw, you dont need "hive" to read hive files. pyspark, duckdb, etc can read hive partitioned files.
Hi @edgy venturey Bobby - Just checking if you got the full error message as file?
Nope, just paste the text in discord
You can't upload files
KeyError Traceback (most recent call last)
D:\Anaconda3\lib\site-packages\pandas\core\generic.py in _get_label_or_level_values(self, key, axis)
1838 values = self.axes[axis].get_level_values(key)._values
1839 else:
-> 1840 raise KeyError(key)
1841
1842 # Check for duplicates
KeyError: 'DEPTNO'
KeyError Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_18972\3746871449.py in <module>
----> 1 result=department_highest_salary(emp, dept)
~\AppData\Local\Temp\ipykernel_18972\3858326592.py in department_highest_salary(emp, dept)
5 merged_data = pd.merge(emp, dept, left_on='DEPTNO', right_on='DEPTNO') # join two data frames based on
6 grouped = merged_data.groupby('ENAME')['SAL'].max().reset_index()
----> 7 Merge_group = pd.merge(merged_data, grouped, how='inner', left_on=['DEPTNO', 'SAL'], right_on=['DEPTNO', 'SAL'])
8 Merge_group= Merge_group.rename(columns={'DEPTNO': 'Department', 'ENAME': 'Employee', 'SAL': 'Salary'})
9 return Merge_group[['Department', 'Employee', 'Salary']]
D:\Anaconda3\lib\site-packages\pandas\core\reshape\merge.py in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
105 validate: str | None = None,
106 ) -> DataFrame:
--> 107 op = _MergeOperation(
108 left,
109 right,
D:\Anaconda3\lib\site-packages\pandas\core\reshape\merge.py in init(self, left, right, how, on, left_on, right_on, axis, left_index, right_index, sort, suffixes, copy, indicator, validate)
698 self.right_join_keys,
699 self.join_names,
--> 700 ) = self._get_merge_keys()
701
702 # validate the merge keys dtypes. We may need to coerce
D:\Anaconda3\lib\site-packages\pandas\core\reshape\merge.py in _get_merge_keys(self)
1095 if not is_rkey(rk):
1096 if rk is not None:
-> 1097 right_keys.append(right._get_label_or_level_values(rk))
1098 else:
1099 # work-around for merge_asof(right_index=True)
Hi everybody! I'm looking for someone who uses arcgis and langchain to provide feedback on my pull request
Hi @left tartan - Please ignore.. I figured out, an error in code... fixed myself...thanks for response..
For Power BI, How do I create a dynamic index based on the current view? Like .reset_index() in python applied every time the table is updated.
Is there any way to get more free compute credits on Google Colab after using all 100 (?) hours?
And does the T4 GPU even take up credits?
Hello everyone, I want to use an open source LLM (like LlaMA 2) for text generation task. My prompt looks something like this:
Use the given question and context to generate a detailed,
authentic description about the machine. Make it sound as if you are a great salesman and are pitching this machine
to a potential buyer. Use good formatting and the description should not be too long (About 200 words only).
Try to make it as easy to read as possible. Most importantly, you absolutely must include all the information provided in the description
that you generate. Do not make up new information. It's a pre owned machine, therefore the description should not be like the launch of a new product.
Generate a description of the machine using the information provided under the Context.
Context:
categoryName: Post Press
subcategoryName: Saddle Stitcher
subsubcategoryName: Conveyor belt
manufacturerName: Monotype
Year: 0.0
MachineModelName: Boston Double Head Stitching
Location: Germany
Info: DOUBLE HEAD STITCHING MACHINE BOSTON
2 HEAD FLAT AND SADDLE STITCHING MACHINE
DOUBLE WIRE
SCENARIO: I'm currently running the ggml quantized version of llama-2-7b and llama-2-13b locally (I can't use API based models due to data security concerns by my company). The results that this prompt generates are somewhat satisfactory for a starting point but the problem is that it takes around 4-5 mins to generate the whole response (with 33.6gbs of ram and NViDIA GeForce GTX 1080 Ti ) . Sometimes it just keeps on running for 15 mins and doesn't generate anything.
QUESTION: I'm wondering if I can either speed up the inference somehow or even considering to downgrade my model since maybe the llama-2-7b/13b model (quantized) could be an overkill for this task. I want to use a model that gives satisfactory results while using the least amount of resources since I need to run this model on the company server. How do I go about narrowing down my model hunt for this task?
Hi there! Take a look a vLLM , it will (hopefully) give you acceptable performance
If you want to research more on your own this issue is called: inference speed up / tokens per second
On a mac m1 mac you can reach without any effort 20 / 30 tokens per second on llama 2 13b 4 bit ggml
Oh okay okay. vLLM looks promising. I'll look into it. There's also exlllama, do you think that could be helpful too?
Hey All,
I am not sure this is the right chat, but it is about LLMs.
So, I am trying to get https://github.com/mosaicml/llm-foundry running.
Short: I think I am there, the only problem I seem to be facing is that composer seems not to be using my conda env and thus I am not able to run the train example on this page.
Long:
I installed all deps et cetera and I am trying out the quick start example.
When I am at the training part, you have to run:"composer train/train.py \ et cetera"
This returns ModuleNotFoundError: No module named 'llmfoundry'. Which is interesting since when I open python and import this, it works.
When I was debugging I found that composer wants to use another python executable: /sw/arch/RHEL8/EB_production/2022/software/Python/3.10.4-GCCcore-11.3.0/bin/python
Is there a way to force to use the conda env python with all the required packages et cetera?
Is there a machine learning classifier algorithm that classifies points on a 2d plane using a vertical or horizontal line as a separator?
I'm trying to write an ensemble learning algorithm from scratch, and I need a simple classifier like this as my base learner
🤔 how important is it for a data scientist to know a framework, like flask or django? Or is that a mostly useless skill and not necessary at all?
This is basically just any classifier (like logistic regression) but it takes only the x coordinate (vertical line) or y coordinate (horizontal line) of the input points. @quartz wigeon
can you clarify? I'm quite new to machine learning
Not always useful. I've had to use Flask but I think I'm in the minority
So logistic regression predicts a line that separates the points in 2d space (with an orientation and position). If you flatten the points, i.e. take only the x coordinate, or the y coordinate, you can predict a point that separates points on either side of the point in 1d space. This is the same as predicting a horizontal/vertical line in 2d space that separates the points.
thanks for the tip! I'll check out logistic regression
The important part here is that you only use the x-coordinate or y-coordinate, as this forces you to predict a horizontal line or vertical line.
I got my first job without knowing any of this
Nowadays I'm more interested in making data / AI products and not just making models so I learnt those by myself. It's definitely not a requirement though
just out of curiosity, what exactly did u end up learning?
I read MDN's documentation first and then learnt (some of) Django and did a project without any JS
Django is one of best documented projects so it's a good place to start
I was thinking to just learn js (svelt or react) and do everything there
Afterwards I progressively went towards JS, Typescript and so on
look at me, making plans for the distant future when I don't even have a job.. :/
Hoi, i'm really close to getting stable diffusion to run on steam deck in ubuntu 22.04, any idea how to fix these last remaining conflicts/issues?
Don't worry - it could be part of your "strategy" imo
In businesses notebooks and models (purely exploratory work) don't really mean too much unless you're a bonafide statistician. You need to be able to put it into production / work. Smaller companies don't have the budget to have a data engineer, data analyst, AI engineer, frontend dev, backend dev and a devops.
You can be a generalist and spread yourself a bit more thin, but do end-to-end work
Or you can be a specialist and pick out for instance NLP, Vision, Time series, ... or a business domain e.g., finance and do that really well.
Hi anyone here familiar with fiftyone module? I’m getting a ServiceListenTimeout error , which is fiftyone is failing to bind to a port while importing the module
Fyi hiring a data scientist is hard
At my company we interviewed 15 for junior and 0 made it through
So if you become skilled it should be "easy" to get a job depending on location
I suggest switching to poetry and/or pyenv
Exactly how skilled does one have to be? 👀
Subjective question, but is it because they were all that bad, or is it that the requirements and expectations are just very high for DS juniors?
I wonder if I would pass 🤔
I dont minf sharing the whole process in DMs and where applications failed to meet expectations
If it helps you
GUys i am try to compare 2 dataframes verify what rows changed, created and removed anyone can help?
Can someone help me to write data into excel faster. For 20K records, it is taking around 25seconds in pandas xlsxwriter, Using pyexccelerate it is taking around 12 seconds. But pyexcelerate has limitations on the formatting(Accounting format)
Hi anyone here who tried ibm watson ai to create a chatbot using python?
but i need verify values that are differents
ValueError: Can only compare identically-labeled (both index and columns) DataFrame objects
diff = df1.compare(df2, align_axis = 0)
Can someone please help me with this
I tried to import 'llama_index' in Jupiter and it shows error as following :
'If you use @root_validator with pre=False (the default) you MUST specify skip_on_failure=True. Note that @root_validator is deprecated and should be replaced with @model_validator.
Apparently Pydantic V2 has made some changes and it is showing this error .
Makes sense, it would work only if there are only changes on both. My intuition is to make an outer join between the two and filter the rows with nans
Hey guys, I'm doing a kaggle comp rn and I am using gradient boosted regressor model along with using iterative imputer to fill in missing values. My laptop apparently isnt performing this well and has been running for a few hours I think there's some problem with it. But if I gave someone the dataset and the code could you please run it for me? It would rlly help a lot with my chances of getting higher on the leaderboard
Excited to announce the initial release of VectorFlow, written in Python! VectorFlow is an open-source, high volume vector embedding pipeline.
Our pipeline is built to embed large volumes of data quickly and reliably. While embedding a handful of documents for Q&A is straightforward, the real challenge arises when ingesting gigabytes of unstructured data to leverage the full power of LLMs on top of your data.
With just a simple API request, you can effortlessly embed raw data and store the vectors in your vector database, eliminating the need for intricate cloud infrastructure setups.
🔗 Check out our Github repo and give us a star: https://lnkd.in/en6FhfN9
For all the innovators working with vector databases, we're eager to hear your insights, feedback, and ideas for the roadmap.
Demo can be viewed here: https://www.youtube.com/watch?v=aQOlOT14DaA
And our website is here, sign up for a free consultation: https://www.getvectorflow.com/
This link will take you to a page that’s not on LinkedIn
VectorFlow is an open source, high throughput, fault tolerant vector embedding pipeline. With a simple API request, you can send raw data that will be embedded and stored in any vector database or returned back to you.
VectorFlow: Open source, high-throughput, fault-tolerant vector embedding pipeline. Simple API endpoint that ingests large volumes of raw data, processes, and stores or returns the vectors quickly and reliably.
@iron basalt good evening mate
I had a question about the nneu source material you suggested to me prior
When we multiply the "slopes" by the error, we are reducing the error of high confidence predictions
What does this statement mean
A machine learning craftsmanship blog.
You can run your code on kaggle for free, any time and as much as you want (for up to 12h sessions, which are more like 11h, but that's still great)
Google how to do it so u don't lose ur progress 👍
It seems to be a redundant statement, ignore it.
The following sentences is what they are getting at.
This post is mostly for the code, if you want a understanding of the mathematics there are other places to look.
hey guys, i need some help clarifying a concept. is anyone available to help me at the moment? 🙂
i have 3 variables, date, store, sales amount. i put x axis as date, y axis as sales amount, and store as legends.
is this univariate, bivariate, multivariate, or neither of them?
Multivariate if you use all three store timeseries for a singular goal
E.g. a timeseries with multiple features (2+)for each timestamp
Hello, does anyone know why tensorflow stuck on import? i've waiting and nothing happen on the console(ignore the typo)
i'm using tensorflow 2 without avx i download from here
heres quick spec of my system:
OS: Linux Mint 20.1 x86_64
Host: 80MH Lenovo ideapad 100-14IBY
Kernel: 5.4.0-58-generic
CPU: Intel Celeron N2840 (2) @ 2.582GHz
GPU: Intel Atom Processor Z36xxx/Z37xxx Series Graphics & Di
Memory: 1227MiB / 1869MiB
I'm sure the processor/ram is not a problem when just import the module(right?)
try initializing python with verbose python -vvv then import again, see what it says.
hello
I try to convert file .py to .exe
I run file .py and it is well run, but I run file .exe and it is not run.
help me
Thank youuuuu
there's so many output and i dont know how to read it
what are you using to build the .exe?
is it still hanging? whats the last line of the output?
I use PyInstaller to convert file .py to .exe
When I run .exe, it not run.
But I run file .py, it display the word "hello"
I also use "Auto py to exe" tool to convert, but the result similar
this
# code object from '/home/alfarizi/Documents/machine_learning_flask/venv/lib/python3.8/site-packages/tensorflow/lite/experimental/microfrontend/ops/__pycache__/gen_audio_microfrontend_op.cpython-38.pyc'
import 'tensorflow.lite.experimental.microfrontend.ops.gen_audio_microfrontend_op' # <_frozen_importlib_external.SourceFileLoader object at 0x7fa0632c0a30>
here: pyinstaller --onefile [--windowed] yourfile.py
it's a broad enough field that you can get through junior level (or could in the past, before it got super competitive) with expertise or deep knowledge only in 1 or 2 areas + being willing and able to learn on the job.
as for this, you really should take stats and probability classes while you're in school. it's harder to self-study that material than to learn it in an organized controlled environment
sql is enough, you will likely not need or even want to use an ORM for most data science work
a lot of data science work requires you to have a good understanding of undergrad-level statistics, but you aren't usually "doing" statistics in the sense of running t-tests all day
Don't worry, I couldn't skip them even if I wanted to 🙂
As a matter of fact, I'll probably have to take a couple of advanced classes related to statistics so I'll probably have that section covered
Err, I'm not sure what you're trying to say
Basically, you need to know the theory, but you won't be practicing the statistics you'll learn at university?
It depends on what statistics.
Machine learning is technically a subset of statistics that is pioneerd by computer scientists
good. you basically need all of this:
- calculus (pre-req for probability mostly), ideally also multivariate
- linear algebra
- probability
- statistics
that's a lot of new material to learn and intuition/understanding to develop. if you think you understand the basics after a couple of lectures, you didn't spend long enough time pondering them. take the advanced classes, but don't lose sight of the fact that it all builds sequentially, and you can't really apply any of the advanced material without really understanding the fundamentals.
We are not a job recruitment board. Please do not post job ads in the future.
From that perspective it makes no sense to limit yourself to just "machine learning", techniques from "traditional" statistics are valuable
in a lot of statistics classes, students are taught certain procedures or recipes, which are often not directly applicable in the real world. however the principles underlying them are very useful and sometimes necessary to do your job well.
Take as many classes of those are possible because they'll cover techniques you may (or may not) use in the future
Well yeah, I understand that it's an entire sub-field (Like "Calculus") in math. I'm assuming there is some broad idea of what "Intro to statistics" teach though?
Intro to stats typically teaches probability theory (make no mistake, this is NOT part of statistics, it's a prereq), descriptive stats and inferential stats
People hate probability theory and get turned off statistics as a whole
it's unfortunate because probability theory is more useful than traditional statistics in some fields, e.g. reasoning about rare events and uncertain outcomes even when you don't need to "fit a model"
Yeah I do have a solid foundation with Calculus (Real Analysis, apparently) and Linear Algebra
Understanding the fundamental theory is, of course, ideal - but not always possible under the extremely hectic curriculum of university
Descriptive statistics is (summarizing data) very important for machine learning as it relates closely to experimental data analysis
indeed, you're almost certainly going to keep studying things and learning things for years. however if you have the ability to choose priorities at all, hopefully this helps you decide what to focus on.
Oh of course. I don't think I'll ever be doing Delta-Epsilon proofs again but the idea behind them is rather crucial to understanding more complicated things like Gradient Descent I'd reckon
heh, if there's ever a class that i don't think has been even remotely useful for me, it's real analysis
Yeah but... my grades, y'know?
Inferential statistics is at the heart of machine learning. I see too many practitoners (even people at work!) focussing on getting a very low MSE/ high accuracy when the point is actually getting unbiased estimates of performance
To whole notion of unbiased estimates is very rarely covered in ML classes (We only briefly spoke about it in my entire AI masters) but it's a big part of statistics
(however if you get into numerical computing then yes real analysis i believe becomes very important)
I have intro to probability as a prerequisite class to Intro to statistics, so that shouldn't be a problem
Really? It's my absolute favourite actually
Almost entirely useless in practice, but it really teaches you to "think" if that makes sense
this is also why people tend to need a masters degree to even get into this field. you're usually packing a ton of things into your 4 years at school (as you should!) and you need a year or two to reset and focus a little more heavily on a smaller set of core ideas + spend more dedicated time on a thesis or capstone project
That's interesting, what should one do about this?
yes definitely. i shouldn't discourage people from spending their undergrad time just learning how to think. that's arguably even more foundational than any particular math concept. as i just mentioned above, part of the point of a masters is so that you can have more focused learning time on your chosen subject after spending your time on a broad range in undergrad.
Doing Kaggle implicitly helps explain this discrepancy
Cause in Kaggle you actually have 2 jobs:
- getting a good model
- Finding a way to robustly evaluate your models
If you don't succeed at both you're bust
Makes perfect sense honestly
I generally believe a masters degree is the way to go if you want to "hone your craft" in most cases
Then again, I'm too much of an academic newbie to have this opinion
School teaches you point 1. Unless you go to production and your model fails you won't learn point 2 either at work
school ought to each 2 as well
some curricula cover it. traditional stats does to some extent
They ought to, but they don't cover it well enough
Explaining what a roc curve and cross validation is, isn't enough
CV at least is a valid and useful technique
i remember we learned about cv, bootstrap, leave-one-out, AIC, etc
It is, but if you do like the people at work and you CV endlessly
It defeats the purpose of CV
it wasn't particularly well-informed introduction or instilled deep understanding, but at least i'd seen it before
Admittedly I don't look at school as a practical tool for the job market
I don't mind learning "useless things". To me school is just a foundation to obtain the ability to learn whatever's necessary
btw there's some pushback now against "unbiased" estimation in statistics as well. the machine learning concept of bias-variance has a lot of overlap with the use of priors in bayesian statistics.
People learn CV as a tool and idt the reasons behind the tool (and how you can still abuse it) are covered adequately
this is probably the right way to think. i just don't know if the economy is in good enough shape to allow people to think this way 😬 but i'm pretty out of touch with the job market for juniors. i hear it's rough right now.
i'm curious what abuses of CV you've seen in industry
i've been lucky to work with very few knuckleheads and mostly people who are very conscientious about their work
Thankfully It'll be many years until I finish my masters, so maybe things will change for the better by then 😅
Well, someone at work is working with a medical dataset. Not a lot of subjects. They're making a model. They use their entire dataset as validation
Because test train splitting with a small dataset isn't great either
ah. do they not know about bootstrapping and cv?
But they've iterated too much on their dataset that they're overfitting implicitly now
They're using cross validation
Cross validation does not save you here
After 1000 rounds of CV you're essentially making new features to raise the validation score
Each evaluation on your test / validation set increases the bias on your score
oh so they'll do CV, make a change, do CV again, etc?
yeah that's always a tough one. in theory you're not supposed to do it at all, but how else are you supposed to iterate?
i've definitely fudged it with problematic datasets where we did things like CV simultaneously for hyperparameter selection and performance eval 😆 but we 1) knew we were overestimating performance and undersold our results to the business, 2) knew we would be able to get new out-of-sample data soon that we could use to evaluate the model properly, and 3) had good business reasons to believe that our data was "representative enough" (part of it was synthetically constructed anyway)
You can't iterate without doing it, but doing it too much means you're overfitting so the answer is doing it "a little"
yep. that seems like something you could maybe study with an information theoretic approach (how much is too much) but i haven't seen any papers on it
There are but they're tedious haha
i'd be curious what the literature says on it
This problem has a name, iirc it's "adaptive overfitting"
I think business reasons makes you exempt tbf
If I can sell it to myself that it's OK it's OK
hah yep
The problem is not knowing and having crazy inflated scores as a result
but that's why we need all this foundational knowledge: can you sell it to yourself in a way that's legit?
like you're saying, you have to know what's wrong with doing Bad Thing in order to ever coherently justify doing Bad Thing
also i didn't know the term "adaptive overfitting", i've heard about it before in cases like everyone training on the same reference dataset but not with a nice name
I don't think you even need to justify it? If you know it's bad and you can attach a "performance may be inflated" disclaiemr you're fine
Why? Let's say you cut some corners and the performance is 3 % higher than the baseline. I'm picking the baseline
If it's 30 % and the corners that I cut aren't too severe, sure I'm still picking my approach
fair
ideally you can get a numerical estimate though
that's not always easy. simulation studies can be hard to design
And to do that we'd have to look at our cousins from statistics
this was a good read and analysis of the adaptive overfitting problem https://gregpark.io/blog/Kaggle-Psychopathy-Postmortem/
How I dropped 50 spots in one minute by overfitting in a Kaggle contest
fwiw i think multiple comparisons correction is controversial even in stats
you need a pretty well-formed decision criterion to do any kind of "testing" properly
The thing is, at least they know it's a problem
If I were serious about tackling it I know stats has been grappling with this for ages and I know what to read
Oh... thank you mate
what reading do you have in mind? i've read a bunch of the older papers but i haven't seen any recent work on it
yesterday my plot was plotting.. I don't remember touching it.. today it's not plotting anymore.. :?
Can you show the code? This isn't nearly enough information to start diagnosing the problem
can anyone provide me a good website focusing on ai ml dl data etc. staff?
+1 me too..
Websites can do a lot of stuff. Without further information, Kaggle? Kdnuggets maybe?
sites that would teach me these staff, would they do?
Kaggle has a pretty good introduction so definitely yeah
fixed it 🖤 I guess I did touch the plot after all, and somehow forgot. I don't understand why, but adding transition_duration to my plotly figures layout somehow made it so that some traces weren't rendered.. 🤔 🤷♀️
Kaggle is a great site, but their tutorials..? not so much, in my opinion at least
It has its flaws but it's solid for beginners since it gives you a playground to test things out
I haven't really compared beginners tutorials though, feel free to add your own recommendations
it also has errors, or "too simple/short" examples with leakage and stuff like no normalization for models that need it, which can lead to confusion, especially for beginners. It's not bad, but it's definitely not great. I'd say udemy level (and I generally avoid udemy)
Practical data skills you can apply immediately: that's what you'll learn in these no-cost courses. They're the fastest (and most fun) way to become a data scientist or improve your current skills.
Learn Data Science for free through application oriented courses. Utilize our expert-curated resources as per your interest and pace.
Brandon Rohrer post library
StatQuest for simple explanations on Youtube (stats and ML)
can anyone help me with install ing keras, I have installed python 3.12 and when I install tensorflow, it gives me these errors,
u using correct env?
You might wanna confirm if the python version you're using in your IDE is in same environment where Tensorflow was installed.
its alr im using pytorch now
Is it confirmed that Keras will support pytorch this year?
that sounds odd to me considering that keras is a wrapper for tensorflow
ah, thanks! that was what i was looking for...
ooo thanks to you too...
Thanks man
keras started life as a higher-level framework over tensorflow. tensorflow then kind of ate keras and made their "keras api", but keras itself continued to exist. now keras is branching out again to actually support other frameworks as backends.
Great because I only need to learn one framework, hopefully
Hehehe this was also what I said before; that I'm gonna learn Tensorflow. You certainly need to start with your most favourite framework but it'll be nice to be framework agnostic. Started with Tensorflow, but then I recently moved into ML Research, and here I am still learning PyTorch. It's nice to know at least two in my opinion. Tensorflow / PyTorch / JAX
Just read this. It looks like what Ivy is also trying to do https://unify.ai/
Is there a ((detailed)) pretrained model for audio samples used in music? AudioSet isn't that detailed and its the highest benchmark currently available for audio classifiers
HuggingFace usually has one or two gems for almost everything ML. Maybe try checking there.
Can someone tell me if I need Cuda 11 for tensorflow to work with my GPU? I currently have CUDA 12 and tensorflow is not detecting my GPU
O’Reilly 🙂
?
You asked for sites that would teach AI
I had the same problem. I then tried Pytorch and that worked. So i just used Pytorch now and most research are using Pytorch so yea..
I wanna stick to tensorflow for now. Seems like this is really annoying for a lot of people lol
There's some docker thing that makes it easy apparently so im gonna look into that ig
Landed my next Data scientist job! its been such a long journey I feel like sam on mount doom
Congrats
🥳 ty
had a really weird interview question though about prior likelihood vs probability, I think they got the wording mixed up
gotta say the job markets so bad at the moment, was a real grind
Didn't Keras start as a multi backend thing?
Also my hot take is that once you know one framework you can be productive by googling "How do I do X in Pytorch / Tensorflow"
Hello everyone!
I see this chat is a lot less populated than Python General which I have been frequenting recently
Quantity is no indicator of quality though!
Does anyone here have any insight into the best paid positions within the AI and ML field?
I've seen two commonly recurring job titles are "AI engineer" and "ML engineer" and the internet seems quite divided on who has the higher salary
there is no consistency in what different AI/ML/DS job titles actually mean.
I've met "artifical intelligence engineers" who just flat out do not write code.
which means they are not "engineers" in the programming sense.
so even if it turns out that people who have the title "ML engineer" on average make more than people who have the title "AI engineer", that doesn't really tell us anything.
whats the appropriate plot for displaying the min, max, and avg execution time of a function (the X-axis will display the amount of time VS the input size)
I was thinking of using three lines with the same color (but different shadow) and fill the area between the min & max with a color
and I want to display the benchmark for 4 functions, so 12 data points in total
Often multiple bar plots.
thats defintely easier than the classical multi-line chart, but I would rather the former
I was taught that I shouldnt use bar plots for continuous data
It's often discrete, you run with a couple of input sizes, and each input size has a group of bars.
i remember seeing ivy posted a while ago,i haven't kept up with it but it's nice to see it's still alive
for chatbot just say like a fun, normal conversation which type of ml model should i be aiming for?
yes but afaik tensorflow was the only backend that is both still extant and was supported at the time. i think maybe it also supported theano but i might have also made that up
the best paying in AI/ML as with most tech-related fields are phd-level high-ranking individual contributor (as in, you're an actual known researcher being hired to solve hard problems) or senior management (you're finding/hiring/managing the people solving the hard problems)
Either way, Theano is basically dead no?
we use candlestick charts for this type of stuff, when we want to show avg/hi/low plus stddev
yeah it was put gently to bed by its original developers, although i remember seeing something about the community continuing to fix bugs and keep the project alive, even if not advancing
or maybe that was pymc3 which was based on theano? idk
oh I completely forgot about that
Even the about in the github page for it says "Theano was a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently."
Keyword, "was."
Or box and whiskers
but multiple candlesticks on the x axis, how would that work?
Ah, it was continued as "Aesara," which was forked to "PyTensor."
@unique ether but more broadly, the best-paid positions are high-value positions in industries and at companies that have the capital to pay a lot of money for that high-value work. that would typically be ML/AI/data engineering supporting advanced research teams and/or critical production systems, or being an advanced researcher yourself. for positions that are realistically obtainable for normal people, "data engineering" and "data science" are still the two primary tracks. pay depends on seniority/expertise, region, and choice of industry
