#data-science-and-ml | Python | Page 84

tardy lark Oct 5, 2023, 3:51 PM

#

although this time the detected shape is 140

#

uhm idk the shape but the data is stock proces

tidal bough Oct 5, 2023, 3:55 PM

#

Check scaled_data.shape and scaled_data.dtype

tardy lark Oct 5, 2023, 3:56 PM

#

float64 and 6007,1

tidal bough Oct 5, 2023, 3:56 PM

#

Actually, I think I see a mistake. You're taking elements from scaled_data but looping up to len(train_data). If the latter is higher than len(scaled_data), then the last slices the loop produces can be shorter than 60 elements.

tardy lark Oct 5, 2023, 4:00 PM

#

so how could i fix that?

#

        train_data = final_data[0:200,:]
        valid_data = final_data[200:,:]

tidal bough Oct 5, 2023, 4:03 PM

#

If len(train_data) < len(scaled_data) as this seems to imply, your loop should work.

tardy lark Oct 5, 2023, 4:04 PM

#

for i in range(60,len(train_data < len(scaled_data)): like that?

serene scaffold Oct 5, 2023, 4:14 PM

#

tardy lark ``` for i in range(60,len(train_data < len(scaled_data)):``` like that?

what type does len(train_data) < len(scaled_data) return, and what types do you need to pass to range?

tardy lark Oct 5, 2023, 4:24 PM

#

serene scaffold what type does `len(train_data) < len(scaled_data)` return, and what types do yo...

it just returns false

serene scaffold Oct 5, 2023, 4:25 PM

#

tardy lark it just returns false

False is a bool. Does range(60, False) make sense?

tardy lark Oct 5, 2023, 4:26 PM

#

no

serene scaffold Oct 5, 2023, 4:26 PM

#

see if you can fix it

tardy lark Oct 5, 2023, 5:39 PM

#

serene scaffold see if you can fix it

yeah i'm stumped

serene scaffold Oct 5, 2023, 5:41 PM

#

tardy lark yeah i'm stumped

when ConfusedReptile said "If len(train_data) < len(scaled_data) as this seems to imply, your loop should work.", they weren't suggesting that you need to include len(train_data) < len(scaled_data) in your code somewhere. they were saying "if the length of train_data is less than the length of scaled_data, then your loop should work"

#

so the loop needs to go from 60, to what?

neon lantern Oct 5, 2023, 5:46 PM

#

my brain isn't working right now so I need a sanity check
can i train ppo with a continous stream of downscaled images (think 360x240 or something) and have it just dictate mouse movement and whether or not to click on each frame

#

and as a followup, should I do it in tensorflow or pytorch or do it with custom gym environment

#

I assume that custom gym is easiest since theres a lot of backend I don't have to handle but I wanted to get a second opinion

half cliff Oct 5, 2023, 6:03 PM

#

Hi! Has anyone ever worked with the library pydeck?

serene scaffold Oct 5, 2023, 6:13 PM

#

half cliff Hi! Has anyone ever worked with the library pydeck?

always ask your actual question, and give enough information for someone who would know the answer to start answering it. don't wait for a commitment.

cold osprey Oct 5, 2023, 6:24 PM

#

serene scaffold always ask your actual question, and give enough information for someone who wou...

shud have a bot command for this by now hahah

serene scaffold Oct 5, 2023, 8:50 PM

#

cold osprey shud have a bot command for this by now hahah

I just use a self-bot /s

#

but also we did have a bot command telling people not to ask to ask and such, but it was used in a passive aggressive way, and people didn't really read it.

rugged comet Oct 5, 2023, 9:45 PM

#

@past meteor Was I meant to sort the residuals and the variable values for some reason before graphing? I think not because I think that would cause the variable values to not "line up" with their corresponding residuals.

past meteor Oct 5, 2023, 10:26 PM

#

rugged comet <@260493929047130113> Was I meant to sort the residuals and the variable values ...

correct, don't sort them

small ore Oct 5, 2023, 10:27 PM

#

Is there a quick handy way to get a report on outliers (in numerical data) . Number of outliers in each feature and an array/series of indices that contains them?

past meteor Oct 5, 2023, 10:27 PM

#

small ore Is there a quick handy way to get a report on outliers (in numerical data) . Num...

A basic way is to just make a boxplot

small ore Oct 5, 2023, 10:28 PM

#

That does not give me a report of how many outliers are there for each feature and their locations if I am thinking right

past meteor Oct 5, 2023, 10:31 PM

#

small ore That does not give me a report of how many outliers are there for each feature a...

You can compute the dots above the whiskers manually if you want

#

It's what I do if I want quick and dirty univariate outliers

small ore Oct 5, 2023, 10:35 PM

#

I have a dataset with (potentially) 23 features and want to look at similar stuff in the future too. A quick report/describe that helps me determine if I can remove them without worries is what i need. Also need indices/location so that I can remove those easily. Bonus would be to see the overlap

small ore Oct 5, 2023, 10:50 PM

#

Basically a

df.describe()

with extra data

past meteor Oct 5, 2023, 11:18 PM

#

small ore I have a dataset with (potentially) 23 features and want to look at similar stuf...

outlier_ mask = df["feature_1"] > df["feature_1"].quantile(q=0,75) * 1,5

#

And you do the same for those lower than the 25th quantile and do an or between both, those are your outliers

#

Gives you a true or false series

#

Finally, you filter by this mask and you get the index of all these values

#

Note that this is indeed univariate outliers. Sometimes values are anomalous together. If you want to find those consider using a one class SVM: https://scikit-learn.org/stable/modules/generated/sklearn.svm.OneClassSVM.html / https://scikit-learn.org/stable/modules/outlier_detection.html#outlier-detection

small ore Oct 6, 2023, 12:34 AM

#

I was attempting to write my own code already. That is why asked for " a quick handy method" in my original question.

#

But thanks anyways. I will look into your way too

small wedge Oct 6, 2023, 4:08 AM

#

!rule ad

arctic wedgeBOT Oct 6, 2023, 4:08 AM

#

Rules

6. Do not post unapproved advertising.

honest verge Oct 6, 2023, 4:27 AM

#

do you guys have any suggestions on what backend to use for ml models? Would it differ between traditional ml methods and dl ?

agile cobalt Oct 6, 2023, 4:45 AM

#

for traditional ml, most things should be 'cheap' enough to run as far as computing power goes that you don't have to think much about it beyond the usual considerations for literally any project's backend at all

for deep learning, you might need to ensure you have access to gpus depending on which model(s) you are using, which can make it harder and more expensive

agile cobalt Oct 6, 2023, 4:49 AM

#

honest verge do you guys have any suggestions on what backend to use for ml models? Would it ...

classic options like AWS, Azure or Google Cloud Platform can work fine, if you use the right services inside of it and take the usual precautions to like not leaking credentials, not getting hacked, properly deactivating things you're not using to avoid unwarranted fees etc - not ultra specific to ml

though I guess that as far as specifically for ml goes, there are some things like Hugging Face Spaces you might want to look into

potent sky Oct 6, 2023, 5:53 AM

#

Also HF Inference Endpoints if you're looking for production use

#

Spaces is mostly for demo use ig, and doesn't provide an API to query the model hosted

#

Still very useful to get things up and running and to test it out

mystic ruin Oct 6, 2023, 12:39 PM

#

i can't install chatterbot with pip... i am using python 3.11 output :

        ERROR: Failed building wheel for blis
        Running setup.py clean for blis
      Failed to build preshed thinc blis
      ERROR: Could not build wheels for preshed, thinc, blis, which is required to install pyproject.toml-based projects
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× pip subprocess to install build dependencies did not run successfully.
│ exit code: 1
╰─> See above for output.```

agile cobalt Oct 6, 2023, 5:09 PM

#

mystic ruin i can't install chatterbot with pip... i am using python 3.11 output : ```note: ...

from https://pypi.org/project/ChatterBot/ :

PyPI

ChatterBot

ChatterBot is a machine learning, conversational dialog engine.

#

looks like it hasn't been updated since 2021?
edit; even worse, the pypi version is from 2020
(2021 was the latest github commit on the main branch)

#

you might want to look into LangChain instead

odd meteor Oct 6, 2023, 5:10 PM

#

mystic ruin i can't install chatterbot with pip... i am using python 3.11 output : ```note: ...

You need to download the aforementioned binaries/wheels to your machine (ensure you're downloading the wheels that matches the version of Python you have 3.11) . Then from your command line (Windows / Conda depending on what you're using), try and pip install those wheels you've downloaded now (ensure to change directory in your command line to point to whichever folder in your machine where those wheels you downloaded can be found; that's, if they're not in a directory that's already added to path in your system environment --- and once you've sorted this, you can then pip install those wheels).

Use this website to download the wheels: https://www.lfd.uci.edu/~gohlke/pythonlibs/

Now, after doing that for the missing wheels, try re-installing chatterbot again.

Better Solution: Create a virtual environment install an older version of Python either 3.9 or 3.10. Then install chatterbot in that environment. You should be fine. (Because I noticed some of wheels you're missing like thinc, blis aren't available yet for Python 3.11, however, there's the 3.10 version is available.

odd meteor Oct 6, 2023, 5:14 PM

#

agile cobalt from https://pypi.org/project/ChatterBot/ :

Aha @mystic ruin this is why it's not working. The maintainers of Chatterbot appear to have gone on sabbatical since python 3.8 😀 . So you'd either downgrade to 3.8 or find another library for building what you wanted to use Chatterbot for.

abstract wasp Oct 6, 2023, 6:51 PM

#

I split my data to just training and validation, is it okay if I don’t have a split for testing?

desert oar Oct 6, 2023, 7:00 PM

#

abstract wasp I split my data to just training and validation, is it okay if I don’t have a sp...

it's arguably better to have a testing split and no validation split, because otherwise you have no "un touched" data for final model evaluation. but why do you want to avoid making a 3rd split? there are good reasons to want to avoid it, but i want to understand your case in particular so i can give sensible advice.

abstract wasp Oct 6, 2023, 7:03 PM

#

desert oar it's arguably better to have a testing split and no validation split, because ot...

Idk, I just did it like that. Sometimes I see others do 3 splits and sometimes just 2.
If I include the 3rd split, will it increase my model’s accuracy.
The unseen data, isn’t that the validation set?

left tartan Oct 6, 2023, 7:53 PM

#

abstract wasp Idk, I just did it like that. Sometimes I see others do 3 splits and sometimes j...

It’s not about increasing your models accuracy, it’s about evaluating it:

#

With a train test split, you might test dozens or more of models. You’ll eventually find one that fits the test really well. But then what? Was that just by sheer chance or did you stumble upon a ‘good’ model?

#

So, at least in my world, the validation split is the very last thing that you hold back as long as you can… because often the ‘test’ effectively becomes part of train

#

(Curious whether zestar75 agrees with my characterization)

cerulean kayak Oct 6, 2023, 11:15 PM

#

because of the fact that inputs for neural networks have to be a vector of data,
does that mean whenever I'm using keras to make a Neural Network, I need to make sure that the x_train and y_train have a shape of (n,1)?
please at me if you know.

thick walrus Oct 6, 2023, 11:34 PM

#

Hello All,
I am working on bar plot in matplot. The second plot is not showing the bar:
import matplotlib.pyplot as plt
import numpy as np
from datetime import datetime, timedelta
import matplotlib as mpl

mpl.rcParams["date.converter"] = 'concise'
fig, (ax1, ax2, ax3) = plt.subplots(3, 1, layout='constrained')
price_date = np.array([datetime(2020, 6, 30),
datetime(2020, 7, 22),
datetime(2020, 8, 3),
datetime(2020, 9, 14)], dtype=np.datetime64)

price_close = [8800, 2600, 8500, 7400]
start_date = np.datetime64(datetime(2020, 6, 1))

ax1.bar(price_date, price_close, width=np.timedelta64(4, "D"))
ax2.bar(start_date, price_close, bottom=price_date)
ax3.bar(np.arange(4), price_date-start_date, bottom=start_date)

#

Am I missing something, I wanted the time to be at the bottom on the second plot but it does not show the bar

#

small ore Oct 6, 2023, 11:37 PM

#

thick walrus Hello All, I am working on bar plot in matplot. The second plot is not showing ...

You could format code in code blocks to help others read better

thick walrus Oct 6, 2023, 11:38 PM

#

small ore You could format code in code blocks to help others read better

my apologies - let me try to format it in code blocks

small ore Oct 6, 2023, 11:42 PM

#

!code

arctic wedgeBOT Oct 6, 2023, 11:42 PM

#

Formatting code on discord

Here's how to format Python code on Discord:

```py
print('Hello world!')
```

These are backticks, not quotes. Check this out if you can't find the backtick key.

For long code samples, you can use our pastebin.

thick walrus Oct 6, 2023, 11:45 PM

#

I hope this helps as I used pastebin. As mentioned the second bar plot is not showing the bar.
import matplotlib.pyplot as plt
import numpy as np
from datetime import datetime, timedelta
import matplotlib as mpl

mpl.rcParams["date.converter"] = 'concise'
fig, (ax1, ax2, ax3) = plt.subplots(3, 1, layout='constrained')
price_date = np.array([datetime(2020, 6, 30),
datetime(2020, 7, 22),
datetime(2020, 8, 3),
datetime(2020, 9, 14)], dtype=np.datetime64)

price_close = [8800, 2600, 8500, 7400]
start_date = np.datetime64(datetime(2020, 6, 1))

ax1.bar(price_date, price_close, width=np.timedelta64(4, "D"))
ax2.bar(start_date, price_close, bottom=price_date)
ax3.bar(np.arange(4), price_date-start_date, bottom=start_date)

past meteor Oct 7, 2023, 1:01 AM

#

left tartan It’s not about increasing your models accuracy, it’s about evaluating it:

Fully agree. Splitting is not about "increasing performance", it's about having a high fidelity estimate of how good your model is.

#

As you say, if you just do a train-test split and test N models with it and you likely do this M times ... your test set is de facto part of your training set.

#

My current work project is medical stuff. What I do is essentially split:

Split off some patients in full, they are never touched. (inter patient split)
Split each "training" patient in a train and a test set. (intra patient split)
Within this train set I split off a little more data and mostly cross validate.
I try to minimize the amount of times I'm using the "intra patient" test sets.
I take the top 4 models found on the "intra patient" splits and apply them on the held out ones.
Report the inter patient split results, these are how our models work on unseen patients. If the difference between test and train here is high we likely overfit by reusing the intra patient split too much.

It sounds convoluted but if you don't have a lot of data how you approach this is important to get unbiased performance estimates.

desert oar Oct 7, 2023, 1:19 AM

#

past meteor My current work project is medical stuff. What I do is essentially split: 1. Sp...

The problem with making too many splits is when you have sparse or unbalanced data, sometimes the splits end up kind of useless and you actually end up underestimating model performance

#

the compromise i've made in the past in that situation is to create a single train/test split, and rely on cross validation within the train set for model dev

past meteor Oct 7, 2023, 1:20 AM

#

Underestimating model performance is a lot better than overestimating it, I'm fine with that

left tartan Oct 7, 2023, 1:20 AM

#

I really like this authors illustration and explanation of this train/test overfitting problem (in the finance context but generally applicable): https://www.davidhbailey.com/dhbtalks/battle-quants.pdf

desert oar Oct 7, 2023, 1:21 AM

#

past meteor Underestimating model performance is a lot better than overestimating it, I'm fi...

sort of. if you have 1000 classes but only 500 of them make it into the test set, it's a problem

past meteor Oct 7, 2023, 1:21 AM

#

Honestly? I like the idea of bespoke data splitting. For my current use case it made sense. For others I'll have to do something else.

desert oar Oct 7, 2023, 1:21 AM

#

but yeah, implicitly overfitting through iterated model tweaking is a serious problem

past meteor Oct 7, 2023, 1:22 AM

#

In my case there's an element of unbalanced data in it and we solved it by having a semi-stratified test-train split in the intra patient split

desert oar Oct 7, 2023, 1:22 AM

#

left tartan I really like this authors illustration and explanation of this train/test overf...

Definitely bookmarking this

left tartan Oct 7, 2023, 1:22 AM

#

desert oar but yeah, implicitly overfitting through iterated model tweaking is a serious pr...

I don’t know if finance is more prone to this… it feels like it, since we’re inherently trying to maximize a variable

past meteor Oct 7, 2023, 1:22 AM

#

But the stratification procedure was very much linked to the "domain"

left tartan Oct 7, 2023, 1:23 AM

#

desert oar Definitely bookmarking this

The authors have some great stuff, Bailey has some YouTube talks, and de Prado wrote the wonderful: Advances in Financial Machine Learning

desert oar Oct 7, 2023, 1:23 AM

#

yeah, that's how i ended up solving my 1000 class problem too, we were able to group the classes into bigger categories and stratified that way

desert oar Oct 7, 2023, 1:23 AM

#

left tartan The authors have some great stuff, Bailey has some YouTube talks, and de Prado w...

thank you, i have been looking for more things to listen to while i do house chores

left tartan Oct 7, 2023, 1:24 AM

#

This is the YouTube I was thinking of: https://youtu.be/e3h9xf3p1DE?feature=shared

past meteor Oct 7, 2023, 1:25 AM

#

I think all ML is prone to this, not just finance. It's game over when the data scientist starts thinking their job is to maximize or minimize a variable 😄

desert oar Oct 7, 2023, 1:25 AM

#

Why is it "made for kids"? I can't save the video to a playlist lol

desert oar Oct 7, 2023, 1:25 AM

#

past meteor I think all ML is prone to this, not just finance. It's game over when the data ...

unless that variable is long term business success!

past meteor Oct 7, 2023, 1:26 AM

#

desert oar unless that variable is long term business success!

I meant it in a way that if your focus is maximizing or minimizing a variable without regards to due dilligence in how you evaluate you'll have a negative impact on long term business success

#

The optimal way to get 100 % on the test set is just by copying y

#

Hence why I truly believe that's not our job, our job is to answer the question "if we go into production with this? How good will it be?"

#

That's why I don't mind underestimating performance, if all my metrics say the model is fantastic and we go into prod and it fails then that might impact how much they trust us going forward.

left tartan Oct 7, 2023, 1:30 AM

#

This is why I like staying in the DE side in the finance world!

past meteor Oct 7, 2023, 1:32 AM

#

I mostly speak about DS but I like data engineering equally I think, it's good fun as well

left tartan Oct 7, 2023, 1:38 AM

#

In finance, DS feels more like BS most of the time

past meteor Oct 7, 2023, 1:44 AM

#

Ironically, my background is quantitative business. Finance was never my jam. A lot of my cohort went into actuarial science which does seem very legit.

Algo trading ... idk, I feel like there's too much happening in the world you can't control for.

Finance at large does have interesting use cases like credit risk modelling and fraud detection though. My profs couldn't shut up about these 🤣

desert oar Oct 7, 2023, 1:45 AM

#

past meteor Ironically, my background is quantitative business. Finance was never my jam. A ...

i have no idea if this is true, but my broad economics-based impression of algo trading is that it only makes sense for HFT

past meteor Oct 7, 2023, 1:46 AM

#

That sounds reasonable to me. I don't know enough to comment, I'll leave that to BillyBobby 😄

left tartan Oct 7, 2023, 1:46 AM

#

desert oar i have no idea if this is true, but my broad economics-based impression of algo ...

I know there’s lots of fund styles: everyone thinks they’re a genius. There are non HFT algorithmic shops

#

Usually, I think, it comes down to a big broad strategy bet (ie: certain sectors) and small optimizations to hopefully outperform.

past meteor Oct 7, 2023, 1:49 AM

#

left tartan I really like this authors illustration and explanation of this train/test overf...

The PDF you sent was spot on BTW

left tartan Oct 7, 2023, 1:49 AM

#

Yah, it’s tough to apply in practice, but it’s good to understand the limits

past meteor Oct 7, 2023, 1:51 AM

#

https://d2l.ai/chapter_linear-classification/generalization-classification.html#test-set-reuse from one of the books I always shill.

left tartan Oct 7, 2023, 1:56 AM

#

Oh thanks, this statement is somewhat surprising, I was hoping for a more positive solution: “This problem …. remains a persistent problem plaguing scientific research.”

kind lotus Oct 7, 2023, 4:43 AM

#

Anyone here have experience with pandas?

nimble hawk Oct 7, 2023, 7:35 AM

#

Hello everyone, I uploaded a PySpark course on YouTube channel. I tried to cover wide range of topics including SparkContext and SparkSession, Resilient Distributed Datasets (RDDs), DataFrame and Dataset APIs, Data Cleaning and Preprocessing, Exploratory Data Analysis, Data Transformation and Manipulation, Group By and Window ,User Defined Functions and Machine Learning with Spark MLlib. I am leaving the link below, have a great day!

https://www.youtube.com/watch?v=jWZ9K1agm5Y

YouTube

Onur Baltacı

PySpark Course: Big Data Handling with Python and Apache Spark

PySpark, the Python API for Apache Spark, empowers data engineers, data scientists, and analysts to process and analyze massive datasets efficiently. In this course, you'll dive deep into the fundamentals of PySpark, learning how to harness the combined power of Python and Apache Spark to handle big data challenges with ease. From data manipulat...

▶ Play video

past meteor Oct 7, 2023, 7:48 AM

#

nimble hawk Hello everyone, I uploaded a PySpark course on YouTube channel. I tried to cover...

!rule 6

arctic wedgeBOT Oct 7, 2023, 7:48 AM

#

Rules

6. Do not post unapproved advertising.

past meteor Oct 7, 2023, 7:48 AM

#

You're not supposed to advertise stuff on the Discord 🙂

thick walrus Oct 7, 2023, 8:06 AM

#

Hello All,
I am still having a challenge with the second subplot. It should show the timedelta at the bottom. Currently, it does not show anything.
import matplotlib.pyplot as plt
import numpy as np
from datetime import datetime, timedelta
import matplotlib as mpl

mpl.rcParams["date.converter"] = 'concise'
fig, (ax1, ax2, ax3) = plt.subplots(3, 1, layout='constrained')
price_date = np.array([datetime(2020, 6, 30),
datetime(2020, 7, 22),
datetime(2020, 8, 3),
datetime(2020, 9, 14)], dtype=np.datetime64)

price_close = [8800, 2600, 8500, 7400]
start_date = np.datetime64(datetime(2020, 6, 1))

ax1.bar(price_date, price_close, width=np.timedelta64(4, "D"))
ax2.bar(start_date, price_close, bottom=price_date)
ax3.bar(np.arange(4), price_date-start_date, bottom=start_date)

Any suggestions?

nimble hawk Oct 7, 2023, 8:14 AM

#

past meteor You're not supposed to advertise stuff on the Discord 🙂

Sorry, i won't again. Thanks for the warning

cold osprey Oct 7, 2023, 9:50 AM

#

!code

arctic wedgeBOT Oct 7, 2023, 9:50 AM

#

Formatting code on discord

Here's how to format Python code on Discord:

```py
print('Hello world!')
```

These are backticks, not quotes. Check this out if you can't find the backtick key.

For long code samples, you can use our pastebin.

vale swallow Oct 7, 2023, 11:46 AM

#

Hi, can someone pls help me. I have been trying and trying again to increase the accuracy of my model but it stays at like 70% accuracy. I have added two drop outs layers, L1 regularization, and my dataset is about 23k images big. What can I do to get my model to at least 80% Help 😭😭

open meadow Oct 7, 2023, 12:50 PM

#

I'm intrested in learning ai & ml but I have 0 idea where to start from, my main lang. is python, ig anyone has any tips or courses/vids that can help me I'd appreciate it

tidal bough Oct 7, 2023, 12:59 PM

#

https://www.coursera.org/specializations/machine-learning-introduction, say.

harsh minnow Oct 7, 2023, 1:32 PM

#

Hi guys, I am have large news document dataset, I am trying to train a model on it. Now the news are from 1998 to till now. I want to filter the new's from 2015. The dataset does not have metadata, to check the year I have to manually check the news and it's content. I tried using Spacy, but it's not accurate because the news will have many dates. Is there a better way of doing it?

serene scaffold Oct 7, 2023, 2:09 PM

#

harsh minnow Hi guys, I am have large news document dataset, I am trying to train a model on ...

I don't think there's a "good" way of doing this, but I would treat the most recent date that's mentioned in the document as the approximate publication date

past meteor Oct 7, 2023, 2:48 PM

#

harsh minnow Hi guys, I am have large news document dataset, I am trying to train a model on ...

Unless your data set is very very very large don't be afraid of doing stuff manually. Higher quality data means you get a better model. Start with a subset first and then do more and more until perf flatlines.

harsh minnow Oct 7, 2023, 2:48 PM

#

Manually meaning checking all the docs, there is over 10,000 docs

past meteor Oct 7, 2023, 2:50 PM

#

Are you alone in this or is this a team effort?

harsh minnow Oct 7, 2023, 2:50 PM

#

serene scaffold I don't think there's a "good" way of doing this, but I would treat the most rec...

the news has a lot of dates, even when I use entity recognition, it shows a couple of dates, which is inaccurate.

harsh minnow Oct 7, 2023, 2:50 PM

#

past meteor Are you alone in this or is this a team effort?

Just Alone.

past meteor Oct 7, 2023, 2:52 PM

#

Hmmm, I don't know how long this would take 🤔 . At least for computer vision I tend to manually label stuff whenever necessary. It sucks but imo it's part of the job 😄

serene scaffold Oct 7, 2023, 2:52 PM

#

harsh minnow the news has a lot of dates, even when I use entity recognition, it shows a coup...

that's why I'm saying that you should pick the most recent one as the approximate date. so if the document mentions 12 January, 1975; 6 March, 2013; and 4 June, 2020; I would expect the document was probably published closer to 2020 than the other two.

past meteor Oct 7, 2023, 2:53 PM

#

That's a good one

harsh minnow Oct 7, 2023, 2:53 PM

#

serene scaffold that's why I'm saying that you should pick the most recent one as the approximat...

Oh okay, this makes a lot of sense.

serene scaffold Oct 7, 2023, 2:56 PM

#

I would probably also skip dates that only have a month and year, since those might refer to tentatively scheduled future events

harsh minnow Oct 7, 2023, 2:56 PM

#

Also on other thing I tried it so using llama-7B to output the year, I used it via hugging face transformers. But the only issue is it takes around 12 seconds to complete one request. I am using Colab (85GB RAM)

#

Is this the case usually?

#

@serene scaffold ?

serene scaffold Oct 7, 2023, 3:41 PM

#

harsh minnow Also on other thing I tried it so using llama-7B to output the year, I used it v...

did you make sure you enabled GPU?

harsh minnow Oct 7, 2023, 3:42 PM

#

GPU does not increase at all

serene scaffold Oct 7, 2023, 3:42 PM

#

harsh minnow GPU does not increase at all

so, you're not even using the GPU, apparently

harsh minnow Oct 7, 2023, 3:44 PM

#

Currently I restarted the runtime, usually 70GB of the RAM will be consumed

#

And GPU does not go up at all

serene scaffold Oct 7, 2023, 3:44 PM

#

@harsh minnow can you screenshare with me in #751592231726481530?

#

!stream 533839465060499487 "15 minutes"

arctic wedgeBOT Oct 7, 2023, 3:45 PM

#

✅ @harsh minnow can now stream until <t:1696694410:f>.

harsh minnow Oct 7, 2023, 3:46 PM

#

I don't have the permission to speak

serene scaffold Oct 7, 2023, 3:47 PM

#

come back

harsh minnow Oct 7, 2023, 3:53 PM

#

!pip install transformers
!pip install transformers[sentencepiece]

from transformers import pipeline

pipe = pipeline("text-generation", model="meta-llama/Llama-2-7b-chat-hf")

arctic wedgeBOT Oct 7, 2023, 3:53 PM

#

install v1.3.5

Install packages from within code

serene scaffold Oct 7, 2023, 3:54 PM

#

from transformers import pipeline

pipe = pipeline("text-generation", model="meta-llama/Llama-2-7b-chat-hf", device=0)

harsh minnow Oct 7, 2023, 4:04 PM

#

#

Thanks @serene scaffold

#

It worked 🎉

umbral charm Oct 7, 2023, 4:51 PM

#

https://gyazo.com/4ffdd8c405e74c19b0ea3e6f7f3437be

Gyazo

#

im making a pie chart, how do i get rid of the labels on the pie chart and just insead have a legend

#

portions = [75, 18, (100-75-18)]
plt.pie(x = portions, labels = ['Sky', 'pyramid', 'shady'], startangle = 320, colors = ['deepskyblue', 'yellow', 'gold'])
plt.legend(loc = 'best', bbox_to_anchor = (0.9, 0.9))
plt.show()

odd meteor Oct 7, 2023, 10:50 PM

#

harsh minnow Hi guys, I am have large news document dataset, I am trying to train a model on ...

I have two questions.

Can you easily tell the year a document was published, but looking at some few lines?
If answer to #1 is yes, how do you tell if the year you chose is truly the year that document was published and not some random dates in the documents?

#

Perhaps there could be a recurring pattern in the documents which you could leverage in getting the year each document was published

odd meteor Oct 7, 2023, 10:54 PM

#

open meadow I'm intrested in learning ai & ml but I have 0 idea where to start from, my main...

https://Kaggle.com/learn

Learn Python, Data Viz, Pandas & More | Tutorials | Kaggle

Practical data skills you can apply immediately: that's what you'll learn in these no-cost courses. They're the fastest (and most fun) way to become a data scientist or improve your current skills.

odd meteor Oct 7, 2023, 10:57 PM

#

harsh minnow the news has a lot of dates, even when I use entity recognition, it shows a coup...

If there exist a consistent pattern in all the documents in figuring out the right date, you write a function to do that. Or better still, train your own custom NER to handle that for you.

meager ridge Oct 8, 2023, 2:05 AM

#

not sure where NLP-type questions should go but ...

say I have data for ~2000 cities for every year from a decade
I know every city is on every list (for the most part), but the spellings and syntax probably vary from year to year

what's the best way to find the string representing every city in each year? is it stupid to do k-means if you are going to have several thousand clusters? is it stupid to just do pairwise fuzzy matching?

serene scaffold Oct 8, 2023, 2:33 AM

#

meager ridge not sure where NLP-type questions should go but ... - say I have data for ~2000...

Your have data for 2000 cities. What data?

Every city is on every list. What lists?

Are you trying to figure out when the same city is referred to in different instances, but in different ways; example, "Paris" and "the French capital"?

As an aside, "syntax" means "rules about legal orderings of symbols in a language". The words "syntax" and "semantics" are used much more expansively in casual speech than they are in linguistics and NLP, so be sure that you're using them correctly.

#

@meager ridge I see that you also asked this in a help thread. Please link to your thread when you cross post your question, to avoid duplication of effort

meager ridge Oct 8, 2023, 2:41 AM

#

serene scaffold <@759814783632408638> I see that you also asked this in a help thread. Please li...

(whats best practices for linking the other post?)

serene scaffold Oct 8, 2023, 2:42 AM

#

meager ridge (whats best practices for linking the other post?)

Asking your questions in the help thread, and then linking it in the most relevant topical channel with a brief preview of what your question is about. Thanks for asking.

meager ridge Oct 8, 2023, 3:01 AM

#

meager ridge not sure where NLP-type questions should go but ... - say I have data for ~2000...

opened up a help thread about this!
https://discord.com/channels/267624335836053506/1160398584513048707

night turret Oct 8, 2023, 5:40 AM

#

Has anyone here made a python voice assistant?

bronze robin Oct 8, 2023, 6:05 AM

#

Anyone here with experience of performing FFT on timeseries data? Preferably using numpy

bronze robin Oct 8, 2023, 6:22 AM

#

Yeah I have performed fft but I need some help regarding the output frequency interval customization

halcyon jasper Oct 8, 2023, 6:48 AM

#

somebody can help me with multiNetX?

lunar current Oct 8, 2023, 8:20 AM

#

Any idea why this error occurs? ```
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Token indices sequence length is longer than the specified maximum sequence length for this model (1663 > 1024). Running this sequence through the model will result in indexing errors
Evaluating: 0%| | 0/1379 [00:00<?, ?it/s]

RuntimeError Traceback (most recent call last)
<ipython-input-18-523c0d2a27d3> in <cell line: 1>()
----> 1 main(trn_df, val_df)
...
/usr/local/lib/python3.10/dist-packages/transformers/models/gpt2/modeling_gpt2.py in _attn(self, query, key, value, attention_mask, head_mask)
199 # Need to be a tensor, otherwise we get error: RuntimeError: expected scalar type float but found double.
200 # Need to be on the same device, otherwise RuntimeError: ..., x and y to be on the same device
--> 201 mask_value = torch.full([], mask_value, dtype=attn_weights.dtype).to(attn_weights.device)
202 attn_weights = torch.where(causal_mask, attn_weights.to(attn_weights.dtype), mask_value)
203

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.```

fiery maple Oct 8, 2023, 11:52 AM

#

Guys, I want to use python dsl and lua as configuration language to apply a pipeline for manipulation data, trian, validate and deploy model, and of course, plot graphs. Given that, lua is a powerful language and we can use it to create custom feature engineering functions. I have some questions and concerns related to it:

I would like to avoid recalculation of some expensive feature engineering, given that, my idea is to use hash function to translate the current function code plus the columns used to process. If this hash already has a parquet file pointed, it will load this file instead of process all the data. One point is: how accuracy is the description of columns stats from pandas? And there is a probability of collision in that way (hash result) if I sum the description of the columns plus the code I will use?

left tartan Oct 8, 2023, 12:36 PM

#

Why would you need to include the describe() output? Seems like you’d be fine if you capture all the inputs/code/etc?

fiery maple Oct 8, 2023, 1:05 PM

#

Because if the dataset change for some reason adding more rows for example, in my mind, the feature need to be recalculated?

#

Or applied any kind of imputation for example, filling nan values with zero

solemn glen Oct 8, 2023, 1:31 PM

#

Hi, I'm working on weighting some values that will feed a larger datascience dataset but I'm having trouble figuring out the best way to weight the values to keep the results appropritae.

    # Define weight factors for each parameter
    damage_weight = 0.3
    reproducibility_weight = 0.1
    exploitability_weight = 0.1
    affected_users_weight = 0.2
    discoverability_weight = 0.2
    
    # Calculate the weighted sum of the parameters
    weighted_sum = (
        self.damage * damage_weight +
        self.reproducibility * reproducibility_weight +
        self.exploitability * exploitability_weight +
        self.affected_users * affected_users_weight +
        self.discoverability * discoverability_weight
    )
    
    # Scale the weighted sum to fit within your desired range (0 to DREAD_RISK_CAP)
    scaled_risk_value = (weighted_sum / (damage_weight + reproducibility_weight + exploitability_weight + affected_users_weight + discoverability_weight)) * self.DREAD_RISK_CAP
    
    return min(scaled_risk_value, self.DREAD_RISK_CAP)

solemn glen Oct 8, 2023, 1:56 PM

#

        (1, 12): "Notice",
        (13, 18): "Low",
        (19, 36): "Medium",
        (37, DREAD_RISK_CAP): "High",
    }

these are the risk levels that I'm considering to make it easier to talk about numbers but I end up with results that are always high and I want to avoid that.

left tartan Oct 8, 2023, 2:51 PM

#

solemn glen ``` RISK_LEVELS = { (1, 12): "Notice", (13, 18): "Low", ...

First thing I’d do is ask: what makes them ‘always high’?

#

Is one metric overweighted? Perhaps a log scale is better for that metric, etc

left tartan Oct 8, 2023, 3:10 PM

#

solemn glen ``` RISK_LEVELS = { (1, 12): "Notice", (13, 18): "Low", ...

Another method of weighting is to calculate the percentile score for each metric, and score based on their percentile range... ie: 0-25th percentile is a 0, 25-50 is a 1, etc. Or, perhaps a percentile of the composite. I'm just thinking out loud, may not make sense in this case if the metrics are already computed somewhat arbitrarily.,

topaz night Oct 8, 2023, 3:15 PM

#

hey guys i know this is pyhton server but pliss help me im losing my insanity slowly rn

#

cold osprey Oct 8, 2023, 3:55 PM

#

topaz night

looks fine

lunar current Oct 8, 2023, 4:29 PM

#

lunar current Any idea why this error occurs? ``` Special tokens have been added in the vocabu...

solemn glen Oct 8, 2023, 4:31 PM

#

left tartan Another method of weighting is to calculate the percentile score for each metric...

This is excellent and I think where I’ll take this

gaunt geyser Oct 8, 2023, 6:28 PM

#

How do you make a regression plot when there are NA values in your data?

#

The two columns I'm using both have them in random spots, so I can't use dropna()

left tartan Oct 8, 2023, 6:34 PM

#

gaunt geyser The two columns I'm using both have them in random spots, so I can't use dropna(...

You can ffill and bfill na’s. You can also remove rows that contain a NA value. Whatever makes sense for the data/problem.

gaunt geyser Oct 8, 2023, 6:42 PM

#

I removed the rows containing NA values, but am still getting this error Cannot cast ufunc 'svd_n_s' input from dtype('O') to dtype('float64') with casting rule 'same_kind'

#

That sounds like there's still NA values, but are none in the new df

tidal bough Oct 8, 2023, 6:44 PM

#

Seems like the issue is that you have an object dtype for some reason - you probably want to cast these columns to a normal dtype like np.float64.

gaunt geyser Oct 8, 2023, 6:45 PM

#

yeah that solved it

#

thank you both

#

how does something that was previously int become an object dtype?

odd meteor Oct 8, 2023, 7:03 PM

#

gaunt geyser how does something that was previously int become an object dtype?

If the datatype of the column was int when you did your EDA, then it means you must have cast type the column to object unknowingly.

odd meteor Oct 8, 2023, 7:07 PM

#

lunar current

Have you tried implenting what was suggested in the error message yet? Did it fix the problem?

tranquil beacon Oct 8, 2023, 8:05 PM

#

Hi, I have a out of memory issue ocurring when i try to download a very large dataset- in a series of api calls ran in async -to download the data in parts. pyarrows concat tables method is making my service run out of memory and crashing the instance however i increased eks memory and it's fine for now.. i then take the dataframe and run to_csv which is causing OOM failure and i am limited on cluster size in EKS. is there any other alternative to pandas to_csv that convert a dataframe to csv which is more memory efficient

#

I am passing 100k chunk size on the to_csv call

#

sorry if wrong chat

#

I have 3500mi cpu limit and 15GI memory and although its only 5.5gb data it still runs oom

serene scaffold Oct 8, 2023, 8:31 PM

#

tranquil beacon Hi, I have a out of memory issue ocurring when i try to download a very large d...

This is the right channel for pandas and pyarrow questions. you can add more rows to an existing CSV file like this

df.to_csv('existing_file.csv', mode='a', header=False)

the mode='a' is append mode, and header=False makes sure that the column headers aren't duplicated.

#

also, depending onw you're using pyarrow.concat_tables, you might be using double the memory you need at any given time. because by default, the data for all the tables is copied into a new one, so it's like each row exists in two places at once.

#

also it looks like you could use pyarrow's native CSV writer instead https://arrow.apache.org/docs/python/generated/pyarrow.csv.write_csv.html

#

that would probably save you from having to copy the pyarrow object into a pandas DataFrame

tranquil beacon Oct 8, 2023, 8:38 PM

#

serene scaffold that would probably save you from having to copy the pyarrow object into a panda...

Thank you, I will try this instead

serene scaffold Oct 8, 2023, 8:39 PM

#

@tranquil beacon the other thing you can do to save memory is to create as few variables as possible (using more nested expressions) so that intermediate objects are garbage collected as soon as possible

tranquil beacon Oct 8, 2023, 8:40 PM

#

this wont allow writing to the same csv in parallel right

serene scaffold Oct 8, 2023, 8:46 PM

#

tranquil beacon this wont allow writing to the same csv in parallel right

I'm not sure. the actual writing to the file is handled by the operating system, and it probably has a lock for that

past meteor Oct 8, 2023, 9:18 PM

#

tranquil beacon Hi, I have a out of memory issue ocurring when i try to download a very large d...

Are you doing any transformations on the data or is it just download => write?

Can you get the entire DF in memory? Do you specifically go OOM when writing or before that already?

tranquil beacon Oct 8, 2023, 9:29 PM

#

This is after etl. I can manage to get the df in a data frame and craps out when writing to csv

#

I have a big list of tables which pyarrow now can concat to data frame after increasing memory

#

This needs to happen this way as I don’t think the suggestion of using built in to csv will work in async

past meteor Oct 8, 2023, 9:31 PM

#

tranquil beacon This is after etl. I can manage to get the df in a data frame and craps out whe...

And the ETL is done in Pyarrow or is this the final step?

#

With concat you mean join (so concatenating by column and not by row)

tranquil beacon Oct 8, 2023, 9:39 PM

#

Etl is done by a separate process, can be weeks or months before it’s pulled

#

I’m pretty sure it works by column but how I will get a list of pyarrow tables not sure how that joining logic works

#

But now *

#

I was looking into pandas gzip

#

Using compression= gzip in to_csv call however not sure that will be different as a copy I’m guessing will still be made

past meteor Oct 8, 2023, 9:47 PM

#

I want to understand your use case first better because I might have a solution

past meteor Oct 8, 2023, 9:48 PM

#

tranquil beacon Etl is done by a separate process, can be weeks or months before it’s pulled

So an ETL happens and the result is somewhere behind an endpoint. You query it to get Pyarrow tables, correct?

To me this is already strange, Arrow is specifically an in-memory format. How is the data arriving?

tranquil beacon Oct 8, 2023, 10:03 PM

#

On point one yes.
The api we are using already returns arrow serialized results which are stored in tables and we join using pyarrow concat_tables

#

A list of Pyarrow.Tables

tranquil beacon Oct 8, 2023, 10:24 PM

#

Said the same thing in 3 different responses my bad, I am using external lib which is a hard requirement that does that pyarrow join and so the data frame to csv part I can control only

#

Unfortunately cannot work with parquet files for this one

bold timber Oct 8, 2023, 10:33 PM

#

anyone can give me explanation about this?

small wedge Oct 8, 2023, 10:34 PM

#

wdym when no loss is used? you're using binary cross entropy in both no?

#

oh nvm

#

I see the comment now

bold timber Oct 8, 2023, 10:35 PM

#

small wedge wdym when no loss is used? you're using binary cross entropy in both no?

In the first experiments I used # to decline loss function

small wedge Oct 8, 2023, 10:37 PM

#

https://huggingface.co/docs/transformers/main_classes/model#transformers.TFPreTrainedModel.compile

#

looks like they use RMSProp if you don't specify a loss function

#

oh nvm that's optimizer

#

https://github.com/huggingface/transformers/blob/v4.34.0/src/transformers/modeling_tf_utils.py#L1508-L1519

arctic wedgeBOT Oct 8, 2023, 10:39 PM

#

src/transformers/modeling_tf_utils.py lines 1508 to 1519

"""
This is a thin wrapper that sets the model's loss output head as the loss if the user does not specify a loss
function themselves.
"""
if loss in ("auto_with_warning", "passthrough"):  # "passthrough" for workflow backward compatibility
    logger.info(
        "No loss specified in compile() - the model's internal loss computation will be used as the "
        "loss. Don't panic - this is a common way to train TensorFlow models in Transformers! "
        "To disable this behaviour please pass a loss argument, or explicitly pass "
        "`loss=None` if you do not want your model to compute a loss. You can also specify `loss='auto'` to "
        "get the internal loss without printing this info string."
    )```

small wedge Oct 8, 2023, 10:39 PM

#

looks like it uses "the model's internal loss computation"

#

I have no idea what that means for a bert model, maybe the loss that the original model was saved with? I assume whatever loss function it's using is better suited to the task than binary cross entropy

#

important to note leaving out loss does not train with no loss function as you can see from the message

past meteor Oct 8, 2023, 10:42 PM

#

tranquil beacon Said the same thing in 3 different responses my bad, I am using external lib whi...

So you can't sneak in Polars there? Because:

It uses arrow under the hood
It was made for this. You can make all of the Arrow tables lazyframes, merge and then sink_csv. Sink streams it to disk instead of trying to compute everything at once.

small wedge Oct 8, 2023, 10:43 PM

#

looking into the code more we see this

        if loss in ("auto_with_warning", "passthrough"):  # "passthrough" for workflow backward compatibility
            logger.info(
                "No loss specified in compile() - the model's internal loss computation will be used as the "
                "loss. Don't panic - this is a common way to train TensorFlow models in Transformers! "
                "To disable this behaviour please pass a loss argument, or explicitly pass "
                "`loss=None` if you do not want your model to compute a loss. You can also specify `loss='auto'` to "
                "get the internal loss without printing this info string."
            )
            loss = "auto"
        if loss == "auto":
            loss = dummy_loss
            self._using_dummy_loss = True

dummy loss is defined up here

def dummy_loss(y_true, y_pred):
    if y_pred.shape.rank <= 1:
        return y_pred
    else:
        reduction_axes = list(range(1, y_pred.shape.rank))
        return tf.reduce_mean(y_pred, axis=reduction_axes)

bold timber Oct 8, 2023, 10:45 PM

#

small wedge looks like it uses "the model's internal loss computation"

So which is the reliable result, Experiment 1 or Experiment 2?

It's confuses me because I think it's impossible to train a model without optimizing a loss function, that's what machine learning is and what the optimizer does, it adjusts the weights to minimize the results from the loss function.

#

Can you elaborate on this?

small wedge Oct 8, 2023, 10:47 PM

#

You're right, gradient descent is optimization of the cost function, you can't train a model using GD without loss

#

when you leave out the loss keyword argument, it uses that default 'auto_with_warning' but as you can see by the message:

Don't panic - this is a common way to train TensorFlow models in Transformers!

It seems like it uses a separate loss function which in this case (assuming your dataset is valid) preforms better than plain BCE

#

it sets the loss argument to that function I linked earlier dummy_loss, and has a whole algorithm later in the code that calculates loss for you

#

that said, i'm not sure exactly what loss function that is and I'm not willing to dive any deeper into the source, but I'd say that explains why you got better preformance by using the default loss

tranquil beacon Oct 8, 2023, 10:52 PM

#

past meteor So you can't sneak in Polars there? Because: 1) It uses arrow under the hood 2...

I will definitely check it out

past meteor Oct 8, 2023, 10:52 PM

#

tranquil beacon I will definitely check it out

https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.LazyFrame.sink_csv.html and https://pola-rs.github.io/polars/user-guide/lazy/using/

bold timber Oct 8, 2023, 10:55 PM

#

small wedge it sets the `loss` argument to that function I linked earlier `dummy_loss`, and ...

Which do you think is more reliable between both experiments?

small wedge Oct 8, 2023, 10:55 PM

#

what do you mean by reliable?

#

reliability (as in being able to trust your model's ability to generalize) would be based on your dataset, assuming your dataset was well constructed the performance you get should pretty often be a reliable outcome.

#

granted there are caveats there (very small batch sizes in minibatch gd giving you wildly different results on different training runs, so your model might be able to preform well but just fell into a local minimum or couldn't get a good enough estimation of the gradient to find one for example) but just generally your dataset is the standard for your model's ability to generalize

tranquil beacon Oct 8, 2023, 11:04 PM

#

past meteor https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.LazyFrame.s...

Thank you

bold timber Oct 8, 2023, 11:06 PM

#

small wedge what do you mean by reliable?

What i mean is based on the 2 experiments above, which the result should i believe, whether the model performance is good (without loss function) or the model performance is bad (with loss function)?

small wedge Oct 8, 2023, 11:08 PM

#

well the dichotomy here is not with loss function or without loss function

#

it's how the model preformed using two separate loss functions

#

you can think of it almost as a separate model by using separate loss functions

#

if the first one consistently preforms better, go with that.

bold timber Oct 8, 2023, 11:10 PM

#

ah I see, thank you so much for the explanation! @small wedge

vale swallow Oct 9, 2023, 12:32 AM

#

Why do you have to split the data into x and y? What do the x and y stand for? Is it something relating to independent and dependent data? Like input and output? Not sure.

small wedge Oct 9, 2023, 12:41 AM

#

vale swallow Why do you have to split the data into x and y? What do the x and y stand for? I...

If you think of our models as functions f, x is the input, y is the output. The model learns to estimate a function that turns the x's into the y's; f(x) = y

vale swallow Oct 9, 2023, 12:46 AM

#

small wedge If you think of our models as functions `f`, x is the input, y is the output. T...

I see, and when you split the data, what is the split? Like the percentages?

serene scaffold Oct 9, 2023, 12:46 AM

#

vale swallow I see, and when you split the data, what is the split? Like the percentages?

you might be mixing up x/y and train/test

small wedge Oct 9, 2023, 12:47 AM

#

The splits are for the data we train the model on (training data), and the data we test the model on (test/validation data) to ensure that its not overfitting on our training data

#

You need train_x and test_x as well as train_y and test_y

serene scaffold Oct 9, 2023, 12:48 AM

#

68747470733a2f2f7265732e636c6f7564696e6172792e636f6d2f6479643931316b6d682f696d6167652f75706c6f61642f665f6175746f2c715f6175746f3a626573742f76313534333833363838332f696d6167655f365f6366706a70722e706e67.png

#

@vale swallow make sense?

#

what is the split? Like the percentages?
for a train/test split, 80/20 is a pretty common place to start.

vale swallow Oct 9, 2023, 12:52 AM

#

Ok, yeah, it makes sense, thank you both!!

shut girder Oct 9, 2023, 12:54 AM

#

Does anyone have any free book suggestions for statistics or data analysis overall? I consider myself a beginner to data analysis. I currently have a good understand of Python fundamentals and a simple idea of what NumPy, Pandas, and Matplotlib is used for when it comes to data analysis.

coarse mica Oct 9, 2023, 12:57 AM

#

hi

#

does somebody know if the frelancer market of data science is a good path to follow?

#

i'm corcerning about this those days

serene scaffold Oct 9, 2023, 12:59 AM

#

shut girder Does anyone have any free book suggestions for statistics or data analysis overa...

#data-science-and-ml message

shut girder Oct 9, 2023, 1:03 AM

#

serene scaffold https://discord.com/channels/267624335836053506/366673247892275221/1150186929053...

Thanks, should I start with the statistical learning book if I'm focused on mainly data analysis?

serene scaffold Oct 9, 2023, 1:07 AM

#

shut girder Thanks, should I start with the statistical learning book if I'm focused on main...

yes. statistics is, after all, the actual science of data.

orchid cargo Oct 9, 2023, 2:49 AM

#

Should i take Data Science or ML? Im very tired to think about it, cuz both of them looks good. But i really want to learn one of it.

viscid silo Oct 9, 2023, 2:53 AM

#

Does anyone have any resources for creating transfer functions using time history input and output from an LTI system? Wanted to first understand a singe input single output (SISO) system and then work up to a multiple input multiple output system.

minor cloak Oct 9, 2023, 3:40 AM

#

[Looking for open-source contributors, see below]

Hi there,

I recently open-sourced PyGraft, a configurable Python tool to generate synthetic knowledge graphs easily!
It can be used in any AI tasks (Machine Learning, Deep Learning, Reasoning, etc.) provided that you work with graphs.

The repo is gaining a lot of visibility, and I am looking for motivated contributors to support me in implementing new features and unit tests. Ideally, you should have a general understanding of knowledge graphs, semantic web, RDF/RDFS, and OWL vocabularies. In addition, strong Python programming skills are required. Experience in Software Engineering is a plus 🙂

DM me if you would like to contribute!

Otherwise, you can still take a look and star the repo if you find the project interesting!

https://github.com/nicolas-hbt/pygraft

GitHub

GitHub - nicolas-hbt/pygraft: Configurable Generation of Synthetic ...

Configurable Generation of Synthetic Schemas and Knowledge Graphs at Your Fingertips - GitHub - nicolas-hbt/pygraft: Configurable Generation of Synthetic Schemas and Knowledge Graphs at Your Finger...

misty flint Oct 9, 2023, 3:40 AM

#

orchid cargo Should i take Data Science or ML? Im very tired to think about it, cuz both of t...

data engineering Running https://docs.google.com/spreadsheets/u/0/d/1GOO4s1NcxCR8a44F0XnsErz5rYDxNbHAHznu4pJMRkw/htmlview

serene scaffold Oct 9, 2023, 3:43 AM

#

orchid cargo Should i take Data Science or ML? Im very tired to think about it, cuz both of t...

Do you have descriptions of each course? Because there's potential overlap.

hot pivot Oct 9, 2023, 9:16 AM

#

https://github.com/TommiKark/AdditiveAutoencoder Could be a very influential paper. Unfortunately they went and implemented the whole thing in matlab.. 😂

GitHub

GitHub - TommiKark/AdditiveAutoencoder: This repository contains th...

This repository contains the reference implementation of the additive autoencoder. The technique is derived and experiments summarized in Manuscript. SupplementaryInformation documents all the expe...

#

https://techxplore.com/news/2023-10-technique-based-18th-century-mathematics-simpler.html More info on the paper and link to the article itself can be found here as well

New technique based on 18th-century mathematics shows simpler AI mo...

Researchers from the University of Jyväskylä were able to simplify the most popular technique of artificial intelligence, deep learning, using 18th-century mathematics. They also found that classical training algorithms that date back 50 years work better than the more recently popular techniques. Their simpler approach advances green IT and is ...

muted hollow Oct 9, 2023, 9:37 AM

#

Hey guys, im a bit new to deep learning, Im studying Neuron network but having a question. The second matrix below represents a neural network derived from a bag of words. The third matrix represents the classification of the second matrix into its respective classes. I want to know about when initializing the first layer. Will it create a layer with 3 neurons corresponding to 0,1,1, or will it generate 3 neurons, each neuron being a line of the matrix of 0,1,1; 1,0,1; 1,1,0?

#

past meteor Oct 9, 2023, 10:24 AM

#

muted hollow Hey guys, im a bit new to deep learning, Im studying Neuron network but having a...

What is your actual input? And what is your expected output?

Input == bag of words?
Output == one of multiple classes?

#

The numbers ChatGPT are showing you are very strange, I'd forget that and "start from scratch"

spiral frigate Oct 9, 2023, 10:38 AM

#

Guys, I have a problem

#

You can see my problem in python help

#

"Problems importing tensorflow"

muted hollow Oct 9, 2023, 10:42 AM

#

past meteor What is your actual input? And what is your expected output? Input == bag of w...

I have 3 sentences that are made up from 3 words, the 2nd matrix is showing if that words show up in that sentence (each line is a sentence,if that word show up, it is '1', '0' if not ):
the output are 2 classes for example 'hello' and 'goodbye' in each line

past meteor Oct 9, 2023, 10:43 AM

#

And what is the 1st?

muted hollow Oct 9, 2023, 10:44 AM

#

the first is for shuffle the data

#

so after every shuffle, u still keep the right order of input and output

#

Pardon me, it's kinda hard to explain in my non-native language.

past meteor Oct 9, 2023, 10:50 AM

#

The test train split?

past meteor Oct 9, 2023, 10:50 AM

#

muted hollow so after every shuffle, u still keep the right order of input and output

Is this a toy example to understand neural nets better or do you have an actual use case?

fallow frost Oct 9, 2023, 2:12 PM

#

after playing around with duckdb and datafusion, I'm now convinced that both of them are not production ready

#

gonna try clickhouse now

#

I had high hopes for datafusion being implemented in Rust...

desert oar Oct 9, 2023, 2:15 PM

#

fallow frost after playing around with `duckdb` and `datafusion`, I'm now convinced that both...

what are you finding lacking about them? if you're comparing to clickhouse i feel like maybe you're unsure of the right tool for the job. duckdb, datafusion, et al are in-memory query engines, not complete data warehouses like clickhouse.

#

duckdb and datafusion are basically in-process/in-memory query engines, not all that different from polars or even pandas

#

being written in rust also should not be taken as an indicator of being production-ready or not. rust is just a programming language.

past meteor Oct 9, 2023, 2:35 PM

#

desert oar duckdb and datafusion are basically in-process/in-memory query engines, not all ...

Does Pandas have a query engine?

sterile barn Oct 9, 2023, 3:12 PM

#

Hey y'all.
After a good amount of data processing and encoding, I noticed my function for encoding certain columns gets a little messed up because of a certain value called "NA". I understand that this is because pandas by default understands "NA" as a string means Null/NaN or what have you. Can I avoid this somehow? All other values are correctly understood by the function, so I know that it is from the "NA" being interpreted wrong (source: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html, under "na_values").

Function below for reference:

def rateOrdinalE(df:pd.DataFrame, cols:list) -> pd.DataFrame:
    res_df = df.copy()
    rating_map = {'NA': 0, 'Unf': 1, 'Rec': 2, 'BLQ': 3, 'ALQ': 4, 'GLQ': 5}
    for i in cols:
        res_df[i[0]] = res_df[i[0]].map(rating_map)
    return res_df

#

Never mind, brain finally woke up. Seem to have solved it with a little revision past the for-loop:

res_df = res_df.fillna(0)

left tartan Oct 9, 2023, 4:16 PM

#

fallow frost after playing around with `duckdb` and `datafusion`, I'm now convinced that both...

How so?

topaz night Oct 9, 2023, 4:37 PM

#

cold osprey looks fine

tf am i doing wronggg danggggg

#

thanks you so much tho u really help me out ❤️ @cold osprey

left kraken Oct 9, 2023, 5:16 PM

#

Hello all I am new to this AI field, so want to know exactly where and how to start to get familiar with Machine learning and Data Science. So can anyone help me out with the roadmap and sources and some basic projects which makes learning interesting..?

topaz night Oct 9, 2023, 5:43 PM

#

left kraken Hello all I am new to this AI field, so want to know exactly where and how to st...

ez bro im a 5 years expertize on ai fields and helping to bult chat gpt but first u need solid foundation on MATH, PROGRAMMING, AND DATA HANDLING

second u need to learn the basic of course here a few example ppl u can learn by i learn from themm too
Introduction to Machine Learnin by Andrew Ng on Coursera, Python for Data Analysis"by Wes McKinney, Introduction to Statistical Learning (ISLR) or Elements of Statistical Learning(ESL) by Hastie, Tibshirani, and Friedman
and some online course Coursera, edX, and Udacity

third
then after u learn some shi u need to do some project like kaggle or self project

forth
after u have mastered the common knowledge lets step it up to next lebel play knowledge
like
-Machine Learning Algorithms
-deep learning
-data processing

fifth
u need to find what u want
-theres natural language processing
-computer vision
-reinforcment learning
(perfect ai like chatgpt need them all btw)

sixth
u go to some advance topic like
-ensemble method
-big data
-model deployment

and then seventh
because ai will always growing u need to stay update on ai field like follow ai forum or community

*note if u mastered all that step just take some ai intern and tell em what u capable off so u can test ur skill and what u dont have so u can learn more and yeah stay sane bro oh yeah i forgor stay healthy and drink a lot of water cuzz if sick cuzz learn and die of it its really shit stupid etc

#

hope thats help and goodluck

slender bone Oct 9, 2023, 5:54 PM

#

left kraken Hello all I am new to this AI field, so want to know exactly where and how to st...

https://roadmap.sh/ai-data-scientist Check it out!

roadmap.sh

AI and Data Scientist Roadmap

Learn to become an AI and Data Scientist using this roadmap. Community driven, articles, resources, guides, interview questions, quizzes for modern backend development.

abstract wasp Oct 9, 2023, 6:12 PM

#

Hi, not sure why this is here--why does it have wrapper 1, 2, 3 instead of the layers I added (flatten/dense) I'm using Resnet50 on my data, this is my code:
`resnet_model = Sequential()

pretrained_model= tf.keras.applications.ResNet50(include_top=False,
input_shape=(224, 224 ,3),
pooling='avg',classes=77,
weights='imagenet')
for layer in pretrained_model.layers:
layer.trainable=False

resnet_model.add(pretrained_model)
resnet_model.add(Flatten())
resnet_model.add(Dense(512, activation='relu'))
resnet_model.add(Dense(77, activation='softmax'))`

#

This is what shows up when I look at the summary:

Screenshot_2023-10-09_at_11.13.07_AM.png

#

It's training but this is what shows up when I compile it:

Screenshot_2023-10-09_at_11.13.51_AM.png

fallow frost Oct 9, 2023, 6:27 PM

#

this looks like the issue: https://github.com/duckdb/duckdb/issues/4295

GitHub

When reading parquet, filter using IN () on two values much slower ...

What happens? When reading a parquet dataset (no specific partitioning, so no pruning) via parquet_scan() , applying a filter via column in (value1, value2) is much slower than a single value via c...

left tartan Oct 9, 2023, 6:37 PM

#

fallow frost this looks like the issue: https://github.com/duckdb/duckdb/issues/4295

But your query didn't have an "in" ? There are specific rules about when pushdowns occur, so there are limitations... but your original query was fairly simple.

fallow frost Oct 9, 2023, 6:39 PM

#

left tartan But your query didn't have an "in" ? There are specific rules about when pushdow...

my bad, this is another issue I had with DuckDB a few days ago (hence why I said I dont think its prod-ready)

#

and if I recall correctly, this bug was only with Duckdb, not Datafusion

left tartan Oct 9, 2023, 6:42 PM

#

fallow frost my bad, this is another issue I had with DuckDB a few days ago (hence why I said...

I just don't think that's a fair characterization. You're not describing a bug: Lack of predicate pushdown for complex parquet queries is more of the "absence" of an optimization.

#

And there are workarounds.

#

And, most importantly, any data pipeline will employ multiple technologies, which is why arrow is becoming the backbone of most pipelines.

fallow frost Oct 9, 2023, 6:54 PM

#

left tartan I just don't think that's a fair characterization. You're not describing a bug: ...

its not an absence of "optimization" when it takes 10x more times than using an equality check

#

its more like an absence of basic functionality that a good percentage of SQL queries use

left tartan Oct 9, 2023, 6:55 PM

#

fallow frost its not an absence of "optimization" when it takes 10x more times than using an ...

Would you prefer the equality check takes 10 times longer? What SQL engine are you comparing it to, that can read a parquet file hot?

#

Being able to read a parquet file, without ingesting to a table, while filtering the contents on the fly based on the selected criteria is fairly bleeding edge stuff. Doesn't bother me one bit that it has limitations.

fallow frost Oct 9, 2023, 6:56 PM

#

left tartan Would you prefer the equality check takes 10 times longer? What SQL engine are y...

I'm comparint SELECT ... WHERE col IN ('hello word') to SELECT ... WHERE col = 'hello word', with the SAME query engine: duckdb

#

and I think that regardless; a one-element tuple should be converted to an equality check, but thats besides the point

left tartan Oct 9, 2023, 6:59 PM

#

I hear you, I'm just saying, it's absence of an optimization you're complaining about. Pushdown to parquet reading are limited today to column selection, and AND equalities: that's known (perhaps could/should be better documented, but I'm not affiliated with them).

fallow frost Oct 9, 2023, 7:01 PM

#

oh and btw, the column in question is actually a partitioned column (hive style) with only about 30 distinct values, how in the world is it even possible for this query to take more than 20 seconds

left tartan Oct 9, 2023, 7:01 PM

#

I have no idea, but if you could publish a reproducible example, I'd love to take a look.

fallow frost Oct 9, 2023, 7:02 PM

#

youre mistaking this for a lack of a feature, but I'm pretty confident that I can filter the data faster with pyarrow + pure-python loop

#

(even while doing all the filtering in Python)

left tartan Oct 9, 2023, 7:03 PM

#

fallow frost youre mistaking this for a lack of a feature, but I'm pretty confident that I ca...

Would love to see that too. I'm open minded... I use whatever tool is right for the problem I have. For me, it tends to be duckdb, but I write plenty of polars/pandas/numpy/python code too. That's what's wonderful about arrow, I can do all of that without copying the data.

past meteor Oct 9, 2023, 7:03 PM

#

@left tartan how do you test your workflow?

#

I hear a lot about testing in data/ DE sphere's but I see no one doing it

#

Data profiling like https://greatexpectations.io/ are a massive anti-pattern imo

left tartan Oct 9, 2023, 7:05 PM

#

past meteor <@738234281146712084> how do you test your workflow?

We're moving some of our SQL pipelines to DBT, and recently started using Syrupy to do snapshot unit testing in Python.

past meteor Oct 9, 2023, 7:05 PM

#

And what exactly are you testing?

left tartan Oct 9, 2023, 7:06 PM

#

We're not an airflow shop (doesn't fit our use case), but there's also solutions on that side (I think Eivl would be the one to ask on that side).

past meteor Oct 9, 2023, 7:06 PM

#

I don't use Airflow either don't worry

left tartan Oct 9, 2023, 7:07 PM

#

past meteor And what exactly are you testing?

from a data perspective, it's really about testing that our data acquisition is working correctly/consistently, and checking each step along the pipeline to see if we're regressing. So, what we test is primarily injecting known data and making sure we get expected outputs. That's the idea of snapshot testing

past meteor Oct 9, 2023, 7:07 PM

#

I only know snapshot testing from idk jest and front end JS. I'll think about this...

left tartan Oct 9, 2023, 7:08 PM

#

That's exactly where the idea came from... one of our senior engineers basically said: "Why can't we do something like jest?"

past meteor Oct 9, 2023, 7:08 PM

#

I cycle to work every day and I had a good idea on how to test data pipelines, I'll think I'll write it out in full.

left tartan Oct 9, 2023, 7:08 PM

#

Previous company, we built out a test framework for end to end tests (again, injecting known inputs), but we didn't do it at the unit level. A combination of dbt (decomposing to individual scripts) and snapshot testing is kinda nice.

left tartan Oct 9, 2023, 7:09 PM

#

past meteor I cycle to work every day and I had a good idea on how to test data pipelines, I...

Would love to hear

past meteor Oct 9, 2023, 7:13 PM

#

left tartan Would love to hear

The gist is simply that you need to know your output the schema. Around this you have a "contract". The most basic version which is always in the contract is basically "I have these N tables with M columns, they are of type T1, T2, .." This idea is DB agnostic, your RDBMS captures all of this but other things wouldn't.

On top of this you formulate extra properties or invariants that are part of your contract. For instance, "all sport sessions in my dataset have a unique identifier" or "The meals column contains all the meals consumed, even if other data is missing".

Each line in the contract, must be handled by either your DB making sure it's impossible or it should be tested. Why? These are exactly the assumptions people downstream make about your data. Finally never make a backwards incompatible change to the contract, so no renaming of tables or columns, it just aint worth it, so much downstream always breaks.

#

The key insight is that I'm not testing transformations, I'm testing whether or not it adheres to my spec. I should be able to refactor my pipeline into something more performant and hit the same result.

left tartan Oct 9, 2023, 7:15 PM

#

Still reading, but you're also touching on something I didn't mention. We have validation sql as part of our ~~test cases~~ process, but these are baked into the pipeline itself (so it runs every time, rather than as a unit test), similar to what you're describing.

#

I haven't done much with it yet (still adopting dbt), but this comes to mind: https://docs.getdbt.com/docs/build/tests

past meteor Oct 9, 2023, 7:16 PM

#

Personally I'm very wary of being overzealous and testing too much. I don't want to test if my SQL or polars or Pandas does something, I want to wrap that in a function and check if what it returns, given my input, is what I needed it to return

#

Changing a query should not change my tests etc etc (for the same output)

left tartan Oct 9, 2023, 7:17 PM

#

Yah, that's sort of why I like snapshot testing: I just want to know if something unexpected changed, but I don't want to go through teh effort of writing a test for every case. I think I agree with where you're thinking... I like the idea of a combination of validation (assertions) and snapshot testing.

past meteor Oct 9, 2023, 7:18 PM

#

I haven't used DBT but I find it very very sus idk. If you have 4 intermediate DAG nodes you shouldn't test those

#

And all those models introduce coupling as well

#

I like it if and only if the contract states that thing A and thing B must go through the same process otherwise I would have linear input output with no shared models, if I were to use DBT that is

past meteor Oct 9, 2023, 7:19 PM

#

left tartan Yah, that's sort of why I like snapshot testing: I just want to know if somethin...

With snapshot testing you're basically checking for regressions right? I've had many instances where it was never right to begin with 🤣

left tartan Oct 9, 2023, 7:22 PM

#

past meteor With snapshot testing you're basically checking for regressions right? I've had ...

Lol, yah, that does indeed happen. But yes, really about regressions.

past meteor Oct 9, 2023, 7:24 PM

#

Tbh, I like having them after I guarantee that what I have is correct especially because backwards incompatible changes (aka. breaking everyone's dashboards) are a massive no-no. I'll think about integrating this.

left tartan Oct 9, 2023, 7:25 PM

#

We do go through fairly lengthy customer acceptance tests, so usually our problems are; some new edge case comes up (ie: data is incorrect under certain conditions)… we fix that, but then need to make sure we didn’t regress under the cases we’ve already validated.

past meteor Oct 9, 2023, 7:26 PM

#

Ah yeah, when you have a very mature pipeline that's the way to go

#

But yeah, my approach is very inspired by sane SWE style testing. I should really write this in full with an actual project. If you squint enough it's TDD on data.

left tartan Oct 9, 2023, 7:31 PM

#

Yah, I like where you’re going with this!

frank quiver Oct 9, 2023, 7:54 PM

#

https://stackoverflow.com/questions/77260669/unable-to-reconstruct-back-the-images-using-ddpm-model

please look at this. I had problem calculating reconstruction error using ddpm models

Stack Overflow

Unable to reconstruct back the images using DDPM model

So I have trained a DDPM(diffusion) model and had the checkpoints. now I loaded the checkpoint and to check the performance of the model I have fed images on my test set to the model. The intuition...

fair arrow Oct 10, 2023, 1:30 AM

#

hey all - is this a good place for a question about the pandas library and a student-scheduling app I'm writing?

#

sort of an intermediate-ish question, and I'm trying to avoid sloppy programming practices, so it might take a bit of explanation. Not sure if this is the right channel for asking for help.

small wedge Oct 10, 2023, 1:35 AM

#

fair arrow hey all - is this a good place for a question about the pandas library and a stu...

yes

fair arrow Oct 10, 2023, 1:38 AM

#

... a few moments, I may have solved it.

lapis sequoia Oct 10, 2023, 1:42 AM

#

Can someone help me real quick in python please

fair arrow Oct 10, 2023, 1:42 AM

#

I can try, while I try and answer my own question.

small wedge Oct 10, 2023, 1:42 AM

#

lapis sequoia Can someone help me real quick in python please

don't ask to ask or if someone can help, just ask your question

abstract wasp Oct 10, 2023, 1:53 AM

#

Hi, I’m going to start a sort of a blog where I write stuff related to ML/AI. The main overall subjects I will include are mathematics behind the algorithms, explanation of the actual algorithms alongside some code (project examples). How should I split the menu bar?

left tartan Oct 10, 2023, 1:54 AM

#

fair arrow hey all - is this a good place for a question about the pandas library and a stu...

Or, If it’s an in-depth question and you need to share a lot of code, you may want to open a help thread. #❓｜how-to-get-help

desert oar Oct 10, 2023, 1:55 AM

#

These are a valuable kind of post if you can write them well. Don't worry about the blog format, just start writing. You're more likely to make progress that way.

abstract wasp Oct 10, 2023, 1:55 AM

#

desert oar These are a valuable kind of post if you can write them well. Don't worry about ...

True, I’ll just have it under “articles” for the moment lmao

#

Thanks!

fair arrow Oct 10, 2023, 1:57 AM

#

I think I've got a solution. The short version is that I'm trying to update a DataFrame over which I'm iterating, which Pandas docs explicitly warn against. I'm realizing I need to refactor my code into a better pattern.

desert oar Oct 10, 2023, 2:02 AM

#

fair arrow I think I've got a solution. The short version is that I'm trying to update a Da...

just a note: it's almost always problematic to iterate over something and update it simultaneously. data frames aren't unique in this aspect, it applies to python dicts and lists too

fair arrow Oct 10, 2023, 2:02 AM

#

is a better pattern to throw the updates into a second dataframe, and then merge the two at the end?

desert oar Oct 10, 2023, 2:02 AM

#

can you describe what you're actually trying to do?

#

in this particular case, not your app in general

fair arrow Oct 10, 2023, 2:03 AM

#

I have a dataframe of students, their class preferences, and the actual scheduled classes. I'm trying to add to the actual scheduled classes if the class isn't overenrolled. The problem is that, as I add students, iterrows() is still working with an old copy of the dataframe, so I can't poll the dataframe for the updated attendance, and the class gets overenrolled.

desert oar Oct 10, 2023, 2:04 AM

#

fair arrow I have a dataframe of students, their class preferences, and the actual schedule...

it's very rare that i actually need iterrows. can you share your code?

#

(and i literally never actually use iterrows. i always use itertuples instead)

fair arrow Oct 10, 2023, 2:04 AM

#

it was because apply() didn't update the dataframe as I iterated either, and I thought iterrows might

desert oar Oct 10, 2023, 2:05 AM

#

fair arrow it was because apply() didn't update the dataframe as I iterated either, and I t...

you might be expecting too much magic. show your code so at least i can see what you were trying to do

fair arrow Oct 10, 2023, 2:05 AM

#

yep, just deleting a portion that didn't work, one second

desert oar Oct 10, 2023, 2:05 AM

#

if you can include an example input and your desired output, that would be very helpful as well

fair arrow Oct 10, 2023, 2:06 AM

#

for day in SCHEDULE_DAYS:
        print("--------now scheduling for day" + str(day))
        
        for preferenceNum in range(1,PREFERENCES_PER_DAY + 1):  
            print("---------now scheduling for preference " + str(preferenceNum))  
            for grade in GRADES_ORDER: 
                print('--------------------now scheduling for grade ' + str(grade))
                #Because the shuffled student data is working with the filtered student data,
                #the indexes must be preserved
                student_data = student_data.sample(frac=1,axis=0, ignore_index=False) #shuffle students
                student_data_filtered_by_grade = student_data.loc[student_data["Grade"] == grade]
                classes_to_add = pd.DataFrame()
                for index, row in student_data_filtered_by_grade.iterrows():
                    r = scheduleClassIfEligible(row, day, preferenceNum, grade, student_data, electives_list)
                    if isinstance(r, pd.Series):# if it's a series and not None...
                        
                        #count r and student_data's concat'd column, pass in only if it wouldn't over-enroll
                        #add to classes_to_add
                        classes_to_add = pd.concat([classes_to_add, r])
                        #add the classes at the end of the grade student_data

#

the bottom portion is just pseudocoded for now

desert oar Oct 10, 2023, 2:07 AM

#

is this something like where each student has a selection of preferences each day, and you randomly assign first preferences, then second, and so on?

fair arrow Oct 10, 2023, 2:08 AM

#

yep yep

desert oar Oct 10, 2023, 2:08 AM

#

it looks like that assignment happens separately for each class year / grade, and for each day?

fair arrow Oct 10, 2023, 2:08 AM

#

I've done it in Java, and was porting it over to this trying to use pandas

desert oar Oct 10, 2023, 2:08 AM

#

how did you do it in java? i would almost argue that this is not a great use for pandas, since you are in fact operating on each row sequentially

fair arrow Oct 10, 2023, 2:09 AM

#

pandas made every other part of this project easier!

desert oar Oct 10, 2023, 2:10 AM

#

fair enough. occasionally i do things like convert my dataframe to a list of dicts, do something with that, and then convert the dicts back to a data frame.

#

(you can do this with pandas of course)

#

but i am curious what your java implementation of this same logic looks like

#

are you new to python? or just pandas specifically?

fair arrow Oct 10, 2023, 2:10 AM

#

to be honest I don't know off the top of my head how I did it in Java, it's on my Github. It's been a while. I think I was maintaining two separate tables of data but I did run into bugs trying to keep both in sync.

#

I fixed them but thought this would be a more expandable way to do it

desert oar Oct 10, 2023, 2:11 AM

#

i actually would suggest keeping a separate table here as well

#

but you can rely on the student id / table index (as you already pointed out) to keep it in sync

#

then you can use pd.join at the end

#

however i see some issues with your code that you might want to address

#

not necessarily critical problems, but suggestive that you don't have the algorithm laid out clearly in your mind

fair arrow Oct 10, 2023, 2:13 AM

#

hmmm, well - the thing is, is that this worked for our school. But there were a handful of classes overenrolled.

#

But 95 % of the data was good so we used it and I was going to fix these bugs for next semester.

#

And they just did the fixes by hand. It still was a big timesaver.

desert oar Oct 10, 2023, 2:13 AM

#

what does each row in this table represent?

#

and what is r? the code suggests that it's a Series, but a series containing what exactly?

fair arrow Oct 10, 2023, 2:14 AM

#

ah, the thing with r was relatively new

desert oar Oct 10, 2023, 2:14 AM

#

probably the most important favor you can do for yourself when working with data is to be very clear about what each "row" or "entity" represents

fair arrow Oct 10, 2023, 2:14 AM

#

I should stash this and go back to my earlier commit

desert oar Oct 10, 2023, 2:14 AM

#

that's what branches are for!

#

i ask what your java implementation looks like in part because i feel like it might be easier to start from something that more or less works

fair arrow Oct 10, 2023, 2:15 AM

#

once I dig my way out of this bug, that was the plan haha

#

I did actually start from that, but I laid out this code a while ago, like months ago. And it may be a bit much for me to get into right now how it worked, mainly because it's late and I need to teach tomorrow 🙂

desert oar Oct 10, 2023, 2:16 AM

#

fair enough

#

so what is each row of student_data? what are the columns? how are the students' preferences and class schedules represented here? i think i would need to know that in order to help, otherwise it's too much guess work for someone who isn't inside your head

past meteor Oct 10, 2023, 6:08 AM

#

Does it need to be a neural network? You can just get features with something like tsfresh and then cluster with idk k-means.

#

I work in the time series domain and honestly, I'm a sucker for simple solutions 😄

#

K-means has exactly the same property, close clusters are similar as well

#

Their similarity is just the distance between the cluster centers

#

I mean, you should look at K-means (and all clustering) as solving this optimisation problem:

clustering = argmax(distanceBetweenClusters(data) && argmin(distanceWithinCluster)

#

So yes you do actively want to create dissimilar clusters but the fact of the matter is that maybe an optimal clustering has 5 close clusters and the sixth that is very far (outliers, fraud, ...)

#

What are input nodes? Are they just variables?

#

What are catch 22 parameters? Are they just non informative variables?

#

What is catch22

#

ooooh

#

So you have 440N features where N is the number of input series you have per sample

#

In defining your problem you're already thinking in terms of neural nets (weights) etc. I'd definitely take a step back because you miight find something very simple! 😄

#

Probably not but who knows

#

But in general, don't think in terms of the solution, keep it very simple first. I tend to write it in LaTeX style notation in a markdown, like really generalize the problem I'm doing.

#

I understand your solution now, you want to make clusters and then models for each cluster

#

Is your problem time series classifiication?

#

Okay let me make a suggestion

#

Keep it simpler. If you are solving a time series classification problem make N separate models for your thing

#

One using temp, one using humidity, one using velocity (idk)

#

Evaluate the performance of each of them. Keep the models absurdly simple here, for instance the last N lags with an Xgboost type model.

#

This will already tell you something about the relevance of each individual series, not everything but something

#

The next step I'd do is build a stacking ensemble using my N "simple classifiers" and see if this improved my score

#

Obviously, stacking doesn't take interaction effects at the input level into account, that's the major drawback. The thing is, codewise this is so so simple. I'd also have to see if interaction effects at the input level are relevant for my problem to begin with. If it is then yeah, from that point onwards I'll start thinking of some multivariate time series model compared to an ensemble of univariate models

#

With this solution I have given myself several places to solve the problem prematurely. If you overengineer from the get go you might have wasted time. Additionally, you need to be able to compare your high complexity solution to low complexity solutions to get a sense of whether or not it was worth it in the first place.

Make sense?

#

Have you tried it yet?

#

As in, are you sure none of them are better than random

#

Basically, there's tons of ways to do specifically univariate time series classification. I'd start there. One of them is feature extraction but there's others

#

Do that, proceed into a stacked ensemble and then gradually go towards your solution

#

But idk, maybe I don't understand what you're trying to do in the first place 😅 . Considering there's a lot of NN specific terminology it looks you're set in what you want to do. All I'm saying is, take a step back, formulate your problem without using the word "weight", "cluster", "node", "backpropagation" and then progress from there 🤷

dusk tide Oct 10, 2023, 7:01 AM

#

Hi guys , just finished working on New York taxi trip duration prediction. Here is my notebook, please check it out. https://www.kaggle.com/code/nishchay331/pc-1-new-york-city-taxi-trip-duration . If you have any suggestions on improvement or any better idea , feel free to let me know . Thank you.

PC-1 | New York City Taxi Trip Duration

Explore and run machine learning code with Kaggle Notebooks | Using data from multiple data sources

somber prism Oct 10, 2023, 10:37 AM

#

Hello , I installed PyTorch gpu module and everyday I’ve been training my model for 5 epoch in gpu , but today for some reason my gpu usage is 0% , anyone know what’s wrong with it, I did t make any changes in my code .

#

Also there used to be nvidia rtx 3060 next to CUDA_VISIBLE_DEVICES : but now it’s showing 0

harsh minnow Oct 10, 2023, 11:41 AM

#

I am finetuning gpt 3, I am training it on news documents. When I upload a training file with only system messages its throws and an errors and ask for assistant message. So for each news document I sent the training data as : <>{"role": "system", "content": "NEWS Article"}, "role": "assistant","content": "Placeholder message for fine-tuning"}}<>

Now whenever I ask it a questions, it responds: "Placeholder message for fine-tuning", what should I do?

cold osprey Oct 10, 2023, 11:44 AM

#

any idea what technical theory means?

#

role is for AI/ML software engineer

left tartan Oct 10, 2023, 11:46 AM

#

Wow, that's an anxiety laden email. I think we can all generally guess, but it would certainly depend on the job title/level

harsh minnow Oct 10, 2023, 11:48 AM

#

harsh minnow I am finetuning gpt 3, I am training it on news documents. When I upload a train...

@serene scaffold

cold osprey Oct 10, 2023, 11:49 AM

#

left tartan Wow, that's an anxiety laden email. I think we can all generally guess, but it w...

i can send a JD

#

https://www.linkedin.com/jobs/view/3729051219

left tartan Oct 10, 2023, 11:52 AM

#

Tough thing is it doesn't give a clue about their stack... but it's fintech

cold osprey Oct 10, 2023, 11:52 AM

#

haha ye i kinda know what they do here

#

i would expect it to be python heavy

#

idk y they have java n cpp

left tartan Oct 10, 2023, 11:53 AM

#

I mean, are they looking for NLP stuff? LLM? More classical EDA? etc

cold osprey Oct 10, 2023, 11:53 AM

#

oh

#

but this seems more like a SWE role right

#

APIs for the models

#

ML/MLOps engineer kinda stuff

left tartan Oct 10, 2023, 11:54 AM

#

Yah, but you're meeting with "head of AI", so unsure if that's an AI oriented theory discussion, or a SWE theory discussion?

cold osprey Oct 10, 2023, 11:54 AM

#

HAHAHA

#

fair point

#

theyre hiring for hella roles

#

https://www.linkedin.com/jobs/search/?currentJobId=3731071291&f_C=14607755

left tartan Oct 10, 2023, 11:56 AM

#

I think I'm just adding pressure not helping 🙂

cold osprey Oct 10, 2023, 11:57 AM

#

HAHA npnp

#

tbh im pretty weak in either of those topics: AI or SWE

#

i mainly do data viz stuff and some pipelines at work so

left tartan Oct 10, 2023, 11:59 AM

#

In general, my interview advice is: Know what you know, be pleasant, and know how to not know the answer to something you don't know, especially when you're talking to someone who's an expert.

#

For AI and SWE, if you're prepping for an interview, you really can't learn something completely new to prep for an interview. But... one thing I think is very helpful is: Watch conference videos to learn about current trends in technology.

#

Europython 2023 Pycon US 2023 Pycon UK 2023

#

(beyond that, maybe ask the same question in #career-advice and you'll get better input)

cold osprey Oct 10, 2023, 12:10 PM

#

Ait cool thanks

past meteor Oct 10, 2023, 1:44 PM

#

cold osprey any idea what technical theory means?

So if it were technical theory I'd ask basic stuff like how do you know a model is overfitting

#

And what is the difference between xgboost and random forest, I think it's stuff like that no? The typical ones

little idol Oct 10, 2023, 2:09 PM

#

Hi guys

#

I'm in need of a little bit of help here

delicate apex Oct 10, 2023, 2:47 PM

#

a recently-added sonarlint rule, warning against equality comparison with floats, has triggered in a file of mine. Given that the context is an sqlite file imported into a pandas dataframe, and the values in question are powers of two (0, 0.5, 1, 2, with sonar complaining at if cell==0.5), I should be able to ignore this without consequence, right?

rustic scarab Oct 10, 2023, 3:45 PM

#

I'm trying to draw a plot to show streamlines for the equation psi = -Kx^2 + Ky^2. Here is my code for that:

import numpy as np
import matplotlib.pyplot as plt

K = 1     # I'm assuming my this assumption is wrong

# Define the grid
x = np.linspace(-100, 100, 50)
y = np.linspace(-100, 100, 50)
X, Y = np.meshgrid(x, y)

# Calculate the vector field components
U = -K * X**2
V = K * Y**2

# Create a streamline plot
plt.streamplot(X, Y, U, V, density=1, linewidth=1, arrowsize=1.5, color='blue', broken_streamlines=False)

# Add x and y coordinate lines at 0,0
plt.axhline(0, color='red', linewidth=2.5)  # Horizontal line at y=0
plt.axvline(0, color='red', linewidth=2.5)  # Vertical line at x=0

# Add labels and title
plt.xlabel('x')
plt.ylabel('y')
plt.title('Streamline Plot of -Kx^2 + Ky^2')

# Show the plot
plt.show()

This is the result that I'm getting (image attached with red coordinates).

However the expected result should like the other image

Can anybody help we with this, please?

tidal bough Oct 10, 2023, 3:56 PM

#

rustic scarab I'm trying to draw a plot to show streamlines for the equation `psi = -Kx^2 + Ky...

I'm not sure offhand what plotting the streamlines means - perhaps you should be plotting the gradient of this function? That'd be -2 K x, 2 K y. But what you have on the plot to the right frankly looks like neither to me - it looks to me like the contour plot of the function (the curves on which ψ is constant).

rustic scarab Oct 10, 2023, 3:58 PM

#

tidal bough I'm not sure offhand what plotting the streamlines means - perhaps you should be...

what changes would I've to made in my code for that?

tidal bough Oct 10, 2023, 3:59 PM

#

plt.contour takes the scalar field itself, so:

f = K * (Y**2 - X**2)
plt.contour(X, Y, f, levels=20)

rustic scarab Oct 10, 2023, 4:06 PM

#

tidal bough `plt.contour` takes the scalar field itself, so: ``` f = K * (Y**2 - X**2) plt.c...

Thank you🚀

stark phoenix Oct 10, 2023, 4:30 PM

#

Why do I get different ticks than the displayed ones?

#

I would expect ax.set_xticks(ax.get_xticks()) to do nothing... but it makes so x = -2 and x = 6 are visible :/

#

fixed with:

        ax = lines[0].axes
        xlim = ax.get_xlim()
        ylim = ax.get_ylim()
        ax.set_xticks(list(ax.get_xticks()) + [rx1, lx1])
        ax.set_yticks(list(ax.get_yticks()) + [0.5])
        ax.set_xlim(xlim)
        ax.set_ylim(ylim)

(found solution at https://stackoverflow.com/questions/55603710/python-matplotlib-ax-get-xticks-doesnt-get-current-xtick-location )

serene scaffold Oct 10, 2023, 4:55 PM

#

please always give text as text, not as screenshots. people might need to copy parts of the message to help you.

reef pulsar Oct 10, 2023, 4:56 PM

#

Sorry. I'll keep that in mind

thick topaz Oct 10, 2023, 4:57 PM

#

So we are considering making a datawarehouse, we have approximately 3-400 databases, all identical structure with different datasets of course. Now we are considering creating a large datawarehouse with data from all of the 3-400 databases. How would one go about this?

#

Which libraries would you guys use, what are dos and donts in regards to this?

rustic scarab Oct 10, 2023, 5:22 PM

#

is it possible to put arrows in a contour plot?

serene scaffold Oct 10, 2023, 6:34 PM

#

thick topaz So we are considering making a datawarehouse, we have approximately 3-400 databa...

what kind of databases? SQL?

thick topaz Oct 10, 2023, 6:34 PM

#

serene scaffold what kind of databases? SQL?

Ah sorry, MSSQL

#

We arent talking millions of rows per DB, props 20-100k per db ish

radiant crypt Oct 10, 2023, 6:37 PM

#

hello im trying to make a bot talk to the user but when i try to allow the text to print on the textbox it gives me an error TypeError: CTkEntry.get() takes 1 positional argument but 3 were given

thick topaz Oct 10, 2023, 6:38 PM

#

I think we need more code than that

radiant crypt Oct 10, 2023, 6:38 PM

#

r u able to be in a vc?

#

its easier to show u there

thick topaz Oct 10, 2023, 6:39 PM

#

Nah but feel free to DM me a part of the code and I might have a look later tonight

radiant crypt Oct 10, 2023, 6:39 PM

#

alr

#

thx

cold osprey Oct 10, 2023, 7:00 PM

#

thick topaz Which libraries would you guys use, what are dos and donts in regards to this?

wdym what libraries?

#

if its on MSSQL, i presume u would use something like SSIS?

thick topaz Oct 10, 2023, 7:01 PM

#

cold osprey wdym what libraries?

Mainly in regards to what would be the smartest, fastest and most reliable way of transfering certain data from these 3-400 databases into another database

languid prairie Oct 10, 2023, 8:08 PM

#

need someone who is familiar with LlamaIndex

#

Capture_decran_2023-10-10_a_13.09.00.png

cursive valve Oct 10, 2023, 8:30 PM

#

Hi, is there anyone who is willing to discuss one python code with me? I have spent so much time on it and there is still something that is now working.

keen kettle Oct 10, 2023, 8:31 PM

#

@cursive valve Hi, if you want to discuss your python code.

cursive valve Oct 10, 2023, 8:32 PM

#

Are you willing to help me?

keen kettle Oct 10, 2023, 8:33 PM

#

what kind of project?
pls show me.

cursive valve Oct 10, 2023, 8:33 PM

#

it will be better if could talk because it is quite complicated

#

but I will leave it up to you

keen kettle Oct 10, 2023, 8:34 PM

#

you have document?

cursive valve Oct 10, 2023, 8:34 PM

#

yeah but it is not in english

keen kettle Oct 10, 2023, 8:35 PM

#

hmm... no problem

cursive valve Oct 10, 2023, 8:36 PM

#

I mean in a nutshell it is a program that will substract numbers like '1.01e-11' and '-1.10110101e111' both in binary form and both could be negative or positive

#

and I have to write output

#

In small numbers it works perfectly fine but with larger one it cause troubles

keen kettle Oct 10, 2023, 8:37 PM

#

that's interesting.

#

Can you share your code?

#

@cursive valveare you there?

cursive valve Oct 10, 2023, 8:40 PM

#

Well I do not really want to share my code publicly, because of strict rules that I have to follow

#

But I need to work in string otherwise that will be inaccuracy

keen kettle Oct 10, 2023, 8:42 PM

#

you are right

little idol Oct 10, 2023, 8:54 PM

#

@keen kettle Hello my friend, how are you?

I don't know if you still have some spare time, but I would like to use this opportunity to ask you something about this project I'm enrolled in

I am building a portfolio project which consists in:

I've downloaded a database from kaggle of a fictional telecom company, regarding churn data
Then, I've uploaded the database to SSMS
I've cleaned the data using python and SQL, one-hot encoded columns, solved missing and zero values
I've done feature engineering to prepare for the usage of Random Forest Classifier machine learning algorithm to create a churn prediction tool

The question is, since my features resulted in new tables with structures, column names and overall shape different than the original database, how can I merge them into one singular database to feed into the ML algorithm, since what I've discovered through my research is that there must be an Index Column on the dataframes, but the original index column contained on the database was Customer ID and on the newly created tables, there will be no connections at all with customer ID since different questions were answered by the feature tables

Could you lend me a hand ?

keen kettle Oct 10, 2023, 9:02 PM

#

@little idolokay

#

so u want sql query?

#

i am interested in your project.

little idol Oct 10, 2023, 9:07 PM

#

I want to understand how can I use my engineered features to feed my machine learning model

I can share my screen with you

#

I want to use python to run this machine learning model, but I don't know how to do it, this is my first time dealing with machine learning models

keen kettle Oct 10, 2023, 9:09 PM

#

hmm...
let me show your screen

terse frigate Oct 11, 2023, 2:47 AM

#

I am seeking guidance for a conceptual endeavor regarding Natural Language Processing (NLP) model refinement. My objective is to tailor an NLP model utilizing a private corpus. The data at hand comprises two distinct datasets:

The first dataset encapsulates JSON structured data, encompassing names and pertinent information of various enterprises.
The second dataset embodies a text corpus, inclusive of a passage containing 6000 words.

The envisioned outcome is to engineer a model proficient in executing text or data retrieval from the aforementioned corpus, predicated on user prompts. I am open to insights or suggestions that may steer me towards devising a viable solution for this challenge. Your expertise and directional advice would be greatly valued.

serene scaffold Oct 11, 2023, 3:04 AM

#

terse frigate I am seeking guidance for a conceptual endeavor regarding Natural Language Proce...

so you're trying to develop an information retrieval system. are you looking for a ChatGPT-like user experience?

by the way, you're using a lot of fancy terms here in a way that I don't think is necessary. NLP professionals don't talk about datasets that "encapsulate JSON structured data" or "engineering a model proficient in executing [task]". you can just say that the first dataset is JSON data representing [whatever it represents], and that the desired result is a model that retrieves information predicated on user prompts.

#

The second dataset embodies a text corpus, inclusive of a passage containing 6000 words.
inclusive of a passage containing 6000 words. is that the whole corpus, or is that just one of the documents that are in it?

random fox Oct 11, 2023, 3:09 AM

#

I am attempting to animate a pcolor plot with matplotlib.pyplot and matplotlib.animation. However, I'm running into an issue with the colorbar duplicating repeatedly. The following is the figure setup code:```py
fig = plt.figure()
ax = fig.add_subplot()

def init():
return ax

def update(frame):
f = int(frame)
plt.cla()

"""
ax.set_xlabel(r'$x$')
ax.set_ylabel(r'$y$')

ax.set_xlim(0, 2 * pi)
ax.set_ylim(0, 2 * pi)
ax.xaxis.set_major_formatter(FuncFormatter(lambda val,pos: '{:.0g}$\pi$'.format(val/pi) if val !=0 else '0'))
ax.yaxis.set_major_formatter(FuncFormatter(lambda val,pos: '{:.0g}$\pi$'.format(val/pi) if val !=0 else '0'))
ax.xaxis.set_major_locator(MultipleLocator(base = pi))
ax.yaxis.set_major_locator(MultipleLocator(base = pi))
"""

u_plot = ax.pcolor(x, y, np.transpose(u[frame, :, :]), cmap = 'coolwarm')
# ax.set_aspect('equal')
# ax.set_title(r"$u(t, x, y), t = $" + "{0:.3f}".format(t[frame]))
fig.colorbar(u_plot)

ANI = ani.FuncAnimation(fig, update, frames = range(0, T), init_func = init)
ANI.save("advection_animation.mp4", fps = 180, dpi = 200)

#

Unfortunately, I end up with animations like the following:

serene scaffold Oct 11, 2023, 3:10 AM

#

terse frigate I am seeking guidance for a conceptual endeavor regarding Natural Language Proce...

Also, can you show an example of the json data, and explain how it relates to the second dataset?

lapis sequoia Oct 11, 2023, 3:16 AM

#

chat gpt is quite nice at translating jupyter to python scripts

#

the exports functions normally do not understand what to do with the bash commands

serene scaffold Oct 11, 2023, 3:17 AM

#

lapis sequoia chat gpt is quite nice at translating jupyter to python scripts

You can already do that deterministically with nbconvert

lapis sequoia Oct 11, 2023, 3:18 AM

#

it didn't run in any way here, not sure why

#

i wanted to run the cli command to be able to use tmux now that we upgrade from colab to a pure ubuntu server with gpu

#

it was so complicated to convert that simple file lol...

serene scaffold Oct 11, 2023, 3:19 AM

#

Tmux is bae af

lapis sequoia Oct 11, 2023, 3:21 AM

#

not sure what bae is, but i'm forcing it to be at ease lol

serene scaffold Oct 11, 2023, 3:21 AM

#

random fox I am attempting to animate a `pcolor` plot with `matplotlib.pyplot` and `matplot...

Try asking this again in several hours if you don't get an answer. I appreciate that your question is detailed. I just don't know what to do

serene scaffold Oct 11, 2023, 3:22 AM

#

lapis sequoia not sure what bae is, but i'm forcing it to be at ease lol

Bae is what one would call their romantic partner

lapis sequoia Oct 11, 2023, 3:25 AM

#

o_0

lapis sequoia Oct 11, 2023, 4:45 AM

#

anyone here uses fluorescent tshirts for coding

somber prism Oct 11, 2023, 6:14 AM

#

somber prism Hello , I installed PyTorch gpu module and everyday I’ve been training my model ...

Can someone help me with this

tame path Oct 11, 2023, 12:06 PM

#

is their a way to find out quantitatively wether a given spectrogram contain "some sound" or is just silent

terse frigate Oct 11, 2023, 12:17 PM

#

serene scaffold so you're trying to develop an information retrieval system. are you looking for...

Hiiii

#

Thank you for responding

#

I only wrote the request in fancy terms since I have been asking people for help on LinkedIn too

#

Yes correct - information retrieval system

terse frigate Oct 11, 2023, 12:18 PM

#

serene scaffold > The second dataset embodies a text corpus, inclusive of a passage containing 6...

That would be the whole corpus

#

What I am trying to develop is a POC

#

Where I just want to demonstrate that we can build a model that is curated to our needs and our dataset

#

And not open domain

terse frigate Oct 11, 2023, 12:20 PM

#

serene scaffold Also, can you show an example of the json data, and explain how it relates to th...

The two datasets are completely unrelated

weak mortar Oct 11, 2023, 1:40 PM

#

hello! i have a weird problem with my DataFrame today. i have a column with datetime. time is in format "2017-08-17 04:00:00, but as soon as i pass it to any function, without doing any manipulations or modifications of it, it changes time to 1970-01-01 00:00:00.000

serene scaffold Oct 11, 2023, 1:41 PM

#

@terse frigate I still need to see examples of the json data

weak mortar Oct 11, 2023, 1:41 PM

#

print(somedataframe)
def wtfisgoingon(input):
    print(input)
    return
wtfisgoingon(somedataframe)

#

i never experienced such behavior before that my data gets altered by passing it to a function , and quite clueless on how to solve it

serene scaffold Oct 11, 2023, 1:50 PM

#

weak mortar i never experienced such behavior before that my data gets altered by passing it...

wtfisgoingon can't modity input, so by simplifying the actual function, you've removed the part that causes the problem

#

if an object is mutable (like dataframes), anything with a reference to that object can modify it, including functions

weak mortar Oct 11, 2023, 1:54 PM

#

but i assume that would require that said anything or function, is run

serene scaffold Oct 11, 2023, 1:54 PM

#

yes? in either case, your code example doesn't encapsulate the problem.

weak mortar Oct 11, 2023, 1:55 PM

#

ok. what do you suggest me to do to try and solve this?

#

as you see i print somedataframe , and the only thing other that i do to somedataframe is to pass it to the function, and print it again. so no code is initiated that should change it

serene scaffold Oct 11, 2023, 1:57 PM

#

weak mortar ok. what do you suggest me to do to try and solve this?

create a code example with every variable defined that has the same problem as the actual code

weak mortar Oct 11, 2023, 1:57 PM

#

ill try and recreate the problem in a new .py 👍🏻

modern quail Oct 11, 2023, 2:00 PM

#

Types of neural networks and optimization algorithms are independent of each other right? Like I can use any optimization algorithm I want with any type of neural network?

serene scaffold Oct 11, 2023, 2:01 PM

#

modern quail Types of neural networks and optimization algorithms are independent of each oth...

you're talking about stochastic gradient descent vs Adam, right?

modern quail Oct 11, 2023, 2:02 PM

#

serene scaffold you're talking about stochastic gradient descent vs Adam, right?

I am not talking about either

#

I am talking in general

serene scaffold Oct 11, 2023, 2:03 PM

#

yes, and are those two examples of what you mean by "optimization algorithm", or not?

modern quail Oct 11, 2023, 2:04 PM

#

by types I mean like LSTM, ANFIS and optimization alogrithm ACSLFA, Levenberg Marquardt

#

I do mean adam but i am not sure whether acslfa and lm are optimization algorithms and whether they could be used alongside lstm or anfis

weak mortar Oct 11, 2023, 2:10 PM

#

weak mortar hello! i have a weird problem with my DataFrame today. i have a column with d...

as usually there was an explanation to the problem, of a human nature... i did not realize that the variable was holding multiple dataframes(because it loads csv files from searching after a string and i assumed it only had found one file). what confused and was unexpected behavior to me was that printing it initially it would print somedataframe[0], and after passing it to the function it would print somedataframe[3], which has its timestamp in a non compatible format

#

tl;dr: doing print on a variable with multiple objects without specifying which object, it will do print[0], and after passing it to any function, it will print out the [-1] object.

serene scaffold Oct 11, 2023, 2:15 PM

#

weak mortar tl;dr: doing print on a variable with multiple objects without specifying which ...

there are no "variables with multiple objects". a variable is always exactly one object, though that object might contain other objects. it sounds like the variable might be a list.

weak mortar Oct 11, 2023, 2:16 PM

#

okay, yes, a list storing multiple dataframes

serene scaffold Oct 11, 2023, 2:16 PM

#

but if you print a list, it will just print the whole list, so your assessment about "it will do print[0], and after passing it to any function, it will print out the [-1] object." is incorrect

#

there must be more going on.

weak mortar Oct 11, 2023, 2:17 PM

#

serene scaffold but if you print a list, it will just print the whole list, so your assessment a...

yes you are right, it does print out them all, what i should have said, is that it prints them out in reverse order when i print from inside the function 🙂

serene scaffold Oct 11, 2023, 2:17 PM

#

weak mortar yes you are right, it does print out them all, what i should have said, is that ...

weird

weak mortar Oct 11, 2023, 2:20 PM

#

actually my brain is just completely malfunctioning and python is behaving exactly as expected. thanks for your efforts. i will pour down a liter of coffee to correct my cognition

serene scaffold Oct 11, 2023, 2:21 PM

#

weak mortar actually my brain is just completely malfunctioning and python is behaving exac...

https://tenor.com/view/coffee-gif-14866884794849307214

Tenor

harsh minnow Oct 11, 2023, 4:43 PM

#

Hi, I am running meta-llama/Llama-2-7b-chat-hf on runpod.
The machine type is: 1 x RTX A6000, 14 vCPU 48 GB VRAM

I am running a for loop and calling the model using hugging face pipelines
pipe = pipeline("text-generation", model="meta-llama/Llama-2-7b-chat-hf", device=0, return_full_text=False, use_cache=False)

After the 5th loop the GPU Utilisation becomes 100% full and output is very slow or sometimes no output. What can I do?
For testing I ran

    for i in range(1000):
        yield f"My example {i}"

for out in pipe(data()):
    print(out)```

Still the same delay and pausing after 4th or 5th loop

serene scaffold Oct 11, 2023, 4:43 PM

#

harsh minnow Hi, I am running meta-llama/Llama-2-7b-chat-hf on runpod. The machine type is: ...

!code

arctic wedgeBOT Oct 11, 2023, 4:43 PM

#

Formatting code on discord

Here's how to format Python code on Discord:

```py
print('Hello world!')
```

These are backticks, not quotes. Check this out if you can't find the backtick key.

For long code samples, you can use our pastebin.

harsh minnow Oct 11, 2023, 4:44 PM

#

serene scaffold !code

Just edited.

serene scaffold Oct 11, 2023, 4:45 PM

#

import torch

for d in data():
    with torch.no_grad():
        print(pipe(d))

try that.

harsh minnow Oct 11, 2023, 4:47 PM

#

serene scaffold ```py import torch for d in data(): with torch.no_grad(): print(pip...

I tried this, it;s just completed after about 10-12 seconds.

serene scaffold Oct 11, 2023, 4:48 PM

#

harsh minnow I tried this, it;s just completed after about 10-12 seconds.

so it worked?

harsh minnow Oct 11, 2023, 4:48 PM

#

It worked, but it takes 12 seconds or so.

serene scaffold Oct 11, 2023, 4:48 PM

#

to do 1000 instances, one at a time?

#

because that doesn't surprise me.

harsh minnow Oct 11, 2023, 4:50 PM

#

Oh my bad, actually this is the code: ````import torch

def data():
for i in range(1):
yield f"My example {i}"

for d in data():
with torch.no_grad():
print(pipe(d))```

#

For 1 instance it takes 12 seconds

serene scaffold Oct 11, 2023, 4:51 PM

#

hmm, are you sure nothing else is using the GPU?

harsh minnow Oct 11, 2023, 4:52 PM

#

I guess so, because once the output is completed the GPU becomes to 0% Utilisation

neon field Oct 11, 2023, 4:55 PM

#

someone recommend me a roadmap to study AI
my prof is real shitty
there are too many AI resources
idk what course to finish to actually study AI as a major subject 🟥 IMPORTANT 🟥

serene scaffold Oct 11, 2023, 4:59 PM

#

@harsh minnow I'll try it on my machine

harsh minnow Oct 11, 2023, 4:59 PM

#

Sure @serene scaffold

serene scaffold Oct 11, 2023, 5:01 PM

#

@harsh minnow looks like it requires a license from Meta to use it, which I don't have, so RIP.

harsh minnow Oct 11, 2023, 5:02 PM

#

oh okay, no problem.

#

Actually I guess there is some wierd issue with my system. becase this code still have not given an output: ```import torch

def data():
for i in range(1):
yield f"My example {i}"

for d in data():
with torch.no_grad():
print(pipe(d))```

#

BTW I am using RunPod (runpod.io)

dense sluice Oct 11, 2023, 5:17 PM

#

Hi! I'm in the early stages of implementing face recognition, and am assessing available tools. Python's eco seems great for this, including with Tensorflow, Pytorch, and OpenCV (Aka/formerly-known as CV2?) I'm interested in clustering with faces. Ie recognizing unique faces from a camera, and identifying when they come up again. Where would you start? Ty!

misty flint Oct 11, 2023, 5:40 PM

#

serene scaffold https://tenor.com/view/coffee-gif-14866884794849307214

adogslurp

#

are you me

#

also @serene scaffold did we still want data eng resources or nah

#

at this point im not sure since ive sorta merged MLE/DE beginner resources

#

thinkfast

#

many places merge the two roles

#

books, online courses/resources, podcasts

#

harsh minnow Oct 11, 2023, 6:31 PM

#

I am running 2 GPU, but when I run this code it only uses 1 GPU.

from transformers import pipeline
pipe = pipeline("text-generation", model="meta-llama/Llama-2-7b-chat-hf", device=1, return_full_text=False)```

hasty mountain Oct 11, 2023, 7:01 PM

#

Hey guys, does someone know why Evolutionary Algorithms are hard to apply in a stochastic nature, like gradient descent?
Usually, I see EAs being applied to optimize the average loss in a dataset as a whole. The models are initialized and only after iterating through the whole dataset the mutations/crossovers are applied. I was thinking: why not apply the mutations before iterating through the whole dataset? Like after N batch iterations, like in SGD?

#

In reality, I'm asking this a bit too late, as I already did an entire work testing a "stochastic genetic algorithm". But it worked wonderfully as a failure, and I'm now trying to think why it didn't work

#

(Though I don't discard the possibility that the problem may rely between the screen and the chair...if you know what I mean)

past meteor Oct 11, 2023, 7:23 PM

#

hasty mountain Hey guys, does someone know why Evolutionary Algorithms are hard to apply in a s...

You can apply SGD in EA's

#

Look up local search in evolutionary algorithms

hasty mountain Oct 11, 2023, 7:24 PM

#

past meteor Look up local search in evolutionary algorithms

Yeah, I've seen a paper where the folks used an EA to initialize the model and make it begin with its parameters close to the global minima. Then, they trained the model using SGD.
A bit like pre-training...

glass mason Oct 11, 2023, 7:25 PM

#

How to make ai

#

Can someone help me

hasty mountain Oct 11, 2023, 7:25 PM

#

I tried an idea of mixing SGD and EA to move a model that is already in a learning plateau (a local minima, or more like of a saddle point), to move the model away from that minima, and thus help SGD improve it even more.

past meteor Oct 11, 2023, 7:25 PM

#

hasty mountain I tried an idea of mixing SGD and EA to move a model that is already in a learni...

Imagine you have a population, each individual is a model. Your recombination and mutation work on the weights. You can local search an instance by performing a few individuals in the population at each step

#

SGD, or local search in general, will move you toward a local minimum

#

I don't have the time right now to give elaborate and structured motivation / responses. You can DM if you want and I'll likely answer tomorrow or so 🙂

hasty mountain Oct 11, 2023, 7:27 PM

#

past meteor I don't have the time right now to give elaborate and structured motivation / r...

Oh, I see. Thanks!

echo mesa Oct 11, 2023, 7:39 PM

#

Hello guys, this might be a stupid question(I'm beginner in this field), but I always wondered about the mathematical background of for example neural networks, or for example the complete mathematical background of how a model gets trained and learns, is there any kinda of resource that is dealing with the mathematical background of these machine learning concepts? Thanks 🙂

#

I know that the mathematics that is needed for ai, mainly calculus, linear algebra, and probability. But what I would like to have is a book or any type of resource that would describe or have examples on the mathematical bakground of these concepts in ai and machine learning.

#

Ohh I think I found what I was looking for, there is a book "mathematics for machine learning" from the Cambridge University . I think that would be a good place to start.

odd meteor Oct 11, 2023, 9:07 PM

#

glass mason Can someone help me

Hi Ali, to gain clarity and understand the context of your question better, can you elucidate more?

plucky ivy Oct 11, 2023, 9:16 PM

#

Hi, I can't try jupyter notebook because of some python kernel timeout issue. Could you skip to near the end of this video and tell me how to rectify it.

glass mason Oct 11, 2023, 9:30 PM

#

Bro how can I make / command in my bot and I have to make bot in JavaScript or something else

plucky ivy Oct 11, 2023, 9:37 PM

#

glass mason Bro how can I make / command in my bot and I have to make bot in JavaScript or ...

I've DM'd you code for a basic bot in Python. You can use autogen or openai to make it respond like an assistant.

#

For further help, see #discord-bots or the discord.py server

plucky ivy Oct 11, 2023, 10:17 PM

#

plucky ivy Hi, I can't try jupyter notebook because of some python kernel timeout issue. Co...

Anyone?

left tartan Oct 11, 2023, 10:19 PM

#

plucky ivy Anyone?

You posted a 20 min video. I don’t have that kind of attention span. What issue?

#

I assume you don’t have Python installed, for starters

#

Try running something in vscode outside a notebook

small wedge Oct 11, 2023, 10:31 PM

#

echo mesa Hello guys, this might be a stupid question(I'm beginner in this field), but I a...

https://arxiv.org/pdf/1802.01528.pdf is a paper that steps through all the math and assumes you only know calc 1 and some linear algebra.

https://developers.google.com/machine-learning/crash-course google has a machine learning crash course that begins by covering the math, but also focuses on using modern libs to create your own models

https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi 3b1b has a series of videos going over the math and getting progressively more complicated

#

andrew ng also has free courses online that go in depth on the math

deep tendon Oct 12, 2023, 1:13 AM

#

Guys here is a code that you can use to process a pcap and extract individual bytes to columns of a parquet! it implements the pre processing and labeling required to develop feature-per-byte based NIDS as the one described in DeepPackGen https://arxiv.org/pdf/2305.11039.pdf feedback very welcome! https://github.com/Master-Sorcerer/BytesProcessor

GitHub

GitHub - Master-Sorcerer/BytesProcessor: This class allows to effic...

This class allows to efficiently convert bigger than memory pcap files to a labeled feature-per-byte dataset in parquet format - GitHub - Master-Sorcerer/BytesProcessor: This class allows to effici...

dusk tide Oct 12, 2023, 6:36 AM

#

I have a question . On Kaggle Getting started competitions, we are provided with train and test sets separately. Is it okay to merge both of them for doing preprocessing easily or not ? According to this blog : analytics vidya blog (https://www.analyticsvidhya.com/blog/2021/07/data-leakage-and-its-effect-on-the-performance-of-an-ml-model/) we should not do this because it can lead to Data Leakage . Can anyone tell ?

Analytics Vidhya

CHIRAG GOYAL

Data Leakage And Its Effect On The Performance of An ML Model

In this article, we will discuss all the things related to Data Leakage including what it is, how it has happened, how to fix it,

lavish kraken Oct 12, 2023, 6:40 AM

#

plucky ivy I've DM'd you code for a basic bot in Python. You can use `autogen` or `openai` ...

Send me too please

plucky ivy Oct 12, 2023, 7:14 AM

#

lavish kraken Send me too please

Sent.

plucky ivy Oct 12, 2023, 7:16 AM

#

left tartan I assume you don’t have Python installed, for starters

I have Python 3.11, do I need to make a copy of the python.exe file into the current working directory?

plush pagoda Oct 12, 2023, 12:16 PM

#

• male female X train.txt
• male female X test.txt
• male female y train.txt
• male female y test.txt
...
(c) Bayes classifier with non-parametric distribution (20 points)
Compute and print the prior probabilities for male and female.
Compute class likelihoods, p(height|male), p(weight|male), p(height|female) and p(weight|female) for
all test samples. This can be done by using the bin min/max values returned by NumPy histogram()
function. You can calculate the centroid of each bin and assign each test sample to the closest bin.
After knowing the bin index, the likelihood can be computed using the count vector provided by the
same histogram() function.
Classify all test samples and compute the classification accuracy. Print accuracies for height only,
weight only, and weight and height together (multiply likelihoods).

My attempted solution went something like this (I tried to follow the instructions as well as I could):

Calculate the prior probabilities from the training data
- p(m) = # males / (# males + # females)
- p(f) = 1 - p(m)
Calculate likelihoods
- Use the training data to define the counts and bins for 4 histograms : female height, male height, female weight, male weight
- Calculate the centroids for all 4 histograms
- Using those I can calculate the likelihoods (or so I thought)
Classify
- Classify based on the maximum likelihood * prior probability

#

Here is my code (sorry):

# Male and female height and weight measurements
X_train = np.loadtxt("male_female_X_train.txt") 
X_test = np.loadtxt("male_female_X_test.txt") 
y_train = np.loadtxt("male_female_y_train.txt")
y_test = np.loadtxt("male_female_y_test.txt")

total_samples_train = len(y_train)

# Compute and print the the prior probabilities for male and female
p_m = np.mean(y_train == 1) # p(male)
p_f = 1 - p_m
print(f"p(female) = {p_f}")
print(f"p(male) = {p_m}")

# Compute class likelyhoods

# y=0 -> male and y=1 -> female
# Men's heights
mheights_train = X_train[y_train == 0, 0]
# Men's weights
mweights_train = X_train[y_train == 0, 1]
# Women's heights
fheights_train = X_train[y_train == 1, 0]
# Women's heights
fweights_train = X_train[y_train == 1, 1]


# Histograms 
# The last bin edge from np.histogram(..) is "closed".
counts_mh, bins_mh = np.histogram(mheights_train, bins=10)
counts_mw, bins_mw = np.histogram(mweights_train, bins=10)
counts_fh, bins_fh = np.histogram(fheights_train, bins=10)
counts_fw, bins_fw = np.histogram(fweights_train, bins=10)

# Bin centroids for histograms
get_bin_centr = lambda bins: [b + ((max(bins) - min(bins)) / 20) for b in bins][:10]

bin_centr_mh = get_bin_centr(bins_mh)
bin_centr_mw = get_bin_centr(bins_mw)
bin_centr_fh = get_bin_centr(bins_fh)
bin_centr_fw = get_bin_centr(bins_fw)

# For the sake of brevity and everyone's sanity, I'll omit everything except predictions based on height:
classify = lambda lf, lm: int(lf * p_f > lm * p_m)

y_pred_h = [classify(counts_fh[get_bin_idx(bin_centr_fh, x)] / counts_fh.sum(), 
                     counts_mh[get_bin_idx(bin_centr_mh, x)] / counts_mh.sum()) 
            for x in X_test[:,0]]

#

And here is the feedback I got:
The index of the bin should be taken considering the whole range of available weight/height values but not for male and female separately

But if I don't have the histograms separated into m/f, I don't understand how to classify then. If I only have one histogram for height and I get the bin index from that combined distribution for male and female heights, how can I determine the class?

reef spade Oct 12, 2023, 12:41 PM

#

hello, i want to create an AI avatar using python. What should i learn to do this?

oblique quarry Oct 12, 2023, 12:49 PM

#

I was wondering does it make even sense to use LASSO-Regression on simple linear regression tasks where you have only one independent variable / feature that contributes to the target variable, given that you don't target the the b0-term you can only target the one feature that contributes to the output, which doesnt sound like something I'd normally do

serene scaffold Oct 12, 2023, 1:33 PM

#

reef spade hello, i want to create an AI avatar using python. What should i learn to do thi...

what would this AI avatar do? chances are, it's not something that's attainable as a first project.

reef spade Oct 12, 2023, 1:35 PM

#

serene scaffold what would this AI avatar do? chances are, it's not something that's attainable ...

It would just respond to basic physics questions

past meteor Oct 12, 2023, 2:59 PM

#

@hasty mountain I have time to get back to you now. Your algorithm was basically this correct:

Initialize population
Mutate
Select top model
SGD on top model
Return to 2 until convergence or max steps

Right?

#

Step 4 is what people call "local search" in genetic algorithm literature, it's a valid thing to do. It does have a trade-off:

Local search decreases the time to convergence.
Local search can have you converge faster into a local minimum.

What I saw empirically is that LS really decimates diversity in your population and sends you towards a converged population really quickly. To offset this you need to increase mutation rates to compensate etc.

What you can also do is run LS on the top and bottom N % of individuals

#

In reality I'd never run an EA on a neural network though for various reasons. The flavours of SGD that we use are usually "good enough", I'm not sure we need a global optimiser 🙂

serene scaffold Oct 12, 2023, 3:59 PM

#

no, but "greedy reasoners" is a fun way of putting it

echo mesa Oct 12, 2023, 4:02 PM

#

small wedge <https://arxiv.org/pdf/1802.01528.pdf> is a paper that steps through all the mat...

Thank you very much

rapid oriole Oct 12, 2023, 4:13 PM

#

Hey guys, my neural network with MLPRegressor always outputs the mean of the Y variable. Can anyone help me? I will provide more details

past meteor Oct 12, 2023, 4:57 PM

#

rapid oriole Hey guys, my neural network with MLPRegressor always outputs the mean of the Y v...

Sounds like all your weights are 0 except that of the intercept. You can look at model.coefs_ to attempt to debug

ashen crypt Oct 12, 2023, 5:07 PM

#

Do we have here a OCR master? ;p with tesseract?

serene scaffold Oct 12, 2023, 5:07 PM

#

ashen crypt Do we have here a OCR master? ;p with tesseract?

just ask your question.

ashen crypt Oct 12, 2023, 5:08 PM

#

Hm... i want to OCR my manga chapter but tesseract very often catch correct only 50% of page.

#

i did it with my colleague and somethimes in the IMG we can see, letter "G and G" it is writed as G and 6

#

im crucious where i should focus to resolve this issue.

left tartan Oct 12, 2023, 5:25 PM

#

ashen crypt Hm... i want to OCR my manga chapter but tesseract very often catch correct only...

I’ve only done a simple experiment with tesseract, so I don’t have much experience but: have you tried training against the font? Do you know the font? Context: https://pretius.com/blog/ocr-tesseract-training-data/

harsh minnow Oct 12, 2023, 5:25 PM

#

Hey guys, I am trying to build a news article generator app that is trained on the news in the US, UK, and Canada. It has to be very accurate. Now I first ask some simple questions to get some data and generate an accurate news article about a specific thing in a specific country.

Now I want to train the AI on new article knowledge, but when I fine-tune a model, it outputs nonsense. I tried fine-tuning GPT 3.5, but it's returning inaccurate data (GPT 4 performs much better). Also, GPT-3 (4096) is not enough to generate a long news article, so I am making a plan and calling GPT-3 around 7–10 times (it's context-aware).

I want the model to be smart enough to make decisions about the country and get information about it from the user. I want the model to output in a different format too, but the articles I train on are not in the format I want. Each user's use case is different, and I want the model to be smart enough to analyse and generate the article.

What is the best way to go about this? How can I achieve a smart AI model that knows each country's information and is accurate?

ashen crypt Oct 12, 2023, 5:27 PM

#

left tartan I’ve only done a simple experiment with tesseract, so I don’t have much experien...

Lol its possible to train Tesseract? nice i need to write more about it!

serene scaffold Oct 12, 2023, 5:27 PM

#

harsh minnow Hey guys, I am trying to build a news article generator app that is trained on t...

what would it mean to "generate accurate news articles"? for it to predict the future?

left tartan Oct 12, 2023, 5:28 PM

#

ashen crypt Lol its possible to train Tesseract? nice i need to write more about it!

Please share/tag me if you do, genuinely curious if it helps

harsh minnow Oct 12, 2023, 5:29 PM

#

serene scaffold what would it mean to "generate accurate news articles"? for it to predict the f...

No, whatever the things that has been occurred (eg: Wild fires in US)

ashen crypt Oct 12, 2023, 5:29 PM

#

@left tartan Do you have any Youtube Video to show how it work?

serene scaffold Oct 12, 2023, 5:29 PM

#

harsh minnow No, whatever the things that has been occurred (eg: Wild fires in US)

then aren't you just overfitting it to news articles that already exist?

boreal gale Oct 12, 2023, 5:31 PM

#

ashen crypt Hm... i want to OCR my manga chapter but tesseract very often catch correct only...

must you use tesseract?

also are you interested in text detection or text recognition? or both?

harsh minnow Oct 12, 2023, 5:31 PM

#

serene scaffold then aren't you just overfitting it to news articles that already exist?

I mean its should learn from the articles, account for user's preferences and generate one. May be the user wants more things to be added in the article. This could be a combination of articles

ashen crypt Oct 12, 2023, 5:34 PM

#

i prepare small project so what i need: i need catch the words / letter from manga + i need pixels - then i need export it to file txt then :
1). recognise text change to blank - white.
2) then i change the .txt file with "my translate"
3) upload my words to image.

#

that is the main logic what i want to do.

#

so if u know better option i want to use it.

boreal gale Oct 12, 2023, 5:35 PM

#

ashen crypt so if u know better option i want to use it.

i used EAST and CRNN_VGG_BiLSTM_CTC laid out here with some success before https://github.com/opencv/opencv/blob/4.x/samples/dnn/text_detection.py

GitHub

opencv/samples/dnn/text_detection.py at 4.x · opencv/opencv

Open Source Computer Vision Library. Contribute to opencv/opencv development by creating an account on GitHub.

ashen crypt Oct 12, 2023, 5:36 PM

#

im beginning in my python world. How to use it?

boreal gale Oct 12, 2023, 5:36 PM

#

to me tesseract is great for heavily structured text, but for text that's less structured like in manga, it's unlikely to work out of the box, just my 2 cent.

ashen crypt Oct 12, 2023, 5:37 PM

#

Ok could we connect tomorrow? for some "training" how to use it?

boreal gale Oct 12, 2023, 5:40 PM

#

sorry i don't have bandwidth to help on that level, let me do some googling, there must be tutorial out there

#

i only managed to find one on east for text detection for some reason..
https://pyimagesearch.com/2018/08/20/opencv-text-detection-east-text-detector/

here is one tutorial that aims to use east + tesseract together, which in my experience is also better than just tesseract, again just my 2 cents.
https://nanonets.com/blog/deep-learning-ocr/

boreal gale Oct 12, 2023, 6:10 PM

#

download EAST model from 1
download CRNN_VGG_BiLSTM_CTC model from 2

i think the dependencies you need are just

opencv-python
numpy

opencv-python is cv2 - don't ask me why, i always get confused too.

that's all the help i can give at the moment, i really need to get back to work now.

hasty mountain Oct 12, 2023, 6:17 PM

#

past meteor Step 4 is what people call "local search" in genetic algorithm literature, it's ...

Hey, thanks! Well, the method is a bit more chaotic than that, as the model is already on a local minimum/saddle point, and the population initialized are copies of this same model(thus, already on local minimum). The EA is like...something to force the model to get even better, and it's applied to one single model (the Vanilla) to prevent a madness of memory consumption (I can run this algorithm in my personal computer using 20 models as population, for example).

The SGD is quite interesting, indeed, but I only get a bit upset with the fact that it demands so much patience. I even read a blog post some time ago that the author said there was a saying in Deep Learning that was something like: "Let the model run and take a summer off", to let gradients work by themselves.

past meteor Oct 12, 2023, 6:18 PM

#

hasty mountain Hey, thanks! Well, the method is a bit more chaotic than that, as the model is a...

You need diversity in a population for EA methods to work, I'm not sure starting with N copies is worth it

#

Especially at initiialisation

hasty mountain Oct 12, 2023, 6:19 PM

#

Yes, that's the thing. The mutation happens at each batch sampling, and I tend to use a mutation chance of 60%, more or less

past meteor Oct 12, 2023, 6:20 PM

#

Yes but this is a bit like random search then, you have little benefits of the EA framework especially since you're doing no recombination

hasty mountain Oct 12, 2023, 6:21 PM

#

Yes, it's quite chaotic and random.
I see... So I shouldn't have discarded crossing-over then yert

#

It's funny, though. When I ran some experiments in my poor personal GTX 1650 with a batch size of 4, the method failed miserably.
When I ran in the industrial Tesla P100 of Kaggle with a batch size of 512, the method did provide a result...helping the model getting stuck into a new minimum, even more harder to escape (not sure if it's local or global)

past meteor Oct 12, 2023, 6:26 PM

#

hasty mountain Hey, thanks! Well, the method is a bit more chaotic than that, as the model is a...

Slide spam again. I'd say what you have is probably not an EA because

You have full exploitation (SGD) and then you move to basically nearly full exploitation (random search through mutation)
Diversity is a key idea to explore the search space, you start from the same instance so you're not seeing a lot of the space, you're only seeing around that saddle point

hasty mountain Oct 12, 2023, 6:28 PM

#

Oh no
So I basically have a Reinforcement Learning strategy with a low epsilon yert

past meteor Oct 12, 2023, 6:28 PM

#

I wouldn't call it RL either, it's random search with a splash of local search (when you do SGD on the best individual)

hasty mountain Oct 12, 2023, 6:29 PM

#

It's a crazy thing that I managed to make with my mind

#

py_guido

past meteor Oct 12, 2023, 6:31 PM

#

hasty mountain It's funny, though. When I ran some experiments in my poor personal GTX 1650 wit...

What do you mean with batch size? Population size, 512 instances? Or do you mean truly batch size in the SGD sense?

hasty mountain Oct 12, 2023, 6:31 PM

#

past meteor What do you mean with batch size? Population size, 512 instances? Or do you mean...

Batch size in SGD sense.
My "EA" works in a stochastic manner, together with SGD, using mini batches.

past meteor Oct 12, 2023, 6:32 PM

#

How large is your population?

hasty mountain Oct 12, 2023, 6:32 PM

#

I've used 20 in both devices.

#

Always using a mutation chance of 60%

past meteor Oct 12, 2023, 6:33 PM

#

hasty mountain 20. I've used 20 in both devices.

So your algo is effectively:

SGD till convergence
Copy 20x
Mutate with rate of 60 %
SGD best
Go to 3 and repeat till convergence
?

hasty mountain Oct 12, 2023, 6:33 PM

#

In the beginning, I used a decaying factor for this mutation chance, as the cumulative mutations tended to degrade the models performance. But then I removed it.

hasty mountain Oct 12, 2023, 6:36 PM

#

past meteor So your algo is effectively: 1. SGD till convergence 2. Copy 20x 3. Mutate with...

SGD till convergence (Pre-training) ---> This is the Vanilla Model
Initialize Population by copying the Vanilla model 20x.
Samples a batch from data and SGD --> Optimize.
Repeat Iteration and get loss got after optimization --> SGD Loss
Mutate population with rate of 60%
Iterate through each individual in population --> Gets individual losses.
Selects best loss ---> If SGD Loss, proceed to 3. If one individual Loss surpasses(is lower than) SGD Loss, that individual replaces the Vanilla model, proceed to 3.

past meteor Oct 12, 2023, 6:37 PM

#

hasty mountain 1. SGD till convergence (Pre-training) ---> This is the Vanilla Model 2. Initial...

All with the same learning rate etc?

hasty mountain Oct 12, 2023, 6:38 PM

#

Yes, the learning rate is fixed. The SGD optimization is only applied to the Vanilla Model, even when an individual from the population replaces a Vanilla Model(becoming the new Vanilla)

#

Wait a second...the adam optimizer... It would be able to accompany those changes, right?

#

yert

ashen crypt Oct 12, 2023, 6:40 PM

#

boreal gale download EAST model from 1 download CRNN_VGG_BiLSTM_CTC model from 2 i think th...

Thank You

past meteor Oct 12, 2023, 6:41 PM

#

So, let's assume this is an EA, which it's not really: EA's are a waste of compute. Your compute is better spent on something else like bayes opt.

EA's shine where you truly need a global optimisation method for non-convex (note that EA's have 0 guarantees of finding this) that is tractable or where there is a "large enough" gap between your heuristic and global solution (you don't know this before running the EA) .

hasty mountain Oct 12, 2023, 6:43 PM

#

I see... Sad, I like the idea of the EAs.
Looks like I'll have to review my work, then.

#

Thanks!

past meteor Oct 12, 2023, 6:43 PM

#

hasty mountain I see... Sad, I like the idea of the EAs. Looks like I'll have to review my work...

I'd look at EAs in the context of combinatorial optimization and not neural networks

hasty mountain Oct 12, 2023, 6:44 PM

#

I've seen there were some ideas of using them in Reinforcement Learning...and to also select hyperparameters for neural networks.

past meteor Oct 12, 2023, 6:44 PM

#

Yes but EAs are very very sensitive to hyperparameters themselves

hasty mountain Oct 12, 2023, 6:45 PM

#

I was thinking about trying something chaotic as mutations in some of my projects in Reinforcement Learning... gaming bias in RL is truly annoying...

past meteor Oct 12, 2023, 6:45 PM

#

You've shifted the problem from setting hyperparams of your single NN to setting hyperparams for your EA which involves training 100+ neural networks

#

Hence why it's a total waste of compute compared to bayes opt

hasty mountain Oct 12, 2023, 6:45 PM

#

pithink

past meteor Oct 12, 2023, 6:46 PM

#

They're really simple to try out. Have you heard of the knapsack problem?

verbal oar Oct 12, 2023, 6:46 PM

#

I have (2,) (8,)
but want instead of (8,) (1,8) but I cant transpose with .T

#

first is x second is weights

hasty mountain Oct 12, 2023, 6:47 PM

#

past meteor They're really simple to try out. Have you heard of the knapsack problem?

No. I have no idea on how bayesian optimization works. I only heard of it yert

#

But from what I'm reading now...the ELBo loss used in the Variational AutoEncoder could be something like that?

verbal oar Oct 12, 2023, 6:48 PM

#

    def __init__(self,in_nodes,out_nodes):
    
        self.weights = Tensor((in_nodes * out_nodes))
        self.bias    = Tensor((1,out_nodes))
        self.type = 'linear'
       
        print('w shape', self.weights.data)
        print('b shape', self.bias.data.shape)

    def forward(self,x):

        print('x', x)
        output = np.dot(x,self.weights.data)+self.bias.data
       ```

past meteor Oct 12, 2023, 6:48 PM

#

hasty mountain No. I have no idea on how bayesian optimization works. I only heard of it <:yert...

Bayes opt is a hyper param tuning method that runs sequentially

#

At least, that's what it's good at compared to say grid search which you can run in parallel

#

It matters because you want to waste as little compute as possible when training NNs every evaluation should matter cause it takes too much time to be trying random things (which EAs do)

small wedge Oct 12, 2023, 6:51 PM

#

verbal oar ```py def __init__(self,in_nodes,out_nodes): self.weights = Ten...

what's the Tensor class from

verbal oar Oct 12, 2023, 6:52 PM

#

class Tensor():
    def __init__(self,shape):
        self.data = np.ndarray(shape,np.float32)
        self.grad = np.ndarray(shape,np.float32)

past meteor Oct 12, 2023, 6:52 PM

#

If you want to learn about them you should decouple EAs from NNs, just make a small project solving a classical problem like travelling salesman or knapsack. Generate fake data using Numpy and code out the whole thing from 0, just depend on Numpy. It's not a big project, it's <100 LoC. Then you'll "get" the pros and cons immediately 🙂

verbal oar Oct 12, 2023, 6:53 PM

#

hmm wait its float not int

#

my input is int

small wedge Oct 12, 2023, 6:55 PM

#

hm you could do smth like my_arr.resize((1, *my_arr.shape))

hasty mountain Oct 12, 2023, 6:55 PM

#

past meteor If you want to learn about them you should decouple EAs from NNs, just make a sm...

I see. I'll take a look. Thanks!

verbal oar Oct 12, 2023, 6:55 PM

#

ok

small wedge Oct 12, 2023, 6:55 PM

#

but you might just wanna use (8,1)/(1,8) arrays in the first place so you can use transpose

rapid oriole Oct 12, 2023, 10:09 PM

#

past meteor Sounds like all your weights are 0 except that of the intercept. You can look at...

Weights are not zero

delicate apex Oct 13, 2023, 12:34 AM

#

fun fun fun:
FutureWarning: Styler.applymap has been deprecated. Use Styler.map instead
(got this three times per usage, actually) okay, fine. i'll switch them to .map instead.
except now pylance complains of Cannot access member "map" for type "Styler" Member "map" is unknown !
firRly

desert oar Oct 13, 2023, 2:06 AM

#

past meteor In reality I'd never run an EA on a neural network though for various reasons. T...

Isn't EA normally used for architecture/hyperparameter search, as opposed to parameter optimization?

unique quail Oct 13, 2023, 2:43 AM

#

https://github.com/KeananC/MachineLearning/tree/main can you take a look at my cluster alg and give ideas on how to make another one for 2d data(same like style/simpleness, since i just got into ml)

GitHub

GitHub - KeananC/MachineLearning

Contribute to KeananC/MachineLearning development by creating an account on GitHub.

desert oar Oct 13, 2023, 3:26 AM

#

unique quail https://github.com/KeananC/MachineLearning/tree/main can you take a look at my c...

you invented your own algorithm? i admire the ambition but most people are not inventing their own algorithms. the reason is that existing algorithms people use tend to be very carefully crafted, and often have important proven theoretical properties. it takes a lot of work to invent one from scratch, and there's no guarantee it will even work well.

#

the one with the gaps is interesting. it sounds like your intention is to split the data into 2 clusters by finding the biggest gap between two data points? what about data like this? [1, 2, 11, 12, 21, 22]. presumably you'd need to extend this to handle more than 2 clusters.

#

it sounds like maybe you're at the start of reinventing something like divisive hierarchical clustering https://en.wikipedia.org/wiki/Hierarchical_clustering#Divisive_clustering

shut girder Oct 13, 2023, 4:00 AM

#

Hello, does anyone know a beginner friendly data analysis project that requires minimum knowledge of statistics? I want to become more familiar with the 3 popular Python libraries and practice the responsibilities of a data analyst by working on projects.

quartz karma Oct 13, 2023, 4:01 AM

#

Hi, anyone familiar with seaborn? I am wondering if I can change a legend label without having to provide anyother related settings like bboxtonext, loc, etc. Thanks.

desert oar Oct 13, 2023, 4:11 AM

#

shut girder Hello, does anyone know a beginner friendly data analysis project that requires ...

https://www.kaggle.com/competitions/titanic
https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques
you can follow the prompt and try to fit a predictive model, but these are really good datasets just for getting comfortable with cleaning and exploring data, and forming & testing hypotheses

Titanic - Machine Learning from Disaster

Start here! Predict survival on the Titanic and get familiar with ML basics

House Prices - Advanced Regression Techniques

Predict sales prices and practice feature engineering, RFs, and gradient boosting

shut girder Oct 13, 2023, 4:14 AM

#

Thanks

hasty mountain Oct 13, 2023, 4:18 AM

#

desert oar Isn't EA normally used for architecture/hyperparameter search, as opposed to par...

Hey, I was annoyed by the time SGD demands...and the 30 hours in Kaggle wasn't being enough...
and I was feeling creative

desert oar Oct 13, 2023, 4:19 AM

#

hey, props for creativity

past meteor Oct 13, 2023, 6:03 AM

#

desert oar Isn't EA normally used for architecture/hyperparameter search, as opposed to par...

Yes

#

It's even too expensive for architecture and hyperparameter search. It's a blackbox optimization method, it's suitable for rich research groups to search over architectures and then tell us plebs what they found 🙂

median monolith Oct 13, 2023, 7:34 AM

#

I dunno where to post for help sorry :/ Im a python n00b.
I'm installing fenics, though a WSL (Ubuntu) and this is taking forever (days) there seems to be progress, ie. the last report line is changing every couple of hours and so. But I am worried this might never finish; should I be worried? Is this normal? How to fix? halp

gloomy parrot Oct 13, 2023, 9:46 AM

#

Hello everyone, does anyone know how can i use pytorch in aws lambda? Currently i cant because of the torch size and the limit in lambda is 50mb only

verbal oar Oct 13, 2023, 12:25 PM

#

how can I implement pred_proba ?

#

I have ```py
def predict(self,data):
X = data
for f in self.computation_graph: X = f.forward(X)
return X

#

on forward

        unnormalized_proba = np.exp(x-np.max(x,axis=1,keepdims=True))
        self.proba         = unnormalized_proba/np.sum(unnormalized_proba,axis=1,keepdims=True)
        self.target        = target

        print(self.proba)

#

#

looks like I hardly have

#

hmm in predict there is forward()

#

ah so I have it

desert oar Oct 13, 2023, 2:51 PM

#

past meteor It's even too expensive for architecture and hyperparameter search. It's a black...

Lol makes sense. Is it too slow even for smaller models like fully connected with one hidden layer?

past meteor Oct 13, 2023, 2:58 PM

#

desert oar Lol makes sense. Is it too slow even for smaller models like fully connected wit...

Maybe not 🙂 it's not a matter of it being too slow but rather too wasteful. SGD sends you to a local minimum (in non-convex problems) but that's fine if, as many have claimed, there's a surface with many local minima that don't really differ a lot. Finding the lowest point in this space doesn't give you a lot.

#

Just running N SGD's from scratch may be better

#

But yeah, I've done a fair share of EA's, they're my go to toy problem when trying new languages. I can't stress enough how sensitive they are to their own hyperparameters. You need to tune the thing that tunes your model 🤔

verbal oar Oct 13, 2023, 3:11 PM

#

    test = label_encoder(['beer','milk'])
    predicted_labels = np.argmax(model.predict(test),axis=1)
    accuracy         = np.sum(predicted_labels==y)/len(y)
    print("Model Accuracy = {}".format(accuracy))

    print('predicted ', model.predict(test))
    print(np.where(predicted_labels == 0, 'chips', 'cereals'))```

#

so

#

how can I convert it to display array of ['chips', 'cereals'] instead of just ['chips']?

#

#

because at first I want all predictions then I want to think about only one prediction as is the case when deploying model and making one prediction

#

    batch_size        = 2
    num_epochs        = 10
    samples_per_class = 100
    num_classes       = 2
    hidden_units      = 4

    x = ['beer','milk']
    y = ['chips','cereals']
    
    def label_encoder(list):
        l = np.arange(len(list))
        return l

    x = label_encoder(x).reshape(1,-1)
    y = label_encoder(y)

    model             = utilities.Model()
    model.add(DL.Linear(2,hidden_units))
    model.add(DL.ReLU())
    model.add(DL.Linear(hidden_units,num_classes))
    optim   = DL.SGD(model.parameters,lr=1.0,weight_decay=0.001,momentum=.9)
    loss_fn = DL.SoftmaxWithLoss()
    model.fit(x,y,batch_size,num_epochs,optim,loss_fn)

#

this is first part

#

I converted labels to ints

#

trained and try to predict

shy carbon Oct 13, 2023, 3:23 PM

#

I'm trying to install pytorch3D for windows and I'm struggling. is there no prebuilt packages for this ? Do I really have to build from sources ?

serene scaffold Oct 13, 2023, 3:25 PM

#

shy carbon I'm trying to install pytorch3D for windows and I'm struggling. is there no preb...

looks like your only options for windows are to build from source or use conda (according to this)

shy carbon Oct 13, 2023, 3:30 PM

#

serene scaffold looks like your only options for windows are to build from source or use conda (...

from that same page, I haven't find the instruction using conda

#

there's conda instruction for linux and macOS

#

but the only place that talks about windows is the Building / installing from source section

#

and when i try to install from the source, I still have issues I don't understand

#

#

I am able to import torch and torchvision in a python repl, but when I try to install pytorch3D I have "no module named torch"

wheat fox Oct 13, 2023, 3:43 PM

#

I tried building an AI but I was only able to generate random weird sentences that are grammatically correct but sound nonsensical. How do I go further?

serene scaffold Oct 13, 2023, 3:52 PM

#

wheat fox I tried building an AI but I was only able to generate random weird sentences th...

for someone to help you go further, they first have to know what you already did

desert oar Oct 13, 2023, 3:54 PM

#

past meteor Just running N SGD's from scratch may be better

that's what i was wondering - there's no way to use SGD for hyperparameter search that i know of. unless you mean something weird like gradient boosting where the base learner is a neural network...

#

(there are also many things i do not know!)

past meteor Oct 13, 2023, 3:56 PM

#

Nope because hyperparameter search, even for simpler stuff, is again anon-convex problem 😩 . Hence why I mentioned bayes opt, it's a global optimizer like EA's but it's more data efficient

desert oar Oct 13, 2023, 3:57 PM

#

right, i've used bayes opt before but never EA

past meteor Oct 13, 2023, 3:57 PM

#

Code an EA up over the weekend 😄 it sounds more than it is

#

<100 Loc

desert oar Oct 13, 2023, 3:57 PM

#

i wasn't sure where you were headed with the point about SGD though, unless that was just following up from the original idea

#

yeah it doesn't seem hard, but as you said it has its own parameters that need tuning and i never saw it as something i could use effectively

#

would be a good programming exercise though, i should do it

#

maybe a good excuse to practice julia

past meteor Oct 13, 2023, 3:59 PM

#

desert oar i wasn't sure where you were headed with the point about SGD though, unless that...

Say looking for hyperparameters were differentiable, it would be a non-convex problem so SGD would struggle there

desert oar Oct 13, 2023, 4:03 PM

#

i see, and agreed

past meteor Oct 13, 2023, 4:14 PM

#

Btw there's many things forgotten in the literature 😄

SVMs are so good because they only have 1 or 2 hyper parameters and the RBF SVM is a universal approximator. They're also somewhat interpretable. They just suck at scale but are a great go-to if you're under 30k data points.

quartz karma Oct 13, 2023, 4:17 PM

#

past meteor Btw there's many things forgotten in the literature 😄 SVMs are so good becaus...

for text classifier the svm seems not bad, but could you list a few other methods that can do similar task except svm?

past meteor Oct 13, 2023, 4:19 PM

#

quartz karma for text classifier the svm seems not bad, but could you list a few other method...

I'm not sure I'd use it for text classification myself. Other methods that work are the canonical random forest and gradient boosting type models (for tabular data) and neural networks for unstructured data but all 3 have more "dials" (hyperparams) to turn than an SVM.

quartz karma Oct 13, 2023, 4:28 PM

#

past meteor I'm not sure I'd use it for text classification myself. Other methods that work ...

since there so many variants of neural networks, do you have recommendations about some of them with "relatively" less hyperparameters?

past meteor Oct 13, 2023, 4:29 PM

#

All of them have the same amount. You can always add a layer and you can always add a neuron.

desert oar Oct 13, 2023, 4:43 PM

#

specifically for text classification with bag of words, i've found that plain ridge regression beats SVM

#

also there was some blog series i found years ago that went into great detail on why plain linear models were superior to RBF SVMs for text classification specifically

#

something about how the RBF kernel acts like a low-pass filter, i'll have to find it

past meteor Oct 13, 2023, 4:45 PM

#

It's a bit of an apples and oranges because a linear SVM could be used as well

desert oar Oct 13, 2023, 4:45 PM

#

yeah but i think at that point actual computation performance tends to be much better with e.g. liblinear

#

or just l-bfgs which is what we ended up using in that particular project (because we had some customizations)

past meteor Oct 13, 2023, 4:45 PM

#

hmmm, it depends.

#

You can solve linear SVMs in the primal (number of unknowns are the number of variables) or the dual (the number of unknowns are the number of data points)

#

Note: you can solve ridge in the dual as well but nobody does this 🙂

desert oar Oct 13, 2023, 4:47 PM

#

quartz karma since there so many variants of neural networks, do you have recommendations ab...

one approach i see very often is to use pre-trained embeddings for the model, so conceptually you end up partitioning the parameter space into "embedding parameters" and "model parameters" which you learn separately, or find a pre-trained model to obtain the former

#

i've done that too, worked well enough. i'm not sure if it's exactly the same as what people call "transfer learning" but i think conceptually it's close at least

past meteor Oct 13, 2023, 4:48 PM

#

BoW is something where you typically have num_data < num_unique_words so if your ridge regression impl. does not allow to solve in the dual going to an SVM, which solves in the dual by default, will be much faster.

desert oar Oct 13, 2023, 4:48 PM

#

of course there's also fine-tuning

desert oar Oct 13, 2023, 4:49 PM

#

past meteor BoW is something where you typically have num_data < num_unique_words so if your...

in the particular problem i was thinking of, we had a few million rows and we cut down the token space with some pre-filtering and hashing to a few thousand or tens of thousands, but we also had > 1k classes and it was all very sparse and bad

#

i still dont have a good sense of the performance characteristics of the two algorithms at those kinds of extremes, i was just trying stuff and went with what worked 😛

past meteor Oct 13, 2023, 4:50 PM

#

desert oar i still dont have a good sense of the performance characteristics of the two alg...

At the end of the day this is the only thing that matters imho 😄

desert oar Oct 13, 2023, 4:51 PM

#

thanks, i sometimes need to remind myself it's ok to experiment and not just know everything!

past meteor Oct 13, 2023, 4:51 PM

#

Yeah, the only core skill is knowing where to find information and knowing how to evaluate models properly. From then onwards it's empirics (unless you're doing fundamental research ofc)

quartz karma Oct 13, 2023, 5:08 PM

#

desert oar one approach i see very often is to use pre-trained embeddings for the model, so...

that sounds new to mi who is novice here, so how would you distinguish the embedding and model parameters, like, is there some direction of practices?

desert oar Oct 13, 2023, 6:37 PM

#

quartz karma that sounds new to mi who is novice here, so how would you distinguish the embed...

i mean that you'd just fit them separately. for example in the "old days" you'd do something like use fasttext or word2vec to get token embeddings, and then fit a model using those embeddings as features

#

or you could go even simpler than that, using tfidf and/or dimension reduction like pca

#

it's less optimal in the sense that the embeddings are not "learned" using the actual objective function of interest

#

but for computational and practical reasons it might be better, eg. if you don't have enough data to get a good estimate of the large number of parameters involved

rugged comet Oct 13, 2023, 7:48 PM

#

@past meteor Having some more discussion on taking the absolute value of the residuals. I got the impression that the residuals were just the actual values minus the predicted values, not the absolute value of that result. I'm getting some conflicting information.

#

My question is: When, if ever do you think we take the abs of the residuals and why?

past meteor Oct 13, 2023, 7:55 PM

#

rugged comet <@260493929047130113> Having some more discussion on taking the absolute value o...

When "debugging" your model you shouldn't take the absolute values of the residuals, you're interested in knowing when it's under and overshooting

past meteor Oct 13, 2023, 7:55 PM

#

rugged comet My question is: When, if ever do you think we take the abs of the residuals and ...

When you have outliers in your data evaluating with MSE is bad because squaring errors makes them explode

rugged comet Oct 13, 2023, 7:59 PM

#

To be clear, when defining residuals, do we take the abs or not?

past meteor Oct 13, 2023, 8:00 PM

#

rugged comet To be clear, when defining residuals, do we take the abs or not?

... no

warm steppe Oct 13, 2023, 8:02 PM

#

anyone else try running "conda list" on the new Python in Excel release?

rugged comet Oct 13, 2023, 8:13 PM

#

past meteor ... no

Under what circumstances would you find it to be appropriate to take the abs of the residuals?

#

For example, doing the Shapiro-Wilk test.

past meteor Oct 13, 2023, 8:15 PM

#

rugged comet Under what circumstances would you find it to be appropriate to take the abs of ...

As I mentioned, in the context of the mean absolute error 🙂 Sometimes I also do this np.argmax(np.abs(residuals))

#

And then I investigate what went wrong

#

Been too long since I did the shapiro-wilk test

#

I can't comment on that specifically

rugged comet Oct 13, 2023, 8:18 PM

#

Can anyone else here comment on taking the abs of the residuals before performing the Shaprio-Wilk test?

#

@past meteor The reason it seems I'm so obsessed with taking the abs of the residuals is because my instructor had a code snippet that did it.

# Calculate some characteristics of the residuals
residuals = np.abs(x.values - predictions.values)

Now everyone in the class seems to think that in order to calculate the residuals, you have to take the abs.

past meteor Oct 13, 2023, 8:21 PM

#

rugged comet <@260493929047130113> The reason it seems I'm so obsessed with taking the abs of...

That's totally incorrect

rugged comet Oct 13, 2023, 8:22 PM

#

And can you say why?

past meteor Oct 13, 2023, 8:22 PM

#

Because it just is, residuals are just not defined that way

#

https://www.statisticshowto.com/probability-and-statistics/statistics-definitions/residual/ notice how they say "Negative if they are below the regression line"

#

https://www.khanacademy.org/math/statistics-probability/describing-relationships-quantitative-data/regression-library/a/introduction-to-residuals notice how there's negative residuals here too

past meteor Oct 13, 2023, 8:32 PM

#

rugged comet <@260493929047130113> The reason it seems I'm so obsessed with taking the abs of...

Another one here: https://www.mathworks.com/help/stats/residuals.html from mathworks (matlab) you see the histogram of residuals has negative values. They are not taking the abs as well. I'm adding so many links because you don't have to take it from me 🙂

#

The reason why not taking the abs is relevant is that you want to differentiate between over and undershooting ofc

desert oar Oct 14, 2023, 12:07 AM

#

@rugged comet i second what zestar said, it's just not what residuals are and it's neither required nor desirable for the shapiro wilk test

serene scaffold Oct 14, 2023, 1:41 AM

#

always ask your actual question right away. don't ask to ask.

#

you're still asking to ask. Assume that someone said they will help you--what would they need to know to start helping?

ancient fossil Oct 14, 2023, 2:14 AM

#

If I define a nested model like this:

class Siamese(nn.Module):
    def __init__(self, model):
        super(Siamese, self).__init__()
        self.model = model

    def forward(self, x1, x2):
        output1 = self.model(x1)
        output2 = self.model(x2)
        return output1, output2
...
model = Siamese(BaseModel(args)).to(device)

Will it fail to save the complete state dict?

torch.save(model.state_dict(), 'ckpt.pt')
...
model.load_state_dict(torch.load('ckpt.pt'))

When I load the model and resume training it appears to start from scratch

serene scaffold Oct 14, 2023, 2:28 AM

#

ancient fossil If I define a nested model like this: ```py class Siamese(nn.Module): def __...

What do you see that makes you think it starts from scratch?

#

Also is BaseModel a subclass of nn.Module?

ancient fossil Oct 14, 2023, 2:32 AM

#

serene scaffold What do you see that makes you think it starts from scratch?

Being a contrastive learning framework it tends to start accuracy metric at ~0.5 and climb slowly from there, so I just gotta check that

ancient fossil Oct 14, 2023, 2:33 AM

#

serene scaffold Also is BaseModel a subclass of nn.Module?

it's just this


class TCN(nn.Module):
    def __init__(self, in_dim, out_dim):
        super(TCN, self).__init__()
        # some conv layers etc
    def forward(self, x):
        ...
        return x

serene scaffold Oct 14, 2023, 2:33 AM

#

ancient fossil Being a contrastive learning framework it tends to start accuracy metric at ~0.5...

You should look at the weights of the TCN model before and after you save the Siamese model

ancient fossil Oct 14, 2023, 2:35 AM

#

serene scaffold You should look at the weights of the TCN model before and after you save the Si...

Would that be done by comparing the .parameters() of both states?

serene scaffold Oct 14, 2023, 2:36 AM

#

ancient fossil Would that be done by comparing the `.parameters()` of both states?

Right, of the original, and if the saved/loaded copy

ancient fossil Oct 14, 2023, 2:37 AM

#

serene scaffold Right, of the original, and if the saved/loaded copy

Ok, one sec

ancient fossil Oct 14, 2023, 2:41 AM

#

serene scaffold Right, of the original, and if the saved/loaded copy

All the weights & bias are different

ancient fossil Oct 14, 2023, 2:44 AM

#

serene scaffold Right, of the original, and if the saved/loaded copy

Hold on, the weights are diff even if I initialize two identical models with the same seed

serene scaffold Oct 14, 2023, 2:44 AM

#

I'll see if I can look into it tomorrow

ancient fossil Oct 14, 2023, 2:45 AM

#

serene scaffold I'll see if I can look into it tomorrow

Alright, appreciate it

lapis sequoia Oct 14, 2023, 3:01 AM

#

hi

past meteor Oct 14, 2023, 5:26 AM

#

ancient fossil Hold on, the weights are diff even if I initialize two identical models with the...

Don't do it that way tbh

#

The easiest way to implement a Siamese network is just to use the same neural net and feed it two inputs, calculate the less and backprop. You don't need a special NN for that, just do it in your training loop. It's confusing I know, I had the same idea before I made a Siamese net myself for the first time.

odd relic Oct 14, 2023, 7:17 AM

#

ok Ive been getting this painful error for a very long time

#

so im in essence attempting to create a real-time training situation

#

code:

import tensorflow as tf
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras import Model
from tensorflow import keras
import time
import numpy as np

epoch = 300
n = 1000
duration = 5
learning_rate = 1e-4
optimizer = keras.optimizers.SGD(learning_rate=learning_rate)

def lossfunction(distX, crash, speed, distY):

    term1 = tf.add(1.0, tf.add(tf.multiply(5.0, crash), tf.multiply(5.0, distY)))
    term2 = tf.add(distX, speed)
    loss = tf.subtract(tf.multiply(n, term1), tf.multiply(2.0, term2))

    return loss
def returnData(input_data):

    return None

def getData():
    data = np.random.rand(5)
    print(data)
    return tf.convert_to_tensor(data.reshape(1, -1), dtype=tf.float32)

def getSpeed():

    return float(input("speed"))

def create_model():
    input_tensor = Input(shape=(5,))
    x = Dense(256, activation='relu', trainable=True)(input_tensor)
    x = Dense(512, activation='relu', trainable=True)(x)
    x = Dense(512, activation='relu', trainable=True)(x)
    x = Dense(256, activation='relu', trainable=True)(x)
    x = Dense(2, activation='sigmoid', trainable=True)(x)
    model = Model(inputs=input_tensor, outputs=x)
    model.summary()
    return model

model = create_model()

for i in range(epoch):
    print("\nStart of epoch %d" % (i,))
    start_time = time.time()

    while time.time() - start_time < duration:
        distX = float(input("distx "))
        crash = int(input("crash "))
        speed = getSpeed()
        distY = float(input("distY "))

        with tf.GradientTape() as tape:
            input_data = getData()
            predictions = model(input_data, training=True)
            loss = lossfunction(distX, crash, speed, distY)
            print("TrainingLoss: " + str(loss))

            returnData(predictions)

        gradients = tape.gradient(loss, model.trainable_weights)
        print("Gradients: ", gradients)
        optimizer.apply_gradients(zip(gradients, model.trainable_weights))

#

error:

TrainingLoss: tf.Tensor(30994.0, shape=(), dtype=float32)
Gradients:  [None, None, None, None, None, None, None, None, None, None]
Traceback (most recent call last):
  File "robotAI.py", line 85, in <module>
    optimizer.apply_gradients(zip(gradients, model.trainable_weights))
  File "C:\Users\sidha\anaconda3\envs\OneWhereUcanuseGPU\lib\site-packages\keras\optimizers\optimizer_v2\optimizer_v2.py", line 689, in apply_gradients
    grads_and_vars = optimizer_utils.filter_empty_gradients(grads_and_vars)
  File "C:\Users\sidha\anaconda3\envs\OneWhereUcanuseGPU\lib\site-packages\keras\optimizers\optimizer_v2\utils.py", line 77, in filter_empty_gradients
    raise ValueError(
ValueError: No gradients provided for any variable: (['dense/kernel:0', 'dense/bias:0', 'dense_1/kernel:0', 'dense_1/bias:0', 'dense_2/kernel:0', 'dense_2/bias:0', 'dense_3/kernel:0', 'dense_3/bias:0', 'dense_4/kernel:0', 'dense_4/bias:0'],). Provided grads_and_vars is ((None, <tf.Variable 'dense/kernel:0' shape=(5, 256) dtype=float32, numpy=

odd meteor Oct 14, 2023, 8:44 AM

#

Yes I have done that a couple of times with Lime & SHAP. But I still don’t know what exactly you need help with

wheat fox Oct 14, 2023, 10:16 AM

#

When do you guys think OpenAI will achieve AGI? It doesn't seem like DeepMind can overtake OpenAI

odd meteor Oct 14, 2023, 11:01 AM

#

wheat fox When do you guys think OpenAI will achieve AGI? It doesn't seem like DeepMind ca...

I don't believe anyone should build AGI

#

wheat fox Oct 14, 2023, 1:54 PM

#

odd meteor I don't believe anyone should build AGI

AGI would be able to figure out everything that is possible and profitable to figure out. We can advance so much in physics and medicine. AGI wouldn't be exhausted, would train itself to spot and ignore incorrect notions and use data faster than our brains.

odd meteor Oct 14, 2023, 2:24 PM

#

wheat fox AGI would be able to figure out everything that is possible and profitable to fi...

The real question is, why should anyone even be interested in building an AI that can do "everything"? Instead of that why not an AI that's good at doing a specific task very well 🤷

#

The dangers of AGI outweighs it's proposed advantage. People who do ML Research in Ethics are even vehemently against it even

past meteor Oct 14, 2023, 2:26 PM

#

I think the idea is that there's similarity between tasks you can exploit to make the entire thing more data efficient? Like in the multi task learning case.

#

But yeah, I'm sceptical we can ever build whatever AGI is anyway

odd meteor Oct 14, 2023, 2:33 PM

#

past meteor I think the idea is that there's similarity between tasks you can exploit to mak...

I used to have the mindset that we all should push towards achieving AGI until I met Temnit Gebru in August.

She pretty much delved into the reason why it's a very dangerous mindest to have. It's like someone tryna build a god

past meteor Oct 14, 2023, 2:34 PM

#

odd meteor I used to have the mindset that we all should push towards achieving AGI until I...

Is the presentation somewhere?

Personally I have no opinion on whether we should or shouldn't, I defer that to philosophers 😅

odd meteor Oct 14, 2023, 2:35 PM

#

past meteor Is the presentation somewhere? Personally I have no opinion on whether we shou...

😂 Unfortunately I didn't record the whole conversation but I'll check and revert back with her talk on same topic

still whale Oct 14, 2023, 2:36 PM

#

Drunk Philosophy @past meteor ?

odd meteor Oct 14, 2023, 2:39 PM

#

past meteor Is the presentation somewhere? Personally I have no opinion on whether we shou...

https://stanforddaily.com/2023/02/15/utopia-for-whom-timnit-gebru-on-the-dangers-of-artificial-general-intelligence/

The Stanford Daily

Sophia Artandi

‘Utopia for Whom?’: Timnit Gebru on the dangers of Artificial Gener...

At the 2023 Symbolic Systems Distinguished Speaker Lecture, Timnit Gebru ’08 M.A. ’10 Ph.D. ’17 spoke on the dangers of artificial intelligence.

odd meteor Oct 14, 2023, 2:40 PM

#

past meteor Is the presentation somewhere? Personally I have no opinion on whether we shou...

Her Keynote speech at DLI is more detailed

https://www.youtube.com/live/5cm_FvHmtVI?si=PCvza6Arsx_tlHGK

YouTube

Deep Learning Indaba

Deep Learning Indaba DAY 2

▶ Play video

past meteor Oct 14, 2023, 2:47 PM

#

odd meteor Her Keynote speech at DLI is more detailed https://www.youtube.com/live/5cm_Fv...

Thanks! I'll listen to the see this evening 😁😁

still whale Oct 14, 2023, 3:10 PM

#

@odd meteor that was actually, thoughtful and more on the mind than well thought out and made to appear as though she spent 17 straight hours working out her points

#

The ai realm is pretty scary,. Modelling and data collection and with open source ai sources is honestly a nightmare in my mind, I jumped to the Facebook surveying and the election re trump. Stereotyping millions from adverts and “personality” traits

serene scaffold Oct 14, 2023, 3:29 PM

#

still whale The ai realm is pretty scary,. Modelling and data collection and with open sourc...

When you say "stereotyping millions from adverts and personality traits", what outcome are you worried about?

odd relic Oct 14, 2023, 3:38 PM

#

odd relic code: ```py import tensorflow as tf from tensorflow.keras.layers import Dense, I...

fixed it, had to incorporate predictions(my fault that was kind of obv but I was hoping to find a loophole) I incorporated speed calculation by using simple math to calculate the speed of the robot

odd meteor Oct 14, 2023, 3:43 PM

#

The thing is, most big tech companies don't really like ML Researchers whose research work are in Ethics, more especifically, those who are quite vocal and strong willed.

This is because, it exposes the dirty side of what goes on behind those hyped-up models that fetches them constant 💵

I see Ethics as that field in AI where as a ML Researcher, if you're not strong and have a tough skin, you'll be crushed very easily. More so, you certainly can't escape being gaslighted, harrased, attacked, called "angry bird", fired even.

If Google could fire Temnit, lol... I mean 🤷😀

still whale Oct 14, 2023, 3:48 PM

#

serene scaffold When you say "stereotyping millions from adverts and personality traits", what o...

To me, like honestly I operate under the assumption I’m wrong or right at any given moment.. but it seems ominous that so much data can be hoovered up. Honestly in that instance it’s more the way it was implemented on shallow variables to target groups with certain propaganda. Really showing psychology is pseudo, check ya Facebook group likes. You could be latent who knows what, you wouldn’t even know unless a computer model categorised your data and sold it to companies to be used as analytics! And then got privy through the targeted ads… maybe I’m a skeptic but the direction of ai , the horizons seem scary to me

#

I mean not ranting, it’s well known fact the analytics and all that. But a true ai learning model that is capable of not imitation, that’s true fear

past meteor Oct 14, 2023, 4:05 PM

#

odd meteor The thing is, most big tech companies don't really like ML Researchers whose res...

To me ethics in AI is closer to philosophy than it is to AI. The reason why I always refuse to express an opinion on the ethics side of things is that I don't have any foundational knowledge in it

still whale Oct 14, 2023, 4:16 PM

#

True, to reference philosophy is pretty taxing.. gotta be able to know and keep track of all the contexts 😳. There is a philosophy of ai, look it up if your interested

#

Some philosophies of get insane though, I tried 3 pages of philosophy of psychology and just about blew 3 valves and a bottom end bearing

past meteor Oct 14, 2023, 4:22 PM

#

In my masters we had to pick 1 "general education" elective and it was AI ethics, privacy & big data and cognitive science, I went for privacy & big data because I thought it'd have the most value 🤣

left tartan Oct 14, 2023, 4:27 PM

#

odd meteor The thing is, most big tech companies don't really like ML Researchers whose res...

Yah; therein lies the rub. You need opinionated people to consider the negative impacts of ethical decisions on a business, but it also needs people who will turn a blind eye when the business chooses the less ethical route (in their opinion). I should add; since ethics are highly subjective, you also need opinionated people who accept that others might disagree with their opinions and know how to be a team player. (** This is not a commentary on temnit, I’m just saying it’s perhaps impossible to find such people)

lapis sequoia Oct 14, 2023, 7:19 PM

#

guys what are the main models of ai?

small wedge Oct 14, 2023, 7:20 PM

#

wdym the main models?

#

like RNN, CNN, etc?

lapis sequoia Oct 14, 2023, 7:20 PM

#

i have a research

lapis sequoia Oct 14, 2023, 7:20 PM

#

small wedge like RNN, CNN, etc?

i think so yea

small wedge Oct 14, 2023, 7:21 PM

#

https://machinelearningmastery.com/a-tour-of-machine-learning-algorithms/

lapis sequoia Oct 14, 2023, 7:21 PM

#

dm me i can give you what my reasrch is about

small wedge Oct 14, 2023, 7:21 PM

#

I'm good

lapis sequoia Oct 14, 2023, 7:21 PM

#

alr thank you anyway

#

another question, whar are the materials of ai?

small wedge Oct 14, 2023, 7:25 PM

#

lapis sequoia another question, whar are the materials of ai?

materials? as in learning resources or like the principle components?

lapis sequoia Oct 14, 2023, 7:26 PM

#

small wedge materials? as in learning resources or like the principle components?

bro i dont know anything my teacher gave me this research

small wedge Oct 14, 2023, 7:26 PM

#

XD well I can't answer your question if you don't even know what it means

lapis sequoia Oct 14, 2023, 7:27 PM

#

small wedge XD well I can't answer your question if you don't even know what it means

wait brb lemme ask him

odd meteor Oct 14, 2023, 8:15 PM

#

lapis sequoia bro i dont know anything my teacher gave me this research

This made me laugh for a minute. Bro your teacher must have believed you know something and capable of doing the task before giving it to you. So you do know something, well, maybe not enough at the moment, but you do know something.

odd meteor Oct 14, 2023, 8:19 PM

#

past meteor To me ethics in AI is closer to philosophy than it is to AI. The reason why I al...

If I were presented with such electives I think I'd have gone for Ethics 'cos the topic just have a way of opening your mind to see thing in different perspective. It can be boring though if the professor isn't the type that makes his/her class very engaging and interactive.

odd meteor Oct 14, 2023, 8:22 PM

#

still whale To me, like honestly I operate under the assumption I’m wrong or right at any gi...

Seems you'd most likely enjoy working in Ethics if you were to venture into ML Research. I've always avoided it cos it comes with a lot of things I'm not ready for 😀

verbal oar Oct 14, 2023, 8:27 PM

#

where I can deploy ai model for free?

#

heroku has MFA, vercel requires phone to verify

#

I want some simple

#

this is not security important model

#

just for testing

small wedge Oct 14, 2023, 8:32 PM

#

if it's just for testing, repl.it?

still whale Oct 14, 2023, 8:45 PM

#

@odd meteor ha fuck dude that’s a laugh, I’m right there with ya . I could spout moral ethics till I die but I guarantee I’ll be in ethical within an hour

slow totem Oct 14, 2023, 8:45 PM

#

Hello! Need a bit of help with scikit learn!

So here is my usecase. I have made a model in python scikit learn. It works great with accuracy of 89%. But I want to use it in JS as I am using MERN (I don't want to join the MERN app with a flask/fastAPI server to get the classification). So I looked at tensorflow js and scikit learn wrapper for JS: https://scikitjs.org/ So I preprocess the data in the exact same manner (I self coded the vectorizer in both the languages, so I can compare the data being feed and its exactly the same) But when I run the SVC on js it leads to an amazing accuracy of 9%. Any idea on what can I do?

Thank you for a response in advance

Hello from Scikit.js | Scikit.js

Description will go into a meta tag in

still whale Oct 14, 2023, 8:45 PM

#

@odd meteor if only philosophy didn’t exist in a semi vacuum, I’d be free to be some sort of moral agent of some purpose

odd meteor Oct 14, 2023, 8:46 PM

#

left tartan Yah; therein lies the rub. You need opinionated people to consider the negative ...

True. Possessing empathy, opposing any form of oppression, advocating for inclusivity, and having a natural inclination to challenge inconsistencies and call out bullsh!t are traits not everyone possesses.

While I'd like to believe that I'm someone who understands and values empathy, I don't necessarily envision myself venturing into research in Ethics. It's just a lot 😀

After going through the 'Stochastic Parrot' research paper, my admiration for those who specialize in Ethics has grown. Reflecting on the incident where Google's 2015 image classification model classified black individuals as chimpanzee and monkeys, I'm reminded of the pivotal role ethical researchers play. Their contributions are invaluable, but it's a field that might be too heavy for me.

odd meteor Oct 14, 2023, 8:56 PM

#

verbal oar where I can deploy ai model for free?

Try Streamlit / Streamlit Cloud, HuggingFace Spaces, more recently I saw a video on this new platform called Runway. https://www.youtube.com/watch?v=tSiS15ubQFQ

YouTube

Thu Vu data analytics

How to Deploy Machine Learning Models (ft. Runway)

⚙️ Runway - MLOps made easy: https://bit.ly/mrxrunway
🛣 Full Stack Data science roadmap: https://shorturl.at/abiJY
📚 Designing Machine Learning Systems (by Chip Huyen) 👉 https://amzn.to/3Cajv0Y
👔 Need help with preparing for your next data science interview? Use referral code ‘thuvu’ to get 10% discount on any of the offerings on https://dataint...

▶ Play video

verbal oar Oct 14, 2023, 8:57 PM

#

hmm streamlit is like R shiny?

odd meteor Oct 14, 2023, 9:00 PM

#

verbal oar hmm streamlit is like R shiny?

I haven't used R-Shiny so I have no idea. You can use Streamlit to deploy your model as a web app. If that's what R-shiny does, then perhaps there's some semblance between both.

verbal oar Oct 14, 2023, 9:00 PM

#

ok

sharp crest Oct 14, 2023, 9:01 PM

#

odd meteor Try Streamlit / Streamlit Cloud, HuggingFace Spaces, more recently I saw a video...

Im adding this to my playlisy

past meteor Oct 14, 2023, 9:01 PM

#

verbal oar hmm streamlit is like R shiny?

Lower learning curve than R shiny

verbal oar Oct 14, 2023, 9:26 PM

#

https://mbanet-e62f5mi5jcqypxb2huifrl.streamlit.app/
its stucked on input

#

    preds = predictions(xTxt)
    print('preds', preds)

    text = input("type name of product (e.g beer): ")
    test = pred(text)

#

when sharing user dont see manage app console?

agile cobalt Oct 14, 2023, 9:29 PM

#

I don't think that streamlit supports input(), iirc you should use components provided by the library (buttons, checklists, probably some specialized forms of text inputs etc) instead of using just input() itself

#

check the documentation and examples if you haven't yet

verbal oar Oct 14, 2023, 9:30 PM

#

ok because its for users right

#

found text_input

lapis sequoia Oct 15, 2023, 12:55 AM

#

I have a 200MB, 1.1MM-row CSV file. Too big for Excel, but I'd still like a GUI-based method of initial data exploration. What are your favorite ways of doing this? I've found three possible solutions thus far:

-PowerBI
-Datasette
-VS Code's CSV extensions

serene scaffold Oct 15, 2023, 1:34 AM

#

lapis sequoia I have a 200MB, 1.1MM-row CSV file. Too big for Excel, but I'd still like a GUI-...

You could use pandas in a Jupyter notebook

lapis sequoia Oct 15, 2023, 1:41 AM

#

serene scaffold You could use pandas in a Jupyter notebook

Jupyter forces the dataframe to fit into the screen size, which is really inconvenient when it comes to larger fields like review_content. It also doesn't show me all the entries. I've yet to find a solution for this. Even adjusting display.max_colwidth to 100 doesn't help.

serene scaffold Oct 15, 2023, 1:47 AM

#

lapis sequoia Jupyter forces the dataframe to fit into the screen size, which is really inconv...

I wonder if there's a Jupyter lab plugin to solve that

vestal widget Oct 15, 2023, 1:48 AM

#

Im finetuning a gpt2 124M model using gpt-2-simple module, any tips to prevent overfitting or underfitting, like which parameter should i adjust

lapis sequoia Oct 15, 2023, 1:49 AM

#

serene scaffold I wonder if there's a Jupyter lab plugin to solve that

If you find it, please give me a holler. Because I'm stuck. For now, I'll leave it for the night and play Cities Skylines or something to relax and forget about it.

agile cobalt Oct 15, 2023, 2:01 AM

#

lapis sequoia Jupyter forces the dataframe to fit into the screen size, which is really inconv...

you could take a random sample and explore it on Excel then go back to python and write actual code to process the entire thing, but GUI-based exploration doesn't really scales well

lapis sequoia Oct 15, 2023, 2:02 AM

#

agile cobalt you could take a random sample and explore it on Excel then go back to python an...

Excel has its own problems; namely, it freaks out when encountering accented characters

agile cobalt Oct 15, 2023, 2:02 AM

#

that is probably an issue related to file encoding or excel localisation settings?

#

another option could be yeeting it into a database (either just a sqlite file or something like postgres), then using something like dbeaver, but at that point you might as well just end up using sql instead of python with the same problems

lapis sequoia Oct 15, 2023, 2:04 AM

#

agile cobalt another option could be yeeting it into a database (either just a sqlite file or...

Will try that out. I also stumbled across CSView through Reddit

#

Postgres would be a bit extravagant, wouldn't it? SQLite can do the job I think

agile cobalt Oct 15, 2023, 2:05 AM

#

setting up just for that would be
if you already had it set up for something else and could reuse for that, not that much

peak hamlet Oct 15, 2023, 2:08 AM

#

agile cobalt that is probably an issue related to file encoding or excel localisation setting...

Yep, that’s encoding
Happens to me daily

lapis sequoia Oct 15, 2023, 2:09 AM

#

peak hamlet Yep, that’s encoding Happens to me daily

How would you get around this? SQLite -> dbeaver?

#

Change Excel encoding settings? Switch to PowerBI?

peak hamlet Oct 15, 2023, 2:10 AM

#

lapis sequoia How would you get around this? SQLite -> dbeaver?

I just got here
Not sure what we’re solving

lapis sequoia Oct 15, 2023, 2:11 AM

#

peak hamlet I just got here Not sure what we’re solving

#data-science-and-ml message

#

CSView is okay, but you can't zoom in and out to display more info on the screen. The viewport is fixed.

peak hamlet Oct 15, 2023, 2:13 AM

#

lapis sequoia CSView is okay, but you can't zoom in and out to display more info on the screen...

What program is this?

lapis sequoia Oct 15, 2023, 2:13 AM

#

peak hamlet What program is this?

https://kothar.net/csview

peak hamlet Oct 15, 2023, 2:14 AM

#

What’s the starting format?

#

Is CSV the original type?

#

And all you want is to just manually look at the data?

lapis sequoia Oct 15, 2023, 2:15 AM

#

peak hamlet Is CSV the original type?

Yes. The original source is here: https://www.kaggle.com/datasets/stefanoleone992/rotten-tomatoes-movies-and-critic-reviews-dataset

Rotten Tomatoes movies and critic reviews dataset

17k+ movies and their related critic reviews scraped from Rotten Tomatoes

lapis sequoia Oct 15, 2023, 2:17 AM

#

peak hamlet And all you want is to just manually look at the data?

I wanted a bird's eye view of how the data is formatted as a kind of initial EDA, figure out the proper data format of each field (do I need to do any str -> int casting, etc.) and to check for finicky edge cases

peak hamlet Oct 15, 2023, 2:17 AM

#

Oh 17k records?
Yeah
That’s potatoes for Excel

lapis sequoia Oct 15, 2023, 2:18 AM

#

peak hamlet Oh 17k records? Yeah That’s potatoes for Excel

That's rotten_tomatoes_movies.csv. rotten_tomatoes_critic_reviews.csv is 1.1 million records

#

1,130,017 specifically

peak hamlet Oct 15, 2023, 2:18 AM

#

oh

#

yeah
that's over Excel's max row count

peak hamlet Oct 15, 2023, 2:19 AM

#

lapis sequoia How would you get around this? SQLite -> dbeaver?

You would have to do something like this

#

FYI you can open an empty workbook and then import from CSV, and then in the wizard you can correct the encoding and reject the automatic transformations
But yeah, if you have over a 100k, give or take 100k, definitetly go for the database option

lapis sequoia Oct 15, 2023, 2:32 AM

#

peak hamlet FYI you can open an empty workbook and then import from CSV, and then in the wiz...

Thank you

past meteor Oct 15, 2023, 7:00 AM

#

lapis sequoia Jupyter forces the dataframe to fit into the screen size, which is really inconv...

The only large column is review content and I'm pretty sure that's not something you're going to ... plot are you?

#

I think I'd just work with a notebook and the other 6 columns mainly. If I want the 7th I'd read those reviews "manually"

sand pivot Oct 15, 2023, 7:54 AM

#

hey all small question
i am new to machine learning, and i hae a data set that i basically train upon this network

model = tf.keras.Sequential()

model.add(tf.keras.layers.Dense(128, input_dim=1, activation='relu'))
model.add(tf.keras.layers.Dense(128, activation='relu'))
model.add(tf.keras.layers.Dense(64, activation='relu'))
model.add(tf.keras.layers.Dense(64, activation='relu'))
model.add(tf.keras.layers.Dense(32, activation='relu'))
model.add(tf.keras.layers.Dense(32, activation='relu'))
model.add(tf.keras.layers.Dense(1, activation='linear'))

optimizer = tf.keras.optimizers.Adam(learning_rate=0.0001)
model.compile(loss='mean_squared_error', optimizer=optimizer, metrics=['mean_absolute_error'])

history = model.fit(input_data, output_data, batch_size=32, epochs=300,shuffle=True)

i am basically trying to approximate E = 13.6/(n*n)
i generated some readings based on E, where the input data is n, and the output data is E i apply a small deviation to it E so that the results are not exact

now the problem is that, if i try to predict values outside of the training range (training range is n = 1 to n=50), then i get massive error rates, likes 200% or sth lol, (predicting values within the training range works fine, but anything outside of it, just slowly grows into massive error rates)
any ideas?

#

i am honestly, on my wits end, i have no idea what to do xD

wooden sail Oct 15, 2023, 8:33 AM

#

if you explicitly include the model into the network, it'll work better

#

the network has no way of knowing what it would do outside the training data otherwise

#

you can rework the network so that it estimates the parameters of your model for E

sand pivot Oct 15, 2023, 8:38 AM

#

thanks, and interesting, can you give me an example @wooden sail ?

#

not sure how to achieve that to be fair 😓

wooden sail Oct 15, 2023, 8:40 AM

#

the main question is, what things can we assume to be known ahead of time? and what do we want to find out?

sand pivot Oct 15, 2023, 8:41 AM

#

honestly, my main idea is to have some data entries, and use them to approximate E, through a neural network

#

i have some code that generates data readings, with some deviation/controlled error rate