#data-science-and-ml

1 messages · Page 84 of 1

tardy lark
#

although this time the detected shape is 140

#

uhm idk the shape but the data is stock proces

tidal bough
#

Check scaled_data.shape and scaled_data.dtype

tardy lark
#

float64 and 6007,1

tidal bough
#

Actually, I think I see a mistake. You're taking elements from scaled_data but looping up to len(train_data). If the latter is higher than len(scaled_data), then the last slices the loop produces can be shorter than 60 elements.

tardy lark
#

so how could i fix that?

#
        train_data = final_data[0:200,:]
        valid_data = final_data[200:,:]
tidal bough
#

If len(train_data) < len(scaled_data) as this seems to imply, your loop should work.

tardy lark
#

for i in range(60,len(train_data < len(scaled_data)): like that?

serene scaffold
serene scaffold
tardy lark
#

no

serene scaffold
#

see if you can fix it

tardy lark
serene scaffold
# tardy lark yeah i'm stumped

when ConfusedReptile said "If len(train_data) < len(scaled_data) as this seems to imply, your loop should work.", they weren't suggesting that you need to include len(train_data) < len(scaled_data) in your code somewhere. they were saying "if the length of train_data is less than the length of scaled_data, then your loop should work"

#

so the loop needs to go from 60, to what?

neon lantern
#

my brain isn't working right now so I need a sanity check
can i train ppo with a continous stream of downscaled images (think 360x240 or something) and have it just dictate mouse movement and whether or not to click on each frame

#

and as a followup, should I do it in tensorflow or pytorch or do it with custom gym environment

#

I assume that custom gym is easiest since theres a lot of backend I don't have to handle but I wanted to get a second opinion

half cliff
#

Hi! Has anyone ever worked with the library pydeck?

serene scaffold
cold osprey
serene scaffold
#

but also we did have a bot command telling people not to ask to ask and such, but it was used in a passive aggressive way, and people didn't really read it.

rugged comet
#

@past meteor Was I meant to sort the residuals and the variable values for some reason before graphing? I think not because I think that would cause the variable values to not "line up" with their corresponding residuals.

small ore
#

Is there a quick handy way to get a report on outliers (in numerical data) . Number of outliers in each feature and an array/series of indices that contains them?

past meteor
small ore
#

That does not give me a report of how many outliers are there for each feature and their locations if I am thinking right

past meteor
#

It's what I do if I want quick and dirty univariate outliers

small ore
#

I have a dataset with (potentially) 23 features and want to look at similar stuff in the future too. A quick report/describe that helps me determine if I can remove them without worries is what i need. Also need indices/location so that I can remove those easily. Bonus would be to see the overlap

small ore
#

Basically a

df.describe()

with extra data

past meteor
#

And you do the same for those lower than the 25th quantile and do an or between both, those are your outliers

#

Gives you a true or false series

#

Finally, you filter by this mask and you get the index of all these values

small ore
#

I was attempting to write my own code already. That is why asked for " a quick handy method" in my original question.

#

But thanks anyways. I will look into your way too

small wedge
#

!rule ad

arctic wedgeBOT
#

6. Do not post unapproved advertising.

honest verge
#

do you guys have any suggestions on what backend to use for ml models? Would it differ between traditional ml methods and dl ?

agile cobalt
#

for traditional ml, most things should be 'cheap' enough to run as far as computing power goes that you don't have to think much about it beyond the usual considerations for literally any project's backend at all

for deep learning, you might need to ensure you have access to gpus depending on which model(s) you are using, which can make it harder and more expensive

agile cobalt
# honest verge do you guys have any suggestions on what backend to use for ml models? Would it ...

classic options like AWS, Azure or Google Cloud Platform can work fine, if you use the right services inside of it and take the usual precautions to like not leaking credentials, not getting hacked, properly deactivating things you're not using to avoid unwarranted fees etc - not ultra specific to ml

though I guess that as far as specifically for ml goes, there are some things like Hugging Face Spaces you might want to look into

potent sky
#

Also HF Inference Endpoints if you're looking for production use

#

Spaces is mostly for demo use ig, and doesn't provide an API to query the model hosted

#

Still very useful to get things up and running and to test it out

mystic ruin
#

i can't install chatterbot with pip... i am using python 3.11 output :

        ERROR: Failed building wheel for blis
        Running setup.py clean for blis
      Failed to build preshed thinc blis
      ERROR: Could not build wheels for preshed, thinc, blis, which is required to install pyproject.toml-based projects
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× pip subprocess to install build dependencies did not run successfully.
│ exit code: 1
╰─> See above for output.```
agile cobalt
#

looks like it hasn't been updated since 2021?
edit; even worse, the pypi version is from 2020
(2021 was the latest github commit on the main branch)

#

you might want to look into LangChain instead

odd meteor
# mystic ruin i can't install chatterbot with pip... i am using python 3.11 output : ```note: ...

You need to download the aforementioned binaries/wheels to your machine (ensure you're downloading the wheels that matches the version of Python you have 3.11) . Then from your command line (Windows / Conda depending on what you're using), try and pip install those wheels you've downloaded now (ensure to change directory in your command line to point to whichever folder in your machine where those wheels you downloaded can be found; that's, if they're not in a directory that's already added to path in your system environment --- and once you've sorted this, you can then pip install those wheels).

Use this website to download the wheels: https://www.lfd.uci.edu/~gohlke/pythonlibs/

Now, after doing that for the missing wheels, try re-installing chatterbot again.

Better Solution: Create a virtual environment install an older version of Python either 3.9 or 3.10. Then install chatterbot in that environment. You should be fine. (Because I noticed some of wheels you're missing like thinc, blis aren't available yet for Python 3.11, however, there's the 3.10 version is available.

odd meteor
# agile cobalt from https://pypi.org/project/ChatterBot/ :

Aha @mystic ruin this is why it's not working. The maintainers of Chatterbot appear to have gone on sabbatical since python 3.8 😀 . So you'd either downgrade to 3.8 or find another library for building what you wanted to use Chatterbot for.

abstract wasp
#

I split my data to just training and validation, is it okay if I don’t have a split for testing?

desert oar
abstract wasp
left tartan
#

With a train test split, you might test dozens or more of models. You’ll eventually find one that fits the test really well. But then what? Was that just by sheer chance or did you stumble upon a ‘good’ model?

#

So, at least in my world, the validation split is the very last thing that you hold back as long as you can… because often the ‘test’ effectively becomes part of train

#

(Curious whether zestar75 agrees with my characterization)

cerulean kayak
#

because of the fact that inputs for neural networks have to be a vector of data,
does that mean whenever I'm using keras to make a Neural Network, I need to make sure that the x_train and y_train have a shape of (n,1)?
please at me if you know.

thick walrus
#

Hello All,
I am working on bar plot in matplot. The second plot is not showing the bar:
import matplotlib.pyplot as plt
import numpy as np
from datetime import datetime, timedelta
import matplotlib as mpl

mpl.rcParams["date.converter"] = 'concise'
fig, (ax1, ax2, ax3) = plt.subplots(3, 1, layout='constrained')
price_date = np.array([datetime(2020, 6, 30),
datetime(2020, 7, 22),
datetime(2020, 8, 3),
datetime(2020, 9, 14)], dtype=np.datetime64)

price_close = [8800, 2600, 8500, 7400]
start_date = np.datetime64(datetime(2020, 6, 1))

ax1.bar(price_date, price_close, width=np.timedelta64(4, "D"))
ax2.bar(start_date, price_close, bottom=price_date)
ax3.bar(np.arange(4), price_date-start_date, bottom=start_date)

#

Am I missing something, I wanted the time to be at the bottom on the second plot but it does not show the bar

small ore
thick walrus
small ore
#

!code

arctic wedgeBOT
#
Formatting code on discord

Here's how to format Python code on Discord:

```py
print('Hello world!')
```

These are backticks, not quotes. Check this out if you can't find the backtick key.

For long code samples, you can use our pastebin.

thick walrus
#

I hope this helps as I used pastebin. As mentioned the second bar plot is not showing the bar.
import matplotlib.pyplot as plt
import numpy as np
from datetime import datetime, timedelta
import matplotlib as mpl

mpl.rcParams["date.converter"] = 'concise'
fig, (ax1, ax2, ax3) = plt.subplots(3, 1, layout='constrained')
price_date = np.array([datetime(2020, 6, 30),
datetime(2020, 7, 22),
datetime(2020, 8, 3),
datetime(2020, 9, 14)], dtype=np.datetime64)

price_close = [8800, 2600, 8500, 7400]
start_date = np.datetime64(datetime(2020, 6, 1))

ax1.bar(price_date, price_close, width=np.timedelta64(4, "D"))
ax2.bar(start_date, price_close, bottom=price_date)
ax3.bar(np.arange(4), price_date-start_date, bottom=start_date)

past meteor
#

As you say, if you just do a train-test split and test N models with it and you likely do this M times ... your test set is de facto part of your training set.

#

My current work project is medical stuff. What I do is essentially split:

  1. Split off some patients in full, they are never touched. (inter patient split)
  2. Split each "training" patient in a train and a test set. (intra patient split)
  3. Within this train set I split off a little more data and mostly cross validate.
  4. I try to minimize the amount of times I'm using the "intra patient" test sets.
  5. I take the top 4 models found on the "intra patient" splits and apply them on the held out ones.
  6. Report the inter patient split results, these are how our models work on unseen patients. If the difference between test and train here is high we likely overfit by reusing the intra patient split too much.

It sounds convoluted but if you don't have a lot of data how you approach this is important to get unbiased performance estimates.

desert oar
#

the compromise i've made in the past in that situation is to create a single train/test split, and rely on cross validation within the train set for model dev

past meteor
#

Underestimating model performance is a lot better than overestimating it, I'm fine with that

left tartan
desert oar
past meteor
#

Honestly? I like the idea of bespoke data splitting. For my current use case it made sense. For others I'll have to do something else.

desert oar
#

but yeah, implicitly overfitting through iterated model tweaking is a serious problem

past meteor
#

In my case there's an element of unbalanced data in it and we solved it by having a semi-stratified test-train split in the intra patient split

left tartan
past meteor
#

But the stratification procedure was very much linked to the "domain"

left tartan
desert oar
#

yeah, that's how i ended up solving my 1000 class problem too, we were able to group the classes into bigger categories and stratified that way

desert oar
left tartan
past meteor
#

I think all ML is prone to this, not just finance. It's game over when the data scientist starts thinking their job is to maximize or minimize a variable 😄

desert oar
#

Why is it "made for kids"? I can't save the video to a playlist lol

desert oar
past meteor
#

The optimal way to get 100 % on the test set is just by copying y

#

Hence why I truly believe that's not our job, our job is to answer the question "if we go into production with this? How good will it be?"

#

That's why I don't mind underestimating performance, if all my metrics say the model is fantastic and we go into prod and it fails then that might impact how much they trust us going forward.

left tartan
#

This is why I like staying in the DE side in the finance world!

past meteor
#

I mostly speak about DS but I like data engineering equally I think, it's good fun as well

left tartan
#

In finance, DS feels more like BS most of the time

past meteor
#

Ironically, my background is quantitative business. Finance was never my jam. A lot of my cohort went into actuarial science which does seem very legit.

Algo trading ... idk, I feel like there's too much happening in the world you can't control for.

Finance at large does have interesting use cases like credit risk modelling and fraud detection though. My profs couldn't shut up about these 🤣

desert oar
past meteor
#

That sounds reasonable to me. I don't know enough to comment, I'll leave that to BillyBobby 😄

left tartan
#

Usually, I think, it comes down to a big broad strategy bet (ie: certain sectors) and small optimizations to hopefully outperform.

past meteor
left tartan
#

Yah, it’s tough to apply in practice, but it’s good to understand the limits

past meteor
left tartan
#

Oh thanks, this statement is somewhat surprising, I was hoping for a more positive solution: “This problem …. remains a persistent problem plaguing scientific research.”

kind lotus
#

Anyone here have experience with pandas?

nimble hawk
#

Hello everyone, I uploaded a PySpark course on YouTube channel. I tried to cover wide range of topics including SparkContext and SparkSession, Resilient Distributed Datasets (RDDs), DataFrame and Dataset APIs, Data Cleaning and Preprocessing, Exploratory Data Analysis, Data Transformation and Manipulation, Group By and Window ,User Defined Functions and Machine Learning with Spark MLlib. I am leaving the link below, have a great day!

https://www.youtube.com/watch?v=jWZ9K1agm5Y

PySpark, the Python API for Apache Spark, empowers data engineers, data scientists, and analysts to process and analyze massive datasets efficiently. In this course, you'll dive deep into the fundamentals of PySpark, learning how to harness the combined power of Python and Apache Spark to handle big data challenges with ease. From data manipulat...

▶ Play video
arctic wedgeBOT
#

6. Do not post unapproved advertising.

past meteor
#

You're not supposed to advertise stuff on the Discord 🙂

thick walrus
#

Hello All,
I am still having a challenge with the second subplot. It should show the timedelta at the bottom. Currently, it does not show anything.
import matplotlib.pyplot as plt
import numpy as np
from datetime import datetime, timedelta
import matplotlib as mpl

mpl.rcParams["date.converter"] = 'concise'
fig, (ax1, ax2, ax3) = plt.subplots(3, 1, layout='constrained')
price_date = np.array([datetime(2020, 6, 30),
datetime(2020, 7, 22),
datetime(2020, 8, 3),
datetime(2020, 9, 14)], dtype=np.datetime64)

price_close = [8800, 2600, 8500, 7400]
start_date = np.datetime64(datetime(2020, 6, 1))

ax1.bar(price_date, price_close, width=np.timedelta64(4, "D"))
ax2.bar(start_date, price_close, bottom=price_date)
ax3.bar(np.arange(4), price_date-start_date, bottom=start_date)

Any suggestions?

nimble hawk
cold osprey
#

!code

arctic wedgeBOT
#
Formatting code on discord

Here's how to format Python code on Discord:

```py
print('Hello world!')
```

These are backticks, not quotes. Check this out if you can't find the backtick key.

For long code samples, you can use our pastebin.

vale swallow
#

Hi, can someone pls help me. I have been trying and trying again to increase the accuracy of my model but it stays at like 70% accuracy. I have added two drop outs layers, L1 regularization, and my dataset is about 23k images big. What can I do to get my model to at least 80% Help 😭😭

open meadow
#

I'm intrested in learning ai & ml but I have 0 idea where to start from, my main lang. is python, ig anyone has any tips or courses/vids that can help me I'd appreciate it

harsh minnow
#

Hi guys, I am have large news document dataset, I am trying to train a model on it. Now the news are from 1998 to till now. I want to filter the new's from 2015. The dataset does not have metadata, to check the year I have to manually check the news and it's content. I tried using Spacy, but it's not accurate because the news will have many dates. Is there a better way of doing it?

serene scaffold
past meteor
harsh minnow
#

Manually meaning checking all the docs, there is over 10,000 docs

past meteor
#

Are you alone in this or is this a team effort?

harsh minnow
harsh minnow
past meteor
#

Hmmm, I don't know how long this would take 🤔 . At least for computer vision I tend to manually label stuff whenever necessary. It sucks but imo it's part of the job 😄

serene scaffold
past meteor
#

That's a good one

harsh minnow
serene scaffold
#

I would probably also skip dates that only have a month and year, since those might refer to tentatively scheduled future events

harsh minnow
#

Also on other thing I tried it so using llama-7B to output the year, I used it via hugging face transformers. But the only issue is it takes around 12 seconds to complete one request. I am using Colab (85GB RAM)

#

Is this the case usually?

#

@serene scaffold ?

serene scaffold
harsh minnow
#

GPU does not increase at all

serene scaffold
harsh minnow
#

Currently I restarted the runtime, usually 70GB of the RAM will be consumed

#

And GPU does not go up at all

serene scaffold
#

!stream 533839465060499487 "15 minutes"

arctic wedgeBOT
#

✅ @harsh minnow can now stream until <t:1696694410:f>.

harsh minnow
#

I don't have the permission to speak

serene scaffold
#

come back

harsh minnow
#

!pip install transformers
!pip install transformers[sentencepiece]

from transformers import pipeline

pipe = pipeline("text-generation", model="meta-llama/Llama-2-7b-chat-hf")

arctic wedgeBOT
serene scaffold
#
from transformers import pipeline

pipe = pipeline("text-generation", model="meta-llama/Llama-2-7b-chat-hf", device=0)
harsh minnow
#

Thanks @serene scaffold

#

It worked 🎉

umbral charm
#

im making a pie chart, how do i get rid of the labels on the pie chart and just insead have a legend

#
portions = [75, 18, (100-75-18)]
plt.pie(x = portions, labels = ['Sky', 'pyramid', 'shady'], startangle = 320, colors = ['deepskyblue', 'yellow', 'gold'])
plt.legend(loc = 'best', bbox_to_anchor = (0.9, 0.9))
plt.show()
odd meteor
#

Perhaps there could be a recurring pattern in the documents which you could leverage in getting the year each document was published

odd meteor
odd meteor
meager ridge
#

not sure where NLP-type questions should go but ...

  • say I have data for ~2000 cities for every year from a decade
  • I know every city is on every list (for the most part), but the spellings and syntax probably vary from year to year

what's the best way to find the string representing every city in each year? is it stupid to do k-means if you are going to have several thousand clusters? is it stupid to just do pairwise fuzzy matching?

serene scaffold
# meager ridge not sure where NLP-type questions should go but ... - say I have data for ~2000...

Your have data for 2000 cities. What data?

Every city is on every list. What lists?

Are you trying to figure out when the same city is referred to in different instances, but in different ways; example, "Paris" and "the French capital"?

As an aside, "syntax" means "rules about legal orderings of symbols in a language". The words "syntax" and "semantics" are used much more expansively in casual speech than they are in linguistics and NLP, so be sure that you're using them correctly.

#

@meager ridge I see that you also asked this in a help thread. Please link to your thread when you cross post your question, to avoid duplication of effort

meager ridge
serene scaffold
night turret
#

Has anyone here made a python voice assistant?

bronze robin
#

Anyone here with experience of performing FFT on timeseries data? Preferably using numpy

bronze robin
#

Yeah I have performed fft but I need some help regarding the output frequency interval customization

halcyon jasper
#

somebody can help me with multiNetX?

lunar current
#

Any idea why this error occurs? ```
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Token indices sequence length is longer than the specified maximum sequence length for this model (1663 > 1024). Running this sequence through the model will result in indexing errors
Evaluating: 0%| | 0/1379 [00:00<?, ?it/s]

RuntimeError Traceback (most recent call last)
<ipython-input-18-523c0d2a27d3> in <cell line: 1>()
----> 1 main(trn_df, val_df)
...
/usr/local/lib/python3.10/dist-packages/transformers/models/gpt2/modeling_gpt2.py in _attn(self, query, key, value, attention_mask, head_mask)
199 # Need to be a tensor, otherwise we get error: RuntimeError: expected scalar type float but found double.
200 # Need to be on the same device, otherwise RuntimeError: ..., x and y to be on the same device
--> 201 mask_value = torch.full([], mask_value, dtype=attn_weights.dtype).to(attn_weights.device)
202 attn_weights = torch.where(causal_mask, attn_weights.to(attn_weights.dtype), mask_value)
203

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.```

fiery maple
#

Guys, I want to use python dsl and lua as configuration language to apply a pipeline for manipulation data, trian, validate and deploy model, and of course, plot graphs. Given that, lua is a powerful language and we can use it to create custom feature engineering functions. I have some questions and concerns related to it:

  • I would like to avoid recalculation of some expensive feature engineering, given that, my idea is to use hash function to translate the current function code plus the columns used to process. If this hash already has a parquet file pointed, it will load this file instead of process all the data. One point is: how accuracy is the description of columns stats from pandas? And there is a probability of collision in that way (hash result) if I sum the description of the columns plus the code I will use?
left tartan
#

Why would you need to include the describe() output? Seems like you’d be fine if you capture all the inputs/code/etc?

fiery maple
#

Because if the dataset change for some reason adding more rows for example, in my mind, the feature need to be recalculated?

#

Or applied any kind of imputation for example, filling nan values with zero

solemn glen
#

Hi, I'm working on weighting some values that will feed a larger datascience dataset but I'm having trouble figuring out the best way to weight the values to keep the results appropritae.

    # Define weight factors for each parameter
    damage_weight = 0.3
    reproducibility_weight = 0.1
    exploitability_weight = 0.1
    affected_users_weight = 0.2
    discoverability_weight = 0.2
    
    # Calculate the weighted sum of the parameters
    weighted_sum = (
        self.damage * damage_weight +
        self.reproducibility * reproducibility_weight +
        self.exploitability * exploitability_weight +
        self.affected_users * affected_users_weight +
        self.discoverability * discoverability_weight
    )
    
    # Scale the weighted sum to fit within your desired range (0 to DREAD_RISK_CAP)
    scaled_risk_value = (weighted_sum / (damage_weight + reproducibility_weight + exploitability_weight + affected_users_weight + discoverability_weight)) * self.DREAD_RISK_CAP
    
    return min(scaled_risk_value, self.DREAD_RISK_CAP)
solemn glen
#
        (1, 12): "Notice",
        (13, 18): "Low",
        (19, 36): "Medium",
        (37, DREAD_RISK_CAP): "High",
    }

these are the risk levels that I'm considering to make it easier to talk about numbers but I end up with results that are always high and I want to avoid that. 


left tartan
#

Is one metric overweighted? Perhaps a log scale is better for that metric, etc

left tartan
# solemn glen ``` RISK_LEVELS = { (1, 12): "Notice", (13, 18): "Low", ...

Another method of weighting is to calculate the percentile score for each metric, and score based on their percentile range... ie: 0-25th percentile is a 0, 25-50 is a 1, etc. Or, perhaps a percentile of the composite. I'm just thinking out loud, may not make sense in this case if the metrics are already computed somewhat arbitrarily.,

topaz night
#

hey guys i know this is pyhton server but pliss help me im losing my insanity slowly rn

cold osprey
solemn glen
gaunt geyser
#

How do you make a regression plot when there are NA values in your data?

#

The two columns I'm using both have them in random spots, so I can't use dropna()

left tartan
gaunt geyser
#

I removed the rows containing NA values, but am still getting this error Cannot cast ufunc 'svd_n_s' input from dtype('O') to dtype('float64') with casting rule 'same_kind'

#

That sounds like there's still NA values, but are none in the new df

tidal bough
#

Seems like the issue is that you have an object dtype for some reason - you probably want to cast these columns to a normal dtype like np.float64.

gaunt geyser
#

yeah that solved it

#

thank you both

#

how does something that was previously int become an object dtype?

odd meteor
odd meteor
# lunar current

Have you tried implenting what was suggested in the error message yet? Did it fix the problem?

tranquil beacon
#

Hi, I have a out of memory issue ocurring when i try to download a very large dataset- in a series of api calls ran in async -to download the data in parts. pyarrows concat tables method is making my service run out of memory and crashing the instance however i increased eks memory and it's fine for now.. i then take the dataframe and run to_csv which is causing OOM failure and i am limited on cluster size in EKS. is there any other alternative to pandas to_csv that convert a dataframe to csv which is more memory efficient

#

I am passing 100k chunk size on the to_csv call

#

sorry if wrong chat

#

I have 3500mi cpu limit and 15GI memory and although its only 5.5gb data it still runs oom

serene scaffold
#

also, depending onw you're using pyarrow.concat_tables, you might be using double the memory you need at any given time. because by default, the data for all the tables is copied into a new one, so it's like each row exists in two places at once.

#

that would probably save you from having to copy the pyarrow object into a pandas DataFrame

tranquil beacon
serene scaffold
#

@tranquil beacon the other thing you can do to save memory is to create as few variables as possible (using more nested expressions) so that intermediate objects are garbage collected as soon as possible

tranquil beacon
#

this wont allow writing to the same csv in parallel right

serene scaffold
past meteor
tranquil beacon
#

This is after etl. I can manage to get the df in a data frame and craps out when writing to csv

#

I have a big list of tables which pyarrow now can concat to data frame after increasing memory

#

This needs to happen this way as I don’t think the suggestion of using built in to csv will work in async

past meteor
#

With concat you mean join (so concatenating by column and not by row)

tranquil beacon
#

Etl is done by a separate process, can be weeks or months before it’s pulled

#

I’m pretty sure it works by column but how I will get a list of pyarrow tables not sure how that joining logic works

#

But now *

#

I was looking into pandas gzip

#

Using compression= gzip in to_csv call however not sure that will be different as a copy I’m guessing will still be made

past meteor
#

I want to understand your use case first better because I might have a solution

past meteor
tranquil beacon
#

On point one yes.
The api we are using already returns arrow serialized results which are stored in tables and we join using pyarrow concat_tables

#

A list of Pyarrow.Tables

tranquil beacon
#

Said the same thing in 3 different responses my bad, I am using external lib which is a hard requirement that does that pyarrow join and so the data frame to csv part I can control only

#

Unfortunately cannot work with parquet files for this one

bold timber
#

anyone can give me explanation about this?

small wedge
#

wdym when no loss is used? you're using binary cross entropy in both no?

#

oh nvm

#

I see the comment now

bold timber
arctic wedgeBOT
#

src/transformers/modeling_tf_utils.py lines 1508 to 1519

"""
This is a thin wrapper that sets the model's loss output head as the loss if the user does not specify a loss
function themselves.
"""
if loss in ("auto_with_warning", "passthrough"):  # "passthrough" for workflow backward compatibility
    logger.info(
        "No loss specified in compile() - the model's internal loss computation will be used as the "
        "loss. Don't panic - this is a common way to train TensorFlow models in Transformers! "
        "To disable this behaviour please pass a loss argument, or explicitly pass "
        "`​loss=None`​ if you do not want your model to compute a loss. You can also specify `​loss='auto'`​ to "
        "get the internal loss without printing this info string."
    )```
small wedge
#

looks like it uses "the model's internal loss computation"

#

I have no idea what that means for a bert model, maybe the loss that the original model was saved with? I assume whatever loss function it's using is better suited to the task than binary cross entropy

#

important to note leaving out loss does not train with no loss function as you can see from the message

past meteor
small wedge
#

looking into the code more we see this

        if loss in ("auto_with_warning", "passthrough"):  # "passthrough" for workflow backward compatibility
            logger.info(
                "No loss specified in compile() - the model's internal loss computation will be used as the "
                "loss. Don't panic - this is a common way to train TensorFlow models in Transformers! "
                "To disable this behaviour please pass a loss argument, or explicitly pass "
                "`loss=None` if you do not want your model to compute a loss. You can also specify `loss='auto'` to "
                "get the internal loss without printing this info string."
            )
            loss = "auto"
        if loss == "auto":
            loss = dummy_loss
            self._using_dummy_loss = True

dummy loss is defined up here

def dummy_loss(y_true, y_pred):
    if y_pred.shape.rank <= 1:
        return y_pred
    else:
        reduction_axes = list(range(1, y_pred.shape.rank))
        return tf.reduce_mean(y_pred, axis=reduction_axes)
bold timber
# small wedge looks like it uses "the model's internal loss computation"

So which is the reliable result, Experiment 1 or Experiment 2?

It's confuses me because I think it's impossible to train a model without optimizing a loss function, that's what machine learning is and what the optimizer does, it adjusts the weights to minimize the results from the loss function.

#

Can you elaborate on this?

small wedge
#

You're right, gradient descent is optimization of the cost function, you can't train a model using GD without loss

#

when you leave out the loss keyword argument, it uses that default 'auto_with_warning' but as you can see by the message:

Don't panic - this is a common way to train TensorFlow models in Transformers!

It seems like it uses a separate loss function which in this case (assuming your dataset is valid) preforms better than plain BCE

#

it sets the loss argument to that function I linked earlier dummy_loss, and has a whole algorithm later in the code that calculates loss for you

#

that said, i'm not sure exactly what loss function that is and I'm not willing to dive any deeper into the source, but I'd say that explains why you got better preformance by using the default loss

tranquil beacon
bold timber
small wedge
#

what do you mean by reliable?

#

reliability (as in being able to trust your model's ability to generalize) would be based on your dataset, assuming your dataset was well constructed the performance you get should pretty often be a reliable outcome.

#

granted there are caveats there (very small batch sizes in minibatch gd giving you wildly different results on different training runs, so your model might be able to preform well but just fell into a local minimum or couldn't get a good enough estimation of the gradient to find one for example) but just generally your dataset is the standard for your model's ability to generalize

bold timber
# small wedge what do you mean by reliable?

What i mean is based on the 2 experiments above, which the result should i believe, whether the model performance is good (without loss function) or the model performance is bad (with loss function)?

small wedge
#

well the dichotomy here is not with loss function or without loss function

#

it's how the model preformed using two separate loss functions

#

you can think of it almost as a separate model by using separate loss functions

#

if the first one consistently preforms better, go with that.

bold timber
#

ah I see, thank you so much for the explanation! @small wedge

vale swallow
#

Why do you have to split the data into x and y? What do the x and y stand for? Is it something relating to independent and dependent data? Like input and output? Not sure.

small wedge
vale swallow
serene scaffold
small wedge
#

The splits are for the data we train the model on (training data), and the data we test the model on (test/validation data) to ensure that its not overfitting on our training data

#

You need train_x and test_x as well as train_y and test_y

serene scaffold
#

@vale swallow make sense?

#

what is the split? Like the percentages?
for a train/test split, 80/20 is a pretty common place to start.

vale swallow
#

Ok, yeah, it makes sense, thank you both!!

shut girder
#

Does anyone have any free book suggestions for statistics or data analysis overall? I consider myself a beginner to data analysis. I currently have a good understand of Python fundamentals and a simple idea of what NumPy, Pandas, and Matplotlib is used for when it comes to data analysis.

coarse mica
#

hi

#

does somebody know if the frelancer market of data science is a good path to follow?

#

i'm corcerning about this those days

shut girder
serene scaffold
orchid cargo
#

Should i take Data Science or ML? Im very tired to think about it, cuz both of them looks good. But i really want to learn one of it.

viscid silo
#

Does anyone have any resources for creating transfer functions using time history input and output from an LTI system? Wanted to first understand a singe input single output (SISO) system and then work up to a multiple input multiple output system.

minor cloak
#

[Looking for open-source contributors, see below]

Hi there,

I recently open-sourced PyGraft, a configurable Python tool to generate synthetic knowledge graphs easily!
It can be used in any AI tasks (Machine Learning, Deep Learning, Reasoning, etc.) provided that you work with graphs.

The repo is gaining a lot of visibility, and I am looking for motivated contributors to support me in implementing new features and unit tests. Ideally, you should have a general understanding of knowledge graphs, semantic web, RDF/RDFS, and OWL vocabularies. In addition, strong Python programming skills are required. Experience in Software Engineering is a plus 🙂

DM me if you would like to contribute!

Otherwise, you can still take a look and star the repo if you find the project interesting!

https://github.com/nicolas-hbt/pygraft

GitHub

Configurable Generation of Synthetic Schemas and Knowledge Graphs at Your Fingertips - GitHub - nicolas-hbt/pygraft: Configurable Generation of Synthetic Schemas and Knowledge Graphs at Your Finger...

serene scaffold
hot pivot
#

https://techxplore.com/news/2023-10-technique-based-18th-century-mathematics-simpler.html More info on the paper and link to the article itself can be found here as well

Researchers from the University of Jyväskylä were able to simplify the most popular technique of artificial intelligence, deep learning, using 18th-century mathematics. They also found that classical training algorithms that date back 50 years work better than the more recently popular techniques. Their simpler approach advances green IT and is ...

muted hollow
#

Hey guys, im a bit new to deep learning, Im studying Neuron network but having a question. The second matrix below represents a neural network derived from a bag of words. The third matrix represents the classification of the second matrix into its respective classes. I want to know about when initializing the first layer. Will it create a layer with 3 neurons corresponding to 0,1,1, or will it generate 3 neurons, each neuron being a line of the matrix of 0,1,1; 1,0,1; 1,1,0?

past meteor
#

The numbers ChatGPT are showing you are very strange, I'd forget that and "start from scratch"

spiral frigate
#

Guys, I have a problem

#

You can see my problem in python help

#

"Problems importing tensorflow"

muted hollow
past meteor
#

And what is the 1st?

muted hollow
#

the first is for shuffle the data

#

so after every shuffle, u still keep the right order of input and output

#

Pardon me, it's kinda hard to explain in my non-native language.

past meteor
#

The test train split?

past meteor
fallow frost
#

after playing around with duckdb and datafusion, I'm now convinced that both of them are not production ready

#

gonna try clickhouse now

#

I had high hopes for datafusion being implemented in Rust...

desert oar
#

duckdb and datafusion are basically in-process/in-memory query engines, not all that different from polars or even pandas

#

being written in rust also should not be taken as an indicator of being production-ready or not. rust is just a programming language.

past meteor
sterile barn
#

Hey y'all.
After a good amount of data processing and encoding, I noticed my function for encoding certain columns gets a little messed up because of a certain value called "NA". I understand that this is because pandas by default understands "NA" as a string means Null/NaN or what have you. Can I avoid this somehow? All other values are correctly understood by the function, so I know that it is from the "NA" being interpreted wrong (source: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html, under "na_values").

Function below for reference:

def rateOrdinalE(df:pd.DataFrame, cols:list) -> pd.DataFrame:
    res_df = df.copy()
    rating_map = {'NA': 0, 'Unf': 1, 'Rec': 2, 'BLQ': 3, 'ALQ': 4, 'GLQ': 5}
    for i in cols:
        res_df[i[0]] = res_df[i[0]].map(rating_map)
    return res_df
#

Never mind, brain finally woke up. Seem to have solved it with a little revision past the for-loop:

res_df = res_df.fillna(0)
topaz night
#

thanks you so much tho u really help me out ❤️ @cold osprey

left kraken
#

Hello all I am new to this AI field, so want to know exactly where and how to start to get familiar with Machine learning and Data Science. So can anyone help me out with the roadmap and sources and some basic projects which makes learning interesting..?

topaz night
# left kraken Hello all I am new to this AI field, so want to know exactly where and how to st...

ez bro im a 5 years expertize on ai fields and helping to bult chat gpt but first u need solid foundation on MATH, PROGRAMMING, AND DATA HANDLING

second u need to learn the basic of course here a few example ppl u can learn by i learn from themm too
Introduction to Machine Learnin by Andrew Ng on Coursera, Python for Data Analysis"by Wes McKinney, Introduction to Statistical Learning (ISLR) or Elements of Statistical Learning(ESL) by Hastie, Tibshirani, and Friedman
and some online course Coursera, edX, and Udacity

third
then after u learn some shi u need to do some project like kaggle or self project

forth
after u have mastered the common knowledge lets step it up to next lebel play knowledge
like
-Machine Learning Algorithms
-deep learning
-data processing

fifth
u need to find what u want
-theres natural language processing
-computer vision
-reinforcment learning
(perfect ai like chatgpt need them all btw)

sixth
u go to some advance topic like
-ensemble method
-big data
-model deployment

and then seventh
because ai will always growing u need to stay update on ai field like follow ai forum or community

*note if u mastered all that step just take some ai intern and tell em what u capable off so u can test ur skill and what u dont have so u can learn more and yeah stay sane bro oh yeah i forgor stay healthy and drink a lot of water cuzz if sick cuzz learn and die of it its really shit stupid etc

#

hope thats help and goodluck

slender bone
abstract wasp
#

Hi, not sure why this is here--why does it have wrapper 1, 2, 3 instead of the layers I added (flatten/dense) I'm using Resnet50 on my data, this is my code:
`resnet_model = Sequential()

pretrained_model= tf.keras.applications.ResNet50(include_top=False,
input_shape=(224, 224 ,3),
pooling='avg',classes=77,
weights='imagenet')
for layer in pretrained_model.layers:
layer.trainable=False

resnet_model.add(pretrained_model)
resnet_model.add(Flatten())
resnet_model.add(Dense(512, activation='relu'))
resnet_model.add(Dense(77, activation='softmax'))`

#

This is what shows up when I look at the summary:

#

It's training but this is what shows up when I compile it:

fallow frost
left tartan
fallow frost
#

and if I recall correctly, this bug was only with Duckdb, not Datafusion

left tartan
#

And there are workarounds.

#

And, most importantly, any data pipeline will employ multiple technologies, which is why arrow is becoming the backbone of most pipelines.

fallow frost
#

its more like an absence of basic functionality that a good percentage of SQL queries use

left tartan
#

Being able to read a parquet file, without ingesting to a table, while filtering the contents on the fly based on the selected criteria is fairly bleeding edge stuff. Doesn't bother me one bit that it has limitations.

fallow frost
#

and I think that regardless; a one-element tuple should be converted to an equality check, but thats besides the point

left tartan
#

I hear you, I'm just saying, it's absence of an optimization you're complaining about. Pushdown to parquet reading are limited today to column selection, and AND equalities: that's known (perhaps could/should be better documented, but I'm not affiliated with them).

fallow frost
#

oh and btw, the column in question is actually a partitioned column (hive style) with only about 30 distinct values, how in the world is it even possible for this query to take more than 20 seconds

left tartan
#

I have no idea, but if you could publish a reproducible example, I'd love to take a look.

fallow frost
#

youre mistaking this for a lack of a feature, but I'm pretty confident that I can filter the data faster with pyarrow + pure-python loop

#

(even while doing all the filtering in Python)

left tartan
past meteor
#

@left tartan how do you test your workflow?

#

I hear a lot about testing in data/ DE sphere's but I see no one doing it

left tartan
past meteor
#

And what exactly are you testing?

left tartan
#

We're not an airflow shop (doesn't fit our use case), but there's also solutions on that side (I think Eivl would be the one to ask on that side).

past meteor
#

I don't use Airflow either don't worry

left tartan
# past meteor And what exactly are you testing?

from a data perspective, it's really about testing that our data acquisition is working correctly/consistently, and checking each step along the pipeline to see if we're regressing. So, what we test is primarily injecting known data and making sure we get expected outputs. That's the idea of snapshot testing

past meteor
#

I only know snapshot testing from idk jest and front end JS. I'll think about this...

left tartan
#

That's exactly where the idea came from... one of our senior engineers basically said: "Why can't we do something like jest?"

past meteor
#

I cycle to work every day and I had a good idea on how to test data pipelines, I'll think I'll write it out in full.

left tartan
#

Previous company, we built out a test framework for end to end tests (again, injecting known inputs), but we didn't do it at the unit level. A combination of dbt (decomposing to individual scripts) and snapshot testing is kinda nice.

past meteor
# left tartan Would love to hear

The gist is simply that you need to know your output the schema. Around this you have a "contract". The most basic version which is always in the contract is basically "I have these N tables with M columns, they are of type T1, T2, .." This idea is DB agnostic, your RDBMS captures all of this but other things wouldn't.

On top of this you formulate extra properties or invariants that are part of your contract. For instance, "all sport sessions in my dataset have a unique identifier" or "The meals column contains all the meals consumed, even if other data is missing".

Each line in the contract, must be handled by either your DB making sure it's impossible or it should be tested. Why? These are exactly the assumptions people downstream make about your data. Finally never make a backwards incompatible change to the contract, so no renaming of tables or columns, it just aint worth it, so much downstream always breaks.

#

The key insight is that I'm not testing transformations, I'm testing whether or not it adheres to my spec. I should be able to refactor my pipeline into something more performant and hit the same result.

left tartan
#

Still reading, but you're also touching on something I didn't mention. We have validation sql as part of our test cases process, but these are baked into the pipeline itself (so it runs every time, rather than as a unit test), similar to what you're describing.

past meteor
#

Personally I'm very wary of being overzealous and testing too much. I don't want to test if my SQL or polars or Pandas does something, I want to wrap that in a function and check if what it returns, given my input, is what I needed it to return

#

Changing a query should not change my tests etc etc (for the same output)

left tartan
#

Yah, that's sort of why I like snapshot testing: I just want to know if something unexpected changed, but I don't want to go through teh effort of writing a test for every case. I think I agree with where you're thinking... I like the idea of a combination of validation (assertions) and snapshot testing.

past meteor
#

I haven't used DBT but I find it very very sus idk. If you have 4 intermediate DAG nodes you shouldn't test those

#

And all those models introduce coupling as well

#

I like it if and only if the contract states that thing A and thing B must go through the same process otherwise I would have linear input output with no shared models, if I were to use DBT that is

past meteor
left tartan
past meteor
#

Tbh, I like having them after I guarantee that what I have is correct especially because backwards incompatible changes (aka. breaking everyone's dashboards) are a massive no-no. I'll think about integrating this.

left tartan
#

We do go through fairly lengthy customer acceptance tests, so usually our problems are; some new edge case comes up (ie: data is incorrect under certain conditions)… we fix that, but then need to make sure we didn’t regress under the cases we’ve already validated.

past meteor
#

Ah yeah, when you have a very mature pipeline that's the way to go

#

But yeah, my approach is very inspired by sane SWE style testing. I should really write this in full with an actual project. If you squint enough it's TDD on data.

left tartan
#

Yah, I like where you’re going with this!

frank quiver
fair arrow
#

hey all - is this a good place for a question about the pandas library and a student-scheduling app I'm writing?

#

sort of an intermediate-ish question, and I'm trying to avoid sloppy programming practices, so it might take a bit of explanation. Not sure if this is the right channel for asking for help.

fair arrow
#

... a few moments, I may have solved it.

lapis sequoia
#

Can someone help me real quick in python please

fair arrow
#

I can try, while I try and answer my own question.

small wedge
abstract wasp
#

Hi, I’m going to start a sort of a blog where I write stuff related to ML/AI. The main overall subjects I will include are mathematics behind the algorithms, explanation of the actual algorithms alongside some code (project examples). How should I split the menu bar?

left tartan
desert oar
#

These are a valuable kind of post if you can write them well. Don't worry about the blog format, just start writing. You're more likely to make progress that way.

abstract wasp
#

Thanks!

fair arrow
#

I think I've got a solution. The short version is that I'm trying to update a DataFrame over which I'm iterating, which Pandas docs explicitly warn against. I'm realizing I need to refactor my code into a better pattern.

desert oar
fair arrow
#

is a better pattern to throw the updates into a second dataframe, and then merge the two at the end?

desert oar
#

can you describe what you're actually trying to do?

#

in this particular case, not your app in general

fair arrow
#

I have a dataframe of students, their class preferences, and the actual scheduled classes. I'm trying to add to the actual scheduled classes if the class isn't overenrolled. The problem is that, as I add students, iterrows() is still working with an old copy of the dataframe, so I can't poll the dataframe for the updated attendance, and the class gets overenrolled.

desert oar
#

(and i literally never actually use iterrows. i always use itertuples instead)

fair arrow
#

it was because apply() didn't update the dataframe as I iterated either, and I thought iterrows might

desert oar
fair arrow
#

yep, just deleting a portion that didn't work, one second

desert oar
#

if you can include an example input and your desired output, that would be very helpful as well

fair arrow
#
for day in SCHEDULE_DAYS:
        print("--------now scheduling for day" + str(day))
        
        for preferenceNum in range(1,PREFERENCES_PER_DAY + 1):  
            print("---------now scheduling for preference " + str(preferenceNum))  
            for grade in GRADES_ORDER: 
                print('--------------------now scheduling for grade ' + str(grade))
                #Because the shuffled student data is working with the filtered student data,
                #the indexes must be preserved
                student_data = student_data.sample(frac=1,axis=0, ignore_index=False) #shuffle students
                student_data_filtered_by_grade = student_data.loc[student_data["Grade"] == grade]
                classes_to_add = pd.DataFrame()
                for index, row in student_data_filtered_by_grade.iterrows():
                    r = scheduleClassIfEligible(row, day, preferenceNum, grade, student_data, electives_list)
                    if isinstance(r, pd.Series):# if it's a series and not None...
                        
                        #count r and student_data's concat'd column, pass in only if it wouldn't over-enroll
                        #add to classes_to_add
                        classes_to_add = pd.concat([classes_to_add, r])
                        #add the classes at the end of the grade student_data
#

the bottom portion is just pseudocoded for now

desert oar
#

is this something like where each student has a selection of preferences each day, and you randomly assign first preferences, then second, and so on?

fair arrow
#

yep yep

desert oar
#

it looks like that assignment happens separately for each class year / grade, and for each day?

fair arrow
#

I've done it in Java, and was porting it over to this trying to use pandas

desert oar
#

how did you do it in java? i would almost argue that this is not a great use for pandas, since you are in fact operating on each row sequentially

fair arrow
#

pandas made every other part of this project easier!

desert oar
#

fair enough. occasionally i do things like convert my dataframe to a list of dicts, do something with that, and then convert the dicts back to a data frame.

#

(you can do this with pandas of course)

#

but i am curious what your java implementation of this same logic looks like

#

are you new to python? or just pandas specifically?

fair arrow
#

to be honest I don't know off the top of my head how I did it in Java, it's on my Github. It's been a while. I think I was maintaining two separate tables of data but I did run into bugs trying to keep both in sync.

#

I fixed them but thought this would be a more expandable way to do it

desert oar
#

i actually would suggest keeping a separate table here as well

#

but you can rely on the student id / table index (as you already pointed out) to keep it in sync

#

then you can use pd.join at the end

#

however i see some issues with your code that you might want to address

#

not necessarily critical problems, but suggestive that you don't have the algorithm laid out clearly in your mind

fair arrow
#

hmmm, well - the thing is, is that this worked for our school. But there were a handful of classes overenrolled.

#

But 95 % of the data was good so we used it and I was going to fix these bugs for next semester.

#

And they just did the fixes by hand. It still was a big timesaver.

desert oar
#

what does each row in this table represent?

#

and what is r? the code suggests that it's a Series, but a series containing what exactly?

fair arrow
#

ah, the thing with r was relatively new

desert oar
#

probably the most important favor you can do for yourself when working with data is to be very clear about what each "row" or "entity" represents

fair arrow
#

I should stash this and go back to my earlier commit

desert oar
#

that's what branches are for!

#

i ask what your java implementation looks like in part because i feel like it might be easier to start from something that more or less works

fair arrow
#

once I dig my way out of this bug, that was the plan haha

#

I did actually start from that, but I laid out this code a while ago, like months ago. And it may be a bit much for me to get into right now how it worked, mainly because it's late and I need to teach tomorrow 🙂

desert oar
#

fair enough

#

so what is each row of student_data? what are the columns? how are the students' preferences and class schedules represented here? i think i would need to know that in order to help, otherwise it's too much guess work for someone who isn't inside your head

past meteor
#

Does it need to be a neural network? You can just get features with something like tsfresh and then cluster with idk k-means.

#

I work in the time series domain and honestly, I'm a sucker for simple solutions 😄

#

K-means has exactly the same property, close clusters are similar as well

#

Their similarity is just the distance between the cluster centers

#

I mean, you should look at K-means (and all clustering) as solving this optimisation problem:

clustering = argmax(distanceBetweenClusters(data) && argmin(distanceWithinCluster)

#

So yes you do actively want to create dissimilar clusters but the fact of the matter is that maybe an optimal clustering has 5 close clusters and the sixth that is very far (outliers, fraud, ...)

#

What are input nodes? Are they just variables?

#

What are catch 22 parameters? Are they just non informative variables?

#

What is catch22

#

ooooh

#

So you have 440N features where N is the number of input series you have per sample

#

In defining your problem you're already thinking in terms of neural nets (weights) etc. I'd definitely take a step back because you miight find something very simple! 😄

#

Probably not but who knows

#

But in general, don't think in terms of the solution, keep it very simple first. I tend to write it in LaTeX style notation in a markdown, like really generalize the problem I'm doing.

#

I understand your solution now, you want to make clusters and then models for each cluster

#

Is your problem time series classifiication?

#

Okay let me make a suggestion

#

Keep it simpler. If you are solving a time series classification problem make N separate models for your thing

#

One using temp, one using humidity, one using velocity (idk)

#

Evaluate the performance of each of them. Keep the models absurdly simple here, for instance the last N lags with an Xgboost type model.

#

This will already tell you something about the relevance of each individual series, not everything but something

#

The next step I'd do is build a stacking ensemble using my N "simple classifiers" and see if this improved my score

#

Obviously, stacking doesn't take interaction effects at the input level into account, that's the major drawback. The thing is, codewise this is so so simple. I'd also have to see if interaction effects at the input level are relevant for my problem to begin with. If it is then yeah, from that point onwards I'll start thinking of some multivariate time series model compared to an ensemble of univariate models

#

With this solution I have given myself several places to solve the problem prematurely. If you overengineer from the get go you might have wasted time. Additionally, you need to be able to compare your high complexity solution to low complexity solutions to get a sense of whether or not it was worth it in the first place.

Make sense?

#

Have you tried it yet?

#

As in, are you sure none of them are better than random

#

Basically, there's tons of ways to do specifically univariate time series classification. I'd start there. One of them is feature extraction but there's others

#

Do that, proceed into a stacked ensemble and then gradually go towards your solution

#

But idk, maybe I don't understand what you're trying to do in the first place 😅 . Considering there's a lot of NN specific terminology it looks you're set in what you want to do. All I'm saying is, take a step back, formulate your problem without using the word "weight", "cluster", "node", "backpropagation" and then progress from there 🤷

dusk tide
somber prism
#

Hello , I installed PyTorch gpu module and everyday I’ve been training my model for 5 epoch in gpu , but today for some reason my gpu usage is 0% , anyone know what’s wrong with it, I did t make any changes in my code .

#

Also there used to be nvidia rtx 3060 next to CUDA_VISIBLE_DEVICES : but now it’s showing 0

harsh minnow
#

I am finetuning gpt 3, I am training it on news documents. When I upload a training file with only system messages its throws and an errors and ask for assistant message. So for each news document I sent the training data as : <>{"role": "system", "content": "NEWS Article"}, "role": "assistant","content": "Placeholder message for fine-tuning"}}<>

Now whenever I ask it a questions, it responds: "Placeholder message for fine-tuning", what should I do?

cold osprey
#

any idea what technical theory means?

#

role is for AI/ML software engineer

left tartan
#

Wow, that's an anxiety laden email. I think we can all generally guess, but it would certainly depend on the job title/level

left tartan
#

Tough thing is it doesn't give a clue about their stack... but it's fintech

cold osprey
#

haha ye i kinda know what they do here

#

i would expect it to be python heavy

#

idk y they have java n cpp

left tartan
#

I mean, are they looking for NLP stuff? LLM? More classical EDA? etc

cold osprey
#

oh

#

but this seems more like a SWE role right

#

APIs for the models

#

ML/MLOps engineer kinda stuff

left tartan
#

Yah, but you're meeting with "head of AI", so unsure if that's an AI oriented theory discussion, or a SWE theory discussion?

cold osprey
#

HAHAHA

#

fair point

#

theyre hiring for hella roles

left tartan
#

I think I'm just adding pressure not helping 🙂

cold osprey
#

HAHA npnp

#

tbh im pretty weak in either of those topics: AI or SWE

#

i mainly do data viz stuff and some pipelines at work so

left tartan
#

In general, my interview advice is: Know what you know, be pleasant, and know how to not know the answer to something you don't know, especially when you're talking to someone who's an expert.

#

For AI and SWE, if you're prepping for an interview, you really can't learn something completely new to prep for an interview. But... one thing I think is very helpful is: Watch conference videos to learn about current trends in technology.

#

(beyond that, maybe ask the same question in #career-advice and you'll get better input)

cold osprey
#

Ait cool thanks

past meteor
#

And what is the difference between xgboost and random forest, I think it's stuff like that no? The typical ones

little idol
#

Hi guys

#

I'm in need of a little bit of help here

delicate apex
#

a recently-added sonarlint rule, warning against equality comparison with floats, has triggered in a file of mine. Given that the context is an sqlite file imported into a pandas dataframe, and the values in question are powers of two (0, 0.5, 1, 2, with sonar complaining at if cell==0.5), I should be able to ignore this without consequence, right?

rustic scarab
#

I'm trying to draw a plot to show streamlines for the equation psi = -Kx^2 + Ky^2. Here is my code for that:

import numpy as np
import matplotlib.pyplot as plt

K = 1     # I'm assuming my this assumption is wrong

# Define the grid
x = np.linspace(-100, 100, 50)
y = np.linspace(-100, 100, 50)
X, Y = np.meshgrid(x, y)

# Calculate the vector field components
U = -K * X**2
V = K * Y**2

# Create a streamline plot
plt.streamplot(X, Y, U, V, density=1, linewidth=1, arrowsize=1.5, color='blue', broken_streamlines=False)

# Add x and y coordinate lines at 0,0
plt.axhline(0, color='red', linewidth=2.5)  # Horizontal line at y=0
plt.axvline(0, color='red', linewidth=2.5)  # Vertical line at x=0

# Add labels and title
plt.xlabel('x')
plt.ylabel('y')
plt.title('Streamline Plot of -Kx^2 + Ky^2')

# Show the plot
plt.show()

This is the result that I'm getting (image attached with red coordinates).

However the expected result should like the other image

Can anybody help we with this, please?

tidal bough
rustic scarab
tidal bough
#

plt.contour takes the scalar field itself, so:

f = K * (Y**2 - X**2)
plt.contour(X, Y, f, levels=20)
stark phoenix
#

Why do I get different ticks than the displayed ones?

#

I would expect ax.set_xticks(ax.get_xticks()) to do nothing... but it makes so x = -2 and x = 6 are visible :/

serene scaffold
#

please always give text as text, not as screenshots. people might need to copy parts of the message to help you.

reef pulsar
#

Sorry. I'll keep that in mind

thick topaz
#

So we are considering making a datawarehouse, we have approximately 3-400 databases, all identical structure with different datasets of course. Now we are considering creating a large datawarehouse with data from all of the 3-400 databases. How would one go about this?

#

Which libraries would you guys use, what are dos and donts in regards to this?

rustic scarab
#

is it possible to put arrows in a contour plot?

serene scaffold
thick topaz
#

We arent talking millions of rows per DB, props 20-100k per db ish

radiant crypt
#

hello im trying to make a bot talk to the user but when i try to allow the text to print on the textbox it gives me an error TypeError: CTkEntry.get() takes 1 positional argument but 3 were given

thick topaz
#

I think we need more code than that

radiant crypt
#

r u able to be in a vc?

#

its easier to show u there

thick topaz
#

Nah but feel free to DM me a part of the code and I might have a look later tonight

radiant crypt
#

alr

#

thx

cold osprey
#

if its on MSSQL, i presume u would use something like SSIS?

thick topaz
# cold osprey wdym what libraries?

Mainly in regards to what would be the smartest, fastest and most reliable way of transfering certain data from these 3-400 databases into another database

languid prairie
#

need someone who is familiar with LlamaIndex

cursive valve
#

Hi, is there anyone who is willing to discuss one python code with me? I have spent so much time on it and there is still something that is now working.

keen kettle
#

@cursive valve Hi, if you want to discuss your python code.

cursive valve
#

Are you willing to help me?

keen kettle
#

what kind of project?
pls show me.

cursive valve
#

it will be better if could talk because it is quite complicated

#

but I will leave it up to you

keen kettle
#

you have document?

cursive valve
#

yeah but it is not in english

keen kettle
#

hmm... no problem

cursive valve
#

I mean in a nutshell it is a program that will substract numbers like '1.01e-11' and '-1.10110101e111' both in binary form and both could be negative or positive

#

and I have to write output

#

In small numbers it works perfectly fine but with larger one it cause troubles

keen kettle
#

that's interesting.

#

Can you share your code?

#

@cursive valveare you there?

cursive valve
#

Well I do not really want to share my code publicly, because of strict rules that I have to follow

#

But I need to work in string otherwise that will be inaccuracy

keen kettle
#

you are right

little idol
#

@keen kettle Hello my friend, how are you?

I don't know if you still have some spare time, but I would like to use this opportunity to ask you something about this project I'm enrolled in

I am building a portfolio project which consists in:

I've downloaded a database from kaggle of a fictional telecom company, regarding churn data
Then, I've uploaded the database to SSMS
I've cleaned the data using python and SQL, one-hot encoded columns, solved missing and zero values
I've done feature engineering to prepare for the usage of Random Forest Classifier machine learning algorithm to create a churn prediction tool

The question is, since my features resulted in new tables with structures, column names and overall shape different than the original database, how can I merge them into one singular database to feed into the ML algorithm, since what I've discovered through my research is that there must be an Index Column on the dataframes, but the original index column contained on the database was Customer ID and on the newly created tables, there will be no connections at all with customer ID since different questions were answered by the feature tables

Could you lend me a hand ?

keen kettle
#

@little idolokay

#

so u want sql query?

#

i am interested in your project.

little idol
#

I want to understand how can I use my engineered features to feed my machine learning model

I can share my screen with you

#

I want to use python to run this machine learning model, but I don't know how to do it, this is my first time dealing with machine learning models

keen kettle
#

hmm...
let me show your screen

terse frigate
#

I am seeking guidance for a conceptual endeavor regarding Natural Language Processing (NLP) model refinement. My objective is to tailor an NLP model utilizing a private corpus. The data at hand comprises two distinct datasets:

  • The first dataset encapsulates JSON structured data, encompassing names and pertinent information of various enterprises.
  • The second dataset embodies a text corpus, inclusive of a passage containing 6000 words.

The envisioned outcome is to engineer a model proficient in executing text or data retrieval from the aforementioned corpus, predicated on user prompts. I am open to insights or suggestions that may steer me towards devising a viable solution for this challenge. Your expertise and directional advice would be greatly valued.

serene scaffold
# terse frigate I am seeking guidance for a conceptual endeavor regarding Natural Language Proce...

so you're trying to develop an information retrieval system. are you looking for a ChatGPT-like user experience?

by the way, you're using a lot of fancy terms here in a way that I don't think is necessary. NLP professionals don't talk about datasets that "encapsulate JSON structured data" or "engineering a model proficient in executing [task]". you can just say that the first dataset is JSON data representing [whatever it represents], and that the desired result is a model that retrieves information predicated on user prompts.

#

The second dataset embodies a text corpus, inclusive of a passage containing 6000 words.
inclusive of a passage containing 6000 words. is that the whole corpus, or is that just one of the documents that are in it?

random fox
#

I am attempting to animate a pcolor plot with matplotlib.pyplot and matplotlib.animation. However, I'm running into an issue with the colorbar duplicating repeatedly. The following is the figure setup code:```py
fig = plt.figure()
ax = fig.add_subplot()

def init():
return ax

def update(frame):
f = int(frame)
plt.cla()

"""
ax.set_xlabel(r'$x$')
ax.set_ylabel(r'$y$')

ax.set_xlim(0, 2 * pi)
ax.set_ylim(0, 2 * pi)
ax.xaxis.set_major_formatter(FuncFormatter(lambda val,pos: '{:.0g}$\pi$'.format(val/pi) if val !=0 else '0'))
ax.yaxis.set_major_formatter(FuncFormatter(lambda val,pos: '{:.0g}$\pi$'.format(val/pi) if val !=0 else '0'))
ax.xaxis.set_major_locator(MultipleLocator(base = pi))
ax.yaxis.set_major_locator(MultipleLocator(base = pi))
"""

u_plot = ax.pcolor(x, y, np.transpose(u[frame, :, :]), cmap = 'coolwarm')
# ax.set_aspect('equal')
# ax.set_title(r"$u(t, x, y), t = $" + "{0:.3f}".format(t[frame]))
fig.colorbar(u_plot)

ANI = ani.FuncAnimation(fig, update, frames = range(0, T), init_func = init)
ANI.save("advection_animation.mp4", fps = 180, dpi = 200)

#

Unfortunately, I end up with animations like the following:

serene scaffold
lapis sequoia
#

chat gpt is quite nice at translating jupyter to python scripts

#

the exports functions normally do not understand what to do with the bash commands

serene scaffold
lapis sequoia
#

it didn't run in any way here, not sure why

#

i wanted to run the cli command to be able to use tmux now that we upgrade from colab to a pure ubuntu server with gpu

#

it was so complicated to convert that simple file lol...

serene scaffold
#

Tmux is bae af

lapis sequoia
#

not sure what bae is, but i'm forcing it to be at ease lol

serene scaffold
serene scaffold
lapis sequoia
#

o_0

lapis sequoia
#

anyone here uses fluorescent tshirts for coding

tame path
#

is their a way to find out quantitatively wether a given spectrogram contain "some sound" or is just silent

terse frigate
#

Thank you for responding

#

I only wrote the request in fancy terms since I have been asking people for help on LinkedIn too

#

Yes correct - information retrieval system

terse frigate
#

What I am trying to develop is a POC

#

Where I just want to demonstrate that we can build a model that is curated to our needs and our dataset

#

And not open domain

terse frigate
weak mortar
#

hello! i have a weird problem with my DataFrame today. i have a column with datetime. time is in format "2017-08-17 04:00:00, but as soon as i pass it to any function, without doing any manipulations or modifications of it, it changes time to 1970-01-01 00:00:00.000

serene scaffold
#

@terse frigate I still need to see examples of the json data

weak mortar
#
print(somedataframe)
def wtfisgoingon(input):
    print(input)
    return
wtfisgoingon(somedataframe)
#

i never experienced such behavior before that my data gets altered by passing it to a function , and quite clueless on how to solve it

serene scaffold
#

if an object is mutable (like dataframes), anything with a reference to that object can modify it, including functions

weak mortar
#

but i assume that would require that said anything or function, is run

serene scaffold
#

yes? in either case, your code example doesn't encapsulate the problem.

weak mortar
#

ok. what do you suggest me to do to try and solve this?

#

as you see i print somedataframe , and the only thing other that i do to somedataframe is to pass it to the function, and print it again. so no code is initiated that should change it

serene scaffold
weak mortar
#

ill try and recreate the problem in a new .py 👍🏻

modern quail
#

Types of neural networks and optimization algorithms are independent of each other right? Like I can use any optimization algorithm I want with any type of neural network?

serene scaffold
modern quail
#

I am talking in general

serene scaffold
#

yes, and are those two examples of what you mean by "optimization algorithm", or not?

modern quail
#

by types I mean like LSTM, ANFIS and optimization alogrithm ACSLFA, Levenberg Marquardt

#

I do mean adam but i am not sure whether acslfa and lm are optimization algorithms and whether they could be used alongside lstm or anfis

weak mortar
# weak mortar hello! i have a weird problem with my DataFrame today. i have a column with d...

as usually there was an explanation to the problem, of a human nature... i did not realize that the variable was holding multiple dataframes(because it loads csv files from searching after a string and i assumed it only had found one file). what confused and was unexpected behavior to me was that printing it initially it would print somedataframe[0], and after passing it to the function it would print somedataframe[3], which has its timestamp in a non compatible format

#

tl;dr: doing print on a variable with multiple objects without specifying which object, it will do print[0], and after passing it to any function, it will print out the [-1] object.

serene scaffold
weak mortar
#

okay, yes, a list storing multiple dataframes

serene scaffold
#

but if you print a list, it will just print the whole list, so your assessment about "it will do print[0], and after passing it to any function, it will print out the [-1] object." is incorrect

#

there must be more going on.

weak mortar
weak mortar
#

actually my brain is just completely malfunctioning and python is behaving exactly as expected. thanks for your efforts. i will pour down a liter of coffee to correct my cognition

harsh minnow
#

Hi, I am running meta-llama/Llama-2-7b-chat-hf on runpod.
The machine type is: 1 x RTX A6000, 14 vCPU 48 GB VRAM

I am running a for loop and calling the model using hugging face pipelines
pipe = pipeline("text-generation", model="meta-llama/Llama-2-7b-chat-hf", device=0, return_full_text=False, use_cache=False)

After the 5th loop the GPU Utilisation becomes 100% full and output is very slow or sometimes no output. What can I do?
For testing I ran

    for i in range(1000):
        yield f"My example {i}"

for out in pipe(data()):
    print(out)```

Still the same delay and pausing after 4th or 5th loop
arctic wedgeBOT
#
Formatting code on discord

Here's how to format Python code on Discord:

```py
print('Hello world!')
```

These are backticks, not quotes. Check this out if you can't find the backtick key.

For long code samples, you can use our pastebin.

harsh minnow
serene scaffold
#
import torch

for d in data():
    with torch.no_grad():
        print(pipe(d))

try that.

harsh minnow
serene scaffold
harsh minnow
#

It worked, but it takes 12 seconds or so.

serene scaffold
#

to do 1000 instances, one at a time?

#

because that doesn't surprise me.

harsh minnow
#

Oh my bad, actually this is the code: ````import torch

def data():
for i in range(1):
yield f"My example {i}"

for d in data():
with torch.no_grad():
print(pipe(d))```

#

For 1 instance it takes 12 seconds

serene scaffold
#

hmm, are you sure nothing else is using the GPU?

harsh minnow
#

I guess so, because once the output is completed the GPU becomes to 0% Utilisation

neon field
#

someone recommend me a roadmap to study AI
my prof is real shitty
there are too many AI resources
idk what course to finish to actually study AI as a major subject 🟥 IMPORTANT 🟥

serene scaffold
#

@harsh minnow I'll try it on my machine

harsh minnow
#

Sure @serene scaffold

serene scaffold
#

@harsh minnow looks like it requires a license from Meta to use it, which I don't have, so RIP.

harsh minnow
#

oh okay, no problem.

#

Actually I guess there is some wierd issue with my system. becase this code still have not given an output: ```import torch

def data():
for i in range(1):
yield f"My example {i}"

for d in data():
with torch.no_grad():
print(pipe(d))```

dense sluice
#

Hi! I'm in the early stages of implementing face recognition, and am assessing available tools. Python's eco seems great for this, including with Tensorflow, Pytorch, and OpenCV (Aka/formerly-known as CV2?) I'm interested in clustering with faces. Ie recognizing unique faces from a camera, and identifying when they come up again. Where would you start? Ty!

misty flint
#

are you me

#

also @serene scaffold did we still want data eng resources or nah

#

at this point im not sure since ive sorta merged MLE/DE beginner resources

#

many places merge the two roles

#

books, online courses/resources, podcasts

harsh minnow
#

I am running 2 GPU, but when I run this code it only uses 1 GPU.

from transformers import pipeline
pipe = pipeline("text-generation", model="meta-llama/Llama-2-7b-chat-hf", device=1, return_full_text=False)```
hasty mountain
#

Hey guys, does someone know why Evolutionary Algorithms are hard to apply in a stochastic nature, like gradient descent?
Usually, I see EAs being applied to optimize the average loss in a dataset as a whole. The models are initialized and only after iterating through the whole dataset the mutations/crossovers are applied. I was thinking: why not apply the mutations before iterating through the whole dataset? Like after N batch iterations, like in SGD?

#

In reality, I'm asking this a bit too late, as I already did an entire work testing a "stochastic genetic algorithm". But it worked wonderfully as a failure, and I'm now trying to think why it didn't work

#

(Though I don't discard the possibility that the problem may rely between the screen and the chair...if you know what I mean)

past meteor
#

Look up local search in evolutionary algorithms

hasty mountain
glass mason
#

How to make ai

#

Can someone help me

hasty mountain
#

I tried an idea of mixing SGD and EA to move a model that is already in a learning plateau (a local minima, or more like of a saddle point), to move the model away from that minima, and thus help SGD improve it even more.

past meteor
#

SGD, or local search in general, will move you toward a local minimum

#

I don't have the time right now to give elaborate and structured motivation / responses. You can DM if you want and I'll likely answer tomorrow or so 🙂

echo mesa
#

Hello guys, this might be a stupid question(I'm beginner in this field), but I always wondered about the mathematical background of for example neural networks, or for example the complete mathematical background of how a model gets trained and learns, is there any kinda of resource that is dealing with the mathematical background of these machine learning concepts? Thanks 🙂

#

I know that the mathematics that is needed for ai, mainly calculus, linear algebra, and probability. But what I would like to have is a book or any type of resource that would describe or have examples on the mathematical bakground of these concepts in ai and machine learning.

#

Ohh I think I found what I was looking for, there is a book "mathematics for machine learning" from the Cambridge University . I think that would be a good place to start.

odd meteor
plucky ivy
#

Hi, I can't try jupyter notebook because of some python kernel timeout issue. Could you skip to near the end of this video and tell me how to rectify it.

glass mason
#

Bro how can I make / command in my bot and I have to make bot in JavaScript or something else

plucky ivy
left tartan
# plucky ivy Anyone?

You posted a 20 min video. I don’t have that kind of attention span. What issue?

#

I assume you don’t have Python installed, for starters

#

Try running something in vscode outside a notebook

small wedge
# echo mesa Hello guys, this might be a stupid question(I'm beginner in this field), but I a...

https://arxiv.org/pdf/1802.01528.pdf is a paper that steps through all the math and assumes you only know calc 1 and some linear algebra.

https://developers.google.com/machine-learning/crash-course google has a machine learning crash course that begins by covering the math, but also focuses on using modern libs to create your own models

https://www.youtube.com/playlist?list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi 3b1b has a series of videos going over the math and getting progressively more complicated

#

andrew ng also has free courses online that go in depth on the math

deep tendon
#

Guys here is a code that you can use to process a pcap and extract individual bytes to columns of a parquet! it implements the pre processing and labeling required to develop feature-per-byte based NIDS as the one described in DeepPackGen https://arxiv.org/pdf/2305.11039.pdf feedback very welcome! https://github.com/Master-Sorcerer/BytesProcessor

GitHub

This class allows to efficiently convert bigger than memory pcap files to a labeled feature-per-byte dataset in parquet format - GitHub - Master-Sorcerer/BytesProcessor: This class allows to effici...

dusk tide
#

I have a question . On Kaggle Getting started competitions, we are provided with train and test sets separately. Is it okay to merge both of them for doing preprocessing easily or not ? According to this blog : analytics vidya blog (https://www.analyticsvidhya.com/blog/2021/07/data-leakage-and-its-effect-on-the-performance-of-an-ml-model/) we should not do this because it can lead to Data Leakage . Can anyone tell ?

In this article, we will discuss all the things related to Data Leakage including what it is, how it has happened, how to fix it,

plucky ivy
plucky ivy
plush pagoda
#

• male female X train.txt
• male female X test.txt
• male female y train.txt
• male female y test.txt
...
(c) Bayes classifier with non-parametric distribution (20 points)
Compute and print the prior probabilities for male and female.
Compute class likelihoods, p(height|male), p(weight|male), p(height|female) and p(weight|female) for
all test samples. This can be done by using the bin min/max values returned by NumPy histogram()
function. You can calculate the centroid of each bin and assign each test sample to the closest bin.
After knowing the bin index, the likelihood can be computed using the count vector provided by the
same histogram() function.
Classify all test samples and compute the classification accuracy. Print accuracies for height only,
weight only, and weight and height together (multiply likelihoods).

My attempted solution went something like this (I tried to follow the instructions as well as I could):

  1. Calculate the prior probabilities from the training data
    • p(m) = # males / (# males + # females)
    • p(f) = 1 - p(m)
  2. Calculate likelihoods
    • Use the training data to define the counts and bins for 4 histograms : female height, male height, female weight, male weight
    • Calculate the centroids for all 4 histograms
    • Using those I can calculate the likelihoods (or so I thought)
  3. Classify
    • Classify based on the maximum likelihood * prior probability
#

Here is my code (sorry):

# Male and female height and weight measurements
X_train = np.loadtxt("male_female_X_train.txt") 
X_test = np.loadtxt("male_female_X_test.txt") 
y_train = np.loadtxt("male_female_y_train.txt")
y_test = np.loadtxt("male_female_y_test.txt")

total_samples_train = len(y_train)

# Compute and print the the prior probabilities for male and female
p_m = np.mean(y_train == 1) # p(male)
p_f = 1 - p_m
print(f"p(female) = {p_f}")
print(f"p(male) = {p_m}")

# Compute class likelyhoods

# y=0 -> male and y=1 -> female
# Men's heights
mheights_train = X_train[y_train == 0, 0]
# Men's weights
mweights_train = X_train[y_train == 0, 1]
# Women's heights
fheights_train = X_train[y_train == 1, 0]
# Women's heights
fweights_train = X_train[y_train == 1, 1]


# Histograms 
# The last bin edge from np.histogram(..) is "closed".
counts_mh, bins_mh = np.histogram(mheights_train, bins=10)
counts_mw, bins_mw = np.histogram(mweights_train, bins=10)
counts_fh, bins_fh = np.histogram(fheights_train, bins=10)
counts_fw, bins_fw = np.histogram(fweights_train, bins=10)

# Bin centroids for histograms
get_bin_centr = lambda bins: [b + ((max(bins) - min(bins)) / 20) for b in bins][:10]

bin_centr_mh = get_bin_centr(bins_mh)
bin_centr_mw = get_bin_centr(bins_mw)
bin_centr_fh = get_bin_centr(bins_fh)
bin_centr_fw = get_bin_centr(bins_fw)

# For the sake of brevity and everyone's sanity, I'll omit everything except predictions based on height:
classify = lambda lf, lm: int(lf * p_f > lm * p_m)

y_pred_h = [classify(counts_fh[get_bin_idx(bin_centr_fh, x)] / counts_fh.sum(), 
                     counts_mh[get_bin_idx(bin_centr_mh, x)] / counts_mh.sum()) 
            for x in X_test[:,0]]
#

And here is the feedback I got:
The index of the bin should be taken considering the whole range of available weight/height values but not for male and female separately

But if I don't have the histograms separated into m/f, I don't understand how to classify then. If I only have one histogram for height and I get the bin index from that combined distribution for male and female heights, how can I determine the class?

reef spade
#

hello, i want to create an AI avatar using python. What should i learn to do this?

oblique quarry
#

I was wondering does it make even sense to use LASSO-Regression on simple linear regression tasks where you have only one independent variable / feature that contributes to the target variable, given that you don't target the the b0-term you can only target the one feature that contributes to the output, which doesnt sound like something I'd normally do

serene scaffold
reef spade
past meteor
#

@hasty mountain I have time to get back to you now. Your algorithm was basically this correct:

  1. Initialize population
  2. Mutate
  3. Select top model
  4. SGD on top model
  5. Return to 2 until convergence or max steps

Right?

#

Step 4 is what people call "local search" in genetic algorithm literature, it's a valid thing to do. It does have a trade-off:

  1. Local search decreases the time to convergence.
  2. Local search can have you converge faster into a local minimum.

What I saw empirically is that LS really decimates diversity in your population and sends you towards a converged population really quickly. To offset this you need to increase mutation rates to compensate etc.

What you can also do is run LS on the top and bottom N % of individuals

#

In reality I'd never run an EA on a neural network though for various reasons. The flavours of SGD that we use are usually "good enough", I'm not sure we need a global optimiser 🙂

serene scaffold
#

no, but "greedy reasoners" is a fun way of putting it

rapid oriole
#

Hey guys, my neural network with MLPRegressor always outputs the mean of the Y variable. Can anyone help me? I will provide more details

past meteor
ashen crypt
#

Do we have here a OCR master? ;p with tesseract?

serene scaffold
ashen crypt
#

Hm... i want to OCR my manga chapter but tesseract very often catch correct only 50% of page.

#

i did it with my colleague and somethimes in the IMG we can see, letter "G and G" it is writed as G and 6

#

im crucious where i should focus to resolve this issue.

left tartan
harsh minnow
#

Hey guys, I am trying to build a news article generator app that is trained on the news in the US, UK, and Canada. It has to be very accurate. Now I first ask some simple questions to get some data and generate an accurate news article about a specific thing in a specific country.

Now I want to train the AI on new article knowledge, but when I fine-tune a model, it outputs nonsense. I tried fine-tuning GPT 3.5, but it's returning inaccurate data (GPT 4 performs much better). Also, GPT-3 (4096) is not enough to generate a long news article, so I am making a plan and calling GPT-3 around 7–10 times (it's context-aware).

I want the model to be smart enough to make decisions about the country and get information about it from the user. I want the model to output in a different format too, but the articles I train on are not in the format I want. Each user's use case is different, and I want the model to be smart enough to analyse and generate the article.

What is the best way to go about this? How can I achieve a smart AI model that knows each country's information and is accurate?

ashen crypt
serene scaffold
left tartan
harsh minnow
ashen crypt
#

@left tartan Do you have any Youtube Video to show how it work?

serene scaffold
boreal gale
harsh minnow
ashen crypt
#

i prepare small project so what i need: i need catch the words / letter from manga + i need pixels - then i need export it to file txt then :
1). recognise text change to blank - white.
2) then i change the .txt file with "my translate"
3) upload my words to image.

#

that is the main logic what i want to do.

#

so if u know better option i want to use it.

boreal gale
ashen crypt
#

im beginning in my python world. How to use it?

boreal gale
#

to me tesseract is great for heavily structured text, but for text that's less structured like in manga, it's unlikely to work out of the box, just my 2 cent.

ashen crypt
#

Ok could we connect tomorrow? for some "training" how to use it?

boreal gale
#

sorry i don't have bandwidth to help on that level, let me do some googling, there must be tutorial out there

boreal gale
#

download EAST model from 1
download CRNN_VGG_BiLSTM_CTC model from 2

i think the dependencies you need are just

opencv-python
numpy

opencv-python is cv2 - don't ask me why, i always get confused too.

that's all the help i can give at the moment, i really need to get back to work now.

hasty mountain
# past meteor Step 4 is what people call "local search" in genetic algorithm literature, it's ...

Hey, thanks! Well, the method is a bit more chaotic than that, as the model is already on a local minimum/saddle point, and the population initialized are copies of this same model(thus, already on local minimum). The EA is like...something to force the model to get even better, and it's applied to one single model (the Vanilla) to prevent a madness of memory consumption (I can run this algorithm in my personal computer using 20 models as population, for example).

The SGD is quite interesting, indeed, but I only get a bit upset with the fact that it demands so much patience. I even read a blog post some time ago that the author said there was a saying in Deep Learning that was something like: "Let the model run and take a summer off", to let gradients work by themselves.

past meteor
#

Especially at initiialisation

hasty mountain
#

Yes, that's the thing. The mutation happens at each batch sampling, and I tend to use a mutation chance of 60%, more or less

past meteor
#

Yes but this is a bit like random search then, you have little benefits of the EA framework especially since you're doing no recombination

hasty mountain
#

Yes, it's quite chaotic and random.
I see... So I shouldn't have discarded crossing-over then yert

#

It's funny, though. When I ran some experiments in my poor personal GTX 1650 with a batch size of 4, the method failed miserably.
When I ran in the industrial Tesla P100 of Kaggle with a batch size of 512, the method did provide a result...helping the model getting stuck into a new minimum, even more harder to escape (not sure if it's local or global)

past meteor
hasty mountain
#

Oh no
So I basically have a Reinforcement Learning strategy with a low epsilon yert

past meteor
#

I wouldn't call it RL either, it's random search with a splash of local search (when you do SGD on the best individual)

hasty mountain
#

It's a crazy thing that I managed to make with my mind

past meteor
hasty mountain
past meteor
#

How large is your population?

hasty mountain
#
  1. I've used 20 in both devices.
#

Always using a mutation chance of 60%

past meteor
hasty mountain
#

In the beginning, I used a decaying factor for this mutation chance, as the cumulative mutations tended to degrade the models performance. But then I removed it.

hasty mountain
# past meteor So your algo is effectively: 1. SGD till convergence 2. Copy 20x 3. Mutate with...
  1. SGD till convergence (Pre-training) ---> This is the Vanilla Model
  2. Initialize Population by copying the Vanilla model 20x.
  3. Samples a batch from data and SGD --> Optimize.
  4. Repeat Iteration and get loss got after optimization --> SGD Loss
  5. Mutate population with rate of 60%
  6. Iterate through each individual in population --> Gets individual losses.
  7. Selects best loss ---> If SGD Loss, proceed to 3. If one individual Loss surpasses(is lower than) SGD Loss, that individual replaces the Vanilla model, proceed to 3.
past meteor
hasty mountain
#

Yes, the learning rate is fixed. The SGD optimization is only applied to the Vanilla Model, even when an individual from the population replaces a Vanilla Model(becoming the new Vanilla)

#

Wait a second...the adam optimizer... It would be able to accompany those changes, right?

past meteor
#

So, let's assume this is an EA, which it's not really: EA's are a waste of compute. Your compute is better spent on something else like bayes opt.

EA's shine where you truly need a global optimisation method for non-convex (note that EA's have 0 guarantees of finding this) that is tractable or where there is a "large enough" gap between your heuristic and global solution (you don't know this before running the EA) .

hasty mountain
#

I see... Sad, I like the idea of the EAs.
Looks like I'll have to review my work, then.

#

Thanks!

past meteor
hasty mountain
#

I've seen there were some ideas of using them in Reinforcement Learning...and to also select hyperparameters for neural networks.

past meteor
#

Yes but EAs are very very sensitive to hyperparameters themselves

hasty mountain
#

I was thinking about trying something chaotic as mutations in some of my projects in Reinforcement Learning... gaming bias in RL is truly annoying...

past meteor
#

You've shifted the problem from setting hyperparams of your single NN to setting hyperparams for your EA which involves training 100+ neural networks

#

Hence why it's a total waste of compute compared to bayes opt

hasty mountain
past meteor
#

They're really simple to try out. Have you heard of the knapsack problem?

verbal oar
#

I have (2,) (8,)
but want instead of (8,) (1,8) but I cant transpose with .T

#

first is x second is weights

hasty mountain
#

But from what I'm reading now...the ELBo loss used in the Variational AutoEncoder could be something like that?

verbal oar
#
    def __init__(self,in_nodes,out_nodes):
    
        self.weights = Tensor((in_nodes * out_nodes))
        self.bias    = Tensor((1,out_nodes))
        self.type = 'linear'
       
        print('w shape', self.weights.data)
        print('b shape', self.bias.data.shape)

    def forward(self,x):

        print('x', x)
        output = np.dot(x,self.weights.data)+self.bias.data
       ```
past meteor
#

At least, that's what it's good at compared to say grid search which you can run in parallel

#

It matters because you want to waste as little compute as possible when training NNs every evaluation should matter cause it takes too much time to be trying random things (which EAs do)

small wedge
verbal oar
#
class Tensor():
    def __init__(self,shape):
        self.data = np.ndarray(shape,np.float32)
        self.grad = np.ndarray(shape,np.float32)
past meteor
#

If you want to learn about them you should decouple EAs from NNs, just make a small project solving a classical problem like travelling salesman or knapsack. Generate fake data using Numpy and code out the whole thing from 0, just depend on Numpy. It's not a big project, it's <100 LoC. Then you'll "get" the pros and cons immediately 🙂

verbal oar
#

hmm wait its float not int

#

my input is int

small wedge
#

hm you could do smth like my_arr.resize((1, *my_arr.shape))

hasty mountain
verbal oar
#

ok

small wedge
#

but you might just wanna use (8,1)/(1,8) arrays in the first place so you can use transpose

delicate apex
#

fun fun fun:
FutureWarning: Styler.applymap has been deprecated. Use Styler.map instead
(got this three times per usage, actually) okay, fine. i'll switch them to .map instead.
except now pylance complains of Cannot access member "map" for type "Styler" Member "map" is unknown !
firRly

desert oar
unique quail
desert oar
#

the one with the gaps is interesting. it sounds like your intention is to split the data into 2 clusters by finding the biggest gap between two data points? what about data like this? [1, 2, 11, 12, 21, 22]. presumably you'd need to extend this to handle more than 2 clusters.

shut girder
#

Hello, does anyone know a beginner friendly data analysis project that requires minimum knowledge of statistics? I want to become more familiar with the 3 popular Python libraries and practice the responsibilities of a data analyst by working on projects.

quartz karma
#

Hi, anyone familiar with seaborn? I am wondering if I can change a legend label without having to provide anyother related settings like bboxtonext, loc, etc. Thanks.

desert oar
# shut girder Hello, does anyone know a beginner friendly data analysis project that requires ...

https://www.kaggle.com/competitions/titanic
https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques
you can follow the prompt and try to fit a predictive model, but these are really good datasets just for getting comfortable with cleaning and exploring data, and forming & testing hypotheses

shut girder
#

Thanks

hasty mountain
desert oar
#

hey, props for creativity

past meteor
#

It's even too expensive for architecture and hyperparameter search. It's a blackbox optimization method, it's suitable for rich research groups to search over architectures and then tell us plebs what they found 🙂

median monolith
#

I dunno where to post for help sorry :/ Im a python n00b.
I'm installing fenics, though a WSL (Ubuntu) and this is taking forever (days) there seems to be progress, ie. the last report line is changing every couple of hours and so. But I am worried this might never finish; should I be worried? Is this normal? How to fix? halp

gloomy parrot
#

Hello everyone, does anyone know how can i use pytorch in aws lambda? Currently i cant because of the torch size and the limit in lambda is 50mb only

verbal oar
#

how can I implement pred_proba ?

#

I have ```py
def predict(self,data):
X = data
for f in self.computation_graph: X = f.forward(X)
return X

#

on forward

        unnormalized_proba = np.exp(x-np.max(x,axis=1,keepdims=True))
        self.proba         = unnormalized_proba/np.sum(unnormalized_proba,axis=1,keepdims=True)
        self.target        = target

        print(self.proba)
#

looks like I hardly have

#

hmm in predict there is forward()

#

ah so I have it

desert oar
past meteor
#

Just running N SGD's from scratch may be better

#

But yeah, I've done a fair share of EA's, they're my go to toy problem when trying new languages. I can't stress enough how sensitive they are to their own hyperparameters. You need to tune the thing that tunes your model 🤔

verbal oar
#
    test = label_encoder(['beer','milk'])
    predicted_labels = np.argmax(model.predict(test),axis=1)
    accuracy         = np.sum(predicted_labels==y)/len(y)
    print("Model Accuracy = {}".format(accuracy))

    print('predicted ', model.predict(test))
    print(np.where(predicted_labels == 0, 'chips', 'cereals'))```
#

so

#

how can I convert it to display array of ['chips', 'cereals'] instead of just ['chips']?

#

because at first I want all predictions then I want to think about only one prediction as is the case when deploying model and making one prediction

#
    batch_size        = 2
    num_epochs        = 10
    samples_per_class = 100
    num_classes       = 2
    hidden_units      = 4

    x = ['beer','milk']
    y = ['chips','cereals']
    
    def label_encoder(list):
        l = np.arange(len(list))
        return l

    x = label_encoder(x).reshape(1,-1)
    y = label_encoder(y)

    model             = utilities.Model()
    model.add(DL.Linear(2,hidden_units))
    model.add(DL.ReLU())
    model.add(DL.Linear(hidden_units,num_classes))
    optim   = DL.SGD(model.parameters,lr=1.0,weight_decay=0.001,momentum=.9)
    loss_fn = DL.SoftmaxWithLoss()
    model.fit(x,y,batch_size,num_epochs,optim,loss_fn)
#

this is first part

#

I converted labels to ints

#

trained and try to predict

shy carbon
#

I'm trying to install pytorch3D for windows and I'm struggling. is there no prebuilt packages for this ? Do I really have to build from sources ?

serene scaffold
shy carbon
#

there's conda instruction for linux and macOS

#

but the only place that talks about windows is the Building / installing from source section

#

and when i try to install from the source, I still have issues I don't understand

#

I am able to import torch and torchvision in a python repl, but when I try to install pytorch3D I have "no module named torch"

wheat fox
#

I tried building an AI but I was only able to generate random weird sentences that are grammatically correct but sound nonsensical. How do I go further?

serene scaffold
desert oar
#

(there are also many things i do not know!)

past meteor
#

Nope because hyperparameter search, even for simpler stuff, is again anon-convex problem 😩 . Hence why I mentioned bayes opt, it's a global optimizer like EA's but it's more data efficient

desert oar
#

right, i've used bayes opt before but never EA

past meteor
#

Code an EA up over the weekend 😄 it sounds more than it is

#

<100 Loc

desert oar
#

i wasn't sure where you were headed with the point about SGD though, unless that was just following up from the original idea

#

yeah it doesn't seem hard, but as you said it has its own parameters that need tuning and i never saw it as something i could use effectively

#

would be a good programming exercise though, i should do it

#

maybe a good excuse to practice julia

past meteor
desert oar
#

i see, and agreed

past meteor
#

Btw there's many things forgotten in the literature 😄

SVMs are so good because they only have 1 or 2 hyper parameters and the RBF SVM is a universal approximator. They're also somewhat interpretable. They just suck at scale but are a great go-to if you're under 30k data points.

quartz karma
past meteor
quartz karma
past meteor
#

All of them have the same amount. You can always add a layer and you can always add a neuron.

desert oar
#

specifically for text classification with bag of words, i've found that plain ridge regression beats SVM

#

also there was some blog series i found years ago that went into great detail on why plain linear models were superior to RBF SVMs for text classification specifically

#

something about how the RBF kernel acts like a low-pass filter, i'll have to find it

past meteor
#

It's a bit of an apples and oranges because a linear SVM could be used as well

desert oar
#

yeah but i think at that point actual computation performance tends to be much better with e.g. liblinear

#

or just l-bfgs which is what we ended up using in that particular project (because we had some customizations)

past meteor
#

hmmm, it depends.

#

You can solve linear SVMs in the primal (number of unknowns are the number of variables) or the dual (the number of unknowns are the number of data points)

#

Note: you can solve ridge in the dual as well but nobody does this 🙂

desert oar
#

i've done that too, worked well enough. i'm not sure if it's exactly the same as what people call "transfer learning" but i think conceptually it's close at least

past meteor
#

BoW is something where you typically have num_data < num_unique_words so if your ridge regression impl. does not allow to solve in the dual going to an SVM, which solves in the dual by default, will be much faster.

desert oar
#

of course there's also fine-tuning

desert oar
#

i still dont have a good sense of the performance characteristics of the two algorithms at those kinds of extremes, i was just trying stuff and went with what worked 😛

past meteor
desert oar
#

thanks, i sometimes need to remind myself it's ok to experiment and not just know everything!

past meteor
#

Yeah, the only core skill is knowing where to find information and knowing how to evaluate models properly. From then onwards it's empirics (unless you're doing fundamental research ofc)

quartz karma
desert oar
#

or you could go even simpler than that, using tfidf and/or dimension reduction like pca

#

it's less optimal in the sense that the embeddings are not "learned" using the actual objective function of interest

#

but for computational and practical reasons it might be better, eg. if you don't have enough data to get a good estimate of the large number of parameters involved

rugged comet
#

@past meteor Having some more discussion on taking the absolute value of the residuals. I got the impression that the residuals were just the actual values minus the predicted values, not the absolute value of that result. I'm getting some conflicting information.

#

My question is: When, if ever do you think we take the abs of the residuals and why?

past meteor
past meteor
rugged comet
#

To be clear, when defining residuals, do we take the abs or not?

warm steppe
#

anyone else try running "conda list" on the new Python in Excel release?

rugged comet
# past meteor ... no

Under what circumstances would you find it to be appropriate to take the abs of the residuals?

#

For example, doing the Shapiro-Wilk test.

past meteor
#

And then I investigate what went wrong

#

Been too long since I did the shapiro-wilk test

#

I can't comment on that specifically

rugged comet
#

Can anyone else here comment on taking the abs of the residuals before performing the Shaprio-Wilk test?

#

@past meteor The reason it seems I'm so obsessed with taking the abs of the residuals is because my instructor had a code snippet that did it.

# Calculate some characteristics of the residuals
residuals = np.abs(x.values - predictions.values)

Now everyone in the class seems to think that in order to calculate the residuals, you have to take the abs.

rugged comet
#

And can you say why?

past meteor
#

Because it just is, residuals are just not defined that way

past meteor
#

The reason why not taking the abs is relevant is that you want to differentiate between over and undershooting ofc

desert oar
#

@rugged comet i second what zestar said, it's just not what residuals are and it's neither required nor desirable for the shapiro wilk test

serene scaffold
#

always ask your actual question right away. don't ask to ask.

#

you're still asking to ask. Assume that someone said they will help you--what would they need to know to start helping?

ancient fossil
#

If I define a nested model like this:

class Siamese(nn.Module):
    def __init__(self, model):
        super(Siamese, self).__init__()
        self.model = model

    def forward(self, x1, x2):
        output1 = self.model(x1)
        output2 = self.model(x2)
        return output1, output2
...
model = Siamese(BaseModel(args)).to(device)

Will it fail to save the complete state dict?

torch.save(model.state_dict(), 'ckpt.pt')
...
model.load_state_dict(torch.load('ckpt.pt'))

When I load the model and resume training it appears to start from scratch

serene scaffold
#

Also is BaseModel a subclass of nn.Module?

ancient fossil
ancient fossil
serene scaffold
ancient fossil
serene scaffold
ancient fossil
ancient fossil
ancient fossil
serene scaffold
#

I'll see if I can look into it tomorrow

ancient fossil
lapis sequoia
#

hi

past meteor
#

The easiest way to implement a Siamese network is just to use the same neural net and feed it two inputs, calculate the less and backprop. You don't need a special NN for that, just do it in your training loop. It's confusing I know, I had the same idea before I made a Siamese net myself for the first time.

odd relic
#

ok Ive been getting this painful error for a very long time

#

so im in essence attempting to create a real-time training situation

#

code:

import tensorflow as tf
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras import Model
from tensorflow import keras
import time
import numpy as np

epoch = 300
n = 1000
duration = 5
learning_rate = 1e-4
optimizer = keras.optimizers.SGD(learning_rate=learning_rate)

def lossfunction(distX, crash, speed, distY):

    term1 = tf.add(1.0, tf.add(tf.multiply(5.0, crash), tf.multiply(5.0, distY)))
    term2 = tf.add(distX, speed)
    loss = tf.subtract(tf.multiply(n, term1), tf.multiply(2.0, term2))

    return loss
def returnData(input_data):

    return None

def getData():
    data = np.random.rand(5)
    print(data)
    return tf.convert_to_tensor(data.reshape(1, -1), dtype=tf.float32)

def getSpeed():

    return float(input("speed"))

def create_model():
    input_tensor = Input(shape=(5,))
    x = Dense(256, activation='relu', trainable=True)(input_tensor)
    x = Dense(512, activation='relu', trainable=True)(x)
    x = Dense(512, activation='relu', trainable=True)(x)
    x = Dense(256, activation='relu', trainable=True)(x)
    x = Dense(2, activation='sigmoid', trainable=True)(x)
    model = Model(inputs=input_tensor, outputs=x)
    model.summary()
    return model

model = create_model()

for i in range(epoch):
    print("\nStart of epoch %d" % (i,))
    start_time = time.time()

    while time.time() - start_time < duration:
        distX = float(input("distx "))
        crash = int(input("crash "))
        speed = getSpeed()
        distY = float(input("distY "))

        with tf.GradientTape() as tape:
            input_data = getData()
            predictions = model(input_data, training=True)
            loss = lossfunction(distX, crash, speed, distY)
            print("TrainingLoss: " + str(loss))

            returnData(predictions)

        gradients = tape.gradient(loss, model.trainable_weights)
        print("Gradients: ", gradients)
        optimizer.apply_gradients(zip(gradients, model.trainable_weights))
#

error:

TrainingLoss: tf.Tensor(30994.0, shape=(), dtype=float32)
Gradients:  [None, None, None, None, None, None, None, None, None, None]
Traceback (most recent call last):
  File "robotAI.py", line 85, in <module>
    optimizer.apply_gradients(zip(gradients, model.trainable_weights))
  File "C:\Users\sidha\anaconda3\envs\OneWhereUcanuseGPU\lib\site-packages\keras\optimizers\optimizer_v2\optimizer_v2.py", line 689, in apply_gradients
    grads_and_vars = optimizer_utils.filter_empty_gradients(grads_and_vars)
  File "C:\Users\sidha\anaconda3\envs\OneWhereUcanuseGPU\lib\site-packages\keras\optimizers\optimizer_v2\utils.py", line 77, in filter_empty_gradients
    raise ValueError(
ValueError: No gradients provided for any variable: (['dense/kernel:0', 'dense/bias:0', 'dense_1/kernel:0', 'dense_1/bias:0', 'dense_2/kernel:0', 'dense_2/bias:0', 'dense_3/kernel:0', 'dense_3/bias:0', 'dense_4/kernel:0', 'dense_4/bias:0'],). Provided grads_and_vars is ((None, <tf.Variable 'dense/kernel:0' shape=(5, 256) dtype=float32, numpy=
odd meteor
#

Yes I have done that a couple of times with Lime & SHAP. But I still don’t know what exactly you need help with

wheat fox
#

When do you guys think OpenAI will achieve AGI? It doesn't seem like DeepMind can overtake OpenAI

odd meteor
wheat fox
# odd meteor I don't believe anyone should build AGI

AGI would be able to figure out everything that is possible and profitable to figure out. We can advance so much in physics and medicine. AGI wouldn't be exhausted, would train itself to spot and ignore incorrect notions and use data faster than our brains.

odd meteor
#

The dangers of AGI outweighs it's proposed advantage. People who do ML Research in Ethics are even vehemently against it even

past meteor
#

I think the idea is that there's similarity between tasks you can exploit to make the entire thing more data efficient? Like in the multi task learning case.

#

But yeah, I'm sceptical we can ever build whatever AGI is anyway

odd meteor
past meteor
odd meteor
still whale
#

Drunk Philosophy @past meteor ?

past meteor
still whale
#

@odd meteor that was actually, thoughtful and more on the mind than well thought out and made to appear as though she spent 17 straight hours working out her points

#

The ai realm is pretty scary,. Modelling and data collection and with open source ai sources is honestly a nightmare in my mind, I jumped to the Facebook surveying and the election re trump. Stereotyping millions from adverts and “personality” traits

serene scaffold
odd relic
odd meteor
#

The thing is, most big tech companies don't really like ML Researchers whose research work are in Ethics, more especifically, those who are quite vocal and strong willed.

This is because, it exposes the dirty side of what goes on behind those hyped-up models that fetches them constant 💵

I see Ethics as that field in AI where as a ML Researcher, if you're not strong and have a tough skin, you'll be crushed very easily. More so, you certainly can't escape being gaslighted, harrased, attacked, called "angry bird", fired even.

If Google could fire Temnit, lol... I mean 🤷😀

still whale
# serene scaffold When you say "stereotyping millions from adverts and personality traits", what o...

To me, like honestly I operate under the assumption I’m wrong or right at any given moment.. but it seems ominous that so much data can be hoovered up. Honestly in that instance it’s more the way it was implemented on shallow variables to target groups with certain propaganda. Really showing psychology is pseudo, check ya Facebook group likes. You could be latent who knows what, you wouldn’t even know unless a computer model categorised your data and sold it to companies to be used as analytics! And then got privy through the targeted ads… maybe I’m a skeptic but the direction of ai , the horizons seem scary to me

#

I mean not ranting, it’s well known fact the analytics and all that. But a true ai learning model that is capable of not imitation, that’s true fear

past meteor
still whale
#

True, to reference philosophy is pretty taxing.. gotta be able to know and keep track of all the contexts 😳. There is a philosophy of ai, look it up if your interested

#

Some philosophies of get insane though, I tried 3 pages of philosophy of psychology and just about blew 3 valves and a bottom end bearing

past meteor
#

In my masters we had to pick 1 "general education" elective and it was AI ethics, privacy & big data and cognitive science, I went for privacy & big data because I thought it'd have the most value 🤣

left tartan
# odd meteor The thing is, most big tech companies don't really like ML Researchers whose res...

Yah; therein lies the rub. You need opinionated people to consider the negative impacts of ethical decisions on a business, but it also needs people who will turn a blind eye when the business chooses the less ethical route (in their opinion). I should add; since ethics are highly subjective, you also need opinionated people who accept that others might disagree with their opinions and know how to be a team player. (** This is not a commentary on temnit, I’m just saying it’s perhaps impossible to find such people)

lapis sequoia
#

guys what are the main models of ai?

small wedge
#

wdym the main models?

#

like RNN, CNN, etc?

lapis sequoia
#

i have a research

lapis sequoia
lapis sequoia
#

dm me i can give you what my reasrch is about

small wedge
#

I'm good

lapis sequoia
#

alr thank you anyway

#

another question, whar are the materials of ai?

small wedge
lapis sequoia
small wedge
#

XD well I can't answer your question if you don't even know what it means

lapis sequoia
odd meteor
odd meteor
odd meteor
verbal oar
#

where I can deploy ai model for free?

#

heroku has MFA, vercel requires phone to verify

#

I want some simple

#

this is not security important model

#

just for testing

small wedge
still whale
#

@odd meteor ha fuck dude that’s a laugh, I’m right there with ya . I could spout moral ethics till I die but I guarantee I’ll be in ethical within an hour

slow totem
#

Hello! Need a bit of help with scikit learn!

So here is my usecase. I have made a model in python scikit learn. It works great with accuracy of 89%. But I want to use it in JS as I am using MERN (I don't want to join the MERN app with a flask/fastAPI server to get the classification). So I looked at tensorflow js and scikit learn wrapper for JS: https://scikitjs.org/ So I preprocess the data in the exact same manner (I self coded the vectorizer in both the languages, so I can compare the data being feed and its exactly the same) But when I run the SVC on js it leads to an amazing accuracy of 9%. Any idea on what can I do?

Thank you for a response in advance

Description will go into a meta tag in

still whale
#

@odd meteor if only philosophy didn’t exist in a semi vacuum, I’d be free to be some sort of moral agent of some purpose

odd meteor
# left tartan Yah; therein lies the rub. You need opinionated people to consider the negative ...

True. Possessing empathy, opposing any form of oppression, advocating for inclusivity, and having a natural inclination to challenge inconsistencies and call out bullsh!t are traits not everyone possesses.

While I'd like to believe that I'm someone who understands and values empathy, I don't necessarily envision myself venturing into research in Ethics. It's just a lot 😀

After going through the 'Stochastic Parrot' research paper, my admiration for those who specialize in Ethics has grown. Reflecting on the incident where Google's 2015 image classification model classified black individuals as chimpanzee and monkeys, I'm reminded of the pivotal role ethical researchers play. Their contributions are invaluable, but it's a field that might be too heavy for me.

odd meteor
# verbal oar where I can deploy ai model for free?

Try Streamlit / Streamlit Cloud, HuggingFace Spaces, more recently I saw a video on this new platform called Runway. https://www.youtube.com/watch?v=tSiS15ubQFQ

⚙️ Runway - MLOps made easy: https://bit.ly/mrxrunway
🛣 Full Stack Data science roadmap: https://shorturl.at/abiJY
📚 Designing Machine Learning Systems (by Chip Huyen) 👉 https://amzn.to/3Cajv0Y
👔 Need help with preparing for your next data science interview? Use referral code ‘thuvu’ to get 10% discount on any of the offerings on https://dataint...

▶ Play video
verbal oar
#

hmm streamlit is like R shiny?

odd meteor
# verbal oar hmm streamlit is like R shiny?

I haven't used R-Shiny so I have no idea. You can use Streamlit to deploy your model as a web app. If that's what R-shiny does, then perhaps there's some semblance between both.

verbal oar
#

ok

sharp crest
past meteor
verbal oar
#
    preds = predictions(xTxt)
    print('preds', preds)

    text = input("type name of product (e.g beer): ")
    test = pred(text)
#

when sharing user dont see manage app console?

agile cobalt
#

I don't think that streamlit supports input(), iirc you should use components provided by the library (buttons, checklists, probably some specialized forms of text inputs etc) instead of using just input() itself

#

check the documentation and examples if you haven't yet

verbal oar
#

ok because its for users right

#

found text_input

lapis sequoia
#

I have a 200MB, 1.1MM-row CSV file. Too big for Excel, but I'd still like a GUI-based method of initial data exploration. What are your favorite ways of doing this? I've found three possible solutions thus far:

-PowerBI
-Datasette
-VS Code's CSV extensions

serene scaffold
lapis sequoia
# serene scaffold You could use pandas in a Jupyter notebook

Jupyter forces the dataframe to fit into the screen size, which is really inconvenient when it comes to larger fields like review_content. It also doesn't show me all the entries. I've yet to find a solution for this. Even adjusting display.max_colwidth to 100 doesn't help.

serene scaffold
vestal widget
#

Im finetuning a gpt2 124M model using gpt-2-simple module, any tips to prevent overfitting or underfitting, like which parameter should i adjust

lapis sequoia
agile cobalt
lapis sequoia
agile cobalt
#

that is probably an issue related to file encoding or excel localisation settings?

#

another option could be yeeting it into a database (either just a sqlite file or something like postgres), then using something like dbeaver, but at that point you might as well just end up using sql instead of python with the same problems

lapis sequoia
#

Postgres would be a bit extravagant, wouldn't it? SQLite can do the job I think

agile cobalt
#

setting up just for that would be
if you already had it set up for something else and could reuse for that, not that much

peak hamlet
lapis sequoia
#

Change Excel encoding settings? Switch to PowerBI?

peak hamlet
lapis sequoia
#

CSView is okay, but you can't zoom in and out to display more info on the screen. The viewport is fixed.

peak hamlet
#

What’s the starting format?

#

Is CSV the original type?

#

And all you want is to just manually look at the data?

lapis sequoia
lapis sequoia
peak hamlet
#

Oh 17k records?
Yeah
That’s potatoes for Excel

lapis sequoia
#

1,130,017 specifically

peak hamlet
#

oh

#

yeah
that's over Excel's max row count

peak hamlet
#

FYI you can open an empty workbook and then import from CSV, and then in the wizard you can correct the encoding and reject the automatic transformations
But yeah, if you have over a 100k, give or take 100k, definitetly go for the database option

past meteor
#

I think I'd just work with a notebook and the other 6 columns mainly. If I want the 7th I'd read those reviews "manually"

sand pivot
#

hey all small question
i am new to machine learning, and i hae a data set that i basically train upon this network

model = tf.keras.Sequential()

model.add(tf.keras.layers.Dense(128, input_dim=1, activation='relu'))
model.add(tf.keras.layers.Dense(128, activation='relu'))
model.add(tf.keras.layers.Dense(64, activation='relu'))
model.add(tf.keras.layers.Dense(64, activation='relu'))
model.add(tf.keras.layers.Dense(32, activation='relu'))
model.add(tf.keras.layers.Dense(32, activation='relu'))
model.add(tf.keras.layers.Dense(1, activation='linear'))

optimizer = tf.keras.optimizers.Adam(learning_rate=0.0001)
model.compile(loss='mean_squared_error', optimizer=optimizer, metrics=['mean_absolute_error'])

history = model.fit(input_data, output_data, batch_size=32, epochs=300,shuffle=True)

i am basically trying to approximate E = 13.6/(n*n)
i generated some readings based on E, where the input data is n, and the output data is E i apply a small deviation to it E so that the results are not exact

now the problem is that, if i try to predict values outside of the training range (training range is n = 1 to n=50), then i get massive error rates, likes 200% or sth lol, (predicting values within the training range works fine, but anything outside of it, just slowly grows into massive error rates)
any ideas?

#

i am honestly, on my wits end, i have no idea what to do xD

wooden sail
#

if you explicitly include the model into the network, it'll work better

#

the network has no way of knowing what it would do outside the training data otherwise

#

you can rework the network so that it estimates the parameters of your model for E

sand pivot
#

thanks, and interesting, can you give me an example @wooden sail ?

#

not sure how to achieve that to be fair 😓

wooden sail
#

the main question is, what things can we assume to be known ahead of time? and what do we want to find out?

sand pivot
#

honestly, my main idea is to have some data entries, and use them to approximate E, through a neural network

#

i have some code that generates data readings, with some deviation/controlled error rate