#data-science-and-ml

1 messages · Page 158 of 1

serene scaffold
#

what's the tldr?

smoky basalt
#

so is deep seek better

untold shell
#

Hey guys, any sites or Youtube channels where i can learn about Evolutionary AI ?

fringe bluff
floral tree
#

yeah what is it

past meteor
#

it's just optimization algorithms

smoky basalt
#

wat should i learn after pandas if i wanna go into ml

serene scaffold
smoky basalt
#

🙃

#

i ve got gcses this year

#

but i wanna excel

serene scaffold
# smoky basalt im 15

You should plan to get a degree that's related to ML. But in the meantime, a good next step would be to train basic classifiers with scikit learn, so that you learn about concept like x and y data, train and test sets, and evaluation metrics

smoky basalt
#

greatly in cs

smoky basalt
#

so which packages should i learn after pandas

serene scaffold
#

Don't focus on learning libraries

#

Repeat after me

#

Don't focus on learning libraries

#

(I can't hear you)

smoky basalt
#

no as in i mean to take that route of learning sci kit learn

smoky basalt
#

im js asking bc idk if i should learn matplotlib

#

is it related?

#

or is it more data science

serene scaffold
#

@smoky basalt I just gave some suggestions for what concepts to learn next and what library to use to do it. So that's my answer for now.

smoky basalt
#

would i need to learn numpy or smthn like that?

#

heard numpy is important

serene scaffold
smoky basalt
#

uhm im confused

serene scaffold
#

You need to learn concepts. You use libraries to apply those concepts

smoky basalt
#

so wat concepts to learn first

serene scaffold
#

I already told you

unkempt wigeon
#

why do I need to place everything into a class?

serene scaffold
#

python is not java.

unkempt wigeon
#

I was looking at the pytorch resources and it hade the code in a class so i was just wondering

serene scaffold
unkempt wigeon
#

i got to the code with the class and the thought do i even need this in a class

vivid skiff
#

Is there any discord server specific to pytorch?

flint sierra
neon island
#

Besides, you may want to parameterize your NeuralNetwork to try different instances of parameters, it's useful to have it as a class.

iron basalt
#

A standard OOP library pattern is to have a fundamental building block that can be inherited from. For example, in a game engine this would typically be the GameObject type.

unkempt wigeon
iron basalt
arctic wedgeBOT
#

torch/nn/modules/flatten.py line 13

class Flatten(Module):```
iron basalt
#

The idea is to be build new Modules be composition of other Modules.

#

This forms a tree structure, where each Module has some list of children, and when you then call something like to, it can navigate this tree and call to on all the children, and all of their children (etc), resulting in everything getting sent.

#

You could build your own system like this, or do it manually for each one.

gilded sundial
#

Can anyone help on a beginner's project "Spotify Recommendation System" ? I am really stuck on how I should Cluster and find data to make the recommendation system model

gilded sundial
stuck tapir
#

Check dms,

tawdry sundial
#

How is fine tuning expensive? It seems very cheap to fine tune while a lot of people claim its expensive

#

Fine tunning seems better for most use cases

#

I feel like i am missing something

past meteor
tawdry sundial
past meteor
# tawdry sundial Than rag

finetuning openai models is mostly about having the models follow the system prompt and not necessarily about "knowledge"

tawdry sundial
#

if 3-7 batches is enough for the model to acquire knowledge and follow certain prompts, then it would be really worth it. however i am still not sure why its not nearly as common as rag

round tusk
#

I'm a little new to this concept, but I was looking at this video where they simulated a Deep Reinforcement Learning AI. The guy said that it was trained for 5 years. Do they actually train and simulate it for 5 years in human time? Or do they mean the 5 years in game?

#

The game they simulated was pokemon red

#

I'm dipping my feet in this area, and the fact that it takes years to train seems a little daunting, but I highly doubt they do that for so long.

serene grail
past meteor
#

finetuning an openai model should be seen as a substitute to shortening a prompt

unkempt wigeon
#

Is this correct for a resource?

serene scaffold
unkempt wigeon
#

Sorry

glacial root
#

how important is differential equations for machine learning

glacial root
#

and then statistics

serene scaffold
#

those are more relevant, yes

#

and it's specifically multivariate calculus for derivatives. I don't know of any application for integrals in ML.

wooden sail
#

many cost functions tend to be formulated as integrals, in their so-called "variational form"

iron basalt
#

Differential equations is what is currently used in physics to describe physical systems' behavior (and for the foreseeable future).

#

If you want to do ML research, then it can be good to know too, as it opens the door to physics for you, and ML is no stranger to taking ideas from there (recent Noble Prize in physics was given to an ML researcher due to its link to physics).

#

(If you want to do (broad) research you want all the math so you can take ideas from other fields (go wide, not narrow))

past meteor
#

I think I like answering no to these questions

#

If someone asks if they need to know diff eq for ML/AI the answer is likely just no. If they’re interested in theory and not practice, they’d likely just want to pick it up themselves

#

Or you learn the parts of diff eq you need to when you need to (it’s how I approach it)

gilded sundial
#

song_interaction_count = test_data_kmeans.groupby('name')['user'].count()
popular_songs = song_interaction_count[song_interaction_count >= 3].index
test_data_kmeans = test_data_kmeans[test_data_kmeans['name'].isin(popular_songs)]

Create a utility matrix (user-item matrix)

utility_matrix = test_data_kmeans.pivot_table(index='user', columns='name', values='user_rating', fill_value=0)

Convert the utility matrix to a sparse matrix

sparse_matrix = csr_matrix(utility_matrix)

Define a batched KNN function for incremental computation

def recommend_songs_batched(song_name, utility_matrix, sparse_matrix, batch_size=1000, num_recommendations=5):

if song_name not in utility_matrix.columns:
    print(f"Song '{song_name}' not found in the dataset!")
    return []

song_idx = utility_matrix.columns.get_loc(song_name)

knn = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=num_recommendations + 1)
recommendations = []

for start in range(0, sparse_matrix.shape[1], batch_size):
    end = min(start + batch_size, sparse_matrix.shape[1])

    # Fit KNN only on the batch
    knn.fit(sparse_matrix.T[start:end])
    
    # Get indices within the current batch
    indices_in_batch = list(range(start, end))
    
    # Compute neighbors for the target song
    distances, indices = knn.kneighbors(
        sparse_matrix.T[song_idx].reshape(1, -1),
        n_neighbors=num_recommendations + 1
    )
    
    for i, idx in enumerate(indices.flatten()):
        # Skip the first neighbor (it is the song itself)
        if i == 0:
            continue

        global_idx = indices_in_batch[idx]
        recommendations.append((utility_matrix.columns[global_idx], 1 - distances.flatten()[i]))
        
return sorted(recommendations, key=lambda x: -x[1])[:num_recommendations]
#

Example usage

song_to_recommend = "Camby Bolongo" # Replace with a valid song name
recommendations = recommend_songs_batched(song_to_recommend, utility_matrix, sparse_matrix, batch_size=5000, num_recommendations=5)

print(f"Recommendations for '{song_to_recommend}':")
for rec, score in recommendations:
print(f"{rec} (Similarity Score: {score:.2f})")

#

Can anyone tell me why I can't find songs in my dataset while I am searching from songs in my dataset ?

jaunty helm
#

like I expect utility_matrix['some_column_that_has_song_names']

gilded sundial
jaunty helm
gilded sundial
#

I did and some names do match . It's likely due to taking batch sizes this problem is occurring

woeful pulsar
#

Hello guys. I am a newbie in python and a data science enthusiast.
Quick question, how do I delete a NaN row in python?

woeful pulsar
#

Yes. Okay.
Thanks a lot

wide bane
#

I want to make a model which will generate a 2d image from text input from user and then it will make the 3d model of the 2d image which was created and the 3d model will be used in blender to view, so which model should I use or the best resources which can help??

Also the model must run locally in my system

agile cobalt
#

see https://stability.ai/stable-3d + look up alternatives to it

candid ridge
#

idk if this is related but
am i the only one getting this error with google gemini?
EPROTO 04110000:error:0A000119:SSL routines:ssl3_get_record:decryption failed or bad record mac:c:\\ws\\deps\\openssl\\openssl\\ssl\\record\\ssl3_record.c:623:

same error happened even when i use python, nodejs, or curl

serene scaffold
candid ridge
#

it also happen when i use the official library pip install google-genai

serene scaffold
candid ridge
#
requests.exceptions.SSLError: HTTPSConnectionPool(host='generativelanguage.googleapis.com', port=443): Max retries exceeded with url: /v1beta/models/gemini-1.5-flash:generateContent (Caused by SSLError(SSLError(1, '[SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC] decryption failed or bad record mac (_ssl.c:1018)')))```
#
from google import genai

client = genai.Client(api_key="bla")
response = client.models.generate_content(
    model="gemini-1.5-flash", contents="Explain how AI works"
)
print(response.text)
lean oriole
#

hello guys

#

i'll be in your care 🙂

silent basin
#

OpenAi: "hey, you stole our data"
Everone: "now you know how it feels"

sullen herald
candid ridge
#
>>> import google.generativeai as genai
>>> genai.configure(api_key='dskodkskdsooas')
>>> model = genai.GenerativeModel('gemini-1.5-flash')
>>> resp = model.generate_content('hello')
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1738599352.419306   11664 ssl_transport_security.cc:1665] Handshake failed with error SSL_ERROR_SSL: error:1e000065:Cipher functions:OPENSSL_internal:BAD_DECRYPT: Invalid certificate verification context
I0000 00:00:1738599352.538987   11664 ssl_transport_security.cc:1665] Handshake failed with error SSL_ERROR_SSL: error:1e000065:Cipher functions:OPENSSL_internal:BAD_DECRYPT: Invalid certificate verification context
I0000 00:00:1738599352.661139   20156 ssl_transport_security.cc:1665] Handshake failed with error SSL_ERROR_SSL: error:1e000065:Cipher functions:OPENSSL_internal:BAD_DECRYPT: Invalid certificate verification context
I0000 00:00:1738599352.791475   20156 ssl_transport_security.cc:1665] Handshake failed with error SSL_ERROR_SSL: error:1e000065:Cipher functions:OPENSSL_internal:BAD_DECRYPT: Invalid certificate verification context
I0000 00:00:1738599352.899497    9664 ssl_transport_security.cc:1665] Handshake failed with error SSL_ERROR_SSL: error:1e000065:Cipher functions:OPENSSL_internal:BAD_DECRYPT: Invalid certificate verification context
I0000 00:00:1738599353.024943    9664 ssl_transport_security.cc:1665] Handshake failed with error SSL_ERROR_SSL: error:1e000065:Cipher functions:OPENSSL_internal:BAD_DECRYPT: Invalid certificate verification context
I0000 00:00:1738599353.155409   11664 ssl_transport_security.cc:1665] Handshake failed with error SSL_ERROR_SSL: error:1e000065:Cipher functions:OPENSSL_internal:BAD_DECRYPT: Invalid certificate verification context
I0000 00:00:1738599353.287683    9664 ssl_transport_security.cc:1665] Handshake failed with error SSL_ERROR_SSL: error:1e000065:Cipher functions:OPENSSL_internal:BAD_DECRYPT: Invalid certificate verification context
I0000 00:00:1738599353.415192    9664 ssl_transport_security.cc:1665] Handshake failed with error SSL_ERROR_SSL: error:1e000065:Cipher functions:OPENSSL_internal:BAD_DECRYPT: Invalid certificate verification context
I0000 00:00:1738599353.545329   20156 ssl_transport_security.cc:1665] Handshake failed with error SSL_ERROR_SSL: error:1e000065:Cipher functions:OPENSSL_internal:BAD_DECRYPT: Invalid certificate verification context
sullen herald
#

thank you for your reply, i have fixed this by, genai.configure(api_key=GOOGLE_API_KEY,transport=‘rest’)

unkempt wigeon
#

Is it possible to combine activation functions?

candid ridge
#
raise SSLError(e, request=request)
requests.exceptions.SSLError: HTTPSConnectionPool(host='generativelanguage.googleapis.com', port=443): Max retries exceeded with url: /v1beta/models/gemini-1.5-flash:generateContent?%24alt=json%3Benum-encoding%3Dint (Caused by SSLError(SSLError(1, '[SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC] decryption failed or bad record mac (_ssl.c:1018)')))```
#

can you try with ur api key so i know if it on google or on me

#

it's used to work not that long ago

#

like few weeks

sullen herald
#

I will check wait

sullen herald
#

Sending you nb link over dm.

candid ridge
eager hamlet
#

hey, I'm trying to train a model to read a word from an image. I'm exploring architectures like attention-based encoder-decoder models or CNN-RNN-CTC architecture. What do you guys think would be better?

glacial root
unkempt apex
dawn current
#

Can any one suggest platforms to learn and brushup my coding skills

summer flame
#

Hey guys I always struggled with Neural Networks, so I made a simple, yet powerful Neural Network file for beginners 😄

Here it is:

#

please give me reviews since this is my first release in... anything, so please give feedback!

jade hatch
summer flame
#

nop spent a month writing this thing

serene scaffold
summer flame
#

o yeah

#

since its an unofficial official package, i just wrote a bunch of comments

hollow cobalt
#

Hello guys. Me and a friend are beginning a project of building an LLM. If you want to join us or hear more send me a DM.

serene scaffold
hollow cobalt
#

No, I know the difference. Give an example of the minimum computing power you think is required.

serene scaffold
hollow cobalt
#

It’s a transformer architecture LLM

serene scaffold
flat mirage
#

Someone who speaks Spanish

tidal bough
#

deepseek v3 was trained for "only" 5 million, but it's a massive outlier in cheapness that required a lot of low-level optimization.

hollow pagoda
#

ajajajaja

tropic bluff
#

Hi guys I want to prepare for data science internship but I don't have that much knowledge but my interview is very near how can I prepare to ace the interview?

void crescent
#

guys whats the best BERT edition i just picked one ranodm one (its doing horribly during fine-tuning)

serene scaffold
void crescent
serene scaffold
void crescent
#

its mostly political

#

but like

#

one is on US's history

#

one is on New Zealand airline

#

ok nvm its kinda varied

#

but the sources are all from CNN

#

so which models can understand CNN articles

serene scaffold
#

@void crescent look to see if there's a BERT model on huggingface that'st rained on news articles.

hard nest
#

In object detection, I have an image and the coordenates of the item in question. But some images don't have the item, and the code returns an error if I give empty coordenates in the training. What should I do?

serene scaffold
glacial root
#

which is better to learn, tensorflow, pytorch, or scikit-learn

jaunty helm
glacial root
jaunty helm
glacial root
#

with a focus on torch or tensor cause computer vision is heavily focused on cnns right

jaunty helm
glacial root
#

i don't really know much, not in college yet

#

but hopefully i can learn some cv before college, i need to first learn math up until differential equations though

frosty pawn
#

these is someone know about ai

#

i want some advice like...

#

what i should to learn

#

i know should python but should i have good experience in python language

small arch
#

does someone know about fast rcnn?

serene scaffold
muted vine
#

hey guys, how is it going? some of you have used the Deepseek distill model of llama on AWS Bedrock? I wanna know about the pricing of use it and how does works the process to import such model? I am using in my API the llama model on AWS, but i want to migrate to DeepSeek model.

smoky basalt
#

for it to return a response

#

can someone show me complex code for smthn in ml

#

i wanna see how compelx it gets

tepid tartan
#

stats : basic descriptive + tests (t test anova chi2 correlation) SQL : modelisation entity-relationship model, obv joins, views common table expressions, being good at queries python : pandas numpy, matplotlib/seaborn, excel basics, Pbi basics Tableau basics, added value R, SAS

is this a good way to get the basic in data analyst

serene scaffold
#

and that's before you can actually start using it

smoky basalt
#

💀

tepid tartan
smoky basalt
#

ml looks cool

spice ravine
#

Claude AI seems to be the best for coding

#

At least for me

#

Then it’s DeepSeek and then gpt is just the worst

serene scaffold
spice ravine
#

I've never tried the llamas

#

is it good?

obtuse yacht
#

From what I've seen llamas better at accuracy and efficiency in coding tasks

#

Chatgpt though is better for assistance with creative tasks ex. writing code comments or generating documentation

#

but claude is the best for coding in my opinion

tender hearth
#

theyre good to establish some sort of baseline but anything within a few percentage points shouldn't be taken to be significant

#

use the models on your own and formulate your own opinion

obtuse yacht
#

thats true

#

companies often fund these "benchmarks" to demonstrate the new models

tender hearth
#

Goodhart's law is an adage often stated as, "When a measure becomes a target, it ceases to be a good measure". It is named after British economist Charles Goodhart, who is credited with expressing the core idea of the adage in a 1975 article on monetary policy in the United Kingdom:

Any observed statistical regularity will tend to collapse once...

woeful escarp
#

Does anybody know about daily dataset updates?
I'm wondering about data for a trading bot (ML), i saw a dataset in kaggle but i found is not accurate, it had price spread

unkempt apex
heavy canyon
woeful escarp
#

because i want daily trading or weekly trading, but 1st of all i need the data haha

obtuse yacht
#

stocks?

woeful escarp
#

Man index and currency

obtuse yacht
# woeful escarp Index
import pandas as pd
from tqdm import tqdm
import yfinance as yf
import os
import contextlib
import shutil
from os.path import join

def read_symbols_data():
    data = pd.read_csv("http://www.nasdaqtrader.com/dynamic/SymDir/nasdaqtraded.txt", sep='|')
    data_clean = data[data['Test Issue'] == 'N']
    symbols = data_clean['NASDAQ Symbol'].tolist()
    return symbols, data_clean

def download_specific_symbols_data(symbols, period):
    os.makedirs('hist', exist_ok=True)
    is_valid = {}

    with open(os.devnull, 'w') as devnull:
        with contextlib.redirect_stdout(devnull):
            for symbol in tqdm(symbols, desc="Collecting"):
                data = yf.download(symbol, period=period)
                if len(data.index) == 0:
                    continue
                is_valid[symbol] = True
                data.to_csv(f'hist/{symbol}.csv')

    valid_symbols = [symbol for symbol in symbols if symbol in is_valid]
    return valid_symbols

def move_symbols_to_directory(symbols, source, dest):
    os.makedirs(dest, exist_ok=True)
    for symbol in symbols:
        filename = f'{symbol}.csv'
        shutil.move(join(source, filename), join(dest, filename))
#
def check_if_empty(input_path:str):
    if len(os.listdir(input_path)) != 0:
        files = [os.remove(os.path.join(input_path, file)) for file in os.listdir(input_path)]
    os.rmdir(input_path)

def optimize_code_specific_symbols(symbols, period='max'):
    os.makedirs('hist', exist_ok=True)
    
    data_path = "data"
    if os.path.exists(data_path):
        check_if_empty(data_path)

    valid_symbols = download_specific_symbols_data(symbols, period)

    _, data_clean = read_symbols_data()
    valid_data = data_clean[data_clean['NASDAQ Symbol'].isin(valid_symbols)]

    os.makedirs('data', exist_ok=True)

    stocks = valid_data[valid_data['ETF'] == 'N']['NASDAQ Symbol'].tolist()

    move_symbols_to_directory(stocks, "hist", "data")

    os.rmdir('hist')


if __name__ == "__main__":
    index_symbols = [ 
        "^GSPC", # s&p 500
        "^DJI", # dow jones industrial average
        "^IXIC", # nasdaq composite
        "^RUT", # russell 2000
        "^VIX", # volatility index
        "^FTSE", # ftse 100
        "^N225" # nikkei 225
    ]
    specific_symbols = ['^GSPC'] #S&P 500 should appear in a folder under "{stock_name}.csv"
    optimize_code_specific_symbols(specific_symbols)
#

this is some old code that i made when I did a stock prediction project

#

I added the index ticker codes in the main function

#

your dataset should look like this

glacial root
obtuse yacht
#

I added it so when you download more than one stock it would show a loading bar of how many stocks are done

glacial root
# obtuse yacht its the loading bars

oh, yeah i got it mixed up with another library. a while ago i watched a video about some random python libraries and tqdm was mentioned along with another library that organizes error messages and such

ionic valley
#

Hello, got a data science interview for a Two Sigma internship but I've legit never done a data science interview before. How should I prep?

Here's what I know about the interview: "The permitted languages for the first question are: C, C++, Java, and Python. The permitted languages for the second and third questions are: Java, Octave (Matlab), Python, and R."

What should I be grinding? So far I've just been spamming Pandas syntax

void crescent
#
import tensorflow_hub as hub
import tensorflow_text as text

preprocess = hub.KerasLayer(tfhub_handle_preprocess)
encoder = hub.KerasLayer(tfhub_handle_encoder)

inputs = tf.keras.layers.Input(shape=(1,), dtype=tf.string)
x = preprocess(inputs)
x = encoder(x)
outputs = tf.keras.layers.Dense(1, activation="sigmoid")(x["pooled_output"])

model = tf.keras.Model(inputs, outputs)

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

model.summary()

weird im getting

#

ValueError: Exception encountered when calling layer 'keras_layer_4' (type KerasLayer).

A KerasTensor is symbolic: it's a placeholder for a shape an a dtype. It doesn't have any actual numerical value. You cannot convert it to a NumPy array.

Call arguments received by layer 'keras_layer_4' (type KerasLayer):
  • inputs=<KerasTensor shape=(None, 1), dtype=string, sparse=False, name=keras_tensor_364>
  • training=None
hollow silo
#

what is a good pandas book to read cover to cover?

flint sierra
untold bloom
#

questions and answers in the pandas tag of stackoverflow

agile cobalt
gilded pebble
#

I want to learn python where can i learn from

short barn
#

Hello

serene scaffold
arctic wedgeBOT
#
Resources

The Resources page on our website contains a list of hand-selected learning resources that we regularly recommend to both beginners and experts.

short barn
#

Can anyone recommend me how to learn AI development?

#

I haven't found any good explanations of how AI works

serene scaffold
main fox
#

y=mx+b, at scale 🐒

hollow cobalt
#

Lesson 1: How to break Copilot.

limber belfry
#

Is there any of topic channel? I have a survey to ask and idk where…

fluid basalt
obtuse yacht
#

What are the neurons, why are there layers, and what is the math underlying it?
Help fund future projects: https://www.patreon.com/3blue1brown
Written/interactive form of this series: https://www.3blue1brown.com/topics/neural-networks

Additional funding for this project was provided by Amplify Partners

Typo correction: At 14 minutes 45 seconds...

▶ Play video
fading wigeon
#

I have a basic question about tree ensembles. Like I get well enough how through various methodologies we can generate 100 trees or whatever. After that is it just like… the most common answer? So if 51 trees say it’s a cat and 49 say it’s a dog then it’s a cat?

serene scaffold
past meteor
steel hull
#

Guys Which is/are the best Machine Learning resource(s) for a strong academic and practical foundation? ISLP or Andrew Ng (2018 - YouTube Version) or some other resource? Which one to pick up first?

I am looking forward to build a strong academic/theoretical and practical foundation in ML. So that it would help me out during advanced courses in my masters and also in building projects

Please suggest

flat token
dense needle
#

linear algebra done right is my favorite linear algebra book but it’s a tough follow if someone isn’t comfortable with proof based math

#

And linear algebra foundations really

flat token
#

Then they shouldn't be learning ML bc all they really want is to copy paste code that has been 100 ways

#

Done*

tame blaze
#

anyone available?

serene scaffold
#

Even if someone is available, they can't know if they can help until they know the question.

lapis sequoia
#

Red-pill me on image segmentation

flat token
# lapis sequoia Red-pill me on image segmentation

image segmentation is best done using linear programming. adobe actually originally used LP as their solver for patching together images. combine that with a skillful algorithm for managing nxm images with variable colors and relevant pixelations and you have the most modern image segmentation that exists

#

enjoy ur red-pill

lofty thorn
#

I am 25 years old...and don't know much about DATA SCIENCE.... but i am really interested to pursuing it...
and also wanna earn something from this profession
is it too late for me ?

lofty thorn
#

I am really interested to meet with a data scientist...
is there any?

flat token
grave garnet
dense needle
delicate cargo
vale matrix
#

need help with face recognition anyone ?

bronze fossil
#

just ask your question

fervent canopy
weary timber
#

guys how the fk does chatgpt handle oov when i type in the prompt "hello chat gpt kjakfdask"

weary timber
fervent canopy
#

there are some other font files in there too

#

I made it cuz, I hate corporate greed lol

weary timber
fervent canopy
weary timber
#

oh okay

#

thanks

dapper dune
#

Hi there. I'm new to ai. I am currently researching this topic for a few personal projects. Can someone guide me on the topic of fine tune language models? I have a general idea of this process from the hugging face manuals, but if I can ask someone about it in more detail (preferably in private messages, as I don't want to reveal details of projects to an audience), please let me know. I'd be very grateful

serene scaffold
dapper dune
# serene scaffold I will only help you in the server. Can you describe more about what you want th...

The main idea is that I write some literary texts and as an experiment and study of the topic I would like to train a model on them (for example gpt2-large or any other, if you can suggest better options). in particular, I am interested in how to properly compose a dataset based on my texts, how to correctly mark them up, so that the whole text is used for training and not just a part of it or 2 texts are glued together, how to add any third-party labels that will be important to the tokenizer, as well as how to train the model so that the model is approximately as good as it should be.

dapper dune
serene scaffold
#

and as an experiment and study of the topic
also I don't understand this part.

dapper dune
serene scaffold
#

The main idea is that I write some literary texts and I would like to train a model on them
is this the salient part?

dapper dune
serene scaffold
dapper dune
#

By structure I mean ordering of information, for example each text has a label [TITLE] and [MAIN PART]. The goal is to generate a text with a structure where there will be a title and a main part

serene scaffold
dapper dune
serene scaffold
tidal pebble
#

Any1 know where I can learn to build a collaborative filtering model? Im trying to make one for my project

dapper dune
# serene scaffold so you give it the title, and then it generates the main part?

that's not quite right, I'm writing a label and the beginning of the text content inside this part. As an promt, there can be both the beginning of the text in the TITLE and the beginning of the text in the [MAIN PART]. the result of the generation is a text with the structure [TITLE] and [MAIN PART]. that is, in fact, the structure of the final text is always the same, but the starting point of generation can be any of its parts. I will also be satisfied with the option if just a phrase is used as an promt, without specifying which part it refers to

dire loom
#

hello peeps, can anyone point me in the direction of stock market focused trading communities? im a novice python programmer & funded day trader 🙂

turbid viper
#

anyone who's worked with layoutLM or floorplans or both?

serene scaffold
barren veldt
#

when doing backpropagation do I use the output that is after Softmax or the raw output? (neural networks)

serene scaffold
#

you also seem to have forgotten a very important step

stable isle
#

@dapper dune are you generating your own training data?

#

i see you're trying to train/fine-tune on literary texts you will write...

#

@dapper dune are you interested in creating data-sets of an application's usage?

dapper dune
#

I'm more interested in how to properly prepare these texts for fine-tune.

stable isle
dapper dune
barren veldt
#

I'm abit new to neural networks

snow moat
gleaming plinth
#

good morning, i have a quick question. before i study videos regarding data science with python, should i first familiarize myself with ML or can i learn data science first and then ML? thank you for your advice

#

A.I suggested that I start with Data Science first for the following reason. If you were to start directly with ML, you would constantly run into problems because, for example, you don't know how to prepare or analyze data.

dense needle
#

You’ll want a solid math/stats foundation to do ML yes

#

How much experience do you have with math, stats, and programming

gleaming plinth
dense needle
#

How about math/stats

hybrid zodiac
#

Is there anyone experimenting with a CLIP model ?

dense needle
gleaming plinth
toxic mortar
#

Are there any good papers that cover embedding code repository into vector database?

gleaming plinth
dense needle
gleaming plinth
#

if you are talking about advance math/stats skills i would say im a noob tbh

dense needle
#

Def get on the math as well on top of the programming. Maybe prioritize it even more

untold fable
untold fable
#

didn't get

sterile heath
pastel vessel
dusty sentinel
#

Hi guys, has anyone ever had issues with the DBSCAN algorithm? I'm using it in a research project with simple code on images, but it's crashing my machine. I've been coding for four years, and this is the first time I've encountered a real bottleneck in it.

#

I am a statistician, so I tested it in R too. While searching for 'why does this work here but not in Python?', I discovered that the implementation in R is more efficient (AKA C++ imp), running smoothly. However, for real-world applications, Python would be a better choice. So if anyone has experienced these issues, a faster solution would be great! 🙂

serene scaffold
dusty sentinel
#

Yes, I also tried HDBSCAN, but with the exponential increase in parameters for images, the same error appears. Tonight, I will run it on Colab to test the actual limits of the bottleneck.

serene scaffold
#

This implementation has a worst case memory complexity of O(n^2), which can occur when the eps param is large and min_samples is low, while the original DBSCAN only uses linear memory. For further details, see the Notes below.

dusty sentinel
#

like 200 min samples or 2.5 eps, (i know is too much but i dont work with the alg too much), it crashs.

dusty sentinel
past meteor
#

That being said, I rarely cluster and when I do, I just use k-means

#

I assume your issue is that you don’t want to specify the number of clusters ahead of time

dusty sentinel
past meteor
#

How many samples do you have

dusty sentinel
#

A lot, like 250+ images, and possibly more. I just split some satellite images provided by my professor from my country for the beginning of the research. But I’m processing them one by one. The full image is around 500MB, but each split is about 2.5MB

#

250 images cover my entire state

#

In the test case I'm working on right now, only with partitioned images, with a specific area.

past meteor
#

Even if your program didn’t crash I’d worry about the quality of clustering in this high dimensional space

odd meteor
fallow coyote
#

what are some good linear and logistic regression projects I could do thatd be good to increase my programming skills and as a good portfolio project? nothing too complicated but challenging enough. I want to get into ML, but its too fucking complicated so Ill stick with the simpler aspects

crude karma
#

guys does anyone have Zillow dataset

fervent canopy
glacial root
#

do you guys think i could learn pytorch while learning calc 3 or is it better to wait until after i finish calc 3 and linear algebra

small wedge
rain path
#

I want to process a dataset but it's too large for my disk space, so I'm using streaming mode to iterate over it. Is there a way to free up memory after each batch of iteration since the data in memory builds up: (and this is a really large dataset)

from datasets import load_dataset

dataset = load_dataset("calabi-yau-data/ws-5d", name="reflexive", split="full", streaming=True)

# Convert dataset to an iterator
dataset_iter = iter(dataset)

# Iterate through first 1000 rows, in chunks of 100
batch_size = 100
total_rows = 1000

for i in range(0, total_rows, batch_size):
    batch = list(itertools.islice(dataset_iter, batch_size))
    if not batch:  # Stop if there are no more rows
        break
    print(f"Batch {i // batch_size + 1}:")
    print(batch)  # Process the batch as needed

It's a shame datasets aren't indexable otherwise I would've run the code to return the specified range of rows .

marble shoal
#

How to install torchdirect-ml because i use pip but it cannot find the package even though i already installed the packages from pypi

dark tangle
#

Guys do you know how can i convert this data into a table? I think it should be printed in a table right?

agile cobalt
dark tangle
brave adder
#

hey guys, i have heard about andrew ng's ml course. I wanna know whether it is good? and also if it is free (lol)?
for context im kinda new to ml and would like to go in depth into how different models work

odd meteor
glacial root
#

not bs stuff, actual good projects

#

i'm learning multivariable calculus and linear algebra anyways, just wanted to know if i could go on ahead and learn ml frameworks/libraries as well

brave adder
agile cobalt
#

just keep in mind it's about the foundations you need to understand how models work, it will not go into specifics about different models, it'll just give you the knowledge to understand what is going on for the general case

glacial root
#

so it's all theory and not much application?

agile cobalt
keen perch
#

Anyone knows how to download tensorflow for GPU and what other things I should download?

small wedge
# glacial root but math knowledge is necessary to actually do stuff with it right

No, one of the the main points of libraries like pytorch and tensorflow are that they enable you to use machine learning without needing the intense math knowledge. Certainly having a foundation of knowledge will help you understand what's going on under the hood or be required to actually preform novel research, but as far as required math knowledge to build and train a model with one of these frameworks, it's about as minimal as you can get.

brave adder
#

and i want something similar for other types of models
like svm, random forest etc

lofty thorn
#

Data science role has a different meaning for different companies. some says go for data analyst and some say other. I also don't know any data scientist.
Can anyone help me deciding which job role should I go for?

timid veldt
#

what's the best approach i can do to make myself learn ai faster using machine learning using python

timid veldt
serene scaffold
timid veldt
timid veldt
serene scaffold
timid veldt
#

I'm a 3rd year student

serene scaffold
#

can you switch to an AIML related major?

timid veldt
serene scaffold
timid veldt
serene scaffold
timid veldt
serene scaffold
#

!resources data science

arctic wedgeBOT
#
Resources

The Resources page on our website contains a list of hand-selected learning resources that we regularly recommend to both beginners and experts.

timid veldt
#

Thank you.

sullen herald
#

Any reviews on Kerashub?

sullen herald
empty furnace
#

Can someone familiar with rerankers explain the differences between AnswerDotAI's rerankers and FlagRerankers?

sullen herald
round rapids
#

how

#

do I import pandas

serene scaffold
#

@round rapids you're voice banned for voice gate spam btw

round rapids
#

whats voice gate spam?

serene scaffold
round rapids
#

oh

round rapids
serene scaffold
round rapids
#

in the cmd?

serene scaffold
round rapids
#

I tried

round rapids
serene scaffold
serene scaffold
# round rapids what?

there's text in the terminal in the screenshot that you posted. I need to copy and paste it, so please put it in this chat as text.

round rapids
#

C:\Users\bilal>pip install pandas
'pip' is not recognized as an internal or external command,
operable program or batch file.

C:\Users\bilal>

serene scaffold
round rapids
#

or am I missing something

serene scaffold
round rapids
#

or do you want another text, I don't get it

serene scaffold
round rapids
#

PS C:\Users\bilal\Desktop\coding> & C:/Users/bilal/AppData/Local/Programs/Python/Python313/python.exe "c:/Users/bilal/Desktop/coding/python with pandas/python.py"
Traceback (most recent call last):
File "c:\Users\bilal\Desktop\coding\python with pandas\python.py", line 1, in <module>
import pandas as pd
ModuleNotFoundError: No module named 'pandas'
PS C:\Users\bilal\Desktop\coding>

serene scaffold
#

@round rapids do this

C:/Users/bilal/AppData/Local/Programs/Python/Python313/python.exe -m pip install pandas
round rapids
#

thanks

round rapids
serene scaffold
round rapids
#

aight

main fox
#

@serene scaffold you're a saint of patience, helpfulness,and restraint

serene scaffold
#

@round rapids did that command work?

#

also what are you trying to do with pandas?

round rapids
serene scaffold
round rapids
serene scaffold
empty furnace
opal frigate
#

Hello

#

Can someone help me with my data science homework?

#

We're task with making the same output as this picture, but i cant get the position for the individual inset just right

lapis sequoia
opal frigate
lapis sequoia
#

Keep trying, thats what i to say to those around me in problem, currently you a server-mate(idk if that term is correct or not)

silver hill
#

I personally have been doing a f tokenizer over and over again for almost 3 months now...hope this is the last version I will ever make because its a true headache... its still interesting thought but well, also tiring

lapis sequoia
opal frigate
opal frigate
lapis sequoia
#

Oh dear devil..

#

I remembered one incident where we had to take anhydrous for experiment and it was just 1± mg differencing the hell out of me, either 0.5 or 1.5 mg went more..

silver hill
#

Could anyone explain to me the mathematical theory behind a tokenizer?

#

I got a system that counts chars, an arbitrary threshold so far that takes out too frequent characters, then I got a compiler that looks up all possible character combinations with the sorounding characters , then it looks up the frequency of the combinations too, then I tried normalizing their frequency by dividing it by the overall frequency of all elements with that lenght

#

Maybe Im doing it wrong from the start the think I am aiming for is a dinamic threshold and tokenizer (in the long run)

mighty radish
#

Hi everyone,

I’m working on developing a web scraping API using Django to collect financial data and save it to a database that can be automatically downloaded. I’m looking for any guidance or step-by-step tutorials on:
• Setting up Django for web scraping
• Creating API endpoints to expose the scraped data
• Automating database downloads

If anyone has experience with this or knows of a comprehensive tutorial, I’d greatly appreciate your help!

Thanks in advance!

open slate
#

what asked ?

unkempt apex
#

ohh sorry wrong pin

#

my bad

tawdry plover
#

kaprekar numbers

glacial root
lapis sequoia
#

Question: How do you learn about and get into AI?

silver hill
#

AI is a really broad term

#

sorry to say this but it engloves tons of processes and sub processes

#

it goes from tokenization to neural network, its a really vast field, what do you need to know about in particular?

hollow pagoda
#

i think he wants to know how people learned and got into the subject programming wise

glacial root
#

hey guys, what are some good beginner projects to do with just numpy

silver hill
# lapis sequoia Yes

do you want to finetune an already existing model or instead want to do it all by yourself

silver hill
#

Start by a tokenizer

serene scaffold
#

start what with a tokenizer?

silver hill
#

An AI

#

He wants to make an AI from scrach

serene scaffold
#

if you're interested to learn about AI, tokenizers aren't a good place to start.

silver hill
#

Dang... nvm then

serene scaffold
silver hill
#

Text analisis and answer processing

serene scaffold
#

that's an incredibly narrow subset of AI.

silver hill
#

Whats yours then?

serene scaffold
#

Programs that emulate the application of knowledge.

silver hill
#

Thats also not correct

serene scaffold
#

It's correct.

silver hill
#

Because it sees the rules of text, not reality

serene scaffold
#

Do you not consider self-driving cars to be AI?

silver hill
#

It formulates rules that are derived from tons of text not from visual or other perceptions

silver hill
serene scaffold
#

formulates rules that are derived from tons of data
this is approaching a correct definition of machine learning

silver hill
#

Where does machine learning start?

#

Whats the first phase

serene scaffold
#

machine learning is where you have a computation graph whose state is determined by data

silver hill
#

First you look up too frequent characters right?

#

As they have a meaning by themselves

serene scaffold
#

you're still only thinking about NLP

silver hill
#

Im too new to this to understand it as a whole

serene scaffold
#

it's fine. NLP is the best one 😄

serene scaffold
# silver hill What do you mean by that?

the steps of the algorithm (the computation graph) are decided upon by humans, but the algorithm depends on values that are not set manually--they're "learned" from data.

silver hill
#

So far I made a code that read a text, got ride of too frequent characters by storing them, here I already get into a problem because I either look up neighbor characters to the target character or do combinations and look up their frequency

#

I also have trouble deciding wether how to make the threshold that decides wether something is too frequent dinamic

gray slate
silver hill
#

So far I got the overall frequency and normalized group and individual frequencies

gray slate
silver hill
gray slate
silver hill
#

Ye Im attemting to make a tokenizer from scrach

serene scaffold
#

tokenizers often do actually require that you set the rules manually

silver hill
#

Im already on version 3.2 of my proyect I have been doing this for almost 4 months and Im getting nowhere lol

silver hill
serene scaffold
lapis sequoia
#

If I want a job in the industry

serene scaffold
arctic wedgeBOT
#
Resources

The Resources page on our website contains a list of hand-selected learning resources that we regularly recommend to both beginners and experts.

silver hill
serene scaffold
gray slate
#

What's the purpose of it?

silver hill
gray slate
#

Yeah I mean what's your goal

#

experiment, produce something novel, just to learn?

lapis sequoia
gray slate
#

or a specific use case

serene scaffold
lapis sequoia
silver hill
#

I want it to identify tokens, and well, get to somewhere at the very list, as long as I can get to I will working on it

lapis sequoia
#

I could get into for example software engineering or cyber security by self learning or no

gray slate
#

Just make things that are useful and interesting, build a profile on GitHub

lapis sequoia
#

Or like what is something that’s really growing that’s not niche that I can self learn and make lots of money off of

serene scaffold
gray slate
#

Agentic systems probably.
I don't have a related degree but know devops, done gigs in data science that way. 'cause someone has to actually deploy the stuff if it's for industry

silver hill
#

What formula should I usse to make a dynamic threshold to choose wether a token is too frequent?

gray slate
#

Too frequent for what?

silver hill
gray slate
#

Are you looking to make something novel, something that is better than what we have?

silver hill
serene scaffold
silver hill
#

if its too much text dm me

#

dont get auto banned for text wall

gray slate
#

Well consider that you can take a word and break it up into ["w", "o", "r", "d", "wo", "or", "rd", "wor", "ord", "word"]. Do that for all the words you see in some text that you care about understanding, count how many times each one shows up. Sort by the score. Choose a maximum list length and cut off at that point.

serene scaffold
#

!paste

arctic wedgeBOT
#
Pasting large amounts of code

If your code is too long to fit in a codeblock in Discord, you can paste your code here:
https://paste.pythondiscord.com/

After pasting your code, save it by clicking the Paste! button in the bottom left, or by pressing CTRL + S. After doing that, you will be navigated to the new paste's page. Copy the URL and post it here so others can see it.

gray slate
#

like "ize" and "ise" for example

silver hill
#

how do you diffrence structures of consecutive tokens from subparts that have to be fussed into a token

gray slate
#

also, you could look at the average position where sentences containing a token end up in an LLMs latent space, take the distances between them, and use that to help order the list - maybe this would improve training times if you're making a model from scratch, because the token ids actually contain semantic data

gray slate
#

or I guess "tokenizing priority" first, whatever that turns out to be. Which is also an interesting problem

silver hill
#

guess that works

#

althought I still believe that differentiating structures from tokens is a huge challengue

silver hill
# gray slate you check length-first when you tokenize

as you have to know when to check surrounding tokens to make structures and when to unify tokens based on frequency patterns...Im not even sure if I should just get the general frequency of consecutive characters to fuse tokens that likelly appear together or go for a more roundabout approach dividing the text in a percentage basis

#

I find it a fascinating topic...anyway, too many hard questions that have been making my head hurt for quite a while, Its already horribly complicated althogether, at list for me, I have been trying to apply some already in usse normalizing and threshold finding formulas and their results wherent that great...maybe because it lacks input text, maybe thats way Im trying to read up as many studies as I can of ppl who actually got somewhere

uneven pawn
#

Anybody have any experience in fine tuning LLMs? Wondering how good a model would be if I finetuned it on a codebase to fully understand it/I could ask questions about it

silver hill
uneven pawn
#

Na, it needs that higher level of reasoning and writing that you get with some of the top models

#

I don't really want it to just be repeating existing code

#

But to actually understand it

silver hill
#

ye guess you got a point there

silver hill
#

Llama?

uneven pawn
#

I'm not sure, I'm thinking llama or deepseek

#

But I'm not even sure how feasible finetuning deepseek is

silver hill
#

deepseek is too new if you want more feedback you should get llama as its a clasic

#

there is a ton more information on the net about Llama I mean

uneven pawn
#

Right but isn't r1 also opensource?

#

So.. the reasoning capabilities would be a lot higher

#

Might be worth the effort

silver hill
#

guess that settels it then

#

btw, have you ever made a tokenizer?

uneven pawn
#

Yes in some simple programs

silver hill
#

how did you handle too frequent characters or character combinations? how did you separate combination of characters(fusing) from structures of does?

uneven pawn
#

I know there is entire github projects that make it easy to finetune

#

And that tokenises your input

#

What do you mean by how did I handle too frequent characters? The data set, if its good, won't have too frequent characters

silver hill
#

I mean, have you ever built a tokenizer from nothing?

uneven pawn
#

Yes, character level based tho, very simple

silver hill
#

you know the datasets that LLMA programs read are real life books right? they read inmense ammount of data

uneven pawn
#

Basically string to int lol

silver hill
#

yes you get the frequencies

#

but in tokenization you have to choose

#

wether a token is meaningfull by itself or if it has to be combined

#

and that requires a threshold

#

wanted to ask how you did that because I cant figure it out

uneven pawn
#

Llama trains on huge amounts of data Yes but finetuning requires minimal data

silver hill
#

from nothing

uneven pawn
#

You can start with your wanted token count, say 10million

#

Then for each part/character only take the top N amount that fits into your required count

#

Obviously 10million is insanely high

silver hill
#

how do you choose what to take and what to not

#

there are many combinations that may be usseless

uneven pawn
#

Frequency analysis

#

If the combination appears many times, that is what your data is showing

silver hill
#

and on what basis do you calculate the threshold that the analizer usses

#

is there a formula?

uneven pawn
#

The amount of time you're willing to spend on compute and required time to train N amount of tokens

#

Obviously you can do it at character level and it will take 100x longer or large strings

#

It purely depends on how granular you want it

#

There isn't really a set margin I don't think, maybe somebody else can help ya there

silver hill
#

got it

#

thanks for the advice

#

Im trying to make it as granual as possible jaja...

gray slate
silver hill
#

Im trying to get subword level tokenization so that afterwards I can get sentence boundaries in pleace, e.g. Dr. isnt a sentence end most of the time

silver hill
gray slate
#

numpy is python for working with numbers, it's data scientists making Python ugly but it's fast

silver hill
#

thats a concise answer

gray slate
#

What are you gonna use this tokenizer for btw? Like, break text into parts... and then what?

#

Because the design decisions all have trade-offs, and it's really common for people to aim for something they don't need at the expense of something they need later

harsh lagoon
#

hey guys is it possible to get the coordinates of detected objects in yolov5?

median nexus
#

Anyone familiar with Streamlit?

unkempt apex
median nexus
#

Trying to deploy my ML model but I keep on getting the 'ModuleNotFoundError', I have installed and provided all modules in requirements.txt, any idea how to debug it?

abstract basin
#

I want help with my Regression Project !

#

Can Anyone help ?

odd meteor
odd meteor
ashen blaze
#

Hey so I want to learn data analytics.
And from YouTube I have noticed about this field and it's majorly about making attractive dashboards from power Bi or tabluea.

#

So any expert would guide as to why is python or sql used?
Is it necessary to learn these 2?

normal grove
#

for example you could join a dataset in a way that part of the data is kept and only certain parts of another set are kept while the rest of the data is deleted. This is helpful in scenarios such as if maybe you are filtering out all data that has a duplicate, etc., or maybe you want to get rid of all blank accounts with nothing in them. im not sure if you would use that in a real scenario since i am just a college student, but this is what was relayed to me basically

ashen blaze
#

I use tabluea... So is it possible to work on it too?

normal grove
#

python can also automate some processes and is just a good general language overall because its compatible with a lot of things i believe. tl;dr: they make doing stuff faster

#

not sure on tabluea, since i honestly havent learned it yet 😅

ashen blaze
#

I see

#

Are you a data analyst?

normal grove
#

nope im a college accounting student with an MIS minor looking to pursue a Data Science master's, i currently just work in tax

ashen blaze
#

Hmm would you like to give any opinion or advice since going into data related field?

Like how much SQL should I know or is python really necessary to learn? Since I can start early and learn it later on

normal grove
#

so it really depends on what type of "data" you want to do. there are a lot of different things that are labelled under those positions that are all a little different. data analyst is commonly used interchangeably with some terms. for example you might see it mixed in with business and occasionally financial analyst positions, basically if it lists using Python etc. in the job description it is most likely what you are looking for. im not sure too much on the big differences between data scientists and data engineers. it could be a good idea to do research online on youtube most likely of people in those professions documenting what its like and requirements to break into the field. data analysts though, i dont believe you usually need a degree but they might ask for one in a field related to it like statistics, mathematics, etc.

#

if you just want to be an analyst of some sort, so, a lot of those positions are basically working with datasets and communicating the results to teams or managers. so depending on the field what they ask you to know could be different. for example a financial analyst would be asked to make budget forecasts etc., i think these positions usually operate in Excel/PowerBI. like i said probably a good idea to look at some youtubers who are in the field most likely!

#

and yeah python/sql would probably be honestly the most basic things to learn ^^; i believe R and some other things are good later on that might be a bit more advanced too

ashen blaze
normal grove
# ashen blaze True both should be learned!

If you want to self-learn, I really like the Geeks4Geeks website projects, you could try taking notes over the basics, and then try recreating projects without looking at them. PyCharm Community Edition is great for practicing. I also watch a lot of Python Programmer's youtube channel and he has a lot of project/video basics as well. I honestly haven't been practicing SQL too hard but I probably should because I have an exam this week over some of it 😭 but yeah hope this helps!!

young granite
#

maybe more remarks to ur initial question.
Data Analytics can differ as Redd stated already, it gives many different job roles where those Analytics Methods comes to work.
Business -> BI-Reports (mainly PowerBI)
R&D -> ML/NN (Python, Azure, SQL)
Production -> SCADA

and so on

#

so id say bare minimum is SQL

ashen blaze
young granite
#

as you get more expirienced you will be confronted with cloud, kubernetes, new frameworks etc.

ashen blaze
young granite
#

!resources

arctic wedgeBOT
#
Resources

The Resources page on our website contains a list of hand-selected learning resources that we regularly recommend to both beginners and experts.

young granite
normal grove
#

and sql honestly, but im familiar with powerbi. i mean its not hard to learn powerbi tbh

young granite
ashen blaze
young granite
#

SQL is an easy language, even easier than python

normal grove
#

oh i think one of my graduate courses i might take at one of my target schools actually uses one of those books since the course name is similar to the book name

young granite
#

u can get SQL basics in like 1 day and have 60% of the whole lang. already learned

ashen blaze
young granite
#

what u mean by align data and hardware?

ashen blaze
normal grove
#

yeah we had 1 day in my AIS class where we discussed different SQL commands you can use that i think ill be tested on D: and then make a project later i believe. just 1 day lmao. i wanted to take a full course on it later on anyways though. my professor was previously a software engineer for tax software services in SoCal so thats probably why, he likely thinks we wont need to know too much more than what he showed us

ashen blaze
#

But either ways how much SQL should I know?
Like in terms of context

young granite
#

the basics would be how does SQl work, e.g. what is a table, how to aggregate data, applying filters

#

understanding the concepts of relational db vs non-relational

ashen blaze
# young granite what u mean by align data and hardware?

Hmm to do something special yk.
Like making a server and clouds.
I see Data analytics as a way to understand SQL , python and machine in a beginner way.

I still haven't figured it out but I have major interest in this field

ashen blaze
young granite
#

so u wanna be a sys admin?

ashen blaze
#

System admin?

young granite
#

o'reilly Learning SQL, 3rd Edition

ashen blaze
#

Idk that but like server related or database engineer

lapis sequoia
#

a book i accidently stole from the airport and read the entire thing on flight good read 10/10

ashen blaze
lapis sequoia
#

and the thief does not read

#

but i definitely put it in my bag and forgot to return it so

ashen blaze
#

💀💀💀 bruv

#

THATS CLEARLY STEALING

ashen blaze
lapis sequoia
#

but ay the book taught me sum good shit

#

tested it out at home

ashen blaze
#

What's your house address so that I can steal it?

lapis sequoia
#

55 bucks i just searched up the price 😭

lime grove
#

up to 1000 dimensions for some of the sets.

#

These are nice, clean sets with ground truths that somewhat remove the dataset hassle from algorithm development

granite agate
#

What is the lastest version of python and CUDA does tensorflow and pytorch both support. I have a rtx 3060 and I need it to run on the GPU

serene scaffold
granite agate
agile cobalt
granite agate
#

It says u need WSL only if u use 2.11 so I can use 2.10 for native windows support. So I just need to get the highest CUDA possible which is compatabile with both Tensoflow 2.10 and pytorch

lime grove
#

Have you encountered any WSL-specific bugs within CUDA?

#

I recall running into some strange issues a few years back, some of the examples they provided wouldn't compile. But I haven't really looked into things since then

granite agate
#

The only time I used linux was in college when I had to use Ubuntu. I thought I could get away without touching linux but here we are

lime grove
#

Linux is superior.

#

this is sort of tangential to the channel topic, but I really dislike Windows and the complete mess it represents. Sorry - won't talk about this again 😄

agile cobalt
granite agate
#

Well I guess it is a good time to start learning then

lime grove
#

btw, my gaming laptop has an RTX 3070 laptop GPU. I think it is enough for practice, but not production

agile cobalt
#

no mods I think, but I am not even sure about what mods would be in this case? overclocking?

#

the first time I installed it, it worked without any problems

later on after I installed more things was when things got weird, not sure if I broke apt at some point

lime grove
#

I am not sure either. I added that question out of my own personal ignorance. Maybe mods are possible, but not sure

opaque condor
opaque condor
karmic void
#

Hello guys, I am trying to get into data science by making projects. Can anyone tell me what projects I can create, I know the usual matplotlib, pandas, numpy and plotly. I am bored of creating graphs and all, is there any other thing I can do using these?

echo yacht
#

hellooo pandas question D:

#

i cant figure out how to merge two datasets on the index if they have different row counts (patients), one of them has significantly more than the other but im trying to keep the rows to match the smaller set to be more conservative

undone ridge
#

I am looking for an experienced developer in Python openCV and .NET programming.
If you have the ability, you should work on a project related to image processing.
If you have the ability, please contact me.

jaunty helm
serene trellis
#

yo anyone familiar with RL?i think i fumbled but im not sure what could be the reasons

#

kinda feels like its trying random stuff instead of exploiting a possible strat

noble arch
keen perch
#

I need help installing tensorflow GPU for windows I installed wsl, cuda toolkit and Nvidia drivers, what else should i do, I need detailed explanation please help

valid swift
#

ok

fleet marlin
#

bro can someone help me navigate im currently learning python as my first language my goal is to get into AI/Ml what should i do after learning python can someone explain me?

unkempt apex
serene trellis
#

i used bibit algo for reducing state space by making states biclusters of user -item matrix instead of items

#

but the performance varies wildly by user

#

and im not exactly why is that since the qlearning and gridworld is the same(same as in the same for every user)

#

it should start low and plateau at 60 but it starts going crazy for some users

granite agate
#

Is cudnn 9.7.1 only available for windows 10? is there one for windows 11

agile cobalt
lucid hornet
#

Are pretty much all of the common models out there based on data that only goes up to Oct '23 at the latest? It feels weird to me

#

The only one I found was "Meta Llama 3.3 70B" which goes up to December '23

agile cobalt
lucid hornet
#

Ah true, I didn't think about that

agile cobalt
#

most models from companies with $$$ to spend include some premium training data, like the partnerships OpenAI has been doing, or otherwise include content scrapped from sources they probably had better not admit they are using

lucid hornet
#

With the current administration, who knows if it'll matter, though

#

inb4 OpenAI is given access to NSA data

agile cobalt
#

well... at least as far as US goes I wouldn't be surprised if they damaged themselves just to try and fail to harm Deepseek

lucid hornet
#

That still cracks me up

#

The hypocrisy is so strong

fading wigeon
#

Weell, you see, Chat GPT is good. But Deepseek is bad.

#

Because uh....

#

Well, let me ask Chat GPT to answer for you, one sec

craggy agate
unkempt apex
gritty vessel
#

Hey

gritty vessel
#

Is there a guide to install ollama with different backend?

#

In llama.cpp docs they have mentioned metallium backend I want to install the same backend on ollama

#

Are there any guide for that?

scarlet anchor
tropic sphinx
#

Hey everyone! 👋

I’m working on a Python package that needs to automatically track transformations applied to pandas, NumPy, and scikit-learn. The goal is to detect when a dataset is modified without requiring the user to write extra code or manually call tracking functions.

The main challenge is finding a method that works seamlessly while ensuring all meaningful changes are detected.


🔹 What I Want to Achieve

  • Automatically track modifications when a user applies transformations like df.fillna(), df.drop_duplicates(), or sklearn_pipeline.fit(X).
  • Ensure minimal code changes for the user—ideally, they just import the package and work with pandas/NumPy as usual.
  • Detect in-memory modifications, including df.iloc[0, 1] = 5 or array[2] = 100, without requiring the user to explicitly log them.
  • Avoid major performance overhead—the tracking system should be lightweight and not slow down computations.

🔹 Approaches I’ve Considered

Proxy Wrapping (Overriding pandas, NumPy, and scikit-learn Methods)

  • Override common transformation functions (fillna(), drop_duplicates(), apply(), fit(), transform()).
  • Pros: Works transparently, no user interaction needed.
  • Cons: Override all the functionalities!

🔹 What I Need Help With

  • What other approaches would you suggest for tracking pandas/NumPy transformations **(almost) without user interaction **?
  • How would you track inline modifications (df.iloc[...] = 5) without modifying user code too much?
  • What’s the most efficient way to track changes while avoiding performance overhead?

Would love to hear your thoughts on how you’d approach this! 🚀 Thanks in advance for any insights! 🙌

keen perch
#

I need help installing tensorflow GPU for windows I installed wsl, cuda toolkit and Nvidia drivers, what else should i do, I need detailed explanation please help

harsh lagoon
#

hey guys how do I use this?

#

where should i put it?

#

I want to know the coordinates using detect.py

#

I'm new to python so I really don't know what most of these do

unkempt wigeon
#
import torch
import torchvision
import torch.nn as nn
import torchvision.transforms as transforms

import matplotlib.pyplot as plt

#device configuration
device = torch.device('cuda' if  torch.cuda.is_available() else "cpu")


#hyper paramiters
input_size = 784 # 28X28=784
hidden_size = 100
num_classes = 10
num_epochs = 2
Batch_size = 100

learn_rate = 0.0001

trainging_datasets = torchvision.datasets.MNIST(root='./data',train=True,
                                               transform=transforms.ToTensor(), download=True)


tests_datasets = torchvision.datasets.MNIST(root='./data',train=False,
                                               transform=transforms.ToTensor())

train_loader = torch.utils.data.DataLoader(dataset=trainging_datasets, batch_size=Batch_size,
                                           shuffle=True)


test_loader = torch.utils.data.DataLoader(dataset=tests_datasets, batch_size=Batch_size,
                                           shuffle=False)


exampels = iter(train_loader)
samples, labels = exampels.next(exampels)
print(samples.shape,labels.shape)
#

how come there is an attribute error

true flicker
#

Hi

unkempt wigeon
#

hello

serene scaffold
unkempt wigeon
unkempt wigeon
#

its in the paste bin

serene scaffold
unkempt wigeon
#

thank you

#

n_correct = (predictions == labels).sum().item()
^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'bool' object has no attribute 'sum'

serene scaffold
unkempt wigeon
#

it is not finding a sum even thou i pit .sum as following the tutorial

serene scaffold
#

and tell me what it says

unkempt wigeon
#

printed output:
<class 'torch.Tensor'> <class 'torch.Tensor'>

unkempt wigeon
odd meteor
unkempt wigeon
#
epoch -1 / 2, step 200/600, loss = 1.3214
epoch -1 / 2, step 300/600, loss = 0.8638
epoch -1 / 2, step 400/600, loss = 0.8123
epoch -1 / 2, step 500/600, loss = 0.6779
epoch -1 / 2, step 600/600, loss = 0.4998
epoch 0 / 2, step 100/600, loss = 0.5865
epoch 0 / 2, step 200/600, loss = 0.4572
epoch 0 / 2, step 300/600, loss = 0.3157
epoch 0 / 2, step 400/600, loss = 0.5043
epoch 0 / 2, step 500/600, loss = 0.4080
epoch 0 / 2, step 600/600, loss = 0.3515

it dose not update the epoch and it starts with -1 instead of 1

unkempt wigeon
#

yes but it is not updating the epoch

serene scaffold
unkempt wigeon
#

what do i need to do to update the epoch?

#

it dose update in the positive

odd meteor
unkempt wigeon
#
print(f'epoch {epooch+1} / {num_epochs}, step {i+1}/{n_total_steps}, loss = {loss.item():.4f}')```
odd meteor
# serene scaffold `AttributeError: 'bool' object has no attribute 'sum'` indicates that ` (predict...

Yeah, I initially suspected it's probably a case where Wendigo might have mistakenly made predictions and labels a rank 0 tensor instead of a rank 1 tensor. But then i also noticed in his code that s/he used torch.max(outputs, 1) instead of torch.argmax(outputs, dim=1)

torch.max(outputs, 1) returns a tuple instead of predicted label. I usually prefer using torch.argmax() in this case because it's a more safer option.

unkempt wigeon
#

Im just following a tutorial

odd meteor
unkempt wigeon
#

epoch wont update

odd meteor
unkempt wigeon
#

epoch 1 / 2, step 100/600, loss = 1.7161
epoch 1 / 2, step 200/600, loss = 1.2840
epoch 1 / 2, step 300/600, loss = 0.8888
epoch 1 / 2, step 400/600, loss = 0.7217
epoch 1 / 2, step 500/600, loss = 0.6352
epoch 1 / 2, step 600/600, loss = 0.4856
epoch 2 / 2, step 100/600, loss = 0.5465
epoch 2 / 2, step 200/600, loss = 0.3712
epoch 2 / 2, step 300/600, loss = 0.4181
epoch 2 / 2, step 400/600, loss = 0.3767
epoch 2 / 2, step 500/600, loss = 0.4154
epoch 2 / 2, step 600/600, loss = 0.4518
accurecy = 0.8600000143051147

odd meteor
odd meteor
unkempt wigeon
#

its updating now corectly???

odd meteor
unkempt wigeon
#

how can i test my forward network?

unkempt wigeon
odd meteor
# unkempt wigeon how can i test my forward network?

The last part in your code that's using a context manager has already handled the evaluation of your MLP (feed forward NN) on your test-set; hence, the reason you got accuracy= 0.86

with torch.no_grad():
    n_correct = 0
    n_samples = 0

    for images, labels in test_loader:
        images = torch.flatten(images, start_dim=1).to(device) #<--flatten image to a 1D vector
        labels = labels.to(device)

        logits = model(images) #<---Your feed-forward / forward propagation 

        predicted_labels = torch.argmax(logits, dim=1) #<--- I used 'argmax' here instead of 'max' to get predicted labels

        labels = labels.view_as(predicted_labels)  #<--- ensure labels match shape of predicted_labels

        # Compute accuray per batch
        n_samples += len(labels)
        n_correct += torch.sum(predicted_labels == labels).item()
    # Compute the avergae accuracy over all batches

    final_acc = 100.0 * n_correct / n_samples
    print(f'Accuracy = {final_acc:.2f}%')
unkempt apex
unkempt wigeon
odd meteor
#

About using Tkinter to draw a digit and have your trained model have a go at it, yeah, I believe it's possible as well. I haven't done sometime like that myself but yeah it's possible.

unkempt wigeon
unkempt wigeon
odd meteor
unkempt wigeon
odd meteor
unkempt wigeon
#

Where would I implement it at the end inside of the loop or in the main body?

unkempt wigeon
#

What earlystop should I go for in training on average
To avoid overfitting?

cerulean rain
#

hello, I am trying to create a neural network from scratch using numpy, but i am kinda lost in building a optimiser, I don't know how i should implement that... if anyone can give me any info, that would be super helpful! thanks!

unkempt wigeon
odd meteor
# unkempt wigeon What is the limit that I should set to avoid overfitting?

There is no one-size-fits-all answer. It depends on multiple factors such as dataset size, model complexity, and the problem you're solving. I could have a tolerance level of 3 while yours could be 9. In my case, if validation loss doesn't improve after 3 epochs, EarlyStopping will be triggered which ultimately will terminate the model from training further.

unkempt wigeon
#

To be honest I forgot they hit retrain and now I'm up to 98% accuracy which I'm surprised actually worked so that's why I was wondering because I didn't realize it and instead of disrupting the network I chance clinic keep going do you think it's a little overfitting now

tough ingot
#

Which tool can make an node-link-graph for big data? I tried Pyvis and Sigma.js but there was too many connections that built white blobs

lime grove
#

Maybe you could coarse grain the graph, and represent that instead?

#

as things stand, visualizing the whole thing in one go is just not a good idea

#

You can perform a type of renormalization wherein you could, for instance, only represent nodes with a certain connectivity greater than N

#

I actually thought that these two images were some sort of a burlap cloth.

#

Or maybe represent only neighborhoods? there are a number of interesting things you could do that would convey information more efficiently than this data dump

odd meteor
tough ingot
#

The white are the Links / connections between nodes

odd meteor
unkempt wigeon
odd meteor
main fox
#

Could also use Patience, e.g. if no improvement in test loss after x amount of epochs

tough ingot
#

So instead of generate an graph.html, I create many graph.html (cluster) and then program an graph.html that combines all cluster into a graph.

odd meteor
# unkempt wigeon what should i make next?

Since you've implemented EarlyStopping in your MLP, you might wanna move to CNN next.

Well, before moving to CNN, pick two different datasets (one tabular dataset and one image data) and practise what you just learned by training a NN with MLP.

#

Once you've trained a MLP on a new image dataset - - preferably an image with 3 color channels, then try to train a CNN model on same dataset. Hopefully, this will enable you see and understand why CNN tend to outperform MLP.

unkempt wigeon
#

MLP?

unkempt wigeon
odd meteor
unkempt wigeon
opaque condor
orchid light
#

Bruh i made an Gpt like transformer but it sucks it learns so slowly......

#

Could i get any tips? Like i tried every thing changeing learning rate, optimizers, model paramiters, datasets, vocab... and my model its just stuck

orchid light
#

I just swiched from character level vocab to subword and it just learns so slowly

opaque condor
# orchid light

Remember learning takes time especially in this case the computer has to translate what humans mean as in language and then convert that into a vectors that it can draw lines between to find the words that are appropriate for what is on the graph

cerulean kayak
#

So I just found out that xscale and yscale exist in matplotlib.pyplot.

Basically, is this a function that you guys use often? Because this seems like a function that could be a big game changer, yet I've never seen it used before today.

please at me if you have anythinh.

odd meteor
odd meteor
calm thicket
#

it's impossible in practice because there is noise

limpid bronze
#

i want to build a Mobile app for real time detection, should i use yolo8n.pt or yolo8s.pt??? or any other yolo8 model?

jaunty helm
tidal bough
#

Basically, is this a function that you guys use often?
I use logarithmic scales often, so yes

sour kelp
#

hello, I am looking for a partner to learn DSA with me using python and on intermediate level

opaque condor
orchid light
#

I train on a good dataset called Fineweb 10bt (10 bilion tokens) and my character level model (with less paramiters) did better (had loss 0.8) than my new model that cant cross loss 4.6.

#

My model graph isnt even close to this

#

I just dont know what to do anymore

jaunty helm
# orchid light

from my limited knowledge, I think llms usually only train for a few (< 5) epochs, on a very big (trillion tokens) dataset

#

also, maybe dynamic learning rates as training goes on?

orchid light
#

I also tried turning it off but it didnt help

orchid light
#

I could give u my code if u want too look for some errors

jaunty helm
thorn flame
#

What's an alternative library to sentence-transformers for creating embeddings

#

That supports python 3.8

rich moth
#

!paste

arctic wedgeBOT
#
Pasting large amounts of code

If your code is too long to fit in a codeblock in Discord, you can paste your code here:
https://paste.pythondiscord.com/

After pasting your code, save it by clicking the Paste! button in the bottom left, or by pressing CTRL + S. After doing that, you will be navigated to the new paste's page. Copy the URL and post it here so others can see it.

cerulean kayak
# tidal bough > Basically, is this a function that you guys use often? I use logarithmic scale...

So in general, how do you know that scaling the graph will give you a better idea of what you want to illistrate?
Because my problem is I heard of it while doing a tutorial where they were like "Oh geez. You should really scale it to see what I mean", and I'm wondering "How the heck would I know to do that by myself?"

Sorry this is such an open ended question: I typed the function in on YouTube and expected to find several videos on when and when not to use it and found very little.

calm thicket
#

if you know your data has a certain distribution, you could scale it to make it more obvious or show up better. a log tailed distribution is hard to visualize without a log scale because everything will be clumped on the left. but with scaling, it will look normal

#

or if the data follows a power law, you could use a log log scale to show that

cerulean kayak
tidal bough
#

interactive plots are also very useful for exploring data

cerulean kayak
mild dirge
unkempt wigeon
#

for i (images, labels) in enumerate(train_loader):
^^^^^^^^^^^^^^^^^^
SyntaxError: cannot assign to function call'

serene scaffold
gray shard
#

lads what is your opinion on data camp as a learning resource?

#

I feel some of concepts are slightly rused and not explained as well

opaque condor
serene scaffold
opaque condor
#

No my brain is just confused sometimes I am feeling what people might be thinking very tired

serene scaffold
opaque condor
#

Yep

lime grove
#

chatroom paranoia has remained the same for around 4 decades. Are you, or are you not a sockpuppet, sir?

serene scaffold
lime grove
#

sure, but regardless of those details, the point remains. Autoconfusion is a giveaway, but also writing style.

gray shard
#

for the pd.read_csv() function there is an argument called na_values? anyone know what this is?

serene scaffold
#

!docs pandas.read_csv

arctic wedgeBOT
#
pandas.read_csv(filepath_or_buffer, *, sep=<no_default>, delimiter=None, header='infer', names=<no_default>, index_col=None, usecols=None, dtype=None, engine=None, ...)```
Read a comma-separated values (csv) file into DataFrame.

Also supports optionally iterating or breaking of the file into chunks.

Additional help can be found in the online docs for [IO Tools](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html).
serene scaffold
#

take a look and you'll see what that argument is for.

gray shard
#

ahhhh ok thank you! where can I learn more about the bot conmands here in this server

serene scaffold
gray shard
serene scaffold
#

I'm a lad now?

gray shard
#

if ur girl thanks lass

serene scaffold
#

I don't care if people think I'm a girl as long as they think I'm a pretty girl.

gray shard
#

only pretty girls out of the female category know data science and ai

jaunty helm
neat bloom
#

do y'all have any reccomended books for newbies?

#

pls ping me if y'all know, i will genuinely forget

odd meteor
hearty rampart
odd meteor
unkempt wigeon
serene scaffold
lapis sequoia
#

so basically for our project we will be having to use some models and we decided on LSTM/ RNN as its time series

tmrw we have a meeting with our mentor and we r supposed to have done some research or have a demo model or anything

so basically main PS is
like aiml powered smart energy management system
be it for home or office anything
so like we r supposed to collect data through IoT devices
and there will be 3 models:
consumption prediction
anamoly detection
generation prediction

original idea was to make it for gated communities then thermostat, AC use case for offices came later we just combined and made it general

idk what exactly to prepare for this. Like my teammate is going through a github project that has like LSTM for some finland electricity consumption thingy

im trying to go through research papers but any inputs ideas etc? what models can be used and metrics to be kept in mind. we talked to our professor and she suggested the RNNs / LSTMs

tidal bough
woeful escarp
#

Hello, I am starting in ML, I would like to work in a project to improve, send me DM

worldly wagon
#

hi does polars have tuple support? can't find anything on it

main fox
# unkempt wigeon nope

Your class definition is all messed up

class ConvNet(nn.Module):
    def __init__(self,):
        self.conv1 = nn.Conv2d(3, 6,5)#r channels
        self.pool = nn.MaxPool2d(2,2)
        self.conv2= nn.Conv2d(6,16,5)
        self.fc1 = nn.Linear(16*5*5, 120)
        self.fc2= nn.Linear(120, 84)
        self.fc3= nn.Linear(84, 10)
        x = conv1

    def forward(self,x):
        pass

Revise this part first

unkempt wigeon
weary timber
#

is there a decent free website i can train my llm's

#

?

serene scaffold
weary timber
#

to be exact

serene scaffold
weary timber
#

from scratch

main fox
#

You didn't specify super() , so your class isn't inheriting from nn.Module

You don't seem to have a layer for flattening, so your Linear layer won't be able to take the outputs of the conv layer

x = conv1 should not be in your def init()

You don't have any activation functions so your Linear layers don't learn any non linear patterns

Your forward pass doesn't do anything, so your model won't process inputs

serene scaffold
weary timber
serene scaffold
# weary timber so i cant test my own model :(?

you can create some models that aren't LLMs from scratch, and you can fine-tune existing LLMs. Only a handful of very wealthy organizations with tons of training data can create LLMs from scratch.

weary timber
#

i want to make a small one couldnt i ?

serene scaffold
#

I suppose, but it probably wouldn't be able to respond coherently to any prompt.

weary timber
#

okay then ty

#

i was ready to work on this for a full week

#

to waste all my time

serene scaffold
#

Sorry I don't have better news

#

There are still beginner ML projects that you can do. But they probably won't involve LLMs.

weary timber
#

can you tell me one in nlp?

#

i made countless project on image classification and stuff

serene scaffold
main fox
#

Image captioning
Stack a CNN on top of a RNN

weary timber
#

okay thakns

unkempt wigeon
serene scaffold
weary timber
main fox
unkempt wigeon
#

how can i disable that cursor?

main fox
#

By indenting properly

unkempt wigeon
#

in the white block

main fox
#

Press tab there, yeah

weary timber
#

on your keyboard

unkempt wigeon
#

thank you

#

thank you

unkempt wigeon
dense needle
#

Used an existing package that already had a tokenizer and tools for making document feature matrices

#

And then I implemented the prediction method just to get a feel for how ngrams work

#

Not LLM level stuff but it was doable and I learned a lot

#

Point being I think you could do something small like that

unkempt wigeon
unkempt wigeon
main fox
#

Matter of fact, I mentioned changes in my previous reply when you asked

unkempt wigeon
unkempt wigeon
agile cobalt
sour kelp
#

hello, I am looking for a partner to learn DSA with me using python and on intermediate level

serene scaffold
orchid light
serene scaffold
orchid light
#

U cant....

#

U mean pre train?

serene scaffold
#

When you make an LLM from scratch

orchid light
#

I made one but its shitty

#

Like i tried to swich from character level vocab to subword and its shit now

serene scaffold
#

I don't think even gpt-2 could be taught to respond correctly to prompts

orchid light
#

training it rn

weary timber
weary timber
weary timber