#data-science-and-ml

1 messages · Page 53 of 1

verbal venture
#

pandas is throwing me a 'could not convert string to float: '?' error when I try to do print(df['horsepower'].median()). Does anyone know why? I'm running this in pycharm I get an error, but when I run it in jupyter notebook it works

queen cradle
#

You should check very carefully whether you're working with the same data in PyCharm and your Jupyter notebook. It sounds like they're out of sync.

hasty mountain
#

Does anyone know a trick to check if a sentence has special characters(like comma) and isolate that character from nearby words?
I'm trying to preprocess a text to be used for a small-scale GPT, and I don't want to remove those special characters from my input text because, well, I want the model to learn when to use them. However, my vocabulary list doesn't have those characters attached to words(like this,), only isolated tokens ['like', 'this', ','].

#

Oh yes...regular expressions module... yert

hasty mountain
gritty heart
#

u could use it to isolate the characters

#

do u want to completely leave them out or just make them special in some sort of way like adding 2 spaces after them or a symbol after them?

hasty mountain
#

I want to add a space before and after that character

#

this, ----> this ,

gritty heart
#

so whenever this AI uses those words it just adds a space?

hasty mountain
#

Nah, it's just so I don't get an error because my vocabulary was made like ['this', ','] and not like ['this,']

tidal bough
hasty mountain
#

I didn't really want something to return a list, though

#

There's sentence.replace(',', ' , '), but...if I use that for multiple characters, it gets messy

tidal bough
#

something like

def partition_all(text: str, separators: list[str]) -> list[str]:
    tokens = [text]
    for sep in separators:
        tokens = [part for token in tokens for part in token.partition(sep) if part]
    return tokens

is what I'm thinking. probably very slow though.

hasty mountain
tidal bough
#

oh right, that doesn't quite work because partition only does one

#

one needs re.split instead

#

!e

import re
def tokenize(text: str, separators: list[str]) -> list[str]:
    # separators should be escaped if they are regex-special!
    tokens = [text]
    for sep in separators:
        tokens = [part for token in tokens for part in re.split(f"({sep})", token) if part]
    return tokens
print(tokenize(
    "Nah, it's just so I don't get an error because my vocabulary was made like ['this', ','] and not like ['this,']",
    list(";, '"),
))
arctic wedgeBOT
#

@tidal bough :white_check_mark: Your 3.11 eval job has completed with return code 0.

['Nah', ',', ' ', 'it', "'", 's', ' ', 'just', ' ', 'so', ' ', 'I', ' ', 'don', "'", 't', ' ', 'get', ' ', 'an', ' ', 'error', ' ', 'because', ' ', 'my', ' ', 'vocabulary', ' ', 'was', ' ', 'made', ' ', 'like', ' ', '[', "'", 'this', "'", ',', ' ', "'", ',', "'", ']', ' ', 'and', ' ', 'not', ' ', 'like', ' ', '[', "'", 'this', ',', "'", ']']
tidal bough
#

this looks good to me

hasty mountain
#

Thanks!

wooden sail
#

my best guess is that the curvature of your loss function is very steep near that local minimum, so it starts behaving poorly. somethings to notice with SGD are that you have no guarantee the loss will actually decrease at each iteration, and the convergence of all gradient methods depends on the curvature of the loss. the larger the curvature, the smaller the step size has to be

#

maybe try with a smaller step size first

serene scaffold
wooden sail
merry fern
#

Try df.info() whenever you get a TypeError and then go from there.

I think you could do something like this to find the strings:
df[df['horsepower'].str.contains('^[a-zA-Z]$')]

hasty mountain
#

(Or so I remember)

serene scaffold
verbal venture
verbal venture
wooden sail
#

does it still work in your notebook if you rerun everything from top to bottom in one go?

serene scaffold
serene scaffold
#

also, when Edd says "rerun everything from top to bottom in one go", that includes restarting the notebook kernel.

#

think of all the variables and functions in your notebook as existing in a dict (because they actually do). each cell can change what's in that dict. deleting the cell or changing what's in it and re-running it doesn't undo what changes you made to the global state dict.

#

whereas restarting the kernel clears everything out of the global state dict, and you start over fresh.

ocean holly
#

How do i use gpu on jupyter lab ? I am training AI with cpu . and it is so slow

serene scaffold
ocean holly
#

and 3090 at school

serene scaffold
mellow pendant
#

Hi all, I have a question more on statistics. I have an excel file with that contains transaction information. The file has seperate sheets for each year since 2011 and contains the columns on the business name, customer name, line of business, sub line of business, and revenue. I want to look at specific businesses and see if diversifying their products has helped with revenue. I have looked at it a few different ways and wanted to know if it would make sense to look at the standard deviation of the percentage make up of the sub linbes of business as a measure of if a company has diversified. An example being if a certain company had 80%/10%/10% breakdown of the sub lines of business where one takes up 80% of the records, the standard deviation would be higher than if it were something like 50%/30%/20% since the values would be closer to the mean. Does this make sense to do it that way? I feel like I am missing something.

languid mortar
#

I had a data scaling question and database question, if i was storing json data and webpage html in a db, what is an efficient db backend to use plain old mysql redis or something else. I'd like to build this app to be scalable from the begining so I don't have to change anything later. it will mostly be used for a cache. Thanks for the guidance

#

oh theres a database channel nevermind my bad

violet gull
wooden sail
#

the number depends entirely on the loss function, the network, and the data. there is no 1 number that always works

#

try dividing by 10 or 100 and see if it behaves any better or different. if not, then we have to give some thought to what the reason might be

violet gull
#

@wooden sail 0.0001

#

but all that did was slow the rate down so if i did more than 200 iterations it would probably go back to doing the unga bunga dance

velvet mountain
#

does someone know a good alternative to MLflow that is reliable? I tried unsuccessfully to set it up with FTP but ran into massive failures (and I'm not the only one, according to their issue tracker). so I'm a bit tired of them. any other good tool in the scope of managing model deployments ?

gilded bobcat
#

Hi all I had a question on a personal project im working on:

I have animal types, their outcome (adopted, not adopted), etc...

I find that most animals get adopted in 2 months of less, however I get roughly .01% being adopted in 2 years, 3 years, 5 years.... These extreme observations are true and are not errors. I fear keeping them will ruin all my inputs to my models as the standardization values will be influenced by these outliers, however I don't want to drop them because they are informative!

Curious on what I should do? Leaning on dropping anyway?

velvet mountain
gilded bobcat
#

idk if my logic even sound to you lol

#

I guess I could standardize to the median and not mean, win-win?

violet gull
#

i did 2000 iterations and 0.0001 lr

#

it still broke

velvet mountain
#

median sounds like a nice thing to at least try

gilded bobcat
#

I could always share my screen and show you my logic but idk if you wanna hear my rambling lol

velvet mountain
#

std deviation is also not robust to outliers, so wach out here too

gilded bobcat
#

Yeah fair ty ty

#

I'll just push forward and hope for the best 🙂

queen cradle
#

@gilded bobcat My advice is, don't try to fit a normal distribution. Instead, try a gamma distribution. They're more appropriate to your situation.

gilded bobcat
queen cradle
#

If you're ultimately going to use a non-parametric model, then I'd say the only reason to try to fit a parametric distribution in the first place is so that you have some idea of what the data looks like.

#

If a gamma distribution fits well, then you learn something. If it doesn't fit well, and if you can identify where it doesn't fit well, then you learn something else.

#

That is, you can fit a parametric distribution as a kind of exploratory tool, instead of because you want to shoehorn the data into something parametric.

gilded bobcat
#

Makes sense, do you have advice on how to ensure it looks like a gamma dist other than inspection? Ngl I am used to shoving my data into normal and moving on lol

queen cradle
#

For exploratory purposes, inspection is a good way to go. For example, plot the density of the fitted gamma distribution over a histogram or KDE of the data. Or make a Q-Q plot of the data versus the fitted gamma distribution.

gilded bobcat
#

Got it! Visually it looks good, let me know if you agree:

queen cradle
#

Huh, those are interesting. They all seem to have distinct elbows.

gilded bobcat
#

Like the ~750 spot on "adopt"?

queen cradle
#

Yeah. They all seem to have elbows at about that height.

hasty mountain
#

Hey guys, can someone help me with Pretraining of a Transformer?
I know that the Unsupervised Learning phase of neural networks is mostly to train the "feature extracting" layers, with the objective of minimizing information entropy to make things easier for the classifier. I can see that quite easily for image models. But how can I do that for a Transformer?
Should I use as "information entropy" output the Encoder output?

But then, GPT-1 had only the Decoder part, right? Shouldn't I use something related to the Decoder for this?

hasty mountain
#

ChatGPT told me that the Transformer would be trained to predict whether 2 generated sentences are consecutive or not...but it also gets pretty messed up with that information.

#

Uh...ok...I don't get it...it would be a CrossEntropy, ok? But what would be the targets?

silk marsh
#

So what math Field should I learn for AI except statistics

mild elk
#

does anyone know why this is not working

#

I just converted the numpy array X_test_MinMax into a dataframe called a

#

and then i just want to get the dataframe when the "Island" column is 1.0 but it shows NaN values

austere swift
austere swift
#

for example if you give it "I really love programming in", it would give you what it thinks the next token should be, which could be something like: 0.5 python, 0.3 C++, 0.1 java, 0.05 rust, ...

hasty mountain
austere swift
#

this is if you're using word tokenization, there's also a bunch of other tokenization methods that are subword (so tokens represent parts of words) so the tokens could be something like "th-" or "ex-"

hasty mountain
#

In a supervised learning configuration, the loss would be CrossEntropy(model_output, target_text).
But what about unsupervised, where there's no labels, no targets?

austere swift
#

the target is the actual next word in the sequence

#

because your training data is a bunch of text, you already know what the next word is

#

so the label is just that next word

hasty mountain
#

That doesn't look like unsupervised learning to me

austere swift
#

it's self-supervised learning

hasty mountain
#

Ok, but I want to use unsupervised learning for pre-training

austere swift
#

there is no unsupervised learning for that, people call it "unsupervised" because there are no explicit labels, but it's technically self-supervised

hasty mountain
#

Ugh... Then I see no difference from an unsupervised learning configuration and a supervised learning for Transformer

#

Working with images is way easier... and clarified

gilded bobcat
#

I have a NLP-ish type of question.... I have a feature called animal breeds, many (if not most) of these breeds are sparse (like 1 or 2 animals per breeds). Could embedding these with a pretrained model (like GloVe) be a good idea? Would it be able to understand the similarity between "German Shepard Mix" and "German Shepard" and "Pitbull?" My end goal is to use breed to predict if an animal will be adopted

serene scaffold
gilded bobcat
# serene scaffold You're trying to predict breeds, with what features?

I am trying to predict if an animal will be adopted using some predictors.... One of them being breed, I feel as if it will provide some great explanatory power, but if I OHE it itll be incredibly sparse. I thought maybe I could make embeddings instead and use those for prediction.

#

To make it more confusing some animals are "short hair tabby" and others are "short hair tabby mix"

gilded bobcat
#

Y is adoption (dichotomous), possible X's are: Age (continuous), Animal Type (categorical), Breed (categorical), ,Color (categorical), Intake Reason (categorical), Intake Sex (categorical), Intake Conditional (categorical)

#

I will prob drop color its even worse than breed

serene scaffold
serene scaffold
#

Anyway, I wouldn't do the word embedding thing. I'll explain why later.

austere swift
#

Just putting this out there as well, a lot of those features will likely not get you much information for the model to predict the breed

#

color and type I think would make sense, but intake reason, sex, and age likely have little to no predictive power

gilded bobcat
#

Sorry I might have been unclear. I want to use breed as a feature to predict if an animal will get adopted.

austere swift
#

ah okay that makes sense

austere swift
gilded bobcat
#

Yeah let me edit, I think steelercus read it the same

austere swift
#

I think it would make sense to kind of merge some of those breeds together into the same category if they're very similar

gilded bobcat
#

Here is an idea of what they look like:

austere swift
#

like "german shepard mix" and "german shepard" would be merged into "german shepard" etc

gilded bobcat
#

I would agree but my pain is like I think for dogs being 'mixed' actually matters.... A purebred german shepard is wildly different from a mix. Moreover, I am just unsure how german shepard/lab is different from a lab/german shepard... I could def break it up though

austere swift
#

yeah if the mix would non-negligibly affect the chances of being adopted then it should be kept

gilded bobcat
#

but like Chicken and Chicken mix? Wtf is a chicken mix??

#

I have another Q on feature selection if thats okay

#

I plan to do feature selection prior to building my model, probably like RFE.... Should I include my OHE categoricals when I do this? If so, if it says one categorical value (like A, B, C and it says A is useless) is useless then should I drop all my dummies for that categorical?

serene scaffold
#

anyway @gilded bobcat, you could use word vectors to see if the names of the breeds form discernable clusters is that vector space, and treat any breeds that are part of the same cluster as the same breed. But that's basically just binning. If you want to create bins of breeds, you can already do that using whatever bins you want.

serene scaffold
gilded bobcat
#

I see, honestly worth a shot or atleast a fun way to practice my clustering techniques.... With 4k+ breeds I need a better way to automate over me deciding

serene scaffold
gilded bobcat
#

Ty 🙂

serene scaffold
#

yw

serene scaffold
gilded bobcat
# serene scaffold you have to decide what features you're going to feed into the model before you ...

I might be confused, what if I have a categorical with A, B, and C values only. I go ahead and OHE these so that I now have three columns in my feature dataset. I then run a feature selection technique over my all my possible features. If my feature selection was to say "A and B are really important but C is useless" should I just drop the whole categorical variable or drop the one hot encoded C column?

#

I say drop all of that categorical variable, but curious none the less

serene scaffold
#

what does ohe stand for

gilded bobcat
#

one hot encode/dummy them out

serene scaffold
#

ah right. I've never seen that as an acronym for some reason.

#

also there's no established meaning for "feature dataset" as a separate thing from "dataset".

#

"A and B are really important but C is useless" should I just drop the whole categorical variable or drop the one hot encoded C column?
why would you drop A and B if they're important?

gilded bobcat
#

Because they're all dummies within the same categorical variable, I have no good reason to say I should/shouldn't, but it feels like I am tossing out a 1/3 of a variable and not sure if that's okay.

serene scaffold
#

my guess is that the model would just learn to ignore C anyway, but it depends on the model and properties of your dataset.

mild elk
mild elk
#

it still does not work

mild elk
charred frost
#

hey I got a quick question for pandas, I want to add a dataframe to another to the right of it

#

so add it column wise but keep it as is

#

keep the row names etc

#

so do this

       |  A                                                    |  B                  

row1 | True row5 | True
row2 | False row6 | False
row3 | False

after appending B to A

       |  A                      |  B                  

row1 | True row5 | True
row2 | False row6 | False
row3 | False

gilded bobcat
# mild elk

im not too sureunless I can see that minmax array, you could send example data?

charred frost
#

seems like osmething really basic but cant find easy way to do this

plush jungle
#

does anyone know why the matplotlib window might be not responding when I use:

plt.pause(0.001)```
iron basalt
plush jungle
#

I tried 0.1, but that didn't work either

#

same thing happens with zero pause

iron basalt
#

How much work is it doing? Are you plotting a lot? Matplotlib is slow.

plush jungle
#

nah, it's a single data point on the first iteration

#
def plot_rewards(show_result=False):
    plt.figure(1)

    rewards_t = torch.tensor(episode_rewards, dtype=torch.float)
    if show_result:
        plt.title('Result')
    else:
        plt.clf()
        plt.title('Training...')
    plt.xlabel('Episode')
    plt.ylabel('Reward')
    plt.plot(rewards_t.numpy())
    # Take 10 episode averages and plot them too
    if len(rewards_t) >= 10:
        means = rewards_t.unfold(0, 10, 1).mean(1).view(-1)
        means = torch.cat((torch.zeros(99), means))
        plt.plot(means.numpy())
    
    plt.pause(0.001)  # pause a bit so that plots are updated
    #plt.show()
    if is_ipython:
        if not show_result:
            display.display(plt.gcf())
            display.clear_output(wait=True)
        else:
            display.display(plt.gcf())```
#

not on ipython

iron basalt
#

Is interactive mode on?

plush jungle
#

ooh ok, now it's actually plotting data once I up the wait to 1, but it's still not responding when I click on it

#

I guess that's not a super big problem, but it is annoying

iron basalt
#

Use FuncAnimation instead.

plucky bolt
#

Anyone here read this book?

#

Wondering if it is beginner friendly

half quest
#

Hey, I have alot of data that will be key value pairs. Both the keys and values will be intergers. There will probably be either millions, or maybe billions of key value pairs. What would be the best way to store this data so python could get a value with a key efficiently?

wooden sail
#

probably with a database if you don't want to store all the data in memory

young granite
#

Hey, im currently trying out sklearns pipeline and wondered if theres an easy way to implement preprocessing and postprocessing into it.

young granite
foggy yarrow
#

Does anyone have experience with extracting data from uneven grid?

sullen flicker
#

I got a question regarding data labeling:

I have a few large datasets of customer feedback which I would wish to label for text classification. However, my time and resources are limited, so I do not have the capacity to label the whole dataset. Therefore, I am interested in algorithms that help me to create labels with only a fraction of the data by utilizing unsupervised/semi-supervised techniques and accepting some noise in the labels.

So my questions would be: Which approaches would you recommend? How would you solve this problem? What state of the art algorithms /papers exist on this topic?

young granite
#

currently i use functions for preprocessing but would want to store all steps into the pipeline to have a complete "model", but i cant implement the functions directly to the pipeline cause it starts with a df then transformations etc. and i struggle to get the right approach

#

i got n dfs with the shape (300, 2) and transform them into a df where n rows are present and 60+ cols

#

from this df the cols are my features and i got another df with 20 cols which are my targets, n is always the ID of the df in all cases

cold snow
#

hello any there

#

i need a small help

serene scaffold
cold snow
#

ok

#

actually i am creating an AI bot. after importing the csv file i got a msg like this .

#

ValueError Traceback (most recent call last)
<ipython-input-21-6250f5fee32f> in <module>
----> 1 df.sample(6)

1 frames
/usr/local/lib/python3.9/dist-packages/pandas/core/sample.py in sample(obj_len, size, replace, weights, random_state)
148 raise ValueError("Invalid weights: weights sum to zero")
149
--> 150 return random_state.choice(obj_len, size=size, replace=replace, p=weights).astype(
151 np.intp, copy=False
152 )

mtrand.pyx in numpy.random.mtrand.RandomState.choice()

ValueError: a must be greater than 0 unless no samples are taken

#

i am using this csv file

#

i am using google colab to create this bot

serene scaffold
#

@cold snow df.sample(6) worked for me when I did it with your csv, so you may have inadvertently overwritten your df variable with the wrong thing.

cold snow
#

how to correct it

#

actually i aa begginer so i am asking

serene scaffold
#

I haven't seen enough of your code to know. try restarting the notebook kernel, and then do df.sample(6) immediately after the df is created.

cold snow
#

i will try it

#

bro its still the same

#

let me send my ipynb file

arctic wedgeBOT
#
Pasting large amounts of code

If your code is too long to fit in a codeblock in Discord, you can paste your code here:
https://paste.pythondiscord.com/

After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.

serene scaffold
cold snow
#

ok

#

context window of size 7
n = 7

for i in data[data.name == 'KIRTHIN waifu'].index:
if i < n:
continue
row = []
prev = i - 1 - n # we additionally substract 1, so row will contain current responce and 7 previous responces
for j in range(i, prev, -1):
row.append(data.line[j])
contexted.append(row)

columns = ['response', 'context']
columns = columns + ['context/' + str(i) for i in range(n - 1)]

df = pd.DataFrame.from_records(contexted, columns=columns)

serene scaffold
cold snow
#

actually its a pre made command line

#

i am editing ad using it

#

and the context i problem i think

#

@serene scaffold are you there bro

cold snow
#

@serene scaffold

#

bro

#

anybody help

#

me

#

@barren otter

serene scaffold
cold snow
#

ok

merry fern
cold snow
#

bro i cant understand

#

pls explain cearly

merry fern
#
n = 7

for i in data[data.name == 'KIRTHIN waifu'].index:
  if i < n:
    continue
  row = []
  prev = i - 1 - n # we additionally substract 1, so row will contain current responce and 7 previous responces  
  for j in range(i, prev, -1):
    row.append(data.line[j])
  contexted.append(row)

columns = ['response', 'context'] 
columns = columns + ['context/' + str(i) for i in range(n - 1)]

df = pd.DataFrame.from_records(contexted, columns=columns)```
cold snow
#

well yes

#

😅

#

!paste

arctic wedgeBOT
#
Pasting large amounts of code

If your code is too long to fit in a codeblock in Discord, you can paste your code here:
https://paste.pythondiscord.com/

After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.

cold snow
#

@serene scaffold bro i fixed my problem

#

but again new one appeared that is

serene scaffold
cold snow
#

create dataset suitable for our model
def construct_conv(row, tokenizer, eos = True):
flatten = lambda l: [item for sublist in l for item in sublist]
conv = list(reversed([tokenizer.encode(x) + [tokenizer.eos_token_id] for x in row]))
conv = flatten(conv)
return conv

class ConversationDataset(Dataset):
def init(self, tokenizer: PreTrainedTokenizer, args, df, block_size=512):

    block_size = block_size - (tokenizer.model_max_length - tokenizer.max_len_single_sentence)

    directory = args.cache_dir
    cached_features_file = os.path.join(
        directory, args.model_type + "_cached_lm_" + str(block_size)
    )

    if os.path.exists(cached_features_file) and not args.overwrite_cache:
        logger.info("Loading features from cached file %s", cached_features_file)
        with open(cached_features_file, "rb") as handle:
            self.examples = pickle.load(handle)
    else:
        logger.info("Creating features from dataset file at %s", directory)

        self.examples = []
        for _, row in df.iterrows():
            conv = construct_conv(row, tokenizer)
            self.examples.append(conv)

        logger.info("Saving features into cached file %s", cached_features_file)
        with open(cached_features_file, "wb") as handle:
            pickle.dump(self.examples, handle, protocol=pickle.HIGHEST_PROTOCOL)

def __len__(self):
    return len(self.examples)

def __getitem__(self, item):
    return torch.tensor(self.examples[item], dtype=torch.long)
arctic wedgeBOT
#
Formatting code on discord

Here's how to format Python code on Discord:

```py
print('Hello world!')
```

These are backticks, not quotes. Check this out if you can't find the backtick key.

For long code samples, you can use our pastebin.

serene scaffold
#

be sure to always use this from now on ^

cold snow
#

ok

#

!code

serene scaffold
cold snow
#

ok

lapis sequoia
#

Hello chat

cold snow
#

i used pastebin but how to use thsat in here

lapis sequoia
#

Woah, CS grad?

lapis sequoia
#

What college?

serene scaffold
serene scaffold
serene scaffold
lapis sequoia
#

Woah

#

Can you show me your setup

#

I have so many questions

#

I'm planning to study CS in the future

#

Right now just focusing on school and a little bit of programming when I have time

serene scaffold
#

cool flag though 🇳🇵

#

try asking in #career-advice is there are any developers in india or nepal who know what to do

lapis sequoia
#

I don't see much potential here

cold snow
#

NameError Traceback (most recent call last)
<ipython-input-12-a654172287f5> in <module>
6 return conv
7
----> 8 class ConversationDataset(Dataset):
9 def init(self, tokenizer: PreTrainedTokenizer, args, df, block_size=512):
10

<ipython-input-12-a654172287f5> in ConversationDataset()
7
8 class ConversationDataset(Dataset):
----> 9 def init(self, tokenizer: PreTrainedTokenizer, args, df, block_size=512):
10
11 block_size = block_size - (tokenizer.model_max_length - tokenizer.max_len_single_sentence)

NameError: name 'PreTrainedTokenizer' is not defined

#

bro this is the error i got

lapis sequoia
#

Code

cold snow
#

anywayto solve it

lapis sequoia
#

Send code

cold snow
serene scaffold
cold snow
#

how to import it

serene scaffold
#

an import statement. but you have to figure out where it's located.

#

you must have other import statements in your code to use as an example

cold snow
#

i c

#

let me trythanks again

lapis sequoia
#

He hasn't defined Dataset either

odd steeple
#

How to get started in developing a ai model like chatgpt

serene scaffold
serene scaffold
odd steeple
#

Language models means like NLP? Just asking

lapis sequoia
#

LMAO WHY DO I HAVE THIS

serene scaffold
cold snow
#

stelercus bro i solved tokenizer error also

#

i reloaded kernel and forgot to install pip transformers and install tokenizer

#

reloaded

#

runned the command on run time and the probem is solved

odd steeple
serene scaffold
odd steeple
serene scaffold
odd steeple
#

I understood brother

heavy crow
#

Do you guys know of any papers on caption to image search? Along the lines of clip or lit but if possible a bit more efficient and accurate ofc haha

#

I feel like lit, clip, align, blip, blip2 are all more about captioning and classification whereas I am looking only for caption to image search

#

Please ping me if you know any cool research:)

lapis sequoia
#

Does anyone here know how to use OpenAI gym library in python for reinforcement learning? I've been having trouble using an environment to train an AI on.

storm geyser
#

anyone want to chat and code

hasty mountain
# lapis sequoia Does anyone here know how to use OpenAI gym library in python for reinforcement ...

The idea is basically this:

import retro
import time
from stable_baselines import PPO2
from stable_baselines.common.vec_env import SubprocVecEnv
from stable_baselines.common import set_global_seeds
from stable_baselines.common.callbacks import CheckpointCallback
from wrapper import wrapper

# Pre-saved states
states = ["ChunLiVsBlanka.1star", "ChunLiVsBalrog.1star", "ChunLiVsBison.1star", "ChunLiVsChunLi.1star", "ChunLiVsDhalsim.1star",
            "ChunLiVsGuille.1star", "ChunLiVsHonda.1star", "ChunLiVsKen.1star", "ChunLiVsRyu.1star", "ChunLiVsSagat.1star", "ChunLiVsVega.1star",
            "ChunLiVsZahgief.1star"]

env = retro.make(game="StreetFighterIISpecialChampionEdition-Genesis", state="ChunLiVsBlanka.1star")
#env = retro.make(game="DonkeyKongCountry2-Snes")
env = wrapper(env)
#model = PPO2.load("D:/Python/Projects/Hakisa/rl_model_1000000_steps")

obs = env.reset()
total_reward = []
steps = 0
end = False

model = PPO2(policy="CnnPolicy", env=env, gamma=0.99, n_steps=64, learning_rate=3e-9, vf_coef=0.5, verbose=1)
start = time.time()
model.learn(total_timesteps=1000000000, log_interval=100, reset_num_timesteps=True, callback=checkpoint)

while end != True:
    env.render()
    action, state = model.predict(obs)
    obs, reward, end, info = env.step(action)
    #steps += 1
    total_reward.append(reward)
    time.sleep(0.05)

# Don't use these
#env.render()
#env.close()

end = time.time()

print("Duration: ", (end-start)/3600)
print(f"Total Reward: {sum(total_reward)}")

'''checkpoint = CheckpointCallback(save_freq=100000, save_path="D:/Python/Projects/Hakisa/Donkey_Kong")'''
#

Also, env.render() and env.close() have some bugs. If you call env.close() it'll simply close your window directly...but when you call env.render(), it'll already render the game window and then close it.

#

Oh yes... I forgot to mention...this one is gym retro, which is more focused on retro games...but I suppose the original gym might have an idea that is close to this.

lapis sequoia
#

thanks so much!

#

this helped alot!

gilded bobcat
#

Hello my DS friends, curious on your take:

Would you do a feature selection technique if youll already have regularization in your model?

gilded bobcat
# thorn swift why not?

I guess best case scenario itll make your model faster but wont improve it over just regularization (cause regularization would have dropped the same variables anyway!), at worst youll over generalize and make a worse model... This is my guess tho

#

I guess ill do both and report back haha

plucky bolt
#

Anyone here use anaconda?

thorn swift
plucky bolt
thorn swift
#

it shoudlnt be a problem

plucky bolt
mint palm
#

scikit learn
how do i pronounce it, i heard multiple pronounciation, what do you use?

#

have heard*

thorn swift
mint palm
#

some people say skeeet learn

thorn swift
#

yea thats probably how it was meant

soft badge
#

Guys what is the path for be expert in IA?

odd meteor
lapis sequoia
#

so for some fun AI stuff im currently working on a SC2 AI for their deep learning ladder and i have gotten this error

"c:/Users/Redux/Documents/python VB/hi.py"
Traceback (most recent call last):
  File "c:\Users\Redux\Documents\python VB\hi.py", line 1, in <module>
    from pysc2.agents import base_agent
  File "C:\Users\Redux\AppData\Local\Programs\Python\Python311\Lib\site-packages\pysc2\agents\base_agent.py", line 20, in <module>
    from pysc2.lib import actions
  File "C:\Users\Redux\AppData\Local\Programs\Python\Python311\Lib\site-packages\pysc2\lib\actions.py", line 27, in <module>
    from s2clientprotocol import spatial_pb2 as sc_spatial
  File "C:\Users\Redux\AppData\Local\Programs\Python\Python311\Lib\site-packages\s2clientprotocol\spatial_pb2.py", line 16, in <module>
    from s2clientprotocol import common_pb2 as s2clientprotocol_dot_common__pb2
  File "C:\Users\Redux\AppData\Local\Programs\Python\Python311\Lib\site-packages\s2clientprotocol\common_pb2.py", line 32, in <module>
    _descriptor.EnumValueDescriptor(
  File "C:\Users\Redux\AppData\Local\Programs\Python\Python311\Lib\site-packages\google\protobuf\descriptor.py", line 796, in __new__
    _message.Message._CheckCalledFromGeneratedFile()
TypeError: Descriptors cannot not be created directly.
If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0.
If you cannot immediately regenerate your protos, some other possible workarounds are:
 1. Downgrade the protobuf package to 3.20.x or lower.
 2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).

More information: https://developers.google.com/protocol-buffers/docs/news/2022-05-06#python-updates

not sure if this is the right place to post this but do you guys have any insight on how to correct this error?

odd meteor
serene scaffold
#

the sci is the same as science, and I'm not aware of a variety of English where the "c" is pronounced.

lapis sequoia
#

English is a very weird language for sure haha

serene scaffold
lapis sequoia
#

I couldn't speak to that as i only speak English so xD I just know the weird things with English which there are a lot for example ate and eight knife science and some other examples of English being interesting

rough lava
#

Hey hey people
Anyone here has any experience with bias (e.g. gender bias) in text based data ?
Looking to build a binary classifier to detect bias as a first step
So here are my 3 questions

  1. Could anyone suggest any open source annotated/labelled datasets? (labels --> 1: biased, 2: non-biased)
  2. What other methods would you recommend if any?
  3. In conjunction with 2, I know of word embeddings but never actually used them. Are they implemented/trained with NN mostly ?

Sorry for the long post ^^

P.S. Using python/anaconda and mostly interested of 1)

serene scaffold
soft badge
serene scaffold
soft badge
#

okay

#

yeah

#

so I haven't studied AI in depth yet

#

but my goal is to create automations with AI.

#

either with computer vision or even within some software.

#

what do you reccomend study on internet?

wooden sail
#

if you really wanna be an expert, you should start by learning the basics of python and learning math

#

AI is math, and the more math you know, the better stuff you'll be able to do with it. separately, python is currently the most popular language for AI due to the large community and available modules, among other things

rough lava
# soft badge yeah, because in brasil we say IA

there are plenty of courses for you to start, with small to little cost (e.g. edx ones ~300) or even completely free (e.g. MIT youtube uploads lots)
You could check those courses and if interested then you can delve deeper to math as @wooden sail said, and said strong foundations. (Math will help you remove the feeling of wondering "why" in most cased of AI)

serene scaffold
serene scaffold
#

getting AI advice from Edd is like getting a million dollars, or something.

wooden sail
#

maybe that's the regional variant of mhm

soft badge
#

it's my way of understanding

serene scaffold
#

interesting

soft badge
#

but it's also important to know about communication between systems, so you know how to pull data from a given system and apply a right model?

wooden sail
#

that's also true. there's layers to AI, and that's on the system design level. you may or may not have to deal with that at all depending on which part of AI you want to focus on

#

for very large tasks, it's not just one person. you have people building a pipeline, people doing math, people coding the models, people sifting through data, etc

#

it's not realistic for one person to do all of these for a large model, but they can be good skills to have. in that sense, getting familiar with databases (something i've never done in my life) and generally with linux (because everything runs on linux) are good ideas

soft badge
#

oh yeah

rough lava
#

do you have lots of working experience ? 😮 @wooden sail

wooden sail
#

depends on what you call work experience 😛

weary flint
soft badge
rough lava
wooden sail
#

i do signal processing stuff, which usually has me dealing with the math part, but not necessarily handling data

rough lava
#

oh knife-edge model on matlab and stuff like that?

wooden sail
#

sure. i've never explicitly used that diffraction model, but similar stuff

rough lava
#

Interesting and love reading code about those things
But I prefer to stay away from math parts if possible
Nlp seems fun for the time being, except when I need to hunt for datasets...

odd meteor
soft badge
#

deep, its big truly

#

sometimes i stay just on language

wooden sail
#

if you're in HS, different math topics might be more accessible than others

#

many schools cover calculus toward the end, but not much on statistics and linear algebra

#

the good thing is that basic probability and a lot of linalg are independent from most other stuff you learn in school, so you could jump right into them

odd meteor
rough lava
#

At first I did, but due to having to read papers most of the time, not anymore

#

at least not in volumes
if I have the time to spread them more in my day, yeah depending on the NLP topic

violet gull
#

Can I train an image classification cnn on 60 images per class?

agile cobalt
#

depends on the model, whenever you are planning to train from scratch or fine tune, as well as how easy your images are to tell apart

olive stone
#

Hey, when creating an ML model can the validation data be the same as the training data?

wooden sail
#

there are recent papers showing that some neural network architectures reach 0% training loss under mild conditions, and this in general says nothing about how well the network generalizes

#

it's a recipe for overfitting

olive stone
#

I see, then having separate data for validation is better?

wooden sail
#

i'd say necessary, not just better, if you want a useful network outside of the training data

#

unless you only ever need the network to work on the training data. there are special cases where this makes sense

agile cobalt
#

there are even recommendations to have a third data set to see how the network will work with completely unseen data after you reach a final model

empty inlet
#

Hello, good afternoon

#

how to solve this inequality with sympy? 4 <= 3x - 2 < 13

#

tryed a lot of solvers and also cant found an example in internet

wooden sail
#

hmm

#

oops

#

ugh just do two inequality and find the intersection lol

agile cobalt
#

iirc pandas also requires for you to use (a < b) & (b < c) (or use methods like .between) instead of supporting a < b < c, I guess that the way python handles a == b == c, a < b < c etc doesn't allows as much customisation

wooden sail
#

!e

from sympy import solve_univariate_inequality, Symbol
x = Symbol('x')
s1 = solve_univariate_inequality(4 <= 3*x - 2, x)
s2 = solve_univariate_inequality(3*x - 2 < 13, x)
print(s1 & s2)
arctic wedgeBOT
#

@wooden sail :white_check_mark: Your 3.11 eval job has completed with return code 0.

(-oo < x) & (2 <= x) & (x < 5) & (x < oo)
wooden sail
#

i thought it would simplify it, somehow

agile cobalt
#

oo? is that their infinite representation?

wooden sail
#

when printing, yeah

empty inlet
#

nice solution... I was trying to solve all together.... thank you by the help. Really apreciate

wooden sail
#

i guess we can pass the parameter extended_real = False to get rid of the infinities

#

!e

from sympy import solve_univariate_inequality, Symbol
x = Symbol('x')
s1 = solve_univariate_inequality(4 <= 3*x - 2, x, extended_real=False)
s2 = solve_univariate_inequality(3*x - 2 < 13, x, extended_real=False)
print(s1 & s2)
arctic wedgeBOT
#

@wooden sail :x: Your 3.11 eval job has completed with return code 1.

001 | Traceback (most recent call last):
002 |   File "/home/main.py", line 3, in <module>
003 |     s1 = solve_univariate_inequality(4 <= 3*x - 2, x, extended_real=False)
004 |          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
005 | TypeError: solve_univariate_inequality() got an unexpected keyword argument 'extended_real'
wooden sail
#

hmmm

#

the API is kinda bad for this, the argument is there for rational polynomials. oh well, no matter

empty inlet
#

its very ticky to get without documentation... and also had a lots of books here. No one example like that

wooden sail
#

the docs are there, i used the last example to cook something up

empty inlet
#

in the books solution is closed to 2 and opened in 5 interval

#

[2, 5)

junior sun
#

does anyone know any alternative to GPT 3 cuz i used all of my tokens and uh let's say i have some paypal issues

wooden sail
#

alternative in what sense?

junior sun
#

basically it can do the same stuff but it's free

#

i'm fairly new to AI and stuff so i just want to try it out and see if i can use it in one of my projects

#

so

#

i don't want to pay rn

#

i know that this isn't about openai and gpt but i thought this might be the place where i can find people who have knowledge about this kind of stuff

agile cobalt
#

perhaps chatgpt or bloom

#

that thing?

#

does not sounds like the weights are public
and let's not talk about leaked models

olive stone
#

When dealing with a dataset of images, is it always recommended to apply normalization on the images?

hasty mountain
hasty mountain
iron basalt
wooden sail
#

this is your daily reminder that CS is originally a branch of mathematics

#

the compute in computer science is from computability theory

iron basalt
#

(Also it makes the "science" in "computer science" extra wrong)

#

("computer math?")

wooden sail
#

computer meth

warm copper
#

hello everyone

#

does anyone know pyspark here

#

I cant seem to get pyspark work with pycharm

#

reee

serene scaffold
warm copper
#

its in this channel @serene scaffold

weak briar
#

Hi all, I'm having problems with a task I was given for a course. For context, I'm a physics student in their first semester, and I'm taking a course to be able to use python in my branch. I'm afraid that the task is a bit above my level in physics though, so I'm not sure how I should continue... I've plotted the data, but I don't know how I could continue with the rest of the task- so if anyone could at least help me with a periodogram, I'd be really thankful!! MafuBow

short heart
#

I need help with pandas ASAP, I would appreciate help. Basically I have df with 3 cols: userID, value, itemID. I need to do the following: group by user, pick the biggest value and assign itemID, which corresponds to value to all userIDs. How can I do this?

serene scaffold
serene scaffold
charred light
#

I have a 1 x n df (n is always going to be even). Columns are named column1_null_count, column1_count_not_null, column2_null_count, column2_not_null_count, ...

How can I turn this into:
column1 | column1_null_count | column1_not_null_count
column2 | column2_null_count | column2_not_null_count

Without having to do some kind of looping and split on '_'

serene scaffold
#

but do print(df.T.head(10).to_dict('list'))

#

you also don't want to have dataframes where the number of columns is the one that varies.

#

please ping me if/when you do that.

boreal gale
charred light
#

Essentially ended up as a large sum(case when XXXX is null then 1 else 0 end) XXXX_count_nulls, count(XXXX) as XXXX_count_not_nulls for each column. It's not great, but no way around it. Ends up as a 1 x n df.

boreal gale
#

" If it's stupid but it works, it's not stupid " 🤷

charred light
#

More like I already spent an hour googling and couldn't find a built in function w/ SQL that does that.

#

My other option was to pull the entire table, but that takes longer.

serene scaffold
boreal gale
#

i would take the underlying numpy values, reshape and assign another column for the extra column name
meaning something like

import pandas as pd
df = pd.DataFrame({'col_1_a': [1], 'col_1_b': [1], 'col_2_a': [2], 'col_2_b': [3]})
pd.DataFrame(df.values.reshape(2,-1), columns=['a', 'b']).assign(column=df.columns[::2].str.rsplit('_', n=1).str[0])

(i am all ears for a neat actual pandas solution if anyone has one 🙏 )

serene scaffold
#

I might have one once I get the print result 😛

charred light
charred light
#

Nvm, above works if I swapped the reshape.

serene scaffold
#

@charred light

In [23]: df
Out[23]:
   col1_a  col1_b  col2_a  col2_b  col3_a  col3_b
0       0      10       0      10       1       9

In [24]: df2 = df.T.reset_index()

In [25]: df2
Out[25]:
    index   0
0  col1_a   0
1  col1_b  10
2  col2_a   0
3  col2_b  10
4  col3_a   1
5  col3_b   9

In [26]: df3 = df2['index'].str.extract(r"col(\d+)_(\w+)")

In [27]: df3
Out[27]:
   0  1
0  1  a
1  1  b
2  2  a
3  2  b
4  3  a
5  3  b

In [28]: df3['num'] = df2[0]

In [29]: df3
Out[29]:
   0  1  num
0  1  a    0
1  1  b   10
2  2  a    0
3  2  b   10
4  3  a    1
5  3  b    9

In [37]: df3.pivot_table(columns=1, index=0, values='num')
Out[37]:
  num
    a   b
1   0  10
2   0  10
3   1   9
#

CC @boreal gale

boreal gale
#

oh yeah, good one 👍 i was thinking of .T.reset_index() but my brain was fried and it just eluded my mind

charred light
#

I also just realized, I actually don't really need the not_null count. ASfacepalm Could have just gotten the full count once at the start.

sterile wyvern
#

What if we the returns do not make a continuous function?

boreal gale
sterile wyvern
rancid ruin
#

What is backpropagating/how do you do it?

whole gazelle
#

Hi! Has anybody here worked with object detectors in webapps?

I'm trying to integrate YOLOv8 into my React Project. I have code for frontend and backend. Im thinking that Im gonna need 3 CLIs for this since I need 1 each for client and server and another for the Object Detector. Do any of you have any ideas as to how I could accomplish this?

wooden sail
#

yes

boreal gale
# sterile wyvern TestWindowLen = [90, 125, 256] CheckDays = [ 1,15, 30 ] std_dev=[25, 50, 100]

i don't think that's an issue. have a look at https://optuna.org/ if you just want a package to do the hyperparameter optimisation for you. otherwise you will need to dive into some papers to fully understand what's going on

sterile wyvern
boreal gale
sterile wyvern
boreal gale
#

well firstly it's important to really distinguish properly what is it that you are showing.
are these hyperparamters of your models?
are these some sort of output of your models?

boreal gale
#

so the hyperparameters you have shown looks like components you would need to create a grid in grid search.
without knowing the model you are working with, i can't tell if they are continuous or not.

sterile wyvern
#

So Ive ben told to use Random search should be fast and possible. It was said random search can be combined with other optimization techniques like Bayesian optimization.

#

How do you determin if a function is continuous? @boreal gale

boreal gale
#

sorry i am really confused as to what are you trying to do. perhaps it's a language barrier or there is some knowledge gap somewhere

#

what's the thing you are trying to model?
what are your inputs? where are they from? what do they mean in the real world?
what are your output(s)? where are they from? what do they mean in the real world?

#

How do you determin if a function is continuous?
layman explanation is probably "function that does not have discontinuities" or "something that you can draw with one stroke of a pen, as opposed something that require you to lift your pen"

sterile wyvern
#

I have a program that uses a grid search to find the optimal parameters for another function. Im trying to find a better way to gt optimal values. @boreal gale

#

So Im wondering if a random seach can be used/applied to this.

boreal gale
#

another function
what is this "another function"? without knowing this i can't comment of whether the parameters (i.e. the hyperparamters) are continuous. do you see what i am getting at?

#

So Ive Im wondring if a random seach can be used/applied to this.
yes, random search is always an option.

sterile wyvern
boreal gale
#

i don't know. but preferably the entire definition.

#

also are you asking whether the hyperparamters are continuous or the function itself? those are two different things.

mild dirge
#

If you can use grid search, then obviously you can sample random points on that grid to test your model with. So yes, you are able to use random search.

sterile wyvern
analog cipher
#

Hello

#

if there is someone who's good at numpy, can I talk to you in private ?

wooden sail
#

i'm decent at numpy, but won't go in dms

analog cipher
#

u do u

wooden sail
#

we discourage dms, as the server is for helping in the server

analog cipher
#

k

wooden sail
#

pretty much

#

the composition through different layers makes that a terrible exercise in the chain rule

#

deep learning frameworks evaluate lazily. they create a computational graph that makes the forward pass more efficient, and automatic differentiation easier

#

i think someone had linked you before to a website on how to construct and traverse computation graphs. if you don't use autodiff, you either construct the graph yourself on code, or do the derivatives on paper

tidal bough
#

tensorflow used to do that, didn't it

#

before version 2 or so. when pytorch just began, its killer feature was implicit computational graphs rather than explicit.

#

(then TF learned to do that too)

terse kindle
#

does anybody have resource link where I can learn about Handwriting recognition using Deep Learning ?

grand warren
#

What are the neurons, why are there layers, and what is the math underlying it?
Help fund future projects: https://www.patreon.com/3blue1brown
Written/interactive form of this series: https://www.3blue1brown.com/topics/neural-networks

Additional funding for this project provided by Amplify Partners

Typo correction: At 14 minutes 45 seconds, th...

▶ Play video
#

3 blue one brown explains it

serene scaffold
#

@terse kindle ^

terse kindle
#

Will look into it. Thank you so much

wet blade
#

dataframe\pandas or numpy

wooden sail
#

yes, that would just make it easier for you to make your own autodiff

#

i don't know how in depth you wanna go into the making your own deep learning stuff

#

fair enough. in python, i would recommend stuff like jax, pytorch, and tensorflow for this task. you can also do it with sympy, but sympy is pretty slow

tidal bough
#

jax, notably, is very annoying to build on windows

plush jungle
#

are there any obvious bugs in this DQN back propagation code I wrote? I mostly used the pytorch example but I changed a couple things so it would fit my tensor shapes and I'm worried the reason it's diverging is cause I have some bug

#
def optimize_model():

    if len(memory) < BATCH_SIZE:
        return
    batch = memory.sample(BATCH_SIZE)
        
    state_batch, reward_batch, next_state_batch, terminal_batch = zip(*batch)

    state_batch = torch.stack(tuple(state for state in state_batch))
    reward_batch = torch.stack(reward_batch)
    next_state_batch = torch.stack(tuple(state for state in next_state_batch))

    if torch.cuda.is_available():
        state_batch = state_batch.cuda()
        reward_batch = reward_batch.cuda()
        next_state_batch = next_state_batch.cuda()

    q_values = policy_net(state_batch)
    policy_net.eval()
    
    with torch.no_grad():            
        next_prediction_batch = target_net(next_state_batch)

    y_batch = torch.cat(
        tuple(reward if terminal else reward + GAMMA * prediction for reward, terminal, prediction in
              zip(reward_batch, terminal_batch, next_prediction_batch)))

    optimizer.zero_grad()

    loss = loss_function(q_values, y_batch.float())
    loss.backward()

    torch.nn.utils.clip_grad_value_(policy_net.parameters(), 100)
    optimizer.step()
warm copper
#

I assume no one works with pyspark on Mac here?

#

No one managed to answer the error I keep getting 😅

serene scaffold
#
Traceback (most recent call last):
  File "/Users/kadiraltunel/PythonProjects/Lab1/main.py", line 17, in <module>
    df = spark.createDataFrame(data=data, schema=col_names)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kadiraltunel/Documents/Spark/python/pyspark/sql/session.py", line 894, in createDataFrame
    return self._create_dataframe(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kadiraltunel/Documents/Spark/python/pyspark/sql/session.py", line 938, in _create_dataframe
    jrdd = self._jvm.SerDeUtil.toJavaArray(rdd._to_java_object_rdd())
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kadiraltunel/Documents/Spark/python/pyspark/rdd.py", line 3113, in _to_java_object_rdd
    return self.ctx._jvm.SerDeUtil.pythonToJava(rdd._jrdd, True)
                                                ^^^^^^^^^
  File "/Users/kadiraltunel/Documents/Spark/python/pyspark/rdd.py", line 3505, in _jrdd
    wrapped_func = _wrap_function(
                   ^^^^^^^^^^^^^^^
  File "/Users/kadiraltunel/Documents/Spark/python/pyspark/rdd.py", line 3362, in _wrap_function
    pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
                                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kadiraltunel/Documents/Spark/python/pyspark/rdd.py", line 3345, in _prepare_for_python_RDD
    pickled_command = ser.dumps(command)
                      ^^^^^^^^^^^^^^^^^^
  File "/Users/kadiraltunel/Documents/Spark/python/pyspark/serializers.py", line 468, in dumps
    raise pickle.PicklingError(msg)
_pickle.PicklingError: Could not serialize object: IndexError: tuple index out of range

^ this is the error message they're referring to

warm copper
#

The code doesn’t have any issues @serene scaffold

#

It’s the pyspark causing issues on my Mac

serene scaffold
warm copper
#

Oh that’s the error

#

I’m just saying that so that people don’t try to find out if the code is wrong. That’s my teacher’s code which works on his computer

#

I’m using Pycharm on MacBook Air M2

#

My teacher couldn’t figure it out either but then he doesn’t use Mac @serene scaffold

novel python
#

anyone used to plotly dash in python here?

boreal gale
#

@warm copper what python version are you using?

because i recall there was someone who dug up a jira issue for you which shows a similar error for python 3.11 iirc, and it was quickly dismissed for some reason, all without actually checking anything as far as i can see.

warm copper
#

I am using the latest version of python @boreal gale

#

I even added the paths

sterile wyvern
boreal gale
misty flint
boreal gale
austere swift
# tidal bough jax, notably, is very annoying to build on windows

this might be a controversial take but I think anybody who has the know-how to use jax and who is doing complex applications that would need it (since it's oriented towards applications where you'd want complete control over pretty much every operation) would likely already be using linux due to the headache that windows brings

zenith plover
#
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
import cv2
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from tensorflow.keras.optimizers import Adam
from keras import backend as K
from keras.layers import Conv2D,MaxPooling2D,UpSampling2D,Input,BatchNormalization,LeakyReLU
from keras.layers.merge import concatenate
from keras.models import Model
from keras.preprocessing.image import ImageDataGenerator
import tensorflow
import tensorflow.compat.v1 as tf

tensorflow.random.set_seed(123)
session_conf = tf.compat.v1.ConfigProto(intra_op_parallelism_threads=1, inter_op_parallelism_threads=1)
sess = tf.Session(graph=tf.get_default_graph(), config=session_conf)
tf.keras.backend.set_session(sess)
tensorflow.random.set_seed(2)
np.random.seed(1)

print(os.listdir("../deneme/dataset/dataset_updated/"))
#
LEARNING_RATE = 0.001
Model_Colourization.compile(optimizer=Adam(learning_rate=LEARNING_RATE),
                            loss='mean_squared_error')
Model_Colourization.summary()
#

I am getting an error in the compile part. How to fix

sterile wyvern
boreal gale
# sterile wyvern I was told the results need to be continuous to use random search.

hmm okay, again we need to clarify what is "the results"

in hyperparameter optimisation, you obviously have some metric you are trying to maximise/minimise, if that's what you are calling "the results" (since your application is finance, let's just assume this is your profit for example) then i don't think that statement is correct, random search doesn't have any requirements like that.

random search is literally trying some hyperparamter configuration and see how good it performs, whether the metric you are trying to maximise/minimise is continuous or not shouldn't matter i think.

however if that statement is aimed towards bayesian optimisation, then maybe there is some truth to it. i have never dealt with a discrete metric to optimise for. but i can kinda see why the metric being discrete might be an issue.

sterile wyvern
boreal gale
sterile wyvern
#

Besides english

boreal gale
#

just english and chinese

sterile wyvern
#

lol

#

Lets stick to english.

sterile wyvern
#

I have a function Im using to get hyperparameters.

#

To use random seach on this function the results should be continuous.

#

I can be wrong.

boreal gale
#

I have a function Im using to get hyperparameters.
okay, let's call this function the hyperparameters-optimiser from now on.

sterile wyvern
#

Ok.

boreal gale
#

To use random seach on this function the results should be continuous.
is "this function" here the hyperparameters-optimiser?

boreal gale
#

when you say on, did you mean in?

sterile wyvern
boreal gale
#

meaning.. "using random search as the core logic of this hyperparameter-optimiser"

boreal gale
#

okie dokie

sterile wyvern
#

🙂

boreal gale
#

now that only leaves "the results" as my only source of confusion.
i assume "the results" mean the evaluation metric of the model?
the model here is not the hyperparamter-optimiser

sterile wyvern
boreal gale
#

when you use the word "the results", my mind just draws a blank, hence the confusion.

#

but if you do mean the evaluation metric your hyperparamter-optimiser will be working to maximise, then my above reply is relevant
#data-science-and-ml message

sterile wyvern
#

what makes more sense to you?

sterile wyvern
boreal gale
#

"performance of the function that its being used in" makes more sense, yeah (but i can't just assume that's what you meant)

sterile wyvern
sterile wyvern
boreal gale
#

okay, random search should be fine for reasons stated above.

as for the issue of continuous or not, ratios/percentages sounds pretty continuous to me, unless the components of the ratio is themselves bounded and discrete

e.g. say if your ratio is X:Y, and X can only ever take value (1,2,3,4,5) and Y only take value (5,6,7,8,9), then that ratio doesn't sound continuous to me.

boreal gale
#

it's continuous then

arctic wedgeBOT
#

Hey @vocal fractal!

You either uploaded a .txt file or entered a message that was too long. Please use our paste bin instead.

#

Hey @vocal fractal!

You either uploaded a .txt file or entered a message that was too long. Please use our paste bin instead.

vocal fractal
sterile wyvern
#

How often you use a random search?

#

@boreal gale

#

Do you use insample out of sample split?

#

80/20 for example

boreal gale
# vocal fractal https://paste.pythondiscord.com/fefapididi getting this error (pasted at t...

RSI as in Relative Strength Index?
presumably it's because when calculating RSI, you had to discard some data because you don't have 14 (or whatever your window is) days worth of history for RSI calculation, and you will have less rows compared to the original dataframe

when you try to assign non-scalar (i.e. not just one value) data (in data structure that is without pandas index information) back into the dataframe, the length has match the original dataframe.
in this case you are indeed assigning non-scalar data without pandas index information (a numpy array) back into a dataframe and the length doesn't match, hence an error occurs, specifically "ValueError: Length of values (3312) does not match length of index (3326)" - notice how the two values are off by 14!

#

if this isn't a homework, maybe looking into using TA-lib instead of homebrewing RSI calculation is a worthwhile thing to do.
https://pypi.org/project/TA-Lib/

boreal gale
vocal fractal
#

@boreal gale Thank you so much! I have so much to learn!

sterile wyvern
boreal gale
boreal gale
sterile wyvern
boreal gale
#

👌

sterile wyvern
#

Which is fast and powerful?

#

Why not use random seach?

#

Is Bayesian optimization fast?

boreal gale
#

random seach is very luck based.
bayes opt at least has some theory as to why it should work

#

though it's really worth noting, hyperparameter optimisation is not a silver bullet.
sometimes, one's data/model is just not up to scratch, no matter how you tune your hyperparameter, the model is still not going to perform to your satisfaction
had that happened to me once or twice, my data is just shit, no matter what i do, i can't make anything useful from it

hasty mountain
#

Can someone give me some help with unsupervised learning applied to neural networks?
I'm currently trying to test a Minimum Entropy Loss from a recent paper, which has the objective of making the model learn to minimize the information entropy of certain input more explictly, even allowing it to "pre-classify" the input(inputs with similar entropy level are more likely to be from the same class, like in CIFAR100, ambulances pics tend to have the same minimum entropy).

However, I'm having the problem that...my model loss is actually increasing after each epoch, not decreasing(if not to say that the entropy minimization doesn't seem to make sense at all). I suppose this means the model isn't being consistent with the entropy minimization.

Can someone give me some ideas on what could be causing this? Is there any more "consecrated" loss function for this task, just so I can use it as a control compared to this MinEntLoss?

#

(Yes, I have reviewed my code quite a few times to make sure I'm implementing the loss and unsupervised task correctly)
Also, the model is a ResNet extracting features from 100x100x3 images into 512 features.

warm copper
#

bruh I think its my teacher's code @boreal gale

boreal gale
warm copper
#

so its the python?

boreal gale
#

indeed. python 3.11 is not supported yet.

warm copper
#

oh interesting

boreal gale
warm copper
#

lol

#

3.4?

#

is py 3.4 out?

boreal gale
#

spark 3.4 not python 3.4

but to answer your question 3.4 was out a long long time ago

warm copper
#

yeah

#

im on 3.3

boreal gale
#

it's not out yet, i am unsure what's their release schedule

boreal gale
warm copper
#

ugh

#

thats gonna suck

#

this code worked tho

#

interestingly

boreal gale
#

if you haven't heard of pyenv, it might be worth looking into it, it will lessen the burden of installing python/switching python version.

warm copper
#
from pyspark.sql import SparkSession
from pyspark.sql.functions import sum, desc

spark = SparkSession.builder.appName(
    'Covid').getOrCreate()

covid = spark.read.csv('/Users/kadiraltunel/PycharmProjects/covid-us.csv', sep=',',
                       inferSchema=True, header=True)

covid.show(50)

covid.groupBy('date').agg(sum('cases'), sum('deaths')).orderBy('date').show()
covid.groupBy('state').agg(sum('cases'), sum('deaths')).orderBy(desc('sum(cases)')).show()

covid.select(sum(covid.cases)).show()
#

I wonder if you get the same results as I do tho

#

when you run it

boreal gale
#

i don't have access to your csv, so it's gonna blow up, but i will most likely get the same behaviour as you i believe.

warm copper
#

this is the data

boreal gale
#

as to why it only breaks when running the dataframe example script?
it's because of some spark internal which deemed creating a dataframe from user manually supplied data requires a shuffle in the data (or at least the shuffle function), such that the random.Random referenced here https://github.com/apache/spark/pull/38987/files is pickled for transport to other process (basically this is how your python code gets transported to the executor in spark, via something called a pickle, you might also see something like cloudpickle, or dill - all very similar and built upon pickle), seeing as this class no longer exists, code running on python3.11 blows up.

warm copper
#

Im confused because the number of cases seem way too high

#

oh I see @boreal gale

boreal gale
#

anyway, you can tell your professor 3.11 can't run the dataframe script properly at the moment. you can link https://github.com/apache/spark/pull/38987 and/or https://issues.apache.org/jira/browse/SPARK-41125 if he asks for proof/reasons

also, if that's the only script that doesn't run, there is not much reason to downgrade 🤷 (my advice to downgrade was based on my understanding that spark just plainly doesn't work at all in python3.11, of which obviously i was wrong)

sonic comet
warm copper
#

weird that there are over 31 billion cases right? @boreal gale

rapid oriole
#

Hey guys, I need help for a customer churn prediction model for my group project in my marketing course. Basically we have a 1.2 million customers database that made purchases in various retailers. A specific retailers has been assigned to us and therefore, we now have 591k customers that have made at least one purchase at this retailer. We would like to create a binary variable called 'churn', that will take the value of 1 if the last purchase is more than 18 months ago, 0 if not. We have data for 36 months (2019-2021). We would like to predict customer churn. I intend to spit the data randomly into train and test, however my question is: After fitting the model.. What do I do? Obviously I can't try to predict customers from my training model, so we can ignore them.. What about the rest? How do I apply my model so we can confidently say: This cluster of customer is at risk of churning?

Edit: After thinking, I had this idea: after training the model, I create a new dataframe containing only customers that are still active and haven't churned yet. Then I fit the model, and every customers that have been predicted to churn will form my group of customers that are at risk of churning?

edgy falcon
#

Hi!, can someone help me with the next problem:
Im trying to use a Transformer XL layer in my model, then when i used the argument "**kwargs" it tolds me its not defined, but the documentation used it, help me please

soft badge
#

Anyone know a community in discord focus on chatGPT and another AI?

lapis sequoia
#

is there a way to version control jupyter notebooks on github that doesnt make the diffs insane and huge

#

or does kaggle or hugging face have something smarter

night pasture
#

hello i am new here and i am searching for how to become data science programmer can anyone suggest what do i learn first

night pasture
plush jungle
#

have you taken any machine learning courses in university?

night pasture
plush jungle
#

well the first thing to do would be get super familiar with all the basics of python. data scientists use almost exclusively python

night pasture
plush jungle
#

learning by doing is the best way. pick a project and test your skills. preferably a data science project

#

there are three main things data scientists do:

gather data
clean data
apply machine learning/statistics to data

night pasture
#

thanks i will try to learn those

plush jungle
#

if you're looking for data that's already gathered and mostly cleaned, kaggle.com is a great resource

hasty mountain
# edgy falcon Hi!, can someone help me with the next problem: Im trying to use a Transformer ...

**kwargs is not really a proper argument, it's just extra arguments that could be added.

def init_optimizer(args)

arguments = {
  args['lr']=0.001,
  args['betas']=(0.9, 0.999),
  **kwargs
}

optimizer = torch.optimizer.Adam(**arguments)

Considering that torch.optimizer.Adam() accepts as arguments lr, betas and eps, you could pass as arguments for the function init_optimizer a dictionary args with the itens 'eps'=1e-6, for example, which would be a **kwargs, an extra argument that isn't defined by default.

edgy falcon
#

Thank u bro

mint palm
#

pca, tSNE (visualisation, dimensionality reduction), contingency tables, uni/bi/multi variate analysis, what other statistics should i learn before my interview?
I will do micro/ marco F1, confusion matrix, AUC, ROC etc, too

#

from libraries what important functions should i know?

cloud finch
rough lava
#

Anyone here knows any free/open-source datasets regarding bias ?

rough lava
rough lava
#

Should I maybe dm another channel here? lemon_thinking

lavish wave
#

Can anyone help me in resolving this error?

serene scaffold
#

though I suspect that the model you're trying to load is just invalid.

lapis sequoia
#

Hi I want to train a regression model.
Should I find the optimal degree first on a default model?
And then take care of regularisation and other parameters later with the optimal degree of the regression model?

agile cobalt
#

use regularisation from the start.

lapis sequoia
#

Does that mean if you have n hyperparameters you would always need a n-nested loop?

agile cobalt
#

you do not have to perform a full grid search on all parameters, if there are too many hyper parameters to tune just use a random search instead of grid search or fix some of them

lapis sequoia
#

In this case it's just 2, so it will work

#

Once I did a 4 level grid search, that took a very very long time so I ended up fixing one of the parameters to just an acceptable value

agile cobalt
#

remember that if you are way too picky about your hyper parameters you may end up effectively overfitting to your test set

lapis sequoia
#

hmm

lapis sequoia
#

my column looks like this, do I perform power transformation on them to make them normal for polynomial regression

#

I don't really know when to use it and if there's any downsides to just always using it. Chatgpt says to compare performance before and after using it. But that's just another 'hyperparameter' to tune then.

#

@agile cobalt

agile cobalt
lapis sequoia
#

ye

agile cobalt
#

honestly I have never seen the term power transformation before and taking a quick look at wikipedia I don't get it, but for features that typically scale exponentially like population or money (such as the GDP column) you may want to consider using log(), while for things that scale linearly you'll probably want to not use any way too fancy methods

lapis sequoia
#

hmm

#

what's a way to check for growth rate of something

#

what plot will it reflect in

agile cobalt
#

you don't

#

if anywhere, it might be reflected in the distribution of the data

#

but that is something that you should know about the data you are dealing with, not something you'll infer from the data

lapis sequoia
#

What if it's all just black box data with no labels

#

or you can't understand the labels

agile cobalt
#

then you should not be using that data at all?

#

model interpretability is already bad enough as-is, I cannot commend using data you do understand

lapis sequoia
#

I see

boreal gale
#

if your question is about when to use power transformation, imo you should use it when your model model assumes normality and your data doesn't follow a normal distribution (how you determine if your data is roughly normal is another question, QQ plot and kolmogorov smirnov test is pretty common)

if your model doesn't require/assume normality, then there are less reasons to use it but sometimes it is indeed useful, i feel this is all very context-dependent.

also power transform impacts the interpretability of your model, which might be an issue. but you can always use SHAP to recover some if not all interpretability.

lapis sequoia
boreal gale
#

i actually have no idea 🤔 my stats has degraded a lot since leaving uni

lapis sequoia
#

what did you study

boreal gale
#

stats 😂

wooden sail
#

if you do it via linear least squares, poly regression does indeed assume normality

#

there's more than one way to find the coefficients of a polynomial

#

least squares always assumes normally distributed observations, i.i.d.

lapis sequoia
#

hmm, I think sklearn does it the least square way, doesn't it?

wooden sail
#

most likely, but i can't say for sure 😛

lapis sequoia
#

That's what I am taught as well

#

But I also used regularisation, but that doesn't affect much ig?

wooden sail
#

depends on the kind of regularization

lapis sequoia
#

I did elastic net

#

l1+l2

wooden sail
#

you can usually think of regularized least squares as assuming there is AWGN, and the regularization terms are equivalent to assuming your coefficients are random and come from a special distribution

#

with l1 and l2, that'd be some weird combo of laplace and gaussian priors

lapis sequoia
#

I have 38 features now and with 3 degree that become a lot of features and taking long to run

#

don't people have tons of features irl?'

#

That would make polynmial regression inefficient

#

beyond maybe 2 degrees

boreal gale
#

don't people have tons of features irl?'
yes, but most people don't use polynmial regression, at least from my past experience

lapis sequoia
#

why is it there then

#

What's popular mostly used algorithms

#

Cool graph isn't it

wooden sail
#

you usually only use poly regression for modestly low degree polynomials

lapis sequoia
#

My teacher made it for 10 degrees, but I have tons of features

wooden sail
#

because it turns out it's a fairly challenging problem

#

it involves a toeplitz matrix with a terrible condition number

#

you can run into issues involving numerical stability, or if direct inversion is impossible, slow convergence

lapis sequoia
#

@wooden sail

agile cobalt
wooden sail
#

the additive error is normally distributed

thick viper
#

Any ideas?🤣

agile cobalt
#

Challenge portfolio work
what does that means?

thick viper
#

It’s just a task, it’s my homework

#

I’m trying to do the challenge task because I wanna try to learn it better but I literally have no clue

wooden sail
#

you'll have to brush up on your joint and conditional probabilities

thick viper
#

Nevermind I got it :so

#

Had to do it on paper haha

mossy lance
#

in pytorch, does anyone know how to combine a sequence of multidimensional tensors? 65536 4x4x4 tensors that i want to reshape into 8x8x8 tensors

serene scaffold
mint palm
#

can interviewer ask me "without checking syntax" write code for something?

#

what am i expected to generally write without reference? TO give you an idea of role: i am applying for "Sr. data scientist" role at startup

scarlet kite
#

beginner question. what is the bias used for? also is it automatically added or not?

agile cobalt
#

I'm guessing the b in y = x*w + b?

low musk
#

bias in neral networks

low musk
agile cobalt
#

usually it is just a feature with value 1 for all records

keen loom
mint palm
low musk
thick viper
agile cobalt
#

without it, you would always get y=0 for Xs = [0, 0, 0, 0, 0, ..., 0, 0, 0]
some machine learning libraries might add it automatically for you, while others may require for you to add it yourself

mint palm
#

am i hired, lmao?

keen loom
low musk
#

i thought it was just some random value u had to choose

#

in my project the value of bias didnt matter much 🤔

scarlet kite
#

thanks

low musk
#

hmm I didnt use any library i so thats why i had to select a random value

agile cobalt
#

the bias "feature" always has a value of 1
the bias weight is initialised randomly and learned by the network, just like all other feature weights

low musk
#

idk what is feature

agile cobalt
#

one of the names for your input data

low musk
#

oh

#

gradient descent

#

i didnt implement it

#

but my program still works

#

98% of the time it guesses it correctly

#

what else matters?

#

but it improved?

agile cobalt
#

there are dozens if not hundreds of different ways to measure how well a model is doing

low musk
#

when u start training it has accuracy 0.5

#

eventually reaches 0.98

agile cobalt
#

!paste can you show your code?

arctic wedgeBOT
#
Pasting large amounts of code

If your code is too long to fit in a codeblock in Discord, you can paste your code here:
https://paste.pythondiscord.com/

After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.

low musk
scarlet kite
#

is an activation function used after the last layer of the network?

low musk
#

idk what that means yet

scarlet kite
#

i was just asking because i dont know the answer

low musk
#

oh i thought u were asking about my code

#

😅

agile cobalt
#

not sure if it counts as actual gradient descent, but I guess that this bit trains it

arctic wedgeBOT
#

algorithm.py lines 135 to 145

if product + bias > 0:
    predict_shape = shapes[0]
    if shape != predict_shape:
        weight = addition(weight, image_list, +1)
else:
    predict_shape = shapes[1]
    if shape != predict_shape:
        weight = addition(weight, image_list, -1)

if shape == predict_shape:
    correct_guesses += 1```
agile cobalt
#

but +1 on their suggestion to use numpy

low musk
#

is it neccesary

agile cobalt
#

would probably run at least 10~100x faster, and if you want to actually work with data science or do anything even hobby level of seriousness you'll 100% need to use numpy and friends

low musk
#

how do I learn it properly is there a beginner's book for this

agile cobalt
#

I'd recommend taking a course like Andrew Ng's machine learning introduction on coursera, or at least following something like 3Blue1Brown's videos or https://course.fast.ai

low musk
#

i am just starting linear algebra at school 💀

#

how do they identify more than 2 things

agile cobalt
#

watch it to find out /s

low musk
#

/s ?

agile cobalt
#

sarcasms

low musk
#

ohh

agile cobalt
#

usually you'll do something like ```
1-10 how much of a circles is it
1-10 how much of a rectangle is it
1-10 how much of a triangle is it

low musk
agile cobalt
#

you don't | you rewrite a lot of things

low musk
#

how do they get those values

agile cobalt
#

actually seriously this time, watch the video to find out

low musk
#

yeah no good night

#

i have school tom

scarlet kite
#

is an activation function used after the last layer of the network?

#

or just for hiudde

#

hidden

agile cobalt
mild dirge
#

And not every layer in the same model needs to have the same activation

#

You will often see hidden layers each having ReLU activation, and the last layer Softmax f.e.

scarlet kite
#

and when i use random data for a network, is there a way of seeing the predictions vs. the real data?

#

@mild dirge

edgy falcon
#

Hi! someone can help me with this error in a Transformer XL layer on tensorflow:
TypeError: tf__call() missing 1 required positional argument: 'relative_position_encoding'

Here's the layer

    vocab_size=140,
    num_layers=6,
    hidden_size=256,
    num_attention_heads=30,
    head_size=5,
    inner_size=30,
    dropout_rate=0.2,
    attention_dropout_rate=0.2,
    initializer="glorot_uniform",
    two_stream=True,
    tie_attention_biases=True,
    memory_length=30,
    reuse_length=30,
    inner_activation='relu'
)(embedding_1)```
hasty mountain
edgy falcon
#

I'll try it, thank u bro

brittle pivot
#

I have a pandas dataframe with a time column, a power column and a frequency column. The power and frequency are measured at 0.05s intervals. The frequency that is measured is based on a target frequency, where the frequency will jump to a value and then be held for 60 seconds. how can I split the dataframe based on these frequency jumps?

brittle pivot
#

I want to detect when the frequency changes, i.e. f1 and take a slice from index[0] to index[f1], then from index[f1] to index[f2] and so on

lofty dagger
#
import plotly.graph_objects as go

lat = ["22.290222"]
lon = ["73.167065"]

fig = go.Figure(go.Scattermapbox(
    lat=lat,
    lon=lon,
    mode="markers",
    marker=go.scattermapbox.Marker(
        size=10,
        color='red'
    ),
    text=['Location'],
))

fig.update_mapboxes(style="open-street-map")
fig.write_html("/tmp/temp.html")

anyone knows how i would change the shape of the marker to that of a bus?

untold flicker
#

Hi I'm just trying to create a time-series neural network. I don't know what format to put time into my data? Do I convert it into seconds and input my data as a 3-D tensor, samples, seconds, features or do I keep it in date and time format

lapis sequoia
#

How can I remove the dependent variable from my list of Xs? Here is what I got so far

#
                                paste(feature.names, 
                                      collapse = ' + ')))```
#
Type ~ RI + Na + Mg + Al + Si + K + Ca + Ba + Fe + Type```
#

Nvm I figured it out

clever summit
#

Hello

#

Can you help me?

#

I'm keeping on hitting this error:

~\AppData\Local\Temp\ipykernel_7264\806691498.py in <module>
     27 incCount5=0
     28 incCount_reset=0
---> 29 start_time=time.time()
     30 
     31 net = cv2.dnn.readNetFromDarknet(model_config,model_weights)

AttributeError: 'float' object has no attribute 'time'```

What did i do wrong? What should i do?
wooden sail
#

call the variable or the module a different name

mild dirge
#

Yeah, we're solving it in their help channel

clever summit
#

Hello, i need help again

#

This time about opencv dnn error

#
error                                     Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_7264\3346177036.py in <module>
     87     #classIds, confs, bbox = net.detect(img,confThreshold=thres)
     88     print(classIds,bbox)
---> 89     blob=cv2.dnn.blobFromImage(img,1/255,(wght_hght_target,wght_hght_target),[0,0,0,0],1,crop=False)
     90     net.setInput(blob)
     91     LayerNames=net.getLayerNames()

error: OpenCV(4.7.0) D:\a\opencv-python\opencv-python\opencv\modules\imgproc\src\resize.cpp:4062: error: (-215:Assertion failed) !ssize.empty() in function 'cv::resize'```
#

I am currently using yolov3-320.cfg as config and yolov3-320.weights as weight

#

What is happening here?

queen cradle
silent spade
#

As anyone tried deploying a NLP model that uses nltk wordnet or stopwords to AWS Lambda?

serene scaffold
silent spade
#

ok my apologies.

I have built a lambda function that uses NLTK to preprocess text before being used in my classification model. The function needs to use NLTK's stopwords, punkt and wordnet libraries to work. I am having issues with the lambda function being able to download the libraries upon execution. Everything works fine locally, but when deployed to AWS it doesnt download the files to the right directory. Has anyone come across this issue before?

dusty valve
#

I wanted to display a cnn layer weights in mplt,the shape is (3,3,3,32), can i do it?

agile cobalt
silent spade
#

I have tried to change the install location to a /tmp/ directory, but the function doesnt want to search that directory for the libraries.

arctic wedgeBOT
#

9. Do not offer or ask for paid work of any kind.

low musk
#

should I run some random apk from a discord server 🤔

serene scaffold
#

You can't recruit for paid opportunities or business projects here, so please remove your message.

#

!warn 807551900417130537 We've asked you before not to recruit for projects like this. This is your last warning about this, so please contact @sonic vapor if you need any further clarification about what is or isn't appropriate.

arctic wedgeBOT
#

:incoming_envelope: :ok_hand: applied warning to @shell sequoia.

midnight grotto
#

how tough is it to like make some sort of ai . that simply has to choose between apis to use for results on the basis of text provided

tidal bough
#

"some sort of ai" can mean a lot of things, including "an if statement" :p

#

depends what kind of accuracy you want, and what kind of task you have in detail.

iron basalt
#

A bit late on this news, but this is something nice.

wooden sail
#

oh that's fantastic

tidal bough
#

heh, I saw this suggested as a possible way to improve gpt4's capabilities, like, yesterday

#

since it's not good at math, but pretty good at delegating to tools

iron basalt
#

If it could read your input files / directories it could be a full DS tool.

wooden sail
#

yeah this makes a lot more sense. basically use it to translate your natural language queries to formal mathematical ones and back

iron basalt
#

(And remote databases that you point it at)

bright pasture
#
ValueError: num_samples should be a positive integer value, but got num_samples=0```

I'm trying to run something called Lora_SVC, and it gives me this error when I try to train.
edgy falcon
#

Hi! i need some help please, im passing the argument relative_position_enconding to a Transformer XL layer (in tensorflow) like this:
relative_position_encoding=(None, 300, 256)
But i get the next error:

Dimension value must be integer or None or have an index method, got value 'TensorShape([])' with type '<class 'tensorflow.python.framework.tensor_shape.TensorShape'>'

So whats wrong?

misty flint
hasty mountain
#

Guys, when preprocessing text data for training a Transformer model, should I add a <Start-Of-Sentence> token to my target sentence?
So at the first iteration in a sentence, the model must predict the <SOS> token before predicting any actual word?

#

It feels a bit weird, since the <SOS> token is inserted by default during inference...

hasty mountain
#

Uh... I suppose when the Transformer is implemented correctly, vanishing gradients isn't that much of a problem?

glad otter
#

I have a log files, it might contain an error lines or not, i want to make a code that can understand each error line and print just a unique from each no need to duplicate

Example

Input file :
Leakage value 1.2 for circuit 1 is greater than the standard

1)Leakage value 0.9 for circuit 2 is greater than standard

2)Capacitance is huge in circuit 3

3)Capacitance is huge in circuit 4

4)Capacitance is huge in circuit 5

5)Capacitance is huge in circuit 6

6)High delay in circuit 7

Output: must be the unique ignoring instance information like circuit number or certain value

  1. Leakage value 1.2 for circuit 1 is greater than the standard

  2. Capacitance is huge in circuit 3

  3. High delay in circuit 7

The log file may contain over than 10000 errors but not ,however its might be just 10 unique errors as shown in the output ,,,

Anyone can suggest a library, or a place to start from?is it possible to make code clever enough to determine these things ?

keen kestrel
#

Anyone can suggest me which vendor that offers VM with gpu like V100 / A10 / A100 at decent price? This is for my personal learning on training deep learning model on public data, no privacy / enterprise feature necessary.

misty flint
#

courtesy of the FSDL folks

#

blobpray🥞

#

thanks josh tobin and charles frye and fam

keen kestrel
hasty mountain
misty flint
#

their online course is also really good

obsidian moth
#

First of all hi, I need a data about whether the python is playing in data science how??

How to become data scientist by using python language

white sun
#

maximum knowledge required in DS&A (PYTHON) for Data Sc.?

wooden sail
#

wdym by "maximum knowledge"?

mint palm
#

I had an interview yesterday, went well.
One anomalous question was following:

HIM: why F1 is HM?
ME: To punish score even if one of precision or recall is low even when other might be high.
HIM: so why cant we use F1=precision X recall

Now, i wasnt able to comeup with explanation, he told me it had to do something with Harmonic motion that we learn in high school.
Does anyone know why F1 cant be precision X recall?

mild dirge
#

It's the harmonic mean between precision and recall

#

In mathematics, the harmonic mean is one of several kinds of average, and in particular, one of the Pythagorean means. It is sometimes appropriate for situations when the average rate is desired.
The harmonic mean can be expressed as the reciprocal of the arithmetic mean of the reciprocals of the given set of observations. As a simple example, t...

#

And this is probably what they meant with that motion:

#

@mint palm

#

Here you can see the resulting f1 score (yellow is 1, dark blue is 0) for the harmonic mean, and simply multiplying them

queen vector
#

Hlw, I am using request-html for web scraping but when I am encountering a <div class=""> no children is returned but while inspecting there is an img tag, How to get that img tag src data ??

edgy falcon
#

Hi! i need some help please, im passing the argument relative_position_enconding to a Transformer XL layer (in tensorflow) like this:
relative_position_encoding=(None, 300, 256)
But i get the next error:

Dimension value must be integer or None or have an index method, got value 'TensorShape([])' with type '<class 'tensorflow.python.framework.tensor_shape.TensorShape'>'

So whats wrong?

light quiver
#

hey,does anyone use simpy in simulations???

hasty mountain
#

Hey guys, I'm trying to make a vectorizer model. The model has 4 fully connected layers, receives features from an image and then generates a vector. It works fine for generating vectors with dimensions (Batch, vector_size).
However, I want to generate 2 dimensional vectors to see how things will work and also to be able to plot the model performance and visualize how things are going(like it's done with PCA and tSNE), but I don't know how to do this without getting an output where the first vector number will be exactly equal to the second.
My code looks like this:

x = self.neuronA(x)
x = self.Relu(x)
x = self.neuronB(x)
x = self.Relu(x)
x = self.neuronC(x)
x = self.Relu(x)
features_embedding = self.neuronD(x)

return features_embedding

I want to make something similar to Pytorch/Keras embedding layer:

test = torch.randint(0, 10, (1, 5)) # (Batch, n_features)

embed = nn.Embedding(10, 10) # 10 embedding dimensions

out = embed(teste) # (Batch, n_features, embedding_dimension) = (1, 5, 10)

Any tip or suggestion?

#

(ChatGPT suggested me to simply reshape my output, but this doesn't seem to make sense mathmatically)

mild dirge
#

Well, reshaping is the first thing I thought of too. It will generate 50 output values. You can interpret it as a 1d vector, or 5x10 f.e.

#

So what do you want different from a 2d output than from a 1d with same number of elements?

#

And with PCA you would normally get a 1d vector output with 2 elements. Such that you can plot it as a 2d x-y graph

hasty mountain
#

I think the correct term would be "spacial vector representation", or something like that...

mild dirge
#

So an embedding where you want similar inputs to be close in the output space as well

hasty mountain
mild dirge
#

Yes, you could just make the output a 1d vector of two elements

hasty mountain
#

Oh, ok. Now that I think about it...it's a bit like how we do to create images with linear layers... We get a 1-d output, and simply apply reshape to get a 2-d or 3-d array

mild dirge
#

You want the output to be an image?

hasty mountain
#

Apply dimensionality reduction to a dimensionality reduction?

hasty mountain
#

Ok then... That was easier than I expected. Thanks!

boreal gale
#

could you type out the expected output by hand in form of a dataframe please?

#

and can i assume column Two prefix will match One? i.e. if Two is 'A-123' then One must be A?

boreal gale
#

okay perfect, gimme a moment!

#

okay, long story short is that there is just no out of the box way to do this merge natively using just the toolbox pandas provides

doing any naive merge and then filling in the blanks seems to be making it harder on yourself.

i first assume you know how to truncate Two from df_data into the string before ;, i will call this truncated Two
imo, your best bet would be then to compute the correct join key in your df_data first, i.e. first check if the corresponding truncated Two exists in df_keys, if it does, great, use that truncated Two as is, if not then the join key would be None/NaN (since your wildcard join is indicated by None/NaN)
(edit: by join key, i meant one part of the actual join condition you will be using, namely how you match up the Twos from both dataframe, since One is already known to be equal from both dataframe, we pay no extra attention to it)

all together this would be

df_data['truncated_two'] = df_data['Two'].str.split(';').str[0]  # > i first assume you know how to truncate Two from df_data into the string before ;, i will call this truncated Two

df_data['joinable_two'] =  np.where(
    df_data['truncated_two'].isin(df_keys['Two']),  # >  first check if the corresponding truncated `Two` exists in `df_keys`
    df_data['truncated_two'],  # > if it does, great, use that truncated `Two` as is
    None  # > if not then the join key would be `None`/`NaN` (since your wildcard join is indicated by `None`/`NaN`)
)
    

pd.merge(
    df_keys,
    df_data,
    left_on=['One', 'Two'],
    right_on=['One', 'joinable_two'],
    how='right',
)[['One', 'Two_x', 'Target', 'Total']]
iron basalt
agile cobalt
#

there's also the alternative of building a multiindex instead of using merge() / join() but I probably shouldn't really recommend it ```py
import pandas as pd
...
df_data["Two"] = df_data["Two"].str.split(";", n=1).str[0]
mapping = df_keys.set_index(["One", "Two"])["Target"]
keys_to_map = pd.MultiIndex.from_frame(df_data[["One", "Two"]])

values = keys_to_map.map(mapping)

result = df_data.assign(Target=values.fillna(0))
print(result)

primal linden
#

This is the last cell of a project I've been working on in jupyter notebook. I added it specifically because a blogger said it only requires pandas/numpy, and I have no other visualizations in the notebook. Upon running it in a virtual environment it turns out the blogger lied to me and it requires matplotlib.

My question is, do you fine folks think it is worth including matplotlib just for this one, rinky dink visualization, or should I just remove it altogether because the values are already discussed in the cells prior?

tidal bough
#

You need matplotlib for background_gradient? huh, weird

#

oh, I guess it's for the colormap.

primal linden
#

I'll try it without the cmap.

tidal bough
#

looking at the source code, it uses matplotlib unconditionally

arctic wedgeBOT
#

pandas/io/formats/style.py lines 3930 to 3936

with _mpl(Styler.background_gradient) as (plt, mpl):
    smin = np.nanmin(gmap) if vmin is None else vmin
    smax = np.nanmax(gmap) if vmax is None else vmax
    rng = smax - smin
    # extend lower / upper bounds, compresses color range
    norm = mpl.colors.Normalize(smin - (rng * low), smax + (rng * high))
    from pandas.plotting._matplotlib.compat import mpl_ge_3_6_0```
tidal bough
#

IMO, matplotlib is so common you might as well install it. Your choice, though, it's not like the gradient is even very noticable here on a 2x2 table.

primal linden
#

I appreciate the input, as well as others'!

tawdry ruin
#

Is anyone interested to do leetcode questions together starting from easy level?
We can do by our own approaches and then have a discussion on concepts!

thin palm
queen cradle
arctic wedgeBOT
#

6. Do not post unapproved advertising.

frank sinew
#

What are some ways I can implement a bot to my game using an Ai (training and usage)? The background is that each client is in control of an ev3 mindstorm robot, but the robot can also run on ai if there are no players, doing stuff like moving around in the real world and shoot other robots. The data the robots have is the position of the other robots which i get from the aruco markers in opencv from the camera pointing down on all of them

#

I am limited to one main phone camera 3rd person which points down on the aruco marker on top of the tobot. Each robot also has a camera in front (first person).

worldly dawn
#

Like maybe starting simple with a small 2d simulation

karmic valley
#

is chatgpt good to write code

serene scaffold
# karmic valley is chatgpt good to write code

sometimes it produces correct results, and sometimes it produces mostly-correct results. but if it produces mostly-correct results, and you have no idea which part to fix, then it doesn't really help.

#

A lot of people overrate its abilities.

edgy falcon
#

How can i solve this error:
Dimension value must be integer or None or have an index method, got value 'TensorShape([])' with type '<class 'tensorflow.python.framework.tensor_shape.TensorShape'>'

On this: relative_position_encoding=(None, 300, None)

is how im passing the argument to a Transformer XL layer on tensorflow

charred light
serene scaffold
serene scaffold
# supple pine Ask chatgpt to code it

I asked chatgpt to code a discord bot thing a few weeks ago, and it was surprisingly good, but it had some extra lines that were crap. I think its training data included non-overlapping versions of dpy

supple pine
white pendant
#

For example

#

I was just asking if it could help me design a version of Simpletron that ran on a 640 x 480 screen

#

Here is a fragment of the answer it came up with

#


First, let's lay out a plan for how the display and the memory should be organized for the 640x480 resolution:

Display: The Simpletron display will have a resolution of 640x480 pixels. Each pixel can be represented by a single memory address in Simpletron's memory. To store the entire display, we need 640x480 = 307,200 memory addresses. You can choose to use 16-bit or 32-bit memory addresses, depending on your requirements.

Memory: Simpletron typically has a fixed amount of memory, for example, 1000 memory addresses. However, for our design, we need to accommodate 307,200 memory addresses for the display alone. You can choose to expand Simpletron's memory to fit the display requirements or use a separate memory space for the display.

Now, let's outline some basic instructions for the Simpletron display:

SET_PIXEL: This instruction will set a pixel at the specified X and Y coordinates to a specific color. The operand will contain the X, Y, and color information.

CLEAR_SCREEN: This instruction will clear the entire screen by setting all pixels to a specific color. The operand will contain the color information.

DRAW_LINE: This instruction will draw a line between two points with a specific color. The operand will contain the X1, Y1, X2, Y2, and color information.

DRAW_RECT: This instruction will draw a rectangle with a specified color at the given X, Y coordinates, and dimensions. The operand will contain the X, Y, width, height, and color information.```
#

This is something I asked about that is not present anywhere (or not anywhere that I know of)

#

And yet, look how intricate this is

queen cradle
#

Sure, it's intricate. But it says you can choose to use 16-bit memory addresses when you have 307,200 things to store.

white pendant
#

It's 307,200 memory addresses

queen cradle
#

To store the entire display, we need 640x480 = 307,200 memory addresses.

white pendant
#

The 307,200 memory addresses is for simulating the display

queen cradle
#

Each pixel can be represented by a single memory address in Simpletron's memory.

white pendant
#

Yes, that's correct

#

It's an 8-bit pixel

queen cradle
#

So you need 307,200 bytes. But 16-bit addresses can only index 65,536 separate items.

#

ChatGPT is clearly confused about arithmetic here.

white pendant
#

Um...

white pendant
#

Yes, you need 307,200 memory addresses for the memory itself

#

But each memory address can hold a 16-bit value

queen cradle
#

That only gets you to 131,072 bytes.

white pendant
#

Q: What is the math for the display again? Can you please clarify? If each memory address stores an 8-bit address, how come the total reserved is 307,200 bytes?


A: Apologies for the confusion earlier. You are correct that if each memory address stores an 8-bit value, the calculation for the total memory reserved for the display would be different. Let me clarify the math for the display:

For a 640x480 resolution display, there are a total of 640 x 480 = 307,200 pixels. If we use 8 bits per pixel (256 colors), each pixel will require 1 byte of memory.

Now, if each memory address stores an 8-bit value (1 byte), then we would need one memory address per pixel. Therefore, the total memory required for the display would be:

307,200 pixels * 1 byte per pixel = 307,200 bytes

So, in this case, the total memory reserved for the display would still be 307,200 bytes (or 307.2 KB), as each pixel requires 1 byte of memory, and there are 307,200 pixels in a 640x480 resolution display.

Please note that this calculation assumes an 8-bit color depth. If you use a different color depth, such as 16-bit (65,536 colors) or 24-bit (16.7 million colors), the memory requirements for the display would change accordingly.

#

@queen cradle How's that?

queen cradle
#

It doesn't acknowledge that it made a mistake in claiming that you could use a 16-bit address space. Though to be fair, you didn't specifically ask it about that. Also to be fair, it wouldn't matter to me if you did. ChatGPT isn't good with arithmetic; there are plenty of examples of this, and yours is just one more.

#

I think I've said all I have to say here.

white pendant
#

@queen cradle The mistake was on me though, not on GPT

#

Because it originally worded as such:

#

Display: The Simpletron display will have a resolution of 640x480 pixels. Each pixel can be represented by a single memory address in Simpletron's memory. To store the entire display, we need 640x480 = 307,200 memory addresses. You can choose to use 16-bit or 32-bit memory addresses, depending on your requirements.

#

I suppose the last paragraph could be reworded to add: "Please note my calculation assumes 8-bit color depth. If you choose a different resolution, your requirements will change."

queen cradle
#

Okay.

edgy falcon
#

Hi!, how can i solve this error:
Dimension value must be integer or None or have an index method, got value 'TensorShape([])' with type '<class 'tensorflow.python.framework.tensor_shape.TensorShape'>'

On this: relative_position_encoding=(None, 300, None)

is how im passing the argument to a Transformer XL layer on tensorflow

whole gazelle
#

Hi! has anybody here worked with YOLOv8? Im trying to save the values in xywh format using the save_txt=True CLI argument but it's currently what I assume to be normalized

2 0.839807 0.165415 0.12882 0.0654616
24 0.850087 0.551329 0.193253 0.089764
2 0.840522 0.179473 0.128972 0.088213
0 0.535866 0.103385 0.0689186 0.0563577
2 0.898594 0.135384 0.202476 0.0797681
0 0.364594 0.115743 0.08385 0.058203
2 0.957544 0.171258 0.0844107 0.0878528
2 0.0187968 0.179325 0.0375859 0.107676
2 0.935403 0.13661 0.128447 0.0786718
0 0.80963 0.272964 0.101042 0.200643
0 0.471424 0.116067 0.212462 0.208825
0 0.686469 0.351023 0.275836 0.676815
0 0.310915 0.28648 0.20025 0.400011
0 0.245834 0.285782 0.200336 0.39382
0 0.533272 0.447903 0.397006 0.667959
0 0.090301 0.226236 0.180229 0.413001
36 0.442348 0.725502 0.411456 0.163376
36 0.642305 0.625472 0.274271 0.159934
#

Unless some of you know how to convert this to the xywh format that i need

serene scaffold
silent spade
# misty flint yes. it is a pain. you should try to build a custom container image with the lib...

Yeah I managed to figure it out. I did create a docker image with the required packages. The issue was that upon execution, the function will download the stop words and other NLTK libraries needed for preprocessing. It tried downloading the files to directories that can’t be modified for some reason. So I had to have it download to a temp directory and manually point the NLTK function to look in that particular file. It was a pain

serene scaffold
hasty mountain
#

Or make a Diffusion model

mild dirge
#

Whats the point of this part in a pytorch dataset? isn't index just always an integer?

hasty mountain
#

Also...does diffusion models work with audio data, for audio generation? pithink

hasty mountain
#

So, if your batch has size 8, dataloader will be like

for i in range(8):
  item = dataset.__getitem__(i)
  return item
mild dirge
#

Right, but in that case it would be an integer

#

So why would it ever expect a tensor

hasty mountain
#

Hm... maybe because of .iloc?

#

Yeah, I don't know either

wooden sail
#

that would be my guess as well

#

for example in numpy and pandas, indexing with a list yields different results from indexing with another numpy array

#

and you may compute indices using other pytorch functions

mild dirge
#

I've been writing this entire dataset for pytorch

class VegetableDataset(Dataset):
    def __init__(self, dataset_path, nr_images_per_class=None, transform=None):
        """
        :param dataset_path: Path to the dataset containing all images
        :param nr_images_per_class: The number of images per class that are loaded
        :param transform: transform to be applied to samples
        """
        # Initialize some instance variables
        self.dataset_path = dataset_path
        self.nr_images_per_class = nr_images_per_class
        self.transform = transform

        # Compose a dict with a list of paths for each image class
        image_path_dict = {}
        for class_path in glob.glob(os.path.join(dataset_path, '*')):
            class_image_paths = list(glob.glob(os.path.join(class_path, '*')))
            random.shuffle(class_image_paths)

            if nr_images_per_class is not None:
                class_image_paths = class_image_paths[:nr_images_per_class]

            class_name = os.path.basename(os.path.normpath(class_path))
            image_path_dict[class_name] = class_image_paths

        # Put the images of the dict into a single list together with the labels
        self.paths_and_labels = []
        for idx, class_name in enumerate(sorted(image_path_dict.keys())):
            for image_path in image_path_dict[class_name]:
                self.paths_and_labels.append((image_path, idx))

        # Shuffle this list to randomize the order of images fed to
        random.shuffle(self.paths_and_labels)

    def __len__(self):
        return len(self.paths_and_labels)

    def __getitem__(self, idx):
        image_path, label = self.paths_and_labels[idx]
        image = ...
#

Turns out, I can just use this

from torchvision.datasets import ImageFolder

dataset = ImageFolder('vegetable_data/')
#

Anyone knows if I can make it so it only grabs x nr of images per class, instead of all of them?

wooden sail
#

you'd then use a dataloader

#

imagefolder does not load the contents to memory

#

you can tell the dataloader to load some amount (batch size) of images each time, selected at random from a source of images (imagefolder)

#

these dataloaders usually include augmentation capabilities btw. tensorflow has something similar as well

mild dirge
#

Yeah, but I want to limit the "entire dataset" to only contain x nr of images per class. Not just adjust the batch size when loading the images in

#

I want to train multiple cnns and then combine the results using some election rules, so I want each one to be trained on some random set of images

#

But I already wrote the custom dataset, I'll just continue that so I can personalize it anyways

wooden sail
#

i'm pretty sure there should be some parameter for that, but i don't use pytorch so i don't know which one. the rule of thumb is that, if it seems like a common enough problem, it already has a solution 😛

serene scaffold
wooden sail
#

i use jax, but never for this sort of stuff. i usually generate my own data synthetically and rarely work on measured data

#

i did do some tensorflow courses at some point but i've never used it for anything myself other than the usual mnist, fashionmist, hand signs, etc that everyone does while learning

#

off the top of my head, a solution could be to make a list of dataloaders per class, but that only makes sense if you first split the data into directories per class

mild dirge
#

Yeah I suppose it doesn't even matter that the classes are perfectly balanced, I could just iterate through a set nr of batches per classifier. But it's good to know that there is already stuff out there for image datasets.

serene scaffold
#

is Jax more explicit than pytorch when it comes to adjusting the weights? because I find loss.backward() and optim.step() to be weird and implicit.

wooden sail
#

as explicit as you like

#

you can compute the gradients and do whatever you like with them before updating the parameters

#

jax by default is just numpy with jit and autodiff

#

there's also the optax module that works much like pytorch and tf. you tell it which optimizer you like and it handles the rest

serene scaffold
#

is it at least as fast as pytorch?

wooden sail
#

should be comparable

serene scaffold
#

hmm, maybe I'll try it for my next project that uses neural networks

wooden sail
#

that's in general a terrible idea, i think

serene scaffold
#

why