#data-science-and-ml

1 messages · Page 340 of 1

arctic wedgeBOT
#

@hasty grail :white_check_mark: Your eval job has completed with return code 0.

001 |          35 function calls (32 primitive calls) in 0.097 seconds
002 | 
003 |    Ordered by: standard name
004 | 
005 |    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
006 |         1    0.000    0.000    0.000    0.000 <__array_function__ internals>:2(concatenate)
007 |         1    0.000    0.000    0.000    0.000 <__array_function__ internals>:2(diff)
008 |         1    0.000    0.000    0.000    0.000 <__array_function__ internals>:2(nonzero)
009 |         1    0.000    0.000    0.097    0.097 <__array_function__ internals>:2(unique)
010 |         3    0.000    0.000    0.000    0.000 _asarray.py:110(asanyarray)
011 |         1    0.000    0.000    0.000    0.000 arraysetops.py:125(_unpack_tuple)
... (truncated - too many lines)

Full output: https://paste.pythondiscord.com/fiducopemu.txt?noredirect

hasty grail
#

!e

import cProfile
from collections import Counter
import numpy as np

arr = np.random.randint(0, 100, size=1000000)
with cProfile.Profile() as pr:
     c = Counter(arr)

pr.print_stats()
arctic wedgeBOT
#

@hasty grail :white_check_mark: Your eval job has completed with return code 0.

001 |          35 function calls (19 primitive calls) in 0.274 seconds
002 | 
003 |    Ordered by: standard name
004 | 
005 |    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
006 |         1    0.000    0.000    0.274    0.274 __init__.py:581(__init__)
007 |         1    0.000    0.000    0.274    0.274 __init__.py:649(update)
008 |         9    0.000    0.000    0.000    0.000 _collections_abc.py:409(__subclasshook__)
009 |         1    0.000    0.000    0.000    0.000 abc.py:117(__instancecheck__)
010 |       9/1    0.000    0.000    0.000    0.000 abc.py:121(__subclasscheck__)
011 |         1    0.000    0.000    0.000    0.000 cProfile.py:117(__exit__)
... (truncated - too many lines)

Full output: https://paste.pythondiscord.com/liberofige.txt?noredirect

hasty grail
#

Around 3x faster in this example

#

@gaunt marsh

serene scaffold
#
result = 0
for a, b in itertools.product(A, B):
    result += 1 / abs(a - b)

If we assume that A and B are arrays, is there a way to vectorise this?

hasty grail
#

when I see itertools.product I think of np.meshgrid

#

so maybe something like...

#

!e

import numpy as np

A = np.arange(4, 8)
B = np.arange(0, 4)

x, y = np.meshgrid(A, B)
print(x - y)
print((1 / (x - y)))

result = (1 / (x - y)).sum()
print(result)
arctic wedgeBOT
#

@hasty grail :white_check_mark: Your eval job has completed with return code 0.

001 | [[4 5 6 7]
002 |  [3 4 5 6]
003 |  [2 3 4 5]
004 |  [1 2 3 4]]
005 | [[0.25       0.2        0.16666667 0.14285714]
006 |  [0.33333333 0.25       0.2        0.16666667]
007 |  [0.5        0.33333333 0.25       0.2       ]
008 |  [1.         0.5        0.33333333 0.25      ]]
009 | 5.076190476190476
hasty grail
#

oops forgot the abs

#

but you get the idea

desert oar
#

a "normal array" would be a list, and yes you can represent rgb_ids and rgb_counts as numpy arrays, use np.array(list(...))

serene scaffold
#

@hasty grail thanks lemon_hyperpleased

agile jolt
#

anyone?

desert oar
#

did you get some negative feedback on it? there are 2 main problems with "radial" charts like this:

  1. perceptual non-linearity due to the geometry of circles
  2. lots of wasted space in the middle; the relevant data is squeezed to the outer edges

the value of a radial chart is when the data covers multiple cycles, e.g. a time series that covers 10 years, you can see all 10 years overlaid effectively, without the visual artifical "break" between the end of the cycle and start of the next cycle

i'd argue that in this particular case, the cyclical nature isn't relevant because this isn't time series data, and the cycles are so short and easy to understand that you don't really get value out of it anyway

#

so my recommendation would be to turn this into a line plot with 2 lines, actual and expected

#

and adjust the y axis

agile jolt
#

okay, but i should keep polar chart as an example..so any idea which data could be useful for it?

desert oar
#

time series that covers multiple cycles

#

so 2 months of data that exhibits a weekly cycle

agile jolt
#

hmm, okay thank you

#

very useful info

lapis sequoia
#

best video in hindi for data science?

light warren
#

everytime i use the code return_portfolio = z.pct_change()

#

i get typeError: unsupported operand type(s) for /: 'str' and 'str'

crisp wing
#

Anyone knows why scikits learning_curve would throw this error?

ConnectionRefusedError: [WinError 10061] No connection could be made because the target machine actively refused it

Using code from
https://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html
Specifically the problem arises within

    train_sizes, train_scores, test_scores, fit_times, _ = \
        learning_curve(estimator, X, y, cv=cv, n_jobs=n_jobs,
                       train_sizes=train_sizes,
                       return_times=True)
#

Seems to be network-connected, but I wouldn't know why that would be the case here

lapis sequoia
hidden eagle
#

Hey guys, I'm a little lost on how to create a vector from an image:

app.route('/mix_images', methods=['POST'])
def mix_images():
 try:
 img = open("test1.jpeg")
 img.load()
 img2 = open("test2.jpeg")
 img2.load()
 # label1 = np.asarray(json.loads(request.form['label1']), dtype='float64')
 # label2 = np.asarray(json.loads(request.form['label2']), dtype='float64')
 label1 = np.asarray(img, dtype='float64')
 label2 = np.asarray(img2, dtype='float64')
 vector1 = np.asarray(json.loads(request.form['vector1']), dtype='float64')
 vector2 = np.asarray(json.loads(request.form['vector2']), dtype='float64')


new_vectors, new_labels = interpolate(12, vector1, vector2, label1, label2)
 new_ims = sample(new_vectors, new_labels)
 
 result = PIL.Image.fromarray(new_ims)
 result.save("test.png")
 return jsonify([
            [ encode_img(arr) for arr in new_ims ],
 new_vectors.tolist(),
 new_labels.tolist()
        ])
 except Exception as e:
 print(e)
 return '', 500 

Can someone point me in the direction to create the labels and vectors from a image?

light warren
hidden eagle
light warren
#

my dataframe is called z, so do i do z.pandas.to_numeric()

storm hill
lapis sequoia
digital sparrow
#

Hey guys, I'm having an issue with this pandas script at work.

Somewhere it is screwing up the values in a few select columns.

arctic wedgeBOT
#

Hey @digital sparrow!

Uh-oh! It looks like your message got zapped by our spam filter. We currently don't allow .csv attachments, so here are some tips to help you travel safely:

• If you attempted to send a message longer than 2000 characters, try shortening your message to fit within the character limit or use a pasting service (see below)

• If you tried to show someone your code, you can use codeblocks
(run !code-blocks in #bot-commands for more information) or use a pasting service like:

https://paste.pythondiscord.com

safe viper
#

I am making a data science project and would like to use data to see if masks have been impactful in limiting covid transmission in the US. How would I tackle this problem practically?

#

It seems simple, but now i can't figure out how i would actually try to get data that would represent no mask and data that would represent mask

#

Would i try to compare states who relieved mask mandates early

desert oar
#

You could do that, use something like a discontinuity design

lapis sequoia
#

Hii, im new to python discord chanel. I have facing issue to extract stocks data from bloomberg api. I installed blpapi. It works. But my goal is to fetch all stocks data from bloomberg api to database like mysql, postgresl. I cant work with excel because of shortage table row, column. If you have a ideas n solution. Please let me know thanx

desert oar
#

But keep in mind that there's probably a relationship between the state relieving the mask mandate and people's behavior

#

This is called "endogeneity" in econometrics literature

#

You could do something like compare results on 2 different sides of a state/county/town line

#

@safe viper ☝️

desert oar
#

So you'd just fetch the data you want and save it to the database

lapis sequoia
#

Okk thanx .. the data is huge. I mean all stocks from bloomberg api.. do you think postgres or mysql can handle this data ? Do you know other options ? Anyway i will ask this question in databse..

#

@desert oar

desert oar
lapis sequoia
#

Okk..do you have any idea to solve this problem ?

#

My main issue is to select all stocks from bloomberg api.

#

Here is the code .. i mean im not sure ..

#

tickets = ['a', 'b',.....,'z']
blp.bdh( tickers=tickets flds=['Last_Price','EQY_FUND_CRNCY=CAD'], start_date='2020-12-31', end_date='2021-05-31', Per='M', )

#

actually the tickets should be name of stock.. but the name is always with alphabet order..

#

@desert oar

hidden eagle
glad mulch
#

whats your problem

hasty grail
hidden eagle
#

I

hasty grail
#

You could just use PIL.Image.open instead of the builtin open method then

hidden eagle
#

kk, let me try that and harrass you later. Thanks mate!

hasty grail
#

np

hidden eagle
# hasty grail np

I need to ask one question, I'm on the software side (higher level) so it doesn't make a lot of sense, can you give me what exactly is going on from a computer point of view?

#

Is the represented vector here a "direction" in what the bytes are?

hasty grail
#

Without looking at your JSON, idk what your "vectors" are

tender hearth
#

are you creating a vector of the rows of pixels in your image?

hidden eagle
hidden eagle
#

This is what I need help understanding, is how Tensorflow "views" a byte array / vector of data.

tender hearth
#

from a computer science perspective a vector is just a fixed sized array

hidden eagle
#

Because a image is a byte array.

#

Means something, that I don't understand.

#

I understand it's an array of values, but I just don't know what they are / how to convert a image to those values 😄

desert oar
#

@hidden eagle you might need to turn the PIL image data into a numpy array of RGB(A) pixels

hidden eagle
#

kk, I'll give that a try.

desert oar
#

That or these already are RGBA pixels

#

Note that "vector" is an overloaded term

#

In math it's best to think of a "vector" as a collection of numbers that live in a "space" with a coordinate system

#

And even that very abstract idea is fudging the real version

hasty grail
#

Can you describe what do 'label1' and 'vector1' (as queried from your database) represent? I am still confused.

#

An image is represented by an array of pixels

#

The pixel values are normally either encoded as byte values (0-255) or floats (0-1)

hidden eagle
#

No I understand what a vector is, I'm asking what is a vector in this case, not the literal meaning of a vector here's the code:

 const image1 = await knex.from('image').where({ key:key1 }).first()
        const image2 = await knex.from('image').where({ key:key2 }).first()

        if (image1.size != image2.size) {
            return res.status(400).send('Cannot mix images of differnet sizes.')
        }
        const url = (image1.size == 128) ? secrets.ganurl128 : secrets.ganurl256
        const [ imgs, vectors, labels ] = await request({
            url: url+'/mix_images',
            method: 'POST',
            json: true,
            form: {
                label1: JSON.stringify(image1.label),
                label2: JSON.stringify(image2.label),
                vector1: JSON.stringify(image1.vector),
                vector2: JSON.stringify(image2.vector)
            }
        })
#

That's the API that handles the request.

#

and how I'm handling / trying to handle it:

       img = PIL.Image.open("test1.jpeg")
        img.load()
        img2 = PIL.Image.open("test2.jpeg")
        img2.load()
        # label1 = np.asarray(json.loads(request.form['label1']), dtype='float64')
        # label2 = np.asarray(json.loads(request.form['label2']), dtype='float64')
        label1 = np.asarray(img, dtype='float64')
        label2 = np.asarray(img2, dtype='float64')
        vector1 = np.asarray(json.loads(request.form['vector1']), dtype='float64')
        vector2 = np.asarray(json.loads(request.form['vector2']), dtype='float64')
hidden eagle
hasty grail
#

Oh, I thought you had knowledge of the contents of the database you're accessing

hidden eagle
#

I do, but it's been converted with the np.asarray.

#

I think the confusion here is I don't really understand what the data is.

#

A image is a byte array, from what I understand looking at the code is it's being converted to a vector (some other form of data) that I don't understand / know.

hasty grail
#

What is the original source of the data?

#

(Before it is put into the database)

hidden eagle
#
 const ids = await knex('image').insert(insert).returning('id')
    if (secrets.local_images) {
        for (let i = 0; i < imgs.length; i++) {
            const buf = new Buffer(imgs[i].replace(/^data:image\/\w+;base64,/, ""),'base64')
            fs.writeFileSync("public/img/"+insert[i].key+".jpeg", buf);
        }
    } 

// dont worry about this this is the s3 save 

else {
        const uploads = []
        for (let i = 0; i < imgs.length; i++) {
            const buf = new Buffer(imgs[i].replace(/^data:image\/\w+;base64,/, ""),'base64')
            uploads.push(s3.upload({
                Bucket: bucket,
                Key: `imgs/${insert[i].key}.jpeg`,
                Body: buf,
                ACL: 'public-read',
                ContentEncoding: 'base64',
                ContentType: 'image/jpeg'
            }).promise())
        }
        await Promise.all(uploads)
    }
    return insert.map(({ key }) => ({ key }))
}
hidden eagle
#

The api is sent a file that it handles

#

It saves the file as a jpg and stores the vector to the db

hasty grail
#

I assume that the API backend* is where the 'label' and 'vector' are generated

hidden eagle
# hasty grail I assume that the API backend* is where the 'label' and 'vector' are generated

app.post('/mix_images', async (req, res) => {
    const key1 = req.body.key1
    const key2 = req.body.key2
    if (!key1 || !key2) return res.sendStatus(400)
    try {
        const image1 = await knex.from('image').where({ key:key1 }).first()
        const image2 = await knex.from('image').where({ key:key2 }).first()

        if (image1.size != image2.size) {
            return res.status(400).send('Cannot mix images of differnet sizes.')
        }
        const url = (image1.size == 128) ? secrets.ganurl128 : secrets.ganurl256
        const [ imgs, vectors, labels ] = await request({
            url: url+'/mix_images',
            method: 'POST',
            json: true,
            form: {
                label1: JSON.stringify(image1.label),
                label2: JSON.stringify(image2.label),
                vector1: JSON.stringify(image1.vector),
                vector2: JSON.stringify(image2.vector)
            }
        })
        const children = await save_results({ imgs, vectors, labels,
                                              size: image1.size,
                                              parent1: image1.id,
                                              parent2: image2.id })
        return res.json(children)
    } catch(err) {
        console.log('Error: /mix', err)
        return res.sendStatus(500)
    }
})
hasty grail
#

Need to know how exactly are image.label and image.vector generated

#

If you can't access the internals and there is no public documentation for it, well...

lapis sequoia
#

is it possible to get sweet randal out of the html text <a class="linkFriend" href="https://steamcommunity.com/id/bepri">sweet randal</a>

hidden eagle
# hasty grail Need to know how exactly are `image.label` and `image.vector` generated
 const image1 = await knex.from('image').where({ key:key1 }).first()
        const image2 = await knex.from('image').where({ key:key2 }).first()

        if (image1.size != image2.size) {
            return res.status(400).send('Cannot mix images of differnet sizes.')
        }
        const url = (image1.size == 128) ? secrets.ganurl128 : secrets.ganurl256
        const [ imgs, vectors, labels ] = await request({
            url: url+'/mix_images',
            method: 'POST',
            json: true,
            form: {
                label1: JSON.stringify(image1.label),
                label2: JSON.stringify(image2.label),
                vector1: JSON.stringify(image1.vector),
                vector2: JSON.stringify(image2.vector)
            }
        })```
#

That's how they are generated.

#

And the save above.

lapis sequoia
#

I tried but i've only gotten to

#
def extract_members(self, response):
        for href in response.css('.linkFriend::attr(href)'):
            yield { 'href': href.extract() }```
hasty grail
lapis sequoia
#

and that just saves the link of their profile in the href

#

@hidden eagle

hidden eagle
# hasty grail That's just querying the database

Yes and this is how it saves to the db

 const ids = await knex('image').insert(insert).returning('id')
    if (secrets.local_images) {
        for (let i = 0; i < imgs.length; i++) {
            const buf = new Buffer(imgs[i].replace(/^data:image\/\w+;base64,/, ""),'base64')
            fs.writeFileSync("public/img/"+insert[i].key+".jpeg", buf);
        }
    } 
#

It's a buffer array

lapis sequoia
#

thats js not python

hidden eagle
hasty grail
#

I don't see the strings 'label' and 'vector'

hidden eagle
# hasty grail I don't see the strings 'label' and 'vector'
module.exports = async function({ imgs, vectors, labels, size, parent1=null, parent2=null}) {
    const insert = []
    for (var i = 0; i < imgs.length; i++) {
        insert.push({
            parent1, parent2, size,
            key: randomString(12),
            label: labels[i],
            vector: vectors[i],
        })
    }

This is js too

hasty grail
#

You need to trace where 'vectors' comes from when the function is called

hidden eagle
#

Basically the js api sends the data to the db and the python is handling the data with TF

#
async function main() {
    const size = 256
    const [ imgs, vectors, labels ] = await request({
        url: secrets.ganurl256+'/random',
        method: 'POST',
        json: true,
        form: { num: '24' }
    })
    await save_results({ imgs, vectors, labels, size })
    console.log('all done')
}
hasty grail
#

I suspect that the vector is the seed passed to the GAN

#

i.e the noise vector

hidden eagle
#

"noise vector" how does that work from a data sci explanation?

#

Is that the weight of a pixel?

hasty grail
#

The training loop begins with generator receiving a random seed as input. That seed is used to produce an image. The discriminator is then used to classify real images (drawn from the training set) and fakes images (produced by the generator). The loss is calculated for each of these models, and the gradients are used to update the generator and discriminator.

  • From TensorFlow's GAN tutorial
hidden eagle
#

Basically:

app.post('/mix_images', async (req, res) => {
    const key1 = req.body.key1
    const key2 = req.body.key2

that's pulled from:

form.addEventListener('submit', event => {
    event.preventDefault()
    const parent1 = form.querySelector('.parent1').value
    const parent2 = form.querySelector('.parent2').value
    let key1, key2
    try {
        key1 = get_key(parent1)
        key2 = get_key(parent2)
    } catch(e) {
        alert('URL not valid.')
        console.log(e)
        return
    }

and

   form.mixinput(style='text-align:center;')
            input.parent1(type='text', placeholder='Enter url', required)
            input.parent2(type='text', placeholder='Enter url', required)

            br
            input.submit(type='submit', value='Submit')

hasty grail
#

You should take a look at the 'random' endpoint

hidden eagle
#

That actually does do a noise function, you are correct.

hasty grail
#

Since vectors is returned from the POST request sent there

hidden eagle
#

"Noise" being the model data:

async function main() {
    const size = 256
    const [ imgs, vectors, labels ] = await request({
        url: secrets.ganurl256+'/random',
        method: 'POST',
        json: true,
        form: { num: '24' }
    })
    await save_results({ imgs, vectors, labels, size })
    console.log('all done')
}
hidden eagle
#

By calling the random endpoint.

#

docker-compose exec server node make_randoms.js

arctic wedgeBOT
#

:incoming_envelope: :ok_hand: applied mute to @lapis sequoia until <t:1631340444:f> (9 minutes and 59 seconds) (reason: duplicates rule: sent 4 duplicated messages in 10s).

boreal wasp
#

Hi I have a question...

#

I'm using sqlalchemy and created a table called TweetWordPair

#

The table include 2 columns, "tweet_id" and "word"

#

But I want to display another column "ftd" using query after selecting values

#

I want to show the result of the values in ftd

#

but I get the add_columns error...

#

anyone has any suggestion? not creating a new column in the table

#
from sqlalchemy.orm import sessionmaker
from sqlalchemy import func

Session = sessionmaker(bind=engine)
session = Session()

ftd = session.query.add_columns(func.count(TweetWordPair.words).label("ftd")).first()
result = session.query(TweetWordPair.tweet_id, TweetWordPair.word, ftd).group_by(TweetWordPair.tweet_id).all()
for r in result:
    print(r)
lapis sequoia
# glad mulch whats your problem

My issue is that i want to extract all stocks from bloomberg. I dont have idea how can i do it? Im using blpapi. My goal is that i want fetch all stocks data into database like mysql or postgresql.. any idea

agile jolt
#

Can someone explain me calculating outliers with st.dev
Just the theory part, no code needed

#

To be more specific I was given a thesis:
''We can calculate outliers in 2 ways: 1. with IQR, 2. with +/- 3 st.dev based on normal distribution. Explain!"

#

Basically, I need just the second way

prime hearth
#

i think +/- is just #1

agile jolt
#

Wdym, like it's self explanatory?

#

Nvm, i figured it out

harsh bear
#

So i have a vps, where im plotting some data of my bot on a graph. problem is, its deployed on a vps. How can i see the graph live on my pc?

lapis sequoia
#

What would be required to build the worlds best generall chatbot? With long term memory and learn about you and remembers everything. So basically like the movie “Her”

quiet vault
#

Does anyone have code for Bayesian Optimization for time series data with deep learning?

desert oar
muted falcon
misty flint
#

i recommend alexey grigorev

#

he also has an online course and a bunch of stuff on github

#

if we're doing rec's

royal crest
#

!rule 6

arctic wedgeBOT
#

6. Do not post unapproved advertising.

rain wadi
#

just started out with neural networks and i was wondering something about the opencv library

#

i wanted to make my own computer vision library just for playing around

#

but i cant figure out how opencv understands the "x y" coordinates of an object on the camere/screen

#

for example i know how to make the neural network know if theres a pattern/object or not in the image/frame but i cant figure out how to understand its position on the screen

#

should i train the model with different positions for the object?

#

or theres a more smart way im missing?

#

if anyone can suggest an article that might be able to explain this. i would be happy to read it

near spindle
#

Is it worth to buy a bundle of ML related books on Humble Bundle?

lapis sequoia
#

the first thing is if you'll actually read them; second is if they are written well and lastly if you'l lactually learn from them

near spindle
lapis sequoia
#

Then I'd recommend these books:

  1. Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython 2nd Edition
  2. Introduction to Machine Learning with Python: A Guide for Data Scientists
  3. Deep Learning with Python, Second Edition
lapis sequoia
near spindle
#

I'm looking rn for any fragments to see how well or badly it's written

near spindle
#

They seem to be quite well written from what I saw

#

I think I'll buy them

quick relic
#

Hi, I am trying to train my CNN, but the loss does not decrease and the test results are always the same

#
    def train_loop(self):

        for batch, (X,y) in tqdm(enumerate(self.train_dataloader), desc="TRAINING"):
            X = X.to(self.device)
            y = y.to(self.device)

            self.optimizer.zero_grad()
            pred = self.model(X)
            loss = self.loss_fn(pred, y)

            loss.backward()
            self.optimizer.step()
        print(f"Result\nloss: {loss.item()}")
#
    def test_loop(self):
        size = len(self.test_dataloader.dataset)
        num_batches = len(self.test_dataloader)
        test_loss = 0
        correct = 0

        with torch.no_grad():
            for batch, (X,y) in tqdm(enumerate(self.test_dataloader), desc="TESTING"):
                X = X.to(self.device)
                y = y.to(self.device)

                pred = self.model(X)
                test_loss += self.loss_fn(pred, y).item()
                correct += (pred.argmax(1) == y).type(torch.float).sum().item()

        test_loss /= num_batches
        correct /= size
        print(f"Result\nAccuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f}")
#
            self.model = Network().to(self.device)
            self.data_train = ImageSet(img_path, annotations_file_name, transform=ToTensor(), train=True)
            self.data_test = ImageSet(img_path, annotations_file_name, transform=ToTensor(), train=False)

            self.train_dataloader = torch.utils.data.DataLoader(self.data_train, batch_size=32, shuffle=True)
            self.test_dataloader = torch.utils.data.DataLoader(self.data_test, batch_size=32, shuffle=True)
            #self.test_dataloader = torch.utils.data.DataLoader(self.data_train, batch_size=16, shuffle=True)
        
            #training params
            self.epochs = 10
            self.learning_rate = 1e-2
            self.loss_fn = nn.CrossEntropyLoss()
            self.optimizer = torch.optim.Adam(self.model.parameters(), lr=self.learning_rate)
#

I assume there is some stupid mistake in the train function but I cant spot it

hoary wigeon
#

help me!

#

I have created a model using Decision Tree algorithm, over an imbalanced data

#

how can i make decision tree to ignore few points, like svm and logistic

velvet thorn
#

ignore points

hoary wigeon
#

like in svc, few points are mis classified

#

so we have a parameter C in svc

velvet thorn
#

normally, to decrease variance in decision trees, some sort of stacking is used

hoary wigeon
#

i thought it is a critical point

velvet thorn
hoary wigeon
velvet thorn
hoary wigeon
#

hold on

velvet thorn
hoary wigeon
#

yeah it is an regularization param

#

is there something like this in decision tree ?

velvet thorn
velvet thorn
#

I mean

#

you can change the max depth of the tree

#

or how it splits

hoary wigeon
#

yeah i know about svc and logistic

velvet thorn
#

but they're fundamentally different models

hoary wigeon
#

yep

velvet thorn
hoary wigeon
#

yeah bruh..

#

i read all ituition

velvet thorn
#

okay, then you should understand what I'm saying 🙂

hoary wigeon
#

sure

#

so how can i fix this ?

velvet thorn
velvet thorn
hoary wigeon
#

i tried with grid search cv

#

using cv as Stratified

lapis sequoia
#

precision and recall of 0 is 0 that means it's always classifying as 1. which i would say is quite bad regardless of accuracy of 74. try increasing depth? i feel this less rules may not be enough for the tree.

hoary wigeon
#

theres no change in accuracy after max_depth 3++

lapis sequoia
#

i mean increase max depth. say 5?

hoary wigeon
#

after changing clas weight

lapis sequoia
#

I'm saying increase depth🤕

hoary wigeon
hoary wigeon
lapis sequoia
#

weights 2 1?

hoary wigeon
#

cool

lapis sequoia
#

I'd say this is better.

hoary wigeon
#

yeah

lapis sequoia
#

than that 75

hoary wigeon
#

atleast model is supporting both classes

#

how do you calculate weight

lapis sequoia
#

nice question. in IR system I'd somehow make a framework of AW = y
since we do have y and do inversion but not sure about this one.

hoary wigeon
#

i dint got this

lapis sequoia
#

class_weightdict, list of dict or “balanced”, default=None

Weights associated with classes in the form {class_label: weight}. If None, all classes are supposed to have weight one. For multi-output problems, a list of dicts can be provided in the same order as the columns of y.

Note that for multioutput (including multilabel) weights should be defined for each class of every column in its own dict. For example, for four-class multilabel classification weights should be [{0: 1, 1: 1}, {0: 1, 1: 5}, {0: 1, 1: 1}, {0: 1, 1: 1}] instead of [{1:1}, {2:5}, {3:1}, {4:1}].
#

is yours multilabel?

#

i think no. then shouldn't it be like below.

hoary wigeon
lapis sequoia
#

why not just remove weights for now? what accuracy does it give?

hoary wigeon
lapis sequoia
#

seems better.

hoary wigeon
#

i guess it is changing because of train test split

#

previous results were totally worst

lapis sequoia
#

we also have depth of 5 rn.

hoary wigeon
#
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=1, shuffle=True, stratify=y) # Splitting data```
zenith crow
#

hello

#

pls help

#

my python is pretty basic

#

here

#

five_info.append("Input=")

#

is there a way i can append "Input" to a the first list

#

five_info = [[],[]]

#

five_info.append("Input=",[1])

#

it doesn't work

serene scaffold
#

@zenith crow hello, this is not a data science question

zenith crow
#

sry

light warren
#

hey, i am trying to run this code that i found on my school help sheet but i keep getting this error,

prime hearth
#

i guess would it be okay to ask

#

have you broken down

#

the .methods

#

so portfolio.sample()

#

does that successfuly compile

#

if it doesnt then thats where the error is being due to invalid argument

light warren
#

same error

prime hearth
#

Yes

#

so thats the error

#

its the argument that is being passed

#

i dont know much about time and date

#

but can try looking (googling) the error

#

and that should answer the problem

light warren
#

shall i remove the date column, every other column is float64 and that one is object

prime hearth
#

im not sure, that depends on what the error means from quick google search and the solutions offered from stack over flow

lapis sequoia
#

there's one value which is coming has zero in the batch. How to see what is there in the batch? can anyone help please

serene scaffold
#

!paste

arctic wedgeBOT
#

Pasting large amounts of code

If your code is too long to fit in a codeblock in discord, you can paste your code here:
https://paste.pydis.com/

After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.

hasty grail
lapis sequoia
# serene scaffold Can you provide all the code as text?

model.eval()
print("entering validation")
data_iterator = tqdm(loader, leave=False, unit="batch", disable=silent)
losses = []
rl_loss = []
kl_loss = []
counts = []
for index, batch in enumerate(data_iterator):
batch = batch[0]
if cuda:
batch = batch.cuda(non_blocking=True)
recon, mean, logvar = model(batch)
rl,kl,loss =model.loss(batch, recon, mean, logvar)
losses.append(loss.detach().cpu())
rl_loss.append(rl.detach().cpu())
kl_loss.append(kl.detach().cpu())
# losses.append(model.loss(batch, recon, mean, logvar).detach().cpu())
batch_sum =batch.sum(1).detach().cpu()
# b=(batch_sum ==0)
# if torch.sum(b):
# a=0
counts.append(batch_sum)
# dimon = torch.cat(counts).mean().exp().item()
# nomi = torch.cat(losses)
return float((torch.cat(losses) / torch.cat(counts)).mean().exp().item())

serene scaffold
lapis sequoia
hasty grail
#

Oh, sorry didn't catch that

#

I guess you can iterate through your data until the index is equal to that number

#

and inspect the corresponding data sample

lapis sequoia
#

im not getting how do i do that

lapis sequoia
serene scaffold
arctic wedgeBOT
#

Pasting large amounts of code

If your code is too long to fit in a codeblock in discord, you can paste your code here:
https://paste.pydis.com/

After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.

hasty grail
#
for index, batch in enumerate(data_iterator):
    if index == empty_batch_index:
        # Do something with batch
serene scaffold
#

^ click the link

lapis sequoia
serene scaffold
lapis sequoia
serene scaffold
lapis sequoia
#

inside the if condition, how can i get the data?

hasty grail
#

wait, your index is the index within a batch

lapis sequoia
#

yes

hasty grail
#

in that case, batch[empty_batch_index] would be the data you need

lapis sequoia
#

it gives me like this

#

batch[125]
tensor([0., 0., 0., ..., 0., 0., 0.])

#

but I want to see what is this tensor associated with in my data..

hasty grail
#

you'll have to trace to your data source and iterate through it in the same way as you did here

#

then find the corresponding sample in the batch

lapis sequoia
#

is it possible to tell how do i do this?

#

coz this is where im stuck at

#

unless i know what is causing the issue i cannot fix it

hasty grail
#

You have control over the data iterator, right?

#

as in, it is constructed inside your code

lapis sequoia
#

yes.

hasty grail
#

ok so you need to go to whatever line that is

#

and figure out what data is being fed to it

#

apart from the index of the empty sample in the batch (i.e. sample index), also print out the batch index (index in the code you showed above)

#

then, you find the piece of data that corresponds to the batch index and the sample index

lapis sequoia
#

ok i will try to search for some code to do that.... if you have something, would help too...

#

also, im using the torch dataloader.

hasty grail
#

it would be helpful if you set up your own dummy dataloader that, for example, prints out the file source for the currently yielded batch

#

and iterate through the actual and dummy dataloader in parallel (using zip)

#

that way, when you get to that problematic batch, you can look at the console/log and identify the relevant files

lapis sequoia
#

ah!! i shall do that then.

#

thank you so much 🙂

#

it was really helped a lot

hasty grail
#

you're welcome 🙂

lapis sequoia
#

also, there's no way we could do with the in built ah?

hasty grail
#

I haven't worked with PyTorch all that much, so idk

lapis sequoia
#

ah okok

#

thats ok, this was very helpful 🙂

serene scaffold
#

I have this

                                    CRF               BiLSTM+CRF               BioBERT+CRF              
                                      P      R     F1          P      R     F1           P      R     F1
Animal   CellLine                 0.952  0.513  0.667      0.719  0.590  0.648       0.739  0.872  0.800
         GroupName                0.819  0.457  0.586      0.752  0.581  0.656       0.684  0.716  0.699
         GroupSize                0.835  0.662  0.739      0.764  0.706  0.734       0.723  0.861  0.786
         SampleSize               0.800  0.279  0.414      0.529  0.419  0.468       0.630  0.674  0.652
         Sex                      0.980  0.760  0.856      0.977  0.871  0.921       0.958  0.913  0.935
         Species                  0.982  0.844  0.907      0.978  0.893  0.933       0.952  0.917  0.934
         Strain                   0.917  0.742  0.820      0.865  0.825  0.845       0.832  0.863  0.847

I want to get a boolean dataframe where the best P, R, or F1 for each of the three algorithms is True, but idxmax doesn't support level=

serene scaffold
#

df.groupby(level=0, axis=1).apply(lambda x: x.droplevel(axis='columns', level=0) == maxx) appears to be it.

#

annoying 😄

serene scaffold
#

Now how can I make this one statement?

is_max = (
    df.groupby(level=0, axis=1)
        .apply(lambda d: 
            d.droplevel(axis=1, level=0)
            == 
            df.max(level=1, axis=1)
        )
)

df_str = df.applymap(lambda x: f'{x:.3f}')
df_str[is_max] = r'\textbf{' + df_str + '}'

print(df_str.to_latex(escape=False))
frosty flame
#

hey

#

i need help with something

#

iam trying to import tensorflow_hub

#

but i keep getting this:

#

cannot import name 'parameter_server_strategy_v2' from 'tensorflow.python.distribute'

serene scaffold
frosty flame
#

can i send as a file, cuz i cant send large texts while iam not nitro

serene scaffold
#

!paste

arctic wedgeBOT
#

Pasting large amounts of code

If your code is too long to fit in a codeblock in discord, you can paste your code here:
https://paste.pydis.com/

After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.

frosty flame
#

!paste

serene scaffold
#

You have to click the link from the bot message

frosty flame
#

paste it there?

#

then what?

serene scaffold
#

yes, then save and share the URL

frosty flame
#

there you go

#

this is the error

serene scaffold
#

@frosty flame what tensorflow version are you using?

frosty flame
#

latest? i downloaded it via anaconda

serene scaffold
#

I've never used anaconda

#
pip install --upgrade tensorflow-estimator==2.3.0

Try that

#

but for anaconda

frosty flame
#

then i tried it using pycharm, and the same issue happened

serene scaffold
#

idk how to install stuff with anaconda.

frosty flame
#

its spider

serene scaffold
frosty flame
#

anaconda is like a hub

serene scaffold
#

have you tried not using anaconda?

#

most people don't, so it's easier to get help that way.

frosty flame
#

yes, and it was the same issue

serene scaffold
#

well, try installing tensorflow-estimator==2.3.0 specifically

#

two pages told me that that is the solution

frosty flame
#

i executed what you sent

#

and other issue appeared

serene scaffold
#

what is the other issue?

serene scaffold
#

I see. what are you trying to do, exactly?

#

are you following a tutorial?

frosty flame
#

trying to make ESP bot

frosty flame
serene scaffold
#

can you help me understand how you ended up at this problem?

frosty flame
#

its just importing the library what sends this error

serene scaffold
#

what OS are you on?

frosty flame
#

Windows 10 pro

serene scaffold
#

what python version?

frosty flame
#

python 3.8

serene scaffold
#

I'm trying to get it working locally

#

I'm going to go do a thing while it installs

frosty flame
#

ok thanks

serene scaffold
# frosty flame ok thanks
pip install tensorflow_hub
pip install tensorflow
python -c "import tensorflow_hub; print('Done!')"
#

worked for me

#

though I'm on linux, so you may have to install the build tools

#

!build

arctic wedgeBOT
#

Microsoft Visual C++ Build Tools

When you install a library through pip on Windows, sometimes you may encounter this error:

error: Microsoft Visual C++ 14.0 or greater is required. Get it with "Microsoft C++ Build Tools": https://visualstudio.microsoft.com/visual-cpp-build-tools/

This means the library you're installing has code written in other languages and needs additional tools to install. To install these tools, follow the following steps: (Requires 6GB+ disk space)

1. Open https://visualstudio.microsoft.com/visual-cpp-build-tools/.
2. Click Download Build Tools >. A file named vs_BuildTools or vs_BuildTools.exe should start downloading. If no downloads start after a few seconds, click click here to retry.
3. Run the downloaded file. Click Continue to proceed.
4. Choose C++ build tools and press Install. You may need a reboot after the installation.
5. Try installing the library via pip again.

frosty flame
#

i already have build tools for intellig, do they work?

#

ERROR: Could not install packages due to an OSError: [WinError 5] Access is denied: 'D:\Apps\Anaconda\Lib\site-packages\~cipy\integrate\lsoda.cp38-win_amd64.pyd'

#

this happens when executed pip install tensorflow

serene scaffold
serene scaffold
frosty flame
#

but i think it worked

#

no errors are apearing when running

#

thanks for helping

serene scaffold
#

💚

frosty flame
#

💚

boreal bear
#

Hello all, I am looking at some survey data and I have filled my 'NaN' values with 'NO RESPONSE' but I want to count how values are 'NO RESPONSE' in the pd dataset. How can I do this? .isnull() won't catch the 'NO RESPONSE'

thin prism
boreal bear
#

I used .fillna() on the dataset but I guess i could also go back and remove that and count the NAs but that's a little annoying

thin prism
#

yea im more like a oldschool coder so

boreal bear
#

What's the easiest way to filter 'NO RESPONSE' across 42 columns?

thin prism
#

i do hardcode stuff haha

#

are you using pd?

boreal bear
#

yeah

thin prism
#

try search

#

df.filter(regex=)

boreal bear
#

ok will try. Thank you so much!

thin prism
#

i really do hope im not wasting ytour time haha

boreal bear
#

Of course not. It's a learning process.

#

And I thoroughly enjoy the process

thin prism
#

let me know if that method works 🙂

boreal bear
#

thanks will do

#

doesn't seem to be working

#

df_nores = df.filter(regex='NO RESPONSE', axis=1)

#

is this syntax correct?

rigid zodiac
#

Quick question, how do you apply your deep learning model H5 to the new data

thin prism
#

can i know the column title for the value that contains no response?

boreal bear
#

there are a lot of null values

thin prism
#

ohhhhh

#

so you have multiple null values inside the multiple columns

boreal bear
#

130 to be exact!!

thin prism
#

in that case we can try

#

uhhh .str.contains('NO RESPONSE') on entire column

#

try search that method

#

hope that works

boreal bear
#

Maybe I can try that and iterate through

thin prism
#

mhm 🙂

lapis sequoia
#

Hey, is there some machine learning specialist here that could tell me what kind of data quality I need to get what I want in my project?

#

Also would love to hear about what algorithm to use and so on

lapis sequoia
#

I know 😄

lapis sequoia
# faint ravine That's a very broad question

The aim of my project is to provide a user-specific, individualized presentation of product images, maximizing customer attraction and thereby increase conversion.
The data for this is collected from bot user interaction with the shop.
And just to provide you a brief overview:
At first, the customer is shown randomly selected product images from the backend.
Currently, the majority of the data consists of the correlation between the product image viewed and whether a purchase was subsequently made.
The selection of the best product image should then be based on some machine learning algorithm.

faint ravine
lapis sequoia
#

exactly

#

and data is the image_url

#

do you think you can help with that? @faint ravine

prime hearth
#

I am just a student

#

not sure if this might help

#

but I would try all algorithms

#

since this is part of the machine learning remodel trainning process

#

right after model trainning

#

Identify problem -> data collection -> data prep (feature engineering, feature selection...)- > explority analyse (visualizatino on disitrubtion etc) -> model trainning - > evaluate -> repeat

#

Some algorithms would be classifiers or CNN , and recommendatino system (content/collaborative)

faint ravine
# lapis sequoia do you think you can help with that? <@!265867977792946186>

I'm not sure how you collect your data. In all honesty, I don't think you need a machine learning model to decide which product image works best. My 2 cents:

  1. Most people are similar, and if one image looks good for one person, it'll look good for anyone else.
  2. You could probably use a machine learning model to create a recommendation system.
  3. If you have lots of product images and want to decide which one has the most user engagement, you can randomly evenly show images for a number of users and extract a rating for each image. Over time, the one with the highest rating is the one that is displayed by default.

Again its really difficult for me to understand why you need an ML model to solve the problem of the "optimal image". You might have a good reason that I can't see..

lapis sequoia
#

So by optimal image, I actually meant a recommendation system

#

Very good points, indeed

#

@faint ravine

#
  1. is what I'm doing to collect data
    Though I can't use production data, so I use bots for this
faint ravine
#

Hold on. Let's say you have a shoe product, call it X. You have N images of product X and you're trying to find the one that works best for User P, right?

lapis sequoia
#

exactly

faint ravine
#

So the ML model is a function F that takes in: UserData, ProductID and outputs ProductImage.

wide rose
#

oh

#

I read yesterday at

#

5

lapis sequoia
#

Inside UserData there has to be some sort of previous purchases, metadata, etc.

faint ravine
#

How many product images on average do you have per product? (You must have a lot)

lapis sequoia
#

So for testing purposes, I could use 4-10 per product

#

Is this enough to make it work? Or would I need like 500

faint ravine
#

Things you can learn from past purchases would be like colors, background color..etc.

#

You can extract features from images of past user's purchased products. And try to find the image of the recommended product that most resembles that..

lapis sequoia
#

Theyre just different resolutions for now, so all we can work with is the different image url as a feature

faint ravine
#

Not sure there's a whole lot of things you can find that are common between two entirely different product images..

lapis sequoia
#

Yeah, it seems very hard

faint ravine
#

For example, let's say User: Jack has purchased a shoe that has a red color in the past. And now he's browsing curtains and trying to purchase a curtain for his new room. You have 4 images of curtains with different colors. The best image to show him is the one of the red curtain

#

So this is just a guess. But color is something that can be an individualized users specific feature.

#

Can't think of anything else

lapis sequoia
#

Yeah either it is super simple or gets way too complex for my time span - either way, currently, the majority of the data consists of the correlation between the product image viewed and whether a purchase was subsequently made.

#

I only got 2 weeks

faint ravine
#

You could build

lapis sequoia
#

So you would go with the different color approach?

faint ravine
#

A simple model that takes in the exact images used in all past purchases. And use images of similar products or different product images as "false data". Now you have a function that tells you whether the user is gonna like some image or not.
Then feed this function all images of the selected product, and pick the one with the highest output..

lapis sequoia
#

different product images being?

#

different on the same product?

faint ravine
#

But again, there's a lot of edge cases. The model will make some random predictions that have nothing to do with whether the user likes the product or not.. Because of the small dataset.

I'd honestly use a more manual approach to this. Or target a set of very specific features like color. (You can get creative here).

lapis sequoia
#

So what I could either do is to create a concept on how it could work

#

Or implement some manual approach, but I would like the first

#

Since I'm writing my first scientific working paper on it

faint ravine
#

Yes, if you try to use a complex deep learning algorithm on this, it'll just be a big blackbox and return random predictions that you won't appreciate..

faint ravine
#

Its a very complex problem. One that I don't think will have a generic solution but still, you could get creative.

lapis sequoia
#

Yeah, do you think I can create 20 pages by the end of the month on this?

faint ravine
#

20 pages of what?/

lapis sequoia
#

Of paper

#

😄

faint ravine
#

Not sure.

lapis sequoia
#

So uhm I'm really thinking how I could utilize the data I currently have

faint ravine
#

I mean, some complex papers are only 7 pages. Why the obsession with the number?

lapis sequoia
#

its a requirement

#

20-30 pages

faint ravine
#

Where do you study lol?

lapis sequoia
#

Germany

faint ravine
#

I doubt they're looking for something super original

#

Is this a masters thesis

lapis sequoia
#

hmm yeah it is my first one anyways

#

nono

#

pre-bachelors

faint ravine
#

So a bachelor's thesis right?

lapis sequoia
#

literally my first one at uni, 2nd semester

faint ravine
#

Oh

#

Youngster

#

xD

lapis sequoia
#

haha ye

faint ravine
#

You probably could yea.

#

Just put in lots of figures

lapis sequoia
#

Is there anything I could do with the order history? Say if I make the bots place 100 different orders with like 3 products each

#

Then I got some sample data, like just to have something very basic stuff working, which could be colors etc. with the correct data

#

like I use image resolution rn

#

since there are no different colors

#

or do you think thats not doable

#

using resolution as a feature

faint ravine
#

Uhm could be, but the weight of the feature isn't significant. A user might like a high resolution image. But its not much of an inconvenience you know..

lapis sequoia
#

oh sure, not talking about production data

#

like its just randomly selected rn

faint ravine
#

Try color and resolution. Give color more emphasis for now. I gotta sleep now. Tell me about your progress tomorrow..

lapis sequoia
#

Sure, will try to get some colors

#

thx!

light warren
#

Hey, i dont know if this the right channel, but could i ask a question about portfolio optimization using python

minor geyser
#

Hello, has anyone ever heard of DataCamp? If you have what is your thoughts on it?

amber blaze
#

I'm trying to find a way to hide code cell input in a regular Jupyter Notebook so it just shows the output (ideally with some kind of toggle to show or hide the input) - is there a way to do that? I find instructions for things like 'Jupyter Book' but I guess that is a different thing.

amber blaze
#

Looks like "Hide input" extension might be what I want

desert oar
#

there's another extension in that same package that does it, i think

#

look through the list

#

note that if you're running a kernel in a separate environment from jupyter, the extension must be installed in the jupyter environment

amber blaze
#

Ok thanks, right now I am just using vanilla Jupyter Notebooks - I'm just starting out with them.

dense lintel
#

i have an mp3 file
and i want to check if some part of that mp3 file is present in that mp4 file

bold timber
#

hi i so confuse to handling outliers. how to decide a value is outliers or not?

#

whether have parameter to decide a value is outlier or not?

desert oar
smoky lava
#

Ok so I've written small this package that recursively crawls links starting with an initial URL, then crawling the URLs on that page and so on... example: py crawler = Crawler( head_url='https://docs.python.org', branching=1, max_depth=2, url_filter=URLFilter('python.org', True) ) crawler.crawl() produces the following json data ```json
{
"3.9.7 Documentation": {
"url": "https://docs.python.org",
"links": {
"3.11.0a0 Documentation": {
"url": "https://docs.python.org/3.11/",
"links": {
"3.9.7 Documentation": {
"url": "https://docs.python.org/3.9/",
"links": {}
...

#

I don't have any specific idea, I just thought there might be some mining utility in this

bold timber
west dagger
#

is there any way to render a dynamic web page besides requests_html?

chilly geyser
#

Outliers can be extreme events and they can stem from the same data-generating process

smoky lava
west dagger
#

so im stuck with selenuim then

smoky lava
#

Look like your options are use PyQT's web browser utilities to read the javascript or use Selenium. It does look like you can use Selenium to load specific snippets of javascript instead of loading the whole page. That could save a lot of time

#

@west dagger

west dagger
smoky lava
#

no problem!

lapis sequoia
#

guys, any idea what would be the metric to use to check the performance of a neural topic model? and why should we use the mentioned method?

serene scaffold
#

!remind 5h topic modeling

arctic wedgeBOT
#
You got it!

Your reminder will arrive on <t:1631533733:F>!

lapis sequoia
#

sure, that would be really nice! thanksss man!

iron basalt
#

(right click on cmd prompt -> run as administrator (popup will ask for permission))

#

(programs run without admin cannot touch stuff in certain folders like program files, system32, etc)

#

(though i'm not sure why it would need permissions, maybe some conda thing)

royal crest
#

shouldn't need root access for pip

#

or uhh what's the windows equivalent

#

Admin

iron basalt
#

Windows has further separation between the real root and admins.

royal crest
#

oh turns out they had their problems fixed

iron basalt
#

Not meant to use the real root unless you need to do a recovery boot.

royal crest
#

4 messages below with a thank you

iron basalt
#

(Or uninstall something that won't let itself be uninstalled, that bug still there)

royal crest
#

good to know

slate fox
#

Hello

#

i am trying to implement a kaggle code in my google collab to understand it but i am getting a error

#

can anyone help me

#

i am not able to find this on the net

tidal bough
# slate fox

The normal way to get a column of a dataframe is df["column name"]. If the column name is a valid identifier, like mycoolcolumn, then you can also do it as df.mycoolcolumn. Check your column names - presumably, text_sent is not the column name.

slate fox
#

okay got it thanks

rigid zodiac
#

Morning everyone, quick question, have you ever use LSTM or CNN, ConvLSTM model? If yes, do you shuffle your data and randomly split into train-test set?

slate fox
#

is anyone good with docker kubernetes who can help me understand something or do you know a server where I can get that help

royal crest
#

not sure if containerisation is data science or AI related

rigid zodiac
royal crest
#

what do you want me to share exactly

rigid zodiac
buoyant adder
#

Why you should not directly remove outliers? To understand check out this video:
https://youtu.be/BAQvZntcOpo

This will give you an intuition why you should not remove outliers and how they can be used to facilitate your analysis and expand your vision.
Join this telegram group if you are serious about learning data science and want to avail free organized resources that are added and updated everyday: https://t.me/analyticadata
Follow us on Instagram:...

▶ Play video
arctic wedgeBOT
rigid zodiac
#

Any body know how to stack frames?

velvet thorn
rigid zodiac
#

I'm working on a mmwave radar and wondering how can I stack all the frame together

drifting mason
#

Hey guys, I wanted an Idea

#

I have a list of molecule IDs from a database

#

I have their corresponding names

#

All I wanna do is search the IDs in the database (online database called ChEBI) and check if I have the proper names

#

example

#

Here say suppose, there is an online fruit database

#

The ID of Apple is 11111

#

I wanna make a python program to search the online fruit database 22222 and check if it is really mango

#

How can I do that

#

Please help me

royal crest
#

i believe chembl has API you can use

#

see here

drifting mason
#

Sorry it is ChEBI

royal crest
#

check out libChEBI

#
#

this article explains it in detail

#

this one for the github

drifting mason
hardy cloud
#

hey everyone!, i'have been learning python for a year and it's my first language, i wanna enter the data science field but i'm afraid i don't have the required skills,so my question is, what should i be capable of doing before learning data science?

rigid zodiac
#

ability to dont give up.... tbh

#

also you basically have to read a lot of stackoverflow and github + able to understand the error

gaunt marsh
#

I am using Mathplotlib and I have a bar chart. Is it possible to plot directly to an image file with high resolution?

#

not a 640x480 png file but maybe a 4K picture with good zoomability

royal crest
#

!d matplotlib.pyplot.figure

arctic wedgeBOT
#

matplotlib.pyplot.figure(num=None, figsize=None, dpi=None, facecolor=None, edgecolor=None, frameon=True, FigureClass=<class 'matplotlib.figure.Figure'>, clear=False, **kwargs)```
Create a new figure.
royal crest
#

experiment around the params figsize and dpi

gaunt marsh
velvet thorn
#

wait

#

it's not

#

must have been thinking of something else

#

or did it get changed

gaunt marsh
#

its like that @velvet thorn

velvet thorn
gaunt marsh
#

this is my code with np.unique

velvet thorn
gaunt marsh
velvet thorn
lapis sequoia
#

quick question:

  1. Can inner product be something other than dot product for say R^2?

reasoning to ask question:
An inner product is a generalization of the dot product.
altho it does mention multiplying, I'd want to know if we can define our way of it?
from: https://mathworld.wolfram.com/InnerProduct.html

  1. also, assuming that we can have some different inner product, can we create Hilbert space by that?
    if yes, is there any example of that?
gaunt marsh
velvet thorn
#

but also

#

I was thinking of something else

gaunt marsh
#

I have an Array with multiple sub-arrays. I want to compare the subarrays to each other and filter out the duplicate subarrays. The subarrays contain integer values. What is the best way to do that?

lapis sequoia
#

sorting out can take O(nlogn) after that removing dupes(if you sort that way) can be done in O(n)
(altho not sure)
also you may ask this question in #algos-and-data-structs

dusty anchor
faint ravine
#

Can someone help with offline speech recognition

light warren
#

hey, could someone help me with this error

lone drum
#

when i replace my -9999 values by NaN values it is not getting replaced , i am using pandas
my code python df = pd.read_csv(f'{path}{file_name}{extension}') print(df) print()

replace -9999 values by NaN values```python

new_df = df.replace(-9999, np.NaN)
print(new_df) python
day temperature windspeed event
0 1/1/2017 32 6 Rain
1 1/2/2017 -99999 7 Sunny
2 1/3/2017 28 -99999 Snow
3 1/4/2017 -99999 7 0
4 1/5/2017 32 -99999 Rain
5 1/6/2017 31 2 Sunny
6 1/6/2017 34 5 0

    day  temperature  windspeed  event

0 1/1/2017 32 6 Rain
1 1/2/2017 -99999 7 Sunny
2 1/3/2017 28 -99999 Snow
3 1/4/2017 -99999 7 0
4 1/5/2017 32 -99999 Rain
5 1/6/2017 31 2 Sunny
6 1/6/2017 34 5 0

gaunt marsh
hardy cloud
#

for a lot of data, this code is slow O(n*2), stackoverflow a faster one!

serene scaffold
#

still looking

#

sigh, the explanation is really long. let me try to figure it out.

gaunt marsh
#

For example: I want only the subarrays which are at least 3 times in the super array

#

so every unique subarray and just normal duplicates (two times) will be removed

lapis sequoia
#

making a dict?

gaunt marsh
#

or let's say: I want to remove every subarray, which has no duplicates. I tried it with unique but it doesn't seem to work in my chart

desert oar
serene scaffold
# lapis sequoia reminding you 😉

I haven't found the paper associated with it yet. unfortunately I have to get back to working on it, but I might figure it out since that's what I'm working on today.

gaunt marsh
#

if you look at my chart, there are a lot of bars which only appear once. every bar represents a subarray. I want to remove the unique ones

desert oar
#

oh, this thing

#

i told you, use a Counter and remove all the elements with count 1

#

that should hopefully remove the vast majority of your data

gaunt marsh
#

okay the collections counter which has to be imported, right?

desert oar
#

!e ```python
from collections import Counter
data = list('aabcabdafbba')
item_counter = Counter(data)
item_counter_dup = {k: v for k, v in item_counter.items() if v > 1}
print(item_counter)
print(item_counter_dup)

arctic wedgeBOT
#

@desert oar :white_check_mark: Your eval job has completed with return code 0.

001 | Counter({'a': 5, 'b': 4, 'c': 1, 'd': 1, 'f': 1})
002 | {'a': 5, 'b': 4}
lapis sequoia
#

hello>

#

my python AI is not taking my voice input

#

what shall i do

#

?

gaunt marsh
#

What's the matter with numpy-array? Do I have to use a DataFrame for that?

desert oar
#

and yes, there's no numpy ndarray count method

#

!e

from collections import Counter
data = list('aabcabdafbba')
item_counter = Counter(data)

print('All counts:', dict(item_counter))
print('Unique counts:', {k: v for k, v in item_counter.items() if v == 1})
print('Duplicated counts:', {k: v for k, v in item_counter.items() if v > 1})
arctic wedgeBOT
#

@desert oar :white_check_mark: Your eval job has completed with return code 0.

001 | All counts: {'a': 5, 'b': 4, 'c': 1, 'd': 1, 'f': 1}
002 | Unique counts: {'c': 1, 'd': 1, 'f': 1}
003 | Duplicated counts: {'a': 5, 'b': 4}
desert oar
#

i also told you to use tuples for this

#

numpy isn't helpful for this problem

#

you want a list of RGB tuples

#

[(r1, g1, b1), (r2, g2, b2), ...]

gaunt marsh
#

Okay thanks. I have it as tuples but commented it. Played around with np.array and np.unique

pine wolf
#

well, no count method, but you can do :

In [47]: a = np.array(list('aabcabdafbba'))

In [48]: (a == 'a').sum()
Out[48]: 5
desert oar
#

oh you probably can do this with numpy unique

#

!e ```python
import numpy as np

rgbs = np.array([
[1, 2, 3],
[4, 5, 6],
[1, 2, 3],
[4, 5, 7],
])

rgb_uniq, rgb_uniq_idx, rgb_counts = np.unique(rgbs, return_index=True, return_counts=True, axis=1)

print(rgb_uniq)
print(rgb_uniq_idx)
print(rgb_counts)

arctic wedgeBOT
#

@desert oar :white_check_mark: Your eval job has completed with return code 0.

001 | [[1 2 3]
002 |  [4 5 6]
003 |  [1 2 3]
004 |  [4 5 7]]
005 | [0 1 2]
006 | [1 1 1]
desert oar
#

hm, i wonder why that first one is not deduplicated

#

i'll have to experiment with this

#

!e ```python
import numpy as np

rgbs = np.array([
[1, 2, 3],
[4, 5, 6],
[1, 2, 3],
[4, 5, 7],
])

rgb_uniq, rgb_uniq_idx, rgb_counts = np.unique(
rgbs, return_index=True, return_counts=True, axis=0
)

print(rgb_uniq)
print(rgb_uniq_idx)
print(rgb_counts)

arctic wedgeBOT
#

@desert oar :white_check_mark: Your eval job has completed with return code 0.

001 | [[1 2 3]
002 |  [4 5 6]
003 |  [4 5 7]]
004 | [0 1 3]
005 | [2 1 1]
desert oar
#

there we go @gaunt marsh ☝️

#

that would work if you already have a numpy array of rgb values, e.g. 700k rows and 3 columns (or 4 if you have alpha)

#

!e ```python
import numpy as np

Every row is an RGBA quad

rgbas = np.array([
[0.50, 0.75, 0.25, 1.00],
[0.25, 0.50, 0.75, 1.00],
[0.50, 0.75, 0.25, 0.50],
[0.25, 0.50, 0.75, 1.00],
]) * 255
print(rgbas)

Every row is an RGB triple (drop the A channel)

rgbs = rgbas[:-1]
print(rgbs)

rgb_uniq, rgb_uniq_idx, rgb_counts = np.unique(
rgbs, return_index=True, return_counts=True, axis=0
)
print(rgb_uniq)
print(rgb_uniq_idx)
print(rgb_counts)

arctic wedgeBOT
#

@desert oar :white_check_mark: Your eval job has completed with return code 0.

001 | [[127.5  191.25  63.75 255.  ]
002 |  [ 63.75 127.5  191.25 255.  ]
003 |  [127.5  191.25  63.75 127.5 ]
004 |  [ 63.75 127.5  191.25 255.  ]]
005 | [[127.5  191.25  63.75 255.  ]
006 |  [ 63.75 127.5  191.25 255.  ]
007 |  [127.5  191.25  63.75 127.5 ]]
008 | [[ 63.75 127.5  191.25 255.  ]
009 |  [127.5  191.25  63.75 127.5 ]
010 |  [127.5  191.25  63.75 255.  ]]
011 | [1 2 0]
... (truncated - too many lines)

Full output: https://paste.pythondiscord.com/ayepatiyey.txt?noredirect

desert oar
#

so you would use rgb_uniq_idx as the bar position in the plot, and rgb_counts as the bar size

hard gull
#

Is there an good way of learning ai in python?

grave frost
serene dragon
#

Is there any easy way to color scatter plot in Pandas according to column values?

serene scaffold
gleaming osprey
#

plz roast immediatly. ```python
classifications = []

for i in data['Rating']:
if i < 3:
classifications.append('Average')

elif i >= 3 and i < 7:
    classifications.append('Great')

elif i >= 7:
    classifications.append('Excellent')

data['Classifications'] = classifications```

#

there is data['Rating'] between 1 - 10 and I have to classify. Question: Classify Movies Based on Ratings [Excellent, Good, and Average]

serene scaffold
serene dragon
#

When i have used custom names for legend

#

it got disconnected from hue

#
df = pd.read_csv('https://raw.githubusercontent.com/dspiegel29/ArtofStatistics/master/00-1-age-and-year-of-deathofharold-shipmans-victims/00-1-shipman-confirmed-victims-x.csv', sep=",", decimal='.')
plt.figure(figsize=[12,10])
plt.rc('font', size=15)
x = sns.scatterplot(x='fractionalDeathYear', y='Age', data= df, hue='gender2',)
plt.ylabel('Wiek ofiary')
plt.xlabel('rok')
plt.legend(['Kobieta','Mężczyzna'], loc='upper left'
bold timber
#

do I can to replace outlier by median?

serene scaffold
bold timber
serene scaffold
bold timber
#

6500000 and others around 100-100000

serene scaffold
bold timber
#

yes, like 150000

serene scaffold
#
df[df > 100_000] = df.median(axis=1)
#

this will do it with the median of every column, I believe.

bold timber
serene scaffold
bold timber
gleaming osprey
serene scaffold
crisp wing
#

Anyone knows how to get statsmodels.api.OLS.predict working with train/test-split data, like with sklearn's predict?
Using pandas.dataframes,

from sklearn.model_selection import train_test_split
import statsmodels.api as sm
        df_x_train, df_x_test, df_y_train, df_y_test = train_test_split(df.loc[:, df.columns != 'Y'],
                                                                        df.loc[:, 'Y'],
                                                                        train_size=0.75, random_state=42)

        df_x_train = sm.add_constant(df_x_train)
        df_x_test = sm.add_constant(df_x_test)
        model_lm = sm.OLS(df_y_train, df_x_train)
        result = model_lm.fit()
        df_pred = model_lm.predict(df_x_test) # ValueError here

Gives me a

ValueError: shapes (11976178,3) and (11976178,3) not aligned: 3 (dim 1) != 11976178 (dim 0)

Which to me seems like the predict function assumes df_x_test to be same size as df_x_train?

bold timber
serene scaffold
bold timber
serene scaffold
#

you might use a more sophisticated way of determining what an outlier is, like standard deviation, or something

bold timber
serene scaffold
serene scaffold
bold timber
digital sparrow
#

Anyone know a pandas command to drop all rows below a certain row number for example row 21 and below need removed.

iron basalt
digital sparrow
#

Thanks!

lapis sequoia
#

Hii, how to extract all tickers from bloomberg ? Im still looking .

#

Im using xbbg library

hasty grail
desert oar
#

R operators have different precedences so you don't need them in R

serene scaffold
#

idk R

lapis sequoia
#

Can anyone recommend some scientific Python projects on GitHub that use a parameters file as input to a computational model?

desert oar
#

maybe DVC has some examples

lapis sequoia
#

DVC?

desert oar
#

i don't know if they have anything, but maybe they have some examples repository somewhere

lapis sequoia
#

I found a project called pyrk on GitHub that uses a Python file for input parameters. But I'm curious how others handle parameters especially if those parameters are defined in text file like TOML or YAML. https://github.com/pyrk/pyrk

GitHub

Python for Reactor Kinetics. Contribute to pyrk/pyrk development by creating an account on GitHub.

timid saffron
#

What is this?

vital stone
#

same question here

inner pebble
#

Hello there,
(I am working with pandas)
I have a column for which each lines are series of dates but in string format like : [2021-07-23--2021-07-30, 2021-07-16--2021-07-2021, ...]

I would like first to explode the date range represented by each index and then create one column for the starting date and another column for the ending date.

How would you do that?

Thanks for your help.

#

I have performed it by exploding first and then splitting manually the string in each line , but I m looking for something maybe more accurate

shut dock
#

I would start with splitting the string by '--', then splitting those by '-', and saving each as a datetime object. So quick and dirty...

#

start_date = datetime.datetime(date_range_entry)[0].split('--')[0].split('-')[0], date_range_entry)[0].split('--')[0].split('-')[1], date_range_entry)[0].split('--')[0].split('-')[2])

#

I screwed that up a bit with the copy/paste, but do you see what I'm after?

#

start_date = datetime.datetime(date_range_entry.split('--')[0].split('-')[0], date_range_entry.split('--')[0].split('-')[1], date_range_entry.split('--')[0].split('-')[2]) there we go

#

end_date = datetime.datetime(date_range_entry.split('--')[1].split('-')[0], date_range_entry.split('--')[1].split('-')[1], date_range_entry.split('--')[1].split('-')[2]

#

You may have to convert the string to int, now that I'm looking at it

inner pebble
#

ahhh yes I understand now your idea. That s a good idea.
I will try this thanks @shut dock

shut dock
#

start_date = datetime.datetime(int(date_range_entry.split('--')[0].split('-')[0]), int(date_range_entry.split('--')[0].split('-')[1]), int(date_range_entry.split('--')[0].split('-')[2]))

#

Sure thing 👍

tender hearth
#

datetime accepts strings

desert oar
#

!eval @inner pebble ```python
import pandas as pd

df = pd.DataFrame({
"y": pd.Series([
"2021-07-23--2021-07-30",
"2021-07-16--2021-07-21",
])
})
print(df)

y_ranges = df['y'].str.split('--', expand=True)
y_ranges = y_ranges.apply(pd.to_datetime)
df[["y_start", "y_end"]] = y_ranges
print(df)

arctic wedgeBOT
#

@desert oar :white_check_mark: Your eval job has completed with return code 0.

001 |                         y
002 | 0  2021-07-23--2021-07-30
003 | 1  2021-07-16--2021-07-21
004 |                         y    y_start      y_end
005 | 0  2021-07-23--2021-07-30 2021-07-23 2021-07-30
006 | 1  2021-07-16--2021-07-21 2021-07-16 2021-07-21
desert oar
#

if you run df.info(), you'll see that those 2 new columns are indeed datetime data and not strings

inner pebble
#

expand=True 👍 I forgot this option
and I havn t thought to use an independent variable (y_ranges) to perform the split and then have it back to the initial df
Thanks @desert oar 👍

desert oar
#

@inner pebble you can do it without the intermediate variable too 🙂

df[["y_start", "y_end"]] = (
    df['y']
    .str.split('--', expand=True)
    .apply(pd.to_datetime)
)
#

but yes, the key is expand=True

gaunt marsh
# desert oar there we go <@!200805179509964800> ☝️

I already have tuples and it looks like this:

[(200, 200, 215), (161, 162, 172), (72, 45, 31), (116, 75, 33), (182, 182, 195), (103, 63, 26), (151, 152, 156), (211, 211, 228), (190, 191, 204), (98, 75, 49), (93, 51, 23), (135, 135, 135), (117, 107, 84), (163, 99, 35), (172, 173, 184), (172, 173, 184), (89, 84, 75), (163, 148, 120), (167, 152, 124), (173, 162, 142), (237, 235, 217), (108, 101, 90), (122, 115, 101), (135, 128, 115), (199, 190, 170), (144, 129, 103), (155, 141, 114), (213, 206, 187), (188, 177, 155), (151, 144, 128), (70, 68, 62)]```
#

Used this: rgbs = list(map(tuple, list_of_three_values))

dawn orbit
#

Hi, I have a problem in a published Jupyter notebook about an algorithm that's called S3D. There is a figure that is meant to visualize your trained model, but for some reason I get this error "Length of values (1) does not match length of index (9)" when running the notebook's code block. What's written in the error log is also quite complicated, so I have no idea where to start debugging. Could someone here give me a hand with this?

worn bough
#

@dawn orbit could you give the command that raises the error? And do you know what the 'values' and the 'index' are?

dawn orbit
mortal dove
#

I've mainly been using R for time series analysis since that's just what we've been using in uni classes.
Which libraries are generally used for it in python?

dawn orbit
#

I'd say Prophet and Darts

#

But you probably already used Prophet if you've used R

prime hearth
#

hello, i would like to please ask

#

what is the easiest way to implement sentimental analysis?

#

Ive done little research and mostly it is using Natural language processing such as these alogirhtm:
Neural Network/CNN/RNN
TextBlobs
MultiNomialNB
TfidVectorizer
K-means
SVM

gaunt marsh
#

Do you know how to make the bars thicker and the ticks per bar visible? These are about 129.000 values which are processed via mathplotlib barchart

ebon walrus
#

theres no way to spread it out other than find the mean of the data and display it in 5-10 bars max if possible

#

try to find a way around it

gaunt marsh
#

There were around 700.000 values and I removed a lot of them 😫

ebon walrus
#

i mean unless you want to make it a scrollable dataset you could possibly make it so that you could scroll very far with it but i dont have the skills for that

ebon walrus
#

theres no enough colors to display that many values

dawn orbit
#

What is the point of the graph? Just looks like colors and their frequency to me

#

What I mean is there has to be a better way to display whatever it is that you want to display in the graph

gaunt marsh
dawn orbit
short heart
#

Does LabelEncoder from sklearn automatically fill the nans

uncut barn
#

Hi guys,

I'm doing a project that involves a binary segmentation task of an image as well as predicting the grade of an image from (0-3), would I need to create 2 separate CNNs for each task or is there some way I can interlink these 2 tasks, any help would be much appreciated.

rigid zodiac
west dagger
#

anyone knows why when i try to start selenium, it cant locate the chrome driver

#

ive added it to /local/bin but joker still wont work

quasi parcel
#

hi everyone i hope you are doing well
i need help in constructing a co-occurrence matrix
the data will be like this
for this data
so i am trying to construct customer_id and product_ids
the confusion is
there is a list of product ids now ca i create a concurrence matrix
can anyone help
please

#

this is the sample data

serene scaffold
#

This would be easier to replicate if you provided the dataframe as text so that we can reproduce it. With screenshots, it's all conjecture.

quasi parcel
#

thank you @serene scaffold let me try that

serene scaffold
#

If it doesn't work, please do print(df.head().to_csv()) and copy/paste the text into this chat exactly so that we can continue.

quasi parcel
#

sure i will do that

#

will 10 row be sufficient?

serene scaffold
#

just the code I provided (namely print(df.head().to_csv())) is fine.

#

though you may need to replace df if it has a different name.

#

I assume it worked; otherwise please ping or I won't know to come back.

quasi parcel
#

there is this key error

#

@serene scaffold

serene scaffold
#

go ahead and do print(df.head().to_csv()) then and provide the text.

quasi parcel
#

sure

arctic wedgeBOT
#

Hey @quasi parcel!

Uh-oh! It looks like your message got zapped by our spam filter. We currently don't allow .csv attachments, so here are some tips to help you travel safely:

• If you attempted to send a message longer than 2000 characters, try shortening your message to fit within the character limit or use a pasting service (see below)

• If you tried to show someone your code, you can use codeblocks
(run !code-blocks in #bot-commands for more information) or use a pasting service like:

https://paste.pythondiscord.com

serene scaffold
#

It should only be five lines of text.

quasi parcel
serene scaffold
#

thanks. one moment

quasi parcel
#

sure thanks a lot

serene scaffold
quasi parcel
#

could you share the output

#

if you dont mind

serene scaffold
#
            Time_stamp                                            ...
Product_id      194757 203271 214387 235573 240620 248684 255443  ... 353982 353987 358385 358386 359707 359783 359784
Customer_ID                                                       ...
2145912              0      0      0      0      0      0      0  ...      0      0      0      0      0      0      0
3803083              0      0      0      0      0      0      0  ...      0      1      0      0      0      0      0
4678591              0      0      1      1      1      0      0  ...      0      0      0      0      0      0      0
5247070              0      0      0      0      0      0      0  ...      0      0      0      0      0      0      0
5371841              0      0      0      0      0      0      0  ...      0      0      1      1      0      0      0
6410476              0      0      0      0      0      0      0  ...      0      0      0      0      0      1      1
6427723              0      0      0      0      0      0      0  ...      1      0      0      0      0      0      0
6428352              1      0      0      0      0      1      0  ...      0      0      0      0      1      0      0
6430668              0      0      0      0      0      0      0  ...      0      0      0      0      0      0      0
6435539              0      1      1      0      0      0      1  ...      0      0      0      0      0      0      0

[10 rows x 23 columns]
quasi parcel
#

seriously thanks a lot now, can we store this in nparray or something?

serene scaffold
#

yes, you can do .to_numpy() at the end.

#

and all your wildest dreams will come true

quasi parcel
#

can we also do product to product? @serene scaffold

#

please

#

sorry to ask

serene scaffold
quasi parcel
#

same co-occurance matrix for product to product

serene scaffold
#

let me think

desert oar
serene scaffold
#

that might be better lol

serene scaffold
#

I didn't really have a plan

quasi parcel
#

thank you so much @serene scaffold and @desert oar

desert oar
#

that pivot table thing was pretty clever

#

i would have probably done it with a loop or something

quasi parcel
#

i have a doubt @desert oar i have done with binarizer but

#

its not doing co occurance

#

i mean i want like this

#

but i need to add the number of occurrence of product ids

desert oar
#

!e @quasi parcel ```python
from itertools import combinations
import pandas as pd

def unpack_to_columns(series, colnames=None):
return pd.DataFrame(series.tolist(), columns=colnames)

df = pd.DataFrame({
"customer_id": ["c123", "c456", "c789"],
"product_ids": [["a", "c"], ["b", "c", "f"], ["a", "c", "e"]],
}).rename_axis(index="transaction_id")

unique_prod_ids = df['product_ids'].explode().unique()

adjmat_prod_prod = (
unpack_to_columns(
df['product_ids']
.apply(lambda x: [list(pair) for pair in combinations(x, 2)])
.explode(),
colnames=['product_id_1', 'product_id_2']
)
.pivot_table(index='product_id_1', columns='product_id_2', aggfunc=len)
.fillna(0)
.astype(int)
.reindex(index=unique_prod_ids, columns=unique_prod_ids, fill_value=0)
)

print(adjmat_prod_prod)

arctic wedgeBOT
#

@desert oar :white_check_mark: Your eval job has completed with return code 0.

001 | product_id_2  a  c  b  f  e
002 | product_id_1               
003 | a             0  2  0  0  1
004 | c             0  0  0  1  1
005 | b             0  1  0  1  0
006 | f             0  0  0  0  0
007 | e             0  0  0  0  0
desert oar
#

!e and the customer-product matrix:

import pandas as pd

df = pd.DataFrame({
    "customer_id": ["c123", "c456", "c789"],
    "product_ids": [["a", "c"], ["b", "c", "f"], ["a", "c", "e"]],
}).rename_axis(index="transaction_id")

unique_cust_ids = df['customer_id'].unique()
unique_prod_ids = df['product_ids'].explode().unique()

adjmat_cust_prod = (
    df
    .explode('product_ids')
    .rename(columns={'product_ids': 'product_id'})
    .pivot_table(index='customer_id', columns='product_id', aggfunc=len)
    .fillna(0)
    .astype(int)
    .reindex(index=unique_cust_ids, columns=unique_prod_ids)
)

print(adjmat_cust_prod)
arctic wedgeBOT
#

@desert oar :white_check_mark: Your eval job has completed with return code 0.

001 | product_id   a  c  b  f  e
002 | customer_id               
003 | c123         1  1  0  0  0
004 | c456         0  1  1  1  0
005 | c789         1  1  0  0  1
desert oar
#

i also cooked up versions using scipy sparse matrices if this data is really big, i can show those if you want

#

@serene scaffold you might be interested too

desert oar
#

of course, if the data is really huge (streaming from 1 billion gzipped rows or a data warehouse query), you will want to use a dok_matrix instead and fill it with a boring old loop

#

you'll need to incrementally build the product_id -> i dict in that case, too

quasi parcel
#

@desert oar thank you so much

charred umbra
#

So this is my paper for an algorithm I made for ientifying diseases using semi-supervised learning

#

you guys have any suggestions for making the algorithm better?

quasi parcel
#

@desert oar Hi really thank you, can you suggest me which course you have done for data sciences? please

desert oar
#

i didn't take any standalone data science courses

velvet thorn
#

I'm looking to take one next year 🙏

desert oar
#

and some moderately regrettable "applied machine learning" classes

desert oar
#

not very rigorous, fluffy assignments and exams

#

although one of them was my first introduction to unix (literally a mainframe operated by the university) so that was a very positive experience

#

i also didn't have a thesis adviser until my thesis was already mostly done... oops

arctic wedgeBOT
#

Hey @quiet vault!

Uh-oh! It looks like your message got zapped by our spam filter. We currently don't allow .txt attachments, so here are some tips to help you travel safely:

• If you attempted to send a message longer than 2000 characters, try shortening your message to fit within the character limit or use a pasting service (see below)

• If you tried to show someone your code, you can use codeblocks
(run !code-blocks in #bot-commands for more information) or use a pasting service like:

https://paste.pythondiscord.com

velvet thorn
#

were they classes targeted @ students in general (i.e. "breadth" courses)?

desert oar
#

I think it was a combination of departments dialing in the right difficulty levels, and catering to students in less rigorous masters degrees

#

I remember I specifically took the "applied" classes because I wanted lots of project experience, but that was a mistake

#

It turns out that my friends who took the not-"applied" version not only had more rigorous material, but also had much more intensive projects, and professors who were better connected in industry

serene scaffold
#

!d pandas.DataFrame.idxmax

arctic wedgeBOT
#

DataFrame.idxmax(axis=0, skipna=True)```
Return index of first occurrence of maximum over requested axis.

NA/null values are excluded.
serene scaffold
#

@desert oar where is the level=?

#

||WHERE IS IT?!||

desert oar
#

Does it return a tuple if it's a multiindex?

serene scaffold
#

it returns a Series

desert oar
#

Interesting

#

Oh with axis

#

!e ```
import pandas as pd

idx = pd.MultiIndex.from_tuples([
("a","x"), ("b", "y"), ("c", "z")
], names=["l1", "l2"])
y = pd.Series([1,2,3], index=idx, name="y")

print(y.idxmax())

arctic wedgeBOT
#

@desert oar :white_check_mark: Your eval job has completed with return code 0.

('c', 'z')
desert oar
#

@serene scaffold ☝️

#

This took me like 10 minutes to type on my phone

serene scaffold
desert oar
#

Lol yes

serene scaffold
#

but look at your DMs!

desert oar
#

Heh i see

#

You get an array of tuples then right?

serene scaffold
#

Series (of tuples)

desert oar
#

Seems like that can work with .loc

serene scaffold
#

it's not enough information though

#

there needs to be three selections per row

#

one for precision, recall, and f1

#

I don't need to know that the CRF precision was usually best because what that really means is that it was hyper conservative in making predictions lemon_angrysad

desert oar
#

Do 3 separate idxmax'es, one per column

serene scaffold
#

isn't that basically what my groupby was?

desert oar
#

Yeah but any time i see is_max = x == x_max i squirm

serene scaffold
#

also
what about ties?

#

there were a few

desert oar
#

Aha

#

Idxmax takes the first

serene scaffold
#

in which case they both get to be bold

desert oar
#

In that case yes do your thing

#

It would be nice if they supported that natively

#

On a bigger data set that could be the difference between a fast operation and a very slow one

#

Numpy probably supports it

serene scaffold
#

pandas not having something that numpy has despite being a layer on top of numpy

desert oar
#

You can compute all maxima and their indices in a single pass over the data, it's annoying that the top level functions don't make it easy to do that

desert oar
#

But yeah pandas can be weird like that

tender hearth
#

Hey folks, I'm a bit confused on how Transformers work:
If I understand right, at the beginning of the decoding process, a special '<start>' token is fed to the decoder. That input goes through the first MHA layer, and then to the second MHA layer, this time along with the encoder output. Then the decoder loops this process until the special '<end>' token is produced.

If this is the case, why is masking the decoder input during training even required? Why not just feed the '<start>' token, wait for the decoder to finish, and then compute loss based on the decoder output and the ground truth?

serene scaffold
tender hearth
#

Can you elaborate? I don't understand

serene scaffold
#

What do you know about attention?

tender hearth
#

Still learning about it, but I think I have basic understanding. The attention mechanism computes an attention vector that determines how much weight a particular part of the input should have in the output

serene scaffold
tender hearth
#

track?

serene scaffold
#

but what word would determine that "track" is an acceptable replacement?

tender hearth
#

running

serene scaffold
#

"track" is not in the sentence currently

#

yes 😄

#

so whatever you pick for the mask, we would say that that word attends to "running" more than "was".

tender hearth
#

Ok, so mask the output during training, but what happens during inference?

#

When you don't know the length of the output and thus can't compute a masking vector

serene scaffold
#

let me think

#

I'm not sure sad_cat

tender hearth
#

Haha, it's fine

serene scaffold
#

have you read the original paper?

tender hearth
#

It was a bit too convoluted for me, so I escaped to blog posts, but I agree, I think the paper has the answer

serene scaffold
#

I guess I need to read it too

#

don't duck that, it's just a sad cat

velvet thorn
#

the paper isn’t written very well IMO

#

but maybe it’s because I don’t have a math background

#

😔

#

wait, wrong paper

#

ignore me

tender hearth
#

@serene scaffold Okay so it turns out masking is only done during training. And masking is done in the first place because the ground truth is fed into the decoder in parallel (instead of sequentially)

#

I was under the impression that it did the same thing as it does in inference; i.e. loops until regression

arctic wedgeBOT
#

:incoming_envelope: :ok_hand: applied mute to @livid kiln until <t:1631672865:f> (9 minutes and 59 seconds) (reason: newlines rule: sent 126 newlines in 10s).

vagrant kite
#

!unmute 581318950760218636

arctic wedgeBOT
#

:incoming_envelope: :ok_hand: pardoned infraction mute for @livid kiln.

vagrant kite
#

!paste

arctic wedgeBOT
#

Pasting large amounts of code

If your code is too long to fit in a codeblock in discord, you can paste your code here:
https://paste.pydis.com/

After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.

vagrant kite
#

use that ^

livid kiln
#

Here the step size is 5

df["diff"] = df["val"].diff()
df["diff"].mode()[0]
#

Here the true/false border is 645/650 and their difference does satisfy the step size of 5

#
df[600:700]
livid kiln
#

I've completed the task, any suggestion on how to improve this code?

df["diff"] = df["val"].diff()
diffmode = df["diff"].mode()[0]
idx = df[df["diff"] <= diffmode].index
#Remove index which have smaller than diffmode difference
df = df.loc[pd.Index(np.arange(idx.min()-diffmode,idx.max()+diffmode,diffmode)).intersection(df.index)]
df["diff"] = df["val"].diff()
#Select index which have multiples of diffmode difference
nonmodal = np.setdiff1d(df["diff"].dropna().unique(), diffmode)
bands = df.loc[df["diff"].isin(nonmodal)]
#Find the latter (low to high, therefore upperBand) index on the true/false boundary
upperBand = bands.loc[bands["truth"].rolling(2).sum() == 1.0]["val"].values[0]
lowerBand = bands.iloc[bands.index.get_loc(upperBand) - 1]["val"]
#Get the inner block of the bands
upperIdx = df.iloc[df.index.get_loc(upperBand) - 1]["val"]
lowerIdx = df.iloc[df.index.get_loc(lowerBand) + 1]["val"]
df[lowerIdx:upperIdx]

EDIT: comments added

desert oar
#

add comments, because this is non-trivial code

#

you will probably forget what this does in 2 weeks 🙂

fluid sparrow
#

@desert oarhii can you help me understand how to populate a list in R with numeric values

desert oar
#

@fluid sparrow be specific about:

  1. what you are trying to do
  2. what you already tried
  3. what went wrong when you tried
fluid sparrow
#

Sure

desert oar
#

also be specific about whether you mean a list or a numeric vector

fluid sparrow
#

I am trying to create a list, I did myList = [ 1, 2, 3, 4, 5, 6]

#

and it says object error

#

I have no idea what the differnce between a list and a numeric vector

#

I am completely new to R

desert oar
#

are you in a course? or just trying to teach yourself?

#

that's python syntax, not R syntax

fluid sparrow
#

I am in a course but I just began

#

I have prior experience with Python

#

Instead of helping answering this question directly if its easier you can redirect me to good R resources

desert oar
#

this is good if you already know R, but i don't know any beginner programming guides off the top of my head https://r4ds.had.co.nz/

#

i can search

#

anyway, R has a very different approach to representing data compared to python

#

python has basic data types like strings, numbers, booleans, etc.

#

R does too, but everything is an array - in most cases, you can't get a "number" by itself, you can only get an array containing 1 number

#

consequently, you won't use lists as frequently as you do in python, because the basic data types are already very list-like

#

as for syntax, R does not have special syntax for constructing lists or "vectors" (the R name for a 1-D array). the c() function constructs a vector, and the list() function constructs a list

fluid sparrow
#

@desert oar thank you so much! I will definitely check out the site you reffered and I hope I can build my knowledge on it! As you may already know I am very confused and rusty so hopefully I will get better!

desert oar
#

i honestly wouldn't look there to start

#

it assumes you already know the basics

#
# A vector of numbers
v1 <- c(1, 2, 3)

# A vector of strings, called a "character" vector
v2 <- c("hello", "welcome to R")

# A vector of Booleans, called a "logical" vector
v3 <- c(TRUE, FALSE, TRUE)

# A list of vectors
items <- list(v1, v2, v3)
fluid sparrow
#

Thank you so much again I also see you have been helping other people its truly kind and incredible! I hope I can be as knowledgeable someday

#

I see so I am trying to create a vector of numbers

#

does the c represent something?