#data-science-and-ml

1 messages · Page 102 of 1

final kiln
#

I don't think they'll be able to meet my salary expectations anyway

lofty thorn
#

let's go

final kiln
#

I don't trust this equity split stuff

lofty thorn
#

is it a HR round or the first step to get the job in that company ?

final kiln
#

no this is the third round already

#

there's a fourth and a fifth ._.

lofty thorn
#

oh

final kiln
#

gotta take it with stride

lofty thorn
final kiln
#

I thik there's a sixth

#

no it's 5, and the fifth I have to travel to their office, which I don't mind since it's in a cool place

final kiln
lofty thorn
#

you seem very unsure

final kiln
#

third is solving a physics problem, fourth is more leetcode and fifth is traveling to their place

#

i speak Spanish but im not gonna answer

lofty thorn
#

oops

#

why 'i' = 1 is written?

tidal bough
#

That's just a sum as i goes from 1 to n (inclusive).

lofty thorn
#

oh ok

#

i though the first item need to be one

wooden sail
#

the first value i takes is 1

#

math notation is something one reads and interprets, just like any other language

#

you would read this as "the sum of elements x_i, where i goes from 1 to n"

#

so, x1 + x2 + ... xn

vapid minnow
#

Hey, i've made a 3D plot with matplotlib and i would like to know if there is some way to enable antialiasing for it?

#

I've tried passing antialiased=True as a parameter for plot but it doesn't seem to make any difference

#

Here's the code of the graph:

def plot_graph3D(func, zlim, name):
    fig = plt.figure()
    ax = fig.add_subplot(111, projection='3d')

    x = np.linspace(0.0, 2.0, GPAPH_RESOLUTION)
    y = np.linspace(0.0, 1.0, GPAPH_RESOLUTION)

    X, Y = np.meshgrid(x, y)
    Z = func(X, Y)

    ax.view_init(15, GRAPH_ANGLE)
    ax.plot_surface(X, Y, Z, cmap='viridis', antialiased=True)

    ax.set_zlim(0.0, zlim)

    ax.set_xlabel('Pixel Luminance')
    ax.set_ylabel('Average Luminance')
    ax.set_zlabel('Output Luminance')

    if name is not None:
        plt.savefig(DIR + 'graph3D-' + name + '.pdf', bbox_inches='tight')

    plt.show()
vapid minnow
lusty lotus
#

hello! i am a 16 year old living in the uk. i want to have some work experience in research before i enter university. i am highly interested in reinforcement learning. i have successfully replicated a fairly well-known research paper and currently refining the code (i have a functional MVP but currently writing code to hopefully speed up the learning). however i consider my ml background to be "shaky" as i still lack some math understanding in some fundamental aspects of ml. i currently use pytorch, which abstracts some math away. would this be ok with my volunteering work experience, like i can just work behind a library?

i would want to have some work experience with some tech companies (i can volunteer). does anyone know any good companies/programmes that i can join to get some real-world experience? i know that at least for some us-based highschoolers there are some programmes for them and internships are usually for undergraduates. can someone please recommend some international/uk based programmes/opportunities for me? tysm :D

jagged latch
#

I need a bit of help on something related to Plotly Dash. I am making an app, but it appears that I'm running into an issue where not all the data is being displayed when the data is being entered into a new dataframe from the text box which runs code to generate df_2. Here's the background. When I first run the program, I use Tkinter to enter an initial date for when the app first goes live. The data looks good when I do that. The problem is when I enter the same exact date in the text box and click the button, which should execute the Callback, which should then call the previous functions to generate a new dataframe, now a lot of the data is missing. Theoretically, nothing should change because it's the same exact query being run. Most of the data from df is the same, but when data is supposed to be saved into df_2 at certain times, it just does not do it, when it did it prior to the app going live with the initial date and continues like it never even saw said row in question even though it did. What could be causing such an issue when running all those functions from the Callback?

#

Here's the syntax I'm using to add rows to df_2 if it helps where A, B, C, D are variables defined in the function:

df_2.loc[len(df_2.index)] = [A, B, C, D]

#

Many times that line of code would appear like it's being ignored when the functions using this line are being called from the Callback.

#

It does not get ignored when I first run the program and enter the date. This problem I noticed only happens when the date is entered in the text box of the Dash App by the user.

#

In other words. It happens when everything is executing inside of the Callback instead of outside of it, where I get the expected results.

carmine pecan
#

Hi I need help sending an email to a website. I want to scrape their data and use it to train my AI for a research paper/whatever that paper you send to confrences is called.

What should the email contain? They already told me I can do it via a call but I want to have an actual email.

brave arch
#

Hello, I need help with python. I have a code but i am unable to use beautiful soup and scrape from website the column level data

river cape
#

Hey in SVR , we call the points which are outside the epsilon insensitive tube as support vectors right?
and for training model , we use the .fit() method right? Should it always have a 2D array as its input?

past meteor
river cape
past meteor
river cape
past meteor
#

And then call .shape on it

river cape
#

It gives it in the form of a 2D array

past meteor
#

try a bunch of things, try muiltiplying two vectors of shape nx1 in numpy as well

#

You gotta run all the code to get the intuitions

river cape
#

I want to make that 1D array to be vertical?

past meteor
#

You don't need to write print in notebooks btw

river cape
#

Is there any way>

past meteor
#

also try calling .shape here on it

#

(Look up what .reshape(-1) does as well)

wooden sail
#

i would note that numpy ndarrays of dimension 1 do not behave like proper vectors

#

you can multiply them from the left or right with no issues

past meteor
#

That's true

#

Even so, that's something you need to run to find out

#

Compared to someone telling you

wooden sail
#

hmmm

river cape
river cape
past meteor
#

Ah, it's (-1, 1)

#

You should be able to call .shape on the array you get

final kiln
#
# Calculation
number_of_neurons = 86_000_000_000  # 86 billion neurons
average_synapses_per_neuron = 5_000  # average synapses
parameters_per_synapse = 40 # 10 parameters for each 4 modes (neuro transmiter types, receptor types, synaptic strength, other factors (modulatory receptors, ion channels, post synaptic properties, etc)

total_parameters = number_of_neurons * average_synapses_per_neuron * parameters_per_synapse
total_parameters
#

a fermi estimate for the number of parameters in the human brain

#

17200000000000000

#

17.2e15 if I counted the zeros right

final kiln
#

But ngl, I might've over engineered my pipeline

#

Now that I'm able to fit the entire dataset in ram

#

The code will probly be useful eventually, so I'm not too bummed out

#

Maybe now it fits, but nothing is stopping me from just getting more data

jagged latch
final kiln
jagged latch
final kiln
desert oar
# jagged latch I would if this wasn't for work though.

setting up reproducible examples is an important skill. if you can't isolate the problem enough to explain it to someone else, you might not understand the problem that well yourself. often the process of figuring out how to reproduce a problem also elucidates the root cause, and from there a solution.

#

that's been my experience at least, and i know i'm not alone in that.

final kiln
#

That moment when I realize most of my job is debugging stuff

desert oar
#

truly. 70% debugging, 20% data cleaning, 5% talking to people, 5% actual data analysis and machine learning

final kiln
#

so good to hear that, was starting to think I was doing something wrong ahah

#

There's just like, infinite configuration. In the GitHub actions stuff alone there were like 10 unexpected things that prevented them from reaching the 24h run time mark

#

Not even in my code, just libs not working as expected, or settings that I didn't know about, like a timeout at 6h, that one was unfortunate cuz I could've seen it coming if I read the docs

desert oar
# final kiln so good to hear that, was starting to think I was doing something wrong ahah

it's the current plague of doing data science in industry. data scientists are under-supported by engineers and devops people. so you have this perverse trend of trying to hire data scientists with unicorn-level credentials to do 3 different jobs at once, instead of hiring 2 extra people to collaborate with the data scientist and get a lot more value out of the whole team. save 2/3 on payroll but only get 1/4 productivity, it's a bad deal in the end for everybody (including and especially the data scientist who doesn't get to actually do their own job and their resume / skills atrophy over time).

versed pilot
#

I'm a data analyst and it's not much better

#

There are some data engineers in the company but they have their own work to focus on, mostly I have to do my own full stack end to end tasks, from system administration , to etl scripts, to cloud platform work, to SQL, python and Tableau

desert oar
#

that's particularly bad

#

80-90th percentile bad

past meteor
#

I think companies doing this aren't necessarily wrong

#

DS folk I meet just think they can get away with knowing 1 thing

versed pilot
#

On the other hand companies are trying to get away with having small teams and fewer people.

past meteor
#

And that's really a thing for the vast majority of jobs, especially if you're not a specialist in a large company

versed pilot
#

you get to a point when that is not very productive

past meteor
#

Yeah, if they're not pulling a lot of revenue what else are you supposed to do?

desert oar
#

don't hire people that you can't make use of

versed pilot
#

I'm a jack of all trades and it's very hard to become a master of any

past meteor
#

Unless your argument is: don't get a data scientist untill you're at a big scale

desert oar
#

not at big scale, but don't get a data scientist if you don't have at least 1 engineer that can help support getting their stuff into prod

past meteor
#

Or they hire someone that can do a bit of both? 🤔

desert oar
#

it never works out that way

past meteor
#

I think it's a myth that having more than responsibility means you're doing twice the work

desert oar
#

i can do both. i've done both, professionally. unequivocally it's worse when you expect someone to do both.

past meteor
#

you're doing half of each

desert oar
#

that's mythical man-month thinking

past meteor
#

I do both and I don't do twice the job tbh

desert oar
#

right

#

case in point, no?

#

it's not about doing twice the job. it's about doing less than half of each job

#

if i split my time 50/50 i end up doing 30% of each job

#

that leaves 40% of the job not done, or backlogged

#

yet i'm still spending the same 100% of hours

past meteor
#

It's all data, I never understood the distinction

versed pilot
#

there's the time spent context switching, and missed synergies

desert oar
#

then maybe you're not doing what i'm doing

past meteor
#

I don't see the context switching, data is data 🤷

desert oar
#

also just the fact that it's ridiculous to expect a data scientist to also be a software engineer

versed pilot
#

If I work on notebooks I have a variety of projects I can work on

#

If I switch from notebooks to sql to tableau to unix sysadmin

#

that's context switching 😉

desert oar
#

is writing an HTTP API and setting up a CI/CD pipeline "data" work in any non-trivial sense?

#

we literally pay kids $100k out of college to do that and only that, full time

#

here i am doing that and trying to also do data science and keeping an eye on the ETL pipelines

past meteor
#

My interest, is in making things that work. If that requires an HTTP API, CI/CD, an ETL, ... so be it personally

#

What I see of a lot of DS is no interest in making things that work

desert oar
#

same. that's why i do those things and know how to do them. i still think it's stupid to expect to hire someone who can do that

#

you and i are unicorns

past meteor
#

The interest is in doing stuff in notebooks

desert oar
#

it's very very well known across basically all industries that 1 person doing 3 jobs in 1/3 time ratios is less effective than 3 people specializing

past meteor
#

ML model in a notebook, plot in a notebook 👎

#

I don't want to make any excuses for that

desert oar
#

i use notebooks 🤷‍♂️ not sure what that has to do with it

#

i don't expect them to run in production

past meteor
#

I also use notebooks, that's not what I mean

desert oar
#

but even if i did, i don't see why it matters. tests are tests, pipelines are pipelines, etc.

past meteor
#

I mean, no interest in going to prod

desert oar
#

is it no interest in going to prod? or is it lack of interest in doing what should be someone else's job? another specialist's job, so you can focus on your own specialty?

past meteor
#

Because some of my colleagues believe their responsibilities start at getting a clean dataset and end at producing a PoC

versed pilot
past meteor
#

When your responsibility should be: getting something that works. If there's no one to bail you out, then you gotta do it yourself imho

desert oar
past meteor
#

sure I do

desert oar
#

yes, at smaller scales, it pays off to be a generalist and to hire generalists

past meteor
#

Depending on the scale of their company

#

You can 100 % blame them

desert oar
#

have you ever worked with a good DBA?

past meteor
#

What I'm saying is DS do this at any scale

#

And that's unreasonable

desert oar
#

sure, maybe that's entitled on their part. but at the same time, it's ridiculous to expect this level of multi-specialization as table stakes for all DS

versed pilot
river cape
#

sc2.inverse_transform(regressor.predict(sc1.transform([[6.5]])).reshape(-1,1))
the above statement is used to predict the result of an svr regression algo

surreal sedge
#

hu

river cape
#

Is it necessary to use a reshape()

past meteor
#

I think there's just like an overspecialisation of DS folk

#

I don't think any CS niche overspecialises this much

desert oar
#
  1. DS isn't CS

  2. i totally disagree, i think there's an unrealistic expectation of DS people also being software engineers and unless we pay them 2x what software engineers make, it's just employers trying to be cheap

past meteor
#

What

#

But you're not doing twice the job 😭

desert oar
#

i spent 6 years in school studying math, statistics, and machine learning. you now want me to also become a professional software engineer?

final kiln
#

I've been mostly working at startups, and I do really like doing a bit of everything but, I've decided that I'm not a substitute for a team, there's a point at which it's just not fair. I personally don't see the fun in just doing one thing, but I also see the line in the sand as an extremely important thing for my own well being

desert oar
#

that's two careers and two specializations. i expect to be paid double accordingly.

past meteor
#

The expectation is just that people can deliver results

#

Every role has this problem, it's not unique to DS

desert oar
#

but that's a very startup-centric small-scale mindset

past meteor
#

Not necessarily imho

desert oar
#

there is absolutely a niche for people who can "deliver results"

past meteor
#

It's definitely compounded by the fact that the slice of the process DS people do is super narrow

#

Narrower than other roles

#

Like pure pure DS roles. You need a large supporting cast for that

#

I'm just pleading for knowing more than 1 thing is all, just knowing DE is already a step up

left tartan
#

Ivory tower DSers

past meteor
#

At my job I grew into a lot of tasks because all the rest just says no

left tartan
past meteor
#

That was the first contract I signed but I got cold feet and tore it up

#

I might in the future

wooden sail
#

i'm kinda on salt rock's camp here

#

on a time constraint, time spent learning math is time spent not learning software eng

#

i would say that's a job for 2 people at least

#

the code and software optimizations you learn on one side are completely unrelated to the ones on the other

past meteor
#

The truth is that for most roles there's diminishing returns on that math vs. software

wooden sail
#

in general, a lot of "DS" positions really just need software eng

past meteor
#

If you go that deep then you should really only aim for the ones where the diminshing returns aren't doing you in

wooden sail
#

people don't even know what DS and ML are in the first place

versed pilot
#

But a lot of the data work requires a different kind of software engineering. Optimal SQL or pandas is very different to the skills you learn with C/C++/Java type software engineering

final kiln
#

But there's a reason why we teamup right, a team goes farther and for a team to work you gotta have lanes

wooden sail
past meteor
#

But it's definitely a two way street that's what I meant in the discussion tbh

final kiln
#

Yeah I'm aware of what happens, especially in the smaller companies

past meteor
#

Because from observation DS are unique in the fact that they say "not my job" and don't grow towards the mismatch in the hire

versed pilot
#

Actually a lot of DS are moving into Data Engineering

#

getting AWS/GCP certifications, learning dbt

past meteor
#

Unless the point all of you are trying to make is that companies should hire less data scientists

final kiln
#

I don't know, I had to learn to say no. Not because I'm not willing to take additional tasks but because I'll very quickly become overworked

wooden sail
#

a lot of them don't really need it imo

versed pilot
#

well, they should think whether they need a data engineer first

#

before going for the data scientist

final kiln
#

I'm very productive in general and that creates this illusion even to myself that I can just keep on doing more stuff, but it's not the case

wooden sail
#

just basic stats would take them a long way, which doesn't require heavy ds

past meteor
#

To come full circle

#

What I'm trying to say is, move some of those math / stats hours to software

#

Or work at google / do a PhD / ...

versed pilot
#

but business people don't always get statistics

#

don't like uncertainty

past meteor
#

We actually had a breath of fresh air 2 hires ago

past meteor
#

The person that we hired wasn't married to ML (and was previously a software engineer)

#

They ended up building an awesome ML product, one of the best we have on offer

#

Because they're willing to do what it takes

wooden sail
#

that kinda piggy bags on what i said though

#

that software eng is truly what is usually needed

past meteor
#

I think we're in agreement

wooden sail
#

you could probably even do with a single ds person that doesn'T even code, but regularly participates in the meetings where stuff is arranged with the others

past meteor
#

I just don't agree with the "I can do two things so I need to be paid 2x" argument

wooden sail
#

yeah i guess that's unrealistic expectations, but from both sides

past meteor
#

Software isn't a monolith, software engineers themselves need to do 2+ things all the time (frontend, backend, devops, data, ...) and none of them makes this argument tbh

wooden sail
#

the employer not knowing what to ask for, and DS people being reluctant

past meteor
#

There's still things I more or less "refuse" to do because I don't enjoy them and I'm not good at them either, I'm just transparent about it

#

If anyone decides to not here me on the basis of that both of us win

wooden sail
#

lol

final kiln
#

Where's the line that seperates the roles

past meteor
#

I can imagine the DS will be terrible at frontend

#

This doesn't happen tho

#

My next project is on HCI / explainable AI. The very first thing I'll do is make a frontend we'll use for the experiments.

#

My focus is on making cool stuff and if there's no one else to do it, then I step up. Obviously it'll take me longer than a specialist, but at the end we do have something tangible which is what matters

wooden sail
#

zestar, maker of cool stuff

past meteor
#

Yeah, maybe I should put that on my linkedin

#

And take away data scientist or whatever I have, I have been thinking of it 😛

final kiln
#

I guess the fear is to be stuck doing things that don't further what the person feels should be their career, and this is a pretty strong thing because a lot of people derive purpose from their work

left tartan
wooden sail
#

i guess DS people hit this wall often because its a buzz word that was turned into a career in unis for whatever reason

left tartan
#

Lots of people think ML -should- be their career path, and it (imo) won’t be for most of them.

#

I think reality is: careers are shaped primarily by opportunity, some luck, and a bit of preparation

past meteor
#

My organized thoughts will be written down about this. I have many sketches (actual drawings/figures) of what I think the problem is

#

Will take me a couple of months to write it out, but then I'll let all of you know

wooden sail
#

zestar approaching us in 3 months with a large mirror
"look"

left tartan
#

I’m wondering which other hype cycles have been like this.

wooden sail
#

bitconnect

past meteor
#

The gist of what it's going to be if you look at the N % most valuable work in an org it likely needs to be very large to sustain someone with a very lopsided skillset (I'll use radar charts for these).

left tartan
#

The dot com boom was just general SWEing, but I guess it mainstreamed web dev

versed pilot
#

yep, web dev hype in the late 90s

left tartan
versed pilot
#

like Verilog, layout etc.?

past meteor
#

in what sense?

left tartan
#

One sec( there’s a talk…

wooden sail
#

idk if that's the best comparison though

versed pilot
#

Those chip companies just focus on chips though, they'll have a hardware team, a layout team, an embedded software team etc.

wooden sail
#

cuz there's also the current struggle that too few people study electronics compared to what the market would like (in the chip design end)

past meteor
wooden sail
#

and the overall trend of people studying STEM decreasingly

left tartan
#

[EuroPython 2023 — Forum Hall on 2023-07-20]

https://ep2023.europython.eu/session/the-future-of-microprocessors

The Future of Microprocessors - a talk about the history of microprocessors, how we got here and what might happen next. There will be two laws, one equation, some graphs and a particle beam weapon out of Star Trek.

This work is lic...

▶ Play video
wooden sail
#

what's the TL;DW

versed pilot
#

ok she's a bit of a legend

left tartan
versed pilot
#

but I think in her line of work it's as I mentioned above, lots of verilog people, lots of embedded software people (including assembly)

#

and some people who are more into layout etc.

left tartan
#

Similarly, I think there’s some limit to the returns in DS in a single organization. Maybe a bit of a stretch.

versed pilot
#

It depends, you can do lots of R&D to develop the next processor or ASIC

#

but it's much harder to push CMOS technology further

#

or give up on CMOS and come up with a replacement

#

not sure how this compares to data science

wooden sail
#

past a certain point what you need is a team of physicists to research new fancy stuff and separately, engineers to try implementing it

versed pilot
#

CMOS is hitting limits in terms of electrons tunelling through thin layers of insulator

#

quantum mechanics and all that

#

so you need a paradigm shift

#

and that was parallelisation, multicore, GPUs etc.

past meteor
#

Anyhow, I apologize for the controversial opnions! 😄

#

Esp to Salt Rock if he's still reading

#

It's a difficult topic

versed pilot
#

data science had the opposite with GPUs, suddenly a world of possibility opened

final kiln
past meteor
#

Just when I thought I was out they pull me back in :/

#

We agree that people exaggerated with stuff like microservices because Google did it yeah? "Google does it, they're big so if we do it, we'll be big!"

final kiln
#

Just to clarify tho, I don't think it's healthy to do just one thing, but there's no arrogance in pursuing self determination in a career

jagged latch
past meteor
#

Who says it's not the same for data science 😭

#

At least in the way where it's frequently touted

versed pilot
#

people invented Hadoop to mimic Google's file system, Big Table etc.

#

and that was the data science fad of the 2010s

#

ok, one of them

brave arch
#

hello, I need a help with python

final kiln
brave arch
#

for scraping a website, I am unable to get the subcolum data

brave arch
#

import requests
from bs4 import BeautifulSoup
import csv

# Define the URL
url = "https://izw1.caltech.edu/ACE/ASC/DATA/level3/icmetable2.htm"

# Send a GET request to the URL
response = requests.get(url)

# Check if the request was successful (status code 200)
if response.status_code == 200:
    # Parse the HTML content of the page
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find the table using its attributes (modify as needed)
    table = soup.find('table', {'border': '1', 'width': '1500', 'bgcolor': '#ECFFFF'})

    # Open a CSV file in write mode
    with open('output.csv', 'w', newline='') as csvfile:
        writer = csv.writer(csvfile, delimiter=',')

        # Process all rows
        for row in table.find_all('tr'):
            # Extract data from each row
            row_data = [column.text.strip() for column in row.find_all(['td', 'th'])]

            # Write the row data to the CSV file
            writer.writerow(row_data)

    print("Data has been successfully written to output.csv")

else:
    print(f"Failed to retrieve the webpage. Status code: {response.status_code}") ```
final kiln
#

Here, use this ```

brave arch
#

I do not get the subcolumn data

final kiln
#

: P

past meteor
#

I also don't think this question is best suited for this room

#

Could you make a help thread?

brave arch
brave arch
iron basalt
#

(See who remains after the layoffs, it's not the specialists...)

#

Software is returning to where it was before, lots of generalists with many hats.

final kiln
brave arch
final kiln
#

import requests
from bs4 import BeautifulSoup
import csv

# Define the URL
url = "https://izw1.caltech.edu/ACE/ASC/DATA/level3/icmetable2.htm"

# Send a GET request to the URL
response = requests.get(url)

# Check if the request was successful (status code 200)
if response.status_code == 200:
    # Parse the HTML content of the page
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find the table using its attributes (modify as needed)
    table = soup.find('table', {'border': '1', 'width': '1500', 'bgcolor': '#ECFFFF'})

    # Open a CSV file in write mode
    with open('output.csv', 'w', newline='') as csvfile:
        writer = csv.writer(csvfile, delimiter=',')

        # Process all rows
        for row in table.find_all('tr'):
            # Extract data from each row
            row_data = [column.text.strip() for column in row.find_all(['td', 'th'])]

            # Write the row data to the CSV file
            writer.writerow(row_data)

    print("Data has been successfully written to output.csv")

else:
    print(f"Failed to retrieve the webpage. Status code: {response.status_code}") ```
#

It's the same code, I need the syntax highlight

final kiln
#

Like maybe show the page and the CSV

left tartan
left tartan
#

But unclear on whether the subtotals are in the same html table (I didn’t look at the raw html)

final kiln
#

Yes I understood, I'm not understanding what went wrong

brave arch
final kiln
#

Seeing the CSV will make it clear for me what is being described

left tartan
#

I think basic problem js this isn’t a simple html table. Colspan headers, multiple splits, etc

brave arch
#

so i think there should be a way to distinguish the sbucolumns and the column names must processed to create an equal number of names as the number of colums/subcolumns

left tartan
#

Yah, it’s the cold pans

#

Colspan

brave arch
#

how to fix this in my code so that I get appropriate data ?

final kiln
#

But it looks fine

brave arch
#

no, the ICME Plasma/Field Start, End Y/M/D (UT)" can be divided into "ICME Plasma/Field Start Y/M/D (UT)" and "ICME Plasma/Field End Y/M/D (UT)"

#

currently the data goes in different coloumn header , which is Comp. Start, End (Hrs wrt. Plasma/

final kiln
#

I see three columns

brave arch
#

yes but the header is not mapped correctly and I need to fix my code

final kiln
#

Four rows before the row with the trxt

brave arch
#

missing the header

#

check the below row and it misses the sub column

#

so data is messed up

final kiln
#

Below row it's still three columns

#

And choosing cells at random, they seem to match

#

Maybe I'm not seeing something, but the only thing missing is the first row, which is the header

left tartan
#

The Comp Start values are actually teh second column of the ICME plasma

#

Because of the col span

#
thead>
<tr align="center"><td><b>Disturbance Y/M/D (UT) <A HREF="#(a)">(a)</a></b>  </td><td colspan="2">
<b>ICME Plasma/Field Start, End Y/M/D (UT) <A HREF="#(b)">(b)</a> </b> </td><td colspan="2">

<b>Comp. Start, End (Hrs wrt. Plasma/ Field) <A HREF="#(c)">(c)</a></b> </td><td colspan="2">
final kiln
#

Oooh I see it

left tartan
#

Just a terrible table design.

final kiln
#

But can CSV represent this table ?

left tartan
#

Yah, the col span needs to be split into two headers

#

Meaning, the colspan=2 headers need to be unpacked into two headers

final kiln
#

Wait is the / meant to match the two columns

#

Like "name of first column/ name of second"

#

No it's the start, end

#

It's a date range or something

left tartan
#

Yah, start/end I think

#

The funny thing is you can just open the HTML in Excel

final kiln
#

I think even copying and pasting would work

left tartan
#

Yah

brave arch
#

any suggestion for code ?

left tartan
brave arch
#

any suggestion ?

#

what exactly ?

left tartan
#

You'll have to modify this step: row_data = [column.text.strip() for column in row.find_all(['td', 'th'])]

brave arch
#

to ?

left tartan
#

You'll have to figure that out, I'm just telling you the problem and where to start.

brave arch
#

Okay

#
from bs4 import BeautifulSoup
import csv

# Define the URL
url = "https://izw1.caltech.edu/ACE/ASC/DATA/level3/icmetable2.htm"

# Send a GET request to the URL
response = requests.get(url)

# Check if the request was successful (status code 200)
if response.status_code == 200:
    # Parse the HTML content of the page
    soup = BeautifulSoup(response.text, 'html.parser')

    # Find the table using its attributes (modify as needed)
    table = soup.find('table', {'border': '1', 'width': '1500', 'bgcolor': '#ECFFFF'})

    # Open a CSV file in write mode
    with open('output.csv', 'w', newline='') as csvfile:
        writer = csv.writer(csvfile, delimiter=',')

        # Process all rows
        for idx, row in enumerate(table.find_all('tr')):
            # Extract data from each row
            if idx == 0:
                # Handle headers with colspans for both main columns and subcolumns
                header_row = []
                for cell in row.find_all(['td', 'th']):
                    colspan = int(cell.get('colspan', 1))
                    header_text = cell.text.strip()
                    if colspan > 1:
                        # If colspan is greater than 1, add the text multiple times
                        header_row.extend([header_text] * colspan)
                    else:
                        # Otherwise, just add the text once
                        header_row.append(header_text)
                writer.writerow(header_row)
            else:
                # Extract data from each cell in the row
                row_data = [column.text.strip() for column in row.find_all(['td', 'th'])]
                writer.writerow(row_data)

    print("Data has been successfully written to output.csv")

else:
    print(f"Failed to retrieve the webpage. Status code: {response.status_code}")
#

This works for 1st 4 row and not all. Any suggestion ?

left tartan
#

This is all GPT code, right?

brave arch
#

not all .

#

some to get the suggestion

left tartan
#

So, the first step is to figure out what is not doing what you want it to do.

#

You ask: "Any suggestion ?". What do you need help with in this version?

#

Or, said differently, you say it works for 1st 4 rows. What happens on 5th and 6th?

brave arch
#

now with updated code the the first four row works but when it find out 5th which is again a column header the col span is not working and it messing up a data

left tartan
#

What about 6th row?

brave arch
#

not working some problem as earlier

left tartan
#

What's wrong with 6th row?

brave arch
#

Let me fix it

left tartan
#

The trick to debugging these problems is to add a few print statements, so you can see what's happening.

#

Do you know what a colspan is?

short path
#

I need help with some basics machine learning. I am trying to solve the Titanic prediction problem from Kaggle but after imputation, my train data gets more row somehow and then it doesn't match with the y_train

#
X = train_data[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare']]
y = train_data['Survived']

X_train, X_val, y_train, y_val = train_test_split(X, y)

# Encoding

oh_enc = OneHotEncoder(handle_unknown='ignore', sparse_output=False)

oh_X_train = pd.DataFrame(oh_enc.fit_transform(X_train[['Sex']]))
oh_X_val = pd.DataFrame(oh_enc.transform(X_val[['Sex']]))

X_train_encoded = pd.concat([X_train.drop('Sex', axis=1), oh_X_train], axis=1)
X_val_encoded = pd.concat([X_val.drop('Sex', axis=1), oh_X_val], axis=1)

X_train_encoded.columns = X_train_encoded.columns.astype(str)
X_val_encoded.columns = X_val_encoded.columns.astype(str)

# Imputation

imputer = SimpleImputer()

imputed_train_data = pd.DataFrame(imputer.fit_transform(X_train_encoded))
imputed_test_data = pd.DataFrame(imputer.transform(X_val_encoded))

imputed_train_data.index = X_train_encoded.index
imputed_test_data.index = X_val_encoded.index

imputed_train_data.columns = X_train_encoded.columns
imputed_test_data.columns = X_val_encoded.columns
#

I put a py X_train_encoded.describe() after the encoding and it says the dataframe has 668 rows at that point, which is what it should have

#

But when I do this after the imputation, for some reason it shows the df with a varying number of rows around 830, though this number varies a little bit every time I restart the kernel and at the end of the program, I get this error "ValueError: Found input variables with inconsistent numbers of samples: [838, 668]" when trying to fit a model

#

Do you have any idea about what it could be?

#

Is it wrong to do the imputation after encoding?

serene scaffold
short path
#

But then I can't think of an actual problem with the imputation I did

#

even if take out the parts where I set the index and the columns, it keeps adding rows

serene scaffold
#

Encoding is about making sure that the same information is represented the same way, and that information is represented in a way that is intelligible by the model

#

Both of those concerns are equal

#

Imputation is about filling in missing information with whatever would be least interesting to the model

short path
#

that's very odd

serene scaffold
#

No...

short path
#

Could you try to run this program if send you the dataset?

serene scaffold
#

No.

#

I'm actually on vacation

#

I even promised my mom that I wouldn't answer questions on discord during it.

#

I don't live with my mom bte

short path
#

lol

serene scaffold
#

Btw

#

I just answer questions on discord when I'm at her house so I don't have to talk to her

#

So she assumes I do it all the time.

short path
#

I get it

serene scaffold
#

Yeah

short path
#

but then I don't know what to do

serene scaffold
#

Me either

#

Just don't give up

#

You can do it

#

Eventually

short path
#

I could try to do it using get_dummies

#

probably it wouldn't give an error

#

but I want to see what the problem really is

serene scaffold
#

That's basically like one hot encoding, I think

short path
#

Yeah

short path
#

It's very odd and I don't want to have it happening again

serene scaffold
#

It can also do something other than what you intended, without raising an error

#

In which case you're fucked. Unless you know what you're doing.

short path
#

Yeah

short path
serene scaffold
short path
#

Oh OK

#

my bad

crimson summit
#

I just want to verify the my intuition of why activation functions are necessary. For this example lets consider a network that classifies numbers 0-9. A network WITHOUT an activation function will be able to do well on numbers that are similar to the sizes of the numbers in the training set but it will struggle if numbers appear to be darker or lighter because it is linear and cannot take both size and lightness/darkness into account. In a neural network completeing the same task but WITH activation functions will be able to take into account both orientation and lightness/darkness because the weights will learn all possible relationships between the pixel values in the data set and the sigmoid(or other acctivation function) will then take number that are slighty lighter or darker and transform/smooth them so that the network could ouput the same probability as if it were the nomral darkness/lightness. Does this intuition sound about right or is it incorrect in some way ?

burnt coral
#

sorry, didn't mean to send the previous message. i'm using pytorch, and i'm not really sure on the format of the data on which my pretrained model was... trained. there's a lot of stuff in the code about making vocabs that are, to my knowledge, not actually torch vocab objects (it seems all custom?). i'm having trouble parsing it but it doesn't seem to be changing anything outside of something about ID correspondence and making a .pkl file. i want to fine tune it for a binary classification task. should i be worried about this data formatting? as far as i know the data is similar enough, at least in its raw form. would it be okay for me to just put it into a dataset and start training?

#

the creators of the pretrained model talk about implementing classification as a downstream task, if that changes the process at all

mild dirge
#

Because in the end each output is just a linear combination of the inputs

#

In the output an activation could be necessary because, f.e., you want the outputs to sum up to 1, because it needs to be a probability distribution. So that is why you would use softmax f.e.

eternal pawn
gritty vessel
#

i am trying to predict next two days value

#

can anyone review my code please

final kiln
#

I'm updating my resume, and this is the stuff I've been doing, and it reminded me of yesterdays conversation

looking at this, it feels like coding the model and the direct contact with data is a small part of what needs to be done to get these things going, either that or I'm doing something wrong, but I don't really see how I'm gonna do this without all the infra, and when the infra is done all I'm doing is sort of waiting around (I use that time to do other stuff like update resume and etcs)

#

like, once the infra is done, I trigger some runs and wait for it to do its thing

#

then I take the results, review what I did right or wrong, go back to the data and repeat the process

#

once that is done I'll be doing deployments

#

what I'm saying is that rn to me it feels like the MLOps stuff is 90% of the work that needs to happen to train a model

wooden sail
#

that sounds about right in practical jobs and is in line with what we were discussing yesterday with zestar amd the others

#

we use ml extensively where i work too, but we do none of this other than things like in your first bullet point :p

final kiln
#

like do you just get a super expensive instance and train stuff there

wooden sail
#

we do math on paper, run our bad code on the university compute cluster (e.g. lsf or slurm), and publish the results in papers

final kiln
#

is it like a super computer

wooden sail
#

yeah

final kiln
#

I think we have a couple in my country, there was a cluster in my faculty but it was small stuff

wooden sail
#

this one is not huge either, but the nodes add up to a couple dozen A100s or so

#

and the nature pf the work is more about reformulating problems and showing why some approaches should be better. actually running code is more to show evidence

#

or at least that's how i see it :p maybe my boss hates me haha

final kiln
#

like, if you read their paper they were very concerned with performance and doing all these crazy optimizations

#

which is my next thing to do

#

I'm essentially emulating their challenge but on a smaller scale

#

my constraint is a 16Gb GPU on the cloud, which is not a lot to train even small transformers

wooden sail
#

yeah that's rough

final kiln
#

I'm writing it to be scalable, eventually if I decide to I can just add more gpu at will

#

I think it's a good play to start in an artificially constrained setup because that encourages me not to be wasteful of the available resources

wooden sail
#

yeah. we do a fair amount of that too, since faster and less mem is always a selling point

muted iron
#

ASDSC

gritty vessel
#

hey I trained multiple regression models on ethereum historicaldataset and all of them are giving good results

#

is it time to get rich?

left tartan
river cape
#

Whats the importance of p-value? in multiple linear regression

mild dirge
desert oar
agile owl
#

what's the least performance-impacting way to do bounds checking on the size of a cuda array so that I can implement some kind of mitigation on my program's concurrent memory usage

final kiln
#

Like prevent out of bounds access of a memory address ?

agile owl
#

prevent it from trying to allocate too much CUDA ram using something like thread blocking or smth

#

my model fits in memory but I'm using async on parts of my program

#

so it can be loaded twice I think

#

in which case being in memory twice doesn't fit

#

so I need to wait for the first model to be deallocated

#

in principle though let's say you have many GPUs I only want it to block if the next action would overallocate to CUDA

final kiln
#

If you know how much memory each model occupies on the GPU (accounting for the stuff that happens when calculations are being made), you can use redis to store how many models have been loaded to GPU

#

Redis can act as a lock for multi processing and multi threading stuff since it's single threaded and does one thing at a time

agile owl
#

good idea thx

#

speaking of which, I wish there were something like Spark for the GPU

#

that's where I really need the memory mgmt

final kiln
#

I think pytorch has builtin functionality for handling multiple GPUs

#

Something something nn.DataParallel, idk

agile owl
#

i don't mean just that part, I mean the memory management part

#

like it's not smart enough to say "no more right now"

#

the way Spark is

final kiln
#

Yeah that would be useful for sure

#

But many times it's hard to predict because it's not just the memory of the model, it's the allocations that happen in between when the graph is being executed

#

Got an interview in 10min or so

#

Just wanna get it over with, these things give me anxiety ._.

jagged latch
final kiln
#

It's gonna be a physics problem or something of the sort

final kiln
#

Ngl, failing this interview would be a bit of hit on my pride 🥲

#

But I'm rusty on physics have t touched it in 2 years

jagged latch
final kiln
final kiln
#

Physics simulation tend to be costly and some can't even be put in a GPU without a ton of simplifying assumptions

#

My MSc thesis was physics simulation, but of a different kind to the ones they seem to be doing, which relates more to structural engineering

#

My stuff was simulation of elementary particles like photons and electrons

#

The first time they started doing them was actually at Los Alamos during the Manhattan project

#

They had to simulate neutrons and such

#

Aight gotta go

jagged latch
jagged latch
#

I have a question. Does anyone experienced in Plotly Dash know how to update 'active_cell' with the new dataframe so that the contents of the cell for data_table will be read with the new data rather than reading the old data from the data_table cell.

lofty thorn
#

trimmed mean-- first sort the numbers and then removing 1st and last number and then finding mean of that...did i understand it right?

jagged latch
# jagged latch I have a question. Does anyone experienced in Plotly Dash know how to update 'ac...

To give a better idea of the problem, imagine you have a table with 1 column and the rows have the values A, B, C, D. You then press a button that performs an operation that changes the values from A, B, C, D to 1, 2, 3, 4. Now you go to click the cell with the value 2, but the contents have it read as B, even though you can clearly see the 2 on that table. What would be the best way to solve this?

wooden sail
trim saddle
lofty thorn
#

do i take the largest no. or smallest

#

in place of 'p'

wooden sail
#

.latex sigma notation means the following;
[
\sum_{i=n}^N x_i = x_n + x_{n+1} + x_{n+2} + \cdots + x_{N-1} + x_{N}
]

strange elbowBOT
wooden sail
#

so if you add p to n, and subtract p from N, it means "ignore the p smallest and p largest entries, then take the mean"

#

p can be 1, but it can also be any other value

jagged latch
final kiln
#

Eh, I do believe it went almost to perfection

#

I wasn't able to answer a question about the boundary conditions of the navier stokes equation on an air plane wing, but he said it was cool. The first questions I aced them all

#

And there was a lot of time left at the end

#

We just kinda talked about the role

#

Tho I'm a bit embarrassed I didn't know that one

#

I also got wrong a question regarding genetic algorithms, I think I correctly explained everything but I said something wrong by relating them to reinforcement learning

dusty forge
#

Anyone here running a study group? Looking for one where people actually want and enjoy learning and working together on (mini) projects. Preferably based in the European timezone.

crimson summit
jagged latch
jagged latch
final kiln
odd meteor
desert oar
jagged latch
final kiln
# jagged latch Do you have a specific reason?

It's just a bunch of stereotypes. High levels of Intelligence also frequently come with high levels of empathy and emotional understanding. Sheldon is a myth as far as I know, and you can see that if you read up on real world geniuses

#

Physics and science students in general are also kinda just normal college students, they don't look or sound like the people on that show

desert oar
#

it's a noxious combination of bad stereotypes, pop-culture pandering, and generally not being funny or interesting

proud river
#

ai is cool

gritty vessel
#

Guys anyone in research ?

#

I wrote my first paper today
So just looking for a review before submitting it

#

Did I wrote something wrong ?😶‍🌫️

crimson summit
#

@wooden sail In a fully connected neural network that detects numbers 0-9 if two 7's are inputted into the network and both 7's are the exact same size and position expet one 7 is slightly darker and one is slightly lighter when the pre activation values at neuron 1(for example) the sigmoid function will essentially make the activation values the same something like 0.993 and 0.992 which will allow for both 7's to be treated the same thought the rest of the network and be classified correctly does this intuition sound right ?

wooden sail
#

the most important being that you wouldn't really know what value the network will output nor why

#

and there's no reason why the output values would be close to each other

#

0.50001 for one of them and 0.999999 for the other is still a correct classification

crimson summit
# wooden sail hmm there's several issues with that reasoning

at the end of the network their would be probabilities for each number so if one is 0.50001 in the first layer it could cause network to output a higher probability for another number whereas if the network has already learning the relationship for the pixel values for a 7 in that position the sigmoid will essentially just transform the lighter 7 to take the same path as the normal 7 throughout the network

wooden sail
#

idk what you mean by "path through the network", any input will have exactly the same operations done on it

#

classifiers do often have an output corresponding to a probability being assigned to each class. all you need is for one class to have a probability higher than the others for that to count as the class the network predicts

#

it doesn't matter if it's the largest by 0.9 or by 1e-15

#

also what you understand by "similarity" is not at all what the network will learn to treat as "similarity". what networks do usually has no real world interpretation that makes intuitive sense to people

#

it's just not the case that you can interpret what a neural network is doing in general. you'd be better served thinking of it as a function that maps an input to a categorical distribution, without wondering about the "how" for now

crimson summit
# wooden sail idk what you mean by "path through the network", any input will have exactly the...

pre activation value for lighter 7 =4 pre activation value for darker 7 =5. Without the sigmoid the "lighter seven" would come up with a much different probability at the end of the network compared to the darker 7 but with the sigmoid if the pixel values are slightly different it will make them both 0.991 and 0.992 so then when all operations get done throughout rest of network they come out with the same probability

wooden sail
#

there's no reason why the probability of the two 7s will be anywhere near each other

#

in fact the scenario you're describing is often enough to make a network guess the number incorrectly

crimson summit
wooden sail
#

it can very well happen

crimson summit
#

so then isnt that what sigmoid does in this case

wooden sail
#

nope

#

all the sigmoid does is enforce that the outputs are between 0 and 1, and the softmax (the multidimensional form of the sigmoid) makes it so that the outputs are between 0 and 1 and add up to 1

crimson summit
hoary halo
#

Can anyone help me with a problem im having with chromadb in python? -
im using unstructured to chunk and embed files to my local instance of chromadb. i then query the chromadb and send the k chunks to an LLM for natural language processing, and get a result. This flow works well, but im at the point where i need to store metadata and filter by the metadata when querying.

I am inserting vendor invoices in pdf format into chroma, and then i need to query them later. This is obviously difficult with multiple invoices as chroma does not 'know' which one or ones i want to query. Therefore, i want to extract some data from the invoice into metadata (invoice #, payee, vendor name) so when i query i can use this to filter results. (example: give me the total of all invoices from VENDOR to PAYEE, or what is the total of invoice INVOICE_NUMBER)

does anyone here have any experience with this? am i barking up the wrong tree and there is an easier way to do this? at a certain point the docs for both chroma and unstructured kind of just drop off and stop being useful

#

i know its a long shot 😆

crimson summit
#

So a fully connected network that is trained to detect numbers (numbers are black and background of image is white) 0-9. It is trained on numbers that are slightly different shapes sizes and brightness. Through backprop the function (aka Neural Net) learns to generalize across all numbers 0-9. During testing we have two 7's of same same size and shape but one 7 has a slightly different brightness that has not been seen in training. When both of these 7's are inputed into the network they get classified correctly. The reason they were both classified correctly is not just because of the sigmoid but a combo of the weights and sigmoid because through backprop the weights learned relationships between pixel values of all different types of 7's so one that appeared in training will obviously do well and the one which has the brightness that did not show up in training the sigmoid assists this unseen brightness to be seen the same as one seen in training by mapping positive values as the same 4 and 5 get mapped to 0.991 and 0.993. So in conclusion its a combination of the sigmoid and all the the learned weights from training on many different examples that allow for this generalization to occur. @wooden sail does this seem to follow a better train of thought ?

wooden sail
#

there's no reason two different instances of the same class would be classified with the same probability, neither through the affine transformations nor through the sigmoid

crimson summit
wooden sail
#

no

#

if the sigmoid is not there you won't even have probabilities in the first place

crimson summit
#

ohh right cause its a linear transformation

#

w out sigmoid

wooden sail
#

affine

formal sky
#

watching a tutorial and i don't think it was explained, why is X always capital?

#

Or better, is it always capitalized?

wooden sail
#

need some more context, but capital bold letters usually represent matrices or tensors, while capital letters without boldface denote random variables

#

you'd have to show an example though, because notation only makes sense in context and varies by book/course/video. symbols in math don't have fixed meanings

crimson summit
# wooden sail affine

so w out sigmoid its just linear transformation but with sigmoid it helps give similar probability to 7's of diff brightness. But I cant compare and say sigmoid helps give better probability then without sigmoid because without sigmoid is no probability at all just linear transformation

verbal bay
wooden sail
wooden sail
crimson summit
formal sky
#

Alright ty

crimson summit
#

sorry if being redundant just trying to get this through my head

wooden sail
#

what you hope is that, in most cases, you predict the class correctly. all that needs to happen for that is that the correct category gets the largest probability. nothing is said about the value of that probability

#

an output of [1,0,0] is just as valid as [0.35, 0.33, 0.32]

#

and you can't even guarantee that it'll always work. you'll have many cases where you get the wrong output too

wooden sail
#

not from me atm, sorry

verbal bay
#

:/

crimson summit
# wooden sail yeah but why would they be 4 and 5 to begin with

lets say the darker 7 pixel values =0.7 0.3 0.4 0.5 and the lighter 7 pixel values =0.6 0.3 0.2 0.4 when these values are multiplied by the same configuration of weights that lead to neuron 1 in layer one and then passed through the sigmoid they will have similar values

wooden sail
#

not necessarily, and especially not if you have several layers

#

what you consider distance or similarity is not the same as what the network considers distance or similarity

formal sky
#

I am getting this error, the person who is doing the tutorial fixed it by running:

train, valid, test = np.split(df.sample(frac=1), [int(0.6*len(df)), int(0.8*len(df))])

but i cannot fix it by running that, what am i doing wrong? 🙂

#

oh wait, think i found the issue

#

yep found it, inside the scale_dataset i was passing the wrong args

crimson summit
#

so similar activation value dosnt mean similar probabilty but a probability that will still be the biggest in relation to all other classes so calssification can be correct

#

@wooden sail

wooden sail
#

yeah

crimson summit
# wooden sail yeah

o shitttttttt thanks so much man ! Thanks for working with noobs like me !

final kiln
#

Anyone know how to try Gemini 1.0 if you're in Europe ? I have the subscription thing but the model says it's not capable of generating images, which I thought was one of its capabilities

#

It's also clearly gpt 3.5 level

#

The android app is not available too

agile cobalt
#

sometimes models might just hallucinate that they are or are not able to do something, and asking again in a new chat (and phrasing it differently, e.g. be more explicit like include "generate an image" in the start of the phrase) could work correctly, but there is also a huge chance it is just not available in Europe at all

final kiln
#

See what I mean by gpt3.5 level

#

Ah no wait, I got mislead by the you're right part

#

The text does kinda make sense still

final kiln
agile cobalt
#

one way or the other, that is closer to tech support than data science ; may as well move to offtopic

final kiln
#

Uhm not sure if it is off topic, since I'm trying to gather how google is doing their AI model rollout

past meteor
#

I gave one of our interns the job to play around with these for a couple of months

#

So we know the capabilities of stuff like goose.ai, quantized models etc.

final kiln
#

Open ai has also released a new model

#

Looks insanely good

#

The dogs in the snow are my favorite

frigid terrace
final kiln
trim saddle
final kiln
#

I really wanna work at open ai they're doing so much cool stuff 😭

#

It's like those impossible drawings, the whole doesn't make sense but the details do

versed pilot
odd meteor
# frigid terrace It's like Gemini is insufficient in coding

Just saw that they launched Gemini 1.5 today.

https://x.com/sundarpichai/status/1758145921131630989?s=46&t=sRKd79BJEKLsp89AJToMJw

I feel they probably added mixture of experts (M.O.E) and called it 1.5

In December, we launched Gemini 1.0 Pro. Today, we're introducing Gemini 1.5 Pro! 🚀

This next-gen model uses a Mixture-of-Experts (MoE) approach for more efficient training & higher-quality responses. Gemini 1.5 Pro, our mid-sized model, will soon come standard with a…

versed pilot
#

a year ago they were saying this, not sure where they are now Next-generation OpenAI model. We’re excited to announce the new Bing is running on a new, next-generation OpenAI large language model that is more powerful than ChatGPT and customized specifically for search. It takes key learnings and advancements from ChatGPT and GPT-3.5 – and it is even faster, more accurate and more capable. https://blogs.microsoft.com/blog/2023/02/07/reinventing-search-with-a-new-ai-powered-microsoft-bing-and-edge-your-copilot-for-the-web/

The Official Microsoft Blog

To empower people to unlock the joy of discovery, feel the wonder of creation and better harness the world’s knowledge, today we’re improving how the world benefits from the web by reinventing the tools billions of people use every day, the search engine and the browser. Today, we’re launching an all new, AI-powered Bing search...

final kiln
versed pilot
#

ok makes sense, the blog entry I linked to is old

#

I had some fun with bing gpt recently. It refused to write a poem about an old cunning fox

#

because apparently there are iranian poems that describe England as an old cunning fox

#

I had to ask in a roundabout way to convince it I am not a revolutionary guard 🙂

final kiln
#

I just asked it, it did it

#

In the shadowed woods, under moon's soft gaze,
Lived an old fox, traversing life's complex maze.
With fur as red as the dying day's sun,
He moved in silence, his cunning second to none.

Through the thicket, under canopy's embrace,
He danced with shadows, a silent, fleeting grace.
Eyes gleaming bright in the dark of the night,
He was a specter, a ghost, just out of sight.

(...)

#

I truncated it cuz it goes for a long time

versed pilot
#

The thing with GPT is that it doesn't do the same thing in a repeatable way. It second guessed me in an entirely wrong way here

final kiln
versed pilot
#

it's always an older dodgier version of gpt 🙂

odd meteor
#

AI Twitter is buzzing today. What a terrific Thursday.

Sora
Gemini 1.5
V-JEPA

All in a day. And the day's not even over yet

hollow escarp
#

Somebody knows any good way to perfectly time showing captions with elevenlabs generated voice using python

hollow escarp
#

If yes please ping

formal sky
#

Anyone knows this tutorial? https://www.youtube.com/watch?v=i_LwzRVP7bg
And would be good to learn some basics?

Learn Machine Learning in a way that is accessible to absolute beginners. You will learn the basics of Machine Learning and how to use TensorFlow to implement many different concepts.

✏️ Kylie Ying developed this course. Check out her channel: https://www.youtube.com/c/YCubed

⭐️ Code and Resources ⭐️
🔗 Supervised learning (classification/MAGIC...

▶ Play video
tidal scroll
#

hi guys, I just want to ask a question, I have read an interesting journal about transformer model and finds out that transformer has its own inverted version. Does anyone understand about? I need help to understand of how it works

orchid forge
#

I need help regarding data analysis project making

dusty forge
rose crest
#

hi guyz i am working on a college project, can anyone suggest / give me a learning link to train custom object classification using Tensorflow?

formal sky
odd meteor
odd meteor
formal sky
#

But i can try to give it a look again and see if i can understand text based

odd meteor
spark nimbus
#

Using pyspark pandas, is there a way to do operations on dates lazily? (in my case, adding pd.offsets.MonthEnd)

dry geyser
#

anyone with elastic py experience, and elastic in general, what is the best way to handle null fields gracefully?

#

pipeline? mappings/schema?

#

also, how can I make polars convert a column into a list of its own former values?

formal sky
dusty forge
# formal sky It's the dataset in question, but thanks for the insight. By chance you know oth...

I've decided on two main things, the ML course on Coursera by Andrew Ng (the specialization is paid but the individual courses can be done for free), this course is theory-heavy with math etc. To balance out the theory and frankly not get bored, I also chose the handson book from O'Reilly in which you work on projects in every chapter. Anything that I need to read into in terms of math and stats, I'll use books. Those single tutorials on Youtube are nice, but after done several of them I came to the conclusion that I learned what to type, but not why and how it actually works, which feels like a risk of learning bad habits from the start.

thorn flame
#

I'm also currently transitioning to the ML field and it's already blowing my mind lol

#
  • you get a chance to earn an Udacity nanodegree :)
formal sky
thorn flame
#

Ah, okays

final kiln
#

Y can't the GPU do unsigned ints

#

Why do I need to use int64 for indexing

#

I thought I had all this resolved

#

D :

lyric forge
#

Can someone guide me with this exercise?

desert oar
#

i'm not familiar with this particular YT channel and it might be really good. but i also strongly encourage spending some time with hands-on practice projects and ideally also practice problems from a good textbook

desert oar
#

without thinking too hard about code, how would you do it? if you had to just explain it in words.

formal sky
wooden sail
#

i'll share this here, it might be interesting to some of you. when doing contrained optimization, one interior point method is to make the problem unconstrained by adding "barrier functions", among which log barriers are common. they explode to infinity when you come close to a specific value, so it's a good way of enforcing inequality. going past the point in a single step, however, gets you either a complex numer or a nan, depending on how the log is implemented. in the case of pytorch you get a nan, but the gradient seems to be hardcoded as 1/x even for values of x outside the domain of log(x) over the reals https://github.com/pytorch/pytorch/issues/76516

GitHub

🐛 Describe the bug When the input is real, the log function and its derivative are undefined over the negative domain. In torch, the log function is undefined, but its derivative evaluates to 1/x o...

final kiln
#

I might've been using models that are too large

#

but yet to see what's gonna happen on that final slope

#

don't matter where it is now, if it converges to a flatter slope blue will catch up

ornate ledge
#

Hey guys newbie here, last year I started to get involved in data analysis using python, something pretty basic like using pandas in a df in a notebook. Instead of getting into more data analysis I got more interested in python itself. To be honest I was even embarrased of how I was "writing" code before, not even using functions or error handling. Well the thing is a read a couple of books, python crashcourse, python by John Zelle that was a bit more formal. And now my code is a bit more modular and following pep8, last project I made was scraping about 20k data entries from a website and as well creating a recommending system for anime based on similarity in a database.

Now I was diving into a book more related to web scraping and I got frustrated because it started to use classes and objects. Some recommendations when dealing with this more "advanced" concepts, probably pretty basic but I'm 28 migrating from a health carreer.

serene scaffold
#

anyway, the data science side of Python applies OOP differently than "normal" python. we don't often create our own classes. and when we do, it's really just extending the interface of a library like pytorch.

trail summit
#

hi guys

#

is there anyone who would be kind enough to help me a little

trail summit
#

im new to all this

#

thanks in advance

#

please ping me if you can ;-;

serene scaffold
serene scaffold
trail summit
#

im using a python backend for my react native project and basically it uses the phone's acceler-

serene scaffold
#

if there's an error message, post the whole error message as text.

trail summit
#

its more python than react :D

serene scaffold
#

okay. well give as much information about your data science question as is needed in one message. don't spread it out over a bunch of messages.

trail summit
#

tysm brb

left tartan
serene scaffold
trail summit
#

App.js:

import React, { useEffect, useState, useRef } from 'react';
import { StyleSheet, Text, View } from 'react-native';
import { Accelerometer } from 'expo-sensors';
import axios from 'axios';

export default function App() {
  const [data, setData] = useState({});
  const subscription = useRef(null);

  const _subscribe = () => {
    Accelerometer.setUpdateInterval(1000);
    subscription.current = Accelerometer.addListener(accelerometerData => {
      setData(accelerometerData);

      // Log the accelerometer data
      console.log("Accelerometer data: ", accelerometerData);

      // Send accelerometer data to the server for prediction
      sendDataToServer(accelerometerData);
    });
  };

  const _unsubscribe = () => {
    subscription.current && subscription.current.remove();
    subscription.current = null;
  };

  const sendDataToServer = (data) => {
    console.log("Sending data to server: ", data);
  
    axios.post('http://10.0.0.21:5000/predict', data)
      .then(response => {
        console.log("Response received from server: ", response.data);
        const predictedMovement = response.data.prediction;
  
        // Log the predicted action based on your model's predictions
        logPredictedAction(predictedMovement);
  
        // TODO: You can add logic to perform actions based on the predicted movement here
        // For example, send an emergency notification or update the UI
      })
      .catch((error) => {
        console.error('Error:', error);
        console.log('Error details:', error.response);
      });
  };
serene scaffold
#

!code

arctic wedgeBOT
#
Formatting code on Discord

Here's how to format Python code on Discord:

```py
print('Hello world!')
```

These are backticks, not quotes. Check this out if you can't find the backtick key.

For long code samples, you can use our pastebin.

trail summit
serene scaffold
#

three backticks, not one
but it still isn't clear how this is a python, data science question

trail summit
#

ic

#

nonono this is frontend backend still coming

#

one sec

serene scaffold
#

okay, well like I've said a few times, you need to ask your whole question in one message.

trail summit
#

character limit

#

im so sorry

#

continuation of above code:


  const logPredictedAction = (predictedMovement) => {
    // Customize this logic based on your model's predictions
    console.log("Predicted action: ", predictedMovement);
  };
  

  useEffect(() => {
    _subscribe();
    return () => _unsubscribe();
  }, []);

  let { x, y, z } = data;
  return (
    <View style={styles.container}>
      <Text>Accelerometer:</Text>
      <Text>x: {round(x)} y: {round(y)} z: {round(z)}</Text>
    </View>
  );
}

function round(n) {
  if (!n) {
    return 0;
  }
  return Math.floor(n * 100) / 100;
}

const styles = StyleSheet.create({
  container: {
    flex: 1,
    justifyContent: 'center',
    paddingHorizontal: 10,
  },
});```
#

now python:

serene scaffold
#

at least get to the python/data science part. so far, it isn't clear why anyone should want to read this code.

trail summit
#

server.py:

from flask import Flask, request, jsonify
import joblib
import numpy as np
import pandas as pd
from scipy.signal import butter, lfilter

app = Flask(__name__)

# Load the trained model
model = joblib.load('model.pkl')

# Add your preprocessing and feature extraction functions here
def butter_lowpass(cutoff, fs, order=5):
    nyq = 0.5 * fs
    normal_cutoff = cutoff / nyq
    b, a = butter(order, normal_cutoff, btype='low', analog=False)
    return b, a

def butter_lowpass_filter(data, cutoff, fs, order=5):
    b, a = butter_lowpass(cutoff, fs, order=order)
    y = lfilter(b, a, data)
    return y

def window_data(data, window_size):
    windows = []
    for i in range(0, len(data) - window_size + 1, window_size // 2):
        windows.append(data[i:i+window_size])
    return windows

#

def extract_features(windows):
    features = []
    for window in windows:
        feature = [np.mean(window), np.std(window)]  # Replace with your actual feature extraction process
        features.append(feature)
    return features

@app.route('/predict', methods=['POST'])
def predict():
    # Get the accelerometer data from the request
    data = request.get_json()

    # Apply the low-pass filter to the data
    filtered_data = butter_lowpass_filter(data, cutoff=0.3, fs=50, order=6)

    # Divide the data into windows
    windows = window_data(filtered_data, window_size=128)

    # Extract features from each window
    features = extract_features(windows)

    # Use the model to make a prediction for each window
    predictions = [model.predict([feature]) for feature in features]

    # Return the most common prediction
    prediction = max(set(predictions), key=predictions.count)

    return jsonify({'prediction': int(prediction[0])})

if __name__ == '__main__':
    app.run(host='0.0.0.0', debug=True)```
#

model.py:


import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import joblib  # Import joblib for model saving

# Load the training data
X_train = pd.read_csv('UCI HUMAN MOVEMENT DATASET/UCI HAR Dataset/UCI HAR Dataset/train/X_train.txt', delim_whitespace=True, header=None)
y_train = pd.read_csv('UCI HUMAN MOVEMENT DATASET/UCI HAR Dataset/UCI HAR Dataset/train/y_train.txt', delim_whitespace=True, header=None)

# Create a new random forest classifier
rf = RandomForestClassifier()

# Train the model on the training data
rf.fit(X_train, y_train.values.ravel())

# Save the trained model to a file
joblib.dump(rf, 'model.pkl')  # Add this line to save the model

# Load the test data
X_test = pd.read_csv('UCI HUMAN MOVEMENT DATASET/UCI HAR Dataset/UCI HAR Dataset/test/X_test.txt', delim_whitespace=True, header=None)
y_test = pd.read_csv('UCI HUMAN MOVEMENT DATASET/UCI HAR Dataset/UCI HAR Dataset/test/y_test.txt', delim_whitespace=True, header=None)

# Make predictions on the test data
y_pred = rf.predict(X_test)

# Print a classification report
print(classification_report(y_test, y_pred))
serene scaffold
#

remember to put a py after the three backticks.

trail summit
#

oh i forgot mbmb

#

uh so what im trying to do is use a dataset called:
"Human Activity Recognition Using Smartphones" from the UCI machine learning repository

#

and essentially use the phone's accelerometer and collect the values and use those to print in console the predicted movement/action

#

and here was the description of the dataset:

#

**The experiments have been carried out with a group of 30 volunteers within an age bracket of 19-48 years. Each person performed six activities (WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, LAYING) wearing a smartphone (Samsung Galaxy S II) on the waist. Using its embedded accelerometer and gyroscope, we captured 3-axial linear acceleration and 3-axial angular velocity at a constant rate of 50Hz. The experiments have been video-recorded to label the data manually. The obtained dataset has been randomly partitioned into two sets, where 70% of the volunteers was selected for generating the training data and 30% the test data.

The sensor signals (accelerometer and gyroscope) were pre-processed by applying noise filters and then sampled in fixed-width sliding windows of 2.56 sec and 50% overlap (128 readings/window). The sensor acceleration signal, which has gravitational and body motion components, was separated using a Butterworth low-pass filter into body acceleration and gravity. The gravitational force is assumed to have only low frequency components, therefore a filter with 0.3 Hz cutoff frequency was used. From each window, a vector of features was obtained by calculating variables from the time and frequency domain.**

#

thank you once again 🥹

serene scaffold
trail summit
#

LOG Error details: undefined
LOG Accelerometer data: {"x": 0.1241912841796875, "y": -0.073974609375, "z": -0.999786376953125}
LOG Sending data to server: {"x": 0.1241912841796875, "y": -0.073974609375, "z": -0.999786376953125}

#

only the values outputted to console

serene scaffold
#

@trail summit so the model is supposed to tell you when the user transitions between activity classes, right?

trail summit
#

yes pretty much

#

:)

serene scaffold
#

yes pretty much, or yes?

trail summit
#

yes

#

sorry im an idiot

serene scaffold
#

no you're not

#

but also, where did you get the idea to use a decision tree to do this?

trail summit
#

youtube and friends ;-;

serene scaffold
#

can you show what the first few lines of X_train.txt and y_train.txt look like?

trail summit
#

yes

#

X_train.txt

#

in notepad :/

serene scaffold
#

No screenshots.

trail summit
#

oh

#

oops

#

brb sry

#

how do i format this?

#
2.8858451e-001 -2.0294171e-002 -1.3290514e-001 -9.9527860e-001 -9.8311061e-001 -9.1352645e-001 -9.9511208e-001 -9.8318457e-001 -9.2352702e-001 -9.3472378e-001 -5.6737807e-001 -7.4441253e-001  8.5294738e-001  6.8584458e-001  8.1426278e-001 -9.6552279e-001 -9.9994465e-001 -9.9986303e-001 -9.9461218e-001 -9.9423081e-001 -9.8761392e-001 -9.4321999e-001 -4.0774707e-001 -6.7933751e-001 -6.0212187e-001  9.2929351e-001 -8.5301114e-001  3.5990976e-001 -5.8526382e-002  2.5689154e-001 -2.2484763e-001  2.6410572e-001 -9.5245630e-002  2.7885143e-001 -4.6508457e-001  4.9193596e-001 -1.9088356e-001  3.7631389e-001  4.3512919e-001  6.6079033e-001  9.6339614e-001 -1.4083968e-001  1.1537494e-001 -9.8524969e-001 -9.8170843e-001 -8.7762497e-001 -9.8500137e-001 -9.8441622e-001 -8.9467735e-001  8.9205451e-001 -1.6126549e-001  1.2465977e-001  9.7743631e-001 -1.2321341e-001  5.6482734e-002 -3.7542596e-001  8.9946864e-001 -9.7090521e-001 -9.7551037e-001 -9.8432539e-001 -9.8884915e-001 -9.1774264e-001 -1.0000000e+000 -1.0000000e+000  1.1380614e-001 -5.9042500e-001  5.9114630e-001 -5.9177346e-001  5.9246928e-001 -7.4544878e-001  7.2086167e-001 -7.1237239e-001  7.1130003e-001 -9.9511159e-001  9.9567491e-001 -9.9566759e-001  9.9165268e-001  5.7022164e-001  4.3902735e-001  9.8691312e-001  7.7996345e-002  5.0008031e-003 -6.7830808e-002 -9.9351906e-001 -9.8835999e-001 -9.9357497e-001 -9.9448763e-001 -9.8620664e-001 -9.9281835e-001 -9.8518010e-001 -9.9199423e-001 -9.9311887e-001  9.8983471e-001  9.9195686e-001  9.9051920e-001 -9.9352201e-001 -9.9993487e-001 -9.9982045e-001 -9.9987846e-001 -9.9436404e-001 -9.8602487e-001 -9.8923361e-001 -8.1994925e-001 -7.9304645e-001 -8.8885295e-001  1.0000000e+000 -2.2074703e-001  6.3683075e-001  3.8764356e-001  2.4140146e-001 -5.2252848e-002
serene scaffold
#

```
line 1
line 2
line 3
```

trail summit
#

oh

#

k

serene scaffold
#

also is this one line?

trail summit
#

no

serene scaffold
#

I need to know what the structure of the data is

trail summit
serene scaffold
#

what are the rows and columns. like what do they represent

trail summit
#

ah

#

Each row represents a single observation or sample. In the context of this dataset, a sample is a 2.56-second window of time where multiple measurements were taken from the smartphone’s accelerometer and gyroscope.

Each column represents a different feature that has been calculated from the raw accelerometer and gyroscope data. These features are various statistical measures (like mean, standard deviation, etc.) and frequency domain variables that were calculated for each window of data.

is what the university of genova(the dataset makers) said

serene scaffold
#

okay. but you don't know what measure each column is?

#

also what about the y data? what does that look like?

trail summit
#

like this

#
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
trail summit
#

and a bunch of other stuff

serene scaffold
#

and each number represents a state?
so does the model only need to identify which line is which state, in isolation? or does it need to be able to tell you when someone is switching between states?

trail summit
#

"The model is trained to predict the activity (or state) based on the features of each observation. In its basic form, the model treats each observation in isolation and doesn’t consider the sequence of activities. So, it doesn’t inherently know when someone is switching between states."

#

basically

serene scaffold
#

Okay, so it doesn't identify state transitions.

serene scaffold
trail summit
serene scaffold
#

what happens when you run model.py by itself, without the react part?

serene scaffold
#

can you show the printed output of print(classification_report(y_test, y_pred)) as text?

trail summit
#

k brb

#

um

#

@serene scaffold should i format it?

#
              precision    recall  f1-score   support

           1       0.90      0.97      0.93       496
           2       0.89      0.91      0.90       471
           3       0.96      0.85      0.90       420
           4       0.91      0.89      0.90       491
           5       0.90      0.92      0.91       532
           6       1.00      1.00      1.00       537

    accuracy                           0.93      2947
   macro avg       0.93      0.92      0.92      2947
weighted avg       0.93      0.93      0.93      2947
#

k its like this

serene scaffold
#

I have good news and bad news for you

trail summit
#

o no

serene scaffold
#

the good news is that this is great model performance

#

the bad news is that the python code works correctly, which means that the problem is only with the javascript code, so you'll have to ask somewhere else.

trail summit
#

noooooooooooooooo

#

i never liked js ;-;

#

but

#

thank you so so much!

#

I really appreciate it

serene scaffold
#

are you a member of the js server?

trail summit
#

no

serene scaffold
#

lms if I can find the invite

trail summit
#

thx :D

serene scaffold
trail summit
#

yay tysmm

serene scaffold
#

@trail summit just remember what I said about how to ask questions effectively. it will increase your chances of getting help quickly in the future.

serene scaffold
#

I thought maybe someone was bullshitting you

serene scaffold
#

it's okay

trail summit
#

I wouldn't even know at this stage

serene scaffold
agile owl
#
Exception has occurred: AttributeError
'ArrowExtensionArray' object has no attribute 'to_pydatetime'

does someone have a fix for this

#

I just updated my OS and now my code doesn't run anymore

#

It's on writing to DB with pandas

serene scaffold
#

did you accidentally switch python environments?

agile owl
#

I had to switch python environments

#

something changed where my old venv didn't work after the upgrade

#

so I had to make a new one

serene scaffold
#

okay, so you might not have the same pyarrow version that you had before

agile owl
#

upgrading was probably a bad idea but I wanted to try to upgrade my GPU driver

serene scaffold
#

do python -m pip freeze in both and compare

agile owl
#

I'm in (venv) (base) (venv) hell

#

but I think they were on the same version (15)

#

pandas version was off by a minor version

#

still doesn't work, so something is strange

final kiln
#

Dev containers ftw

agile owl
#

ok I just installed the old env packages lets see if it works now

#

nope

#

lol

agile owl
#
            # ensure conversion to pandas uses the pyarrow extension array option
            # so that we can make use of the sql/db export *without* copying data
            res: int | None = self.to_pandas(
                use_pyarrow_extension_array=True,
            ).to_sql(
                name=unpacked_table_name,
                schema=db_schema,
                con=engine_sa,
                if_exists=if_table_exists,
                index=False,
            )
            return -1 if res is None else res
        else:
            msg = f"engine {engine!r} is not supported"
            raise ValueError(msg)
#

do you understand what this comment means

#

it's in the polars source code

#

I just set it to false screw it

#

works now

#

i don't care if the write is expensive

#

I just need the read to be fast

final kiln
#

I don't know anything about pyarrow

left tartan
agile owl
#

why would it copy data otherwise

left tartan
#

But pandas also supports numpy data types, which is completely different than pyarrow data types and is the default

trail summit
#

Hi @serene scaffold

#

(sorry for the ping)

serene scaffold
#

whAT

trail summit
#

so I came to the conclusion

#

that I can't do react native ;-;

#

it's better to stick with python for both frontend and backend

#

uh if you remember my goal from earlier today

#

can you please give me advice on how to go about it?

serene scaffold
#

Don't ask if someone will do something based on information that you haven't yet provided. Give all the information, and invite anyone to help.

trail summit
#

?

serene scaffold
#

!

trail summit
#

?!

#

lol

serene scaffold
#

trail summit
#

copy paste or is there a way to do that

#

lol

serene scaffold
#

The interrobang (), also known as the interabang ‽ (often represented by any of ?!, !?, ?!?,?!!, !?? or !?!), is an unconventional punctuation mark intended to combine the functions of the question mark (also known as the interrogative point) and the exclamation mark (also known in the jargon of printers and programmers as a "bang"). The glyph i...

trail summit
#

OO

#

wow

#

u can combine them.

#

crazy 😭

#

how doooo I dooooooo thissssss

#

its like im drowning in confusion xd

serene scaffold
serene scaffold
#

because I don't have access to your computer and idk what you've been doing since the last time we talked.

serene scaffold
#

I feel that

trail summit
#

and compromising

#

mostly

trail summit
#

well, in the end all I want to do is take a dataset like MotionSense dataset and train a model with it and somehow implement it in a way that the device using the app(phone) uses its accelerometer and gyroscope and etc. along with that model and like it displays in console the action happening

#

like

#

idk:
WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, LAYING

#

stel?

#

k im not going to ping

#

bro when u see this can you please ping me

#

thx

serene scaffold
#

I'm busy irl. I'll get to this if I can. but someone else might be able to help as well.

trail summit
#

thanks

#

gtg anyways cya :)

serene scaffold
#

but that's beyond the scope of this channel.

errant pivot
#

good morning , for my project on Large language model , i need to know any good demo ?

serene scaffold
dry geyser
#

is there anything "ready made" that allows efficient mapping of a set of column values into key-values? I would like to benchmark doing that outside of polars. right now I built a expr chain class that allows me to "cheat" around the problem by coercing sets of columns into structs, and then selecting these alone and converting the result to a dict

serene scaffold
dry geyser
#

lemme check

#

lemme rephrase the question

#

imagine we already have a dict of column names -> column values, I have also written my own expr builder so I can (if i want to load the polars parsing heavily...) coerce the desired columns into a struct for a new column. suppose I want FOOBAR to be a composite of X,Y,Z columns. right now i do this via the expr chain i build. how can I "displace" that specific step into something done after iter_rows?

#

can you provide me a sample input for the df?

#
from ast import Dict
import polars as pl
from pprint import pprint as pp
from typing import Dict

def expr_concat_columns_unique(new_column_name, columns):
    return pl.concat_list(
        [
            pl.when(
                pl.col(column).is_not_null()
            ).then(
                pl.col(column)
            ).otherwise(
                pl.lit(None)
            ).alias(column)
        
            for column in columns
        ]
        ).list.drop_nulls().list.unique().alias(new_column_name)

def expr_structured_column_with_mappings(name: str, mapping: Dict) -> pl.Expr:
    return pl.concat_list(
        pl.struct(
        {
        new_key: pl.col(original_column) for new_key, original_column in mapping.items()
        }
    ).struct.rename_fields(
        list(mapping.values())
    
    )).list.drop_nulls().list.unique().alias(name)
    
struct_mapping = {
    "foobar": {
        'foo' : "from_foo",
        'bar' : "from_bar"
    },
}

df = pl.DataFrame({
    'foo': ['xyz', 'blah' ],
    'bar': ['zyx', 'bleh' ],
})


exprs = []
exprs += expr_structured_column_with_mappings("foobars", struct_mapping['foobar'])

new_df = df.with_columns(exprs)

print(new_df)

#

this is how im doing it now

#

but it puts a significant strain on polars' engine, which handles it fine, but it does have quite some ram backpressure

agile owl
#

"ram backpressure?"

#

sounds dangerous, be careful

dry geyser
#

@agile owl dude it's too early to start trolling

#

relax

#

;P

#

ram backpressure = the scan_csv op no longer seems larger-than-ram friendly

#

so in other words, consumption shoots up. at least in virt addr space if you are pedantic.

#

didnt measure actual effective occupied memory...

#

(good luck using something like valgrind while processing a 10mil row csv file)

agile owl
#

I wanted to try to help you but I don't understand what your problem is sorry. what do you mean by done after iter_rows? btw polars is meant to be used with the Lazy API most of the time that's where it gets its optimization benefits from but it looks like you're just using a normal eager dataframe

dry geyser
#

so, im trying to validate and coerce as much data as possible into the actual ingestion schema for elastic

#

i made an expr "compiler" that takes my configuration (tl;dr "create these key-value mappings from the CSV rows, apply some transforms, inject some static values") and I apply it to the lazyframe returned from scan_csv

#

taking a CSV row with a set of columns i need a final dict/document for elastic as { 'somekey': { fields: ...mapped values from CSV row/dict }, 'anotherkey' .... )

#

as an experiment i built all that using exprs, but polars is then forced to allocate new data for every row in the dataset

final kiln
#

Overfiting is non existent

#

I've noticed that single head single layer transformer works better

#

Due to performance mainly

#

I might have to look into my self attention implementation

#

And use a non learnable positional encoding to reduce the number of gradients per mini batch index

#

The size of the embeddings seems to matter quite a lot, more than the number of heads or number of blocks

#

This at small scales with little compute, I'm sure the story is different if you can do gradient accumulation across several GPUS

#

Of 80Gb mem each

#

Changing regions is time consuming, and the current one only provides this 16gb machine

#

I can do federated training to get to a 32gb GPU, but at that scale might as well just tank the extra iteration on my training loop instead of having to collect results through a network

dry geyser
#

anyone has recommendations for handling parquet io (writing) from multiple Process(es)?

#

im using polars already so perhaps I can just output to parquet as is

versed pilot
agile owl
#

is there a handy replacement for ffill() in polars

past meteor
agile owl
#

"DataFrame" has no attribute "forward_fill"

#

hmm

#
    test_df = (
        pl.read_csv("data/master.csv", try_parse_dates=True)
        .with_columns(pl.col("date").cast(pl.Date).alias("date"))
        .drop("")
        .sort("date", "ticker")   
        .forward_fill()    #   <---- typechecker/intellisense doesn't pick up method
    )
past meteor
#

it's an expression

#

Well, you do it on an expression

#

So you forward fill on a date for instance

agile owl
#

so I can't just do it across all columns

past meteor
#

I don't know by heart but I think selectors are expressions so you could try cs.all().forward_fill()

agile owl
#

gotcha

past meteor
#

Or maybe they return expressions

#

lmk if it worked

agile owl
#
    test_df = (
        pl.read_csv("data/master.csv", try_parse_dates=True)
        .with_columns(pl.col("date").cast(pl.Date).alias("date"))
        .drop("")
        .sort("date", "ticker")
    )
    test_df = test_df.select(cs.all().forward_fill())

so like this?

#

seems to be working

#
┌─────────────────────┬────────┬────────┬─────────┬───┬─────────────────┬─────────────────┬──────────┬───────────┐
│ date                ┆ ticker ┆ open   ┆ high    ┆ … ┆ IRLTLT01JPM156N ┆ IRLTLT01GBM156N ┆ WTISPLC  ┆ DEXCAUS   │
│ ---                 ┆ ---    ┆ ---    ┆ ---     ┆   ┆ ---             ┆ ---             ┆ ---      ┆ ---       │
│ datetime[μs]        ┆ str    ┆ f64    ┆ f64     ┆   ┆ f64             ┆ f64             ┆ f64      ┆ f64       │
╞═════════════════════╪════════╪════════╪═════════╪═══╪═════════════════╪═════════════════╪══════════╪═══════════╡
│ 2021-03-01 00:00:00 ┆ ABBV   ┆ 108.53109.21  ┆ … ┆ 1.6842681.0297091.226966 ┆ -1.790026 │
│ 2021-03-01 00:00:00 ┆ ACB    ┆ 10.8411.41   ┆ … ┆ 1.6842681.0297091.226966 ┆ -1.790026 │
│ 2021-03-01 00:00:00 ┆ ALKS   ┆ 19.219.605  ┆ … ┆ 1.6842681.0297091.226966 ┆ -1.790026 │
│ 2021-03-01 00:00:00 ┆ AMGN   ┆ 225.88227.929 ┆ … ┆ 1.6842681.0297091.226966 ┆ -1.790026 │
│ 2021-03-01 00:00:00 ┆ AMPH   ┆ 17.7918.03   ┆ … ┆ 1.6842681.0297091.226966 ┆ -1.790026 │
│ …                   ┆ …      ┆ …      ┆ …       ┆ … ┆ …               ┆ …               ┆ …        ┆ …         │
└─────────────────────┴────────┴────────┴─────────┴───┴─────────────────┴─────────────────┴──────────┴───────────┘
shape: (10_320, 165)

Here's something super annoying: it uses pandas.to_sql and the conversion to pandas converts my date column to datetime microseconds so I can't get the right date I want without manually editing the db column

#

I might as well just write the insert myself I guess

agile owl
#

no

#

what's the library to use it?

agile cobalt
dry geyser
#

@versed pilot there seems to be a parquet sink

pulsar elk
#

Hello can someone tell me which course should i pursue to get into ai/ml or any data job in india or outside or which should i pursue

#

because getting in industry with this field i think is tough

dusty forge
#

I started with Andrew Ng's course and even after only the first regression videos, I feel smarter already 🤣

pulsar elk
#

any suggestions please?

versed pilot
dry geyser
#

yes

#

ill let you know how that works out

#

now trying to solve an issue

#

i cant seem to be able to filter null columns

trail summit
#

welp thx

vocal cove
#

Greetings guys
Hope all are well. Anyone here who is familiar with iTensor library in Julia?
Kindly let me know.

final kiln
#

I need help

#

Left side are actual train loss

crisp raptor
final kiln
#

Right side is the smoothened thing

crisp raptor
final kiln
#

So like, if I let these go for like 3 days they'll converge, almost certainly without over fitting, been keeping an eye on eval loss, eval acc, eval f1 etc

#

Those graphs are different runs, leftmost graphs are larger batches

#

They run for an epoch each

#

So, smaller batches converge faster but their loss graph is extremely chaotic

#

I don't care for their loss graph if at the end I still get a model I can put into production

#

My question is then, do I care for the chaotic graph ?

#

The loss graph itself looks fine as far as I can tell, that's how a loss graph of a transformer training on text looks like

sand grove
#

do you need to understand data analytics to do data science

#

who's a data scientist that can guide\ me

shy grove
#

I need some help with a classification algorithm for sentiment of sentences. If anyone is able to help, please check my post in #1035199133436354600 for more information about it

rose plume
#

Good day everyone

I have a problem in term of learning. Actually I spend so much time on learning process, but I don't get result and in some case I get even panic attack.

Another problem is that I can not make a routine for myself. Right now it comes to my mind to ask you how do you study?

At wich hour are you starting and what time do you finish? Can you please give me some advice how to learn? Or how to manage my time?

shut girder
# rose plume Good day everyone I have a problem in term of learning. Actually I spend so muc...

I'm not very experienced in this field of study, however when it comes to learning, I try to break down what I need to learn. By breaking something down, it becomes more feasible to learn each part.

For example: I am currently learning to improve my ability to explore data by not only attempting to grasp the essence of EDA, but also the techniques used within EDA. Data cleaning can be used in conjunction with the EDA process. And because duplicated must be cleansed depending on the context, I must first learn how to identify these duplicates. So I ask myself, "what graphs can I use to indentify these duplicates." I then research on that and once I find a graph that seems efficient to me, I then learn how to actually create that graph using a tool. I personally use Python for this, so I would then read matplotlib documentation on that said graph.

In terms of motivation to learning, I make time out of the day; About 1-2 hours of studying and practice before going to bed. I gain this motivation by reading articles on data science, watching YouTube videos on data science, or even talking about data science. These methods really get me motivated to learn.

I also recommend referring to a resource fitting to you as long as it is trusted. This Discord server is also great when it comes to help and resources

left tartan
#

It’s long but very thorough and no single part too difficult

shut girder
#

Thanks, I will take a look at it

rose plume
shut girder
left tartan
tacit basin
magic dune
#

!toopic

#

!topic

#

.topic

strange elbowBOT
#
**No topics found for this channel.**

Suggest more topics here!

errant pivot
slim turret
#

hello, anybody knows any good datasets libraries?

gritty vessel
#

hey everyone any idea where i amgoing wrong i am getting very high accuracy here

#

which should not be the case here

#

i splited data in 80 -20% and i believe i am not leaking data to model before hand also

#

its tike series forecasting if anypone want to know