#data-science-and-ml

1 messages · Page 299 of 1

pine vapor
#

Question about method chaining with pandas, when I want to refer to a column created as part of the method within the chain, that will not exist as it all refers to the initial dataframe. How do I best handle this? Do I have to start another chain?

solemn hinge
uneven gust
#

hey can someone help me please?

#

i'm workin on a project in R

#

does anyone know R?

candid sable
#

hi guys - I'm getting

        y_pred.shape.assert_is_compatible_with(y_true.shape)

ValueError: Shapes (None, 2) and (None, 1) are incompatible```
when trying to use more than only 1 metric in my model.compile.. why would that be?

when I only use metrics=['acc'], it works..
grave frost
#

I think that you have particularly strong opinions on some topics that may not necessarily align with reality or with the needs of someone else. For example, Kaggle's mini-courses do teach important aspects for beginners. while it may not be needed for you, in practice most beginners like to start with something small.

https://towardsdatascience.com/kaggles-micro-courses-my-favorite-introduction-to-data-science-f0cc6aeb024c Here, the author lists that Kaggle's Mini-courses starts with:-

  1. Data visualization
  2. Pandas
  3. Basic DL which covers Transfer Learning and Data augmentation
  4. Intro and Advanced SQL
  5. GeoSPatial analysis with GeoPandas
  6. Basic NLP
  7. Intro to RL that covers an agent using simple minimax

This differs somewhat from your statement-

In fact, it doesn't teach you ML at all
It gives you a tutorial thing which literally just calls the Decision Tree method without really explaining what it is

Which seems pretty wrong seeing the above evidence.

TBH I really admire your knowledge and apologize if I sound rude or always contradictory (because people in PyDis enjoy arguments a lot) but simply that everyone has an opinion - and if different people provide their perspectives to someone, the person on the other end receives a much better answer of their question overall.

hollow sentinel
#

Kaggle is not good enough for beginners

#

I'm sorry dude

#

it's designed to show the highlights of ML while they spoonfeed you code

#

but it'll never give you a strong basis

#

so I agree with Raggy

hollow sentinel
#

bc if you didn't I don't think you have any standing to talk about it

grave frost
# hollow sentinel did you even do the course?

I did do the course when I was new to ML 🙂 if you want to see the amount of stuff they have, this is "intro the DL" where I think the amount of math and code usually levels out with the amount of code required for a beginner https://www.kaggle.com/learn/intro-to-deep-learning

You can inspect more courses here https://www.kaggle.com/learn/overview and decide for yourself. as for me, they helped me when I was a beginner so I stand by my opinion

tidal bronze
#

Which metric could I use to evalute different clusters given that the feature they are based on will differ

hollow sentinel
#

doesn't mean you actually know what you're doing

grave frost
#

@hollow sentinel also, you can execute an exercise where it takes you to a different notebook which tries to explain the concepts learnt with a sample dataset 🤷

hollow sentinel
#

especially when they spoonfeed you all the code

#

lmao

#

wow an exercise with a sample dataset

#

with all the code given to you

#

so innovative

#

fill in the blanks

#

lmao

grave frost
#

@hollow sentinel I think what your approach is very new to CS, because most people already know the basic coding required to start the courses from the scratch

hollow sentinel
grave frost
#

intro to DL is not the first course BTW

hollow sentinel
#

I know

#

but it's not enough

grave frost
#

its like the 6th or 7th one. before that they teach stuff like visualization and even more basic stuff

hollow sentinel
#

micro courses are not enough to build any significant skills

grave frost
#

ofc

hollow sentinel
#

it's more like dipping your toes in

#

yes

grave frost
#

hence the name "Micro-course"

hollow sentinel
#

so you just conceded why you're wrong

#

congrats

grave frost
#

yeah, but its more than good to give an overview to a beginner

#

(especially when they are not in college)

hollow sentinel
#

an overview when they literally know nothing but the basics of python

#

ok

grave frost
hollow sentinel
#

it's just a way to cater excitement

#

generate hype

#

it does nothing to teach you

grave frost
#

well, then I can't argue with you since you are just fueled on opinion rather than arguments 🙂 have a good day

hollow sentinel
#

nice way to concede

tidal bronze
kindred radish
#

So I've been trying to use K-means and Spectral Clustering to try and detect this cluster over here. I've read that these are good with even clustering sizes, so does this mean it couldn't pick up on the cluster in the circled area?

ripe forge
#

Uh. Visually that circle doesn't look like a separate cluster to me

#

Unless you only mean the little group of points off to its own side

kindred radish
#

Really?

#

Yeah i do sorry i made the circle too big hahaha

ripe forge
#

Look at where you've drawn the circle

#

Ah so you do mean the small group then?

kindred radish
#

I mean this, i just didn't want the circle bit to cover the gap

ripe forge
#

Got it. That is definitely better

kindred radish
#

So i also know that there is a cluster here

ripe forge
#

So, K means needs a k upfront, what output did you get with K set to 4?

kindred radish
#

hol' up lemme go check

#

Im using sklearn btw, so do you mean like the number of clusters i told it?

ripe forge
#

As for clustering algorithms in general, the idea is they're usually doing their own thing. Usually you only really want to use them for exploration that leads upto something down the line

#

And yes, the number of clusters

kindred radish
#

Ill show you what i get for 4 clusters:

kindred radish
#

It looks like the spectral one is handling that cluster a little better, but it's slightly off

forest plover
#

Where can I get started on machine learning and ai in general?

kindred radish
candid sable
#

Anyone can help me figure out why I'm getting incompatible shapes ValueError when using multiple metrics in my model.compile? If I only use 'acc', it works..

ripe forge
#

I'm curious how dbscan would perform here.

forest plover
#

Thank you

kindred radish
#

oooooh i read about dbscan! Sklearn says it's good for uneven clusters right? But it said its use-case was for "non-flat geometry"?

misty flint
#

found it from a podcast. very nicely done

#

apparently made with D3

kindred radish
ripe forge
#

So, dbscan is intended for spatial clustering, so yes on that.

#

However that shouldn't stop you from just seeing how it performs, since you can treat each feature as an axis in one dimension

#

For eps, it's simply a param to play around with, I'd say let it do its thing. Higher eps makes fewer clusters iirc

ripe forge
#

Its a measure of distances that are within a tolerance

#

For any two points

kindred radish
#

Well i've been playing around with it and I can't really get better than this:

#

Doesn't seem to even be able to tell the two big clusters apart a lot of the time

ripe forge
#

Ah. Hmm. Guess that's not the move for this dataset then

kindred radish
#

RIP

#

I guess spectral is giving me the best of other options ive tried. I went for MeanShift as well

grave frost
kindred radish
#

The purple overlaps into the red, where the true cluster is just the tiny LHS island of the purple (for the spectral result)

#

Should look like that

grave frost
#

your clusters are too less distinct to be identified by k-means.

#

A simple google yields me this paper that deals with clusters with high overlap http://ceur-ws.org/Vol-1455/paper-06.pdf their recommendation is to use some EM algorithm using another CBOvalue score to aid it (and they claim it works better than spectral)

kindred radish
#

ah thank you for that! I'll go check it out ^^

tidal bronze
#

Which metric could I use to evalute different clusters given that the feature they are based on will differ but all are performed using kmeans

uncut orbit
#

what are some good hyperparams for keras neural net?

pseudo wing
#

Why is my SVC f1-score on training data in the first code and 2nd code different?

lapis sequoia
#

Hey,

#

Not sure if this is the right place but im having some issues with mat plot lib

#

this is my graph

#

it does not show the actuall data

#

and the line should have a smooth increase

#

only like +100 per hour

#

but as you can see it bugs significantly

#

also, is there a way to make the graph fit to size?

#

Its a bit wide atm

#
import tkinter as tk
import matplotlib.pyplot as plt
from pandas import DataFrame
from matplotlib.backends.backend_tkagg import FigureCanvasTkAgg
import random
import datetime

data = {'Price': [1],
        'Years': []
        }
time = datetime.datetime.now()
previous_number = 1
for x in range(1000):
    num = random.randint(1, 100)
    previous_number += num
    data["Price"].append(previous_number)
    data["Years"].append(time.strftime("%d/%m/%Y %H:%M:%S"))
    time += datetime.timedelta(hours=1)
data["Years"].append(time.strftime("%d/%m/%Y %H:%M:%S"))


df = DataFrame(data, columns = ['Price','Years'])
for x in df.values:
    print(x)

root = tk.Tk()
figure = plt.Figure(figsize=(1000,4), dpi=100)
ax = figure.add_subplot(111)
chart_type = FigureCanvasTkAgg(figure, root)
chart_type.get_tk_widget().pack()
df = df[['Price','Years']].groupby('Years').sum()
df=df.astype(float)
df.plot(kind='line', legend=True, ax=ax)
ax.set_title('Example')
#

this is my code atm

misty flint
#

have you double-checked to see if your data is sorted

lapis sequoia
#
        'Years': []
        }
time = datetime.datetime.now()
previous_number = 1
for x in range(1000):
    num = random.randint(1, 100)
    previous_number += num
    data["Price"].append(previous_number)
    data["Years"].append(time.strftime("%d/%m/%Y %H:%M:%S"))
    time += datetime.timedelta(hours=1)
data["Years"].append(time.strftime("%d/%m/%Y %H:%M:%S"))```
#

is how its generated

#

it should always be going up

#

but 100 max

tidal bronze
#

Which metric could I use to evalute different clusters given that the feature they are based on will differ but all are performed using kmeans

misty flint
#

"should" but is it?

#

look at your price values once more

#

the actual data values

lapis sequoia
#

@misty flint

#

i got the aspect ratio to fit

#

but

#

1s

#

how can i download with python images from google image search?

#

and @lapis sequoia just ran

#
before = 0
total = 0
for line in string.splitlines():
    num = int(line.split("[")[1].split(" ")[0])
    if num <= before:
        total += 1
        print(before)
        print(num)
    before = num```
#

basic checker to see if there is any anomalies in the data set but it never printed anything

#

its the graph

#

its only when showing it in tkinter

tidal bronze
#

Which metric could I use to evalute different clusters given that the feature they are based on will differ but all are performed using kmeans

grave frost
#

could you elaborate on your problem?

limber vector
#

Can anyone suggest some site from which I can get a stream of data / free API to be fed into pypspark

left arch
#

Hello everyone, sorry if this is the wrong channel but I figured this was a basic data science question. I am trying to have a pandas column that will get the change percentage for stock data between yesterdays close and today's open. I cannot seem to find the correct verbage to google to find this very simple task.. attached is the excel relation I am trying to correspond into my python script. Anything helps, thank you!

grave frost
#

I read an account where a person trained a word2vec model on their own dataset and then used those vectors to train the model. that seems strange - can we expect a reasonable boost in accuracy on an embedding model trained on little data and get it to capture the contextual vector for each word? it doesn't seem right to me, but maybe you guys can illumintate this issue

candid sable
#

I have a retrained InceptionV3 model on Keras and I'm getting 100% acc and val_acc from first to last epoch.. and ofc it's not accurate when I predict.

what could be wrong?

solid quest
#

Anyone knows any good source of data for big datasets? I need atleast 4 Gb of data for a project where I need to apply clustering trough apache Spark

grave frost
misty flint
#
Prophet

Prophet is a forecasting procedure implemented in R and Python. It is fast and provides completely automated forecasts that can be tuned by hand by data scientists and analysts.

misty flint
#

take a sneak peek. the visuals are mind-blowing

#

🧠

#

made with d3

#

they had a full stack data scientist doing it

quasi sparrow
#

Do you guys prefer panda over conventional NOSQL? I'm trying to figure out if I should keep learning SQL as a DevOps

#

I mean, I'm not a DevOps yet but I'm working towards it, lol

shell summit
#

So I developed a rudimentary chess AI, but it’s slow as hell. Any way I can speed it up?

grave frost
#

thats becuase brute-forcing takes time

tidal bough
#

if you're searching the state-space in Python, it's just going to be slow in general

#

pretty much just because it's the kind of thing Python is slow in - iteration and number-crunching,

lean ledge
#

python's fine at number crunching using numpy, just iteration and general business logic is slow

tidal bough
#

In general, though:

  1. Profile your program. (Every single optimization must start with that step)
  2. See what the most expensive functions are, and consider if you can't speed them up, or even rewrite them in something like Cython.
tidal bough
#

but if there isn't a library implementing what you want, your only choice is to learn how to rewrite functions in numba/cython/whatever yourself.

grave frost
#

A guy did try to brute-force in chess and implemented some sophisticated techniques to reduce time. it still wasn't enough (he used C++)

#

so I doubt python contributes much to it

#

The problem is that chess just has too many combinations. not as much as GO thankfully, but its still pretty significant

#

so the best way is to train an AI

lean ledge
tidal bough
#

I mean, it's true that search is just slow, but also it'd probably be like a hundred times faster in C++ than in Python 😅

lean ledge
#

more like 200

tidal bough
#

also, aren't this how classical (non-ML) chess engines work? They can be very advanced.

lean ledge
#

search engines are sorta different, but basically every game AI you encountered in any game up until the last couple years had no learning

grave frost
#

I agree with all your points, but just that with so many combinations there is no reasonable way to speed it up. minimum, it takes 20-30 minutes for each move

grave frost
lean ledge
#

...but we've had much faster chess AIs for like

#

decades

grave frost
#

I thought we were trying to get te best possible player 🥴

#

my bad

lean ledge
#

even state of the art ML doesn't give the best player possible, that's not reasonably searchable as a space

lean ledge
#

nope, it's still not the optimal player, just a really really good one

tidal bough
grave frost
#

and it beat the world champion twice, so its not much up to debate

lean ledge
#

also chess engines were beating grand champions decades ago

lean ledge
#

it's not hard to write a chess engine that is better than all humans

grave frost
#

not chess

lean ledge
#

we're talking about chess right?

#

how is go relevant

grave frost
#

but chess was beater first time by IBM

#

using ML

deft ruin
#

Chess engines definitely don’t search the whole state space, but they have efficient methods for pruning nodes with bad moves and taking into account transposition

tidal bough
#

there's stuff like this which is completely ML-less and still human-level
https://en.wikipedia.org/wiki/Stockfish_(chess)

Stockfish can use up to 512 CPU threads in multiprocessor systems. The maximal size of its transposition table is 32 TB. Stockfish implements an advanced alpha–beta search and uses bitboards. Compared to other engines, it is characterized by its great search depth, due in part to more aggressive pruning, and late move reductions.[4] As of November 2020, Stockfish 12 (4-threaded) achieves an Elo rating of 3516+24
−20 on the CCRL 40/15 benchmark.[5]
though it does get murdered by ML chess players:
In December 2017, Stockfish 8 was used as a benchmark to test Google division Deepmind's AlphaZero, with each engine supported by different hardware. AlphaZero was trained through self-play for a total of nine hours, and reached Stockfish's level after just four.[48][49][50] In 100 games from the normal starting position, AlphaZero won 25 games as White, won 3 as Black, and drew the remaining 72, with 0 losses.[51] AlphaZero also played twelve 100-game matches against Stockfish starting from twelve popular openings for a final score of 290 wins, 886 draws and 24 losses, for a point score of 733:467.[52][note 1]

Stockfish is a free and open-source chess engine, available for various desktop and mobile platforms. It is developed by Marco Costalba, Joona Kiiski, Gary Linscott, Tord Romstad, Stéphane Nicolet, Stefan Geschwentner, and Joost VandeVondele, with many contributions from a community of open-source developers.Stockfish is consistently ranked firs...

grave frost
#

On this day 21 years ago, the world changed forever when a computer beat the then-chess champion of the world at his own game. On February 10, 1996, Deep Blue beat Garry Kasparov in the first game of a six-game match—the first time a computer had ever beat a human in a formal chess game.1

#

yeah, the best chess player is always an AI hands down

lean ledge
#

mostly because the "AI" is just a normal chess engine being sped up with a good heuristic

#

so it's more of a normal engine++

tidal bough
#

questionable. Are there any model-free-learning-based chess AIs?

lean ledge
#

there might be but they won't beat stockfish :p

grave frost
lean ledge
#

model-free RL chess?

grave frost
#

try it tho, it might be 😉 I just get 3D models lol

lean ledge
#

got it lol

#
Deepmind

In 2016, we introduced AlphaGo, the first artificial intelligence (AI) program to defeat humans at the ancient game of Go. Two years later, its successor - AlphaZero - learned from scratch to master Go, chess and shogi. Now, in a paper in the journal Nature, we describe MuZero, a significant step forward in the pursuit of general-purpose algorit...

grave frost
#

its a model....?

#

MuZero just models aspects that are important to the agent’s decision-making process. After all, knowing an umbrella will keep you dry ....

tidal bough
#

wtf

lean ledge
#

I dont think you know what you're talking about enough to actually have a conversation about this. It's not given a model of the game which includes the rules of chess

tidal bough
#

why are they calling it model-free then

tidal bough
grave frost
#

and why do they use reward, value and policy then? sounds kinda like RL to me

lean ledge
#

MuZero learns a model that, when applied iteratively,
predicts the quantities most directly relevant to planning: the reward, the action-selection policy, and
the value function.

lean ledge
tidal bough
#

Specifically, MuZero models three elements of the environment that are critical to planning:

The value: how good is the current position?
The policy: which action is the best to take?
The reward: how good was the last action?

oh, that's just what all RL (or at least everything derived from q-learning) does. Not sure why are they calling it modelling, tbh.

grave frost
#

https://paperswithcode.com/method/muzero

MuZero is a model-based reinforcement learning algorithm.
LOLOL

MuZero is a model-based reinforcement learning algorithm. It builds upon AlphaZero's search and search-based policy iteration algorithms, but incorporates a learned model into the training procedure.

The main idea of the algorithm is to predict those aspects of the future that are directly relevant for planning. The model receives the observat...

lean ledge
lean ledge
#

This is exactly what I've been saying all the time

#

You haven't "owned" anyone, you just dont understand the conversation we're having

grave frost
#

model-free reinforcement learning
^ that's what you said
MuZero is a model-based reinforcement learning
^ paper

lean ledge
#

Sorry, I should have been more precise, it has a model of the graph structure, it doesn't have the dynamics of the game

grave frost
#

that kinda seems like RL, but you can help me understand the difference

The model receives the observation (e.g. an image of the Go board or the Atari screen) as an input and transforms it into a hidden state. The hidden state is then updated iteratively by a recurrent process that receives the previous hidden state and a hypothetical next action. At every one of these steps the model predicts the policy (e.g. the move to play), value function (e.g. the predicted winner), and immediate reward (e.g. the points scored by playing a move). The model is trained end-to-end, with the sole objective of accurately estimating these three important quantities, so as to match the improved estimates of policy and value generated by search as well as the observed reward.

tidal bough
grave frost
#

so...the only difference is there in this part

the hidden states are free to represent state in whatever way is relevant to predicting current and future values and policies. Intuitively, the agent can invent, internally, the rules or dynamics that lead to most accurate planning.
like you mean it does not model the game or its rules

lean ledge
#

It's not completely model-free, it just doesn't have the dynamics

lean ledge
tidal bough
#

I should probably try reading that paper, lol

lean ledge
grave frost
lean ledge
#

alphazero does have a dynamic model lol, you have to program it in

tidal bough
#

that's just how nearly all RL works. If I understand it right, the idea is that the transitions between current and possible future states depending on the action are learned by the model (instead of programmed in)... and also not stored at all, I think, just used on each value update?

lean ledge
#

RL can work entirely on the observables (image, etc) of program (model free), or it can have an idea of the dynamics and/or the decision making structure (model)

#

This one is partially model free in not having the dynamics but modelled in that it's not working on the direct observables but a parsed structure with decision information

grave frost
#

im not much familiar with RL, but wasn't the whole point to learn an environment without any prior hard coding (contrary to the dynamic model that we have to program in it)

lean ledge
#

Not of AlphaZero

#

MuZero learns dynamics, sort of, it learns the dynamics of the optimal control based values

iron basalt
grave frost
#

and AZ does not; it has to be hard coded. that seems like a workaround

tidal bough
lean ledge
#

But MuZero is still fundamentally based on a graph search, it's still doing Monte Carlo Tree Search

#

It's just learning some dynamics alongisde the fundamental decision making MDP model

grave frost
#

but conventional consumer level stuff can usually solve simple environments without any hard coding, so I assumed that other techniques just scale up the complexity 🤷 sad

lean ledge
iron basalt
#

The chess "AI"'s are kind of fuzzy when it comes to model vs model-free. Of course, one does not need to model everything, only parts of some task could be modeled.

lean ledge
#

RL is very very behind what people want it to think

grave frost
lean ledge
grave frost
#

Just a pretty naive way to solve the possible RL problem: while an agent is randomly searching its space, why can't we inject a pseudo-randomly combination that represents to a great degree the task we want it close to do and have the model just optimize it for maximum reward (a well defined and thought out reward function).

Wouldn't this allow it to learn complex task if we give it a boost in the start (like a nudge to the correct direction) so that it can easily make the connection on the best way to accomplish a complex task?

tidal bough
#

it seems to me like you just described what all RL agents that get taught on human records do

#

like, AlphaZero is Zero because it only got taught on its own games - it wasn't primed by learning on tons of human matches like AlphaGo. As a result, AlphaZero took a lot longer to learn, but ended up better at the end.

grave frost
#

umm, I may be misunderstanding you here, but what I meant is just to provide a skeleton of the possible action that the RL algo should take to help it accomplish pretty complex tasks and get a general idea of how it is supposed to solve a particular environment.

austere swift
#

does anybody here actually use cupy?

#

i've tried using it once and it had all sorts of errors

#

and it seems nice but i've never found it super helpful anyways

misty flint
#

@stiff barn @rapid fog yo yo

stiff barn
#

ayy

misty flint
#

google analytics ive heard is a real nice tool

#

for tracking

#

does google have AutoML or is that a dif cloud provider?

rapid fog
#

What's the easiest way to host postgres on the cloud for a discord bot?

#

Just on the VPS?

stiff barn
#

Google Cloud has hosted postgres

misty flint
rapid fog
#

Is it free?

stiff barn
#

Can also use digital ocean

misty flint
#

i see

#

oh digital ocean

stiff barn
#

Yeah, GCP has auto ml. Most of the clouds do

misty flint
#

that one seems interesting

#

interesting

stiff barn
#

You get plenty of free credits with both @rapid fog

#

Digital Ocean is my go to for smaller projects. Has easy to launch vps, managed databases, kubernetes, ect...

misty flint
rapid fog
#

I'll take a look. Thank you!

stiff barn
#

No problem

misty flint
#

i need to become more familiar with the cloud

#

maybe this summer

#

when working with AWS

stiff barn
#

AWS is still the most popular so a good place to start

misty flint
#

i always feel like in this field there is always an endless amount of things to learn

#

if you want to stay relevant

stiff barn
#

Haha yeah you can never learn it all.

misty flint
#

do you do any testing/unit-testing

#

people have said i should also learn that

#

this never-ending bucket list... ID_BoomKek

stiff barn
#

Yeah, I write test cases for every function/method I build.

misty flint
#

that sounds like good swe practice

stiff barn
#

It becomes natural very quickly. Writing tests in Python is pretty intuitive.

misty flint
#

do you use..what is it called

#

pytest

stiff barn
#

just the native unittest library

misty flint
#

ah

#

i see

stiff barn
#

Next time you go to test some python code manually just try to write a test instead and you might find that it actually makes your life easier.

misty flint
#

hmm

#

i need to remember this

#

anyway

#

how goes your ML studies

stiff barn
#

Going well. Working on the finishing touches of a recommendation engine I've been building for a while.

#

Had to build too many pieces for it haha

lean ledge
#

Should go with Azure AutoML for totally unbiased reasons

misty flint
#

noice

#

glad its coming together for you

stiff barn
# lean ledge What type?

Somewhat nontraditional. I built a binary classification multi-modal model that takes the apartment images and processes those via a CNN, then structured data as a DNN, then concatenates them. I'm recommending apartments to just myself.

misty flint
#

hey, its a nice use case

#

one guy on a podcast i heard built a neural net just for tinder swipes

#

for himself

lean ledge
#

Ah so it's supervised content based recommendation

#

I was expecting collaborative filtering

stiff barn
stiff barn
#

How about yours @misty flint?

misty flint
#

wait let me see if i can find a link

wide sorrel
#

hello

#

how would you recommend begging to learn about ai?

misty flint
#

💀

wide sorrel
#

beginning

misty flint
#

if you come from a non-technical background, you can start with andrew ng's AI for Everybody course on coursera

#

its a good start

stiff barn
#

Seems to be the one people gravitate to. Must be good

misty flint
#

we barely finished a 2-3 week long one where we made a contract analysis app with some basic nlp

misty flint
#

he goes through what kind of business projects AI/ML is good at vs. those that arent good projects

#

and then walks through how to try to build up a AI/data culture at your company if youre trying to create buy-in/not everyone is onboard with change

#

lol

exotic maple
stiff barn
#

Pretty interesting and quick read

wide sorrel
#

im looking for a course thats more hands-on, being that im already somewhat fluent in python

stiff barn
misty flint
#

yeah but i wouldnt really recommend it to most technical peeps unless theyre interested in going into management or part of a large team

#

at the very least, you can 2x through the videos and get through the gist of it pretty quickly

misty flint
#

then you can report back to the class

exotic maple
#

The problem is, i have no training data

#

1 sample from my side :v

#

few dimensions

#

F

misty flint
#

looks like youll have to do some swiping

#

to feed the model

stiff barn
#

Haha yup, gotta build it up

misty flint
stiff barn
#

That's the fun part

misty flint
#

honestly that sounds like a hilarious project to have on your resume

#

def a talking point

stiff barn
#

For sure

#

I'm surprised that dude labeled 10,000. That's a lot....

misty flint
#

thats pretty wild yeah but hes done other crazy things before

stiff barn
#

I built a client and labeled like 2000 apartments for my project and that took a long long time

misty flint
#

oh yeah he did this challenge called 12 months to mastery which is honestly pretty ridic

#

@stiff barn March was his Tinder bot month

#

💀

stiff barn
#

Very interesting haha

#

Can see a lot of room for improvement but super cool for a fast project

exotic maple
#

i really need to learn CNN

#

and OpenCV

#

the whole analyze images makes for some very damn good portfolio projects

misty flint
#

did opencv for one project but nowhere near mastery

#

pillow is a cool library too

#

using that for another project

stiff barn
#

The nice thing in that area is that is where the bulk of the pre-trained models for transfer learning are so you can get good results quickly.

#

I like pillow. Keeps things simple

exotic maple
#

I think i want to make a repo with 3 projects. One purely "analytics" and visualization
One with a ML model (not sure of how to show it here)
One with an image NN (havent even started NNs lol)

misty flint
#

yeah i feel like projects are now either CV or NLP focused

#

at least the "interesting" ones

exotic maple
#

though, I actually feel like doign a sentiment analysis miniproject in spanish (my mother language)

misty flint
#

well i heard you usually end up choosing one or the other

stiff barn
#

NLP is weird because you just know GPT-3 is there and you'll never get anywhere near it

misty flint
#

so its okay

exotic maple
#

but id need to scrape some language

#

yeah NLP is a dead end with GPT and transformer models lol

misty flint
exotic maple
#

truth be told, with how behind most companies are, even a simple KNN implementation would do wonders pydis_snake

misty flint
stiff barn
#

Would be nice if GPT-3 was open so we could use it for transfer learning and such. I have access to the api for it but that's limited to some extent.

misty flint
#

maybe it will be open more in the future

exotic maple
#

bro, no joke, I have a guy in my BU trying to make a "bot" for classifying cases? His idea, REGEX!!

misty flint
stiff barn
#

There are still plenty of use cases for NLP.

exotic maple
#

motherfucker I can do that better with a simple KNN lmao

stiff barn
#

And GPT-3 is spawning many businesses.

#

Probably will be closed forever though since Microsoft bought the exclusive license to it.

misty flint
#

maybe it will be an azure service?

tidal bough
stiff barn
#

It will be for sure. And you can get access to the api if you ask nicely and wait a long time.

misty flint
exotic maple
#

there's literally a startup t o make webapss, based on GPT-3 that only needs a rough description of the app

misty flint
stiff barn
#

Yeah, the stuff being built with GPT-3 is very interesting

exotic maple
#

it's pretty rough from what i've seen, but c'mon, can you imagine the chaos of getting rid of half of "full stack devs"?

stiff barn
#

This was cool as well

exotic maple
#

AI will kill us all

#

abandon AI, return to Amish

stiff barn
#

Agreed

misty flint
#

haha this is great. i like the avocado chair

exotic maple
#

REx

#

are you a student?

stiff barn
#

Avocado chair is pretty good

misty flint
#

hmm?

#

yes..?

exotic maple
#

Lol

#

i'll send you a DM

misty flint
#

ok

exotic maple
#

oof

#

privated

#

just like in tinder

#

-cries-

misty flint
arctic wedgeBOT
#

Hey @tardy plover!

It looks like you tried to attach file type(s) that we do not allow (.pdf). We currently allow the following file types: .gif, .jpg, .jpeg, .mov, .mp4, .mpg, .png, .mp3, .wav, .ogg, .webm, .webp, .flac, .m4a.

Feel free to ask in #community-meta if you think this is a mistake.

frigid forum
clever karma
#

How much linear algebra do people in machine learning actually use?

ivory pendant
#

35% in pinned messages

gentle bramble
#

hi guys

#

so i'm an intermediate in python

#

and i wrote a tutorial on how to make a face recognition program with python

#

if anyone has any spare time i'd love if you checked it out and tell me if i have any mistakes

frigid forum
tidal bronze
#

what do you guys think should be my k in this case?

serene scaffold
#

My coworkers haven't run out of things to do 🤷‍♂️

exotic maple
#

I mean, I said it as a kind of tongue-in-cheek kind of joke, but I can see how it definitely didnt come across as it

#

My hate for NLP is also not subtle :p

grave frost
#

There is so much to learn NLP that I am dying everyday

#

it's like every paper has some different technique and there's a whole flood of them

serene scaffold
exotic maple
#

😔

grave frost
exotic maple
#

I said hate, not boring

grave frost
#

what do you hate about it then?

raw minnow
#

Hiiii, I have a question

#

If I want to start learning data science

#

should i learn jupyter, rstudio, watson studio,... or python's numpy, pandas, matplotlib, seaborn... first?

#

thanks!

#

is machine learning related to data science?

keen kestrel
#

emacs

exotic maple
serene scaffold
raw minnow
#

@serene scaffold wow, I thought machine learning and deep learning are both AI

keen kestrel
grave frost
raw minnow
#

so is an if-else program to determine whether a number is even or odd artificial intelligence?

grave frost
grave frost
#

it technically counts as logic, but doesn't actually exhibit intelligence, you can't say its AI

odd lion
# grave frost I don't even know what i33t is

Leet (or "1337"), also known as eleet or leetspeak, is a system of modified spellings used primarily on the Internet. It often uses character replacements in ways that play on the similarity of their glyphs via reflection or other resemblance. Additionally, it modifies certain words based on a system of suffixes and alternate meanings. There ar...

raw minnow
#

i'm currently learning computer science at college as a first year

grave frost
#

Again though, most of the definition are up for discussion and opinion 🤷 and you would find plenty of ideas online

raw minnow
#

is data science a subset of computer science or it use computer science as a tool?

austere swift
#

computer science is a tool for it

#

you can do data science by hand

#

but, who wants to do that 😆

lapis sequoia
#

Can any1 help me?

odd lion
#

I think most DS studies would fall into CS schools these days because the majority of the work is involving CS. But economists use DS all the time, so does business school, agriculture,etc...

austere swift
lapis sequoia
#

I have a file that I've done. but I am facing some issues that not running the file

arctic wedgeBOT
austere swift
#

can you elaborate

#

!code

arctic wedgeBOT
#

Here's how to format Python code on Discord:

```py
print('Hello world!')
```

These are backticks, not quotes. Check this out if you can't find the backtick key.

lapis sequoia
#

I made a file in python

grave frost
#

you mean you wrote code

lapis sequoia
#

but there is an error

#

ye exactly

austere swift
#

can you show some code or tell us what the error is

lapis sequoia
#

and i couldn't solve this error

grave frost
#

is it specific to AI / Data-science?

austere swift
#

^

lapis sequoia
#

idk what this means

austere swift
#

it means what it says

#

invalid syntax

#

you wrote the code wrong

#

this doesnt look like it has anything to do with data science though

#

so you should claim a help channel

lapis sequoia
#

Okay ik
So what is the right code to

#

how :/

#

i jus joined

#

here

austere swift
#

it looks like you already have one lol

lapis sequoia
#

o;

misty flint
raw minnow
#

is this a good road map?

grave frost
#

why are you learning jupyter BTW?

raw minnow
#

it's included in the course, i didnt specifically pick it

misty flint
#

are you learning on mobile?

#

uh oh

grave frost
#

and I recommend you leave roadmaps but rather learn basics and learn what you like rather than following some set path

raw minnow
#

no, its a screenshot from my laptop

misty flint
#

ah its cropped

#

from..it looks like udemy?

raw minnow
#

yup

misty flint
#

might as well start somewhere. as long you think you can complete it

#

just know its not comprehensive

grave frost
#

tbh I find a mindset that "I have to learn x thing by y time" to be the most unproductive one ever. its not how we really learn things

raw minnow
#

because there are a course from edx too but it's very different and have things like sql, rstudio, jupyter lab, watson studio,...

serene scaffold
# raw minnow

I discourage Python learners from touching jupyter until they have more experience with the language, as it makes everything more difficult to debug and encourages you to not think about code re-usability.

grave frost
#

I don't get why everyone's like "I started coding and aim to build an app in 1 month and then learn AI at the end of 5th month" after that just milk that 120K. That is such a bad mindset. you end up leaving CS in 2 weeks just because you don't like it.

This isn't your school exams that you force yourself and it doesn't matter much if you forget everything (or worse, just rote memorize and forget). CS takes time, years, decades of work to be good at it. roadmaps are a good indicator as to what amount of knowledge an average person is expected to have at a certain point, but its not a path set in stone to follow for eternity.

austere swift
#

yeah honestly imo it's better to just think of an application or project you wanna learn how to do, then learn about that specific topic/project to be able to do it. then after you learn one project and like the basics of it, you can adapt your code to do other stuff as well

#

so like if you start off with something like "i want to be able to visualize this dataset", then you learn about how to use the different visualization tools, learn about data management and stuff like pandas/numpy, etc

#

later, you can use that same code with the same dataset, and learn more stuff building on that

#

by building up on concepts you learn it makes it a lot easier than just learning stuff in order

exotic maple
#

Literally I love @austere swift 's approach

#

personally i set myself 3 goals, in increasing difficulty :

grave frost
exotic maple
#
  1. visualization project with python
  2. ML application, simple with python (predict something with an ML model)
  3. More complex, CV application with CNN
#

and im basing my learning on that

#

mostly

grave frost
#

I used to do personal projects for learning all the basics - now I have reduced those (because I can't manage the time very well) but I find competitions much more encouraging to explore experimental techniques and somehow apply them to increase my LB score.

hidden cove
#

Hello Evervybody , I have a question , how to calculate the average of a signal please ??

grave frost
#

That's why I encourage beginners to do those simple kaggle competitions (one which have a monthly LB) to learn more

hidden cove
#

the statement tells me: create a function that evaluates the average of a signal

austere swift
grave frost
austere swift
#

yeah its a lot more interesting than learning it from somewhere online since you can actually see the results of what you did

#

and satisfying

grave frost
exotic maple
#

oh

#

@grave frost answering your question. It might not be that i dislike NLP, but mostly that i'm just pissed off at the awful quality of teaching ive had of it so far lol

#

so i'll probably have to relearn it from scratch if i ever use it

exotic maple
#

2

grave frost
exotic maple
#

awful explanation, the lecturer was as stimulating as a political speech and his explanations and analogies were shit and there was very little code or matha long

#

mgiht as well read wikipedia to learn it

grave frost
exotic maple
#

and honestly im a bit burnout too i think lol

grave frost
#

sad, NLP is kinda interesting. though my interest in AI has been dwindling somewhat lately

exotic maple
#

I hate it because i loved the ML part and i was actually excited about everything i did, even if i struggled with seemingly basic stuff someties

#

sometimes

#

so i went to NLP excited, but this guy killed me in a week lmao

light stump
#

does anyone here have some experience with skimage.transform module?

#

i'm trying to implement either PolynomialTransform().estimate or PiecewiseAffineTransform().estimate and i'm getting errors that idk how to deal with properly

#

but any experience at all with skimage.transform would be helpful

exotic maple
#

@grave frost is this what yumeant?

grave frost
#

for what?

exotic maple
#

the competitions

#

leaderboard

grave frost
#

yeah, that's the LB - the top rankers in a competetion

edgy edge
#

Hello guys

#

How is the job thing in US concerning Data Science

lapis sequoia
#

Can I ask here a pandas question?

exotic maple
edgy edge
lapis sequoia
#

df['invoicepayed'] = df['invoicepayed'].replace(['\N'],np.nan)

#

will replace \N with NaN in my dataframe, seems to work okay

#

if df['invoicepayed'].notnull():
pd.to_datetime(df['invoicepayed'], format='%Y-%m-%d %H:%M:%S')

#

format works for not NaN and without the if-statement

#

so with if ... I try to convert only for not NaN

#

I get

#

"The truth value of a {0} is ambiguous. "

#

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

fast plover
#

Quick pandas/numpy question: I want to get the mean of the bottom 10% of values in a series. How do I do this?

lapis sequoia
#

I checked stackoverflow found some posts but I seem not to understand the underlying problem

exotic maple
#

df['invoicepayed'].notnull()

#

this is not what you think it is

#

that returns a series

#

where each value in the column is evaluated to be NAN or not NAN

#

so if you do if df['invoicepayed'].notnull():
you get an error because you are saying

#

"Is Series True?"

#

and python cant answer that

#

@lapis sequoia you're not comparing each object inside the series, you're comparing the series itself there

abstract zealot
#

Hi basically working with data frames with excess of 1 million rows and I’m using pd.groupby, specifically
‘’’py
For (a,b), c in df.groupby(by=[‘col1’, ‘col2’])
‘’’
I’ve noticed this is very very slow and was wondering if anyone had any suggestions for improvement? I’ve tried itertools groupby which slightly improved times, but I think because the column consists of strings maybe somehow converting the columns to an integer value might speed things up ? I have no idea but would love to try some of your guys suggestions 🙂

exotic maple
#

a million rows shouldnt that much of a problem for pandas @abstract zealot can you share a screenshot of your df?

lapis sequoia
#

Ok. So I need to check for every element in the series not the series itself.

exotic maple
#

if datetime ignores nans, then you can just use it

abstract zealot
#

Sorry my bad it’s 25 million you made me recheck hahaha

lapis sequoia
#

oh.... I played in my jupyter notebook and didn't notice.

misty flint
#

thats a lot of data

lapis sequoia
misty flint
#

itll probably take a long time regardless unless you have access to more processing power

exotic maple
#

thats probably what's slowing you down

#

why are you looping to do a groupby anyways lol

misty flint
#

lol

abstract zealot
#

@exotic maple are there any better alternatives to groupby? I unfortunately need to do calculations on Sub data frames returned by groupby for certain values

exotic maple
#

you cna try just aggrating or passing a custom function

#

IF i get you right

#

for example, you want the SUM of N values of a row in a groupby

#

df.groupby("RELEVANT GROUP").agg({"COLUMN TO SUMMARIZE": SUMMARY FUNCTION)

abstract zealot
#

I’ll definitely try something like this and let you know thank you very much man

exotic maple
#

you can also try it via pivot tables

#

but i found pivot tables in pandas...odd. i prfer grouping manually lol

abstract zealot
#

Another quick question @exotic maple what if in addition to grouping, I wanted to only look at the data frames in intervals from rows 0-20, 20-40, 40-60 etc

#

Is this possible with the method you describe?

exotic maple
#

You can by grouping by partitions?

#

You mean?

abstract zealot
#

I think so yes

exotic maple
#

Eh ive never done that but i think you can.

Id try this.
Df["splits"] = Pd.cut(df,5)
This will create 5 equally distinta values for splitting

Then id use groupby by that column

#

Im sure theres a better way but i dont have a sample df nor energy right now lol

abstract zealot
#

Jahahaha no problem thank you very much again

short heart
#

ValueError: Input 0 of layer sequential is incompatible with the layer: expected ndim=3, found ndim=2. Full shape received: (None, 60)

#

any ideas how to fix that?

tidal bronze
#

how can I save cluster mappings of Kmeans aggregation to a dictinary?

#

I used sklearn if it matters

fading kernel
tidal bronze
#
clust_map = dict(zip(agg_df.index, agg_df["an_vol_cluster"]))

I used this in the end which seems to be quite similar to waht you are suggesting 😄

#

thanks anyway @fading kernel

short heart
#

How do i put a 2d array into sequential in keras

#

Or how do i reshape it

short heart
#
model = Sequential()
model.add(LSTM(4,input_shape=(940,60),return_sequences=True))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(X_train, y_train, epochs = 100, batch_size = 32)
grave frost
#

input?

short heart
#

an array of minmaxed values

#

give me a sec

#

[[1.98211310e-02 2.13912644e-02 2.05622164e-02 ... 7.13034744e-02
  8.30816740e-02 8.40888464e-02]
 [2.13912644e-02 2.05622164e-02 2.05600173e-02 ... 8.30816740e-02
  8.40888464e-02 8.43153503e-02]
 [2.05622164e-02 2.05600173e-02 2.06237903e-02 ... 8.40888464e-02
  8.43153503e-02 8.45660438e-02]
 ...
 [5.82092875e-05 4.71699742e-05 4.45750758e-05 ... 1.26729997e-03
  1.00240043e-03 1.10061074e-03]
 [4.71699742e-05 4.45750758e-05 4.63343289e-05 ... 1.00240043e-03
  1.10061074e-03 1.38178337e-03]
 [4.45750758e-05 4.63343289e-05 4.59165063e-05 ... 1.10061074e-03
  1.38178337e-03 1.59278379e-03]] ```
#

looks something like this

grave frost
#

I mean using the input layer

#

tf.keras.layer.Input

short heart
#

idk

#

im having a brain fart at this point

#

i tried reshaping it like this X_train = np.reshape(X_train, (940, 60, 1)) which worked before btw, but now it gives out this error ValueError: total size of new array must be unchanged, input_shape = [60, 1], output_shape = [940, 60]

abstract zealot
#

@exotic maple your suggestion works! As does using a lambda function in the groupby 🙂

short heart
surreal radish
short heart
#

60

#

Should be 60

surreal radish
#

940 * 60 is not 60 !!!!

short heart
#

oh u meant that

#

Well yeah, so what?

surreal radish
#

you can use this for the second argument of reshape :

#

(1,len(X_train))

#

is it work !?

short heart
#

but its 940,60…

surreal radish
#

you want 3D array ?

short heart
#

Yes, i do

#

For a sequential in keras

surreal radish
#

as i know the multiplication of Dimensions should equal your array length

short heart
#

Wdym

surreal radish
#

i mean multiplication of values in second argument of reshape

short heart
#

So what u r saying is multiplication of second arguments should equal 3?

surreal radish
#

no no

short heart
#

Just give an example...

#

I dont think it solves the problem though

misty flint
#

i did matrix multiplication by hand the other day

surreal radish
#

for example you have array with length 12 okay ? if you want to reshape it to any Dimensions the multiplication of numbers should be 12

misty flint
#

tldr: it was not fun

lucid ferry
#

Hey,
I have this def that takes a string and a DataFrame as arguments.
def accuracy_by_species(specie_name, df):

Now, I want to use apply function and pass DataFrame as argument. Is this possible?
Something like:
['a', 'b', 'c'].apply(accuracy_by_species, MyDF)

Any help would be appreciated.

surreal radish
short heart
#

R u sure it works that way

surreal radish
#

yes becouse it is a rule you can check it in documents of numpy

red yew
#

Hey there, I have a pretty common task that I struggle with in python, usually trying to use pandas and matplotlib. I've seen guides online for similar things, but never quite this issue, which I'd think is very common:

I have a series of discrete events broken up into timestamps, like "message sent at timestamp x", and some 10,000 of those. The timestamps span maybe 13 months. All I want to do is bin that data into days, so like "10 messages received on Jan 1st, 12 on Jan 2nd, 7 on Jan 3rd," etc. I'd like to see it on a graph, showing the number of events per day over time, to see trends. I've already converted the timestamps into epoch time (seconds since 1970) and have it in CSV form and as a dataframe in pandas.

Anyone know how I can do this?

grave frost
exotic maple
grave frost
red yew
red yew
exotic maple
#

if you have your days only, you can use

misty flint
red yew
#

do you mean basically just count 24 hour periods from a start date to an end date, iterating over my sorted data and counting manually in a loop?

exotic maple
#

df.groupby("date") -> this will groupby the table by the UNIQUE values of the grouping column

#

then, you can cast an aggregation function

misty flint
#

numpy can do it in like 1/100th of the time

exotic maple
#

in your case you simply want sum so

grave frost
exotic maple
#

df.groupby("date").agg({"messages":sum})

misty flint
#

...i didnt say it was not easy

exotic maple
#

you can also do it via pivot table, since pivot table is literally a grouping function as well

grave frost
red yew
#

Sure, just give me a minute to upload it

exotic maple
#

the way i got it he only wants grouped sums

grave frost
exotic maple
#

isnt it easier to convert to timestma and extract day?

grave frost
#

it depends

exotic maple
#

that's the approach id use

grave frost
#

I prefer the shortest route however dirty it might be 😁

exotic maple
grave frost
#

I had a problem to solve to get the mean of nested arrays and this time I decided to do it properly with a class since it would be re-used. took me an hour

red yew
grave frost
#

just to write this piece of shit:

class BPE():
  def bpe_embed(self, arg):
    vec = []

    for _ in arg:
      vec.append(bpemb_ny.embed(_))
    
    return self.averager(vec)

  def averager(self, sentence_vec):
    averaged_vec = []

    for j in sentence_vec:
      for k in j:
        averaged_vec.append(k)
      
    avg = np.mean(averaged_vec, axis=0)
    #print("avg:", avg)
    return avg
  
  def final(self, arg):
    final = []
    for stanza in tqdm(arg):
      final.append(self.bpe_embed(stanza))

    print(final)
    return np.array(final)
grave frost
#

!e

a_string = '1593316925.431|user1'
print(a_string.split('|')[0])
arctic wedgeBOT
#

You are not allowed to use that command here. Please use the #bot-commands channel instead.

red yew
#

I can see programatically sorting it and then iterating over it in 24 hour chunks, creating a new table that way

#

I've also used a pivot table for this in the past somehow

grave frost
#

how does epoch and user correlate with day and events?

red yew
#

each timestamp is an event that occurs on some day in time. The user column is mostly useless for this case, it's already filtered by user

grave frost
#

and the number at the start?

red yew
#

epoch|author is the row that defines the headers
1593316925.431|user1 is the first row of data, with the first column being an event that occurred on June 28th, 2020, 4:02:05 AM UTC, by user1

grave frost
#

aight. so then just extract the day from each entry, put it in a list and then use matplotlib. what do you find difficulty in?

#

you can use count to then count the number of times it appears

red yew
#

I guess I was expecting that matplotlib or pandas would have a function like hist() or something that would automatically know how to do this

#

I can extract the day and count programmatically, then plot that

#

I guess simpler is better. I was hoping for fanciness

grave frost
#

I guess there might be some function 🤷 but I don't know

#

tho tbh you might be suprised to do so many common things, we do not have functions (or libs) for it

red yew
#

could numpy help me perhaps? I'll need to sort the data by that column, find the min and max to get the date range, and then iterate through the data to sum up the counts per day. basically binning it manually

tidal bough
#

as for a numpy solution, hmm.

exotic maple
#

seaborn as hist

#

and pandas has

#

pd.plot.hist

#

i think

tidal bough
#

what do you need exactly? Plot the counts of events per day? That seems like a histogram with fixed bin edges to me - if so, you can just use plt.hist or np.hist.

red yew
#

it's a one-time processing of, worst case, 310,000 rows, so I don't really mind the processing time

exotic maple
#

in fact, matplotlib has histograms too...

tidal bough
#

matplotlib's hist calls np's hist, even.

red yew
exotic maple
#

try weeks

tidal bough
#

pretty much; you might just need to manually generate the bin edges

exotic maple
#

you can try getting "week of year" (a number from 1 to 52) and generate a histogram from there

tidal bough
#

but that's a pretty small function, comparatively; it's only 400 numbers

exotic maple
#

by day is awful

#

you wont be able to read it

red yew
#

my goal is to see a trend over time, where counts per day would give me a good indicator of daily activity that may fluctuate over time. Like, picture wanting to do analytics on a website where you see hits per day from one user

exotic maple
red yew
#

basically I'm doing a transformation of an unordered list of discrete events into a summation of hits per day

#

then plotting that

tidal bough
#

I see. So yeah, that's just a histogram.

red yew
#

ah

#

I've looked up guides on histograms but not found something clear about this specific thing

exotic maple
#

@red yew try this

  1. separate epoch from author as awesome told you
  2. convert the epoch to a readable dt format
  3. extract day / week from dt format
  4. create histogram
red yew
#

like I've tried this:

df = pd.read_csv('output.csv', header=0, delimiter='|', quotechar='^', quoting=csv.QUOTE_MINIMAL)
fig, ax = plt.subplots()
df["timestamp"].astype(np.int64).plot.hist(ax=ax)
labels = ax.get_xticks().tolist()
labels = pd.to_datetime(labels)
ax.set_xticklabels(labels, rotation=90)

plt.show()

but it resulted in a graph without sufficient bins

tidal bough
#

As an example, just plt.hist it. The results will be horrible because it will by default choose like 10-20 evenly sized bins, but it should work. To make it right, pass the bins argument to it.

red yew
#

ah hm, let me see the b ins arg

tidal bough
#

but it resulted in a graph without sufficient bins
yup, precisely, you'd have to specify their count and maybe also the precise positions.

red yew
#

current output

exotic maple
#

its just the x axis is wrong

#

because you are using epochs and not human dates

red yew
#

can I format the x-axis with a function that knows how to handle the epoch? Or do I need to change the input format

exotic maple
#

i'd choose changing input format, much cleaner

#

and reproducible

#

but thats up to you

#

labels = pd.to_datetime(labels) this part of your code doesnt seemt o be working

red yew
#

I can put it into ISO8601 or something. I'll need to lookup how matplotlib handles datetimes I suppose

#

o

exotic maple
#

ISO8601 my man here with ISO format

#

-hugs-

red yew
#

we are nothing without standards

#

I don't know how to convert a column of data that's in a dataframe. I assume there's a transformation function that can be applied over it.

#

bins=365 improves things already

exotic maple
#

pd.todatetime or whatever the hell is spelled

#

buuuuuuuuuuuut

#

im not sure if datetime converst epochs

tidal bough
#

should be possible to at least convert it to numpy's datetime type

red yew
#

hmm to_datetime takes an strftime format string, hm

exotic maple
#

once you know what you want, its easy to google it 😉

#

pandas was literally built to handle annoying datetime stuff lol id be surprised if it didnt handle it

#

now, being a bit more..."scientific" why you are looking at that trend like that plasma? I think a more interesting observation could be:
1-) messages by day of weeek.
2-) seasonality of messges by month / day / week, etc
daily change itself doesnt seem too valuable to me there

#

in fact, your data has some very noticeable spikes, so there seems to be somehting there

red yew
exotic maple
#

see, the spikes i mentioned :p

red yew
#

like ideally, a vertical line in the graph with a label indicating what happened on that day

#

again, if it were a website, picture "sale on this day"

exotic maple
#

you could add the following

#

compute the mean per day

#

and make a single horizontal line

#

to display it

#

and then color all the bars n-stds away from the mean

red yew
#

that'd be neat

exotic maple
#

id like to see your data plotted as a normal distribution

#

tbf it seems like it COULD approximate it

red yew
#

I imagine there'd be some trends in day-of-week, just not interesting in my case

#

a running weekly average would be interesting too

#

spikes here probably line up with weekends

exotic maple
#

what I'm trying to say is: mark the mean. mark 1 std deviation above and below the mean

#

and color the bars ABOVE the 1-std differently

red yew
#

I'd like to figure that out as an experiment, sure. It'd be neat to see

exotic maple
#

that's pretty easy to plot :p

#

and it can visually display your idea of "something different happene dhere"

red yew
#

I have no idea how to do that currently. both pandas and matplotlib are opaque to me, and most docs seem to be just SO questions/answers, or very verbose API references

twin moth
#

Any idea which of those is better in order to count duplicates in a dataframe?

def count_duplicatives(df, col_name=None):
    return df.duplicated(col_name or df.columns.tolist()).sum() 
def count_duplicatives(df, col_name=None):
    return df[df.duplicated(col_name or df.columns.tolist())].shape[0] 
exotic maple
#

you can cast np.mean(df["value"] on the column that holds your values

red yew
#

I think I'll start with trying to fix these x-axis labels (still trying), and then drawing vertical lines for important events

red yew
red yew
#

thanks

exotic maple
#

instead of verical bars

#

thats what you want no?

#

use annotate, much cleaner

red yew
#

basically yea, and it sounds like the coordinate system is the x-axis by default, so I can specify an epoch time of the event

#

which I can figure out

exotic maple
#

matplotlib is really cool but a massive pain in the ...

red yew
#

so Iv'e gathered! Do you have a preferred plotting lib?

#

I saw pyplot but it seemed very focused on web-based notebooks

exotic maple
#

try seaborn

#

its prettier

#

and abstracts a lot of stuff you dont want

red yew
#

cool

exotic maple
#

and you can still reference matplotlib objects

#

since seaborn inherits matplotlib

#

thats the kind of plot that id like to see in your data. kde for values basically

#

if its normally or normal-like distributed, you can easily find outlier matematically by declaring

#

Z scores

#

(how many standard devs is the value away from the mean)

red yew
#

but what if the outlier is a trend over time? Like "the user slowly stopped using this service over a period of 1 month"

#

one could calculate the weekly frequency of usage

exotic maple
#

that's different. I would have to think it over

#

but thats not related to population

#

but to one user

#

so you'd have to compute it separately

#

i shoould be working on NLP but im findng your data more interesting lmao

#

I guess i like the intersection of marketng, analytics :p

red yew
#

haha. I think I'd rather be working on NLP

exotic maple
#

people reall need to check documentaiton more often

#

xD

red yew
#

speaking of documentation, I'm trying to find out just what subplots() does and how I can go from this default bar graph to a connected line graph

#

and then I can have multiple lines indicating different users

tidal bough
#

subplots is for several subfigures on one figure, basically

#

if you want to plot more than one plot on a figure (like, several lines), this is as simple as plotting them all between getting a new figure and showing it

red yew
#

I wonder why the example I used had me use it. Maybe so I can control ax

#

o hm

tidal bough
#
plt.figure()
plt.plot(...)
plt.plot(...)
plt.plot(...)
plt.show()
red yew
#
#!/usr/bin/python3

import pandas as pd
import csv
import matplotlib.pyplot as plt
import numpy as np

df = pd.read_csv('output.csv', header=0, delimiter='|', quotechar='^', quoting=csv.QUOTE_MINIMAL)
df['timestamps'] = pd.to_datetime(df['timestamp'], unit='s')
fig, ax = plt.subplots()
df["timestamp"].astype(np.int64).plot.hist(ax=ax, bins=75)
labels = ax.get_xticks().tolist()
labels = pd.to_datetime(labels)
ax.set_xticklabels(labels, rotation=90)

plt.annotate("event 1", (1611925200, 100), color='r')
plt.axvline(x=1611925200, color='r')

plt.show()

in my case I'm not even sure why it's creating a bar graph

#

I assume it's being set to that by plt.subplots()

tidal bough
#

bar graph?

red yew
#

ye, currently looks like this

tidal bough
#

uhh

#

so, just a histogram? I don't see how it's different from what you had before.

red yew
#

it is, but, can the histogram data be displayed as a line graph?

twin moth
exotic maple
twin moth
exotic maple
#

you can just count the False insances then

#

actually

tidal bough
twin moth
#

The Trues and that's what I did kinda

exotic maple
#

since True has value 1. You cando
len(df) - sum(column)

tidal bough
#

plt.hist calls np.hist, so their arguments are pretty much the same.

exotic maple
#

try that

#

or something similar

#

basically

#

Trues are 1

twin moth
exotic maple
#

if you discountr their sum from the lenght of the rows, you get the Falses

twin moth
#

I did exactly that by using .sum()

exotic maple
#

shape should be faster but does it give you the same result?

twin moth
#

Both work exactly the same

#

I get that generally shape is faster

#

But is it really faster in this implementation even though I create a whole new DF just for that?

exotic maple
#

unfortunately i cant answer that confidently

#

so id rather not misinform you

#

time them

#

and if shape is faster, it is

twin moth
#

Thanks 🙂

#
for col in df.select_dtypes(exclude=['int64','float64']):
    most_common = df[col].mode()[0]
    df[col].fillna(most_common, inplace=True)

How would you guys achieve that?

#

Gives me the following error A value is trying to be set on a copy of a slice from a DataFrame

woven pumice
#

The SettingWithCopyWarning should just be a warning not an error, and I am actually able to run your code without issue. Another method would be
df.loc[:,col] = df[col].fillna(most_common). Also, using scikit-learn's Imputer with strategy='most_frequent' with may be a more effective way of filling missing data in preprocessing

vestal bough
#

I was wandering about using Kmeans clustering for multidimensial dataset.
Being a geometrical method, how i can be sure that clustering has been made correctly? (Not having visualization feedback)

twin moth
#

Got any idea how to fetch the most common value for each column without iterating through it?

woven pumice
#

you could do something like new_df.mode().iloc[0]

twin moth
# woven pumice you could do something like `new_df.mode().iloc[0]`

Actually I did the following:

def replace_missing_values(df, col_to_def_val_dict):
    new_df = df.copy()
    new_df.fillna(col_to_def_val_dict, inplace=True)
    
    new_df[new_df.select_dtypes(exclude=['int64','float64']).columns.tolist()] = new_df.select_dtypes(exclude=['int64','float64']).fillna(new_df.mode())
    new_df[new_df.select_dtypes(include=['int64','float64']).columns.tolist()] = new_df.select_dtypes(include=['int64','float64']).fillna(new_df.median())
    return new_df
#

Doesn't always work though

#

Some of those values stays NaN

#

If I replace new_df.mode() with a single string it "works"

#

Otherwise it just stays the same

exotic maple
#

by definiation Kmeans is an unsupervised / descriptive model. and Kmeans requires predetermining the amount oif clusters you want to use

#

If you have no idea of how many you have perhaps try using DBSCAN?

twin moth
#

Now it works!

strong zephyr
uncut barn
#
class encoder(nn.Module):
    def __init__(self, n_inputs = 40):
        super(encoder, self).__init__()
        self.n_inputs = n_inputs
        self.N_c = torch.randint(1, n_inputs + 1, (1,)).item()
        self.random_indices = torch.randperm(self.n_inputs)[:self.N_c]

        self.fc_enc1 = nn.Linear(2, 64)
        self.fc_enc2 = nn.Linear(64, 32)
        self.fc_enc3 = nn.Linear(32, 2)

        torch.nn.init.normal_(self.fc_enc1.weight, std=0.01)
        torch.nn.init.zeros_(self.fc_enc1.bias)
        torch.nn.init.normal_(self.fc_enc2.weight, std=0.01)
        torch.nn.init.zeros_(self.fc_enc2.bias)
        torch.nn.init.normal_(self.fc_enc3.weight, std=0.01)
        torch.nn.init.zeros_(self.fc_enc3.bias)

    def forward(self, X, y):
      x_c, y_c = X[:, self.random_indices], y[:, self.random_indices]
      input = torch.cat((x_c, y_c), 2)
      h1_enc_output = F.relu(self.fc_enc1(input))
      h2_enc_output = F.relu(self.fc_enc2(h1_enc_output))
      r_c = F.relu(self.fc_enc3(h2_enc_output))
      return r_c

would this be a possible way for a model to accept an arbitrary number of inputs?

exotic maple
#

though, not interested in pipelines...atm

woven pumice
# twin moth Now it works!

After some searching around this also seems to work

import pandas as pd
df = pd.DataFrame({'a': [1] * 3 + [2] * 3 + [np.NaN] * 2,
                   'b': [True, True, True, True, True, False, np.NaN, np.NaN],
                   'c': [1.0, 2.0, 3.0, np.NaN, np.NaN, 6.0, 7.0, 8.0]})
print(df.head(10))
df['a'].fillna(df['a'].mean(), inplace=True)
df['b'].fillna(df['b'].mode().iloc[0], inplace=True)
df['c'].fillna(df['c'].median(), inplace=True)
print(df.head(10))```
strong zephyr
#

@exotic maple pipelines are just one possibility with easy jobs, but next closet parallel is celery 🙂

twin moth
#

But I replaced a bunch of columns in each operation while you only did one at a time

woven pumice
#

Ah, I see. Seems like a good approach

exotic maple
#

that's a not DS question, but are you looping through those images?

grave frost
#

!code

arctic wedgeBOT
#

Here's how to format Python code on Discord:

```py
print('Hello world!')
```

These are backticks, not quotes. Check this out if you can't find the backtick key.

exotic maple
#

well, While does not create an I value

#

you're better off using a for loop

#

but if you insist on a while

#

you can do something like

#

While i

#
While i < len(df)
DO SOMETHING HERE
i += 1
analog cave
#

does df stand for dataframe?

exotic maple
#

yes

#

its the standard short version

serene scaffold
exotic maple
#

Ofx there ks but im playing minecraft with my daughter 😂 cant think right now

serene scaffold
#

Cute

twin moth
exotic maple
#

28

grave frost
#

I thought you were somewhere around 22-23 😂

velvet thorn
dark sonnet
#

hi

serene scaffold
#

@dark sonnet hi. Do you wanna talk about data science?

gray arch
exotic maple
#

growing old doesnt mean you need to be grumpy :v

fleet hare
#

Anyone have suggestions on repo structure/templates for ml projects? Specifically a project with heavy experimentation but also a deployed production model

exotic maple
#

man i still dont get how to upload files to google colab

#

that shit is as cryptic as matplotlib docs...

fleet hare
#

Pretty sure it's just drag and drop

serene scaffold
exotic maple
fleet hare
# serene scaffold so you have a repo that has a pretrained model, but you can also use it to train...

Yeah or many different types of models and data preprocessing and everything else that you mess around with while working on a project. Basically what’s the best way to structure a repo to keep track of these experiments but also have a production model.

Looking for something like this: https://github.com/jeremyjordan/data-science-template or this: https://github.com/ml-tooling/ml-project-template but I wanted to see if there was anything else out there

#

The second one seems like a bit of an anti pattern since since it’s basically the same as splitting the research and production into 2 different repos which I don’t want to do

fleet hare
# exotic maple on the notebook?

You have to click the folder icon on the menu on the left side when you’re in a notebook and then you can drag and drop files or there should be an upload button

exotic maple
#

@fleet hare I dont know

#

but i willfind you, and i will hug you

#

thanks

#

lmao

serene scaffold
#

@fleet hare I'm not sure I understand the issue with having the user-facing code in a separate repository if the research-specific code isn't useful to them.

lapis sequoia
#

It's a way of representing documents usually employed in Information Recovery or learning from texts. Each document (observation) is modelled as a vector of N dimensions, being N the number of words, terms or whatever base unit you are working with. If the document contains a given word, then the corresponding element of the vector is not zero. It's a generalization of the Standard Boolean Model, where elements of a vector can only take values 0 or 1.

short heart
#

Link to my recent problem

solemn atlas
#

Hello Gentlemen,
Hope you have a very enormous day,
I am new to ai and stuff ,I wnt to write my very first neural network ,just wnt to get started but on yt I cant find the appropriate video, if you guys can suggest some yt video for absolute beginner who knows python programming upto certain extent(not pro though) will be great 😁

grave frost
short heart
#

So im using lstm. Im using 60 values to predict 1 value, append it to these 60 and remove first value, predict again. But my model doesnt seem to be very effective. Is it worth it giving it more data(gonna take long time) or i have to somehow change the model

grave frost
sonic raft
#

Hi! I've been struggling to understand why we need nonlinearity for neural networks, why we need to use activation functions.. also for example in case of the famous mnist_dataset where the image sizes are 28*28, we construct the weight matrix with dimensions of (28 * 28, 30), (30 is just an example,) but the point is that it's bigger than one.. why? 😄 What does it look like when an input flow through a neural network with two layers? (28 * 28,30) and the second layer (30,1)
That's my biggest problem I have no idea what it looks like when the pixels of an image(the input data) flow through the network.
(fastai fastbook chapter: https://github.com/fastai/fastbook/blob/master/04_mnist_basics.ipynb)

short heart
winged yew
#

heyy anyone

#

who know data science fully

tidal bronze
#

can you use silhouette score to compare different kmeans aggregation that use different features?

tidal bough
# sonic raft Hi! I've been struggling to understand why we need nonlinearity for neural netwo...

why we need nonlinearity for neural networks, why we need to use activation functions..
Because it's easy to prove that if you only use linear activation functions, then your linear network, no matter how deep or wide it is, is equivalent to just a linear function from inputs to outputs. And, well, linear function do not useful computation make. A linear function classifiying images as cat or not would have some pixels with positive weights and some with negative ones - make all the former ones pure white, all the former ones pure black, and you'll get a "perfect" cat image as far as the network is concerned.

#

Don't think I understand the rest of your question.

sonic raft
# tidal bough Don't think I understand the rest of your question.

I don't understand either, I just can't imagine what does it look like when our input data "flows" through the network, the futures that it constructs.
Furthermore, as the image shows instead of creating your weights with one column we create 30 columns, maybe that give you some idea what I was trying say.

tidal bough
#

This is a network with 28*28 inputs, 30 neurons in the first (and only) hidden layer, and 1 output

#

So the matrix that transforms from first (inputs) to second(first hidden) layer is (28*28) x 30, and the matrix transforming from the first hidden to the outputs is 30 x 1.

sonic raft
#

I see, but Why is it good to have more and more neurons? I mean I know that it will make the network deeper and deeper, and it will perform better, but why?

tidal bough
#

Well, the more complexity, the more complex relationships the network can approximate.

short heart
#

can someone help me with lstm

sonic raft
tidal bough
#

If your layers are just dense like here, there's no meaningful "purpose" of each layer. They just do some stuff that, after training, ends up being involved somehow in calculating the result.

#

If the layers are different, like how in image classification neural networks the first few layers usually do convolutions and stuff, then you can say that the first few ones do stuff like search for lines, then for angles and more complex details - but even that's mostly a guess.

#

Generally speaking, neural networks just work - their training adjusts the flow between each layer so that the whole ends up doing the task you're making it do. There's no guarantee you can describe the purpose of any specific layer.

#

You can try searching for research on that matter though, maybe there are papers about trying to determine the function of parts of trained neural networks.

sonic raft
#

because I constantly think that the whole matrix multiplication it does can be done by just simply one layer, because they basically multiply the layers with weights and weights, but I guess I can describe why this work like that with Nonlinearity, like ReLU

#

😄

tidal bough
#

because I constantly think that the whole matrix multiplication it does can be done by just simply one layer, because they basically multiply the layers with weights and weights
yup, this is precisely why activation functions are needed - without them (or if they were linear), like I said, the entire network can be collapsed into just one layer mapping from inputs to outputs

sonic raft
tidal bough
#

yeah, it's pretty weird, and yet ReLU is newer to become popular - stuff like logistic and tanh are the older ones. Apparently ReLU was shown to be better, and I don't think I know enough to understand why.

sonic raft
#

Yes, I guess it just trying to push parameters that are important to be positive and less importants to be closer to negative 😄

sonic raft
quiet dawn
#

I have a question

#

what is the best source for learning pytorch

#

but it should be beginner level

lapis sequoia
quiet dawn
#

i didn't check completely but there isn't math side of ai

#

probably