#data-science-and-ml

1 messages ยท Page 25 of 1

velvet turtle
#

I am using .pivot_table function but in the output its not showing the column and values output

desert oar
velvet turtle
#

ok

fringe anvil
#

Order ID that repeats means there was multiple items in 1 order. so i got

desert oar
#

(you might want to ensure that p1 and p2 are sorted in some unambiguous way, so that you don't accidentally treat p2,p1 as distinct from p1,p2)

fringe anvil
#

the 2 in combinations means what?

desert oar
#

combinations of length 2

#

!d itertools.combinations

arctic wedgeBOT
#

itertools.combinations(iterable, r)```
Return *r* length subsequences of elements from the input *iterable*.

The combination tuples are emitted in lexicographic ordering according to the order of the input *iterable*. So, if the input *iterable* is sorted, the combination tuples will be produced in sorted order.

Elements are treated as unique based on their position, not on their value. So if the input elements are unique, there will be no repeat values in each combination.

Roughly equivalent to:
fringe anvil
#

ah, thanks

desert oar
#

it's also nice because it already sorts each combination

velvet turtle
#

this is the problem that im facing @desert oar

desert oar
#

@fringe anvil

import itertools
import pandas as pd

df: pd.DataFrame = ...  # your data here

product_pair_counts = {}
product_pair_counts.setdefault(0)
for order_id, group in df.groupby('Order ID', sort=False):
    product_ids = group['Product ID'].to_list()
    for pair in itertools.combinations(product_ids, 2):
        product_pair_counts[pair] += 1

this is how i'd write it probably

#

it's parallelizable too, by chunking up the groups, dispatching each group to a different process, and then combining the resulting dicts (by summing) at the end. although that's of course more advanced and probably not necessary for your bootcamp course (or a good use of your time at this point)

desert oar
velvet turtle
#

ok

#
observations = pd.pivot_table(observations,index='PATIENT',values='VALUE',columns='DESCRIPTION')

observations.head()
fringe anvil
desert oar
velvet turtle
#

ok

fringe anvil
# desert oar it's parallelizable too, by chunking up the groups, dispatching each group to a ...

hmm, also for loops and double for loops are kinda slow compared to default pandas methods. we had a lecture about not using them, if we could. i fixed the code a bit to reflect my data and got

orders = df[df['Order ID'].duplicated(keep=False)]
product_pair_counts = {}
product_pair_counts.setdefault(0)
for order_id, group in df.groupby("Order ID", sort=False):
    product_ids = group["Product"].to_list()
    for pair in combinations(product_ids, 2):
        product_pair_counts[pair] += 1
desert oar
#

!e ```python
x = {}
x.setdefault(0)
x[('a','b')] += 1
print(x)

arctic wedgeBOT
#

@desert oar :x: Your 3.11 eval job has completed with return code 1.

001 | Traceback (most recent call last):
002 |   File "<string>", line 3, in <module>
003 | KeyError: ('a', 'b')
desert oar
#

that's on me

#

let me see what i did wrong

#

!d dict.setdefault

arctic wedgeBOT
#

setdefault(key[, default])```
If *key* is in the dictionary, return its value. If not, insert *key* with a value of *default* and return *default*. *default* defaults to `None`.
desert oar
#

oh, that's just not how setdefault works

#

lol my mistake

#
from collections import defaultdict

orders = df[df['Order ID'].duplicated(keep=False)]

product_pair_counts = defaultdict(int)
for order_id, group in df.groupby("Order ID", sort=False):
    product_ids = group["Product"].to_list()
    for pair in combinations(product_ids, 2):
        product_pair_counts[pair] += 1
product_pair_counts = dict(product_pair_counts)

try that

serene scaffold
desert oar
#

btw @fringe anvil

orders = df.drop_duplicates(subset=['Order ID']))

this would work too. but you do not at all want to drop duplicate order ids here!!! then you'd only be getting 1 product per order, which makes no sense for this task

desert oar
#

you'd need to map the inner counter over the groupby

fringe anvil
#

its getting complicated lol

desert oar
#

wait... + works on counters?? TIL

velvet turtle
desert oar
desert oar
desert oar
velvet turtle
desert oar
desert oar
velvet turtle
desert oar
# velvet turtle https://paste.pythondiscord.com/igimozipal
In [4]: data.head()
Out[4]: 
                                PATIENT                                        DESCRIPTION  VALUE
0  034e9e3b-2def-4559-bb2a-7850888ae060                                        Body Height  193.3
1  034e9e3b-2def-4559-bb2a-7850888ae060  Pain severity - 0-10 verbal numeric rating [Sc...    2.0
2  034e9e3b-2def-4559-bb2a-7850888ae060                                        Body Weight   87.8
3  034e9e3b-2def-4559-bb2a-7850888ae060                                    Body Mass Index   23.5
4  034e9e3b-2def-4559-bb2a-7850888ae060                           Diastolic Blood Pressure   82.0

does this look right?

#

what are you trying to calculate here?

#

https://synthetichealth.github.io/synthea/ is this the source of the data?

fringe anvil
#

hmm what method do i call on that Counter to return only the highest value? looks like it's the second one

velvet turtle
#

ya

desert oar
#

however you can replace the inner for loop with a Counter if you prefer

#

however i think it's simpler to just update the "main" dict all at once, instead of first constructing a big list of Counters and summing them

arctic wedgeBOT
#

class collections.Counter([iterable-or-mapping])```
A [`Counter`](https://docs.python.org/3/library/collections.html#collections.Counter "collections.Counter") is a [`dict`](https://docs.python.org/3/library/stdtypes.html#dict "dict") subclass for counting hashable objects. It is a collection where elements are stored as dictionary keys and their counts are stored as dictionary values. Counts are allowed to be any integer value including zero or negative counts. The [`Counter`](https://docs.python.org/3/library/collections.html#collections.Counter "collections.Counter") class is similar to bags or multisets in other languages.

Elements are counted from an *iterable* or initialized from another *mapping* (or counter):

```py
>>> c = Counter()                           # a new, empty counter
>>> c = Counter('gallahad')                 # a new counter from an iterable
>>> c = Counter({'red': 4, 'blue': 2})      # a new counter from a mapping
>>> c = Counter(cats=4, dogs=8)             # a new counter from keyword args
desert oar
#

i think it has a method to compute the maximum value, check the docs. otherwise you can do it with something like max(counter.items(), key=lambda pair: pair[1])[0]

fringe anvil
#

hmm nvm, thats not it lol

desert oar
#

.most_common(1)

fringe anvil
#

yeah i ended up trying it

#

alright. ill try to simplify it into easy to read lines, for my own understanding. thanks a lot salt. you're always coming in clutch ๐Ÿ™‚

desert oar
fringe anvil
#

i wanted to call the variable cccombo_breaker .. but it was too long and im sure the instructor wouldnt get the reference lol

desert oar
#

i would also strongly encourage using product ids instead of names whenever possible. names are more likely to change or be misspelled

#

if you do the code above and convert it to a dataframe, then you can easily .join in the names and other metadata later if you need it

fringe anvil
desert oar
#

ah, that's too bad then

#

these product names look pretty "clean" and it's just for the exercise anyway

#

but something to keep in mind when working with real data

fringe anvil
#

yeah its some amazon sales from 2019 csv that was provided in a zip when i forked the github repo of the course

brave sand
#

the output of RL is a policy right

#

so when I run RL code Iโ€™m not training it?

fresh tiger
#

Hi! I had a question related to gradient descent, in particular with the formula in the first screenshot.

Im currently just tryna test my knowledge in terms of drawing a graph on how different sizes of the learnign rate, alpha, can impact computation time.

The solution is in screenshot2. What I am not understanding is how the graph would have lower computation times for very large values of alpha.. My take on the answer is in screenshot 3. Wouldnt we have potentially an infinite amount of computation time if we keep over shooting the minima in gradient descent due to very large values of alpha?

silk axle
#

There's definitely still some funkiness going on... although it could be the data ig

fringe anvil
#

like opening a dictionnary in the middle, then your word is in the first half. so you open the first half in the middle, and your word is the the 2nd half of that half... etc until you find your word

fresh tiger
fringe anvil
fading wigeon
#

I have nothing to contribute other than I love all these handwritten drawings

fading wigeon
#

The head engineer at my company calls those "picassos" whenever I scribble something out for him, lol

frozen summit
#

hey! I was wondering if anyone knew where I could get started on machine learning? specificially on creating a prediction system using python

frozen summit
young granite
#

what u want to achive using it?

frozen summit
#

I want to predict the winner of the world cup tournament

young granite
#

๐Ÿ—ฟ

frozen summit
#

I have a dataset I just dont really know what to do with it

#

and I found some projects on github but they are all for single matches not entire tournament brackets

young granite
young granite
#

i dont earn credit for that ๐Ÿ˜„

#

@frozen summit but if u are completely new to ML i suggest starting with simpler things

young granite
#

well if its ur first time check kaggle iris dataset for example

#

just to get in touch with pandas commands

frozen summit
#

Im really rusty on python too should I go back to the basics before kaggle

#

@young granite btw wheres the tournament bracket side of things? i cant find it

young granite
#

i just bookmarked it

frozen summit
#

I read a bit of it but I think its only for football matches

#

single match

bright sundial
#

Hi guys, I currently studying engineering and I have a important question.

#

Is it possible or real to be able to solve questions or exercises on advanced mathematics such as calculus, algebra, physics, thermodynamics, among others, in a university degree?

Only using data science libraries, like pandas, numpy, matplotlib, pytorch, etc?

#

For example: Can I solve complex multivariable calculus exercises just using numpy or any python library?

young granite
# bright sundial Hi guys, I currently studying engineering and I have a important question.

In this video I go through all the formulas in 2nd year calculus and how to evaluate them symbolically in python with no pencil or paper required

First year calculus:
https://youtu.be/-SdIZHPuW9o

Link to code:
https://github.com/lukepolson/youtube_channel/blob/main/Python Tutorial Series/math2.ipynb

DISCORD SERVER:
https://discord.gg/hTBz...

โ–ถ Play video
sand parrot
#

Hi when I have 2 cvs files which has vid both.

Vio(description, risk_category, vid)

and want to combine csv. what kind of join should I perform ?
I want it to be (iid, description, risk_category, vid)

coral cradle
#

when normalizing data, should I normalize the prediction and the predictors?
does the test set also get normalized?

hasty mountain
rugged comet
#

When using the functional API, I have three input layers. I'm trying to create the next layer for one of the inputs which is a Normalization layer. I don't understand why I would call normalization.adapt on the raw training data instead of on the input layer.

verbal venture
#

hey guys, is running a tf project locally just installing tf in a python ide? or is there a specific setup to it

rugged comet
#

There's a specific setup to it.

verbal venture
cerulean glacier
#

I know that there is a cost function that neural networks attempt to "optimize." But I was wondering what it means to optimize the function. Do you try to reduce it to zero? Or get it as high/low as possible?

fading wigeon
#

Minimize, I believe

serene scaffold
#

Yes, minimize.

#

The cost is the difference between the actual and desired output. So if the actual output is the desired output, the cost is zero

fading wigeon
#

Hey so I'm working on some variable transforms and since some transforms require positive numbers, I was considering just adding an offset. Here's the problem, I can add an offset to my data that I'm working off of, but I'll still probably get negative values greater than that. Is there a danger to overshooting that offset? Say my data ranges from -10 to 10 in my dataset but future data may go beyond that. Should I make 10 an offset? 11? 20? 100????

#

Like, let's choose something simple like sqrt as an example

verbal venture
#

essentially

rugged comet
#

Sure but you need all that extra software.

cerulean glacier
turbid arch
#

Hello, I have a question: can ML use a voice module known as pyttsx3? If not then could you reply other working voice modules?

desert oar
#

aka "asinh"

fading wigeon
desert oar
fading wigeon
desert oar
#

but that's a separate problem

fading wigeon
#

Another problem I'm working on is sometimes I have to analyze a group of let's say 50 variables, where some perform better with one transform and some with another, but I have to choose which transform to uniformly apply to the whole group. I'm still not sure how I'm going to solve that.

desert oar
#

fwiw things like gradient boosting and neural networks are supposed to free us from having to worry about getting the precisely optimal feature transformations

#

why do you need to apply the same transformation to all of them?

#

anyway box-cox / power and asinh are both good ones to have in your automated feature engineering toolbox

fading wigeon
#

I'm not 100% sure, lol. I was asked to by an industry expert so I just went along with it.

desert oar
#

if they're all kind of "the same" feature then i think it has intuitive appeal

fading wigeon
#

Oh, on a similar note.... what should I call this toolbox/function/feature? I'm so bad at naming things

desert oar
#

it's a kind of ad-hoc regularization to avoid overfitting

#

i have no idea what the feature is ๐Ÿ˜†

fading wigeon
desert oar
#

is this some automated feature engineering thing for linear models?

#

you might want to try a GAM instead of doing all this

fading wigeon
#

Ah hah, perhaps so. But I'm working with old school people who don't touch ML and I have to use these distributions downstream to Z-score against different databases.

desert oar
#

wait what

#

what are these z-scores for? you're trying to hammer a huge number of features into a gaussian-ish distribution so you can compute differences in z-scores?

fading wigeon
#

The Z scores will go into a multivariate and will also be used in clustering to explore the dataset.

#

K-means

#

This will all eventually be something similar to 23 and me except for neuroscience so people can compare their brain's overall health/functioning to people in their age group, across all age groups, etc

desert oar
#

yeesh

#

sounds really ad-hoc

#

probably will work okay but k-means seems weird here

#

if this is your goal then you definitely want/need to look into the box-cox and yeo-johnson transformations https://en.wikipedia.org/wiki/Power_transform#Yeoโ€“Johnson_transformation as well as asinh

In statistics, a power transform is a family of functions applied to create a monotonic transformation of data using power functions. It is a data transformation technique used to stabilize variance, make the data more normal distribution-like, improve the validity of measures of association (such as the Pearson correlation between variables), a...

fading wigeon
#

I started reading a 40 page paper on box cox today, lol

#

Apparently there have been a lot of developments since it was pioneered but also a lot of contention on how to extend it

#

I'm not familiar with yeo-johnson, I'll look into it

#

Yeoโ€“Johnson transformation looks really promising! Hopefully there are python implementations to solve for lambda, I'm not sure how I'd go about doing that on my own

lapis sequoia
#

Can someone help me to understand these two graphs? Both are comparison of built models, but one uses MAE and the other uses RMSE. I can't figure out why the difference between the models when I use MAE is much greater than when I use RMSE.

lapis sequoia
# trail rune Check out this article https://medium.com/human-in-a-machine-world/mae-and-rmse-...

Thank you for your reply @trail rune !

Do you believe then that one of the reasons for this difference would be because RMSE penalizes outlier errors more strongly than MAE?

That is, when analyzing with MAE I get the idea that the error frequency of some models is much higher than that of others. However, when analyzing using RMSE, I realize that although some models err more frequently( conclusion drawn using MAE), the magnitude of the error, when looking with RMSE, is similar across all models.

Does this interpretation make any sense?

mint palm
#

i am making a "video" anomaly detection algorithm, which can identify anomaly at segment level rather than video level( i mean it is able to categorise portions of video as anomalous rather than categorise whole video as anomalous)
I trained my autoencoder of normal video(no portion is anomalous).
i am getting following results:

  1. When test set has normal videos(no portion is anomalous) + anomalous video(some portion is anomalous with some portions non-anomalous too),
    Model has AUC = 0.63 ish

  2. When test set has only anomalous video(some portion is anomalous with some portions non-anomalous too)
    Model has AUC = 0.51 ish (pathetic)

What can be the reason?

trail rune
wooden sail
serene scaffold
#

Hmm, what does norm mean in this context?

wooden sail
#

vector norm

#

p-norm, in particular

#

that illustration of norm balls in 2d (and also in 3d) gives an intuitive visualization of how distance is measured. the 2-norm is what you normally think of as "distance". with the 1-norm, you see that moving diagonally is kinda "further away"

bold timber
#

Hello guys, whether we need to preprocess with scaling the image first if we want to make a predictions by EfficientNetB0 model?

lapis sequoia
supple wyvern
#

URGHHH tensorflow not working on 3.11

dire falcon
#

Hello does anyone know any good learning resources for getting into this field?
I currently have a module on data science in my course however I am really struggling to keep up with the lecturers pace.
I'm looking for something like a youtube series/ free online course thats easy to follow.

supple wyvern
#

Firstly, which one? data science or AI?

dire falcon
#

well both but best start learning about data science no?

#

I am planning to do a machine learning based application for my final year project this year

supple wyvern
#

Well, I kinda only know things for AI

supple wyvern
dire falcon
#

But I'm quite dissapointed in the lecturers teaching method, the way she conveys the lectures is near impossible to understand, and her lab tutorials consist of copying code off her and she gets mad at you for not understanding it, not that she describes anything about it at all

dire falcon
supple wyvern
#

nice

#

I recommend tensorflow since it's like the biggest growing AI framework

dire falcon
#

i kind of got trapped in a tough situation by my thesis supervisor however, he suggested a project idea that well i cant really do so i need a new topic to do as well

supple wyvern
#

A youtube video that I'd recommend is 7hr tensorflow tutorial from freecodecamp

#

I'll get you the link

dire falcon
#

Learn how to use TensorFlow 2.0 in this full tutorial course for beginners. This course is designed for Python programmers looking to enhance their knowledge and skills in machine learning and artificial intelligence.

Throughout the 8 modules in this course you will learn about fundamental concepts and methods in ML & AI like core learning alg...

โ–ถ Play video
#

this one?

supple wyvern
#

yep

#

Sometimes things are hard to understand so you might have to watch that part a few times

dire falcon
#

ah yeah a course mate got me that, I was hoping to get something similar on the basics of data analytics.

supple wyvern
dire falcon
#

yeah i was considering doing chord classification for music?

#

since the project idea my supervisor proposed was to predict when a musician would reach a plateau in their mechanical skill but ๐Ÿ’€ how do I get that data, couldnt find anything to really work with

supple wyvern
#

Have you heard of teachable machine?

dire falcon
supple wyvern
#

Ithink this will help a lot with your project if you don't want to create a model yourself

#

One of its functions is audio classification

#

I actually contributed to the image classification keras code snippet ๐Ÿ˜

dire falcon
#

one sec i have to meet with my supervisor i'll mention this haha

supple wyvern
#

Literally takes like 10 minutes to generate a model and you can test it before exporting the model, and I think it'll be good if you want to test out before actually making the model yourself.

dire falcon
#

i need enough content to write a 100 page thesis so ๐Ÿ’€

supple wyvern
#

oh

supple wyvern
#

just for testing out your training data

dire falcon
#

yeah i'll have to look into it

#

brb

supple wyvern
#

It's an unofficial server tho

#

but really quickly growing

azure fern
#

Is there one for yolo?

supple wyvern
#

wdym yolo?

wooden sail
narrow flare
#

im looking to create a bot for a game. the game is 3d and u can walk around with the arrow keys.

the aim of the bot is to complete "tasks" in the game, which involve:

  1. reading off what the task is (writing detection)
  2. walk to where the task is telling you to go (with arrow keys)

Step 2 would involve some computer vision scheme so that the bot can see where it needs to go, and then I'd algorithmically tell the bot which arrows to press depending on where it sees the target location. So it detects the location via computer vision, but does the movement without any AI.

What im wondering is, what technologies would I need to learn to be able to do this? I already know basic tensorflow.

winged mason
#

Can anyone help me with this pytorch issue?

dire falcon
#

btw supervisor said chord classification is fine to go with so thumbsup

#

can start working on a prototype

supple wyvern
#

So I'd suggest once you get all your training data, train a prototype with teachable machine (You can save your progress) and when it works well, after changing epochs and all, train it properly with tf ๐Ÿ™‚

boreal cape
#

hey does anyone known anything about topic modelling

merry ridge
#

Can anyone suggest a good resource on parallelization for databricks? My Google searches are getting clogged with a ton of low quality medium/towards data science posts that don't explain enough.

narrow flare
#

we dont know what the assistant does though xD

#

what does it do @lapis sequoia

#

lol

#

weird question but how old are u?

#

did you hard code all the responses of the robot lol

arctic wedgeBOT
rapid cedar
#

should i start with sklearn/matplotib/thinker

wicked wing
#

anyone here good with jupyter and matplotlib? getting a weird importlib error

#
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In [1], line 1
----> 1 get_ipython().run_line_magic('matplotlib', 'widget')
      3 import matplotlib.pyplot as plt

File ~/.cache/pypoetry/virtualenvs/U26YDnIW-py3.10/lib/python3.10/site-packages/IPython/core/interactiveshell.py:2309, in InteractiveShell.run_line_magic(self, magic_name, line, _stack_depth)
   2308 with self.builtin_trap:
-> 2309     result = fn(*args, **kwargs)
   2310 return result

File ~/.cache/pypoetry/virtualenvs/U26YDnIW-py3.10/lib/python3.10/site-packages/IPython/core/magics/pylab.py:99, in PylabMagics.matplotlib(self, line)
---> 99     gui, backend = self.shell.enable_matplotlib(args.gui.lower() if isinstance(args.gui, str) else args.gui)

File ~/.cache/pypoetry/virtualenvs/U26YDnIW-py3.10/lib/python3.10/site-packages/IPython/core/interactiveshell.py:3473, in InteractiveShell.enable_matplotlib(self, gui)
-> 3473 pt.activate_matplotlib(backend)
   3474 configure_inline_support(self, backend)

File ~/.cache/pypoetry/virtualenvs/U26YDnIW-py3.10/lib/python3.10/site-packages/IPython/core/pylabtools.py:359, in activate_matplotlib(backend)
    357 from matplotlib import pyplot as plt
--> 359 plt.switch_backend(backend)
    361 plt.show._needmain = False

File ~/.cache/pypoetry/virtualenvs/U26YDnIW-py3.10/lib/python3.10/site-packages/matplotlib/pyplot.py:265, in switch_backend(newbackend)
--> 265 backend_mod = importlib.import_module(
    266     cbook._backend_module_name(newbackend))

File /usr/lib/python3.10/importlib/__init__.py:126, in import_module(name, package)
    124             break
    125         level += 1
--> 126 return _bootstrap._gcd_import(name[level:], package, level)

(...)

File <frozen importlib._bootstrap>:1004, in _find_and_load_unlocked(name, import_)

ModuleNotFoundError: No module named 'ipympl'
#

I'm running jupyter from inside a poetry virtual environment

#

for some reason it looks like importlib is escaping the virtual environment

desert oar
hushed kraken
#

Use a generic method from statistics that is independent of the timeseries to remove outliers in the data

mean = data.belpex.mean()
std = data.belpex.std()
n_std = 5
data['belpex'][(data.belpex >= mean + n_std*std)] = mean + n_std*std 
data['belpex'][(data.belpex <= mean - n_std*std)] = mean + n_std*std 

Does anyone know what the name of this method is? Would like to learn more about it

hushed kraken
#

nvm

#

me stupid

torn elm
#

Hello.. my code is not working specifically for values 2.53 and 2.51

#

I am using Spyder( Python 3.9)

#

Can anyone please help

#

def changeMarker(value):
if value >= 0 and value <= 5:
amount = int(value*100)
two_pound = amount//200

   one_pound = amount % 200//100
   p50       = amount % 200% 100 // 50
   p20 =  amount % 200% 100 % 50 // 20
   p10 = amount % 200 % 100 % 50 % 20// 10
   p5 = amount % 200% 100 % 50 % 20 % 10//5
   p2 = amount % 200 % 100 % 50 % 20 % 10 % 5 // 2
   p1 = amount % 200 % 100 % 50 % 20 % 10 % 5 % 2 // 1
   
   
else :
    
   two_pound= -1
   one_pound= -1
   p50 = -1
   p20 = -1
   p10 = -1
   p5 = -1
   p2 = -1
   p1 = -1

return  two_pound, one_pound, p50, p20, p10, p5, p2,p1

value = 2.53
output = changeMarker(value)
print("output = {0}".format(output))

wicked wing
desert oar
#

you might also want to install ipywidgets

wicked wing
#

I just had to specify it explicitly

#

in my pyproject.toml

desert oar
#

you intend to use the project's python venv to run jupyter, right?

wicked wing
#

yes exactly, everything's installed to the venv

desert oar
#

it's actually possible to have a single "central" jupyter installation that runs "kernels" from various envs/projects. but if you aren't doing that setup, i didn't want to complicate things

#

okay, in that case yes. you just need to install ipympl and i suggest ipywidgets as well

wicked wing
#

gotcha, thanks!

young granite
#

is there an automated method in pandas to drop cols/rows which got outliers ?

young granite
# young granite is there an automated method in pandas to drop cols/rows which got outliers ?
           1         2         3          4          5         6         7  \
0    29740.0   69277.0  189645.0  1321527.0   112478.0   19536.0    5413.0   
1    57228.0   37776.0  148611.0        0.0    81654.0       0.0       0.0   
2    21263.0   55671.0   51399.0        0.0   123019.0   57952.0   23970.0   
3    71677.0   65626.0   49598.0  1017098.0   128965.0   42908.0   21552.0   
4    41682.0   67693.0   34373.0        0.0   175257.0   82372.0   46864.0   
5   123677.0   89131.0   41563.0   909706.0   229204.0   71436.0   42461.0   
6    73058.0  225785.0       0.0  1327173.0   817648.0  165429.0  125564.0   
7    23898.0   90253.0       0.0   610598.0   558249.0  102117.0   99471.0   
8    86272.0  286587.0   23501.0   989984.0  1693514.0       0.0  166103.0   
9   114224.0  167569.0  251141.0   463315.0   836308.0       0.0  115151.0   
10       0.0    4029.0    6826.0   108047.0   101546.0       0.0    1879.0   
11       0.0   47296.0    1487.0   200398.0   671665.0       0.0   39387.0   ```
#

i wanted to remove the cols where the violin chart indicates outliers

desert oar
desert oar
young granite
#

first of u are 100% right on the definition of outlier, however i can say that those are outliers due to the measurement method

desert oar
#

the definition of an outlier is entirely specific to your task. therefore pandas cannot possibly have a method for it.

young granite
#

i was thinking of (df-df.mean())<= df.std()

desert oar
#

in general, the technique is to construct some kind of equivalent drop_mask, which is a boolean Series with True corresponding to the rows to be dropped

#

if your df .index is set up intelligently, then you can also do

def standardize(y):
    return (y - y.mean()) / y.std()
df_std = df.apply(standardize)
drop_mask = (df_std >= 1).any(axis=1)

df.drop(df.index[drop_mask], inplace=True)
#

and of course there are many variations thereof

young granite
desert oar
#

note the use of .copy to avoid the "setting on a slice" warning, if you intend to do further data manipulations

#

as always, think before copying and pasting. the usual caveats about untested code written by strangers apply.

#

actually i think you can just call standardize on the entire dataframe

#
df_std = standardize(df)
drop_mask = (df_std >= 1).any(axis=1)
young granite
#

nah i would need to allow it only for a range of cols

#

atm i got my input variables in there aswell

desert oar
#
cols = [ ... ]
df_std = standardize(df[cols])
young granite
#

let me try that real quick

desert oar
#

note also the use of .loc to select rows. i never use "plain" [] for selecting rows. too easy to make typos and get a weird result

young granite
#

makes sense

#

by that u would mean like this ?
df_7d.loc[:, 1: 42]

#
import plotly.graph_objects as go
from plotly.subplots import make_subplots


def standardize(y):
    return (y - y.mean()) / y.std()

df_std = standardize(df_7d.loc[:, 1: 42])
drop_mask = (df_std >= 3).any(axis=1)
df_std.drop(df_std.index[drop_mask], inplace=True)

fig = go.Figure()


trace = np.arange(0,43).astype("str")

for i in np.arange(1,43):

    fig.add_trace(go.Violin(
        name=trace[i],
        y=df_std[i],
        box_visible=True,
        meanline_visible=True
        ),
        )
fig.show()```
#

well >=3 and still a mess ๐Ÿ—ฟ

#

what did i just measure there ๐Ÿธ

#

i guess i will only delete outliers manually, where i know that the values are faulty and leave everything else untouched and proceed with em

desert oar
#

you would use iloc to select columns by number

#

...unless your column names are actually numbers

young granite
#

indeed ๐Ÿ˜„

azure fern
#

Hello, who can help me run yolov7 locally on CPU in real time?

strong sedge
#

sorry for asking a math heavy question, but this has been bugging me for days

'''
lets take a single neuron

the output of this neuron is

y = wx + b

how do we go from this to updating the weight and bias by
dw = dy * x
db = dy
and how does the error backpropagate as
dx = dy * w
'''
#

when you take the partial derivative of y = wx + b
you get
dy = dw * x + dx * w + db

#

how does one go from this to what was written above ?
(Note the location of dy and dw, dx)

wooden sail
#

are you asking for the total derivative as a differential form?

mossy haven
#

Where do I start with reinforcement learning?

strong sedge
wooden sail
#

not that way, i would say

strong sedge
#

or give me some resource to read, I dont mind

wooden sail
#

if you want a full derivation of gradient descent, a bunch of stuff is needed

#

have you done any convex optimization?

strong sedge
strong sedge
wooden sail
#

hmm that's pretty far removed from the question you asked

strong sedge
#

I am hearing the term convex optimisation for the first time

#

I thought that machine learning is just multi variate calculus ๐Ÿฅฒ

wooden sail
#

that's what you'd have to read about

strong sedge
vast stirrup
#

anyone know if you're able to submit transparent png cutouts for images to be recognized in pyautogui?

strong sedge
strong sedge
vast stirrup
strong sedge
#

opencv is ai related so its all good :D

wooden sail
#

aha, it actually let me send it

#

this is a notebook i compiled for the students in our lab. it's an intro to gradient descent, maybe you'll find it useful

#

most of the stuff is explained there, except for one detail involving taylor's theorem

wooden sail
#

also some details regarding wirtinger calculus are just taken at face value. i don't recall if i provide a reference to that, but it should be fairly easy to find it in google under "wirtinger calculus" or "C-R calculus"

#

important to keep in mind when dealing with complex-valued functions but only requiring them to be real-differentiable

#

and the books by boyd are great for convex optimization. i think i referenced those extensively there

strong sedge
wooden sail
#

well, the short answer is that you need linear algebra and multivariable calculus to show how and when gradient descent works

#

and later on statistics to show how it works in machine learning with stochastic gradients (not covered in the pdf)

merry pike
#

Hello community, can you see this project and give me your feedback

mossy haven
strong sedge
strong sedge
mossy haven
mossy haven
strong sedge
#

np :D

young granite
#

can i get input/output correlations with sklearn aswell or is that more DL with TF or PT?

#

or in other words i struggle a bit to approach my datasets
i got 4 datasets each containing 12rows*42cols and 9 input values for each of the 4 datasets

storm kelp
#

What is the appropriate way to use spark.cache? do I need to uncache?

serene scaffold
verbal venture
#

hey, what's the risk of having your training data accuracy to 1.0 but not your test data

serene scaffold
strong sedge
merry pike
past prawn
#

So I'm pre-processing my data for machine learning training. I can't figure out if I should take care of the outliers in the dataset first or the missing data. I googled it and there were mixed opinions. I think it would make sense to remove the outliers first as that would mean the imputed data would be unaffected by the outliers. Any thoughts?
(I'm a beginner so don't know a whole lot)

storm kelp
#

Depends entirely on why you're considering them outliers. Are those data points likely errors or just extreme values?

verbal venture
#

how do you sum a specific row in numpy? or do I need to use pandas for it?

#

nvm got it, thanks

past prawn
willow hedge
#
narrow flare
#

hey guys im trying to learn openCV for python, but most of the tutorials i can find for it use non-ML computer vision algorithms

#

can someone direct me to a resource which explains how to use ML models for opencv

#

like suppose i already have the model, i just wanna use it in openCV

#

i understand that this is mainly a matter of syntax but i cant find any good resources

#

oh i didnt realise i need to be looking for tutorials for the DNN module

novel python
#

guys, I have a dataframe witih around 15k rows, and I wanted to run a linear regression for every row, is there an "easy" way to do it?

novel python
#

I'm trying to run it through the whole dataset row by row using the following code:

    grid_model.fit(X_train.iloc[i], y_train[i])
    y_pred = grid_model.predict(X_test.iloc[i])
    predictions.append(y_pred)
    rmse_errors.append(mean_squared_error(y_test[i], y_pred, squared=False))
    print(i)```
#

but I'm getting "TypeError: Singleton array 0.389 cannot be considered a valid collection."

#

that's the value of the first y_train, not sure why I'm getting this error, already looked up on google

lapis sequoia
#

im not sure im new to this stuff

rugged comet
#

When using the tensorflow functional API, I have three input layers. I'm trying to create the next layer for one of the inputs which is a Normalization layer. I don't understand why I would call normalization.adapt on the raw training data instead of on the input layer.

converted_mana_cost_inputs = keras.Input(shape=x_train_converted_mana_cost_input_shape)

# Normalize the converted mana costs
normalization = layers.Normalization()
normalization.adapt(converted_mana_cost_inputs)

I get this error
https://pastebin.com/qv7pL8s9
when calling adapt on the input layer.

verbal venture
#

can z and t tests only be used if the underlying distribution is normal?

pine yoke
#

is there a specific model that would lend itself to getting bboxes and classes for this dataset?

#

each square = 1 input image

#

i thought about yolov5 but it seems overkill, wonder if there's any general ideas

night sequoia
strong sedge
hazy cosmos
#

Hi everyone, my name is Gladin, from Kerala, India. I am a Data Scientist. I am at the moment seeking a new job opportunity. Excited to learn more and grow. Kindly add me on LinkedIn everyone, I am open to endorsing everyone's skills on LinkedIn. Let's network: https://www.linkedin.com/in/gladin/

cinder schooner
#

hello, would anyone recommend a good book to start reinforcement learning?

hybrid plaza
#

Hello! Basic Pandas question.. How do I access a column (Series) by index (rather than by name)?

strong sedge
#

You can also check .at

hybrid plaza
#

.iloc accesses a row. I'm trying to access a column.

#

df['a'] works, df[0] does not.

#

My current workaround is series = [serie for _name, serie in data.items()], then series[0], but it feels like there ought to be a better way.

main fox
hybrid plaza
#

Thank you!

mint palm
#

which is better?

minor coral
#

hi

#

does anybody knows how mnist dataset works?

#

Im currently trying to create cgan but with my own dataset, but I dont know how to implement it

serene scaffold
# minor coral does anybody knows how mnist dataset works?

the dataset itself doesn't "work". it's just there. it's the model that actually does something with the dataset.

you can use MNIST to make a character recognition model, and you can get good results doing that with just a basic (feed forward) neural network. The code for doing it should be basically the same even if you're using a dataset where the only difference is the letter/number system

minor coral
#

Can I create a dataset that matches the format of the mnist?

#

Our professor requires us to use cgan specifically for this

serene scaffold
minor coral
#

Like I want to use the cgan for generating the image

#

conditional generative adversarial network

#

i saw thi code but I dont know how to do the "sudo" something

serene scaffold
#

"sudo" is a linux command. it's not really relevant to what you're trying to do, conceptually speaking.

desert oar
#

not only is it completely irrelevant, but it's likely that you will break your system if you run "sudo" commands without understanding them

minor coral
#

I mean, is there any counterpart to windows os for this?

desert oar
#

it's the Linux equivalent of messing around in C:\Windows with administrator access turned on

desert oar
# verbal venture can z and t tests only be used if the underlying distribution is normal?

yes, but be careful about what you mean by the "underlying distribution".

any hypothesis test requires the test statistic to follow a particular probability distribution when the null hypothesis is true.

for example, consider the "welch's T test" for differences in means in independent samples. the data itself does not need to be normally distributed, because there is a more general set of conditions under which the test statistic follows the T distribution.

#

in particular, you only need the sample mean to be normally distributed, which is always the case in samples that are "big enough", as per the central limit theorem

#

if you have not yet wrapped your head around the concept of a sample mean being a random variable with its own probability distribution, spend the time to do so

tiny wadi
#

Anyone know a fast way of changing list like [1,2,4,1] into representation in the form [[1,2,3,4]]?

Output like this:
[[1,0,0,1],[0,1,0,0],[0,0,0,0],[0,0,1,0]]?

tiny wadi
serene scaffold
tiny wadi
serene scaffold
#

can you give the actual input that is intended to produce [[1,0,0,1],[0,1,0,0],[0,0,0,0],[0,0,1,0]]?

#

sorry, misread

#

one moment

#

seems like the relationship is arbitrary?

#

why does 1 become [1,0,0,1] in the first element, and then [0,0,1,0] for the last one?

tiny wadi
#

list = [1,2,4,1]
split_list = []
for i in range(4):
split_list .append(np.multiply(list==i,1)

#

its like a dictionary, {1:[1,0,0,1],2:[0,1,0,0],3:[0,0,0,0],4:[0,0,1,0]}

serene scaffold
#

ah, I see now

#

!e @tiny wadi this would be the idiomatic way to do it

import numpy as np
arr = np.array([1, 2, 4, 1])
index = np.arange(1, 5)
result = (arr[None, :] == index[:, None]).astype(int)
print(result)
arctic wedgeBOT
#

@serene scaffold :white_check_mark: Your 3.11 eval job has completed with return code 0.

001 | [[1 0 0 1]
002 |  [0 1 0 0]
003 |  [0 0 0 0]
004 |  [0 0 1 0]]
tiny wadi
serene scaffold
minor coral
#

hii, is there any way to make the dataset ( above) to be the same format with the dataset below?

serene scaffold
#

!code

arctic wedgeBOT
#

Here's how to format Python code on Discord:

```py
print('Hello world!')
```

These are backticks, not quotes. Check this out if you can't find the backtick key.

young granite
#

@serene scaffold may i ask u what kind of sklearn algo. would suit in ur opinion for a dataset of 12*42 with inputvalues from 0-3

serene scaffold
young granite
serene scaffold
#

it would also be helpful to know what the data represents.

young granite
#

integrated areas in this form:

           1        2         3          4         5        6       7       8  \
0    29740.0  69277.0  189645.0  1321527.0  112478.0  19536.0  5413.0  1423.0   
1        0.0      0.0    2555.0    54682.0    6512.0      0.0   547.0     0.0   
2        0.0   1352.0    4098.0    40962.0    1275.0      0.0     0.0     0.0   
3        0.0      0.0    1776.0    36531.0    1509.0      0.0   787.0     0.0   
4        0.0      0.0     759.0    28094.0    1905.0      0.0   386.0     0.0   
..       ...      ...       ...        ...       ...      ...     ...     ...   
325      0.0      0.0    3388.0    21471.0    1115.0      0.0     0.0     0.0   
326      0.0      0.0    2897.0    23324.0       0.0      0.0   820.0     0.0   
327      0.0      0.0       0.0    23832.0     852.0      0.0     0.0     0.0   
328      0.0      0.0       0.0    21121.0       0.0      0.0     0.0     0.0   
329      0.0      0.0       0.0    21031.0       0.0      0.0     0.0     0.0  
#

as u can see in this full dataset the area values drop after certain time thats why i reduced it to:

          1         2          3          4          5         6         7   \
56    21263.0   55671.0    51399.0        0.0   123019.0   57952.0   23970.0   
21   112953.0   39277.0   454261.0   442966.0    79459.0       0.0    7731.0   
42    16039.0  681685.0   119236.0  1595052.0   196827.0       0.0  109792.0   
267   81984.0  117635.0     3743.0   564249.0  1004721.0       0.0  127240.0   
225  114224.0  167569.0   251141.0   463315.0   836308.0       0.0  115151.0   
87    35274.0       0.0  7149357.0  1106840.0   158358.0   69680.0   24107.0   
112  123677.0   89131.0    41563.0   909706.0   229204.0   71436.0   42461.0   
309       0.0       0.0     1603.0   230084.0   781602.0       0.0       0.0   
99    53284.0   72158.0    31252.0  1341475.0   347423.0   77789.0   33366.0   
..       ...      ...       ...        ...       ...      ...     ...     ... 
211   24247.0  120011.0        0.0   860222.0   781548.0  117812.0  107597.0   
204   91321.0   90479.0        0.0   774805.0   595667.0  112264.0   79113.0   
35    38419.0       0.0  7992028.0        0.0    86738.0       0.0       0.0   
75    57301.0   68681.0    96929.0  1190922.0   159876.0   62375.0   24785.0   
232   86978.0  403606.0     2730.0  1539340.0  2215212.0       0.0  361130.0   
7     57228.0   37776.0   148611.0        0.0    81654.0       0.0       0.0   
302       0.0    1647.0     1304.0   115092.0    96582.0       0.0    2265.0   
140   78284.0       0.0  5966734.0  1125559.0   263598.0  116030.0   64898.0   
14    23068.0   98709.0    58554.0  1329078.0   118384.0   19615.0   15860.0   
...
190   88009.0   896297.0   2147.0      0.0      0.0       0.0  
260  266873.0   646077.0  25561.0      0.0  48154.0       0.0  

[38 rows x 42 columns]```
desert oar
#

moreover, asking the question of "which algo in scikit-learn do i use" generally suggests that you don't actually know what the various algorithms do and how they work. that's not a good way to do things.

bold timber
#

Hello guys, Does anyone clearly understands about efficientnetb0 model?

noble grove
#

Looking for a way to extract some keys and values from a dictionary then replacing the values. Also concatenating two other key values. The dictionary also has a nested dictionary inside. Any tutorial I can go through? Thanks for the help.

storm kelp
#

I have a dataframe with columns a, b, and c. I want to group by a and b, and then find and keep the row with the minimum value for c.
I have a solution that kinda works but it isn't able to break ties. If there is a tie I'm not bothered I just want it to take the first occurrence of the minimum value. I'm stuck trying to figure this out though.

#

Is there a way to group by a and b, then order by c and then just retain the top row for each grouping?

noble tusk
#

Can you send what the DataFrame looks like after you do those operations? When there's a tie

#

Cos you might be able to just use iloc but I wanna make sure

storm kelp
#

Will iloc work after running a group_by().orderly().? @noble tusk

#

If so that would actually be much more simple than my current method lol

#

I'm working within PySpark if that matters

noble tusk
#

I know group_by() does at least

#

You may also need to use .reset_index() to reset the indices to start from 0, then do .iloc(0)

storm kelp
#

Hopefully it runs within PySpark

young granite
noble tusk
storm kelp
noble tusk
#

If you pass 0 it should be the 0th row. If that doesn't work it's probably one-indexed so pass 1 instead

wispy coyote
# young granite i was asking for something like this more like a best practice approach i do now...

All of these algorithms do very very different things. It highly depends on what you want the data to do for you. Do you want to take that information and make a decisions about what do do next? Classification is great. Do you want to take a point you're really interested in, and find points you might also be interested in? Clustering might be a way to go. Does the data change over time and do you want to know what some of it will be in a few time stamps? Regression is helpful. Do you have a bunch of data, much of which could be summarized into a smaller group, and then do analysis on that? Dimensional reduction is helpful at.

All of my prompts are just a subset of the ways you can use those four groups, but they all allude to the fact that you need to want to do something with the data before you start asking about algorithms to achieve that something. You have a bunch of something, but you need to want to do something with it. Otherwise it's just noise. There's lot of signals, but which signal do you want?

desert oar
young granite
#

@wispy coyote @desert oar well first of all thanks that u took the time and explained it in more depth to me.
Im new to the field of DS and therefore appreciate it even more!
So u got any sources other then https://scikit-learn.org/stable/user_guide.html
to get more in touch with the procedure?

THANKS

dense lagoon
#

anyone here good with using annotated images to find location of data that you want to grab from a image based on the label?

serene scaffold
weary goblet
#

Hey there,
Wanted to ask, when do i branch off away from python to learn Data science?
as in is there a milestone or a specific topic?

storm kelp
serene scaffold
dense lagoon
#

do you guys think keras would be best and fast enough to process the year and mint of a coin?

serene scaffold
#

Though I would suggest that you use pytorch.

#

I'm assuming this is image classification?

dense lagoon
#

yea and I wanted pytorch also, but my buddy is suggesting keras

serene scaffold
dense lagoon
#

Yea even tesla I believe

#

if my datasets are fairly small, no more than 10k, keras should be perfectly fine yea?

weary goblet
weary goblet
serene scaffold
#

The size of the dataset is irrelevant for deciding which neural network library to use. They both let you train neural networks.

storm kelp
serene scaffold
lean hawk
#

Hey, Do i have to learn something before learn data science (I mean when you already know how to program)?

#

Like machine learning or something like that?

serene scaffold
weary goblet
serene scaffold
weary goblet
#

Dayum, that hopefully should be in the works.

#

Well thank you, i now know what to look out for!

dense lagoon
#

better processing and predictions too

serene scaffold
dense lagoon
#

really? i heard many stories of keras being annoying to deal with when it comes tot hat

serene scaffold
#

being annoying for the programmer to use has nothing to do with what the weights of the model are, or what the outputs are for the same input.

serene scaffold
arctic wedgeBOT
#
Resources

The Resources page on our website contains a list of hand-selected learning resources that we regularly recommend to both beginners and experts.

lean hawk
#

Because there are a lot of fields

#

ok

#

thanks

serene scaffold
lean hawk
torn elm
#

Hello

#

I am trying to solve a multiple linear regression model and I am getting R square as 1

#

The actual y and predicted y are same

#

Could anyone help me understand what happens or in which scenario this happens

storm kelp
lapis sequoia
storm kelp
lapis sequoia
lapis sequoia
#

Lemme read.

lapis sequoia
#

and lemme think about it.

storm kelp
#

So I was suggested df.groupby('A', 'B').orderby('C').iloc[0]

#

I can understand the logic of why that would do what I want, I'm just not sure if Python will actually work with that logic

lapis sequoia
lapis sequoia
storm kelp
#

Haven't tried it yet - need to wait till tomorrow when I'm back on my work computer

dull fern
#

Hey, I have multiple neural networks that solve the same problem, I would like to know if their predictions could be combined to improve the overall performance. How would you do that ? Any specific plot that could give me a good insight ?

lapis sequoia
#

Also may be you could do better than orderby. Orderby is kinda sorting, finding min is O(n) and sorting is O(nlogn)

strong cairn
#

hello guys
about A.I does anyone have any experience?

arctic wedgeBOT
#

@lapis sequoia :white_check_mark: Your 3.11 eval job has completed with return code 0.

001 |         Max Speed  d
002 | Animal              
003 | Falcon      370.0  1
004 | Parrot       24.0  4
lapis sequoia
lapis sequoia
storm kelp
strong cairn
#

and how they could interact with a web interface

lapis sequoia
strong cairn
lapis sequoia
#

!e

import pandas as pd
df = pd.DataFrame({'Animal': ['Falcon', 'Falcon', 'Falcon',
                              'Parrot', 'Parrot'],
                   'Max Speed': [380., 370., 370., 24., 26.],
                   'd': [1,2,3,4,5]})
print(df)
print('-'*20)
print(df.loc[df.groupby('Animal')['Max Speed'].idxmin()].reset_index(drop=True))
#

perfect.

arctic wedgeBOT
#

@lapis sequoia :white_check_mark: Your 3.11 eval job has completed with return code 0.

001 |    Animal  Max Speed  d
002 | 0  Falcon      380.0  1
003 | 1  Falcon      370.0  2
004 | 2  Falcon      370.0  3
005 | 3  Parrot       24.0  4
006 | 4  Parrot       26.0  5
007 | --------------------
008 |    Animal  Max Speed  d
009 | 0  Falcon      370.0  2
010 | 1  Parrot       24.0  4
lapis sequoia
#
df.groupby('Animal') # you'll group by 2 cols here
df.groupby('Animal')['Max Speed'].idxmin() # finding row index of each df having max speed minimum, (we find each row index since there may be more fields)

df.loc[df.groupby('Animal')['Max Speed'].idxmin()]
# just taking those rows from original df

and at the end resetting index.
storm kelp
#

Is it just the result df will have strange indexes?

lapis sequoia
#

!e

import pandas as pd
df = pd.DataFrame({'Animal': ['Falcon', 'Falcon', 'Falcon',
                              'Parrot', 'Parrot'],
                   'Max Speed': [380., 370., 370., 24., 26.],
                   'd': [1,2,3,4,5]})
print(df)
print('-'*20)
print(df.loc[df.groupby('Animal')['Max Speed'].idxmin()].reset_index())
arctic wedgeBOT
#

@lapis sequoia :white_check_mark: Your 3.11 eval job has completed with return code 0.

001 |    Animal  Max Speed  d
002 | 0  Falcon      380.0  1
003 | 1  Falcon      370.0  2
004 | 2  Falcon      370.0  3
005 | 3  Parrot       24.0  4
006 | 4  Parrot       26.0  5
007 | --------------------
008 |    index  Animal  Max Speed  d
009 | 0      1  Falcon      370.0  2
010 | 1      3  Parrot       24.0  4
storm kelp
#

ah ok

lapis sequoia
#

see, now you have extra column for index, if drop not provided.

storm kelp
#

Thanks for your help - I'll let you know tomorrow if it works in pyspark in the same way effectively

storm kelp
# lapis sequoia Sure!

somewhat ironic - the 'solution' I found after trawling through stackoverflow was needlessly complicated and didn't actually work if there were rows tied. This solution seems much simpler and less computationally intensive

#

discord + documentation > stackoverflow
haha

lapis sequoia
# storm kelp somewhat ironic - the 'solution' I found after trawling through stackoverflow wa...

Honestly its about how we google a lot of times, I would be lieing if I say I did not stackoverflow, tahts the link

https://stackoverflow.com/questions/54470917/pandas-groupby-and-select-rows-with-the-minimum-value-in-a-specific-column

P.s. Now I know how to tackle this since I read whole thing and put the example here.

storm kelp
quaint plover
#

I have a somewhat datascience related question on help-coconut on intersection of sets if anyone has 5 min

novel python
#

what's the easiest way to count how many times a column has a minimum value when compared to 9 other columns in a dataframe?

serene scaffold
hasty mountain
#

Hey guys, which neural network structure tends to be more stable? A model that outputs floats between -1 and 1, or a model that outputs integers(an index to a list)?

PS: the list can have an index like 1500

serene scaffold
hasty mountain
# serene scaffold what does the model in question do?

Model 1: Outputs a number which will be used to get a key in a dictionary to return a proper response(since it's a RL model, it returns a command for a game)

Model 2: Outputs a number which will serve as an index to get a string in a list with the proper response(the command in question)

serene scaffold
serene scaffold
hasty mountain
#

Hm... I see...
I'm actually thinking about making the model play and create the data as it plays.
It'll receive a frame from the game as input, generate a random output, and them get a reward for that.
If the reward is good, that frame+action will become the dataset for a supervised learning

serene scaffold
#

anyway, normalizing the range for the output (which is what you were getting at with the -1, 1 thing) is often good, but you can't do that if you're treating each option as discrete

#

and if the output is a dict key, that is discrete.

hasty mountain
#

Oh, I see...

#

I always used a normalized range for my output, so I don't know for sure the consequences of not doing so.

serene scaffold
#

that's fine for things that are continuous

hasty mountain
#

And I'm doing this model based on NLP...and in NLP, the output isn't normalized, at least as far as I've seen

hasty mountain
serene scaffold
# hasty mountain Like RGB images?

yes, RGB values are continuous, because a pixel can have any amount of each color from 0 to 1. and 0.880000000001 is meaningfully different from 0.89

hasty mountain
#

Oh, I see

turbid arch
serene scaffold
turbid arch
#

Man

#

Now I know

#

Because you answered me. Thank you

hasty mountain
#

I'm doing this, actually. It's helpful, but I don't know if this affects the performance

serene scaffold
#

again, continuous vs discrete.

hasty mountain
#

I see...but why? if the model outputs 0.88, which is closer to 0.89 than to 0.71, then it wouldn't be a problem to consider it a 0.89, right?

serene scaffold
hasty mountain
#

The only problem is fitting the KNN to big dictionaries...that take quite a long time

serene scaffold
#

I'd be really surprised if you can get good performance doing that.

hasty mountain
serene scaffold
#

pretty much

hasty mountain
#

Hm... Good to know. Then I'll double check my testing process...

hasty mountain
# serene scaffold pretty much

Also...tell me something... What is the difference between using an Embedding layer with...let's say...a matrix of size 10 and output of size 1, and using a fully conected layer which receives 10 features and outputs 1 feature?

#

For this model, I was thinking about using embedding layers, but I don't see how much this would benefit the model in relation to a dense layer

serene scaffold
hasty mountain
#

Or in the middle...

novel python
hasty mountain
#

But never in the ending

serene scaffold
#

is "embedding layer" a keras-specific term? because that would explain why I haven't heard of it.

#

are you using keras @hasty mountain?

hasty mountain
#

Uh...it's used for keras and for pytorch

#

I'm actually using Pytorch

#

Perhaps you might know it as embedding matrix

serene scaffold
#

what kind of neural network is this?

hasty mountain
#

Mine? Or the embedding?

serene scaffold
#

the network that you're making

hasty mountain
#

It gets a frame from a game, decomposes it through convolutions and, in the end, it passes through a linear layer to get an output value which corresponds to an action to be perfomed in the game

#

A Reinforcement Learning algorithm

serene scaffold
#

how many linear layers do you have?

hasty mountain
#

Just one

serene scaffold
#

that's probably not going to be enough

hasty mountain
#

Why?

serene scaffold
#

more layers means more memory capacity for the model, and more opportunity to learn subtle relationships between inputs

hasty mountain
#

But there's 10 convolution layers

#

And maxpooling after each 2 convs

serene scaffold
#

sure, but aren't convolutions and maxpools just "distilling" the image? if you only have one linear layer, you're still saying that once the image is "distilled", the relationship between it and what you're trying to learn can be learned with one transformation.

hasty mountain
#

Hm... Well, the convolutions and maxpools serve as feature extractors

#

At least in VGG, they use convs + maxpools as feature extractors and, then, 2 linear layers if I'm not mistaken

#

I'm just using fewer convs and a single linear layer

#

So, the model will extract features from the image, and, based on the most relevant features, will generate an output...which can be a value from a dictionary, or, as I'm considering now, a value that will be converted to an integer and then be used as a list index

weary crown
#
import pandas as pd
import numpy as np
from numpy import sqrt
from sklearn.preprocessing import MinMaxScaler
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from nptyping import NDArray, Int, Shape
import pickle

# read in csv into dataframe
df = pd.read_csv(r"C:\Users\josmo\Downloads\creditcard.csv")

target = df['Class']
df.pop('Class')

scaler = MinMaxScaler(feature_range=(-1, 1))

# feature scale each column
for column in df.columns:
    scaler.fit(df[column].values.reshape(-1, 1))
    df[column] = scaler.transform(df[column].values.reshape(-1, 1) + 1e-4)

data_train, data_test, target_train, target_test = train_test_split(
    df, target, test_size=0.2, random_state=42)

tree_reg = DecisionTreeRegressor()
tree_reg.fit(data_train, target_train)

# Testing
housing_predictions = tree_reg.predict(pd.concat([data_test, target_test]))

# RMSE evaluation
lin_mse = sqrt(mean_squared_error(target_test, housing_predictions))
print(f"Loss: {lin_mse}")

# Cross Validation
scores = cross_val_score(tree_reg, target_train, target_test, scoring="neg_mean_squared_error", cv=10)
tree_rmse_scores = sqrt(-scores)

# Display Cross Validation results
def display_scores(scores):
    print(f"Scores: {scores}\nMean: {scores.mean()}\nStandard Deviation: {scores.std()}")```
#

so um

#
C:\Users\josmo\PycharmProjects\FraudDetection\venv\lib\site-packages\sklearn\utils\validation.py:1858: FutureWarning: Feature names only support names that are all strings. Got feature names with dtypes: ['int', 'str']. An error will be raised in 1.2.
  warnings.warn(
C:\Users\josmo\PycharmProjects\FraudDetection\venv\lib\site-packages\sklearn\base.py:450: UserWarning: X does not have valid feature names, but DecisionTreeRegressor was fitted with feature names
  warnings.warn(
Traceback (most recent call last):
  File "C:\Users\josmo\PycharmProjects\FraudDetection\main.py", line 32, in <module>
    housing_predictions = tree_reg.predict(pd.concat([data_test, target_test]))

ValueError: Input X contains NaN.
DecisionTreeRegressor does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values```
serene scaffold
weary crown
#

I somehow have NaN values but I used df.dropna() - but it didnt work and i still dont know where i get nan values from??

#
df.isnull().sum().sum()``` used this to count NaN values in the df but it printed 0... so where the error is from??
hasty mountain
serene scaffold
serene scaffold
weary crown
#

also, what does that have to do with naN values?

serene scaffold
hasty mountain
#

(Also, when using tensor.long, Pytorch rounds 1.0005 to 1...and 0.0095 to 0)

serene scaffold
serene scaffold
#

!paste

arctic wedgeBOT
#

Pasting large amounts of code

If your code is too long to fit in a codeblock in Discord, you can paste your code here:
https://paste.pythondiscord.com/

After pasting your code, save it by clicking the floppy disk icon in the top right, or by typing ctrl + S. After doing that, the URL should change. Copy the URL and post it here so others can see it.

weary crown
#

so i fit it on the first column only?

chrome lake
#

Can't post my code.

serene scaffold
chrome lake
#

It's not a code related isue anyways

serene scaffold
# chrome lake Can't post my code.

well, no one wants to help over DMs until they know for sure what the question is. because people don't want to get DMs that they have to read before finding out if they can do anything with it.

weary crown
#

i have 10 feature s- are three 10 different encoder methods??

serene scaffold
hasty mountain
weary crown
#

Min maxing everything should be okay

#

i just need to fix the Nan error which isnt caused by min maxing

#

this dataset is large so heavily reduced sampling noise so i have leeway

#

this is my first real ml project sorta thingy ๐Ÿ˜

serene scaffold
#

so each feature, which has its own min and max, needs its own minmaxscaler.

weary crown
#

yeah

#

so thats what im doing by resetting each column right

hasty mountain
serene scaffold
weary crown
# serene scaffold you need separate instances of MinMaxScaler
# feature scale each column
for column in df.columns:
    scaler = MinMaxScaler(feature_range=(-1, 1))
    scaler.fit(df[column].values.reshape(-1, 1))
    df[column] = scaler.transform(df[column].values.reshape(-1, 1) + 1e-4)``` like this? The NaN error is still there... :((
serene scaffold
queen holly
#

I could use some advise on a dataset that I have to be able to slice and filter, essentially it is a collection of message types (in the hundreds of variants) each message having a different set of fields / attributes. As an example

#

Am I better off to keep this as a single data frame and turn into something that has every possible attribute as columns

#

or do I convert this into a DF of DFs and manage every message as it's own DF

serene scaffold
queen holly
#

yeah I was investigating that too

#

my thought is that I'm going to be wanting to split up the attribute:data such that each attribute will be it's own column

#

since each message_type has it's own collection of attributes then my only option would be to create one master list of all attributes (possibly hundreds)_ as a superset of attributes across all message types...

desert oar
#

the truth is that you really don't, at least not at first

#

i also want to be clear that i'm not trying to resist making a recommendation here. but i legitimately don't know enough about your problem to recommend something

#

i could say that in general you have a few options for doing regression on an unknown dataset of sufficient size and "density" of data: GAM, random forest, gradient boosting, shallow feedforward NN

#

if you're going to study up on algorithms, those are the ones to consider if you are just trying to predict something with minimal error

#

(by "density" i mean that you have good coverage across the range of the data within your dataset)

#

but "minimize prediction error according to one specific metric" is usually not a useful goal except as a study tool and/or in well-defined business automation problems

minor coral
#

Hello, does anyone knows how to create cgan with custom dataset?

arctic wedgeBOT
#

Hey @undone mirage!

It looks like you tried to attach file type(s) that we do not allow (.pdf). We currently allow the following file types: .gif, .jpg, .jpeg, .mov, .mp4, .mpg, .png, .mp3, .wav, .ogg, .webm, .webp, .flac, .m4a, .csv, .json.

Feel free to ask in #community-meta if you think this is a mistake.

lapis sequoia
#

Anyone have a good place to learn NLP from?

craggy wadi
#

Hi everyone, I am looking for resources that explain how to implement a Binary search tree to store an object with multiple attributes in python.

rugged comet
#

Please explain to me what projection means in the context of the shapes of data.

young granite
hasty mountain
#

You basically just create a GAN and then concatenates the input to your conditioner(before passing the input to both the discriminator and generator). Remember to concatenate it in your channels dimension

#

I don't know quite the logic behind it...but this makes me feel stupid because now I have to fix my code for an audio generator...where I concatenated in the batch dimension

minor coral
#

But the thing is, i dont know how to process the images i have that matches the cgan

#

I saw a sample code of cgan but it uses mnist dataset , and I dont know how to implement the images I have as the dataset in the model

#

๐Ÿฅฒ

hasty mountain
#

Take a look at the first function

#

And ignore the audio part...specially the preprocessing

minor coral
#

Does the dataset there can be use here?

hasty mountain
#

Yes, it can. You'll just have to convert it from numpy to pytorch tensor


tensor_data = torch.from_numpy(numpy_data)
tensor_data = tensor_data.view(tensor_data.size(0), tensor_data.size(3), tensor_data.size(1), tensor_data.size(2))
minor coral
#

@hasty mountain thank you so much imma try it later!

hasty mountain
#

Numpy(keras, tensorflow) uses dimensions (N_samples, Height, Width, Channels), while Pytorch uses (N_samples, Channels, Height, Width)

minor coral
#

Imma back later for questions but thank you!!

rugged comet
#

What do I do when I have inputs of different shapes?

ValueError: A `Concatenate` layer requires inputs with matching shapes except for the concatenation axis. Received: input_shape=[(None, 376, 16), (None, 19), (None, 2644, 128)]

I tried this

    # Get all inputs to same shape
    type_x = layers.Dense(8)(type_x)
    converted_mana_cost_x = layers.Dense(8)(converted_mana_cost_x)
    text_x = layers.Dense(8)(text_x)

But that resulted in a similar error

ValueError: A `Concatenate` layer requires inputs with matching shapes except for the concatenation axis. Received: input_shape=[(None, 376, 8), (None, 8), (None, 2644, 8)]
hasty mountain
rugged comet
dense lagoon
#

damn you guys are all smart fr

minor coral
#

i hate online class

#

they only gave us 5 weeks to study machine learning to AI ...

quaint plover
#

I'm looking for some support into using sets to find intersections between two sets (one list of keywords and a list of strings), channel help-candy

charred light
#

I think this is hilarious. I would assume most API would have a limiting factor.

mortal dove
#

You would also have no idea what the underlying architecture looks like, so even if you could get enough data for a training set, you wouldn't be able to replicate it ๐Ÿ˜‚

desert parcel
#

The values produced by MSE, RMSE, R2 score, etc. Show loss which is how good or bad your model is at predicting.

#

So does that mean that loss is actually variance? Since the higher the loss the more inaccurate your predictions are and the further away from the actual values they are from the labels.

plush glacier
minor coral
#
ValueError                                Traceback (most recent call last)
<ipython-input-18-1b5f3d095e18> in <module>
      1 batch_size = 32
      2 
----> 3 data_loader = torch.utils.data.DataLoader(MNIST(root="/content/dataset",train=True,download=True,transform=transform),
      4                                           batch_size=batch_size, shuffle=True)

3 frames
/usr/local/lib/python3.7/dist-packages/torchvision/datasets/mnist.py in read_sn3_pascalvincent_tensor(path, strict)
    524     # we need to reverse the bytes before we can read them with torch.frombuffer().
    525     needs_byte_reversal = sys.byteorder == "little" and num_bytes_per_value > 1
--> 526     parsed = torch.frombuffer(bytearray(data), dtype=torch_type, offset=(4 * (nd + 1)))
    527     if needs_byte_reversal:
    528         parsed = parsed.flip(0)

ValueError: offset (16 bytes) must be non-negative and no greater than buffer length (16 bytes) minus 1
#

does somebody knows how to fix this error?

#

What I did was replace the downloaded mnist file with the file I created, but it now show this error

dusk tide
#

I am working on a college project. The project is "Plant species identification " . So I have decided to do this via deep learning . But unable to find a good dataset with lots of images like around 1000s of each category. Can someone guide ??

wind patrol
#

Hello there, im extremely new to ai and ml, and was just getting the dice rolling, i was working on an image scene classification thing using VGG16 from the keras import lib, i was getting an error when i was tryna get results for a run , my full traceback is as follows-

MemoryError                               Traceback (most recent call last)
c:\Users\blufl\OneDrive\Desktop\CNN shtuff\Researchpaper.ipynb Cell 8 in <cell line: 52>()
     48 labels = lb.fit_transform(labels)
     50 # perform a training and testing split, using 75% of the data for
     51 # training and 25% for evaluation
---> 52 (trainX, testX, trainY, testY) = train_test_split(np.array(data),
     53     np.array(labels), test_size=0.25)
     55 # define our Convolutional Neural Network architecture
     56 '''model = Sequential()
     57 model.add(Conv2D(8, (3, 3), padding="same", input_shape=(128, 128, 3)))
     58 model.add(Activation("relu"))
   (...)
     76 model.add(Dense(6))
     77 model.add(Activation("softmax"))'''

MemoryError: Unable to allocate 19.1 GiB for an array with shape (17034, 224, 224, 3) and data type float64```
im not sure how to fix this
#

!paste

#

this is the full code for training and testing part at least

#

i had made another cell in jupyter, to do the exact same thing- and it gave me an entirely different error

#

^^^ above is the 2nd error when run on a different cell along with the code that was used

wind patrol
mild dirge
#

So when trying to allocate 19.1 GB, that will not be enough

wind patrol
#

im not sure why its trying to allocate that much

mild dirge
#

(17034, 224, 224, 3)

#

This is the shape

#

That is basically 17 thousand RGB images of 224x224 pixels

wind patrol
#

the shape shud be (none,224,224,3)

mild dirge
#

Can't allocate that all at once, so a solution would be to do it in batches

mild dirge
wind patrol
#

yes yes

#

i read ur msg afterwards

mild dirge
#

which in your case is 17k (because you are probably trying to do it all in 1 batch)

wind patrol
#

yep, when using the sequential base, it never gave me this issue, in total i have around 24,000 imgs which are of 32x32 each

#

when ran on sequential the shape used (None,128,128,3)

mild dirge
#

32x32 is a lot less pixels than 224x224

wind patrol
#

ye

#

thats the default params for vgg16 iirc

mild dirge
#

Anyways, whatever the shape is, if you don't have enough memory to load it all in at once, load it in in batches

wind patrol
#

checks out

wind patrol
mild dirge
#

I'm not super comfortable with keras, but this link shows an example of how to do transfer learning with vgg16, there are code snippets in there for an image generator, which takes the directory with images, and loads them in batches

mild dirge
wind patrol
#

and i think ik why thats happening

mild dirge
#

The memory error is simply because you try to make an array that is too large

wind patrol
#

yep

#

now what im confused about is, when im using the same exact same test with vgg16 instead of sequential, the error comes up, however it never happened on sequential before

mild dirge
#

Not completely sure what you mean with sequential

#

Is it a different model?

wind patrol
#

yep

#

its not pretrained, but its from keras only

#

ValueError: Input 0 of layer "vgg16" is incompatible with the layer: expected shape=(None, 224, 224, 3), found shape=(None, 128, 128, 3)

#

so the value error is coming cuz of the shape

mild dirge
#

So is your data of shape (batch size x)128x128x3?

wind patrol
#

ye

mild dirge
#

The model expects the images to be 224x224 then

wind patrol
#

ye

mild dirge
#

Which is not the same shape as your data

wind patrol
#

im not sure how to get around that then

mild dirge
#

Resize the images

#

Or use a model that expects 128x128

wind patrol
#

alright one sec lemme try it

mild dirge
#

With resize I really mean resize and not reshape

wind patrol
mild dirge
#

yeah resizing is the most logical option then

wind patrol
#

the original images are 32x32 but i resized them to 128x128 to make sequential work

mild dirge
#

Kind of a waste of resources and memory to have such small images and resize them to work with a larger model

#

Probably prone to overfitting too, the model is probably too large for the simplicity of the data and the problem

wind patrol
#

nope nvm they 150x150 each

#

i confused it for a diff dataset

mild dirge
#

Alright, well that makes a bit more sense then

#

So keras probably has a resize function, otherwise you can use something like opencv or something

wind patrol
#

ye thats what im using

mild dirge
wind patrol
#

mhm running the test now again, lets hope it works

mild dirge
#

Running it in batches now, or all at once?

wind patrol
#

i was doing it in batches of 32

#

but i think i gotta dumb it down even more

mild dirge
#

32 should be alright when you have that much memory

#

That is 4816896 floats, so only a few megabytes

wind patrol
#

it gave a mem error still

mild dirge
#

Could you show it?

wind patrol
#

so like im running it even smaller batches

wind patrol
mild dirge
#

alright haha

wind patrol
#

like show as in on like a vc or like just a screengrab, cuz for 128x128 images, 32 batch size worked fine

mild dirge
wind patrol
#

my pc is lagging more lets hope its doing something :hidesthepain:

mild dirge
#

It should be like 38 megabytes if the shape is 32x224x224x3

#

So if that gives a memory problem, there might be another issue

wind patrol
#

send help man wtf is this shit

#

shit wait

mild dirge
#

So it is still loading it all at once

wind patrol
#

seems like

#

it

mild dirge
#

They also use vgg16, so it should be pretty simple to follow

wind patrol
#

leme have a look

mild dirge
#

This part especially

wind patrol
#

ye so thats exactly how im doing it

#

the batch size part at least

#

since im using a different optimiser in adam

#

its still tryna run itself together

#

im so confused

#

i might try to take their approach once

viscid flume
viscid flume
#

Eh, anyone there?๐Ÿ˜…

viscid flume
minor coral
#

does anyone knows how to convert png, jpg to mnist format dataset?

hasty mountain
#

It'll open the JPG/PNG image as PIL Image object, then you can simply call np.array(image) on that

minor coral
#

but I also need the labels and such

hasty mountain
#

Are the images organized like: "class1.png", where class is that image class?

minor coral
#

I arrange the images into different folders

#

Dataset
0
image1....
1
Image1....

hasty mountain
#

Oh, then it's quite easy

minor coral
#

like 5weeks

#

these are my classes looks like

hasty mountain
# minor coral man, im a beginner T_T

Try something like this:


labels = []

for directory, filename, folder in os.walk(path):
            for file in folder:
                pics.append(directory+'/'+file)
                labels.append(directory)

#

I don't remember if this filename is indeed the filename. Usually I just use directory and folder

#

Path is indeed the path to your directory, like C:/User/Dataset
Inside Dataset, each folder will be directory. So, if you have Dataset/label1, label 1 will become directory.
Inside directory, each image wil be file.

So if you have C:/User/Dataset, pass it as path, then directory will be your labels, and then remove that for file in folder, as your folder will already be your images.

viscid flume
#

Does anyone know how to use a wrapper or something to change the memory allocator in pytorch?

minor coral
#

I tried this but it isnt working

hasty mountain
#

# Load from and save to
Names = [['./training-images','train'], ['./test-images','test']]

for name in Names:
    
    data_image = array('B')
    data_label = array('B')

It'll create a list of lists Name, where each element is a list with element 0 being the images path and element 1 being where you'll save the images array.

Then it'll just iterate through the images path, append each image to a new list.

This new list will be used to open each image with PIL Image, resize them as you wish, and then make some preprocess things that I don't really understand, and finally create the dataset in bytes type

#

Honestly, though...simply try using the way it's in the DatasetCreator I've sent...it's way easier.
If you need the image in bytes and grayscale, just add image.convert('L') in the code.

#

Resizing images can also be done with image.resize((height, width)) instead of iterating through each pixel

cold saddle
#

How would i go about making a reccomendation system based on images AND descriptions.
I am able to use cosine similarity on the descriptions which has been working well to make reccomendations.

But now i want to suggest products that looks similar.
I am looking for something simple not state of the art.

I am using the Amazon Berkely dataset.
https://amazon-berkeley-objects.s3.amazonaws.com/index.html

#

My inital thought was to keep them seperate. Reccomend similar images. Then reccomend similar descriptions and do some weighting of the two.

copper fjord
#
def findPeakInterval(dag, time, kwh):
    
    fรฅForbrukMaxdf = Forbruk_Dag_Time.reset_index()
    finneForbrukMax = fรฅForbrukMaxdf['KWH 60 Forbruk'].max()
    filter_forbrukMax = (fรฅForbrukMaxdf['KWH 60 Forbruk'] == finneForbrukMax)
    filter_forbrukMax = fรฅForbrukMaxdf.loc[filter_forbrukMax]
    forbrukMaxDagNr = filter_forbrukMax.iloc[0]['Dag']
    forbrukMaxTimeNr = filter_forbrukMax.iloc[0]['Time']
    forbrukMaxKWh = filter_forbrukMax.iloc[0]['KWH 60 Forbruk']
    return forbrukMaxDagNr[dag], forbrukMaxTimeNr[time], forbrukMaxKWh[kwh]

findPeakInterval(1,1,1)```
#

cant seem to assign each return variable to their own respective indexes.

serene scaffold
#

you can also do print(Forbruk_Dag_Time.reset_index().head().to_dict('list'))

#

Please ping me when you have done that.

copper fjord
serene scaffold
#

I need to know what the data looks like before you try doing any of this.

copper fjord
#

yes

#
   Time  Dag  KWH 60 Forbruk
282   18   12           7.981```
serene scaffold
#

I can't help. sorry.

copper fjord
#

look, what i want is to extract each columum from this mini-dataframe as their each own variable

#

@serene scaffold

#

can't be that hard right

serene scaffold
#

you can just return list(df.iloc[0]), and if there's three variables to "catch" the result, it will be what you want.

copper fjord
#

i got the same output as you

#

how do i extract them to their own variable

serene scaffold
copper fjord
#

thanks

serene scaffold
#

trying to end this channel for everyone? lemon_sweat

copper fjord
#

wait

#

๐Ÿ˜ญ

#
def findPeakInterval(x,y,z):
    
    fรฅForbrukMaxdf = Forbruk_Dag_Time.reset_index()
    finneForbrukMax = fรฅForbrukMaxdf['KWH 60 Forbruk'].max()
    filter_forbrukMax = (fรฅForbrukMaxdf['KWH 60 Forbruk'] == finneForbrukMax)
    filter_forbrukMax = fรฅForbrukMaxdf.loc[filter_forbrukMax]
    forbrukMaxDagNr, forbrukMaxTimeNr, forbrukMaxKWh = filter_forbrukMax.iloc[0]
    x,y,z = forbrukMaxDagNr, forbrukMaxTimeNr, forbrukMaxKWh
    return forbrukMaxDagNr[x], forbrukMaxTimeNr[y], forbrukMaxKWh[z]
    

    

findPeakInterval(x)```
#

today is not my day man

#

still error

serene scaffold
#

forbrukMaxDagNr, forbrukMaxTimeNr, forbrukMaxKWh isn't similar to what I said

#

you showed me a dataframe with one row. where is that?

copper fjord
#

it is

 filter_forbrukMax```
serene scaffold
#

and you just want to return the three values that are in it, right?

copper fjord
#

yes

#

i did it

serene scaffold
#

so return filter_forbrukMax.iloc[0]

#

if you do list(filter_forbrukMax.iloc[0]), you get the three values in a list. they aren't keys for looking up the values, like you seemed to assume when you wrote return forbrukMaxDagNr[x], forbrukMaxTimeNr[y], forbrukMaxKWh[z]

copper fjord
#

do i have to set each variable as

global```
?
serene scaffold
#

!code

arctic wedgeBOT
#

Here's how to format Python code on Discord:

```py
print('Hello world!')
```

These are backticks, not quotes. Check this out if you can't find the backtick key.

serene scaffold
#

But generally speaking, the global keyword is only for if you want to overwrite the variable for the whole module. you can always read module-level variables.

vale prawn
#
print('hello world!')
copper fjord
#
def findPeakInterval():
    global
    fรฅForbrukMaxdf = Forbruk_Dag_Time.reset_index()
    finneForbrukMax = fรฅForbrukMaxdf['KWH 60 Forbruk'].max()
    filter_forbrukMax = (fรฅForbrukMaxdf['KWH 60 Forbruk'] == finneForbrukMax)
    filter_forbrukMax = fรฅForbrukMaxdf.loc[filter_forbrukMax]
    x = list(filter_forbrukMax.iloc[0])
    forbrukMaxDagNr,forbrukMaxTimeNr,forbrukMaxKWh = x[0], x[1], x[2] 
    return forbrukMaxDagNr, forbrukMaxTimeNr, forbrukMaxKWh
  
findPeakInterval()```
arctic wedgeBOT
#

Hello, @vale prawn!

copper fjord
#

now i can't use these return values outside the function

serene scaffold
#
def findPeakInterval():
    fรฅForbrukMaxdf = Forbruk_Dag_Time.reset_index()
    finneForbrukMax = fรฅForbrukMaxdf['KWH 60 Forbruk'].max()
    filter_forbrukMax = (fรฅForbrukMaxdf['KWH 60 Forbruk'] == finneForbrukMax)
    filter_forbrukMax = fรฅForbrukMaxdf.loc[filter_forbrukMax]
    return filter_forbrukMax.iloc[0]
  
a, b, c = findPeakInterval()
copper fjord
#

nvm

#

found it

quiet seal
#

Hi I'm using plotly.express to generate a radar chart with px.line_polar(df, r='Score', theta='Section', line_close=True, range_r=[0,5]) and I want to plot two series

#

since this doesn't support r=[r1, r2, ...] I'm doing this with plotly.graph_objects, but I can't figure out the equivalent of line_close and range_r. I tried fig.update_traces(marker_colorbar_tickformatstops=dict(dtickrange=[0,5]), selector=dict(type='scatterplot')) and a few other things, no luck

#

Any suggestions on how to get this thing to close the line and set the range of r?

#

oh huh. setting the range was in the example I read, I missed it somehow for the past 2 hours. Closing the line is not ๐Ÿ˜

brave sand
#

what is the oracle in RL?

storm kelp
#

@lapis sequoia@noble tusk
Unfortunately I couldn't use the solutions you guys suggested because many of those functions are not available for PySpark Dataframes. I ended up coming up with this somewhat grotesque method which appears to be working but I still need to do more QC on the results to confirm.
df.select("*",F.row_number().over(Window.partitionBy("A", "B").orderBy("C")).alias("rn")).filter("rn" == 1)

noble tusk
#

Yikes, that's pretty rough

#

Looking up PySpark, it seems to be a wrapper for Apache Spark. I don't really know a lot about that, so I might be wrong here, but could you instead use Apache Arrow tables, or even Polars, which is based on Arrow?

storm kelp
#

It's just getting used to it I guess. Does seem crazy complicated for something as simple as grouping a df and finding the smallest value

noble tusk
#

Yeah. I feel like there much be a better way but I've never used PySpark so idk

#

If it's SQL DataFrames you're using, and then using SQL is the only requirement, you could get away with using Pandas. But that would be dependent on organisation's situation itself

#

I might try and trawl through the docs see if I can find anything on PySpark that would work better for you

#

@storm kelp Looking at the docs, something like this might work ```py
df.groupby(["a", "b"]).min("c").collect()

lapis sequoia
storm kelp
storm kelp
noble tusk
#

If it is necessary then idk enough about PySpark to be able to help that much more unfortunately

#

Here's some benchmark data

#

I've not used Polars that much, but the syntax looks pretty similar to what I'm seeing from PySpark

#

Those benchmarks are actually run on groupby() as well lmao

desert oar
serene plume
#

Is it possible to add an axis to a numpy array but only when there is a single one?

#

!e

import numpy as np

def expand(arr):
    if arr.ndim == 1:
        return np.expand_dims(arr, axis=0)
    return arr

a = np.array([10, 20])
b = np.array([[10, 20], [10, 20]])

print(a.shape)
print(b.shape)
print()

a = expand(a)
b = expand(b)

print(a.shape)
print(b.shape)
arctic wedgeBOT
#

@serene plume :white_check_mark: Your 3.11 eval job has completed with return code 0.

001 | (2,)
002 | (2, 2)
003 | 
004 | (1, 2)
005 | (2, 2)
serene plume
#

Does numpy not have something that works like this expand?

desert oar
#

i don't think it has that built-in

#

or wait, i misread your code

#

this is atleast_2d

#

!d numpy.atleast_2d

arctic wedgeBOT
#

numpy.atleast_2d(*arys)```
View inputs as arrays with at least two dimensions.
desert oar
#

!e ```python
import numpy as np

a = np.array([10, 20])
b = np.array([[10, 20], [10, 20]])

print(np.atleast_2d(a))
print()
print(np.atleast_2d(b))

arctic wedgeBOT
#

@desert oar :white_check_mark: Your 3.11 eval job has completed with return code 0.

001 | [[10 20]]
002 | 
003 | [[10 20]
004 |  [10 20]]
weary crown
#

@serene scaffold would u mind helping me with my code again ๐Ÿ™‚

desert oar
weary crown
#
import pandas as pd
import numpy as np
from numpy import sqrt
from sklearn.preprocessing import MinMaxScaler
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from nptyping import NDArray, Int, Shape
import pickle

# read in csv into dataframe
df = pd.read_csv(r"C:\Users\josmo\Downloads\creditcard.csv")

target = df['Class']
df.pop('Class')



# feature scale each column
for column in df.columns:
    scaler = MinMaxScaler(feature_range=(-1, 1))
    scaler.fit(df[column].values.reshape(-1, 1))
    df[column] = scaler.transform(df[column].values.reshape(-1, 1) + 1e-4)

data_train, data_test, target_train, target_test = train_test_split(
    df, target, test_size=0.2, random_state=42)

tree_reg = DecisionTreeRegressor()
tree_reg.fit(data_train, target_train)

# Testing
housing_predictions = tree_reg.predict(pd.concat([data_test, target_test]))

# RMSE evaluation
lin_mse = sqrt(mean_squared_error(target_test, housing_predictions))
print(f"Loss: {lin_mse}")

# Cross Validation
scores = cross_val_score(tree_reg, target_train, target_test, scoring="neg_mean_squared_error", cv=10)
tree_rmse_scores = sqrt(-scores)

# Display Cross Validation results
def display_scores(scores):
    print(f"Scores: {scores}\nMean: {scores.mean()}\nStandard Deviation: {scores.std()}")```
#

When I train my dataset I get NaN value error - but I counted the number of NaN values in the DF and it said 0...

#

so like what happened-

serene plume
desert oar
#

@weary crown show the full exception?

#

or at least say what line the exception occurs on

storm kelp
desert oar
weary crown
#

the except is too large to fit in 1 message

#

idk what hist gradient boosting classifier is

desert oar
#

you are sure that after applying the min-max scaler, data_test.isnull().any().any() is false?

desert oar
#

note that :
1) you can apply the scaler to the entire dataframe at once as an array nvm you are scaling each column individually
2) .values is deprecated
3) you are messing with columns that have non-string names, which might break things (as per the warning messages in the output you showed)

weary crown
desert oar
#

also, scaling min-max on both train and test sets is inadvisable. it's basically cheating, using test data in training

weary crown
desert oar
weary crown
#

ok

desert oar
#

you might also want to use a Pipeline for this, which takes care of the bookkeeping related to having one scaler per column that you need to store and re-apply on the test set

#

@weary crown does this work? any errors?

import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.tree import DecisionTreeRegressor

data = pd.read_csv(r"C:\Users\josmo\Downloads\creditcard.csv")
target = data.pop('Class')

scaler = MinMaxScaler(feature_range=(-1, 1))
scaler_columnwise = ColumnTransformer([], remainder=scaler)
tree_reg = DecisionTreeRegressor()
pipeline = make_pipeline(scaler_columnwise, tree_reg)

data_train, data_test, target_train, target_test = train_test_split(
    data, target, test_size=0.2, random_state=42
)

pipeline.fit(data_train, target_train)
weary crown
desert oar
weary crown
#

I forgot what transformers are

desert oar
#
pipeline.fit(data_train, target_train)

pred_test = pipeline.predict(data_test)

it lets you write this, without having to re-apply all the fitted transformers to the test set

weary crown
#

well i have a vauge idea

desert oar
#

an "estimator" is what you might otherwise call a model

#

a "transformer" just transforms data

weary crown
#

ooh i see

desert oar
#

transformers have a .transform method, estimators have a .predict method. that's the main difference.

weary crown
#

yay this code works

#

now how to predict with it?

desert oar
serene scaffold
#

the authors of "Attention is all you need" poisoned the well for the meaning of "transformer"

desert oar
#

sorry, not the user guide, the tutorial

weary crown
desert oar
weary crown
#

ofc i can do boring .predict

desert oar
#

yep that's it

weary crown
#

oh great!

desert oar
#

pipeline is awesome. it's one of the things that got me to switch to python from r in 2015

#

pandas was new and really clunky at the time, but scikit-learn was already excellent

#

the r equivalent (caret) seemed archaic by comparison

weary crown
#
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from math import sqrt
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score

data = pd.read_csv(r"C:\Users\josmo\Downloads\creditcard.csv")
target = data.pop('Class')

scaler = MinMaxScaler(feature_range=(-1, 1))
scaler_columnwise = ColumnTransformer([], remainder=scaler)
tree_reg = DecisionTreeRegressor()
pipeline = make_pipeline(scaler_columnwise, tree_reg)

data_train, data_test, target_train, target_test = train_test_split(
    data, target, test_size=0.2, random_state=42
)

pipeline.fit(data_train, target_train)

# Testing
pred = pipeline.predict(pd.concat([data_test, target_test]))

# RMSE evaluation
lin_mse = sqrt(mean_squared_error(target_test, pred))
print(f"Loss: {lin_mse}")

# Cross Validation
scores = cross_val_score(tree_reg, target_train, target_test, scoring="neg_mean_squared_error", cv=10)
tree_rmse_scores = sqrt(-scores)

# Display Cross Validation results
def display_scores(scores):
    print(f"Scores: {scores}\nMean: {scores.mean()}\nStandard Deviation: {scores.std()}")``` still same error when i do.predict??
#

i was 7 in 2015 hehe

desert oar
#

just do pipeline.predict(data_test)

weary crown
#

ok

desert oar
#

what were you trying to achieve with that pd.concat?

weary crown
#

it fixed a previous error

#
C:\Users\josmo\PycharmProjects\FraudDetection\venv\Scripts\python.exe C:/Users/josmo/PycharmProjects/FraudDetection/main.py 
Traceback (most recent call last):
  File "C:\Users\josmo\PycharmProjects\FraudDetection\main.py", line 33, in <module>
    scores = cross_val_score(tree_reg, target_train, target_test, scoring="neg_mean_squared_error", cv=10)
  File "C:\Users\josmo\PycharmProjects\FraudDetection\venv\lib\site-packages\sklearn\model_selection\_validation.py", line 515, in cross_val_score
    cv_results = cross_validate(
  File "C:\Users\josmo\PycharmProjects\FraudDetection\venv\lib\site-packages\sklearn\model_selection\_validation.py", line 252, in cross_validate
    X, y, groups = indexable(X, y, groups)
  File "C:\Users\josmo\PycharmProjects\FraudDetection\venv\lib\site-packages\sklearn\utils\validation.py", line 433, in indexable
    check_consistent_length(*result)
  File "C:\Users\josmo\PycharmProjects\FraudDetection\venv\lib\site-packages\sklearn\utils\validation.py", line 387, in check_consistent_length
    raise ValueError(
ValueError: Found input variables with inconsistent numbers of samples: [227845, 56962]
Loss: 0.03050319422577728```
desert oar
weary crown
#

ohh

desert oar
#

you shouldn't get that error with this code

#

the only way you'd get that error is if you mixed up train and test data in the same fit call

#

the error message means that your data and labels have different lengths

#

hopefully you can understand why that's a problem

weary crown
#

i hate this dataset its not applicable or anything

#

since im not given what the labels mean due to the creator of the dataset saying hes unwilling to disclose it

desert oar
desert oar
weary crown
#

no i just searched up cool datasets on kaggle and found it