#data-science-and-ml

1 messages · Page 366 of 1

wicked grove
#

so should i add a flatten layer

#

these are the last 2 layers of my model

#
                                )                                                                 
                                                                                                  
 top_activation (Activation)    (None, 17, 17, 2304  0           ['top_bn[0][0]']                 
                                )                                                                 
                                                                                                  
 avg_pool (GlobalAveragePooling  (None, 2304)        0           ['top_activation[0][0]']         
 2D)                                                                                              
                                                                                                  
 top_dropout (Dropout)          (None, 2304)         0           ['avg_pool[0][0]']               
                                                                                                  
 predictions (Dense)            (None, 1000)         2305000     ['top_dropout[0][0]']  ```
stone marlin
#

I'm not sure, I'm not great at NN stuff yet, someone else in here should be able to answer you though. :']

wicked grove
#

alrightt,thankss:))

wicked grove
#

but i am getting another error

#
logits and labels must have the same first dimension, got logits shape [32,3] and labels shape [96]
     [[node sparse_categorical_crossentropy/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits
 (defined at C:\Users\Urja\anaconda3\lib\site-packages\keras\backend.py:5113)``` i haven't even used logits
stone marlin
arctic wedgeBOT
#

:incoming_envelope: :ok_hand: applied mute to @lapis sequoia until <t:1641725672:f> (9 minutes and 59 seconds) (reason: duplicates rule: sent 4 duplicated messages in 10s).

stuck holly
#

Anyone here use vscode?

wicked grove
#

Hello, i have a problem...im doing transfer learning for a dataset with 3390 images
The time taken for an epoch is an hour ://

#

Is there any way i can improve the speed

#

After the first epoch the training stopped cause the i had no memory left on my C drive

upbeat prism
#

@wicked grove I'm no expert myself and no idea about the time since I never did such a thing. I suggest you do the following:

  1. If you have a GPU use cuda to train your NN
  2. Enable the profiler of whatever NN library you use or alternatively: Measure it yourself. E.g. in your loop for 1 epoch I guess you have something like "train()" and "evaluate()" and in e.g. train() you might have model.predict(batch) and optimizer.step() or whatever. You have different big function calls, measure them each and check which one takes the longest. Also make sure you measure the duration of you file read/write.
  3. Generally, FileIO (wirintg/reading a file) is slow. You also want to minimize communication between CPU memory (your RAM) and CUDA. So make sure that you move as much data to CUDA memory as possible. I only can give details here if you use pyTorch's DataSet class but basically: Try to store your images in HDF5 format. Use h5py for that. (it's quit a bit of work probably).

Are you sure you run out of space on your C drive and not out of memory? (Maybe you run out of memory and your OS creates a swap file? No idea how windows works really, sorry.) My guess is that you somehow read too much into memory.

E.g. if you run out of memory on linux, linux will create a file and dump memory into the file basically making memory reads extremly!!! slow. Could be a reason but I'm really guessing here because as I said, I have no idea about windows. Make sure to monitor your memory.

To put it simply: Make basic measurements so find the bottleneck and then try to fix it.

arctic wedgeBOT
#

:incoming_envelope: :ok_hand: applied mute to @lapis sequoia until <t:1641738329:f> (9 minutes and 59 seconds) (reason: duplicates rule: sent 4 duplicated messages in 10s).

upbeat prism
#

Maybe an plan of action: What I'd do is: import time and then start = time.time() and end = time.time() around the basic function calls like optimizer step, model evaluation, fileIO and then output for each of them print(end-start). Also open task manager and look at the memory usage. If it gets above 100% you need to fix how you read data. Also make sure you use cuda.

upbeat dove
#

How do I make a many to one RNN network that can take an infinitely long string

#

In tensorflow

boreal summit
#

Hello everyone, I'm on this task for sentiment analysis for Amazon reviews. So I have the dataset loaded on Colab, it has 3 columns namely class_index (1 to 5), review title and review text. I am asked to perform EDA on this dataset and I honestly do not know what to do next. I'm wondering if I should predict the class index or what?

serene scaffold
#

and yes, your goal is to design a system that could predict the "class index" given only the review title and the review text.

#

can you think of what features you could use?

boreal summit
serene scaffold
#

I assume you can do it however you want. You could also concatenate the two.

boreal summit
#

Yes, I was asked to perform EDA on the dataset and report the final performance metrics for my approach. Also suggest ways in which I can improve my model.

#

So I'm guessing I would have to create a model to predict the class index, then show the accuracy & loss, and some other metrics and suggest ways to improve the model.

#

I'm working on it already.

serene scaffold
#

Sounds good to me

boreal summit
#

Thanks.

serene scaffold
#

I assume you're familiar with NLP basics like tokenization?

boreal summit
#

Sure.

serene scaffold
#

Sure, as in yes?

boreal summit
#

Yes.

serene scaffold
boreal summit
#

I'm also familiar with embedding, conv1d. So I'll just try out different stuffs and see what works.

#

👍🏿

unborn temple
#

Uh, I would like to learn whole math and how it works behind machine learning. Any suggestions to resources

#

I'm not good at reading research papers, so if there is any alternatives

vast thunder
#

Something is pinned about math. I guess this https://mml-book.com/

serene scaffold
unborn temple
#

Hmm, yeah I will try it, but is there any easier way like books mentioned above, for deep learning algorithm, deep reinforcement algorithms, new algorithms?

serene scaffold
#

they have deep learning books with Python examples.

unborn temple
serene scaffold
unborn temple
#

It will continue like maybe min 3 to Max 8 months

serene scaffold
#

in my case, it was a matter of checking the website for the university library. Though it might be as easy as trying to create an account using your university email.

unborn temple
serene scaffold
#

If you need help finding books that aren't behind a paywall, I can try to help with that, instead.

pearl grove
#

Anyone knows why this is happening? Code works in cell 1 but not in cell 2

serene scaffold
unborn temple
#

It could be a utf-16 character kinda error

grave frost
grave frost
#

try with the "Mathematics for Machine Learning" book. with a little googling, it should get you far with just High school math knowledge

pearl grove
# serene scaffold Can you show the whole error message as text (not a screenshot)?

`---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
C:\Users****\AppData\Local\Temp/ipykernel_20900/2342191653.py in <module>
----> 1 Year1 = pd.read_csv(r'..\datasets\2001.csv')

C:\Anaconda3\lib\site-packages\pandas\util_decorators.py in wrapper(*args, **kwargs)
309 stacklevel=stacklevel,
310 )
--> 311 return func(*args, **kwargs)
312
313 return wrapper

C:\Anaconda3\lib\site-packages\pandas\io\parsers\readers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
584 kwds.update(kwds_defaults)
585
--> 586 return _read(filepath_or_buffer, kwds)
587
588 `

serene scaffold
pearl grove
#

`C:\Anaconda3\lib\site-packages\pandas\io\parsers\readers.py in _read(filepath_or_buffer, kwds)
480
481 # Create the parser.
--> 482 parser = TextFileReader(filepath_or_buffer, **kwds)
483
484 if chunksize or iterator:

C:\Anaconda3\lib\site-packages\pandas\io\parsers\readers.py in init(self, f, engine, **kwds)
809 self.options["has_index_names"] = kwds["has_index_names"]
810
--> 811 self._engine = self._make_engine(self.engine)
812
813 def close(self):

C:\Anaconda3\lib\site-packages\pandas\io\parsers\readers.py in _make_engine(self, engine)
1038 )
1039 # error: Too many arguments for "ParserBase"
-> 1040 return mapping[engine](self.f, **self.options) # type: ignore[call-arg]
1041
1042 def _failover_to_python(self):

C:\Anaconda3\lib\site-packages\pandas\io\parsers\c_parser_wrapper.py in init(self, src, **kwds)
67 kwds["dtype"] = ensure_dtype_objs(kwds.get("dtype", None))
68 try:
---> 69 self._reader = parsers.TextReader(self.handles.handle, **kwds)
70 except Exception:
71 self.handles.close()

C:\Anaconda3\lib\site-packages\pandas_libs\parsers.pyx in pandas._libs.parsers.TextReader.cinit()

C:\Anaconda3\lib\site-packages\pandas_libs\parsers.pyx in pandas._libs.parsers.TextReader._get_header()

C:\Anaconda3\lib\site-packages\pandas_libs\parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows()

C:\Anaconda3\lib\site-packages\pandas_libs\parsers.pyx in pandas._libs.parsers.raise_parser_error()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 343: invalid continuation byte`

serene scaffold
#

guess we should add that to our website

grave frost
#

Also if you can afford it @unborn temple nnfs.io is a very popular starter with code and simply explained math to start you off

#

that alone should give a very strong foundation to your knowledge

serene scaffold
#

@pearl grove did you check what the encoding is for the csv file? guess it's not utf-8, or something

pearl grove
serene scaffold
serene scaffold
#

I think colab uses a linux environment, but I'm not sure

unborn temple
pearl grove
unborn temple
pearl grove
serene scaffold
pearl grove
serene scaffold
#

is it possible to share the colab so I can look?

grave frost
serene scaffold
pearl grove
unborn temple
serene scaffold
#

there's a csv module that comes with python

#

!docs csv

arctic wedgeBOT
#
csv

Source code: Lib/csv.py

The so-called CSV (Comma Separated Values) format is the most common import and export format for spreadsheets and databases. CSV format was used for many years prior to attempts to describe the format in a standardized way in RFC 4180. The lack of a well-defined standard means that subtle differences often exist in the data produced and consumed by different applications. These differences can make it annoying to process CSV files from multiple sources. Still, while the delimiters and quoting characters vary, the overall format is similar enough that it is possible to write a single module which can efficiently manipulate such data, hiding the details of reading and writing the data from the programmer.

pearl grove
#

sorry I'm very new to coding so still unaware of the jargon

unborn temple
#

Try this

#

@pearl grove try putting your file path in filename variable in that code

pearl grove
pearl grove
unborn temple
#

This might help you faster

#
    filename = r"C:\\Users\\\Mohammed Haris\\Documents\\Prog for DS\\Coursework\\\datasets\\2000.csv"
unborn temple
lapis sequoia
#

Hii all

pearl grove
pearl grove
untold yew
#

I have 32 images for my object classifier. Is 30 for training and 2 for testing fine?
32 images of all objects if I might add

unborn temple
untold yew
unborn temple
#

Is the problem hard or easy

untold yew
#

not really hard

#

it has to decide what object it is

#

from 3 very different ones, in front of a white wall

#

the objects are a tennis ball, nasal spray and hand cream

#

which all look very different

#

(in color and size)

unborn temple
#

Hmm, you could increase your dataset by some methods like cutting the images(not objects included) changing colour, stretching the image

untold yew
#

note, it doesnt need to be very good yet, im following a tutorial and this is the first one

unborn temple
#

Are you using tensorflow?

untold yew
#

yes

#

tf1

unborn temple
#

Then wait a sec, I will tell you a preprocessing method

untold yew
#

okay

#

I labeled all the images already btw

#

1 label, 1 picture | 32 times

unborn temple
#

Here is the methods, to increase the dataset

untold yew
#

I think it does augmentation automatically

#

atleast he said that

#

in the video

unborn temple
#

Hmm, then, 25 for training, remaining for testing

untold yew
#

btw someone just told me I should use 70/30 for training/testing

untold yew
unborn temple
#

It all depends on complexity

untold yew
#

I see

#

I put all the images in, im gonna train now

unborn temple
untold yew
#

cause I picked from model zoo

#

but not sure

#

im new to this

unborn temple
#

It might take some time then

untold yew
#

it will

#

it did when I tried yesterday too

#

but thats fine I planned that

unborn temple
#

So okay, good luck on your project

untold yew
#

thanks!

untold yew
#

@unborn temple there doesnt seem to be a lot happening here except it having saved the first checkpoint 40 min ago

#

even though it is using 20% cpu

#

and directml has connected to my gpu

lapis sequoia
#

Hello is it possible to compare the values of two columns of different length in two different dataframes?

df['habitants'] = habitants_df.loc[(habitants_df['Municipality'].apply(lambda x: str(x).split(' ', 1)[-1]) == df['cityTown']), 'Total']

Getting: raise ValueError("Can only compare identically-labeled Series objects") ValueError: Can only compare identically-labeled Series objects

#

df has approximately 53k rows whereas habitants_df has 501 rows

azure talon
#

What is a great way to learn Python ML?

serene scaffold
#

@lapis sequoia you can't compare series that don't have identical sets of indices. You have to use the eq method and specify a fill value

#

!docs pandas.Series.eq

arctic wedgeBOT
#

Series.eq(other, level=None, fill_value=None, axis=0)```
Return Equal to of series and other, element-wise (binary operator eq).

Equivalent to `series == other`, but with support to substitute a fill\_value for missing data in either one of the inputs.
vale isle
#

I get the error "You must compile your model before training/testing. Use model.compile(optimizer, loss)." on the last line

I compiled it... anyone spots my mistake?

model_2 = tf.keras.Sequential([
    data_augmentation ,
    layers.Rescaling(1./255, input_shape=(img_height, img_width, 3)),
    layers.Conv2D(16, 3, padding='same', activation='relu'),
    layers.MaxPooling2D((2, 2), strides=2),
    layers.Conv2D(32, 3, padding='same', activation='relu'),
    layers.MaxPooling2D((2, 2), strides=2),
    layers.Conv2D(64, 3, padding='same', activation='relu'),
    layers.MaxPooling2D((2, 2), strides=2),
    layers.Flatten(),
    layers.Dense(64,activation='relu'),
    layers.Dense(5, activation='softmax'),
])

model_2.compile(optimizer='adamax',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(),
              metrics=['accuracy'])

history_3 = model_2.fit(
    train_ds,
    validation_data=val_ds,
    epochs=epochs,
    callbacks=[es]
)


adamax_nodropout_es = model_2.evaluate(val_ds)
serene scaffold
#

@vale isle you re defined model_2

lapis sequoia
vale isle
serene scaffold
vale isle
#

i'm going to restart and run the complete notebook. i guess i messed up with some copy pastes trying out different parameters

serene scaffold
#

I'd have to look at the notebook and know the exact execution order. But I would restart the kernel and do everything in one cell until you're sure it's working

vale isle
#

allright, thanks!

rose pasture
stone marlin
#

In the case of the Scaler, the Scaler "fits" onto your data and gets a mean and a standard deviation (which are given by the mean_, scale_ parts below). It then computes the corresponding z-score for whatever you put in when you call transform.

In [1]: from sklearn.preprocessing import StandardScaler
In [2]: import numpy as np
In [3]: scaler = StandardScaler()
In [4]: x = np.array([10, 10, 10, 12, 10, 12, 9, 9, 8, 8]).reshape(-1, 1)
In [7]: scaler.fit(x)

In [10]: scaler.mean_
Out[10]: array([9.8])

In [11]: scaler.scale_
Out[11]: array([1.32664992])

In [13]: scaler.transform(np.array([9.8, 12, 15, 20]).reshape(-1, 1))
Out[13]:
array([[0.        ],
       [1.6583124 ],
       [3.91964748],
       [7.68853929]])
#

In general, there are two parts to most estimators / processors in Sklearn: fit and transform. Fit will take in training data and figure out what it needs to do with the data (compute means, stdevs, or, in the case of models, the model itself). Transform will transform the data into a new form to be used where ever. Scaling, for example, will scale the data nicely. PCA will apply PCA to the data.

Pipelining in sklearn makes heavy use of this and it might "click" better to see things in terms of pipelines:
https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

lapis sequoia
#

Bayesian optimization is a little more sophisticated. It assumes that a specific probability distribution, which is typically a Gaussian distribution is underlying the performance of model architectures. So you use observations from tested architectures to constrain the probability distribution and guide the selection of the next option. This allows us to build up an architecture stochastically based on the test results and the constrained distribution.
I don't understand what does
which is typically a Gaussian distribution is underlying the performance of model architectures.
mean.
Also, what does this
to constrain the probability distribution and guide the selection of the next option.
mean?

stone marlin
#

Bayesian optimization is a little rough to understand. The gist (greatly simplified) is:

You assume that something has a certain distribution (like maybe a normal curve). But then you get more data and you notice it's a little skewed. So you update the distribution. Then the distribution tells you what to do next. You do it, you get more data which changes the distribution a bit more --- so you update it and continue.

serene scaffold
#

@lapis sequoia @vale isle did you both figure out what you were working on?

rose pasture
stone marlin
#

It does figure it out by itself given the data you put into the "fit" method. StandardScaler, for example, when you "fit" the data, it'll find the mean and stdev of that data. Then, whenever you you "transform" new data, it'll scale it according to the mean and stdev it found.

The z-score is a standard thing in statistics that maps some value to a standard normal distribution. I think this site might explain what the z-score is and how it's used:
https://www.simplypsychology.org/z-score.html

The gist is, it tells you how many standard deviations away from the mean data is. So if you transform a number and get "1" back, it was 1 stdev above from the mean. If you get -1.6 back, it was 1.6 stdev below the mean.

rose pasture
stone marlin
#

No problem, if you haven't done stats in a while it can be a bit confusing, and you slowly get used to the fit-transform stuff in sklearn.

What do you mean by the range?

rose pasture
stone marlin
#

Like, a confidence interval? The mean and standard deviation of data are both floats. For example,

In [14]: a = np.array([1, 2, 3, 4])
In [15]: a.mean()
Out[15]: 2.5
In [16]: a.std()
Out[16]: 1.118033988749895
rose pasture
azure talon
stone marlin
#

Well, you could find the range of the transformed training data by doing .max(), .min(), but there won't be a max or a min in general. The StandardScaler can transform a number to any number between -infty and infty.

rose pasture
lapis sequoia
# serene scaffold <@456226577798135808> you can't compare series that don't have identical sets of...

Thanks for the clarification, Can I set a multi-index on the columns that I want to compare within the dataframe that has less rows and then perform a .loc with column values of the other dataframe? This is what I mean:

multi_indexed_habitants_df = habitants_df.set_index(['Municipality', 'Period']).sort_index(ascending=True)
df['habitants'] = multi_indexed_habitants_df.loc[(df['cityTown'], pd.to_datetime(df['startDate'], format='%Y-%m-%dT%H:%M:%S').dt.year), 'Total']
serene scaffold
#

Please ping me when you've done that.

#

also, please run it before the code that you show me, so that I can see what the data looks like before this

lapis sequoia
slow vigil
#

Is there any way to use a function that returns two separate values with rolling.apply() ?

#

in pandas

#

df['A'], df['B'] = series.rolling(7).apply(my_func)

#

something like this?

serene scaffold
slow vigil
#

ohh ok, but can I use one that returns two values

serene scaffold
#

and no, a function applied with rolling has to return a numeric type.

slow vigil
#

oh I know what I could do

#

I could apply rolling inside the function to two separate series

#

and then return them

#

or

#

no

serene scaffold
#
roller = series.rolling(7)
df['A'], df['B'] = roller.apply(my_func), roller.apply(other_func)
slow vigil
#

you now what I mean

serene scaffold
#

@lapis sequoia still working on it?

lapis sequoia
serene scaffold
vale isle
lapis sequoia
#

{'incidenceId': [56125, 56123, 56122, 56121, 56120], 'sourceId': [1, 1, 1, 1, 1], 'incidenceType': ['Accidente', 'Accidente', 'Accidente', 'Seguridad vial', 'Seguridad vial'], 'autonomousRegion': ['Euskadi', 'Euskadi', 'Euskadi', 'Euskadi', 'Euskadi'], 'province': ['BIZKAIA', 'ARABA', 'GIPUZKOA', 'BIZKAIA', 'GIPUZKOA'], 'carRegistration': ['BI', 'VI', 'SS', 'BI', 'SS'], 'cause': ['Alcance', 'Alcance', 'Alcance', 'Aver�a', 'Aver�a'], 'cityTown': ['Zeanuri', 'Vitoria-Gasteiz', 'Eskoriatza', 'Barakaldo', 'Villabona'], 'startDate': ['2022-01-08T13:21:08', '2022-01-08T13:36:43', '2022-01-08T13:38:31', '2022-01-08T12:27:20', '2022-01-08T10:52:05'], 'incidenceLevel': ['Green', 'Yellow', 'Yellow', 'Green', 'Green'], 'road': ['N-240', 'N-622', 'AP-1', 'A-8', 'N-1'], 'pkStart': [39.0, 5.0, 121.0, 124.0, 444.0], 'pkEnd': [39.0, 5.0, 121.0, 124.0, 444.0], 'direction': ['BILBAO', 'BILBAO', 'Madrid', 'CANTABRIA', 'IR�N'], 'latitude': [43.06164, 42.88119, 43.01685, 43.28869, 43.19736],
'longitude': [-2.70798, -2.69503, -2.541243, -3.0047, -2.0412], 'incidenceName': [None, None, None, None, None], 'endDate':
[None, None, None, None, None]}
Int64Index([ 0, 2, 3, 4, 5, 7, 8, 9, 10,
11,
...
57334, 57335, 57337, 57338, 57339, 57340, 57341, 57343, 57344,
57345],
dtype='int64', length=43493)
{'Municipios': ['Agurain/Salvatierra', 'Agurain/Salvatierra', 'Alegr�a-Dulantzi', 'Alegr�a-Dulantzi', 'Amurrio'], 'Sexo': ['Total', 'Total', 'Total', 'Total', 'Total'], 'Periodo': [2021, 2020, 2021, 2020, 2021], 'Total': ['5.029', '5.038', '2.925', '2.935', '10.307'], 'Provincia': ['Araba', 'Araba', 'Araba', 'Araba', 'Araba']}
RangeIndex(start=0, stop=502, step=1)

serene scaffold
#

thanks 😄

#

@lapis sequoia wow, are you Basque?

lapis sequoia
#

Yes I'm

serene scaffold
#

wow, I've never met a Basque person surprisedPika

upbeat dove
#

Can I make a Sequential take any length of an input (for an RNN network)

serene scaffold
#

pretty sure you have to use something like an RNN for that, but I'm not completely sure

upbeat dove
#

Yeah I am

#

My nn will consists of Embedding -> LSTM (no return sequences) -> Dense

serene scaffold
#

@lapis sequoia can you explain what comparison you're trying to do? population for towns at each period?

lapis sequoia
upbeat dove
lapis sequoia
#

Basically I'm trying to create a new column called habitants within the incidences dataframe that contains the number of habitants of that particular city where the incidence took place, for that I compare Municipality and cityTown which are the same but with different names for each dataframe and then I extract the year from the startDate column and compare it to the Period column which is just the year.

serene scaffold
#

so first I did df['year'] = pd.to_datetime(df['startDate'], format='%Y-%m-%dT%H:%M:%S').dt.year

lapis sequoia
lapis sequoia
#

With that you have a new column with the year within incidences df

serene scaffold
#

!docs pandas.DataFrame.merge

arctic wedgeBOT
#

DataFrame.merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)```
Merge DataFrame or named Series objects with a database-style join.

A named Series object is treated as a DataFrame with a single named column.

The join is done on columns or indexes. If joining columns on columns, the DataFrame indexes *will be ignored*. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on. When performing a cross merge, no column specifications to merge on are allowed.
serene scaffold
#

So you need to pick which columns represent the same thing in both dataframes

#

which one means the same thing as Municipios?

#

cityTown? or province?

lapis sequoia
serene scaffold
#

okay great

#

so you will have something like left_on=['Municipios', 'Periodo'], right_on=['cityTown', 'year']

#

to show which columns are used to match the rows.

lapis sequoia
#

I see and then once I do the merge I can query those columns as usual

serene scaffold
#

yes

lapis sequoia
#

spicy move not gonna lie

#

Well I'm gonna implement that and I see what I get

serene scaffold
#

I have to go, but I wrote down the solution. So see if you can figure it out

#

if you try and can't figure it out, I will show you.

lapis sequoia
#

If you wanna learn basque let me know😅

stone marlin
#

This isn't really data science or ai, you may have better luck in one of the help channels, Viking.

eager cloak
#

i tried clicking on #discord-bots and mustve accidentally clicked here

lapis sequoia
#

When the distance between observations grows, supervised learning becomes more difficult because predictions for new samples are less likely to be based on learning from similar training examples.
What is meant by "observations" here?

#

Is that input variables?

cursive onyx
#

can someone send the source code of a premade chatbot AI? would mean a lot

#

in python

proper wren
#

Who wants to work on a project involving ML and AI?

serene scaffold
proper wren
serene scaffold
proper wren
serene scaffold
#

in a public repository? link?

proper wren
serene scaffold
#

who is dekriel

proper wren
proper wren
#

not much progress yet tho

earnest fog
#

Why is the csv file being read like this instead of displaying the columns and rows nicely, or is the csv file broken?

#

Like this

earnest fog
#

not the other

proper wren
earnest fog
#
data_files = [file for file in os.listdir('./dataframes')]

both_subjects = pd.DataFrame()

for file in data_files:
    df = pd.read_csv('./dataframes/' + file)
    both_subjects = pd.concat([both_subjects, df])

both_subjects.head()
#

this is the code

proper wren
#

hm

#

i think thats the website

earnest fog
#

Any idea how to fix it?

serene scaffold
#

your separator appears to be a ;, whereas , is the default.

#

also, you should avoid design like both_subjects = pd.concat([both_subjects, df]). this involves copying every single cell of both_subjects every time, and is thus horribly inefficient

midnight fossil
#

Hi, im trying to add the column "Days" with the column "Recieved". However, idk how to make it so python adds it with the days in the month. ex: 7 + 06.12.21 = 13.12.21

#

The library im using is pandas
Any help would be really appreciated

serene scaffold
#

@midnight fossil one column is of floating point numbers, and one is of strings. are you trying to do addition or string concatenation?

#

oh I see. they're both wrong.

#

Days needs to be a timedelta, and Received needs to be a datetime.

midnight fossil
#

I've heard of the datetime library before

#

but im not quite sure what you mean by that

#

sorry im a bit of a noob

#

like from dateutil.relativedelta import relativedelta?

serene scaffold
#

a datetime is an exact time in the real world ("16.12.21"), and a timedelta is a duration ("7 days")

midnight fossil
#

Ok, thanks. I guess ill watch a video on how to convert "Recieved" to a datetime

serene scaffold
#

can you do print(df.head().to_dict('list')) and copy and paste the text into this chat?

#

I will not accept a screenshot

midnight fossil
#

import pandas as df
df = df.read_excel('C:/Users/enesi/OneDrive/Amazon accounting.xlsx')
Total = df['Days '] + df['Received']
print(df.head().to_dict('list'))

serene scaffold
#

you have to copy and paste the result of the print statement, not the code.

midnight fossil
#

oh, my bad😆

serene scaffold
#

C:/Users/enesi/OneDrive/Amazon accounting.xlsx is not a file that I have, so this is how I can figure out what is in it

#

in a way that is useful

midnight fossil
#

result[mask] = op(xrav[mask], yrav[mask])
TypeError: unsupported operand type(s) for +: 'float' and 'str'

serene scaffold
#

do it immediately after you define df

#

before the line that causes the error

midnight fossil
#

'Amazon product testing': [1, 2, 3, 4, 5], 'Platform': ['Telegram', 'Telegram', 'Telegram', 'Telegram', 'Telegram'], 'Seller name': ['Mikki Mikki', 'Wangzekun', 'Chari', 'E S', 'Mikki Mikki'], 'Order date': ['03.12.21', '03.12.21', '04.12.21', '05.12.21', '05.12.21'], 'Order num': ['302-9210629-2828348', '302-8699596-8046740', '302-4841467-2222755', '302-4797916-9617919', '302-1684012-4898734'], 'product name': ['Uhrenladegerät', 'PC controller', 'GROJAT smart uhr', 'Aufsteckbürsten', 'Styles pen'], 'price': ['12,99', '27,99', '45,99', '7,99', '27,98'], 'Days ': [7.0, 5.0, 7.0, 5.0, 7.0], 'Received': ['06.12.21', '08.12.21', '08.12.21', '08.12.21', '08.12.21'], 'Reviewed': [1.0, 1.0, 1.0, 1.0, 1.0], 'refund status': [1.0, 1.0, 1.0, 1.0, 1.0], 'sold price': [nan, nan, nan, nan, nan], 'Add. Info': [nan, nan, 'picture', nan, nan], 'Trustworthy': ['YES', 'YES', 'YES', 'YES', nan]}

#

it just lists everything in the excel file

serene scaffold
#

There's no opening {, but I'm going to assume that's the only character that's missing.

midnight fossil
#

yeah

serene scaffold
#

If there are missing columns in this, I will never be able to figure that out if you don't tell me.

midnight fossil
#

forrgot to copy that

#

everrything is there except the {

serene scaffold
#
>>> df2 = df.assign(
    **{
        'Order date': pd.to_datetime(df['Order date']),  # Convert `Order date` to a timestamp
        'Days': pd.to_timedelta(df['Days '], unit='D')  # convert `Days ` to a duration, without the extra space
    }
).drop('Days ', axis=1)  # Drop the column that has the extra space

@midnight fossil look at this

midnight fossil
#

Damn

#

Thanks a lot

#

quick question, what do the 2 ** mean?

serene scaffold
#

good question! it's an esoteric python thing.

midnight fossil
midnight fossil
serene scaffold
#

yes, I know what two asterisks you're referring to, lol

#

also remove the >>> as those are part of a REPL

#

which is a type of python environment

#

you're probably not using that.

midnight fossil
#

Yeah, i figured.

#

Thanks again for the help

serene scaffold
#

anyway, from there, Days and Order date are in formats that you can work with as actual time representations

#

since Order date becomes an exact timestamp, you can add durations to it and get an updated timestamp

midnight fossil
#

I still got this errror for some reason:

#

I think its because you created a dataframe but I dont need a dataframe since im trying to get python to read off of an excel document

serene scaffold
tall drum
#

Hi, I have three numpy arrays, x and y coordinates and deflections in z coordinate, each are about size 2500. I tried to do meshgrid for x/y coordinates which was succesfull but somehow when I try to do np.meshgrid(x,y,z) it uses all the memory on my machine (180GB).
Is this normal or is there a better way maybe?

wicked grove
#

I checked it ,apparently there is intel(R) UHD 620 and radeon graphics (I'm kinda confused about this)

void peak
severe folio
#

Would it be unwise to try and learn how to do machine learning/ai with python (I have only probably written 400 lines of code max)

lapis sequoia
#

this is just my two cents, I'm still somewhere at point 2/3 myself

warm jungle
lapis sequoia
gentle lion
#

I have a neural network that should try to classify images of chairs into 18 different classes. More specific, given an image of a chair, it should be able to predict it's rotation. Each class is a certain rotation degree of a chair. For example, as you can see in the image, all chairs in the class "260 degrees" are rotated 260 degrees. I have the same 1.8k images in each class, but rotated differently of course. Now after 16 epochs, my model's accuracy has not improved. It's still guessing with an accuracy of 5.5% (which is random guessing if there are 18 classes). i'll also include my neural network's structure. Does anyone know some key things to make this better? (maybe even a different approach, as it should ideally predict more angles and not only angles divisable by 20)

night gorge
#

I plotted a boxplot using seaborn library.
For "Iris-setosa",
Q1 is 3.1,
Q2 is 3.4,
Q3 is 3.6
IQR value is Q3-Q1 = 0.5

According to theory, lower and upper plot whiskers(extended thin line) should be on (Q1 - 1.5xIQR) and (Q3 + 1.5xIQR) respectively.
That would give us values 2.35 and 4.35 respectively.

But as you can see in box plot for "Iris-setosa", the starting and ending point of whiskers are not on that values.I have also specified that whisker value should be 1.5 while calling boxplot. Why this happens?

wicked grove
#

Can someone please tell me how i can increase the ram on google colab

serene scaffold
serene scaffold
#

though keep in mind that as a free service, they're not going to keep granting you more ram indefinitely

#

also, if you're using a deep learning library, I would confirm that your tensors are actually on the GPU and not in the CPU.

wicked grove
wicked grove
serene scaffold
wicked grove
#

and run it on my local pc?yes somebody suggested that to me, but idk if cuda works for radeon graphics

serene scaffold
#

numpy arrays can't go on the GPU, so they'll just take up all your RAM.

wicked grove
#

yeah i am not even tho i have activated it

serene scaffold
#

so that's probably the problem

serene scaffold
#

use cuda tensors. in colab.

wicked grove
wicked grove
wicked grove
#

ill google a few videos and try it,thank youu!!

#

@serene scaffold can i use sklearn's train_test_split on tensors?

serene scaffold
wicked grove
#

tensorflow

serene scaffold
untold yew
#

this code would work locally and just detect objects I hold infront of the webcam. However, I am doing it in google colab and it just stops after 0 seconds and never shows anything. I suppose that has to do with the capture device not being accessed. How do I access my webcam from google colab with opencv?

wicked grove
serene scaffold
lapis sequoia
#

Hey guys, I have two concerns in the graph. Could someone guide me.

  1. How do make the years(xaxis) tilt when im using plt.stackplot?
    2.What scale should be used to make the values in yaxis to be more interpretable?
gentle lion
#

I am trying to make a neural network that can predict the rotation of a chair. I have right now 18 different categories that each contain 1.7k chair images of different degrees ( so 1 category contains 1.7k images of chairs rotated 20 degrees, another for 40 degrees and so on)

#

Does anyone know some CNN structure to make this work?

gentle lion
wicked grove
#

i have an array as 512,512 how can i reshape it to 512,512,1

fair oracle
#

hii is anyone well versed in pandas

#

dataframes

stone marlin
boreal bear
#

Hello all, I need help with a problem:

I have a Pandas dataframe and want to create a conditional column based on the "USA State" column. I want to use multiple conditions to yield a column with a rep name. I have created lists associated with each rep and want to write the reps name in a column if in State is in the Rep list.
df1["Rep Name"] =

["Darrin" if i in Darrin else "None" for i in df1["State"]
This works but I want it to work with multiple names. How would I do this?

serene scaffold
#

@boreal bear I'm on mobile, but the solution doesn't involve any loops. Look into masking with conditionals.

serene scaffold
stone marlin
# night gorge I plotted a boxplot using seaborn library. For "Iris-setosa", Q1 is 3.1, Q2 i...

For a boxplot, your whiskers won't be exactly on (q1 - 1.5 * iqr, q3 + 1.5 * iqr), since, otherwise, boxplots would always be symmetric. Instead, the whisker is on the max/min value which is not an outlier. An outlier, in this case, is defined as anything beyond that q2 +- 1.5 * iqr value.

(Also, I think your values for q1, q2, q3 may be off, I get slightly different ones using the same data and functions.)

q1: 3.2
q2: 3.4
q3: 3.675
IQR: 0.475
Whisker bounds: (2.9, 4.2)

Which means, in this case, the bounds should be at 2.9 and 4.2 respectively, which seems to be the case in the plot.

Refs:
https://www.simplypsychology.org/boxplots.html
https://github.com/mwaskom/seaborn/blob/e04b07eb3df135511e71e556c2bd34ef59ba08ba/seaborn/categorical.py#L1288-L1294

arctic wedgeBOT
#

seaborn/categorical.py lines 1288 to 1294

def draw_box_lines(self, ax, data, support, density, center):
    """Draw boxplot information at center of the density."""
    # Compute the boxplot statistics
    q25, q50, q75 = np.percentile(data, [25, 50, 75])
    whisker_lim = 1.5 * (q75 - q25)
    h1 = np.min(data[data >= (q25 - whisker_lim)])
    h2 = np.max(data[data <= (q75 + whisker_lim)])```
lapis sequoia
#

Hey guys, How can i plot the two values on the yaxis as shown in this figure?

blazing anchor
#

I'm having some trouble appending DFs. I get the error Reindexing only valid with uniquely valued Index objects. The dfs have multiindex (date, company identifier) and the combo is unique. Date will repeat, company identifier will repeat, but the combo will not. I cannot reindex the dfs as I need the index values. I have a sample in a kaggle notebook that I can share. The dfs share some columns, but each have unique columns as well.

serene scaffold
blazing anchor
charred umbra
mild dirge
#

It should learn the rotation of the chair, instead of which "class" it belongs too

lapis sequoia
mild dirge
#

That way it would also be easier to generalize it to more rotation values

lapis sequoia
#

Does anyone know if its a good idea to normalize all my data per column at the start or should I normalize them after I split them into train/test data

#

I am not sure how it works

serene scaffold
#

If you normalize the test data differently, the model won't understand what it's looking at

lapis sequoia
#

Okay thank you very much, I am doing a ml project and its my first one so I have a lot of questions. I really open and close tabs the last 10 hours.

lapis sequoia
serene scaffold
#

Idk

stone marlin
#

When I do my preprocessing, I make a pipeline and fit my normalizers / scalers / whatever only on training data. If my training data is representative, then my test data should scale the same way with the same parameters.

#

If you fit your scaler with test data, that's using test data for training, which isn't an excellent practice.

lapis sequoia
#

I see so what I did is I normalized all my data between 0 and 1 and then splitted them

#

So in a way the test data are affected by the normalization

#

But only if they do contain the maximum/minimum value

stone marlin
#

When making a model, you're probably never going to want to use your test set (or validation set, in the case of CV) for any preprocessing.

#

I honestly don't think it would affect too much of the preprocessing, but looking or manipulating the test data in the parameter fitting / training step is not a good practice to get into.

#

So, if you want to do some kind of normalization, it should be like this:

  1. Split data into train/test (or train/validation for CV).
  2. Fit your normalizer / whatever on your training data.
  3. Make the model and train it on the training data.

Now you have two things: your normalizer and your model. When you score the test data, you want to put it through both. This is called pipelining.

See: https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

#

See the example there. They do a StandardScaler and an SVC (which is a type of model). Then they fit that, which tells the scaler how to scale everything and the model how to classify. Then it scores on the test set, running the testing sets through the pipeline.

lapis sequoia
#

I see so I should split my data first

#

I havent checked cross-validation yet but I think I have to do it for my project

stone marlin
#

IMO, best practice when you get data is kind'a check it out and get a test set / validation set ready at the beginning.

lapis sequoia
#

Preprocessing: normalization
• Learning: training, cross-validation
• Diagnostics: testing, accuracy, loss
I need to do all these stuff for multiple linear regression

stone marlin
#

It's all good, you still should split with CV to get a training set and a validation set, but some people (for whatever reason) don't use a validation set. Validation sets are pretty much just "test sets" for CV. At least, they're used in the same way.

#

Yep. So, you'd pretty much follow something similar to the pipeline example here. Then at the end, to test, you'd use whatever the metrics you want, here: https://scikit-learn.org/stable/modules/model_evaluation.html

lapis sequoia
#

I am not sure if I need to split validation though, my data is time-related its a stock price

#

And I have read in the course that i need to get train data to be older

stone marlin
#

It's up to you. Some people split for CV, some people don't.

lapis sequoia
#

CV is cross validation?

stone marlin
#

Yes, Cross Validation.

#

For time-series, this can get a bit tricky. I'd be interested to know how some of the DS people in here handle train/test or CV on their time-series. There's many ways to do it.

#

I use older data for train and newer data for test, but I also normalize with respect to trend and seasonality first. But some people do the opposite thing: they don't normalize and they test older, train newer.

#

I've read both sides claim their way was better so, you know, who knows.

lapis sequoia
#

I see I am doing this for a course and prof says train to be older so I guess I wont have a problem with deciding :p

stone marlin
#

Haha, yeah, I honestly think it makes more sense, but they're both probably fine ways to do it.

lapis sequoia
#

But I have seen some tutorials where they normalize before splitting I am still not sure

stone marlin
#

When you say normalize, in this case, for time series, what normalization are you applying?

lapis sequoia
#

date isnt normalized its just works like index, all the prices and volume is normalized

#

its open/close/max/min

stone marlin
#

Like, trend + seasonality, or minmaxscaler?

lapis sequoia
#

prices

#

minmax

stone marlin
#

Oh, got'cha.

#

Ehhh. I would probably still split first and do it all with a pipeline, but if it's a course it's prob not a big deal if you normalize first, if that makes things significantly easier for you.

lapis sequoia
#

can i use the same minmaxscale to scale multiple columns and then reuse the multiplier of each column to the test data to normalize them with the same "multiplier"?

stone marlin
#

Yeah, that's what you usually will do in timeseries stuff. You'll scale on a lot of train data, then you'll apply that to the test data as well when you score that.

lapis sequoia
#

so the project states that i need to do cv, should i split the data to 70-10-20 train/val/test?

stone marlin
#

I'm not sure how good minmax scaling on stock data will be (as it is notoriously non-seasonal and variable w/rt short-term trends).

#

You won't need a test set anymore for CV, just the number of splits to use.

lapis sequoia
#

Learning: training, cross-validation
so if i want to do these 2 things I will need to resplit the data?

stone marlin
#

The easiest way might be to do something like... Make a training set and a test set, something like 70-30. Then you can use that for training without CV. For CV, you can just plug in the training set, and use the test set as your validation set.

#

That way you're still splitting up data nicely, but you only have to split it once and not worry about it too much.

lapis sequoia
#

Oh so I will never need the 3 sets in for the same process

#

okay

#

I havent read the cv thing yet so thats why i have questions

stone marlin
#

Yeah, no problem. Some people here might do things differently, so they might have good advice as well.

lapis sequoia
#

I am trying to store the fitted values from the minmaxscaler so i can simply transform train data

stone marlin
#

Q for you NN People. I started to look into NNs --- neat stuff! But I wanted to know your workflow. When you get data and a task where you want to use a NN, do you build it up yourself, or look for pre-built ones, or what's the deal? I feel like it'd be annoying to build it up all the time, but there are "recipes" to make various types, so, you know, who knows. I'm interested in your take.

warped creek
#

do you guys need to know math for ml?

stone marlin
#

The "fit" part is the part that "saves" the fitted values, so you only need to call transform on the stuff.

stone marlin
# warped creek do you guys need to know math for ml?

Depends on what you want to do with it. I'm a strong believer that one needs at least a basic understanding of calc, linear algebra, and stats to have a career in DS, but to do projects or even to do more of the software-engineering side, you may not need to know as much.

lapis sequoia
#

Do you think I should normalize the target variable too?

#

I think thats where errors were popping from

#

cuz I didnt this for y_train/y_test too with the same scaler

#

SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

#

also should i ignore this warning?

stone marlin
#

I'm not sure, but everything you do for the training set you should do for the test set as well. Yeahhh, that warning is saying you have something like,

df[row_indexer, col_indexer]
or something somewhere, and it wants you to use

df.iloc[row_indexer, col_indexer]
instead. Something like that. I try to fix it when I see it, but I think it'll usually work with that warning.

wicked grove
#

hello, how can i use Conv2D on numpy arrays]

#

i want to apply Conv2D on X_train before passing it to my model

odd meteor
wicked grove
odd meteor
odd meteor
wicked grove
lapis sequoia
#

hey, is anybody here...who can help me with pandas dataframes

#

i am willing to concatenate 3 dataframes below each other vertically ... using pandas

lapis sequoia
#

great !

#

@odd meteor just tell that, Are gonna text something releated to mee ? .... cause im waiting for you as you're typing...XD

odd meteor
# wicked grove Yes it hangs notoriously, the mouse pointer stops working My ram size is 8 gb ,...

At this point, you might wanna consider purchasing the paid version of Colab 😀

On a more serious note, isn't there a way to switch to GPU instead of using CPU? I think there should be a way.

My PC's GPU is Iris XE. So I can't utilize CUDA either although I have thunderbolt4 port to connect eGPU.

If it's any consolation, not everyone with Nvidia GPU can afford to train heavy/deep NN on their pc. Most people still utilize colab.

lapis sequoia
# odd meteor 😊

Ooh, i thought. You would text something related to pandas merging data frames. that is why i was waiting for your text... *i know im very stupid

odd meteor
lapis sequoia
#

btw, do you use PyTorch or Tensorflow.. ?

odd meteor
lapis sequoia
#

i think pytorch is better with RNNs that tensorflow

odd meteor
#

To me, no framework is superior than the other. TensorFlow is a lot easier that's why I picked it first. Ooh, Keras is fun as well.

wicked grove
odd meteor
wicked grove
#

Alrightt, thanks😁

lapis sequoia
#

hello

#

I'm trying to read these irregular tables into a dataframe

#

I dont know how to go about it

#

please help ~

#

it has like.. row data and column data

lapis sequoia
#

help plox

night gorge
#

svm_model_linear = SVC(kernel = 'linear').fit(x_train, y_train)```
.
ytrain is one hot encoded (it contains 1 among 3  values)
. 
But while running it gives error
```y should be a 1d array, got an array of shape (105, 3) instead.```
.
How to fix this?
lapis sequoia
lapis sequoia
#

they are column headers

lapis sequoia
#

then I'd start with deleting lines 1, 2, 8 and 9 if they're not relevant

arctic wedgeBOT
#

:x: failed to apply.

#

@lucid plover Please don't try to ping @everyone or @here. Your message has been removed. If you believe this was a mistake, please let staff know!

upbeat prism
# wicked grove I checked it ,apparently there is intel(R) UHD 620 and radeon graphics (I'm kind...

not all GPU support CUDA. Disk space is not the same as memory. Disk space is your HDD or SSD while memory is your RAM or the GPU's RAM. If you run out of ram it can be that the computer uses your HDD/SSD as ram which is very slow.

In any case, make sure you understand why you run out of memory. E.g. if you network has an input size of let's say 10 numbers (doubles) and you have a batch size of 1000, then you can compute how much memory you need approximately (10 * 10'000 * 8 bytes) 8 bytes cause a double is 8 bytes (64 bits).

Now you wil need a bit more memory since you might store some things. Coding mistakes can lead to you running out of memory. Maybe you have too much input => out of memory. Make sure you understand why you run out of memory.

upbeat prism
# serene scaffold there's a few solutions given here: https://stackoverflow.com/questions/41859605...

@wicked grove

what I use is pytorch's subset.

        # Read samples dataset
        samplesDS = SamplesDataset(args.samples_file,
            device=args.device)

        # Make a 80/20 split for training/eval data
        k = len(samplesDS)
        train_indices = np.arange(0, int(k * 0.8), dtype='int')
        validation_indices = np.arange(int(k * 0.8), k, dtype='int')
        TrainDS = torch.utils.data.Subset(samplesDS, train_indices)
        ValidDS = torch.utils.data.Subset(samplesDS, validation_indices)

(I don't shuffle my data before splitting in this example)

Not sure if it's the best way though.

upbeat prism
lapis sequoia
wicked grove
wicked grove
#

Yeahh i have amd gpu and it isnt supported by CUDA

lapis sequoia
odd meteor
wicked grove
gentle lion
# lapis sequoia Hi have you found solution to this?

i have now switched to linear regression as it would indeed make more sense. I am training some different CNN structures, but my pc is very slow so it takes a long time I will try on a better PC soon and then i will know if one if these structures work

night gorge
# odd meteor Is there any reason why you onehot encoded the response variable? Even if you ha...

I am a beginner, I am working on iris dataset.
.
We have 150 rows of Sepal Width, Sepal Length,Petal Width, Petal Length and its corresponding classification ("Iris-virginica","Iris-versicolor" or "Iris-setosa")
.
I need to make a model which will predict "classification" depending on values of SW, SL, PW, PL
.
I made:
x = SW, SL, PW, PL
y = "classification"
.
Then since y is a categorical data, I encoded with y=pd.get_dummies(y).
.
Then I split x and y and tried to model. Where can it go wrong?

arctic wedgeBOT
#

:incoming_envelope: :ok_hand: applied mute to @lapis sequoia until <t:1641904622:f> (9 minutes and 59 seconds) (reason: duplicates rule: sent 4 duplicated messages in 10s).

pastel valley
#

yo are there any library llike keras which has imagedatagenerator but instead of flipping zoom and shift i want to applica color space transformations

lapis sequoia
lapis sequoia
upbeat prism
# wicked grove Thank youu!! Yess so the data i was trying to store was 10.6GiB and it was runni...

10.6 GiB shouldn't be a problem though, no? How much do you read at once?

Think about it: You basically have one loop which loops over the epochs and each epoch does training + evaluation and maybe some other minor things right? While training, you loop over those 10.6 GiB of data right? Now here you don't have to read all at once! You can read 1GiB, train it, read next 1GiB, train it. Once done with training, you go to validation and do the same.

You might have to change how you store the data though. If you store the data in one big normal file like csv or txt or whatever people use, then reading it in 1 GiB chunks won't be possible.

  1. How do you store the data?
  2. how do you read the data?
  3. How much data do you process at once?
#

E.g. my training data currently is 30gb or something like that and test data is 2x80gb

#

and my GPU has 3.7GB ram.

odd meteor
vast thunder
#

Guys is Brain.js a good library for AI? I'm not necessarily gonna do super advanced stuff. Just some kaggle datasets ig?

wicked grove
wicked grove
#

npz file

upbeat prism
#

do you use pytorch? keras? why npz?

wicked grove
#

keras

#

i just found it easier to load it in npz after preprocessing

wicked grove
upbeat prism
#

I don't know keras nor npz but basically what you do is you load the whole 10GB at once. Right? Then you trible it => you'd need 30GB or memory which you probably haven't.

What you can do is to only read a piece of the data, feed it to your network, load new data, feed it again. You might have to store your data in chunks of npz files or just use HDF?

see e.g. https://keras.io/api/data_loading/

wicked grove
#

thank youu!!:)) i will check that

upbeat prism
#

how much memory do you have?

wicked grove
#

i have 8 gb ram

#

google colab pro offers 25gb

molten wedge
#

for computer vision, is it possible to take five different pictures of the same object and also train the model to understand that these five different pictures are actually the same object?

iron basalt
molten wedge
#

I am building a grading system which will grade diamonds based on the initial image clicked. I have to assume that multiple images from different angles will be best able to grade as opposed to a single image. However is it possible to tell the model that for example these five images are of the same diamond clicked from different angles?

iron basalt
#

Yes.

#

You can also just do video and rotate it depending on how good the model is.

molten wedge
#

thank you for your answer. What are the limitations of using a video? I have to assume that prices in the video would take much much longer time than processing a single or two images. '

#

that processing*

iron basalt
#

If you know where the diamond is relative to the camera, you can do it without ML, since it's nice and rigid, not something like a pile of mud.

molten wedge
#

I see.. Let's say we do not know whether diamond is relative to the camera. Is it possible to train a dataset say on 360° videos of 10,000 diamonds? can you ballpark along it would take?

#

or if this is something that is even reasonable…

iron basalt
#

A model to segment the image for the diamond and then another to asses its quality can work. Or you can do a single end-to-end model. Or some combination of various models that are experts that each find out something and finally you take that output and simply apply some fuzzy logic. There are multiple ways.

#

Getting the training data will probably be the most difficult part.

molten wedge
#

That is true....pretty much impossible to find a training data like that

#

so just that I understood clearly... It indeed is possible to tell a model when it is looking at the same thing from different angles without obtaining any kind of 3D scan or 3D data some other complicated stuff

iron basalt
#

It can be done with camera alone. 3D scans make it a lot easier though.

molten wedge
#

Got it...thanks for helping me.

safe elk
#

Search for Photogrammetry algo or software might help

molten wedge
#

problem is that 3D scans capture voxel data but they do not capture colour characteristics... Unless I'm mistaken

safe elk
#

Ah yes and your subject is transparent ...photogrammetry has troubles there too

molten wedge
#

yeah....

safe elk
#

Lemme think

molten wedge
#

sure thanks

#

how can I answer the following question:
for (z) e.g. 1000 number of images given (x) resolution (e.g. 512x512 or 1024x1024) what amount of (y) time it would take to train a neural network using Google colab?

#

I think this would make it easier for me to go ahead

iron basalt
#

Not nearly as long as it takes to get those 1000 images I would think. It also depends a lot on which model / method is chosen.

#

If your model can perform online learning, then technically the training would take as long as it takes you get each sample, since it could learn it on the spot for each image you get.

molten wedge
#

Lets say i have 200 images at 2048x2048 resoltion of 200 diamonds...are you saying it would tkae 1 second per image to process and learn?

iron basalt
#

Standard deep learning usually can't do online learning though, so if you want to stick with that, which most people know about, then it might take a while to get a good model.

molten wedge
#

can you please elaborate what you mean when you say it depends on which model/method? I am quite new to this aspect of Python and am trying to learn more

#

sorry if these are too many questions...

iron basalt
#

Some methods require a lot of samples and resampling (to avoid forgetting), while others can instantly and permanently learn things (one-shot). While the latter is the ideal and what the future holds, the current most popular methods can't do it yet, or at least not very well. But also within those slower methods, some may still require more samples, or less, and more processing power / time or less.

#

One thing that obviously controls how long it will take it how large the model is.

#

But a larger model might give better results.

#

In this regard, datascience is more like alchemy than chemistry, more of an art than a science (despite the name), and really just requires some testing. Try some small models that don't take very long to train to get a feel for how long it might take and how well it can perform.

#

Predicting it upfront is really hard since not only does the choice of model and model hyper-parameters matter, but also hardware configurations, and desired results (how well does it perform).

#

(Also without having done a similar problem with that exact same setup, it's even harder)

lapis sequoia
#

Hello help me pls

molten wedge
#

got it. thankyou so much for all your responses, I think I'm starting to understand why my questions don't have a straight answer. Truly appreciate you taking the time to tell me all of thiss

robust jungle
#

how viable is it to make a dataset by seperating frames of a few videos

#

assuming the environments / lightings were varied along with the angles

#

note: specifically talking about image recognition

rapid pawn
#

anyone know if google collab free tier has better gpu or is a 3080 better at training networks?

#

ping me plz thank you

olive shore
#
import pandas as pd
from sklearn import preprocessing
from sklearn.naive_bayes import GaussianNB


df = pd.read_csv('../input/heart-failure-prediction/heart.csv')


Age = df.Age.tolist()
Sex = df.Sex.tolist()
CPT = df.ChestPainType.tolist()
RestingBP = df.RestingBP.tolist()
Cholesterol = df.Cholesterol.tolist()
FastingBS = df.FastingBS.tolist()
RestingECG = df.RestingECG.tolist()
MaxHR = df.MaxHR.tolist()
ExerciseAngina = df.ExerciseAngina.tolist()
Oldpeak = df.Oldpeak.tolist()
ST_Slope = df.ST_Slope.tolist()
HeartDisease = df.HeartDisease.tolist()

#encoding strings into integers in order to calculate the probabilities.


le = preprocessing.LabelEncoder()

Sex_E=le.fit_transform(Sex)
CPT_E=le.fit_transform(CPT)
RestingECG_E=le.fit_transform(RestingECG)
ExerciseAngina_E=le.fit_transform(ExerciseAngina)
ST_Slope_E=le.fit_transform(ST_Slope)

variables = zip(Age,Sex,CPT_E,RestingBP,Cholesterol,FastingBS,RestingECG_E,MaxHR,ExerciseAngina_E,Oldpeak,ST_Slope_E)


model = GaussianNB()
model.fit(variables,HeartDisease)
predicted= model.predict([[0,1]])

print(predicted) 

I am trying to use bayes theorem for heart conditions this dataset has 12 columns one is the heart condition column. I am getting an error about array Reshape if my data has a single Reshape. What do I do?

austere swift
olive shore
#

Ok

#

If you need like specific data I will give it to you in a second I need to get on my laptop

#

I was using the instructions

#

First ones

storm blade
#

How can I learn Data Science and AI for free? it can also be cheap but like my wallet is struggling to survive sooo...

olive shore
#

there are so many free sources out there like fast ais course and courses on coursera and datacamp

brave sand
#

how do I create a custom dataset for image detection?

#

as in what kind of images go in?

#

I just take hundreds of pictures of the object?

heady spoke
#

hi

serene scaffold
brave sand
olive shore
#

is anyone good with data science or AI and could help me with my issue

stone marlin
#

Try doing .reshape(-1, 1) on whatever it's throwing the error on, but you also should probably do the "10-minute to Pandas" guide, since most of what you're doing can be done in terms of manipulation can be done with dataframe operations.

olive shore
#

i did it,it doesnt work. They are saying do that if its one feature. doesnt one feature mean one column?

serene scaffold
olive shore
serene scaffold
#
Sex_E=le.fit_transform(Sex)
CPT_E=le.fit_transform(CPT)
RestingECG_E=le.fit_transform(RestingECG)
ExerciseAngina_E=le.fit_transform(ExerciseAngina)
ST_Slope_E=le.fit_transform(ST_Slope)

every time you call fit_transform, you reset the encoder, which means you can't transform instances of the same feature again.

brave sand
serene scaffold
brave sand
serene scaffold
#

I don't know how to do that, sorry.

brave sand
#

same, couldn’t find any tips online

safe elk
#

We tried that and trained

#

You do have to label the regions of interest for training data

#

So you need photos of the target taken from a distance then do as above

brave sand
#

can yolo recognize a custom dataset?

#

alright thanks

safe elk
#

It needs the annotated images and there are annotation tools

#

Get a fast machine with a good gpu on or offline

stone marlin
#

We've seen a lot of YOLO in the past few days, dang.

#

Also, I finished up that NN deeplearning ai course, it was pretty good. I did the basic keras/tensorflow tutorial, and that was also fun. It was nice building up little dealies and messing around with it. I didn't make anything substantial, but I did get it to classify some article text into a few subjects [sports, tech, and finance], which is better than nothing!

#

Thanks for recommending it to me, y'all. I feel like I know "something" about NNs now. Enough to know what to google, anyhow, if I ever need'em again.

desert oar
stone marlin
#

Next project: relearn PySpark junk so I don't look like a fool when I need to do it. :''']

#

Ugh, the worst thing about PySpark is honestly the logs. The rest is fine. The logs are just a gd nightmare. You make a typo and it's 50 pages of logs.

desert oar
#

alas, jvm tracebacks

olive shore
#

le = preprocessing.LabelEncoder()

serene scaffold
#

but in either case, every feature needs its own encoder, unless you never want to encode instances of that feature again.

olive shore
#

ok

#

yeah so the stuff I am encoding are strings and they need to be put into integers in order for them to be calculated into the probability

#

i used the same encoding method they used here

serene scaffold
#

so MaxHR is a string?

#

because that sounds wrong.

olive shore
serene scaffold
#

why would a heartrate be a string

#

I just checked the dataset and it's a number.

olive shore
#

which one?

#

maxHR?

serene scaffold
#

yes. several of the features are numbers, actually.

olive shore
#
Sex_E=le.fit_transform(Sex)
CPT_E=le.fit_transform(CPT)
RestingECG_E=le.fit_transform(RestingECG)
ExerciseAngina_E=le.fit_transform(ExerciseAngina)
ST_Slope_E=le.fit_transform(ST_Slope)

those are the only one that are being encoded

#

maxHR isnt there?

#

wait are the stuff that go here

variables = zip(Age,Sex,CPT_E,RestingBP,Cholesterol,FastingBS,RestingECG_E,MaxHR,ExerciseAngina_E,Oldpeak,ST_Slope_E)

only supposed to be the encoded ones?

serene scaffold
#

I see

stone marlin
#

That is the most things I've ever seen zipped. :'']

olive shore
#

i just followed exactly what they did in datacamp

#

idk if I messed something up though

stone marlin
#

I'm only half paying attention, I'm sorry, Mr. Einstein, I've got to finish up a submission. I think a lot of what you want to do can be done with df operations, though.

wicked grove
#

can someone pls tell me how i can resolve this

#

Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0 to /job:localhost/replica:0/task:0/device:GPU:0 in order to run _EagerConst: Dst tensor is not initialized.

olive shore
#

what is the code?

gray swallow
#

Guys can I have any recommendations for tutorials in AI and ML for beginners like absolute beginner

fair locust
#
rows, columns = (3, 6)
fig, axes = plt.subplots(nrows=rows, ncols=columns)
for i, key in enumerate(df.keys()):
    plot = df.boxplot(column=key, ax=axes[i%rows, i//rows])
    plt.setp(axes, xticks=[], yticks=[])

plt.show()

How can I add titles to this?

#

df.hist returns an ndarray

#

And I can't just use plt.title

prisma jay
#

What makes df.groupby([some columns]).sum().reset_index() retrun zero row df?

olive patio
#

Hey guys

#

I'm trying to do something with shap

#

import shap
batch = next(iter(test_dl))
images, _ = batch

background = images[:100].to(device)
test_images = images[100:105].to(device)

e = shap.DeepExplainer(model, background)
shap_values = e.shap_values(images)

shap_numpy = [np.swapaxes(np.swapaxes(s, 1, -1), 1, 2) for s in shap_values]
test_numpy = np.swapaxes(np.swapaxes(test_images.cpu().numpy(), 1, -1), 1, 2)
shap.image_plot(shap_numpy, -test_numpy)

#

I'm getting an error -The size of tensor a (512) must match the size of tensor b (2048) at non-singleton dimension 1

#

This is my model -class PredsModel(ImageClassificationBase):
def init(self, num_classes, pretrained=True):
super().init()
# Use a pretrained model
self.network = models.resnet50 (pretrained=pretrained)
# Replace last layer
self.network.fc = nn.Linear(self.network.fc.in_features, num_classes)

def forward(self, xb):
    return self.network(xb)
#

How do I solve this? Thanks

tardy badger
#

does anyone know a good course to learn Pyspark?

lapis sequoia
#

So I am using Tensorflow for making 2 classifiers, I first placed 70% data into training, then 15% data into validation and 15% in test. I used that data for training 2 classifiers. Then I create new dataset in the same way and again train two classifiers. I do that 5 times. I got accuracy for each phase of testing.

#

Now I would like to use statistical test for comparing my my models

#

What statistical test do you propose?

nocturne hedge
#

Hello, I've a question regarding Neural Networks. Does every Inputneuron get the whole Inputvector or just one element of it

vast thunder
#

Guys is https://www.w3schools.com/ai/ a good tutorial source for ML?

desert oar
desert oar
#

i think you can also invoke fig.suptitle if you want to use the OO interface

desert oar
hot slate
#

Hi everyone, in pandas.Series, how can I overload the == operator?

For example:

s = pd.Series(['abc', 'daf', 'ghi'])
if (s == 'a').any():
  print('True')

I want to overload the == operator to perform some regular expression. The result that I want is:

True
True
False
serene scaffold
#

you don't need to overload the operator for that. what regular expression operation are you trying to do?

#

@hot slate

#

!e

import pandas as pd
s = pd.Series(['abc', 'daf', 'ghi'])
result = s.str.contains('a')
print(result)
arctic wedgeBOT
#

@serene scaffold :white_check_mark: Your eval job has completed with return code 0.

001 | 0     True
002 | 1     True
003 | 2    False
004 | dtype: bool
serene scaffold
#

and you can add .any() to the end of that if you want a single bool.

hot slate
#

oh thanks alot

#

I didn't know the .str.contains before

serene scaffold
#

!docs pandas.Series.str

arctic wedgeBOT
#

Series.str()```
Vectorized string functions for Series and Index.

NAs stay NA unless handled otherwise by a particular method. Patterned after Python’s string methods, with some inspiration from R’s stringr package.

Examples

```py
>>> s = pd.Series(["A_Str_Series"])
>>> s
0    A_Str_Series
dtype: object
```...
serene scaffold
#

the str accessor has tons of good stuff.

#

most string methods or regex functions, you can do to the whole series via the str accessor.

hot slate
#

thanks

#

this is extremely helpful and informative for me

serene scaffold
#

any time 💚

lapis sequoia
#

So I am using Tensorflow for making 2 classifiers, I first placed 70% data into training, then 15% data into validation and 15% in test. I used that data for training 2 classifiers. Then I create new dataset in the same way and again train two classifiers. I do that 5 times. I got accuracy for each phase of testing.
Now I would like to use statistical test for comparing my my models
What statistical test do you propose?
If that's important I have maybe 850 photos per class

vagrant monolith
#

Hello i have a timeseries type date i want to extract the year only, how can i do this ?

stone marlin
#

Another sweet accessor, dt.

lapis sequoia
stone marlin
#

What statistical testing would you like to do?

olive patio
stone marlin
#

"This model is statistically more accurate in some regard to this other one."?

vagrant monolith
lapis sequoia
vagrant monolith
#

Thing is its not a datetime value its a time series

robust jungle
#

I'm trying to make a bot to parry a kick in a game. The bot needs to be able to recognize when it will be hit by the kick in real time, how should I approach this?

desert oar
#

post your data, or a sample thereof

#

or at least tell us what the dtype is and give some example values

vagrant monolith
stone marlin
#

No DMs please, keep everything public.

lapis sequoia
stone marlin
#

In re: to statistical tests, you've got a few models and you want to compare them. So, you want to kind of say, "This one has X accuracy (or whatever), and this one has Y accuracy (or whatever), so this one is better." That kind of thing? There's a large number of ways to do this sort of thing, so I'm trying to narrow down what you want.

serene scaffold
# vagrant monolith

what type is that column currently? are those all strings? because strings aren't proper datetimes.

#

!docs pandas.to_datetime

arctic wedgeBOT
#

pandas.to_datetime(arg, errors='raise', dayfirst=False, yearfirst=False, utc=None, format=None, exact=True, unit=None, infer_datetime_format=False, origin='unix', cache=True)```
Convert argument to datetime.
serene scaffold
lapis sequoia
stone marlin
#

Remind me which type of classifier you've decided to use on them?

lapis sequoia
#

One is my custom made CNN other is InceptionV3 with transfer learning

vagrant monolith
#

@serene scaffold

#

Im sorry im new and kinda lost

serene scaffold
#

the pd.to_datetime function will help you.

stone marlin
#

The usual way that I know of (maybe someone else in here knows more) to test is called McNemar's test. It's a chi-square test which compares error rates in the two models.

#

There's some other ones, but they're usually fairly specific.

serene scaffold
#

it might even work if you just do pd.to_datetime(df['Date'])

vagrant monolith
#

I've already tried it no luck

serene scaffold
#

so you'll have to provide the format=, which is where you say what each part of the string means.

#

oh

#

so it worked, you just have to write it back to the dataframe

#

market_cap_doge['Date'] = pd.to_datetime(market_cap_doge['Date'])

stone marlin
#

McNemar's test is given here: https://en.wikipedia.org/wiki/McNemar's_test and it might help you out. There are other tests --- like t- and z- score tests for some models, but the recent criticism iirc has been that they violate basic assumptions of t- and z-, so the recommendation was to not use them unless literally nothing else would work.

#

If anyone else knows better than me, let me know, since I don't do statistical testing all that much on my models. I know, I know, it's bad.

vagrant monolith
#

@serene scaffold it workeddd !! thanks so much!!

serene scaffold
#

@vagrant monolith pandas objects work differently from the rest of Python. most pandas functions/methods return new objects, without changing existing ones

#

so, pd.to_datetime returns a new Series. it won't change the DataFrame that that Series came from unless you tell it to.

vagrant monolith
#

unless u tell it so

#

i see now thanks a bunch

stone marlin
#

Student t-test and Z-test, which are two of the "big" tests in statistics. https://en.wikipedia.org/wiki/Z-test It's the statistical test most stats classes will lead off with since it's fairly robust and usually pretty good.

serene scaffold
lapis sequoia
#

Going to eat, I already read something so I will response later

stone marlin
#

Good luck. I don't know much about this, so maybe someone else will be able to chime in later.

vagrant monolith
#

@serene scaffold Oh ii get it now that makes sense

serene scaffold
#

soon you will understand 😄

vagrant monolith
#

@serene scaffold yeaa i see how it can be helpful you don't wanna end up with modified data everytime

lime sigil
#

How can I detect how likely it is that a string as the same meaning as another string?
Like I have a sentence "The first programming language was Fortran " and "Fortran was the first programming language"
For us they say the same, I need to detect it via Python

serene scaffold
#

unrelated, but does anyone know of a library for taking existing voice audio and making it sound higher or lower? one that just changes the pitch isn't sufficient as there's more to voice quality than that. it's apparently very difficult to Google for because there's too much noise (voice synthesis libraries, general audio manipulation libraries, etc.)

serene scaffold
brave granite
#

how to find location of data in excel file using python

uneven oracle
#

Can anybody support on this.

  1. Randomly place 20 points within a unit square.

  2. Find the two points that are closest to each other and compute their distance. Find the two points that are farthest from each other and compute their distance. Code these calculations from scratch; do not use a packaged function.

  3. Repeat (1) to (2) r=100 times. Collect the closest and farthest pairs. Plot all pairs on a scatter plot, with blue points for closest pairs and red points for farthest pairs. Report the average closest- and farthest-pair distances on the scatter plot.

lime sigil
#

I need it for a discord bot

opal fern
#

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import pandas_datareader as data
from tensorflow.keras.models import load_model
import streamlit as st

the time of start and close

start = '2010-01-01'
end = '2019-12-31'

st.title('stock Trend Prediction')
user_input = st.text_input('Enter stock Ticker', 'AAPL')
#using datareader to take data, 'AAPL'is the company ticket
df = data.DataReader(user_input,'yahoo', start, end)

#describing data
st.subheader('Data from 2010 - 2019')
st.write(df.describe())

#

error : ModuleNotFoundError: No module named 'keras.models'

quasi parcel
#

have you tried

#

uninstalling and installing?

lapis sequoia
quasi parcel
#

i have an issue and its troubling me since days
the issue is in this sample csv
https://paste.pythondiscord.com/nufixokoka.yaml
i need to change these following columns type
product_category_id is in string need to parse to list of integer
product_category is in string need to parse in list of String
product_ids is in string need to parse to list of string
i tried ast.literal_eval
df_explode = piv_pdp.assign(second=piv_pdp.product_ids.str.split(","))
when i executed this literal_eval code it was giving this traceback

1         [240623]
2         [286313]
3         [285627]
4         [312021]
Name: product_ids, Length: 390202, dtype: object```
#

can anyone please help

median idol
#

Could someone please help, as I'm trying to use replace in my dataset but part that I'm trying to replace remains the same:

#
df['floor'] = df['floor'].astype('int64')

col_to_replace = ['hoa', 'rent amount', 'property tax', 'fire insurance']

for i in col_to_replace:
    df[i] = df[i].astype('string')
    df[i] = df[i].replace('R$', ' ')
    
df.head()```
#

The output remains the same

timber sky
#

Hi, I am training a model out of a sensordataset from kaggle. I guess somehow I am doing something wrong 😄

lapis sequoia
#

Howdy y’all

#

Could anyone answer a question and provide a few insights for me? It’d be greatly appreciated

stone marlin
lapis sequoia
#

I’m a bit mentorless and just trying to understand a project I have for my boot camp. Sorry if it’s noobish.

lapis sequoia
stone marlin
#

Oof, I guess don't ask to ask is blocked. Either way, "Don't ask to ask", scoby, just ask.

lapis sequoia
#

Damn, fucking statistics makes my life hard...

stone marlin
#

Yeah, in that case I'm not sure what statistical test would be good. I imagine that McNemars could be extended, but I've never done it.

serene scaffold
lapis sequoia
#

I’m just starting a Data Science program and basically I have this rubric for an assignment.

timber sky
#

Why does the accuracy sink if I train a neronal network? Like with everystep until almost 0 😄

serene scaffold
#

are you sure you're not talking about loss?

lapis sequoia
#

not looking for someone to do it for me just… a bit intimidated.

#

I was curious how y’all might approach this.

Im pretty familiar w python and it’s our first project. I just wanted some insights from those who are more knowledgeable than myself. It’s a data science ML boot camp. This is our first assignment.

#

It’s a simple dataset really.

serene scaffold
lapis sequoia
#

It’s just my desktop. I’ll do that.

stone marlin
#

The first step for any good DS project is to do EDA (exploratory data analysis) and to make a bunch of graphs and things. By the end of that, you should be able to answer part 1. I'd take things step-by-step.

robust jungle
#

How can I go about making a bot to predict when something will happen via video (e.g. when a falling ball will hit the ground).

robust jungle
lapis sequoia
#

Description
Objective

Explore the dataset to identify differences between the customers of each product. You can also explore relationships between the different attributes of the customers. You can approach it from any other line of questioning that you feel could be relevant for the business. The idea is to get you comfortable working in Python.

You are expected to do the following :

Come up with a customer profile (characteristics of a customer) of the different products
Perform univariate and multivariate analyses
Generate a set of insights and recommendations that will help the company in targeting new customers.

Data Dictionary

The data is about customers of the treadmill product(s) of a retail store called Cardio Good Fitness. It contains the following variables-

Product - The model no. of the treadmill
Age - Age of the customer in no of years
Gender - Gender of the customer
Education - Education of the customer in no. of years
Marital Status - Marital status of the customer
Usage - Avg. # times the customer wants to use the treadmill every week
Fitness - Self rated fitness score of the customer (5 - very fit, 1 - very unfit)
Income - Income of the customer
Miles- Miles that a customer expects to run

serene scaffold
timber sky
#

basically I see it live and it goes down every second it does another step

robust jungle
#

another example

#

say there was an attack in a game you needed to dodge

#

you already knew what attack it was you needed to dodge, and your goal was to dodge that one attack

lapis sequoia
#

Sorry for the text block, I’m not so worried about completing the project or anything of that sort. Just wanted insights about how y’all might approach this,

trying to learn as much as I can without being in a vacuum

#

@stone marlin I will read about different non parametric stat tests...can I ask you something if I don't understand something?

robust jungle
#

it needs to:
realize that the attack is being used
find out when it will land (it already knows how long the animation is, it needs to figure out how far behind it was based on where it is in the animation)
avoid the attack

stone marlin
lapis sequoia
stone marlin
#

I'm only tangentially familiar with it, unfortunately, and I may not have time to do the necessary research to answer your questions.

#

I will skim back up and if no one answers I'll try my best.

stone marlin
#

Please no DMs, y'all, keep it in public chat.

#

(Not you, Luka, haha.)

lapis sequoia
#

why is this so intimidating

stone marlin
#

No, no, I meant, not this time, you.

#

No DMs from anyone, please. Everything in public chat.

lapis sequoia
#

😅 it was me

#

Lol

stone marlin
#

It's all good, it's one of those old-man vs younger peeps things. I'm an angry ancient guy.

lapis sequoia
#

For example, Student’s t-test for two independent samples is reliable only if each sample follows a normal distribution and if sample variances are homogeneous.

#

So I should calculate normal distribution and variance of each image?

stone marlin
#

You could do this --- I've read recently that this is violated but, honestly, I think most people still do the t-test / z-test.

lapis sequoia
#

Also, this two (normal distribution and sample variances) are for students tests

stone marlin
#

IIRC, it's that it's not homoskedasticitic, but I'd have to read the paper again.

#

Lemme see if I can find an example of this.

lapis sequoia
stone marlin
#

(There is also a chi-square version but I've literally never seen it used, so idk about it.)

lapis sequoia
#

Hello I need a help

#

pls

robust jungle
lapis sequoia
#

With .csv files

robust jungle
#

what about them?

lapis sequoia
#

Problems with appending

robust jungle
#

this may be the wrong channel, but I would still be happy to help

lapis sequoia
#

I am using pandas library

lapis sequoia
#

@stone marlin Hmm, this is for numeric data in all examples

stone marlin
#

Maybe I'm misunderstanding what you're trying to test --- there are many, many potential ways to "test" models to see which is "better", and there's various things better means.

lapis sequoia
#

I thought to set hypothesis

stone marlin
#

Wait, it's not important so you track it?

lapis sequoia
#

accuracy of X > accuracy of Y

lapis sequoia
stone marlin
#

It's okay, I just was unsure what you meant.

lapis sequoia
#

I wanted to say I don't pay attention to accuracy of particular class

#

So I thought that it's valid to set hypothesis as model X has greater acc than model Y

stone marlin
#

Okay, so you have two models. You are looking at accuracy for model X and model Y. And you want a test to say, "This model performs better, in terms of this."

#

Right, exactly. So, that's exactly the test.

#

But, there's one big issue there. You don't have a standard deviation with just one run of the model.

#

You have just one value: the accuracy.

lapis sequoia
#

For knowing my samples follow normal distribution?

stone marlin
#

Well, right now, you've set up

H0: mean(X) == mean(Y)
HA: mean(X) != mean(Y)

I thought you noted you were going to do a t-test to test this.

lapis sequoia
#

I thought you noted you were going to do a t-test to test this.
I read that in using t-test

stone marlin
#

Oh, whoops, you did >. But same deal.

lapis sequoia
#

Observations in each sample are independent and identically distributed (iid).
Observations in each sample are normally distributed.
Observations in each sample have the same variance.

#

What does it mean that samples are independent?

stone marlin
#

The gist for hypothesis testing is you calculate a "test statistic" and then you try to see if the corresponding p-value is small or not. [I'm leaving a LOT out here, because this could cover the last third of a stats course.]

lapis sequoia
#

Independent samples are samples that are selected randomly so that its observations do not depend on the values other observations.

#

How did you conclude that I set
H0: mean(X) == mean(Y)
HA: mean(X) != mean(Y)

stone marlin
#

Sorry, above you said >. In this case, though, it should probably be two-sided.

#
I thought to set hypothesis
accuracy of X > accuracy of Y
#

This is what you noted before.

lapis sequoia
#

H0: mean(X) == mean(Y)
HA: mean(X) != mean(Y)

stone marlin
#

Ah, I see, I used "mean" which threw you off probably.

#

I was getting ahead of myself here. The idea is that you cannot do this test for one value of accuracy. Otherwise your test is just "this is greater than this".

desert oar
# stone marlin The gist for hypothesis testing is you calculate a "test statistic" and then you...

i'll elaborate further on this because i feel strongly about it:

hypothesis testing works by setting up a "null hypothesis". you then figure out how unlikely/unusual/rare/extreme your data is, assuming that the null hypothesis is true. if it turns out that your data is very unlikely/unusual/rare/extreme when the null hypothesis is assumed to be true, then this is taken as evidence against the null hypothesis. and we reject the null hypothesis when that evidence exceeds a pre-determined threshold.

the evidence is usually the p-value, and the threshold is the size of the test (often written as α)

stone marlin
#

Thank you, Salt, haha. I'm going to also have to step back because work is picking up in a few mins so feel free to chime in.

lapis sequoia
stone marlin
#

The usual way to do it is to get multiple values for accuracy, and then you've got a set of accuracy values for each model. At that point, you have a mean and a standard deviation for the accuracy of both models.

lapis sequoia
stone marlin
#

You can then perform a t-test (since you have the mean and stdev of accuracies for both models), and you conclude that either the means of the accuracies are the same OR they are different.

#

(Actually: you either reject the null hypothesis or you fail to reject it, but, in this case it's probably going to be rejected.)

#

Good, so you've got some values for accuracy from each model

#

So you get the means + stdevs from those, and with those you perform a t-test.

lapis sequoia
#

@stone marlin You mean this test Paired Student’s t-test?

#

But what if two means of two paired samples are significantly different?

#

I don't understand how this test relates to what I want to test - whether model X has greater accuracy than model Y

stone marlin
#

https://www.investopedia.com/terms/t/t-test.asp I don't know exactly how scipy does it, so I'd use the formulas here. I'd prob use the "Equal Variance" t-test for this.

#

The gist is like this: what if, by a fluke, your accuracy in model X was higher than model Y. Then model X is better right? Not necessarily. So you want to try a few different times to see if the one time you trained it wasn't a "fluke".

#

Like, you might have gotten lucky with data the first time and model X was really good, but it was terrible every other time.

#

I've got to go to a meeting, but I'd say the following: if this is an assignment, you might want to ask the teacher / TA what they want from this, there are MANY things that we could do to test it. If not, I wouldn't worry about testing right now, esp if you don't know hypothesis testing.

lapis sequoia
#

@stone marlin

The gist is like this: what if, by a fluke, your accuracy in model X was higher than model Y.
I think that's not a case in my case. What I did was to randomize data, placed 70% in training, 15% in validation and 15% in testing. Then I trained first model and tested it, then trained second model and tested it. I then again randomized data, placed 70% in training, 15% in validation and 15% in testing, trained first model and tested it, then trained second model and tested it...and I did that 4 more times, so I have 5 accuracies

stone marlin
#

(Right: that's not the case. That's what you're trying to show explicitly with this test.)

lapis sequoia
stone marlin
#

I'd say that's what I'd use. Salt may know more. I will say this is not a common thing most DS people deal with --- at least in my field.

#

(okay, now I'm really gone.)

lapis sequoia
#

@desert oar are you here?

urban meadow
#

dumb question, how to use numpy to convert an array of this kind to this kind?
[[255 255 255 ... 255 255 255]] -> [[[255 255 255] [255 255 255] [255 255 255]]]
so effectively for each element i do: 255 -> [255 255 255]

serene scaffold
#

when you reshape an array, -1 means "the rest".

#

I made you a little example.

In [8]: np.repeat(255, 9)
Out[8]: array([255, 255, 255, 255, 255, 255, 255, 255, 255])

In [9]: arr = _

In [10]: arr.reshape(1, -1, 3)
Out[10]:
array([[[255, 255, 255],
        [255, 255, 255],
        [255, 255, 255]]])
urban meadow
#

ty for the answer, just digesting

serene scaffold
#

do you know how array shapes work?

#

not just reshaping them, but what the shape of an array is, in general

urban meadow
#

ok i think i tried reshape but one problem is that i don't want to have 3 elements, for each element I want to replace it with an array of size 3 that has the same number

#

the real question is i have an image that is being read by opencv but the image is 2 colors, white/red and it keeps being read as a grayscale and it's not being read as rgb/bgr/whatever and every time i try to mask the mask image is grayscale

lapis sequoia
#

Yo guys,

#

So I want to find the Average income for each gender in my dataset.

#
# Reading Data from relative directory
data = pd.read_csv("CardioGoodFitness.csv")

#Converting Gender Values to Intergers for later use. Female = 0, Male = 1

data['Gender'].replace('Female', 0, inplace=True)
data['Gender'].replace('Male', 1, inplace=True)
#

ran this: then this,

#
data.groupby("Income")["Gender"].mean().sort_values(ascending=True)
#

got this:

#

Income
55713 0.0
65220 0.0
62535 0.0
53536 0.0
52291 0.0
...
48658 1.0
48556 1.0
31836 1.0
68220 1.0
104581 1.0
Name: Gender, Length: 62, dtype: float64

#

How can I consolidate this so its not ascending

#

or sorted, i guess i just delete the end of that command?

serene scaffold
grand imp
#

Hey, is there a way to train a Neural network on hundreds on human conversations, and make it be able to respond to inputs with a human-like tone?

serene scaffold
grand imp
#

Well the input is text based and not speech

serene scaffold
#

so, you're asking how to make a conversational chat bot? those are notoriously difficult because when humans have conversations, the things that they say are informed by a large body of knowledge and experiences. robots don't have that.

urban meadow
#

@serene scaffold yes can I just feed in my array to np.arrange like np.arrange(imagearray)? that would be perfect
@lapis sequoia you can just clean it up after line per line. then do string manipulation/convert to int. u can use grep or just make a for loop and check if it's a number or not to make your income per person if speed isn't a concern

grand imp
#

Exactly, but how about training it on thousands if not millions of conversations (yes I know it will take really long but it's possible) so that it gains information of the conversations

lapis sequoia
#

Its not for that I have to

#

create business insights for a project

grand imp
#

So it will compare your input with the others and see which output is most similar to the correct input, and give that which already contains the knowledge you are looking for.

serene scaffold
urban meadow
#

ah i see

atomic leaf
#

Hey guys! I hope I am not disturbing you convo too much c:
I am making a CAPTCHA solver with pytorch, but I can't create a dataset that has both the image/captcha and the label/target in the dataset/dataloader. Can someone assist me on this? ❤️

lapis sequoia
#

im crying

grand imp
#

we all are

serene scaffold
lapis sequoia
#

I just wanna see if the men on average are making more than the women in this hypothetical scenario

serene scaffold
#

@atomic leaf keep in mind that asking for help with captcha solvers is against the rules, so don't do that in the future.

lapis sequoia
#

trying to build a customer profile for a project

#

fake dataset

atomic leaf
lapis sequoia
#

going to make a business recommendation on marketing to men or women more based on customers income provided

urban meadow
#

i mean what's stopping you from calling income for men -> create average by cleaning up datasetm calling income from women -> create average then compare?

grand imp
urban meadow
#

@serene scaffold is it possible to clone my image array and then read into a new array of size 3? too dumb to figure out the syntax, something like a1, a2, a3, and then answerarray[a1,a2,a3]

#

not size 3 but each element is size 3

serene scaffold
spice hamlet
#

hi guys. i saw a few vids about ai in games (geometry dash for example) and wondered, how a ai can teach itself how to play the game, without even having a variable that tells it (good/bad). im talking of genetic algorythms

serene scaffold
urban meadow
#

@serene scaffold so for each element X in the original array, in the new array is [X X X] so not really the same because Ihave 2 colors

serene scaffold
urban meadow
#

source image/array is (315, 375) target is (315, 375, 3)

serene scaffold
#

so, aren't you repeating the array three times into a new dimension?

#

because that's what I just did.

urban meadow
#

yes

#

ok i see it tyty

lapis sequoia
serene scaffold
lapis sequoia
serene scaffold
#

but that code didn't actually solve the asker's question.

serene scaffold
#

so if you have an array of size 12, you can reshape it to (2, -1, 2), and then -1 gets interpreted as 3, because 2 * 3 * 2 is 12.

urban meadow
#

in a philosophical perspective u use -1 because you can never get to -1 (unless you go backwards/use a negative step). even if you use an arbitrarily high number you will end up getting to it at some point

#

ok i never knew that about numpy

serene scaffold
#

I think they just picked -1 somewhat arbitrarily.

lapis sequoia
#

@serene scaffold I see. Thanks for explanation. Btw, are you maybe familiar with statistical tests?

serene scaffold
#

like t tests?

lapis sequoia
serene scaffold
lapis sequoia
#

I want to make sure that particular model that has greater mean is really better

#

What test do you propose that I use?

serene scaffold
#

this assumes that each prediction is independent.

lapis sequoia
#

I read what identically distributed mean

lapis sequoia
#

What do you think? @serene scaffold

serene scaffold
#

I'm not really sure; I don't have time to dive in, unfortunately

lapis sequoia
#

@stone marlin Are you available maybe? I don't know if my data is identically distributed, googled and still not sure

serene scaffold
lapis sequoia
lapis sequoia
serene scaffold
lapis sequoia
#

and community members deserve to only be pinged to draw their attention to ongoing conversations that they've chosen to participate in
Well, I wanted to ask him about something what we talked about

serene scaffold
serene scaffold
#

If what you are currently typing pertains to this, please send it to @sonic vapor

lapis sequoia
#

Yeah ok, I don't have time for that

#

I will follow rules

lapis sequoia
#

Hello I have an existing project and I want to set a dev environment with conda cause I'm using libraries compiled in C and what it started as a relatively small project, it turned into a big project. I'm in doubt if I should install Anaconda in Windows Subsystem for Linux or install it regularly on Windows since I have never used conda before and I'm not sure what would be a better choice

rapid pawn
desert oar
#

it works fine in powershell/cmd, you don't need a VM or WSL

#

in general building packages is nontrivial in windows, whereas in a linux-based environment you typically have a sensible build toolchain already set up

#

but if you are using pre-compiled conda packages, you should be fine working directly in windows

modest mulch
#

anyone knows how to approximate/calculate the median or nth quantile of very large datasets that can't be fit into memory at once? preferably doing so in batches rather than having to iterate through each sample.

lapis sequoia
#

Thank you very much for both explanations I will definetely follow your leads tomorrow when I set up my development environment. I will probably stick to installing "Miniconda" instead of Anaconda natively in Windows since I think that I will be using packages that are accesible with Miniconda

#

I have one more question though

lapis sequoia
#

I'm planning to use jupyter notebooks to display relevant information in a user friendly manner thanks to markdown language. I have never used jupyter notebook before so I wonder how can I set up my conda environment to be able to interact with jupyter notebooks

desert oar
desert oar
#

jupyter itself is a client/server setup

#

the actual code is run by a "jupyter kernel", which acts as the server

#

and you interact with a "jupyter frontend", which acts as the client

#

ipykernel (part of the ipython project) is the standard python kernel. you install this into your conda environment with all your dependencies

#

jupyter notebook or jupyterlab are frontends. these can run any kernel anywhere on your system. you can install them right into your project, in which case no additional setup is required. but if you are hosting this over the web, you might want to run a centralized instance (e.g. jupyterhub), in which case you will need to set up a "kernel spec" that tells the jupyter frontend how to start and connect to your desired jupyter kernel

#

(it can be a bit confusing because jupyter notebook is itself a server. but with respect to the jupyter protocol, it's a client)

#
jupyter kernel  <->  jupyter notebook  <->  user's browser
  (conda env)           (anywhere)            (anywhere)
rapid pawn
#

you can even use jupyter notebook in pycharm

#

which enhances the experience massively

#

also if you decided to stick with browsers i suggest jupyter lab instead of jupyter notebook

#

since jupyter lab is the way forward

desert oar
#

it's worth being precise about the terminology: pycharm can act as a jupyter frontend/client, and it can read and edit the same file format as jupyter notebook

rapid pawn
#

yes exactly

#

jupyter notebook files have the extension .ipynb

#

which stands for ipython notebook iirc

#

in my experience if you get a massive project jupyter notebook or jupyter lab in a browser alone would be really messy

#

because it lacks a lot of the IDE functionalities such as stepwise debug, auto association of variables, suggestive contexts etc

lapis sequoia
# desert oar it depends a little on how the notebooks will be shared/hosted/run

We will use it mainly during development to see in a more visual way the representation off the data and also to show it to the client so that they understand better how the process work. I think that we are not planning to host it alongside the code which will be probably installed by means of a python package globally on a server hosted by the client

desert oar
#

it's a way to host jupyter notebooks with user authentication

#

you can serve it over http and put it behind your company's domain

lapis sequoia
#

Wow thank you so match to all of you for the in depth explanations. I will be reading through the messages to make sure that I have a proper understanding of everything so that tomorrow I can start setting up the environment

#

And I will definetely propose to my organization the idea of JupiterHub

stone marlin
#

As noted before by both myself and now by Sterlercus, please do not ping me, it pings on all my devices.

#

I'm not sure if your data is iid, but you can assume it is for the sake of this problem.

#

I also am going to be busy for a while, I've got to finish a project fairly quickly, so unfortunately I will be unable to help for a bit.