#data-science-and-ml
1 messages · Page 366 of 1
these are the last 2 layers of my model
)
top_activation (Activation) (None, 17, 17, 2304 0 ['top_bn[0][0]']
)
avg_pool (GlobalAveragePooling (None, 2304) 0 ['top_activation[0][0]']
2D)
top_dropout (Dropout) (None, 2304) 0 ['avg_pool[0][0]']
predictions (Dense) (None, 1000) 2305000 ['top_dropout[0][0]'] ```
I'm not sure, I'm not great at NN stuff yet, someone else in here should be able to answer you though. :']
alrightt,thankss:))
i think the article made sense i had to add globalpoolingaverage2d so that it maps to the new dense layer
but i am getting another error
logits and labels must have the same first dimension, got logits shape [32,3] and labels shape [96]
[[node sparse_categorical_crossentropy/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits
(defined at C:\Users\Urja\anaconda3\lib\site-packages\keras\backend.py:5113)``` i haven't even used logits
I'm not sure, but this SO comes up when I google the error. https://stackoverflow.com/questions/49161174/tensorflow-logits-and-labels-must-have-the-same-first-dimension
I am new in tensoflow and I want to adapt the MNIST tutorial https://www.tensorflow.org/tutorials/layers with my own data (images of 40x40).
This is my model function :
def cnn_model_fn(features,
Thank you soo much!!
:incoming_envelope: :ok_hand: applied mute to @lapis sequoia until <t:1641725672:f> (9 minutes and 59 seconds) (reason: duplicates rule: sent 4 duplicated messages in 10s).
Anyone here use vscode?
Hello, i have a problem...im doing transfer learning for a dataset with 3390 images
The time taken for an epoch is an hour ://
Is there any way i can improve the speed
After the first epoch the training stopped cause the i had no memory left on my C drive
@wicked grove I'm no expert myself and no idea about the time since I never did such a thing. I suggest you do the following:
- If you have a GPU use cuda to train your NN
- Enable the profiler of whatever NN library you use or alternatively: Measure it yourself. E.g. in your loop for 1 epoch I guess you have something like "train()" and "evaluate()" and in e.g. train() you might have model.predict(batch) and optimizer.step() or whatever. You have different big function calls, measure them each and check which one takes the longest. Also make sure you measure the duration of you file read/write.
- Generally, FileIO (wirintg/reading a file) is slow. You also want to minimize communication between CPU memory (your RAM) and CUDA. So make sure that you move as much data to CUDA memory as possible. I only can give details here if you use pyTorch's DataSet class but basically: Try to store your images in HDF5 format. Use h5py for that. (it's quit a bit of work probably).
Are you sure you run out of space on your C drive and not out of memory? (Maybe you run out of memory and your OS creates a swap file? No idea how windows works really, sorry.) My guess is that you somehow read too much into memory.
E.g. if you run out of memory on linux, linux will create a file and dump memory into the file basically making memory reads extremly!!! slow. Could be a reason but I'm really guessing here because as I said, I have no idea about windows. Make sure to monitor your memory.
To put it simply: Make basic measurements so find the bottleneck and then try to fix it.
:incoming_envelope: :ok_hand: applied mute to @lapis sequoia until <t:1641738329:f> (9 minutes and 59 seconds) (reason: duplicates rule: sent 4 duplicated messages in 10s).
Maybe an plan of action: What I'd do is: import time and then start = time.time() and end = time.time() around the basic function calls like optimizer step, model evaluation, fileIO and then output for each of them print(end-start). Also open task manager and look at the memory usage. If it gets above 100% you need to fix how you read data. Also make sure you use cuda.
How do I make a many to one RNN network that can take an infinitely long string
In tensorflow
Hello everyone, I'm on this task for sentiment analysis for Amazon reviews. So I have the dataset loaded on Colab, it has 3 columns namely class_index (1 to 5), review title and review text. I am asked to perform EDA on this dataset and I honestly do not know what to do next. I'm wondering if I should predict the class index or what?
by EDA, do you mean "exploratory data analysis"? well, look at the data, see what is in it that could be used to signify which class a given review has been assigned.
and yes, your goal is to design a system that could predict the "class index" given only the review title and the review text.
can you think of what features you could use?
Thanks for the response. I would use a pretrained model, but I was wondering if I could use just the review text or I have to use both.
I don't know why you're doing this task. Did someone tell you that you have to use the title?
I assume you can do it however you want. You could also concatenate the two.
Yes, I was asked to perform EDA on the dataset and report the final performance metrics for my approach. Also suggest ways in which I can improve my model.
So I'm guessing I would have to create a model to predict the class index, then show the accuracy & loss, and some other metrics and suggest ways to improve the model.
I'm working on it already.
Sounds good to me
Thanks.
I assume you're familiar with NLP basics like tokenization?
Sure.
Sure, as in yes?
Yes.

I'm also familiar with embedding, conv1d. So I'll just try out different stuffs and see what works.
👍🏿
Uh, I would like to learn whole math and how it works behind machine learning. Any suggestions to resources
I'm not good at reading research papers, so if there is any alternatives
Something is pinned about math. I guess this https://mml-book.com/
Companion webpage to the book “Mathematics for Machine Learning”. Copyright 2020 by Marc Peter Deisenroth, A. Aldo Faisal, and Cheng Soon Ong. Published by Cambridge University Press.
research papers are intended to be read by other experts. They also aren't always that well written. So don't feel bad if you find them confusing. One technique is to read the abstract, section headings, and conclusion a few times before trying to read the whole paper.
Hmm, yeah I will try it, but is there any easier way like books mentioned above, for deep learning algorithm, deep reinforcement algorithms, new algorithms?
do you go to a school/university, or are employed by a tech company? If so, I would see if you have access to O'Reilly's online library.
they have deep learning books with Python examples.
I'm currently, a first year college student in Data science and Artificial intelligence
great. see if your college/university gives you that access. Mine did.
Uh, in my country, covid lockdown has been placed, so I couldnt access the library in my college. So, I'm searching for online resources
It will continue like maybe min 3 to Max 8 months
well, it's O'Reilly's online library, so it's a matter of seeing of your university will let you open an account with O'Reilly.
in my case, it was a matter of checking the website for the university library. Though it might be as easy as trying to create an account using your university email.
I strongly suspect they have one, what to do if have to learn on my own?
I'm not sure I follow. You're asking how to get ahold of online books, right? There are a bunch on the website I just mentioned; you just have to try to make an account with your university email and see if your university has a contract with them.
If you need help finding books that aren't behind a paywall, I can try to help with that, instead.
Anyone knows why this is happening? Code works in cell 1 but not in cell 2
That would help me
Can you show the whole error message as text (not a screenshot)?
It could be a utf-16 character kinda error
There are many blogposts and YT videos out there to break stuff down very well
try with the "Mathematics for Machine Learning" book. with a little googling, it should get you far with just High school math knowledge
`---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
C:\Users****\AppData\Local\Temp/ipykernel_20900/2342191653.py in <module>
----> 1 Year1 = pd.read_csv(r'..\datasets\2001.csv')
C:\Anaconda3\lib\site-packages\pandas\util_decorators.py in wrapper(*args, **kwargs)
309 stacklevel=stacklevel,
310 )
--> 311 return func(*args, **kwargs)
312
313 return wrapper
C:\Anaconda3\lib\site-packages\pandas\io\parsers\readers.py in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
584 kwds.update(kwds_defaults)
585
--> 586 return _read(filepath_or_buffer, kwds)
587
588 `
that was suggested earlier in this conversation, interestingly enough
`C:\Anaconda3\lib\site-packages\pandas\io\parsers\readers.py in _read(filepath_or_buffer, kwds)
480
481 # Create the parser.
--> 482 parser = TextFileReader(filepath_or_buffer, **kwds)
483
484 if chunksize or iterator:
C:\Anaconda3\lib\site-packages\pandas\io\parsers\readers.py in init(self, f, engine, **kwds)
809 self.options["has_index_names"] = kwds["has_index_names"]
810
--> 811 self._engine = self._make_engine(self.engine)
812
813 def close(self):
C:\Anaconda3\lib\site-packages\pandas\io\parsers\readers.py in _make_engine(self, engine)
1038 )
1039 # error: Too many arguments for "ParserBase"
-> 1040 return mapping[engine](self.f, **self.options) # type: ignore[call-arg]
1041
1042 def _failover_to_python(self):
C:\Anaconda3\lib\site-packages\pandas\io\parsers\c_parser_wrapper.py in init(self, src, **kwds)
67 kwds["dtype"] = ensure_dtype_objs(kwds.get("dtype", None))
68 try:
---> 69 self._reader = parsers.TextReader(self.handles.handle, **kwds)
70 except Exception:
71 self.handles.close()
C:\Anaconda3\lib\site-packages\pandas_libs\parsers.pyx in pandas._libs.parsers.TextReader.cinit()
C:\Anaconda3\lib\site-packages\pandas_libs\parsers.pyx in pandas._libs.parsers.TextReader._get_header()
C:\Anaconda3\lib\site-packages\pandas_libs\parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows()
C:\Anaconda3\lib\site-packages\pandas_libs\parsers.pyx in pandas._libs.parsers.raise_parser_error()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 343: invalid continuation byte`
guess we should add that to our website
Also if you can afford it @unborn temple nnfs.io is a very popular starter with code and simply explained math to start you off
that alone should give a very strong foundation to your knowledge
@pearl grove did you check what the encoding is for the csv file? guess it's not utf-8, or something
how do I do that?
But wait when I write this code instead Year1 = pd.read_csv('..\datasets\2001.csv') I get the File not found error
I'm not sure how to do it on colab, unfortunately
try switching the backslashes to forward slashes?
I think colab uses a linux environment, but I'm not sure
Yeah I'm aware of it, its a good one for neural networks, sentdex, even has a video series for free, it helps in understanding normal Neural networks, but I need deep reinforcement learning algorithm math, and new algorithm math
it still gives the unicode error
Yeah you are right it's linux
Im using jupyter
jupyter isn't the same as the operating system.
oh nvm then, I thought colab was an IDE, havent heard of it before
is it possible to share the colab so I can look?
the ideas and basics are the same
colab is an online environment; it's not an IDE per se.
Um how do I do that...
Try reading using Csv module, it might be slower, but it will tell where is the error
wdym by csv module
Source code: Lib/csv.py
The so-called CSV (Comma Separated Values) format is the most common import and export format for spreadsheets and databases. CSV format was used for many years prior to attempts to describe the format in a standardized way in RFC 4180. The lack of a well-defined standard means that subtle differences often exist in the data produced and consumed by different applications. These differences can make it annoying to process CSV files from multiple sources. Still, while the delimiters and quoting characters vary, the overall format is similar enough that it is possible to write a single module which can efficiently manipulate such data, hiding the details of reading and writing the data from the programmer.
sorry I'm very new to coding so still unaware of the jargon
Try this
@pearl grove try putting your file path in filename variable in that code
okay I'll do that, thanks
File "C:\Users\****\AppData\Local\Temp/ipykernel_20900/1996697643.py", line 1 filename = "C:\Users\Mohammed Haris\Documents\Prog for DS\Coursework\datasets\2000.csv" ^ SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape
It didnt show error when i just typed in the actual file name
This might help you faster
filename = r"C:\\Users\\\Mohammed Haris\\Documents\\Prog for DS\\Coursework\\\datasets\\2000.csv"
This the correct file path, the one you used is not
Hii all
I tried that too but same error
k lemme check that
I have 32 images for my object classifier. Is 30 for training and 2 for testing fine?
32 images of all objects if I might add
Uhh, that depends on the problem you are working
3 objects, just infront of a white background
Is the problem hard or easy
not really hard
it has to decide what object it is
from 3 very different ones, in front of a white wall
the objects are a tennis ball, nasal spray and hand cream
which all look very different
(in color and size)
Hmm, you could increase your dataset by some methods like cutting the images(not objects included) changing colour, stretching the image
note, it doesnt need to be very good yet, im following a tutorial and this is the first one
Are you using tensorflow?
Then wait a sec, I will tell you a preprocessing method
Here is the methods, to increase the dataset
Hmm, then, 25 for training, remaining for testing
btw someone just told me I should use 70/30 for training/testing
okay I will do that
There are cases, training using 15 images, and getting awesome results
It all depends on complexity
Are you fine tuning a trained model, or creating one from scratch?
fine tuning I think
cause I picked from model zoo
but not sure
im new to this
It might take some time then
So okay, good luck on your project
thanks!
@unborn temple there doesnt seem to be a lot happening here except it having saved the first checkpoint 40 min ago
even though it is using 20% cpu
and directml has connected to my gpu
Hello is it possible to compare the values of two columns of different length in two different dataframes?
df['habitants'] = habitants_df.loc[(habitants_df['Municipality'].apply(lambda x: str(x).split(' ', 1)[-1]) == df['cityTown']), 'Total']
Getting: raise ValueError("Can only compare identically-labeled Series objects") ValueError: Can only compare identically-labeled Series objects
df has approximately 53k rows whereas habitants_df has 501 rows
What is a great way to learn Python ML?
@lapis sequoia you can't compare series that don't have identical sets of indices. You have to use the eq method and specify a fill value
!docs pandas.Series.eq
Series.eq(other, level=None, fill_value=None, axis=0)```
Return Equal to of series and other, element-wise (binary operator eq).
Equivalent to `series == other`, but with support to substitute a fill\_value for missing data in either one of the inputs.
I get the error "You must compile your model before training/testing. Use model.compile(optimizer, loss)." on the last line
I compiled it... anyone spots my mistake?
model_2 = tf.keras.Sequential([
data_augmentation ,
layers.Rescaling(1./255, input_shape=(img_height, img_width, 3)),
layers.Conv2D(16, 3, padding='same', activation='relu'),
layers.MaxPooling2D((2, 2), strides=2),
layers.Conv2D(32, 3, padding='same', activation='relu'),
layers.MaxPooling2D((2, 2), strides=2),
layers.Conv2D(64, 3, padding='same', activation='relu'),
layers.MaxPooling2D((2, 2), strides=2),
layers.Flatten(),
layers.Dense(64,activation='relu'),
layers.Dense(5, activation='softmax'),
])
model_2.compile(optimizer='adamax',
loss=tf.keras.losses.SparseCategoricalCrossentropy(),
metrics=['accuracy'])
history_3 = model_2.fit(
train_ds,
validation_data=val_ds,
epochs=epochs,
callbacks=[es]
)
adamax_nodropout_es = model_2.evaluate(val_ds)
@vale isle you re defined model_2
depends on many things, like your prior experience and learning style
I copy pasted my code wrong from my notebook (its in different cells), edited it now
So you're using a notebook? Make sure you're executing the cells in the right order. Did you compile it in the same cell where it was defined?
i tried both, in the same cell and in a different cell
i'm going to restart and run the complete notebook. i guess i messed up with some copy pastes trying out different parameters
I'd have to look at the notebook and know the exact execution order. But I would restart the kernel and do everything in one cell until you're sure it's working
allright, thanks!
Hi, can someone explain what happens when the transform() method is used please? I don't understand what it does and the purpose of it. I have an example here https://paste.pythondiscord.com/vubuyurewo.yaml
In the case of the Scaler, the Scaler "fits" onto your data and gets a mean and a standard deviation (which are given by the mean_, scale_ parts below). It then computes the corresponding z-score for whatever you put in when you call transform.
In [1]: from sklearn.preprocessing import StandardScaler
In [2]: import numpy as np
In [3]: scaler = StandardScaler()
In [4]: x = np.array([10, 10, 10, 12, 10, 12, 9, 9, 8, 8]).reshape(-1, 1)
In [7]: scaler.fit(x)
In [10]: scaler.mean_
Out[10]: array([9.8])
In [11]: scaler.scale_
Out[11]: array([1.32664992])
In [13]: scaler.transform(np.array([9.8, 12, 15, 20]).reshape(-1, 1))
Out[13]:
array([[0. ],
[1.6583124 ],
[3.91964748],
[7.68853929]])
In general, there are two parts to most estimators / processors in Sklearn: fit and transform. Fit will take in training data and figure out what it needs to do with the data (compute means, stdevs, or, in the case of models, the model itself). Transform will transform the data into a new form to be used where ever. Scaling, for example, will scale the data nicely. PCA will apply PCA to the data.
Pipelining in sklearn makes heavy use of this and it might "click" better to see things in terms of pipelines:
https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
Examples using sklearn.pipeline.Pipeline: Feature agglomeration vs. univariate selection Feature agglomeration vs. univariate selection, Pipeline ANOVA SVM Pipeline ANOVA SVM, Poisson regression an...
Bayesian optimization is a little more sophisticated. It assumes that a specific probability distribution, which is typically a Gaussian distribution is underlying the performance of model architectures. So you use observations from tested architectures to constrain the probability distribution and guide the selection of the next option. This allows us to build up an architecture stochastically based on the test results and the constrained distribution.
I don't understand what does
which is typically a Gaussian distribution is underlying the performance of model architectures.
mean.
Also, what does this
to constrain the probability distribution and guide the selection of the next option.
mean?
Bayesian optimization is a little rough to understand. The gist (greatly simplified) is:
You assume that something has a certain distribution (like maybe a normal curve). But then you get more data and you notice it's a little skewed. So you update the distribution. Then the distribution tells you what to do next. You do it, you get more data which changes the distribution a bit more --- so you update it and continue.
@lapis sequoia @vale isle did you both figure out what you were working on?
The transform method will transform the data into a new form that I need, by figuring it out by itself? The z-score is used to make sense of the given data?
It does figure it out by itself given the data you put into the "fit" method. StandardScaler, for example, when you "fit" the data, it'll find the mean and stdev of that data. Then, whenever you you "transform" new data, it'll scale it according to the mean and stdev it found.
The z-score is a standard thing in statistics that maps some value to a standard normal distribution. I think this site might explain what the z-score is and how it's used:
https://www.simplypsychology.org/z-score.html
The gist is, it tells you how many standard deviations away from the mean data is. So if you transform a number and get "1" back, it was 1 stdev above from the mean. If you get -1.6 back, it was 1.6 stdev below the mean.
Is there a way to find the range of the stdev transformed data? Thank you once again for the great explanation! I understand it much better now.
No problem, if you haven't done stats in a while it can be a bit confusing, and you slowly get used to the fit-transform stuff in sklearn.
What do you mean by the range?
Yeah haven't done stats in a while. I might have to go through my notes on stats again just to refresh my memory.
Like the min max of the stdev of the given data.
Like, a confidence interval? The mean and standard deviation of data are both floats. For example,
In [14]: a = np.array([1, 2, 3, 4])
In [15]: a.mean()
Out[15]: 2.5
In [16]: a.std()
Out[16]: 1.118033988749895
Let's say you transform that np.array([1,2,3,4]).reshapre(-1,1)
you'd get this:
https://paste.pythondiscord.com/iyerurihax.lua
So the range of the transform data would be from -1.34163079 to 1.34163079, how do you go about finding that?
No experience, and learning style being hands on, so I guess through projects.
Well, you could find the range of the transformed training data by doing .max(), .min(), but there won't be a max or a min in general. The StandardScaler can transform a number to any number between -infty and infty.
Oh i see so there's no point in finding that. Thanks!!
Thanks for the clarification, Can I set a multi-index on the columns that I want to compare within the dataframe that has less rows and then perform a .loc with column values of the other dataframe? This is what I mean:
multi_indexed_habitants_df = habitants_df.set_index(['Municipality', 'Period']).sort_index(ascending=True)
df['habitants'] = multi_indexed_habitants_df.loc[(df['cityTown'], pd.to_datetime(df['startDate'], format='%Y-%m-%dT%H:%M:%S').dt.year), 'Total']
wow, there's a lot going on here. can you run this and show me the resultant text?
for d in (df, habitants_df):
print(d.head().to_dict('list'), d.index)
Please ping me when you've done that.
also, please run it before the code that you show me, so that I can see what the data looks like before this
Sure
Is there any way to use a function that returns two separate values with rolling.apply() ?
in pandas
df['A'], df['B'] = series.rolling(7).apply(my_func)
something like this?
when you apply a function, don't call it. you want to pass the actual function
ohh ok, but can I use one that returns two values
and no, a function applied with rolling has to return a numeric type.
oh I know what I could do
I could apply rolling inside the function to two separate series
and then return them
or
no
roller = series.rolling(7)
df['A'], df['B'] = roller.apply(my_func), roller.apply(other_func)
you now what I mean
@lapis sequoia still working on it?
I have it I'm just trying to pass it prettier
it doesn't matter as I'm going to copy and paste it into a REPL
Wow, thank you for asking! Yes i solved it, thanks for your help :))
There u have it
{'incidenceId': [56125, 56123, 56122, 56121, 56120], 'sourceId': [1, 1, 1, 1, 1], 'incidenceType': ['Accidente', 'Accidente', 'Accidente', 'Seguridad vial', 'Seguridad vial'], 'autonomousRegion': ['Euskadi', 'Euskadi', 'Euskadi', 'Euskadi', 'Euskadi'], 'province': ['BIZKAIA', 'ARABA', 'GIPUZKOA', 'BIZKAIA', 'GIPUZKOA'], 'carRegistration': ['BI', 'VI', 'SS', 'BI', 'SS'], 'cause': ['Alcance', 'Alcance', 'Alcance', 'Aver�a', 'Aver�a'], 'cityTown': ['Zeanuri', 'Vitoria-Gasteiz', 'Eskoriatza', 'Barakaldo', 'Villabona'], 'startDate': ['2022-01-08T13:21:08', '2022-01-08T13:36:43', '2022-01-08T13:38:31', '2022-01-08T12:27:20', '2022-01-08T10:52:05'], 'incidenceLevel': ['Green', 'Yellow', 'Yellow', 'Green', 'Green'], 'road': ['N-240', 'N-622', 'AP-1', 'A-8', 'N-1'], 'pkStart': [39.0, 5.0, 121.0, 124.0, 444.0], 'pkEnd': [39.0, 5.0, 121.0, 124.0, 444.0], 'direction': ['BILBAO', 'BILBAO', 'Madrid', 'CANTABRIA', 'IR�N'], 'latitude': [43.06164, 42.88119, 43.01685, 43.28869, 43.19736],
'longitude': [-2.70798, -2.69503, -2.541243, -3.0047, -2.0412], 'incidenceName': [None, None, None, None, None], 'endDate':
[None, None, None, None, None]}
Int64Index([ 0, 2, 3, 4, 5, 7, 8, 9, 10,
11,
...
57334, 57335, 57337, 57338, 57339, 57340, 57341, 57343, 57344,
57345],
dtype='int64', length=43493)
{'Municipios': ['Agurain/Salvatierra', 'Agurain/Salvatierra', 'Alegr�a-Dulantzi', 'Alegr�a-Dulantzi', 'Amurrio'], 'Sexo': ['Total', 'Total', 'Total', 'Total', 'Total'], 'Periodo': [2021, 2020, 2021, 2020, 2021], 'Total': ['5.029', '5.038', '2.925', '2.935', '10.307'], 'Provincia': ['Araba', 'Araba', 'Araba', 'Araba', 'Araba']}
RangeIndex(start=0, stop=502, step=1)
Yes I'm
wow, I've never met a Basque person 
Can I make a Sequential take any length of an input (for an RNN network)
pretty sure you have to use something like an RNN for that, but I'm not completely sure
@lapis sequoia can you explain what comparison you're trying to do? population for towns at each period?
Sure
Kaggle is the place to go then, they have both free courses and lots of projects https://www.kaggle.com/learn/intro-to-machine-learning
Learn the core ideas in machine learning, and build your first models.
Since I'm not returning a sequence, in theory, I should be able to give as long of an input as I please
Basically I'm trying to create a new column called habitants within the incidences dataframe that contains the number of habitants of that particular city where the incidence took place, for that I compare Municipality and cityTown which are the same but with different names for each dataframe and then I extract the year from the startDate column and compare it to the Period column which is just the year.
oh okay, you just have to do a merge operation
so first I did df['year'] = pd.to_datetime(df['startDate'], format='%Y-%m-%dT%H:%M:%S').dt.year
Maybe it could be accomplished easily as you said with a merge
I see
With that you have a new column with the year within incidences df
!docs pandas.DataFrame.merge
DataFrame.merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)```
Merge DataFrame or named Series objects with a database-style join.
A named Series object is treated as a DataFrame with a single named column.
The join is done on columns or indexes. If joining columns on columns, the DataFrame indexes *will be ignored*. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on. When performing a cross merge, no column specifications to merge on are allowed.
So you need to pick which columns represent the same thing in both dataframes
which one means the same thing as Municipios?
cityTown? or province?
cityTown
okay great
so you will have something like left_on=['Municipios', 'Periodo'], right_on=['cityTown', 'year']
to show which columns are used to match the rows.
I see and then once I do the merge I can query those columns as usual
yes
I have to go, but I wrote down the solution. So see if you can figure it out
if you try and can't figure it out, I will show you.
Yeah sure I will try it tomorrow. Thank you very much
If you wanna learn basque let me know😅
This isn't really data science or ai, you may have better luck in one of the help channels, Viking.
mb wasnt supposed to send it here xD
i tried clicking on #discord-bots and mustve accidentally clicked here
Thanks for answer
When the distance between observations grows, supervised learning becomes more difficult because predictions for new samples are less likely to be based on learning from similar training examples.
What is meant by "observations" here?
Is that input variables?
can someone send the source code of a premade chatbot AI? would mean a lot
in python
Who wants to work on a project involving ML and AI?
Why don't you say what the project is? Remember that you can't recruit for closed-source projects or business ventures.
Oh. Well, were trying in the end to make like an alexa-type robot and soon have hardware for it... but i guess that counts as closed-source?
if you do it on github, then it's open-source
Ok. We are doing it on github
in a public repository? link?
getting it from dekriel
who is dekriel
Why is the csv file being read like this instead of displaying the columns and rows nicely, or is the csv file broken?
Like this
Maybe something went wrong with the code? or the website formatting is diffrent?
data_files = [file for file in os.listdir('./dataframes')]
both_subjects = pd.DataFrame()
for file in data_files:
df = pd.read_csv('./dataframes/' + file)
both_subjects = pd.concat([both_subjects, df])
both_subjects.head()
this is the code
Any idea how to fix it?
try this:
both_subects = pd.concat((pd.read_csv(f'./dataframes/{file}', sep=';') for file in data_files), axis=1)
your separator appears to be a ;, whereas , is the default.
also, you should avoid design like both_subjects = pd.concat([both_subjects, df]). this involves copying every single cell of both_subjects every time, and is thus horribly inefficient
Hi, im trying to add the column "Days" with the column "Recieved". However, idk how to make it so python adds it with the days in the month. ex: 7 + 06.12.21 = 13.12.21
The library im using is pandas
Any help would be really appreciated
@midnight fossil one column is of floating point numbers, and one is of strings. are you trying to do addition or string concatenation?
oh I see. they're both wrong.
Days needs to be a timedelta, and Received needs to be a datetime.
I've heard of the datetime library before
but im not quite sure what you mean by that
sorry im a bit of a noob
like from dateutil.relativedelta import relativedelta?
a datetime is an exact time in the real world ("16.12.21"), and a timedelta is a duration ("7 days")
Ok, thanks. I guess ill watch a video on how to convert "Recieved" to a datetime
can you do print(df.head().to_dict('list')) and copy and paste the text into this chat?
I will not accept a screenshot
import pandas as df
df = df.read_excel('C:/Users/enesi/OneDrive/Amazon accounting.xlsx')
Total = df['Days '] + df['Received']
print(df.head().to_dict('list'))
you have to copy and paste the result of the print statement, not the code.
oh, my bad😆
C:/Users/enesi/OneDrive/Amazon accounting.xlsx is not a file that I have, so this is how I can figure out what is in it
in a way that is useful
result[mask] = op(xrav[mask], yrav[mask])
TypeError: unsupported operand type(s) for +: 'float' and 'str'
'Amazon product testing': [1, 2, 3, 4, 5], 'Platform': ['Telegram', 'Telegram', 'Telegram', 'Telegram', 'Telegram'], 'Seller name': ['Mikki Mikki', 'Wangzekun', 'Chari', 'E S', 'Mikki Mikki'], 'Order date': ['03.12.21', '03.12.21', '04.12.21', '05.12.21', '05.12.21'], 'Order num': ['302-9210629-2828348', '302-8699596-8046740', '302-4841467-2222755', '302-4797916-9617919', '302-1684012-4898734'], 'product name': ['Uhrenladegerät', 'PC controller', 'GROJAT smart uhr', 'Aufsteckbürsten', 'Styles pen'], 'price': ['12,99', '27,99', '45,99', '7,99', '27,98'], 'Days ': [7.0, 5.0, 7.0, 5.0, 7.0], 'Received': ['06.12.21', '08.12.21', '08.12.21', '08.12.21', '08.12.21'], 'Reviewed': [1.0, 1.0, 1.0, 1.0, 1.0], 'refund status': [1.0, 1.0, 1.0, 1.0, 1.0], 'sold price': [nan, nan, nan, nan, nan], 'Add. Info': [nan, nan, 'picture', nan, nan], 'Trustworthy': ['YES', 'YES', 'YES', 'YES', nan]}
it just lists everything in the excel file
There's no opening {, but I'm going to assume that's the only character that's missing.
yeah
If there are missing columns in this, I will never be able to figure that out if you don't tell me.
>>> df2 = df.assign(
**{
'Order date': pd.to_datetime(df['Order date']), # Convert `Order date` to a timestamp
'Days': pd.to_timedelta(df['Days '], unit='D') # convert `Days ` to a duration, without the extra space
}
).drop('Days ', axis=1) # Drop the column that has the extra space
@midnight fossil look at this
good question! it's an esoteric python thing.
oh, cool
yes, I know what two asterisks you're referring to, lol
also remove the >>> as those are part of a REPL
which is a type of python environment
you're probably not using that.
anyway, from there, Days and Order date are in formats that you can work with as actual time representations
since Order date becomes an exact timestamp, you can add durations to it and get an updated timestamp
I still got this errror for some reason:
I think its because you created a dataframe but I dont need a dataframe since im trying to get python to read off of an excel document
you did import pandas as df when you should have done import pandas as pd. if you import pandas as df, and then immediately name your first DataFrame df, then you won't be able to use the to_datetime or to_timedelta functions.
You are creating a DataFrame of the excel document.
Hi, I have three numpy arrays, x and y coordinates and deflections in z coordinate, each are about size 2500. I tried to do meshgrid for x/y coordinates which was succesfull but somehow when I try to do np.meshgrid(x,y,z) it uses all the memory on my machine (180GB).
Is this normal or is there a better way maybe?
Heyy i just saw this
Thank you soo much for the detailed answer
I do have a gpu on my laptop but idk how powerful it is
Yes i will check CUDA out,but i switched to google colab and i really hope that works
Ah okay you are right i ran out of memory ( like my c drive was full)
I checked it ,apparently there is intel(R) UHD 620 and radeon graphics (I'm kinda confused about this)
Dam, how does that device work with that
Would it be unwise to try and learn how to do machine learning/ai with python (I have only probably written 400 lines of code max)
in general I would advise this:
1 learn Python basics
2 small projects with only Python (preferably no frameworks)
3 learn data science basics, think Numpy, Pandas and Matplotlib
4 learn ML/AI
because it is an advanced topic
this is just my two cents, I'm still somewhere at point 2/3 myself
Thanks 👍
It's not uncommon for computers to have a graphics chip on the motherboard as well as an another gpu
you're welcome !
I have a neural network that should try to classify images of chairs into 18 different classes. More specific, given an image of a chair, it should be able to predict it's rotation. Each class is a certain rotation degree of a chair. For example, as you can see in the image, all chairs in the class "260 degrees" are rotated 260 degrees. I have the same 1.8k images in each class, but rotated differently of course. Now after 16 epochs, my model's accuracy has not improved. It's still guessing with an accuracy of 5.5% (which is random guessing if there are 18 classes). i'll also include my neural network's structure. Does anyone know some key things to make this better? (maybe even a different approach, as it should ideally predict more angles and not only angles divisable by 20)
I plotted a boxplot using seaborn library.
For "Iris-setosa",
Q1 is 3.1,
Q2 is 3.4,
Q3 is 3.6
IQR value is Q3-Q1 = 0.5
According to theory, lower and upper plot whiskers(extended thin line) should be on (Q1 - 1.5xIQR) and (Q3 + 1.5xIQR) respectively.
That would give us values 2.35 and 4.35 respectively.
But as you can see in box plot for "Iris-setosa", the starting and ending point of whiskers are not on that values.I have also specified that whisker value should be 1.5 while calling boxplot. Why this happens?
Can someone please tell me how i can increase the ram on google colab
is it possible to do it as a sparse array?
I found two articles confirming that you can use this weird trick: https://towardsdatascience.com/double-your-google-colab-ram-in-10-seconds-using-these-10-characters-efa636e646ff
though keep in mind that as a free service, they're not going to keep granting you more ram indefinitely
also, if you're using a deep learning library, I would confirm that your tensors are actually on the GPU and not in the CPU.
i read these articles and tried their code but ig colab changed their policy cause that option isnt there anymore
yes i am using tensorflow...but all my data is loaded in numpy arrays
why don't you convert them to cuda tensors?
and run it on my local pc?yes somebody suggested that to me, but idk if cuda works for radeon graphics
no, in colab. if I understand correctly, you're not using the GPU in colab
numpy arrays can't go on the GPU, so they'll just take up all your RAM.
yeah i am not even tho i have activated it
so that's probably the problem
ohhh
use cuda tensors. in colab.
yes you are rightt
yeah somebody suggested CUDA yesterday...do you think it is better to use my local pc (has 8gb ram) or do everything on colab?
alrightt
ill google a few videos and try it,thank youu!!
@serene scaffold can i use sklearn's train_test_split on tensors?
are you using pytorch or tensorflow
tensorflow
there's a few solutions given here: https://stackoverflow.com/questions/41859605/split-tensor-into-training-and-test-sets
this code would work locally and just detect objects I hold infront of the webcam. However, I am doing it in google colab and it just stops after 0 seconds and never shows anything. I suppose that has to do with the capture device not being accessed. How do I access my webcam from google colab with opencv?
Thank youu!! I have another question
On colab i will use the gpu for the tensors and then change the runtime to none otherwise ?
I've never actually used colab, so I'm not completely sure.
anyone?
Hey guys, I have two concerns in the graph. Could someone guide me.
- How do make the years(xaxis) tilt when im using plt.stackplot?
2.What scale should be used to make the values in yaxis to be more interpretable?
I am trying to make a neural network that can predict the rotation of a chair. I have right now 18 different categories that each contain 1.7k chair images of different degrees ( so 1 category contains 1.7k images of chairs rotated 20 degrees, another for 40 degrees and so on)
Does anyone know some CNN structure to make this work?
I have tried something here which did not work
i have an array as 512,512 how can i reshape it to 512,512,1
Try reshape.
In [19]: a = np.ones(shape=(20, 15))
In [20]: a = a.reshape(20, 15, 1)
In [21]: a.shape
Out[21]: (20, 15, 1)
Hello all, I need help with a problem:
I have a Pandas dataframe and want to create a conditional column based on the "USA State" column. I want to use multiple conditions to yield a column with a rep name. I have created lists associated with each rep and want to write the reps name in a column if in State is in the Rep list.
df1["Rep Name"] =
["Darrin" if i in Darrin else "None" for i in df1["State"]
This works but I want it to work with multiple names. How would I do this?
@boreal bear I'm on mobile, but the solution doesn't involve any loops. Look into masking with conditionals.
can you do print(df.head().to_dict('list')) and copy and paste the text into this chat?
For a boxplot, your whiskers won't be exactly on (q1 - 1.5 * iqr, q3 + 1.5 * iqr), since, otherwise, boxplots would always be symmetric. Instead, the whisker is on the max/min value which is not an outlier. An outlier, in this case, is defined as anything beyond that q2 +- 1.5 * iqr value.
(Also, I think your values for q1, q2, q3 may be off, I get slightly different ones using the same data and functions.)
q1: 3.2
q2: 3.4
q3: 3.675
IQR: 0.475
Whisker bounds: (2.9, 4.2)
Which means, in this case, the bounds should be at 2.9 and 4.2 respectively, which seems to be the case in the plot.
Refs:
https://www.simplypsychology.org/boxplots.html
https://github.com/mwaskom/seaborn/blob/e04b07eb3df135511e71e556c2bd34ef59ba08ba/seaborn/categorical.py#L1288-L1294
seaborn/categorical.py lines 1288 to 1294
def draw_box_lines(self, ax, data, support, density, center):
"""Draw boxplot information at center of the density."""
# Compute the boxplot statistics
q25, q50, q75 = np.percentile(data, [25, 50, 75])
whisker_lim = 1.5 * (q75 - q25)
h1 = np.min(data[data >= (q25 - whisker_lim)])
h2 = np.max(data[data <= (q75 + whisker_lim)])```
Hey guys, How can i plot the two values on the yaxis as shown in this figure?
I'm having some trouble appending DFs. I get the error Reindexing only valid with uniquely valued Index objects. The dfs have multiindex (date, company identifier) and the combo is unique. Date will repeat, company identifier will repeat, but the combo will not. I cannot reindex the dfs as I need the index values. I have a sample in a kaggle notebook that I can share. The dfs share some columns, but each have unique columns as well.
Are you sure you're not trying to concatenate the two?
I had 130+ columns, didn't realize that some column names are duplicated. That's why it didn't want to concat. Now I have to figure out and name columns with a unique identifier. Thank you for your response.
maybe try new activation functions (eg. softplus, GELU, swish, mish, phish)
Also seems like instead of separating it into completely separate classes, you would have it predict a single value which is the rotation
It should learn the rotation of the chair, instead of which "class" it belongs too
Hi have you found solution to this?
That way it would also be easier to generalize it to more rotation values
Does anyone know if its a good idea to normalize all my data per column at the start or should I normalize them after I split them into train/test data
I am not sure how it works
You want to normalize all the data the same way
If you normalize the test data differently, the model won't understand what it's looking at
Okay thank you very much, I am doing a ml project and its my first one so I have a lot of questions. I really open and close tabs the last 10 hours.
Do you think I should normalize the target variable too
Idk
When I do my preprocessing, I make a pipeline and fit my normalizers / scalers / whatever only on training data. If my training data is representative, then my test data should scale the same way with the same parameters.
If you fit your scaler with test data, that's using test data for training, which isn't an excellent practice.
I see so what I did is I normalized all my data between 0 and 1 and then splitted them
So in a way the test data are affected by the normalization
But only if they do contain the maximum/minimum value
When making a model, you're probably never going to want to use your test set (or validation set, in the case of CV) for any preprocessing.
I honestly don't think it would affect too much of the preprocessing, but looking or manipulating the test data in the parameter fitting / training step is not a good practice to get into.
So, if you want to do some kind of normalization, it should be like this:
- Split data into train/test (or train/validation for CV).
- Fit your normalizer / whatever on your training data.
- Make the model and train it on the training data.
Now you have two things: your normalizer and your model. When you score the test data, you want to put it through both. This is called pipelining.
See: https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
Examples using sklearn.pipeline.Pipeline: Feature agglomeration vs. univariate selection Feature agglomeration vs. univariate selection, Pipeline ANOVA SVM Pipeline ANOVA SVM, Poisson regression an...
See the example there. They do a StandardScaler and an SVC (which is a type of model). Then they fit that, which tells the scaler how to scale everything and the model how to classify. Then it scores on the test set, running the testing sets through the pipeline.
I see so I should split my data first
I havent checked cross-validation yet but I think I have to do it for my project
IMO, best practice when you get data is kind'a check it out and get a test set / validation set ready at the beginning.
Preprocessing: normalization
• Learning: training, cross-validation
• Diagnostics: testing, accuracy, loss
I need to do all these stuff for multiple linear regression
It's all good, you still should split with CV to get a training set and a validation set, but some people (for whatever reason) don't use a validation set. Validation sets are pretty much just "test sets" for CV. At least, they're used in the same way.
Yep. So, you'd pretty much follow something similar to the pipeline example here. Then at the end, to test, you'd use whatever the metrics you want, here: https://scikit-learn.org/stable/modules/model_evaluation.html
There are 3 different APIs for evaluating the quality of a model’s predictions: Estimator score method: Estimators have a score method providing a default evaluation criterion for the problem they ...
I am not sure if I need to split validation though, my data is time-related its a stock price
And I have read in the course that i need to get train data to be older
It's up to you. Some people split for CV, some people don't.
CV is cross validation?
Yes, Cross Validation.
For time-series, this can get a bit tricky. I'd be interested to know how some of the DS people in here handle train/test or CV on their time-series. There's many ways to do it.
I use older data for train and newer data for test, but I also normalize with respect to trend and seasonality first. But some people do the opposite thing: they don't normalize and they test older, train newer.
I've read both sides claim their way was better so, you know, who knows.
I see I am doing this for a course and prof says train to be older so I guess I wont have a problem with deciding :p
Haha, yeah, I honestly think it makes more sense, but they're both probably fine ways to do it.
But I have seen some tutorials where they normalize before splitting I am still not sure
When you say normalize, in this case, for time series, what normalization are you applying?
date isnt normalized its just works like index, all the prices and volume is normalized
its open/close/max/min
Like, trend + seasonality, or minmaxscaler?
Oh, got'cha.
Ehhh. I would probably still split first and do it all with a pipeline, but if it's a course it's prob not a big deal if you normalize first, if that makes things significantly easier for you.
can i use the same minmaxscale to scale multiple columns and then reuse the multiplier of each column to the test data to normalize them with the same "multiplier"?
Yeah, that's what you usually will do in timeseries stuff. You'll scale on a lot of train data, then you'll apply that to the test data as well when you score that.
so the project states that i need to do cv, should i split the data to 70-10-20 train/val/test?
I'm not sure how good minmax scaling on stock data will be (as it is notoriously non-seasonal and variable w/rt short-term trends).
You won't need a test set anymore for CV, just the number of splits to use.
Learning: training, cross-validation
so if i want to do these 2 things I will need to resplit the data?
The easiest way might be to do something like... Make a training set and a test set, something like 70-30. Then you can use that for training without CV. For CV, you can just plug in the training set, and use the test set as your validation set.
That way you're still splitting up data nicely, but you only have to split it once and not worry about it too much.
Oh so I will never need the 3 sets in for the same process
okay
I havent read the cv thing yet so thats why i have questions
Yeah, no problem. Some people here might do things differently, so they might have good advice as well.
I am trying to store the fitted values from the minmaxscaler so i can simply transform train data
Q for you NN People. I started to look into NNs --- neat stuff! But I wanted to know your workflow. When you get data and a task where you want to use a NN, do you build it up yourself, or look for pre-built ones, or what's the deal? I feel like it'd be annoying to build it up all the time, but there are "recipes" to make various types, so, you know, who knows. I'm interested in your take.
do you guys need to know math for ml?
If you're using sklearn, you can do something like:
minmaxscaler = MinMaxScaler()
minmaxscaler.fit_transform(training_data_stuff)
and then the next time you wanna use it (like on the test data) you can do
minmaxscaler.transform(test_data_stuff)
The "fit" part is the part that "saves" the fitted values, so you only need to call transform on the stuff.
Depends on what you want to do with it. I'm a strong believer that one needs at least a basic understanding of calc, linear algebra, and stats to have a career in DS, but to do projects or even to do more of the software-engineering side, you may not need to know as much.
Do you think I should normalize the target variable too?
I think thats where errors were popping from
cuz I didnt this for y_train/y_test too with the same scaler
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
also should i ignore this warning?
ah thats good to hear
I'm not sure, but everything you do for the training set you should do for the test set as well. Yeahhh, that warning is saying you have something like,
df[row_indexer, col_indexer]
or something somewhere, and it wants you to use
df.iloc[row_indexer, col_indexer]
instead. Something like that. I try to fix it when I see it, but I think it'll usually work with that warning.
hello, how can i use Conv2D on numpy arrays]
i want to apply Conv2D on X_train before passing it to my model
You need to know Math and Statistics actually. It's not really much of a big deal since you don't need to know the entirety of both subjects before getting started with ML.
Heyy,Will my cpu crash if i convolve tensors on it??
I ran this code and the pc started hanging soo much
X_train = tf.nn.Conv2D(X_train)
I think it depends on what you're working on. Thanks to Transfer Learning for actually being a thing. It saves time and makes the modelling process a lot easier and faster.
Idk bruh 😀... If it's notoriously hanging then it's a sign you might be over loading it.
If you perceive the workload is too much for your machine, you might wanna use Colab.
What's the RAM size of your PC?
Do you have GPU?
Yes it hangs notoriously, the mouse pointer stops working
My ram size is 8 gb ,i have gpu but it's AMD so i cant use CUDA
I wanna use Colab but it keeps crashing cause it runs out of ram
hey, is anybody here...who can help me with pandas dataframes
i am willing to concatenate 3 dataframes below each other vertically ... using pandas
Thanks, i did it myself 👍 ignore_index=True was needed to be given as parameter to pd.concat for natural behavior...
great !
@odd meteor just tell that, Are gonna text something releated to mee ? .... cause im waiting for you as you're typing...XD
At this point, you might wanna consider purchasing the paid version of Colab 😀
On a more serious note, isn't there a way to switch to GPU instead of using CPU? I think there should be a way.
My PC's GPU is Iris XE. So I can't utilize CUDA either although I have thunderbolt4 port to connect eGPU.
If it's any consolation, not everyone with Nvidia GPU can afford to train heavy/deep NN on their pc. Most people still utilize colab.
😊
Ooh, i thought. You would text something related to pandas merging data frames. that is why i was waiting for your text... *i know im very stupid
No you're not. I just figured you've already resolved the issue ✌️
btw, do you use PyTorch or Tensorflow.. ?
TensorFlow but currently learning PyTorch
i was in this stage 5months ago
i think pytorch is better with RNNs that tensorflow
To me, no framework is superior than the other. TensorFlow is a lot easier that's why I picked it first. Ooh, Keras is fun as well.
The paid version of colab allocates 25 GB of ram will that be sufficient for a dataset which is of size (3390,512,512,3) for transfer Learning?
Honestly, I have no idea. I haven't used the paid version yet. Others who have used it might have a better answer.
Alrightt, thanks😁
hello
I'm trying to read these irregular tables into a dataframe
I dont know how to go about it
please help ~
it has like.. row data and column data
help plox
svm_model_linear = SVC(kernel = 'linear').fit(x_train, y_train)```
.
ytrain is one hot encoded (it contains 1 among 3 values)
.
But while running it gives error
```y should be a 1d array, got an array of shape (105, 3) instead.```
.
How to fix this?
does it though ? Where are the column headers ?
the things you see in row 3 in the first table and row 10 in second table
they are column headers
then I'd start with deleting lines 1, 2, 8 and 9 if they're not relevant
:x: failed to apply.
@lucid plover Please don't try to ping @everyone or @here. Your message has been removed. If you believe this was a mistake, please let staff know!
not all GPU support CUDA. Disk space is not the same as memory. Disk space is your HDD or SSD while memory is your RAM or the GPU's RAM. If you run out of ram it can be that the computer uses your HDD/SSD as ram which is very slow.
In any case, make sure you understand why you run out of memory. E.g. if you network has an input size of let's say 10 numbers (doubles) and you have a batch size of 1000, then you can compute how much memory you need approximately (10 * 10'000 * 8 bytes) 8 bytes cause a double is 8 bytes (64 bits).
Now you wil need a bit more memory since you might store some things. Coding mistakes can lead to you running out of memory. Maybe you have too much input => out of memory. Make sure you understand why you run out of memory.
@wicked grove
what I use is pytorch's subset.
# Read samples dataset
samplesDS = SamplesDataset(args.samples_file,
device=args.device)
# Make a 80/20 split for training/eval data
k = len(samplesDS)
train_indices = np.arange(0, int(k * 0.8), dtype='int')
validation_indices = np.arange(int(k * 0.8), k, dtype='int')
TrainDS = torch.utils.data.Subset(samplesDS, train_indices)
ValidDS = torch.utils.data.Subset(samplesDS, validation_indices)
(I don't shuffle my data before splitting in this example)
Not sure if it's the best way though.
the error is clear. x_train is probably something like [1,2,3,4] while y is something like [[1,2,3],[4,5,6],[6,7,8],...]
Just make them the same dimension. You probably do your one hot encoding wrong.
they are relevant.. it's weird because it's like the field name is in one cell on those lines and the data for that field is in the subsequent cell
Thank youu!! Yess so the data i was trying to store was 10.6GiB and it was running of ram on colab as well
I'm planning on buying colab pro which 25 GB and i hope that works
Ohhh okayy, i converted my tensor back to array and used sklearn's train_test_split
Yeahh i have amd gpu and it isnt supported by CUDA
then I'm afraid I cannot help you, there are so many abbreviations that it's hard to see what it's all about
Is there any reason why you onehot encoded the response variable? Even if you had to encode the target, ensure it's still in 1D
Just like the error message reads, your response variable is supposed to be in 1d space
I got the same error,you can reshape y
i have now switched to linear regression as it would indeed make more sense. I am training some different CNN structures, but my pc is very slow so it takes a long time I will try on a better PC soon and then i will know if one if these structures work
I am a beginner, I am working on iris dataset.
.
We have 150 rows of Sepal Width, Sepal Length,Petal Width, Petal Length and its corresponding classification ("Iris-virginica","Iris-versicolor" or "Iris-setosa")
.
I need to make a model which will predict "classification" depending on values of SW, SL, PW, PL
.
I made:
x = SW, SL, PW, PL
y = "classification"
.
Then since y is a categorical data, I encoded with y=pd.get_dummies(y).
.
Then I split x and y and tried to model. Where can it go wrong?
:incoming_envelope: :ok_hand: applied mute to @lapis sequoia until <t:1641904622:f> (9 minutes and 59 seconds) (reason: duplicates rule: sent 4 duplicated messages in 10s).
yo are there any library llike keras which has imagedatagenerator but instead of flipping zoom and shift i want to applica color space transformations
its not abbreviations, it's column name
let's agree to disagree 🙂
10.6 GiB shouldn't be a problem though, no? How much do you read at once?
Think about it: You basically have one loop which loops over the epochs and each epoch does training + evaluation and maybe some other minor things right? While training, you loop over those 10.6 GiB of data right? Now here you don't have to read all at once! You can read 1GiB, train it, read next 1GiB, train it. Once done with training, you go to validation and do the same.
You might have to change how you store the data though. If you store the data in one big normal file like csv or txt or whatever people use, then reading it in 1 GiB chunks won't be possible.
- How do you store the data?
- how do you read the data?
- How much data do you process at once?
E.g. my training data currently is 30gb or something like that and test data is 2x80gb
and my GPU has 3.7GB ram.
I have no idea but maybe you find something here https://pytorch.org/vision/stable/index.html ?
Since you have 3 classes in your response variable, you can simply encode the 3 species of flowers as, 0,1,2 respectively. And then do a multi-class classification
Guys is Brain.js a good library for AI? I'm not necessarily gonna do super advanced stuff. Just some kaggle datasets ig?
The problem i faced was allocating that much data in numpy arrays
It ran out of ram then
Initially i was able to run 1 epoch but then it crashed later
It is extremely slow and difficult to load all the data in a numpy array idk why
The data i have is 3390,528,528,1
And then i use np.repeat to make it 3390,528,528,3 and that's when it starts hanging
How do you store the data?
npz file
do you use pytorch? keras? why npz?
tensorflow.keras
I don't know keras nor npz but basically what you do is you load the whole 10GB at once. Right? Then you trible it => you'd need 30GB or memory which you probably haven't.
What you can do is to only read a piece of the data, feed it to your network, load new data, feed it again. You might have to store your data in chunks of npz files or just use HDF?
see e.g. https://keras.io/api/data_loading/
no no i have around 3.55 gb of data which i load at once and then i trible it which becomes 10.6
thank youu!!:)) i will check that
how much memory do you have?
for computer vision, is it possible to take five different pictures of the same object and also train the model to understand that these five different pictures are actually the same object?
For human made objects it can be done without ML.
I am building a grading system which will grade diamonds based on the initial image clicked. I have to assume that multiple images from different angles will be best able to grade as opposed to a single image. However is it possible to tell the model that for example these five images are of the same diamond clicked from different angles?
thank you for your answer. What are the limitations of using a video? I have to assume that prices in the video would take much much longer time than processing a single or two images. '
that processing*
If you know where the diamond is relative to the camera, you can do it without ML, since it's nice and rigid, not something like a pile of mud.
I see.. Let's say we do not know whether diamond is relative to the camera. Is it possible to train a dataset say on 360° videos of 10,000 diamonds? can you ballpark along it would take?
or if this is something that is even reasonable…
A model to segment the image for the diamond and then another to asses its quality can work. Or you can do a single end-to-end model. Or some combination of various models that are experts that each find out something and finally you take that output and simply apply some fuzzy logic. There are multiple ways.
Getting the training data will probably be the most difficult part.
That is true....pretty much impossible to find a training data like that
so just that I understood clearly... It indeed is possible to tell a model when it is looking at the same thing from different angles without obtaining any kind of 3D scan or 3D data some other complicated stuff
It can be done with camera alone. 3D scans make it a lot easier though.
Got it...thanks for helping me.
Search for Photogrammetry algo or software might help
problem is that 3D scans capture voxel data but they do not capture colour characteristics... Unless I'm mistaken
Ah yes and your subject is transparent ...photogrammetry has troubles there too
yeah....
Lemme think
sure thanks
how can I answer the following question:
for (z) e.g. 1000 number of images given (x) resolution (e.g. 512x512 or 1024x1024) what amount of (y) time it would take to train a neural network using Google colab?
I think this would make it easier for me to go ahead
Not nearly as long as it takes to get those 1000 images I would think. It also depends a lot on which model / method is chosen.
If your model can perform online learning, then technically the training would take as long as it takes you get each sample, since it could learn it on the spot for each image you get.
Lets say i have 200 images at 2048x2048 resoltion of 200 diamonds...are you saying it would tkae 1 second per image to process and learn?
Standard deep learning usually can't do online learning though, so if you want to stick with that, which most people know about, then it might take a while to get a good model.
can you please elaborate what you mean when you say it depends on which model/method? I am quite new to this aspect of Python and am trying to learn more
sorry if these are too many questions...
Some methods require a lot of samples and resampling (to avoid forgetting), while others can instantly and permanently learn things (one-shot). While the latter is the ideal and what the future holds, the current most popular methods can't do it yet, or at least not very well. But also within those slower methods, some may still require more samples, or less, and more processing power / time or less.
One thing that obviously controls how long it will take it how large the model is.
But a larger model might give better results.
In this regard, datascience is more like alchemy than chemistry, more of an art than a science (despite the name), and really just requires some testing. Try some small models that don't take very long to train to get a feel for how long it might take and how well it can perform.
Predicting it upfront is really hard since not only does the choice of model and model hyper-parameters matter, but also hardware configurations, and desired results (how well does it perform).
(Also without having done a similar problem with that exact same setup, it's even harder)
Hello help me pls
got it. thankyou so much for all your responses, I think I'm starting to understand why my questions don't have a straight answer. Truly appreciate you taking the time to tell me all of thiss
how viable is it to make a dataset by seperating frames of a few videos
assuming the environments / lightings were varied along with the angles
note: specifically talking about image recognition
anyone know if google collab free tier has better gpu or is a 3080 better at training networks?
ping me plz thank you
import pandas as pd
from sklearn import preprocessing
from sklearn.naive_bayes import GaussianNB
df = pd.read_csv('../input/heart-failure-prediction/heart.csv')
Age = df.Age.tolist()
Sex = df.Sex.tolist()
CPT = df.ChestPainType.tolist()
RestingBP = df.RestingBP.tolist()
Cholesterol = df.Cholesterol.tolist()
FastingBS = df.FastingBS.tolist()
RestingECG = df.RestingECG.tolist()
MaxHR = df.MaxHR.tolist()
ExerciseAngina = df.ExerciseAngina.tolist()
Oldpeak = df.Oldpeak.tolist()
ST_Slope = df.ST_Slope.tolist()
HeartDisease = df.HeartDisease.tolist()
#encoding strings into integers in order to calculate the probabilities.
le = preprocessing.LabelEncoder()
Sex_E=le.fit_transform(Sex)
CPT_E=le.fit_transform(CPT)
RestingECG_E=le.fit_transform(RestingECG)
ExerciseAngina_E=le.fit_transform(ExerciseAngina)
ST_Slope_E=le.fit_transform(ST_Slope)
variables = zip(Age,Sex,CPT_E,RestingBP,Cholesterol,FastingBS,RestingECG_E,MaxHR,ExerciseAngina_E,Oldpeak,ST_Slope_E)
model = GaussianNB()
model.fit(variables,HeartDisease)
predicted= model.predict([[0,1]])
print(predicted)
I am trying to use bayes theorem for heart conditions this dataset has 12 columns one is the heart condition column. I am getting an error about array Reshape if my data has a single Reshape. What do I do?
could you give some details on the data in the csv file as well as a paste of the full error message
Ok
https://www.kaggle.com/fedesoriano/heart-failure-prediction is the dataset
If you need like specific data I will give it to you in a second I need to get on my laptop
I was using the instructions
First ones
How can I learn Data Science and AI for free? it can also be cheap but like my wallet is struggling to survive sooo...
there are so many free sources out there like fast ais course and courses on coursera and datacamp
how do I create a custom dataset for image detection?
as in what kind of images go in?
I just take hundreds of pictures of the object?
hi
So, the model needs to be able to identify that certain regions of the image depict a certain thing? That kind of thing?
like in a real time video stream, it needs to be able to detect a certain image
is anyone good with data science or AI and could help me with my issue
Try doing .reshape(-1, 1) on whatever it's throwing the error on, but you also should probably do the "10-minute to Pandas" guide, since most of what you're doing can be done in terms of manipulation can be done with dataframe operations.
i did it,it doesnt work. They are saying do that if its one feature. doesnt one feature mean one column?
detect certain images. do you mean it has to pick out entire frames, or identify things in the video?
do you know how to fix my issue?
no
Sex_E=le.fit_transform(Sex)
CPT_E=le.fit_transform(CPT)
RestingECG_E=le.fit_transform(RestingECG)
ExerciseAngina_E=le.fit_transform(ExerciseAngina)
ST_Slope_E=le.fit_transform(ST_Slope)
every time you call fit_transform, you reset the encoder, which means you can't transform instances of the same feature again.
so like I have a drone 50 feet in the air. I have to detect a big red target mat in real time and fly over. Best way of doing this?
this problem is called object detection, not image detection. see if you can find resources about it with that in mind.
ah sorry. so I did something with a pre trained ssd mobile net but now I have to train my own neural network and label and create my one dataset. any tips for creating my own dataset?
I don't know how to do that, sorry.
same, couldn’t find any tips online
You only look once (YOLO) is a state-of-the-art, real-time object detection system.
We tried that and trained
You do have to label the regions of interest for training data
In this blog we will show how to label custom images for making your own YOLO detector. We have other blogs that cover how to setup Yolo with Darknet, running object detection on images, videos and live CCTV streams. If you want to detect items not covered by the general model, you need custom training. … Continue reading "How to label custom im...
So you need photos of the target taken from a distance then do as above
It needs the annotated images and there are annotation tools
Get a fast machine with a good gpu on or offline
We've seen a lot of YOLO in the past few days, dang.
Also, I finished up that NN deeplearning ai course, it was pretty good. I did the basic keras/tensorflow tutorial, and that was also fun. It was nice building up little dealies and messing around with it. I didn't make anything substantial, but I did get it to classify some article text into a few subjects [sports, tech, and finance], which is better than nothing!
Thanks for recommending it to me, y'all. I feel like I know "something" about NNs now. Enough to know what to google, anyhow, if I ever need'em again.
typical for this channel
Next project: relearn PySpark junk so I don't look like a fool when I need to do it. :''']
Ugh, the worst thing about PySpark is honestly the logs. The rest is fine. The logs are just a gd nightmare. You make a typo and it's 50 pages of logs.
alas, jvm tracebacks
so are you saying I need to put this after everytime I encode something
le = preprocessing.LabelEncoder()
I would first establish that every feature (a) needs to be encoded and that (b) the LabelEncoder is the right encoder for the job
but in either case, every feature needs its own encoder, unless you never want to encode instances of that feature again.
ok
yeah so the stuff I am encoding are strings and they need to be put into integers in order for them to be calculated into the probability
i used the same encoding method they used here
all of the stuff here are strings
yes. several of the features are numbers, actually.
Sex_E=le.fit_transform(Sex)
CPT_E=le.fit_transform(CPT)
RestingECG_E=le.fit_transform(RestingECG)
ExerciseAngina_E=le.fit_transform(ExerciseAngina)
ST_Slope_E=le.fit_transform(ST_Slope)
those are the only one that are being encoded
maxHR isnt there?
wait are the stuff that go here
variables = zip(Age,Sex,CPT_E,RestingBP,Cholesterol,FastingBS,RestingECG_E,MaxHR,ExerciseAngina_E,Oldpeak,ST_Slope_E)
only supposed to be the encoded ones?
I see
That is the most things I've ever seen zipped. :'']
i dont need to zip them right?
i just followed exactly what they did in datacamp
idk if I messed something up though
I'm only half paying attention, I'm sorry, Mr. Einstein, I've got to finish up a submission. I think a lot of what you want to do can be done with df operations, though.
can someone pls tell me how i can resolve this
Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0 to /job:localhost/replica:0/task:0/device:GPU:0 in order to run _EagerConst: Dst tensor is not initialized.
what is the code?
Guys can I have any recommendations for tutorials in AI and ML for beginners like absolute beginner
rows, columns = (3, 6)
fig, axes = plt.subplots(nrows=rows, ncols=columns)
for i, key in enumerate(df.keys()):
plot = df.boxplot(column=key, ax=axes[i%rows, i//rows])
plt.setp(axes, xticks=[], yticks=[])
plt.show()
How can I add titles to this?
df.hist returns an ndarray
And I can't just use plt.title
What makes df.groupby([some columns]).sum().reset_index() retrun zero row df?
Hey guys
I'm trying to do something with shap
import shap
batch = next(iter(test_dl))
images, _ = batch
background = images[:100].to(device)
test_images = images[100:105].to(device)
e = shap.DeepExplainer(model, background)
shap_values = e.shap_values(images)
shap_numpy = [np.swapaxes(np.swapaxes(s, 1, -1), 1, 2) for s in shap_values]
test_numpy = np.swapaxes(np.swapaxes(test_images.cpu().numpy(), 1, -1), 1, 2)
shap.image_plot(shap_numpy, -test_numpy)
I'm getting an error -The size of tensor a (512) must match the size of tensor b (2048) at non-singleton dimension 1
This is my model -class PredsModel(ImageClassificationBase):
def init(self, num_classes, pretrained=True):
super().init()
# Use a pretrained model
self.network = models.resnet50 (pretrained=pretrained)
# Replace last layer
self.network.fc = nn.Linear(self.network.fc.in_features, num_classes)
def forward(self, xb):
return self.network(xb)
How do I solve this? Thanks
^
does anyone know a good course to learn Pyspark?
So I am using Tensorflow for making 2 classifiers, I first placed 70% data into training, then 15% data into validation and 15% in test. I used that data for training 2 classifiers. Then I create new dataset in the same way and again train two classifiers. I do that 5 times. I got accuracy for each phase of testing.
Now I would like to use statistical test for comparing my my models
What statistical test do you propose?
Hello, I've a question regarding Neural Networks. Does every Inputneuron get the whole Inputvector or just one element of it
Guys is https://www.w3schools.com/ai/ a good tutorial source for ML?
provide a sample of data that reproduces this problem
you should be able to set the title on each Axes object, otherwise use plt.suptitle to set a title for the whole figure
i think you can also invoke fig.suptitle if you want to use the OO interface
good question. i learned the basics from a coworker...
Hi everyone, in pandas.Series, how can I overload the == operator?
For example:
s = pd.Series(['abc', 'daf', 'ghi'])
if (s == 'a').any():
print('True')
I want to overload the == operator to perform some regular expression. The result that I want is:
True
True
False
you don't need to overload the operator for that. what regular expression operation are you trying to do?
@hot slate
!e
import pandas as pd
s = pd.Series(['abc', 'daf', 'ghi'])
result = s.str.contains('a')
print(result)
@serene scaffold :white_check_mark: Your eval job has completed with return code 0.
001 | 0 True
002 | 1 True
003 | 2 False
004 | dtype: bool
and you can add .any() to the end of that if you want a single bool.
!docs pandas.Series.str
Series.str()```
Vectorized string functions for Series and Index.
NAs stay NA unless handled otherwise by a particular method. Patterned after Python’s string methods, with some inspiration from R’s stringr package.
Examples
```py
>>> s = pd.Series(["A_Str_Series"])
>>> s
0 A_Str_Series
dtype: object
```...
the str accessor has tons of good stuff.
most string methods or regex functions, you can do to the whole series via the str accessor.
any time 💚
So I am using Tensorflow for making 2 classifiers, I first placed 70% data into training, then 15% data into validation and 15% in test. I used that data for training 2 classifiers. Then I create new dataset in the same way and again train two classifiers. I do that 5 times. I got accuracy for each phase of testing.
Now I would like to use statistical test for comparing my my models
What statistical test do you propose?
If that's important I have maybe 850 photos per class
Hello i have a timeseries type date i want to extract the year only, how can i do this ?
df['Date'].dt.year
Another sweet accessor, dt.
Do you know maybe answer for my question? I remember you previously commented something about deep learning on my post
What statistical testing would you like to do?
guys can someone answer this please https://stackoverflow.com/questions/70684946/the-size-of-tensor-a-512-must-match-the-size-of-tensor-b-2048-at-non-singlet
"This model is statistically more accurate in some regard to this other one."?
I would like to compare my two models, I am not sure what statistical test would be best
Thing is its not a datetime value its a time series
I'm trying to make a bot to parry a kick in a game. The bot needs to be able to recognize when it will be hit by the kick in real time, how should I approach this?
what do you mean by "a time series"
post your data, or a sample thereof
or at least tell us what the dtype is and give some example values
heres the dataset
No DMs please, keep everything public.
Ok
In re: to statistical tests, you've got a few models and you want to compare them. So, you want to kind of say, "This one has X accuracy (or whatever), and this one has Y accuracy (or whatever), so this one is better." That kind of thing? There's a large number of ways to do this sort of thing, so I'm trying to narrow down what you want.
what type is that column currently? are those all strings? because strings aren't proper datetimes.
!docs pandas.to_datetime
pandas.to_datetime(arg, errors='raise', dayfirst=False, yearfirst=False, utc=None, format=None, exact=True, unit=None, infer_datetime_format=False, origin='unix', cache=True)```
Convert argument to datetime.
you'll have to use this to parse them out. here's the docs for the format= mini-language: https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior
Yeah in my case are two models
Remind me which type of classifier you've decided to use on them?
One is my custom made CNN other is InceptionV3 with transfer learning
so, they are strings. proper datetimes are stored as numbers.
the pd.to_datetime function will help you.
The usual way that I know of (maybe someone else in here knows more) to test is called McNemar's test. It's a chi-square test which compares error rates in the two models.
There's some other ones, but they're usually fairly specific.
it might even work if you just do pd.to_datetime(df['Date'])
so you'll have to provide the format=, which is where you say what each part of the string means.
oh
so it worked, you just have to write it back to the dataframe
market_cap_doge['Date'] = pd.to_datetime(market_cap_doge['Date'])
McNemar's test is given here: https://en.wikipedia.org/wiki/McNemar's_test and it might help you out. There are other tests --- like t- and z- score tests for some models, but the recent criticism iirc has been that they violate basic assumptions of t- and z-, so the recommendation was to not use them unless literally nothing else would work.
If anyone else knows better than me, let me know, since I don't do statistical testing all that much on my models. I know, I know, it's bad.
@serene scaffold it workeddd !! thanks so much!!
@vagrant monolith pandas objects work differently from the rest of Python. most pandas functions/methods return new objects, without changing existing ones
so, pd.to_datetime returns a new Series. it won't change the DataFrame that that Series came from unless you tell it to.
What you mean by -t and -z?
Ohhh i see so it's not an instance method that modofies the exisiting value
unless u tell it so
i see now thanks a bunch
Student t-test and Z-test, which are two of the "big" tests in statistics. https://en.wikipedia.org/wiki/Z-test It's the statistical test most stats classes will lead off with since it's fairly robust and usually pretty good.
instance methods in pandas work the way that I said. they return a new instance, without changing the one that you called the method from.
Going to eat, I already read something so I will response later
Good luck. I don't know much about this, so maybe someone else will be able to chime in later.
@serene scaffold Oh ii get it now that makes sense
it's confusing at first, but it's actually very useful once you get used to it. it makes it easier to keep track of how your data changes through your program.
soon you will understand 😄
@serene scaffold yeaa i see how it can be helpful you don't wanna end up with modified data everytime
How can I detect how likely it is that a string as the same meaning as another string?
Like I have a sentence "The first programming language was Fortran " and "Fortran was the first programming language"
For us they say the same, I need to detect it via Python
unrelated, but does anyone know of a library for taking existing voice audio and making it sound higher or lower? one that just changes the pitch isn't sufficient as there's more to voice quality than that. it's apparently very difficult to Google for because there's too much noise (voice synthesis libraries, general audio manipulation libraries, etc.)
what kind of computer will you be running this on? you might be able to use models for sentence similarity.
how to find location of data in excel file using python
Can anybody support on this.
-
Randomly place 20 points within a unit square.
-
Find the two points that are closest to each other and compute their distance. Find the two points that are farthest from each other and compute their distance. Code these calculations from scratch; do not use a packaged function.
-
Repeat (1) to (2) r=100 times. Collect the closest and farthest pairs. Plot all pairs on a scatter plot, with blue points for closest pairs and red points for farthest pairs. Report the average closest- and farthest-pair distances on the scatter plot.
linux server
I need it for a discord bot
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import pandas_datareader as data
from tensorflow.keras.models import load_model
import streamlit as st
the time of start and close
start = '2010-01-01'
end = '2019-12-31'
st.title('stock Trend Prediction')
user_input = st.text_input('Enter stock Ticker', 'AAPL')
#using datareader to take data, 'AAPL'is the company ticket
df = data.DataReader(user_input,'yahoo', start, end)
#describing data
st.subheader('Data from 2010 - 2019')
st.write(df.describe())
error : ModuleNotFoundError: No module named 'keras.models'
Did you say McNemar's test because in my case non-parametric test should be done? If so, why is in my case non-parametric test?
i have an issue and its troubling me since days
the issue is in this sample csv
https://paste.pythondiscord.com/nufixokoka.yaml
i need to change these following columns type
product_category_id is in string need to parse to list of integer
product_category is in string need to parse in list of String
product_ids is in string need to parse to list of string
i tried ast.literal_eval
df_explode = piv_pdp.assign(second=piv_pdp.product_ids.str.split(","))
when i executed this literal_eval code it was giving this traceback
1 [240623]
2 [286313]
3 [285627]
4 [312021]
Name: product_ids, Length: 390202, dtype: object```
can anyone please help
Could someone please help, as I'm trying to use replace in my dataset but part that I'm trying to replace remains the same:
df['floor'] = df['floor'].astype('int64')
col_to_replace = ['hoa', 'rent amount', 'property tax', 'fire insurance']
for i in col_to_replace:
df[i] = df[i].astype('string')
df[i] = df[i].replace('R$', ' ')
df.head()```
The output remains the same
Hi, I am training a model out of a sensordataset from kaggle. I guess somehow I am doing something wrong 😄
Howdy y’all
Could anyone answer a question and provide a few insights for me? It’d be greatly appreciated
I don't know why, but I assumed it was a classifier between two classes, but I'm not sure if that is the case --- if it isn't, I'm not exactly sure what test they do for multi-class modeling. And, no, it didn't have anything to do with parametric-ness.
I’m a bit mentorless and just trying to understand a project I have for my boot camp. Sorry if it’s noobish.
I don't know why, but I assumed it was a classifier between two classes
It's multi class classifier
Oof, I guess don't ask to ask is blocked. Either way, "Don't ask to ask", scoby, just ask.
Damn, fucking statistics makes my life hard...
Yeah, in that case I'm not sure what statistical test would be good. I imagine that McNemars could be extended, but I've never done it.
you have to remove the non-numeric characters, and then convert it to a numeric type.
df['hoa'].replace(r'[\$R,]', '', regex=True).astype(float)
I’m just starting a Data Science program and basically I have this rubric for an assignment.
Why does the accuracy sink if I train a neronal network? Like with everystep until almost 0 😄
are you sure you're not talking about loss?
not looking for someone to do it for me just… a bit intimidated.
I was curious how y’all might approach this.
Im pretty familiar w python and it’s our first project. I just wanted some insights from those who are more knowledgeable than myself. It’s a data science ML boot camp. This is our first assignment.
It’s a simple dataset really.
any reason you can't copy and paste the text in that screenshot? it's easier for people to help you when they're dealing with text and it doesn't take that much more work on your part.
It’s just my desktop. I’ll do that.
The first step for any good DS project is to do EDA (exploratory data analysis) and to make a bunch of graphs and things. By the end of that, you should be able to answer part 1. I'd take things step-by-step.
How can I go about making a bot to predict when something will happen via video (e.g. when a falling ball will hit the ground).
loss is pretty much inaccuracy
Description
Objective
Explore the dataset to identify differences between the customers of each product. You can also explore relationships between the different attributes of the customers. You can approach it from any other line of questioning that you feel could be relevant for the business. The idea is to get you comfortable working in Python.
You are expected to do the following :
Come up with a customer profile (characteristics of a customer) of the different products
Perform univariate and multivariate analyses
Generate a set of insights and recommendations that will help the company in targeting new customers.
Data Dictionary
The data is about customers of the treadmill product(s) of a retail store called Cardio Good Fitness. It contains the following variables-
Product - The model no. of the treadmill
Age - Age of the customer in no of years
Gender - Gender of the customer
Education - Education of the customer in no. of years
Marital Status - Marital status of the customer
Usage - Avg. # times the customer wants to use the treadmill every week
Fitness - Self rated fitness score of the customer (5 - very fit, 1 - very unfit)
Income - Income of the customer
Miles- Miles that a customer expects to run
how narrow is the scope of what it needs to predict? because otherwise it sounds like you're edging "true AI".
basically I see it live and it goes down every second it does another step
quite narrow
another example
say there was an attack in a game you needed to dodge
you already knew what attack it was you needed to dodge, and your goal was to dodge that one attack
Sorry for the text block, I’m not so worried about completing the project or anything of that sort. Just wanted insights about how y’all might approach this,
trying to learn as much as I can without being in a vacuum
@stone marlin I will read about different non parametric stat tests...can I ask you something if I don't understand something?
Thank you
it needs to:
realize that the attack is being used
find out when it will land (it already knows how long the animation is, it needs to figure out how far behind it was based on where it is in the animation)
avoid the attack
You should ask the room in general, I may not be around or asleep, and I'm not an expert on non-param stats. :'[ Someone here may be though.
Yeah, it seems that you are only one who is familiar with that
I'm only tangentially familiar with it, unfortunately, and I may not have time to do the necessary research to answer your questions.
I will skim back up and if no one answers I'll try my best.
Nice
Haha, I can DM you?
why is this so intimidating
No, no, I meant, not this time, you.
No DMs from anyone, please. Everything in public chat.
It's all good, it's one of those old-man vs younger peeps things. I'm an angry ancient guy.
For example, Student’s t-test for two independent samples is reliable only if each sample follows a normal distribution and if sample variances are homogeneous.
So I should calculate normal distribution and variance of each image?
You could do this --- I've read recently that this is violated but, honestly, I think most people still do the t-test / z-test.
Why it's violated?
Also, this two (normal distribution and sample variances) are for students tests
IIRC, it's that it's not homoskedasticitic, but I'd have to read the paper again.
Lemme see if I can find an example of this.
Thanks man
https://machinelearningmastery.com/statistical-hypothesis-tests-in-python-cheat-sheet/ Here's some examples of the tests. I'm not 100% sure how to do it with NNs, but with Cross Validation stuff you'll usually take the means of the slices, create a distribution with those, and test that the distributions are different using a difference-of-means.
(There is also a chi-square version but I've literally never seen it used, so idk about it.)
what do you need help with?
With .csv files
what about them?
Problems with appending
this may be the wrong channel, but I would still be happy to help
I am using pandas library
Thanks, I read several articles from that site about this topic but didn't find that one
@stone marlin Hmm, this is for numeric data in all examples
Maybe I'm misunderstanding what you're trying to test --- there are many, many potential ways to "test" models to see which is "better", and there's various things better means.
Well, in my case accuracy of particular class isn't important, so I track accuracy
I thought to set hypothesis
Wait, it's not important so you track it?
accuracy of X > accuracy of Y
English isn't my native language, sorry. I mean, I follow accuracy as metric how I compare two models
It's okay, I just was unsure what you meant.
I wanted to say I don't pay attention to accuracy of particular class
So I thought that it's valid to set hypothesis as model X has greater acc than model Y
Okay, so you have two models. You are looking at accuracy for model X and model Y. And you want a test to say, "This model performs better, in terms of this."
Right, exactly. So, that's exactly the test.
But, there's one big issue there. You don't have a standard deviation with just one run of the model.
You have just one value: the accuracy.
Yeah, why would I need standard deviation?
For knowing my samples follow normal distribution?
Well, right now, you've set up
H0: mean(X) == mean(Y)
HA: mean(X) != mean(Y)
I thought you noted you were going to do a t-test to test this.
Haha, I didn't know that I set up that 😄
I thought you noted you were going to do a t-test to test this.
I read that in using t-test
Oh, whoops, you did >. But same deal.
Observations in each sample are independent and identically distributed (iid).
Observations in each sample are normally distributed.
Observations in each sample have the same variance.
What does it mean that samples are independent?
The gist for hypothesis testing is you calculate a "test statistic" and then you try to see if the corresponding p-value is small or not. [I'm leaving a LOT out here, because this could cover the last third of a stats course.]
Independent samples are samples that are selected randomly so that its observations do not depend on the values other observations.
How did you conclude that I set
H0: mean(X) == mean(Y)
HA: mean(X) != mean(Y)
Sorry, above you said >. In this case, though, it should probably be two-sided.
I thought to set hypothesis
accuracy of X > accuracy of Y
This is what you noted before.
Yeah but how does that relate to
H0: mean(X) == mean(Y)
HA: mean(X) != mean(Y)
Ah, I see, I used "mean" which threw you off probably.
I was getting ahead of myself here. The idea is that you cannot do this test for one value of accuracy. Otherwise your test is just "this is greater than this".
i'll elaborate further on this because i feel strongly about it:
hypothesis testing works by setting up a "null hypothesis". you then figure out how unlikely/unusual/rare/extreme your data is, assuming that the null hypothesis is true. if it turns out that your data is very unlikely/unusual/rare/extreme when the null hypothesis is assumed to be true, then this is taken as evidence against the null hypothesis. and we reject the null hypothesis when that evidence exceeds a pre-determined threshold.
the evidence is usually the p-value, and the threshold is the size of the test (often written as α)
Thank you, Salt, haha. I'm going to also have to step back because work is picking up in a few mins so feel free to chime in.
So how I should test this? What's your proposal?
The usual way to do it is to get multiple values for accuracy, and then you've got a set of accuracy values for each model. At that point, you have a mean and a standard deviation for the accuracy of both models.
Yeah I have multi values of accuracy actually
You can then perform a t-test (since you have the mean and stdev of accuracies for both models), and you conclude that either the means of the accuracies are the same OR they are different.
(Actually: you either reject the null hypothesis or you fail to reject it, but, in this case it's probably going to be rejected.)
Good, so you've got some values for accuracy from each model
So you get the means + stdevs from those, and with those you perform a t-test.
@stone marlin You mean this test Paired Student’s t-test?
But what if two means of two paired samples are significantly different?
I don't understand how this test relates to what I want to test - whether model X has greater accuracy than model Y
https://www.investopedia.com/terms/t/t-test.asp I don't know exactly how scipy does it, so I'd use the formulas here. I'd prob use the "Equal Variance" t-test for this.
The gist is like this: what if, by a fluke, your accuracy in model X was higher than model Y. Then model X is better right? Not necessarily. So you want to try a few different times to see if the one time you trained it wasn't a "fluke".
Like, you might have gotten lucky with data the first time and model X was really good, but it was terrible every other time.
I've got to go to a meeting, but I'd say the following: if this is an assignment, you might want to ask the teacher / TA what they want from this, there are MANY things that we could do to test it. If not, I wouldn't worry about testing right now, esp if you don't know hypothesis testing.
Good luck with meeting. I will read more about hypothesis testing
@stone marlin
The gist is like this: what if, by a fluke, your accuracy in model X was higher than model Y.
I think that's not a case in my case. What I did was to randomize data, placed 70% in training, 15% in validation and 15% in testing. Then I trained first model and tested it, then trained second model and tested it. I then again randomized data, placed 70% in training, 15% in validation and 15% in testing, trained first model and tested it, then trained second model and tested it...and I did that 4 more times, so I have 5 accuracies
(Right: that's not the case. That's what you're trying to show explicitly with this test.)
Ok, just to check it, so you propose to use Paired Student’s t-test?
I'd say that's what I'd use. Salt may know more. I will say this is not a common thing most DS people deal with --- at least in my field.
(okay, now I'm really gone.)
Thanks a lot for help, man! I appreciate that!
@desert oar are you here?
dumb question, how to use numpy to convert an array of this kind to this kind?
[[255 255 255 ... 255 255 255]] -> [[[255 255 255] [255 255 255] [255 255 255]]]
so effectively for each element i do: 255 -> [255 255 255]
looks like you're trying to reshape it. try arr.reshape(1, -1, 3). It won't work if the number of elements isn't evenly divisible by 3.
when you reshape an array, -1 means "the rest".
I made you a little example.
In [8]: np.repeat(255, 9)
Out[8]: array([255, 255, 255, 255, 255, 255, 255, 255, 255])
In [9]: arr = _
In [10]: arr.reshape(1, -1, 3)
Out[10]:
array([[[255, 255, 255],
[255, 255, 255],
[255, 255, 255]]])
ty for the answer, just digesting
do you know how array shapes work?
not just reshaping them, but what the shape of an array is, in general
ok i think i tried reshape but one problem is that i don't want to have 3 elements, for each element I want to replace it with an array of size 3 that has the same number
the real question is i have an image that is being read by opencv but the image is 2 colors, white/red and it keeps being read as a grayscale and it's not being read as rgb/bgr/whatever and every time i try to mask the mask image is grayscale
Yo guys,
So I want to find the Average income for each gender in my dataset.
# Reading Data from relative directory
data = pd.read_csv("CardioGoodFitness.csv")
#Converting Gender Values to Intergers for later use. Female = 0, Male = 1
data['Gender'].replace('Female', 0, inplace=True)
data['Gender'].replace('Male', 1, inplace=True)
ran this: then this,
data.groupby("Income")["Gender"].mean().sort_values(ascending=True)
got this:
Income
55713 0.0
65220 0.0
62535 0.0
53536 0.0
52291 0.0
...
48658 1.0
48556 1.0
31836 1.0
68220 1.0
104581 1.0
Name: Gender, Length: 62, dtype: float64
How can I consolidate this so its not ascending
or sorted, i guess i just delete the end of that command?
like this?
In [47]: np.tile(np.arange(10), (3, 1)).T
Out[47]:
array([[0, 0, 0],
[1, 1, 1],
[2, 2, 2],
[3, 3, 3],
[4, 4, 4],
[5, 5, 5],
[6, 6, 6],
[7, 7, 7],
[8, 8, 8],
[9, 9, 9]])
Hey, is there a way to train a Neural network on hundreds on human conversations, and make it be able to respond to inputs with a human-like tone?
I would train it to input/output text, and then handle the speech synthesis separately
Well the input is text based and not speech
so, you're asking how to make a conversational chat bot? those are notoriously difficult because when humans have conversations, the things that they say are informed by a large body of knowledge and experiences. robots don't have that.
@serene scaffold yes can I just feed in my array to np.arrange like np.arrange(imagearray)? that would be perfect
@lapis sequoia you can just clean it up after line per line. then do string manipulation/convert to int. u can use grep or just make a for loop and check if it's a number or not to make your income per person if speed isn't a concern
Exactly, but how about training it on thousands if not millions of conversations (yes I know it will take really long but it's possible) so that it gains information of the conversations
So it will compare your input with the others and see which output is most similar to the correct input, and give that which already contains the knowledge you are looking for.
no, np.arange takes an integer n and returns a one-dimensional array of every integer from [0, n). you can't just pass random stuff to it.
ah i see
Hey guys! I hope I am not disturbing you convo too much c:
I am making a CAPTCHA solver with pytorch, but I can't create a dataset that has both the image/captcha and the label/target in the dataset/dataloader. Can someone assist me on this? ❤️
im crying
we all are
We can't help you with a captcha solver, sorry.
I just wanna see if the men on average are making more than the women in this hypothetical scenario
@atomic leaf keep in mind that asking for help with captcha solvers is against the rules, so don't do that in the future.
oh, yea okay. My bad! Thanks for the answer
going to make a business recommendation on marketing to men or women more based on customers income provided
i mean what's stopping you from calling income for men -> create average by cleaning up datasetm calling income from women -> create average then compare?
Just change the question to "needing help with image recognition" lmao
@serene scaffold is it possible to clone my image array and then read into a new array of size 3? too dumb to figure out the syntax, something like a1, a2, a3, and then answerarray[a1,a2,a3]
not size 3 but each element is size 3
so the array is currently of shape (n, m), but you want to change it to (n, m, 3), where each slice in the third dimension is the same?
hi guys. i saw a few vids about ai in games (geometry dash for example) and wondered, how a ai can teach itself how to play the game, without even having a variable that tells it (good/bad). im talking of genetic algorythms
This is the solution, assuming that is the question:
In [59]: np.arange(12).reshape(4, 3)
Out[59]:
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11]])
In [60]: arr = _
In [61]: np.dstack([arr] * 3)
Out[61]:
array([[[ 0, 0, 0],
[ 1, 1, 1],
[ 2, 2, 2]],
[[ 3, 3, 3],
[ 4, 4, 4],
[ 5, 5, 5]],
[[ 6, 6, 6],
[ 7, 7, 7],
[ 8, 8, 8]],
[[ 9, 9, 9],
[10, 10, 10],
[11, 11, 11]]])
@serene scaffold so for each element X in the original array, in the new array is [X X X] so not really the same because Ihave 2 colors
I can't answer the question effectively without knowing the the number of dimensions in the array.
source image/array is (315, 375) target is (315, 375, 3)
so, aren't you repeating the array three times into a new dimension?
because that's what I just did.
What does -1 means here? Also, I am not sure why you use 1 for first parameter and 3 for last parameter?
when you reshape an array, -1 means "the rest"
Yeah I read that but couldn't understand what "the rest" would mean in that context
but that code didn't actually solve the asker's question.
the size of an array is the product of the length of each axis. when you reshape an array, the size has to remain the same. so the -1 represents whatever integer, if there is one, completes the product.
so if you have an array of size 12, you can reshape it to (2, -1, 2), and then -1 gets interpreted as 3, because 2 * 3 * 2 is 12.
in a philosophical perspective u use -1 because you can never get to -1 (unless you go backwards/use a negative step). even if you use an arbitrarily high number you will end up getting to it at some point
ok i never knew that about numpy
I think they just picked -1 somewhat arbitrarily.
@serene scaffold I see. Thanks for explanation. Btw, are you maybe familiar with statistical tests?
like t tests?
Yes
there's a few t test implementations in scipy
Okay, so I have few accuracies for my two models (each prediction from same training and test dataset)
I want to make sure that particular model that has greater mean is really better
What test do you propose that I use?
so you have two arrays, each representing scores from two models on the same data? you can use this: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_ind.html
this assumes that each prediction is independent.
Yes, and I don't know whether my predictions are independent and identically distributed
I read what identically distributed mean
I think they are independent, because previous or next prediction didn't influence on current prediction
What do you think? @serene scaffold
I'm not really sure; I don't have time to dive in, unfortunately
@stone marlin Are you available maybe? I don't know if my data is identically distributed, googled and still not sure
please don't ping people asking for help. if they're not actively reading the channel, assume they aren't available.
they already told you that: #data-science-and-ml message
Hmm, tbh, that doesn't make sense to me...can you explain, why I wouldn't tag someone?
Yeah but if I ask something and tag him that doesn't mean that he can just reply
it's rude to ping people to draw attention to your question. all help given here is volunteer-driven, and community members deserve to only be pinged to draw their attention to ongoing conversations that they've chosen to participate in
it's rude to ping people to draw attention to your question.
Why it's rude?
and community members deserve to only be pinged to draw their attention to ongoing conversations that they've chosen to participate in
Well, I wanted to ask him about something what we talked about
because they get a notification on all of their devices, and for those who frequently volunteer to help, that gets noisy very quickly.
Please DM @sonic vapor if you have any other questions about this.
but they already told you not to do that when they said to direct your questions to the whole channel.
If what you are currently typing pertains to this, please send it to @sonic vapor
Hello I have an existing project and I want to set a dev environment with conda cause I'm using libraries compiled in C and what it started as a relatively small project, it turned into a big project. I'm in doubt if I should install Anaconda in Windows Subsystem for Linux or install it regularly on Windows since I have never used conda before and I'm not sure what would be a better choice
i would either dual boot ubuntu or other linux distro or install conda natively on windows if you have gigantic projects just to avoid the jankiness of WSL. i would recommend a native linux environment if resource is tight since those tend to take up less VRAM etc if you have no GUI compared to windows which consumes a lot of ram and VRAM by its own GUI
i would avoid using anaconda and stick with plain conda ("miniconda") if possible
it works fine in powershell/cmd, you don't need a VM or WSL
in general building packages is nontrivial in windows, whereas in a linux-based environment you typically have a sensible build toolchain already set up
but if you are using pre-compiled conda packages, you should be fine working directly in windows
anyone knows how to approximate/calculate the median or nth quantile of very large datasets that can't be fit into memory at once? preferably doing so in batches rather than having to iterate through each sample.
Thank you very much for both explanations I will definetely follow your leads tomorrow when I set up my development environment. I will probably stick to installing "Miniconda" instead of Anaconda natively in Windows since I think that I will be using packages that are accesible with Miniconda
I have one more question though
Is my data identically distributed?
I'm planning to use jupyter notebooks to display relevant information in a user friendly manner thanks to markdown language. I have never used jupyter notebook before so I wonder how can I set up my conda environment to be able to interact with jupyter notebooks
anaconda is just conda with a bunch of stuff included by default. miniconda is conda only, it's only "mini" because it's a minimal installation; the underlying pacakge manager is identical
it depends a little on how the notebooks will be shared/hosted/run
jupyter itself is a client/server setup
the actual code is run by a "jupyter kernel", which acts as the server
and you interact with a "jupyter frontend", which acts as the client
ipykernel (part of the ipython project) is the standard python kernel. you install this into your conda environment with all your dependencies
jupyter notebook or jupyterlab are frontends. these can run any kernel anywhere on your system. you can install them right into your project, in which case no additional setup is required. but if you are hosting this over the web, you might want to run a centralized instance (e.g. jupyterhub), in which case you will need to set up a "kernel spec" that tells the jupyter frontend how to start and connect to your desired jupyter kernel
(it can be a bit confusing because jupyter notebook is itself a server. but with respect to the jupyter protocol, it's a client)
jupyter kernel <-> jupyter notebook <-> user's browser
(conda env) (anywhere) (anywhere)
you can even use jupyter notebook in pycharm
which enhances the experience massively
also if you decided to stick with browsers i suggest jupyter lab instead of jupyter notebook
since jupyter lab is the way forward
it's worth being precise about the terminology: pycharm can act as a jupyter frontend/client, and it can read and edit the same file format as jupyter notebook
yes exactly
jupyter notebook files have the extension .ipynb
which stands for ipython notebook iirc
in my experience if you get a massive project jupyter notebook or jupyter lab in a browser alone would be really messy
because it lacks a lot of the IDE functionalities such as stepwise debug, auto association of variables, suggestive contexts etc
We will use it mainly during development to see in a more visual way the representation off the data and also to show it to the client so that they understand better how the process work. I think that we are not planning to host it alongside the code which will be probably installed by means of a python package globally on a server hosted by the client
you should look into JupyterHub
it's a way to host jupyter notebooks with user authentication
you can serve it over http and put it behind your company's domain
Wow thank you so match to all of you for the in depth explanations. I will be reading through the messages to make sure that I have a proper understanding of everything so that tomorrow I can start setting up the environment
And I will definetely propose to my organization the idea of JupiterHub
As noted before by both myself and now by Sterlercus, please do not ping me, it pings on all my devices.
I'm not sure if your data is iid, but you can assume it is for the sake of this problem.
I also am going to be busy for a while, I've got to finish a project fairly quickly, so unfortunately I will be unable to help for a bit.