#data-science-and-ml
1 messages · Page 378 of 1
any help in #help-ramen, please !
it seems like theyre helping you already
that is a very, how to put this, broad question that could go in many different directions depending on what youre looking for
id say focus on the goal: what does that look like to you?
and incrementally learning just enough ML to solve your problem, and then over time, increasing your knowledge base from there
no, i don't think thye are entirely sure since they are also second guessing, so it is quite confusing
please if you have any more and actual knowldege
it won't be long !
well im on mobile but it looks like a join problem
sorry but i wont be able to answer without a laptop
the code seems to be done now its just errors im assuming
if ur able to look through that
if not then its fine
JavaLim seems to have it handled
yeah ! going good now
hello
is anyone available
they had to go, and eventho its nearly done im back to needing some help with errors
please lmk if u can
sorry for asking the same questions about LSTMs, but should i be scaling be sentiment values, as they lie b/w [-1,1]
hello, anyone available to help me ?
@fresh shadow please don't flood this channel with requests for help about the same question. You may be on your own for a bit.
apologies, i have gotten help twice and both have left in the middle and so i have to pick it up all the way again to explain it to the new helper
and i am literally almost done, so just kind of ticked and desperate
its taken up a lot of my time to explain it to the new helper who has offered, only for them to then leave again
and i really need to finish this code, i am on the last step
i understand i am texting quite a lot, but i think u can also understand that if all 3 helpers have left half way, that is a lot of time gone taken for me to re explain
and not even end up at the end of/ finishing my code
we're doing minitorch for my deep learning class and i have mixed feelings about it lol. idk if anyone else has tried it.
full error ```python
File "D:\college_project\modules\train.py", line 212, in post
x_train = np.array(list(map(preprocessing, x_train)))
File "D:\college_project\modules\train.py", line 309, in preprocessing
images= grayscale(images) #convert to grayscale
File "D:\college_project\modules\train.py", line 301, in grayscale
images= cv2.cvtColor(images, cv2.COLOR_BGR2GRAY)
cv2.error: OpenCV(4.5.5) D:\a\opencv-python\opencv-python\opencv\modules\imgproc\src\color.cpp:182: error: (-215:Assertion failed) !_src.empty() in function 'cv::cvtColor'```
looks like its purpose is meant to be for ML engineers who want to study its internals
just reading its description idk why you'd use it for a deep learning class
i guess they just want us to have SOME code implementation rather than just learning the pure math
which is understandable but at the same time...trying to write the backprop algo into an ML library is a pain ngl
makes me appreciate all the ML libraries i take for granted that i use regularly
meh.. you can implement backprop in plain numpy
i dont think you should waste time on numerical computing when trying to also teach deep learning
hahaha
honestly i think it was mostly the TA's agenda
the prof is the dept head
so she has tenure and all that
she mostly just teaches the math part
that bothers me
i really dontl ike classes where you dont learn what your peers would be learning
not that school should be a competition as such but
yeah, idk, its def something, hence my mixed feelings
I actually really like this from the POV of being able to make something new. But with the goal in mind of applying ML as quickly as possible (and maybe also for getting the bigger picture first), not so much.
This is something to be done AFTER learning Pytorch.
actually i think i wouldve liked that approach yeah
like this could be an elective/advanced class or something after doing that
but not really important
It's important for the more niche group that makes ML libraries and systems. But if you are not a systems programmer it's not applicable to you.
It depends what you do. I do crank out ML libs, but that is relatively niche.
very interesting
well, for me, since its for a grade, ill just approach it from the perspective of working on my coding + problem solving skills
It's also perhaps nice from a sort of "I could make this if it suddenly got deleted from the internet" POV.
im curious. like open source stuff or proprietary stuff?
do certain companies ask for custom made ML library functions or something?
Proprietary, but may start having some OSS.
i mean if you cant talk about it too much i understand
ah interesting
i heard a podcast today about OSS business models and sustainability/maintenance over time
Yes, for example, to make use of FPGAs, or other strange / new hardware. Or it simply does not fall domain of what Pytorch / TF / etc can do / handle.
it was kinda sad especially since so many devs use certain OSS
interesting interesting
"Systems programming" as people call it, although that is often associated with operating systems. I just use the term in general for systems which support some task and generally care about making that fast and stable (usually written in something like C, but in the case of ML also has bindings / a Python side to it (to allow non-systems people to interact with it and unite the systems and scientific community)).
i can see that
i'd call that "lower level" programming
my company is still relatively new so out-of-the-box stuff works fine for them
i definitely stick to "systems programming = close to hardware"
they did ask me to see if i can sit on this one meeting to improve one of their "AI products"
i heard the problem and i was like "oh. you could probably use NLP to solve this"
Systems programming is not always close to hardware. It used to be but not so much now. Low level is when using something like a hardware description language IMO. C used to be considered high-level.
Those were dark times
thats wild
If you think assembly is a pain, try programming an FPGA with a HDL.
my cousin uses C for work since he does embedded systems and gave me a mini-lecture on EE concepts and i was like "nope"

he says C IDEs are funky
i said "lower level", not "low level" for a reason 😛
OpenCL feels like a blessing then.
Anyhow, someone has to be able to make stuff like Pytorch, and while you don't need that many people doing that, it still is a thing. Should it be taught in schools? Yeah, probably, but later, if only so that if something goes wrong, like a bug, it could be fixed by them or at least give a good report of it.
Also the way things are trending, stuff like FPGAs and other strange hardware is on the rise, and important for future ML.
File "D:\college_project\modules\train.py", line 213, in post
x_validation = np.array(list(map(preprocessing, x_validation)))
File "D:\college_project\modules\train.py", line 309, in preprocessing
images= grayscale(images) #convert to grayscale
File "D:\college_project\modules\train.py", line 301, in grayscale
images= cv2.cvtColor(images, cv2.COLOR_BGR2GRAY)
cv2.error: OpenCV(4.5.5) D:\a\opencv-python\opencv-python\opencv\modules\imgproc\src\color.cpp:182: error: (-215:Assertion failed) !_src.empty() in function 'cv::cvtColor'``` can anyone help me in this ?
can Matplotlib/Pyplot do something like this? IE, plotting the opening of an angle with an arc and a bisecting line at the mid point ?
Hi everyone. If we have a 27.3% of missing values in 8 columns. Is this considered a class imbalance?
u can use 3d graphs to do that
TypeError: 'int' object is not iterable
Whats wrong with this
hi everyone!
I'm a physics undergraduate student, searching for ideas for a biophysics data analysis project
any suggestion (site or book or something)?😅
Biophysics maybe https://emedicine.medscape.com/article/320160-overview
Gait cycle Walking is the most convenient way to travel short distances. Free joint mobility and appropriate muscle force increases walking efficiency.
thanks! ill check them!
Think of using motion sensors used in VR trackers in this project
Amazon.com: vr full body tracking
Then its all data analysis
A huge problem that a lot of people would appreciate being solved is measuring gross wattage vs leverage etc whilst jogging. If you know how cyclists do it, doing it for bipeds is a real challenge.
Figure out how to use sensors to measure leg angles, leverage with the ground and such.
Is anyone familiar with a module that works with Python 3.9 that can be used for conjugating verbs in English? Feels like a lot of the modules that can do this (nodebox, pattern.en) have fallen into serious disrepair.
please do ping me if you know about anything like that. 
hmm, never mind, this seems to work quite well https://github.com/SekouDiaoNlp/mlconjug3
Hi together!
Do you have any idea where the error Columns must be same length as key could come from and how I can solve it?
It appeared after I tried to turn my last loop
def compare_row(value, index):
def compare_row(value, index):
return df[[strName, simRows]].apply(lambda y:
y[simRows] + '|' + str(index + 1)
if fuzz.partial_token_sort_ratio(value, y[strName]) > 90
else y[simRows], axis=1)
for i in range(len(df)):
df[simRows] = compare_row(df.at[i, strName], i)
into a pandas apply function.
def compare_row(value, index):
def compare_row(value, index):
return df[[strName, simRows]].apply(lambda y:
y[simRows] + '|' + str(index + 1)
if fuzz.partial_token_sort_ratio(value, y[strName]) > 90
else y[simRows], axis=1)
df[simRows] = df[[strName, simRows]].apply(lambda x: compare_row(x[strName], 1), axis=1)
(Upper code works, but is slow)
(Code below doesn't work, but would be much faster)
Not really, but if you break out your lamda function into a subroutine things would be a lot easier to understand.
Changed it, thanks for making me acknowledge it.
Basically I'm comparing a column with itself with the program above.
my code here https://paste.pythondiscord.com/xoyosenimo
Traceback (most recent call last):
File "D:\college_project\modules\model_train.py", line 15, in <module>
model.add(Convolution2D(16, 3, 3, input_shape = ( 64, 64, 3), activation = 'relu'))
File "C:\Users\shubh\anaconda3\lib\site-packages\keras\legacy\interfaces.py", line 91, in wrapper
return func(*args, **kwargs)
File "C:\Users\shubh\anaconda3\lib\site-packages\keras\layers\convolutional.py", line 473, in __init__
super(Conv2D, self).__init__(
File "C:\Users\shubh\anaconda3\lib\site-packages\keras\layers\convolutional.py", line 105, in __init__
super(_Conv, self).__init__(**kwargs)
File "C:\Users\shubh\anaconda3\lib\site-packages\keras\engine\base_layer.py", line 132, in __init__
name = _to_snake_case(prefix) + '_' + str(K.get_uid(prefix))
File "C:\Users\shubh\anaconda3\lib\site-packages\keras\backend\tensorflow_backend.py", line 74, in get_uid
graph = tf.get_default_graph()
AttributeError: module 'tensorflow' has no attribute 'get_default_graph'```
ping me when replying
fuzzy wuzzy string matching? we used that at work recently

Has anyone has used TF-IDF?
I have a small issue regarding where once we apply TFIDF to our text data and set them as the X features, should we convert the X features to a sequence of length by padding it before feeding to a neural network?
to elaborate if we apply tfidf to a short sentence, lets say we get 20 features (columns)
but if it was a long sentence, we could get 20+ feautures
so is there anyway i can have a fixed feature length, just so that i can have that fixed feature length set as the No of Neurons on the input layer of the NN
Mfw you didn't just message me directly 
Work is how I got the idea, but I'm programming it in my free time as practice for starting data analysis.

Right now I'm trying to get rid of my last for loop, but I'm not sure why it keeps telling me that Columns must be same length as key.
Please always show the whole error message
!traceback
Please provide the full traceback for your exception in order to help us identify your issue.
While the last line of the error message tells us what kind of error you got,
the full traceback will tell us which line, and other critical information to solve your problem.
Please avoid screenshots so we can copy and paste parts of the message.
A full traceback could look like:
Traceback (most recent call last):
File "my_file.py", line 5, in <module>
add_three("6")
File "my_file.py", line 2, in add_three
a = num + 3
TypeError: can only concatenate str (not "int") to str
If the traceback is long, use our pastebin.
One second, I will do.
Maybe also crosstab could be an option, just saw how it works.
that isn't a 3d image though
I understand. I meant plot a 3d graph.
Traceback (most recent call last):
File "C:\Users\Adrian\PycharmProjects\ProjectMystery1\version2.py", line 22, in <module>
df[simRows] = df[[strName, simRows]].apply(lambda x: compare_row(x[strName], 1), axis=1)
File "C:\Users\Adrian\PycharmProjects\ProjectMystery1\venv\lib\site-packages\pandas\core\frame.py", line 3645, in __setitem__
self._set_item_frame_value(key, value)
File "C:\Users\Adrian\PycharmProjects\ProjectMystery1\venv\lib\site-packages\pandas\core\frame.py", line 3775, in _set_item_frame_value
raise ValueError("Columns must be same length as key")
ValueError: Columns must be same length as key
But now I saw it, it might be smarter to not work with a double apply, but instead with crosstab.
Is there some kind of way or a similar function to do this (found it in the internet), but without the aggregation?
def cross_fuzz(df):
ct = pd.crosstab(df['strings'], df['strings'])
ct = ct.apply(lambda col: [fuzz.ratio(col.name, x) for x in col.index])
return ct
no no, that's what I meant. or 3d graph can plot 2d images
METRICS = [
keras.metrics.TruePositives(name='tp'),
keras.metrics.FalsePositives(name='fp'),
keras.metrics.TrueNegatives(name='tn'),
keras.metrics.FalseNegatives(name='fn'),
keras.metrics.CategoricalAccuracy(name='accuracy'),
keras.metrics.Precision(name='precision'),
keras.metrics.Recall(name='recall'),
]
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=METRICS)
this the right way to get evaluations ?
btw by building a model
model = Sequential()
model.add(Conv2D(16, (3, 3), padding='valid', activation='relu', input_shape=(144,144,3),name='input_tensor'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(32, (3, 3), padding='valid', activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(64, (3, 3), padding='valid', activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.5))
model.add(Flatten())
model.add(Dense(1024, activation='relu'))
model.add(Dense(512, activation='relu'))
model.add(Dense(256, activation='relu'))
model.add(Dense(train_generator.num_classes, activation='softmax', name='output_tensor'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=METRICS)
their initial weights are random right?
is there a way i can create a 2 models with same initial weights? like a perfectly identical model?
because ill try to train both using different kind of dataset but contains the same classes
you were asleep, it was middle of the night US time.
do feel free to make a suggestion though

I wouldn't have actually known anyway
I mostly deal with getting information out of text than creating it.
finaly i made my code
Hello, I am trying to take the correlation coefficient here using pandas of the 2nd two columns of data within each row for each corresponding country
For instance, for however many Afghanistan's there are I would take the correlation coefficient of (0.14, 0.222), (0.19, 0.183), (0.24, 0.24), and so on and have a number dedicated just toward that country
If I want to go down the whole list all the way to the end is there a way I can do that to only take the CC of certain row indicies? Any help would be greatly appreciated.
@snow helm so you want to group by country, and then treat the left column as x and the right column as y, and calculate the correlation coefficient?
from scipy.stats import pearsonr
df.groupby('Country').apply(lambda d: pearsonr(d['x'], d['y']).r)
something like that.
though it's not clear from your screenshot what data type this is. it looks like you actually have a numpy array.
Yes that is exactly what I am trying to do but still struggling a tad @serene scaffold
I want a correlation coefficient of both columns for each country
This is my code
data = pd.read_csv('/users/rchan/Desktop/BCYear3/Data_Science/Homework/Homework4/owid-covid-data.csv')
country_stats2 = data[["location", "people_fully_vaccinated_per_hundred", "new_deaths_smoothed_per_million"]]
country_stats2.dropna(subset = ["people_fully_vaccinated_per_hundred"], inplace=True)
country_stats2.dropna(subset = ["new_deaths_smoothed_per_million"], inplace=True)
print(country_stats2.to_numpy())
columnA = country_stats2["people_fully_vaccinated_per_hundred"]
columnB = country_stats2["new_deaths_smoothed_per_million"]
@snow helm try making it a dataframe with columns named Country, x, and y, and then do the thing I showed.
so do not convert to numpy then
no
so right now it looks like this not in numpy form
now designate column x to the first variable and y to the second?
I believe this is what you are saying but I'm getting a keyerror saying columnA or 'x' in your case doesnt exist
KeyError: 'columnA'
Sorry still learning in my intro to ds class so im a bit lost on the proper syntaxing
your syntax is fine. the problem is that you used a column name that does not appear in the dataframe
but i designated column A to that literal column?
so you get the same result as if you had looked up a dictionary key that's not there
"people_fully_vaccinated_per_hundred"
nope, it has to be based on d in the lambda
assigning a variable doesn't change the names of df columns
So I would have to assign the acutal column names in the lamda
by replacing columnA and B to the name of the actual thing
just use the whole name of the column.
let me think
No problem
I appreciate the help by the way
Thank you very much
country_stats2.groupby("location").apply(lambda d: pearsonr(d['people_fully_vaccinated_per_hundred'], d['new_deaths_smoothed_per_million']).r)
just remove the .r and see what happens
I thought it would return a named tuple, not a regular tuple
ValueError: x and y must have length at least 2.
but I was wrong. just like I always am 
okay, so there must be countries that only have one row
can you do country_stats2['location'].value_counts() so we can see?
Quite possibly the list is like 10k rows
sure
Looks like there exists at least 1 with a count of 1
and here I was, thinking that the Solomon Islands were innocent
ahaha
so, you'll have to drop rows for countries with only one row
I dont think you can even take a proper CC with 1 row of data anyways
Not enough data points
yeah, that's what the error from before was telling you 
sorry im a newb 😦
that's okay 🙂
how exactly would i drop the guily solomon islands from my list?
or those with count of 1
I'm reading a book about neural networks right now in another tab, so I'm someone else's noob.
From the looks of a grouping by count on my own it looks like there might be more with just 1 instance so ill have to eliminate all of those somehow
Would it be something like this?
df = df[df.groupby('ID').ID.transform(len) > 1]
Found that on stack
thinking
I mean you can try that 😄 and then just chain .groupby("location").apply(lambda d: pearsonr(d['people_fully_vaccinated_per_hundred'], d['new_deaths_smoothed_per_million'])) onto it, without writing over any variables
i mean it looks like it possibly worked?
what was the result
wooo
country_stats2 = country_stats2[country_stats2.groupby('location').location.transform(len) > 1]
though you want to be careful about writing over df variables, since you lose whatever it was before
Would it work if I kept the df variables?
Because I named my data frame as something else
As shown with 'country_stats2'
depends on what calculations you're doing
Ah
So it looks like it runs now the line you provided as well with lambda but I'm not seeing where my coefficients are exactly
Or I might just be clueless 🤣
dataframe operations mostly create new dataframes, rather than modifying existing ones. so if you don't print it, it's just gone
oh so I would have to print right away
yes, or give it to a variable
is this everything you ever wanted?
I would think the CC would be just one number but i guess its of each column
And what would the right represent
oh wait
CC is between -1 and 1
im a bozo
The last part I just need to accomplish now is to put all of the CC's in a 1d array to take stats of it
:incoming_envelope: :ok_hand: applied mute to @rich kestrel until <t:1645384841:f> (9 minutes and 59 seconds) (reason: duplicates rule: sent 4 duplicated messages in 10s).
oh boy
Hmmm so I gotta find a way to iterate each country and just pull the left number
or just put [0] at the end of the lambda to get the first element of the ruple
lambda d: pearsonr(d['people_fully_vaccinated_per_hundred'], d['new_deaths_smoothed_per_million']) is the lambda
No solomon island indeed!
got my median number available to me as well with .describe()
and now just gotta notch bot plot and i am done 😃
Has anyone seen this use of a semicolon in math notation?
I think this is basically the point? you get a vector that shows the distance between the prediction and the desired value for each item, and then the cost is the average.
yeah, i've seen it, let me dig up what it means
so I assume the colon is just separating arguments
ah right, separating variables and parameters
instead of a comma
i was wondering if somebody had a good explanation of cache awareness and out of core computation and how these improve the performance of xgboost
does anyone want to build an ai in a discord bot????
Yes, (args; parameters).
its a bad table cause they didnt label the rows and the columns, but if the rows are the predicted class and the columns are the actual class its correct
yeah not labelling axes is a big red flag for me
ah yeah the second plot is better
i find it easier to think of recall as "of the samples we applied the label to, what % did we label correctly" rather than in terms of TP/FN
oh no thats precision
or is it
gah i need to finish my coffee
recall is "of all the instances of label x, what proportion did our classifier actually give the label x"
its better for reasoning when u have more than 2 classes
This is the pic I always think about.
Precision is: "I said a lot of things were positives... but what percent were true positives?"
Recall is: "I got a lot of positives... but how many positives did I get out of the total possible?"
yo
how do i join these two columns together, so there is another column
so the two columns are side by side along with the date*
So I'm curious in generating unique sentences using AI, but I'm not sure where to start. My only requirement is it should be able to use a list like this:py self.responses = [ "good", "fine, how are you?", "understandable, have a great day", "Reddit is wonderful", "Python is better then JavaScript and C++", "The dev cat speel :p", "Hi guys! Today’s going great! ᕕ( ᐛ )ᕗ", "c:" ]
What do you guys recommend for making an AI?
the simplest technique is to use an ngram sentence generator
!docs pandas.DataFrame.join
DataFrame.join(other, on=None, how='left', lsuffix='', rsuffix='', sort=False)```
Join columns of another DataFrame.
Join columns with other DataFrame either on index or on a key column. Efficiently join multiple DataFrame objects by index at once by passing a list.
what's that?
ok, so I've got something like this: py [('that', 'stupid', 'fan', 'in', 'standingpad'), ('stupid', 'fan', 'in', 'standingpad', 's'), ('fan', 'in', 'standingpad', 's', 'raspberry'), ('in', 'standingpad', 's', 'raspberry', 'pi'), ('standingpad', 's', 'raspberry', 'pi', 'won'), ('s', 'raspberry', 'pi', 'won', 't'), ('raspberry', 'pi', 'won', 't', 'shut'), ('pi', 'won', 't', 'shut', 'up'), ('i', 'm', 'trying', 'to', 'study'), ('m', 'trying', 'to', 'study', 'for'), ('trying', 'to', 'study', 'for', 'a'), ('to', 'study', 'for', 'a', 'math'), ('study', 'for', 'a', 'math', 'test'), ('will', 'you', 'shut', 'up', 'man'), ('can', 'you', 'shut', 'up', 'for'), ('you', 'shut', 'up', 'for', '5'), ('shut', 'up', 'for', '5', 'minutes'), ('may', 'you', 'please', 'stop', 'asking'), ('if', 'you', 'won', 't', 'shut'), ('you', 'won', 't', 'shut', 'up'), ('won', 't', 'shut', 'up', 'i'), ('t', 'shut', 'up', 'i', 'll'), ('shut', 'up', 'i', 'll', 'just'), ('up', 'i', 'll', 'just', 'stop'), ('i', 'll', 'just', 'stop', 'here'), ('for', 'goodness', 'sake', 'let', 'me'), ('goodness', 'sake', 'let', 'me', 'watch'), ('sake', 'let', 'me', 'watch', 'youtube'), ('let', 'me', 'watch', 'youtube', 'c'), ('angry', 'garbling', 'noises', 'someone', 'stole'), ('garbling', 'noises', 'someone', 'stole', 'my'), ('noises', 'someone', 'stole', 'my', 'muffins')]
but how would I work with this?
huh, that was fast. so you've created some 5-grams
fuck, that means you've made pentagrams. that means this channel is going to get sucked into hell
anyway, you also need 4-grams for the same sentences. once you've made the 4-grams, pick one randomly, and then pick a 5-gram that's the same except for the last token (since the 5-gram has an extra token on the end)
I just used the nltk library
Lol
@spark urchin here's an example from an ngram generator I made a few years ago
using 3-grams and 4-grams
you have to count (n-1)grams and ngrams. or n and (n+1)grams, I guess, depending on how you think of it. two consecutive integers >2
and you pick one of the (n-1)grams that you have, and then pick an ngram that completes it
and then step forward by one token/gram and continue.
does that help?
in this example, "They fought with weapons" and "They fought with them" are sequences of 4 tokens that also appear in the data
but we randomly select one so that we can continue.
so for selecting, something like this?: ```py
def mood_response(self):
rand_response = random.choice(self.responses)
self.out4 = list(ngrams(rand_response.split(), 4))
self.out3 = list(ngrams(rand_response.split(), 3))
rand_response4 = random.choice(self.out4)
rand_response3 = random.choice(self.out3)
default palate. they just need to be distinct.
it's black on my screen and then exports as white.
so I have an output like: slams head on CPU | head on CPU
no, not really. you have to make ngrams and (n-1)grams for the entire corpus (which is the collection of sentences/documents)
and then to start making a sentence, you randomly pick an (n-1)gram
so in this case, a 3gram?
and then you randomly pick an ngram where the first n-1 grams are the same
this means that the first n grams will be an actual sequence of grams from the corpus.
does this much make sense? so far, you've just copied an ngram from the corpus, basically.
I don't get it
okay, well I'll continue with the explanation
the window then slides forward
so now you're looking at (n-1) grams again
see how for the red section, it's an (n-1) gram of ("they", "fought", "with")?
yeah
and then the next n-1gram is ("fought", "with", "the")?
ok
each time, you're completing the current (n-1)gram with an ngram.
and then you move forward by one gram and do it again.
ahh
so something like this?:py loop: rand = select_rand(ngrams) next_ngram = rand + 1 merge(rand, next_ngram) continue
you need a collection of (n-1)grams and ngrams. two separate lists/sets/whatever
there are guides for how to make them online.
I do have a set of 4grams and 3 grams
but this is basically the original system for sentence generation.
so it's more like?:py loop: ngram_3 = select_rand(ngrams) ngram_4 = get_ngram_4(ngram_3) merge(ngram_3, ngram_4 ) continue
something like that, I guess
if you want to do it the simpler way, where you don't have to account for sentences ending, you can just stop after you reach a certain number of grams
ok, thanks
oh stelercus is here
i have an NLP question if youre interested
so we are trying to find out synonyms for words in this one domain
my initial idea is to find out the word embeddings of the words that we need synonyms for
so that we can figure out some sort of cosine similarity score between words that are synonyms or not
then i thought, hmm, this domain is unique enough where it would be better to use a model fine-tuned with words from this domain beforehand
@serene scaffold what would be your approach?
if youre busy, feel free to ignore
just curious
if there's an approach for this specific task, it's very likely that it involves word embeddings and cosine similarity in some way.
cool cool
im glad our initial ideas are the same
ill look more into it just wanted to make sure im going down the right path
thanks

though one dilemma with word embeddings is that if there are homographs for that word, it's all the same
are you trying to find synonyms in context, or just in general? @misty flint
if you're looking for synonyms in context, one approach is to use BERT. though the top prediction from BERT might be something uninteresting, like a pronoun, or a word of a different part of speech that happens to also form a coherent result.
theres a biomedical BERT that might give us better results
but i would have to test ir
it
I think I also have a trivial commit in their repo
thx bb 💚

its one of my favorite emojis
@misty flint you can also use this to fill in the blanks, if you decide to use bert: https://github.com/swfarnsworth/madlibert
I wrote it like two years ago. I think it works.
oh noice
ill def look into it and try a few things
see what works best for our use case
if i end up using your repo, ill def reference you


A very simple method is to just create a Markov chain.
(I hope it's the most relevant channel)
I am trying to implement fact searching, but I stucked with find facts in my language
So, would be ok to translate text to English and try to use English solutions and translate facts again to my language or mistake probability will be too high? 🤔
Hello, peps would someone mind interpreting radiance and irradiance in computer vision I surfed the internet lot but didn't got the idea..
In radiometry, radiance is the radiant flux emitted, reflected, transmitted or received by a given surface, per unit solid angle per unit projected area. Spectral radiance is the radiance of a surface per unit frequency or wavelength, depending on whether the spectrum is taken as a function of frequency or of wavelength. These are directional qu...
In radiometry, irradiance is the radiant flux received by a surface per unit area. The SI unit of irradiance is the watt per square metre (W⋅m−2). The CGS unit erg per square centimetre per second (erg⋅cm−2⋅s−1) is often used in astronomy. Irradiance is often called intensity, but this term is avoided in radiometry where such usage leads to con...
In geometry, a solid angle (symbol: Ω) is a measure of the amount of the field of view from some particular point that a given object covers. That is, it is a measure of how large the object appears to an observer looking from that point.
The point from which the object is viewed is called the apex of the solid angle, and the object is said to s...
yeah
or ann API for that
Radiance - "Radiant flux emitted, reflected, transmitted or received by a surface, per unit solid angle per unit projected area. This is a directional quantity."
Irradiance - "Radiant flux received by a surface per unit area."
"Radiance is useful because it indicates how much of the power emitted, reflected, transmitted or received by a surface will be received by an optical system looking at that surface from a specified angle of view."
please helpme solve thsi
Which would be a place to draw our NNs to put in report or something?
when i do that, i get this
they cannot magically know which columns to join ofc. please read the function examples atleast.
i used df.fillna() , but it still doesnt work
Hello all
I am working in a time series LSTM problem
I have numeric data and also caterogerical data (weekdays which i ONE HOT ENCODED), How do i combine this to create a sequence for my prediction. I want to predict the numerical values
hmm please read the message again, you need to tell it that which cols to merge.
I executed the code combined = data.join(df1) where data and df1 are my dateframes
even when i try to tweak it, using parameter to place it on the right how, still gives me a column with values NaN
not sure why stel suggested join
they have same vals right?
and you want to join in terms of their vals?
(the two cols)
here
i think the date column is an issue
so you want something like 130.08+0.21 something for date 2020....
@upper spindle
yeye
Hello all
I am working in a time series LSTM problem
I have numeric data and also caterogerical data (weekdays which i ONE HOT ENCODED), How do i combine this to create a sequence for my prediction. I want to predict the numerical values
so i want a column for sentiments and a column for price
in the same df
if that makes sense
so you can join with on as Date
then add both cols simply.
df.new_col = df.a + df.b
hmm, let me give it a try
I'll mess in my lab hold on
sure thing, no probs
!e
import pandas as pd
import numpy as np
df = pd.DataFrame({'Date': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'],
'A': np.arange(6) + 1})
df2 = pd.DataFrame({'Date': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'],
'B': np.arange(6) + 3})
df3 = df.merge(df2, on='Date')
df3['C'] = df3.A+df3.B
print(df3)
@lapis sequoia :white_check_mark: Your eval job has completed with return code 0.
001 | Date A B C
002 | 0 K0 1 3 4
003 | 1 K1 2 4 6
004 | 2 K2 3 5 8
005 | 3 K3 4 6 10
006 | 4 K4 5 7 12
007 | 5 K5 6 8 14
thanks, ill give it a try now aha
I don't know enough about your data to know why you got this. You get a Nan if there a date missing from one or the other
i just realised this now, sorry for me being so stupid 🤦♂️
dammit
is there a way to the missing values the value from before, but because i scraped the data it would be hard to as there isnt a row that can replace the missing data
Hey,
is there anyone who worked with healthcare data standards like HL7 V2 or FHIR ?
@upper spindle look into interpolation
Just ask your actual question
actually they are two :
- How can I convert some proper data to FHIR standard ?
- How can I transform an HL7 V2 data to FHIR data ?
hm these seem like a very very specific set of questions. I assume, that if you give enough background to understand it in simple words, people may be able to help, this may go unanswered because this is too specific and may be it is not hard but people don't have background.
I'm working with text data, specifically I'm trying to make sure links have been generated correctly based off of record names. I have extracted the relevant part of the link to check. Now I'm dealing with the issue of trying to understand how the programmer sanitized the name before generating the link. For example, in one instance he removed all commas in the name before generating the link. Should I bother trying to reverse engineer what he did? There seems to be a lot of edge cases. I'm thinking of using more of a heuristic and just split the name and check to see if each word in the name is in the link. And if that threshold is > 50% say the link is correctly generated.
I think this is working.
What does lstm len mean in bilstm, pytorch?
might be long short term memory?
The lstm length usually refers to the number of elements in a window of the input sequence data https://towardsdatascience.com/text-generation-with-bi-lstm-in-pytorch-5fda6e7cc22c
Thanks, also, what does time_feature_count mean?
I don’t see that variable in the tutorial I linked, so I’m not sure what you are referring to.
dict = {}
def alphaToInt256(x):
ascii_values = [ord(char) for char in x]
count = 0
tracker = 0
intValue = 0
for i in ascii_values:
count = count + 1
for i in ascii_values:
i = (i * 256 ** (count - 1))
ascii_values[tracker] = i
tracker = tracker + 1
count = count - 1
for i in ascii_values:
intValue = intValue + i
return intValue
numerical_code = list(map(alphaToInt256, iso_codes))
#print(numerical_code)
total_dpm = data.groupby("iso_code", as_index=False)[["total_deaths_per_million"]].max()
#print(total_dpm)
for i in range(len(numerical_code)):
dict[numerical_code[i]] = total_dpm.iloc[i,1]
print(dict)
print("\nQ1 - nanquantile of dict : ", np.nanquantile(dict.values(), .25))
Just trying to take the values within my dictionary and get the nanquantiles of them, would anyone be able to provide some assistance?
Just having trouble accessing the values themselves since they dont support indexing
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
For code below I am getting result
df.groupby(["Outlet_Establishment_Year","Item_Type","Item_Outlet_Sales"]).max().sort_values(by=['Item_Outlet_Sales'])
But for below code I am getting "sort_values() got an unexpected keyword argument 'by'
df.groupby(["Outlet_Establishment_Year","Item_Type","Item_Outlet_Sales"]).size().sort_values(by=['Item_Outlet_Sales'])
Why sort_values is not working with size()?
can someone suggest a good course for SQL on youtube
DataFrame.sort_values has a by argument, but Series.sort_values does not.
the Socratica channel has one
my guess is that df.groupby(["Outlet_Establishment_Year","Item_Type","Item_Outlet_Sales"]).max() returns a Series with multiple levels of indexing.
or the .size() one.
I want a course that is from beginner to advance
freecodecamp has one but it is just a beginner course
@lapis bramble we unfortunately don't have a recommended resource for that: https://www.pythondiscord.com/resources/?topics=databases&type=course&difficulty=beginner
We're a large, friendly community focused around the Python programming language. Our community is open to those who wish to learn the language, as well as those looking to help others.
in either case, this is the data science channel. try asking in #databases
if you do end up finding a resource you like that meets your expectations, let me know in #community-meta so I can see about getting it on our website.
@lapis sequoia the question you asked belongs in an individual help channel (see #❓|how-to-get-help) or in #algos-and-data-structs. Please remember to always ask your actual question, not if someone knows about a topic.
update:
IBM Watson is pretty decent at NLP
however, there are some serious deficiencies when applied to our domain
i think its just the nature of the data it was trained on aka probably not enough from our domain
IBM Watson 🤮
I just don't like how pretentious IBM is, and how they're still coasting on the publicity that their Watson brand got for the success of a specific algorithm that's now a decade old and probably not even used.
yeahhh
like
i feel like, at least for our use case, i could probably fine-tune GPT3 and it would probs perform better
can you fine-tune gpt3? I thought there's only one instance of gpt3 and it's behind an api paywall
but idk how much we have in our budget to try something new and i think they charge per token
this FEELS expensive
but I think you can train your own gpt2, or something? (I'm mostly about information extraction so I'm not up-to-date on GPT-\d)
but i dont actually know
ah. we did text generarion on GPT 1 and 2 for a project. it was okay. probs not enough training data.
this is also true

the link btw https://openai.com/api/pricing/
OpenAI
Pricing
Open?
Why is my fillna method not working?
geo_AL_fips["BusinessType"] = geo_AL_fips["BusinessType"].fillna('N/A')
It works and then I write to csv and check the dataframe again and its back to nan
well for one thing, why do you want to fill them with N/A?
because that's just a worse version of NaN, in many ways.
Oh so when I write/read the csv it's reinterpreting the N/A as Nan?
geo_AL_fips["BusinessType"].fillna('N/A') does not create a CSV; it creates a dataframe. though I'm not sure if that's what you mean.
are you trying to make sure that NaN values are written a certain way when you to to_csv?
No here's my process -
geo_AL_fips = pd.read_csv("data/state_data/geo/geo_fips/AL_fips_scraped.csv") #read csv
geo_AL_fips["BusinessType"] = geo_AL_fips["BusinessType"].fillna('N/A')
geo_AL_fips["BusinessType"].unique() #check
geo_AL_fips.to_csv("data/state_data/geo/geo_fips/AL_fips_scraped.csv") #re-write
what is geo_AL_fips["BusinessType"].unique() intended to do?
but when I read it in again it's back to nan I just need to to read 'N/A' for js application
If all you're trying to do is write the null values in the CSV file as N/A, you can just do this
path = "data/state_data/geo/geo_fips/AL_fips_scraped.csv"
pd.read_csv(path).to_csv(path, na_rep='N/A')
Yes but I only need it for this one column.
If it's just going to do that because thats how it interprets N/A I'll just come up with something else to change it to.
But that may come in handy in the future as well, thank you.
path = "data/state_data/geo/geo_fips/AL_fips_scraped.csv"
df = pd.read_csv(path)
df['BusinessType'].fillna('N/A', inplace=True)
df.to_csv(path)
Still does the same thing. fills with N/A, I re-write file, read it back in and it is back to nan.
oh. can you see what's missing from my to_csv call? I included it earlier.
actually, hmm
well, I need to focus on work. sorry.
No problem, thanks.
what does df.BusinessType.dtype say
if u want to store a literal 'N/A' it should be type object
i am noob looking for fun learning project based on lottery numbers draw history results, not sure what to do with csv of 100's of lotto results, any ideas please?! 🙂
you could check how well they fit with Benford's law
or that they fit uniform distribution
benfords law... ok I will google this ty!, I was thinking about 'popularity' of numbers, or trying to find a recurring pattern! this is more analysis i think, not ML which was my first thought ha ha!
(like i say i am beginner to data science/ ML!) still got long way to go!
Ik you all must have heard this question several times but pls bare with me. I am done with andrew ng course on machine learning and now I am feeling lost on how to actually actually code ml and ai or what tutorials or resource to follow could anyone pls help .
Ah I see
u can covert it with like geo_AL_fips["BusinessType"] = geo_AL_fips["BusinessType"].fillna('N/A').astype('object')
although i woulda thought that happened automatically
I ended up changing it to a different string variable but I was thinking that's the case. Edit: Was already obj. type
throwing random numbers into ML or any other algo will give you random results, you might get something that looks meaningful just by chance
understood.... I recall when my mates would play roulette: " it has not been black for x spins, thus chances increased of being black on next spin"!
i am in similar position, I did the google crash course and now looking for starter projects I can do myself for further practice
Anyone can give them money and get access, so it’s open! /s
The big libraries all have excellent tutorials for how to apply ML/AI for various datasets. See https://scikit-learn.org/stable/tutorial/index.html as an example. PyTorch, tensorflow, xgboost all have good introductions as well.
Writing a basic neural network “from scratch” (meaning just numpy functions) is also a good introductory project
yo is it possible to program a gf for myself?
!e
print('Hello World')
Guys how do y'all get your python code to run on Discord? 😀 Apparently, this isn't working for me.... Or am I not doing it the right way?
depends how low your standards are i suppose 🤔
Did you add the !e after the fact?
The bot won't recognize it as a command if you edit it. It has to be a command the first time
!echo spam spam spam spam
spam spam spam spam
There's no particular reason why I did that.
!e
import pandas as pd
import numpy as np
df = pd.DataFrame({'Languages' : ['Lingala', 'Swahili', 'Twi', 'Yoruba'], 'Country': ['Congo', 'South Africa', 'Ghana', 'Nigeria']})
print(df)
@odd meteor :white_check_mark: Your eval job has completed with return code 0.
001 | Languages Country
002 | 0 Lingala Congo
003 | 1 Swahili South Africa
004 | 2 Twi Ghana
005 | 3 Yoruba Nigeria
Thank you 😊. It has finally worked
@odd meteor this significantly understates the linguistic diversity of Africa. And the cultural irrelevance of its boarders.
That's actually very true! The 'artifical' borders were drawn by the colonial masters in Berlin conference mainly for resource exploitation.
We have over 300 languages here in Nigeria. By default almost everyone here speaks at least 2 languages.
:incoming_envelope: :ok_hand: applied mute to @exotic epoch until <t:1645487073:f> (9 minutes and 59 seconds) (reason: duplicates rule: sent 4 duplicated messages in 10s).
rekt
very low
> having standards
Thanks again for your help over the last few days, @serene scaffold.
Improved my program a bit more and changed to using lists and later tuples to store my results.
Now I got a program which compares automatically columns by name for me. 😄
I mean, it still can and will be extended, but the core function does work very well already.
The main changes I will add from now on, are mostly only some kind of splitting the data and maybe add a renaming algorithm for very similar, but not equally named rows.
I want to make a CNN model (the FrontBlock) to share a same output to other model blocks (BuildingBlock and BuildingBlock2) then combine the result in the end with multiply (EndModel). However it seems that my model just splitted between BuildingBlock and BuildingBlock2 with the same FrontBlock structure. This is how i create my model
def FrontBlock():
input_front = Input(shape=(160, 160, 3))
x = Conv2D(filters=64, kernel_size=(3, 3), activation='relu')(input_front)
x = Conv2D(filters=128, kernel_size=(3, 3), activation='relu')(x)
out = Conv2D(filters=128, kernel_size=(3, 3), activation='relu')(x)
HeadBlock = Model(input_front,out)
return HeadBlock
def BuildingBlock():
hdr = FrontBlock()
x = Conv2D(filters=64,kernel_size=(3,3),activation='relu')(hdr.output)
x = Conv2D(filters=64,kernel_size=(3,3),activation='relu')(x)
x = BatchNormalization()(x)
x = AveragePooling2D(pool_size=(3,3))(x)
x = Flatten()(x)
out = Dense(512, 'relu')(x)
MiddleBlock = Model(hdr.inputs,out)
return MiddleBlock
def BuildingBlock_Two():
hdr = FrontBlock()
x = DepthwiseConv2D(kernel_size=(3, 3), activation='relu')(hdr.output)
x = Conv2D(filters=64, kernel_size=(5, 5), activation='relu')(x)
x = MaxPooling2D(pool_size=(3, 3))(x)
x= Flatten()(x)
out = Dense(512, 'relu')(x)
MidModel = Model(hdr.inputs,out)
return MidModel
def EndModel():
mid_one = BuildingBlock()
mid_two = BuildingBlock_Two()
x = Multiply()([mid_one.output,mid_two.output])
out = LeakyReLU()(x)
TailModel = Model([mid_one.inputs,mid_two.inputs],out)
return TailModel
and this is when i call these line:
MComb.compile(optimizer='Adam')
plot_model(MComb, to_file=r'C:\Users\Admin\Pictures\modelo.png', show_shapes=True)```
anyone know why matplotlibs bar.set_height doesn't do anything for me?
it not stacked well since the front block not splitted the result as i expected (i cannot assure if both share same output from FrontBlock)
you'll have to show enough code for someone to reproduce the figure, and what the figure looks like for you, for anyone to answer that.
tysm for the help, you made me try and go to the help channel which led me to try and simplify the problem to reproduce the error on a smaller scale for clarity which led me to understanding the error which led me to solving it 👍 👍
amazing
you'll always learn the most from solving it on your own
@serene scaffold update: the VP wants me to see if we can replace Watson with GPT-3
but the thing is, he also told me that they dont have any FTE positions after i graduate
and i was like

welp
rip
i guess i will try my best before i leave
just so i can say ive done it before

idk
the less we legitimize IBM's business-to-business SAAS hell, the better.
Question about scree plots in Python. Is there a way add the bar charts and value labels for each point like in R?
python result
https://cdn.discordapp.com/attachments/412620789133606914/945525172176314418/unknown.png
honestly. 1000% agree

looks like this has some examples of what you're going for: https://queirozf.com/entries/add-labels-and-text-to-matplotlib-plots-annotation-examples
Examples on how to add simple annotations and labels to your matplotlib plots.
though this person uses str.format where I would use fstrings. and my way is better.

I have a story I can tell you over DMs if you want, though I see that your settings are off. up to you.
i dont usually use the regular matplotlib but ill save your link for when i do
ah thats bc of spam lol. sure
THANK YOUUU
for ordinary least squares, how would I go about proving this?
Could someone explain or show me how?
Interesting question, and one I should probably know the answer to. I'll see if I can investigate tomorrow.
Thanks @serene scaffold
Hello plz help me
I don't know much statistics
Please suggest me some YouTube channel or any paid course that teach complete statistics for data science and ML
Check out "Krish Naik" yt channel
"Statistics for ML" book for ML
For transforming a very large JSON dataset into a different schema should I use pyspark? I'm particularly interested since it seems very easy to use though AWS Glue
are validation sets always necessary
Yes
hello all
i have this array
[9,7,6,5,4,3,2,1,6,7],
[7,6,5,4,3,7,8,9,0,1]])```
and i want to convert it to
```array([[1,[2,3,4,5,6,7,8,9,0]],
[9,[7,6,5,4,3,2,1,6,7]],
[7,[6,5,4,3,7,8,9,0,1]]])```
How can i go around it
hi how can i plot the learning rate?
also why are there values like this on my metrics?
speaking atleast 2 language is kind of true for India as well for the most part, since states have their own language, and we mostly know hindi(national language) too.
I'm pretty sure you don't mean learning rate here.
you can consider those as 0.
anyways, can you tell me which matric do you want to plot? I just did it in my own notebook.
you can store the model.compile in history variable and then use it.
are learning rate irrelevant ? i want to plot it and compare like that? or its non sense?
i mean. learning rate is 0.007, its just a value, what do you mean plot it?
we usually plot what changes with iterations, like accuracy.
so my model is really the worst to get those kind of results? wew
like this
hm not sure about that. but its 0 nonetheless.
these are the metrics i used
so i think for this you will need to do experiment with different learning rates.
then you can plot with given data.
categoricalaccuracy for multiclass right?
yeah but i think you're taking wrong precision may be.
may be tho. I'll need to check.
maybe i should manually calculate the precision and recall based on tp fp etc?
i can see your accuracy going up. so precision and recall should too. atleast more than 0.
Your model isn't getting any true or false positives at all
i can see accuracy as .38 tho
It looks like everything is being predicted as negative
And it just happens that negative is right 38% of the time
oh yes yes, tn and tf are all up and tp and fp are all 0.
hm yep true
@pastel valley you way wonna look at the model again then, currently its predicting everything as negative.
why dont you show your model?
There is something fundamentally wrong with your model and it's not learning anything
also why would you print tf, nf, tp, fp anyways since you are saying its categorical.
It's probably because the final layer is set up wrong, likely using a sigmoid layer for binary classifications when he wants to do something like softmax for multiclassification
how to create a python program to extract emails found within the txt file . and result in nested json format with No of times the email is repeated in the text.
you gotta show what you have done so far.
but apart from that,
well you can start by reading file, managing starts and ends of email as first step.
Look at your error
f is undefined, probably because it can't read the file
specify your encoding
open(path, 'r', encoding='utf-8')
it might not be utf-8 though
is the implementation of tp fp etc same for multi and binary classification?
do you know what do we mean by tp and fp theoritically?
we assume that one class is positive and one is negative, we cannot define positive and negative if we have 5 classes.
@acoustic halo@lapis sequoia
its not this bad last attempt i just tried to input custom learning rate rather than the default? what is the default learning rate of i did not put something there?
It has nothing to do with your learning rate
yeah i think i get it
it can be used to create confusion matrix right? useful for multi class?
what is the problem?
Your model is broken
You are doing multiclassification right? Your model is trying to do binary classification
hahaha damn thats worst
yeah but not for multiclass.
Just post what your model looks like
Layer (type) Output Shape Param #
conv2d_32 (Conv2D) (None, 142, 142, 16) 448
max_pooling2d_33 (MaxPoolin (None, 71, 71, 16) 0
g2D)
conv2d_33 (Conv2D) (None, 69, 69, 32) 4640
max_pooling2d_34 (MaxPoolin (None, 34, 34, 32) 0
g2D)
conv2d_34 (Conv2D) (None, 32, 32, 64) 18496
max_pooling2d_35 (MaxPoolin (None, 16, 16, 64) 0
g2D)
dropout_11 (Dropout) (None, 16, 16, 64) 0
flatten_11 (Flatten) (None, 16384) 0
dense_40 (Dense) (None, 1024) 16778240
dense_41 (Dense) (None, 512) 524800
dense_42 (Dense) (None, 256) 131328
dense_43 (Dense) (None, 5) 1285
why don't you show the model? may be you fucked up your loss.
What activation function is on your last dense layer?
huh, that is surprising
just give lr to say 0.3 for now and see.
Can you give an example of some of the outputs of the model?
i think its cause the rescaling is a linear transform of x_hat where alpha is a constant, so minimizing y_i - w*(x_hat_i*alpha) is the same as minimizing y_i - (w*alpha)*x_hat_i
Are they all exactly the same or do you get different results?
if i dont put any learning rate what is the default? if i put only like optimizer='adam'
this is the prediction of gt_model why its above 1?
it's not above 1
they are not this is with a different input
o those e- hahaha
hm so you are applying softmax, it is converting to probability distribution so it is doint what you are telling it to do.
Okay, well I think your model is maybe alright then, true positive etc are probably just not a good metric for this
its the only metric i know like accuracy but my data is not balance so i searched that its good to see if precision and recall is also good
your expected y is like... [0,0,0,1,0..] ?
yeah each column represent each class right? that is what hot label is right?
Yeah thats right, its fine
i am retraining it now its looks ok
maybe the learning rate i entered is bad
i just now used the default
What was it originally?
the learning rate?
yeah
hm i guess you trained less(or less slowly). also do not use tp tn kinda metrics here, well they are not logical for multiclass.
when i used the default optimizer='adam' i get this good result and when i tried to change the learning rate to 0.007 thats where everything is negative
with same epochs? so it has less training_rate so its learning slow.
but precision and recall is good? they are using tp tn in calculating right or its still ok?
btw whats this number ?
precision and recall..well we usually use them for 2 classes(if im not mistaken(i can be wrong))
probably the number of rows it has completed.
you can use tp/tn/fp/fn but its a bit of a pain, you have to do it treating each class individually as a binary classification problem to get the F1 score for EACH class
rows? what rows?
Then you can use the F1 from each class to get the macro/micro f1 scores
It's the batches
wait wait lets talk it slowly
precision is the confidence of the model in predicting the right class
recall is the score of the model on how much its getting wrong prediction?
f1 score is?
accuracy, calculated from precision and recall
how could I calculate the (standard deviation)*7 for the first seven days, then the next 7 days then the 7 days after that, to get the weekly volatility from the column log returns. And put it into a dataframe, please, thanks in advance
what does it mean should i get a high score or low with that?
1 is best, 0 is worst
btw its doing good now model is being genius
Yeah thats doing really well now
A little bit of hyperparam optimisation and you could probably bump it up another 0.5%
aside from learning rate what else could i explore to bump up the model?
hyperparameter optimization is like trial and error right?
since this is image you can use pretrained models and add another layer on them
try this if its good maintain if bad try other is it how it works?
It can be trial and error, thats the simplest way, You could technically automate it
You could try different optimizers, different layer sizes
diffferent activation functions
loads of stuff
no in this i need to use 2 from scratch and identical models for comparing
damn maybe next time ahahha
could someone help me out please
its pretty high because the train and test is the same images hahaha
im just still trying to explore and understand stuff like this batches i dont understand
is the the image being processed by the model per epoch?
How many images are you training?
Each epoch is 1 full round of training after every image has been used
jesus what? don't you split?
6k
also i rescaled it its the same as normalize? right ? its good because it makes the training faster right?
you have a lot more options in image generator btw.
When training through a single epoch, it does it in smaller batches of images
giving more rotated and zoomed out and inned and blurred images, it helps your model.
my folder is arranged like this
oh right i already did that i want to compare different augmentation techniques thats why i want to create 2 identical models without pre trained weights to make sense i guess hahaha
The fact that you don't have separate tarining and test data is a problem though
i set the batch to 5 and on model.fit() i set this stuff i found it on tutorial
what does it mean
So it will train 5 images at a time
by giving same training and testing data, you are literally ignoring that overfitting exists in ML.
6k/5 ~=1360
this is not or i can say the complete dataset im just try this stuffs and learning and when i got the custom dataset needed ill do it in a proper way like the train test split stuff
Yeah, I just mean you can't tell if your model is good or not until you do that
and per epoch it with use 1360 images?
1360 batches**
Each epoch is basically a load of smaller training steps
I would probably leave that as 32(default)
yeah that is one problem hhaha it may be worst but atleast i can tinker it a little now somehow hahah
oh so it really the batch but with 5 images i get it now is it also considered as hyperparameter?
if i tweak the batch will it affect performance or just training speed?
Yes and Both
32 is the default batch dang so i am doing it low hgahaha
It probably wont make much of a difference tbh, plus there's not much point making hyperparam optimisations until you have the separate test set
if i dont include batch_size in flow_from_directory() then the default is 32?
atleast am getting some ideas hahaha i hope you guys will be here when i got the images 😅 👍
If I'm having my custom layer in keras, how can i add non linearity function?
btw this means half of the weights are ignored right ?
yeah and dropout is another hyperparameter
(add randomly)
oh yeah its randomly selected every input ?
is it by input?
or pass?
or batch?
whats the term?
I would assume each batch but i'm not sure
meaning half of the 16384 are only activated?
The layer before
in images batch works like getting image on top of each other right or am i wrong?
No, thats wrong
for each batch yes.
also refs
https://stackoverflow.com/questions/40061258/neural-networks-how-often-is-dropout-filter-updated
which part its gets ignored the activations?
Basically, each epoch the model will train on 5 images, then the next 5, then the next 5 until all images have been trained
before flatting?
Before being passed into flatten_12 yes
It will basically set half of the values passed into flatten to 0
not each epoch you mean
but 5 by 5 yeah.
in imagination the 16,16,64 is being flattened to be 16384 but the half of it is ignored and dont apply activations for the dense_44?
It will train all images over a single epoch, but it does it 5 at a time in smaller training steps is what i mean
if you had somethign like [[1,2][3,4]], it might pass something like [0,2,0,4] to the dense layer
it will become zero or just example?
will become 0
i thought its not getting passed
It has to be the same shape still, it can't just not pass values
oh i see it will retain the units
btw on f1 score if i got the precision and recall i just use to for the formula then thats the f1 score of my model?
I think keras can calculate the f1 score for you
not being passed requires messing up with the shape, 0ing out is faster.
Basically no, you have to calculate the f1 for each class
Then use those f1 scores to get the macro-f1 score for the model
TensorflowAddons has f1Score, not yet in tf.
So I would use whatever keras has built-in to do it for you (if it has any)
how do i do that? how can i know the precision and recall for each class?
is this it?
tfa.metrics.F1Score
this is probably will come from the model.fit right? how can i retrieve the y_true , y_pred on training ?
of if i add it hear it will automatically works?
i dont even know how the model knows that metrics isnt its just argument to fit()? is the keras.model.sequential expecting those metrics already maybe?
Generally i only use the default metrics so you will have to play about and see what happens
If you are obly bothered about the accuracy, just use that
oh i see but if i did it this should be what you mean by f1 score by class?
Probably? I can't be bothered to do the math in my head
But like i said, if you don't specifically need it, just stick to accuracy
hahah sorry i think it is the score for each class
i may or may not need it hahaha
It is the f1 for each class im fairly sure
btw my model is predicting 1s and 0s is it because its confident? or i can change it so that it outputs the probabily for each classes?
What was the final layer?
softmax
It should be a probability distribution then, maybe it is really confident then
Infact yeah obviously, its because you are training and testing on the same data
It knows it is correct because it's been trained on the test data so it is super confident
haha yeah thats another problem hahaha
but if its softmax isnt it the spread of probabilities between classes?
so if i tried a real deal i will probably get like [0.1, 0.1, 0.2, 0.3 ,0.2]
thats what probability distribution right? or i understand it wrong?
Yeah thats right, except they add up to 1 not 0.9
ops hahaha i swear i counted it 1
anyways i learned i lot today thank you @acoustic halo@lapis sequoia
till next time 😅 👍 .
What is the difference between fit_transform & tranform ?
from sklearn.impute import SimpleImputer
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
# Imputation
my_imputer = SimpleImputer()
imputed_X_train = pd.DataFrame(my_imputer.fit_transform(X_train))
imputed_X_valid = pd.DataFrame(my_imputer.transform(X_valid))
Using Imputation to address Missing Values
Can we use transform instead of fit_transform for the above example as the second argument is not passed
It says it will first fit then it will return the transformed X. Can anyone tell me what it is going to do by fitting it
you must first fit a transform before using it to transform the data, so that it will have the necessary information about the data you're fitting
if you check the documentation, it should have some fields ending with _ that are "learned" from the data
Ok & what about tranform it is not fitting the data
You call .fit() and .transform() method on your train set but only call .transform() method on the test set.
So, with that said, fit_transform is a way to kill 2 birds with one stone.
try to comment out the line with fit_transform and see what happens if you try to transform without fitting
that said... if the transformer you're using is doing something extremely simple that does not requires any information about the data, it is possible that they just included it for compatibility with the rest of the api
Ok but X_train & X_valid are having the same features then why is one tranformed differently from the other
except that X_train consists of 80% data & rest is in X_valid
Because that's generally how it works. Remember we apply the information learned from the train set on the validation/test set.
Recall that we only call fit() method on the train set when training a model and then call .predict() method on the holdout set ( val/test set) when making prediction. If you understand this concept without confusion, then you'll realize it's pretty much the same principle that's applied when calling .fit_transform() method on train set and then calling .transform() method on the val set.
They are transformed the same - that is the whole point. For example, if you are Imputing missing values by using the average of that feature for the data set, you want to "fit" that average to be from the training set data only. You then "transform" the training data by filling in missing values with the average from the training data. Then in the validation and future prediction input data, you "transform" them in the same way - by filling in missing values with the average of the training data.
like has been mentioned. the fit_transform method is just for convenience. You could also use fit and transform called separately, both on the same dataset.
however you definitely do not want to call fit_transform or fit on the validation, testing, or future prediction data, because then you've changed your model pipeline.
Ok got it. we want to fit that average from the training data set only makes sense. Thanks @neat anvil , @odd meteor
I have the following code which I've got from a book
mentions = [500, 505]
years= [2017, 2018]
plt.bar(years, mentions, 0.8)
plt.xticks(years)
plt.ylabel("# of times I hears someone say 'data science'")
# if you dont do this, matplotlib will label the x-axis 0, 1
# and then add a +2.0.13e3 off the corner (bad matplotlib!)
plt.ticklabel_format(useOffset=False)
#misleading y-axis only shows the part above 500
plt.axis([2016.5, 2018.5, 499, 506])
plt.title("Look at the 'Huge' Increase!")
plt.show()```
and this is the graph that is produced
but when I comment out that line, this is the graph that is produced
What exactly is the difference?
there may be none - are you using the exact same version of matplotlib as the code in the book?
If not, the behavior may be different
could have: https://pypi.org/project/matplotlib/#history
I feel like it wouldn't even necessarily require a different mpl version; the defaults might be system-dependent.
yeah - it'd be very likely for the book author to have some plotting configuration saved on their computer that they were using to make sure plots looked good in print.
this is why we have Docker
. Plot a 30Hz sine wave, with an amplitude of 40mV peak-to-peak over 3 cycles. Select an appropriate step size for the plot and add an appropriate title and axis labels with units.
- Plot a 30Hz cosine wave, with an amplitude of 40mV peak-to-peak over 3 cycles. Select an appropriate step size for the plot and add an appropriate title and axis labels with units. Use the same worksheet, and the same time column from Q1.
trying this on excel
any tips
guys i got one doubt, is it necessary to divide the image /255 and also use Normalize from torchvision transforms ?? isn't /255 normalize the image and convert the pixels range to 0-1 ? then why do we have to use Normalize again or am i mistaken ?
Normalize doesn't act to make the range of values from 0 to 1, but to get a specific mean and std (usually 0 and 1)
observe that if all the values are from 0 to 1, the std is definitely less than 1, and the mean is probably higher than 0
ohhh
0~1 is something like MinMax if I recall correctly?
yeh
if its an image data then you usually divide the image by /255
so i should divide the image by 255 first and then get the mean and std to normalize the image ?
yeah, though you could skip the division too, really - Normalize can take it, it's a linear transformation after all
hi guys
this is an assignment given to me to get an internship
i have never done any of these
and i am still learning
what would be a good resource or a starting point?
is there any particular tool/library i should learn or use?
pls guide me
you can go through andrew ng's deep learning CNN course
would be a good resource to get into computer vision
interesting problem 
any model trained on ImageNet should be sufficient. but yeah theres tons of resources out there
Hi, So i trying to line-graph Music-albums. I want the Y-axis to be the the length of the longest album im going to use and every song on the album to be dotted out at the right time on the timeline. I have no idea how to go about this at all. Can anyone point me in the right direction? Dont really know the right terms to search
i will draw an image
ok
is there anything specifically regarding the assignment i can read?
Hi, So i trying to line-graph Music-albums. I want the Y-axis to be the the length of the longest album im going to use and every song on the album to be dotted out at the right time on the timeline. I have no idea how to go about this at all. Can anyone point me in the right direction? Dont really know the right terms to search ( see an drawing of what im trying to do https://i.imgur.com/Pd13HsK.jpeg )
Now @misty flint hope this make is more clear.
this is one of those cases where i actually think a bubble chart would be better
where the size of the bubble can indicate one of your axes
since youre technically comparing 3 things at once
hmm but if youre insistent, you could just graph this using excel, powerBI, or tableau
or any other BI tool
if you want to use python, matplotlib, seaborn, plotly
etc.
i recommend tableau or plotly
Im doing this just to learn so i want to use python.
Thanks a lot i will check them out. i was using matplot first
that one is ok. its just too ugly for me sometimes lol
haha, ye it doest look too twenty first century 😄
Thanks for the help.
Hi, could you help me with this code? https://paste.ofcode.org/HGDkzK3za3nJV2TDLXenKP How to save data from run_mcmc separately to a file? Thank you
This is the current output. After one interation, I would like to save np array of positions and one value for ln probability. After another iteration I would like to append the file.
Hey, I'm building a script to do an analysis of a text to find lines that corresponds to a given subject matter. For example to grab all lines that correspond to a subject matter of ”love” or ”hate”, or maybe even something more specific like ”I have just fallen in love” or ”I’m feeling lonely”. Anyone know a library that would help me to do such analysis?
it would probably be easier to save 1 file per iteration
i don't think numpy data files are "extensible"
or you can save to hdf5, which does allow you save multiple arrays in the same file
just call it mcmc_output_1.npz, mcmc_output_2.npz etc
does anyone know how to take a data frame in python and convert it into a table in excel?
Sorry, what is a workbook? An excel file with multiple pages?
Save a dataframe as an excel file. Look for ExcelWriter in the pandas docs @ornate ore
I'm on mobile or I'd find the link for you
It's easy though.
okay thank you
Has anyone gotten OpenAI to work?
I’ve downloaded it but it won’t let me run the Lunar Lander bc of Box2D not valid.
There are toy datasets on Kaggle
how can i get output
{ "market@qq.com" : {"Occurance":2, "EmailType": "Non-Human"} ,"jeff.peterson@b2bsearch.com" : {"Occurance":1, "EmailType": "Human"}}
in thiss format
in nested json
Is this a data science question? I'll only look at the code as actual text
!code
Here's how to format Python code on Discord:
```py
print('Hello world!')
```
These are backticks, not quotes. Check this out if you can't find the backtick key.
import re
from collections import Counter
file = open('websiteData.txt', 'r', encoding='utf-8')
f = file.read()
h = re.findall('[A-Za-z0-9.+-]+@[A-Za-z0-9.-]+.[a-zA-Z]*', f)
count=Counter(h)
print(count)
for i in h:
a = i.split('@')
if (len(a[0]) <= 8):
print('Company Email:',i)
else:
print('Human Email:',i)
Next time, use the markdown format.
import re
from collections import Counter
file = open('websiteData.txt', 'r', encoding='utf-8')
f = file.read()
h = re.findall('[A-Za-z0-9.+-]+@[A-Za-z0-9.-]+.[a-zA-Z]*', f)
count=Counter(h)
print(count)
for i in h:
a = i.split('@')
if (len(a[0]) <= 8):
print('Company Email:',i)
else:
print('Human Email:',i)
Please copy and paste an example from websiteData.txt as text.
Hey @forest bluff!
You either uploaded a .txt file or entered a message that was too long. Please use our paste bin instead.
Emonics LLC
Senior Big Data Engineer
Emonics LLC
Vancouver, British Columbia, Canada
send email to hr@vancuemonics.com
likecelebratesupport agent syneca.gregory@gmail.com
@serene scaffold
I think this pattern is better: [\w\.+-]+@[\w-]+\.[A-z]+
is the length of the website name really what tells you if it's a company or not?
you can also add named groups: (?P<name>[\w\.+-]+)@(?P<site>[\w-]+\.[A-z]+)
ok
!e
import re
text = """Emonics LLC
Senior Big Data Engineer
Emonics LLC
Vancouver, British Columbia, Canada
send email to hr@vancuemonics.com
likecelebratesupport agent syneca.gregory@gmail.com"""
emails = re.finditer(
r"""(?P<name>[\w\.+-]+)@(?P<site>[\w-]+\.[A-z]+)""",
text
)
for email in emails:
print(email.groupdict())
@serene scaffold :white_check_mark: Your eval job has completed with return code 0.
001 | {'name': 'hr', 'site': 'vancuemonics.com'}
002 | {'name': 'syneca.gregory', 'site': 'gmail.com'}
@forest bluff see?
anyway, I would put every email in a list, and then pass that list to Counter
okay
and then you can make a new dict that's {email: {count: n, type: human/company}}
by iterating over the counter and checking the end of the email string
!e
import re
text = """Emonics LLC
Senior Big Data Engineer
Emonics LLC
Vancouver, British Columbia, Canada
send email to hr@vancuemonics.com
likecelebratesupport agent syneca.gregory@gmail.com"""
emails = re.finditer(
r"""(?P<name>[\w.+-]+)@(?P<site>[\w-]+.[A-z]+)""",
text
)
for email in emails:
print(email.groupdict())
@tranquil glacier :white_check_mark: Your eval job has completed with return code 0.
001 | {'name': 'hr', 'site': 'vancuemonics.com'}
002 | {'name': 'syneca.gregory', 'site': 'gmail.com'}
what would be the best way to go about making a new column in a data frame based on the values of some other columns?
for example if i want to make New based on A and B and the logic being
New_i = 1 if A_i == 1 or B_i == 1 else 0
| A | B | New |
0 0 0 0
1 0 1 1
2 1 0 1
There are several ways:
One way is to do
df['new'] = df.apply(lambda x: 1 if (x.A==1) or (x.B==1) else 0, axis=1)
But it will execute row by row and that may be slow if data is big
Taking advantage of vectorization you can do:
df['new'] = ((df.A==1) | (df.B==1)).map({True: 1, False: 0})
But still the map part is not vectorized (it will be row by row). So, an even faster way will be:
df['new'] = 0
df.loc[(df.A==1) | (df.B==1), 'new'] = 1```
Just checked in my pc, with a dataframe of size 10k.
Method 1: 94.7 ms ± 1.27 ms
Method 2: 1.35 ms ± 79.2 µs
Method 3: 555 µs ± 146 µs
So yup, as always in python, avoid for loops D:
I want to write my pandas data frame to CSV (1371980 rows × 2 columns) in if I am using df.to_csv(path,index=False), it is only writing 104000 rows...... can anyone help me with that I am using google collab
cheers, works brilliantly
hi @serene scaffold
@serene scaffold any clue?
Find a pretrained object detection model and scan the video frame by frame?
i dont actually understand what that first point is about but for the second one, there are lots of tutorials , just look for object detection
nltk, spacy, gensim , these are the most commonly used libs for nlp task
Thank you!
:incoming_envelope: :ok_hand: applied mute to @compact stirrup until <t:1645615491:f> (9 minutes and 59 seconds) (reason: duplicates rule: sent 4 duplicated messages in 10s).
hey guys, how to put every word in a string to a list? The below image shows my example strong
did they really just encourage list comprehensions in pandas
hi
does anyone have ai program that generate pictures from given sample
for example sample is folder with 1000 photos
and this program should generate similar photos
hola
^
hmm
let me be honest
it is possible, yes
but in that condition
it will be hardest thing to ever done
or maybe there is already something like this
on internet
but i just cant finfd anything like this
that used text instead of picture right?
no
any example
i don't think it will be done within 10k IMO
this is based on some actual testing done at the link in the image, which redirects here https://towardsdatascience.com/apply-function-to-pandas-dataframe-rows-76df74165ee4
i find it odd that he is using a line profiler and not just timing the execution
Hi everyone, so I just was wondering if anyone could help me find ressources about a problem I have. Basically I'd like to plot a harmonic functions and a scattered plot (on the same figure) and counts dynamically how many times the harmonic function encounters a randomly placed dots on the figure. I'd love to make it dynamic in terms of seeing the curves being traced and having a counter at the top to see how many times it meets a dot on the grah. Thanks in advance!
does anyone know why to calculate softmax we use exp(x) / exp(x).sum(dim=1), why not use 10^x or 100^x, why we use specifically e**x???
This is essentially what I want to do (source: https://www.youtube.com/watch?v=DxfEbulyFcY&t=64s&ab_channel=SebastianLague in the Light Scattering section)
the natural logarithm specifically has a lot of very convenient mathematical properties
There are two aspects to this. The first is to count the number of times the function intersects a diet. That is unrelated to the "plotting" part, it's just math. Although keep in mind that the dots and/or the line must have some special extent, otherwise the probability of intersection is essentially 0. The second part is animation, which is built into matplotlib.
Would it be possible to during the animation have a counter tho? so getting in real time information of the graph ? or enlight the concerned dots time to time ?
can you please give an example of what you exactly mean?
basically I don't really know where to start on that, I can make a basic programm with two harmonic functions, make a 2d array of some kind with random point and check how many time each function cross a dot from the array
but how could I specify that the size of a dot isn't 0 (basically just a geometrical position) but a real dot with a radius ? should I make a specific function that creates a circle inside an array and then apply that randomly to each cell of the global array?
The derivative of the exponential function with base e is the exponential function with base e
And of course the natural logarithm is the inverse of an exponential function with base e
yes, forgot about that, simplifies a lot the calculations, thank you
I think 3b1b has some good video about why it's so "natural"
Maybe not him, but I know somebody does
thanks for the tip, I'll look into that
Yeah you can check collision if the function is within a certain radius of a point, The naïve but simple approach would be to write loops
my problem is that (not my only one), I don't know how to create a numpy array containing the dots in the first place
I would just :
Create array
Take random position
Apply some function to create dots of a certain radius there
return the array
Apply some function to create dots of a certain radius there is my issue
All about ln(x).
Full playlist: https://www.youtube.com/playlist?list=PLZHQObOWTQDP5CVelJJ1bNDouqrAhVPev
Home page: https://www.3blue1brown.com
Brought to you by you: https://3b1b.co/ldm-thanks
Beautiful pictorial summary by @ThuyNganVu:
https://twitter.com/ThuyNganVu/status/1259288683489849344
Errors:
At minute 16, the sum should be written w...
you don't have to do that, you can basically just check which points are within radius r of the "head" of the advancing harmonic function
😳
and "advancing" in this case just means looping through the pre-computed function
ho ye !
And is it possible in an animated graph to get the current value of the point at a certain time, and then apply the function on that point to see if there is any nearby circles ?
I feel like this would be very slow, so maybe using time steps ?
again, do this without worrying about animation first
just write a loop that prints a counter
yes, always
