#data-science-and-ml
1 messages · Page 223 of 1
well thats the index
are u sure the index correlates with the same class value?
I am not sure of that
I assumed it was because of the way it was trained
Is there a way I can find this out?
Can I do like model.labels[idx] or something?
Not exactly that
But somehow get the label for it
well it'll follow the datasets label format
oh wait this is sparse, so that shouldnt be an issue
Let me try with other numbers
yeah it might just be one bad one
Yeah something is weird
I would assume the model wouldn't be this bad
It is predicting 3 for everything
Lol
lol
PS C:\Users\Ryan\Desktop\ml-api> python .\predict.py
{'confidence': [1.6521667149409522e-18, 2.794477973674936e-12, 0.002019522013142705, 0.9946495890617371, 0.0, 0.002819732530042529, 2.1464751850941433e-11, 0.0005111345089972019, 3.013474395628175e-32, 9.863308PS C:\Users\Ryan\Desktop\ml-api> python .\predict.py
{'confidence': [1.6521667149409522e-18, 2.794477973674936e-12, 0.002019522013142705, 0.9946495890617371, 0.0, 0.002819732530042529, 2.1464751850941433e-11, 0.0005111345089972019, 3.013474395628175e-32, 9.863308066247621e-36], 'prediction': 3}
PS C:\Users\Ryan\Desktop\ml-api> python .\predict.py
{'confidence': [1.6521667149409522e-18, 2.794477973674936e-12, 0.002019522013142705, 0.9946495890617371, 0.0, 0.002819732530042529, 2.1464751850941433e-11, 0.0005111345089972019, 3.013474395628175e-32, 9.863308066247621e-36], 'prediction': 3}
PS C:\Users\Ryan\Desktop\ml-api> python .\predict.py
{'confidence': [2.4253387200669225e-17, 1.805287798591071e-12, 0.08808434754610062, 0.868319034576416, 0.0, 0.042480047792196274, 7.983710914594155e-11, 0.0011165498290210962, 1.889110677748859e-29, 2.5717616017503705e-33], 'prediction': 3}
PS C:\Users\Ryan\Desktop\ml-api> python .\predict.py
This is multiple runthroughs
All with different digits
I had this issue before where it was predicting 5 for everything
hm
mine is working fine lol
and its the same code as yours
whats the shape of the digit_data?
it should be sent as a batch size of one -> np.array([img])
yup thats the problem
just tested it
its probably not the same image when u reshape it
try plotting it before and after u reshape it
When I receive the image I do this to it
def read_digit(data: list, encoding: str) -> np.array:
im = Image.frombytes(encoding, (28, 28), data)
im.save('im.png')
arr = np.array(im)
arr = arr / 255.0
return np.array([arr])
I just changed it
Still same
I am viewing the image I read
And it looks good
However I didn't consider what it looks like after I dp arr = arr / 255.0
hm
yeah the reshaping looks fine after i looked over again, i forgot to update the ar values
normalizing shouldnt be an issue
Yeah I am very confused
Yeah
those probability output numbers look really similar actually, you might be running the same image into the detector each run
Unfortunately that isn't the case
hmm
did u try running the test image?
Yeah 1 sec
So it is supposed to be a 7
According to y_test
[[1.0039120e-06 4.1181980e-08 1.7208897e-05 1.5117071e-04 1.2190830e-10
5.2620344e-06 7.6247827e-11 9.9981230e-01 7.5138960e-06 5.5338201e-06]]
It is a 7 here
So the image input is somehow screwing up
yeah that's what i suspected
since the models running as expected
its likely that ur referencing the same img somehow
can i see ur full code for getting the images and feeding them into the model?
import requests
from PIL import Image
import numpy as np
from io import BytesIO
import tensorflow as tf
model = tf.keras.models.load_model('mnist.model')
def read_digit(fname):
print(f'LOADING {fname}')
img = Image.open(f'example-digits/{fname}').convert('L').resize((28, 28))
return BytesIO(img.tobytes())
img = read_digit('4.png')
r = requests.post('http://127.0.0.1:5000/MNIST/predict', files={'file': img})
if r.status_code == 200:
print(r.json())
else:
print(f'Error: {r.status_code}')
I checked the fname and it changes
I can even show the image to make sure
This is what gets sent with the 4
Aka the current code
Is the stroke weight of the images not high enough?
Maybe it needs to be thicker?
hm it could be, tho that doesn't explain why ur getting the same result
i would understand if the model performance is bad, but the probabilites are too similar each run
Yeah I ran through a couple and the proper image is being loaded
I even check on the API side
And it is the same thing I sent
Correct
Let me show what an image from the dataset looks like
Lets see if its way thicker
Wait...
test data array looks like
[0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0.3254902 0.99215686 0.81960784 0.07058824 0. 0.
0. 0. 0. 0. ]
[0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.08627451
0.91372549 1. 0.3254902 0. 0. 0.
0. 0. 0. 0. ]
Meaning white is 0
from the test dataset or from ur images?
So rn mine is inverted
Test datasetr
So anything that is 255 needs to be 0
And then the blacks get normalized down to 0 to 1 scale
? but the number isnt white
Well there is lots of 0 which I assume is supposed to be the background
This is just 2 rows actually
I tried opening this as image
And its pure black
Even after doing it *= 255
To bring it back
first_x *= 255.0
img = Image.fromarray(first_x, mode='L')
img.show()
Yeah so I made sure
0 is background
Values > 0 is digit pixels
I didn't mess with print options a lot
But you can see the numbers make a 7 lol
Which means I need to prep my data differently
255 (white) -> 0
Anything else gets scaled
Yeah @flat quest I am getting different values now but still some are wrong
are most still correct tho?
When I send a 1, I get this
lol u shouldnt be getting 0's for all values, those probabilities should sum to 1
This is the pixel values
So the actual pixels that make up the digit get scaled between 0 and 1
And background is 0
And I need to somehow encode the image to get it in a 28x28 array so it isn't rgba
Managed to invert the image so it is what the model expects, but still nada
Oh well, will try tomorrow
yeah lol this is a lot harder than expected to solve
This is the image I am sending
Before it gets normalized by dividing by 255
The downscaling is messing with it a lot
I might need to multiple all the pixels that aren't hard black up by some factor
Sending this through works
So it is 100% a downscaling issue @flat quest last time I tag you sorry
Just thought you might wanna know
I drew the digit in a 28x28 px window so it didnt need to downscale
And of course it works
yeah no all good
tho i still don't get why it doesnt work tbh
I think it is too noisy
i wouldve expected that a model could work on lighter images of numbers but i guess it just needed to be trained on lighter images as well
Yeah I think the dataset has a lot of large stroke images
So I might just do like
If its not black scale the white up to 255
Or set a very low threshold
the stroke and also the color might be an issue
u could add those kinds of images in ur training data
through data augmentation
its basically transforming some of ur data so u have a greater variety
u might change the brightness, rotation, shear, randomly change a color channel, etc.
it allows the model to learn more types of objects
I have two images that are binary images and need to find the similarity. I was looking at using Pearsons (np.corrcoef). how do I go about doing this
what am I putting in the X and Y
Hey guys
Just reshape your images into column vectors and throw them in.
what you need is the inner dot product
look up cosine similarity
but what sort of images are these
what do you mean by binary images
@lapis sequoia is the outer product ever referred to as dot?
I thought that was only used for inner product
you're right
inner product is the dot product.. inner dot product is a weird way of saying it
can i install psotgresql + pgadmin so i can have an web ui thing on an ubantu 18.04 vps
you want #databases
oh i missread im dumb sorry
How do I delete rows in which birthYear is \N? dropna() doesnt work
@faint furnace \N is not a missing value. dropna() removes only missing ones. Just drop
i was trying something like this but its not even recognising it
is it because my "birthYear" is object type?
it didnt give error but didnt remove the \
maybe it change it only in the first occurance?
Ohh
now this removed everything
noice
i did this change /N to NaN
but it didnt consider it as empty value
try replacing by np.nan? ‘NaN’ is still a string
Hi, sorry to interrupt, can someone help with contour sorting? 😄
I built a custom dataset and it's quite accurate, I just need to sort from left to right now...
Any help would be greatly appreciated!
@faint furnace try posting a small data example, df.to_json() will enable you to do that
Hi, I dont know if someone can help me. I have a set of letters (they are amino acids), I have 6 of them. I want to get every combination of 6 possible, including just 20 repetitions of one of them. Perhaps the fact that there will be millions of them means that this is basically impossible?
note: i want the combinations to be 20 letters long
@solar phoenix what's the list?
The letters are I, A, G, L,F, V, M
list(itertools.permutations(['i', 'a', 'g', 'l', 'f', 'v', 'm']))
Thank you forcousteau helped me with that.
@faint furnace sure - i mean in general though, that is a useful thing to do
yea i actually am checkign what this line of code does
takes a little while to run it
rie this gives me a list that is 7 long
i want one that is 20 long
so for example, one output would be LLLLLLLLLLLLLLLLLLLL
there's one in there that matches that
oh hang on no, i don't understand why you would get that from a permutation of those elements
every combination of 6 possible, including just 20 repetitions of one of them
i don't follow
i want to create strings of length 20
that include absolutely every combination of those 7 elements
it is probably going to be too many isn't it
it's going to be a lot
ye
somthing like that
is too much
a LOT
what u need it for? xD
7^20 ? where's that from?
in length 20
there are 7! ways to arrange 7 things, you have 20 spaces... I'm trying to remember combinations etc 🤦♂️
Is it not 20**7?
so i want to make a list of all of them
maybe but with binary 10 digits it 2**10 ways, this has 7 states length 20
20^7, now where's that from?
20 * 20 * 20 etc for each char
Anyway, you are looking at itertools.combinations_with_replacment
Idk, 20^7 sounds off
20 options for each of 7 characters.
Wait now that I type it out that is the wrong way round
7 ** 20 indeed.
binary has 2 states so number of states is 2**length so with 7 states i thought it would be 7**length tho
yh, but tbh i just guessed the order, there was a 7 and a 20 and the answer is very big
Well how many ways are there to arrange the 7 characters?
it's going to be 7! right?
no exponent there
yep same letter valid
That's without replacemtn though.
ah yeah , shit
i was thinkning of it as a base 7 number problem with 20 digits, how many ways are there to arrange 5 decimal number 0-99999 = 100000 = 10(number base) ** 5 (length of number)
itertools.combinations_with_replacement(['i', 'a', 'g', 'l', 'f', 'v', 'm'], 20)
is that even a good idea to run? how long do u think that will take xD
0 seconds.
really!?
Since it doesn't actually make the strings right away untill you use them.
creates a generator
It creates a generator.
ah
oh ok thats a more efficient way to do it
so as long as u dont call list on it your ok
In [101]: len(list(itertools.combinations_with_replacement(['a', 'b', 'c', 'd', 'e', 'f', 'g'], 20)))
Out[101]: 230230
Python an do that much yeah.
this is fine
thanks for this all
this is way smaller than some of those numbers though 🤔
In [108]: samp = pd.Series([''.join(x) for x in itertools.combinations_with_replacement(['a', 'b', 'c', 'd', 'e', 'f', 'g'], 20)])
In [109]: samp.sample(20)
Out[109]:
113891 aabbbcccceeeeffffffg
131937 aaccdeeffffggggggggg
115873 aabbbdddddeeeeeffggg
82491 aaabbbbcdeeefffffggg
173967 accdeeeeeeffffgggggg
90337 aaabcccccccccddddeee
56377 aaaabbbbbbccccceefff
101120 aabbbbbbbbbbbbbcdeeg
224770 cccddddddeeeeggggggg
216041 bccdddddfffffffggggg
115529 aabbbceeeeeeeeeegggg
131396 aaccddddddeeeeefgggg
95043 aaaccccccccccccddeef
4470 aaaaaaaaaaacccdfgggg
65145 aaaabbceeeeeffffffgg
16792 aaaaaaaaccccccccdeff
36986 aaaaaacccdgggggggggg
216190 bccdddeeeeeeffffffff
100210 aaadddddeeeefffffggg
23192 aaaaaaabcccccddfffff
dtype: object
looks alright
You might want to use a generator expression instead, if pandas is okay with that
for what? this is fine
In [112]: pd.DataFrame(samp).info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 230230 entries, 0 to 230229
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 0 230230 non-null object
dtypes: object(1)
memory usage: 1.8+ MB
You could have a pretty big memory burst here, and it could fail in some circonstancies
I mean, the list
there's really not much
The lists have way more overhead compared to a dataframe
really, that surprises me
not too sure how to check the memory usage for that though
In [113]: pd.DataFrame(samp).info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 230230 entries, 0 to 230229
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 0 230230 non-null object
dtypes: object(1)
memory usage: 16.9 MB
with deeep
But it will not count strings, which have even more overhead
how to get the memory usage of a list then
In [118]: l.__sizeof__()
Out[118]: 1880784
this is inaccurate?
In [119]: l = [''.join(x) for x in itertools.combinations_with_replacement(['a', 'b', 'c', 'd', 'e', 'f', 'g'], 20)]
that's l
that seems close with the original dataframe response
You also need to count all the strings inside
why would it do them separately? but ok, i'll check
Here it is just counting the chain of references, which are more or less lightweight
In [120]: x= ''.join(str(x) for x in l)
In [121]: x.__sizeof__()
Out[121]: 4604649
seems pretty small still
!e
print(7**20)
@silk acorn :x: Your eval job has completed with return code 1.
001 | File "<string>", line 1
002 | print(7"*20)
003 | ^
004 | SyntaxError: EOL while scanning string literal
!e
print(7**20)
@silk acorn :white_check_mark: Your eval job has completed with return code 0.
79792266297612001
that's a big number
In [127]: samp[samp.str.startswith('f')]
Out[127]:
230209 ffffffffffffffffffff
230210 fffffffffffffffffffg
230211 ffffffffffffffffffgg
230212 fffffffffffffffffggg
230213 ffffffffffffffffgggg
230214 fffffffffffffffggggg
230215 ffffffffffffffgggggg
230216 fffffffffffffggggggg
230217 ffffffffffffgggggggg
230218 fffffffffffggggggggg
230219 ffffffffffgggggggggg
230220 fffffffffggggggggggg
230221 ffffffffgggggggggggg
230222 fffffffggggggggggggg
230223 ffffffgggggggggggggg
230224 fffffggggggggggggggg
230225 ffffgggggggggggggggg
230226 fffggggggggggggggggg
230227 ffgggggggggggggggggg
230228 fggggggggggggggggggg
neat
nice to look at too
there should be examples of fgfffffffffffffff
probably somwhere (way) further down the dataset
no these are sorted
you can also see the index, and the previous len
In [130]: list(itertools.combinations_with_replacement(['a', 'b', 'c'], 4))
Out[130]:
[('a', 'a', 'a', 'a'),
('a', 'a', 'a', 'b'),
('a', 'a', 'a', 'c'),
('a', 'a', 'b', 'b'),
('a', 'a', 'b', 'c'),
('a', 'a', 'c', 'c'),
('a', 'b', 'b', 'b'),
('a', 'b', 'b', 'c'),
('a', 'b', 'c', 'c'),
('a', 'c', 'c', 'c'),
('b', 'b', 'b', 'b'),
('b', 'b', 'b', 'c'),
('b', 'b', 'c', 'c'),
('b', 'c', 'c', 'c'),
('c', 'c', 'c', 'c')]
dislist=list(itertools.combinations_with_replacement(['a', 'b'],3))
[('a', 'a', 'a'), ('a', 'a', 'b'), ('a', 'b', 'b'), ('b', 'b', 'b')]
there is no 'b''b''c'
for example
there's no babb in the version that i posted above tho
but yuor example only contains half the samples it should of, was that the whoel output?
mine, yes
should be length 81, its not making every combination
it's making what i posted
its like the original one only made like 230000 but should of had way more
ye agree
yeah, it's not done babb in the small example
I think there will be too many examples for this, I will have to think of a new approach
thanks all for your help
Anyone can offer a helping hand on contour sorting using Tensorflow and OpenCV, please?
@solar phoenix
{('a', 'a', 'a', 'a'),
('a', 'a', 'a', 'b'),
('a', 'a', 'a', 'c'),
('a', 'a', 'b', 'b'),
('a', 'a', 'b', 'c'),
('a', 'a', 'c', 'b'),
('a', 'a', 'c', 'c'),
('a', 'b', 'b', 'b'),
('a', 'b', 'b', 'c'),
('a', 'b', 'c', 'c'),
('a', 'c', 'b', 'b'),
('a', 'c', 'c', 'b'),
('a', 'c', 'c', 'c'),
('b', 'a', 'a', 'a'),
('b', 'a', 'a', 'c'),
('b', 'a', 'c', 'c'),
('b', 'b', 'a', 'a'),
('b', 'b', 'a', 'c'),
('b', 'b', 'b', 'a'),
('b', 'b', 'b', 'b'),
('b', 'b', 'b', 'c'),
('b', 'b', 'c', 'a'),
('b', 'b', 'c', 'c'),
('b', 'c', 'a', 'a'),
('b', 'c', 'c', 'a'),
('b', 'c', 'c', 'c'),
('c', 'a', 'a', 'a'),
('c', 'a', 'a', 'b'),
('c', 'a', 'b', 'b'),
('c', 'b', 'a', 'a'),
('c', 'b', 'b', 'a'),
('c', 'b', 'b', 'b'),
('c', 'c', 'a', 'a'),
('c', 'c', 'a', 'b'),
('c', 'c', 'b', 'a'),
('c', 'c', 'b', 'b'),
('c', 'c', 'c', 'a'),
('c', 'c', 'c', 'b'),
('c', 'c', 'c', 'c')}
this was it right?
although what i've just written doesn't want to scale lol
@solar phoenix what i wrote didn't really scale
Oh
i mean - it might have run with patience, i didn't have patience
@solar phoenix doesn't this pattern have a name?
i'd have thought it would have been done before somewhere and you could just use their file / data
i'll get the code it should be in memory
@jolly briar cool thanks
unique_chars = 3
string_len = 4
permutations = itertools.permutations(list(string.ascii_letters[:unique_chars]))
all_combinations = []
for perm in permutations:
c = list(itertools.combinations_with_replacement(perm, string_len))
all_combinations.append(c)
all_combo_list = list(itertools.chain.from_iterable(all_combinations))
all_combo_list_unique = set(all_combo_list)
On nice
this will generate the above list, you can change params 3,4 at the top there
Yeah I see that
but don't just stick 7,20 in as it probably won't run
@jolly briar yeah I might see if it can run it on a supercomputer or something
i think 5,20 will run 🤔
there's probably plenty of room for optimising the above, if it's salvageable
Think that I could use that and then just run it on a server somewhere. Thanks for this- cool solution
i need some help with xml parsing
Did someone try using machine learning algorithms for stock/crypto trading
I've got a bit of free time so I want to try out to code something like trading bot as a personal project
So if someone worked on something similar to this, I would like to hear your experiences
For anyone new and looking into getting into data Science and Mschine Learning. We have made a Youtube channel related to Data Science and Machine Learning and it would mean a lot if could check it out and if you like it, please subcribe. https://www.youtube.com/channel/UCKaajyjktvduM6mmuBtAOyg
Does TensorFlow take our Python code (in map function) and do something like JIT compiling? I try to set breakpoint with VS Code but it is not hit at all.🤔
Hi! I can get help here in machine learning (time series) ?
Upload the question, so we can see...
I need to make a machine learning model on the time series to predict the quality of communication.
In this dataset, I need to predict the “Y” column.
I plotted a linear plot of y versus date, as well as ACF and PACF.
The Dickey-Fuller criterion is 0.
What model can be built and how to determine the parameters for it?The dataset itself was collected over 14 days and contains ~ 7 million rows. I averaged the value over a period of 1 minute. The dataset currently contains 20,160 rows
can anyone help me with something regarding to converting DOB to age in an excel form?
Anyone wnna help me with my data analytics assignment? - its about pandas and stuff
plt.pie(genrevotes1)
i generated this by using the code. but i want the labels as well. how do i do that ?
@faint furnace you need to have lables in theplt.pie() argument, so if you create a list of labels so in your case ["Drama", "Comedy"...] then add the argument labels=labels inside plt.pie so it looks like plt.pie(genrevotes1, labels=labels), then it should show the categories but just a warning, with that many categories the text may become cramped together at the smaller categories
btw you don't need to manually create the list if the data is in a dataframe, you just need to get the row names
yes i know that but i want to create the plot directly thorugh the series i have
genrevotes1. has 1 column as all the genres which I want as labels
solved. i was able to do it by typeing
labels=genrevotes1.index
Hey guys
Is there a simple way of changing a grid solving program from monte carlo approach to temporal difference learning
I have the full monte carlo approach code and just struggling to convert it
Hey,
is there a way to complete 2d on numpy array to matrix with zeroes?
like [[1,2], [1]] to [[1,2], [1,0]]?
i don't think so because the 2 original lists are of different length so it will give you an array containing lists (so oyu can't use numpy fetures on them)
you can probably do it with loops but I don't think there will be a nice numpy feture lke .reshape unless you initialize the array with uniform dimensions
Is anyone interested in Stock Market algorithms?
wdym?
In the development and application of algorithms that trade the markets
wouldn't recommend it
why do you say that
Is anyone interested in Stock Market algorithms?
@supple moon yep
its difficult to make an algorithm that will perform well with stock markets, there's so many factors
u'll have to have access to a large amount of quality data
its difficult to make an algorithm that will perform well with stock markets, there's so many factors
u'll have to have access to a large amount of quality data
@flat quest And thats why it is fun to think about and play with, right?
i was looking to collaborate to see if we could build something good
too early for me, but I'd like to see some links with concepts here 🙂
well it would be fun
but might become an unnaturally large project lol (will probably take up a lot of resources) and getting/cleaning data gets more frustating the more you do it
That's my main area of work although a fair number of resources have gotten shifted to pandemic modeling instead.
does anyone know how to use tensor processing unit (TPU) in Google Colab or Kaggle?
i am trying to use TPU and think i am following the example code exactly
but it is only using CPU
can share notebook
im doing exactly this
please anyone know anything about TPU at all and tensorflow help me
anyone at all please @ or PM me
are u using a layer / model compatable with tpu? @drifting umbra
@flat quest i belive so. keras Sequential see this:
# instantiate a distribution strategy
tpu_strategy = tf.distribute.experimental.TPUStrategy(tpu)
# instantiating the model in the strategy scope creates the model on the TPU
with tpu_strategy.scope():
model = Sequential()
model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2]), return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(256))
model.add(Dropout(0.2))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')
# train model normally - doing this below
# model.fit(training_dataset, epochs=EPOCHS, steps_per_epoch=…)
model.summary()
starts as
import numpy
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.callbacks import ModelCheckpoint
from keras.utils import np_utils
import os
import tensorflow as tf
print("Tensorflow version " + tf.__version__)
...
i can upload file if that helps
notebook?
What's up guys... So I wanna learn ML and robotics ,something like put these 2 things together but I don't know where to start or which one should I start with,can someone pls give me some recommendation for some tutorial or advice .
LSTM layers I know are compatable with TPU by default. @drifting umbra
are u running this code?
# detect and init the TPU
tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
tf.config.experimental_connect_to_cluster(tpu)
tf.tpu.experimental.initialize_tpu_system(tpu)```
@arctic canopy well you should first learn the basics of each one separately. Not too sure for robotics but ML u should start with some sort of course to give you a good overview.
@flat quest So is it better to start with ML first?
and is it that complex?,I mean I heared it needs a lot of math and stuff
@flat quest thank you and let me check
# detect and init the TPU
tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
tf.config.experimental_connect_to_cluster(tpu)
tf.tpu.experimental.initialize_tpu_system(tpu)
that is exactly what i have
can i link you this
yeah i'll take a look
@arctic canopy
it depends on where u want to go with ML
basic stuff like simple text generation and classifying images doesn't really need much knowledge with math
if you want to make a production ready model, or look into improving existing models through new architectures, then you'll have to learn the math to some extent
@drifting umbra don't think i have access to your data
Hey @drifting umbra!
Uh-oh! It looks like your message got zapped by our spam filter. We currently don't allow .txt attachments, so here are some tips to help you travel safely:
• If you attempted to send a message longer than 2000 characters, try shortening your message to fit within the character limit or use a pasting service (see below)
• If you tried to show someone your code, you can use codeblocks
(run !code-blocks in #bot-commands for more information) or use a pasting service like:
lol idk why but i can see the files listed in the input directory
but the os module cannot find it :/
hm
yeah its odd cause i can find the file using os.walk
@flat quest so basically as you go more deep it will require more math I think, Thanks mate
Im stoked man just learned a bit of selenium, its so fun
what?
learned how to use selenium with python to web scrape


Let's imagine I have a dataframe with an arbitrary length, and each row has 4 parameters. I want to rank the dataframe by these 4 parameters. At the moment what I do is test each of the parameters against a desired value and then rank them by how close they are to the value. I then add a new column to the dataframe which indicates which rank they are for that parameter. I do this for all 4 parameters then combine the rank of parameters 1-4 and get a "total" rank, with the closest to my desired set of parameters being at the top. Is there a better way to do this? My concern is that a row that has very good score for 3 parameters will constantly outcompete one that has a mediocre score in all 4 parameters.
@solar phoenix might be easier to follow with example data
Hi @jolly briar I'll make an example and send it, cheers
df.sample(20,random_state=1).to_json() might be useful
Im quite new to kaggle. I have a model now running. Its gonna take 2 hours to make the version. Can I safely close the window?
Knowing that my version creation will finish? 😄
Hey guys sorry if this is the wrong spot... thoughts on using a .txt or .xlsx to create a wordbank that i will use as an array to compare other parsed text to? im trying to automate my job search abit 
@twin parcel why xlsx over csv?
just kinda compare them all forgot csv was another type and the best option atm prob
completed numpy, pandas and matplotlib. now what next?
also am i able to import csv filled with reg expressions by chance?
read ISL @lapis sequoia
what does "completed" mean @lapis sequoia
what do you mean completed numpy pandas and matplotlib?
Studied them
@twin parcel can you give an example
@lapis sequoia what does studied mean here, to what extent
what's the next step?
explaining yourself properly would be a good step imo
@lapis sequoia what does studied mean here, to what extent
@jolly briar covered these topics with practical data analysis
@jolly briar do you have nay idea about auto encoders ?
@somber rune sorry no 😦
@lapis sequoia you're being vague as hell so I"m just going to say go read ESL
@twin parcel do you ?
Im trying to add "minimum of 6 years" to a list im going to filter out but id rather have a reg expression that would cover minimum of 6< years instead of writing "minimum of 6 years, minimum of 7 years, minimum of 8 years" but i also want to keep it imported from a file
Err not a clue besides for image and vid but i've used python for a total of 3 hours now so cant help much 😂
ah okay
@lapis sequoia lol that is still so vague
@twin parcel tbh if you only have a fixed set of cases just writing them out and putting them in a list isn't really a bad idea
@twin parcel idk what the source is though - are they guaranteed to have this structure?
only problem is im going to cover 6< and that could reach 20 and yes they will always have this structure
I want to learn machine learning and data science. I am a self learner. Learned Python with OOP concepts, and topics like file handling, regex, web scraping, numpy, pandas and matplotlib. What's the next 3 things should I learn to get one step further? Should I learn scikit now?
or if one falls out of it itleast i catch 7/10 and have that many less jobs to read over
why would you store regex's in a csv rather than in the script? I don't really follow that
because I want to make it simple for someone to adapt without touching the code to much i would call the txt in a loop to use each regex inside it
sounds funky
it is 😂
I don't think i'd store regexs like that
they should be in code, data in data
separation and all that jazz
without a concrete example idk what to say here, other than have your scripting in the script
in terms of ML
depends on what u want to do
surface level ML or if u just want to work with existing architectures, yeah scikit and tensorflow are a good place to go
if u want to improve existing models/architectures, you're gonna have to learn some aspect of the math. (online vids, courses, wiki are all good for that).
@lapis sequoia
if u want to improve existing models/architectures
this seems like an extremely narrow set of people
hmmm thinking on that maybe ill format my txt different code the regex in but get the year value based on the files value
might use a txt dedicate first few lines to specific stats then split after thoose lines on ,
not really. A lot of ML startups/companies have some focus on improving existing architectures. Most of the ML stuff we use now was invented in like the 80's.
I think it's still very narrow, of the people that are going to learn these tools, that's a narrow set of people
Can you guys tell me what 3 skills should I learn now? I'm trying to get into data science career with self learning
yeah, but if you plan to make a career or job out of it. The math can't be ignored.
though it depends what improve means I guess - if it's publishing and stuff, very narrow
@lapis sequoia what do you think you should do, based on what you've looked at so far
performance of models. More data/cleaning/feature engineering does help, but only to a certain extent.
yeah i think data engineering is more important for most
Can you specify the topics ?
esp. if someone is self learning
yeah
@lapis sequoia if u don't have any clue of any architectures. Start out with linear reg/logistic on scikit. Then try decision trees and gradient boosting.
I'm trying to set a roadmap and create a curriculum on my own to continue learning from here.
well those are the topics u should look into
Is this a good information? I'm almost following this. https://towardsdatascience.com/a-road-map-for-data-science-d1977504a72b
trying to plan everything out to the nth degree is the biggest waste of time
^
those suggestions from drag are good, do them
Don't bother reading medium posts about planning about planning about planning
if you've been through np/pandas/mpl as much as you think trying a project is a good test
should be able to get some open data from a gov site, clean it, and provide some insights
I would bet it's more time consuming than you expect
yeah got stuck on that for a while too
u never get out of the phase of planning to do something
u need to do start just working with ML.
yeah cleaning data takes longer than model building imo
I actually downloaded some datasets from Kaggle and presented visual representation well for practice. Is that all what data analyzing mean?
how old are you?
because i'm not sure why these questions are phrased as they are
Is that all what data analyzing mean?
this question just seems nuts
presenting means nothing. You need to be able to extract information from it @lapis sequoia
like?
maybe ur famiiar with this dataset the kaggle titanic.
Let's say u see that there's a cluster of deaths with people who have the same last name
because this is probably the first thing I'd learn, you're not going be given a 3 step guide to anything at work
then you might infer that those people are part of the same family group, and family groups are likely to all die or survive.
I played with some other dataset as well analyzing different scenarios
aight idk the only advice i can really give
Is to dive into your data and then work with it. Like rie said make a project utilizing the python data-packages and publish the results on a website or on github.
I don't really understand what type of projects it can be. Can you generalize it?
@lapis sequoia have a guess
just have a guess at something that shows you have thought for yourself and go from there, maybe it's a good idea
should be able to get some open data from a gov site, clean it, and provide some insights
@jolly briar like this?
using my thought as your own, sure
given what you say you have learnt - my biggest concern would be that you don't seem to be able to piece anything together for yourself
which suggests that perhaps you've rattled through a few tutorials without digesting / internalising any of it
using my thought as your own, sure
@jolly briar sounds like every coding meeting ive been apart of...
😄
i never did understand that regex thing you were talking about @twin parcel , personally i'd always go for having them in the script, and having data as it's own thing
you can just extract the numbers if that's easier , re.search( r'(\d+)', sentance).group(1) looks like it'd catch what you needed
I decided to go that route im gonna research regex in python but i added a place holder in the .txt to fill so others can change it and ill just read the number "Minimum Years: 7" is line 2 and im able to grab the 7 easily so ill use that var in regex and it should work out 🙂
soon i will have a tool to not waste time on the job search 
share me that tool when u make it 😛
Figure ive looked once a day for last few months alone switching tabs prob is around 5 mins a day
so already i have it just take the first 10 pages of indeed based on my search and print it in one page, next step filter out requirments im nowhere near as a new grad, theres several extra minutes a day determining if i can. then save it all to a file is the last step. figure at 22 i have a good 40 years of career so this tool can be used later on also
also another big benefit is i can now add python to my portfolio as i wanted to make a scraper with it for a while but couldnt find a legal use
what's indeeds policy on scraping?
showing off a tool that shows you haven't read a data usage policy might be a bit brave 😅
If I read its TOS and other google searches correctly its ok as long as its not for commercial uses but i think imma email there customer service before this is center of my git or im IP banned from a job site 😂
i'm having a directory notebooks that contains the fastai directory, inside this notebook folder i also have a folder for each week of exercices. In these directories there are notebooks but they can't find the path to the fastai directory because it is complaining about relative paths, could someone help me out ?
@thin remnant if you run things from project root it makes stuff like this a lot simpler
so in your notebook you can have something like os.chdir('../../') , or better something like os.chdir(here()) using here() from pyprojroot
here's a link to that : https://github.com/chendaniely/pyprojroot
is it not possible to access the fastai modules by using just a path
it's only one directory up
i think the os.chdir'../' worked
it's only one directory up
doesn't change anything, running from proj root is still simpler?
so you can't, from an interactive session in root, dofrom blah import blahwhere blah is what you want
i think the os.chdir'../' worked
right, but if you re-run it then it'll keep knocking you back (you'll have to reset kernel)
would appending fastai to the pythonpath be a solution?
it says here() not defined
because you haven't installed the package I linked i guess
should i also place that in the same directory as the fastai directory
the package? You'd just install with pip
do you have this project version controlled?
I've no idea what that is
why would you clone?
how do i install than
pip
pip install pyprojroot?
here() is still not defined :/
have you followed the readme
are you in jupyter notebook or lab
jupyter
yeah, which
you need to restart the kernel
you installed it in the wrong env then
do i have to install in the fastai-cpu env ?
if that's what you're using for this notebook then yeah
you have to install things for particular envs, you can use requirements files and such to manage this for you
make sure you install to that env, should then work
also - i tend to use it as os.chdir(here()), just at the top of the notebook
import here works now
cool
so i usually have something like
import os
from pyprojroot import here
os.chdir(here())
<other imports>
it still doesnt find fastai
which is an odd ordering, i just don't want to use here() throughout the script
import os
from pyprojroot import here
os.chdir(here())
do this, thenos.listdir(), is it in the root?
idk why it wouldn't work when previously os.chdir(.../ stuff did
hi
shouldn't have to
mmm
this is weird
@jolly briar what should i do when the dir is changed
it still doesn't load the import of fastai
@thin remnant looks like it's put you into you ~/ dir, which i doubt is the project root
is it?
nope ..
🤷♂️
I'm trying to generate plot points for a 3d scatter plot. I have the values, but being new to python, numpy, pandas, etc., I'm not sure if I'm capturing and structuring the data in the most simplified way for plotting. Here is my code:
sample_data_subset_intervals = np.unique(sample_data_subset_df['sampling_interval'].to_numpy())
sample_data_subset_durations = np.unique(sample_data_subset_df['sampling_duration'].to_numpy())
scatterplot_raw_data_df = \
(sample_data_subset_df[['sampling_interval','sampling_duration','sampling_error']]).dropna()
scatterplot_raw_data_df['sampling_error'] = scatterplot_raw_data_df['sampling_error'].abs()
scatterplot_3d_plot_points_dtype = \
[('sampling_interval', np.int32), ('sampling_duration', np.int32), ('sampling_error', np.float64)]
scatterplot_3d_plot_points = np.empty([0,1],dtype=scatterplot_3d_plot_points_dtype)
plot_points_index = 0
for interval in sample_data_subset_intervals:
for duration in sample_data_subset_durations:
if duration <= interval:
interval_duration_pair_data_subset_df = \
scatterplot_raw_data_df[(scatterplot_raw_data_df['sampling_interval']==interval) & \
(scatterplot_raw_data_df['sampling_duration']==duration)]
idp_sampling_error_summation = interval_duration_pair_data_subset_df['sampling_error'].sum()
idp_mean_sampling_error = \
idp_sampling_error_summation / len(interval_duration_pair_data_subset_df.index)
scatterplot_3d_plot_points.resize(plot_points_index + 1,1)
scatterplot_3d_plot_points[plot_points_index]=(interval,duration,idp_mean_sampling_error)
plot_points_index = plot_points_index + 1
and the output looks like this:
[[( 10, 10, 0.00000000e+00)]
[( 30, 10, 4.56183120e-04)]
[( 30, 30, 0.00000000e+00)]
[( 60, 10, 2.84578755e-03)]
[( 60, 30, 1.92741648e-03)]
[( 60, 60, 0.00000000e+00)]
[( 120, 10, 1.33025818e-01)]
[( 120, 30, 1.21143218e-01)]
[( 120, 60, 9.39393846e-02)]
[( 120, 120, 0.00000000e+00)]
[( 300, 10, 7.69409264e-01)]
[( 300, 30, 7.70362944e-01)]
[( 300, 60, 7.38203127e-01)]
[( 300, 120, 5.79511920e-01)]
[( 300, 300, 0.00000000e+00)]
[( 600, 10, 1.18857403e+00)]
[( 600, 30, 1.18091259e+00)]
[( 600, 60, 1.16379460e+00)]
[( 600, 120, 1.02220597e+00)]
[( 600, 300, 6.36643452e-01)]
[( 600, 600, 0.00000000e+00)]
[( 900, 10, 1.38186398e+00)]
[( 900, 30, 1.41657535e+00)]
[( 900, 60, 1.42654824e+00)]
[( 900, 120, 1.28564349e+00)]
[( 900, 300, 9.52358564e-01)]
[( 900, 600, 4.13780964e-01)]
[( 900, 900, 0.00000000e+00)]
[(1800, 10, 1.56350134e+00)]
[(1800, 30, 1.59038708e+00)]
[(1800, 60, 1.57760143e+00)]
[(1800, 120, 1.47674187e+00)]
[(1800, 300, 1.27458568e+00)]
[(1800, 600, 9.84249018e-01)]
[(1800, 900, 7.20700696e-01)]
[(1800, 1800, 0.00000000e+00)]
[(3600, 10, 1.58364303e+00)]
[(3600, 30, 1.62856429e+00)]
[(3600, 60, 1.66236178e+00)]
[(3600, 120, 1.67353265e+00)]
[(3600, 300, 1.47160299e+00)]
[(3600, 600, 1.39347321e+00)]
[(3600, 900, 1.18549807e+00)]
[(3600, 1800, 7.73267790e-01)]
[(3600, 3600, 0.00000000e+00)]]
The number of brackets and parens in the output implies to me perhaps unnecessary complexity in my data structure, but that may just be due to me being unfamiliar with structuring data in python/numpy. Does the format/structure of this output look correct and most simplified for moving forward with it to plot? Thanks!
Hi, I am getting really frustrated, because the changes I am making to a dataframe, inside a function, are not committed outside the function. I use return, but it does not work. Am I missing something ?
to be more specific, I wrote a function that takes a column away from the df, and that merges it with another df. The function output is correct i.e. a new df that looks exactly like I want. However, I would like this df to overwrite the original one, and I can't make it work
well, technically it should look like that
data = pd.DataFrame(data={"me": [1, 2], "something": [3, 4]})
data = function(data)
and this function would look like
def function(data):
# do something with this data
return data # or some other variable if you want
is it usual for pandas to sometimes replace (numeric) values?
because i have a dataframe which if i visualize it, there are some 0 in the values which shouldn't be there. i checked the file which i also print it to but all values in it are correct
def visualize(dataFrames:dict, outputLocation:str, showing:dict, show = True, save = True) -> None:
print("Start visualizing...")
yValues = []
print("- Start Extracting what to show...")
for key, val in showing.items():
if (val):
print("- - Adding item {}...".format(key))
yValues.append(key)
print("- Finished Extracting what to show")
print("- Start iterating plots...")
for id, df in dataFrames.items():
print("- - Starting plot: "+id+"...")
print("- - - Start converting Duration to numeric values...")
df.index += 1
df.DurationIncl = pd.to_numeric(df.DurationIncl)
#df.ScanTimeAutoLight = pd.to_numeric(df.ScanTimeAutoLight)
print("- - - Finished converting Duration to numeric values...")
print("- - - Start plotting...")
plottedFrame = df.plot(
y = yValues,
kind = "line",
title = "Runtime analysis",
use_index = True,
grid = True
)
print("- - - Finished plotting")
print("- - - Start adding legend...")
legend = []
for yval in yValues:
legend.append(yval + " of " + id)
plottedFrame.legend(legend)
plottedFrame.set_xlabel("index")
plottedFrame.set_ylabel("Time in seconds")
print("- - - Finished adding legend")
if save:
print("- - - Start saving...")
matplotlib.pyplot.savefig(outputLocation+"AnalysedData{}.png".format(id))
print("- - - Finished saving")
print("- - finished plot: "+id)
print("- Finished iterating plots")
if show:
print("- Start showing...")
matplotlib.pyplot.show()
print("- Finished showing")
print("Finished visualizing")
the code of the visualiziation
Hi!
How to install fbprophet on win10 for python 3.8?
I searched for manuals and tried them - all in vain
Never tried with windows, but it worked for me on macOs using conda.
In windows10 python3.8 works fine with anaconda.
Hi!
@lapis sequoia in windows10, python3.8 works fine with anaconda
while using a scraper, sites that store cookies and login sessions would the scraper use that session, or as a scraper it has its own session?
i assume towards its own just like different browsers
@twin parcel indeed
my old version worked pretty well until I realized some divs had an optional field that i want to track, so now i have to rebuild based on divs.
https://www.coursera.org/specializations/jhu-data-science is this course good to get started in data science?
In windows10 python3.8 works fine with anaconda.
@lapis sequoia in windows10, python3.8 works fine with anaconda
@frail ocean I need in FBProphet lib
I see.. then I have no idea. Sorry.
Any suggestions on changing this to allow 2 chars minExperienceLimit = badFiltersContent[2][15] Im using it to grab the int from this string, but this wont work for >9 ``` This is the Bad filter list, add words or phrases that make a job more likely not match, seperate with commas!
Minimum Years: 7```
i dont know if this fits here but i need to compare 2 numpy arrays with eachother, they have different sizes. if 1 of the colors in first array are found in the big array i need to have a true output
i tried allclose but that works for all, i tried isclose, i tried any(isclose)
i tried if A in B is true then:
wdym by true output?
i dont need to know the value that matched
just that they did match
a boolean
@uncut shadow
What is the best resource for learning regression and classification in Python?
@kind saddle I think this should give you the way to do this (https://stackoverflow.com/questions/25490641/check-how-many-elements-are-equal-in-two-numpy-arrays-python). You can change it to what you actually wanted to achieve
it works too for different sizes?
well, gimme a sec
im alot asking sorry, ive been brainstorming and testing so much that all my ideas ran out :/
well, this one should do
https://stackoverflow.com/questions/45936138/check-how-many-numpy-array-within-a-numpy-array-are-equal-to-other-numpy-arrays
@uncut shadow 2 questions but 1 is decently stupid, if it found a match then my if statement should say if value > 0 is true right?
second question is even tho the arrays are the same partly i dont get a match, can i put in an error margin for like +-3?
well, I don't know much about this particular package (I didn't need it before) but you should check it's documentation to check if you can add a margin or stuff like that
but to my original arrays before i feed them into the package
nvm this is overthinking it too much
actually, what do you mean with the first question cuz I don't think I understood it right
the problem is, 2 pictures are translated into arrays, 1 is a small part out of the bigger one. so if i compare i should get a value or anything apart from 0 or whatever since they are the same and treated the same in the code
question 1 was, that the count of matches would be greater than 0 if there was a match
im sure that is yes xD
so yes, if there is a match then it should be bigger than 0
i think the problem is in the processing or asking too many matches. ill try it with 1 single color first
@kind saddle do you have example data
@kind saddle do you have example data
@jolly briar yes and no, not a raw file its converted from image to array
is this where the NLTK nerds are
yes
you know how to parse feature based semantics
Wdym?
define semantics
uh
i have a CFG with fol and lambda calculus and i give it a sentence and it tokenizes and parses the tree for it with a semantic representation
but it does not work with certain constructions, like subject inverted ditransitive questions where the recipient is a prepositional phrase “for x”
For beginners 😊 https://youtu.be/38KOhekzEgA
can neural network add two numbers. In this video i tried something different for practise. Here i crated video for addition of two numbers using the artificial neural networks. whole code you can find in below github link.
code: https://gist.github.com/Pawandeep-...
did you really use a 5 layer dense model for addition :/
if ur doing addition a single perceptron will work
at an equal or higher efficiency
just consider the mathematical basis of the perceptron: x1w1 + x2w2 + b
set b to 0 and w1 and w2 to 1 and you have addition.
Anyone knows where to find the source code for "vocab_file" for (thanks):
FullTokenizer = bert.bert_tokenization.FullTokenizer
bert_layer = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/1", trainable=False)
vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy() #The vocab file of bert for tokenizer
tokenizer = FullTokenizer(vocab_file)
i could only find the source code of the KerasLayer and the resolved_object method, but no vocab_file nor asset_path methods/attributes...
it seems vocab_file is an attribute dynamically set in the export_bert_tfhub function
bert_layer.resolved_object.vocab_file is an Asset object, https://github.com/tensorflow/tensorflow/blob/v2.2.0/tensorflow/python/training/tracking/tracking.py#L280-L330, where you can see the asset_path property being defined (it returns a tf tensor/string, thus the .numpy() at the end)
@rustic igloo
Thanks @paper niche
@worn stratus hey man thanks for the reply!, im just wondering do you know where i can find the study materials for decision trees, i need to know about the classification and regression type specifically. i need to make them without sklearn or other 3rd party modules. all i can find in the web were those with them involved.
honestly im at a loss lol
Hm - I've had a look around for that stuff myself to not great success. If I were you, I'd have a look around at stuff like the Wikipedia page for ID3 decision trees, and random implementations you can find on github. When I looked - they were all a bit rubbish, but still helpful for figuring out roughly what is required
From a quick google search, there are tutorials out there for things like building an ID3 decision tree - just they're all not great. With some patience you can probably use a combination of the different resources available to figure things out
ID3 is definitely what I'd be looking towards
I have a dataset of images, with each image tagged with keywords. For example, a photo of the Earth from space may have the keywords ["earth", "space", "planet", "photo"] while an illustrated diagram of the Sun may have the keywords ["sun", "star", "illustration", "rendering", "diagram", "labeled"].
I want to be able to automatically filter a set of images with keywords to end up with a smaller set of images that match the aesthetic that I am looking for. To do this, my plan is to figure out the weight of each keyword in detemining if a particular image should or should not be included.
I will have a human go through the dataset of images, and say if they should or should not be included. The classification for each image will be recorded.
I think the output will be something like this where each keyword that was seen is given a weight:
{
"keyword1": 1.25,
"keyword2": 0.75,
"keyword3": -8.6,
...
}
Then, when given an unclassified image, my program will use those determined weights to say if or if not that unclassified image should be included.
What types of techniques should I look into for this? One consideration I'm thinking of is that the frequency of a keyword needs to be considered in the weight. If a word only shows up a single time, and I say that associated image does not belong in the set, then that keyword probably shouldn't be 100% bad.
When orchestrating ML into the business, do you all use event sourcing and CQRS concepts? We are doing a lot of stream processing, and we are trying to plan out the best strategy for ML predictions.
Am i in the right place here to ask Questions about graphs? (nodes and vertices and stuffs)
okay, here goes: I'm wring something to represent a game I'm playing. it has a lot of 'currencies' wich can be converted into other 'currencies'. Some of them i can dirrectly assign a value to (in this case a power increase/currency)
there are loops in this graph, and I can not guarantee that the dev was smart enough to prevent diverging loops
Do you have any hints on how to calculate the 'value' for each currency, when factoring in conversions?
I'm thinking about setting all values to 0, except for those wich I can dirrectly assign.
all directly assigned currencies go into a queue.
then I take one item out of the queue and update all curencies wich can convert into that currency. I will save a 'value' for each currency the currency can convert to. Then I add that currency to the queue
that by itself does not terminate.
so its a BFS?
whats a BFS?
breath first search t
here is what i would do instead
searching seems overcomplicated from whati understand
if you are able to recode it
just make a class for currency then have a method converting between
or a fuction
the issue are the loops in that directed graph
do you have a picture of the graph you can upload
the values should converge when going through a cycle. So if the increase when updating a node is neglectable i won't add it to the queue. (hm... there is the issue of possible multiple small updates?)
so it will eventually terminate if all cycles converge
I'm unsure how to handle the (propably not happening) case of a diverging cycle
what do you mean by diverging cycle?
like it goes to infinity
so it just gets trapped?
trivial example: i can buy 2 banans for 1 apple. and i can buy 2 apples for 1 banan
so if I were to assign some base value to an apple from elsewhere, the value of an apple would diverge by swapping between apples and bananas
ofc the graph I'm talking about is a bit more complicated ^.^
not yet
1 apple= 2 bananas = 4 apples = 8banans =16 apples =32 banans
so if my 'base value' for an apple is non-zero it will diverge
wait that doesnt make sense tho
more realistic case: 1 apple = 5 dollars; 1apple=2bananas; 1banan=3dollars. whats the dollar value of a banana
1 apple= 2 bananas = 4 apples
that doesnt make sense
thats where your error is
that cant be true
the trades are not transitive or reflective
it is a directed graph
so bannas->apples needs not be 1/(apples->bananas)
the diverging case would only happen if the dev of the game seriously messes up. But i can not exclude the possibility
maybe transitive was the wrong word
but the surely are not reflective
yeah they should be transitive, I used the wrong word there sorry
yea but if its not reflective how does anything have value
example: I can get 1 banana for 2 apples or 1apple for 2 bananas but i can also sell 1 apple for 1 dollar(my 'value' is meassured in dollars here)
wait
edited
in that case 1 apple is worth 1 dollar and a banan is worth 0.5 dollars
thats the simplest case i can imaging for a non-reflective loop with convergent values
also the transitions are more complicated.... I think a graph could not be enough to represent it. for example you could get 1 apple 3 kiwi and a grapefruit for 10 bananas. (but only all at once, no single trading) in some cases
there is no supply and demand
its all 'trades' set by the game dev
I think I have an idea how to talke it. I may need to divide by zero a bit, but thats okay ;)
ok if you have some code or a graph let me know
Best intro data science course also what should I Know before getting into an intro to data science course?
@wide rose is a json-serialized dictionary with dummy data (some data I don't yet have accurate values of, other data depending on my gamestate and I haven't set up those calcuations yet) okay?
It's 1am here, maybe tomorrow?
just started making some code to input the data. started with nested loops with global data, now refactoring to reasonable function calls. too lazy to refactor out the global data dict hehe
hahahha
Thanks guys
Found a video for creating deep fakes https://youtu.be/RsOJJd1q6Bg
welcome, creating deep fake used to require high computation but stick along with this video as i shown each step to create your own deep fake video.
you can also check below links for more such videos.
CONSIDER SUBSCRIBING
- project *
handwritten digit recognition : https...
Hi, I have a question regarding data visualization. I have a simple ORM Event model with a single attribute, date_created. (to track volume of API calls over time) How would I go about visualizing this in graphs of different resolutions? For instance, there may be a graph with 15 min resolution that sums up all of the events that occurred within that time span, or an hour resolution that sums up the events within that hour. Isn't there some python library that can make an interactive graph with JS that plugs into the web frameworks?
This is all new to me, any pointing in the right direction is appreciated. Thanks.
I heard plotly is based on plotly.js so that might be what ur looking for. Though it’s performance isn’t as good as the c based ones.
Thank you. And 'histogram' was the word I was looking for, a lot more pieces fell into place once I discovered the concept I had in my head had an actual name lol
Hi,
I’m trying to implement a CNN with Numpy only and I have a problem that the Convolutional layer is very slow - takes ~1 second...
def run(self, x, is_training=True):
"""Convolves the filters over 'x' """
if self.filters is None:
self.filters = self.initialize_weights((self.units, x.shape[0], *self.filter_size))
self.grads = self._init_bias_weight_like()
if is_training:
self.cache['X'] = x
n_filt, dim_filt, size_filt, _ = self.filters.shape
dim_img, size_img, _ = x.shape
if dim_filt != dim_img:
raise ValueError("Image and filter dimension must be the same")
size_out = int((size_img - size_filt) / self.stride) + 1
out = np.zeros((n_filt, size_out, size_out))
for filt in range(n_filt):
y_filt = y_out = 0
while y_filt + size_filt <= size_img:
x_filt = x_out = 0
while x_filt + size_filt <= size_img:
out[filt, y_out, x_out] = np.sum(
self.filters[filt] * (x[:, y_filt: y_filt + size_filt, x_filt:x_filt + size_filt])
+ self.bias[filt]
)
x_filt += self.stride
x_out += 1
y_filt += self.stride
y_out += 1
out = self.activation.apply(out, is_training)
return out
Does anybody have an idea how to improve it? Thanks
Are you following any tutorial for that? (I'm asking cuz I'm curious)
@uncut shadow No
Does anyone know a paper or a blog which goes in-dept about the architecture/tehnologies?
Computer architectures?
Thanks guys
@eager heath my apology I did not include the extra crucial detail.
I am looking for GAN generator/discriminator data basically
I appreciate the advice
can anyone recommend a good data science book for a beginner ?
can anyone recommend a good data science book for a beginner ?
@lapis sequoia python datascience handbook
@lapis sequoia you are welcome
what is the fastest way to iterate through a function 100s or 1000s of times that gives a string output and add the output to a list?
at the moment i just do, for i in range(1000):
then append the output to a list
speed matters because i will end up doing it several million times
@solar phoenix As you probably already know, append has an amortized O(1) cost [which doesn't mean that each individual append costs O(1) time; it just means that because python over-allocates with append, on average, the append operation costs O(1)], so over many many times, that O(1) cost should be very close to actually being O(1)
@solar phoenix have you tried using list comprehentions (sry for that spelling) because as it doesn't call append it is much faster to create large list item by item
iirc it is also faster than list(map(...))
It still has to recreate the list every now and then, the same as append although it does have some other optimisations
It might be faster to pre-allocate a list if you know the exact number of iteration ahead of time. But, even that is debatable. You can easily test which way is faster in your particular code using a smaller run with python's timeit
check out an actual time comparison and do one yourself maybe https://stackoverflow.com/questions/22225666/pre-allocating-a-list-of-none
@spark stag I have not tried that but will now
@ivory plank ok did not know about pre allocation
thanks all
Try this badly written quick program I wrote @solar phoenix , you can appropriately define new ways and give it to the list of functions to time them
import timeit
def append_way(n, to_append):
l = []
for _ in range(n):
l.append(to_append)
def pre_allocate(n, to_append):
l = [""]*n
for i in range(n):
l[i] = to_append
def list_comprehension(n, to_append):
l = [to_append for _ in range(n)]
def deque_way (n, to_append):
d= deque()
for _ in range(n):
d.append(to_append)
def main():
n = 10**1
to_append = "test"
for func in ["append_way", "pre_allocate", "list_comprehension", "deque_way"]:
seconds = timeit.timeit("{}(n,to_append)".format(func), setup="from __main__ import {};n={};to_append='{}'".format(func, n, to_append), number = 1)
print("{} takes {} seconds".format(func, seconds))
if __name__ == "__main__":
main()
@ivory plank awesome will do, thanks so much
it's not actually seconds btw, it's usecs. I forgot about the defaults (EDIT: actually, it's seconds. The thing that's actually the problem here is that timeit by default repeats the code 1M times. To make it 1 time, add "number =1" in the timeit call. But, none of this actually changes the difference in timing between the different functions)
Understood. A million might be excessive...
I have a pandas dataframe with a column 'score' in the range [-1, 1] and I have 10-15 terms in other columns. What would be the best tool to understand how these n-terms predict the score?
Finding the correlation between each continuous feature and the score is a good start. Plotting each feature vs the score also gives a good indication.
You guys are so helpful! Thank you
@polar acorn thank you
@wise igloo Are you being sarcastic or something?
?
Besides python what else should I know before going into an intro to data science course?
Programmingwise you should probably take a quick look at numpy and pandas. Mathwise you should be familiar with calculus, basic stats and some linear algebra. All of these can be learnt as the same time as you're an doing a intro to data science course and that is perhaps what I would recommend. Just jump into the course, pause and dive into the math or libraries you don't understand.
@polar acorn from your comment I take it you mean to start with scatter plots then go into linear regression?
Hello,
I have a problem and I have spent about two hours on it but still unsolved!!
I have a dataset which contains nearly 600,000 data. It is the air pollution of a city. I want to train my machine with 599,999 other data and predict one of them.
Like I drop the data in row 100 and train the machine with 599,999 data and my goal is to predict the dropped row. But I error.
I really appreciate it if you could help me.
df = df.head(100000)
df["Measurement date"] = pd.to_datetime(df["Measurement date"])
df["Year"] = df["Measurement date"].apply(lambda x:x.year)
df["Month"] = df["Measurement date"].apply(lambda x:x.month)
df["Day"] = df["Measurement date"].apply(lambda x:x.dayofweek)
df.drop(["Latitude","Longitude","Address","Measurement date"] , axis=1 , inplace=True)
df.drop(100, axis=0, inplace=True)
a=[101,0.004,0.05,0.002,0.9,59,39,2017,1,3]
mine = pd.DataFrame(index=["Station code","SO2","NO2","O3","CO","PM10","PM2.5","Year","Month","Day"] ,
data=a , columns=["Goal"])
y = df["PM10"]
X = df[["Station code","SO2","NO2","O3","CO","PM2.5","Year","Month","Day"]]
from sklearn.model_selection import train_test_split
X_train, y_train, X_test, y_test = train_test_split(X,y,test_size=0.4, random_state=101)
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train, y_train)
predictions = lm.predict(mine)
print(len(predictions))
print("==========================")
print(df["PM10"])
This is the code.
Hi I am working on an multilabel classification problem where I have many possible labels (probably like 30-40 in total)....Generally speaking is creating a somewhat accurate multilabel classification model straight forward when I have so many possible lables? I have about 50k images which I will have tagged and I am looking to train locally with a 1070 gpu
@umbral aspen A 30-40 dimensional output isn't uncommon for a conventional problem. The difficulty of the problem depends entirely on your data. The one thing I would look out for is noise in your data and negative samples/negative data [if a particular sample belongs to class 0, does your data ensure that it doesn't also belong to class 1?]. Your GPU is a little underpowered to train state of the art models with large datasets, but also remember that the training difficulty is usually more dependent on the complexity of the model you pick and not the problem itself. I'd personally first do data analysis to find what makes my data easy/difficult to model, and then start with the most simple model that I believe would work for my problem, only moving on to more complex models if my performance isn't sufficient. If your data isn't very complex, even an SVM is pretty good at modeling things.
http://nvidia-research-mingyuliu.com/gaugan check that site
X_train, X_test, y_train, y_test = train_test_split(...)
@lapis sequoia
@ivory plank I need to first do some manual tagging of the images but then the quality will be good. Also this isn't a problem where I have to classify 1 class per image, as each image could have multiple classes (multi label)... So not sure how much extra complexity this adds to my model...
Ah that sounds like a noisy dataset
But your problem sounds very similar to ImageNet
You might be able to use large parts of ResNet and our current efforts on efficient ImageNets
hey, fellow people! i've got pretrained GPT-2 model here that I want to load with gpt2_simple library. how could i do this? two models don't match up, especially since one is tensorflow model and other is pytorch. maybe anyone got gpt2_simple analog for pytorch model?
@lapis sequoia I saw that you were looking for stuff related to Data Science. I would recommend this, https://www.youtube.com/channel/UCKaajyjktvduM6mmuBtAOyg
@blazing bridge thanks
I would appreciate that if you like it that you subscribe
There will be another course on pandas and sci kit learn and matplotlib
Can u recommend any book for data science?
gosh, i'm wrapping my head around it for 2 hours already and still can't get it to work
https://github.com/mgrankin/ru_transformers
anyone gets how to make those models up and running?
it seems that those models just lack tokenizers and i don't understand how to finetune them without tokenizers
i haven't looked through all the code
but the github documentation has a tokenizer step (5.5)
hey guys
well i plotted a decision tree using matplotlib but i can't zoom in for some reason
im talking about this zoom to rectangle thing
featureNames = ["Sex", "FamilySize", "Age","Pclass" , "Fare","Embarked"]
classNames = ["Survived",'Succumbed']
fig, ax = plt.subplots(figsize=(10, 10))
plot_tree(clf,feature_names=featureNames,class_names=classNames,filled=True,ax=ax)
plt.show()
this is my code
really
@silk forge the code shouldn't be the issue, are you just clicking the magnifying glass or dragging to create a rectangle for it to zoom into, if that doesn't work either you can try hold right mouse button and drag to zoom
if there are no erros idk what the issue could be, did oyu try using right mouse drag to zoom in, it resclaes each axis as you drag, its not the most convenient fix but if it works its better than nothing
Hey can anyone help me with something related to data visualisation
I wanna recreate this graph using matplotlib and I need help figuring out the code
What does the code look like
@crimson umbra I am happy to have a look for you. Send me the code if you can.
Hello guys im currently struggeling a bit implementing SA to optimize a solution for an assignment. My Solution exists of a list containing numbers from 0 to 11 representing a position in a storage array. Anyway my code executes but does not find any improvement which is definetly false. Does any1 of u see something wrong here? For the neighbour solution our script said pick some random neighbour
ur last if statement says currentCost < currentCost? o.0
so you’re only doing 1 round of randint for the neighbour before dropping the temperature? when I implemented this for MC a while ago I seem to recall I attempted multiple random “flips” per temperature
The TA laid some groundwork. And the pseudocode we got was this
No word about how to chose our neighbour so i just assumed its like this
maybe I have to play with the temps a little bit more but its like no improvement at all, so i thought it must be because i made a mistake in the algo somewhere or with choosing the neighbour but dunno
yeah the neighbour function seems off.. what's the physical context? as in, what does a neighbour mean in this assignment?
when I did this, it was in the context of modeling spins in a lattice. so states are up/down of spins in a lattice, and the neighbours are well-defined
We have a warehouse and some storage shelves, Each iteration a shelve according to our demand list gets called and placed in a queue and then placed back into our warehouse we have to optimize the location where its placed back so that the way these shelves move gets minimized. Each number in the solutionlist is the nth free slot. So 0 is the nearest slot (The warehouse is a 1D array) and 11 the farthest free
ah it's a 1D array. hmm so wouldn't the neighbour just be +/- 1 of the current index?
Doubt changing more than 1 of the numbers in our list helps. Thing is to calc the cost you have to simulate the whole process and the only chance I see improving it is with a lower temp and something like 0.9999 as cooling coeff. which will take ages to compute
Yeah also thought of it. Will try it out. They didnt specify neighbour in our class so I thought like changing a number is also a neighbour but +-1 will prob see better results
as in, before the while loop, select a random position, inside the while loop find its neighbour (50% chance of +/- 1), change its state, calculate the Cost, perform the acceptance algorithm
the next loop, pick its neighbour, and so on
If we select a random pos before our while loop it wont go over the other pos after its run through it 1 time its done, am not allowed to change that
yeah big f should have chosen web-dev that would be an easy a
Hi everyone, i'm currently working with pandas. I got this excel file, when i load it to a pandas df there are cell's values in some columns showing as NaN, but in the excel file these cells have values. Is this because of the value's type in the excel file?
thats really weird behavior @sick nacelle
do you mind uploading the excel sheet
and posting your code
does anyone know how FSTs work
@raw rapids I can provide the excel file. The code part it's just loading the excel in a df, though i got the code that generates that excel file.
Hey @sick nacelle!
It looks like you tried to attach file type(s) that we do not allow (.xlsx). We currently allow the following file types: .3gp, .3g2, .avi, .bmp, .gif, .h264, .jpg, .jpeg, .m4v, .mkv, .mov, .mp4, .mpeg, .mpg, .png, .tiff, .wmv, .svg, .psd, .ai, .aep, .xcf, .mp3, .wav, .ogg.
Feel free to ask in #community-meta if you think this is a mistake.
so
i have no experience with jupyter notebooks, I watched a 30 minute video.
is it worth spending time working with it, or should I do a dashboard with some descriptive stats
well if you are planning on doing more data science
then you should make. a larger attempt to learn jupyter notebook
but a dashboard with descriptive stats is fine too
@sick nacelle , I dont know how you can share the excel files. If you find some online sharing for excel you could ping me
@brisk moth, FSTs are usually for NLP projects
o
idk how
you want to know how to code it in python
bummer
sorry tho
there is an openfst pacakage in python
and pywrapfst
im more into spacy for nlp
@brisk moth
open fst seems really easy to use
so the states are like the Q0 and Q1 and the arcs are the transitions?
well if you are planning on doing more data science
@raw rapids yes, but I am trying to pad my resume asap
its no use padding ur resume unless you actually know how to work with your tools. Otherwise you'll be lost even if u get a job.
Concrete knowledge with your tools will also allow you to create better projects to pad your resume, so there's no point not learning them.
@real wigeon
right, but jupyter notebooks is less important than knowing pandas, matplotlib, or numpy
?
well its one of the main ways of sharing data science related work
so if you want to display work you've done (for others to see) in an easy to run notebook, jupyter is usually a good way to go. Also, when ur running models / doing data science ur going to be making visualizations, which is much easier in jupyter usually.
they're different tools for different things
yeah
x1_domain_list = load_alexa("top-100.csv")
x2_domain_list = load_dga("dga-cryptolocke-50.txt")
x3_domain_list = load_dga("dga-post-tovar-goz-50.txt")
x_domain_list=np.concatenate((x1_domain_list, x2_domain_list,x3_domain_list))
y1=[0]*len(x1_domain_list)
y2=[1]*len(x2_domain_list)
y3=[1]*len(x3_domain_list)
y=np.concatenate((y1, y2,y3))
#print (x_domain_list)
cv = CountVectorizer(ngram_range=(2, 2), decode_error="ignore",
token_pattern=r"\w", min_df=1)
x = cv.fit_transform(x_domain_list).toarray()
# apply KMeans and TSNE ...
k_means = KMeans(init = 'k-means++', n_clusters = 2, random_state=170)
k_means.fit(x)
# assign the labels to a variable
k_means_labels = k_means.labels_
# assign the cluster centres to a variable
k_means_cluster_centers = k_means.cluster_centers_
x_embedd = TSNE(n_components=2,
learning_rate=100,
random_state=170).fit_transform(x)
y_pred = k_means.predict(x)
# fig, ax = plt.subplots(figsize=(7,7), dpi=100)
# plt.scatter(x_embedd[:, 0], x_embedd[:, 1], c=y_pred)
print('Before TSNE: ', x.shape)
print('Accuracy: ', np.mean(y_pred==y)*100)
print('After TSNE: ', x_embedd.shape) ```
please feel free to help, I dont know as to why my acurracy is at 17% , before it was 79%, have I overfit my data or? I am bit lost thanks in advance
what did u do before?
Would you recommend the https://www.udemy.com/course/machinelearning/
Can anyone suggest a way i can paste code to Colab with Android mobile phone? Trying to be productive on the way to work and have sublime text on my phone. So once i finish a snippet, i thought I could just copy and paste it to Colab. BUT - so far i have been unsuccessful to paste to Colab via my mobile browser. The attached list appears when I hold my finger on the screen (no paste). Thanks!!
btw, i've searched online a bit for the solution, but couldn't find anything useful. Wondering if problem is myself.
Would you recommend the https://www.udemy.com/course/machinelearning/
@blazing bridge
I guess Udemy courses are good enough.. Go for it... Just take some background of the instructor and the course and u can go for it surely acc to me
@rustic igloo Even if the app doesn't allow you to paste, your mobile keyboard should allow you to do so regardless. You should be able to select a text you type and then paste using your phone's default menu or from your phone's keyboard/clipboard
Hi guys! I have a very "newbie" question, so I'm learning about ML algorithms, and how to implement them in Python. So my question is about Linear Regression, If I split my data into train and test sets, should I count the accuracy for the train set too, or is it okay if I just count it for the test set?
you should count for both
Both. Because it will tell you about overfitting /underfitting
Thanks! 🙂
What do you mean by overfitting / underfitting? (Too low, or too high accuracy score?)
@ivory plank thanks. I am finally able to paste using the phone's Gboard clipboard.
well a neural model tries to emulate the true relationship between the inputs and outputs right
but since we are not given data on the entire population (only a sample), the pattern in our sample is likely different from the true one.
When we try to fit on the relationship represented by our inputs and outputs in our sample too well (ie pick up patterns that are very subtle) our model will not generalize well to the pattern for the total population since these subtle patterns do not exist in the population.
Underfitting is when we fit on the sample so poorly, that we miss out on major patterns. These major patterns generally also occur in the true relationship, so its important to find those patterns.
Underfitting and overfitting are tradeoffs. The less you try to overfit, the more potential for underfitting.
@sonic raft
Thanks, that's really helpful! So Let's say, I underfitted my model(Simple Linear Regression model) , how should I cure this problem? Change train/test scale?(I usually use the 80/20 variation)
@flat quest
hi everybody
i looking for a labeled text base dataset with less than 50 % accuracy for my uni project
can you help me?
not necesarrily change train and test sizes but u could use a more complex model.
for example multilayer relu will be able to fit on progressively more complex models such as those with many local max and mins as well as are non-linear
@sonic raft
Thanks!
I imagine there's a bit of bias here, but would the most logical order to learn the 3 languages be Python -> SQL -> R?
I would separate SQL from the others
you want to know SQL nevertheless
you should learn that in parallel to other things
Well
having that said, what kind of work do you intend to do?
I've barely touched R so far in my jobs where I dwelled in machine learning stuff
manipulating the data in excel mostly
didn't even know VBA
but now with all this time, I intend to learn proper data processing / visualization
I want to get back into data for banks I guess, gathering isights for what works and what todo next
if you want to learn a "real" programming language... go with Python
probably some ML to prevent attrition by predicting behaviors that lead to clsoed accounts
R's basically a language created by statisticians and it shows
So ultimately tracking customer transaction histories (massive DB of line by line per account)
to see what actions trend towards a closed account
such as if they stop using adebit card
having that said, if you intend to work with time series analysis in general, then I would for sure recommend R
or their DD disappears
trigger them for a contact from a banker or something
but there's 10k - millions of transactions a day depending on the bank size
but working with big databases like that to gather insights for customer behaviors to ensure maximum profitability
I have 0 coding experience 😦
outside of the bit of SQL I had to try to work with durign a Salesforce integration
although when I caught an error that the expert made I felt pretty good, ahha
alright, so I'll go back to learning Python, thanks again! Datacamp is offering a free week, so I wanted to maximize my time with the platform
yeah, go with Python + SQL for now.
I might go round robin between the two
do their intro course to python, then SQL, then do the next python, then enxt SQL
❤️ this server, you all are the best
yeah, that seems like a good approach
Hi guys!! I have an issue with pandas that I'm surely it's bc I am new. I don't know if you guys want me to post my code or just a screenshot, but all I can do is a ss for now.
Basically I'm entering in all information from previous dataframes, using user inputs of previous columns!
share the code if possible
and state specifically what's the problem you're having
I can in a little bit. Those NaNs shouldn't be there
Under dry sample. If I do dry sample first I get nans, if I do weathered sample first I get nans. I tried doing fill na in that door loop but it never worked
I think it's something that the two dataframes indices don't match?? But it's hard to know what question to ask and thus what to search for.
hi everybody
i looking for a labeled text base dataset with less than 50 % accuracy ever achived for my uni project
can you help me?
that code looks a bit weird
you intend to append rows to a dataframe
but I think you're always replacing entire columns each time you do those assigments
nvm, I can see that you really intend to operate column wise
I don't know the best way to go about this. Just kinda winging it. I know nothing of coding this is like 1 month of quarantine in fruition
but it seems off to me, are you sure that what's you really want to do?
what do you want to do then
What do you mean? Code wise
yes, code wise it looks weird
Inefficient??
no, just plain wrong
Or just like logically doesn't make sense
Oh shit ok. Let me get my code I'll be back on in a bit.
Attached is the pastebin. At the top has an explanation of the code, a tldr of what it's meant to do, and what issue I am having.
I want to note I'm very new to coding and this is my first project. I want to use this for myself as I am a material scientist by trade and WFH I spent all this time learning python.
Every problem I've had I addressed by googling questions, but this I haven't been able to see a lot of people have this issue.
If you guys can look at this please let me know if there's a fix I can use. If you do look at it and reply, I'd highly appreciate it if you can tag me
Whats the best way to encode text data for building tree models so that we dont get dim curse. I have used target encoding , any alternatives?
One hot is better for tree based
One creates large vectors > large dims > Complex trees
U can always limit tree size
But my bad I meant better for non tree based
Target encoding creates relationships that aren’t there like red is 1 doesn’t mean 2 is blue.
But one hot can make the tree split on color red or blue rather than on the
For trees prob just use target encoding for now, since one hot causes those feature to lose importance in the model even when they shouldn’t