#data-science-and-ml

1 messages · Page 223 of 1

woven saffron
#

Maybe its too light?

#

I am normalizing the greyscale to a range between 0 and 1

#

By dividing by 255.0

flat quest
#

well thats the index
are u sure the index correlates with the same class value?

woven saffron
#

I am not sure of that

#

I assumed it was because of the way it was trained

#

Is there a way I can find this out?

#

Can I do like model.labels[idx] or something?

#

Not exactly that

#

But somehow get the label for it

flat quest
#

well it'll follow the datasets label format
oh wait this is sparse, so that shouldnt be an issue

woven saffron
#

Let me try with other numbers

flat quest
#

yeah it might just be one bad one

woven saffron
#

Yeah something is weird

#

I would assume the model wouldn't be this bad

#

It is predicting 3 for everything

#

Lol

flat quest
#

lol

woven saffron
#
PS C:\Users\Ryan\Desktop\ml-api> python .\predict.py
{'confidence': [1.6521667149409522e-18, 2.794477973674936e-12, 0.002019522013142705, 0.9946495890617371, 0.0, 0.002819732530042529, 2.1464751850941433e-11, 0.0005111345089972019, 3.013474395628175e-32, 9.863308PS C:\Users\Ryan\Desktop\ml-api> python .\predict.py
{'confidence': [1.6521667149409522e-18, 2.794477973674936e-12, 0.002019522013142705, 0.9946495890617371, 0.0, 0.002819732530042529, 2.1464751850941433e-11, 0.0005111345089972019, 3.013474395628175e-32, 9.863308066247621e-36], 'prediction': 3}
PS C:\Users\Ryan\Desktop\ml-api> python .\predict.py
{'confidence': [1.6521667149409522e-18, 2.794477973674936e-12, 0.002019522013142705, 0.9946495890617371, 0.0, 0.002819732530042529, 2.1464751850941433e-11, 0.0005111345089972019, 3.013474395628175e-32, 9.863308066247621e-36], 'prediction': 3}
PS C:\Users\Ryan\Desktop\ml-api> python .\predict.py
{'confidence': [2.4253387200669225e-17, 1.805287798591071e-12, 0.08808434754610062, 0.868319034576416, 0.0, 0.042480047792196274, 7.983710914594155e-11, 0.0011165498290210962, 1.889110677748859e-29, 2.5717616017503705e-33], 'prediction': 3}
PS C:\Users\Ryan\Desktop\ml-api> python .\predict.py
#

This is multiple runthroughs

#

All with different digits

#

I had this issue before where it was predicting 5 for everything

flat quest
#

hm

#

mine is working fine lol

#

and its the same code as yours

#

whats the shape of the digit_data?

#

it should be sent as a batch size of one -> np.array([img])

woven saffron
#

I am doing this

#

With my image

#
np.reshape(arr, (1, 28, 28))
flat quest
#

yup thats the problem

#

just tested it
its probably not the same image when u reshape it

#

try plotting it before and after u reshape it

woven saffron
#

When I receive the image I do this to it

#
def read_digit(data: list, encoding: str) -> np.array:
    im = Image.frombytes(encoding, (28, 28), data)
    im.save('im.png')
    arr = np.array(im)
    arr = arr / 255.0
    return np.array([arr])
#

I just changed it

#

Still same

#

I am viewing the image I read

#

And it looks good

#

However I didn't consider what it looks like after I dp arr = arr / 255.0

flat quest
#

hm

#

yeah the reshaping looks fine after i looked over again, i forgot to update the ar values
normalizing shouldnt be an issue

woven saffron
#

Yeah I am very confused

flat quest
#

can u try running one of the images in the test dataset?

#

instead of a custom one

woven saffron
#

Yeah

flat quest
#

those probability output numbers look really similar actually, you might be running the same image into the detector each run

woven saffron
#

Unfortunately that isn't the case

flat quest
#

hmm
did u try running the test image?

woven saffron
#

Yeah 1 sec

#

So it is supposed to be a 7

#

According to y_test

#
[[1.0039120e-06 4.1181980e-08 1.7208897e-05 1.5117071e-04 1.2190830e-10
  5.2620344e-06 7.6247827e-11 9.9981230e-01 7.5138960e-06 5.5338201e-06]]
#

It is a 7 here

#

So the image input is somehow screwing up

flat quest
#

yeah that's what i suspected
since the models running as expected

its likely that ur referencing the same img somehow

can i see ur full code for getting the images and feeding them into the model?

woven saffron
#
import requests
from PIL import Image
import numpy as np
from io import BytesIO
import tensorflow as tf

model = tf.keras.models.load_model('mnist.model')

def read_digit(fname):
    print(f'LOADING {fname}')
    img = Image.open(f'example-digits/{fname}').convert('L').resize((28, 28))
    return BytesIO(img.tobytes())


img = read_digit('4.png')

r = requests.post('http://127.0.0.1:5000/MNIST/predict', files={'file': img})

if r.status_code == 200:
    print(r.json())
else:
    print(f'Error: {r.status_code}')
#

I checked the fname and it changes

#

I can even show the image to make sure

#

This is what gets sent with the 4

#

Aka the current code

#

Is the stroke weight of the images not high enough?

#

Maybe it needs to be thicker?

flat quest
#

hm it could be, tho that doesn't explain why ur getting the same result
i would understand if the model performance is bad, but the probabilites are too similar each run

woven saffron
#

Yeah I ran through a couple and the proper image is being loaded

#

I even check on the API side

#

And it is the same thing I sent

flat quest
#

but the model works for normal images

#

from the test dataset

woven saffron
#

Correct

#

Let me show what an image from the dataset looks like

#

Lets see if its way thicker

#

Wait...

#

test data array looks like

#
[0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.3254902  0.99215686 0.81960784 0.07058824 0.         0.
  0.         0.         0.         0.        ]
 [0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.08627451
  0.91372549 1.         0.3254902  0.         0.         0.
  0.         0.         0.         0.        ]
#

Meaning white is 0

flat quest
#

from the test dataset or from ur images?

woven saffron
#

So rn mine is inverted

#

Test datasetr

#

So anything that is 255 needs to be 0

#

And then the blacks get normalized down to 0 to 1 scale

flat quest
#

? but the number isnt white

woven saffron
#

Well there is lots of 0 which I assume is supposed to be the background

#

This is just 2 rows actually

#

I tried opening this as image

#

And its pure black

#

Even after doing it *= 255

#

To bring it back

#
first_x *= 255.0

img = Image.fromarray(first_x, mode='L')
img.show()
#

Yeah so I made sure

#

0 is background

#

Values > 0 is digit pixels

#

I didn't mess with print options a lot

#

But you can see the numbers make a 7 lol

#

Which means I need to prep my data differently

#

255 (white) -> 0

#

Anything else gets scaled

#

Yeah @flat quest I am getting different values now but still some are wrong

flat quest
#

are most still correct tho?

woven saffron
#

When I send a 1, I get this

flat quest
#

its the convert ("L") thats making it black

#

try removing that

woven saffron
#

But it predicts this as a 4

flat quest
#

lol u shouldnt be getting 0's for all values, those probabilities should sum to 1

woven saffron
#

This is the pixel values

#

So the actual pixels that make up the digit get scaled between 0 and 1

#

And background is 0

#

And I need to somehow encode the image to get it in a 28x28 array so it isn't rgba

woven saffron
#

Managed to invert the image so it is what the model expects, but still nada

#

Oh well, will try tomorrow

flat quest
#

yeah lol this is a lot harder than expected to solve

woven saffron
#

This is the image I am sending

#

Before it gets normalized by dividing by 255

#

The downscaling is messing with it a lot

#

I might need to multiple all the pixels that aren't hard black up by some factor

#

Sending this through works

#

So it is 100% a downscaling issue @flat quest last time I tag you sorry

#

Just thought you might wanna know

#

I drew the digit in a 28x28 px window so it didnt need to downscale

#

And of course it works

flat quest
#

yeah no all good
tho i still don't get why it doesnt work tbh

woven saffron
#

I think it is too noisy

flat quest
#

i wouldve expected that a model could work on lighter images of numbers but i guess it just needed to be trained on lighter images as well

woven saffron
#

Yeah I think the dataset has a lot of large stroke images

#

So I might just do like

#

If its not black scale the white up to 255

#

Or set a very low threshold

flat quest
#

the stroke and also the color might be an issue
u could add those kinds of images in ur training data

through data augmentation

woven saffron
#

Like anything > 50 gets set to 255

#

What is data augmentation?

flat quest
#

its basically transforming some of ur data so u have a greater variety
u might change the brightness, rotation, shear, randomly change a color channel, etc.

#

it allows the model to learn more types of objects

obtuse skiff
#

I have two images that are binary images and need to find the similarity. I was looking at using Pearsons (np.corrcoef). how do I go about doing this

#

what am I putting in the X and Y

unreal thistle
#

Hey guys

merry ridge
#

Just reshape your images into column vectors and throw them in.

lapis sequoia
#

what you need is the inner dot product

#

look up cosine similarity

#

but what sort of images are these

#

what do you mean by binary images

jolly briar
#

@lapis sequoia is the outer product ever referred to as dot?

#

I thought that was only used for inner product

lapis sequoia
#

you're right

#

inner product is the dot product.. inner dot product is a weird way of saying it

deft jacinth
#

can i install psotgresql + pgadmin so i can have an web ui thing on an ubantu 18.04 vps

lapis sequoia
deft jacinth
#

oh i missread im dumb sorry

faint furnace
#

How do I delete rows in which birthYear is \N? dropna() doesnt work

flat bough
#

@faint furnace \N is not a missing value. dropna() removes only missing ones. Just drop

past pewter
faint furnace
#

i was trying something like this but its not even recognising it

#

is it because my "birthYear" is object type?

flat bough
#

try typing \N. Because when in string \N reads as special symbol just like \n or \t

faint furnace
#

i am also trying to replace the "\" but i think my code is wrong

flat bough
#

if you want to use in \ is string you need to type it twice

faint furnace
flat bough
#

maybe it change it only in the first occurance?

faint furnace
#

now this removed everything

#

noice

paper niche
#

try replacing by np.nan? ‘NaN’ is still a string

jagged plume
#

Hi, sorry to interrupt, can someone help with contour sorting? 😄

#

Any help would be greatly appreciated!

jolly briar
#

@faint furnace try posting a small data example, df.to_json() will enable you to do that

solar phoenix
#

Hi, I dont know if someone can help me. I have a set of letters (they are amino acids), I have 6 of them. I want to get every combination of 6 possible, including just 20 repetitions of one of them. Perhaps the fact that there will be millions of them means that this is basically impossible?

#

note: i want the combinations to be 20 letters long

jolly briar
#

@solar phoenix what's the list?

solar phoenix
#

The letters are I, A, G, L,F, V, M

jolly briar
#

list(itertools.permutations(['i', 'a', 'g', 'l', 'f', 'v', 'm']))

faint furnace
#

Thank you forcousteau helped me with that.

jolly briar
#

@faint furnace sure - i mean in general though, that is a useful thing to do

faint furnace
#

yea i actually am checkign what this line of code does

#

takes a little while to run it

solar phoenix
#

rie this gives me a list that is 7 long

#

i want one that is 20 long

#

so for example, one output would be LLLLLLLLLLLLLLLLLLLL

jolly briar
#

there's one in there that matches that

#

oh hang on no, i don't understand why you would get that from a permutation of those elements

#

every combination of 6 possible, including just 20 repetitions of one of them

i don't follow

solar phoenix
#

i want to create strings of length 20

#

that include absolutely every combination of those 7 elements

#

it is probably going to be too many isn't it

jolly briar
#

it's going to be a lot

solar phoenix
#

ye

spark stag
#

i think its 20**20 combinations?

#

oh wait 7**20

solar phoenix
#

yeah

#

7**20

spark stag
#

somthing like that

solar phoenix
#

is too much

spark stag
#

a LOT

solar phoenix
#

ok

#

i'll re think

#

thanks

spark stag
#

what u need it for? xD

solar phoenix
#

each of the letters represent amino acids

#

and i know the properties i want

jolly briar
#

7^20 ? where's that from?

solar phoenix
#

in length 20

jolly briar
#

there are 7! ways to arrange 7 things, you have 20 spaces... I'm trying to remember combinations etc 🤦‍♂️

silk acorn
#

Is it not 20**7?

solar phoenix
#

so i want to make a list of all of them

spark stag
#

maybe but with binary 10 digits it 2**10 ways, this has 7 states length 20

jolly briar
#

20^7, now where's that from?

silk acorn
#

20 * 20 * 20 etc for each char

solar phoenix
#

yeah Grote, i think you are right

#

1280000000

silk acorn
#

Anyway, you are looking at itertools.combinations_with_replacment

jolly briar
#

Idk, 20^7 sounds off

silk acorn
#

20 options for each of 7 characters.
Wait now that I type it out that is the wrong way round

#

7 ** 20 indeed.

spark stag
#

binary has 2 states so number of states is 2**length so with 7 states i thought it would be 7**length tho

#

yh, but tbh i just guessed the order, there was a 7 and a 20 and the answer is very big

jolly briar
#

Well how many ways are there to arrange the 7 characters?

#

it's going to be 7! right?

#

no exponent there

silk acorn
#

That's without replacemtn though.

#

They said 20 * the same letter was a valid option

solar phoenix
#

yep same letter valid

jolly briar
#

That's without replacemtn though.
ah yeah , shit

spark stag
#

i was thinkning of it as a base 7 number problem with 20 digits, how many ways are there to arrange 5 decimal number 0-99999 = 100000 = 10(number base) ** 5 (length of number)

silk acorn
#

itertools.combinations_with_replacement(['i', 'a', 'g', 'l', 'f', 'v', 'm'], 20)

spark stag
#

is that even a good idea to run? how long do u think that will take xD

silk acorn
#

0 seconds.

spark stag
#

really!?

silk acorn
#

Since it doesn't actually make the strings right away untill you use them.

jolly briar
#

creates a generator

silk acorn
#

It creates a generator.

solar phoenix
#

ah

spark stag
#

oh ok thats a more efficient way to do it

solar phoenix
#

so when i loop through this

#

that is when it will become an issue

spark stag
#

so as long as u dont call list on it your ok

jolly briar
#
In [101]: len(list(itertools.combinations_with_replacement(['a', 'b', 'c', 'd', 'e', 'f', 'g'], 20)))
Out[101]: 230230
silk acorn
#

Python an do that much yeah.

jolly briar
#

this is fine

solar phoenix
#

thanks for this all

jolly briar
#

this is way smaller than some of those numbers though 🤔

#
In [108]: samp = pd.Series([''.join(x) for x in itertools.combinations_with_replacement(['a', 'b', 'c', 'd', 'e', 'f', 'g'], 20)])

In [109]: samp.sample(20)
Out[109]:
113891    aabbbcccceeeeffffffg
131937    aaccdeeffffggggggggg
115873    aabbbdddddeeeeeffggg
82491     aaabbbbcdeeefffffggg
173967    accdeeeeeeffffgggggg
90337     aaabcccccccccddddeee
56377     aaaabbbbbbccccceefff
101120    aabbbbbbbbbbbbbcdeeg
224770    cccddddddeeeeggggggg
216041    bccdddddfffffffggggg
115529    aabbbceeeeeeeeeegggg
131396    aaccddddddeeeeefgggg
95043     aaaccccccccccccddeef
4470      aaaaaaaaaaacccdfgggg
65145     aaaabbceeeeeffffffgg
16792     aaaaaaaaccccccccdeff
36986     aaaaaacccdgggggggggg
216190    bccdddeeeeeeffffffff
100210    aaadddddeeeefffffggg
23192     aaaaaaabcccccddfffff
dtype: object
#

looks alright

eager heath
#

You might want to use a generator expression instead, if pandas is okay with that

jolly briar
#

for what? this is fine

#
In [112]: pd.DataFrame(samp).info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 230230 entries, 0 to 230229
Data columns (total 1 columns):
 #   Column  Non-Null Count   Dtype
---  ------  --------------   -----
 0   0       230230 non-null  object
dtypes: object(1)
memory usage: 1.8+ MB
eager heath
#

You could have a pretty big memory burst here, and it could fail in some circonstancies

#

I mean, the list

jolly briar
#

there's really not much

eager heath
#

The lists have way more overhead compared to a dataframe

jolly briar
#

really, that surprises me

#

not too sure how to check the memory usage for that though

eager heath
#

Well, it doesn't hurt, you just need to change the [] by ()

#

You can use sizeof()

jolly briar
#
In [113]: pd.DataFrame(samp).info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 230230 entries, 0 to 230229
Data columns (total 1 columns):
 #   Column  Non-Null Count   Dtype
---  ------  --------------   -----
 0   0       230230 non-null  object
dtypes: object(1)
memory usage: 16.9 MB

with deeep

eager heath
#

But it will not count strings, which have even more overhead

jolly briar
#

how to get the memory usage of a list then

#
In [118]: l.__sizeof__()
Out[118]: 1880784

this is inaccurate?

#
In [119]: l = [''.join(x) for x in itertools.combinations_with_replacement(['a', 'b', 'c', 'd', 'e', 'f', 'g'], 20)]

that's l

#

that seems close with the original dataframe response

eager heath
#

You also need to count all the strings inside

jolly briar
#

why would it do them separately? but ok, i'll check

eager heath
#

Here it is just counting the chain of references, which are more or less lightweight

solar phoenix
#

I don't know how it can only be 230230

#

it must be more

jolly briar
#
In [120]: x= ''.join(str(x) for x in l)

In [121]: x.__sizeof__()
Out[121]: 4604649
#

seems pretty small still

silk acorn
#

!e
print(7**20)

arctic wedgeBOT
#

@silk acorn :x: Your eval job has completed with return code 1.

001 |   File "<string>", line 1
002 |     print(7"*20)
003 |                ^
004 | SyntaxError: EOL while scanning string literal
silk acorn
#

!e
print(7**20)

arctic wedgeBOT
#

@silk acorn :white_check_mark: Your eval job has completed with return code 0.

79792266297612001
jolly briar
#

that's a big number

#
In [127]: samp[samp.str.startswith('f')]
Out[127]:
230209    ffffffffffffffffffff
230210    fffffffffffffffffffg
230211    ffffffffffffffffffgg
230212    fffffffffffffffffggg
230213    ffffffffffffffffgggg
230214    fffffffffffffffggggg
230215    ffffffffffffffgggggg
230216    fffffffffffffggggggg
230217    ffffffffffffgggggggg
230218    fffffffffffggggggggg
230219    ffffffffffgggggggggg
230220    fffffffffggggggggggg
230221    ffffffffgggggggggggg
230222    fffffffggggggggggggg
230223    ffffffgggggggggggggg
230224    fffffggggggggggggggg
230225    ffffgggggggggggggggg
230226    fffggggggggggggggggg
230227    ffgggggggggggggggggg
230228    fggggggggggggggggggg
#

neat

spark stag
#

nice to look at too

solar phoenix
#

there should be examples of fgfffffffffffffff

spark stag
#

probably somwhere (way) further down the dataset

jolly briar
#

no these are sorted

#

you can also see the index, and the previous len

#
In [130]: list(itertools.combinations_with_replacement(['a', 'b', 'c'], 4))
Out[130]:
[('a', 'a', 'a', 'a'),
 ('a', 'a', 'a', 'b'),
 ('a', 'a', 'a', 'c'),
 ('a', 'a', 'b', 'b'),
 ('a', 'a', 'b', 'c'),
 ('a', 'a', 'c', 'c'),
 ('a', 'b', 'b', 'b'),
 ('a', 'b', 'b', 'c'),
 ('a', 'b', 'c', 'c'),
 ('a', 'c', 'c', 'c'),
 ('b', 'b', 'b', 'b'),
 ('b', 'b', 'b', 'c'),
 ('b', 'b', 'c', 'c'),
 ('b', 'c', 'c', 'c'),
 ('c', 'c', 'c', 'c')]
solar phoenix
#

dislist=list(itertools.combinations_with_replacement(['a', 'b'],3))

#

[('a', 'a', 'a'), ('a', 'a', 'b'), ('a', 'b', 'b'), ('b', 'b', 'b')]

#

there is no 'b''b''c'

#

for example

jolly briar
#

well you couldn't have c in your example

#

it's not in the list

solar phoenix
#

oh yeah

#

lol

#

sorry

jolly briar
#

there's no babb in the version that i posted above tho

spark stag
#

but yuor example only contains half the samples it should of, was that the whoel output?

jolly briar
#

mine, yes

spark stag
#

should be length 81, its not making every combination

jolly briar
#

it's making what i posted

spark stag
#

its like the original one only made like 230000 but should of had way more

solar phoenix
#

ye agree

jolly briar
#

yeah, it's not done babb in the small example

solar phoenix
#

I think there will be too many examples for this, I will have to think of a new approach

#

thanks all for your help

silk acorn
#

Oh, looks like could combinations is sorted only

#

My bad

jagged plume
#

Anyone can offer a helping hand on contour sorting using Tensorflow and OpenCV, please?

jolly briar
#

@solar phoenix

{('a', 'a', 'a', 'a'),
 ('a', 'a', 'a', 'b'),
 ('a', 'a', 'a', 'c'),
 ('a', 'a', 'b', 'b'),
 ('a', 'a', 'b', 'c'),
 ('a', 'a', 'c', 'b'),
 ('a', 'a', 'c', 'c'),
 ('a', 'b', 'b', 'b'),
 ('a', 'b', 'b', 'c'),
 ('a', 'b', 'c', 'c'),
 ('a', 'c', 'b', 'b'),
 ('a', 'c', 'c', 'b'),
 ('a', 'c', 'c', 'c'),
 ('b', 'a', 'a', 'a'),
 ('b', 'a', 'a', 'c'),
 ('b', 'a', 'c', 'c'),
 ('b', 'b', 'a', 'a'),
 ('b', 'b', 'a', 'c'),
 ('b', 'b', 'b', 'a'),
 ('b', 'b', 'b', 'b'),
 ('b', 'b', 'b', 'c'),
 ('b', 'b', 'c', 'a'),
 ('b', 'b', 'c', 'c'),
 ('b', 'c', 'a', 'a'),
 ('b', 'c', 'c', 'a'),
 ('b', 'c', 'c', 'c'),
 ('c', 'a', 'a', 'a'),
 ('c', 'a', 'a', 'b'),
 ('c', 'a', 'b', 'b'),
 ('c', 'b', 'a', 'a'),
 ('c', 'b', 'b', 'a'),
 ('c', 'b', 'b', 'b'),
 ('c', 'c', 'a', 'a'),
 ('c', 'c', 'a', 'b'),
 ('c', 'c', 'b', 'a'),
 ('c', 'c', 'b', 'b'),
 ('c', 'c', 'c', 'a'),
 ('c', 'c', 'c', 'b'),
 ('c', 'c', 'c', 'c')}

this was it right?

#

although what i've just written doesn't want to scale lol

solar phoenix
#

@jolly briar yeah this is it

#

What did you run

jolly briar
#

@solar phoenix what i wrote didn't really scale

solar phoenix
#

Oh

jolly briar
#

i mean - it might have run with patience, i didn't have patience

#

@solar phoenix doesn't this pattern have a name?

#

i'd have thought it would have been done before somewhere and you could just use their file / data

solar phoenix
#

Yeah I thought that too

#

What exactly did you run to get that?

jolly briar
#

i'll get the code it should be in memory

solar phoenix
#

@jolly briar cool thanks

jolly briar
#
unique_chars = 3
string_len = 4
permutations = itertools.permutations(list(string.ascii_letters[:unique_chars]))

all_combinations = []
for perm in permutations:
    c = list(itertools.combinations_with_replacement(perm, string_len))
    all_combinations.append(c)

all_combo_list = list(itertools.chain.from_iterable(all_combinations))
all_combo_list_unique = set(all_combo_list)
solar phoenix
#

On nice

jolly briar
#

this will generate the above list, you can change params 3,4 at the top there

solar phoenix
#

Yeah I see that

jolly briar
#

but don't just stick 7,20 in as it probably won't run

solar phoenix
#

@jolly briar yeah I might see if it can run it on a supercomputer or something

jolly briar
#

i think 5,20 will run 🤔

#

there's probably plenty of room for optimising the above, if it's salvageable

solar phoenix
#

Think that I could use that and then just run it on a server somewhere. Thanks for this- cool solution

arctic cliff
#

i need some help with xml parsing

wintry mural
#

Did someone try using machine learning algorithms for stock/crypto trading

#

I've got a bit of free time so I want to try out to code something like trading bot as a personal project

#

So if someone worked on something similar to this, I would like to hear your experiences

blazing bridge
#

For anyone new and looking into getting into data Science and Mschine Learning. We have made a Youtube channel related to Data Science and Machine Learning and it would mean a lot if could check it out and if you like it, please subcribe. https://www.youtube.com/channel/UCKaajyjktvduM6mmuBtAOyg

main narwhal
lapis sequoia
#

Hi! I can get help here in machine learning (time series) ?

lapis sequoia
#

Upload the question, so we can see...

lapis sequoia
#

I need to make a machine learning model on the time series to predict the quality of communication.
In this dataset, I need to predict the “Y” column.

#

I plotted a linear plot of y versus date, as well as ACF and PACF.

#

The Dickey-Fuller criterion is 0.
What model can be built and how to determine the parameters for it?The dataset itself was collected over 14 days and contains ~ 7 million rows. I averaged the value over a period of 1 minute. The dataset currently contains 20,160 rows

crimson umbra
#

can anyone help me with something regarding to converting DOB to age in an excel form?

spark pelican
#

Anyone wnna help me with my data analytics assignment? - its about pandas and stuff

faint furnace
#

i generated this by using the code. but i want the labels as well. how do i do that ?

spark stag
#

@faint furnace you need to have lables in theplt.pie() argument, so if you create a list of labels so in your case ["Drama", "Comedy"...] then add the argument labels=labels inside plt.pie so it looks like plt.pie(genrevotes1, labels=labels), then it should show the categories but just a warning, with that many categories the text may become cramped together at the smaller categories

#

btw you don't need to manually create the list if the data is in a dataframe, you just need to get the row names

faint furnace
#

yes i know that but i want to create the plot directly thorugh the series i have

#

genrevotes1. has 1 column as all the genres which I want as labels

faint furnace
#

solved. i was able to do it by typeing
labels=genrevotes1.index

cunning wadi
#

Hey guys

#

Is there a simple way of changing a grid solving program from monte carlo approach to temporal difference learning

#

I have the full monte carlo approach code and just struggling to convert it

split drift
#

Hey,
is there a way to complete 2d on numpy array to matrix with zeroes?
like [[1,2], [1]] to [[1,2], [1,0]]?

spark stag
#

i don't think so because the 2 original lists are of different length so it will give you an array containing lists (so oyu can't use numpy fetures on them)

#

you can probably do it with loops but I don't think there will be a nice numpy feture lke .reshape unless you initialize the array with uniform dimensions

supple moon
#

Is anyone interested in Stock Market algorithms?

uncut shadow
#

wdym?

supple moon
#

In the development and application of algorithms that trade the markets

silent swan
#

wouldn't recommend it

supple moon
#

why do you say that

calm pewter
#

Is anyone interested in Stock Market algorithms?
@supple moon yep

flat quest
#

its difficult to make an algorithm that will perform well with stock markets, there's so many factors
u'll have to have access to a large amount of quality data

calm pewter
#

its difficult to make an algorithm that will perform well with stock markets, there's so many factors
u'll have to have access to a large amount of quality data
@flat quest And thats why it is fun to think about and play with, right?

supple moon
#

i was looking to collaborate to see if we could build something good

calm pewter
#

too early for me, but I'd like to see some links with concepts here 🙂

flat quest
#

well it would be fun
but might become an unnaturally large project lol (will probably take up a lot of resources) and getting/cleaning data gets more frustating the more you do it

merry ridge
#

That's my main area of work although a fair number of resources have gotten shifted to pandemic modeling instead.

drifting umbra
#

does anyone know how to use tensor processing unit (TPU) in Google Colab or Kaggle?

#

i am trying to use TPU and think i am following the example code exactly

#

but it is only using CPU

#

can share notebook

#

im doing exactly this

#

please anyone know anything about TPU at all and tensorflow help me

drifting umbra
#

anyone at all please @ or PM me

flat quest
#

are u using a layer / model compatable with tpu? @drifting umbra

drifting umbra
#

@flat quest i belive so. keras Sequential see this:

# instantiate a distribution strategy
tpu_strategy = tf.distribute.experimental.TPUStrategy(tpu)

# instantiating the model in the strategy scope creates the model on the TPU
with tpu_strategy.scope():
    model = Sequential()
    model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2]), return_sequences=True))
    model.add(Dropout(0.2))
    model.add(LSTM(256))
    model.add(Dropout(0.2))
    model.add(Dense(y.shape[1], activation='softmax'))
    model.compile(loss='categorical_crossentropy', optimizer='adam')


# train model normally - doing this below
# model.fit(training_dataset, epochs=EPOCHS, steps_per_epoch=…)

model.summary()
#

starts as

import numpy
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.callbacks import ModelCheckpoint
from keras.utils import np_utils
import os
import tensorflow as tf
print("Tensorflow version " + tf.__version__)
...
#

i can upload file if that helps

#

notebook?

arctic canopy
#

What's up guys... So I wanna learn ML and robotics ,something like put these 2 things together but I don't know where to start or which one should I start with,can someone pls give me some recommendation for some tutorial or advice .

flat quest
#

LSTM layers I know are compatable with TPU by default. @drifting umbra
are u running this code?

# detect and init the TPU
tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
tf.config.experimental_connect_to_cluster(tpu)
tf.tpu.experimental.initialize_tpu_system(tpu)```
#

@arctic canopy well you should first learn the basics of each one separately. Not too sure for robotics but ML u should start with some sort of course to give you a good overview.

arctic canopy
#

@flat quest So is it better to start with ML first?

#

and is it that complex?,I mean I heared it needs a lot of math and stuff

drifting umbra
#

@flat quest thank you and let me check

#
# detect and init the TPU
tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
tf.config.experimental_connect_to_cluster(tpu)
tf.tpu.experimental.initialize_tpu_system(tpu)
#

that is exactly what i have

#

can i link you this

flat quest
#

yeah i'll take a look
@arctic canopy

#

it depends on where u want to go with ML

#

basic stuff like simple text generation and classifying images doesn't really need much knowledge with math

#

if you want to make a production ready model, or look into improving existing models through new architectures, then you'll have to learn the math to some extent

#

@drifting umbra don't think i have access to your data

arctic wedgeBOT
#

Hey @drifting umbra!

Uh-oh! It looks like your message got zapped by our spam filter. We currently don't allow .txt attachments, so here are some tips to help you travel safely:

• If you attempted to send a message longer than 2000 characters, try shortening your message to fit within the character limit or use a pasting service (see below)

• If you tried to show someone your code, you can use codeblocks
(run !code-blocks in #bot-commands for more information) or use a pasting service like:

https://paste.pythondiscord.com

drifting umbra
#

just alice in wonderland txt

flat quest
#

lol idk why but i can see the files listed in the input directory
but the os module cannot find it :/

drifting umbra
#

hm

flat quest
#

yeah its odd cause i can find the file using os.walk

arctic canopy
#

@flat quest so basically as you go more deep it will require more math I think, Thanks mate

south dagger
#

Im stoked man just learned a bit of selenium, its so fun

lapis sequoia
#

what?

south dagger
#

learned how to use selenium with python to web scrape

vivid badge
timber jolt
cunning wadi
#

hi guys

#

How would i go about changing this into a temporal difference approach

solar phoenix
#

Let's imagine I have a dataframe with an arbitrary length, and each row has 4 parameters. I want to rank the dataframe by these 4 parameters. At the moment what I do is test each of the parameters against a desired value and then rank them by how close they are to the value. I then add a new column to the dataframe which indicates which rank they are for that parameter. I do this for all 4 parameters then combine the rank of parameters 1-4 and get a "total" rank, with the closest to my desired set of parameters being at the top. Is there a better way to do this? My concern is that a row that has very good score for 3 parameters will constantly outcompete one that has a mediocre score in all 4 parameters.

jolly briar
#

@solar phoenix might be easier to follow with example data

solar phoenix
#

Hi @jolly briar I'll make an example and send it, cheers

jolly briar
#

df.sample(20,random_state=1).to_json() might be useful

faint umbra
#

Im quite new to kaggle. I have a model now running. Its gonna take 2 hours to make the version. Can I safely close the window?

#

Knowing that my version creation will finish? 😄

twin parcel
#

Hey guys sorry if this is the wrong spot... thoughts on using a .txt or .xlsx to create a wordbank that i will use as an array to compare other parsed text to? im trying to automate my job search abit lemon_warpaint

somber rune
#

urgent help needed

#

on google colab

#

can i share screen

#

??

jolly briar
#

@twin parcel why xlsx over csv?

twin parcel
#

just kinda compare them all forgot csv was another type and the best option atm prob

jolly briar
#

you can append to a csv

#

to_csv( mode = 'a') iirc

lapis sequoia
#

completed numpy, pandas and matplotlib. now what next?

twin parcel
#

also am i able to import csv filled with reg expressions by chance?

somber rune
#

read ISL @lapis sequoia

jolly briar
#

what does "completed" mean @lapis sequoia

flat quest
#

what do you mean completed numpy pandas and matplotlib?

lapis sequoia
#

Studied them

jolly briar
#

@twin parcel can you give an example

#

@lapis sequoia what does studied mean here, to what extent

lapis sequoia
#

what's the next step?

jolly briar
#

explaining yourself properly would be a good step imo

lapis sequoia
#

@lapis sequoia what does studied mean here, to what extent
@jolly briar covered these topics with practical data analysis

somber rune
#

@jolly briar do you have nay idea about auto encoders ?

jolly briar
#

@somber rune sorry no 😦

#

@lapis sequoia you're being vague as hell so I"m just going to say go read ESL

somber rune
#

@twin parcel do you ?

twin parcel
#

Im trying to add "minimum of 6 years" to a list im going to filter out but id rather have a reg expression that would cover minimum of 6< years instead of writing "minimum of 6 years, minimum of 7 years, minimum of 8 years" but i also want to keep it imported from a file

#

Err not a clue besides for image and vid but i've used python for a total of 3 hours now so cant help much 😂

somber rune
#

ah okay

flat quest
#

@lapis sequoia lol that is still so vague

jolly briar
#

@twin parcel tbh if you only have a fixed set of cases just writing them out and putting them in a list isn't really a bad idea

#

@twin parcel idk what the source is though - are they guaranteed to have this structure?

twin parcel
#

only problem is im going to cover 6< and that could reach 20 and yes they will always have this structure

lapis sequoia
#

I want to learn machine learning and data science. I am a self learner. Learned Python with OOP concepts, and topics like file handling, regex, web scraping, numpy, pandas and matplotlib. What's the next 3 things should I learn to get one step further? Should I learn scikit now?

twin parcel
#

or if one falls out of it itleast i catch 7/10 and have that many less jobs to read over

jolly briar
#

why would you store regex's in a csv rather than in the script? I don't really follow that

twin parcel
#

because I want to make it simple for someone to adapt without touching the code to much i would call the txt in a loop to use each regex inside it

jolly briar
#

sounds funky

twin parcel
#

it is 😂

jolly briar
#

I don't think i'd store regexs like that

#

they should be in code, data in data

#

separation and all that jazz

#

without a concrete example idk what to say here, other than have your scripting in the script

flat quest
#

in terms of ML
depends on what u want to do

surface level ML or if u just want to work with existing architectures, yeah scikit and tensorflow are a good place to go

if u want to improve existing models/architectures, you're gonna have to learn some aspect of the math. (online vids, courses, wiki are all good for that).

@lapis sequoia

jolly briar
#

if u want to improve existing models/architectures
this seems like an extremely narrow set of people

twin parcel
#

hmmm thinking on that maybe ill format my txt different code the regex in but get the year value based on the files value

#

might use a txt dedicate first few lines to specific stats then split after thoose lines on ,

flat quest
#

not really. A lot of ML startups/companies have some focus on improving existing architectures. Most of the ML stuff we use now was invented in like the 80's.

jolly briar
#

I think it's still very narrow, of the people that are going to learn these tools, that's a narrow set of people

lapis sequoia
#

Can you guys tell me what 3 skills should I learn now? I'm trying to get into data science career with self learning

flat quest
#

yeah, but if you plan to make a career or job out of it. The math can't be ignored.

jolly briar
#

though it depends what improve means I guess - if it's publishing and stuff, very narrow

#

@lapis sequoia what do you think you should do, based on what you've looked at so far

flat quest
#

performance of models. More data/cleaning/feature engineering does help, but only to a certain extent.

jolly briar
#

yeah i think data engineering is more important for most

lapis sequoia
#

Can you specify the topics ?

jolly briar
#

esp. if someone is self learning

flat quest
#

yeah

#

@lapis sequoia if u don't have any clue of any architectures. Start out with linear reg/logistic on scikit. Then try decision trees and gradient boosting.

lapis sequoia
#

I'm trying to set a roadmap and create a curriculum on my own to continue learning from here.

flat quest
#

well those are the topics u should look into

lapis sequoia
jolly briar
#

trying to plan everything out to the nth degree is the biggest waste of time

flat quest
#

^

jolly briar
#

those suggestions from drag are good, do them

#

Don't bother reading medium posts about planning about planning about planning

#

if you've been through np/pandas/mpl as much as you think trying a project is a good test

#

should be able to get some open data from a gov site, clean it, and provide some insights

#

I would bet it's more time consuming than you expect

flat quest
#

yeah got stuck on that for a while too
u never get out of the phase of planning to do something

u need to do start just working with ML.

#

yeah cleaning data takes longer than model building imo

lapis sequoia
#

I actually downloaded some datasets from Kaggle and presented visual representation well for practice. Is that all what data analyzing mean?

jolly briar
#

how old are you?

#

because i'm not sure why these questions are phrased as they are

#

Is that all what data analyzing mean?
this question just seems nuts

flat quest
#

presenting means nothing. You need to be able to extract information from it @lapis sequoia

lapis sequoia
#

like?

jolly briar
#

What do you think?

#

do you have any ideas / thoughts of your own?

flat quest
#

maybe ur famiiar with this dataset the kaggle titanic.

Let's say u see that there's a cluster of deaths with people who have the same last name

jolly briar
#

because this is probably the first thing I'd learn, you're not going be given a 3 step guide to anything at work

flat quest
#

then you might infer that those people are part of the same family group, and family groups are likely to all die or survive.

lapis sequoia
#

I played with some other dataset as well analyzing different scenarios

flat quest
#

aight idk the only advice i can really give

Is to dive into your data and then work with it. Like rie said make a project utilizing the python data-packages and publish the results on a website or on github.

lapis sequoia
#

I don't really understand what type of projects it can be. Can you generalize it?

jolly briar
#

@lapis sequoia have a guess

#

just have a guess at something that shows you have thought for yourself and go from there, maybe it's a good idea

lapis sequoia
#

should be able to get some open data from a gov site, clean it, and provide some insights
@jolly briar like this?

jolly briar
#

using my thought as your own, sure

#

given what you say you have learnt - my biggest concern would be that you don't seem to be able to piece anything together for yourself

#

which suggests that perhaps you've rattled through a few tutorials without digesting / internalising any of it

twin parcel
#

using my thought as your own, sure
@jolly briar sounds like every coding meeting ive been apart of...

jolly briar
#

😄

#

i never did understand that regex thing you were talking about @twin parcel , personally i'd always go for having them in the script, and having data as it's own thing

#

you can just extract the numbers if that's easier , re.search( r'(\d+)', sentance).group(1) looks like it'd catch what you needed

twin parcel
#

I decided to go that route im gonna research regex in python but i added a place holder in the .txt to fill so others can change it and ill just read the number "Minimum Years: 7" is line 2 and im able to grab the 7 easily so ill use that var in regex and it should work out 🙂

#

soon i will have a tool to not waste time on the job search lemon_enraged

flat quest
#

share me that tool when u make it 😛

jolly briar
twin parcel
#

Figure ive looked once a day for last few months alone switching tabs prob is around 5 mins a day

#

so already i have it just take the first 10 pages of indeed based on my search and print it in one page, next step filter out requirments im nowhere near as a new grad, theres several extra minutes a day determining if i can. then save it all to a file is the last step. figure at 22 i have a good 40 years of career so this tool can be used later on also

#

also another big benefit is i can now add python to my portfolio as i wanted to make a scraper with it for a while but couldnt find a legal use

jolly briar
#

what's indeeds policy on scraping?

#

showing off a tool that shows you haven't read a data usage policy might be a bit brave 😅

twin parcel
#

ah also true 😩

#

imma read up on that

twin parcel
#

If I read its TOS and other google searches correctly its ok as long as its not for commercial uses but i think imma email there customer service before this is center of my git or im IP banned from a job site 😂

thin remnant
#

i'm having a directory notebooks that contains the fastai directory, inside this notebook folder i also have a folder for each week of exercices. In these directories there are notebooks but they can't find the path to the fastai directory because it is complaining about relative paths, could someone help me out ?

jolly briar
#

@thin remnant if you run things from project root it makes stuff like this a lot simpler

#

so in your notebook you can have something like os.chdir('../../') , or better something like os.chdir(here()) using here() from pyprojroot

thin remnant
#

is it not possible to access the fastai modules by using just a path

#

it's only one directory up

#

i think the os.chdir'../' worked

jolly briar
#

it's only one directory up
doesn't change anything, running from proj root is still simpler?
so you can't, from an interactive session in root, do from blah import blah where blah is what you want

#

i think the os.chdir'../' worked
right, but if you re-run it then it'll keep knocking you back (you'll have to reset kernel)

thin remnant
#

would appending fastai to the pythonpath be a solution?

jolly briar
#

better to use here()

#

i've given what i think is a good solution

thin remnant
#

it says here() not defined

jolly briar
#

because you haven't installed the package I linked i guess

thin remnant
#

should i also place that in the same directory as the fastai directory

jolly briar
#

the package? You'd just install with pip

#

do you have this project version controlled?

thin remnant
#

like this

#

i didn't do any pip install

jolly briar
#

I've no idea what that is

thin remnant
#

pyprojroot

#

thats what u linked me

#

i cloned that from git

jolly briar
#

why would you clone?

thin remnant
#

how do i install than

jolly briar
#

pip

thin remnant
#

pip install pyprojroot?

jolly briar
thin remnant
#

here() is still not defined :/

jolly briar
#

have you followed the readme

thin remnant
#

yes

#

the pyprojroot import here doesnt work

#

no module named pyprojroot

jolly briar
#

are you in jupyter notebook or lab

thin remnant
#

jupyter

jolly briar
#

yeah, which

thin remnant
#

notebook

#

conda

jolly briar
#

you need to restart the kernel

thin remnant
#

i did

#

3 times already xd

jolly briar
#

you installed it in the wrong env then

thin remnant
#

do i have to install in the fastai-cpu env ?

jolly briar
#

if that's what you're using for this notebook then yeah

thin remnant
#

i think so

#

i'll try

jolly briar
#

you have to install things for particular envs, you can use requirements files and such to manage this for you

#

make sure you install to that env, should then work

#

also - i tend to use it as os.chdir(here()), just at the top of the notebook

thin remnant
#

import here works now

jolly briar
#

cool

#

so i usually have something like

import os
from pyprojroot import here
os.chdir(here())
<other imports>
thin remnant
#

it still doesnt find fastai

jolly briar
#

which is an odd ordering, i just don't want to use here() throughout the script

thin remnant
jolly briar
#
import os
from pyprojroot import here
os.chdir(here())

do this, thenos.listdir(), is it in the root?

thin remnant
#

yes

#

it's in the root

#

should i put the entire path to the project dir ?

jolly briar
#

idk why it wouldn't work when previously os.chdir(.../ stuff did

lapis sequoia
#

hi

thin remnant
#

meh ill just make a path variable

#

not that big of a deal

jolly briar
#

shouldn't have to

thin remnant
#

mmm

#

this is weird

#

@jolly briar what should i do when the dir is changed

#

it still doesn't load the import of fastai

jolly briar
#

@thin remnant looks like it's put you into you ~/ dir, which i doubt is the project root

#

is it?

thin remnant
#

nope ..

jolly briar
#

in your project root git init it

#

git init

#

don't do that in your ~/ dir

thin remnant
#

nvm

#

this will do

jolly briar
#

🤷‍♂️

heavy night
#

I'm trying to generate plot points for a 3d scatter plot. I have the values, but being new to python, numpy, pandas, etc., I'm not sure if I'm capturing and structuring the data in the most simplified way for plotting. Here is my code:

sample_data_subset_intervals = np.unique(sample_data_subset_df['sampling_interval'].to_numpy())
sample_data_subset_durations = np.unique(sample_data_subset_df['sampling_duration'].to_numpy())

scatterplot_raw_data_df = \
    (sample_data_subset_df[['sampling_interval','sampling_duration','sampling_error']]).dropna()
scatterplot_raw_data_df['sampling_error'] = scatterplot_raw_data_df['sampling_error'].abs()

scatterplot_3d_plot_points_dtype = \
    [('sampling_interval', np.int32), ('sampling_duration', np.int32), ('sampling_error', np.float64)]
scatterplot_3d_plot_points = np.empty([0,1],dtype=scatterplot_3d_plot_points_dtype)
plot_points_index = 0

for interval in sample_data_subset_intervals:
    for duration in sample_data_subset_durations:
        if duration <= interval:
            interval_duration_pair_data_subset_df = \
                scatterplot_raw_data_df[(scatterplot_raw_data_df['sampling_interval']==interval) & \
                                        (scatterplot_raw_data_df['sampling_duration']==duration)]
            idp_sampling_error_summation = interval_duration_pair_data_subset_df['sampling_error'].sum()
            idp_mean_sampling_error = \
                idp_sampling_error_summation / len(interval_duration_pair_data_subset_df.index)
            scatterplot_3d_plot_points.resize(plot_points_index + 1,1)
            scatterplot_3d_plot_points[plot_points_index]=(interval,duration,idp_mean_sampling_error)
            plot_points_index = plot_points_index + 1
#

and the output looks like this:

[[(  10,   10, 0.00000000e+00)]
 [(  30,   10, 4.56183120e-04)]
 [(  30,   30, 0.00000000e+00)]
 [(  60,   10, 2.84578755e-03)]
 [(  60,   30, 1.92741648e-03)]
 [(  60,   60, 0.00000000e+00)]
 [( 120,   10, 1.33025818e-01)]
 [( 120,   30, 1.21143218e-01)]
 [( 120,   60, 9.39393846e-02)]
 [( 120,  120, 0.00000000e+00)]
 [( 300,   10, 7.69409264e-01)]
 [( 300,   30, 7.70362944e-01)]
 [( 300,   60, 7.38203127e-01)]
 [( 300,  120, 5.79511920e-01)]
 [( 300,  300, 0.00000000e+00)]
 [( 600,   10, 1.18857403e+00)]
 [( 600,   30, 1.18091259e+00)]
 [( 600,   60, 1.16379460e+00)]
 [( 600,  120, 1.02220597e+00)]
 [( 600,  300, 6.36643452e-01)]
 [( 600,  600, 0.00000000e+00)]
 [( 900,   10, 1.38186398e+00)]
 [( 900,   30, 1.41657535e+00)]
 [( 900,   60, 1.42654824e+00)]
 [( 900,  120, 1.28564349e+00)]
 [( 900,  300, 9.52358564e-01)]
 [( 900,  600, 4.13780964e-01)]
 [( 900,  900, 0.00000000e+00)]
 [(1800,   10, 1.56350134e+00)]
 [(1800,   30, 1.59038708e+00)]
 [(1800,   60, 1.57760143e+00)]
 [(1800,  120, 1.47674187e+00)]
 [(1800,  300, 1.27458568e+00)]
 [(1800,  600, 9.84249018e-01)]
 [(1800,  900, 7.20700696e-01)]
 [(1800, 1800, 0.00000000e+00)]
 [(3600,   10, 1.58364303e+00)]
 [(3600,   30, 1.62856429e+00)]
 [(3600,   60, 1.66236178e+00)]
 [(3600,  120, 1.67353265e+00)]
 [(3600,  300, 1.47160299e+00)]
 [(3600,  600, 1.39347321e+00)]
 [(3600,  900, 1.18549807e+00)]
 [(3600, 1800, 7.73267790e-01)]
 [(3600, 3600, 0.00000000e+00)]]

The number of brackets and parens in the output implies to me perhaps unnecessary complexity in my data structure, but that may just be due to me being unfamiliar with structuring data in python/numpy. Does the format/structure of this output look correct and most simplified for moving forward with it to plot? Thanks!

surreal flume
#

Hi, I am getting really frustrated, because the changes I am making to a dataframe, inside a function, are not committed outside the function. I use return, but it does not work. Am I missing something ?

#

to be more specific, I wrote a function that takes a column away from the df, and that merges it with another df. The function output is correct i.e. a new df that looks exactly like I want. However, I would like this df to overwrite the original one, and I can't make it work

uncut shadow
#

well, technically it should look like that

data = pd.DataFrame(data={"me": [1, 2], "something": [3, 4]})
data = function(data)
#

and this function would look like

def function(data):
  # do something with this data
  return data # or some other variable if you want
hard fiber
#

is it usual for pandas to sometimes replace (numeric) values?
because i have a dataframe which if i visualize it, there are some 0 in the values which shouldn't be there. i checked the file which i also print it to but all values in it are correct

#
def visualize(dataFrames:dict, outputLocation:str, showing:dict, show = True, save = True) -> None:
    print("Start visualizing...")
    yValues = []
    print("- Start Extracting what to show...")
    for key, val in showing.items():
        if (val):
            print("- - Adding item {}...".format(key))
            yValues.append(key)
    print("- Finished Extracting what to show")

    print("- Start iterating plots...")
    for id, df in dataFrames.items():
        print("- - Starting plot: "+id+"...")
        print("- - - Start converting Duration to numeric values...")
        df.index += 1
        df.DurationIncl = pd.to_numeric(df.DurationIncl)
        #df.ScanTimeAutoLight = pd.to_numeric(df.ScanTimeAutoLight)
        print("- - - Finished converting Duration to numeric values...")
        print("- - - Start plotting...")
        plottedFrame = df.plot(
            y = yValues,
            kind = "line",
            title = "Runtime analysis",
            use_index = True,
            grid = True
        )
        print("- - - Finished plotting")
        print("- - - Start adding legend...")
        legend = []
        for yval in yValues:
            legend.append(yval + " of " + id)
        plottedFrame.legend(legend)
        plottedFrame.set_xlabel("index")
        plottedFrame.set_ylabel("Time in seconds")
        print("- - - Finished adding legend")
        if save:
            print("- - - Start saving...")
            matplotlib.pyplot.savefig(outputLocation+"AnalysedData{}.png".format(id))
            print("- - - Finished saving")
        print("- - finished plot: "+id)
    print("- Finished iterating plots")
    if show:
        print("- Start showing...")
        matplotlib.pyplot.show()
        print("- Finished showing")
    print("Finished visualizing")

the code of the visualiziation

lapis sequoia
#

Hi!

#

How to install fbprophet on win10 for python 3.8?

#

I searched for manuals and tried them - all in vain

polar acorn
#

Never tried with windows, but it worked for me on macOs using conda.

frail ocean
#

In windows10 python3.8 works fine with anaconda.

Hi!
@lapis sequoia in windows10, python3.8 works fine with anaconda

twin parcel
#

while using a scraper, sites that store cookies and login sessions would the scraper use that session, or as a scraper it has its own session?

#

i assume towards its own just like different browsers

hearty holly
#

@twin parcel indeed

twin parcel
#

my old version worked pretty well until I realized some divs had an optional field that i want to track, so now i have to rebuild based on divs.

lapis sequoia
#

In windows10 python3.8 works fine with anaconda.
@lapis sequoia in windows10, python3.8 works fine with anaconda
@frail ocean I need in FBProphet lib

frail ocean
#

I see.. then I have no idea. Sorry.

twin parcel
#

Any suggestions on changing this to allow 2 chars minExperienceLimit = badFiltersContent[2][15] Im using it to grab the int from this string, but this wont work for >9 ``` This is the Bad filter list, add words or phrases that make a job more likely not match, seperate with commas!

Minimum Years: 7```

kind saddle
#

i dont know if this fits here but i need to compare 2 numpy arrays with eachother, they have different sizes. if 1 of the colors in first array are found in the big array i need to have a true output

#

i tried allclose but that works for all, i tried isclose, i tried any(isclose)

#

i tried if A in B is true then:

uncut shadow
#

wdym by true output?

kind saddle
#

i dont need to know the value that matched

#

just that they did match

#

a boolean

#

@uncut shadow

tacit spruce
#

What is the best resource for learning regression and classification in Python?

uncut shadow
#

@kind saddle I think this should give you the way to do this (https://stackoverflow.com/questions/25490641/check-how-many-elements-are-equal-in-two-numpy-arrays-python). You can change it to what you actually wanted to achieve

kind saddle
#

it works too for different sizes?

uncut shadow
#

well, gimme a sec

kind saddle
#

im alot asking sorry, ive been brainstorming and testing so much that all my ideas ran out :/

uncut shadow
kind saddle
#

@uncut shadow 2 questions but 1 is decently stupid, if it found a match then my if statement should say if value > 0 is true right?
second question is even tho the arrays are the same partly i dont get a match, can i put in an error margin for like +-3?

uncut shadow
#

well, I don't know much about this particular package (I didn't need it before) but you should check it's documentation to check if you can add a margin or stuff like that

kind saddle
#

but to my original arrays before i feed them into the package

#

nvm this is overthinking it too much

uncut shadow
#

actually, what do you mean with the first question cuz I don't think I understood it right

kind saddle
#

the problem is, 2 pictures are translated into arrays, 1 is a small part out of the bigger one. so if i compare i should get a value or anything apart from 0 or whatever since they are the same and treated the same in the code

#

question 1 was, that the count of matches would be greater than 0 if there was a match

#

im sure that is yes xD

uncut shadow
#

so yes, if there is a match then it should be bigger than 0

kind saddle
#

i think the problem is in the processing or asking too many matches. ill try it with 1 single color first

jolly briar
#

@kind saddle do you have example data

kind saddle
#

@kind saddle do you have example data
@jolly briar yes and no, not a raw file its converted from image to array

brisk moth
#

is this where the NLTK nerds are

uncut shadow
#

yes

brisk moth
#

you know how to parse feature based semantics

agile cypress
#

Wdym?

weary ferry
#

define semantics

brisk moth
#

uh

#

i have a CFG with fol and lambda calculus and i give it a sentence and it tokenizes and parses the tree for it with a semantic representation

#

but it does not work with certain constructions, like subject inverted ditransitive questions where the recipient is a prepositional phrase “for x”

trail parcel
flat quest
#

did you really use a 5 layer dense model for addition :/

trail parcel
#

@flat quest it worked better than less number

#

Its not that intensive

flat quest
#

if ur doing addition a single perceptron will work

#

at an equal or higher efficiency
just consider the mathematical basis of the perceptron: x1w1 + x2w2 + b

set b to 0 and w1 and w2 to 1 and you have addition.

rustic igloo
#

Anyone knows where to find the source code for "vocab_file" for (thanks):


FullTokenizer = bert.bert_tokenization.FullTokenizer
bert_layer = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/1", trainable=False)

vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy() #The vocab file of bert for tokenizer
tokenizer = FullTokenizer(vocab_file)
#

i could only find the source code of the KerasLayer and the resolved_object method, but no vocab_file nor asset_path methods/attributes...

paper niche
#

it seems vocab_file is an attribute dynamically set in the export_bert_tfhub function

#

@rustic igloo

rustic igloo
#

Thanks @paper niche

orchid tinsel
#

@worn stratus hey man thanks for the reply!, im just wondering do you know where i can find the study materials for decision trees, i need to know about the classification and regression type specifically. i need to make them without sklearn or other 3rd party modules. all i can find in the web were those with them involved.

#

honestly im at a loss lol

worn stratus
#

Hm - I've had a look around for that stuff myself to not great success. If I were you, I'd have a look around at stuff like the Wikipedia page for ID3 decision trees, and random implementations you can find on github. When I looked - they were all a bit rubbish, but still helpful for figuring out roughly what is required

#

From a quick google search, there are tutorials out there for things like building an ID3 decision tree - just they're all not great. With some patience you can probably use a combination of the different resources available to figure things out

#

ID3 is definitely what I'd be looking towards

torpid ingot
#

I have a dataset of images, with each image tagged with keywords. For example, a photo of the Earth from space may have the keywords ["earth", "space", "planet", "photo"] while an illustrated diagram of the Sun may have the keywords ["sun", "star", "illustration", "rendering", "diagram", "labeled"].

I want to be able to automatically filter a set of images with keywords to end up with a smaller set of images that match the aesthetic that I am looking for. To do this, my plan is to figure out the weight of each keyword in detemining if a particular image should or should not be included.

I will have a human go through the dataset of images, and say if they should or should not be included. The classification for each image will be recorded.

I think the output will be something like this where each keyword that was seen is given a weight:

{
    "keyword1": 1.25,
    "keyword2": 0.75,
    "keyword3": -8.6,
    ...
}

Then, when given an unclassified image, my program will use those determined weights to say if or if not that unclassified image should be included.

What types of techniques should I look into for this? One consideration I'm thinking of is that the frequency of a keyword needs to be considered in the weight. If a word only shows up a single time, and I say that associated image does not belong in the set, then that keyword probably shouldn't be 100% bad.

oblique belfry
#

When orchestrating ML into the business, do you all use event sourcing and CQRS concepts? We are doing a lot of stream processing, and we are trying to plan out the best strategy for ML predictions.

sage latch
#

Am i in the right place here to ask Questions about graphs? (nodes and vertices and stuffs)

wide rose
#

computer science is probably more appropriate

#

tho here is probably fine

sage latch
#

okay, here goes: I'm wring something to represent a game I'm playing. it has a lot of 'currencies' wich can be converted into other 'currencies'. Some of them i can dirrectly assign a value to (in this case a power increase/currency)

#

there are loops in this graph, and I can not guarantee that the dev was smart enough to prevent diverging loops

#

Do you have any hints on how to calculate the 'value' for each currency, when factoring in conversions?

#

I'm thinking about setting all values to 0, except for those wich I can dirrectly assign.

#

all directly assigned currencies go into a queue.

#

then I take one item out of the queue and update all curencies wich can convert into that currency. I will save a 'value' for each currency the currency can convert to. Then I add that currency to the queue

#

that by itself does not terminate.

wide rose
#

so its a BFS?

sage latch
#

whats a BFS?

wide rose
#

breath first search t

#

here is what i would do instead

#

searching seems overcomplicated from whati understand

#

if you are able to recode it

#

just make a class for currency then have a method converting between

#

or a fuction

sage latch
#

the issue are the loops in that directed graph

wide rose
#

do you have a picture of the graph you can upload

sage latch
#

the values should converge when going through a cycle. So if the increase when updating a node is neglectable i won't add it to the queue. (hm... there is the issue of possible multiple small updates?)

#

so it will eventually terminate if all cycles converge

#

I'm unsure how to handle the (propably not happening) case of a diverging cycle

wide rose
#

what do you mean by diverging cycle?

#

like it goes to infinity

#

so it just gets trapped?

sage latch
#

trivial example: i can buy 2 banans for 1 apple. and i can buy 2 apples for 1 banan

#

so if I were to assign some base value to an apple from elsewhere, the value of an apple would diverge by swapping between apples and bananas

#

ofc the graph I'm talking about is a bit more complicated ^.^

wide rose
#

im not sure why the value would diverge

#

do you have some code?

sage latch
#

not yet

#

1 apple= 2 bananas = 4 apples = 8banans =16 apples =32 banans

#

so if my 'base value' for an apple is non-zero it will diverge

wide rose
#

wait that doesnt make sense tho

sage latch
#

more realistic case: 1 apple = 5 dollars; 1apple=2bananas; 1banan=3dollars. whats the dollar value of a banana

wide rose
#

1 apple= 2 bananas = 4 apples

#

that doesnt make sense

#

thats where your error is

#

that cant be true

sage latch
#

the trades are not transitive or reflective

#

it is a directed graph

#

so bannas->apples needs not be 1/(apples->bananas)

#

the diverging case would only happen if the dev of the game seriously messes up. But i can not exclude the possibility

wide rose
#

you could probably just add a fail safe in the code

#

but why are trades not transitive

sage latch
#

maybe transitive was the wrong word

#

but the surely are not reflective

#

yeah they should be transitive, I used the wrong word there sorry

wide rose
#

yea but if its not reflective how does anything have value

sage latch
#

example: I can get 1 banana for 2 apples or 1apple for 2 bananas but i can also sell 1 apple for 1 dollar(my 'value' is meassured in dollars here)

#

wait

#

edited

#

in that case 1 apple is worth 1 dollar and a banan is worth 0.5 dollars

#

thats the simplest case i can imaging for a non-reflective loop with convergent values

#

also the transitions are more complicated.... I think a graph could not be enough to represent it. for example you could get 1 apple 3 kiwi and a grapefruit for 10 bananas. (but only all at once, no single trading) in some cases

wide rose
#

So is a banana worth a dollar?

#

If not then you have arbitrage in the economy

sage latch
#

there is no supply and demand

#

its all 'trades' set by the game dev

#

I think I have an idea how to talke it. I may need to divide by zero a bit, but thats okay ;)

wide rose
#

ok if you have some code or a graph let me know

wise igloo
#

Best intro data science course also what should I Know before getting into an intro to data science course?

sage latch
#

@wide rose is a json-serialized dictionary with dummy data (some data I don't yet have accurate values of, other data depending on my gamestate and I haven't set up those calcuations yet) okay?

wide rose
#

Hey sorry I have to do some studying atm

#

i might be able to help later

#

@sage latch

sage latch
#

It's 1am here, maybe tomorrow?

#

just started making some code to input the data. started with nested loops with global data, now refactoring to reasonable function calls. too lazy to refactor out the global data dict hehe

wide rose
#

hahahha

wise igloo
#

Thanks guys

trail parcel
lapis sequoia
#

Hi, I have a question regarding data visualization. I have a simple ORM Event model with a single attribute, date_created. (to track volume of API calls over time) How would I go about visualizing this in graphs of different resolutions? For instance, there may be a graph with 15 min resolution that sums up all of the events that occurred within that time span, or an hour resolution that sums up the events within that hour. Isn't there some python library that can make an interactive graph with JS that plugs into the web frameworks?

This is all new to me, any pointing in the right direction is appreciated. Thanks.

flat quest
#

I heard plotly is based on plotly.js so that might be what ur looking for. Though it’s performance isn’t as good as the c based ones.

lapis sequoia
#

Thank you. And 'histogram' was the word I was looking for, a lot more pieces fell into place once I discovered the concept I had in my head had an actual name lol

solar phoenix
#

does anyone have any experience with Numba?

#

or with vectorizing loops with Numpy

valid drum
#

Hi,
I’m trying to implement a CNN with Numpy only and I have a problem that the Convolutional layer is very slow - takes ~1 second...


def run(self, x, is_training=True):
        """Convolves the filters over 'x' """
        if self.filters is None:
            self.filters = self.initialize_weights((self.units, x.shape[0], *self.filter_size))
            self.grads = self._init_bias_weight_like()

        if is_training:
            self.cache['X'] = x

        n_filt, dim_filt, size_filt, _ = self.filters.shape
        dim_img, size_img, _ = x.shape

        if dim_filt != dim_img:
            raise ValueError("Image and filter dimension must be the same")

        size_out = int((size_img - size_filt) / self.stride) + 1

        out = np.zeros((n_filt, size_out, size_out))
        for filt in range(n_filt):
            y_filt = y_out = 0
            while y_filt + size_filt <= size_img:
                x_filt = x_out = 0
                while x_filt + size_filt <= size_img:
                    out[filt, y_out, x_out] = np.sum(
                        self.filters[filt] * (x[:, y_filt: y_filt + size_filt, x_filt:x_filt + size_filt])
                        + self.bias[filt]
                    )
                    x_filt += self.stride
                    x_out += 1
                y_filt += self.stride
                y_out += 1

        out = self.activation.apply(out, is_training)
        return out

Does anybody have an idea how to improve it? Thanks

uncut shadow
#

Are you following any tutorial for that? (I'm asking cuz I'm curious)

valid drum
#

@uncut shadow No

lapis ice
#

Does anyone know a paper or a blog which goes in-dept about the architecture/tehnologies?

eager heath
#

Computer architectures?

wise igloo
#

Thanks guys

lapis ice
#

@eager heath my apology I did not include the extra crucial detail.
I am looking for GAN generator/discriminator data basically

wise igloo
#

I appreciate the advice

lapis sequoia
#

can anyone recommend a good data science book for a beginner ?

steady sparrow
#

can anyone recommend a good data science book for a beginner ?
@lapis sequoia python datascience handbook

lapis sequoia
#

ok thanks

#

hows the book "data science from scratch"?

#

@steady sparrow

steady sparrow
lapis sequoia
#

is it good for a beginner ?

#

@steady sparrow

steady sparrow
#

I think yes
Iam using it and it is good with me

#

iam also beginner btw

lapis sequoia
#

ok

#

thanks

steady sparrow
#

@lapis sequoia you are welcome

solar phoenix
#

what is the fastest way to iterate through a function 100s or 1000s of times that gives a string output and add the output to a list?

#

at the moment i just do, for i in range(1000):

#

then append the output to a list

#

speed matters because i will end up doing it several million times

ivory plank
#

@solar phoenix As you probably already know, append has an amortized O(1) cost [which doesn't mean that each individual append costs O(1) time; it just means that because python over-allocates with append, on average, the append operation costs O(1)], so over many many times, that O(1) cost should be very close to actually being O(1)

spark stag
#

@solar phoenix have you tried using list comprehentions (sry for that spelling) because as it doesn't call append it is much faster to create large list item by item

#

iirc it is also faster than list(map(...))

kindred finch
#

It still has to recreate the list every now and then, the same as append although it does have some other optimisations

ivory plank
#

It might be faster to pre-allocate a list if you know the exact number of iteration ahead of time. But, even that is debatable. You can easily test which way is faster in your particular code using a smaller run with python's timeit

solar phoenix
#

@spark stag I have not tried that but will now

#

@ivory plank ok did not know about pre allocation

#

thanks all

ivory plank
#

Try this badly written quick program I wrote @solar phoenix , you can appropriately define new ways and give it to the list of functions to time them

#
import timeit


def append_way(n, to_append):
    l = []
    for _ in range(n):
        l.append(to_append)

def pre_allocate(n, to_append):
    l = [""]*n
    for i in range(n):
        l[i] = to_append

def list_comprehension(n, to_append):
    l = [to_append for _ in range(n)]

def deque_way (n, to_append):
    d= deque()
    for _ in range(n):
        d.append(to_append)

def main():
    n = 10**1
    to_append = "test"
    for func in ["append_way", "pre_allocate", "list_comprehension", "deque_way"]:
        seconds = timeit.timeit("{}(n,to_append)".format(func), setup="from __main__ import {};n={};to_append='{}'".format(func, n, to_append), number = 1)
        print("{} takes {} seconds".format(func, seconds))



if __name__ == "__main__":
    main()
solar phoenix
#

@ivory plank awesome will do, thanks so much

ivory plank
#

it's not actually seconds btw, it's usecs. I forgot about the defaults (EDIT: actually, it's seconds. The thing that's actually the problem here is that timeit by default repeats the code 1M times. To make it 1 time, add "number =1" in the timeit call. But, none of this actually changes the difference in timing between the different functions)

solar phoenix
#

Understood. A million might be excessive...

lapis sequoia
#

I have a pandas dataframe with a column 'score' in the range [-1, 1] and I have 10-15 terms in other columns. What would be the best tool to understand how these n-terms predict the score?

polar acorn
#

Finding the correlation between each continuous feature and the score is a good start. Plotting each feature vs the score also gives a good indication.

wise igloo
#

You guys are so helpful! Thank you

lapis sequoia
#

@polar acorn thank you

polar acorn
#

@wise igloo Are you being sarcastic or something?

wise igloo
#

?

#

Besides python what else should I know before going into an intro to data science course?

polar acorn
#

Programmingwise you should probably take a quick look at numpy and pandas. Mathwise you should be familiar with calculus, basic stats and some linear algebra. All of these can be learnt as the same time as you're an doing a intro to data science course and that is perhaps what I would recommend. Just jump into the course, pause and dive into the math or libraries you don't understand.

lapis sequoia
#

@polar acorn from your comment I take it you mean to start with scatter plots then go into linear regression?

lapis sequoia
#

Hello,

I have a problem and I have spent about two hours on it but still unsolved!!

I have a dataset which contains nearly 600,000 data. It is the air pollution of a city. I want to train my machine with 599,999 other data and predict one of them.

Like I drop the data in row 100 and train the machine with 599,999 data and my goal is to predict the dropped row. But I error.

I really appreciate it if you could help me.

#

df = df.head(100000)
df["Measurement date"] = pd.to_datetime(df["Measurement date"])
df["Year"] = df["Measurement date"].apply(lambda x:x.year)
df["Month"] = df["Measurement date"].apply(lambda x:x.month)
df["Day"] = df["Measurement date"].apply(lambda x:x.dayofweek)
df.drop(["Latitude","Longitude","Address","Measurement date"] , axis=1 , inplace=True)
df.drop(100, axis=0, inplace=True)

a=[101,0.004,0.05,0.002,0.9,59,39,2017,1,3]
mine = pd.DataFrame(index=["Station code","SO2","NO2","O3","CO","PM10","PM2.5","Year","Month","Day"] ,
data=a , columns=["Goal"])

y = df["PM10"]
X = df[["Station code","SO2","NO2","O3","CO","PM2.5","Year","Month","Day"]]
from sklearn.model_selection import train_test_split
X_train, y_train, X_test, y_test = train_test_split(X,y,test_size=0.4, random_state=101)
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train, y_train)
predictions = lm.predict(mine)

print(len(predictions))
print("==========================")
print(df["PM10"])

#

This is the code.

umbral aspen
#

Hi I am working on an multilabel classification problem where I have many possible labels (probably like 30-40 in total)....Generally speaking is creating a somewhat accurate multilabel classification model straight forward when I have so many possible lables? I have about 50k images which I will have tagged and I am looking to train locally with a 1070 gpu

ivory plank
#

@umbral aspen A 30-40 dimensional output isn't uncommon for a conventional problem. The difficulty of the problem depends entirely on your data. The one thing I would look out for is noise in your data and negative samples/negative data [if a particular sample belongs to class 0, does your data ensure that it doesn't also belong to class 1?]. Your GPU is a little underpowered to train state of the art models with large datasets, but also remember that the training difficulty is usually more dependent on the complexity of the model you pick and not the problem itself. I'd personally first do data analysis to find what makes my data easy/difficult to model, and then start with the most simple model that I believe would work for my problem, only moving on to more complex models if my performance isn't sufficient. If your data isn't very complex, even an SVM is pretty good at modeling things.

frail pawn
wise igloo
#

@lapis sequoia are you looking at air quality data

#

Nvm I read your previous post

paper niche
#
X_train, X_test, y_train, y_test = train_test_split(...)

@lapis sequoia

umbral aspen
#

@ivory plank I need to first do some manual tagging of the images but then the quality will be good. Also this isn't a problem where I have to classify 1 class per image, as each image could have multiple classes (multi label)... So not sure how much extra complexity this adds to my model...

ivory plank
#

Ah that sounds like a noisy dataset

#

But your problem sounds very similar to ImageNet

#

You might be able to use large parts of ResNet and our current efforts on efficient ImageNets

prisma verge
#

hey, fellow people! i've got pretrained GPT-2 model here that I want to load with gpt2_simple library. how could i do this? two models don't match up, especially since one is tensorflow model and other is pytorch. maybe anyone got gpt2_simple analog for pytorch model?

blazing bridge
lapis sequoia
#

@blazing bridge thanks

blazing bridge
#

I would appreciate that if you like it that you subscribe

#

There will be another course on pandas and sci kit learn and matplotlib

lapis sequoia
#

Can u recommend any book for data science?

blazing bridge
#

Probably the python Data science handbook

#

It has good reviews

prisma verge
#

gosh, i'm wrapping my head around it for 2 hours already and still can't get it to work

#

it seems that those models just lack tokenizers and i don't understand how to finetune them without tokenizers

flat quest
#

i haven't looked through all the code
but the github documentation has a tokenizer step (5.5)

silk forge
#

hey guys

#

well i plotted a decision tree using matplotlib but i can't zoom in for some reason

#

im talking about this zoom to rectangle thing

#
featureNames = ["Sex", "FamilySize", "Age","Pclass" , "Fare","Embarked"]
classNames = ["Survived",'Succumbed']
fig, ax = plt.subplots(figsize=(10, 10))
plot_tree(clf,feature_names=featureNames,class_names=classNames,filled=True,ax=ax)
plt.show()
#

this is my code

eternal orbit
#

really

spark stag
#

@silk forge the code shouldn't be the issue, are you just clicking the magnifying glass or dragging to create a rectangle for it to zoom into, if that doesn't work either you can try hold right mouse button and drag to zoom

silk forge
#

i created a rectangle

#

but it still wont zoom

#

@spark stag

spark stag
#

if there are no erros idk what the issue could be, did oyu try using right mouse drag to zoom in, it resclaes each axis as you drag, its not the most convenient fix but if it works its better than nothing

crimson umbra
#

Hey can anyone help me with something related to data visualisation

#

I wanna recreate this graph using matplotlib and I need help figuring out the code

eternal orbit
#

What does the code look like

merry violet
#

@crimson umbra I am happy to have a look for you. Send me the code if you can.

compact delta
#

Hello guys im currently struggeling a bit implementing SA to optimize a solution for an assignment. My Solution exists of a list containing numbers from 0 to 11 representing a position in a storage array. Anyway my code executes but does not find any improvement which is definetly false. Does any1 of u see something wrong here? For the neighbour solution our script said pick some random neighbour

paper niche
#

ur last if statement says currentCost < currentCost? o.0

compact delta
#

oh ^^

#

thx, but still no improvement .. kinda strange

paper niche
#

so you’re only doing 1 round of randint for the neighbour before dropping the temperature? when I implemented this for MC a while ago I seem to recall I attempted multiple random “flips” per temperature

compact delta
#

No word about how to chose our neighbour so i just assumed its like this

#

maybe I have to play with the temps a little bit more but its like no improvement at all, so i thought it must be because i made a mistake in the algo somewhere or with choosing the neighbour but dunno

paper niche
#

yeah the neighbour function seems off.. what's the physical context? as in, what does a neighbour mean in this assignment?

#

when I did this, it was in the context of modeling spins in a lattice. so states are up/down of spins in a lattice, and the neighbours are well-defined

compact delta
#

We have a warehouse and some storage shelves, Each iteration a shelve according to our demand list gets called and placed in a queue and then placed back into our warehouse we have to optimize the location where its placed back so that the way these shelves move gets minimized. Each number in the solutionlist is the nth free slot. So 0 is the nearest slot (The warehouse is a 1D array) and 11 the farthest free

paper niche
#

ah it's a 1D array. hmm so wouldn't the neighbour just be +/- 1 of the current index?

compact delta
#

Doubt changing more than 1 of the numbers in our list helps. Thing is to calc the cost you have to simulate the whole process and the only chance I see improving it is with a lower temp and something like 0.9999 as cooling coeff. which will take ages to compute

#

Yeah also thought of it. Will try it out. They didnt specify neighbour in our class so I thought like changing a number is also a neighbour but +-1 will prob see better results

paper niche
#

as in, before the while loop, select a random position, inside the while loop find its neighbour (50% chance of +/- 1), change its state, calculate the Cost, perform the acceptance algorithm

#

the next loop, pick its neighbour, and so on

compact delta
#

If we select a random pos before our while loop it wont go over the other pos after its run through it 1 time its done, am not allowed to change that

compact delta
#

yeah big f should have chosen web-dev that would be an easy a

sick nacelle
#

Hi everyone, i'm currently working with pandas. I got this excel file, when i load it to a pandas df there are cell's values in some columns showing as NaN, but in the excel file these cells have values. Is this because of the value's type in the excel file?

raw rapids
#

thats really weird behavior @sick nacelle

#

do you mind uploading the excel sheet

#

and posting your code

brisk moth
#

does anyone know how FSTs work

sick nacelle
#

@raw rapids I can provide the excel file. The code part it's just loading the excel in a df, though i got the code that generates that excel file.

arctic wedgeBOT
#

Hey @sick nacelle!

It looks like you tried to attach file type(s) that we do not allow (.xlsx). We currently allow the following file types: .3gp, .3g2, .avi, .bmp, .gif, .h264, .jpg, .jpeg, .m4v, .mkv, .mov, .mp4, .mpeg, .mpg, .png, .tiff, .wmv, .svg, .psd, .ai, .aep, .xcf, .mp3, .wav, .ogg.

Feel free to ask in #community-meta if you think this is a mistake.

real wigeon
#

so

#

i have no experience with jupyter notebooks, I watched a 30 minute video.

#

is it worth spending time working with it, or should I do a dashboard with some descriptive stats

raw rapids
#

well if you are planning on doing more data science

#

then you should make. a larger attempt to learn jupyter notebook

#

but a dashboard with descriptive stats is fine too

#

@sick nacelle , I dont know how you can share the excel files. If you find some online sharing for excel you could ping me

#

@brisk moth, FSTs are usually for NLP projects

brisk moth
#

ya i know

#

id like to implement one ive drawn by hand into python

raw rapids
#

o

brisk moth
#

idk how

raw rapids
#

you want to know how to code it in python

brisk moth
#

yea

#

im also not sure if the fst is even right but yolo

#

do you know how to do that

raw rapids
#

I only the concept lol

#

I've never coded one before

brisk moth
#

bummer

raw rapids
#

sorry tho

#

there is an openfst pacakage in python

#

and pywrapfst

#

im more into spacy for nlp

#

@brisk moth

#

open fst seems really easy to use

brisk moth
#

so the states are like the Q0 and Q1 and the arcs are the transitions?

real wigeon
#

well if you are planning on doing more data science
@raw rapids yes, but I am trying to pad my resume asap

flat quest
#

its no use padding ur resume unless you actually know how to work with your tools. Otherwise you'll be lost even if u get a job.
Concrete knowledge with your tools will also allow you to create better projects to pad your resume, so there's no point not learning them.

#

@real wigeon

real wigeon
#

right, but jupyter notebooks is less important than knowing pandas, matplotlib, or numpy

#

?

flat quest
#

well its one of the main ways of sharing data science related work
so if you want to display work you've done (for others to see) in an easy to run notebook, jupyter is usually a good way to go. Also, when ur running models / doing data science ur going to be making visualizations, which is much easier in jupyter usually.

charred blaze
#

they're different tools for different things

real wigeon
#

yeah

charred blaze
#

having that said...

#

I'd say your assessment is right

humble gale
#
x1_domain_list = load_alexa("top-100.csv")
x2_domain_list = load_dga("dga-cryptolocke-50.txt")
x3_domain_list = load_dga("dga-post-tovar-goz-50.txt")

x_domain_list=np.concatenate((x1_domain_list, x2_domain_list,x3_domain_list))


y1=[0]*len(x1_domain_list)
y2=[1]*len(x2_domain_list)
y3=[1]*len(x3_domain_list)

y=np.concatenate((y1, y2,y3))

#print (x_domain_list)

cv = CountVectorizer(ngram_range=(2, 2), decode_error="ignore",
                                      token_pattern=r"\w", min_df=1)
x = cv.fit_transform(x_domain_list).toarray()

# apply KMeans and TSNE ...

k_means = KMeans(init = 'k-means++', n_clusters = 2, random_state=170)
k_means.fit(x)

# assign the labels to a variable
k_means_labels = k_means.labels_
# assign the cluster centres to a variable
k_means_cluster_centers = k_means.cluster_centers_

x_embedd = TSNE(n_components=2, 
                learning_rate=100,
                random_state=170).fit_transform(x)


y_pred = k_means.predict(x)


# fig, ax = plt.subplots(figsize=(7,7), dpi=100)
# plt.scatter(x_embedd[:, 0], x_embedd[:, 1], c=y_pred)

print('Before TSNE: ', x.shape)
print('Accuracy: ', np.mean(y_pred==y)*100)
print('After TSNE: ', x_embedd.shape)  ```
#

please feel free to help, I dont know as to why my acurracy is at 17% , before it was 79%, have I overfit my data or? I am bit lost thanks in advance

paper niche
#

what did u do before?

blazing bridge
rustic igloo
#

Can anyone suggest a way i can paste code to Colab with Android mobile phone? Trying to be productive on the way to work and have sublime text on my phone. So once i finish a snippet, i thought I could just copy and paste it to Colab. BUT - so far i have been unsuccessful to paste to Colab via my mobile browser. The attached list appears when I hold my finger on the screen (no paste). Thanks!!

#

btw, i've searched online a bit for the solution, but couldn't find anything useful. Wondering if problem is myself.

burnt swan
ivory plank
#

@rustic igloo Even if the app doesn't allow you to paste, your mobile keyboard should allow you to do so regardless. You should be able to select a text you type and then paste using your phone's default menu or from your phone's keyboard/clipboard

sonic raft
#

Hi guys! I have a very "newbie" question, so I'm learning about ML algorithms, and how to implement them in Python. So my question is about Linear Regression, If I split my data into train and test sets, should I count the accuracy for the train set too, or is it okay if I just count it for the test set?

uncut shadow
#

you should count for both

next smelt
#

Both. Because it will tell you about overfitting /underfitting

sonic raft
#

Thanks! 🙂

#

What do you mean by overfitting / underfitting? (Too low, or too high accuracy score?)

rustic igloo
#

@ivory plank thanks. I am finally able to paste using the phone's Gboard clipboard.

flat quest
#

well a neural model tries to emulate the true relationship between the inputs and outputs right

but since we are not given data on the entire population (only a sample), the pattern in our sample is likely different from the true one.
When we try to fit on the relationship represented by our inputs and outputs in our sample too well (ie pick up patterns that are very subtle) our model will not generalize well to the pattern for the total population since these subtle patterns do not exist in the population.

Underfitting is when we fit on the sample so poorly, that we miss out on major patterns. These major patterns generally also occur in the true relationship, so its important to find those patterns.

Underfitting and overfitting are tradeoffs. The less you try to overfit, the more potential for underfitting.

@sonic raft

sonic raft
#

Thanks, that's really helpful! So Let's say, I underfitted my model(Simple Linear Regression model) , how should I cure this problem? Change train/test scale?(I usually use the 80/20 variation)

#

@flat quest

wild ridge
#

hi everybody
i looking for a labeled text base dataset with less than 50 % accuracy for my uni project
can you help me?

flat quest
#

not necesarrily change train and test sizes but u could use a more complex model.
for example multilayer relu will be able to fit on progressively more complex models such as those with many local max and mins as well as are non-linear

#

@sonic raft

sonic raft
#

Thanks!

charred blaze
#

accuracy at 17%?! that's abysmal

#

something must be wrong with your code itself

stone ruin
#

I imagine there's a bit of bias here, but would the most logical order to learn the 3 languages be Python -> SQL -> R?

charred blaze
#

I would separate SQL from the others

#

you want to know SQL nevertheless

#

you should learn that in parallel to other things

stone ruin
#

Well

charred blaze
#

having that said, what kind of work do you intend to do?

stone ruin
#

Gotta do one at a time

#

pulling csv's out of a GUI built on top of a SQL database

charred blaze
#

I've barely touched R so far in my jobs where I dwelled in machine learning stuff

stone ruin
#

manipulating the data in excel mostly

#

didn't even know VBA

#

but now with all this time, I intend to learn proper data processing / visualization

#

I want to get back into data for banks I guess, gathering isights for what works and what todo next

charred blaze
#

if you want to learn a "real" programming language... go with Python

stone ruin
#

probably some ML to prevent attrition by predicting behaviors that lead to clsoed accounts

charred blaze
#

R's basically a language created by statisticians and it shows

stone ruin
#

So ultimately tracking customer transaction histories (massive DB of line by line per account)

#

to see what actions trend towards a closed account

#

such as if they stop using adebit card

charred blaze
#

having that said, if you intend to work with time series analysis in general, then I would for sure recommend R

stone ruin
#

or their DD disappears

#

trigger them for a contact from a banker or something

#

but there's 10k - millions of transactions a day depending on the bank size

#

but working with big databases like that to gather insights for customer behaviors to ensure maximum profitability

#

I have 0 coding experience 😦

#

outside of the bit of SQL I had to try to work with durign a Salesforce integration

#

although when I caught an error that the expert made I felt pretty good, ahha

#

alright, so I'll go back to learning Python, thanks again! Datacamp is offering a free week, so I wanted to maximize my time with the platform

charred blaze
#

yeah, go with Python + SQL for now.

stone ruin
#

I might go round robin between the two

#

do their intro course to python, then SQL, then do the next python, then enxt SQL

#

❤️ this server, you all are the best

charred blaze
#

yeah, that seems like a good approach

wind plume
#

Hi guys!! I have an issue with pandas that I'm surely it's bc I am new. I don't know if you guys want me to post my code or just a screenshot, but all I can do is a ss for now.

Basically I'm entering in all information from previous dataframes, using user inputs of previous columns!

charred blaze
#

share the code if possible

#

and state specifically what's the problem you're having

wind plume
#

I can in a little bit. Those NaNs shouldn't be there

#

Under dry sample. If I do dry sample first I get nans, if I do weathered sample first I get nans. I tried doing fill na in that door loop but it never worked

#

I think it's something that the two dataframes indices don't match?? But it's hard to know what question to ask and thus what to search for.

wild ridge
#

hi everybody
i looking for a labeled text base dataset with less than 50 % accuracy ever achived for my uni project
can you help me?

charred blaze
#

that code looks a bit weird

#

you intend to append rows to a dataframe

#

but I think you're always replacing entire columns each time you do those assigments

#

nvm, I can see that you really intend to operate column wise

wind plume
#

I don't know the best way to go about this. Just kinda winging it. I know nothing of coding this is like 1 month of quarantine in fruition

charred blaze
#

but it seems off to me, are you sure that what's you really want to do?

#

what do you want to do then

wind plume
#

What do you mean? Code wise

charred blaze
#

yes, code wise it looks weird

wind plume
#

Inefficient??

charred blaze
#

no, just plain wrong

wind plume
#

Or just like logically doesn't make sense

#

Oh shit ok. Let me get my code I'll be back on in a bit.

wind plume
#

Attached is the pastebin. At the top has an explanation of the code, a tldr of what it's meant to do, and what issue I am having.

I want to note I'm very new to coding and this is my first project. I want to use this for myself as I am a material scientist by trade and WFH I spent all this time learning python.

Every problem I've had I addressed by googling questions, but this I haven't been able to see a lot of people have this issue.

#

If you guys can look at this please let me know if there's a fix I can use. If you do look at it and reply, I'd highly appreciate it if you can tag me

radiant nymph
#

Whats the best way to encode text data for building tree models so that we dont get dim curse. I have used target encoding , any alternatives?

flat quest
#

One hot is better for tree based

radiant nymph
#

One creates large vectors > large dims > Complex trees

flat quest
#

U can always limit tree size
But my bad I meant better for non tree based

Target encoding creates relationships that aren’t there like red is 1 doesn’t mean 2 is blue.

But one hot can make the tree split on color red or blue rather than on the

For trees prob just use target encoding for now, since one hot causes those feature to lose importance in the model even when they shouldn’t