#data-science-and-ml
1 messages ยท Page 267 of 1
It's really just a proof of concept program at this point
what would you do differently?
btw, not directly functionality related, but
snake case is preferred for Python
@smoky fractal maybe something like this:
diff = symbol_data['close'].tail(bars)
down_mean, _, up_mean = diff.groupby(np.sign(diff)).mean().sort_index()
rsi = 100 - 100 / (1 + up_mean / down_mean)
Has anyone used a Data Discovery/Catalog service like Amundsen?
@velvet thorn I'm still a bit confused about the ups/downs, is that a continuous value at each point or is it just one point that gets returned?
because initially I had them as variables but now they are columns in the dataframe
I think what you mean to ask is - is that a Series or a single value?
and the answer is the former
wait, go back
you mean in yours?
Yes that's exactly what I was asking, thanks. And those three lines you posted do the same thing as my code? Just trying to
or mine
in my original code, I had them as single values. Now I have them as series.
oh wait actually I need to clarify your algorithm
RS is supposed to have the value of the mean increase divided by the value of the mean decrease, right
yeah
I've pretty much just been trying to implement this in python https://www.macroption.com/rsi-calculation/
okay, I edited the code
in my original code, I had them as single values. Now I have them as series.
@smoky fractal at that point
they should be single values
because they are means
so what I think I am doing now is taking the RSI at evey point and putting it in that column. At the end of my code now I am returning symbol_data['RSI'][-1] to get the most recent value
Because while yes it is a mean it is still helpful to see it plotted over time as a series
so what I think I am doing now is taking the RSI at evey point and putting it in that column. At the end of my code now I am returning
symbol_data['RSI'][-1]to get the most recent value
@smoky fractal RSI given the last X points?
yes the function takes in the bars variable as the amount of points to consider
Yes the function returns a single value. To give some context, this utility will be a filter to decide whether to buy a stock or not. If RSI >=X, do/don't buy. So I always want the most recent value
But in other contexts I might want to plot the RSI over time to spot trends
@smoky fractal then look into window functions
I was brain stroming for ideas to write a python module, But I was struck.
what kind of python module do you guys think should have been already available?
Is this related to data science?
My advisor asked me to help one of her students install tensorflow on windows 10; he was getting errors related to Windows not being able to find C++ files
For some reason I'm able to install tensorflow but idk what I installed that enables me to do that.
is it the cpp build tools?
could be. I asked him to install visual studio and that didn't work.
it's separate from vs
you still have to download the build tools
I see
https://visualstudio.microsoft.com/visual-cpp-build-tools/
pretty sure that's the one
not being able to find C++ files
idk what else that'd be
I'm not referring to the IDE so I may be using the wrong term
I'm not sure why they throw around the word "visual" so much
wasn't visual cpp ms's version of the language or smthing
So I want to use data from MySQL what library is good for that?
from sqlalchemy import create_engine
engine = create_engine("mysql:///:memory:")
``` Would something like this work?
might be a good #databases question
sqlEngine = create_engine('mysql+pymysql://*', pool_recycle=3600)
dbConnection = sqlEngine.connect()
frame = pd.read_sql("select * from whatever", dbConnection);
pd.set_option('display.expand_frame_repr', False)
dbConnection.close()```
ยฑ whatever options you need
Hey guys, does anyone have a small classification dataset? I want to build a neural network just using numpy and the MNIST one seems a bit much for me
https://www.kaggle.com could probably find one here
Kaggle is the worldโs largest data science community with powerful tools and resources to help you achieve your data science goals.
yeah, i searched on kaggle but not really sure which one i should practice on
well what are you using it for
some keywords to search for will be helpful
well what are you using it for
@bitter harbor i want to build a small neural network
for..?
classification, i just learned them so i want to practice
classification of what tho?
any classification dataset
sqlEngine = create_engine('mysql+pymysql://*', pool_recycle=3600)
dbConnection = sqlEngine.connect()
frame = pd.read_sql("select * from whatever", dbConnection);
pd.set_option('display.expand_frame_repr', False)
dbConnection.close()```
ยฑ whatever options you need
@bitter harbor umm what is pymysql?
and pool_recycle?
idk that was the first thing that came up with the google search
I haven't worked with sql like at all
ok
should i just use a normal breast cancer dataset?
@winged stratus you want a data set?
i practiced logistic regression on that
@winged stratus you want a data set?
@undone flare yeah
for what type of operations you wanna practice on that
@undone flare a neural netowrk
i really dont know how to build them, so im prcticng
we understand that, it's just that 'classification' is a pretty broad term
why not just use the mnist set?
mnist has something like 784 inputs per training example right?
it may take quite a while to train
im just looking for something small and simple
it's ok ill find one
thanks for your time guys
@winged stratus https://www.kaggle.com/kaggle/sf-salaries how about this?
I haven't worked with neural networks so idk what type of dataset is good for that
that's not a lot of inputs tbh
the one you sent arnav has 110811
hi im new to python/numpy and i was trying to understand how to represent different probability functions via numpy
rn im confused as to how i can alter the probability of np random (if its possible)
@quiet pine what exactly do you want to do?
i want to return 1 or 0 given a probability ratio
basically implement a bernoulli rv
ik random returns [0, 1)? i believe
np.random.binomial
alternatively, np.random.choice
(but the former would be more appropriate)
ah yeah im trying to do it w rand specifically because i want to learn how to implement these probabilities
ah, okay
so in that case
the output of np.random.rand is uniformly distributed, right
in the range [0, 1)
ye
.3? no
yup
so now
let's say your probability of success is 0.3
wouldn't you say that the above calculation
could represent a Bernoulli RV?
yup
so in that case
we set 0.7
as an arbitrary bound
but it doesn't have to be 0.7, right?
or rather, 0.7 and 0.3 are arbitrarily chosen
to put it another way...given a uniform distribution in the range [0, 1), and a number p also in that range, what is the probability that a randomly drawn value will be >= p?
think about that and relate it to the nature of a Bernoulli RV
if the value > p then it has probability 1-p and if its less than p it has probability p?
or hm wait let me write this out before i come to a conclusion 1sec
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
^
P(X <= 0.6) = 0.6, P(X >= 0.4) = 0.4 right okay.
ok so if its less than or equal to the value then it represents the probability
ohh okay i got it now thanks ๐
lol that was confusing for how simple it is
ye yw ๐
Anyone know why pytorch checkpoint is recognized as a folder in my Ubuntu and a file in my VM? And if checkpoint is in the form of a folder, how do I load it?
The ckeckpoint is supposed to be 20 Gigs, but it was in the download phase where I beleive it lost the correct format
Im gonna try to compress it and see if that works
For those who use VS code to write Python, I just discovered my intellisense is case sensitive and won't come up except you use the exact case for the word you need. Any walk around this?
I have a dataset and it has a column called 'CC Exp Date'
and it has dates like 01/20, 04/22, 23/25.... in different rows
How many people have a credit card that expires in 2025?
so
I tried using regex
but I failed miserably lol
nvm got it ๐
For those who use VS code to write Python, I just discovered my intellisense is case sensitive and won't come up except you use the exact case for the word you need. Any walk around this?
@boreal summit Use Kite
Can anyone confirm whether a 20Gb checkpoint would necessarily use 20Gb RAM or is there a way to reduce the memory taken by the loading checkpoint? I have 8GB mem on my system but can use Kaggle/Colab for 16Gb too.
@grave frost thanks, I'll give it a try.
can anyone link me to some good ai tutorials using python
I need to use scipy's optimise functions to perform gradient descent.
Only issue is, my function is calculated by finding a linear combination of several large matrices
Large enough that I can't store them all in memory, so they're numpy memmapped
However, there's still a significant memory usage spike while calculating each function value
Is there a way to make sure it won't overcap my memory while running? Also, can I keep storing the iteratively computed values to disk, so that in the case it does fail, I can atleast start closer to the minima
does legend() has default loc set to 0?
anyone here know Pandas? Can you take a look https://stackoverflow.com/questions/64762942/how-do-i-compare-2-separate-csv-files-using-pandas-and-store-it-into-a-dictionar
hey... i'm just getting into machine learning (TensorFlow for now) do i need to get anaconda for that or can i just use my usual python 3.8 with tensorflow?
i already have multiple versions of python i'd rather not install more... is anaconda a requirement for machine learning or is it optional??
optional but recommended bc jupyter notebook is great @open stratus
I have a list of dictionaries:
[{'store': 'a', 'buy': '1.1312', 'sell': '1.1518'},
{'store': 'b', 'buy': '1.1315', 'sell': '1.1517'},
{'store': 'c', 'buy': '1.1316', 'sell': '1.1518'},
etc.]
all of the buys and sells are strings. What is the most performant way to convert them all to floats? I made a try_float method but iteratively that takes quite a while
if anyone has any performant data structures they'd recommend for processing a list of dictionaries that would be great - I'm trying to see if I can use a numpy array
One possibility is to write a dict wrapper that applies a float conversion as the 'buy' and 'sell' values are accessed.
how can i scrap and download mp3 here https://www.ldoceonline.com/dictionary/absent
absent meaning, definition, what is absent: not at work, school, a meeting etc, beca...: Learn more.
i can scrap text no problem but i couldnt scrap and download mp3 file
i just wanna download first class="speaker exafile fas fa-volume-up hideOnAmp"
optional but recommended bc jupyter notebook is great @open stratus
@hollow sentinel you can get jupyter without anaconda too
so I'm trying to make a financial option tool, and I need to decide between using any of:
- Hidden Markov Models
- Naive Bayes
- Bayesian Networks
- Markov Networks
If I'm inputting past sets of observations, such as price points, bid/ask prices, strike prices, etc, and I input the next day's return on investment for different combinations of those observations, and I want to make the tool predict a return on investment given a unique set of observations, what model should I use? I'm currently going with Bayesian Networks because they're good at inference, but I'm not totally confident.
@prisma isle my bad haha
boys i need help, i want to fix spell errors in spanish text and i don't know haw to start, somebody get an idea?
@fallow prism Do you want to use ML for that or is any other tool good enough?
Does anyone know how to load a big Pytorch checkpoint (20Gigs) without taking 20~ish Gb RAM? I only have 16G
@fallow prism Do you want to use ML for that or is any other tool good enough?
@grave frost yes, i want to use it for ML, specifcally NLP
Hi guys,
any idea if I could do this in one line?
hue = np.mean(hsv[:, :, 0])
saturation = np.mean(hsv[:, :, 1])
value = np.mean(hsv[:, :, 2])
hue, sat, val = np.mean(hsv, ....)
hue, sat, val = (np.mean(hsv[:, :, n]) for n in range(3)), with this few elements it does not matter that you are using a python comp
other than that, maybe np.moveaxis with some clever axis arg for the np.mean
mean returns matrixes if feeded axis, or scalar if not
๐ I though I will find some numpy solution
it does not matter that you are using a python comp
@pale thunder
hey, can you elaborate on that compiler? what u had on mind?
that was short for comprehension
ah right
I jsut checked, and you can iterate natively on last axis, so for matrix in hsv ๐
i was wrong
so I'm trying to make a financial option tool, and I need to decide between using any of:
- Hidden Markov Models
- Naive Bayes
- Bayesian Networks
- Markov Networks
If I'm inputting past sets of observations, such as price points, bid/ask prices, strike prices, etc, and I input the next day's return on investment for different combinations of those observations, and I want to make the tool predict a return on investment given a unique set of observations, what model should I use? I'm currently going with Bayesian Networks because they're good at inference, but I'm not totally confident.
Have you considered time series model?
no, what's that?
But also I think a clarification on your problem will be helpful too.
From my understanding, you want to predict a return on investment based on some past data.
yes
So time series models learn from past data to predict future patterns.
ah
are there any other names for time series models?
i'm using pomegranate, and it doesn't list time series models as an option
it's possible that pomegranate doesnt support it though
Iโm not too sure what pomegranate is, but you can look up ARIMA models.
ok
do you know any ML libraries that implement time series?
o nvm, found one that looks good
https://pypi.org/project/pmdarima/ if anyone comes upon this later
oh wait I did a dumb, when I agreed with "you want to predict a return on investment based on some past data", I interpreted past data to mean a set of observations that has just been collected, not data from a significant time ago. the thing being predicted is totally independent of past data @heady hatch
Oh I see.
If that's the case you can try much more algorithms. If you think the relationship is linear or can be transformed into linear, you can try linear models. If not then try some tree based or ensemble models.
I think I should clarify what you mean by totally independent of the past data.
As in there's no relationship or no time relationship?
Because it would be hard to do a prediction on features that doesn't have any relationships at all.
oh
there's no time relationship
which of these would (most likely) be the best though:
- Hidden Markov Models
- Naive Bayes
- Bayesian Networks
- Markov Networks
(the person I'm doing this for wants to stick to these models)
Hi all,
I am running the following code:
import math
from scipy import stats
o=float(input("Enter Odds(O):"))
r=float(input("Provide ROI(R):"))
s=abs(math.sqrt(abs(r*(o-r))))
print("\nStandard Deviation(S.D.)="+str(s)+"")
n=float(input("Enter n:"))
t=(math.sqrt(n)*(r-1))/s
print("\nT-score="+str(t)+"\n")
p=round((stats.t.sf(t,n))*100,3)
print("\nP-value="+str(p))
for which I'm entering:
Enter Odds(O):4.76
Provide ROI(R):0.1163
Standard Deviation(S.D.)=0.734889318196965
Enter n:8854
T-score=-113.14951036753682
P-value=100.0
I just can't quite understand the p-value here
Hello, a quick question.
I have a DataFrame with a Time vector, where the time has been given as
00:04
00:08
00:11
and so on, which is an object datatype.
How do i change this to a normal time vector, like 4,8,11, etc.
@hallow orbit hard to say without able to know the relationship of your features.
But I would start with naive bayes since that's relatively simple in capturing relationship in a probabilistic way.
Can someone help me to understand Random Forests Classifiers?
Do you have any specific questions?
ah yes. I want to use random forest to predict results from a dataset
oh ty ๐
and here is a good write up on different ensemble models: https://scikit-learn.org/stable/modules/ensemble.html
Hey @midnight rain , have you worked much with rf?
I would love to get your opinion on outliers and imbalanced data with rf.
ive done a bit of work with isolation forests
Ty for the info. I'll look into it
but im not a datascientist im a machine learning engineer
Ahh.
so i do more support then primary modeling
if you want to use a RF i really recommend trying an isolation forest
Do you mind if I ask you regarding your responsibility as a mle?
they work very well and i think they tend to work much better than SVMs in production environments
Ahh.
sure whats up
What's your responsibility like as an mle?
Because I've come across a wide definition and would love to add yours to my knowledge base.
mmm right now im integrating data science models into a large project we are working on
i take the jupyter notebooks from the data scientists and then i turn them into a production ready model by optimizing the code as much as possible and adding production ready error handling etc.
im also managing the data pipelines for productionizing the models.
and im working on a large project in Neo4J to create an interface for us to query and get insights out of the data produced from the models and our other data scraping
Oh that's pretty cool.
Do you add monitoring and testing/debugging for the models?
i dont do any monitoring at my current job yet
eventually i'll add a feedback loop from our BI work, but thats further down the product pipeline
@heady hatch thanks!
Are you the sole/few engineers on your team?
oh wait you're still in the convo, sorry for ping
Yea no problem Theelx.
Btw pantsforbirds, thank you so much for the information.
It's really nice to hear what other people are doing and working with so I can evaluate myself and see where I fit in.
yeah im the only MLE on the team right now
Ahh.
its a smaller VC firm that im working at now
Oh how are deadlines like?
we are semi researching right now so not terrible
also have you guys seen https://www.gdeltproject.org/ ?
The GDELT Project
oh that looks cool
i have no idea the latency on it
they might have a bunch of different scrapers set up for each news sites
the data scale is so insane
is there like a machine learning project idea generator somewhere
I'm getting bored
generator?
make mozaic generator ๐
lol don't think i'm at that level yet
I'm having a hard time staying motivated to do the Ng course
its not ML actually ;D
oh my bad then
have you been on open ai gym ?
no what's that
any recommendations for a SQL IDE?
i need to practice cleaning data
I downloaded some fruits from kaggle and jsut classified it ๐
hi @velvet thorn
okay so about resizability of numpy arrays
in the general case, you cannot resize them, because the memory of a numpy array must be contiguous
so it's best (IMO) to treat them as static.
however, @serene scaffold is actually right in that you can technically resize them with the .resize method.
I agree
so what is the best alternative of vector<T> in C++ for python?
but you reaaaaaaally shouldn't do that because of views and stuff
They can start with a list, append to that, then switch to np array right
but list is slow right
but you reaaaaaaally shouldn't do that because of views and stuff
@velvet thorn because say you have an arraybthat is a view into an arraya; if you resizea, the behaviour ofbis undefined.
Or you could np.full numbers
so what is the best alternative of
vector<T>in C++ for python?
@candid lodge why do you need resizability?
if it's a tile map
i need a map that is resizable?
what kind of calculations
are you doing
as in
why not just create a new array
copying the values
oh
so what np.append does is create a new array
so np.append(array, values)?
with the passed values added to the end
!e
import numpy as np
a = np.array([1, 2, 3])
print(a)
print(np.append(a, [4, 5]))
print(a)
@velvet thorn :white_check_mark: Your eval job has completed with return code 0.
001 | [1 2 3]
002 | [1 2 3 4 5]
003 | [1 2 3]
you can see that a does not change, because append returns a new array
this is unlike the behaviour of native Python list.append
you can append inplace to vectors in C++, right?
yes
yeah, you can't for numpy arrays (in the general case)
so they're really more like Python lists, except faster
and statically typed
ohh okay thank you
do you know how to make a numpy array the initialise the size
when created
!e
import numpy as np
a = np.zeros((3, 5))
print(a)
@velvet thorn :white_check_mark: Your eval job has completed with return code 0.
001 | [[0. 0. 0. 0. 0.]
002 | [0. 0. 0. 0. 0.]
003 | [0. 0. 0. 0. 0.]]
there you go
not 0 but what if different?
what are the parametersw?
you can check the docs
alright
ohh
@serene scaffold btw sorry to ping you but was just wondering - why did you say arrays could be resized?
like were you thinking of the same thing I was or was there something else
Hey I'm new here๐
What happens here?
Hey I'm new here๐
What happens here?
@prime girder we talk about data science
I figured that if you can change the data in an array without creating a new object, then you can also change the size.
ah, okay
@prime girder we talk about data science
@velvet thorn and we don't talk about fight club
thanks for explaining
@velvet thorn and we don't talk about fight club
@serene scaffold yes but we ALSO don't talk about fight club
I feel like there is a lot of context I am missing
but yeah, we discuss data science/machine learning/statistics/etc. and the Python libraries incidental thereto here
Python Discord Rules
We have a small but strict set of rules on our server. Please read over them and take them on board. If you don't understand a rule or need to report an incident, please send a direct message to @sonic vapor!
Rule 1
Do not talk about fight club.
Rule 2
DO NOT TALK ABOUT FIGHT CLUB.
Rule 3
Listen to and respect staff members and their instructions.
Rule 4
This is an English-speaking server, so please speak English to the best of your ability.
Rule 5
Do not provide or request help on projects that may break laws, breach terms of services, be considered malicious or inappropriate. Do not help with ongoing exams. Do not provide or request solutions for graded assignments, although general guidance is okay.
Rule 6
No spamming or unapproved advertising, including requests for paid work. Open-source projects can be shared with others in #python-discussion and code reviews can be asked for in a help channel.
Well there goes my gameplan to get files to hack people with
what is the better code of this
test = [5, 2, 6, 3]
counter = 0
for e in test:
print(e)
print(counter)```
i want to keep track of the index
test = [5, 2, 6, 3]
for i, value in enumerate(test):
print(value)
print(i)
@candid lodge
WHY ARE WE SCREAMING
@lapis sequoia please don't disturb the developers.
Guys, each time I try to run GridSearchCv, I always get this invalid parameters error. Even though I'm certain all the hyper parameters are spelled and labelled correctly. I use VS code.
Never mind about this again, just found out I spelt neighbours wrongly. I used neighbours instead of neighbors
Never mind about this again, just found out I spelt neighbours wrongly. I used neighbours instead of neighbors
@boreal summit
Amen to that I hate American spelling
Cuz a "u" is soo hard to type
@prime girder I've been wondering why the code couldn't run even after doing everything right for the past 2 days. I'm a bit relieved now.
Thanks.
@velvet thorn I think I'm beginning to understand why database prefer atomic values.
Working with data structures inside columns are a pain.
@velvet thorn I think I'm beginning to understand why database prefer atomic values.
Working with data structures inside columns are a pain.
@heady hatch by a lot
I mean there are databases that do well with those
just not SQL
Hii.. Any one familiar with pyzbar? ๐
I am getting error...
While i trying to implement this code :
import cv2
import numpy as py
import pyzbar
cap = cv2.VideoCapture(0)
cap.set(3,640)
cap.set(4,480)
while True:
success,img = cap.read()
for barcode in decode(img):
print(barcode.data)
mydata = barcode.data.decode('utf-8')
print(mydata)
cv2.imshow('Result',img)
cv2.waitKey(1
Hey where can I see all the datasets available in seaborn?
someone can help me with standardisation process? explain how it's done?
i have a code which saves a image and do further execution of code
now i do not want to save this image , i want to directly do further execution of code
how i can do this?
my code herepython with open("imagetosave2.png", "wb") as test_img: test_img.write(image_data) test_img = image.load_img("imagetosave2.png", target_size = (64, 64))
here i do not want to save img2.png this here
ping me when u have ans
@mild topaz whats image data consisting of?
numpy arrays?
or what?
btw is this channel about data science or for machine learning aswell?
see my code herepython image_data = base64.b64decode(data["image"]) print(type(image_data)) data = io.BytesIO(image_data) try: test_img = Image(io.BytesIO(image_data)) except Exception as e : logger.debug ({ "status": "invalid", "message" : "Provide valid base64 string"}) return { "status": "invalid", "message" : "Provide valid base64 string"} test_img = open("img2.png", "rb") image_data = test_img.read() test_img.close() test_img = image.load_img("img2.png", target_size = (64, 64))
@fierce shadow
btw is this channel about data science or for machine learning aswell?
both
never worked with those base64 stuff... but I am pretty sure you might have to use PIL.Image
it has many functions to convert images
see here i am not converting any image @fierce shadow
hi! I'm learning some supervised learning stuff and i have a project to find the best classifier for some data; i'm trying svm's rn and I find that the poly kernel takes a long time; I can't seem to find why some kernels take longer than others; it seems to be data dependent too since my friend with a different data set had the same problem but for her the linear kernel was the one that took a long time to run; is there a reason why?
I'm trying to work with some json data but can't figure out how to gather all information and then use it for an example a function that just get's all the json sections without me specifying the real name like this: 2020-11-11
{
"status": 200,
"type": "stack",
"data": {
"2020-11-11": {
"total_cases": 166707,
"deaths": 6082,
"recovered": 0,
"critical": 129,
"tested": 2431770,
"death_ratio": 0.03648317107260043,
"recovery_ratio": 0
},
"2020-11-10": {
"total_cases": 162240,
"deaths": 6057,
"recovered": 0,
"critical": 92,
"tested": 2431770,
"death_ratio": 0.0373335798816568,
"recovery_ratio": 0
},
"2020-11-09": {
"total_cases": 146461,
"deaths": 6022,
"recovered": 0,
"critical": 92,
"tested": 2431770,
"death_ratio": 0.04111674780316944,
"recovery_ratio": 0
},
"2020-11-08": {
"total_cases": 146461,
"deaths": 6022,
"recovered": 0,
"critical": 92,
"tested": 2431770,
"death_ratio": 0.04111674780316944,
"recovery_ratio": 0
},
"2020-11-07": {
"total_cases": 146461,
"deaths": 6022,
"recovered": 0,
"critical": 92,
"tested": 2431770,
"death_ratio": 0.04111674780316944,
"recovery_ratio": 0
},
"2020-11-06": {
"total_cases": 146461,
"deaths": 6022,
"recovered": 0,
"critical": 92,
"tested": 2431770,
"death_ratio": 0.04111674780316944,
"recovery_ratio": 0
},
"2020-11-05": {
"total_cases": 141764,
"deaths": 6002,
"recovered": 0,
"critical": 90,
"tested": 2242469,
"death_ratio": 0.0423379701475692,
"recovery_ratio": 0
},
"2020-11-04": {
"total_cases": 137730,
"deaths": 5997,
"recovered": 0,
"critical": 73,
"tested": 2242469,
"death_ratio": 0.04354171204530603,
"recovery_ratio": 0
}
}
}
is there anyone know something about huggingface and text classification with electra?
@marsh chasm could be that your data is high dimensional. You could reduce the dimensionality using PCA or some other dimensionality reduction technique.
Also, Your data might be too complex for the model you're using to train it.
@marsh chasm you could definitely try PCA, but you should also check to see if there are a lot of zeros in your dataset. Sometimes these zeros are treated as a placeholder or null value. You could use the following code to check:
print(np.sum(data == 0)/(data.size))
If this results in a large %, you should consider using the TruncatedSVD dimensionality reduction technique.
With textgenrnn, is it possible to continue training a pre-trained dataset (possibly with more data)? So, like, I'll train with a datest with 10000 datapoints for 5 epochs one day, and then the next day I can continue to train with that same data (possible with now 10200 datapoints) from the hdf5 file for another 5 epochs to get even better results?
@marsh chasm could be that your data is high dimensional. You could reduce the dimensionality using PCA or some other dimensionality reduction technique.
@boreal summit yeah Iโll try PCA thanks
@marsh chasm you could definitely try PCA, but you should also check to see if there are a lot of zeros in your dataset. Sometimes these zeros are treated as a placeholder or null value. You could use the following code to check:
print(np.sum(data == 0)/(data.size))If this results in a large %, you should consider using the
TruncatedSVDdimensionality reduction technique.
@cerulean spindle ok cool! Thanks so much
@marsh chasm Are you using MNIST dataset?
oh ok
Does anyone here mess with Tensorflow? Pretty much learned what I can from the entire Python Crash Course book and was wanting to move into ML. Only been doing Python for like 8 months. Should I learn about something else before Tensorflow and ML or just go straight into it?
Are you familiar with ML foundation? Such as train, validation, test split, overfitting, underfitting, imbalanced datasets, model evaluation, optimization, different kinds of ml problems, data cleaning, transformation, selection, etc etc?
Along with mathematical foundation such as linear algebra, probabilities, statistics, and calculus?
Or you can also dive straight in and go with a top down approach instead of a bottom up.
Ultimately depends on your learning style and how much you are willing to adapt.
There's fastai which teaches it in top down perspective and you learn the models as tools and then learning how to take it apart.
I've been using pandas/python/jupyter for many years now, but i recently saw an R notebook and it looked really really clean/easy. Can someone explain to me what benefit R might have to someone who already knows python/pandas/jupyter well?
I believe R is a more statistically minded approach, but I'm not sure.
my initial thoughts were that it looked cleaner, but python is maybe more granular?
Anyone here who knows both R and Python?
I have something I have been doing in R for quite a while but implementing it in Python is a hassle
def children(data):
if data=0:
return 'childless'
if 1 <= data =<3:
return '1-3 children'
if data > 3:
return '4+ children'
im getting a syntax error with this
can anyone help lol
says data=0 is syntax error
data == 0
no worries
whats wrong with the <= 3
if data>=1 and data<=3:
def children(data):
if data=0:
return 'childless'
if data >= 1 and data <= 3:
return '1-3 children'
if data > 3:
return '4+ children'
u a god
Keep on working on Python
i will one day become a god like you
I am not a god haha
Just learn that comparison syntax and you should be fine
I just wished someone helped me with my issue, I am overcomplicating my code
@torpid cave I'm not familiar with R, but might be able to help you translate.
What are you trying to do in terms of code?
I have one dataframe with responses, and another dataframe with keys
I just need to translate responses to keys
My initial approach (works in R) was:
df.apply(lambda x: df2[df2['key'] == x]['code'].item())
Hm could you give me an example of the dataframes?
for column in insurance.columns:
pivot=insurance.pivot_table('charges', index=column)
display(pivot)
pivot.plot.bar(stacked=False)
oscar im getting an error
with the new columns we just made
one sec @hearty jewel
that code worked with all columns except the new columns
ValueError: Grouper for 'charges' not 1-dimensional
Do you have two columns named "charges"?
df = pd.DataFrame(dict(
Sample1 = [5,2,10,2,2],
Sample2 = [5,5,5,10,10]))
df2 = pd.DataFrame(dict(
Keys = ['A','B','C'],
Values = ['5', '2', '10'] ))
@heady hatch
What I am trying to do is just convert df into df2 letters
so change all the 5 to 'A'?
yep
Your apply makes sense.
I am doing this frankestein
def TranslateList(list1):
def LookValue(value):
value = str(value)
value = info[1][info[1]['IDNumber'] == value]['Sample'].item()
return value
translation = []
for item in list1:
translation.append(LookValue(item))
return(translation)
defCreateTrasnlatedTable():
#I am writting this now
But a one-liner should do it
Not sure why I can't get it right
@hearty jewel what are you trying to do? I have some more time before I start work
So another way of doing is to creating a dictionary to index into.
df2_dict = df2.set_index('Values')['Keys'].to_dict()
df1.apply(lambda x: df2_dict[x])
This is assuming that the values to keys mapping is unique.
Yea let me know how that goes.
TypeError: 'Series' objects are mutable, thus they cannot be hashed
damn it
haha
My df is quite complex, let me check the code
But I think a dict should be the way
Hmm does your values or keys have Series object?
Nice nice nice.
Nevermind
info[0][['Sample1','Sample2','Sample3']].apply(lambda x: info_dict.get(x))
Tried that
so info is a list of dataframes
df 0 is the on I am using
df[0]
and the columns with the keys are the ones I am interested
To double check, info_dict is okay? Like you can create it and index into the hashes.
so info_dict is
i want to pivot between each column and charges
info_dict = info[1].drop(['Title'],axis=1).set_index('IDNumber')['Sample'].to_dict()
and show a graph
info_dict = info[1].drop(['Title'],axis=1).set_index('IDNumber')['Sample'].to_dict()
in a for loop
And print info_dict for yourself, is that what you're expecting?
{'329': 'A', '587': 'A', '433': 'B', '274': 'B'}
print(info_dict)
{'329': 'A', '587': 'A', '433': 'B', '274': 'B'}
And that's what you want?
Cool. Now hmm for info.
You said info is a list of dataframes?
yes
info is a list that contains 2 dataframes
I just did it like that to have some order, I dont like having so many variables in the code
Should not affect anything
@hearty jewel maybe try showing me what output you are looking for. I could help with data manipulation.
For graphs I use ggplot in R and the plots I did in Python are just copy/paste so I won't be able to help there
@heady hatch how is that?
nvm I will just read the docs
applymap is elementwise, apply is series.
And I think the error is tripping up when you do dict.get(SeriesObject).
I thought, I should not be coding something this simple that hard
Why is Python not as simple as R
many thanks, you just won lots of internet points
Hope your journey is swell from here on.
info[0][['Sample1','Sample2','Sample3','Diff']].applymap(LookValue)
Works like a charm
Hey guys trying to train my gan getting a weird traceback
Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.
Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.
Can someone help me figure out what's going in
*on
The traceback would infer it's a type error but I've just loaded up type8 numpy arrays into tf.dataset.from tensor slices or whatever
Should I upload my h5py files and convert the numpy array to type 64 or whatever it's saying is appropriate
You should train your models with float32 inputs in general. If you're using images, it's recommended that you rescale them to the [0, 1] range.
For gans they recommend you normalise between -1 and 1 tho
Most documentation I've seen does that
Which is what I've done
All my data is between those numbers
Anyway I think I'm gonna try reshape with numpy to float32 cos they're float64
That would be called type casting, not reshaping. [-1, 1] would also work, I personally don't have much experience with GANs
My bad
It is asking me to cast it to a supported type
TypeError: Failed to convert object of type <class 'tensorflow.python.data.ops.dataset_ops.BatchDataset'> to Tensor. Contents: <BatchDataset shapes: ((1, 256, 256, 1), (1,)), types: (tf.float64, tf.int32)>. Consider casting elements to a supported type.
you're dealing with a dataset, which doesn't have a dtype of its own
you should map the dataset with a function that casts its elements to the correct dtype
# Suppose your dataset is the variable `ds`
cast_ds = ds.map(lambda x, y: (tf.cast(x, tf.float32), y))
I mean I cast the thing as float32 before storing it in the tf.dataset file
Didn't work
But idk
How do I implement that into the code
Implement what?
The cast ds
I just did it above
Hii, Can anyone help me as I would like to get excel sheet from complex nested JSON file?
python has a json library that might be useful. That can help you turn it into a python dict. Then pandas can build a dataframe from a dict, and a pandas dataframe can be saved to a csv or workbook.
Hi! I was wondering if people here are familiar with the validation_curve functionality of sklearn. Basically I was wondering if for the x axis on a validation curve I can plot instead of a hyperparameter a combination of hyperparameters like so:
my teacher somehow managed to produce that plot, unfortunately i can't see how given the limitation of the validation_curve function with param_name and param_range
The x-axis format reminds me of what matplotlib.pyplot does by default if you have a multi-index dataframe, but I am not sure on that. Even if what I said is correct I am not sure if it will help you. The short answer is I don't have any experience with the validation_curve function.
yeah the thing is i feel like i need the validation_curve in order to plot both the training and validation score (every time i try to look up how to plot them without validation_curve google just tries to show me validation_curve xD)
pls ping me if you could help! i'd greatly appreciate it. i asked in a help channel before but my helper and i got stuck xD
I can just use the gridsearch features
ugh i totally forgot about that
oh wait but still that doesnt show me the training vs validation score
hm
https://colab.research.google.com/drive/1tn4l65t47I6G1MV5fcZ0Q-W91Rtwns9e?usp=sharing https://hastebin.com/eyopocaxir.csharp Any ideas what's going wrong?
I'm following this tutorial https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/text/text_generation.ipynb but changing as much as possible to try and show to myself I actually understand what is going on
any resources for sql projects besides kaggle?
Hi guys, anyone here knows how to do math in Python?
I mean solving equations by calling out other variables
is this a good channel for web scraping
I mean solving equations by calling out other variables
@torpid cave are you talking about symbolic math?
How can I move the legend?
no seaborn
getting this error not sure how to fix. Note: im still learning python so sorry for my stupidity xd
that only works if you're in a Jupyter notebook
inside a regular script, it's nonsense
so it doesnt work in spyder... damn...
What is better : sns.factorplot(kind='bar') or the sns.barplot()
whats the difference between them if you dont mind me asking
There is actually no difference
it's just that factorplot has a kind attribute
factorplot/catplot
so like if you set kind to violin it will act as sns.violinplot()
I see. Ty
so what would you prefer?
I'm making a random forest code to predict something using a dataset i found. So just going around and researching different codes and reading which one would be best and easiest
I mean would you use factorplot() or the specific kind plots
hmm, I prolly would since it'll make is easier to understand for me
so you would use specific kind plots?
i guess ye
ah ok xD
Hey guys I'm still having trouble with my model
I'm confused because shouldn't a numpy array that is stored in a tf.dataset be a tensor
It should
It doesn't make sense that the error I'm getting is saying BatchDataset is cannot be converted to a tensor
you need to distinguish between a Dataset and an element of a Dataset
a Dataset is a Dataset
an element of a Dataset is a Tensor
Datasets are basically a better version of a vanilla Python generator when it comes to iterating over data
they are not convertible to ndarrays directly
But even if I cast it as a float 32 before loading it with from_tensor _slices I still get the same thing
Dataset is not a tensor
as such, it doesn't have a dtype
so you can't cast it
you can only cast the elements of the dataset
Yeah I meant cast the image array
in that case you need to map the dataset
as I have shown previously
the mapping function is applied to each element of the dataset
either that, or you cast the array before converting it into a dataset
Well I tried the latter and it.didnt work I got the same thing
how did you do it
and then?
Same error traceback
what does the error say again?
Just float32 instead of 64
One sec
TypeError: Failed to convert object of type <class 'tensorflow.python.data.ops.dataset_ops.BatchDataset'> to Tensor. Contents: <BatchDataset shapes: ((1, 256, 256, 1), types: (tf.float32)>. Consider casting elements to a supported type.
That's the error I get when I cast it as float32 before converting
can you provide your code again?
Seems that you're passing a dataset to tf.cast which doesn't work because, as mentioned above, datasets don't have a dtype
Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.
gen_output = generator(g_dataset, training=True)
this doesn't work
you need to pass in a tensor
not a dataset
Aight
such as the input argument of the function this code lies within
Yeah so now I'm getting a thing saying content is larger than 2gb but my training data (g_x_in) ,is 300mb
Cannot create a tensor proto whose content is larger than 2gb
Hi, i'm new to data science and i want to create some ML project but i need some datasets, is there any recommendation site for good datasets that have more than 5k data? other than kaggle and UCI
can you print input?
your fit function also has a loop that doesn't make sense
# Train
for n, (g_dataset) in train_ds.enumerate():
print('.', end='')
if (n+1) % 100 == 0:
print()
train_step(g_dataset, target_dataset, epoch)
print()
train_ds.enumerate() yields the train step, and an element of train_ds
so using g_dataset is misleading
datasets don't yield datasets
@weary heart what kind of data are you looking for? Have you considered scraping?
Ok changing that thanksd
Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.
That's what I get when I print the generator input
i'm looking for some kind of e-commerce datasets or something like that. i haven't tried scraping atm
looks correct
Idk man I'm so confused as to why it doesn't work
look at the call stack and determine the type of each variable that is relevant
check that they have the correct type (don't confuse Dataset with Tensor!)
Well yeah ive changed all calls to tensors
Where appropriate at least
The problem is the size?
But it's totally below 2gb
I've looked up the error on Google and I can't find anything thats relevant
This stinks
can you show the error log?
Sorry to nitpick but is this correct?
g_x_in = np.array(g_x_in) - 127.5/127.5
It looks like array - 1
does that make sense to you?
Oh lmao I didn't mean that as in I wrote it. I picked it from @lapis sequoia 's code.
Let me put brackets around that
I'll grab the error logs one sec dude
Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.
I'll be right back homies
oh
you need to zip the two datasets
if you're passing in a dataset, the y parameter in tf.keras.Model.fit will be ignored
according to the docs:
- A
tf.datadataset. Should return a tuple of either(inputs, targets)or(inputs, targets, sample_weights).
g.map(sns.displot,'total_bill')```
What is wrong with this? It gives displot separated than the grid
but when I do
g.map(sns.distplot,'total_bill')``` It works fine but a warning comes distplot will be deprecated in future release
g.map(sns.histplot,'total_bill')``` Works but I just want to know why displot won't work?
@hasty grail you know how to zip it?
tf.data.Dataset.zip
So would I do smth like g_dataset = tf.data.dataset.zip(g_dataset)
async def masscloneemoji(self, ctx, emoji: discord.PartialEmoji, name=None):
What should I change here if I want to be able to add multiple emojis at once?
look at the example in the docs @lapis sequoia
zip is in the sense of the vanilla Python zip
Yeah ok will do
anyone know why displot doesn't work with grid?
g.map(sns.displot,'total_bill')``` What is wrong with this? It gives displot separated than the grid
This is what I am talking about
@hasty grail same error
code?
Sec
Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.
g_dataset = tf.data.Dataset.zip(g_dataset) that's not how you zip datasets
did you read the docs?
you need to zip the sample and label datasets together into a single dataset
and pass only that dataset to fit
I think it wont work because g_x_in is an ndarray
I wasn't sure what to do because it said you need to put dataset objects in it
G_y_in is a dataset
I can comment out the line where it's g_x_in/127.5 - 1
That'll change g_x_in to a dataset and the zip will work but then the issue is how do I normalise all the data
You can normalize the data, then convert it into a dataset
Or use .map on the dataset to apply a mapping function to each element
BUFFER_SIZE = 5000
gen_input = h5py.File('/content/gdrive/My Drive/Colab Notebooks/files/training_mnist_raw.h5','r')
g_x_in = gen_input.get('images')
g_x_in = np.array(g_x_in)/127.5 - 1
g_y_in = gen_input.get('labels')
g_dataset = tf.data.Dataset.from_tensor_slices(g_x_in)
g_dataset = g_dataset.shuffle(BUFFER_SIZE)
g_dataset = g_dataset.batch(BATCH_SIZE,drop_remainder=True)
g_dataset = tf.data.Dataset.zip((g_dataset,g_y_in))```
gives me
```TypeError Traceback (most recent call last)
<ipython-input-19-262507c436a9> in <module>()
8 g_dataset = tf.data.Dataset.from_tensor_slices(g_x_in)
9
---> 10 g_dataset = tf.data.Dataset.zip((g_dataset,g_y_in))
11
12
1 frames
/usr/local/lib/python3.6/dist-packages/tensorflow/python/data/ops/dataset_ops.py in zip(datasets)
998 Dataset: A `Dataset`.
999 """
-> 1000 return ZipDataset(datasets)
1001
1002 def concatenate(self, dataset):
/usr/local/lib/python3.6/dist-packages/tensorflow/python/data/ops/dataset_ops.py in __init__(self, datasets)
3481 message = ("The argument to `Dataset.zip()` must be a nested "
3482 "structure of `Dataset` objects.")
-> 3483 raise TypeError(message)
3484 self._datasets = datasets
3485 self._structure = nest.pack_sequence_as(
TypeError: The argument to `Dataset.zip()` must be a nested structure of `Dataset` objects.```
I don't know if it's different because of the fact that its a batch dataset
Because g_dataset before being zipped is a batch dataset object
Images and labels
The point seems to be that the content of your zip isn't properly structured to be accepted by tensorflow.
Since you want to use MNIST, it seems, you should find it easier to do this:
import tensorflow_datasets as tfds
(ds_train, ds_test), ds_info = tfds.load(
'mnist',
split=['train', 'test'],
shuffle_files=True,
as_supervised=True,
with_info=True,
)
def normalize_img(image, label):
"""
Normalizes images: `uint8` -> `float32`.
"""
return tf.cast(image, tf.float32) / 255., label
ds_train = ds_train.map(normalize_img, num_parallel_calls=tf.data.experimental.AUTOTUNE)
ds_train = ds_train.cache()
ds_train = ds_train.shuffle(ds_info.splits['train'].num_examples)
ds_train = ds_train.batch(128)
ds_train = ds_train.prefetch(tf.data.experimental.AUTOTUNE)
ds_test = ds_test.map(normalize_img, num_parallel_calls=tf.data.experimental.AUTOTUNE)
ds_test = ds_test.batch(128)
ds_test = ds_test.cache()
ds_test = ds_test.prefetch(tf.data.experimental.AUTOTUNE)
@cobalt jetty thanks man, my data is actually like a smaller version of mnist and the images are distorted
It's like 5000 images
Do you think the same code would work with just my unloaded image and label dataset from my h5py files? The images are 256*256
The mnist images are formatted in 28x28 natively, where do you get a 256x256 mnist dataset?
So these are actually photos I took with a camera that captured the mnist images that were projected with a laser using a DMD
And then I distorted those images using scattering media
Lol
Those images were 256
This project is kinda like a superresolution project
I'm essentially correcting the distortion with a gan
Do the sk-learn classifiers not work with just a single feature?
since you want to work on a dataset that is comparable to the MNIST, I would resize your secondary dataset to 28x28.
You might also want to look at the following: https://keras.io/examples/vision/image_classification_from_scratch/
there are some pretty nice pre-processing functions from tf/keras shown there
especially tf.keras.preprocessing.image_dataset_from_directory, which might ease your workflow.
I don't think I would need to do that though, my generator downsamples the image and outputs a 28*28 image
I've seen precedent of this working in a paper which was the inspiration for this project
mhm
I've used that page to help preprocess a relatively chonky dataset to train a NSFW detector.
You're trying to create a LeNet-like network?
I've never heard of that I thought this was based off the original SRGAN
Just appropriated to a physics context
Basically the generated output is 28*28 anyway
So what are you trying to achieve? At first glance it seems like you're trying to compress a 128ยฒ picture into a 28ยฒ one.
My images (which are 256*256) go through the generator and are processed to eventually look like the target images (mnist) and then go through the discriminator which decides if the image is fake or real
It's essentially a network that is making predictions of what the image looked like prior to distortion
that's actually neat.
Would be if I could get it to work lol
I just can't seem to figure out how to load in data, I've been trying to zip data but I feel like I shouldn't even need to my data isnt over 2gb
My training dataset is 313 mb
what's the issue? You can't load your data in memory?
I don't understand why you'd need to zip the data to work with it
Yeah exactly lol
But no the issue is when I run fit
I'll paste the traceback
Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.
This is an error I get when I try to run my code
can I get the code of the cell where it happens?
Sure one sec
also I'm in class right now, I might not answer quickly.
Anyone know how to load a large checkpoint without eating up all the RAM? (I have about a 20G Pytorch checkpoint) Would prefer a solution that can make it work in about 16G of memory....
No problem I really appreciate your help all the same
Pastebin.com is the number one paste tool since 2002. Pastebin is a website where you can store text online for a set period of time.
Also it's 11:40 pm for me so I might knock soon, if you want you can just DM me
Really appreciate everything tho guys thank you all
you have an issue in how your g_dataset or dataset are constructed.
you're trying to pass a file which tensorflow cannot parse as a tensor because it's too big.
However if your dataset is only 353mb, you might have a preprocessing issue
you have to look in your previous cells
how you process and get your two variables.
I don't know I just used sentdex's method for storing data in h5py
I mean he stored his in pickle files but I just used hdf5
Cos pickle sux
Lol
tbh, your dataset is small enough you can just use keras to load pictures directly in batches of like 16, 34, 64
no need to perform some preprocessing like that.
is logistic regression and naive bayes good for numerical classification dataset?
supervised learning classification
depends on the complexity of your dataset, but I'd say that logistic regression is a good start for binary classification iirc.
Yeร h I was thinking about that roms
depends on the complexity of your dataset, but I'd say that logistic regression is a good start for binary classification iirc.
@cobalt jetty the target is binary yeah
then I'd go back to the Keras page I sent you and use the functions there.
@cobalt jetty what about naive bayes? is that good for binary target as well?
I've never used naive bayes, so I can't tell.
but like logistic regression is super easy to implement
all of them are easy, same code just changes the model function
make the model, fit the model and test predictions
tell me if i'm skipping something
like with sci-kit learn
dataset = pd.read_csv("AAAAAAAAAAAAH.csv",sep=";")
train_set, test_set = train_test_split(dataset, test_size=0.2)
model = LogisticRegression()
model.fit(train_set.attribute, train_set.label)
pred = model.predict(test_set)```
yeah that's what i was saying, are there harder ways to implement the learning algorithms?
You could implement the logreg from scratch
doing something from scratch without using libraries is always the hardest.
Anyone know how to load a large checkpoint without eating up all the RAM? (I have about a 20G Pytorch checkpoint) Would prefer a solution that can make it work in about 16G of memory....
if you're talking about all the checkpoints in Pytorch, isn't there a load function where you can specify which checkpoint you want to load.
I'd be surprised a model would weight 20gb
I've not used Pytorch much so I can't really help
hello everybody, anyone knows something about pycharm respect to the others python editors like vscode or sublime? what is the difference beetwen scientific python development and pure python development? give me your opinions ๐ค ๐ค ๐ค
yeah I wouldn't think a checkpoint would be 20gb, usually mine only ever go up to like 200mb
but I don't think theres any way to do that anyways, you really do have to load everything into memory anyways since the model itself has to be stored in memory
so even if you could somehow load the checkpoint without an oom error, you'd still have to have the whole model in your memory anyways
@austere swift ik, but model can also be read through SSD which may impact performance but wouldn't matter much since I am not training, only inferencing. So, that wouldn't really impact time taken that much
Iโve never read the model off disk so I wouldnโt know how to do that lol
Making progress on research for some jupyter notebooks I'm working on ๐
With an online calc, puts it to 6s to load the model which doesn't sound that bad, and I can have it done in a day or so
Does anyone know of things relating to audio you'd like to be explained in a simplified way?
@austere swift Are you sure that a 200Mb model would take exactly 200 RAM, perhaps there is some clever memory tricks done on the way to save memory..?
tbh, I'm always intrigued at how people can splice voice out of a clip (with music for instance) or vice versa.
but not enough to read up on that.
@cobalt jetty Using Machine Learning
@cobalt jetty either by using a bandpass, machine learning or trying to recreate the music part and subtracting it
Bandpass and a bit of manual sample editing is probably what they did back in the day
mhm
an uneducated guess of mine was that voice and instruments are usually not recorded with the same mics and so voice and instruments would be recorded on different subparts of let's say a magnetic band.
so one could only read those parts.
but I was wrong, I see.
Guys. Anyone up? I need some help
That's usually the case for source files, but due to size constraints on tapes/cds it all had to be put on one channel (two for stereo), and that was usually stored as interleaved data at best
Can someone help me? I just need to know how to go about an Idea I have, I will code it myself
I just need the framework
@crude marsh what's the issue?
I have this code that basically scrapes a share price from the web and then prints out the price
I need it to record the prices in an excel file
but yeah, the interactive jupyter notebooks I'm working on should hopefully help with understanding some more abstract concepts
@crude marsh try openpyxl, or if even something simple works, you could export it as CSV using the built-in csv library
So, in open csv, Does it have to be in a table form?
!docs csv.writer
csv.writer(csvfile, dialect='excel', **fmtparams)```
Return a writer object responsible for converting the userโs data into delimited strings on the given file-like object. *csvfile* can be any object with a `write()` method. If *csvfile* is a file object, it should be opened with `newline=''` [1](#id3). An optional *dialect* parameter can be given which is used to define a set of parameters specific to a particular CSV dialect. It may be an instance of a subclass of the [`Dialect`](#csv.Dialect "csv.Dialect") class or one of the strings returned by the [`list_dialects()`](#csv.list_dialects "csv.list_dialects") function. The other optional *fmtparams* keyword arguments can be given to override individual formatting parameters in the current dialect. For full details about the dialect and formatting parameters, see section [Dialects and Formatting Parameters](#csv-fmt-params)... [read more](https://docs.python.org/3/library/csv.html#csv.writer)
Whats this?
The documentation for something that writes to a CSV file, there's an example if you click read more
I am not able to access the file
Uhh...
I can only read !docs csv.writer
I dont know. How to enable them?
Hmm. Yeah sure.
This is what I see.
AAh, just a min. found out whats wrong
now I can see
@spark nimbus U still there?
Yeah
Yeah, I can read it now. I will post the code here for your reference
does anyone know whether there's a better way to select rows in a dataframe based on a column value than what I'm currently using?
part_df = df[df['path'].str.startswith(directory_path, na=False)]
I'm dealing with 2+ million rows and the command above takes about 11 seconds to complete
# Imports
import bs4
import requests
#Custom Function
def get_share_price(share_url):
res = requests.get(share_url)
res.raise_for_status()
#Element finder
soup = bs4.BeautifulSoup(res.text, features="html.parser")
elems = soup.select('#quote-header-info > div.My\(6px\).Pos\(r\).smartphone_Mt\(6px\) > div.D\(ib\).Va\(m\).Maw\(65\%\).Ov\(h\) > div > span.Trsdu\(0\.3s\).Fw\(b\).Fz\(36px\).Mb\(-4px\).D\(ib\)')
return elems[0].text.strip()
#Get price
price = get_share_price('https://in.finance.yahoo.com/quote/HDFCBANK.NS/history/')
#Call
print('The price of HDFC bank share is ' + price)
@quiet breach Sorry mate, I am intermediate
Why don't u use copy, paste and ctrlF to do it quickly?
@quiet breach
how do you mean?
huh? no
I have a dataframe of 2 million rows
where I must select a subset containing only the rows where the value in the 'path' column matches a string
the initial scope of what I'm working on required this to happen 50 times
so I was fine with it taking 10 seconds per run
now it needs to run 120k times :)
Ahh, I see, since I am an intermediate, I cant really help a lot, but from what I know, you can scrape the code using python to return only the values that meet a specific criteria
@crude marsh doesn't yahoo finance have an API so you don';t need to scrape?
Yeah, but I am doing it to have some basic experience with web scraping
ah
then I can move on to some complex projects with confidence
Like scraping wikipedia
Aight, Imma go see if CSV works
Yahoo doesn't support an API anymore IIRC. Last year their API was removed -- maybe it's changed since.
That caused me issues when I tried to recreate my own VIX index calculator.
Yeah, but I want it to create a chart connecting different articles with each other
You know what I mean
?
Not really. What do you mean by articles?
Wait a min. I will show you
I can t find the image
It basically shows how one article leads to another
like there are links to other articles right?
those
what do you mean by article here?
aren't you working with Yahoo Finance, tho?
Yeah, this is my future project
based on your snippet above.
I plan on building it
Not yet though
How can I make my code transfer all the data to csv file?
Any idea. You see my code above
What should I edit so that It transferrs it to a CSV file?
transfer the stock data your scraped into a panda dataframe then just use the method .to_csv('file.csv')
Ahh. I see
I just need to convert it to a table and then use .to_csv('file.csv')
Right?
I just need to convert it to a table and then use .to_csv('file.csv')
Using pandas
@remote valley i ended up figuring it out; i didn't realize gridsearch had the ability to return test scores and training scores; this way i don't have to use the validation_curve function and can just directly plot it using matplotlib
@marsh chasm nice. thanks for telling me. gridsearch does the validation curve stuff for the whole set of parameters and plots with correct axis labels for the parameter set? sounds way easier.
Anybody who have experience using multiindex on Pandas?
https://colab.research.google.com/drive/1tn4l65t47I6G1MV5fcZ0Q-W91Rtwns9e?usp=sharing https://hastebin.com/eyopocaxir.csharp Any ideas what's going wrong?
I'm following this tutorial https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/text/text_generation.ipynb but changing as much as possible to try and show to myself I actually understand what is going on
Hello guys what does it mean when my cross validation score is less than my model accruacy score?
@grave path could you give more information on what you mean by cv score vs accuracy score? Is your cv using a different metric?
hello nine so I did split my data first and do the scaling for them my accuracy was 84% then I tried to apply cross validation on the scaled model and the accuracy was 79%
print("{:.3f}".format(scores.mean())) ```
@heady hatch
@grave path So you're saying the scores for the scaled model was lower than the score of the model unscaled?
Could you clarified on what you mean by your accuracy was 84%?
Model score = 74%
Model Scaled = 84%
Cross Validation for model scaled = 79%
because I have training and testing data
Right so what's model score and what's model scaled?
Model themselves don't have score unless you're talking about oob.
Model = LogisiticRegression()
So
Just to clarify.
You're asking why your training score was higher than your cv score?
isn't cross validation supposed to split the dataset and try to fnd the best accuracy
Yes
Okay I guess here, give me these information.
- cv score of model not scaled
- cv score of the model scaled.
cv isn't trying to find the best accuracy.
cv is trying to see how your model will generalize on a validation set.
cv score of model not scaled:71%
cv score of the model scaled:79%
Okay so what I'm seeing here is your model is generalizing better on the validation set.
Meaning that it's capturing more signal when the data is scaled.
It doesn't really make sense to compare one model's score on training data to another model's cv score.
Often times in classical ml, if training score is higher than validation score, that could mean your model is overfitting.
what do you mean when you say cv is trying to see how your model will generalize on a validation set
So do you know how cross validation works, especially in default where it's kfold?
kfold is the number of splits right?
Mhm!
might have misunderstood what it does then
but is 79% considered good when cross validating ?
yes so we split the dataset and each time we change the training and testing split and then the cv will be the mean of these right?
Right.
oh I think you cleared something for me then
It splits it into x section, in 5 folds, it's 5 sections.
It treats one of them as the validation set. And your model trains on the rest, and test it on the validation set.
So this is an overall accuracy since it tests more possible outcomes and the scaled has nothing to do with it
I should compare it on the split I did manually right
Hm could you clarify on what you meant by scaled has nothing to do with it?
It seems like scaling does better, since your data respond well to scaling.
Isn't that's what your cv showed? 71 vs 79.
I'm not sure. What do you mean by interpreted in a bad way?
I was thinking that cv must be higher than Scaled data or something is wrong
You can think of scaled vs not scaled model as two different models.
Because cv is higher in model that scales the data, that's a sign that your data provides more signal when it's scaled.
Or that your model isn't able to capture the signal properly when it's not scaled.
cv is just a scoring method to tell you how your model generalizes on data it wasn't trained on.
Because you don't want to just train it on the training set and test it on the training set.
Yeah I see your point since cv trains and tests on everything eventually then it will give you the gerelized score
since testing it wouldn't make sense since its not new data
Does anyone here also know Abstract graph transformation, and deriving binary logic from it?
Thanks a lot for the help Nine
Yea no problem, glad to be of help.
how do I do this in jupyter? turn text black? I do not see myself coping and pasting cells to make them raw
uh
I have to run cell, nvm
thanks guys, you are awesome ๐
Hi can i ask here something about Matrix operations in an iteration?
Try it, and if people don't answer, try somewhere else.
okey
thanks
im having this error
No loop matching the specified signature and casting was found for ufunc inv
J=sp.lambdify([x, y],[dp1,dp2], "numpy")
f=sp.lambdify([x, y],[dp1,dp2], "numpy")
v = v0
print(v)
for i in range(20):
Jr=np.array(J(v[0], v[1]))
fi=np.array(f(v[0],v[1]))
J_inv=np.linalg.inv(Jr)
#print(J_inv)
print("")
v = v - J_inv @ fi
print("v")
print(v)
print("")
return
basically this is what i wanth to tierate
J and f are two matrix (J is from derivatives) but the thing is i dont know how to make the iteration works without errors
What kind of errors are you getting?
TypeError: No loop matching the specified signature and casting was found for ufunc inv
