#Building a web scraper
1 messages · Page 1 of 1 (latest)
While you are waiting for getting help, here are some tips to improve your experience:
If nobody is calling back, that usually means that your question was not well asked and hence nobody feels confident enough answering. Try to use your time to elaborate, provide details, context, more code, examples and maybe some screenshots. With enough info, someone knows the answer for sure.
Don't forget to close your thread using the command </help-thread close:1027500463647621170> when your question has been answered, thanks.
@compact halo
no worries at all
do you need the jobs to persist after a restart
like if you turn the computer on and off again
Not really, I will keep track of the larger groups of requests in a database
thats a yes
I can manage that separately, I know how to segment the list of required requests
well you can make a persistent job queue in postgres relatively trivially
just so you are aware
The problem is that this code will require probably upwards of 10k database calls per second, so will need to be in-memory
what are you scraping and why?
I'm scraping twitter to make a crypto predictor
I know you might say it won't work, which may be the case.
I have written a web scraper that can make 800 successful requests a second using public proxies
okay let's say I'm using their API, the question still stands
Can we look at this narrow slice of the problem?
okay
2k tasks is nothing
On fail (state = QUEUED, last_updated = NOW) added back to the front of the queue
This doesn't make sense - you will get infinite retries
u think u would get performance issues and that they would be solved by having multiple queues, but thats not the case
So a task fails from the front of the queue, it's state is set to queued, and last updated to now
and one persistently failing task will screw you
then it is added back to the front of the queue
you want to put it in the back of the queue
not the front
and yeah - 10k database calls a second is fine
Well I'm treating the start of the queue as the place where items are taken
maybe will make the gears squeek on a small instance, but not really a problem
So stick to one queue?
multiple queues or one won't affect performance
(they are performant, but thats not why you use them)
it only affects ur logic
I'm okay with having a single queue, but I take your point in readability
and sanity
and a 2k queue is nothing. u can have queues with billions of items
Well I need to loop over the in progress tasks to check whether they've been forgotten dead
So I need an efficient memory structure that allows me to loop over each item, remove successful/failed ones
or, do it in postgres and make an index
I've tried it already, doesn't seem to work well
or better - you have job and archived_job tables
iterating a 2k queue every now and then is literally nothing
this queue is iterated in 1 millisecond
If I'm going with two, I need one data structure for the task queue (concurrent linked deque?) and another concurrent linked list?
but yeah, just put it in an array list. 2k^2 is just 400,000k
I'm only planning on using it to store the data, it will really slow things down with I/O
a single list?
or ArrayDeque, queue-wise
Okay, so ConcurrentLinkedList, what is it's time complexity for removing items?
even if you scan over it every time you want to do something thats fast enough
I was originally planing on using a deque
that makes sense, but the items will have to be iterated over to grab a new request, if only one queue is used
it will have to loop over each item and select the first QUEUED item
again. a 2k queue is iterated, modified, added, removed,... in a millisecond
ur overthinking this
It may be closer to 5000
yeah thats why i pointed out that 2k^2 isn't that big. O(n^2) where n=2k is nothing
thats still a millisecond
If the code makes 800+ successful requests a second, it may starve
who cares
And I can remove an item from the middle of a deque linked list?
with a .remove(item)
perfect!
arraydeque supports any modification, yes
any other suggestions?
yes
for efficiency?
if u get performance issues, its not due to ur choice of datastructure
do something more worth your human life than crypto speculation
but due to ur logic
well, stocks too
I have two ml friends who want to build models
and please use the twitter api. otherwise, ull get banned
master baters more like it
both
damn
sure, thank you
my point is though, that if 2000 calls are being made per second, that adds an additional 2 seconds per original
Should I actually remove the item instead?
you are sending requests over the internet. you have time to make a sandwich in the margin of error
To be honest, the code is currently doing 800 success ps / 8000 failed ps, this would effectively halve
even more considering the failed calls
I need to be efficient, so maybe two data structures makes more sense?
one for queued, one for other
the amount of datastructures and the choice won't affect ur performance the slightest
u could make the worst choice here and it would run in the exact same speed than with the best choice
its more about ur logic
than the choice of container
the code thats calling a lot, (ie. request new task) could just pop from one queue (queued)
that way the background code could do cleanup
just get started. get it working. then check if there is an issue, measure it with a profiler and then fix the bottleneck
if there's any
Okay, will try, but want to get this right the first time
So really appreciate any suggestions
i dont really get what u think is so complex here. it's just a simple task queue
with futures
that's not up to u to decide
use modern multithreading and java will make it as fast as possible out of the box
i'm not just going for fast, i'm going for as fast as possible
which is what ive said
create a thread pool using executorservice
then shoot 2 billion tasks at it
and watch them being done as fast as possible
easy
Sorry for all the questions
is it faster to:
- loop through the array until a viable taks is found using an object property
or to
- remove the task from the current queue, add the task to a separate queue and start processing
ur wasting time by optimizing the wrong end
this part is fast anyways
cause ur data is tiny
ur not losing time here
as a side note, it's already making 800 requests a second, I just don't want it to drop too far
its optimised imo
how u work with ur datastructure costs u less time than writing a few if-else that cause branch prediction failures
since ur container is super small
as in the jvm heap size?
2k is nothing
ur toaster can iterate and remove and add 2k items in milliseconds already
I mean, it just needs to always be bigger than the amount of currently processed items
I know, but that 2k iteration needs to be done 800 times a second
in the time that u process a single of ur requests, u can add/remove/whatever trillions of items
per second
but again
why dont u use a executorservice
it will be much faster than any of ur selfmade bullshit
So I'm using one producer thread, multiple consumer threads and concurrency
i can tell u from experience that executorservice is much faster than any self made queue/thread/whatever combination
ive replaced a complex and optimized c++ program that processed TB of data with a 20 line java program that did the same with executorservice. and it was 10x faster
Request request1 = new RequestBuilder("GET")
.setUrl("http://httpforever.com")
.setProxyServer(new ProxyServer.Builder(proxy.getIP(), proxy.getPort()).build())
.build();
c.executeRequest(request1, new handler(c, proxy, this));
Detected code, here are some useful tools:
org.asynchttpclient.AsyncHttpClient
public abstract <T> org.asynchttpclient.ListenableFuture<T> executeRequest( org.asynchttpclient.Request request,
org.asynchttpclient.AsyncHandler<T> handler )
yeah that looks bad
which part 
then do anything else u need to do on that future
org.asynchttpclient
u dont need manual threads at all
nor any queues
do proper future based coding
future.thenApply(...).thenApply(...)...
c.executeRequest(request1, new handler(c, proxy, this)).toCompletableFuture()
It's asynchronous
then spawn all ur thousands of future/requests immediately
and java will execute it as fast as possible automatically. ull reach max throughput
That's pretty much what I'm doing
but ur talking about supervising ur tasks
But I had resource exhaustion, so had to add a maximum concurrency value
manually
that's why it's manual
anyways. either u don't understand my suggestions, or u dont want to use them for some reason
i think ur wasting time by optimizing the wrong end. and i think it's much easier to get max performance by approaching it differently
also, if u have working code, run a profiler on it
I have already
@mint viper would you mind taking a quick look at the code?
I know you know way more than me
id suggest u just post it and all helpers can have a look
ur cpu load is inexistent on that measurement
aare u planning to execute on a much worse hardware?
I think it's because everything is on outbound I/O
uve talked about performance. but as emccue and i suspected, ur problem is not cpu at all
its ur logic
ur cpu is chilling
so all the talking we did was wasted time
let the profiler look at ur io activity
see if that's at max throughput all the time
if it is, ur app is the fastest it can be
if not, u still have wasted potential
It maxes out my internet connection
On my home pc it's network bound
and on the server cpu/memory bound
all the time? or just sometimes
if its at a constant 100% of ur IO, then ur app can't be made faster
=== Tweet Scraper ===
Requests : 1314
Success/s: 267.0
Success Total/s: 159.0
Failed/s: 38.0
Available Proxies: 547
Running for: 4 seconds,
on I/O?