#Building a web scraper

1 messages · Page 1 of 1 (latest)

mighty heathBOT
#

<@&987246841693360200> please have a look, thanks.

mighty heathBOT
#

While you are waiting for getting help, here are some tips to improve your experience:

Code is much easier to read if posted with syntax highlighting and proper formatting.

If nobody is calling back, that usually means that your question was not well asked and hence nobody feels confident enough answering. Try to use your time to elaborate, provide details, context, more code, examples and maybe some screenshots. With enough info, someone knows the answer for sure.

Don't forget to close your thread using the command </help-thread close:1027500463647621170> when your question has been answered, thanks.

slate sundial
#

@compact halo

compact halo
#

sorry it is acutally late

#

okay question

slate sundial
#

no worries at all

compact halo
#

do you need the jobs to persist after a restart

#

like if you turn the computer on and off again

slate sundial
#

Not really, I will keep track of the larger groups of requests in a database

compact halo
#

thats a yes

slate sundial
#

I can manage that separately, I know how to segment the list of required requests

compact halo
#

well you can make a persistent job queue in postgres relatively trivially

#

just so you are aware

slate sundial
#

The problem is that this code will require probably upwards of 10k database calls per second, so will need to be in-memory

compact halo
#

what are you scraping and why?

slate sundial
#

I'm scraping twitter to make a crypto predictor

#

I know you might say it won't work, which may be the case.

#

I have written a web scraper that can make 800 successful requests a second using public proxies

mint viper
#

scraping Twitter is against their ToS

#

u must use their api instead

slate sundial
#

okay let's say I'm using their API, the question still stands

#

Can we look at this narrow slice of the problem?

compact halo
#

okay

mint viper
#

2k tasks is nothing

compact halo
#

On fail (state = QUEUED, last_updated = NOW) added back to the front of the queue
This doesn't make sense - you will get infinite retries

mint viper
#

u think u would get performance issues and that they would be solved by having multiple queues, but thats not the case

slate sundial
#

So a task fails from the front of the queue, it's state is set to queued, and last updated to now

compact halo
#

and one persistently failing task will screw you

slate sundial
#

then it is added back to the front of the queue

compact halo
#

you want to put it in the back of the queue

#

not the front

#

and yeah - 10k database calls a second is fine

slate sundial
#

Well I'm treating the start of the queue as the place where items are taken

compact halo
#

maybe will make the gears squeek on a small instance, but not really a problem

compact halo
#

use queues for logical separation

#

not for performance

mint viper
#

multiple queues or one won't affect performance

compact halo
#

(they are performant, but thats not why you use them)

mint viper
#

it only affects ur logic

slate sundial
#

and sanity

mint viper
#

and a 2k queue is nothing. u can have queues with billions of items

slate sundial
#

Well I need to loop over the in progress tasks to check whether they've been forgotten dead

#

So I need an efficient memory structure that allows me to loop over each item, remove successful/failed ones

compact halo
#

or, do it in postgres and make an index

slate sundial
#

I've tried it already, doesn't seem to work well

compact halo
#

or better - you have job and archived_job tables

mint viper
#

iterating a 2k queue every now and then is literally nothing

#

this queue is iterated in 1 millisecond

slate sundial
#

If I'm going with two, I need one data structure for the task queue (concurrent linked deque?) and another concurrent linked list?

compact halo
#

but yeah, just put it in an array list. 2k^2 is just 400,000k

slate sundial
slate sundial
compact halo
#

sure, why not

#

like he said, 2k is nothing

mint viper
#

or ArrayDeque, queue-wise

slate sundial
#

Okay, so ConcurrentLinkedList, what is it's time complexity for removing items?

compact halo
#

even if you scan over it every time you want to do something thats fast enough

slate sundial
#

I was originally planing on using a deque

#

that makes sense, but the items will have to be iterated over to grab a new request, if only one queue is used

#

it will have to loop over each item and select the first QUEUED item

mint viper
#

again. a 2k queue is iterated, modified, added, removed,... in a millisecond

#

ur overthinking this

slate sundial
#

It may be closer to 5000

compact halo
#

yeah thats why i pointed out that 2k^2 isn't that big. O(n^2) where n=2k is nothing

mint viper
#

thats still a millisecond

slate sundial
#

If the code makes 800+ successful requests a second, it may starve

mint viper
#

it starts to matter when u approach 2 million

#

not 2k

slate sundial
#

Oh okay, that's completely fine then

#

But what data structure?

compact halo
#

who cares

mint viper
#

any

#

arraydeque is "the best" queue. so maybe start with that

slate sundial
#

And I can remove an item from the middle of a deque linked list?

#

with a .remove(item)

#

perfect!

mint viper
#

arraydeque supports any modification, yes

slate sundial
compact halo
#

yes

slate sundial
#

for efficiency?

mint viper
#

if u get performance issues, its not due to ur choice of datastructure

compact halo
#

do something more worth your human life than crypto speculation

mint viper
#

but due to ur logic

slate sundial
#

I have two ml friends who want to build models

mint viper
#

and please use the twitter api. otherwise, ull get banned

slate sundial
#

both with masters

#

yep ofc

#

from here? or twitter?

compact halo
mint viper
#

both

slate sundial
slate sundial
slate sundial
#

Should I actually remove the item instead?

compact halo
#

you are sending requests over the internet. you have time to make a sandwich in the margin of error

slate sundial
#

even more considering the failed calls

#

I need to be efficient, so maybe two data structures makes more sense?

#

one for queued, one for other

mint viper
#

the amount of datastructures and the choice won't affect ur performance the slightest

slate sundial
#

maybe one queued, one processing

#

I'm more referring to the implementation

mint viper
#

u could make the worst choice here and it would run in the exact same speed than with the best choice

#

its more about ur logic

#

than the choice of container

slate sundial
#

the code thats calling a lot, (ie. request new task) could just pop from one queue (queued)

#

that way the background code could do cleanup

mint viper
#

just get started. get it working. then check if there is an issue, measure it with a profiler and then fix the bottleneck

#

if there's any

slate sundial
#

Okay, will try, but want to get this right the first time

#

So really appreciate any suggestions

mint viper
#

i dont really get what u think is so complex here. it's just a simple task queue

#

with futures

slate sundial
#

well, it's how many calls the threads will be making

#

and making it time efficient

mint viper
#

that's not up to u to decide

#

use modern multithreading and java will make it as fast as possible out of the box

slate sundial
#

i'm not just going for fast, i'm going for as fast as possible

mint viper
#

which is what ive said

#

create a thread pool using executorservice

#

then shoot 2 billion tasks at it

#

and watch them being done as fast as possible

#

easy

slate sundial
#

Sorry for all the questions

#

is it faster to:

  • loop through the array until a viable taks is found using an object property

or to

  • remove the task from the current queue, add the task to a separate queue and start processing
mint viper
#

ur wasting time by optimizing the wrong end

#

this part is fast anyways

#

cause ur data is tiny

slate sundial
#

The other end is already optimised

#

what do you mean by that?

mint viper
#

ur not losing time here

slate sundial
#

as a side note, it's already making 800 requests a second, I just don't want it to drop too far

#

its optimised imo

mint viper
#

how u work with ur datastructure costs u less time than writing a few if-else that cause branch prediction failures

#

since ur container is super small

slate sundial
#

as in the jvm heap size?

mint viper
#

2k is nothing

#

ur toaster can iterate and remove and add 2k items in milliseconds already

slate sundial
#

I mean, it just needs to always be bigger than the amount of currently processed items

#

I know, but that 2k iteration needs to be done 800 times a second

mint viper
#

in the time that u process a single of ur requests, u can add/remove/whatever trillions of items

#

per second

#

but again

#

why dont u use a executorservice

#

it will be much faster than any of ur selfmade bullshit

slate sundial
#

So I'm using one producer thread, multiple consumer threads and concurrency

mint viper
#

i can tell u from experience that executorservice is much faster than any self made queue/thread/whatever combination

slate sundial
#

So I'm running multiple threads that make async requests

#

Let me find a snippet

mint viper
#

ive replaced a complex and optimized c++ program that processed TB of data with a 20 line java program that did the same with executorservice. and it was 10x faster

slate sundial
#
                Request request1 = new RequestBuilder("GET")
                        .setUrl("http://httpforever.com")
                        .setProxyServer(new ProxyServer.Builder(proxy.getIP(), proxy.getPort()).build())
                        .build();

                c.executeRequest(request1, new handler(c, proxy, this));
mighty heathBOT
slate sundial
#

org.asynchttpclient.AsyncHttpClient
public abstract <T> org.asynchttpclient.ListenableFuture<T> executeRequest( org.asynchttpclient.Request request,
org.asynchttpclient.AsyncHandler<T> handler )

mint viper
#

yeah that looks bad

slate sundial
#

which part LOL

mint viper
#

use HttpClient

#

sendAsync

#

it returns a future

slate sundial
#

Well, it's async

#

not the default client

mint viper
#

then do anything else u need to do on that future

slate sundial
#

org.asynchttpclient

mint viper
#

u dont need manual threads at all

#

nor any queues

#

do proper future based coding

#

future.thenApply(...).thenApply(...)...

slate sundial
#

c.executeRequest(request1, new handler(c, proxy, this)).toCompletableFuture()

#

It's asynchronous

mint viper
#

then spawn all ur thousands of future/requests immediately

#

and java will execute it as fast as possible automatically. ull reach max throughput

slate sundial
#

That's pretty much what I'm doing

mint viper
#

but ur talking about supervising ur tasks

slate sundial
#

But I had resource exhaustion, so had to add a maximum concurrency value

mint viper
#

manually

slate sundial
#

that's why it's manual

mint viper
#

wait

#

so ur not just planning, u already have working code ?

slate sundial
#

yes!

#

Do you want to have a quick look?

mint viper
#

anyways. either u don't understand my suggestions, or u dont want to use them for some reason

slate sundial
#

I just need to understand them better

#

I thought I was already using futures

mint viper
#

i think ur wasting time by optimizing the wrong end. and i think it's much easier to get max performance by approaching it differently

#

also, if u have working code, run a profiler on it

slate sundial
#

I have already

#

@mint viper would you mind taking a quick look at the code?

#

I know you know way more than me

mint viper
#

id suggest u just post it and all helpers can have a look

#

ur cpu load is inexistent on that measurement

slate sundial
mint viper
#

aare u planning to execute on a much worse hardware?

slate sundial
#

I think it's because everything is on outbound I/O

mint viper
#

uve talked about performance. but as emccue and i suspected, ur problem is not cpu at all

#

its ur logic

#

ur cpu is chilling

#

so all the talking we did was wasted time

#

let the profiler look at ur io activity

#

see if that's at max throughput all the time

#

if it is, ur app is the fastest it can be

#

if not, u still have wasted potential

slate sundial
#

It maxes out my internet connection

#

On my home pc it's network bound

#

and on the server cpu/memory bound

mint viper
#

all the time? or just sometimes

#

if its at a constant 100% of ur IO, then ur app can't be made faster

slate sundial
#

=== Tweet Scraper ===
Requests : 1314
Success/s: 267.0
Success Total/s: 159.0
Failed/s: 38.0
Available Proxies: 547
Running for: 4 seconds,

mint viper
#

just let ur profiler look at it

#

get a nice graph

slate sundial
#

on I/O?

mint viper
#

yes

#

if its at 100% all the time, u already reached max throughput. which means as fast as possible

#

unless u can reduce the io traffic. for example maybe some requests are unnecessary

slate sundial
#

@mint viper having some trouble with VisualVM, my jprofiler evaluation just ended, is there a way to check the network throughput natively from the command line?

#

I mean I could just check the AWS stats