Building a web scraper | Together Java | Page 1

mighty heathBOT Jun 30, 2023, 4:06 AM

#

<@&987246841693360200> please have a look, thanks.

mighty heathBOT Jun 30, 2023, 4:06 AM

#

mighty heath <@&987246841693360200> please have a look, thanks.

#

While you are waiting for getting help, here are some tips to improve your experience:

Code is much easier to read if posted with syntax highlighting and proper formatting.

If nobody is calling back, that usually means that your question was not well asked and hence nobody feels confident enough answering. Try to use your time to elaborate, provide details, context, more code, examples and maybe some screenshots. With enough info, someone knows the answer for sure.

Don't forget to close your thread using the command </help-thread close:1027500463647621170> when your question has been answered, thanks.

slate sundial Jun 30, 2023, 4:06 AM

#

@compact halo

compact halo Jun 30, 2023, 4:13 AM

#

sorry it is acutally late

#

okay question

slate sundial Jun 30, 2023, 4:13 AM

#

no worries at all

compact halo Jun 30, 2023, 4:13 AM

#

do you need the jobs to persist after a restart

#

like if you turn the computer on and off again

slate sundial Jun 30, 2023, 4:13 AM

#

Not really, I will keep track of the larger groups of requests in a database

compact halo Jun 30, 2023, 4:14 AM

#

thats a yes

slate sundial Jun 30, 2023, 4:14 AM

#

I can manage that separately, I know how to segment the list of required requests

compact halo Jun 30, 2023, 4:14 AM

#

well you can make a persistent job queue in postgres relatively trivially

#

just so you are aware

slate sundial Jun 30, 2023, 4:15 AM

#

The problem is that this code will require probably upwards of 10k database calls per second, so will need to be in-memory

compact halo Jun 30, 2023, 4:15 AM

#

what are you scraping and why?

slate sundial Jun 30, 2023, 4:15 AM

#

I'm scraping twitter to make a crypto predictor

#

I know you might say it won't work, which may be the case.

#

I have written a web scraper that can make 800 successful requests a second using public proxies

mint viper Jun 30, 2023, 4:16 AM

#

scraping Twitter is against their ToS

#

u must use their api instead

slate sundial Jun 30, 2023, 4:16 AM

#

okay let's say I'm using their API, the question still stands

#

Can we look at this narrow slice of the problem?

compact halo Jun 30, 2023, 4:17 AM

#

okay

mint viper Jun 30, 2023, 4:17 AM

#

2k tasks is nothing

compact halo Jun 30, 2023, 4:17 AM

#

On fail (state = QUEUED, last_updated = NOW) added back to the front of the queue
This doesn't make sense - you will get infinite retries

mint viper Jun 30, 2023, 4:17 AM

#

u think u would get performance issues and that they would be solved by having multiple queues, but thats not the case

slate sundial Jun 30, 2023, 4:18 AM

#

So a task fails from the front of the queue, it's state is set to queued, and last updated to now

compact halo Jun 30, 2023, 4:18 AM

#

and one persistently failing task will screw you

slate sundial Jun 30, 2023, 4:18 AM

#

then it is added back to the front of the queue

compact halo Jun 30, 2023, 4:18 AM

#

you want to put it in the back of the queue

#

not the front

#

and yeah - 10k database calls a second is fine

slate sundial Jun 30, 2023, 4:18 AM

#

Well I'm treating the start of the queue as the place where items are taken

compact halo Jun 30, 2023, 4:18 AM

#

maybe will make the gears squeek on a small instance, but not really a problem

slate sundial Jun 30, 2023, 4:18 AM

#

mint viper u think u would get performance issues and that they would be solved by having m...

So stick to one queue?

compact halo Jun 30, 2023, 4:18 AM

#

use queues for logical separation

#

not for performance

mint viper Jun 30, 2023, 4:19 AM

#

multiple queues or one won't affect performance

compact halo Jun 30, 2023, 4:19 AM

#

(they are performant, but thats not why you use them)

mint viper Jun 30, 2023, 4:19 AM

#

it only affects ur logic

slate sundial Jun 30, 2023, 4:19 AM

#

compact halo use queues for logical separation

I'm okay with having a single queue, but I take your point in readability

#

and sanity

mint viper Jun 30, 2023, 4:19 AM

#

and a 2k queue is nothing. u can have queues with billions of items

slate sundial Jun 30, 2023, 4:19 AM

#

Well I need to loop over the in progress tasks to check whether they've been forgotten dead

#

So I need an efficient memory structure that allows me to loop over each item, remove successful/failed ones

compact halo Jun 30, 2023, 4:20 AM

#

or, do it in postgres and make an index

slate sundial Jun 30, 2023, 4:20 AM

#

I've tried it already, doesn't seem to work well

compact halo Jun 30, 2023, 4:20 AM

#

or better - you have job and archived_job tables

mint viper Jun 30, 2023, 4:20 AM

#

iterating a 2k queue every now and then is literally nothing

#

this queue is iterated in 1 millisecond

slate sundial Jun 30, 2023, 4:21 AM

#

If I'm going with two, I need one data structure for the task queue (concurrent linked deque?) and another concurrent linked list?

compact halo Jun 30, 2023, 4:21 AM

#

but yeah, just put it in an array list. 2k^2 is just 400,000k

slate sundial Jun 30, 2023, 4:21 AM

#

compact halo or better - you have `job` and `archived_job` tables

I'm only planning on using it to store the data, it will really slow things down with I/O

slate sundial Jun 30, 2023, 4:21 AM

#

compact halo but yeah, just put it in an array list. 2k^2 is just 400,000k

a single list?

compact halo Jun 30, 2023, 4:21 AM

#

sure, why not

#

like he said, 2k is nothing

mint viper Jun 30, 2023, 4:22 AM

#

or ArrayDeque, queue-wise

slate sundial Jun 30, 2023, 4:22 AM

#

Okay, so ConcurrentLinkedList, what is it's time complexity for removing items?

compact halo Jun 30, 2023, 4:22 AM

#

even if you scan over it every time you want to do something thats fast enough

slate sundial Jun 30, 2023, 4:22 AM

#

I was originally planing on using a deque

#

that makes sense, but the items will have to be iterated over to grab a new request, if only one queue is used

#

it will have to loop over each item and select the first QUEUED item

mint viper Jun 30, 2023, 4:23 AM

#

again. a 2k queue is iterated, modified, added, removed,... in a millisecond

#

ur overthinking this

slate sundial Jun 30, 2023, 4:23 AM

#

It may be closer to 5000

compact halo Jun 30, 2023, 4:23 AM

#

yeah thats why i pointed out that 2k^2 isn't that big. O(n^2) where n=2k is nothing

mint viper Jun 30, 2023, 4:23 AM

#

thats still a millisecond

slate sundial Jun 30, 2023, 4:23 AM

#

If the code makes 800+ successful requests a second, it may starve

mint viper Jun 30, 2023, 4:23 AM

#

it starts to matter when u approach 2 million

#

not 2k

slate sundial Jun 30, 2023, 4:23 AM

#

Oh okay, that's completely fine then

#

But what data structure?

compact halo Jun 30, 2023, 4:24 AM

#

who cares

mint viper Jun 30, 2023, 4:24 AM

#

any

#

arraydeque is "the best" queue. so maybe start with that

slate sundial Jun 30, 2023, 4:24 AM

#

And I can remove an item from the middle of a deque linked list?

#

with a .remove(item)

#

perfect!

mint viper Jun 30, 2023, 4:25 AM

#

arraydeque supports any modification, yes

slate sundial Jun 30, 2023, 4:25 AM

#

compact halo who cares

any other suggestions?

compact halo Jun 30, 2023, 4:25 AM

#

yes

slate sundial Jun 30, 2023, 4:25 AM

#

for efficiency?

mint viper Jun 30, 2023, 4:26 AM

#

if u get performance issues, its not due to ur choice of datastructure

compact halo Jun 30, 2023, 4:26 AM

#

do something more worth your human life than crypto speculation

mint viper Jun 30, 2023, 4:26 AM

#

but due to ur logic

slate sundial Jun 30, 2023, 4:26 AM

#

compact halo do something more worth your human life than crypto speculation

well, stocks too

#

I have two ml friends who want to build models

mint viper Jun 30, 2023, 4:26 AM

#

and please use the twitter api. otherwise, ull get banned

slate sundial Jun 30, 2023, 4:26 AM

#

both with masters

#

yep ofc

#

from here? or twitter?

compact halo Jun 30, 2023, 4:26 AM

#

slate sundial both with masters

master baters more like it

mint viper Jun 30, 2023, 4:26 AM

#

both

slate sundial Jun 30, 2023, 4:26 AM

#

compact halo master baters more like it

damn

slate sundial Jun 30, 2023, 4:27 AM

#

mint viper both

sure, thank you

slate sundial Jun 30, 2023, 4:28 AM

#

mint viper this queue is iterated in 1 millisecond

my point is though, that if 2000 calls are being made per second, that adds an additional 2 seconds per original

#

Should I actually remove the item instead?

compact halo Jun 30, 2023, 4:29 AM

#

you are sending requests over the internet. you have time to make a sandwich in the margin of error

slate sundial Jun 30, 2023, 4:30 AM

#

compact halo you are sending requests over the internet. you have time to make a sandwich in ...

To be honest, the code is currently doing 800 success ps / 8000 failed ps, this would effectively halve

#

even more considering the failed calls

#

I need to be efficient, so maybe two data structures makes more sense?

#

one for queued, one for other

mint viper Jun 30, 2023, 4:31 AM

#

the amount of datastructures and the choice won't affect ur performance the slightest

slate sundial Jun 30, 2023, 4:31 AM

#

maybe one queued, one processing

#

I'm more referring to the implementation

mint viper Jun 30, 2023, 4:31 AM

#

u could make the worst choice here and it would run in the exact same speed than with the best choice

#

its more about ur logic

#

than the choice of container

slate sundial Jun 30, 2023, 4:32 AM

#

the code thats calling a lot, (ie. request new task) could just pop from one queue (queued)

#

that way the background code could do cleanup

mint viper Jun 30, 2023, 4:32 AM

#

just get started. get it working. then check if there is an issue, measure it with a profiler and then fix the bottleneck

#

if there's any

slate sundial Jun 30, 2023, 4:32 AM

#

Okay, will try, but want to get this right the first time

#

So really appreciate any suggestions

mint viper Jun 30, 2023, 4:33 AM

#

i dont really get what u think is so complex here. it's just a simple task queue

#

with futures

slate sundial Jun 30, 2023, 4:33 AM

#

well, it's how many calls the threads will be making

#

and making it time efficient

mint viper Jun 30, 2023, 4:33 AM

#

that's not up to u to decide

#

use modern multithreading and java will make it as fast as possible out of the box

slate sundial Jun 30, 2023, 4:34 AM

#

i'm not just going for fast, i'm going for as fast as possible

mint viper Jun 30, 2023, 4:34 AM

#

which is what ive said

#

create a thread pool using executorservice

#

then shoot 2 billion tasks at it

#

and watch them being done as fast as possible

#

easy

slate sundial Jun 30, 2023, 4:35 AM

#

Sorry for all the questions

#

is it faster to:

loop through the array until a viable taks is found using an object property

or to

remove the task from the current queue, add the task to a separate queue and start processing

mint viper Jun 30, 2023, 4:37 AM

#

ur wasting time by optimizing the wrong end

#

this part is fast anyways

#

cause ur data is tiny

slate sundial Jun 30, 2023, 4:37 AM

#

The other end is already optimised

#

what do you mean by that?

mint viper Jun 30, 2023, 4:37 AM

#

ur not losing time here

slate sundial Jun 30, 2023, 4:38 AM

#

as a side note, it's already making 800 requests a second, I just don't want it to drop too far

#

its optimised imo

mint viper Jun 30, 2023, 4:38 AM

#

how u work with ur datastructure costs u less time than writing a few if-else that cause branch prediction failures

#

since ur container is super small

slate sundial Jun 30, 2023, 4:39 AM

#

as in the jvm heap size?

mint viper Jun 30, 2023, 4:39 AM

#

2k is nothing

#

ur toaster can iterate and remove and add 2k items in milliseconds already

slate sundial Jun 30, 2023, 4:39 AM

#

I mean, it just needs to always be bigger than the amount of currently processed items

#

I know, but that 2k iteration needs to be done 800 times a second

mint viper Jun 30, 2023, 4:40 AM

#

in the time that u process a single of ur requests, u can add/remove/whatever trillions of items

#

per second

#

but again

#

why dont u use a executorservice

#

it will be much faster than any of ur selfmade bullshit

slate sundial Jun 30, 2023, 4:41 AM

#

So I'm using one producer thread, multiple consumer threads and concurrency

mint viper Jun 30, 2023, 4:41 AM

#

i can tell u from experience that executorservice is much faster than any self made queue/thread/whatever combination

slate sundial Jun 30, 2023, 4:42 AM

#

So I'm running multiple threads that make async requests

#

Let me find a snippet

mint viper Jun 30, 2023, 4:42 AM

#

ive replaced a complex and optimized c++ program that processed TB of data with a 20 line java program that did the same with executorservice. and it was 10x faster

slate sundial Jun 30, 2023, 4:42 AM

#

                Request request1 = new RequestBuilder("GET")
                        .setUrl("http://httpforever.com")
                        .setProxyServer(new ProxyServer.Builder(proxy.getIP(), proxy.getPort()).build())
                        .build();

                c.executeRequest(request1, new handler(c, proxy, this));

mighty heathBOT Jun 30, 2023, 4:42 AM

#

slate sundial ```java Request request1 = new RequestBuilder("GET") ...

Detected code, here are some useful tools:

slate sundial Jun 30, 2023, 4:42 AM

#

org.asynchttpclient.AsyncHttpClient
public abstract <T> org.asynchttpclient.ListenableFuture<T> executeRequest( org.asynchttpclient.Request request,
org.asynchttpclient.AsyncHandler<T> handler )

mint viper Jun 30, 2023, 4:42 AM

#

yeah that looks bad

slate sundial Jun 30, 2023, 4:43 AM

#

which part LOL

mint viper Jun 30, 2023, 4:43 AM

#

use HttpClient

#

sendAsync

#

it returns a future

slate sundial Jun 30, 2023, 4:43 AM

#

Well, it's async

#

not the default client

mint viper Jun 30, 2023, 4:43 AM

#

then do anything else u need to do on that future

slate sundial Jun 30, 2023, 4:43 AM

#

org.asynchttpclient

mint viper Jun 30, 2023, 4:43 AM

#

u dont need manual threads at all

#

nor any queues

#

do proper future based coding

#

future.thenApply(...).thenApply(...)...

slate sundial Jun 30, 2023, 4:44 AM

#

c.executeRequest(request1, new handler(c, proxy, this)).toCompletableFuture()

#

It's asynchronous

mint viper Jun 30, 2023, 4:44 AM

#

then spawn all ur thousands of future/requests immediately

#

and java will execute it as fast as possible automatically. ull reach max throughput

slate sundial Jun 30, 2023, 4:45 AM

#

That's pretty much what I'm doing

mint viper Jun 30, 2023, 4:45 AM

#

but ur talking about supervising ur tasks

slate sundial Jun 30, 2023, 4:45 AM

#

But I had resource exhaustion, so had to add a maximum concurrency value

mint viper Jun 30, 2023, 4:45 AM

#

manually

slate sundial Jun 30, 2023, 4:45 AM

#

that's why it's manual

mint viper Jun 30, 2023, 4:45 AM

#

wait

#

so ur not just planning, u already have working code ?

slate sundial Jun 30, 2023, 4:45 AM

#

yes!

#

Do you want to have a quick look?

mint viper Jun 30, 2023, 4:46 AM

#

anyways. either u don't understand my suggestions, or u dont want to use them for some reason

slate sundial Jun 30, 2023, 4:47 AM

#

I just need to understand them better

#

I thought I was already using futures

mint viper Jun 30, 2023, 4:47 AM

#

i think ur wasting time by optimizing the wrong end. and i think it's much easier to get max performance by approaching it differently

#

also, if u have working code, run a profiler on it

slate sundial Jun 30, 2023, 4:47 AM

#

I have already

#

#

@mint viper would you mind taking a quick look at the code?

#

I know you know way more than me

mint viper Jun 30, 2023, 4:53 AM

#

id suggest u just post it and all helpers can have a look

#

ur cpu load is inexistent on that measurement

slate sundial Jun 30, 2023, 4:54 AM

#

mint viper id suggest u just post it and all helpers can have a look

https://github.com/Scrapium/scrape-tweets-dev-2

GitHub

GitHub - Scrapium/scrape-tweets-dev-2: Development repo for scrapin...

Development repo for scraping tweets and market data via RDS. Using optimisations techniques such as (Threading, asynchronous I/O, non-blocking I/O - ConcurrentLinkedQueues, and runnable tasks for ...

mint viper Jun 30, 2023, 4:54 AM

#

aare u planning to execute on a much worse hardware?

slate sundial Jun 30, 2023, 4:54 AM

#

I think it's because everything is on outbound I/O

mint viper Jun 30, 2023, 4:55 AM

#

uve talked about performance. but as emccue and i suspected, ur problem is not cpu at all

#

its ur logic

#

ur cpu is chilling

#

so all the talking we did was wasted time

#

let the profiler look at ur io activity

#

see if that's at max throughput all the time

#

if it is, ur app is the fastest it can be

#

if not, u still have wasted potential

slate sundial Jun 30, 2023, 4:57 AM

#

It maxes out my internet connection

#

On my home pc it's network bound

#

and on the server cpu/memory bound

mint viper Jun 30, 2023, 4:58 AM

#

all the time? or just sometimes

#

if its at a constant 100% of ur IO, then ur app can't be made faster

slate sundial Jun 30, 2023, 4:58 AM

#

=== Tweet Scraper ===
Requests : 1314
Success/s: 267.0
Success Total/s: 159.0
Failed/s: 38.0
Available Proxies: 547
Running for: 4 seconds,

mint viper Jun 30, 2023, 4:59 AM

#

just let ur profiler look at it

#

get a nice graph

slate sundial Jun 30, 2023, 4:59 AM

#

on I/O?

mint viper Jun 30, 2023, 4:59 AM

#

yes

#

if its at 100% all the time, u already reached max throughput. which means as fast as possible

#

unless u can reduce the io traffic. for example maybe some requests are unnecessary

slate sundial Jun 30, 2023, 5:13 AM

#

@mint viper having some trouble with VisualVM, my jprofiler evaluation just ended, is there a way to check the network throughput natively from the command line?

#

I mean I could just check the AWS stats

#

#Building a web scraper