#[tech]: help building a .txt parser

305 messages · Page 1 of 1 (latest)

young pond
#

The idea is to be able to parse 200+ .txt files. The size of each file could be as high as 500mb. The intent of having a parser is to parse these logs and flag all the unique errors in them by relevant keywords such as 'failure' etc. Python is preferred but other language inputs are also welcome.

#

@viral narwhal hi : )

viral narwhal
#

.

viral narwhal
#

python is best for this usecase

young pond
viral narwhal
#

im commuting rn

#

ill see after 8

young pond
viral narwhal
#

relevant keywords such as 'failure' etc

#

how many such keywords?

#

i think this could be as simple as a 100-200 line python script

young pond
#

2-3?

viral narwhal
#

oh okay

#

fairly easy now

#

where are the err logs are stored?

young pond
#

At Max 10 tbh to have a cleaner result

viral narwhal
#

on a server right

young pond
viral narwhal
#

oh

#

okay

young pond
#

One issue

viral narwhal
#

cron job -> trigger the script to process the new log -> this gets triggered when there is a new log file

young pond
#

The error log is inside a folder and that folder has a lot of other test related files too

viral narwhal
#

if you're going the script way, the err logs have to be in the same folder as script

young pond
#

Ahh

viral narwhal
#

if they share a common file naming, then it'd be easy

young pond
#

Then I need to manually pick them form the folders no

viral narwhal
#

like errlog<smth-smth>.txt

#

so you can parse via file names

young pond
#

Yess its like that

#

Ahhh

viral narwhal
#

so

#

a shell script will be always up and running to check if a new err log (with a predefind regex pattern) file name appears in the shared network folder

young pond
#

It doesn't even have to be real time. But if it is, that is added benefit

viral narwhal
#

or you can trigger this shell script on defined times like once a day with a cron job

viral narwhal
#

so

#
  • the error log will be stored on either a remote server/shared network folder
  • a parsing microservice that will be checking for new logs -> all the unique errors will go to a database
  • these will be fetch to an error monitoring dashboard like sentry etc
#

the parsing microservice will be fetching logs from the shared network periodically, a cron job will be very helpful

#

the logs will be stored with severity levels in the DB along with timestamps

#

the monitoring service (either a third-party software or an in-house solution) will be querying the DB to fetch data

young pond
#

I'm sorry what's a cron job?😢

viral narwhal
young pond
viral narwhal
#

oh okay

young pond
#

I can't do this on Linux either 😔 has to be done on my work laptop

viral narwhal
#

cron is basically a job scheduler in linux

young pond
#

Okayy

viral narwhal
#

really

young pond
#

Yes i can't use my personal pc for demo

viral narwhal
#

bro we talking about microservices and damn work laptop

#

🤣

young pond
#

😭😭😭

viral narwhal
#

this is really smol company behavior

young pond
#

Okay i should have mentioned that before

#

Unfortunately they ain't small. This team mismanaged as hell

viral narwhal
#

yeah sure

viral narwhal
viral narwhal
young pond
viral narwhal
young pond
viral narwhal
#

oh okay

#

the system design above i wrote will work

viral narwhal
#

do you have admin perms in it?

#

you can use wsl (windows subsystem for linux) to get a linux like command line

young pond
young pond
viral narwhal
#

this fairly easy if you believe

young pond
#

😭

viral narwhal
#

a db, a parsing service, a dashboard to view errors

#

thats it

#

intern stuff

young pond
#

I've run stuff in powershell in admin mode

viral narwhal
#

ha will work

#

koi dikkat nhi hai

#

bro just write a python parser now

young pond
#

Achha okay ghar jake I'll see

viral narwhal
#

yes sure

#

i can do that for you but i charge hourly

#

sorry not sorry

young pond
#

No worries 😂I'll figure that out

viral narwhal
#

ill help nw

#

since the err logs do share a common file naming convention, a regular expression will help in identifying the desired files in folder containing other file types

#

python has inbuilt funtions to open files for parsing (combined with the regexp)

#

now regexp can also be used to flag the keywords "failure" etc since the file is just plain text

#

sets has helpful to store unique itmes

#

this will be the payload to put in the DB

#

and yeah that it

young pond
#

Can I pick a parser from git? Will that be an issue?

viral narwhal
#

wdym pick from git?

#

do you mean github?

#

oh acha

young pond
viral narwhal
#

maybe yes

#

kisko pta chalega tbh

#

waise bhi chatgpt is always there

young pond
#

He said bas implement kro

viral narwhal
#

ha toh kardo

young pond
viral narwhal
#

bro i dont have workign experience

#

mujhe nhi pata ye corporate me kya hota

young pond
#

Haan haan

#

Not needed to know here

viral narwhal
#

but waise you will have to make changes in it tho

#

as you have to put flagged logs in DB as well

#

coding part is easy lol, the system design is debatable

#

😂

young pond
#

Bt ty so much for your help ya

viral narwhal
#

youll be appreciated

#

throw the mumble jumble as well

viral narwhal
#

realtime data pipeline 🔥

young pond
#

If it flies i might participate with this idea in the hackathon

viral narwhal
#

containerisation and shit 🔥

young pond
#

Omg ur more corporate than me xD

viral narwhal
#

throw up some kubernetes and ansible 🔥

viral narwhal
#

need da dolla

viral narwhal
#

@slow stone

#

👆

slow stone
viral narwhal
#

😔

#

pardon m'lady

slow stone
#

@young pond wow you nerd squad too

viral narwhal
#

senõr enzineer

young pond
young pond
viral narwhal
viral narwhal
young pond
viral narwhal
#

man you're really overworked

#

take it slow

slow stone
young pond
viral narwhal
young pond
slow stone
#

Join and chill

viral narwhal
young pond
slow stone
#

It's ahm lol

#

Nobody pays attention lmfao

viral narwhal
#

it's not fir employees lol

young pond
#

Yeah I need to know the product roadmap

#

Baaki it's shit

slow stone
#

Sigh

young pond
young pond
#

And market stats tell us bonus ka kya scene hoga

viral narwhal
#

please internship refferal

slow stone
young pond
slow stone
#

Ah then you'd need to lol

young pond
#

Heheh

#

Still junior only

slow stone
#

Damn you old tho

#

Wait what

young pond
#

What do u mean by senior

#

I'm 3 yrs exp

slow stone
#

Ohhh

young pond
#

Hbu

slow stone
#

Okay company designation might be senior engg but you mid level I assume

young pond
#

Not exactly a new hire but not a lead yet

slow stone
#

Architect ish

young pond
#

Next year tho 😈

slow stone
#

Lesgoooo!!

#

All de best

young pond
#

New hires se bhi karwate but they don't own it completely

viral narwhal
young pond
#

If you're into embedded, best hai

viral narwhal
#

I'm already getting discriminated due to degree

young pond
viral narwhal
#

will prove as an unfair advantage

young pond
viral narwhal
#

only degree

young pond
viral narwhal
#

I'm already doing whatever i can

#

but still shit mentality of people

young pond
#

Efforts never go unrecognised 💯💯

#

I'll put in a referral if we do an off campus internship hiring next year

young pond
viral narwhal
#

I'll be working on a real-time collaborative code editor with sandboxed environment s

young pond
#

Keep in touch with ppl in blr

#

Startup internships you can Target and convert in full-time

viral narwhal
young pond
#

Get exp for 2 years, tab tak uni ka tag ka value nai eehta

viral narwhal
#

startup -> 1-1.5yoe -> big PBC

#

this is the plan

young pond
viral narwhal
#

as big corps generally don't hire non btech folks

zealous citrus
#

basil on that grind

young pond
young pond
viral narwhal
#

sad

slow stone
#

I was looking for embedded roles during my switch, gave into money tho lol

#

At least I am working on C++, not embedded stuff tho but still. Loving it lol

young pond
proper quiver
elder thorn
#

@young pond if this is not a pet project and something that's gonna be used in production time and time again, you can use ELK stack

#

Elasticsearch will search and index your logs

#

Logstash will push the logs to the elasticsearch

#

and Kibana is used for visualization and queries

viral narwhal
proper quiver
#

no

viral narwhal
#

oh okay

#

same

#

gaara's approach reflects a seasoned engineer who has worked with production environments

elder thorn
viral narwhal
elder thorn
viral narwhal
#

damn, senior 🫡

elder thorn
#

Hahaha

#

🥹

viral narwhal
#

as a noob person, i like to duct tape code with hope

#

🤣

elder thorn
#

Very few products are mature

viral narwhal
#

do you have experience with microservices?

elder thorn
#

I have worked as a Data Scientist, Software Dev and SRE

#

😭

viral narwhal
#

quite wide expertise

elder thorn
#

Wish I didn't

#

Sticking to one thing is better for mental peace

viral narwhal
#

hmm

viral narwhal
#

hey doog are you a student?

proper quiver
#

nope i took a drop after 12th

#

chumma x)

viral narwhal
#

we can be friends :)

#

accept my req

proper quiver
#

yay

elder thorn
#

wholesome

#

friends 🥹

viral narwhal
#

sent one to you too 🥂

proper quiver
elder thorn
#

wow

#

accepted bbgs

young pond
elder thorn
# young pond It might be used in production but for now we're just looking for POC i guess. H...

Look it up, elasticsearch is basically a NOSQL database and you can create a cluster so that it's always up even when one or two nodes go down. Elasticsearch is specially useful for text based lookups. Logstash is like a client it can be present on multiple hosts pushing logs after parsing to the ES cluster according to what you have configured and Kibana can be used to make graphs like how many errors in past 3 hours

#

But you know the problem best so it might not suit your situation

#

You have to decide the tradeoffs

young pond
elder thorn
#

To decide whether it's the right fit

#

I'll go through this thread later to understand the scope of problem

young pond
spiral canyon
#

Error monitoring is usually done in the big companies using the ELK stack but there’s a tool called fail2ban in linux, I run a private server and I use fail2ban to monitor “errors” in the log files by defining what is an error and what action to take. If your use case is small and restricted to one system then you could use fail2ban else you’ll probably need a cluster and ELK stack

elder thorn
elder thorn
young pond
young pond
elder thorn
delicate cairn
#

breh why didnt anyone invite me this looks tasty af

#

@young pond aaaahhh

young pond
young pond
delicate cairn
young pond
delicate cairn
#

oops

sturdy crypt
#

why you giving offis work here

young pond
spiral canyon
viral narwhal
#

any updates on this one?

young pond
#

New challenge is that the logs could be on various networks

viral narwhal
#

oh okay

#

cool

viral narwhal
#

configure something to process the incoming logs and extract the relevant info and push it to elasticsearch

#

and kibana to visualisation

#

gaara's approach

#

you might as well use encryption for log data while pushing to the common service

#

this setup seems highly scalable and fault tolerant

#

the microservice approach makes the setup easy and highly customisable to specific business needs and it'll struggle in scaling

#

while the ELK stack is highly scalable ans tolerant

coral geyser
#

regex