#How to count words in a lengthy JSON file excluding the words in keys?

1 messages · Page 1 of 1 (latest)

narrow wind
#

Hello! I have a long JSON file dedicated for translations purpose, like the following :

{
  "greeting": "Hello World !",
  "whats_your_name": "What's your name ?"
  "submissionForm":{
      "title": "Subscription",
      "username": "Username"
      // ...
   },
    // ...
}

My objective is to count the words inside the string values, not the string keys!

Note: one character like '!' or '?' is considered a word

maiden obsidianBOT
#

<@&987246717831381062> please have a look, thanks.

kindred bramble
#

for all string entries or just the greeting one?

#

so in the above example it would be 2 + 3 + 1 + 1 = 7?

narrow wind
#

for all string entries or just the greeting one?
for any key

timid tiger
#

9 with the ! and ? counted as words

narrow wind
#

so in the above example it would be 2 + 3 + 1 + 1 = 7?
yeah

narrow wind
kindred bramble
#

okay, sounds like u need a flat-stream over all entries then. filter for string type and then ur pretty much done already. just the splitting/counting part (but thats easy)

#

@plain sable does ur json lib support flat-streaming or iterating entries easily?

#

the key aspect that makes it difficult is the flatten-part

#

id assume jackson and gson can also do this in a way, but never dived deep enough into them

kindred bramble
timid tiger
#

you could also just parse it yourself ^^

kindred bramble
#

only really an option if its formatted nicely. i.e. one entry per line

#

otherwise pain and suffering follows 😄

narrow wind
kindred bramble
#

yeah but thats not what i mean

narrow wind
#

i think i know what to do

kindred bramble
#

a json file like this is also valid:

#
{ "greeting": "Hello World !", "whats_your_name": "What's your name ?" }
#

but its just one line

narrow wind
#

ah

#

yeah i get it

#

there are some *back to the line *

#

i mean all of it

kindred bramble
#

essentially, for manual text parsing, the key aspect is whether ur sure that its "one entry per line"

#

and not for example two in a line

narrow wind
#

ah yeah, the 1st

timid tiger
#

U don't need a full json parser really

#

Just keep track of if you are inside quotes or not

kindred bramble
#

so theres never two entries or more in a line?

kindred bramble
#

json supports triple quotes and single escape quotes

#

and more

timid tiger
#

Hmm fair

kindred bramble
#

but yeah, it all comes down to how easy this file is for manual parsing

#

if its just one entry per line, its simple

#

so please be sure about that before we continue

kindred bramble
#

how important is it that the result is fully correct? will a business go down if we make a mistake now?

narrow wind
#

that way i'm sure it has one entry per line

kindred bramble
#

how big is the file? MB? GB? TB?

narrow wind
#

there is 940 lines

#

it has 42KB

#

lel

kindred bramble
#

oh... okay, i thought ur talking about a 60 GB file

#

with a trillion lines

narrow wind
kindred bramble
#

well, it would matter

narrow wind
#

consider N lines

#

my question is not about perfromance

kindred bramble
#

yeah but in one case u must be streaming and in the other case it fits into the RAM still

#

so how important is it that the result is absolutely correct, regardless of how complex and full of edge-cases the json data is?

#

if it contains escaped quotes and triple quotes and that kind of shit, a manual parser will likely spit out the wrong values and u wouldnt notice it until its too late

#

hence im asking

#

if ur task is a business critical one, manual parsing is off the table

plain sable
narrow wind
plain sable
#
import dev.mccue.json.Json;
import dev.mccue.json.JsonObject;
import dev.mccue.json.JsonString;

public class Main {
    static int countWords(String s) {
        return 1;
    }
    static int countWords(Json json) {
        int words = 0;
        if (json instanceof JsonObject o) {
            for (var value : o.values()) {
                if (value instanceof JsonString s) {
                    words += countWords(s.toString());
                }
                if (value instanceof JsonObject o2) {
                    words += countWords(o2);
                }
            }
        }
        return words;
    }

    public static void main(String[] args) {
        var json = Json.readString("""
                {
                  "greeting": "Hello World !",
                  "whats_your_name": "What's your name ?"
                  "submissionForm":{
                      "title": "Subscription",
                      "username": "Username"
                   }
                }
                """);

        System.out.println(countWords(json));
    }
}
#

this is the basic strategy

#

which will work with any json library

narrow wind
kindred bramble
#

note that its recursive

#

but shouldnt be a problem unless u have super deeply nested objects

#

like > 200 nesting levels

#

otherwise u can make it non-nested by adding to some job-queue instead

#

similar to an iterative BFS/DFS search

kindred bramble
#

jackson can do the above as well

#

all ur doing here is iterating the object-tree

#

its just a plain tree

#

and a simple DFS search on it

#

jackson and gson offer tree iteration as well

narrow wind
#

Thank you all for help