#String detection goes wrong [FIXED]

23 messages · Page 1 of 1 (latest)

past stone
#

I'm developing my language called "Mafdom" with the help of ChatGPT, and I am at the part of detecting Strings for printing. I'm still at the Lexer, and I noticed that I am really stuck at this part of String Recognization. Look at this example at the terminal:

"Hello" 'Hello' write: 'Hello, World!' write: "123abc"
STRING: "Hello"
 
STRING: 'Hello'
 
KEYWORD: write
21KEYWORD: write
IDENTIFIER: llo
SYMBOL: ,
 
IDENTIFIER: World
!
Bad string formation

Process finished with exit code 0

OBS.: The 21 before the Keyword is the index, that I forgot to remove while running this example.

Almost everyone sticks with C-branches for prog.lang development. But I tried to do something different. And now, I'm stuck!

At the Java script, we can see the part that recognizes the Strings and other Regexes, And at the top part of the script, we can see the Regex Pattern declarators. Take a look at both parts:

#
            String remaining_code = code_with_space.substring(index);

            Matcher keyword_matcher = KEYWORD_PATTERN.matcher(remaining_code);
            Matcher identifier_matcher = IDENTIFIER_PATTERN.matcher(remaining_code);
            Matcher number_matcher = NUMBER_PATTERN.matcher(remaining_code);
            Matcher symbol_matcher = SYMBOL_PATTERN.matcher(remaining_code);
            Matcher string_matcher = STRING_PATTERN.matcher(remaining_code);


            if (remaining_code.startsWith("\"") || remaining_code.startsWith("\'")) {

                int end_quote_idx = 0;

                if (remaining_code.startsWith("\'")) { end_quote_idx = remaining_code.indexOf("\'", 1); }
                else if (remaining_code.startsWith("\"")) { end_quote_idx = remaining_code.indexOf("\"", 1); }

// Cut due to the discord non-nitro char limit

#

                if (end_quote_idx != -1) {

                    tokens.add("STRING: " + remaining_code.substring(0, end_quote_idx + 1));
                    index += end_quote_idx + 1;

                } else {

                    tokens.add("Bad string formation");
                    index = code_with_space.length() - 1;

                }

            } else if (keyword_matcher.lookingAt()) {

                tokens.add("KEYWORD: " + keyword_matcher.group());
                index += keyword_matcher.group().length();

            } else if (identifier_matcher.lookingAt()) {

                tokens.add("IDENTIFIER: " + identifier_matcher.group());
                index += identifier_matcher.group().length();

            } else if (number_matcher.lookingAt()) {

                tokens.add("NUMBER: " + number_matcher.group());
                index += number_matcher.group().length();

            } else if (symbol_matcher.lookingAt()) {

                tokens.add("SYMBOL: " + symbol_matcher.group());
                index += symbol_matcher.group().length();

            } else {

                tokens.add(remaining_code.substring(0, 1));
                index ++;

            }

            if (keyword_matcher.lookingAt()) {

                tokens.add("KEYWORD: " + keyword_matcher.group());

                index += keyword_matcher.group().length();

                continue;

            }
#

Top part:

    private static final Pattern KEYWORD_PATTERN = Pattern.compile("\\b(write|result|value|close)\\b");
    private static final Pattern IDENTIFIER_PATTERN = Pattern.compile("[a-zA-Z_][a-zA-Z0-9_]*");
    private static final Pattern NUMBER_PATTERN = Pattern.compile("\\d+");
    private static final Pattern SYMBOL_PATTERN = Pattern.compile("[\\(\\)\\{\\}\\[\\]=\\+\\-\\*/<>,;:]");
    private static final Pattern STRING_PATTERN = Pattern.compile("\"(?:\\\\\"|[^\"])*\"");

I know you guys aren't expert at Java Regex related things, etc. … But I thought you guys could help me. Can you?

teal sand
#

@past stone Did you resolve this yet?

#

I ran all of this trought GPT 4 Turbo

👇

Certainly! It looks like you're trying to build a lexer for your language "Mafdom" and you're encountering difficulties with string tokenization. From the error logs and the code snippets you've provided, I'll offer some guidance.

Your STRING_PATTERN regular expression is intended for double-quoted strings, and it looks like it also attempts to handle escaped double quotes ("). However, it doesn't cover single-quoted strings. If you want to handle single-quoted strings the same way, you need to adjust your regex pattern to handle both cases.

Additionally, your code for handling the start of strings only accounts for the correct start of a string and consumes the rest of the string without considering whether it ends correctly.

Here's a modified version of the STRING_PATTERN to include both single and double quotes:

private static final Pattern STRING_PATTERN = Pattern.compile("\"(?:\\\\\"|[^\"])*\"|'(?:\\\\'|[^'])*'");

This pattern will recognize both double-quoted strings and single-quoted strings, including escaped quotes.

Next, you should use STRING_PATTERN in the same way you have used the patterns for keywords, identifiers, numbers, and symbols. This involves removing the start-of-string checks and replacing them with a string_matcher.lookingAt(), like so:

if (string_matcher.lookingAt()) {
    tokens.add("STRING: " + string_matcher.group());
    index += string_matcher.group().length();
} else if (keyword_matcher.lookingAt()) {
    // ... Existing code ...
}
// ... Other cases ...
#

This change should ensure that when a string token is found, it properly matches the entire string, whether it's enclosed in single or double quotes and properly handles escape sequences within the string.

Please make sure that your above code snippet reflects this change. With this updated approach, you should be able to tokenize strings correctly in your lexer. Remember that tokenization is just the first step, and you'll also need to handle parsing and interpreting these tokens according to your language specifications

peak flicker
#

i can share an example later

#

will be in c++

#

but should show you how to make a lexe

#

r

past stone
#

As not expected, the terminal outputted write twice, and then "lo", skipping "Hel" and also, the first double quote.

#

Different intelligences, different thoughts: This is valid with alive beings and atrificial... Beings.

#

As I test out with google Bard, the response, of course, will be different, it can turn out a working script. Let' check it out.[]

peak flicker
#

you are using regex

#

you dont want to do that

#

you need to make your own regex engine type thing

#

to lex properly

#

i will show you a more intuitive approach in an hour

past stone
#

String detection goes wrong [FIXED]

#

I found that i just needed to put a continue(); inside all if loops. 🤡