Text length inconsistency reading from file. | Together C & C++ | Page 1

gaunt forge Jan 19, 2025, 1:23 PM

#

Hey, I'm trying to read a text file and get the length of the text (amount of characters) using ftell. I do know newline is represented with 2 characters on windows and adjusted the length accordingly. My problem is with "encoded" code. For example, for a file containing the text "abcd" ftell will return 4. For a file containing the encoded text ")ְ" ftell will also return 4, but those are 3 characters. I do not know how to accommodate this, or why this is occurring. each character is represented by 8 bits and the encryption was 1:1 the amount of bytes. I tried to use both "r" and "rb" while opening the file and got the same results. Would appreciate any help and insight on that matter, thanks!

glacial gustBOT Jan 19, 2025, 1:23 PM

#

When your question is answered use !solved to mark the question as resolved.

Remember to ask specific questions, provide necessary details, and reduce your question to its simplest form. For tips on how to ask a good question use !howto ask.

fierce thunder Jan 19, 2025, 1:24 PM

#

gaunt forge Hey, I'm trying to read a text file and get the length of the text (amount of ch...

because the number of characters isnt the number of characters

#

not every character is 1:1

#

what you're likely using here is utf8 and so some characters use multiple individual bytes to encode them

#

also its encoding not encrypting they are different

#

;compile

std::cout << sizeof(")ְ");

heady lava Jan 19, 2025, 1:26 PM

#

In order to properly count UTF-8 characters you would need to read through the entirety of the file.

wild yokeBOT Jan 19, 2025, 1:26 PM

#

Program Output

fierce thunder Jan 19, 2025, 1:26 PM

#

see its 4 chars (+ 1 for the null)

gaunt forge Jan 19, 2025, 1:35 PM

#

fierce thunder what you're likely using here is utf8 and so some characters use multiple indivi...

How can I approach this problem? Let me give one more example. For the text: "A!5GFs/" ftell will return 7. Those are 7 characters represented by 7 bytes. after encrypting them using the attached encryption (which is a byte for byte - so the amount of bytes will stay the same), the result will be "A!2§C׃/". ftell will return 9 for that text. How is that possible?

fierce thunder Jan 19, 2025, 1:38 PM

#

gaunt forge How can I approach this problem? Let me give one more example. For the text: "A!...

your implementation must be incorrect then because if you aren't adding extra bytes its impossible to add extra bytes

#

send your implementation

gaunt forge Jan 19, 2025, 1:46 PM

#

fierce thunder send your implementation

This is the encryption algorithm:

int enc(const unsigned char *data_in, unsigned int size_in, unsigned char *data_out, unsigned int size_out) {
    if ((data_in == NULL) || (data_out == NULL))
        return ERR_NULL_PTR;
    int j, i, bit1, bit2;
    char a1[6], a2[4], b1[4], b2[6];
    a1[5] = '\0';
    a2[3] = '\0';
    b1[3] = '\0';
    b2[5] = '\0';
    int odd = size_in % 2;
    printf("Size going in is %d", size_in);
    char new_a[9], new_b[9];
    for (i = 0; i < (int)size_in; i = i + 2) {
        if ((odd == 1) && (i == size_in - 1)) {
            data_out[i] = data_in[i];
            break;
        }
        for (j = 0; j < 8; j++) {
            if ((j >= 0) && (j < 5))
                a1[j] = (char)!!((data_in[i] << j) & 0x80) + 48;
            if ((j >= 5) && (j < 8))
                a2[j - 5] = (char)!!((data_in[i] << j) & 0x80) + 48;
            if ((j >= 0) && (j < 3))
                b1[j] = (char)!!((data_in[i + 1] << j) & 0x80) + 48;
            if ((j >= 3) && (j < 8))
                b2[j - 3] = (char)!!((data_in[i + 1] << j) & 0x80) + 48;
        }
        strcpy(new_a, a1);
        strcat(new_a, b1);
        strcpy(new_b, a2);
        strcat(new_b, b2);
        bit1 = strtol(new_a, NULL, 2);
        bit2 = strtol(new_b, NULL, 2);
        data_out[i] = (unsigned char)bit1;
        data_out[i+1] = (unsigned char)bit2;
    }
    return OK;```

fierce thunder Jan 19, 2025, 1:47 PM

#

strcpy(new_a, a1);
strcat(new_a, b1);
strcpy(new_b, a2);
strcat(new_b, b2);
bit1 = strtol(new_a, NULL, 2);
bit2 = strtol(new_b, NULL, 2);
```any reason you're using these?

gaunt forge Jan 19, 2025, 1:47 PM

#

I was able to verify the encryption with the text:

IT - This is my sentence. i like to moove it moove it and if you can't move it i don't know!
OK this sitting is unaccatple if i like to moove it! can you moove-it? not a single
MOoVE. or re-moOvE ok? am i (moove) or mooveing?
Done!it

which was correctly encrypted to:

J4!
"k    q`k3#
y sekװc®ce)ְi k‰ke#iאk¯kצa k4#
kןsֵ#    q€c.a€k&#kץ#c.#פ#
kצa k4#    #kמ#פ#kֿqב
OI`sˆk3#k4s‰kַ#    q`s®c#cask…#    aְi k‰ke#iאk¯kצa k4! caiְ{/q k¯kצak49אkֿq€a sikַk…
MKןRֵ)ְkע#ak¯KצA kכ9אc-#    !k¯kצa©#q@k¯kצc©kַ8ךCkֵ#)p

fierce thunder Jan 19, 2025, 1:48 PM

#

fierce thunder ```cpp strcpy(new_a, a1); strcat(new_a, b1); strcpy(new_b, a2); strcat(new_b, b2...

seems like an incredibly easy way to accidentally add more characters on to the end of your strings

gaunt forge Jan 19, 2025, 1:49 PM

#

fierce thunder ```cpp strcpy(new_a, a1); strcat(new_a, b1); strcpy(new_b, a2); strcat(new_b, b2...

I used those to form the new bytes, combining the 5 MSB of the first byte with with the 3 MSB of the second byte for the encrypted first byte for example

fierce thunder Jan 19, 2025, 1:51 PM

#

ah wait I misread nevermind

#

I can see how you would get 1 extra character

#

for (i = 0; i < (int)size_in; i = i + 2) {
  // ...
  data_out[i] = (unsigned char)bit1;
  data_out[i+1] = (unsigned char)bit2;

if size is an odd number then i + 1 will add an extra character at the end

#

which might overwrite a null terminator I suppose

#

and then you accidentlly get n new characters

gaunt forge Jan 19, 2025, 1:53 PM

#

I think this segment of my code should cover this corner case:

    for (i = 0; i < (int)size_in; i = i + 2) {
        if ((odd == 1) && (i == size_in - 1)) {
            data_out[i] = data_in[i];
            break;
        }

does it not?

#

If I'm currently on the last byte and size is odd, the last byte is copied unchanged and the break jumps out of the loop

fierce thunder Jan 19, 2025, 1:55 PM

#

ah I didnt see that

#

welp yeah I mean looks fine

#

are you expecting the output to be null terminated?

#

and is it zero'd to start?

gaunt forge Jan 19, 2025, 1:57 PM

#

I don't think I understand the question, can you ask again please? 😅

gaunt forge Jan 19, 2025, 2:00 PM

#

fierce thunder are you expecting the output to be null terminated?

The output is the encrypted text, the memory was allocated with malloc according to the length of the text (obtained with ftell)

fierce thunder Jan 19, 2025, 2:00 PM

#

gaunt forge The output is the encrypted text, the memory was allocated with malloc according...

but how is it output?

#

also ftell isn't always that reliable about the size of files but that probably fine here

gaunt forge Jan 19, 2025, 2:02 PM

#

fierce thunder but how is it output?

If the function enc return "OK", data_out will be written to a text file

fierce thunder Jan 19, 2025, 2:02 PM

#

how though

#

how is the buffer written

gaunt forge Jan 19, 2025, 2:03 PM

#

fierce thunder also `ftell` isn't always that reliable about the size of files but that probabl...

Honestly I assumed ftell was the problem because it returned a value higher than expected

fierce thunder Jan 19, 2025, 2:04 PM

#

gaunt forge Honestly I assumed ftell was the problem because it returned a value higher than...

the strings that you sent are that length

gaunt forge Jan 19, 2025, 2:04 PM

#

fierce thunder how is the buffer written

Let me attach the code:
Memory allocation:

int allocate_buffer(void **buf, unsigned int buf_size) {
  // TODO
    if (buf == NULL)
        return ERR_NULL_PTR;
    *buf = (unsigned char*)malloc((buf_size) * sizeof(unsigned char));
    if (*buf == NULL) {
        return ERR_MEMORY;
    }
    return OK;
}

reading from input

int load_data_from_file(const char *input_file_path, unsigned char **buf,
                        unsigned int *buf_size) {
  // TODO
    if ((buf == NULL) || (input_file_path == NULL))
        return ERR_NULL_PTR;
    char c;
    FILE* fp = fopen(input_file_path, "r");
    if (fp == NULL)
        return ERR_FILE;
    for (int i = 0; i < (int)*buf_size; i++) {
        fscanf(fp, "%c", &c);
        (*buf)[i] = (unsigned char)c;
    }
    fclose(fp);
    return OK;
}

writing to output:

int write_data_to_file(const char *output_file_path, const unsigned char *buf,
                       unsigned int buf_size) {
  // TODO
    if ((buf == NULL) || (output_file_path == NULL))
        return ERR_NULL_PTR;
    FILE* fp = fopen(output_file_path, "w");
    if (fp == NULL)
        return ERR_FILE;
    for (int i = 0; i < (int)buf_size; i++) {
        fprintf(fp, "%c", (char)buf[i]);
    }
    fclose(fp);
    return OK;
}

fierce thunder Jan 19, 2025, 2:05 PM

#

so Im more suspicious of how data is getting into the fiel

gaunt forge Jan 19, 2025, 2:05 PM

#

Hopefully I understood your question?

fierce thunder Jan 19, 2025, 2:05 PM

#

Okay what

#

no no no thats not how you write data to files

#

look into fwrite and fread

#

you're not just writing strings you're writing arbitrary data

#

and trying to pass that through the character apis might add in extra encoding characters for locals and things

#

how you print a character is not always the same as the character itself

gaunt forge Jan 19, 2025, 2:08 PM

#

I tried to use fread and fwrite but didn't get the right results. Should I use a loop with those?

fierce thunder Jan 19, 2025, 2:08 PM

#

no you just need 1 function

gaunt forge Jan 19, 2025, 2:08 PM

#

fierce thunder how you print a character is not always the same as the character itself

I tried to make sure the buffer is unsigned char so the actual bytes values stay the same

fierce thunder Jan 19, 2025, 2:08 PM

#

https://en.cppreference.com/w/c/io/fwrite

fierce thunder Jan 19, 2025, 2:08 PM

#

gaunt forge I tried to make sure the buffer is unsigned char so the actual bytes values stay...

its not that, its that a char value may be different when printed

#

I believe

#

regardless this is a terribly inefficient way to output

#

fwrite(buf, 1, buf_size, fp) something like that

#

and fread for reading in

gaunt forge Jan 19, 2025, 2:10 PM

#

gaunt forge I was able to verify the encryption with the text: ``` IT - This is my sentence....

Though for this text the output is right for a fact. So I assume the problem is with multi bytes characters?

fierce thunder Jan 19, 2025, 2:12 PM

#

idk but what you're doing is definitely wrong

gaunt forge Jan 19, 2025, 2:12 PM

#

fierce thunder idk but what you're doing is definitely wrong

I'll try to implement my functions with fread and fwrite, will update soon whether I get the expected results or not. Thank you!

fierce thunder Jan 19, 2025, 2:17 PM

#

It could also possible be ftell ... I think it unlikely. But you should probably switch to an alternative, since you actually don't want to be using raw r and w you want to be using rb and wb since you're not working with characters

#

and then ftell would be unreliable

#

The alternative here is probably fstat (or _fstat in windows) to get the actual size of the file

gaunt forge Jan 19, 2025, 2:20 PM

#

fierce thunder It could also possible be `ftell` ... I think it unlikely. But you should probab...

I changed my functions of reading and writing to the following:

int write_data_to_file(const char *output_file_path, const unsigned char *buf,
                       unsigned int buf_size) {
  // TODO
    if ((buf == NULL) || (output_file_path == NULL))
        return ERR_NULL_PTR;
    FILE* fp = fopen(output_file_path, "w");
    if (fp == NULL)
        return ERR_FILE;
    fwrite(buf, sizeof(unsigned char), buf_size, fp);
    fclose(fp);
    return OK;
}

int load_data_from_file(const char *input_file_path, unsigned char **buf,
                        unsigned int *buf_size) {
  // TODO
    if ((buf == NULL) || (input_file_path == NULL))
        return ERR_NULL_PTR;
    FILE* fp = fopen(input_file_path, "r");
    if (fp == NULL)
        return ERR_FILE;
    fread(*buf, sizeof(unsigned char), *buf_size, fp);
    fclose(fp);
    return OK;
}

The results are sadly the same, but this is definitely more efficient

gaunt forge Jan 19, 2025, 2:20 PM

#

fierce thunder The alternative here is probably `fstat` (or `_fstat` in windows) to get the act...

I'll try that as well

fierce thunder Jan 19, 2025, 2:22 PM

#

there actually is no standard c way to determine the size of a file

#

which is a problem

#

but fstat is good enough

#

the problem is that with r or w the result has no meaning other than as a parameter to ftell (its not the actual offset)
but in rb and wb you can't seek to the end

gaunt forge Jan 19, 2025, 2:26 PM

#

It seems that for fstat I need to include the following:

#include <unistd.h>
#include <sys/stat.h>
#include <sys/types.h>

Which are not allowed in the task I'm doing

gaunt forge Jan 19, 2025, 2:27 PM

#

fierce thunder the problem is that with `r` or `w` the result has no meaning other than as a pa...

I don't mind using "r" and "w" instead of "rb" and "wb", its just that I saw the same results for both of them

glacial gustBOT Jan 19, 2025, 2:32 PM

#

@gaunt forge Has your question been resolved? If so, type !solved :)

fierce thunder Jan 19, 2025, 2:38 PM

#

gaunt forge I don't mind using "r" and "w" instead of "rb" and "wb", its just that I saw the...

the problem is that you can't use r or w to detemrine the size of a file because the ftell result isnt a size
but you also can't use rb or wb because you can't fseek to the end to get ftell to tell you the size

#

generally the only way with these is to allocate as you read

#

so you allocate a block, then try to read, then if you use up your space you reallocate it to be bigger, etc

gaunt forge Jan 19, 2025, 2:56 PM

#

fierce thunder so you allocate a block, then try to read, then if you use up your space you rea...

I see. I'll try to wrestle with it a bit longer and hopefully I'll figure it out. Thank you for all your help! I appreciate it 🙏

#Text length inconsistency reading from file.