#A clone of `wc` unix command

75 messages · Page 1 of 1 (latest)

fresh patrol
#

Hello, relatively recently I started coding in C and decided to make a clone of a wc command that I called wcc which just stands for wc clone.
As of now I haven't fully finished it, it can't handle stdin right now and lacks --help and --version arguments, but I plan to finish it soon and maybe make a new thread or re-post it here.

So, it functions like wc:

  • -l or --lines for line count
  • -w or --words for word count
  • -m or --chars for character count
  • -c or --bytes for byte count

I would like a general feedback/hints on what to pay more attention to.

#

Couldn't post full source code, so here is main.c

#

Also works only on Linux, haven't tested on Windows and I assume it won't work there.

fresh patrol
fresh patrol
fresh patrol
fresh patrol
fresh patrol
# fresh patrol

Fixed the bug, where if you invoked wcc -c <any file> it would try to print out word count, instead of a byte count, which would lead to printing 0

fresh patrol
# fresh patrol

For anyone interested, check out print_result(), the if (opt_flags->use_bytes) part

fresh patrol
# fresh patrol

Fixed the bug, where if you tried piping something (e.g echo "Hello, world!" | wcc -lwmc or when reading from stdin it wouldn't count every option, in the example with piping it would only count lines. Reason is pretty simple, terminals don't usually support file positioning requests so you can't rewind(stdin). I fixed this by writing everything fromstdin to a temporary file and then reading from it instead.

fresh patrol
# fresh patrol

Refactored print_result() to use just sprintf() with an offset, instead of a monstrocity that I wrote before

fresh patrol
# fresh patrol

Refactored process_file() function, so that instead of doing 4 passes when using -lwmc options, it does only one, making it much more efficient, and removing need for a temporary file when handling stdin.

fresh patrol
# fresh patrol

Refactored print_result() again, because I'm an idiot and forgot about printf()'s existence.

harsh oar
#

Why not use something like GitHub for easier version control ?

fresh patrol
fresh patrol
#

And this probably will be the final version for now, will be glad to receive feedback for the final iteration of this code.

solemn laurel
#

you should check ferror() to find out if your fgetwc() didn't like the input (invalid utf8 for instance) also character may be... a bit odd or off with smileys etc (in particular on windows where it's utf-16 inside)

fresh patrol
# fresh patrol

Refactored some error handling code and added ferror() check in process_file() in case of invalid wide characters, I/O errors

fresh patrol
solemn laurel
#

on phone now so I can’t see the latest (you should really put it on github or at least a gist)

fresh patrol
solemn laurel
#

i don’t get the tempbuf[32]

fresh patrol
#

If I don't write to something in wcrtomb() it will always output 1 byte, so (fr->byte_count == fr->char_count) will be true

#

I probably should look more into it, maybe I just missed something but that is a workaround for now

fresh patrol
#

Changed the comment inside to (hopefully) better clarify why

solemn laurel
#

cool better than hardcoded 32. did you test what happens with non utf8 binaries?

fresh patrol
#

Not yet, but I probably should

fresh patrol
#

So I decided to test my program on VM with just ASCII locale, and it doesn't work :(

#

Can't test with UTF-16LE/BE, but since I don't support Windows it doesn't really matter I guess?

fresh patrol
solemn laurel
#

(almost) nobody uses iso-latin-1 in 2026

fresh patrol
#

I wonder why ISO-8559-1 works fine, but pure ASCII doesn't? Afaik ISO-8559-1 is not a unicode encoding, so shouldn't it give me an error too?

solemn laurel
#

and you should define "doesn't work" what actually happens

fresh patrol
#

ASCII doesn't support wide operations while ISO does

#

Since my program uses wide operations, it just zeroes everything out and prints an encoding error to stderr

solemn laurel
#

if it works for iso-8559-1 it works for ascii because ascii is a subset. unless the input isn't ascii obviously (which is what I was hinting about non valid utf-8 test needed)

#

iso-latin-1 (the simpler name for 8559) does not use wide characters

fresh patrol
#

But it supports them, so I can count normally

#

And I can't do the same with ASCII

#

Let me get some screenshots

solemn laurel
#

no, you will never have your byte vs char count to not be exactly equal using iso-latin-1

#

also if you print your MB_CUR_MAX it will be 1 for both locales

fresh patrol
fresh patrol
#

using pure ASCII locale

#

Also something weird with my word counting in the second screenshot compared to wc, so I'll check it out too

solemn laurel
#

so to repeat if you use ascii and have non ascii you get an error (made you check ferror) like if you had invalid
utf8

and to repeat again too, you can see bytes count char count on the one “working” are exactly == which shows there are no wide chars either in iso-latin-1

fresh patrol
#

yes, that I understand that char and byte count on latin-1 would be equal

solemn laurel
#

well ur not printing both… but…

fresh patrol
#

Let me check real quick then

#

But that should be the case, yes

#

Oh yeah, I forgot that I made a check

#

So it won't even attempt to print out chars

#

Because the count would be equal to bytes

fresh patrol
#

Ohhhh wait, I'm dumb

#

It contains some utf-8 stuff, so that's why it doesn't output

solemn laurel
fresh patrol
solemn laurel
#

and iso latin 1 defines all 256 possible byte value so it accepts everything in a “garbage in garbage out” way

#

the first 3 bytes there is called the BOM

fresh patrol
#

So in the end, I didn't catch that the text I was using had some UTF-8 symbols inside that ASCII can't read

#

therefore an error

solemn laurel
#

how about you try a utf8 locale and see that you get 3 more bytes than character thanks to the BOM

fresh patrol