What do you mean by "character"? If you mean code point or "unicode scalar value...

asveikau · on March 27, 2017

I mean your iterator is char* and you advance it by adding. That's it.

I do NOT mean that char itself corresponds to a glyph or codepoint, you are seriously preaching to the choir making that lecture to me.

fauigerzigerk · on March 27, 2017

>you advance it by adding

And when do you stop? UTF-8 strings can have zero bytes in them so treating them as C strings is potentially error prone depending on the context.

imron · on March 28, 2017

> UTF-8 strings can have zero bytes in them

This is not true. A zero-byte in a utf-8 string is the null-terminator and utf-8 strings can be treated exactly like C strings in terms of where the string ends.

What you do need to look out for is malformed utf-8, for example, 1 byte before the null terminator you get a lead byte saying the next character is 4-bytes long.

If you're not checking each byte for null and just skipping based on the length indicated by the lead byte then you're in for a crash.

Where utf-8 strings differ from C strings is slicing. You can't just slice the string at some random point without doing extra validation to make sure you only slice on codepoint boundaries.

dbaupp · on March 28, 2017

> A zero-byte in a utf-8 string is the null-terminator and utf-8 strings can be treated exactly like C strings in terms of where the string ends.

No, the parent was correct: UTF-8 encodes NUL (i.e. \0) as a single zero byte (e.g. in contrast, Modified UTF-8[1] uses an overlong for NUL, so there's never any possibility of an internal zero). Of course, an application/library can choose to restrict itself to only handling UTF-8 that doesn't contain internal NULs, but the spec itself allows for zero bytes in a string.

[1]: https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8

imron · on March 28, 2017

We are in agreement that the only time a single zero byte can be found in well-formed utf-8 is for the NUL character.

By definition, with a null-terminated string, NUL is the terminator.

If you want to have strings that contain NUL, then by definition you can't use a null-terminated string.

This is true of utf-8 or regular C strings.

fauigerzigerk · on March 28, 2017

The point is, if you handle strings the C way, you're not in conformance with UTF-8.

If someone passes you a text file that is verified to be valid UTF-8 and contains, say, access permissions, then you better not stop parsing it at the first '\0' character.

None of this is a huge problem, but it's something to be aware of. C string handling is incompatible with UTF-8.

imron · on March 28, 2017

File processing and string processing are not the same. If you have a file that has a specific data format outside of the encoding, and that format includes NUL bytes as part of the data, then obviously process the file based on that format.

That's separate from string handling.

UTF-8 was originally designed to be compatible with NUL terminated strings and keep NULs out of well formed text.

In fact it was the first point in the 'Criteria for the Transformation Format', mentioned in the initial proposal for utf8.

https://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt

fauigerzigerk · on March 28, 2017

>File processing and string processing are not the same

The UTF-8 spec doesn't make that distinction as far as I know. There's a simple fact: A valid UTF-8 byte sequence can contain nul characters. So you can't naively use C string handling functions on it. And as someone else has correctly pointed out, the same is true for ASCII.

I'm just pointing out a potential pitfall and a source of security issues. Some might assume that after validating UTF-8 text input, you could just dump it in a C string and process it using C's string functions. But that's not the case.

asveikau · on March 27, 2017

Unless you have U+0000 there isn't any other sequence of code points that has an 0x00 byte in UTF-8. I don't see this as a huge problem.

If you really do need it there are some C language libraries that use "pascal-ish" structs to do strings. UNICODE_STRING in Windows comes to mind. Doing strings in C doesn't force you to use C strings, it's just the most common thing to do.

fauigerzigerk · on March 28, 2017

No it's not a huge problem, but if you're not aware of it, it could easily lead to a security breach: https://news.ycombinator.com/item?id=13974919

MichaelGG · on March 27, 2017

It's the same for ASCII - UTF-8 zero byte is NUL.