Because whenever you want to store or transmit a string only the byte count matters (the size of the string). All the fancy unicode stuff on top of bytes is for the display layers to handle. The default should be grounded to the reality of the programmer.
Storing and transmitting is always going to work with low-level storage units like bytes, so your string will need to be converted to that first. But string manipulation is extremely common in programming, and I would think graphemes are the most useful unit here - i.e. as a programmer my preference would be for swift's behaviour.
Human interaction is a more grounded reality for programmers vs. the dumb land of pure bytes, so even at that conceptual level the default should be smart
And bytes is the only thing that matter for a specific type of string, conveniently named, sequence of bytes
(and it's not expected that a character's length is>1 unless you've been conditioned to excpect it)