Over the last few days I’ve been copying the bulk of the home directories on my primary file server over to a new volume (don’t ask) and, of course, I did a comparison afterwards to make sure the copy was successful . I’m talking about 3070 home directories, comprising over seven million files structured in any number of strange and wonderful ways.
I wasn’t at all surprised to find that 3069 of those directories had copied perfectly; robocopy is pretty reliable.
I was a little surprised to find that one directory had an anomaly, but still, glitches happen. I became puzzled, though, when I realized what the problem was: an entire subdirectory was missing. Robocopy hadn’t reported any errors. What’s more, when I ran robocopy over that home directory again, it reported that there was nothing to do: as far as it was concerned, source and destination were a perfect match.
Explorer didn’t show me much. The name of the two directories in the source looked the same; the first character was shown as a box. Another little tool of mine, though, could see the difference:
The tool escapes non-ASCII characters with a percent sign followed by a hexadecimal representation, so the first wide character is 0xD898 in the first directory and 0xDADB in the second. Otherwise the names are the same. Only the first one was present in the destination.
The next step, obviously, was to look up the Unicode code points 0xD898 and 0xDADB. As it turns out, they are “high-surrogate code points”, used in UTF-16 to encode Unicode code points larger than 16 bits. The key here is that surrogate code points are only valid in pairs: an individual surrogate code point is meaningless.
Of course, NTFS doesn’t care. It doesn’t really understand Unicode, so one 16-bit character is much like another. As far as NTFS is concerned, those are perfectly good (and distinct) names. Robocopy, however, must for some reason be converting or normalizing the UTF-16 strings, and as a result it sees those two names as identical. (It appears to be ignoring the second occurrence of the “same” name in a single directory; it doesn’t attempt to copy the second subdirectory at all.)
So, if you’re in the habit of creating files with invalid UTF-16 names, be warned. 🙂
 Using some code I wrote myself. Microsoft don’t seem to have provided a reliable directory-level comparison tool, and I’m not aware of any existing third-party solutions. I should open-source that tool one day.