Tuesday, September 22, 2009

Factor and Unicode

I'm just starting with Factor. More for fun rather than anything else. Selling it at work will be a real task!

Anyways, I was curious about unicode support.

So, I tried to reverse a kannada word! Here's what I got.

( scratchpad ) "ಕನ್ನಧ"

--- Data stack:
"ಕನ್ನಧ"
( scratchpad ) dup

--- Data stack:
"ಕನ್ನಧ"
"ಕನ್ನಧ"
( scratchpad ) reverse dup reverse

--- Data stack:
"ಕನ್ನಧ"
"ಧನ್ನಕ"
"ಕನ್ನಧ"
( scratchpad ) drop

--- Data stack:
"ಕನ್ನಧ"
"ಧನ್ನಕ"
So far so good. But lets change the vowels!
( scratchpad ) "ಸಿದ್ದಾರ್ಥ"

--- Data stack:
"ಸಿದ್ದಾರ್ಥ"
( scratchpad ) dup reverse

--- Data stack:
"ಸಿದ್ದಾರ್ಥ"
"ಥ್ರಾದ್ದಿಸ"
( scratchpad )
Now, we have a problem! This is part of the FAQ.
Quoting -

Does Factor support Unicode?

There is no one meaning to the phrase "Unicode support", but there are a few things that a modern programming language is expected to support in its library: UTF-8/UTF-16 input and output, Unicode collation, Unicode-appropriate casing operations, normalization, strings can hold any Unicode code point, and support for Unicode text rendering in the UI. Of these, Factor supports all but Unicode font rendering, which should be finished before 1.0 comes out.

How do I convert a character to upper or lower case in Unicode?

This isn't a well-defined operation. For example, the ß character becomes SS in upper case. Some letters have context-dependent case mappings. So if you need to change the case of something, use strings, not individual characters. The Factor Unicode library doesn't implement character mapping, because the behavior could only be incorrect. If what you're converting is just ASCII, then there are character conversion routines defined just for that. For case-insensitive comparison, partial collation keys might be appropriate.