A long time ago, way back in the Seventies, somebody invented ASCII (American Standard for Computer Information Interchange). Since computers can only deal with numbers internally, ASCII provides a standard so that numbers can represent letters, punctuation, spaces, and all the other characters you see on your computer screen. For instance, in ASCII the number 65 stands for an uppercase A, 97 stands for a lowercase a, 32 represents a space, and so on. Each ASCII character requires 7 bits of storage, which happens to fit nicely into a computer byte, which is 8 bits long.
ASCII has become the single most entrenched standard in all of computer-dom. Its wide use is why you are able to transfer text files from one computer architecture to another without trouble, from a Macintosh to a DEC minicomputer, say. Alas, ASCII just isn't up to the big job of displaying all the world's many and varied languages. Since each character is stored in 7 bits there can be only 128 unique values, which is suitable for displaying simple text in U.S. English but nothing else.
So the concept of "character sets" was invented to allow the display of more information -- accented characters, monetary unit symbols like the pound and the Euro, and so on -- needed for other languages. The easiest and most obvious first step is to expand the character width to 8 bits, which still fits into a single computer byte but doubles the number of displayable characters. Many character sets use the same mapping for the first 128 characters that ASCII uses and add 128 additional characters, thereby retaining some backward compatibility. This is how ISO-8859-1 (popular in Europe) and many other character sets are designed.
So now there's a whole morass of character sets that are 8-bit extensions of ASCII optimized for particular languages or cultures: ISO-8859-1 through ISO-8859-13, to name the most popular bunch. But this scheme leaves much to be desired. That's a lot of character sets to keep up with. A computer program designed to deal with multiple languages must keep track of all of them and have some way of knowing which character set was used to create a given document. And some languages have far too many unique characters to be displayed in a character set that is simply an 8-bit extension of ASCII, like Japanese for instance, which has led to bizarre character sets like Shift-JIS which is almost as complicated as a programming language. Wouldn't it be better if we could use just one character set to display all languages?
Enter Unicode. A single Unicode character is 16 bits long and occupies two computer bytes. That makes a total of over 65,000 displayable characters, which is enough to handle all the world's languages and then some. (Actually I've read accounts that say Unicode isn't big enough for all the thousands of ideograms in Asian languages, but that's another topic altogether.) Unfortunately ASCII and the concept of the 8-bit character are far too entrenched for Unicode to take over the world and become the dominant standard any time soon. This has led to compromise character sets like UTF-8, which represents each Unicode character as one, two, or three bytes and provides pretty good backward compatibility with ASCII. (UTF-8 is used extensively in BeOS, by the way.)
Neither Unicode nor its kissin' cousin UTF-8 have taken over the world. On USENET they are used far less than ISO-8859-1, in fact. So a program that tries to be a good international citizen must take all of this into account.
This is the way things would work ideally: Every USENET article ever posted would have a header line indicating what character set was used to create it and Pineapple News would recognize that character set and decode it properly into something displayable on your particular computer. If you believe that is what is actually going to happen I've got a bridge I'd like to sell you.
Many things can and do go wrong. Some articles do not have a header line saying what character set was used to create them. Some articles have such a line but it indicates a character set that Pineapple News is not capable of decoding. Even worse, sometimes an article will indicate a character set that was not the one used to create the article.
The final thing that can go wrong is that the program might correctly decode a given article but the font you are using doesn't have the proper "glyphs" to display all the characters. A "glyph" is a representation of how a given character should look onscreen. It describes the arrangement of pixels necessary to display a lowercase letter 'e', for instance. If your font is missing the glyph for a particular character then it will be displayed as a hollow rectangular box. (This could also indicate an error in decoding, if the program had to guess at the character set and got it wrong, say.)
Here's the full list: ISO-8859-1, ISO-8859-2, etc., all the way to ISO-8859-10; ISO-8859-13 to ISO-8859-15; KOI8-R (Russian); Macintosh; US-ASCII; UTF-8; Windows-1251 and Windows-1252. E-mail me if there's a character set you'd like to see support for that is not on the list. Keep in mind that I am not going to add support for any character set unless I can get my hands on one or two example articles to use for testing.
Many USENET messages specify the correct character set to use to display their text. If a message contains a character set value then Pineapple News will use it to display the article. If the message does not specify a character set then the program must fall back on a default character set value, which you can specify.
Pineapple News maintains a global character set default for reading other people's messages. It is the CharsetDefault value in the [Storage] section of the program's INI file. It defaults to ISO-8859-1. For more information on changing this value see the help topic PineappleNews.ini Reference.
The program also maintains a default character set for each and every newsgroup you subscribe to. When a newsgroup is first created it will inherit the global character set default, as described above. To change a newsgroup to a different default, select it in the storage view, press the right mouse button to bring up the context menu, and select "Character set." The current character set for the group will be displayed in a dialog box, which you can change.
Sadly, USENET messages sometimes do not specify what character set was used to create them, or worse, specify a set that is incorrect. If you want to change the character set used for any single message, first display it in the message view. Then from the main menu pick View, then Message, then "Character set ..." which will bring up a dialog box showing the message's current character set and allow you to change it to a different one.
Personally I've noticed that sometimes a message will be displayed with a box (an unrecognized character, in other words) where it should show a "curly open quote" or "curly close quote" character. If the article's current character set is US-ASCII or ISO-8859-1 I've found you can often remedy this by changing the character set to Windows-1251 or Windows-1252. These character sets have the necessary glyphs for curly quotes while ISO-8859-1 and friends do not.
You can do all other aspects of character set support properly but without the right font it all falls apart. If you want to read German posts but your display font doesn't have characters with umlauts, you're not going to get very far. The fonts that ship with BeOS seem to be good at displaying US-ASCII and ISO-8859-1 but not much else.
Eventually Pineapple News will allow you to select a font family, style, and color for every view and window. But for right now, to make character set support somewhat usable, I added a hack to allow you to specify the font just for the message view and the edit window used to type articles.
But first, you must have a font installed that has the characters you're interested in. In case you don't know of such a font there is one available for free on the Internet that will work with BeOS called Bitstream Cyberbit that contains almost all Unicode characters. It is of course huge (the ZIP file I downloaded is 6.14MB). To find it type CYBERBIT.TTF into your favorite Internet search engine. (Thanks to Colin Sarsfield for telling me about this.) Once you've downloaded and unzipped the cyberbit.ttf file copy it to the folder /boot/beos/etc/fonts/ttfonts/, run the Fonts preferences app, press the Rescan button, and make sure it is available as a selection.
Now you must now edit your PineappleNews.ini file so that it contains the necessary lines:
[WindowPosition]See the help topic PineappleNews.ini Reference for more information on editing the file.
The program might also need to display special characters in the headers view because subject lines and author names can contain them. The headers view uses whatever the BeOS default plain font is set to, which you can change with the Fonts preferences app.
Pineapple News lets you select the character set that will be used to create articles that you post. While you are typing the article, the text is stored in UTF-8, the standard used internally by BeOS. The character set you specify doesn't come into play until the article is saved to disk when the translation from UTF-8 to your chosen character set is applied.
To set your preferred character set you must edit the Charset value in the [Message] section stored in PineappleNews.ini. For information on how to edit this file see the help topic PineappleNews.ini Reference.
The default character set is ISO-8859-1 and unless you have a very good reason to change it I'd strongly advise you to leave this setting alone. If you have a burning desire to use a non-standard character set then here are some of the more popular ones.
ISO-8859-2 I'm told this is popular with French speakers and I have a couple of articles in Polish that use it. If you use this you'll probably have to set a custom message font because the standard BeOS fonts are good for US-ASCII and ISO-8859-1 but not much else.
US-ASCII Setting this value means that you are going to write nothing that won't fit within the confines of good old-fashioned 7-bit ASCII, the standard that goes back for decades. An advantage is that every newsreader in the world can be counted on to display this kind of article.
UTF-8 This character set represents 16-bit Unicode characters in a way that is mostly backward-compatible with software that expects 8-bit ASCII characters. The Powers That Be in the USENET community hope this will become the new worldwide dominant standard but as of this writing there are not many newsreaders that support it.