IWPCUG logo

SOME NOTES ON FILE TYPES

by David Broughton

This is an updated reprint from the 2001 Show edition of Hot Key.

There has been some confusion recently over what is meant by "plain text" files. An example was the expression "send it as a plain text HTML file" which seems at first sight to be a contradiction of terms. How can a file be both "plain text" and HTML?

The confusion over file types stems from the fact that what one sees on the monitor screen when a file is displayed depends very much on the software that is displaying it. All files require some kind of formatting and this must be picked up either from the file itself or the software must do its own formatting according to context.

PLAIN TEXT

Plain text files are the simplest of files that contain a minimum of format information. Each character is stored in one byte as a 7-bit code. These characters consist of the letters of the alphabet, both upper and lower case, plus numerals and punctuation signs. The only formatting information is in the form of control codes, the principle one being a newline character. Plain text files are sometimes called ASCII (American Standard Code for Information Interchange) files.

THE INTERNET

The Internet originally only handled plain text (i.e. 7-bit character codes). Files containing other format information and codes other than plain text (i.e. characters requiring 8 bits) had to be coded into plain text. Graphics files, for example, which use all 8 bits of every byte, are coded using a base 64 coding system. This converts every three bytes of 8 bits into four bytes of 6 bits that can then be represented by the letters of the alphabet and numerals plus two punctuation signs. The coding looks something like this:

3VPxHwKSdcEtcO/c3YwXYDlHwKSdcEtcO/c3YwXYDl Lzugozcdtosu27vlJdOyHjAozcdtosu27vlJdOyHjA m7XWP55+X66j9r7Jsc/bfuGP55+X66j9r7Jsc/bfuG AAnOixq2Jncbrd3VTCan4u1ixq2Jncbrd3VTCan4u1 f0gvzTgSMVFhODnjH5v97yFzTgSMVFhODnjH5v97yF RLApyP10y0vCT/E+aAwk6i2yP10y0vCT/E+aAwk6i2 wkrU7J/WMPdMb2aj9p8nCHH7J/WMPdMb2aj9p8nCHH WgdTK87QklhtElqZF/KmTEpK87QklhtElqZF/KmTEp

Whilst most Internet Service Providers today will pass 8-bit codes through the system without change, the old coding systems for 8-bit characters are still in wide use.

The pound sign used in most Windows software is an 8-bit character (the value is 163). This means that it will not always get sent correctly unless the e-mail software has been set to encode such 8-bit characters in another way. To play safe, I always use "UKP" after a pound quantity but you could set your software to "Quoted-printable" which is a MIME (Multipart Internet Mail Extensions) coding system that can code 8-bit characters as two or more 7-bit characters. It also allows line lengths to be variable according to your window size rather than the fixed length you have set.

Regarding line lengths, incidentally, when composing e-mails the lines lengths remain variable whatever settings you have until you click on Send. For normal plain text, the line lengths are then set according to the options you have set.

To select "Quoted printable" in Outlook Express go to Tools/Options/Send/Plain Text Settings where you will find this option when you click on MIME where it says "Encode text using".

ATTACHMENTS

When an e-mail contains an attachment that is base-64 encoded, obviously, the recipient does not want to see the coding. The e-mail software takes care of the conversion automatically in the background when the attachment is opened. But you can see the coding if you wish by viewing the e-mail with an editor like Notepad. (More about this below.)

THE META-LANGUAGE HTML

An HTML (Hyper Text Markup Language) file is a plain text file with formatting using the meta language HTML. A meta-language is a set of rules using the main language that implies a special use when interpreted by the particular software it is designed for. A very simple case would be to display a particular piece of text in a bold font. Since there is no plain text version of a bold font one could use the symbols <B> to denote where the bold font is to start and </B> to mark the end. The < and > symbols are meta-language symbols because they use the main language (plain text) in a special way. Viewed as plain text (in, say, Notepad) you will see <B> and </B> but if presented to some software that understands the meaning of these symbols (such as a web browser) it will show the text between them in a bold font.

Note that this implies that the symbol < cannot be used in its conventional sense, meaning "less than". So the meta-language has to have some means of representing the less than sign when needed for its original purpose. This is no problem. The pound sign, incidentally, is coded as "&pound;" in HTML or alternatively "&#163;" which shows how extra symbols can be incorporated into a restricted alphabet. Here, the ampersand symbol (&) is being used in a special way.

HTML is the language used by Web Browsers, providing a sophisticated set of formatting options (such as lists and tables) that are not easy to provide with plain text alone. It also has provision for a variety of fonts, sizes and colours of text. But the HTML file is plain text and can be transmitted over the internet as it is (i.e. not coded into base 64 or any other coding scheme).

WHAT YOU SEE DEPENDS ON THE SOFTWARE

To emphasise the point, how a file appears on the computer monitor depends on the software that is used to display it. Thus, an HTML file displayed with a web browser will display the file formatted according to the meta-language HTML. If the same file is viewed in, say, Notepad, it will be displayed as plain text and so will include all the meta-language tags and other language constructs that only an HTML person would be able to understand. Generally, it would be difficult to read.

If you viewed a graphics file (say, a GIF file) in Notepad, you would see gobbledygook because Notepad has no knowledge of the format used for graphics files. Let a web browser or graphics program open it, however, and you will see the image because it understands the coding.

Now send the GIF file as an attachment in an e-mail over the Internet and it will be coded in base 64 with the file type GIF. This will be mentioned in the header information of the attachment coding. When the e-mail is received, the software will show, for example, a paper clip symbol to denote the attachment. Clicking on this symbol starts a chain of events as follows: first, the base 64 coding is decoded to produce a binary GIF file. This is placed in a temporary location on the hard disk. The name and place of the file is passed to the operating system which uses the file type information (GIF) to choose an appropriate piece of software to open the file. The software chosen is determined by the file associations table (more about that below). The graphics program opens the file to display the image.

You don't need to send a file as an attachment if it is already a plain text file (though you can). The file could simply be inserted as part of the e-mail's plain text.

An HTML file, being a plain text file, can be inserted into an e-mail in this way. This will display at the other end as plain text by the e-mail software. This might not be very interesting to the general user. But to a web page author it could be something that is wanted for cutting and pasting into a web page.

Insert an HTML file as an attachment, however, and when the recipient clicks on the paper clip symbol, a web browser will start up to display it formatted.

Neither of these types of e-mail should be confused with "HTML e-mails". What is usually meant by HTML e-mails is e-mails composed with software that has the capability to encode the composed text, with its formatting, into HTML, possibly also containing images, making a multi-part e-mail. Multi-part e-mails are divided into their various parts and which part is displayed when received depends on the received software's capability and the user's options. If there is an HTML part, it is usually decoded and displayed as the default. But there is always a plain text part that you can get at if you know how (depends on the software).

But HTML used by e-mail software is limited in its capability. It is frowned upon by many because it:

  1. Increases the size of e-mails by a large factor, from three to ten times.
  2. Forces the recipient to view the text as the composer composed it. If this was in a tiny or unusual font the reader may have difficulty reading it. The usual font size options are disabled on some e-mail packages when HTML is the source. It has the advantage that images can be sent, embedded with the e-mail text, but this tends to make the e-mails even larger. The images are coded into base 64 as separate parts of the e-mail.
  3. Some viruses transmitted by e-mail are coded in a form of HTML. This is a serious matter because some e-mail software, like Outlook Express, has as default a preview pane set on. This displays the e-mail in a separate pane as soon as the e-mail is selected from the main list. If such an e-mail contained a virus it would have been automatically activated and you are in trouble. To switch off the preview pane in Outlook Express, on the View menu, click Layout. Then make sure the box labelled "Show preview pane" is not ticked. This will prevent you from accidentally activating a virus. If you suspect a virus in an e-mail you can view it using Notepad or other plain text editor that will not be able to interpret the special formatting tags. Although the HTML text will be difficult to read, there will always be a plain text version as well that is easily read. The Notepad view, by the way, will show all the header information which you have to skip over first.

To view an e-mail with Notepad, highlight it and use "Save as" from the File menu and save it in a temporary folder somewhere. I use the Windows desktop. I then switch to the desktop and drag the file's icon over the Notepad icon. You can also set up an "Open with" option with Notepad as one of the options. More about this below.

An alternative way is to start up Notepad and open the .eml file. Clicking on a file of type .eml will not work because .eml files are associated with the e-mail software. So another way is to rename the file to have a .TXT file type and double click that after the renaming. Files type TXT are associated with Notepad usually.

FILE ASSOCIATIONS

I have spoken a lot about file associations so perhaps more explanation is required. The fact is, one of the ways a computer can be more easily operated is to use a file associations table and this is what Windows and most other user-friendly operating systems use. It consists of a table of file types and software applications so that most file types used in the computer can be associated with the software that is appropriate. Take the word processing program Word, for example. The file type for letters and documents produced by Word is .DOC so that if you double click on a file of this type, the operating system looks up its table of file associations and starts up Word, feeding the file name to it as a parameter as if you had started up Word without a file, clicked on Open in the File menu and navigated to that file.

More than one file type can be associated with the same application software, but only one application software can be associated with the same file type. However, sometimes you may want to use a different piece of software to open a file. Although you can do this in the old fashioned way (start the application software first and read in the file with "Open"), the software can be placed in the file associations table but not as the default. To use alternative software for any file type, you must first set up the name of the software in the file associations table for that file type and include it under the action "Open with". Then when you right click the file you will get a selection of applications for opening the file which will include those you added. The file associations table can be seen by clicking on My Computer and then View and selecting Folder Options... Then select the File Type tag. In Windows 95 and 98 the Application software is in alphabetical order rather then the file types, which is awkward so that you have to scan down to find the one you want. This was corrected in Windows ME where the file types are in alphabetical order. The Windows help screens will guide you through any changes you want to make to this table.

FILE TYPES

I have spoken a lot about file types but not fully defined them. Originally, in the days of DOS, file types were up to three characters after the dot of the file name and called the file name extension. Windows 95 and later versions allowed long file names that could contain dots -- dots within the file name part. This can be confusing because the file type is the last set of characters that follow the last dot. So if there is more than one dot you could be fooled into thinking that the file type is different from what it is, especially as, by default, Windows does not show the file type in file listings. The advantage of this to a virus writer can be imagined. A nice little EXE file (executable) can be named "My picture.jpg.exe" containing a virus. The user sees this as "My picture.jpg" and expects to see the picture when it is double clicked. But in fact it is not a jpg file but an exe file and you get infected with the demonic doings of the virus writer.

To make sure you see all file types, go to Folder Options from My Computer/View (also available from the Settings menu that comes up from Start) and select the View tag. Scroll to the message "Hide file extensions for known file types" and make sure the box next to this message is not ticked. Then remember an important rule: Never double click an attachment that is an EXE file unless you know for sure who the sender is and that the attachment was intended (because sometimes, even if you know the sender, the sender may not realise that an attachment has been sent).