|
C Experiments:
![]()
Text Processing
![]()
Experiment No. 2
![]() |
C Text Files Character Counts, 1999-05-15:
In any analysis involving a text file, it will be useful to know which text characters are involved. So, in this experiment, we develop a program to identify and count the characters in any specified ASCII text file.
The identification will tell us which of the extended ASCII (0-255) characters are contained in the file and how often they appear.
While the count itself may not be of much use outside of obscure trivia contests, it is so easy to add to the program and uses so few additional resources (and so fully satisfies our obsessive penchant for numerical exactitude!) that we will include it anyway.
The Character Count program is designed to be run offline on a local machine and was executed with the KJV.TXT file while running under Red Hat Linux 5.2 on my Cyrix 686 machine.
The program executed very quickly (the only delay being that to display the progress counter every 10,000 characters almost faster than could be followed) and produced the Report categorizing the KJV's 4,365,198 characters linked to here.
The report consists of two columns. The left column contains the ASCII character codes (0-255 plus an additional code 256 = the total character count for the entire document). And the right column contains the number of occurrences of that character within the document.
Of particular significance in this report are the non-alphanumeric characters. Ten non-alphanumeric characters appear in the KJV document. Two of these, the apostrophe [' ASCII 39; 1995 occurrences] and the hyphen [- ASCII 45; 18 occurrences] are actually part of the words in which they appear. They should thus be considered as characters when parsing the KJV document.
The other eight non-alphanumeric characters [!(),.:;?] are punctuation and should thus be considered as whitespace when parsing the KJV document.
Of course, this treatment of non-alphanumeric characters is specific to the KJV document. Other treatments may be more appropriate for other documents. The point is that the Character Counting program will assist the user in making such determinations.
HOME