|
C Experiments:
![]()
Text Processing
![]()
Experiment No. 3
![]() |
C Text File Common Word Counts, 1999-05-16:
To prepare for further analyses, we need to know the number of words in the text document we are studying and the maximum word length.
In this experiment, we develop a program to extract that information.
Yes, the Unix wc (word count) command would count the words for us, but we will use this opportunity to develop a parsing mechanism for use in later experiments. And we will learn some new concepts along the way. So the apparent duplication of effort will not be wasted.
As a side-benefit, we'll also make the program count the number of occurrences of several very common words that we would expect to be prolific in any text document.
The Common Word Count program is designed to be run offline on a local machine and was executed with the KJV.TXT file while running under Red Hat Linux 5.2 on my Cyrix 686 machine.
The program executed quickly (in about 10 seconds, including the display of the progress counter every 10,000 words) and produced the Report categorizing the KJV's 823,036 words linked to here.
The report indicates that the longest word in the KJV Bible is 18 characters long (the transliterated Hebrew name "Mahershalalhashbaz" which appears in Isaiah 8:1,3) and that the most prolific word is "the" with 63,833 occurrences. The second most prolific word is "and" with 51,492 occurrences.
HOME