|
C Experiments:
![]()
Text Processing
![]()
Experiment No. 5
![]() |
C Text File Division Locations, 1999-05-27:
We will now use observed structural details of the text document to identify its major divisions and their locations.
The Bible is generally cited by book, chapter, and verse, e.g. Genesis 12:4; John 3:16; Romans 10:9; etc. Humans usually have no problem recognizing these divisions, but it's not so easy for a computer.
Examining KJV.TXT, we note that each book begins with the book name on a line by itself, and the book name line is followed by a blank line. The book name can include one word (like "Joshua"), two words (like "II Thessalonians") or three words ("Song of Solomon"). Book names are all uppercase.
Except as noted below, each chapter begins with the word "Chapter" and the chapter number on a line by themselves, and the line is followed by a blank line.
Exception No. 1: The book of Psalms uses the word "Psalm" instead of the word "Chapter".
Exception No. 2: The books of Obadiah, Philemon, II John, III John, and Jude do not have chapter headings because they are each only one chapter long.
Each verse is on a single line by itself and begins with the verse number followed by the text of the verse. The last verse in each chapter is followed by a blank line.
Our task in this experiment will be to identify the locations (i.e. the particular byte number within KJV.TXT's 4,365,198 bytes) where each book, chapter, and verse begins. The "G" in "GENESIS" is defined as byte number zero.
We will specify that each book begins with the first letter of its name, and for location recording we will identify the book name itself as chapter zero, verse zero of the book.
We will also specify that each chapter begins with the first letter of the word "Chapter" (or "Psalm"), and we will identify the chapter heading as verse zero of the chapter for location recording.
We will then specify that each verse begins at the first digit of its verse number.
For Obadiah, Philemon, II John, III John, and Jude, we will specify that the book itself is to be identified as chapter one for recording purposes. And we will redundantly identify the digit "1" at the first verse as also being the location of chapter one, verse zero.
For reference, here are a couple of utility files left over from some previous research which should also prove useful in this experiment:
The Chapters file lists on each line the book number, a 3-character mnemonic abbreviation for the book name, the number of chapters in the book, and the book name for each of the 66 books in KJV.TXT.
The Verses file lists the running chapter number, the book mnemonic, the book chapter number, the number of verses in the chapter, and the running number of the last verse of the preceding chapter for each of the 1189 chapters in KJV.TXT.
To be continued...
HOME