Font

About/Studying/Computerization

Computerization

To prepare the Bible text for computerized language analysis it was divided into sentences. The idea is that in English the sentence is the basic unit that expresses meaning. The basic unit the Bible text is built around is the verse, and the verse divisions do give meaning to the text, but in language analysis the sentence is the starting point.

In standard English a period, a question mark, or an exclamation point is the end of a sentence. Bible English is a bit different. A period always marks the end of a sentence, but a question mark or exclamation point does only sometimes. Often it ends a sentence if it's also the end of the verse or if it's followed by a capitalized word, but there are exceptions. For example, Romans 11:7 follows the form of Romans 3:9 and Romans 6:15, so it makes sense to continue the sentence through the question mark there. In the middle of Matthew 27:17 the new sentence made after the question mark would be basically just a noun phrase, so it's reasonable to not break the sentence there. Over time the sentence divisions are subject to change. As it stands now, the Bible's 31,102 verses are made into 28,516 sentences.

After the Bible was divided into sentences it was tokenized. That means each word and each punctuation mark was separated out, as was each possesive "apostrophe-s." Finally, all of this was collected into what language analysts call a corpus, which is just a set of files that represents some body of writing to be studied, in this case sixty-six files of tokenized sentences that represent the Bible.