Sindarin dictionary statistics
Sep. 2nd 2013
This is a statistical evaluation of the Sindarin dictionary hosted at http://www.sindarin.de.
The frequencies of all Sindarin phonemes are found to be:
Notation (for easiness of counting, digraphs were converted to unigraphs):
Assumptions for simplicity:
For the rank-frequency distribution p(r) (where r is a phoneme’s rank), an ad-hoc formula was first proposed by Zipf in 1929 :
with the normalization s(N)=∑k=1N 1/k, where N is the total amount of phonemes.
Several authors noticed since then that it does not fit the data across languages too well and have proposed other ad-hoc fitting functions [3, 4]. In 1988, Gusein-Zade proposed a formula  based on a sensible assumption, namely that rank-frequencies are drawn from a uniform probability density and that p(r) can be approximated by the corresponding expectation value for any given language. This leads to:
For large n and for large r at fixed r this can be approximated by:
It turns out that this formula describes real-language data rather well and no wild fitting is required (see below). The fact that a model assumptions enters the calculation seems to have been overlooked or misunderstood by other authors, probably because Gusein-Zade’s paper was published in Russian. One can see that it makes no sense to generalize the Zipf distribution by adding fittable parameters, like r−β (as it it seems to be often done) because the dependency is different, approximately log1/r rather than a power law1. This means that a semilogarithmic plot of p(r) should produce a straight line. This is indeed the case for the Sindarin data, as seen in fig. 1.
Comparing it with data from natural languages (fig. 2) one finds a similarly good agreement for English and Swedish, somewhat worse for Bengali. Except for Bengali, deviations are spread both above and below the Gusein-Zade function which suggests a statistical rather than a systematic error. I do not know how reliable the Bengali data are.
Note that the formula does not predict how common a certain sound is, but rather how frequent the phoneme ranked r is (whatever the phoneme itself may be). It turns out that this value is completely determined by the total amount of phonemes N.
Note also that it matters for the individual frequencies whether one considers a dictionary or a text. In the latter case, English [ð] obviously becomes much more common  due to the thes and thats (in Sindarin texts, the frequency of i is expected to go up for the same reason). However, the distribution seems to stay the same: The RP data in figure 2 are from a dictionary, the American English data from a text.
Finally one should note that the RP data for English include diphthongs as separate phonemes, while the American English data do not; but again, this does not seem to affect the distribution itself.
We can thus conclude that the rank-frequency distribution of the Sindarin phonemes is indistinguishable from that of a natural language.
Rank frequencies for vowels only:
Rank frequencies for consonants only:
Place of articulation:
Manner of articulation:
Distribution among stops:
Distribution among fricatives:
Distribution among sonorants/semivowels:
A bigram is a cluster of two letters, or, in this case, two phonemes. One can introduce the conditional probability pi(j) to find the phoneme j if the preceding phoneme is i. It forms a matrix with normalized rows: ∑j pi(j) = 1. If one weighs the rows with the frequencies p(i), one obtains the probability to get the phonemes i and j in two sequential draws: p(i,j) = p(i)pi(j). This is now of course normalized with respect to the total sum: ∑ij p(i,j) = 1. The procedure is readily generalized to n-grams.
Linguistically, the matrix shows us a language’s phonotacics and the restriciveness of its phonology. (Probably, one can also use it to write a ruthlessly efficient hangman algorithm.) Obviously, the higher the spread of values across the bigram matrix, the freer the phonology. This is exactly what is measured by the n-gram entropy2:
|Hn = −|
Hn can already be computed for the unigram frequencies p(i), but as discussed above, their distribution is mostly determined by the total amount of phonemes N, so that the same goes for the entropy. It seems more interesting to look at the bigram entropy H2: The smaller it is, the more restrictive the phonology. Note that for any value of n, Hn has the maximum value of Hmax=log2(N) which corresponds to the case that all n-grams are equiprobable, which would make the phonology absolutely free and all the phonemes uncorrelated.
The following three tables show pi(j), computed for vowels only, consonants only, and for all phonemes. Colors are used as a visual guide to highlight values from 0.1 to 0.2 (blue); 0.2 to 0.3 (green); 0.3 to 0.4 (purple); 0.4 to 0.5 (orange); and finally above 0.5 (red).
For the unigram and bigram entropies, one obtains:
Unfortunately, data from natural languages are hard to come by. For English, Shannon gives H1=4.14, H1/Hmax=0.88 and H2=3.56, H2/Hmax=0.76. However, this was calculated for the N=26 Latin letters rather than for phonemes. Making a comparison nevertheless, one can see that the phonology of Sindarin is much more restricted, which makes sense.
H2 is expected to be smaller than H1 for any language (which is equivalent to the existence of phonotactics). To find a lower bound, languages like Japanese or Hawaiian are promising candidates.
To get a distribution by source, only unique entries were counted. Because of the ubiquitous conceptual changes by Tolkien, an editorial decision has to be made regarding what to count as unique.
For example, N. naith ’gore’ (Ety:387), S. neith, naith ’angle’ (PE17:55) and S. naith ’spearhead, gore, wedge, narrow promontory’ (UT:282) were regarded as the same (polysemous) word, with various possible translations into English, and a joined reference (Ety:387, PE17:55, UT:282).
On the other hand, S. eitha- ’1. prick with a sharp point, stab 2. treat with scorn, insult’ (HEK-, WJ:365) and S. eitha- ’to ease, assist’ (ATHA-, PE17:148) are clearly two different (homophonous) words, and are therefore kept separate. In this case it is obvious from their different etymologies.
There is a grey zone, however: For example, EN baran ’brown, swart, dark-brown’ and S. baran ’brown, yellow-brown’ suggest a conceptual change, albeit a small one, so that they were counted as separate entries, and thus also as different words for the statistics.
This gives the following absolute and relative counts (compare also the Hiswelóke charts ):
Of course, a good amount of words is attested in various sources, so that the added count is higher than the actual entry count. The Venn diagram in figure 3 shows how words are shared across the two top sources (The Etymologies and Parma Eldalamberon 17) and the rest.
This document was translated from LATEX by HEVEA.