Size of vocabulary / number of roots

A forum for discussing linguistics or just languages in general.
Post Reply
leafar
hieroglyphic
hieroglyphic
Posts: 50
Joined: 04 Dec 2010 14:16

Size of vocabulary / number of roots

Post by leafar »

How many words do people use? And how many root words do languages tend to have?
User avatar
eldin raigmore
korean
korean
Posts: 6352
Joined: 14 Aug 2010 19:38
Location: SouthEast Michigan

Re: Size of vocabulary / number of roots

Post by eldin raigmore »

leafar wrote:How many words do people use? And how many root words do languages tend to have?
WALS.info doesn't have any information on this.
Using Dogpile I found some more-nearly-relevant sources online:
Esperanto started out with about 900 roots.
This Wikipedia editor wrote:in polysynthetic languages with very high levels of inflectional morphology, the term "root" is generally synonymous with "free morpheme". Many such languages have a very restricted number of morphemes that can stand alone as a word: Yup'ik, for instance, has no more than two thousand.
http://msdn.microsoft.com/en-us/library/ff819125%28v%3Dvs.85%29.aspx wrote:Agglutinative languages form words through the combination of smaller morphemes to express compound ideas. Each of these morphemes generally has one meaning or function and retains its original form and meaning during the combination process. For languages that have agglutinative morphology, such as Turkish, Finnish, Hungarian, or Korean, it is possible to produce thousands of forms for a given root word.
...
Inflected languages, such as English, French, and Latin, have a very small number of possible word forms for one root word. In inflected languages, morphemes influence one another when binding. Most changes in inflection are present in the stem or word ending. In contrast to agglutinative languages, inflected languages tend to have different functions for a single morpheme. For example, a morpheme can determine both number and case.
See http://en.wikipedia.org/wiki/Natural_se ... talanguage
and http://en.wikipedia.org/wiki/Semantic_primes
and http://en.wikipedia.org/wiki/Word.
http://en.wikipedia.org/wiki/Lexicology#Structuralist_and_neostructuralist_semantics wrote:It may be seen that WordNet "is a type of an online electronic lexical database organized on relational principles, which now comprises nearly 100,000 concepts" as Dirk Geeraerts[6] states it.
Take a look at this; it's a list of (some of the) root words in English. Count them and you'll get a minimum. (English has about 1,000,000 words -- more than any other natlang --, but not all of them are root-words.)

Mark Rosenberger's Language Construction Kit directly addresses one of your questions thusly:
Zompist wrote:Where the conlang bug bites, the Speedtalk meme is sure to follow. Let Robert Heinlein explain it:

Long before, Ogden and Richards had shown that eight hundred and fifty words were sufficient vocabulary to express anything that could be expressed by "normal" human vocabularies, with the aid of a handful of special words-- a hundred odd-- for each special field, such as horse racing or ballistics. About the same time phoneticians had analyzed all human tongues into about a hundred-odd sounds, represented by the letters of a general phonetic alphabet.
... One phonetic symbol was equivalent to an entire word in a "normal" language, one Speedtalk word was equal to an entire sentence.
--"Gulf", in Assignment in Eternity, 1953
This is a tempting idea, not least because it promises to save us a good deal of work. Why invent thousands of words if a hundred will do?

The unfortunate truth is that Ogden and Richards cheated. They were able to reduce the vocabulary of Basic English so much by taking advantage of idioms like make good for succeed. That may save a word, but it's still a lexical entry that must be learned as a unit, with no help from its component pieces. Plus, the whole process was highly irregular. (Make bad doesn't mean fail.)

The Speedtalk idea may seem to receive support from such observations as that 80% of English text makes use of only the most frequent 3000 words, and 50% makes use of only 100 words. However (as linguist Henry Kuera points out), there's an inverse relationship between frequency and information content: the most frequent words are function words (prepositions, particles, conjunctions, pronouns), which don't contribute much to meaning (and indeed can be left out entirely, as in newspaper headlines), while the least frequent words are important content words. It doesn't do you much good to understand 80% of the words in a sentence if the remaining 20% are the most important for understanding its meaning.

The other problem is that redundancy isn't a bug, it's a feature. Claude Shannon showed that the information content of English text was about one bit per letter-- not too high considering that for random text it's about five bits a letter. Sounds inefficient, huh? On the other hand, we don't actually hear every sound (or, if we're accomplished readers, read every letter) in a word. We use the built-in redundancy of language to understand what's said anyway.

To put it another way: y cn ndrstnd Nglsh txt vn wtht th vwls, or shouted into a nor'easter, or over a staticky phone line. Similarly distorted Speedtalk would be impossible to understand, since entire morphemes would be missing or mistaken. Very probably the degree of redundancy of human languages is pretty precisely calibrated to the minimum level of information needed to cope with typical levels of distortion.

However, go ahead and play with the Speedtalk idea. It's good for some hours of fun, working out as minimal a set of primitives as you can; and the habit of paraphrase it gives you is very useful in creating languages. Just don't take it too seriously; if you do, your punishment is to learn 850 words of any actual foreign language and be set down in a city of monolingual speakers of that language.


It's not an answer but it's relevant.

This might be worth reading, or might not, I don't know.

My guess is, if your language is naturalistic and realistic, you probably need about 3,000 to 5,000 morphemes, including roots, to get through most of most quotidian conversations; then for each area of specialization and expertise, you probably need about an additional 30,000 to 50,000 words or morphemes -- some of them being roots -- to get through a specialist conversation at the expert level.

Lojban has around 1350 lexical-content or semantic-content roots, and not quite that many function words and inflectional morphemes.

Anyway:

If your language has a triconsonantal-root-system, like many Afro-Asiatic natlangs do, and has 20 consonants which can each appear anywhere in the root independently of what appears elsewhere, you could form up to 8,000 roots. Of course none of these could be pronounced in root form; you'd have to add the "transfix", the vowels between the consonants, to get something pronounceable. Then there may or may not be a prefix and/or a suffix to finish out the word.
Assuming you have three vowels, a C1-C2-C3 root could fit up to 15 patterns before any prefixes or suffixes:
3 of the form C1VC2C3
3 of the form C1C2VC3
9 of the form C1VC2VC3
With 4 prefixes and 4 suffixes, if they could all be used, you'd wind up with 375 words per root.
You could make the number of stems (what happens after applying the template but before applying any prefixes or suffixes) bigger by allowing the middle consonant to be used twice in some of the templates; in other words you might have an additional 27 templates of the form C1VC2VC2VC3.
And of course if you have five vowels instead of three you'd get 35 templates instead of 15;
5 of the form C1VC2C3
5 of the form C1C2VC3
25 of the form C1VC2VC3.

If your language has a bisyllabic-root-system and all CV syllables, like many Polynesian natlangs do, and has 20 consonants and 5 vowels, which are not unreasonable numbers, you could form up to 10,000 roots.

Most conlangers don't get around to making 5000 morphemes (or words or roots) in their conlang. Most rightly regard it as an achievement if they get to number 1000.
leafar
hieroglyphic
hieroglyphic
Posts: 50
Joined: 04 Dec 2010 14:16

Re: Size of vocabulary / number of roots

Post by leafar »

Cheers Elgin. Really, 3,000-5,000 morphemes? That many?

I was wondering if there is somewhere where you can type in a word, and then it gives you all related words, ie the root that the word comes from and all other words that also come from that root, and all words that are derived from the word that you typed. Or something along those lines.
User avatar
MrKrov
banned
Posts: 1929
Joined: 12 Aug 2010 02:47
Location: /ai/ > /a:/
Contact:

Re: Size of vocabulary / number of roots

Post by MrKrov »

You wouldn't want to use something like that too often, lest you get a relexification.
Golahet
cuneiform
cuneiform
Posts: 196
Joined: 12 Aug 2010 16:01

Re: Size of vocabulary / number of roots

Post by Golahet »

leafar wrote:Cheers Elgin. Really, 3,000-5,000 morphemes? That many?
eldin raigmore wrote:My guess is, if your language is naturalistic and realistic, you probably need about 600 to 1,000 morphemes, including roots, to get through most of most quotidian conversations; then for each area of specialization and expertise, you probably need about an additional 6,000 to 20,000 words or morphemes -- some of them being roots -- to get through a specialist conversation at the expert level.
Fixed according to my view, unless "naturalistic and realistic" means having an unnecessary level of unproductivity and that the morphemes aren't carefully selected.

eldin raigmore wrote:Most conlangers don't get around to making 5000 morphemes (or words or roots) in their conlang. Most rightly regard it as an achievement if they get to number 1000.
Most conlangs aren't languages.
User avatar
eldin raigmore
korean
korean
Posts: 6352
Joined: 14 Aug 2010 19:38
Location: SouthEast Michigan

Re: Size of vocabulary / number of roots

Post by eldin raigmore »

Mahal wrote:
leafar wrote:Cheers Eldin. Really, 3,000-5,000 morphemes? That many?
Mahal wrote:My guess is, if your language is naturalistic and realistic, you probably need about 600 to 1,000 morphemes, including roots, to get through most of most quotidian conversations; then for each area of specialization and expertise, you probably need about an additional 6,000 to 20,000 words or morphemes -- some of them being roots -- to get through a specialist conversation at the expert level.
Fixed according to my view, unless "naturalistic and realistic" means having an unnecessary level of unproductivity and that the morphemes aren't carefully selected.
No, by "realistic and naturalistic" I meant "typical of real natlangs".
As a matter of fact, almost all real natlangs do have an unnecessary level of unproductivity (since, theoretically, no level higher than "zero" is "necessary"), and almost no real natlang does have carefully selected morphemes. (In general their morphemes weren't even consciously "selected".)
I would say that if there were any natlang that got away with 1000 morphemes or fewer, it would qualify as "oligosynthetic". IIRC most of the natural languages which, correctly or incorrectly, have been classified as "oligosynthetic" by some professional academic linguisticians, have usually had more than 800 morphemes whose sounds and meanings had been recorded by said linguisticians before proposing that classification.

Mahal wrote:Most conlangs aren't languages.
You're right, of course.
I venture to say very few conlangs are natlangs. It's been said (though I don't know that it's true) that Urdu was first a military auxlang. Esperanto qualifies now as a natlang though it started as an auxlang. I don't know of any other examples at all. Does anyone?
Last edited by eldin raigmore on 10 Jan 2011 19:42, edited 1 time in total.
User avatar
Yačay256
greek
greek
Posts: 648
Joined: 12 Aug 2010 01:57
Location: Sacramento, California, USA

Re: Size of vocabulary / number of roots

Post by Yačay256 »

I read in an academic book somewhere that Inuktitut has only about 2,000 roots and about 600 suffixes, so that means that a poly lang could get by with around 3,000 morphemes, give or take.
¡Mñíĝínxàʋày!
¡[ˈmí.ɲ̟ōj.ˌɣín.ʃà.βä́j]!
2-POSS.EXCL.ALIEN-COMP-friend.comrade
Hello, colleagues!
Golahet
cuneiform
cuneiform
Posts: 196
Joined: 12 Aug 2010 16:01

Re: Size of vocabulary / number of roots

Post by Golahet »

eldin raigmore wrote:No, by "realistic and naturalistic" I meant "typical of real natlangs".
Well, I read your post as being about what a conlang needs, not what a natlang has.

eldin raigmore wrote:As a matter of fact, almost all real natlangs do have an unnecessary level of unproductivity (since, theoretically, no level higher than "zero" is "necessary")
What I meant with "unnecessary" was what could easily be avoided with reasonable effort.

eldin raigmore wrote:It's been said (though I don't know that it's true) that Urdu was first a military auxlang. I don't know of any other examples at all. Does anyone?
Isn't Urdu a dialect of Hindustani? If combining what I know with that hearsay maybe Urdu was a condialect? If Urdu counts, then I think Finnish counts, "created" by Mikael Agricola, and Modern Hebrew is another example.
User avatar
eldin raigmore
korean
korean
Posts: 6352
Joined: 14 Aug 2010 19:38
Location: SouthEast Michigan

Re: Size of vocabulary / number of roots

Post by eldin raigmore »

Golahet wrote:
eldin raigmore wrote:No, by "realistic and naturalistic" I meant "typical of real natlangs".
Well, I read your post as being about what a conlang needs, not what a natlang has.
OK. But it was about what is typical of real natlangs, rather than about what a conlang needs.
Golahet wrote:What I meant with "unnecessary" was what could easily be avoided with reasonable effort.
OK.
Golahet wrote:Isn't Urdu a dialect of Hindustani?
If that's not the usual opinion, then, at least, Urdu and Hindi are closely-related languages that are mostly mutually intelligible, AIUI.
Remember that "a language is a dialect with its own army and navy", so Urdu is a language and Hindi is a different language, just as American and English and Canadian and Australian are all separate languages. :-s :roll: :-s
If Urdu and Hindi are dialects of something, I guess there's argument about what they're dialects of.
Golahet wrote:If combining what I know with that hearsay maybe Urdu was a condialect?
If the story is true that a certain monarch found that the lack of a common speech among his army was a military liability, and invented (or commissioned the invention of) Urdu as a military auxlang out of the languages his army then spoke, then I guess Urdu counts as a language-contact military conlang/auxlang or even a military aux-pidgin. I don't know whether that story is true, though.
Golahet wrote:If Urdu counts, then I think Finnish counts, "created" by Mikael Agricola, and Modern Hebrew is another example.
Why not? I don't know that either of those has a greater claim than Urdu; nor that either one has a lesser claim. I hadn't even heard that about Finnish before. Agricola sounds like a Latin name; is Finnish then a constructed Romance language?
User avatar
Yačay256
greek
greek
Posts: 648
Joined: 12 Aug 2010 01:57
Location: Sacramento, California, USA

Re: Size of vocabulary / number of roots

Post by Yačay256 »

I cannot find it, but I read somewhere that Indigenous Australian languages generally have very few roots, and I recall that a specific lang whose name I cannot remember had barely over 1,000 morphemes.
¡Mñíĝínxàʋày!
¡[ˈmí.ɲ̟ōj.ˌɣín.ʃà.βä́j]!
2-POSS.EXCL.ALIEN-COMP-friend.comrade
Hello, colleagues!
User avatar
eldin raigmore
korean
korean
Posts: 6352
Joined: 14 Aug 2010 19:38
Location: SouthEast Michigan

Re: Size of vocabulary / number of roots

Post by eldin raigmore »

Yačay256 wrote:I cannot find it, but I read somewhere that Indigenous Australian languages generally have very few roots, and I recall that a specific lang whose name I cannot remember had barely over 1,000 morphemes.
It would be nice to know;
* the name of that language;
* the source who said that about the language;
* whether that report still stands up.

It "sounds" (to me) like one of those reports of an "oligosynthetic" natlang that comes up every now and again. Most of them, AIUI, have since been qualified or placed in doubt or even retracted. But "barely over 1000 morphemes" is still a larger inventory of morphemes than are reported in many of those reports of oligosynthetic natlangs, some of which are reported to have something over 800 morphemes; and comfortably larger than (>167% as large as) the inventory-size Mahal proposed of 600 morphemes, which, as Mahall says, probably, a conlang could get away with if the conlanger were careful.

Since, until your last post, the smallest natlangish morpheme-inventory mentioned in this thread was, reportedly, just under 3000 morphemes, having a natlang that has only half or even only a third that many morphemes would be really interesting. I hope somebody -- maybe you -- can find the article or book you are thinking of.
Golahet
cuneiform
cuneiform
Posts: 196
Joined: 12 Aug 2010 16:01

Re: Size of vocabulary / number of roots

Post by Golahet »

eldin raigmore wrote:Remember that "a language is a dialect with its own army and navy", so Urdu is a language and Hindi is a different language, just as American and English and Canadian and Australian are all separate languages.
As soon as you bring the army, you have already left linguistics.

I hadn't even heard that about Finnish before. Agricola sounds like a Latin name; is Finnish then a constructed Romance language?
I don't know much that isn't mentioned on Wikipedia, and what isn't mentioned on Wikipedia I can't confirm. But I think what I've heard is that he (a Finnish bishop) devised Finnish from a Finnic dialect spectrum, which also included a priori vocabulary.

eldin raigmore wrote:
Yačay256 wrote:I cannot find it, but I read somewhere that Indigenous Australian languages generally have very few roots, and I recall that a specific lang whose name I cannot remember had barely over 1,000 morphemes.
It would be nice to know;
* the name of that language;
* the source who said that about the language;
* whether that report still stands up.
Damin springs to mind, but he is probably thinking of something else. I too have some memory of having read such a statement, but I couldn't find it while writing this post.


My own oligosynthetic language I'm working on will have 700+ roots and 120+ affixes.
Thakowsaizmu
runic
runic
Posts: 2518
Joined: 13 Aug 2010 18:57

Re: Size of vocabulary / number of roots

Post by Thakowsaizmu »

Golahet wrote:I don't know much that isn't mentioned on Wikipedia, and what isn't mentioned on Wikipedia I can't confirm. But I think what I've heard is that he (a Finnish bishop) devised Finnish from a Finnic dialect spectrum, which also included a priori vocabulary.
I think you may be confusing a written system with a language as a whole.
User avatar
eldin raigmore
korean
korean
Posts: 6352
Joined: 14 Aug 2010 19:38
Location: SouthEast Michigan

Re: Size of vocabulary / number of roots

Post by eldin raigmore »

Golahet wrote:I don't know much that isn't mentioned on Wikipedia, and what isn't mentioned on Wikipedia I can't confirm. But I think what I've heard is that he (a Finnish bishop) devised Finnish from a Finnic dialect spectrum, which also included a priori vocabulary.
As has already been pointed out, the Wikipedia article says he invented the system to write Finnish in, not the Finnish language itself. (Though he apparently did invent a lot of Finnish's biblical words.)
Golahet wrote:Damin springs to mind, but he is probably thinking of something else. I too have some memory of having read such a statement, but I couldn't find it while writing this post.
Damin is not a complete language; it is a ceremonial language. It's not like Church Latin, a complete version of a dead natlang used for sacred purposes; it's a language that would be vastly deficient if anyone attempted to use it for any non-ceremonial purposes. If it has even as many as 1000 morphemes it probably has more than it absolutely needs.
Golahet
cuneiform
cuneiform
Posts: 196
Joined: 12 Aug 2010 16:01

Re: Size of vocabulary / number of roots

Post by Golahet »

What I've heard goes beyond merely inventing an orthography. He also defined the standard language to write down. The article on Wikipedia doesn't mention this beyond "but first he had to define rules on which the Finnish standard language still relies", hence why I said I couldn't confirm it. Possibly my sources was exaggerated.

Damin has according to Wikipedia about 150 roots. I know it isn't a complete language.
Bristel
sinic
sinic
Posts: 359
Joined: 14 Aug 2010 19:50

Re: Size of vocabulary / number of roots

Post by Bristel »

eldin raigmore wrote:
Golahet wrote:I don't know much that isn't mentioned on Wikipedia, and what isn't mentioned on Wikipedia I can't confirm. But I think what I've heard is that he (a Finnish bishop) devised Finnish from a Finnic dialect spectrum, which also included a priori vocabulary.
As has already been pointed out, the Wikipedia article says he invented the system to write Finnish in, not the Finnish language itself. (Though he apparently did invent a lot of Finnish's biblical words.)
Estonian had some expansion of ex nihil vocabulary and grammatical changes thanks to Johannes Aavik, maybe this is what Golahet is thinking of?
[bɹ̠ˤʷɪs.təɫ]
Nōn quālibet inīqua cupiditāte illectus hōc agō.
[tiː.mɔ.tʉɥs god.lɐf hɑwk]
Golahet
cuneiform
cuneiform
Posts: 196
Joined: 12 Aug 2010 16:01

Re: Size of vocabulary / number of roots

Post by Golahet »

It's possible it was a conflation of Mikael Agricola and Johannes Aavik.
Bristel
sinic
sinic
Posts: 359
Joined: 14 Aug 2010 19:50

Re: Size of vocabulary / number of roots

Post by Bristel »

Golahet wrote:It's possible it was a conflation of Mikael Agricola and Johannes Aavik.
That's probably it.

I've been reading up on Estonian lately, so I knew what to look for.
[bɹ̠ˤʷɪs.təɫ]
Nōn quālibet inīqua cupiditāte illectus hōc agō.
[tiː.mɔ.tʉɥs god.lɐf hɑwk]
Post Reply