New CBB

Discuss constructed languages, cultures, worlds, related sciences and much more!
It is currently Sun 19 May 2013, 15:27

All times are UTC + 1 hour [ DST ]




Post new topic Reply to topic  [ 14 posts ] 
Author Message
 Post subject: Frequency dictionary
PostPosted: Mon 23 Apr 2012, 09:03 
greek
greek
User avatar

Joined: Wed 11 Apr 2012, 14:58
Posts: 412
Do you know an online dictionary which shows the searched word in one of several categories like
A = 4000 most frequent words
B = 4001 to 10000 most frequent words
C = 10001 to 25000 most frequent words
D = all others
I'm asking since I want to know which words I should make shorter and which longer.


Top
 Profile  
 
 Post subject: Re: Frequency dictionary
PostPosted: Mon 23 Apr 2012, 09:38 
MVP
MVP
User avatar

Joined: Sun 22 Aug 2010, 18:46
Posts: 3788
Such a list cannot be absolute, but must be relative to some text corpus, like the Oxford English Corpus, which the Wikipedia page on the "Most common words in English" is based.

But it's difficult to say which words will be the most common in your conlang (when you can produce a large enough corpus, we might be able to give a rough answer to that question), beyond the rules of thumb that function words like articles etc. tend to be among the most common ones, and also that typical Swadesh-list entries tend to be more common than highly specialised words.

_________________
constructedlanguages.net


Top
 Profile  
 
 Post subject: Re: Frequency dictionary
PostPosted: Mon 23 Apr 2012, 09:50 
greek
greek
User avatar

Joined: Wed 11 Apr 2012, 14:58
Posts: 412
I know such a list cannot be absolute and that function words are one of the most common. What Im interested in is if the word 'needle' is more common than 'blanket' or is the word 'idea' less frequent than 'mind' or how often do we use the words 'absent', 'unite', 'ignition' and 'divide' etc. In short I need such a list for verbs, nouns, adjectives and adverbs.


Top
 Profile  
 
 Post subject: Re: Frequency dictionary
PostPosted: Mon 23 Apr 2012, 10:47 
MVP
MVP
User avatar

Joined: Sun 22 Aug 2010, 18:46
Posts: 3788
nmn wrote:
What Im interested in is if the word 'needle' is more common than 'blanket' or is the word 'idea' less frequent than 'mind' or how often do we use the words 'absent', 'unite', 'ignition' and 'divide' etc. In short I need such a list for verbs, nouns, adjectives and adverbs.


The closest thing would be something like the Oxford English Corpus.

But beware, unless you are making an English cipher, words will not be in a one-to-one correspondence to each other. Words like those you list are likely to correspond to several different lexical entries in your language. And those word may in turn translate to different lexical entries in English. (You're conlang may have several different words that may translate as 'blanket'; and each of these word might also denote other things than those that are referred to as 'blankets' in English; they may translate maybe as 'covering', 'duvet', 'cloth' or a hundred other words.)

_________________
constructedlanguages.net


Top
 Profile  
 
 Post subject: Re: Frequency dictionary
PostPosted: Mon 23 Apr 2012, 19:48 
greek
greek
User avatar

Joined: Wed 11 Apr 2012, 14:58
Posts: 412
Thanks for the link.
xingoxa wrote:
But beware, unless you are making an English cipher, words will not be in a one-to-one correspondence to each other.

I'm aware of this, it's not perfect, but far better than pure guessing.


Top
 Profile  
 
 Post subject: Re: Frequency dictionary
PostPosted: Mon 23 Apr 2012, 20:26 
fire
fire

Joined: Sat 14 Aug 2010, 19:38
Posts: 2794
nmn wrote:
Do you know an online dictionary which shows the searched word in one of several categories like
A = 4000 most frequent words
B = 4001 to 10000 most frequent words
C = 10001 to 25000 most frequent words
D = all others
I'm asking since I want to know which words I should make shorter and which longer.


During the Great Depression (I think -- but they published in 1944), the Thorndike of Thorndike & Barnhart's dictionary did a WPA project (with somebody named Lorge) to see which English words were recognized by how many of a million English readers. They got 18 million words of text and found 30 thousand distinct words in them. So that work tells you about the 30,000 most frequent words, and not only ranks them, but tells their frequency and what fraction of the readership recognizes them. That will cover your A and B and C.

You'll never get D. Lexicographers say there are more than a million words in English, but no dictionary ever has all of them, because new words, (or new pronunciations, or new spellings, or new meanings, or new uses) come into currency too frequently.

_________________
I am not responsible for the accuracy of my sources; they're responsible for their own mistakes, if any, and also responsible for defending their own statements if you disagree with them.


Last edited by eldin raigmore on Tue 24 Apr 2012, 23:58, edited 1 time in total.

Top
 Profile  
 
 Post subject: Re: Frequency dictionary
PostPosted: Tue 24 Apr 2012, 17:51 
cuneiform
cuneiform
User avatar

Joined: Fri 20 Apr 2012, 21:56
Posts: 94
eldin raigmore wrote:
Lexicographers say there are more than a million words in English, but no dictionary ever has all of them, because new words, (or new pronunciations, or new spellings, or new meanings, or new uses) come into currency too frequently.


People keep making this claim here and there, is there some actual research behind it ?
I'm asking because except for Mr Payack's hoax, I cannot see where it comes from.

_________________
grammaire du jrawélien - textes - lexique - miscellanées


Top
 Profile  
 
 Post subject: Re: Frequency dictionary
PostPosted: Tue 24 Apr 2012, 21:20 
MVP
MVP
User avatar

Joined: Sun 22 Aug 2010, 18:46
Posts: 3788
bororo wrote:
eldin raigmore wrote:
Lexicographers say there are more than a million words in English, but no dictionary ever has all of them, because new words, (or new pronunciations, or new spellings, or new meanings, or new uses) come into currency too frequently.


People keep making this claim here and there, is there some actual research behind it ?
I'm asking because except for Mr Payack's hoax, I cannot see where it comes from.


[+1]

I've heard claims of 200,000, 500,000, 600,000, 2,000,000, and I'd guess a few more. How did they arrive at these numbers? And how can they be so different?

_________________
constructedlanguages.net


Top
 Profile  
 
 Post subject: Re: Frequency dictionary
PostPosted: Tue 24 Apr 2012, 23:54 
fire
fire

Joined: Sat 14 Aug 2010, 19:38
Posts: 2794
bororo wrote:
People keep making this claim here and there, is there some actual research behind it ?
I'm asking because except for Mr Payack's hoax, I cannot see where it comes from.



Well, the author of LanguageLog (Benjamin Zimmer) seems to think the lexicographers of the Oxford English Dictionary have done such research.
If you count just head-words of lexical entries there were 300,000 in the 2nd edition of the OED, but there will be more in the 3rd edition.
And if you count all lexemes (all lexical entries, even those that aren't for head-words), there are over 600,000 of those in the 2nd edition of the OED, and apparently they expect around 1.3 million in the 3rd edition.
.... even if we consider one particular dictionary there is no simple answer to how many "words" it contains. The second edition of the Oxford English Dictionary has about 300,000 headwords, covering 640,000 words and phrases, according to AskOxford. (The Third Edition, now in preparation, will increase that number to 1.3 million or more.) So do we count headwords? All defined words and phrases? Every distinct sense and subsense of those words and phrases? Every spelling variant? Do archaic words make the cut, and if so, what's the chronological cutoff for "English"? In estimating the size of the lexicon, AskOxford remains admirably agnostic in its FAQ (emphasis mine):

AskOxford wrote:
How many words are there in the English language?

There is no single sensible answer to this question. It is impossible to count the number of words in a language, because it is so hard to decide what counts as a word. Is dog one word, or two (a noun meaning 'a kind of animal', and a verb meaning 'to follow persistently')? If we count it as two, then do we count inflections separately too (dogs plural noun, dogs present tense of the verb). Is dog-tired a word, or just two other words joined together? Is hot dog really two words, since we might also find hot-dog or even hotdog?
It is also difficult to decide what counts as 'English'. What about medical and scientific terms? Latin words used in law, French words used in cooking, German words used in academic writing, Japanese words used in martial arts? Do you count Scots dialect? Youth slang? Computing jargon?


The lexicographers preparing the 3rd edition of the OED are the ones to whom I was referring, who say that the new OED will have about 1.3 million entries ("lexemes").


xingoxa wrote:
I've heard claims of 200,000, 500,000, 600,000, 2,000,000, and I'd guess a few more. How did they arrive at these numbers? And how can they be so different?

The above from the AskOxford's FAQ should explain it. What's a word? How do you tell one word is different from another word, instead of just being a use or a form of it? How do you tell a word is an English word, instead of a foreign word just being used in an English sentence?
The fuzziness at some of the boundaries means that it's hard to count.

The numbers I used to use -- namely, around 225,000 and around 625,000 --- were based on the counts lexicographers gave in talking about the dictionaries they produced. But I'd have to somehow find out which dictionaries those were and who the lexicographers were, and re-read what they wrote, to find out if they were talking about head-words of entries (or entries of head-words), or something a little less strict.

Anyway, if you don't say something like "English's millionth word was coined sometime last summer", it seems reasonable to say "lexicographers are now saying English has over a million words". As has been mentioned in an article Zimmer links to, there are over a million names of chemicals.

_________________
I am not responsible for the accuracy of my sources; they're responsible for their own mistakes, if any, and also responsible for defending their own statements if you disagree with them.


Last edited by eldin raigmore on Wed 25 Apr 2012, 00:22, edited 1 time in total.

Top
 Profile  
 
 Post subject: Re: Frequency dictionary
PostPosted: Wed 25 Apr 2012, 00:13 
MVP
MVP
User avatar

Joined: Sun 22 Aug 2010, 18:46
Posts: 3788
eldin raigmore wrote:
The lexicographers preparing the 3rd edition of the OED are the ones to whom I was referring, who say that the new OED will have about 1.3 million entries ("lexemes").


But there is a difference between saying that (1) a certain dictionary (like, in this case, the new OED) has a certain number of entries, and that (2) a certain language (like, in this case, English) has a certain number of words.

_________________
constructedlanguages.net


Top
 Profile  
 
 Post subject: Re: Frequency dictionary
PostPosted: Wed 25 Apr 2012, 00:26 
fire
fire

Joined: Sat 14 Aug 2010, 19:38
Posts: 2794
xingoxa wrote:
eldin raigmore wrote:
The lexicographers preparing the 3rd edition of the OED are the ones to whom I was referring, who say that the new OED will have about 1.3 million entries ("lexemes").


But there is a difference between saying that (1) a certain dictionary (like, in this case, the new OED) has a certain number of entries, and that (2) a certain language (like, in this case, English) has a certain number of words.


Of course. But if the dictionary has more than a million head-word entries (non-hyphenated, non-apostropheed, non-otherwise-internally-punctuated, single words without included spaces), then the language must have at least that many.

I'm only defending the remark that "English has over a million words".
I'm not defending any exact number; nor any estimate +-100 days when the millionth word was coined.




xingoxa wrote:
Such a list cannot be absolute, but must be relative to some text corpus, like the Oxford English Corpus, which the Wikipedia page on the "Most common words in English" is based.

@nmn: Wikipedia can also lead you to lists of the most common words in other languages.

_________________
I am not responsible for the accuracy of my sources; they're responsible for their own mistakes, if any, and also responsible for defending their own statements if you disagree with them.


Top
 Profile  
 
 Post subject: Re: Frequency dictionary
PostPosted: Wed 25 Apr 2012, 02:55 
MVP
MVP
User avatar

Joined: Sun 22 Aug 2010, 18:46
Posts: 3788
eldin raigmore wrote:


Of course. But if the dictionary has more than a million head-word entries (non-hyphenated, non-apostropheed, non-otherwise-internally-punctuated, single words without included spaces), then the language must have at least that many.



It depends on what one mean by saying that a word 'belongs' to a language. A dictionary could well include archaic or for various reasons, rarely-used words, or dialectal words, or words that the authors of the dictionary simple wants people to use.

So the question for anyone claiming that "English has X number of words" must be, "how widely used should a word be to be considered a part of English"?

_________________
constructedlanguages.net


Top
 Profile  
 
 Post subject: Re: Frequency dictionary
PostPosted: Wed 25 Apr 2012, 04:32 
runic
runic
User avatar

Joined: Thu 28 Jul 2011, 03:57
Posts: 1411
Location: Glasgow, Scotland
xingoxa wrote:
It depends on what one mean by saying that a word 'belongs' to a language. A dictionary could well include archaic or for various reasons, rarely-used words, or dialectal words, or words that the authors of the dictionary simple wants people to use.

So the question for anyone claiming that "English has X number of words" must be, "how widely used should a word be to be considered a part of English"?

That can be a tricky question; I apparently have a habit of using words that are uncommon according to such statistics but people don't exactly struggle to understand me (at least not due to my vocabulary choices...)
Perhaps a better question to ask would be if the word is frequently understood if used in everyday speech, though this would likely exclude 'higher register' words from most dictionaries (which would be somewhat ironic as I suspect these are the ones most often looked up), so perhaps the two together would cover each other’s deficiencies.

_________________
I speak English and a touch of Gàidhlig.
I am creating a conworld, which I refer to as the Carrion Series, that will contain three languages, Iriex, Dvoen and Maxna.


Top
 Profile  
 
 Post subject: Re: Frequency dictionary
PostPosted: Wed 25 Apr 2012, 10:42 
cuneiform
cuneiform
User avatar

Joined: Fri 20 Apr 2012, 21:56
Posts: 94
@ eldin

You're right, I misread your post (I meant that claims of accurate, all-inclusive word counts are difficult to justify)

_________________
grammaire du jrawélien - textes - lexique - miscellanées


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 14 posts ] 

All times are UTC + 1 hour [ DST ]


Who is online

Users browsing this forum: No registered users and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group