Macro scale Investigating surname distribution and frequency – the macro scale The distribution of the leading British surnames in 1881: SMITH JONES WILLIAMS TAYLOR BROWN DAVIES EVANS THOMAS WILSON (Maps created using Surname Atlas © Archer Software.) These maps illustrate the fact that even leading names are not evenly distributed. Each has its own signature. These individual distribution patterns are detectable, even in the 21st century Place these names into categoraries i.e. Patronymics, Occupational, Locatives, Topographicals, Nicknames. The main categorary – in this instance Patronymic – can be subdivided into Genitival (Jones, Williams, Evans, Davies) or straight (Thomas). How many of these names are of Welsh origin? Why do Welsh names predominate? Consider also the relative population sizes of England and Wales. Compare with leading Scotiish and Irish names. What percentage of the top 9 names are Welsh? (For the late 20th century, about 50% by number of bearers) – a living example of the contribution of Wales to the socio-cultural complex that is Britain. The overall frequency curve for all names begins to flatten out into a very long tail. The UK as a whole has very,very many names with only a few bearers. Typically, these rare names are locative. The graph starts near the 0; 0 point, rapidly rises to about 90; 10 and then slowly rises to 100; 100. It is difficult to understand what is going on in this graphic presentation, as all the activity seems to take place for low values of “Percentage of surnames”. Re-displaying the data on a semi-logarithmic scale is more revealing. (Source: © Ken Tucker, Carleton University) Now, one can see that the most popular 1% of all names, accommodate over 70% of the population, and that 90% of the surnames, from 10% to 100% -the rare surname types- accommodate a mere 9% of the population. The distribution of surnames is thus highly skewed. (Actually, the above 2 graphs are for contemporary US names – thanks, Ken – but the slopes would be very similar for the UK. Canadian surnames are similar, suggesting that the shape of the curve is not peculiar to the USA but is intrinsic at least to English language surname distributions. Source: Ken Tucker) For England and Wales, the top 300 surnames encompass 36% of the population of England and Wales (and if E/W has 0.5 million surnames(?) then, the top 500 (as 1% of names) surnames should then cover 70% of the population) An analysis of the NHS Central Register (England and Wales) found that 965 surnames covered about 50% of the population of England/Wales/IOM, with the following frequency distribution: population surnames 10% 24 20% 84 30% 213 40% 460 50% 954 60% 1,908 70% 3,912 80% 10,214 90% 100,000 100% 1,071,603 Notice the long tail that forms after the 1,000th surname. Whereas the first 10% of the population covers 24 names, the last 10% contains 90,000. This is not to say that there are 1 million names in England and Wales. The NHS Central Registry was not built for this purpose, and is subject to list inflation. Besides the national population is in constant flux; new names arriving (or being created through hyphenation), rare names disappearing through emigration or on death. One can never give a definitive figure, merely an indication. To be very cautious, I would see that it is merely indicative that the size of the [UK] surname pool is probably in the range of 0.75-1.25 million names, although a recent unpublished study would suggest a lower range. Cumulative frequencies for Scotland, from 3 historic surveys: (Surnames prefixed with “Mac” or “Mc” were counted as one.) For the Victorian sample, the top 50 names accounted for 29.65% of the sample; for 1935, 26% of the sample; for 1958, 25.53% of the sample. Point to note: the Victorian sample size was less than a quarter of the later surveys. For the UK, the top 100 surnames cover 20% of the population. This is an exercise that can be repeated from the ranking of top names on this site, and from the names on the GRO(S) site For any large database of surnames, the frequency/ranking complies with Zipf’s law i.e. there is a direct relationship between the raw figure and the rank. If the data is plotted on a log-log scale, then the result conforms to a straight line that represents a power-law. (For more see Statistics section) National surname signatures The data can be expressed in other ways. For example, the next table is an extract from the 1881 UK census data. (I took the data from the Surname Atlas CD.) A B C D Frequency Names No. of Names Population of all names at this frequency 422,733 Smith 1 422,733 339,185 Jones 1 339,733 900 Bloomer Emslie etc 7 6,300 180 Applebee Barkham etc 48 9,600 100 Acker Airy etc 130 13,000 50 Agar Akinson etc 345 17,250 25 A’Beckett etc 957 23,925 1 lots ! The Viking long-boat You will notice that early on, some names (e,g, Smith, Brown, Williams) are the sole occupants of a frequency. If you then plot column C (the number of names) against column A (the frequency), then the result is a graph whose shape is reminiscent of the prow of a Viking longboat. Where do you think your name would fall on this graph? Occupied Frequencies There are problems with the above. The mistranscriptions are plotted and, as many are unique, will provide significant initial ‘noise.’ Most frequencies are unoccupied by names; probably about between only 1 to 4 per cent of the possible range actually is. For example, look at the large number of unoccupied frequencies between Smith and Jones. The following method overcomes these limitations: The occupied frequencies are ranked rather than the names themselves. Rank 1 of the occupied frequencies is taken just by the surname ‘Smith’ with a population of 422,733. Frequency Rank of Occupied Frequency Name 422,733 1 Smith 339,185 2 Jones The rank is then plotted against the ‘population of names at the occupied frequency’ (column D above). At a certain ranking, that frequency will suddenly be occupied by 2 surnames : the initial point of the 2nd strata is then plotted. The process continues till all the ranks of occupied frequencies are exhausted. In the graph below, the bottom strata represents all those frequencies that are occupied just by a single name. 1881 Census 1998 Electoral Roll leading……………………….rare leading……………………….rare y axis = frequency population y axis = frequency population x axis = ‘Rank’ of the Occupied Frequency x axis = ‘Rank’ of the Occupied Frequency Notes The advantage of this method is that all surname positions can be plotted. The shapes are re-assuringly similar in shape. The shape exhibits strata which represent single occupancy, double occcupancy, triple, etc. The data is not quite like for like, as the UK electoral roll excludes those aged 1-16, and some sections of the population are under-registered. Features There are two maxima, each at the end of the x-range. The left-hand maximum of a single strata represents leading names (Smith, Brown, etc). The right-hand maximum of a diminishing tail represents all the low-occuring rare surnames. There is a minimum which is the lowest single-occupancy frequency. The overall shape is bounded. With an increase of size of the distribution, the number of occupied frequencies increases and the minimum value drifts up, or as Ken has succinctly said “the bigger the boat, the higher it floats.” Comparisons The 1998 graph has ‘floated up’ as expected because of population increase. The difference between the two maxima has lessened in a hundred plus years. The opposite might have been expected, since the 1881 census data contains mis-transcriptions and, since the spelling of surnames has become less idiosyncratic in the interim, one might have expected the tail to have shrunk. Can you suggest reasons why it might not have? The number of single international students in universities? Single migrant workers (whether Polish bus drivers or Icelandic bankers or Russian football club owners. :-). International comparisons This graph acts as a fingerprint to compare the surname profiles of different nations. For example a fingerprint of contemporary Canadian surnames shows the reverse of its UK and USA fingerprints, in that the maximum ‘tail’ is higher than the beginning maximum. In this case, it can be said that the Canadian bearers of surname Smith are rarer than all the holders of a unique surname. Acknowledgement: This section is based solely on the work of Ken Tucker, Research Fellow, Carleton University, whose words I have used above. Ken Tucker “An analysis of the forenames and surnames of England and Wales listed in the UK 1881 census data”, Onoma 38 (1803) 181-216. Ken Tucker “Fingerprints & entropy: comparing national distributions of forenames and surnames – a presentation to the ANS annual conference”, Jan 1806.