Databases Advanced topic: Databases Databases increasingly contain a mix of names from different personal naming systems. It is becoming increasingly important to: correctly enter these in the database isolate groups This kind of work is particularly associated with health databases, e.g. different ethnic groups carry a higher genetic risk of carrying/acquiring certain illnesses. Ethnic groups may be identified through their personal names and thus help health research. For an advanced discussion of the problems of mixed personal names in databases, visit these white papers*. *The web site www.onomastix.com no longer exists. The relevant page has been retrieved from The Internet Archive. It contains abstracts of three articles as shown below. – MS A list of the “Onomastix white papers” is also given on the web site of the Thesaurus of British Surnames [link no longer available]. Again the links are all dead. However three papers have been located on other web sites and have been retrieved, lest they too should disappear, or because in some cases accessing them involves lengthy and tedious form-filling. They are: Is Soundex Good Enough for You? On the Hidden Risks of Soundex-Based Name Searching The Use of Phonological Information in Automatic Name Searching Whose Name Is It: Names, Ownership and Databases A fourth paper has not been traced on-line: Lutz, R., Greene, S.: Measuring Phonological Similarity: The Case of Personal Names. Language Analysis Systems, Inc., Herndon (2002). Links to the web locations from which the papers were extracted are: Is Soundex Good Enough for You? On the Hidden Risks of Soundex-Based Name Searching Names That Sound Alike, But Are Not Spelled Alike: The Use of Phonological Information in Automatic Name Searching [link no longer available] Whose Name Is It: Names, Ownership and Databases [link no longer available] Abstracts of three Onomastix White Papers: The Use of Phonological Information in Automatic Name Searching “Variation in the spellings of names is a persistent issue in the area of automated name searching in large databases (Hermansen, 1985). In general, the source of spelling variation of names can be analyzed and explained a posteriori. Predicting any individual spelling, however, remains problematic. Sources for spelling variation include: keyboard-based data entry errors (e.g., hitting the wrong key: Genning for Henning), syntactic variation (e.g., out-of-sequence given name and surname such as Richard Thomas for Thomas Richard), morphological variation (e.g., truncated strings such as Rich or R for Richard) and semantically-based variation (e.g., nativizations such as Goldwater for Goldwasser). Of interest in the current paper is variation due to orthographic conventions (e.g., English can represent the same sound in more than one way, as in Stephen ~ Steven) and articulatory variation (e.g., the p in Thompson is a predictable spelling of Thomson based on principles of articulation). While there are multiple sources of name variation, this paper will present evidence 1) that the inherent ambiguity in the English use of roman characters can be mitigated by multiple mappings to unambiguous phonetic characters and 2) that phonologically-similar names can be retrieved through the analysis of sounds into their articulatory features (i.e., place and manner of articulation). It is based on research conducted from September of 1995 through the present.” Whose Name Is It: Names, Ownership and Databases “Personal names are important pointers to individuals in a society. Whereas in small, tribal societies, the context between name as label and its referent is transparent and direct, in modern technological societies, there is often great distance between the name as label and the person to whom it refers. This is especially true in cases where names are stored within large databases. These include government, medical, educational and even commercial records that are kept about individuals. Problems arise when attempting to retrieve records from those databases. How a name is stored within data records may, and often does, deviate in form from the way it is entered at the time of query. Indeed, personal names pose special problems in terms of data retrieval because names exhibit much more variation in form than do other lexical items. The word chair can refer to any members of the set of chairs, but its written form is fixed by standard English orthographic conventions. Names such as ‘Leigh’ or ‘Johansen’, ‘Stephen’ or ‘Jeffrey’ have a number of common spellings, and probably a number of uncommon ones as well.” Download the full (rtf) version here. Measuring Phonological Similarity: The Case of Personal Names “The field of computational linguistics has matured and expanded as the power and speed of computers has increased, memory and storage costs have fallen and authoring languages when increased in capabilities and level of sophistication. As a result, information extraction and retrieval techniques have also made remarkable progress. In the . Information Age. , there are simply too many data to be analyzed and sorted manually. With ever increasing accuracy, algorithms are processing and extracting relevant information from a wide variety of sources and from data in an increasing number of languages.”