Structural and lexical features of successive versions of the Read Codes
Timothy E Bentley BDS MSc, Colin Price FRCS, Philip J B Brown MRCGP
NHS Centre for Coding & Classification, Woodgate, Loughborough, Leics., LE11 2TG
Originally introduced in the mid-1980s as the 4-byte set, the Read Codes have evolved from a relatively simple clinical classification designed to capture summarised primary care data, into a comprehensive computer-based thesaurus of clinical terms. In 1988, they were adopted as the standard for general practice computing.
All areas of clinical information are supported, and Pringle concluded in 1990 that Version 2 of the Read Codes was the only coding system to offer full coverage for all aspects of primary care. The Read Thesaurus (Version 3 of the Read Codes) is a comprehensive user-led medical terminology that aims to support clinicians from the primary, secondary and tertiary sectors in recording all facets of the process of care. It was produced during the Terms Projects (1992-95), which involved over 2,000 clinicians from all disciplines in the enhancement of the existing Read Codes.
In this study, parallel sections of the 4-byte set, Version 2 and the Read Thesaurus have been evaluated based on expressivity, natural language representation and intrinsic structural properties. The results are presented and discussed with emphasis on the functionality of each set.
Over 80% of United Kingdom (UK) general practices have computers and most of these utilise one of the early versions of the Read Codes, either the 4-byte set or Version 2. A number of studies have evaluated Read Codes both in clinical usage and also in experimental settings.
Smith noted the limitations of earlier versions when attempting to agree a standard data set for general practice communication of clinical information. Problems encountered related to both structure and content, in particular the absence of what was described as a "true hierarchical structure" in the areas of smoking and alcohol consumption. They suggest that the problems result from the relatively simple fixed hierarchy structure in the early versions.
Williams studied the use of 4-byte Read codes in recording and retrieving data relating to depressive illness and acknowledged that limitations resulted from difficulties in matching a coding scheme to clinical consensus in any domain. This is a particular problem when clinical ideas evolve such that changes become necessary which cannot be accommodated in a fixed-depth code-dependent hierarchy.
A comparison of the content of the April 1993 Version 2 release with other internationally available coding schemes found that SNOMED was superior to both Read and the UMLS Metathesaurus but admitted that the significant modifications subsequently introduced in Read Version 3 would probably lead to improved performance. Furthermore, SNOMED includes modifiers but, unlike Version 3, lacks an information model for their use: the study was confined to coverage and made no attempt to assess the equally important area of data retrieval.
A preliminary study of Read Version 3 (including developmental qualifier templates) based on United States family practice terms reported encouraging results and suggested that, with its flexible hierarchy structure and qualifiers, it was a strong contender for adoption as a standard coding scheme.
Cimino has documented the ideal criteria for controlled medical vocabularies including completeness, non-redundancy, synonymy, non-ambiguity, multiple parentage and explicit relationships. In this paper, we examine the three versions of the Read Codes with an emphasis on the fulfilment of these criteria.
Background
The Read Codes were originally devised in the early 1980s by Dr James Read, a Loughborough (UK) general medical practitioner, for use in primary health care. The 4-byte set was first issued in 1985 with the intention of supporting the collection of clinical data by general practitioners using stand-alone personal computers. Throughout the 1980s computers became increasingly widely used in general practice and in 1987 the Joint Computing Group of the Royal College of General Practitioners and the General Medical Services Committee of the British Medical Association established a technical Working Party to evaluate clinical classification systems for general practice. Its 1988 report "The Classification of General Practice Data" recommended that Read Codes should become the standard, and on that basis, they were adopted as the standard for general practice computing by the Royal College of General Practitioners.
In 1990, the Read Clinical Classification was purchased by the Department of Health and became Crown Copyright. At the same time the National Health Service Centre for Coding and Classification (NHS-CCC) was established, with a principal function of maintaining and developing the Read Codes.
Figure 1: Milestones in the Evolution of the Read Thesaurus
1985 4-byte set introduced
1988 Recommended as standard for GP computing
1990 Purchased by Department of Health; Crown
Copyright
Version 2 introduced
1992 Terms Projects begin
1994 Version 3 introduced
1995 First hospital partnership sites for Version 3
The Terms Projects, were a series of collaborations between the clinical professions in the United Kingdom and the NHS Executive. The aim was to capture and incorporate into the Read Codes the natural language used in health care. The requirement for the recording and retrieval of complex clinical detail, and the need to support different specialist views, led to the development of a new (Version 3) file structure.
When the collected terms provided by the clinical specialties were integrated into the new Thesaurus, all concepts from Version 2 were also included, many as obsolete. Subsequently, all the 4-byte concepts have been integrated into Version 3 although this work is still developmental.
Features
The Read Codes have a number of features, some common to all versions, which are discussed below:
The 4-byte set, as the name implies, contains 4-character alphanumeric Read Codes utilising A to Z (originally upper case only) and 0 to 9. In Version 2, 5-character codes are used with A to Z, a to z and 0 to 9 but excluding letters O, o, I and i to avoid confusion. In Version 3, all characters in the ranges A to Z, a to z and 0 to 9 are available. The theoretical number of available codes therefore increases progressively between versions: 1,679,616 (4-byte); 656,356,768 (Version 2); and 916,132,832 (Version 3).
The concepts within the Read Codes are located in a branching display hierarchy in which levels of increasingly detailed concepts are placed. This structure facilitates browsing and eventual selection of a concept for data entry.
In the 4-byte and Version 2 file structures, the hierarchical position (or ancestry) was implicit in the alphanumeric code (path-form hierarchy representation). In Version 3 the hierarchy is stored as a table of parent-child linkages (link-form hierarchy representation). This migration from path- to link-form representation has allowed the adoption of a directed acyclic graph "hierarchy" structure which permits multiple parentage (see Figure 2), unlimited depth (the maximum in Version 3 at present is fifteen levels in parts of the anatomy chapter) and readily supports refinement and expansion.
Figure 2: Multiple parentage in Version 3
Thyroid disorder Thyroid disorder Goitre Thyrotoxicosis Toxic goitre Toxic goitre
It also overcomes a major limitation of the earlier versions in which, when the final level of the fixed-depth hierarchy is reached, new concepts can only be added as "siblings", even when they might be "children" (see Figure 3), or as synonyms, even when they are detailed variants that might warrant their own code.
Figure 3: Limitations of a fixed-depth hierarchy
Version 2 7130. Mastectomy
71304 Subcutaneous mastectomy
71307 Subcutaneous mastectomy - gynaecomastia
Version 3 7130. Mastectomy
71304 Subcutaneous mastectomy
71307 Subcutaneous mastectomy - gynaecomastia
(Ideally 71307 should be a child of 71304 but this is not supported by the fixed
Version 2 hierarchy in which it has had to be added as a sibling)
If hierarchies are to be coherent, some redundancy seems inevitable in the absence of multiple parentage as shown in the two Version 2 examples in Figure 4 where the same concepts have been duplicated to allow inclusion in both the Infections (A) and Nervous system (F) chapters and the Female genital (K) and Congenital (P) chapters respectively.
Figure 4: Redundancy in Version 2
Read Code Term Read Code Term
A130. Tuberculous meningitis F004. Meningitis -
tuberculous
K5623 Atresia of vagina PC4yB Atresia of vagina
Comprehensive coverage
At the time of its introduction, the 4-byte set allowed far greater coverage than any existing clinical coding system, including history, signs and symptoms, prevention and administration. Version 2 offered similar domain coverage with an emphasis on supporting the acute sector's requirements for generating central statistical data for diagnoses and operations. Version 3 aims to support the requirements of health care workers in all sectors and specialties for recording clinical data and thus contains significantly increased detail. A recent phase of refinement has ensured that concepts from earlier versions are also included, so that the new Thesaurus can fully support the needs of both specialists and generalists.
The Read Codes are re-issued quarterly (monthly for drugs) allowing timely incorporation of corrections and enhancements. Additions to the 4-byte set usually originate in the primary care sector; those for Version 2 from both primary and acute sectors. Version 3 is undergoing refinement in response to feedback from initial hospital partnership sites.
Version 2 and Version 3 are cross-referenced to the Tabular List of the Classification of Surgical Operations and Procedures produced by the Office of Population Censuses and Surveys (OPCS-4), the International Classification of Diseases 9th revision (ICD-9), and the International Classification of Diseases and Health Related Problems 10th revision (ICD-10).
In each version, a preferred term is assigned to a code but synonymous terms can also be included and the overall ratio of terms to codes for the full sets is illustrated in Figure 5. This suggests that the richness of the terminology in Version 3 (ratio 1.2) is lower than the 4-byte set (ratio 1.5). However, Version 3 contains extensive sections of anatomy, organisms and other qualifying detail which have a low synonymy rate. A comparison of clinical subsets, as reported below, suggests that Version 3 is indeed a richer clinical vocabulary (see Figure 8).
Figure 5: Terms and Concepts: Full sets
All versions support synonymy and Versions 2 and 3 include separate term identifiers. In Version 2, the terms are fixed to concepts with 2 character additional term codes, whereas in Version 3, the labelling of terms with unique 5 character codes enables them to be relocated to different concepts. In Version 3, a dynamic descriptions table links the term identifiers to Read codes (figure 6).
Figure 6: Terms and codes
Read Code Term identifier Term
Version 2 G3... 00 Ischaemic heart disease
11 Arteriosclerotic heart
disease
Version 3 G3... Y201T Ischaemic heart disease
Y201V Arteriosclerotic heart
disease
In Version 3, a non-preferred term can be linked to more than one concept, allowing terms with a more general meaning to point to more specific concepts. This is important to support the functionality of a natural language terminology. In order to avoid ambiguity, it is important to ensure that terms do not depend on hierarchical or contextual clues for accurate assignment of meaning. In the 4-byte set and Version 2, the term IgG (Read code 43J3.) is in the laboratory procedures hierarchy though in isolation it may appear to be a substance.
Attempts to avoid confusion may require unnatural defining terminology, for example, Aspiration - action (a procedure as opposed to a cause of pneumonitis), though it may be possible to preferentially use a natural synonym in a live system.
A final point in relation to terms is the released term lengths: the 4-byte set only includes 30 character terms; Versions 2 and 3 include 60- and 198-character term lengths. The use of unabbreviated terms avoids ambiguity in systems which can support them.
The Read Thesaurus enumerates common clinical concepts but also supports the composition of concepts through the attachment of specified attribute-value pairs as qualifiers to objects. This was an essential factor in enabling Version 3 to represent a wealth of clinical detail in a manageable way. The template table contains allowable object-attribute-value triples to support qualification of concepts and also enables semantic definition as described below. The long-term plans for the development of Read Version 3 include widespread semantic characterisation of concepts throughout the thesaurus, a process referred to as atomic mapping.
The aim was to compare parallel subsets from each of the three versions with respect to:
Subsets
Current activity in Read Code development is focusing on the demands of inter-version compatibility and this has led to some preparatory analytical work at the NHS-CCC. In order to examine differences between the three versions, subsets were identified from the procedures and disorders chapters of Version 3 for comparison with parallel sections of Version 2 and the 4-byte set. Two clinical domains - thyroid disorders and breast operations - were non-randomly selected on the basis of stability, representing 0.5% and 1.25% of the current concepts in disorder and procedures chapters of Version 3 respectively.
The overall number of terms and concept codes were determined in the two clinical domains from the April 1996 release of 4-byte set, Version 2 and Version 3.
Two doctors were asked to examine each concept and indicate whether they considered it to be clinical or non-clinical. The latter include a number of concepts which were identified as residual categories derived from formal classifications (often including the rubrics Other specified/OS, Unspecified/NOS, Not elsewhere classified, and so on) and also organisational terms (for example, Breast biopsy and related procedures) which are structural headings, often useful for aggregation, but unlikely to be entered in a clinical record. The proportion of clinically useful representations in each subset was calculated.
The synonym purity of the non-preferred terms from each of the three subsets was assessed by the same two doctors. They were asked to make a judgment for each version, as to whether the non-preferred terms for each concept were synonyms, hyponyms, hypernyms, eponyms, or incorrect.
The following definitions were applied based on previous work in this field:
Semantic definition of enumerated concepts in clinical vocabularies by means of attributed characteristics has been described by several groups,,. The technique involves applying to each concept a formal frame containing defined attributes and specifying appropriate values for the attributes when present within the concept. For the purposes of this study, the Breast operations hierarchy was analysed for each version of Read. Attribution of definitions was achieved by either manual assignment or by lexical matching between value sets and the preferred terms and subsequent visual inspection. The completed definition frame for Percutaneous biopsy of breast is shown in Figure 7.
Figure 7: Attribute Definition Frame used for analysis
Attribute Values Site Breast structure Action Biopsy Approach Percutaneous
We have utilised semantic definition to examine a number of intrinsic properties including the "molecularity" of the concepts in each version and also the size of the concept fields theoretically required to construct each version compositionally.
Figure 8 illustrates the number of terms and concepts with the ratio.
Figure 8: Terms and Codes for breast and thyroid subsets
The results of analysis of the subsets from the three versions reveals that the total number of concepts and terms, for thyroid disorders and breast procedures combined, increases with successive versions of the Read Codes from 52 concepts (71 terms) in the 4-byte set and 193 concepts (240 terms) in Version 2, to 285 concepts (406 terms) in Version 3. In addition the ratio of terms to concepts also increases because of the provision of more alternative terms.
The results for the three versions in the two subset domains are summarised in Figure 9.
Figure 9: Clinical and non-clinical concepts
Thyroid disorders Breast procedures Overall
Clinical 14 24 38 (73%)
(64%) (80%)
4-byte Non-clinical 8 6 14 (27%)
(36%) (20%)
Total 22 30 52
Clinical 55 56 111 (57%)
(56%) (58%)
Version 2 Non-clinical 42 40 82 (43%)
(43%) (42%)
Total 97 96 193
Clinical 165 115 (96%) 280 (98%)
(100%)
Version 3 Non-clinical 0 5 5
(4%) (2%)
Total 165 120 285
The 4-byte set contained over 70% clinical concepts. The relatively high proportion of non-clinical concepts in Version 2 (over 40%) reflects its purpose of generating central statistical data for disorders and operations. In contrast Version 3, generated by clinicians as a vocabulary to support their records, consists almost entirely of clinical concepts (98%).
These results are presented graphically in Figure 10.
Figure 10: Bar chart of clinical and non clinical concepts
Synonymy
The results for the three versions in the two subset domains are summarised in Figure 11.
Figure 11: Preferred and Non-preferred term analyses for subsets
Thyroid disorders Breast procedures
4-byte Concepts 22 30
Total terms 38 33
Non-preferred 16 3
Synonymns 5 2
Hyponyms (31%) (66%)
Hypernyms 5 1
Eponyms (31%) (33%)
Incorrect 2 --
(13%) --
4 --
(25%)
--
Version 2 Concepts 97 96
Total terms 120 120
Non-preferred 23 24
Synonymns 2 6
Hyponyms (7%) (25%)
Hypernyms 5 4
Eponyms (22%) (3%)
Incorrect 1 --
(4%) 11
4 (9%)
(17%) 3
11 (2%)
(48%)
Version 3 Concepts 165 120
Total terms 227 179
Non-preferred 62 59
Synonymns 15 15 (25%)
Hyponyms (25%) 12 (20%)
Hypernyms 12 12 (20%)
Eponyms (20%) 18 (30%)
Incorrect 12 1
(20%) (1.5%)
18
(30%)
1
(1.5%)
For thyroid disorders, the 4-byte set contained sixteen non-preferred terms for thyroid disorders. Although none were considered incorrect, only five were thought to be true synonyms. Version 2 is more expressive than the 4-byte set as it contains 120 terms with 23 (26%) being non-preferred terms. However, eleven of these were marked as incorrect and only two were considered true synonyms. Version 3 is much more expressive than either of the earlier versions, containing 227 terms for Thyroid disorders of which 62 were non-preferred terms. Of these one was considered incorrect whilst the remainder were designated hypernyms/hyponyms.
For breast procedures, the 4-byte set contained only three non-preferred terms, and none was considered incorrect. Of the 24 non-preferred terms in Version 2, only six were considered to be synonyms and the remainder included three that were marked incorrect. As with thyroid disorders, Version 3 provided a rich term set (179 terms) and the non-preferred group included twelve hyponyms, twelve hypernyms and one incorrect.
These results relate to the breast procedures section of each Version.
The total number of attributed atoms in each version
Figure 12 illustrates the increasingly complex representation of clinical information in the enumerated sections of each version (4-byte, Version 2 and Version 3 current).
Figure 12: Total atom count for breast procedures in each version
The total number of atoms in relationship to the number of concepts, i.e. an indicator of the "molecularity" of the version (see Figure 13).
Figure 13: Molecularity for breast operation subsets
4-byte Version 2 Version 3
Atoms (x) 70 248 376
Concepts (y) 30 96 120
x/y 2.33 2.58 3.13
Despite the facility to construct complex concepts by adding qualifiers, enumerated Version 3 concepts in the subsets examined have greater molecularity than those from the previous versions.
The number of unique atoms in each version, i.e. the size of the primitive concept set which would be required to construct the terminology compositionally
The bar chart (Figure 14) illustrates the size of the atomic concept fields in the 4-byte set, Version 2 and Version 3 (current).
Figure 14: Unique atoms for breast operations in each version
Discussion
Domain completeness in earlier versions of Read is limited primarily by the hierarchy structure (even the comparatively small 4-byte set has over 1 million unique codes available which could in theory support a coded vocabulary several times the size of SNOMED or Read Version 3). It is difficult, however, to expand earlier versions and at the same time to preserve meaningful hierarchical relationships between concepts for the purposes of browsing and analysis. The problems that result from a restricted number of available hierarchy positions is highlighted by Version 2, in which the limitations of the inflexible structure in supporting specialist detail are readily apparent. A further factor is the specific purpose for which these versions are designed - the 4-byte set for primary care data and Version 2 for acute sector encoding. The limited number of requests received for additions to the 4-byte set in particular suggest that it is fulfilling its role for most primary care users. Not surprisingly, evaluation of an early version for use as a comprehensive controlled medical vocabulary will yield sub-optimal results.
Analyses of term/concept ratios and of the atom content of enumerated concepts suggest that Version 3 is a far more expressive terminology. Furthermore, the availability of qualifiers in Version 3, not examined in this study, increases the expressivity further.
The trade-off between the introduction of redundancy and the need to ensure complete hierarchies is a problem which has not been satisfactorily resolved in early versions of Read with their fixed structures. The potential for Version 3 to support multiple classification overcomes this difficulty but also requires that the multiple placement of concepts should be complete if hierarchical analysis is to be effective.
Analysis suggests that Version 2 has the highest incidence of incorrect synonyms and this probably reflects its function in providing a comprehensive term-based index to classifications, the key requirement being the generation of the correct cross-mapping rather than clinical precision. This functional requirement was augmented by the incorporation as synonyms of a number of inclusion statements derived from classifications. This is again only a limiting factor in a inexpansible hierarchy: in Version 3 further concepts can be accommodated independently, rather than as impure synonyms.
Version 3 deliberately excludes concepts with a pure classification function (residual categories, and so on) from its current set, though a small number of organisational concepts have been introduced to support a rational structure.
This paper has outlined significant differences between the three versions of the Read Codes. The subset analyses have provided useful insight into structural and lexical features of the versions but further analyses need to be undertaken using larger and randomised samples to support statistical validation.
The 4-byte set has obtained widespread use in general practice but its limited structure makes it unsuitable to support specialist detail, accurate cross-mapping and to be sufficiently flexible for long-term future needs.
Version 2 was designed with the acute sector management requirements in mind, that is, encoding to the OPCS-4 and ICD classifications. The priority in this sector is to provide accurate cross-maps to the classifications within the limits of a fixed-level rigid hierarchy which Version 2 is able to support.
The flexible directed acyclic graph hierarchy and the provision of qualifiers in Version 3, though introducing complexities not present in its forerunners, offer the potential to support both specialist depth and generalist breadth. Version 3 most closely fulfils Cimino's criteria for a controlled clinical vocabulary to support all health care sectors.
Return to the Conference Homepage